[PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature
@ 2026-05-05  8:14 Avihai Horon
  2026-05-05  8:14 ` [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h Avihai Horon
                   ` (13 more replies)
  0 siblings, 14 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Hello,

This series adds support for the kernel VFIO_PRECOPY_INFO_REINIT feature
[1], which can reduce migration downtime in some VFIO precopy scenarios
(see the Tests section below). Supporting it requires refactoring
switchover-ack and making it re-usable; that work comes first.

The series is based on Peter's VFIO fixes series [2].
Based-on: 20260421202110.306051-1-peterx@redhat.com

=== Background ===

Switchover-ack is a mechanism to synchronize between source and
destination QEMU during migration to prevent the source from switching
over prematurely.

VFIO uses switchover-ack to ensure switchover happens only after
destination side has loaded the precopy initial bytes. This is important
for VFIO, as otherwise downtime could be impacted and be higher.

In its current state, switchover-ack is a one-time mechanism, meaning
that switchover is acked only once and past that another ACK cannot be
requested again. This was sufficient until now, as VFIO precopy initial
bytes was defined to be monotonically decreasing. Thus, when precopy
initial bytes reached zero for all VFIO devices, a single ACK would be
sent and its validity would hold.

=== Problem ===

Now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy initial
bytes to be re-initialized during precopy. Specifically, it means that
initial bytes can grow after reaching zero, which would invalidate a
previously sent switchover ACK.

=== Solution ===

To support VFIO_PRECOPY_INFO_REINIT, this series makes switchover-ack
re-usable and allows devices to request another switchover ACK when
needed.

In the new mechansim, switchover ACK can be requested for a specific
device and in different times, so switchover ACK is changed to be
per-device (instead of a single ACK for all devices) and source side is
the one who does the pending ACKs accounting (instead of destination
side).

The old (legacy) switchover-ack mechanism is kept for backward
compatibility and is turned on by a compatibility property for older
machines.

With that infrastructure in place, VFIO uses the new switchover-ack path
and implements VFIO_PRECOPY_INFO_REINIT.

=== Tests ===

Functional and performance tests were run.

Performance tests were done by migrating a single VM with:
* 8 GB RAM
* 4 mlx5 VFIO devices:
  - One device with 1GB of device data (stopcopy data) that runs
    workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
    (generate new initial_bytes chunks during precopy).
  - The other 3 devices are idle.

In this setup, VFIO_PRECOPY_INFO_REINIT reduced total migration downtime by
about 43%, and downtime attributed to the busy VFIO device by about 67%:

With VFIO_PRECOPY_INFO_REINIT:
  1335ms total (~520ms from the VFIO device running the workload).

Without VFIO_PRECOPY_INFO_REINIT:
  2352ms total (~1600ms from the VFIO device running the workload).

Functional tests covered the main code paths and combinations, including
legacy and new switchover-ack across versions, for example:
* Migration between QEMU 11.0 (old binary) and 11.1 (new binary).
* Migration between two 11.1 binaries with different machine versions.
* Migration when the VFIO device has no VFIO_PRECOPY_INFO_REINIT.
* Migration when the VFIO device has no VFIO precopy support.

=== Patch Breakdown ===

* Patches 1-2: Update Linux headers
* Patches 3-8,14: Migration cleanups and the new switchover-ack mechanism
* Patches 9-13: VFIO cleanups and VFIO_PRECOPY_INFO_REINIT

Thanks.

[1] https://lore.kernel.org/all/20260317161753.18964-1-yishaih@nvidia.com/
[2] https://lore.kernel.org/qemu-devel/20260421202110.306051-1-peterx@redhat.com/

Avihai Horon (14):
  scripts/update-linux-headers: Add typelimits.h
  linux-headers: Update to Linux v7.1-rc1
  migration: Propagate errors in migration_completion_precopy()
  migration: Log the approver in qemu_loadvm_approve_switchover()
  migration: Replace switchover_ack_needed SaveVMHandler
  migration: Rename switchover-ack code to legacy
  migration: Make switchover-ack re-usable
  migration: Check switchover-ack during switchover phase
  vfio/migration: Re-query precopy size before sending
    VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
  vfio/migration: Add Error ** parameter to vfio_migration_init()
  vfio/migration: Add new switchover-ack mechanism
  vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
  vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
  migration: Enable new switchover-ack

 docs/devel/migration/vfio.rst                 |   4 +-
 hw/vfio/vfio-migration-internal.h             |   2 +
 include/migration/client-options.h            |   1 +
 include/migration/misc.h                      |   2 +
 include/migration/register.h                  |  21 +-
 include/standard-headers/drm/drm_fourcc.h     |  28 +-
 include/standard-headers/linux/const.h        |  18 +
 include/standard-headers/linux/ethtool.h      |  28 +-
 .../linux/input-event-codes.h                 |  13 +
 include/standard-headers/linux/pci_regs.h     |  71 ++-
 include/standard-headers/linux/typelimits.h   |   8 +
 include/standard-headers/linux/virtio_ring.h  |   3 +-
 include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
 include/standard-headers/linux/vmclock-abi.h  |  20 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-arm64/unistd_64.h           |   1 +
 linux-headers/asm-generic/unistd.h            |   5 +-
 linux-headers/asm-loongarch/kvm.h             |   5 +
 linux-headers/asm-loongarch/kvm_para.h        |   1 +
 linux-headers/asm-loongarch/unistd_64.h       |   2 +
 linux-headers/asm-mips/unistd_n32.h           |   1 +
 linux-headers/asm-mips/unistd_n64.h           |   1 +
 linux-headers/asm-mips/unistd_o32.h           |   1 +
 linux-headers/asm-powerpc/unistd_32.h         |   1 +
 linux-headers/asm-powerpc/unistd_64.h         |   1 +
 linux-headers/asm-riscv/kvm.h                 |  11 +-
 linux-headers/asm-riscv/ptrace.h              |  37 ++
 linux-headers/asm-riscv/unistd_32.h           |   1 +
 linux-headers/asm-riscv/unistd_64.h           |   1 +
 linux-headers/asm-s390/unistd_32.h            | 446 ------------------
 linux-headers/asm-s390/unistd_64.h            |   1 +
 linux-headers/asm-x86/kvm.h                   |  21 +-
 linux-headers/asm-x86/unistd_32.h             |   1 +
 linux-headers/asm-x86/unistd_64.h             |   1 +
 linux-headers/asm-x86/unistd_x32.h            |   1 +
 linux-headers/linux/const.h                   |  18 +
 linux-headers/linux/iommufd.h                 |  48 ++
 linux-headers/linux/kvm.h                     |  46 +-
 linux-headers/linux/mshv.h                    |   4 +-
 linux-headers/linux/psp-sev.h                 |   2 +-
 linux-headers/linux/stddef.h                  |   4 +
 linux-headers/linux/vduse.h                   |  85 +++-
 linux-headers/linux/vfio.h                    |  30 +-
 migration/migration.h                         |  17 +-
 migration/savevm.h                            |   4 +-
 hw/core/machine.c                             |   4 +-
 hw/vfio/migration.c                           | 195 +++++++-
 migration/migration.c                         |  65 ++-
 migration/options.c                           |   9 +
 migration/savevm.c                            | 146 ++++--
 hw/vfio/trace-events                          |   5 +-
 migration/trace-events                        |   8 +-
 scripts/update-linux-headers.sh               |   2 +
 53 files changed, 1109 insertions(+), 580 deletions(-)
 create mode 100644 include/standard-headers/linux/typelimits.h
 create mode 100644 include/standard-headers/linux/virtio_rtc.h
 delete mode 100644 linux-headers/asm-s390/unistd_32.h

-- 
2.40.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07  7:54   ` Cédric Le Goater
  2026-05-05  8:14 ` [PATCH 02/14] linux-headers: Update to Linux v7.1-rc1 Avihai Horon
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Upstream Linux added include/uapi/linux/typelimits.h and includes it
from ethtool.h [1][2].

Teach update-linux-headers.sh to install that header into
standard-headers to be able to update kernel headers to versions that
include the above changes.

[1] ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
[2] a8a11e5237ae ("ethtool: uapi: Use UAPI definition of INT_MAX")

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 scripts/update-linux-headers.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 386d7a38e7..da367acee7 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -60,6 +60,7 @@ cp_portable() {
                                      -e 'drm.h' \
                                      -e 'limits' \
                                      -e 'linux/const' \
+                                     -e 'linux/typelimits' \
                                      -e 'linux/kernel' \
                                      -e 'linux/sysinfo' \
                                      -e 'asm/setup_data.h' \
@@ -250,6 +251,7 @@ for i in "$hdrdir"/include/linux/*virtio*.h \
          "$hdrdir/include/linux/pci_regs.h" \
          "$hdrdir/include/linux/ethtool.h" \
          "$hdrdir/include/linux/const.h" \
+         "$hdrdir/include/linux/typelimits.h" \
          "$hdrdir/include/linux/kernel.h" \
          "$hdrdir/include/linux/kvm_para.h" \
          "$hdrdir/include/linux/vhost_types.h" \
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h
  2026-05-05  8:14 ` [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h Avihai Horon
@ 2026-05-07  7:54   ` Cédric Le Goater
  2026-05-07  9:07     ` gaosong
  0 siblings, 1 reply; 31+ messages in thread
From: Cédric Le Goater @ 2026-05-07  7:54 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb, gaosong

+ Song Gao

On 5/5/26 10:14, Avihai Horon wrote:
> Upstream Linux added include/uapi/linux/typelimits.h and includes it
> from ethtool.h [1][2].
> 
> Teach update-linux-headers.sh to install that header into
> standard-headers to be able to update kernel headers to versions that
> include the above changes.
> 
> [1] ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
> [2] a8a11e5237ae ("ethtool: uapi: Use UAPI definition of INT_MAX")
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   scripts/update-linux-headers.sh | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> index 386d7a38e7..da367acee7 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -60,6 +60,7 @@ cp_portable() {
>                                        -e 'drm.h' \
>                                        -e 'limits' \
>                                        -e 'linux/const' \
> +                                     -e 'linux/typelimits' \
>                                        -e 'linux/kernel' \
>                                        -e 'linux/sysinfo' \
>                                        -e 'asm/setup_data.h' \
> @@ -250,6 +251,7 @@ for i in "$hdrdir"/include/linux/*virtio*.h \
>            "$hdrdir/include/linux/pci_regs.h" \
>            "$hdrdir/include/linux/ethtool.h" \
>            "$hdrdir/include/linux/const.h" \
> +         "$hdrdir/include/linux/typelimits.h" \
>            "$hdrdir/include/linux/kernel.h" \
>            "$hdrdir/include/linux/kvm_para.h" \
>            "$hdrdir/include/linux/vhost_types.h" \


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h
  2026-05-07  7:54   ` Cédric Le Goater
@ 2026-05-07  9:07     ` gaosong
  0 siblings, 0 replies; 31+ messages in thread
From: gaosong @ 2026-05-07  9:07 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb

Hi,

I take this patch for my series and update the linux-headers.


在 2026/5/7 下午3:54, Cédric Le Goater 写道:
> + Song Gao
>
> On 5/5/26 10:14, Avihai Horon wrote:
>> Upstream Linux added include/uapi/linux/typelimits.h and includes it
>> from ethtool.h [1][2].
>>
>> Teach update-linux-headers.sh to install that header into
>> standard-headers to be able to update kernel headers to versions that
>> include the above changes.
>>
>> [1] ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
>> [2] a8a11e5237ae ("ethtool: uapi: Use UAPI definition of INT_MAX")
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   scripts/update-linux-headers.sh | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/scripts/update-linux-headers.sh 
>> b/scripts/update-linux-headers.sh
>> index 386d7a38e7..da367acee7 100755
>> --- a/scripts/update-linux-headers.sh
>> +++ b/scripts/update-linux-headers.sh
>> @@ -60,6 +60,7 @@ cp_portable() {
>>                                        -e 'drm.h' \
>>                                        -e 'limits' \
>>                                        -e 'linux/const' \
>> +                                     -e 'linux/typelimits' \
>>                                        -e 'linux/kernel' \
>>                                        -e 'linux/sysinfo' \
>>                                        -e 'asm/setup_data.h' \
>> @@ -250,6 +251,7 @@ for i in "$hdrdir"/include/linux/*virtio*.h \
>>            "$hdrdir/include/linux/pci_regs.h" \
>>            "$hdrdir/include/linux/ethtool.h" \
>>            "$hdrdir/include/linux/const.h" \
>> +         "$hdrdir/include/linux/typelimits.h" \
>>            "$hdrdir/include/linux/kernel.h" \
>>            "$hdrdir/include/linux/kvm_para.h" \
>>            "$hdrdir/include/linux/vhost_types.h" \
>
>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>
Reviewed-by: Song Gao <gaosong@loongson.cn>

Thanks.
Song Gao
> Thanks,
>
> C.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 02/14] linux-headers: Update to Linux v7.1-rc1
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
  2026-05-05  8:14 ` [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07  9:16   ` Cédric Le Goater
  2026-05-05  8:14 ` [PATCH 03/14] migration: Propagate errors in migration_completion_precopy() Avihai Horon
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Update Linux headers to get VFIO_PRECOPY_INFO_REINIT feature of VFIO
migration uAPI.

The update was done by running `update-linux-headers.sh` on commit
254f49634ee1 ("Linux 7.1-rc1") in the Linux tree.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/standard-headers/drm/drm_fourcc.h     |  28 +-
 include/standard-headers/linux/const.h        |  18 +
 include/standard-headers/linux/ethtool.h      |  28 +-
 .../linux/input-event-codes.h                 |  13 +
 include/standard-headers/linux/pci_regs.h     |  71 ++-
 include/standard-headers/linux/typelimits.h   |   8 +
 include/standard-headers/linux/virtio_ring.h  |   3 +-
 include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
 include/standard-headers/linux/vmclock-abi.h  |  20 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-arm64/unistd_64.h           |   1 +
 linux-headers/asm-generic/unistd.h            |   5 +-
 linux-headers/asm-loongarch/kvm.h             |   5 +
 linux-headers/asm-loongarch/kvm_para.h        |   1 +
 linux-headers/asm-loongarch/unistd_64.h       |   2 +
 linux-headers/asm-mips/unistd_n32.h           |   1 +
 linux-headers/asm-mips/unistd_n64.h           |   1 +
 linux-headers/asm-mips/unistd_o32.h           |   1 +
 linux-headers/asm-powerpc/unistd_32.h         |   1 +
 linux-headers/asm-powerpc/unistd_64.h         |   1 +
 linux-headers/asm-riscv/kvm.h                 |  11 +-
 linux-headers/asm-riscv/ptrace.h              |  37 ++
 linux-headers/asm-riscv/unistd_32.h           |   1 +
 linux-headers/asm-riscv/unistd_64.h           |   1 +
 linux-headers/asm-s390/unistd_32.h            | 446 ------------------
 linux-headers/asm-s390/unistd_64.h            |   1 +
 linux-headers/asm-x86/kvm.h                   |  21 +-
 linux-headers/asm-x86/unistd_32.h             |   1 +
 linux-headers/asm-x86/unistd_64.h             |   1 +
 linux-headers/asm-x86/unistd_x32.h            |   1 +
 linux-headers/linux/const.h                   |  18 +
 linux-headers/linux/iommufd.h                 |  48 ++
 linux-headers/linux/kvm.h                     |  46 +-
 linux-headers/linux/mshv.h                    |   4 +-
 linux-headers/linux/psp-sev.h                 |   2 +-
 linux-headers/linux/stddef.h                  |   4 +
 linux-headers/linux/vduse.h                   |  85 +++-
 linux-headers/linux/vfio.h                    |  30 +-
 38 files changed, 711 insertions(+), 493 deletions(-)
 create mode 100644 include/standard-headers/linux/typelimits.h
 create mode 100644 include/standard-headers/linux/virtio_rtc.h
 delete mode 100644 linux-headers/asm-s390/unistd_32.h

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index b39e197cc7..4bad457cc2 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -400,8 +400,8 @@ extern "C" {
  * implementation can multiply the values by 2^6=64. For that reason the padding
  * must only contain zeros.
  * index 0 = Y plane, [15:0] z:Y [6:10] little endian
- * index 1 = Cr plane, [15:0] z:Cr [6:10] little endian
- * index 2 = Cb plane, [15:0] z:Cb [6:10] little endian
+ * index 1 = Cb plane, [15:0] z:Cb [6:10] little endian
+ * index 2 = Cr plane, [15:0] z:Cr [6:10] little endian
  */
 #define DRM_FORMAT_S010	fourcc_code('S', '0', '1', '0') /* 2x2 subsampled Cb (1) and Cr (2) planes 10 bits per channel */
 #define DRM_FORMAT_S210	fourcc_code('S', '2', '1', '0') /* 2x1 subsampled Cb (1) and Cr (2) planes 10 bits per channel */
@@ -413,8 +413,8 @@ extern "C" {
  * implementation can multiply the values by 2^4=16. For that reason the padding
  * must only contain zeros.
  * index 0 = Y plane, [15:0] z:Y [4:12] little endian
- * index 1 = Cr plane, [15:0] z:Cr [4:12] little endian
- * index 2 = Cb plane, [15:0] z:Cb [4:12] little endian
+ * index 1 = Cb plane, [15:0] z:Cb [4:12] little endian
+ * index 2 = Cr plane, [15:0] z:Cr [4:12] little endian
  */
 #define DRM_FORMAT_S012	fourcc_code('S', '0', '1', '2') /* 2x2 subsampled Cb (1) and Cr (2) planes 12 bits per channel */
 #define DRM_FORMAT_S212	fourcc_code('S', '2', '1', '2') /* 2x1 subsampled Cb (1) and Cr (2) planes 12 bits per channel */
@@ -423,8 +423,8 @@ extern "C" {
 /*
  * 3 plane YCbCr
  * index 0 = Y plane, [15:0] Y little endian
- * index 1 = Cr plane, [15:0] Cr little endian
- * index 2 = Cb plane, [15:0] Cb little endian
+ * index 1 = Cb plane, [15:0] Cb little endian
+ * index 2 = Cr plane, [15:0] Cr little endian
  */
 #define DRM_FORMAT_S016	fourcc_code('S', '0', '1', '6') /* 2x2 subsampled Cb (1) and Cr (2) planes 16 bits per channel */
 #define DRM_FORMAT_S216	fourcc_code('S', '2', '1', '6') /* 2x1 subsampled Cb (1) and Cr (2) planes 16 bits per channel */
@@ -1421,6 +1421,22 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t modifier)
 #define DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED \
 	DRM_FORMAT_MOD_ARM_CODE(DRM_FORMAT_MOD_ARM_TYPE_MISC, 1ULL)
 
+/*
+ * ARM 64k interleaved modifier
+ *
+ * This is used by ARM Mali v10+ GPUs. With this modifier, the plane is divided
+ * into 64k byte 1:1 or 2:1 -sided tiles. The 64k tiles are laid out linearly.
+ * Each 64k tile is divided into blocks of 16x16 texel blocks, which are
+ * themselves laid out linearly within a 64k tile. Then within each 16x16
+ * block, texel blocks are laid out according to U order, similar to
+ * 16X16_BLOCK_U_INTERLEAVED.
+ *
+ * Note that unlike 16X16_BLOCK_U_INTERLEAVED, the layout does not change
+ * depending on whether a format is compressed or not.
+ */
+#define DRM_FORMAT_MOD_ARM_INTERLEAVED_64K \
+	DRM_FORMAT_MOD_ARM_CODE(DRM_FORMAT_MOD_ARM_TYPE_MISC, 2ULL)
+
 /*
  * Allwinner tiled modifier
  *
diff --git a/include/standard-headers/linux/const.h b/include/standard-headers/linux/const.h
index 95ede23342..c6a9d0c983 100644
--- a/include/standard-headers/linux/const.h
+++ b/include/standard-headers/linux/const.h
@@ -50,4 +50,22 @@
 
 #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 
+/*
+ * Divide positive or negative dividend by positive or negative divisor
+ * and round to closest integer. Result is undefined for negative
+ * divisors if the dividend variable type is unsigned and for negative
+ * dividends if the divisor variable type is unsigned.
+ */
+#define __KERNEL_DIV_ROUND_CLOSEST(x, divisor)		\
+({							\
+	__typeof__(x) __x = x;				\
+	__typeof__(divisor) __d = divisor;		\
+							\
+	(((__typeof__(x))-1) > 0 ||			\
+	 ((__typeof__(divisor))-1) > 0 ||		\
+	 (((__x) > 0) == ((__d) > 0))) ?		\
+		(((__x) + ((__d) / 2)) / (__d)) :	\
+		(((__x) - ((__d) / 2)) / (__d));	\
+})
+
 #endif /* _LINUX_CONST_H */
diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h
index d0f7a63f10..5d82126cd7 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -17,11 +17,10 @@
 #include "net/eth.h"
 
 #include "standard-headers/linux/const.h"
+#include "standard-headers/linux/typelimits.h"
 #include "standard-headers/linux/types.h"
 #include "standard-headers/linux/if_ether.h"
 
-#include <limits.h> /* for INT_MAX */
-
 /* All structures exposed to userland should be defined such that they
  * have the same layout for 32-bit and 64-bit userland.
  */
@@ -228,7 +227,7 @@ enum tunable_id {
 	ETHTOOL_ID_UNSPEC,
 	ETHTOOL_RX_COPYBREAK,
 	ETHTOOL_TX_COPYBREAK,
-	ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
+	ETHTOOL_PFC_PREVENTION_TOUT, /* both pause and pfc, see man ethtool */
 	ETHTOOL_TX_COPYBREAK_BUF_SIZE,
 	/*
 	 * Add your fresh new tunable attribute above and remember to update
@@ -603,6 +602,8 @@ enum ethtool_link_ext_state {
 	ETHTOOL_LINK_EXT_STATE_POWER_BUDGET_EXCEEDED,
 	ETHTOOL_LINK_EXT_STATE_OVERHEAT,
 	ETHTOOL_LINK_EXT_STATE_MODULE,
+	ETHTOOL_LINK_EXT_STATE_OTP_SPEED_VIOLATION,
+	ETHTOOL_LINK_EXT_STATE_BMC_REQUEST_DOWN,
 };
 
 /* More information in addition to ETHTOOL_LINK_EXT_STATE_AUTONEG. */
@@ -1094,13 +1095,20 @@ enum ethtool_module_fw_flash_status {
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
  * @string_set: String set ID; one of &enum ethtool_stringset
- * @len: On return, the number of strings in the string set
+ * @len: Number of strings in the string set
  * @data: Buffer for strings.  Each string is null-padded to a size of
  *	%ETH_GSTRING_LEN.
  *
  * Users must use %ETHTOOL_GSSET_INFO to find the number of strings in
  * the string set.  They must allocate a buffer of the appropriate
  * size immediately following this structure.
+ *
+ * Setting @len on input is optional (though preferred), but must be zeroed
+ * otherwise.
+ * When set, @len will return the requested count if it matches the actual
+ * count; otherwise, it will be zero.
+ * This prevents issues when the number of strings is different than the
+ * userspace allocation.
  */
 struct ethtool_gstrings {
 	uint32_t	cmd;
@@ -1177,13 +1185,20 @@ struct ethtool_test {
 /**
  * struct ethtool_stats - device-specific statistics
  * @cmd: Command number = %ETHTOOL_GSTATS
- * @n_stats: On return, the number of statistics
+ * @n_stats: Number of statistics
  * @data: Array of statistics
  *
  * Users must use %ETHTOOL_GSSET_INFO or %ETHTOOL_GDRVINFO to find the
  * number of statistics that will be returned.  They must allocate a
  * buffer of the appropriate size (8 * number of statistics)
  * immediately following this structure.
+ *
+ * Setting @n_stats on input is optional (though preferred), but must be zeroed
+ * otherwise.
+ * When set, @n_stats will return the requested count if it matches the actual
+ * count; otherwise, it will be zero.
+ * This prevents issues when the number of stats is different than the
+ * userspace allocation.
  */
 struct ethtool_stats {
 	uint32_t	cmd;
@@ -2190,6 +2205,7 @@ enum ethtool_link_mode_bit_indices {
 #define SPEED_40000		40000
 #define SPEED_50000		50000
 #define SPEED_56000		56000
+#define SPEED_80000		80000
 #define SPEED_100000		100000
 #define SPEED_200000		200000
 #define SPEED_400000		400000
@@ -2200,7 +2216,7 @@ enum ethtool_link_mode_bit_indices {
 
 static inline int ethtool_validate_speed(uint32_t speed)
 {
-	return speed <= INT_MAX || speed == (uint32_t)SPEED_UNKNOWN;
+	return speed <= __KERNEL_INT_MAX || speed == (uint32_t)SPEED_UNKNOWN;
 }
 
 /* Duplex, half or full. */
diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h
index ede79c6ae4..dd7c986106 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -643,6 +643,10 @@
 #define KEY_EPRIVACY_SCREEN_ON		0x252
 #define KEY_EPRIVACY_SCREEN_OFF		0x253
 
+#define KEY_ACTION_ON_SELECTION		0x254	/* AL Action on Selection (HUTRR119) */
+#define KEY_CONTEXTUAL_INSERT		0x255	/* AL Contextual Insertion (HUTRR119) */
+#define KEY_CONTEXTUAL_QUERY		0x256	/* AL Contextual Query (HUTRR119) */
+
 #define KEY_KBDINPUTASSIST_PREV		0x260
 #define KEY_KBDINPUTASSIST_NEXT		0x261
 #define KEY_KBDINPUTASSIST_PREVGROUP		0x262
@@ -891,6 +895,7 @@
 
 #define ABS_VOLUME		0x20
 #define ABS_PROFILE		0x21
+#define ABS_SND_PROFILE		0x22
 
 #define ABS_MISC		0x28
 
@@ -1000,4 +1005,12 @@
 #define SND_MAX			0x07
 #define SND_CNT			(SND_MAX+1)
 
+/*
+ * ABS_SND_PROFILE values
+ */
+
+#define SND_PROFILE_SILENT	0x00
+#define SND_PROFILE_VIBRATE	0x01
+#define SND_PROFILE_RING	0x02
+
 #endif
diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
index 3add74ae25..14f634ab93 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -132,6 +132,11 @@
 #define PCI_SECONDARY_BUS	0x19	/* Secondary bus number */
 #define PCI_SUBORDINATE_BUS	0x1a	/* Highest bus number behind the bridge */
 #define PCI_SEC_LATENCY_TIMER	0x1b	/* Latency timer for secondary interface */
+/* Masks for dword-sized processing of Bus Number and Sec Latency Timer fields */
+#define  PCI_PRIMARY_BUS_MASK		0x000000ff
+#define  PCI_SECONDARY_BUS_MASK		0x0000ff00
+#define  PCI_SUBORDINATE_BUS_MASK	0x00ff0000
+#define  PCI_SEC_LATENCY_TIMER_MASK	0xff000000
 #define PCI_IO_BASE		0x1c	/* I/O range behind the bridge */
 #define PCI_IO_LIMIT		0x1d
 #define  PCI_IO_RANGE_TYPE_MASK	0x0fUL	/* I/O bridging type */
@@ -707,7 +712,7 @@
 #define  PCI_EXP_LNKCTL2_HASD		0x0020 /* HW Autonomous Speed Disable */
 #define PCI_EXP_LNKSTA2		0x32	/* Link Status 2 */
 #define  PCI_EXP_LNKSTA2_FLIT		0x0400 /* Flit Mode Status */
-#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x32	/* end of v2 EPs w/ link */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x34	/* end of v2 EPs w/ link */
 #define PCI_EXP_SLTCAP2		0x34	/* Slot Capabilities 2 */
 #define  PCI_EXP_SLTCAP2_IBPD	0x00000001 /* In-band PD Disable Supported */
 #define PCI_EXP_SLTCTL2		0x38	/* Slot Control 2 */
@@ -1253,11 +1258,6 @@
 #define PCI_DEV3_STA		0x0c	/* Device 3 Status Register */
 #define  PCI_DEV3_STA_SEGMENT	0x8	/* Segment Captured (end-to-end flit-mode detected) */
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
-#define PCI_DVSEC_CXL_PORT				3
-#define PCI_DVSEC_CXL_PORT_CTL				0x0c
-#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
-
 /* Integrity and Data Encryption Extended Capability */
 #define PCI_IDE_CAP			0x04
 #define  PCI_IDE_CAP_LINK		0x1  /* Link IDE Stream Supported */
@@ -1338,4 +1338,63 @@
 #define  PCI_IDE_SEL_ADDR_3(x)		(28 + (x) * PCI_IDE_SEL_ADDR_BLOCK_SIZE)
 #define PCI_IDE_SEL_BLOCK_SIZE(nr_assoc)  (20 + PCI_IDE_SEL_ADDR_BLOCK_SIZE * (nr_assoc))
 
+/*
+ * Compute Express Link (CXL r4.0, sec 8.1)
+ *
+ * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
+ * is "disconnected" (CXL r4.0, sec 9.12.3). Re-enumerate these
+ * registers on downstream link-up events.
+ */
+
+/* CXL r4.0, 8.1.3: PCIe DVSEC for CXL Device */
+#define PCI_DVSEC_CXL_DEVICE				0
+#define  PCI_DVSEC_CXL_CAP				0xA
+#define   PCI_DVSEC_CXL_MEM_CAPABLE			_BITUL(2)
+#define   PCI_DVSEC_CXL_HDM_COUNT			__GENMASK(5, 4)
+#define  PCI_DVSEC_CXL_CTRL				0xC
+#define   PCI_DVSEC_CXL_MEM_ENABLE			_BITUL(2)
+#define  PCI_DVSEC_CXL_RANGE_SIZE_HIGH(i)		(0x18 + (i * 0x10))
+#define  PCI_DVSEC_CXL_RANGE_SIZE_LOW(i)		(0x1C + (i * 0x10))
+#define   PCI_DVSEC_CXL_MEM_INFO_VALID			_BITUL(0)
+#define   PCI_DVSEC_CXL_MEM_ACTIVE			_BITUL(1)
+#define   PCI_DVSEC_CXL_MEM_SIZE_LOW			__GENMASK(31, 28)
+#define  PCI_DVSEC_CXL_RANGE_BASE_HIGH(i)		(0x20 + (i * 0x10))
+#define  PCI_DVSEC_CXL_RANGE_BASE_LOW(i)		(0x24 + (i * 0x10))
+#define   PCI_DVSEC_CXL_MEM_BASE_LOW			__GENMASK(31, 28)
+
+#define CXL_DVSEC_RANGE_MAX				2
+
+/* CXL r4.0, 8.1.4: Non-CXL Function Map DVSEC */
+#define PCI_DVSEC_CXL_FUNCTION_MAP			2
+
+/* CXL r4.0, 8.1.5: Extensions DVSEC for Ports */
+#define PCI_DVSEC_CXL_PORT				3
+#define  PCI_DVSEC_CXL_PORT_CTL				0x0c
+#define   PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+
+/* CXL r4.0, 8.1.6: GPF DVSEC for CXL Port */
+#define PCI_DVSEC_CXL_PORT_GPF				4
+#define  PCI_DVSEC_CXL_PORT_GPF_PHASE_1_CONTROL		0x0C
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_BASE	__GENMASK(3, 0)
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_SCALE	__GENMASK(11, 8)
+#define  PCI_DVSEC_CXL_PORT_GPF_PHASE_2_CONTROL		0xE
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_BASE	__GENMASK(3, 0)
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_SCALE	__GENMASK(11, 8)
+
+/* CXL r4.0, 8.1.7: GPF DVSEC for CXL Device */
+#define PCI_DVSEC_CXL_DEVICE_GPF			5
+
+/* CXL r4.0, 8.1.8: Flex Bus DVSEC */
+#define PCI_DVSEC_CXL_FLEXBUS_PORT			7
+#define  PCI_DVSEC_CXL_FLEXBUS_PORT_STATUS		0xE
+#define   PCI_DVSEC_CXL_FLEXBUS_PORT_STATUS_CACHE	_BITUL(0)
+#define   PCI_DVSEC_CXL_FLEXBUS_PORT_STATUS_MEM		_BITUL(2)
+
+/* CXL r4.0, 8.1.9: Register Locator DVSEC */
+#define PCI_DVSEC_CXL_REG_LOCATOR			8
+#define  PCI_DVSEC_CXL_REG_LOCATOR_BLOCK1		0xC
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BIR			__GENMASK(2, 0)
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_ID		__GENMASK(15, 8)
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_OFF_LOW	__GENMASK(31, 16)
+
 #endif /* LINUX_PCI_REGS_H */
diff --git a/include/standard-headers/linux/typelimits.h b/include/standard-headers/linux/typelimits.h
new file mode 100644
index 0000000000..1304520082
--- /dev/null
+++ b/include/standard-headers/linux/typelimits.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_TYPELIMITS_H
+#define _LINUX_TYPELIMITS_H
+
+#define __KERNEL_INT_MAX ((int)(~0U >> 1))
+#define __KERNEL_INT_MIN (-__KERNEL_INT_MAX - 1)
+
+#endif /* _LINUX_TYPELIMITS_H */
diff --git a/include/standard-headers/linux/virtio_ring.h b/include/standard-headers/linux/virtio_ring.h
index 22f6eb8ca7..7baf1968a3 100644
--- a/include/standard-headers/linux/virtio_ring.h
+++ b/include/standard-headers/linux/virtio_ring.h
@@ -31,7 +31,6 @@
  * SUCH DAMAGE.
  *
  * Copyright Rusty Russell IBM Corporation 2007. */
-#include <stdint.h>
 #include "standard-headers/linux/types.h"
 #include "standard-headers/linux/virtio_types.h"
 
@@ -200,7 +199,7 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 	vr->num = num;
 	vr->desc = p;
 	vr->avail = (struct vring_avail *)((char *)p + num * sizeof(struct vring_desc));
-	vr->used = (void *)(((uintptr_t)&vr->avail->ring[num] + sizeof(__virtio16)
+	vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__virtio16)
 		+ align-1) & ~(align - 1));
 }
 
diff --git a/include/standard-headers/linux/virtio_rtc.h b/include/standard-headers/linux/virtio_rtc.h
new file mode 100644
index 0000000000..7e2c21ebff
--- /dev/null
+++ b/include/standard-headers/linux/virtio_rtc.h
@@ -0,0 +1,237 @@
+/* SPDX-License-Identifier: ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) */
+/*
+ * Copyright (C) 2022-2024 OpenSynergy GmbH
+ * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#ifndef _LINUX_VIRTIO_RTC_H
+#define _LINUX_VIRTIO_RTC_H
+
+#include "standard-headers/linux/types.h"
+
+/* alarm feature */
+#define VIRTIO_RTC_F_ALARM	0
+
+/* read request message types */
+
+#define VIRTIO_RTC_REQ_READ			0x0001
+#define VIRTIO_RTC_REQ_READ_CROSS		0x0002
+
+/* control request message types */
+
+#define VIRTIO_RTC_REQ_CFG			0x1000
+#define VIRTIO_RTC_REQ_CLOCK_CAP		0x1001
+#define VIRTIO_RTC_REQ_CROSS_CAP		0x1002
+#define VIRTIO_RTC_REQ_READ_ALARM		0x1003
+#define VIRTIO_RTC_REQ_SET_ALARM		0x1004
+#define VIRTIO_RTC_REQ_SET_ALARM_ENABLED	0x1005
+
+/* alarmq message types */
+
+#define VIRTIO_RTC_NOTIF_ALARM			0x2000
+
+/* Message headers */
+
+/** common request header */
+struct virtio_rtc_req_head {
+	uint16_t msg_type;
+	uint8_t reserved[6];
+};
+
+/** common response header */
+struct virtio_rtc_resp_head {
+#define VIRTIO_RTC_S_OK			0
+#define VIRTIO_RTC_S_EOPNOTSUPP		2
+#define VIRTIO_RTC_S_ENODEV		3
+#define VIRTIO_RTC_S_EINVAL		4
+#define VIRTIO_RTC_S_EIO		5
+	uint8_t status;
+	uint8_t reserved[7];
+};
+
+/** common notification header */
+struct virtio_rtc_notif_head {
+	uint16_t msg_type;
+	uint8_t reserved[6];
+};
+
+/* read requests */
+
+/* VIRTIO_RTC_REQ_READ message */
+
+struct virtio_rtc_req_read {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+struct virtio_rtc_resp_read {
+	struct virtio_rtc_resp_head head;
+	uint64_t clock_reading;
+};
+
+/* VIRTIO_RTC_REQ_READ_CROSS message */
+
+struct virtio_rtc_req_read_cross {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+/* Arm Generic Timer Counter-timer Virtual Count Register (CNTVCT_EL0) */
+#define VIRTIO_RTC_COUNTER_ARM_VCT	0
+/* x86 Time-Stamp Counter */
+#define VIRTIO_RTC_COUNTER_X86_TSC	1
+/* Invalid */
+#define VIRTIO_RTC_COUNTER_INVALID	0xFF
+	uint8_t hw_counter;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_read_cross {
+	struct virtio_rtc_resp_head head;
+	uint64_t clock_reading;
+	uint64_t counter_cycles;
+};
+
+/* control requests */
+
+/* VIRTIO_RTC_REQ_CFG message */
+
+struct virtio_rtc_req_cfg {
+	struct virtio_rtc_req_head head;
+	/* no request params */
+};
+
+struct virtio_rtc_resp_cfg {
+	struct virtio_rtc_resp_head head;
+	/** # of clocks -> clock ids < num_clocks are valid */
+	uint16_t num_clocks;
+	uint8_t reserved[6];
+};
+
+/* VIRTIO_RTC_REQ_CLOCK_CAP message */
+
+struct virtio_rtc_req_clock_cap {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+struct virtio_rtc_resp_clock_cap {
+	struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_CLOCK_UTC			0
+#define VIRTIO_RTC_CLOCK_TAI			1
+#define VIRTIO_RTC_CLOCK_MONOTONIC		2
+#define VIRTIO_RTC_CLOCK_UTC_SMEARED		3
+#define VIRTIO_RTC_CLOCK_UTC_MAYBE_SMEARED	4
+	uint8_t type;
+#define VIRTIO_RTC_SMEAR_UNSPECIFIED	0
+#define VIRTIO_RTC_SMEAR_NOON_LINEAR	1
+#define VIRTIO_RTC_SMEAR_UTC_SLS	2
+	uint8_t leap_second_smearing;
+#define VIRTIO_RTC_FLAG_ALARM_CAP		(1 << 0)
+	uint8_t flags;
+	uint8_t reserved[5];
+};
+
+/* VIRTIO_RTC_REQ_CROSS_CAP message */
+
+struct virtio_rtc_req_cross_cap {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t hw_counter;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_cross_cap {
+	struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_FLAG_CROSS_CAP	(1 << 0)
+	uint8_t flags;
+	uint8_t reserved[7];
+};
+
+/* VIRTIO_RTC_REQ_READ_ALARM message */
+
+struct virtio_rtc_req_read_alarm {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+struct virtio_rtc_resp_read_alarm {
+	struct virtio_rtc_resp_head head;
+	uint64_t alarm_time;
+#define VIRTIO_RTC_FLAG_ALARM_ENABLED	(1 << 0)
+	uint8_t flags;
+	uint8_t reserved[7];
+};
+
+/* VIRTIO_RTC_REQ_SET_ALARM message */
+
+struct virtio_rtc_req_set_alarm {
+	struct virtio_rtc_req_head head;
+	uint64_t alarm_time;
+	uint16_t clock_id;
+	/* flag VIRTIO_RTC_FLAG_ALARM_ENABLED */
+	uint8_t flags;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_set_alarm {
+	struct virtio_rtc_resp_head head;
+	/* no response params */
+};
+
+/* VIRTIO_RTC_REQ_SET_ALARM_ENABLED message */
+
+struct virtio_rtc_req_set_alarm_enabled {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	/* flag VIRTIO_RTC_ALARM_ENABLED */
+	uint8_t flags;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_set_alarm_enabled {
+	struct virtio_rtc_resp_head head;
+	/* no response params */
+};
+
+/** Union of request types for requestq */
+union virtio_rtc_req_requestq {
+	struct virtio_rtc_req_read read;
+	struct virtio_rtc_req_read_cross read_cross;
+	struct virtio_rtc_req_cfg cfg;
+	struct virtio_rtc_req_clock_cap clock_cap;
+	struct virtio_rtc_req_cross_cap cross_cap;
+	struct virtio_rtc_req_read_alarm read_alarm;
+	struct virtio_rtc_req_set_alarm set_alarm;
+	struct virtio_rtc_req_set_alarm_enabled set_alarm_enabled;
+};
+
+/** Union of response types for requestq */
+union virtio_rtc_resp_requestq {
+	struct virtio_rtc_resp_read read;
+	struct virtio_rtc_resp_read_cross read_cross;
+	struct virtio_rtc_resp_cfg cfg;
+	struct virtio_rtc_resp_clock_cap clock_cap;
+	struct virtio_rtc_resp_cross_cap cross_cap;
+	struct virtio_rtc_resp_read_alarm read_alarm;
+	struct virtio_rtc_resp_set_alarm set_alarm;
+	struct virtio_rtc_resp_set_alarm_enabled set_alarm_enabled;
+};
+
+/* alarmq notifications */
+
+/* VIRTIO_RTC_NOTIF_ALARM notification */
+
+struct virtio_rtc_notif_alarm {
+	struct virtio_rtc_notif_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+/** Union of notification types for alarmq */
+union virtio_rtc_notif_alarmq {
+	struct virtio_rtc_notif_alarm alarm;
+};
+
+#endif /* _LINUX_VIRTIO_RTC_H */
diff --git a/include/standard-headers/linux/vmclock-abi.h b/include/standard-headers/linux/vmclock-abi.h
index 15b0316cb4..fe824badc0 100644
--- a/include/standard-headers/linux/vmclock-abi.h
+++ b/include/standard-headers/linux/vmclock-abi.h
@@ -115,6 +115,17 @@ struct vmclock_abi {
 	 * bit again after the update, using the about-to-be-valid fields.
 	 */
 #define VMCLOCK_FLAG_TIME_MONOTONIC		(1 << 7)
+	/*
+	 * If the VM_GEN_COUNTER_PRESENT flag is set, the hypervisor will
+	 * bump the vm_generation_counter field every time the guest is
+	 * loaded from some save state (restored from a snapshot).
+	 */
+#define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
+	/*
+	 * If the NOTIFICATION_PRESENT flag is set, the hypervisor will send
+	 * a notification every time it updates seq_count to a new even number.
+	 */
+#define VMCLOCK_FLAG_NOTIFICATION_PRESENT       (1 << 9)
 
 	uint8_t pad[2];
 	uint8_t clock_status;
@@ -177,6 +188,15 @@ struct vmclock_abi {
 	uint64_t time_frac_sec;		/* Units of 1/2^64 of a second */
 	uint64_t time_esterror_nanosec;
 	uint64_t time_maxerror_nanosec;
+
+	/*
+	 * This field changes to another non-repeating value when the guest
+	 * has been loaded from a snapshot. In addition to handling a
+	 * disruption in time (which will also be signalled through the
+	 * disruption_marker field), a guest may wish to discard UUIDs,
+	 * reset network connections, reseed entropy, etc.
+	 */
+	uint64_t vm_generation_counter;
 };
 
 #endif /*  __VMCLOCK_ABI_H__ */
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 46ffbddab5..6aefe79738 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -416,6 +416,7 @@ enum {
 #define   KVM_DEV_ARM_ITS_RESTORE_TABLES        2
 #define   KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES	3
 #define   KVM_DEV_ARM_ITS_CTRL_RESET		4
+#define   KVM_DEV_ARM_VGIC_USERSPACE_PPIS	5
 
 /* Device Control API on vcpu fd */
 #define KVM_ARM_VCPU_PMU_V3_CTRL	0
diff --git a/linux-headers/asm-arm64/unistd_64.h b/linux-headers/asm-arm64/unistd_64.h
index 1ef9c40813..70b3754a42 100644
--- a/linux-headers/asm-arm64/unistd_64.h
+++ b/linux-headers/asm-arm64/unistd_64.h
@@ -327,6 +327,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-generic/unistd.h b/linux-headers/asm-generic/unistd.h
index 942370b3f5..a627acc8fb 100644
--- a/linux-headers/asm-generic/unistd.h
+++ b/linux-headers/asm-generic/unistd.h
@@ -860,8 +860,11 @@ __SYSCALL(__NR_file_setattr, sys_file_setattr)
 #define __NR_listns 470
 __SYSCALL(__NR_listns, sys_listns)
 
+#define __NR_rseq_slice_yield 471
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
 #undef __NR_syscalls
-#define __NR_syscalls 471
+#define __NR_syscalls 472
 
 /*
  * 32 bit systems traditionally used different
diff --git a/linux-headers/asm-loongarch/kvm.h b/linux-headers/asm-loongarch/kvm.h
index de6c3f18e4..cd0b5c11ca 100644
--- a/linux-headers/asm-loongarch/kvm.h
+++ b/linux-headers/asm-loongarch/kvm.h
@@ -105,6 +105,7 @@ struct kvm_fpu {
 #define  KVM_LOONGARCH_VM_FEAT_PV_STEALTIME	7
 #define  KVM_LOONGARCH_VM_FEAT_PTW		8
 #define  KVM_LOONGARCH_VM_FEAT_MSGINT		9
+#define  KVM_LOONGARCH_VM_FEAT_PV_PREEMPT	10
 
 /* Device Control API on vcpu fd */
 #define KVM_LOONGARCH_VCPU_CPUCFG	0
@@ -154,4 +155,8 @@ struct kvm_iocsr_entry {
 #define KVM_DEV_LOONGARCH_PCH_PIC_GRP_CTRL	        0x40000006
 #define KVM_DEV_LOONGARCH_PCH_PIC_CTRL_INIT	        0
 
+#define KVM_DEV_LOONGARCH_DMSINTC_GRP_CTRL		0x40000007
+#define KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_BASE		0x0
+#define KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_SIZE		0x1
+
 #endif /* __UAPI_ASM_LOONGARCH_KVM_H */
diff --git a/linux-headers/asm-loongarch/kvm_para.h b/linux-headers/asm-loongarch/kvm_para.h
index fd7f40713d..3fd87a096b 100644
--- a/linux-headers/asm-loongarch/kvm_para.h
+++ b/linux-headers/asm-loongarch/kvm_para.h
@@ -15,6 +15,7 @@
 #define CPUCFG_KVM_FEATURE		(CPUCFG_KVM_BASE + 4)
 #define  KVM_FEATURE_IPI		1
 #define  KVM_FEATURE_STEAL_TIME		2
+#define  KVM_FEATURE_PREEMPT		3
 /* BIT 24 - 31 are features configurable by user space vmm */
 #define  KVM_FEATURE_VIRT_EXTIOI	24
 #define  KVM_FEATURE_USER_HCALL		25
diff --git a/linux-headers/asm-loongarch/unistd_64.h b/linux-headers/asm-loongarch/unistd_64.h
index aa5daac4ef..3a29d86e1d 100644
--- a/linux-headers/asm-loongarch/unistd_64.h
+++ b/linux-headers/asm-loongarch/unistd_64.h
@@ -300,6 +300,7 @@
 #define __NR_landlock_create_ruleset 444
 #define __NR_landlock_add_rule 445
 #define __NR_landlock_restrict_self 446
+#define __NR_memfd_secret 447
 #define __NR_process_mrelease 448
 #define __NR_futex_waitv 449
 #define __NR_set_mempolicy_home_node 450
@@ -323,6 +324,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-mips/unistd_n32.h b/linux-headers/asm-mips/unistd_n32.h
index a33d106dca..5fa1ee0cb4 100644
--- a/linux-headers/asm-mips/unistd_n32.h
+++ b/linux-headers/asm-mips/unistd_n32.h
@@ -399,5 +399,6 @@
 #define __NR_file_getattr (__NR_Linux + 468)
 #define __NR_file_setattr (__NR_Linux + 469)
 #define __NR_listns (__NR_Linux + 470)
+#define __NR_rseq_slice_yield (__NR_Linux + 471)
 
 #endif /* _ASM_UNISTD_N32_H */
diff --git a/linux-headers/asm-mips/unistd_n64.h b/linux-headers/asm-mips/unistd_n64.h
index 1bc251e450..e1f873d83a 100644
--- a/linux-headers/asm-mips/unistd_n64.h
+++ b/linux-headers/asm-mips/unistd_n64.h
@@ -375,5 +375,6 @@
 #define __NR_file_getattr (__NR_Linux + 468)
 #define __NR_file_setattr (__NR_Linux + 469)
 #define __NR_listns (__NR_Linux + 470)
+#define __NR_rseq_slice_yield (__NR_Linux + 471)
 
 #endif /* _ASM_UNISTD_N64_H */
diff --git a/linux-headers/asm-mips/unistd_o32.h b/linux-headers/asm-mips/unistd_o32.h
index c57175d496..8207e9ca4f 100644
--- a/linux-headers/asm-mips/unistd_o32.h
+++ b/linux-headers/asm-mips/unistd_o32.h
@@ -445,5 +445,6 @@
 #define __NR_file_getattr (__NR_Linux + 468)
 #define __NR_file_setattr (__NR_Linux + 469)
 #define __NR_listns (__NR_Linux + 470)
+#define __NR_rseq_slice_yield (__NR_Linux + 471)
 
 #endif /* _ASM_UNISTD_O32_H */
diff --git a/linux-headers/asm-powerpc/unistd_32.h b/linux-headers/asm-powerpc/unistd_32.h
index a3f4aa2fe2..1f63360120 100644
--- a/linux-headers/asm-powerpc/unistd_32.h
+++ b/linux-headers/asm-powerpc/unistd_32.h
@@ -452,6 +452,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-powerpc/unistd_64.h b/linux-headers/asm-powerpc/unistd_64.h
index d4444557f1..87439c53c1 100644
--- a/linux-headers/asm-powerpc/unistd_64.h
+++ b/linux-headers/asm-powerpc/unistd_64.h
@@ -424,6 +424,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h
index 54f3ad7ed2..504e733053 100644
--- a/linux-headers/asm-riscv/kvm.h
+++ b/linux-headers/asm-riscv/kvm.h
@@ -110,6 +110,10 @@ struct kvm_riscv_timer {
 	__u64 state;
 };
 
+/* Possible states for kvm_riscv_timer */
+#define KVM_RISCV_TIMER_STATE_OFF	0
+#define KVM_RISCV_TIMER_STATE_ON	1
+
 /*
  * ISA extension IDs specific to KVM. This is not the same as the host ISA
  * extension IDs as that is internal to the host and should not be exposed
@@ -192,6 +196,9 @@ enum KVM_RISCV_ISA_EXT_ID {
 	KVM_RISCV_ISA_EXT_ZFBFMIN,
 	KVM_RISCV_ISA_EXT_ZVFBFMIN,
 	KVM_RISCV_ISA_EXT_ZVFBFWMA,
+	KVM_RISCV_ISA_EXT_ZCLSD,
+	KVM_RISCV_ISA_EXT_ZILSD,
+	KVM_RISCV_ISA_EXT_ZALASR,
 	KVM_RISCV_ISA_EXT_MAX,
 };
 
@@ -235,10 +242,6 @@ struct kvm_riscv_sbi_fwft {
 	struct kvm_riscv_sbi_fwft_feature pointer_masking;
 };
 
-/* Possible states for kvm_riscv_timer */
-#define KVM_RISCV_TIMER_STATE_OFF	0
-#define KVM_RISCV_TIMER_STATE_ON	1
-
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_RISCV_TYPE_MASK		0x00000000FF000000
 #define KVM_REG_RISCV_TYPE_SHIFT	24
diff --git a/linux-headers/asm-riscv/ptrace.h b/linux-headers/asm-riscv/ptrace.h
index a3f8211ede..cf87642994 100644
--- a/linux-headers/asm-riscv/ptrace.h
+++ b/linux-headers/asm-riscv/ptrace.h
@@ -9,6 +9,7 @@
 #ifndef __ASSEMBLER__
 
 #include <linux/types.h>
+#include <linux/const.h>
 
 #define PTRACE_GETFDPIC		33
 
@@ -127,6 +128,42 @@ struct __riscv_v_regset_state {
  */
 #define RISCV_MAX_VLENB (8192)
 
+struct __sc_riscv_cfi_state {
+	unsigned long ss_ptr;   /* shadow stack pointer */
+};
+
+#define PTRACE_CFI_BRANCH_LANDING_PAD_EN_BIT		0
+#define PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_BIT		1
+#define PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_BIT	2
+#define PTRACE_CFI_SHADOW_STACK_EN_BIT			3
+#define PTRACE_CFI_SHADOW_STACK_LOCK_BIT		4
+#define PTRACE_CFI_SHADOW_STACK_PTR_BIT			5
+
+#define PTRACE_CFI_BRANCH_LANDING_PAD_EN_STATE		_BITUL(PTRACE_CFI_BRANCH_LANDING_PAD_EN_BIT)
+#define PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_STATE	\
+	_BITUL(PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_BIT)
+#define PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_STATE	\
+	_BITUL(PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_BIT)
+#define PTRACE_CFI_SHADOW_STACK_EN_STATE		_BITUL(PTRACE_CFI_SHADOW_STACK_EN_BIT)
+#define PTRACE_CFI_SHADOW_STACK_LOCK_STATE		_BITUL(PTRACE_CFI_SHADOW_STACK_LOCK_BIT)
+#define PTRACE_CFI_SHADOW_STACK_PTR_STATE		_BITUL(PTRACE_CFI_SHADOW_STACK_PTR_BIT)
+
+#define PTRACE_CFI_STATE_INVALID_MASK	~(PTRACE_CFI_BRANCH_LANDING_PAD_EN_STATE | \
+					  PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_STATE | \
+					  PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_STATE | \
+					  PTRACE_CFI_SHADOW_STACK_EN_STATE | \
+					  PTRACE_CFI_SHADOW_STACK_LOCK_STATE | \
+					  PTRACE_CFI_SHADOW_STACK_PTR_STATE)
+
+struct __cfi_status {
+	__u64 cfi_state;
+};
+
+struct user_cfi_state {
+	struct __cfi_status	cfi_status;
+	__u64 shstk_ptr;
+};
+
 #endif /* __ASSEMBLER__ */
 
 #endif /* _ASM_RISCV_PTRACE_H */
diff --git a/linux-headers/asm-riscv/unistd_32.h b/linux-headers/asm-riscv/unistd_32.h
index 9f33956246..828f3c2b9d 100644
--- a/linux-headers/asm-riscv/unistd_32.h
+++ b/linux-headers/asm-riscv/unistd_32.h
@@ -318,6 +318,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-riscv/unistd_64.h b/linux-headers/asm-riscv/unistd_64.h
index c2e7258916..8fa59835a3 100644
--- a/linux-headers/asm-riscv/unistd_64.h
+++ b/linux-headers/asm-riscv/unistd_64.h
@@ -328,6 +328,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-s390/unistd_32.h b/linux-headers/asm-s390/unistd_32.h
deleted file mode 100644
index 37b8f6f358..0000000000
--- a/linux-headers/asm-s390/unistd_32.h
+++ /dev/null
@@ -1,446 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _ASM_S390_UNISTD_32_H
-#define _ASM_S390_UNISTD_32_H
-
-#define __NR_exit 1
-#define __NR_fork 2
-#define __NR_read 3
-#define __NR_write 4
-#define __NR_open 5
-#define __NR_close 6
-#define __NR_restart_syscall 7
-#define __NR_creat 8
-#define __NR_link 9
-#define __NR_unlink 10
-#define __NR_execve 11
-#define __NR_chdir 12
-#define __NR_time 13
-#define __NR_mknod 14
-#define __NR_chmod 15
-#define __NR_lchown 16
-#define __NR_lseek 19
-#define __NR_getpid 20
-#define __NR_mount 21
-#define __NR_umount 22
-#define __NR_setuid 23
-#define __NR_getuid 24
-#define __NR_stime 25
-#define __NR_ptrace 26
-#define __NR_alarm 27
-#define __NR_pause 29
-#define __NR_utime 30
-#define __NR_access 33
-#define __NR_nice 34
-#define __NR_sync 36
-#define __NR_kill 37
-#define __NR_rename 38
-#define __NR_mkdir 39
-#define __NR_rmdir 40
-#define __NR_dup 41
-#define __NR_pipe 42
-#define __NR_times 43
-#define __NR_brk 45
-#define __NR_setgid 46
-#define __NR_getgid 47
-#define __NR_signal 48
-#define __NR_geteuid 49
-#define __NR_getegid 50
-#define __NR_acct 51
-#define __NR_umount2 52
-#define __NR_ioctl 54
-#define __NR_fcntl 55
-#define __NR_setpgid 57
-#define __NR_umask 60
-#define __NR_chroot 61
-#define __NR_ustat 62
-#define __NR_dup2 63
-#define __NR_getppid 64
-#define __NR_getpgrp 65
-#define __NR_setsid 66
-#define __NR_sigaction 67
-#define __NR_setreuid 70
-#define __NR_setregid 71
-#define __NR_sigsuspend 72
-#define __NR_sigpending 73
-#define __NR_sethostname 74
-#define __NR_setrlimit 75
-#define __NR_getrlimit 76
-#define __NR_getrusage 77
-#define __NR_gettimeofday 78
-#define __NR_settimeofday 79
-#define __NR_getgroups 80
-#define __NR_setgroups 81
-#define __NR_symlink 83
-#define __NR_readlink 85
-#define __NR_uselib 86
-#define __NR_swapon 87
-#define __NR_reboot 88
-#define __NR_readdir 89
-#define __NR_mmap 90
-#define __NR_munmap 91
-#define __NR_truncate 92
-#define __NR_ftruncate 93
-#define __NR_fchmod 94
-#define __NR_fchown 95
-#define __NR_getpriority 96
-#define __NR_setpriority 97
-#define __NR_statfs 99
-#define __NR_fstatfs 100
-#define __NR_ioperm 101
-#define __NR_socketcall 102
-#define __NR_syslog 103
-#define __NR_setitimer 104
-#define __NR_getitimer 105
-#define __NR_stat 106
-#define __NR_lstat 107
-#define __NR_fstat 108
-#define __NR_lookup_dcookie 110
-#define __NR_vhangup 111
-#define __NR_idle 112
-#define __NR_wait4 114
-#define __NR_swapoff 115
-#define __NR_sysinfo 116
-#define __NR_ipc 117
-#define __NR_fsync 118
-#define __NR_sigreturn 119
-#define __NR_clone 120
-#define __NR_setdomainname 121
-#define __NR_uname 122
-#define __NR_adjtimex 124
-#define __NR_mprotect 125
-#define __NR_sigprocmask 126
-#define __NR_create_module 127
-#define __NR_init_module 128
-#define __NR_delete_module 129
-#define __NR_get_kernel_syms 130
-#define __NR_quotactl 131
-#define __NR_getpgid 132
-#define __NR_fchdir 133
-#define __NR_bdflush 134
-#define __NR_sysfs 135
-#define __NR_personality 136
-#define __NR_afs_syscall 137
-#define __NR_setfsuid 138
-#define __NR_setfsgid 139
-#define __NR__llseek 140
-#define __NR_getdents 141
-#define __NR__newselect 142
-#define __NR_flock 143
-#define __NR_msync 144
-#define __NR_readv 145
-#define __NR_writev 146
-#define __NR_getsid 147
-#define __NR_fdatasync 148
-#define __NR__sysctl 149
-#define __NR_mlock 150
-#define __NR_munlock 151
-#define __NR_mlockall 152
-#define __NR_munlockall 153
-#define __NR_sched_setparam 154
-#define __NR_sched_getparam 155
-#define __NR_sched_setscheduler 156
-#define __NR_sched_getscheduler 157
-#define __NR_sched_yield 158
-#define __NR_sched_get_priority_max 159
-#define __NR_sched_get_priority_min 160
-#define __NR_sched_rr_get_interval 161
-#define __NR_nanosleep 162
-#define __NR_mremap 163
-#define __NR_setresuid 164
-#define __NR_getresuid 165
-#define __NR_query_module 167
-#define __NR_poll 168
-#define __NR_nfsservctl 169
-#define __NR_setresgid 170
-#define __NR_getresgid 171
-#define __NR_prctl 172
-#define __NR_rt_sigreturn 173
-#define __NR_rt_sigaction 174
-#define __NR_rt_sigprocmask 175
-#define __NR_rt_sigpending 176
-#define __NR_rt_sigtimedwait 177
-#define __NR_rt_sigqueueinfo 178
-#define __NR_rt_sigsuspend 179
-#define __NR_pread64 180
-#define __NR_pwrite64 181
-#define __NR_chown 182
-#define __NR_getcwd 183
-#define __NR_capget 184
-#define __NR_capset 185
-#define __NR_sigaltstack 186
-#define __NR_sendfile 187
-#define __NR_getpmsg 188
-#define __NR_putpmsg 189
-#define __NR_vfork 190
-#define __NR_ugetrlimit 191
-#define __NR_mmap2 192
-#define __NR_truncate64 193
-#define __NR_ftruncate64 194
-#define __NR_stat64 195
-#define __NR_lstat64 196
-#define __NR_fstat64 197
-#define __NR_lchown32 198
-#define __NR_getuid32 199
-#define __NR_getgid32 200
-#define __NR_geteuid32 201
-#define __NR_getegid32 202
-#define __NR_setreuid32 203
-#define __NR_setregid32 204
-#define __NR_getgroups32 205
-#define __NR_setgroups32 206
-#define __NR_fchown32 207
-#define __NR_setresuid32 208
-#define __NR_getresuid32 209
-#define __NR_setresgid32 210
-#define __NR_getresgid32 211
-#define __NR_chown32 212
-#define __NR_setuid32 213
-#define __NR_setgid32 214
-#define __NR_setfsuid32 215
-#define __NR_setfsgid32 216
-#define __NR_pivot_root 217
-#define __NR_mincore 218
-#define __NR_madvise 219
-#define __NR_getdents64 220
-#define __NR_fcntl64 221
-#define __NR_readahead 222
-#define __NR_sendfile64 223
-#define __NR_setxattr 224
-#define __NR_lsetxattr 225
-#define __NR_fsetxattr 226
-#define __NR_getxattr 227
-#define __NR_lgetxattr 228
-#define __NR_fgetxattr 229
-#define __NR_listxattr 230
-#define __NR_llistxattr 231
-#define __NR_flistxattr 232
-#define __NR_removexattr 233
-#define __NR_lremovexattr 234
-#define __NR_fremovexattr 235
-#define __NR_gettid 236
-#define __NR_tkill 237
-#define __NR_futex 238
-#define __NR_sched_setaffinity 239
-#define __NR_sched_getaffinity 240
-#define __NR_tgkill 241
-#define __NR_io_setup 243
-#define __NR_io_destroy 244
-#define __NR_io_getevents 245
-#define __NR_io_submit 246
-#define __NR_io_cancel 247
-#define __NR_exit_group 248
-#define __NR_epoll_create 249
-#define __NR_epoll_ctl 250
-#define __NR_epoll_wait 251
-#define __NR_set_tid_address 252
-#define __NR_fadvise64 253
-#define __NR_timer_create 254
-#define __NR_timer_settime 255
-#define __NR_timer_gettime 256
-#define __NR_timer_getoverrun 257
-#define __NR_timer_delete 258
-#define __NR_clock_settime 259
-#define __NR_clock_gettime 260
-#define __NR_clock_getres 261
-#define __NR_clock_nanosleep 262
-#define __NR_fadvise64_64 264
-#define __NR_statfs64 265
-#define __NR_fstatfs64 266
-#define __NR_remap_file_pages 267
-#define __NR_mbind 268
-#define __NR_get_mempolicy 269
-#define __NR_set_mempolicy 270
-#define __NR_mq_open 271
-#define __NR_mq_unlink 272
-#define __NR_mq_timedsend 273
-#define __NR_mq_timedreceive 274
-#define __NR_mq_notify 275
-#define __NR_mq_getsetattr 276
-#define __NR_kexec_load 277
-#define __NR_add_key 278
-#define __NR_request_key 279
-#define __NR_keyctl 280
-#define __NR_waitid 281
-#define __NR_ioprio_set 282
-#define __NR_ioprio_get 283
-#define __NR_inotify_init 284
-#define __NR_inotify_add_watch 285
-#define __NR_inotify_rm_watch 286
-#define __NR_migrate_pages 287
-#define __NR_openat 288
-#define __NR_mkdirat 289
-#define __NR_mknodat 290
-#define __NR_fchownat 291
-#define __NR_futimesat 292
-#define __NR_fstatat64 293
-#define __NR_unlinkat 294
-#define __NR_renameat 295
-#define __NR_linkat 296
-#define __NR_symlinkat 297
-#define __NR_readlinkat 298
-#define __NR_fchmodat 299
-#define __NR_faccessat 300
-#define __NR_pselect6 301
-#define __NR_ppoll 302
-#define __NR_unshare 303
-#define __NR_set_robust_list 304
-#define __NR_get_robust_list 305
-#define __NR_splice 306
-#define __NR_sync_file_range 307
-#define __NR_tee 308
-#define __NR_vmsplice 309
-#define __NR_move_pages 310
-#define __NR_getcpu 311
-#define __NR_epoll_pwait 312
-#define __NR_utimes 313
-#define __NR_fallocate 314
-#define __NR_utimensat 315
-#define __NR_signalfd 316
-#define __NR_timerfd 317
-#define __NR_eventfd 318
-#define __NR_timerfd_create 319
-#define __NR_timerfd_settime 320
-#define __NR_timerfd_gettime 321
-#define __NR_signalfd4 322
-#define __NR_eventfd2 323
-#define __NR_inotify_init1 324
-#define __NR_pipe2 325
-#define __NR_dup3 326
-#define __NR_epoll_create1 327
-#define __NR_preadv 328
-#define __NR_pwritev 329
-#define __NR_rt_tgsigqueueinfo 330
-#define __NR_perf_event_open 331
-#define __NR_fanotify_init 332
-#define __NR_fanotify_mark 333
-#define __NR_prlimit64 334
-#define __NR_name_to_handle_at 335
-#define __NR_open_by_handle_at 336
-#define __NR_clock_adjtime 337
-#define __NR_syncfs 338
-#define __NR_setns 339
-#define __NR_process_vm_readv 340
-#define __NR_process_vm_writev 341
-#define __NR_s390_runtime_instr 342
-#define __NR_kcmp 343
-#define __NR_finit_module 344
-#define __NR_sched_setattr 345
-#define __NR_sched_getattr 346
-#define __NR_renameat2 347
-#define __NR_seccomp 348
-#define __NR_getrandom 349
-#define __NR_memfd_create 350
-#define __NR_bpf 351
-#define __NR_s390_pci_mmio_write 352
-#define __NR_s390_pci_mmio_read 353
-#define __NR_execveat 354
-#define __NR_userfaultfd 355
-#define __NR_membarrier 356
-#define __NR_recvmmsg 357
-#define __NR_sendmmsg 358
-#define __NR_socket 359
-#define __NR_socketpair 360
-#define __NR_bind 361
-#define __NR_connect 362
-#define __NR_listen 363
-#define __NR_accept4 364
-#define __NR_getsockopt 365
-#define __NR_setsockopt 366
-#define __NR_getsockname 367
-#define __NR_getpeername 368
-#define __NR_sendto 369
-#define __NR_sendmsg 370
-#define __NR_recvfrom 371
-#define __NR_recvmsg 372
-#define __NR_shutdown 373
-#define __NR_mlock2 374
-#define __NR_copy_file_range 375
-#define __NR_preadv2 376
-#define __NR_pwritev2 377
-#define __NR_s390_guarded_storage 378
-#define __NR_statx 379
-#define __NR_s390_sthyi 380
-#define __NR_kexec_file_load 381
-#define __NR_io_pgetevents 382
-#define __NR_rseq 383
-#define __NR_pkey_mprotect 384
-#define __NR_pkey_alloc 385
-#define __NR_pkey_free 386
-#define __NR_semget 393
-#define __NR_semctl 394
-#define __NR_shmget 395
-#define __NR_shmctl 396
-#define __NR_shmat 397
-#define __NR_shmdt 398
-#define __NR_msgget 399
-#define __NR_msgsnd 400
-#define __NR_msgrcv 401
-#define __NR_msgctl 402
-#define __NR_clock_gettime64 403
-#define __NR_clock_settime64 404
-#define __NR_clock_adjtime64 405
-#define __NR_clock_getres_time64 406
-#define __NR_clock_nanosleep_time64 407
-#define __NR_timer_gettime64 408
-#define __NR_timer_settime64 409
-#define __NR_timerfd_gettime64 410
-#define __NR_timerfd_settime64 411
-#define __NR_utimensat_time64 412
-#define __NR_pselect6_time64 413
-#define __NR_ppoll_time64 414
-#define __NR_io_pgetevents_time64 416
-#define __NR_recvmmsg_time64 417
-#define __NR_mq_timedsend_time64 418
-#define __NR_mq_timedreceive_time64 419
-#define __NR_semtimedop_time64 420
-#define __NR_rt_sigtimedwait_time64 421
-#define __NR_futex_time64 422
-#define __NR_sched_rr_get_interval_time64 423
-#define __NR_pidfd_send_signal 424
-#define __NR_io_uring_setup 425
-#define __NR_io_uring_enter 426
-#define __NR_io_uring_register 427
-#define __NR_open_tree 428
-#define __NR_move_mount 429
-#define __NR_fsopen 430
-#define __NR_fsconfig 431
-#define __NR_fsmount 432
-#define __NR_fspick 433
-#define __NR_pidfd_open 434
-#define __NR_clone3 435
-#define __NR_close_range 436
-#define __NR_openat2 437
-#define __NR_pidfd_getfd 438
-#define __NR_faccessat2 439
-#define __NR_process_madvise 440
-#define __NR_epoll_pwait2 441
-#define __NR_mount_setattr 442
-#define __NR_quotactl_fd 443
-#define __NR_landlock_create_ruleset 444
-#define __NR_landlock_add_rule 445
-#define __NR_landlock_restrict_self 446
-#define __NR_memfd_secret 447
-#define __NR_process_mrelease 448
-#define __NR_futex_waitv 449
-#define __NR_set_mempolicy_home_node 450
-#define __NR_cachestat 451
-#define __NR_fchmodat2 452
-#define __NR_map_shadow_stack 453
-#define __NR_futex_wake 454
-#define __NR_futex_wait 455
-#define __NR_futex_requeue 456
-#define __NR_statmount 457
-#define __NR_listmount 458
-#define __NR_lsm_get_self_attr 459
-#define __NR_lsm_set_self_attr 460
-#define __NR_lsm_list_modules 461
-#define __NR_mseal 462
-#define __NR_setxattrat 463
-#define __NR_getxattrat 464
-#define __NR_listxattrat 465
-#define __NR_removexattrat 466
-#define __NR_open_tree_attr 467
-#define __NR_file_getattr 468
-#define __NR_file_setattr 469
-
-#endif /* _ASM_S390_UNISTD_32_H */
diff --git a/linux-headers/asm-s390/unistd_64.h b/linux-headers/asm-s390/unistd_64.h
index 8d9e579ef5..01f674c1bc 100644
--- a/linux-headers/asm-s390/unistd_64.h
+++ b/linux-headers/asm-s390/unistd_64.h
@@ -390,6 +390,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index b804fd25a2..01d46e2929 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -197,13 +197,13 @@ struct kvm_msrs {
 	__u32 nmsrs; /* number of msrs in entries */
 	__u32 pad;
 
-	struct kvm_msr_entry entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_msr_entry, entries);
 };
 
 /* for KVM_GET_MSR_INDEX_LIST */
 struct kvm_msr_list {
 	__u32 nmsrs; /* number of msrs in entries */
-	__u32 indices[];
+	__DECLARE_FLEX_ARRAY(__u32, indices);
 };
 
 /* Maximum size of any access bitmap in bytes */
@@ -243,7 +243,7 @@ struct kvm_cpuid_entry {
 struct kvm_cpuid {
 	__u32 nent;
 	__u32 padding;
-	struct kvm_cpuid_entry entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_cpuid_entry, entries);
 };
 
 struct kvm_cpuid_entry2 {
@@ -265,7 +265,7 @@ struct kvm_cpuid_entry2 {
 struct kvm_cpuid2 {
 	__u32 nent;
 	__u32 padding;
-	struct kvm_cpuid_entry2 entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_cpuid_entry2, entries);
 };
 
 /* for KVM_GET_PIT and KVM_SET_PIT */
@@ -396,7 +396,7 @@ struct kvm_xsave {
 	 * the contents of CPUID leaf 0xD on the host.
 	 */
 	__u32 region[1024];
-	__u32 extra[];
+	__DECLARE_FLEX_ARRAY(__u32, extra);
 };
 
 #define KVM_MAX_XCRS	16
@@ -474,6 +474,7 @@ struct kvm_sync_regs {
 #define KVM_X86_QUIRK_SLOT_ZAP_ALL		(1 << 7)
 #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS	(1 << 8)
 #define KVM_X86_QUIRK_IGNORE_GUEST_PAT		(1 << 9)
+#define KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM (1 << 10)
 
 #define KVM_STATE_NESTED_FORMAT_VMX	0
 #define KVM_STATE_NESTED_FORMAT_SVM	1
@@ -501,6 +502,7 @@ struct kvm_sync_regs {
 #define KVM_X86_GRP_SEV			1
 #  define KVM_X86_SEV_VMSA_FEATURES	0
 #  define KVM_X86_SNP_POLICY_BITS	1
+#  define KVM_X86_SEV_SNP_REQ_CERTS	2
 
 struct kvm_vmx_nested_state_data {
 	__u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
@@ -562,7 +564,7 @@ struct kvm_pmu_event_filter {
 	__u32 fixed_counter_bitmap;
 	__u32 flags;
 	__u32 pad[4];
-	__u64 events[];
+	__DECLARE_FLEX_ARRAY(__u64, events);
 };
 
 #define KVM_PMU_EVENT_ALLOW 0
@@ -741,6 +743,7 @@ enum sev_cmd_id {
 	KVM_SEV_SNP_LAUNCH_START = 100,
 	KVM_SEV_SNP_LAUNCH_UPDATE,
 	KVM_SEV_SNP_LAUNCH_FINISH,
+	KVM_SEV_SNP_ENABLE_REQ_CERTS,
 
 	KVM_SEV_NR_MAX,
 };
@@ -912,8 +915,10 @@ struct kvm_sev_snp_launch_finish {
 	__u64 pad1[4];
 };
 
-#define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
-#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
+#define KVM_X2APIC_API_USE_32BIT_IDS			_BITULL(0)
+#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK		_BITULL(1)
+#define KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST	_BITULL(2)
+#define KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST	_BITULL(3)
 
 struct kvm_hyperv_eventfd {
 	__u32 conn_id;
diff --git a/linux-headers/asm-x86/unistd_32.h b/linux-headers/asm-x86/unistd_32.h
index 34255aac64..e945468829 100644
--- a/linux-headers/asm-x86/unistd_32.h
+++ b/linux-headers/asm-x86/unistd_32.h
@@ -461,6 +461,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-x86/unistd_64.h b/linux-headers/asm-x86/unistd_64.h
index 07f242a5fa..3c49b00ed1 100644
--- a/linux-headers/asm-x86/unistd_64.h
+++ b/linux-headers/asm-x86/unistd_64.h
@@ -385,6 +385,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/unistd_x32.h b/linux-headers/asm-x86/unistd_x32.h
index 08fc9da2fa..bd2af9ad08 100644
--- a/linux-headers/asm-x86/unistd_x32.h
+++ b/linux-headers/asm-x86/unistd_x32.h
@@ -338,6 +338,7 @@
 #define __NR_file_getattr (__X32_SYSCALL_BIT + 468)
 #define __NR_file_setattr (__X32_SYSCALL_BIT + 469)
 #define __NR_listns (__X32_SYSCALL_BIT + 470)
+#define __NR_rseq_slice_yield (__X32_SYSCALL_BIT + 471)
 #define __NR_rt_sigaction (__X32_SYSCALL_BIT + 512)
 #define __NR_rt_sigreturn (__X32_SYSCALL_BIT + 513)
 #define __NR_ioctl (__X32_SYSCALL_BIT + 514)
diff --git a/linux-headers/linux/const.h b/linux-headers/linux/const.h
index 95ede23342..c6a9d0c983 100644
--- a/linux-headers/linux/const.h
+++ b/linux-headers/linux/const.h
@@ -50,4 +50,22 @@
 
 #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 
+/*
+ * Divide positive or negative dividend by positive or negative divisor
+ * and round to closest integer. Result is undefined for negative
+ * divisors if the dividend variable type is unsigned and for negative
+ * dividends if the divisor variable type is unsigned.
+ */
+#define __KERNEL_DIV_ROUND_CLOSEST(x, divisor)		\
+({							\
+	__typeof__(x) __x = x;				\
+	__typeof__(divisor) __d = divisor;		\
+							\
+	(((__typeof__(x))-1) > 0 ||			\
+	 ((__typeof__(divisor))-1) > 0 ||		\
+	 (((__x) > 0) == ((__d) > 0))) ?		\
+		(((__x) + ((__d) / 2)) / (__d)) :	\
+		(((__x) - ((__d) / 2)) / (__d));	\
+})
+
 #endif /* _LINUX_CONST_H */
diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
index 384183a403..82587c7d62 100644
--- a/linux-headers/linux/iommufd.h
+++ b/linux-headers/linux/iommufd.h
@@ -465,16 +465,27 @@ struct iommu_hwpt_arm_smmuv3 {
 	__aligned_le64 ste[2];
 };
 
+/**
+ * struct iommu_hwpt_amd_guest - AMD IOMMU guest I/O page table data
+ *				 (IOMMU_HWPT_DATA_AMD_GUEST)
+ * @dte: Guest Device Table Entry (DTE)
+ */
+struct iommu_hwpt_amd_guest {
+	__aligned_u64 dte[4];
+};
+
 /**
  * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
  * @IOMMU_HWPT_DATA_NONE: no data
  * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
  * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
+ * @IOMMU_HWPT_DATA_AMD_GUEST: AMD IOMMU guest page table
  */
 enum iommu_hwpt_data_type {
 	IOMMU_HWPT_DATA_NONE = 0,
 	IOMMU_HWPT_DATA_VTD_S1 = 1,
 	IOMMU_HWPT_DATA_ARM_SMMUV3 = 2,
+	IOMMU_HWPT_DATA_AMD_GUEST = 3,
 };
 
 /**
@@ -623,6 +634,32 @@ struct iommu_hw_info_tegra241_cmdqv {
 	__u8 __reserved;
 };
 
+/**
+ * struct iommu_hw_info_amd - AMD IOMMU device info
+ *
+ * @efr : Value of AMD IOMMU Extended Feature Register (EFR)
+ * @efr2: Value of AMD IOMMU Extended Feature 2 Register (EFR2)
+ *
+ * Please See description of these registers in the following sections of
+ * the AMD I/O Virtualization Technology (IOMMU) Specification.
+ * (https://docs.amd.com/v/u/en-US/48882_3.10_PUB)
+ *
+ * - MMIO Offset 0030h IOMMU Extended Feature Register
+ * - MMIO Offset 01A0h IOMMU Extended Feature 2 Register
+ *
+ * Note: The EFR and EFR2 are raw values reported by hardware.
+ * VMM is responsible to determine the appropriate flags to be exposed to
+ * the VM since cetertain features are not currently supported by the kernel
+ * for HW-vIOMMU.
+ *
+ * Current VMM-allowed list of feature flags are:
+ * - EFR[GTSup, GASup, GioSup, PPRSup, EPHSup, GATS, GLX, PASmax]
+ */
+struct iommu_hw_info_amd {
+	__aligned_u64 efr;
+	__aligned_u64 efr2;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
@@ -632,6 +669,7 @@ struct iommu_hw_info_tegra241_cmdqv {
  * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
  *                                     SMMUv3) info type
+ * @IOMMU_HW_INFO_TYPE_AMD: AMD IOMMU info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
@@ -639,6 +677,7 @@ enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
 	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
+	IOMMU_HW_INFO_TYPE_AMD = 4,
 };
 
 /**
@@ -656,11 +695,15 @@ enum iommu_hw_info_type {
  * @IOMMU_HW_CAP_PCI_PASID_PRIV: Privileged Mode Supported, user ignores it
  *                               when the struct
  *                               iommu_hw_info::out_max_pasid_log2 is zero.
+ * @IOMMU_HW_CAP_PCI_ATS_NOT_SUPPORTED: ATS is not supported or cannot be used
+ *                                      on this device (absence implies ATS
+ *                                      may be enabled)
  */
 enum iommufd_hw_capabilities {
 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
 	IOMMU_HW_CAP_PCI_PASID_EXEC = 1 << 1,
 	IOMMU_HW_CAP_PCI_PASID_PRIV = 1 << 2,
+	IOMMU_HW_CAP_PCI_ATS_NOT_SUPPORTED = 1 << 3,
 };
 
 /**
@@ -1013,6 +1056,11 @@ struct iommu_fault_alloc {
 enum iommu_viommu_type {
 	IOMMU_VIOMMU_TYPE_DEFAULT = 0,
 	IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
+	/*
+	 * TEGRA241_CMDQV requirements (otherwise, VCMDQs will not work)
+	 * - Kernel will allocate a VINTF (HYP_OWN=0) to back this VIOMMU. So,
+	 *   VMM must wire the HYP_OWN bit to 0 in guest VINTF_CONFIG register
+	 */
 	IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
 };
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index a4ab42dcba..50e87ed72c 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -11,9 +11,11 @@
 #include <linux/const.h>
 #include <linux/types.h>
 
+#include <linux/stddef.h>
 #include <linux/ioctl.h>
 #include <asm/kvm.h>
 
+
 #define KVM_API_VERSION 12
 
 /*
@@ -135,6 +137,12 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_exit_snp_req_certs {
+	__u64 gpa;
+	__u64 npages;
+	__u64 ret;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -180,6 +188,8 @@ struct kvm_xen_exit {
 #define KVM_EXIT_MEMORY_FAULT     39
 #define KVM_EXIT_TDX              40
 #define KVM_EXIT_ARM_SEA          41
+#define KVM_EXIT_ARM_LDST64B      42
+#define KVM_EXIT_SNP_REQ_CERTS    43
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -394,7 +404,7 @@ struct kvm_run {
 		} eoi;
 		/* KVM_EXIT_HYPERV */
 		struct kvm_hyperv_exit hyperv;
-		/* KVM_EXIT_ARM_NISV */
+		/* KVM_EXIT_ARM_NISV / KVM_EXIT_ARM_LDST64B */
 		struct {
 			__u64 esr_iss;
 			__u64 fault_ipa;
@@ -474,6 +484,8 @@ struct kvm_run {
 			__u64 gva;
 			__u64 gpa;
 		} arm_sea;
+		/* KVM_EXIT_SNP_REQ_CERTS */
+		struct kvm_exit_snp_req_certs snp_req_certs;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -520,7 +532,7 @@ struct kvm_coalesced_mmio {
 
 struct kvm_coalesced_mmio_ring {
 	__u32 first, last;
-	struct kvm_coalesced_mmio coalesced_mmio[];
+	__DECLARE_FLEX_ARRAY(struct kvm_coalesced_mmio, coalesced_mmio);
 };
 
 #define KVM_COALESCED_MMIO_MAX \
@@ -570,7 +582,7 @@ struct kvm_clear_dirty_log {
 /* for KVM_SET_SIGNAL_MASK */
 struct kvm_signal_mask {
 	__u32 len;
-	__u8  sigset[];
+	__DECLARE_FLEX_ARRAY(__u8, sigset);
 };
 
 /* for KVM_TPR_ACCESS_REPORTING */
@@ -681,6 +693,11 @@ struct kvm_enable_cap {
 #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK	0xffULL
 #define KVM_VM_TYPE_ARM_IPA_SIZE(x)		\
 	((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+
+#define KVM_VM_TYPE_ARM_PROTECTED	(1UL << 31)
+#define KVM_VM_TYPE_ARM_MASK		(KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \
+					 KVM_VM_TYPE_ARM_PROTECTED)
+
 /*
  * ioctls for /dev/kvm fds:
  */
@@ -966,6 +983,8 @@ struct kvm_enable_cap {
 #define KVM_CAP_GUEST_MEMFD_FLAGS 244
 #define KVM_CAP_ARM_SEA_TO_USER 245
 #define KVM_CAP_S390_USER_OPEREXEC 246
+#define KVM_CAP_S390_KEYOP 247
+#define KVM_CAP_S390_VSIE_ESAMODE 248
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1028,7 +1047,7 @@ struct kvm_irq_routing_entry {
 struct kvm_irq_routing {
 	__u32 nr;
 	__u32 flags;
-	struct kvm_irq_routing_entry entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_irq_routing_entry, entries);
 };
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
@@ -1119,7 +1138,7 @@ struct kvm_dirty_tlb {
 
 struct kvm_reg_list {
 	__u64 n; /* number of regs */
-	__u64 reg[];
+	__DECLARE_FLEX_ARRAY(__u64, reg);
 };
 
 struct kvm_one_reg {
@@ -1201,6 +1220,10 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_LOONGARCH_EIOINTC	KVM_DEV_TYPE_LOONGARCH_EIOINTC
 	KVM_DEV_TYPE_LOONGARCH_PCHPIC,
 #define KVM_DEV_TYPE_LOONGARCH_PCHPIC	KVM_DEV_TYPE_LOONGARCH_PCHPIC
+	KVM_DEV_TYPE_LOONGARCH_DMSINTC,
+#define KVM_DEV_TYPE_LOONGARCH_DMSINTC	KVM_DEV_TYPE_LOONGARCH_DMSINTC
+	KVM_DEV_TYPE_ARM_VGIC_V5,
+#define KVM_DEV_TYPE_ARM_VGIC_V5	KVM_DEV_TYPE_ARM_VGIC_V5
 
 	KVM_DEV_TYPE_MAX,
 
@@ -1211,6 +1234,16 @@ struct kvm_vfio_spapr_tce {
 	__s32	tablefd;
 };
 
+#define KVM_S390_KEYOP_ISKE 0x01
+#define KVM_S390_KEYOP_RRBE 0x02
+#define KVM_S390_KEYOP_SSKE 0x03
+struct kvm_s390_keyop {
+	__u64 guest_addr;
+	__u8  key;
+	__u8  operation;
+	__u8  pad[6];
+};
+
 /*
  * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
  * a vcpu fd.
@@ -1230,6 +1263,7 @@ struct kvm_vfio_spapr_tce {
 #define KVM_S390_UCAS_MAP        _IOW(KVMIO, 0x50, struct kvm_s390_ucas_mapping)
 #define KVM_S390_UCAS_UNMAP      _IOW(KVMIO, 0x51, struct kvm_s390_ucas_mapping)
 #define KVM_S390_VCPU_FAULT	 _IOW(KVMIO, 0x52, unsigned long)
+#define KVM_S390_KEYOP           _IOWR(KVMIO, 0x53, struct kvm_s390_keyop)
 
 /* Device model IOC */
 #define KVM_CREATE_IRQCHIP        _IO(KVMIO,   0x60)
@@ -1571,7 +1605,7 @@ struct kvm_stats_desc {
 	__u16 size;
 	__u32 offset;
 	__u32 bucket_size;
-	char name[];
+	__DECLARE_FLEX_ARRAY(char, name);
 };
 
 #define KVM_GET_STATS_FD  _IO(KVMIO,  0xce)
diff --git a/linux-headers/linux/mshv.h b/linux-headers/linux/mshv.h
index acceeddc1c..6c7d3a9316 100644
--- a/linux-headers/linux/mshv.h
+++ b/linux-headers/linux/mshv.h
@@ -27,6 +27,8 @@ enum {
 	MSHV_PT_BIT_X2APIC,
 	MSHV_PT_BIT_GPA_SUPER_PAGES,
 	MSHV_PT_BIT_CPU_AND_XSAVE_FEATURES,
+	MSHV_PT_BIT_NESTED_VIRTUALIZATION,
+	MSHV_PT_BIT_SMT_ENABLED_GUEST,
 	MSHV_PT_BIT_COUNT,
 };
 
@@ -355,7 +357,7 @@ struct mshv_vtl_sint_post_msg {
 
 struct mshv_vtl_ram_disposition {
 	__u64 start_pfn;
-	__u64 last_pfn;
+	__u64 last_pfn; /* last_pfn is excluded from the range [start_pfn, last_pfn) */
 };
 
 struct mshv_vtl_set_poll_file {
diff --git a/linux-headers/linux/psp-sev.h b/linux-headers/linux/psp-sev.h
index 9479928a4a..7df5002259 100644
--- a/linux-headers/linux/psp-sev.h
+++ b/linux-headers/linux/psp-sev.h
@@ -277,7 +277,7 @@ struct sev_user_data_snp_wrapped_vlek_hashstick {
  * struct sev_issue_cmd - SEV ioctl parameters
  *
  * @cmd: SEV commands to execute
- * @opaque: pointer to the command structure
+ * @data: pointer to the command structure
  * @error: SEV FW return code on failure
  */
 struct sev_issue_cmd {
diff --git a/linux-headers/linux/stddef.h b/linux-headers/linux/stddef.h
index 48ee4438e0..4574982594 100644
--- a/linux-headers/linux/stddef.h
+++ b/linux-headers/linux/stddef.h
@@ -69,6 +69,10 @@
 #define __counted_by_be(m)
 #endif
 
+#ifndef __counted_by_ptr
+#define __counted_by_ptr(m)
+#endif
+
 #define __kernel_nonstring
 
 #endif /* _LINUX_STDDEF_H */
diff --git a/linux-headers/linux/vduse.h b/linux-headers/linux/vduse.h
index da6ac89af1..e19b3c0f51 100644
--- a/linux-headers/linux/vduse.h
+++ b/linux-headers/linux/vduse.h
@@ -10,6 +10,10 @@
 
 #define VDUSE_API_VERSION	0
 
+/* VQ groups and ASID support */
+
+#define VDUSE_API_VERSION_1	1
+
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
  * This is used for future extension.
@@ -27,6 +31,8 @@
  * @features: virtio features
  * @vq_num: the number of virtqueues
  * @vq_align: the allocation alignment of virtqueue's metadata
+ * @ngroups: number of vq groups that VDUSE device declares
+ * @nas: number of address spaces that VDUSE device declares
  * @reserved: for future use, needs to be initialized to zero
  * @config_size: the size of the configuration space
  * @config: the buffer of the configuration space
@@ -41,7 +47,9 @@ struct vduse_dev_config {
 	__u64 features;
 	__u32 vq_num;
 	__u32 vq_align;
-	__u32 reserved[13];
+	__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
+	__u32 nas; /* if VDUSE_API_VERSION >= 1 */
+	__u32 reserved[11];
 	__u32 config_size;
 	__u8 config[];
 };
@@ -118,14 +126,18 @@ struct vduse_config_data {
  * struct vduse_vq_config - basic configuration of a virtqueue
  * @index: virtqueue index
  * @max_size: the max size of virtqueue
- * @reserved: for future use, needs to be initialized to zero
+ * @reserved1: for future use, needs to be initialized to zero
+ * @group: virtqueue group
+ * @reserved2: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_VQ_SETUP ioctl to setup a virtqueue.
  */
 struct vduse_vq_config {
 	__u32 index;
 	__u16 max_size;
-	__u16 reserved[13];
+	__u16 reserved1;
+	__u32 group;
+	__u16 reserved2[10];
 };
 
 /*
@@ -156,6 +168,16 @@ struct vduse_vq_state_packed {
 	__u16 last_used_idx;
 };
 
+/**
+ * struct vduse_vq_group_asid - virtqueue group ASID
+ * @group: Index of the virtqueue group
+ * @asid: Address space ID of the group
+ */
+struct vduse_vq_group_asid {
+	__u32 group;
+	__u32 asid;
+};
+
 /**
  * struct vduse_vq_info - information of a virtqueue
  * @index: virtqueue index
@@ -215,6 +237,7 @@ struct vduse_vq_eventfd {
  * @uaddr: start address of userspace memory, it must be aligned to page size
  * @iova: start of the IOVA region
  * @size: size of the IOVA region
+ * @asid: Address space ID of the IOVA region
  * @reserved: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
@@ -224,7 +247,8 @@ struct vduse_iova_umem {
 	__u64 uaddr;
 	__u64 iova;
 	__u64 size;
-	__u64 reserved[3];
+	__u32 asid;
+	__u32 reserved[5];
 };
 
 /* Register userspace memory for IOVA regions */
@@ -238,6 +262,7 @@ struct vduse_iova_umem {
  * @start: start of the IOVA region
  * @last: last of the IOVA region
  * @capability: capability of the IOVA region
+ * @asid: Address space ID of the IOVA region, only if device API version >= 1
  * @reserved: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get information of
@@ -248,7 +273,8 @@ struct vduse_iova_info {
 	__u64 last;
 #define VDUSE_IOVA_CAP_UMEM (1 << 0)
 	__u64 capability;
-	__u64 reserved[3];
+	__u32 asid; /* Only if device API version >= 1 */
+	__u32 reserved[5];
 };
 
 /*
@@ -257,6 +283,32 @@ struct vduse_iova_info {
  */
 #define VDUSE_IOTLB_GET_INFO	_IOWR(VDUSE_BASE, 0x1a, struct vduse_iova_info)
 
+/**
+ * struct vduse_iotlb_entry_v2 - entry of IOTLB to describe one IOVA region
+ *
+ * @v1: the original vduse_iotlb_entry
+ * @asid: address space ID of the IOVA region
+ * @reserved: for future use, needs to be initialized to zero
+ *
+ * Structure used by VDUSE_IOTLB_GET_FD2 ioctl to find an overlapped IOVA region.
+ */
+struct vduse_iotlb_entry_v2 {
+	__u64 offset;
+	__u64 start;
+	__u64 last;
+	__u8 perm;
+	__u8 padding[7];
+	__u32 asid;
+	__u32 reserved[11];
+};
+
+/*
+ * Same as VDUSE_IOTLB_GET_FD but with vduse_iotlb_entry_v2 argument that
+ * support extra fields.
+ */
+#define VDUSE_IOTLB_GET_FD2	_IOWR(VDUSE_BASE, 0x1b, struct vduse_iotlb_entry_v2)
+
+
 /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME */
 
 /**
@@ -265,11 +317,14 @@ struct vduse_iova_info {
  * @VDUSE_SET_STATUS: set the device status
  * @VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping for
  *                      specified IOVA range via VDUSE_IOTLB_GET_FD ioctl
+ * @VDUSE_SET_VQ_GROUP_ASID: Notify userspace to update the address space of a
+ *                           virtqueue group.
  */
 enum vduse_req_type {
 	VDUSE_GET_VQ_STATE,
 	VDUSE_SET_STATUS,
 	VDUSE_UPDATE_IOTLB,
+	VDUSE_SET_VQ_GROUP_ASID,
 };
 
 /**
@@ -304,6 +359,19 @@ struct vduse_iova_range {
 	__u64 last;
 };
 
+/**
+ * struct vduse_iova_range_v2 - IOVA range [start, last] if API_VERSION >= 1
+ * @start: start of the IOVA range
+ * @last: last of the IOVA range
+ * @asid: address space ID of the IOVA range
+ */
+struct vduse_iova_range_v2 {
+	__u64 start;
+	__u64 last;
+	__u32 asid;
+	__u32 padding;
+};
+
 /**
  * struct vduse_dev_request - control request
  * @type: request type
@@ -312,6 +380,8 @@ struct vduse_iova_range {
  * @vq_state: virtqueue state, only index field is available
  * @s: device status
  * @iova: IOVA range for updating
+ * @iova_v2: IOVA range for updating if API_VERSION >= 1
+ * @vq_group_asid: ASID of a virtqueue group
  * @padding: padding
  *
  * Structure used by read(2) on /dev/vduse/$NAME.
@@ -324,6 +394,11 @@ struct vduse_dev_request {
 		struct vduse_vq_state vq_state;
 		struct vduse_dev_status s;
 		struct vduse_iova_range iova;
+		/* Following members but padding exist only if vduse api
+		 * version >= 1
+		 */
+		struct vduse_iova_range_v2 iova_v2;
+		struct vduse_vq_group_asid vq_group_asid;
 		__u32 padding[32];
 	};
 };
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 720edfee7a..f3282b8e86 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -141,7 +141,7 @@ struct vfio_info_cap_header {
  *
  * Retrieve information about the group.  Fills in provided
  * struct vfio_group_info.  Caller sets argsz.
- * Return: 0 on succes, -errno on failure.
+ * Return: 0 on success, -errno on failure.
  * Availability: Always
  */
 struct vfio_group_status {
@@ -964,6 +964,10 @@ struct vfio_device_bind_iommufd {
  * hwpt corresponding to the given pt_id.
  *
  * Return: 0 on success, -errno on failure.
+ *
+ * When a device is resetting, -EBUSY will be returned to reject any concurrent
+ * attachment to the resetting device itself or any sibling device in the IOMMU
+ * group having the resetting device.
  */
 struct vfio_device_attach_iommufd_pt {
 	__u32	argsz;
@@ -1262,6 +1266,19 @@ enum vfio_device_mig_state {
  * The initial_bytes field indicates the amount of initial precopy
  * data available from the device. This field should have a non-zero initial
  * value and decrease as migration data is read from the device.
+ * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
+ * that new initial data is present on the stream.
+ * The new initial data may result, for example, from device reconfiguration
+ * during migration that requires additional initialization data.
+ * In that case initial_bytes may report a non-zero value irrespective of
+ * any previously reported values, which progresses towards zero as precopy
+ * data is read from the data stream. dirty_bytes is also reset
+ * to zero and represents the state change of the device relative to the new
+ * initial_bytes.
+ * VFIO_PRECOPY_INFO_REINIT can be reported only after userspace opts in to
+ * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2. Without this opt-in, the flags field
+ * of struct vfio_precopy_info is reserved for bug-compatibility reasons.
+ *
  * It is recommended to leave PRE_COPY for STOP_COPY only after this field
  * reaches zero. Leaving PRE_COPY earlier might make things slower.
  *
@@ -1297,6 +1314,7 @@ enum vfio_device_mig_state {
 struct vfio_precopy_info {
 	__u32 argsz;
 	__u32 flags;
+#define VFIO_PRECOPY_INFO_REINIT (1 << 0) /* output - new initial data is present */
 	__aligned_u64 initial_bytes;
 	__aligned_u64 dirty_bytes;
 };
@@ -1506,6 +1524,16 @@ struct vfio_device_feature_dma_buf {
 	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
 };
 
+/*
+ * Enables the migration precopy_info_v2 behaviour.
+ *
+ * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.
+ *
+ * On SET, enables the v2 pre_copy_info behaviour, where the
+ * vfio_precopy_info.flags is a valid output field.
+ */
+#define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/14] linux-headers: Update to Linux v7.1-rc1
  2026-05-05  8:14 ` [PATCH 02/14] linux-headers: Update to Linux v7.1-rc1 Avihai Horon
@ 2026-05-07  9:16   ` Cédric Le Goater
  0 siblings, 0 replies; 31+ messages in thread
From: Cédric Le Goater @ 2026-05-07  9:16 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb, Song Gao

+ Song Gao

On 5/5/26 10:14, Avihai Horon wrote:
> Update Linux headers to get VFIO_PRECOPY_INFO_REINIT feature of VFIO
> migration uAPI.
> 
> The update was done by running `update-linux-headers.sh` on commit
> 254f49634ee1 ("Linux 7.1-rc1") in the Linux tree.
> 
> Signed-off-by: Avihai Horon<avihaih@nvidia.com>
> ---
>   include/standard-headers/drm/drm_fourcc.h     |  28 +-
>   include/standard-headers/linux/const.h        |  18 +
>   include/standard-headers/linux/ethtool.h      |  28 +-
>   .../linux/input-event-codes.h                 |  13 +
>   include/standard-headers/linux/pci_regs.h     |  71 ++-
>   include/standard-headers/linux/typelimits.h   |   8 +
>   include/standard-headers/linux/virtio_ring.h  |   3 +-
>   include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
>   include/standard-headers/linux/vmclock-abi.h  |  20 +
>   linux-headers/asm-arm64/kvm.h                 |   1 +
>   linux-headers/asm-arm64/unistd_64.h           |   1 +
>   linux-headers/asm-generic/unistd.h            |   5 +-
>   linux-headers/asm-loongarch/kvm.h             |   5 +
>   linux-headers/asm-loongarch/kvm_para.h        |   1 +
>   linux-headers/asm-loongarch/unistd_64.h       |   2 +
>   linux-headers/asm-mips/unistd_n32.h           |   1 +
>   linux-headers/asm-mips/unistd_n64.h           |   1 +
>   linux-headers/asm-mips/unistd_o32.h           |   1 +
>   linux-headers/asm-powerpc/unistd_32.h         |   1 +
>   linux-headers/asm-powerpc/unistd_64.h         |   1 +
>   linux-headers/asm-riscv/kvm.h                 |  11 +-
>   linux-headers/asm-riscv/ptrace.h              |  37 ++
>   linux-headers/asm-riscv/unistd_32.h           |   1 +
>   linux-headers/asm-riscv/unistd_64.h           |   1 +
>   linux-headers/asm-s390/unistd_32.h            | 446 ------------------
>   linux-headers/asm-s390/unistd_64.h            |   1 +
>   linux-headers/asm-x86/kvm.h                   |  21 +-
>   linux-headers/asm-x86/unistd_32.h             |   1 +
>   linux-headers/asm-x86/unistd_64.h             |   1 +
>   linux-headers/asm-x86/unistd_x32.h            |   1 +
>   linux-headers/linux/const.h                   |  18 +
>   linux-headers/linux/iommufd.h                 |  48 ++
>   linux-headers/linux/kvm.h                     |  46 +-
>   linux-headers/linux/mshv.h                    |   4 +-
>   linux-headers/linux/psp-sev.h                 |   2 +-
>   linux-headers/linux/stddef.h                  |   4 +
>   linux-headers/linux/vduse.h                   |  85 +++-
>   linux-headers/linux/vfio.h                    |  30 +-
>   38 files changed, 711 insertions(+), 493 deletions(-)
>   create mode 100644 include/standard-headers/linux/typelimits.h
>   create mode 100644 include/standard-headers/linux/virtio_rtc.h
>   delete mode 100644 linux-headers/asm-s390/unistd_32.h

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 03/14] migration: Propagate errors in migration_completion_precopy()
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
  2026-05-05  8:14 ` [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h Avihai Horon
  2026-05-05  8:14 ` [PATCH 02/14] linux-headers: Update to Linux v7.1-rc1 Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07  8:03   ` Cédric Le Goater
  2026-05-05  8:14 ` [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

migration_completion_precopy() doesn't propagate errors to migration
core which leads to error information loss. Fix that.

This prepares for a follow-up where migration_switchover_start() can
fail on switchover-ack and still report a useful error. Errors from
qemu_savevm_state_complete_precopy() are not propagated yet as it
requires more plumbing.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/migration.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 6fd89995a2..a5c7ca6796 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2780,23 +2780,28 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
 static int migration_completion_precopy(MigrationState *s)
 {
     int ret;
+    Error *local_err = NULL;
 
     bql_lock();
 
     if (!migrate_mode_is_cpr()) {
         ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
         if (ret < 0) {
+            error_setg_errno(&local_err, -ret, "Failed to stop the VM");
             goto out_unlock;
         }
     }
 
-    if (!migration_switchover_start(s, NULL)) {
+    if (!migration_switchover_start(s, &local_err)) {
         ret = -EFAULT;
         goto out_unlock;
     }
 
     ret = qemu_savevm_state_complete_precopy(s);
 out_unlock:
+    if (local_err) {
+        migrate_error_propagate(s, local_err);
+    }
     bql_unlock();
     return ret;
 }
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/14] migration: Propagate errors in migration_completion_precopy()
  2026-05-05  8:14 ` [PATCH 03/14] migration: Propagate errors in migration_completion_precopy() Avihai Horon
@ 2026-05-07  8:03   ` Cédric Le Goater
  2026-05-08 13:01     ` Avihai Horon
  0 siblings, 1 reply; 31+ messages in thread
From: Cédric Le Goater @ 2026-05-07  8:03 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb

On 5/5/26 10:14, Avihai Horon wrote:
> migration_completion_precopy() doesn't propagate errors to migration
> core which leads to error information loss. Fix that.
> 
> This prepares for a follow-up where migration_switchover_start() can
> fail on switchover-ack and still report a useful error. Errors from
> qemu_savevm_state_complete_precopy() are not propagated yet as it
> requires more plumbing.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   migration/migration.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 6fd89995a2..a5c7ca6796 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2780,23 +2780,28 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>   static int migration_completion_precopy(MigrationState *s)
>   {
>       int ret;
> +    Error *local_err = NULL;
>   
>       bql_lock();
>   
>       if (!migrate_mode_is_cpr()) {
>           ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>           if (ret < 0) {
> +            error_setg_errno(&local_err, -ret, "Failed to stop the VM");
>               goto out_unlock;
>           }
>       }
>   
> -    if (!migration_switchover_start(s, NULL)) {
> +    if (!migration_switchover_start(s, &local_err)) {
>           ret = -EFAULT;
>           goto out_unlock;
>       }
>   
>       ret = qemu_savevm_state_complete_precopy(s);
>   out_unlock:
> +    if (local_err) {
> +        migrate_error_propagate(s, local_err);
> +    }
>       bql_unlock();
>       return ret;
>   }

Instead, I would modify migration_completion_precopy() to use the Error
variable in migration_completion() :

   static void migration_completion(MigrationState *s)
   {
       int ret = 0;
       Error *local_err = NULL;

       if (s->state == MIGRATION_STATUS_ACTIVE) {
           ret = migration_completion_precopy(s);

       ...

Thanks,

C.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/14] migration: Propagate errors in migration_completion_precopy()
  2026-05-07  8:03   ` Cédric Le Goater
@ 2026-05-08 13:01     ` Avihai Horon
  2026-05-15 15:20       ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-08 13:01 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb


On 5/7/2026 11:03 AM, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/5/26 10:14, Avihai Horon wrote:
>> migration_completion_precopy() doesn't propagate errors to migration
>> core which leads to error information loss. Fix that.
>>
>> This prepares for a follow-up where migration_switchover_start() can
>> fail on switchover-ack and still report a useful error. Errors from
>> qemu_savevm_state_complete_precopy() are not propagated yet as it
>> requires more plumbing.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/migration.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 6fd89995a2..a5c7ca6796 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2780,23 +2780,28 @@ static bool 
>> migration_switchover_start(MigrationState *s, Error **errp)
>>   static int migration_completion_precopy(MigrationState *s)
>>   {
>>       int ret;
>> +    Error *local_err = NULL;
>>
>>       bql_lock();
>>
>>       if (!migrate_mode_is_cpr()) {
>>           ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>           if (ret < 0) {
>> +            error_setg_errno(&local_err, -ret, "Failed to stop the 
>> VM");
>>               goto out_unlock;
>>           }
>>       }
>>
>> -    if (!migration_switchover_start(s, NULL)) {
>> +    if (!migration_switchover_start(s, &local_err)) {
>>           ret = -EFAULT;
>>           goto out_unlock;
>>       }
>>
>>       ret = qemu_savevm_state_complete_precopy(s);
>>   out_unlock:
>> +    if (local_err) {
>> +        migrate_error_propagate(s, local_err);
>> +    }
>>       bql_unlock();
>>       return ret;
>>   }
>
> Instead, I would modify migration_completion_precopy() to use the Error
> variable in migration_completion() :
>
>   static void migration_completion(MigrationState *s)
>   {
>       int ret = 0;
>       Error *local_err = NULL;
>
>       if (s->state == MIGRATION_STATUS_ACTIVE) {
>           ret = migration_completion_precopy(s);
>
>       ...
>
I'd rather keep this change limited and not involve migration_completion().
The error reporting in this path is a bit convoluted (mixing error 
reporting via qemu_file) and I think it deserves a separate series 
cleaning things up there.

Unless I am missing something here and the above should be easy?

Thanks.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/14] migration: Propagate errors in migration_completion_precopy()
  2026-05-08 13:01     ` Avihai Horon
@ 2026-05-15 15:20       ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2026-05-15 15:20 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Cédric Le Goater, qemu-devel, Alex Williamson, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb

On Fri, May 08, 2026 at 04:01:43PM +0300, Avihai Horon wrote:
> 
> On 5/7/2026 11:03 AM, Cédric Le Goater wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 5/5/26 10:14, Avihai Horon wrote:
> > > migration_completion_precopy() doesn't propagate errors to migration
> > > core which leads to error information loss. Fix that.
> > > 
> > > This prepares for a follow-up where migration_switchover_start() can
> > > fail on switchover-ack and still report a useful error. Errors from
> > > qemu_savevm_state_complete_precopy() are not propagated yet as it
> > > requires more plumbing.
> > > 
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > > ---
> > >   migration/migration.c | 7 ++++++-
> > >   1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 6fd89995a2..a5c7ca6796 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -2780,23 +2780,28 @@ static bool
> > > migration_switchover_start(MigrationState *s, Error **errp)
> > >   static int migration_completion_precopy(MigrationState *s)
> > >   {
> > >       int ret;
> > > +    Error *local_err = NULL;
> > > 
> > >       bql_lock();
> > > 
> > >       if (!migrate_mode_is_cpr()) {
> > >           ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
> > >           if (ret < 0) {
> > > +            error_setg_errno(&local_err, -ret, "Failed to stop the
> > > VM");
> > >               goto out_unlock;
> > >           }
> > >       }
> > > 
> > > -    if (!migration_switchover_start(s, NULL)) {
> > > +    if (!migration_switchover_start(s, &local_err)) {
> > >           ret = -EFAULT;
> > >           goto out_unlock;
> > >       }
> > > 
> > >       ret = qemu_savevm_state_complete_precopy(s);
> > >   out_unlock:
> > > +    if (local_err) {
> > > +        migrate_error_propagate(s, local_err);
> > > +    }
> > >       bql_unlock();
> > >       return ret;
> > >   }
> > 
> > Instead, I would modify migration_completion_precopy() to use the Error
> > variable in migration_completion() :
> > 
> >   static void migration_completion(MigrationState *s)
> >   {
> >       int ret = 0;
> >       Error *local_err = NULL;
> > 
> >       if (s->state == MIGRATION_STATUS_ACTIVE) {
> >           ret = migration_completion_precopy(s);
> > 
> >       ...
> > 
> I'd rather keep this change limited and not involve migration_completion().
> The error reporting in this path is a bit convoluted (mixing error reporting
> via qemu_file) and I think it deserves a separate series cleaning things up
> there.
> 
> Unless I am missing something here and the above should be easy?

Not easy to keep everything as before, but I tend to agree with Cedric.

The hard part is to maintain the same error when something failed in
migration_completion(), but IMHO that's a legacy problem we'll need to
tackle with, sooner or later.  We can do it now, facing risk that some
error message might change: I think it's worthwhile to try.

So it also avoids introducing yet another migrate_error_propagate() call
deep in the stack..  Ideally we move it upper and upper so the invokation
should be less as time goes.

The old priority to handle errors in migration_completion() is:

  1. if qemu_file_get_error_obj() succeeded, use it first, otherwise,
  2. if ret!=0, generate an error for retval

Side note: (1) is currently slightly off when qemu_file_get_error_obj()
returns non-zero but without an Error attached.. but let's ignore it for
now.

After this patch, we could prioritize Error* whenever set, hence:

  1. if error non-null, use it directly,
  2. if qemu_file_get_error_obj() succeeded, use it first, otherwise,
  3. if ret!=0, generate an error for retval

I think this order makes sense because neither qemufile error nor retcode
is better than a literal error passed over.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover()
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (2 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 03/14] migration: Propagate errors in migration_completion_precopy() Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07  8:09   ` Cédric Le Goater
  2026-05-15 15:24   ` Peter Xu
  2026-05-05  8:14 ` [PATCH 05/14] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Pass the device name that approved switchover to
qemu_loadvm_approve_switchover() and log it in the trace for debugging
purposes.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/savevm.h     | 2 +-
 hw/vfio/migration.c    | 2 +-
 migration/savevm.c     | 4 ++--
 migration/trace-events | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index 96fdf96d4e..1efbe1d79d 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,7 +70,7 @@ void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
                            Error **errp);
 int qemu_load_device_state(QEMUFile *f, Error **errp);
-int qemu_loadvm_approve_switchover(void);
+int qemu_loadvm_approve_switchover(const char *approver);
 int qemu_savevm_state_non_iterable(QEMUFile *f, Error **errp);
 int qemu_savevm_state_non_iterable_early(QEMUFile *f,
                                          JSONWriter *vmdesc,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 150e28656e..c3dd30d619 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -818,7 +818,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
                 return -EINVAL;
             }
 
-            ret = qemu_loadvm_approve_switchover();
+            ret = qemu_loadvm_approve_switchover(vbasedev->name);
             if (ret) {
                 error_report(
                     "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
diff --git a/migration/savevm.c b/migration/savevm.c
index d1dd696c17..bd8c6ada5a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3158,7 +3158,7 @@ int qemu_load_device_state(QEMUFile *f, Error **errp)
     return 0;
 }
 
-int qemu_loadvm_approve_switchover(void)
+int qemu_loadvm_approve_switchover(const char *approver)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -3167,7 +3167,7 @@ int qemu_loadvm_approve_switchover(void)
     }
 
     mis->switchover_ack_pending_num--;
-    trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
+    trace_loadvm_approve_switchover(approver, mis->switchover_ack_pending_num);
 
     if (mis->switchover_ack_pending_num) {
         return 0;
diff --git a/migration/trace-events b/migration/trace-events
index de99d976ab..d9084a2692 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -24,7 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
-loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
+loadvm_approve_switchover(const char *approver, unsigned int switchover_ack_pending_num) "Approver %s, switchover_ack_pending_num %u"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover()
  2026-05-05  8:14 ` [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
@ 2026-05-07  8:09   ` Cédric Le Goater
  2026-05-08 13:07     ` Avihai Horon
  2026-05-15 15:24   ` Peter Xu
  1 sibling, 1 reply; 31+ messages in thread
From: Cédric Le Goater @ 2026-05-07  8:09 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb

On 5/5/26 10:14, Avihai Horon wrote:
> Pass the device name that approved switchover to
> qemu_loadvm_approve_switchover() and log it in the trace for debugging
> purposes.
hmm, isn't the trace event in vfio_load_state() enough :

	trace_vfio_load_state(vbasedev->name, data);

May be we can improve it instead ?

Anyhow, I won't object to the change.

Thanks,

C.


> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   migration/savevm.h     | 2 +-
>   hw/vfio/migration.c    | 2 +-
>   migration/savevm.c     | 4 ++--
>   migration/trace-events | 2 +-
>   4 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 96fdf96d4e..1efbe1d79d 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -70,7 +70,7 @@ void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
>   int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
>                              Error **errp);
>   int qemu_load_device_state(QEMUFile *f, Error **errp);
> -int qemu_loadvm_approve_switchover(void);
> +int qemu_loadvm_approve_switchover(const char *approver);
>   int qemu_savevm_state_non_iterable(QEMUFile *f, Error **errp);
>   int qemu_savevm_state_non_iterable_early(QEMUFile *f,
>                                            JSONWriter *vmdesc,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 150e28656e..c3dd30d619 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -818,7 +818,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>                   return -EINVAL;
>               }
>   
> -            ret = qemu_loadvm_approve_switchover();
> +            ret = qemu_loadvm_approve_switchover(vbasedev->name);
>               if (ret) {
>                   error_report(
>                       "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d1dd696c17..bd8c6ada5a 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3158,7 +3158,7 @@ int qemu_load_device_state(QEMUFile *f, Error **errp)
>       return 0;
>   }
>   
> -int qemu_loadvm_approve_switchover(void)
> +int qemu_loadvm_approve_switchover(const char *approver)
>   {
>       MigrationIncomingState *mis = migration_incoming_get_current();
>   
> @@ -3167,7 +3167,7 @@ int qemu_loadvm_approve_switchover(void)
>       }
>   
>       mis->switchover_ack_pending_num--;
> -    trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
> +    trace_loadvm_approve_switchover(approver, mis->switchover_ack_pending_num);
>   
>       if (mis->switchover_ack_pending_num) {
>           return 0;
> diff --git a/migration/trace-events b/migration/trace-events
> index de99d976ab..d9084a2692 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -24,7 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>   loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>   loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>   loadvm_process_command_ping(uint32_t val) "0x%x"
> -loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
> +loadvm_approve_switchover(const char *approver, unsigned int switchover_ack_pending_num) "Approver %s, switchover_ack_pending_num %u"
>   postcopy_ram_listen_thread_exit(void) ""
>   postcopy_ram_listen_thread_start(void) ""
>   qemu_savevm_send_postcopy_advise(void) ""



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover()
  2026-05-07  8:09   ` Cédric Le Goater
@ 2026-05-08 13:07     ` Avihai Horon
  0 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-08 13:07 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb


On 5/7/2026 11:09 AM, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/5/26 10:14, Avihai Horon wrote:
>> Pass the device name that approved switchover to
>> qemu_loadvm_approve_switchover() and log it in the trace for debugging
>> purposes.
> hmm, isn't the trace event in vfio_load_state() enough :
>
>        trace_vfio_load_state(vbasedev->name, data);
>
> May be we can improve it instead ?
>
> Anyhow, I won't object to the change.

We could print "data" more nicely, but the main benefit from having a 
separate trace is that we can enable only switchover-ack related traces 
without getting noise from traces of the other VFIO_MIG_FLAG_*.
Plus, future switchover-ack users won't need to add a trace of their own 
and we can also log switchover_ack_pending_num.

I found it very convenient when debugging/testing this feature.

Thanks.

>
>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/savevm.h     | 2 +-
>>   hw/vfio/migration.c    | 2 +-
>>   migration/savevm.c     | 4 ++--
>>   migration/trace-events | 2 +-
>>   4 files changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index 96fdf96d4e..1efbe1d79d 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -70,7 +70,7 @@ void 
>> qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
>>   int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
>>                              Error **errp);
>>   int qemu_load_device_state(QEMUFile *f, Error **errp);
>> -int qemu_loadvm_approve_switchover(void);
>> +int qemu_loadvm_approve_switchover(const char *approver);
>>   int qemu_savevm_state_non_iterable(QEMUFile *f, Error **errp);
>>   int qemu_savevm_state_non_iterable_early(QEMUFile *f,
>>                                            JSONWriter *vmdesc,
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 150e28656e..c3dd30d619 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -818,7 +818,7 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>                   return -EINVAL;
>>               }
>>
>> -            ret = qemu_loadvm_approve_switchover();
>> +            ret = qemu_loadvm_approve_switchover(vbasedev->name);
>>               if (ret) {
>>                   error_report(
>>                       "%s: qemu_loadvm_approve_switchover failed, 
>> err=%d (%s)",
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index d1dd696c17..bd8c6ada5a 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -3158,7 +3158,7 @@ int qemu_load_device_state(QEMUFile *f, Error 
>> **errp)
>>       return 0;
>>   }
>>
>> -int qemu_loadvm_approve_switchover(void)
>> +int qemu_loadvm_approve_switchover(const char *approver)
>>   {
>>       MigrationIncomingState *mis = migration_incoming_get_current();
>>
>> @@ -3167,7 +3167,7 @@ int qemu_loadvm_approve_switchover(void)
>>       }
>>
>>       mis->switchover_ack_pending_num--;
>> - trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
>> +    trace_loadvm_approve_switchover(approver, 
>> mis->switchover_ack_pending_num);
>>
>>       if (mis->switchover_ack_pending_num) {
>>           return 0;
>> diff --git a/migration/trace-events b/migration/trace-events
>> index de99d976ab..d9084a2692 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -24,7 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>>   loadvm_postcopy_ram_handle_discard_header(const char *ramid, 
>> uint16_t len) "%s: %ud"
>>   loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>>   loadvm_process_command_ping(uint32_t val) "0x%x"
>> -loadvm_approve_switchover(unsigned int switchover_ack_pending_num) 
>> "Switchover ack pending num=%u"
>> +loadvm_approve_switchover(const char *approver, unsigned int 
>> switchover_ack_pending_num) "Approver %s, switchover_ack_pending_num %u"
>>   postcopy_ram_listen_thread_exit(void) ""
>>   postcopy_ram_listen_thread_start(void) ""
>>   qemu_savevm_send_postcopy_advise(void) ""
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover()
  2026-05-05  8:14 ` [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
  2026-05-07  8:09   ` Cédric Le Goater
@ 2026-05-15 15:24   ` Peter Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Xu @ 2026-05-15 15:24 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb

On Tue, May 05, 2026 at 11:14:13AM +0300, Avihai Horon wrote:
> Pass the device name that approved switchover to
> qemu_loadvm_approve_switchover() and log it in the trace for debugging
> purposes.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 05/14] migration: Replace switchover_ack_needed SaveVMHandler
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (3 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-15 15:27   ` Peter Xu
  2026-05-05  8:14 ` [PATCH 06/14] migration: Rename switchover-ack code to legacy Avihai Horon
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

A new switchover-ack mechanism that will replace the existing one will
be added in the following patches. The new mechanism will have a new
SaveVMHandler and will not use switchover_ack_needed. However, the old
mechanism must still be kept for backward compatibility.

To keep things clear and decrease API surface of old code, replace
switchover_ack_needed SaveVMHandler with a regular function
migration_request_switchover_ack().

No functional changes intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/migration/vfio.rst |  3 ---
 include/migration/misc.h      |  2 ++
 include/migration/register.h  | 13 -------------
 hw/vfio/migration.c           | 18 ++++++++++--------
 migration/migration.c         | 15 +++++++++++++++
 migration/savevm.c            | 21 ---------------------
 migration/trace-events        |  2 +-
 7 files changed, 28 insertions(+), 46 deletions(-)

diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 691061d182..854277b11c 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -59,9 +59,6 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``save_live_iterate`` function that reads the VFIO device's data from the
   vendor driver during iterative pre-copy phase.
 
-* A ``switchover_ack_needed`` function that checks if the VFIO device uses
-  "switchover-ack" migration capability when this capability is enabled.
-
 * A ``switchover_start`` function that in the multifd mode starts a thread that
   reassembles the multifd received data and loads it in-order into the device.
   In the non-multifd mode this function is a NOP.
diff --git a/include/migration/misc.h b/include/migration/misc.h
index 3159a5e53c..a2219c981b 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -156,4 +156,6 @@ bool multifd_device_state_save_thread_should_exit(void);
 void multifd_abort_device_state_save_threads(void);
 bool multifd_join_device_state_save_threads(void);
 
+void migration_request_switchover_ack(const char *requester);
+
 #endif
diff --git a/include/migration/register.h b/include/migration/register.h
index 5e5e0ee432..eae4c4ffca 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -299,19 +299,6 @@ typedef struct SaveVMHandlers {
      */
     int (*resume_prepare)(MigrationState *s, void *opaque);
 
-    /**
-     * @switchover_ack_needed
-     *
-     * Checks if switchover ack should be used. Called only on
-     * destination.
-     *
-     * @opaque: data pointer passed to register_savevm_live()
-     *
-     * Returns true if switchover ack should be used and false
-     * otherwise
-     */
-    bool (*switchover_ack_needed)(void *opaque);
-
     /**
      * @switchover_start
      *
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index c3dd30d619..4dab569e62 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -463,6 +463,14 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
     return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
 }
 
+static void vfio_request_switchover_ack(VFIODevice *vbasedev)
+{
+    if (vfio_precopy_supported(vbasedev)) {
+        /* Precopy support implies switchover-ack is needed */
+        migration_request_switchover_ack(vbasedev->name);
+    }
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_prepare(void *opaque, Error **errp)
@@ -747,6 +755,8 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
         return ret;
     }
 
+    vfio_request_switchover_ack(vbasedev);
+
     return 0;
 }
 
@@ -845,13 +855,6 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
-static bool vfio_switchover_ack_needed(void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-
-    return vfio_precopy_supported(vbasedev);
-}
-
 static int vfio_switchover_start(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -875,7 +878,6 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
-    .switchover_ack_needed = vfio_switchover_ack_needed,
     /*
      * Multifd support
      */
diff --git a/migration/migration.c b/migration/migration.c
index a5c7ca6796..1db2d296bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2164,6 +2164,21 @@ void migration_rp_kick(MigrationState *s)
     qemu_sem_post(&s->rp_state.rp_sem);
 }
 
+/* This is called only on destination side */
+void migration_request_switchover_ack(const char *requester)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!migrate_switchover_ack()) {
+        return;
+    }
+
+    mis->switchover_ack_pending_num++;
+
+    trace_migration_request_switchover_ack(requester,
+                                           mis->switchover_ack_pending_num);
+}
+
 static struct rp_cmd_args {
     ssize_t     len; /* -1 = variable */
     const char *name;
diff --git a/migration/savevm.c b/migration/savevm.c
index bd8c6ada5a..2bc3414363 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2784,23 +2784,6 @@ static int qemu_loadvm_state_header(QEMUFile *f, Error **errp)
     return 0;
 }
 
-static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
-{
-    SaveStateEntry *se;
-
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->switchover_ack_needed) {
-            continue;
-        }
-
-        if (se->ops->switchover_ack_needed(se->opaque)) {
-            mis->switchover_ack_pending_num++;
-        }
-    }
-
-    trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
-}
-
 static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
 {
     ERRP_GUARD();
@@ -3062,10 +3045,6 @@ int qemu_loadvm_state(QEMUFile *f, Error **errp)
         return -EINVAL;
     }
 
-    if (migrate_switchover_ack()) {
-        qemu_loadvm_state_switchover_ack_needed(mis);
-    }
-
     cpu_synchronize_all_pre_loadvm();
 
     ret = qemu_loadvm_state_main(f, mis, errp);
diff --git a/migration/trace-events b/migration/trace-events
index d9084a2692..9f5f6142d3 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -8,7 +8,6 @@ qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
 qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
-loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
 loadvm_handle_cmd_packaged(unsigned int length) "%u"
@@ -199,6 +198,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 migration_precopy_complete(void) ""
 migration_call_notifiers(int type) "type=%d"
+migration_request_switchover_ack(const char *requester, unsigned int switchover_ack_pending_num) "Requester %s, switchover_ack_pending_num %u"
 
 # migration-stats
 migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/14] migration: Replace switchover_ack_needed SaveVMHandler
  2026-05-05  8:14 ` [PATCH 05/14] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
@ 2026-05-15 15:27   ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2026-05-15 15:27 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb

On Tue, May 05, 2026 at 11:14:14AM +0300, Avihai Horon wrote:
> A new switchover-ack mechanism that will replace the existing one will
> be added in the following patches. The new mechanism will have a new
> SaveVMHandler and will not use switchover_ack_needed. However, the old
> mechanism must still be kept for backward compatibility.
> 
> To keep things clear and decrease API surface of old code, replace
> switchover_ack_needed SaveVMHandler with a regular function
> migration_request_switchover_ack().
> 
> No functional changes intended.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 06/14] migration: Rename switchover-ack code to legacy
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (4 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 05/14] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-05  8:14 ` [PATCH 07/14] migration: Make switchover-ack re-usable Avihai Horon
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

A new switchover-ack mechanism will be added in the following patches.
However, the old mechanism must still be kept for backward
compatibility.

Rename existing code that will be used only for old switchover-ack
mechanism as legacy. This will help to distinguish legacy code from new
code and make it more readable and easier for removal later when no
longer needed.

No functional change intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/misc.h |  2 +-
 migration/migration.h    |  2 +-
 migration/savevm.h       |  2 +-
 hw/vfio/migration.c      | 14 +++++------
 migration/migration.c    |  8 +++----
 migration/savevm.c       | 51 ++++++++++++++++++++++++++--------------
 migration/trace-events   |  4 ++--
 7 files changed, 50 insertions(+), 33 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index a2219c981b..4b43413aee 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -156,6 +156,6 @@ bool multifd_device_state_save_thread_should_exit(void);
 void multifd_abort_device_state_save_threads(void);
 bool multifd_join_device_state_save_threads(void);
 
-void migration_request_switchover_ack(const char *requester);
+void migration_request_switchover_ack_legacy(const char *requester);
 
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index ba0f9e0f9c..6099bac512 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -246,7 +246,7 @@ struct MigrationIncomingState {
      * zero an ACK that it's OK to do switchover is sent to the source. No lock
      * is needed as this field is updated serially.
      */
-    unsigned int switchover_ack_pending_num;
+    unsigned int switchover_ack_pending_num_legacy;
 
     /* Do exit on incoming migration failure */
     bool exit_on_error;
diff --git a/migration/savevm.h b/migration/savevm.h
index 1efbe1d79d..fd0c4d3329 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,7 +70,7 @@ void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
                            Error **errp);
 int qemu_load_device_state(QEMUFile *f, Error **errp);
-int qemu_loadvm_approve_switchover(const char *approver);
+int qemu_loadvm_approve_switchover_legacy(const char *approver);
 int qemu_savevm_state_non_iterable(QEMUFile *f, Error **errp);
 int qemu_savevm_state_non_iterable_early(QEMUFile *f,
                                          JSONWriter *vmdesc,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4dab569e62..314095235d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -463,11 +463,11 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
     return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
 }
 
-static void vfio_request_switchover_ack(VFIODevice *vbasedev)
+static void vfio_request_switchover_ack_legacy(VFIODevice *vbasedev)
 {
     if (vfio_precopy_supported(vbasedev)) {
         /* Precopy support implies switchover-ack is needed */
-        migration_request_switchover_ack(vbasedev->name);
+        migration_request_switchover_ack_legacy(vbasedev->name);
     }
 }
 
@@ -755,7 +755,7 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
         return ret;
     }
 
-    vfio_request_switchover_ack(vbasedev);
+    vfio_request_switchover_ack_legacy(vbasedev);
 
     return 0;
 }
@@ -828,11 +828,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
                 return -EINVAL;
             }
 
-            ret = qemu_loadvm_approve_switchover(vbasedev->name);
+            ret = qemu_loadvm_approve_switchover_legacy(vbasedev->name);
             if (ret) {
-                error_report(
-                    "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
-                    vbasedev->name, ret, strerror(-ret));
+                error_report("%s: qemu_loadvm_approve_switchover_legacy "
+                             "failed, err=%d (%s)",
+                             vbasedev->name, ret, strerror(-ret));
             }
 
             return ret;
diff --git a/migration/migration.c b/migration/migration.c
index 1db2d296bc..3c4385b5f7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2165,7 +2165,7 @@ void migration_rp_kick(MigrationState *s)
 }
 
 /* This is called only on destination side */
-void migration_request_switchover_ack(const char *requester)
+void migration_request_switchover_ack_legacy(const char *requester)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -2173,10 +2173,10 @@ void migration_request_switchover_ack(const char *requester)
         return;
     }
 
-    mis->switchover_ack_pending_num++;
+    mis->switchover_ack_pending_num_legacy++;
 
-    trace_migration_request_switchover_ack(requester,
-                                           mis->switchover_ack_pending_num);
+    trace_migration_request_switchover_ack_legacy(
+        requester, mis->switchover_ack_pending_num_legacy);
 }
 
 static struct rp_cmd_args {
diff --git a/migration/savevm.c b/migration/savevm.c
index 2bc3414363..687d6761cc 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2462,6 +2462,31 @@ static int loadvm_postcopy_handle_switchover_start(Error **errp)
     return 0;
 }
 
+/*
+ * If legacy switchover-ack is enabled but no device uses it, need to send an
+ * ACK to source that it's OK to switchover.
+ */
+static int loadvm_switchover_ack_no_users_legacy(MigrationIncomingState *mis,
+                                                 Error **errp)
+{
+    int ret;
+
+    if (!migrate_switchover_ack()) {
+        return 0;
+    }
+
+    if (!mis->switchover_ack_pending_num_legacy) {
+        ret = migrate_send_rp_switchover_ack(mis);
+        if (ret) {
+            error_setg_errno(errp, -ret,
+                             "Could not send switchover ack RP MSG");
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * 0           just a normal return
@@ -2511,18 +2536,9 @@ static int loadvm_process_command(QEMUFile *f, Error **errp)
         }
         mis->to_src_file = qemu_file_get_return_path(f);
 
-        /*
-         * Switchover ack is enabled but no device uses it, so send an ACK to
-         * source that it's OK to switchover. Do it here, after return path has
-         * been created.
-         */
-        if (migrate_switchover_ack() && !mis->switchover_ack_pending_num) {
-            ret = migrate_send_rp_switchover_ack(mis);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Could not send switchover ack RP MSG");
-                return ret;
-            }
+        ret = loadvm_switchover_ack_no_users_legacy(mis, errp);
+        if (ret) {
+            return ret;
         }
         return 0;
 
@@ -3137,18 +3153,19 @@ int qemu_load_device_state(QEMUFile *f, Error **errp)
     return 0;
 }
 
-int qemu_loadvm_approve_switchover(const char *approver)
+int qemu_loadvm_approve_switchover_legacy(const char *approver)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
-    if (!mis->switchover_ack_pending_num) {
+    if (!mis->switchover_ack_pending_num_legacy) {
         return -EINVAL;
     }
 
-    mis->switchover_ack_pending_num--;
-    trace_loadvm_approve_switchover(approver, mis->switchover_ack_pending_num);
+    mis->switchover_ack_pending_num_legacy--;
+    trace_loadvm_approve_switchover_legacy(
+        approver, mis->switchover_ack_pending_num_legacy);
 
-    if (mis->switchover_ack_pending_num) {
+    if (mis->switchover_ack_pending_num_legacy) {
         return 0;
     }
 
diff --git a/migration/trace-events b/migration/trace-events
index 9f5f6142d3..d6795c64c7 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -23,7 +23,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
-loadvm_approve_switchover(const char *approver, unsigned int switchover_ack_pending_num) "Approver %s, switchover_ack_pending_num %u"
+loadvm_approve_switchover_legacy(const char *approver, unsigned int switchover_ack_pending_num_legacy) "Approver %s, switchover_ack_pending_num_legacy %u"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -198,7 +198,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 migration_precopy_complete(void) ""
 migration_call_notifiers(int type) "type=%d"
-migration_request_switchover_ack(const char *requester, unsigned int switchover_ack_pending_num) "Requester %s, switchover_ack_pending_num %u"
+migration_request_switchover_ack_legacy(const char *requester, unsigned int switchover_ack_pending_num_legacy) "Requester %s, switchover_ack_pending_num_legacy %u"
 
 # migration-stats
 migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/14] migration: Make switchover-ack re-usable
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (5 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 06/14] migration: Rename switchover-ack code to legacy Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07 14:10   ` Fabiano Rosas
  2026-05-05  8:14 ` [PATCH 08/14] migration: Check switchover-ack during switchover phase Avihai Horon
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Switchover-ack is a mechanism to synchronize between source and
destination QEMU during migration to prevent the source from switching
over prematurely.

VFIO uses switchover-ack to ensure switchover happens only after
destination side has loaded the precopy initial bytes. This is important
for VFIO, as otherwise downtime could be impacted and be higher.

In its current state, switchover-ack is a one-time mechanism, meaning
that switchover is acked only once and past that another ACK cannot be
requested again. This was sufficient until now, as VFIO precopy initial
bytes was defined to be monotonically decreasing. Thus, when precopy
initial bytes reached zero for all VFIO devices, a single ACK would be
sent and its validity would hold.

However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy
initial bytes to be re-initialized during precopy. Specifically, it
means that initial bytes can grow after reaching zero, which would
invalidate a previously sent switchover ACK.

To solve this, make switchover-ack reusable and allow devices to request
another switchover ACK when needed.

To avoid scattering them all over, switchover ACKs are requested through
a new request_switchover_ack handler which is called in specific places.

Since now switchover ACK can be requested for a specific device and in
different times, make switchover ACK per-device (instead of a single ACK
for all devices) and let source side do the pending ACKs accounting.

Keep the legacy switchover-ack mechanism for backward compatibility and
turn it on by a compatibility property for older machines. Enable the
property until VFIO implements the new switchover-ack.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/client-options.h |  1 +
 include/migration/register.h       | 21 +++++++++
 migration/migration.h              | 15 +++++--
 migration/savevm.h                 |  4 +-
 hw/core/machine.c                  |  4 +-
 hw/vfio/migration.c                |  8 ++--
 migration/migration.c              | 38 +++++++++++++---
 migration/options.c                | 10 +++++
 migration/savevm.c                 | 69 +++++++++++++++++++++++++++++-
 migration/trace-events             |  4 +-
 10 files changed, 156 insertions(+), 18 deletions(-)

diff --git a/include/migration/client-options.h b/include/migration/client-options.h
index 289c9d7762..78b1daa1a6 100644
--- a/include/migration/client-options.h
+++ b/include/migration/client-options.h
@@ -13,6 +13,7 @@
 
 /* properties */
 bool migrate_send_switchover_start(void);
+bool migrate_switchover_ack_legacy(void);
 
 /* capabilities */
 
diff --git a/include/migration/register.h b/include/migration/register.h
index eae4c4ffca..f43f47a679 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -30,6 +30,11 @@ typedef struct MigPendingData {
     uint64_t total_bytes;
 } MigPendingData;
 
+enum MigSwitchoverAckRequestStage {
+    MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP,
+    MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT,
+};
+
 /**
  * struct SaveVMHandlers: handler structure to finely control
  * migration of complex subsystems and devices, such as RAM, block and
@@ -299,6 +304,22 @@ typedef struct SaveVMHandlers {
      */
     int (*resume_prepare)(MigrationState *s, void *opaque);
 
+    /**
+     * @request_switchover_ack
+     *
+     * Checks if a new switchover ACK is requested. Called only on source side
+     * in the stages specified in enum MigSwitchoverAckRequestStage.
+     *
+     * @stage: the stage in which the handler was called
+     * @opaque: data pointer passed to register_savevm_live()
+     * @requester: output pointer to be set to the name of the requester of the
+     * switchover ACK (for logging purposes). If not set, idstr will be used.
+     *
+     * Returns true to request switchover ACK and false otherwise
+     */
+    bool (*request_switchover_ack)(enum MigSwitchoverAckRequestStage stage,
+                                   void *opaque, const char **requester);
+
     /**
      * @switchover_start
      *
diff --git a/migration/migration.h b/migration/migration.h
index 6099bac512..d46ecd967f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -494,6 +494,12 @@ struct MigrationState {
      */
     uint8_t clear_bitmap_shift;
 
+    /*
+     * This decides whether to use legacy switchover ack (send ACK once for all
+     * devices) or new switchover ack (send ACK for each device).
+     */
+    bool switchover_ack_legacy;
+
     /*
      * This save hostname when out-going migration starts
      */
@@ -503,10 +509,13 @@ struct MigrationState {
     JSONWriter *vmdesc;
 
     /*
-     * Indicates whether an ACK from the destination that it's OK to do
-     * switchover has been received.
+     * Indicates the number of pending ACKs from the destination. The value may
+     * increase or decrease during precopy as new ACKs are requested or
+     * received. When zero is reached, it's OK to switchover. In legacy
+     * switchover-ack, it's initialized to 1 and decreased to zero upon ACK.
      */
-    bool switchover_acked;
+    uint32_t switchover_ack_pending_num;
+
     /* Is this a rdma migration */
     bool rdma_migration;
 
diff --git a/migration/savevm.h b/migration/savevm.h
index fd0c4d3329..937acfa84c 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -37,6 +37,8 @@ bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_non_migratable_list(strList **reasons);
 int qemu_savevm_state_prepare(Error **errp);
 int qemu_savevm_state_do_setup(QEMUFile *f, Error **errp);
+int qemu_savevm_request_switchover_ack(enum MigSwitchoverAckRequestStage stage,
+                                       Error **errp);
 bool qemu_savevm_state_guest_unplug_pending(void);
 int qemu_savevm_state_resume_prepare(MigrationState *s);
 void qemu_savevm_send_header(QEMUFile *f);
@@ -70,7 +72,7 @@ void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
                            Error **errp);
 int qemu_load_device_state(QEMUFile *f, Error **errp);
-int qemu_loadvm_approve_switchover_legacy(const char *approver);
+int qemu_loadvm_approve_switchover(const char *approver);
 int qemu_savevm_state_non_iterable(QEMUFile *f, Error **errp);
 int qemu_savevm_state_non_iterable_early(QEMUFile *f,
                                          JSONWriter *vmdesc,
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 1b661fd36a..4f82813e8b 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -39,7 +39,9 @@
 #include "hw/acpi/generic_event_device.h"
 #include "qemu/audio.h"
 
-GlobalProperty hw_compat_11_0[] = {};
+GlobalProperty hw_compat_11_0[] = {
+    { "migration", "switchover-ack-legacy", "on" },
+};
 const size_t hw_compat_11_0_len = G_N_ELEMENTS(hw_compat_11_0);
 
 GlobalProperty hw_compat_10_2[] = {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 314095235d..2911583ee1 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -828,11 +828,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
                 return -EINVAL;
             }
 
-            ret = qemu_loadvm_approve_switchover_legacy(vbasedev->name);
+            ret = qemu_loadvm_approve_switchover(vbasedev->name);
             if (ret) {
-                error_report("%s: qemu_loadvm_approve_switchover_legacy "
-                             "failed, err=%d (%s)",
-                             vbasedev->name, ret, strerror(-ret));
+                error_report(
+                    "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
+                    vbasedev->name, ret, strerror(-ret));
             }
 
             return ret;
diff --git a/migration/migration.c b/migration/migration.c
index 3c4385b5f7..b86ceea6ff 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1684,7 +1684,9 @@ int migrate_init(MigrationState *s, Error **errp)
     s->vm_old_state = -1;
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
-    s->switchover_acked = false;
+    /* Legacy switchover-ack sends a single ACK for all devices */
+    qatomic_set(&s->switchover_ack_pending_num,
+                migrate_switchover_ack_legacy() ? 1 : 0);
     s->rdma_migration = false;
 
     /*
@@ -2169,7 +2171,7 @@ void migration_request_switchover_ack_legacy(const char *requester)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
-    if (!migrate_switchover_ack()) {
+    if (!migrate_switchover_ack() || !migrate_switchover_ack_legacy()) {
         return;
     }
 
@@ -2425,9 +2427,18 @@ static void *source_return_path_thread(void *opaque)
             break;
 
         case MIG_RP_MSG_SWITCHOVER_ACK:
-            ms->switchover_acked = true;
-            trace_source_return_path_thread_switchover_acked();
+        {
+            uint32_t pending_num;
+
+            pending_num = qatomic_dec_fetch(&ms->switchover_ack_pending_num);
+            trace_source_return_path_thread_switchover_acked(pending_num);
+            if (pending_num == UINT32_MAX) {
+                error_setg(&err, "Switchover ack pending num underflowed");
+                goto out;
+            }
+
             break;
+        }
 
         default:
             break;
@@ -3221,7 +3232,7 @@ static bool migration_can_switchover(MigrationState *s)
         return true;
     }
 
-    return s->switchover_acked;
+    return qatomic_read(&s->switchover_ack_pending_num) == 0;
 }
 
 /* Migration thread iteration status */
@@ -3311,9 +3322,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     Error *local_err = NULL;
     bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    bool can_switchover = migration_can_switchover(s);
+    bool can_switchover;
     MigPendingData pending = { };
     bool complete_ready;
+    int ret;
 
     /* Fast path - get the estimated amount of pending data */
     qemu_savevm_query_pending(&pending, false);
@@ -3346,8 +3358,18 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          */
         if (migration_iteration_next_ready(s, &pending)) {
             migration_iteration_go_next(&pending);
+            ret = qemu_savevm_request_switchover_ack(
+                MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT, &local_err);
+            if (ret < 0) {
+                migrate_error_propagate(s, local_err);
+                qemu_file_set_error(s->to_dst_file, ret);
+                return MIG_ITERATE_RESUME;
+            }
         }
 
+        /* Check can switchover after qemu_savevm_request_switchover_ack() */
+        can_switchover = migration_can_switchover(s);
+
         /* Should we switch to postcopy now? */
         if (can_switchover && postcopy_should_start(s, &pending)) {
             if (postcopy_start(s, &local_err)) {
@@ -3638,6 +3660,10 @@ static void *migration_thread(void *opaque)
     bql_lock();
     ret = qemu_savevm_state_do_setup(s->to_dst_file, &local_err);
     bql_unlock();
+    if (!ret) {
+        ret = qemu_savevm_request_switchover_ack(
+            MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP, &local_err);
+    }
 
     qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
                                MIGRATION_STATUS_ACTIVE);
diff --git a/migration/options.c b/migration/options.c
index 7556fbc06b..44327c588f 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -108,6 +108,9 @@ const Property migration_properties[] = {
                      preempt_pre_7_2, false),
     DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState,
                      multifd_clean_tls_termination, true),
+    /* Use legacy until VFIO implements new switchover-ack */
+    DEFINE_PROP_BOOL("switchover-ack-legacy", MigrationState,
+                     switchover_ack_legacy, true),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState,
@@ -462,6 +465,13 @@ bool migrate_rdma(void)
     return s->rdma_migration;
 }
 
+bool migrate_switchover_ack_legacy(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->switchover_ack_legacy;
+}
+
 typedef enum WriteTrackingSupport {
     WT_SUPPORT_UNKNOWN = 0,
     WT_SUPPORT_ABSENT,
diff --git a/migration/savevm.c b/migration/savevm.c
index 687d6761cc..b6076579de 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1472,6 +1472,54 @@ int qemu_savevm_state_do_setup(QEMUFile *f, Error **errp)
     return precopy_notify(PRECOPY_NOTIFY_SETUP, errp);
 }
 
+static const char *
+switchover_ack_stage_to_str(enum MigSwitchoverAckRequestStage stage)
+{
+    switch (stage) {
+    case MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP:
+        return "SETUP";
+    case MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT:
+        return "PENDING_EXACT";
+    default:
+        return "UNKNOWN";
+    }
+}
+
+int qemu_savevm_request_switchover_ack(enum MigSwitchoverAckRequestStage stage,
+                                       Error **errp)
+{
+    MigrationState *s = migrate_get_current();
+    uint32_t pending_num;
+    SaveStateEntry *se;
+    const char *requester;
+
+    if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
+        return 0;
+    }
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->request_switchover_ack) {
+            continue;
+        }
+
+        requester = NULL;
+        if (se->ops->request_switchover_ack(stage, se->opaque, &requester)) {
+            requester = requester ?: se->idstr;
+            pending_num = qatomic_inc_fetch(&s->switchover_ack_pending_num);
+            if (pending_num == 0) {
+                error_setg(errp, "Switchover ack pending num overflowed by %s",
+                           requester);
+                return -EOVERFLOW;
+            }
+
+            trace_savevm_request_switchover_ack(
+                switchover_ack_stage_to_str(stage), requester, pending_num);
+        }
+    }
+
+    return 0;
+}
+
 int qemu_savevm_state_resume_prepare(MigrationState *s)
 {
     SaveStateEntry *se;
@@ -2471,7 +2519,7 @@ static int loadvm_switchover_ack_no_users_legacy(MigrationIncomingState *mis,
 {
     int ret;
 
-    if (!migrate_switchover_ack()) {
+    if (!migrate_switchover_ack() || !migrate_switchover_ack_legacy()) {
         return 0;
     }
 
@@ -3153,7 +3201,7 @@ int qemu_load_device_state(QEMUFile *f, Error **errp)
     return 0;
 }
 
-int qemu_loadvm_approve_switchover_legacy(const char *approver)
+static int qemu_loadvm_approve_switchover_legacy(const char *approver)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -3172,6 +3220,23 @@ int qemu_loadvm_approve_switchover_legacy(const char *approver)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_approve_switchover(const char *approver)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!migrate_switchover_ack()) {
+        return 0;
+    }
+
+    if (migrate_switchover_ack_legacy()) {
+        return qemu_loadvm_approve_switchover_legacy(approver);
+    }
+
+    trace_loadvm_approve_switchover(approver);
+
+    return migrate_send_rp_switchover_ack(mis);
+}
+
 bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
                                    char *buf, size_t len, Error **errp)
 {
diff --git a/migration/trace-events b/migration/trace-events
index d6795c64c7..be3e688c71 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -24,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s:
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
 loadvm_approve_switchover_legacy(const char *approver, unsigned int switchover_ack_pending_num_legacy) "Approver %s, switchover_ack_pending_num_legacy %u"
+loadvm_approve_switchover(const char *approver) "Approver %s"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -40,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
 savevm_send_recv_bitmap(char *name) "%s"
 savevm_send_switchover_start(void) ""
 savevm_state_setup(void) ""
+savevm_request_switchover_ack(const char *stage, const char *requester, uint32_t pending_num) "Stage %s, requester %s, switchover_ack_pending_num %" PRIu32
 savevm_state_resume_prepare(void) ""
 savevm_state_header(void) ""
 savevm_state_iterate(void) ""
@@ -189,7 +191,7 @@ source_return_path_thread_loop_top(void) ""
 source_return_path_thread_pong(uint32_t val) "0x%x"
 source_return_path_thread_shut(uint32_t val) "0x%x"
 source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
-source_return_path_thread_switchover_acked(void) ""
+source_return_path_thread_switchover_acked(uint32_t pending_num) "switchover_ack_pending_num %" PRIu32
 source_return_path_thread_postcopy_package_loaded(void) ""
 migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 07/14] migration: Make switchover-ack re-usable
  2026-05-05  8:14 ` [PATCH 07/14] migration: Make switchover-ack re-usable Avihai Horon
@ 2026-05-07 14:10   ` Fabiano Rosas
  0 siblings, 0 replies; 31+ messages in thread
From: Fabiano Rosas @ 2026-05-07 14:10 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Avihai Horon <avihaih@nvidia.com> writes:

> Switchover-ack is a mechanism to synchronize between source and
> destination QEMU during migration to prevent the source from switching
> over prematurely.
>
> VFIO uses switchover-ack to ensure switchover happens only after
> destination side has loaded the precopy initial bytes. This is important
> for VFIO, as otherwise downtime could be impacted and be higher.
>
> In its current state, switchover-ack is a one-time mechanism, meaning
> that switchover is acked only once and past that another ACK cannot be
> requested again. This was sufficient until now, as VFIO precopy initial
> bytes was defined to be monotonically decreasing. Thus, when precopy
> initial bytes reached zero for all VFIO devices, a single ACK would be
> sent and its validity would hold.
>
> However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy
> initial bytes to be re-initialized during precopy. Specifically, it
> means that initial bytes can grow after reaching zero, which would
> invalidate a previously sent switchover ACK.
>
> To solve this, make switchover-ack reusable and allow devices to request
> another switchover ACK when needed.
>
> To avoid scattering them all over, switchover ACKs are requested through
> a new request_switchover_ack handler which is called in specific places.
>
> Since now switchover ACK can be requested for a specific device and in
> different times, make switchover ACK per-device (instead of a single ACK
> for all devices) and let source side do the pending ACKs accounting.
>
> Keep the legacy switchover-ack mechanism for backward compatibility and
> turn it on by a compatibility property for older machines. Enable the
> property until VFIO implements the new switchover-ack.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 08/14] migration: Check switchover-ack during switchover phase
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (6 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 07/14] migration: Make switchover-ack re-usable Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-05  8:14 ` [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT Avihai Horon
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Switchover ACK is checked only during precopy while the guest is still
running. The last migration_can_switchover() decision and guest stop are
not atomic, so a device may want to request another switchover ACK in
the gap after switchover decision has been made but before the guest is
stopped. Migration would then miss that request, which can increase
downtime.

Cover this case by checking switchover ACK again during the switchover
phase, after the guest has been stopped. If a new ACK is required at
that point, fail migration so the condition is not ignored.

Ideally, precopy iterations should be resumed in this case, however,
VFIO doesn't support going back to precopy after being stopped, so
implementing such logic would require non-trivial changes to the guest
start/stop flow. Given the above and that this case should be rare,
failing the migration seems reasonable.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/register.h | 1 +
 migration/migration.c        | 7 +++++++
 migration/savevm.c           | 9 +++++++++
 3 files changed, 17 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index f43f47a679..755e590676 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -33,6 +33,7 @@ typedef struct MigPendingData {
 enum MigSwitchoverAckRequestStage {
     MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP,
     MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT,
+    MIG_SWITCHOVER_ACK_REQUEST_STAGE_COMPLETE,
 };
 
 /**
diff --git a/migration/migration.c b/migration/migration.c
index b86ceea6ff..94980fcd37 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2776,6 +2776,13 @@ static bool migration_switchover_prepare(MigrationState *s)
 static bool migration_switchover_start(MigrationState *s, Error **errp)
 {
     ERRP_GUARD();
+    int ret;
+
+    ret = qemu_savevm_request_switchover_ack(
+        MIG_SWITCHOVER_ACK_REQUEST_STAGE_COMPLETE, errp);
+    if (ret < 0) {
+        return false;
+    }
 
     if (!migration_switchover_prepare(s)) {
         error_setg(errp, "Switchover is interrupted");
diff --git a/migration/savevm.c b/migration/savevm.c
index b6076579de..9150cb93ad 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1480,6 +1480,8 @@ switchover_ack_stage_to_str(enum MigSwitchoverAckRequestStage stage)
         return "SETUP";
     case MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT:
         return "PENDING_EXACT";
+    case MIG_SWITCHOVER_ACK_REQUEST_STAGE_COMPLETE:
+        return "COMPLETE";
     default:
         return "UNKNOWN";
     }
@@ -1505,6 +1507,13 @@ int qemu_savevm_request_switchover_ack(enum MigSwitchoverAckRequestStage stage,
         requester = NULL;
         if (se->ops->request_switchover_ack(stage, se->opaque, &requester)) {
             requester = requester ?: se->idstr;
+            if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_COMPLETE) {
+                error_setg(errp,
+                           "Switchover ACK requested by %s during switchover",
+                           requester);
+                return -EPERM;
+            }
+
             pending_num = qatomic_inc_fetch(&s->switchover_ack_pending_num);
             if (pending_num == 0) {
                 error_setg(errp, "Switchover ack pending num overflowed by %s",
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (7 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 08/14] migration: Check switchover-ack during switchover phase Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07  8:24   ` Cédric Le Goater
  2026-05-05  8:14 ` [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

When precopy initial_bytes reaches zero VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
flag is sent to the destination to indicate that initial data has been
sent, so destination can indicate back to source when it finished
loading it.

To get a more accurate estimation of initial_bytes, re-query precopy
size before sending the flag. Extract the flag sending logic from
vfio_save_iterate() to a new helper for clarity.

This may prevent premature sending of VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
flag if, for example, the previously queried initial_bytes was lower
than actually is. Additionally, it prevents sending the flag if
vfio_query_precopy_size() failed.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c  | 37 ++++++++++++++++++++++++++++++++-----
 hw/vfio/trace-events |  1 +
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2911583ee1..243624b5fe 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -456,6 +456,37 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
                                          data_size);
 }
 
+/* Returns true if the init data flag was sent, false otherwise */
+static bool vfio_send_init_data_flag(QEMUFile *f, VFIOMigration *migration)
+{
+    VFIODevice *vbasedev = migration->vbasedev;
+    int ret;
+
+    if (!migrate_switchover_ack()) {
+        return false;
+    }
+
+    if (migration->precopy_init_size || migration->initial_data_sent) {
+        return false;
+    }
+
+    /*
+     * precopy_init_size holds an estimation of the initial data size, re-query
+     * precopy size to ensure it's really zero before sending init data flag.
+     * Don't send the flag if query fails.
+     */
+    ret = vfio_query_precopy_size(migration);
+    if (ret || migration->precopy_init_size) {
+        return false;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+    migration->initial_data_sent = true;
+    trace_vfio_send_init_data_flag(vbasedev->name);
+
+    return true;
+}
+
 static bool vfio_precopy_supported(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -664,11 +695,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 
     vfio_update_estimated_pending_data(migration, data_size);
 
-    if (migrate_switchover_ack() && !migration->precopy_init_size &&
-        !migration->initial_data_sent) {
-        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
-        migration->initial_data_sent = true;
-    } else {
+    if (!vfio_send_init_data_flag(f, migration)) {
         qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
     }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ab27ff5ea2..e91858354c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -176,6 +176,7 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
 vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool exact) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64 " exact %d"
+vfio_send_init_data_flag(const char *name) " (%s)"
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
  2026-05-05  8:14 ` [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT Avihai Horon
@ 2026-05-07  8:24   ` Cédric Le Goater
  2026-05-08 13:10     ` Avihai Horon
  0 siblings, 1 reply; 31+ messages in thread
From: Cédric Le Goater @ 2026-05-07  8:24 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb

Hello Avihai

On 5/5/26 10:14, Avihai Horon wrote:
> When precopy initial_bytes reaches zero VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
> flag is sent to the destination to indicate that initial data has been
> sent, so destination can indicate back to source when it finished
> loading it.
> 
> To get a more accurate estimation of initial_bytes, re-query precopy
> size before sending the flag. Extract the flag sending logic from
> vfio_save_iterate() to a new helper for clarity.

I would prefer to separate the changes, so a patch for this new routine,
and then :

> 
> This may prevent premature sending of VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
> flag if, for example, the previously queried initial_bytes was lower
> than actually is. Additionally, it prevents sending the flag if
> vfio_query_precopy_size() failed.


Thanks,

C.


> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   hw/vfio/migration.c  | 37 ++++++++++++++++++++++++++++++++-----
>   hw/vfio/trace-events |  1 +
>   2 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 2911583ee1..243624b5fe 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -456,6 +456,37 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>                                            data_size);
>   }
>   
> +/* Returns true if the init data flag was sent, false otherwise */
> +static bool vfio_send_init_data_flag(QEMUFile *f, VFIOMigration *migration)
> +{
> +    VFIODevice *vbasedev = migration->vbasedev;
> +    int ret;
> +
> +    if (!migrate_switchover_ack()) {
> +        return false;
> +    }
> +
> +    if (migration->precopy_init_size || migration->initial_data_sent) {
> +        return false;
> +    }
> +
> +    /*
> +     * precopy_init_size holds an estimation of the initial data size, re-query
> +     * precopy size to ensure it's really zero before sending init data flag.
> +     * Don't send the flag if query fails.
> +     */
> +    ret = vfio_query_precopy_size(migration);
> +    if (ret || migration->precopy_init_size) {
> +        return false;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> +    migration->initial_data_sent = true;
> +    trace_vfio_send_init_data_flag(vbasedev->name);
> +
> +    return true;
> +}
> +
>   static bool vfio_precopy_supported(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
> @@ -664,11 +695,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>   
>       vfio_update_estimated_pending_data(migration, data_size);
>   
> -    if (migrate_switchover_ack() && !migration->precopy_init_size &&
> -        !migration->initial_data_sent) {
> -        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> -        migration->initial_data_sent = true;
> -    } else {
> +    if (!vfio_send_init_data_flag(f, migration)) {
>           qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>       }
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ab27ff5ea2..e91858354c 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -176,6 +176,7 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
>   vfio_save_iterate_start(const char *name) " (%s)"
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
>   vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool exact) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64 " exact %d"
> +vfio_send_init_data_flag(const char *name) " (%s)"
>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>   vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>   



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
  2026-05-07  8:24   ` Cédric Le Goater
@ 2026-05-08 13:10     ` Avihai Horon
  0 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-08 13:10 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb


On 5/7/2026 11:24 AM, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai
>
> On 5/5/26 10:14, Avihai Horon wrote:
>> When precopy initial_bytes reaches zero VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
>> flag is sent to the destination to indicate that initial data has been
>> sent, so destination can indicate back to source when it finished
>> loading it.
>>
>> To get a more accurate estimation of initial_bytes, re-query precopy
>> size before sending the flag. Extract the flag sending logic from
>> vfio_save_iterate() to a new helper for clarity.
>
> I would prefer to separate the changes, so a patch for this new routine,
> and then :
>
>>
>> This may prevent premature sending of VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
>> flag if, for example, the previously queried initial_bytes was lower
>> than actually is. Additionally, it prevents sending the flag if
>> vfio_query_precopy_size() failed.
>
Of course, will do.

Thanks.

>
>
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 37 ++++++++++++++++++++++++++++++++-----
>>   hw/vfio/trace-events |  1 +
>>   2 files changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 2911583ee1..243624b5fe 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -456,6 +456,37 @@ static void 
>> vfio_update_estimated_pending_data(VFIOMigration *migration,
>>                                            data_size);
>>   }
>>
>> +/* Returns true if the init data flag was sent, false otherwise */
>> +static bool vfio_send_init_data_flag(QEMUFile *f, VFIOMigration 
>> *migration)
>> +{
>> +    VFIODevice *vbasedev = migration->vbasedev;
>> +    int ret;
>> +
>> +    if (!migrate_switchover_ack()) {
>> +        return false;
>> +    }
>> +
>> +    if (migration->precopy_init_size || migration->initial_data_sent) {
>> +        return false;
>> +    }
>> +
>> +    /*
>> +     * precopy_init_size holds an estimation of the initial data 
>> size, re-query
>> +     * precopy size to ensure it's really zero before sending init 
>> data flag.
>> +     * Don't send the flag if query fails.
>> +     */
>> +    ret = vfio_query_precopy_size(migration);
>> +    if (ret || migration->precopy_init_size) {
>> +        return false;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>> +    migration->initial_data_sent = true;
>> +    trace_vfio_send_init_data_flag(vbasedev->name);
>> +
>> +    return true;
>> +}
>> +
>>   static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -664,11 +695,7 @@ static int vfio_save_iterate(QEMUFile *f, void 
>> *opaque)
>>
>>       vfio_update_estimated_pending_data(migration, data_size);
>>
>> -    if (migrate_switchover_ack() && !migration->precopy_init_size &&
>> -        !migration->initial_data_sent) {
>> -        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>> -        migration->initial_data_sent = true;
>> -    } else {
>> +    if (!vfio_send_init_data_flag(f, migration)) {
>>           qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>       }
>>
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index ab27ff5ea2..e91858354c 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -176,6 +176,7 @@ vfio_save_iterate(const char *name, uint64_t 
>> precopy_init_size, uint64_t precopy
>>   vfio_save_iterate_start(const char *name) " (%s)"
>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) 
>> data buffer size %"PRIu64
>>   vfio_state_pending(const char *name, uint64_t stopcopy_size, 
>> uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool exact) 
>> " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy 
>> dirty size %"PRIu64 " exact %d"
>> +vfio_send_init_data_flag(const char *name) " (%s)"
>>   vfio_vmstate_change(const char *name, int running, const char 
>> *reason, const char *dev_state) " (%s) running %d reason %s device 
>> state %s"
>>   vfio_vmstate_change_prepare(const char *name, int running, const 
>> char *reason, const char *dev_state) " (%s) running %d reason %s 
>> device state %s"
>>
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (8 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-07  7:59   ` Cédric Le Goater
  2026-05-05  8:14 ` [PATCH 11/14] vfio/migration: Add new switchover-ack mechanism Avihai Horon
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

vfio_migration_init() already has many failure points and a new one will
be added in next patch.

Add Error ** parameter to vfio_migration_init() to report a detailed
error message through it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 243624b5fe..b7e929274a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -1038,7 +1038,7 @@ static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
     return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
 }
 
-static int vfio_migration_init(VFIODevice *vbasedev)
+static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
 {
     int ret;
     Object *obj;
@@ -1047,23 +1047,38 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     g_autofree char *path = NULL, *oid = NULL;
     uint64_t mig_flags = 0;
     VMChangeStateHandler *prepare_cb;
+    g_autofree char *error_prefix =
+        g_strdup_printf("%s: VFIO migration init failed:", vbasedev->name);
 
     if (!vbasedev->ops->vfio_get_object) {
+        error_setg(errp, "%s no vfio_get_object handler", error_prefix);
         return -EINVAL;
     }
 
     obj = vbasedev->ops->vfio_get_object(vbasedev);
     if (!obj) {
+        error_setg(errp, "%s failed to get object", error_prefix);
         return -EINVAL;
     }
 
     ret = vfio_migration_query_flags(vbasedev, &mig_flags);
     if (ret) {
+        if (ret == -ENOTTY) {
+            error_setg_errno(errp, -ret,
+                             "%s migration is not supported in kernel",
+                             error_prefix);
+        } else {
+            error_setg_errno(errp, -ret, "%s failed to query migration flags",
+                             error_prefix);
+        }
+
         return ret;
     }
 
     /* Basic migration functionality must be supported */
     if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+        error_setg(errp, "%s VFIO_MIGRATION_STOP_COPY is not supported",
+                   error_prefix);
         return -EOPNOTSUPP;
     }
 
@@ -1261,18 +1276,8 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
         return !vfio_block_migration(vbasedev, err, errp);
     }
 
-    ret = vfio_migration_init(vbasedev);
+    ret = vfio_migration_init(vbasedev, &err);
     if (ret) {
-        if (ret == -ENOTTY) {
-            error_setg(&err, "%s: VFIO migration is not supported in kernel",
-                       vbasedev->name);
-        } else {
-            error_setg(&err,
-                       "%s: Migration couldn't be initialized for VFIO device, "
-                       "err: %d (%s)",
-                       vbasedev->name, ret, strerror(-ret));
-        }
-
         return !vfio_block_migration(vbasedev, err, errp);
     }
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-05-05  8:14 ` [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
@ 2026-05-07  7:59   ` Cédric Le Goater
  2026-05-08 13:18     ` Avihai Horon
  0 siblings, 1 reply; 31+ messages in thread
From: Cédric Le Goater @ 2026-05-07  7:59 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb

Hello Avihai,

On 5/5/26 10:14, Avihai Horon wrote:
> vfio_migration_init() already has many failure points and a new one will
> be added in next patch.
> 
> Add Error ** parameter to vfio_migration_init() to report a detailed
> error message through it.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   hw/vfio/migration.c | 29 +++++++++++++++++------------
>   1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 243624b5fe..b7e929274a 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -1038,7 +1038,7 @@ static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
>       return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>   }
>   
> -static int vfio_migration_init(VFIODevice *vbasedev)
> +static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>   {
>       int ret;
>       Object *obj;
> @@ -1047,23 +1047,38 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>       g_autofree char *path = NULL, *oid = NULL;
>       uint64_t mig_flags = 0;
>       VMChangeStateHandler *prepare_cb;
> +    g_autofree char *error_prefix =
> +        g_strdup_printf("%s: VFIO migration init failed:", vbasedev->name);

We have error_prepend() for this purpose.

Thanks,

C.

>   
>       if (!vbasedev->ops->vfio_get_object) {
> +        error_setg(errp, "%s no vfio_get_object handler", error_prefix);
>           return -EINVAL;
>       }
>   
>       obj = vbasedev->ops->vfio_get_object(vbasedev);
>       if (!obj) {
> +        error_setg(errp, "%s failed to get object", error_prefix);
>           return -EINVAL;
>       }
>   
>       ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>       if (ret) {
> +        if (ret == -ENOTTY) {
> +            error_setg_errno(errp, -ret,
> +                             "%s migration is not supported in kernel",
> +                             error_prefix);
> +        } else {
> +            error_setg_errno(errp, -ret, "%s failed to query migration flags",
> +                             error_prefix);
> +        }
> +
>           return ret;
>       }
>   
>       /* Basic migration functionality must be supported */
>       if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> +        error_setg(errp, "%s VFIO_MIGRATION_STOP_COPY is not supported",
> +                   error_prefix);
>           return -EOPNOTSUPP;
>       }
>   
> @@ -1261,18 +1276,8 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
>           return !vfio_block_migration(vbasedev, err, errp);
>       }
>   
> -    ret = vfio_migration_init(vbasedev);
> +    ret = vfio_migration_init(vbasedev, &err);
>       if (ret) {
> -        if (ret == -ENOTTY) {
> -            error_setg(&err, "%s: VFIO migration is not supported in kernel",
> -                       vbasedev->name);
> -        } else {
> -            error_setg(&err,
> -                       "%s: Migration couldn't be initialized for VFIO device, "
> -                       "err: %d (%s)",
> -                       vbasedev->name, ret, strerror(-ret));
> -        }
> -
>           return !vfio_block_migration(vbasedev, err, errp);
>       }
>   



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-05-07  7:59   ` Cédric Le Goater
@ 2026-05-08 13:18     ` Avihai Horon
  2026-05-08 13:21       ` Avihai Horon
  0 siblings, 1 reply; 31+ messages in thread
From: Avihai Horon @ 2026-05-08 13:18 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb


On 5/7/2026 10:59 AM, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
> On 5/5/26 10:14, Avihai Horon wrote:
>> vfio_migration_init() already has many failure points and a new one will
>> be added in next patch.
>>
>> Add Error ** parameter to vfio_migration_init() to report a detailed
>> error message through it.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/migration.c | 29 +++++++++++++++++------------
>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 243624b5fe..b7e929274a 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -1038,7 +1038,7 @@ static bool 
>> vfio_dma_logging_supported(VFIODevice *vbasedev)
>>       return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>>   }
>>
>> -static int vfio_migration_init(VFIODevice *vbasedev)
>> +static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>>   {
>>       int ret;
>>       Object *obj;
>> @@ -1047,23 +1047,38 @@ static int vfio_migration_init(VFIODevice 
>> *vbasedev)
>>       g_autofree char *path = NULL, *oid = NULL;
>>       uint64_t mig_flags = 0;
>>       VMChangeStateHandler *prepare_cb;
>> +    g_autofree char *error_prefix =
>> +        g_strdup_printf("%s: VFIO migration init failed:", 
>> vbasedev->name);
>
> We have error_prepend() for this purpose.

Right.
I was trying to avoid duplicating the prefix on each fail branch.

Do you suggest to add a common "err:" goto label at the bottom and put 
there a single:

   error_prepend(errp, "%s: VFIO migration init failed:", vbasedev->name);

?

Thanks.

>
>
>>
>>       if (!vbasedev->ops->vfio_get_object) {
>> +        error_setg(errp, "%s no vfio_get_object handler", 
>> error_prefix);
>>           return -EINVAL;
>>       }
>>
>>       obj = vbasedev->ops->vfio_get_object(vbasedev);
>>       if (!obj) {
>> +        error_setg(errp, "%s failed to get object", error_prefix);
>>           return -EINVAL;
>>       }
>>
>>       ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>>       if (ret) {
>> +        if (ret == -ENOTTY) {
>> +            error_setg_errno(errp, -ret,
>> +                             "%s migration is not supported in kernel",
>> +                             error_prefix);
>> +        } else {
>> +            error_setg_errno(errp, -ret, "%s failed to query 
>> migration flags",
>> +                             error_prefix);
>> +        }
>> +
>>           return ret;
>>       }
>>
>>       /* Basic migration functionality must be supported */
>>       if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>> +        error_setg(errp, "%s VFIO_MIGRATION_STOP_COPY is not 
>> supported",
>> +                   error_prefix);
>>           return -EOPNOTSUPP;
>>       }
>>
>> @@ -1261,18 +1276,8 @@ bool vfio_migration_realize(VFIODevice 
>> *vbasedev, Error **errp)
>>           return !vfio_block_migration(vbasedev, err, errp);
>>       }
>>
>> -    ret = vfio_migration_init(vbasedev);
>> +    ret = vfio_migration_init(vbasedev, &err);
>>       if (ret) {
>> -        if (ret == -ENOTTY) {
>> -            error_setg(&err, "%s: VFIO migration is not supported in 
>> kernel",
>> -                       vbasedev->name);
>> -        } else {
>> -            error_setg(&err,
>> -                       "%s: Migration couldn't be initialized for 
>> VFIO device, "
>> -                       "err: %d (%s)",
>> -                       vbasedev->name, ret, strerror(-ret));
>> -        }
>> -
>>           return !vfio_block_migration(vbasedev, err, errp);
>>       }
>>
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-05-08 13:18     ` Avihai Horon
@ 2026-05-08 13:21       ` Avihai Horon
  0 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-08 13:21 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Pierrick Bouvier,
	Philippe Mathieu-Daudé, Zhao Liu, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Maor Gottlieb


On 5/8/2026 4:18 PM, Avihai Horon wrote:
>
> On 5/7/2026 10:59 AM, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello Avihai,
>>
>> On 5/5/26 10:14, Avihai Horon wrote:
>>> vfio_migration_init() already has many failure points and a new one 
>>> will
>>> be added in next patch.
>>>
>>> Add Error ** parameter to vfio_migration_init() to report a detailed
>>> error message through it.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   hw/vfio/migration.c | 29 +++++++++++++++++------------
>>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 243624b5fe..b7e929274a 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -1038,7 +1038,7 @@ static bool 
>>> vfio_dma_logging_supported(VFIODevice *vbasedev)
>>>       return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>>>   }
>>>
>>> -static int vfio_migration_init(VFIODevice *vbasedev)
>>> +static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>>>   {
>>>       int ret;
>>>       Object *obj;
>>> @@ -1047,23 +1047,38 @@ static int vfio_migration_init(VFIODevice 
>>> *vbasedev)
>>>       g_autofree char *path = NULL, *oid = NULL;
>>>       uint64_t mig_flags = 0;
>>>       VMChangeStateHandler *prepare_cb;
>>> +    g_autofree char *error_prefix =
>>> +        g_strdup_printf("%s: VFIO migration init failed:", 
>>> vbasedev->name);
>>
>> We have error_prepend() for this purpose.
>
> Right.
> I was trying to avoid duplicating the prefix on each fail branch.
>
> Do you suggest to add a common "err:" goto label at the bottom and put 
> there a single:
>
>   error_prepend(errp, "%s: VFIO migration init failed:", vbasedev->name);
>
> ?

Or alternatively, put it in vfio_migration_realize():

     ret = vfio_migration_init(vbasedev, &err);
     if (ret) {
         error_prepend(err, "%s: VFIO migration init failed:", 
vbasedev->name);
         return !vfio_block_migration(vbasedev, err, errp);
     }

?

>
>
> Thanks.
>
>>
>>
>>>
>>>       if (!vbasedev->ops->vfio_get_object) {
>>> +        error_setg(errp, "%s no vfio_get_object handler", 
>>> error_prefix);
>>>           return -EINVAL;
>>>       }
>>>
>>>       obj = vbasedev->ops->vfio_get_object(vbasedev);
>>>       if (!obj) {
>>> +        error_setg(errp, "%s failed to get object", error_prefix);
>>>           return -EINVAL;
>>>       }
>>>
>>>       ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>>>       if (ret) {
>>> +        if (ret == -ENOTTY) {
>>> +            error_setg_errno(errp, -ret,
>>> +                             "%s migration is not supported in 
>>> kernel",
>>> +                             error_prefix);
>>> +        } else {
>>> +            error_setg_errno(errp, -ret, "%s failed to query 
>>> migration flags",
>>> +                             error_prefix);
>>> +        }
>>> +
>>>           return ret;
>>>       }
>>>
>>>       /* Basic migration functionality must be supported */
>>>       if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>>> +        error_setg(errp, "%s VFIO_MIGRATION_STOP_COPY is not 
>>> supported",
>>> +                   error_prefix);
>>>           return -EOPNOTSUPP;
>>>       }
>>>
>>> @@ -1261,18 +1276,8 @@ bool vfio_migration_realize(VFIODevice 
>>> *vbasedev, Error **errp)
>>>           return !vfio_block_migration(vbasedev, err, errp);
>>>       }
>>>
>>> -    ret = vfio_migration_init(vbasedev);
>>> +    ret = vfio_migration_init(vbasedev, &err);
>>>       if (ret) {
>>> -        if (ret == -ENOTTY) {
>>> -            error_setg(&err, "%s: VFIO migration is not supported 
>>> in kernel",
>>> -                       vbasedev->name);
>>> -        } else {
>>> -            error_setg(&err,
>>> -                       "%s: Migration couldn't be initialized for 
>>> VFIO device, "
>>> -                       "err: %d (%s)",
>>> -                       vbasedev->name, ret, strerror(-ret));
>>> -        }
>>> -
>>>           return !vfio_block_migration(vbasedev, err, errp);
>>>       }
>>>
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 11/14] vfio/migration: Add new switchover-ack mechanism
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (9 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-05  8:14 ` [PATCH 12/14] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Add support for the new switchover-ack mechanism. This includes
implementing a request_switchover_ack handler that requests a switchover
ACK on setup.

Keep legacy switchover-ack functionality for backward compatibility.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/migration/vfio.rst |  3 +++
 hw/vfio/migration.c           | 16 ++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 854277b11c..7f4e065506 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -59,6 +59,9 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``save_live_iterate`` function that reads the VFIO device's data from the
   vendor driver during iterative pre-copy phase.
 
+* A ``request_switchover_ack`` function that requests switchover ACKs when
+  needed.
+
 * A ``switchover_start`` function that in the multifd mode starts a thread that
   reassembles the multifd received data and loads it in-order into the device.
   In the non-multifd mode this function is a NOP.
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index b7e929274a..2607ce4cec 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -882,6 +882,21 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static bool vfio_request_switchover_ack(enum MigSwitchoverAckRequestStage stage,
+                                        void *opaque, const char **requester)
+{
+    VFIODevice *vbasedev = opaque;
+
+    *requester = vbasedev->name;
+
+    if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP) {
+        /* Precopy support implies switchover-ack is needed */
+        return vfio_precopy_supported(vbasedev);
+    }
+
+    return false;
+}
+
 static int vfio_switchover_start(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -905,6 +920,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .request_switchover_ack = vfio_request_switchover_ack,
     /*
      * Multifd support
      */
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 12/14] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (10 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 11/14] vfio/migration: Add new switchover-ack mechanism Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-05  8:14 ` [PATCH 13/14] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover Avihai Horon
  2026-05-05  8:14 ` [PATCH 14/14] migration: Enable new switchover-ack Avihai Horon
  13 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

According to VFIO uAPI, precopy initial_bytes is considered as critical
data that should be transferred and loaded prior to moving to STOP_COPY
state to ensure precopy phase would be effective.

As currently defined, initial_bytes can only decrease as it's being read
from the data fd. However, there are cases where a new chunk of
initial_bytes should be transferred during precopy.

The new VFIO_PRECOPY_INFO_REINIT feature addresses this and allows
reporting a new value for initial_bytes regardless of any previously
reported values.

Implement VFIO_PRECOPY_INFO_REINIT feature:
1. Opt-in for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 to make
   VFIO_PRECOPY_INFO_REINIT available.
2. Request a new switchover ACK if initial_bytes increases post of a
   previous switchover ACK. This ensures the device is not moved to
   STOP_COPY before initial_bytes has reached zero again.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/vfio-migration-internal.h |  2 +
 hw/vfio/migration.c               | 73 +++++++++++++++++++++++++++++--
 hw/vfio/trace-events              |  4 +-
 3 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index a15fc74703..a1c58b1126 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -45,6 +45,7 @@ typedef struct VFIOMigration {
     void *data_buffer;
     size_t data_buffer_size;
     uint64_t mig_flags;
+    bool precopy_info_v2_used;
     /*
      * NOTE: all three sizes cached are reported from VFIO's uAPI, which
      * are defined as estimate only.  QEMU should not trust these values
@@ -58,6 +59,7 @@ typedef struct VFIOMigration {
     bool multifd_transfer;
     VFIOMultifd *multifd;
     bool initial_data_sent;
+    bool request_switchover_ack;
 
     bool event_save_iterate_started;
     bool event_precopy_empty_hit;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2607ce4cec..6eb363d3f3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -358,9 +358,11 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
 
 static int vfio_query_precopy_size(VFIOMigration *migration)
 {
+    VFIODevice *vbasedev = migration->vbasedev;
     struct vfio_precopy_info precopy = {
         .argsz = sizeof(precopy),
     };
+    bool reinit = false;
     int ret;
 
     if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
@@ -368,16 +370,34 @@ static int vfio_query_precopy_size(VFIOMigration *migration)
         migration->precopy_dirty_size = 0;
         ret = -errno;
         warn_report_once("VFIO device %s ioctl(VFIO_MIG_GET_PRECOPY_INFO) "
-                         "failed (%d)", migration->vbasedev->name, ret);
+                         "failed (%d)", vbasedev->name, ret);
     } else {
         migration->precopy_init_size = precopy.initial_bytes;
         migration->precopy_dirty_size = precopy.dirty_bytes;
+        /*
+         * struct vfio_precopy_info.flags is valid only if
+         * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 is used.
+         */
+        if (migration->precopy_info_v2_used) {
+            reinit = precopy.flags & VFIO_PRECOPY_INFO_REINIT;
+        }
         ret = 0;
     }
 
-    trace_vfio_query_precopy_size(migration->vbasedev->name,
-                                  migration->precopy_init_size,
-                                  migration->precopy_dirty_size, ret);
+    trace_vfio_query_precopy_size(vbasedev->name, migration->precopy_init_size,
+                                  migration->precopy_dirty_size, reinit, ret);
+
+    /*
+     * If we got new initial_bytes after previous initial_bytes were
+     * transferred, request a new switchover ACK. Don't request if legacy
+     * switchover-ack is used.
+     */
+    if (reinit && migration->initial_data_sent &&
+        !migrate_switchover_ack_legacy()) {
+        migration->initial_data_sent = false;
+        migration->request_switchover_ack = true;
+        trace_vfio_query_precopy_size_request_switchover_ack(vbasedev->name);
+    }
 
     return ret;
 }
@@ -558,6 +578,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 
     migration->event_save_iterate_started = false;
     migration->event_precopy_empty_hit = false;
+    migration->request_switchover_ack = false;
 
     if (vfio_precopy_supported(vbasedev)) {
         switch (migration->device_state) {
@@ -886,12 +907,19 @@ static bool vfio_request_switchover_ack(enum MigSwitchoverAckRequestStage stage,
                                         void *opaque, const char **requester)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    bool request;
 
     *requester = vbasedev->name;
 
     if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP) {
         /* Precopy support implies switchover-ack is needed */
         return vfio_precopy_supported(vbasedev);
+    } else if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT) {
+        request = migration->request_switchover_ack;
+        migration->request_switchover_ack = false;
+
+        return request;
     }
 
     return false;
@@ -1041,6 +1069,27 @@ static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
     return 0;
 }
 
+/* Returns 1 on success, 0 if not supported and negative errno on failure */
+static int vfio_migration_set_precopy_info_v2(VFIODevice *vbasedev)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        if (errno == ENOTTY) {
+            return 0;
+        }
+
+        return -errno;
+    }
+
+    return 1;
+}
+
 static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
 {
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
@@ -1062,6 +1111,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
     uint64_t mig_flags = 0;
+    bool precopy_info_v2_used = false;
     VMChangeStateHandler *prepare_cb;
     g_autofree char *error_prefix =
         g_strdup_printf("%s: VFIO migration init failed:", vbasedev->name);
@@ -1098,12 +1148,23 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
         return -EOPNOTSUPP;
     }
 
+    if (mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        ret = vfio_migration_set_precopy_info_v2(vbasedev);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "%s failed to set precopy info v2",
+                             error_prefix);
+            return ret;
+        }
+        precopy_info_v2_used = ret;
+    }
+
     vbasedev->migration = g_new0(VFIOMigration, 1);
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
     migration->mig_flags = mig_flags;
+    migration->precopy_info_v2_used = precopy_info_v2_used;
 
     vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
 
@@ -1126,6 +1187,10 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
     migration_add_notifier(&migration->migration_state,
                            vfio_migration_state_notifier);
 
+    trace_vfio_migration_init(vbasedev->name, migration->mig_flags,
+                              migration->precopy_info_v2_used,
+                              vbasedev->dirty_pages_supported);
+
     return 0;
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index e91858354c..b6cda19394 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -158,12 +158,14 @@ vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx
 vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
 vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
 vfio_load_state_device_buffer_end(const char *name) " (%s)"
+vfio_migration_init(const char *name, uint64_t mig_flags, bool precopy_info_v2_used, bool dirty_pages_supported) " (%s) mig_flags 0x%"PRIx64", precopy_info_v2_used %d, dirty_pages_supported %d"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
 vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
+vfio_query_precopy_size(const char *name, uint64_t init_size, uint64_t dirty_size, bool reinit, int ret) " (%s) init %"PRIu64", dirty %"PRIu64", reinit %d, ret %d"
+vfio_query_precopy_size_request_switchover_ack(const char *name) " (%s)"
 vfio_query_stop_copy_size(const char *name, uint64_t size, int ret) " (%s) stopcopy size %"PRIu64" ret %d"
-vfio_query_precopy_size(const char *name, uint64_t init_size, uint64_t dirty_size, int ret) " (%s) init %"PRIu64" dirty %"PRIu64" ret %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 13/14] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (11 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 12/14] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  2026-05-05  8:14 ` [PATCH 14/14] migration: Enable new switchover-ack Avihai Horon
  13 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

VFIO_REPCOPY_INFO_REINIT is checked only during precopy, while the guest
is running. However, the switchover decision and guest stop are not
atomic, so a VFIO device may want to set VFIO_PRECOPY_INFO_REINIT and
request another switchover ACK in the gap after switchover decision has
been made but before the guest is stopped. This would be missed and may
increase downtime.

Solve this by checking if VFIO_PRECOPY_INFO_REINIT was set during that
gap, and request a new switchover-ack in the COMPLETE stage. Query
precopy info after vCPUs are stopped but before transitioning from
PRE_COPY state, when its valid to call the ioctl.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6eb363d3f3..675560f1e0 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -915,7 +915,8 @@ static bool vfio_request_switchover_ack(enum MigSwitchoverAckRequestStage stage,
     if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_SETUP) {
         /* Precopy support implies switchover-ack is needed */
         return vfio_precopy_supported(vbasedev);
-    } else if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT) {
+    } else if (stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_PENDING_EXACT ||
+               stage == MIG_SWITCHOVER_ACK_REQUEST_STAGE_COMPLETE) {
         request = migration->request_switchover_ack;
         migration->request_switchover_ack = false;
 
@@ -959,6 +960,26 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 
 /* ---------------------------------------------------------------------- */
 
+static void vfio_final_precopy_reinit_check(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    if (!migration->precopy_info_v2_used || !migrate_switchover_ack() ||
+        migrate_switchover_ack_legacy()) {
+        return;
+    }
+
+    ret = vfio_query_precopy_size(migration);
+    if (ret) {
+        error_report("%s: Final precopy reinit check failed (err: %d)",
+                     vbasedev->name, ret);
+        /* If query failed, assume reinit and request switchover-ack */
+        migration->request_switchover_ack = true;
+        migration->initial_data_sent = false;
+    }
+}
+
 static void vfio_vmstate_change_prepare(void *opaque, bool running,
                                         RunState state)
 {
@@ -972,6 +993,15 @@ static void vfio_vmstate_change_prepare(void *opaque, bool running,
                     VFIO_DEVICE_STATE_PRE_COPY_P2P :
                     VFIO_DEVICE_STATE_RUNNING_P2P;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        /*
+         * Now that vCPUs are stopped, check if new init_bytes are available
+         * since switchover decision, to be reported in switchover-ack COMPLETE
+         * stage.
+         */
+        vfio_final_precopy_reinit_check(vbasedev);
+    }
+
     ret = vfio_migration_set_state_or_reset(vbasedev, new_state, &local_err);
     if (ret) {
         /*
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 14/14] migration: Enable new switchover-ack
  2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (12 preceding siblings ...)
  2026-05-05  8:14 ` [PATCH 13/14] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover Avihai Horon
@ 2026-05-05  8:14 ` Avihai Horon
  13 siblings, 0 replies; 31+ messages in thread
From: Avihai Horon @ 2026-05-05  8:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini, Maor Gottlieb,
	Avihai Horon

Now that VFIO has implemented new switchover-ack, enable it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/options.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index 44327c588f..91571f9d30 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -108,9 +108,8 @@ const Property migration_properties[] = {
                      preempt_pre_7_2, false),
     DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState,
                      multifd_clean_tls_termination, true),
-    /* Use legacy until VFIO implements new switchover-ack */
     DEFINE_PROP_BOOL("switchover-ack-legacy", MigrationState,
-                     switchover_ack_legacy, true),
+                     switchover_ack_legacy, false),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState,
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2026-05-15 15:28 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05  8:14 [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
2026-05-05  8:14 ` [PATCH 01/14] scripts/update-linux-headers: Add typelimits.h Avihai Horon
2026-05-07  7:54   ` Cédric Le Goater
2026-05-07  9:07     ` gaosong
2026-05-05  8:14 ` [PATCH 02/14] linux-headers: Update to Linux v7.1-rc1 Avihai Horon
2026-05-07  9:16   ` Cédric Le Goater
2026-05-05  8:14 ` [PATCH 03/14] migration: Propagate errors in migration_completion_precopy() Avihai Horon
2026-05-07  8:03   ` Cédric Le Goater
2026-05-08 13:01     ` Avihai Horon
2026-05-15 15:20       ` Peter Xu
2026-05-05  8:14 ` [PATCH 04/14] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
2026-05-07  8:09   ` Cédric Le Goater
2026-05-08 13:07     ` Avihai Horon
2026-05-15 15:24   ` Peter Xu
2026-05-05  8:14 ` [PATCH 05/14] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
2026-05-15 15:27   ` Peter Xu
2026-05-05  8:14 ` [PATCH 06/14] migration: Rename switchover-ack code to legacy Avihai Horon
2026-05-05  8:14 ` [PATCH 07/14] migration: Make switchover-ack re-usable Avihai Horon
2026-05-07 14:10   ` Fabiano Rosas
2026-05-05  8:14 ` [PATCH 08/14] migration: Check switchover-ack during switchover phase Avihai Horon
2026-05-05  8:14 ` [PATCH 09/14] vfio/migration: Re-query precopy size before sending VFIO_MIG_FLAG_DEV_INIT_DATA_SENT Avihai Horon
2026-05-07  8:24   ` Cédric Le Goater
2026-05-08 13:10     ` Avihai Horon
2026-05-05  8:14 ` [PATCH 10/14] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
2026-05-07  7:59   ` Cédric Le Goater
2026-05-08 13:18     ` Avihai Horon
2026-05-08 13:21       ` Avihai Horon
2026-05-05  8:14 ` [PATCH 11/14] vfio/migration: Add new switchover-ack mechanism Avihai Horon
2026-05-05  8:14 ` [PATCH 12/14] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
2026-05-05  8:14 ` [PATCH 13/14] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover Avihai Horon
2026-05-05  8:14 ` [PATCH 14/14] migration: Enable new switchover-ack Avihai Horon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.