* [PATCH V1 0/8] Live update: vfio
@ 2024-07-09 20:58 Steve Sistare
2024-07-09 20:58 ` [PATCH V1 1/8] migration: cpr_needed_for_reuse Steve Sistare
` (8 more replies)
0 siblings, 9 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Support vfio devices with the cpr-exec live migration mode.
See the commit messages of the individual patches for details.
No user-visible interfaces are added.
This series is extracted from the following and updated for the latest QEMU:
[PATCH V9 00/46] Live Update
https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
This series depends on the following, which is based on commit 44b7329de469:
[PATCH V2 00/11] Live update: cpr-exec
https://lore.kernel.org/qemu-devel/1719776434-435013-1-git-send-email-steven.sistare@oracle.com/
Steve Sistare (8):
migration: cpr_needed_for_reuse
pci: export msix_is_pending
vfio-pci: refactor for cpr
vfio-pci: cpr part 1 (fd and dma)
vfio-pci: cpr part 2 (msi)
vfio-pci: cpr part 3 (intx)
vfio: vfio_find_ram_discard_listener
vfio-pci: recover from unmap-all-vaddr failure
hw/pci/msix.c | 2 +-
hw/pci/pci.c | 13 ++
hw/vfio/common.c | 88 ++++++++--
hw/vfio/container.c | 139 ++++++++++++---
hw/vfio/cpr-legacy.c | 162 ++++++++++++++++++
hw/vfio/cpr.c | 24 ++-
hw/vfio/meson.build | 3 +-
hw/vfio/pci.c | 308 +++++++++++++++++++++++++++++-----
include/hw/pci/msix.h | 1 +
include/hw/vfio/vfio-common.h | 10 ++
include/hw/vfio/vfio-container-base.h | 7 +
include/migration/cpr.h | 1 +
include/migration/vmstate.h | 2 +
migration/cpr.c | 5 +
14 files changed, 682 insertions(+), 83 deletions(-)
create mode 100644 hw/vfio/cpr-legacy.c
--
1.8.3.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V1 1/8] migration: cpr_needed_for_reuse
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-09 20:58 ` [PATCH V1 2/8] pci: export msix_is_pending Steve Sistare
` (7 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Define a vmstate "needed" helper. This will be moved to the preceding patch
series "Live update: cpr-exec" because it is needed by multiple devices.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 1 +
migration/cpr.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index c6c60f8..8d20d3e 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -24,6 +24,7 @@ void cpr_resave_fd(const char *name, int id, int fd);
int cpr_state_save(Error **errp);
int cpr_state_load(Error **errp);
+bool cpr_needed_for_reuse(void *opaque);
QEMUFile *cpr_exec_output(Error **errp);
QEMUFile *cpr_exec_input(Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index f756c15..843241c 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -236,3 +236,8 @@ int cpr_state_load(Error **errp)
return ret;
}
+bool cpr_needed_for_reuse(void *opaque)
+{
+ MigMode mode = migrate_mode();
+ return mode == MIG_MODE_CPR_EXEC;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 2/8] pci: export msix_is_pending
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
2024-07-09 20:58 ` [PATCH V1 1/8] migration: cpr_needed_for_reuse Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-09 20:58 ` [PATCH V1 3/8] vfio-pci: refactor for cpr Steve Sistare
` (6 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Export msix_is_pending for use by cpr. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
hw/pci/msix.c | 2 +-
include/hw/pci/msix.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 487e498..17ef2b0 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -71,7 +71,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
return dev->msix_pba + vector / 8;
}
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
{
return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 0e6f257..11ef945 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
bool msix_is_masked(PCIDevice *dev, unsigned vector);
void msix_set_pending(PCIDevice *dev, unsigned vector);
void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
void msix_vector_use(PCIDevice *dev, unsigned vector);
void msix_vector_unuse(PCIDevice *dev, unsigned vector);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 3/8] vfio-pci: refactor for cpr
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
2024-07-09 20:58 ` [PATCH V1 1/8] migration: cpr_needed_for_reuse Steve Sistare
2024-07-09 20:58 ` [PATCH V1 2/8] pci: export msix_is_pending Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-09 20:58 ` [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
` (5 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Refactor vector use into a helper vfio_vector_init.
Add vfio_notifier_init and vfio_notifier_cleanup for named notifiers,
and pass additional arguments to vfio_remove_kvm_msi_virq.
All for use by CPR in a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 106 +++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 68 insertions(+), 38 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e03d9f3..ca3c22a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -54,6 +54,32 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+/* Create new or reuse existing eventfd */
+static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr)
+{
+ int fd = -1; /* placeholder until a subsequent patch */
+ int ret = 0;
+
+ if (fd >= 0) {
+ event_notifier_init_fd(e, fd);
+ } else {
+ ret = event_notifier_init(e, 0);
+ if (ret) {
+ Error *err = NULL;
+ error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
+ error_report_err(err);
+ }
+ }
+ return ret;
+}
+
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr)
+{
+ event_notifier_cleanup(e);
+}
+
/*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead. We try to get the best of both worlds by
@@ -134,8 +160,8 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
pci_irq_deassert(&vdev->pdev);
/* Get an eventfd for resample/unmask */
- if (event_notifier_init(&vdev->intx.unmask, 0)) {
- error_setg(errp, "event_notifier_init failed eoi");
+ if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+ error_setg(errp, "vfio_notifier_init intx-unmask failed");
goto fail;
}
@@ -167,7 +193,7 @@ fail_vfio:
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
vdev->intx.route.irq);
fail_irqfd:
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
fail:
qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -199,7 +225,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
}
/* We only need to close the eventfd for VFIO to cleanup the kernel side */
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
/* QEMU starts listening for interrupt events. */
qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -266,7 +292,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
Error *err = NULL;
int32_t fd;
- int ret;
if (!pin) {
@@ -289,9 +314,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
}
#endif
- ret = event_notifier_init(&vdev->intx.interrupt, 0);
- if (ret) {
- error_setg_errno(errp, -ret, "event_notifier_init failed");
+ if (vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0)) {
return false;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -300,7 +323,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
return false;
}
@@ -327,7 +350,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
vdev->interrupt = VFIO_INT_NONE;
@@ -471,13 +494,15 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
vector_n, &vdev->pdev);
}
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
{
+ const char *name = "kvm_interrupt";
+
if (vector->virq < 0) {
return;
}
- if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+ if (vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr)) {
goto fail_notifier;
}
@@ -489,19 +514,20 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
return;
fail_kvm:
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
fail_notifier:
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
}
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int nr)
{
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
vector->virq);
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
}
static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -511,6 +537,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
kvm_irqchip_commit_routes(kvm_state);
}
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+ VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+ PCIDevice *pdev = &vdev->pdev;
+
+ vector->vdev = vdev;
+ vector->virq = -1;
+ vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr);
+ vector->use = true;
+ if (vdev->interrupt == VFIO_INT_MSIX) {
+ msix_vector_use(pdev, nr);
+ }
+}
+
static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
MSIMessage *msg, IOHandler *handler)
{
@@ -524,13 +564,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vector = &vdev->msi_vectors[nr];
if (!vector->use) {
- vector->vdev = vdev;
- vector->virq = -1;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
- }
- vector->use = true;
- msix_vector_use(pdev, nr);
+ vfio_vector_init(vdev, nr);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -542,7 +576,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
*/
if (vector->virq >= 0) {
if (!msg) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, nr);
} else {
vfio_update_kvm_msi_virq(vector, *msg, pdev);
}
@@ -554,7 +588,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
vfio_add_kvm_msi_virq(vdev, vector, nr, true);
kvm_irqchip_commit_route_changes(&vfio_route_change);
- vfio_connect_kvm_msi_virq(vector);
+ vfio_connect_kvm_msi_virq(vector, nr);
}
}
}
@@ -661,7 +695,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
kvm_irqchip_commit_route_changes(&vfio_route_change);
for (i = 0; i < vdev->nr_vectors; i++) {
- vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+ vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
}
}
@@ -741,9 +775,7 @@ retry:
vector->virq = -1;
vector->use = true;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
- }
+ vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i);
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
vfio_msi_interrupt, NULL, vector);
@@ -797,11 +829,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
VFIOMSIVector *vector = &vdev->msi_vectors[i];
if (vdev->msi_vectors[i].use) {
if (vector->virq >= 0) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, i);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
NULL, NULL, NULL);
- event_notifier_cleanup(&vector->interrupt);
+ vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
}
}
@@ -2855,8 +2887,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->err_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for error detection");
+ if (vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0)) {
vdev->pci_aer = false;
return;
}
@@ -2868,7 +2899,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
vdev->pci_aer = false;
}
}
@@ -2887,7 +2918,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
}
static void vfio_req_notifier_handler(void *opaque)
@@ -2921,8 +2952,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->req_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for device request");
+ if (vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0)) {
return;
}
@@ -2933,7 +2963,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
} else {
vdev->req_enabled = true;
}
@@ -2953,7 +2983,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
vdev->req_enabled = false;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
` (2 preceding siblings ...)
2024-07-09 20:58 ` [PATCH V1 3/8] vfio-pci: refactor for cpr Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-10 20:03 ` Alex Williamson
2024-07-16 14:42 ` Steven Sistare
2024-07-09 20:58 ` [PATCH V1 5/8] vfio-pci: cpr part 2 (msi) Steve Sistare
` (4 subsequent siblings)
8 siblings, 2 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu.
At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state.
In the container pre_save handler, suspend the use of virtual addresses
in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will
be remapped at a different VA after exec. DMA to already-mapped pages
continues. Save the msi message area as part of vfio-pci vmstate, and
save the interrupt and notifier eventfd's in vmstate.
On qemu restart, vfio_realize() finds the saved descriptors, uses the
descriptors, and notes that the device is being reused. Device and iommu
state is already configured, so operations in vfio_realize that would
modify the configuration are skipped for a reused device, including vfio
ioctl's and writes to PCI configuration space. Vfio PCI device reset
is also suppressed. The result is that vfio_realize constructs qemu
data structures that reflect the current state of the device. However,
the reconstruction is not complete until migrate_incoming is called.
migrate_incoming loads the msi data, the vfio post_load handler finds
eventfds in CPR state, rebuilds vector data structures, and attaches the
interrupts to the new KVM instance. The container post_load handler then
invokes the main vfio listener callback, which walks the flattened ranges
of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the
kernel of the new VA's. Lastly, migration resumes the VM.
This functionality is delivered by 3 patches for clarity. Part 1 handles
device file descriptors and DMA. Part 2 adds eventfd and MSI/MSI-X vector
support. Part 3 adds INTX support.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/pci/pci.c | 13 ++++
hw/vfio/common.c | 12 +++
hw/vfio/container.c | 139 ++++++++++++++++++++++++++++------
hw/vfio/cpr-legacy.c | 118 +++++++++++++++++++++++++++++
hw/vfio/cpr.c | 24 +++++-
hw/vfio/meson.build | 3 +-
hw/vfio/pci.c | 38 ++++++++++
include/hw/vfio/vfio-common.h | 8 ++
include/hw/vfio/vfio-container-base.h | 6 ++
include/migration/vmstate.h | 2 +
10 files changed, 336 insertions(+), 27 deletions(-)
create mode 100644 hw/vfio/cpr-legacy.c
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4c7be52..42513dd 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,7 @@
#include "hw/pci/pci_host.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "migration/misc.h"
#include "migration/qemu-file-types.h"
#include "migration/vmstate.h"
#include "net/net.h"
@@ -389,6 +390,18 @@ static void pci_reset_regions(PCIDevice *dev)
static void pci_do_device_reset(PCIDevice *dev)
{
+ /*
+ * A PCI device that is resuming for cpr is already configured, so do
+ * not reset it here when we are called from qemu_system_reset prior to
+ * cpr load, else interrupts may be lost for vfio-pci devices. It is
+ * safe to skip this reset for all PCI devices, because cpr load will set
+ * all fields that would have been set here.
+ */
+ MigMode mode = migrate_mode();
+ if (mode == MIG_MODE_CPR_EXEC) {
+ return;
+ }
+
pci_device_deassert_intx(dev);
assert(dev->irq_state == 0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7cdb969..72a692a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -566,6 +566,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
{
VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
listener);
+ vfio_container_region_add(bcontainer, section);
+}
+
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section)
+{
hwaddr iova, end;
Int128 llend, llsize;
void *vaddr;
@@ -1395,6 +1401,12 @@ const MemoryListener vfio_memory_listener = {
.log_sync = vfio_listener_log_sync,
};
+void vfio_listener_register(VFIOContainerBase *bcontainer)
+{
+ bcontainer->listener = vfio_memory_listener;
+ memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+}
+
void vfio_reset_handler(void *opaque)
{
VFIODevice *vbasedev;
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 88ede91..9970463 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -31,6 +31,7 @@
#include "sysemu/reset.h"
#include "trace.h"
#include "qapi/error.h"
+#include "migration/cpr.h"
#include "pci.h"
VFIOGroupList vfio_group_list =
@@ -131,6 +132,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
int ret;
Error *local_err = NULL;
+ assert(!bcontainer->reused);
+
if (iotlb && vfio_devices_all_running_and_mig_active(bcontainer)) {
if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
bcontainer->dirty_pages_supported) {
@@ -182,12 +185,24 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
bcontainer);
struct vfio_iommu_type1_dma_map map = {
.argsz = sizeof(map),
- .flags = VFIO_DMA_MAP_FLAG_READ,
.vaddr = (__u64)(uintptr_t)vaddr,
.iova = iova,
.size = size,
};
+ /*
+ * Set the new vaddr for any mappings registered during cpr load.
+ * Reused is cleared thereafter.
+ */
+ if (bcontainer->reused) {
+ map.flags = VFIO_DMA_MAP_FLAG_VADDR;
+ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+ goto fail;
+ }
+ return 0;
+ }
+
+ map.flags = VFIO_DMA_MAP_FLAG_READ;
if (!readonly) {
map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
}
@@ -204,7 +219,11 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
return 0;
}
- error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+fail:
+ error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
+ (bcontainer->reused ? "VADDR" : ""), iova, size, vaddr,
+ strerror(errno));
+
return -errno;
}
@@ -415,12 +434,28 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
}
static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
- Error **errp)
+ bool reused, Error **errp)
{
int iommu_type;
const char *vioc_name;
VFIOContainer *container;
+ /*
+ * If container is reused, just set its type and skip the ioctls, as the
+ * container and group are already configured in the kernel.
+ * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
+ */
+ if (reused) {
+ if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
+ iommu_type = VFIO_TYPE1v2_IOMMU;
+ goto skip_iommu;
+ } else {
+ error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
+ "is not supported");
+ return NULL;
+ }
+ }
+
iommu_type = vfio_get_iommu_type(fd, errp);
if (iommu_type < 0) {
return NULL;
@@ -430,10 +465,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
return NULL;
}
+skip_iommu:
vioc_name = vfio_get_iommu_class_name(iommu_type);
container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
container->fd = fd;
+ container->bcontainer.reused = reused;
container->iommu_type = iommu_type;
return container;
}
@@ -543,10 +580,13 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
VFIOContainer *container;
VFIOContainerBase *bcontainer;
int ret, fd;
+ bool reused;
VFIOAddressSpace *space;
VFIOIOMMUClass *vioc;
space = vfio_get_address_space(as);
+ fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+ reused = (fd > 0);
/*
* VFIO is currently incompatible with discarding of RAM insofar as the
@@ -579,28 +619,50 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
* details once we know which type of IOMMU we are using.
*/
+ /*
+ * If the container is reused, then the group is already attached in the
+ * kernel. If a container with matching fd is found, then update the
+ * userland group list and return. If not, then after the loop, create
+ * the container struct and group list.
+ */
+
QLIST_FOREACH(bcontainer, &space->containers, next) {
container = container_of(bcontainer, VFIOContainer, bcontainer);
- if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
- ret = vfio_ram_block_discard_disable(container, true);
- if (ret) {
- error_setg_errno(errp, -ret,
- "Cannot set discarding of RAM broken");
- if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
- &container->fd)) {
- error_report("vfio: error disconnecting group %d from"
- " container", group->groupid);
- }
- return false;
+
+ if (reused) {
+ if (container->fd != fd) {
+ continue;
}
- group->container = container;
- QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+ continue;
+ }
+
+ ret = vfio_ram_block_discard_disable(container, true);
+ if (ret) {
+ error_setg_errno(errp, -ret,
+ "Cannot set discarding of RAM broken");
+ if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+ &container->fd)) {
+ error_report("vfio: error disconnecting group %d from"
+ " container", group->groupid);
+
+ }
+ goto delete_fd_exit;
+ }
+ group->container = container;
+ QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ if (!reused) {
vfio_kvm_device_add_group(group);
- return true;
+ cpr_save_fd("vfio_container_for_group", group->groupid,
+ container->fd);
}
+ return true;
+ }
+
+ if (!reused) {
+ fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
}
- fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
if (fd < 0) {
error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
goto put_space_exit;
@@ -613,11 +675,12 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
goto close_fd_exit;
}
- container = vfio_create_container(fd, group, errp);
+ container = vfio_create_container(fd, group, reused, errp);
if (!container) {
goto close_fd_exit;
}
bcontainer = &container->bcontainer;
+ bcontainer->reused = reused;
if (!vfio_cpr_register_container(bcontainer, errp)) {
goto free_container_exit;
@@ -643,8 +706,16 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
group->container = container;
QLIST_INSERT_HEAD(&container->group_list, group, container_next);
- bcontainer->listener = vfio_memory_listener;
- memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+ /*
+ * If reused, register the listener later, after all state that may
+ * affect regions and mapping boundaries has been cpr load'ed. Later,
+ * the listener will invoke its callback on each flat section and call
+ * vfio_dma_map to supply the new vaddr, and the calls will match the
+ * mappings remembered by the kernel.
+ */
+ if (!reused) {
+ vfio_listener_register(bcontainer);
+ }
if (bcontainer->error) {
error_propagate_prepend(errp, bcontainer->error,
@@ -653,6 +724,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
}
bcontainer->initialized = true;
+ cpr_resave_fd("vfio_container_for_group", group->groupid, fd);
return true;
listener_release_exit:
@@ -679,6 +751,8 @@ close_fd_exit:
put_space_exit:
vfio_put_address_space(space);
+delete_fd_exit:
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
return false;
}
@@ -690,6 +764,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
QLIST_REMOVE(group, container_next);
group->container = NULL;
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
/*
* Explicitly release the listener first before unset container,
@@ -743,7 +818,12 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
group = g_malloc0(sizeof(*group));
snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
- group->fd = qemu_open_old(path, O_RDWR);
+
+ group->fd = cpr_find_fd("vfio_group", groupid);
+ if (group->fd < 0) {
+ group->fd = qemu_open_old(path, O_RDWR);
+ }
+
if (group->fd < 0) {
error_setg_errno(errp, errno, "failed to open %s", path);
goto free_group_exit;
@@ -772,6 +852,7 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
}
QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+ cpr_resave_fd("vfio_group", groupid, group->fd);
return group;
@@ -797,6 +878,7 @@ static void vfio_put_group(VFIOGroup *group)
vfio_disconnect_container(group);
QLIST_REMOVE(group, next);
trace_vfio_put_group(group->fd);
+ cpr_delete_fd("vfio_group", group->groupid);
close(group->fd);
g_free(group);
}
@@ -806,8 +888,14 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
{
g_autofree struct vfio_device_info *info = NULL;
int fd;
+ bool reused;
+
+ fd = cpr_find_fd(name, 0);
+ reused = (fd >= 0);
+ if (!reused) {
+ fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+ }
- fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
if (fd < 0) {
error_setg_errno(errp, errno, "error getting device from group %d",
group->groupid);
@@ -852,6 +940,8 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
vbasedev->num_irqs = info->num_irqs;
vbasedev->num_regions = info->num_regions;
vbasedev->flags = info->flags;
+ vbasedev->reused = reused;
+ cpr_resave_fd(name, 0, fd);
trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
@@ -868,6 +958,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
QLIST_REMOVE(vbasedev, next);
vbasedev->group = NULL;
trace_vfio_put_base_device(vbasedev->fd);
+ cpr_delete_fd(vbasedev->name, 0);
close(vbasedev->fd);
}
@@ -1136,6 +1227,8 @@ static void vfio_iommu_legacy_class_init(ObjectClass *klass, void *data)
vioc->set_dirty_page_tracking = vfio_legacy_set_dirty_page_tracking;
vioc->query_dirty_bitmap = vfio_legacy_query_dirty_bitmap;
vioc->pci_hot_reset = vfio_legacy_pci_hot_reset;
+ vioc->cpr_register = vfio_legacy_cpr_register_container;
+ vioc->cpr_unregister = vfio_legacy_cpr_unregister_container;
};
static bool hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 0000000..bc51ebe
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
@@ -0,0 +1,118 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "hw/vfio/vfio-common.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "qapi/error.h"
+#include "migration/vmstate.h"
+
+#define VFIO_CONTAINER(base) container_of(base, VFIOContainer, bcontainer)
+
+static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+ .iova = 0,
+ .size = 0,
+ };
+ if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+ error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+ return false;
+ }
+ return true;
+}
+
+static bool vfio_can_cpr_exec(VFIOContainer *container, Error **errp)
+{
+ if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
+ error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
+ return false;
+
+ } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+ error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
+ return false;
+
+ } else {
+ return true;
+ }
+}
+
+static int vfio_container_pre_save(void *opaque)
+{
+ VFIOContainer *container = opaque;
+ Error *err = NULL;
+
+ if (!vfio_can_cpr_exec(container, &err) ||
+ !vfio_dma_unmap_vaddr_all(container, &err)) {
+ error_report_err(err);
+ return -1;
+ }
+ return 0;
+}
+
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+ VFIOContainer *container = opaque;
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ VFIOGroup *group;
+ Error *err = NULL;
+ VFIODevice *vbasedev;
+
+ if (!vfio_can_cpr_exec(container, &err)) {
+ error_report_err(err);
+ return -1;
+ }
+ vfio_listener_register(bcontainer);
+ bcontainer->reused = false;
+
+ QLIST_FOREACH(group, &container->group_list, container_next) {
+ QLIST_FOREACH(vbasedev, &group->device_list, next) {
+ vbasedev->reused = false;
+ }
+ }
+ return 0;
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+ .name = "vfio-container",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .pre_save = vfio_container_pre_save,
+ .post_load = vfio_container_post_load,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer,
+ Error **errp)
+{
+ VFIOContainer *container = VFIO_CONTAINER(bcontainer);
+
+ if (!vfio_can_cpr_exec(container, &bcontainer->cpr_blocker)) {
+ return migrate_add_blocker_modes(&bcontainer->cpr_blocker, errp,
+ MIG_MODE_CPR_EXEC, -1);
+ }
+
+ vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+ return true;
+}
+
+void vfio_legacy_cpr_unregister_container(VFIOContainerBase *bcontainer)
+{
+ VFIOContainer *container = VFIO_CONTAINER(bcontainer);
+
+ vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 87e51fc..4474bc3 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -6,10 +6,12 @@
*/
#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
#include "hw/vfio/vfio-common.h"
-#include "migration/misc.h"
+#include "migration/blocker.h"
+#include "migration/migration.h"
#include "qapi/error.h"
-#include "sysemu/runstate.h"
static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
MigrationEvent *e, Error **errp)
@@ -27,13 +29,29 @@ static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
{
+ VFIOIOMMUClass *ops = VFIO_IOMMU_GET_CLASS(bcontainer);
+
migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
vfio_cpr_reboot_notifier,
MIG_MODE_CPR_REBOOT);
- return true;
+
+ if (!ops->cpr_register) {
+ error_setg(&bcontainer->cpr_blocker,
+ "VFIO container does not support cpr_register");
+ return migrate_add_blocker_modes(&bcontainer->cpr_blocker, errp,
+ MIG_MODE_CPR_EXEC, -1) == 0;
+ }
+
+ return ops->cpr_register(bcontainer, errp);
}
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
{
+ VFIOIOMMUClass *ops = VFIO_IOMMU_GET_CLASS(bcontainer);
+
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+ migrate_del_blocker(&bcontainer->cpr_blocker);
+ if (ops->cpr_unregister) {
+ ops->cpr_unregister(bcontainer);
+ }
}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bba776f..5487815 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,13 +5,14 @@ vfio_ss.add(files(
'container-base.c',
'container.c',
'migration.c',
- 'cpr.c',
))
vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
'iommufd.c',
))
vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+ 'cpr.c',
+ 'cpr-legacy.c',
'display.c',
'pci-quirks.c',
'pci.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ca3c22a..2485236 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,8 @@
#include "hw/pci/pci_bridge.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "migration/misc.h"
+#include "migration/cpr.h"
#include "migration/vmstate.h"
#include "qapi/qmp/qdict.h"
#include "qemu/error-report.h"
@@ -3326,6 +3328,11 @@ static void vfio_pci_reset(DeviceState *dev)
{
VFIOPCIDevice *vdev = VFIO_PCI(dev);
+ /* Do not reset the device during qemu_system_reset prior to cpr load */
+ if (vdev->vbasedev.reused) {
+ return;
+ }
+
trace_vfio_pci_reset(vdev->vbasedev.name);
vfio_pci_pre_reset(vdev);
@@ -3447,6 +3454,36 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
}
#endif
+/*
+ * The kernel may change non-emulated config bits. Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_pci_pre_load(void *opaque)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int size = MIN(pci_config_size(pdev), vdev->config_size);
+ int i;
+
+ for (i = 0; i < size; i++) {
+ pdev->cmask[i] &= vdev->emulated_config_bits[i];
+ }
+
+ return 0;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+ .name = "vfio-pci",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .priority = MIG_PRI_VFIO_PCI, /* must load before container */
+ .pre_load = vfio_pci_pre_load,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
{
DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3457,6 +3494,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
#ifdef CONFIG_IOMMUFD
object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
#endif
+ dc->vmsd = &vfio_pci_vmstate;
dc->desc = "VFIO-based PCI device assignment";
set_bit(DEVICE_CATEGORY_MISC, dc->categories);
pdc->realize = vfio_realize;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e8ddf92..7c4283b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -122,6 +122,7 @@ typedef struct VFIODevice {
bool ram_block_discard_allowed;
OnOffAuto enable_migration;
bool migration_events;
+ bool reused;
VFIODeviceOps *ops;
unsigned int num_irqs;
unsigned int num_regions;
@@ -240,6 +241,9 @@ int vfio_kvm_device_del_fd(int fd, Error **errp);
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
+bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer,
+ Error **errp);
+void vfio_legacy_cpr_unregister_container(VFIOContainerBase *bcontainer);
extern const MemoryRegionOps vfio_region_ops;
typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
@@ -287,6 +291,10 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
uint64_t size, ram_addr_t ram_addr, Error **errp);
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section);
+void vfio_listener_register(VFIOContainerBase *bcontainer);
+
/* Returns 0 on success, or a negative errno. */
bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 419e45e..82ccf0c 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -39,6 +39,7 @@ typedef struct VFIOContainerBase {
MemoryListener listener;
Error *error;
bool initialized;
+ bool reused;
uint64_t dirty_pgsizes;
uint64_t max_dirty_bitmap_size;
unsigned long pgsizes;
@@ -50,6 +51,7 @@ typedef struct VFIOContainerBase {
QLIST_HEAD(, VFIODevice) device_list;
GList *iova_ranges;
NotifierWithReturn cpr_reboot_notifier;
+ Error *cpr_blocker;
} VFIOContainerBase;
typedef struct VFIOGuestIOMMU {
@@ -152,5 +154,9 @@ struct VFIOIOMMUClass {
void (*del_window)(VFIOContainerBase *bcontainer,
MemoryRegionSection *section);
void (*release)(VFIOContainerBase *bcontainer);
+
+ /* CPR */
+ bool (*cpr_register)(VFIOContainerBase *bcontainer, Error **errp);
+ void (*cpr_unregister)(VFIOContainerBase *bcontainer);
};
#endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index f313f2f..87cb5b0 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -162,6 +162,8 @@ typedef enum {
MIG_PRI_GICV3_ITS, /* Must happen before PCI devices */
MIG_PRI_GICV3, /* Must happen before the ITS */
MIG_PRI_MAX,
+ MIG_PRI_VFIO_PCI =
+ MIG_PRI_DEFAULT + 1, /* Must happen before vfio containers */
} MigrationPriority;
struct VMStateField {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 5/8] vfio-pci: cpr part 2 (msi)
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
` (3 preceding siblings ...)
2024-07-09 20:58 ` [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-09 20:58 ` [PATCH V1 6/8] vfio-pci: cpr part 3 (intx) Steve Sistare
` (3 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Finish CPR for vfio-pci MSI/MSI-X devices by preserving eventfd's and
vector state.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 116 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2485236..f0213e0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -56,11 +56,37 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+#define EVENT_FD_NAME(vdev, name) \
+ g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+ EventNotifier *ev)
+{
+ int fd = event_notifier_get_fd(ev);
+
+ if (fd >= 0) {
+ g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+ cpr_resave_fd(fdname, nr, fd);
+ }
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+ return cpr_find_fd(fdname, nr);
+}
+
+static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+ cpr_delete_fd(fdname, nr);
+}
+
/* Create new or reuse existing eventfd */
static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
{
- int fd = -1; /* placeholder until a subsequent patch */
+ int fd = load_event_fd(vdev, name, nr);
int ret = 0;
if (fd >= 0) {
@@ -71,6 +97,8 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
Error *err = NULL;
error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
error_report_err(err);
+ } else {
+ save_event_fd(vdev, name, nr, e);
}
}
return ret;
@@ -79,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
{
+ delete_event_fd(vdev, name, nr);
event_notifier_cleanup(e);
}
@@ -561,6 +590,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
int ret;
bool resizing = !!(vdev->nr_vectors < nr + 1);
+ /*
+ * Ignore the callback from msix_set_vector_notifiers during resume.
+ * The necessary subset of these actions is called from vfio_claim_vectors
+ * during post load.
+ */
+ if (vdev->vbasedev.reused) {
+ return 0;
+ }
+
trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
vector = &vdev->msi_vectors[nr];
@@ -2897,6 +2935,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->err_notifier);
qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (vdev->vbasedev.reused) {
+ return;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2961,6 +3004,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->req_notifier);
qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (vdev->vbasedev.reused) {
+ vdev->req_enabled = true;
+ return;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3454,6 +3503,46 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
}
#endif
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
+{
+ int i, fd;
+ bool pending = false;
+ PCIDevice *pdev = &vdev->pdev;
+
+ vdev->nr_vectors = nr_vectors;
+ vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+ vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+ vfio_prepare_kvm_msi_virq_batch(vdev);
+
+ for (i = 0; i < nr_vectors; i++) {
+ VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+ fd = load_event_fd(vdev, "interrupt", i);
+ if (fd >= 0) {
+ vfio_vector_init(vdev, i);
+ qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+ }
+
+ if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
+ vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+ } else {
+ vdev->msi_vectors[i].virq = -1;
+ }
+
+ if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+ set_bit(i, vdev->msix->pending);
+ pending = true;
+ }
+ }
+
+ vfio_commit_kvm_msi_virq_batch(vdev);
+
+ if (msix) {
+ memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+ }
+}
+
/*
* The kernel may change non-emulated config bits. Exclude them from the
* changed-bits check in get_pci_config_device.
@@ -3472,14 +3561,40 @@ static int vfio_pci_pre_load(void *opaque)
return 0;
}
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int nr_vectors;
+
+ if (msix_enabled(pdev)) {
+ msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
+ vfio_msix_vector_release, NULL);
+ nr_vectors = vdev->msix->entries;
+ vfio_claim_vectors(vdev, nr_vectors, true);
+
+ } else if (msi_enabled(pdev)) {
+ nr_vectors = msi_nr_vectors_allocated(pdev);
+ vfio_claim_vectors(vdev, nr_vectors, false);
+
+ } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+ g_assert_not_reached(); /* completed in a subsequent patch */
+ }
+
+ return 0;
+}
+
static const VMStateDescription vfio_pci_vmstate = {
.name = "vfio-pci",
.version_id = 0,
.minimum_version_id = 0,
.priority = MIG_PRI_VFIO_PCI, /* must load before container */
.pre_load = vfio_pci_pre_load,
+ .post_load = vfio_pci_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+ VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
VMSTATE_END_OF_LIST()
}
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 6/8] vfio-pci: cpr part 3 (intx)
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
` (4 preceding siblings ...)
2024-07-09 20:58 ` [PATCH V1 5/8] vfio-pci: cpr part 2 (msi) Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-09 20:58 ` [PATCH V1 7/8] vfio: vfio_find_ram_discard_listener Steve Sistare
` (2 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Preserve vfio INTX state across cpr-exec. Preserve VFIOINTx fields as
follows:
pin : Recover this from the vfio config in kernel space
interrupt : Preserve its eventfd descriptor across exec.
unmask : Ditto
route.irq : This could perhaps be recovered in vfio_pci_post_load by
calling pci_device_route_intx_to_irq(pin), whose implementation reads
config space for a bridge device such as ich9. However, there is no
guarantee that the bridge vmstate is read before vfio vmstate. Rather
than fiddling with MigrationPriority for vmstate handlers, explicitly
save route.irq in vfio vmstate.
pending : save in vfio vmstate.
mmap_timeout, mmap_timer : Re-initialize
bool kvm_accel : Re-initialize
In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load. Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 47 insertions(+), 4 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f0213e0..b5e7592 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -184,12 +184,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
return true;
}
+ if (vdev->vbasedev.reused) {
+ goto skip_state;
+ }
+
/* Get to a known interrupt state */
qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
vdev->intx.pending = false;
pci_irq_deassert(&vdev->pdev);
+skip_state:
/* Get an eventfd for resample/unmask */
if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
error_setg(errp, "vfio_notifier_init intx-unmask failed");
@@ -204,6 +209,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
goto fail_irqfd;
}
+ if (vdev->vbasedev.reused) {
+ goto skip_irq;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_UNMASK,
event_notifier_get_fd(&vdev->intx.unmask),
@@ -214,6 +223,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
/* Let'em rip */
vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+skip_irq:
vdev->intx.kvm_accel = true;
trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
@@ -329,7 +339,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
return true;
}
- vfio_disable_interrupts(vdev);
+ /*
+ * Do not alter interrupt state during vfio_realize and cpr load. The
+ * reused flag is cleared thereafter.
+ */
+ if (!vdev->vbasedev.reused) {
+ vfio_disable_interrupts(vdev);
+ }
vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -351,7 +367,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
- if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
+ if (!vdev->vbasedev.reused &&
+ !vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
@@ -3262,7 +3279,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
vfio_intx_routing_notifier);
vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
- if (!vfio_intx_enable(vdev, errp)) {
+ /* Wait until cpr load reads intx routing data to enable */
+ if (!vdev->vbasedev.reused && !vfio_intx_enable(vdev, errp)) {
goto out_deregister;
}
}
@@ -3578,12 +3596,36 @@ static int vfio_pci_post_load(void *opaque, int version_id)
vfio_claim_vectors(vdev, nr_vectors, false);
} else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
- g_assert_not_reached(); /* completed in a subsequent patch */
+ Error *err = NULL;
+ if (!vfio_intx_enable(vdev, &err)) {
+ error_report_err(err);
+ return -1;
+ }
}
return 0;
}
+static const VMStateDescription vfio_intx_vmstate = {
+ .name = "vfio-intx",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .fields = (VMStateField[]) {
+ VMSTATE_BOOL(pending, VFIOINTx),
+ VMSTATE_UINT32(route.mode, VFIOINTx),
+ VMSTATE_INT32(route.irq, VFIOINTx),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) { \
+ .name = (stringify(_field)), \
+ .size = sizeof(VFIOINTx), \
+ .vmsd = &vfio_intx_vmstate, \
+ .flags = VMS_STRUCT, \
+ .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
+}
+
static const VMStateDescription vfio_pci_vmstate = {
.name = "vfio-pci",
.version_id = 0,
@@ -3595,6 +3637,7 @@ static const VMStateDescription vfio_pci_vmstate = {
.fields = (VMStateField[]) {
VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+ VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
VMSTATE_END_OF_LIST()
}
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 7/8] vfio: vfio_find_ram_discard_listener
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
` (5 preceding siblings ...)
2024-07-09 20:58 ` [PATCH V1 6/8] vfio-pci: cpr part 3 (intx) Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-07-09 20:58 ` [PATCH V1 8/8] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
2024-08-12 18:19 ` [PATCH V1 0/8] Live update: vfio Steven Sistare
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
Define vfio_find_ram_discard_listener as a subroutine so additional calls to
it may be added in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 72a692a..5c7baad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -561,6 +561,26 @@ static bool vfio_get_section_iova_range(VFIOContainerBase *bcontainer,
return true;
}
+static VFIORamDiscardListener *vfio_find_ram_discard_listener(
+ VFIOContainerBase *bcontainer, MemoryRegionSection *section)
+{
+ VFIORamDiscardListener *vrdl = NULL;
+
+ QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
+ if (vrdl->mr == section->mr &&
+ vrdl->offset_within_address_space ==
+ section->offset_within_address_space) {
+ break;
+ }
+ }
+
+ if (!vrdl) {
+ hw_error("vfio: Trying to sync missing RAM discard listener");
+ /* does not return */
+ }
+ return vrdl;
+}
+
static void vfio_listener_region_add(MemoryListener *listener,
MemoryRegionSection *section)
{
@@ -1285,19 +1305,8 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
MemoryRegionSection *section)
{
RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
- VFIORamDiscardListener *vrdl = NULL;
-
- QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
- if (vrdl->mr == section->mr &&
- vrdl->offset_within_address_space ==
- section->offset_within_address_space) {
- break;
- }
- }
-
- if (!vrdl) {
- hw_error("vfio: Trying to sync missing RAM discard listener");
- }
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
/*
* We only want/can synchronize the bitmap for actually mapped parts -
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V1 8/8] vfio-pci: recover from unmap-all-vaddr failure
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
` (6 preceding siblings ...)
2024-07-09 20:58 ` [PATCH V1 7/8] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2024-07-09 20:58 ` Steve Sistare
2024-08-12 18:19 ` [PATCH V1 0/8] Live update: vfio Steven Sistare
8 siblings, 0 replies; 13+ messages in thread
From: Steve Sistare @ 2024-07-09 20:58 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, Steve Sistare
If there are multiple containers and unmap-all fails for some container, we
need to remap vaddr for the other containers for which unmap-all succeeded.
Recover by walking all address ranges of all containers to restore the vaddr
for each. Do so by invoking the vfio listener callback, and passing a new
"remap" flag that tells it to restore a mapping without re-allocating new
userland data structures.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 45 ++++++++++++++++++++++++++++++++---
hw/vfio/cpr-legacy.c | 44 ++++++++++++++++++++++++++++++++++
include/hw/vfio/vfio-common.h | 4 +++-
include/hw/vfio/vfio-container-base.h | 1 +
4 files changed, 90 insertions(+), 4 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5c7baad..da2e0ec 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -586,11 +586,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
{
VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
listener);
- vfio_container_region_add(bcontainer, section);
+ vfio_container_region_add(bcontainer, section, false);
}
void vfio_container_region_add(VFIOContainerBase *bcontainer,
- MemoryRegionSection *section)
+ MemoryRegionSection *section,
+ bool remap)
{
hwaddr iova, end;
Int128 llend, llsize;
@@ -626,6 +627,30 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
int iommu_idx;
trace_vfio_listener_region_add_iommu(iova, end);
+
+ /*
+ * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
+ * want to remap the vaddr. vfio_container_region_add was already
+ * called in the past, so the giommu already exists. Find it and
+ * replay it, which calls vfio_dma_map further down the stack.
+ */
+
+ if (remap) {
+ hwaddr as_offset = section->offset_within_address_space;
+ hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+ QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
+ if (giommu->iommu_mr == iommu_mr &&
+ giommu->iommu_offset == iommu_offset) {
+ memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+ return;
+ }
+ }
+ error_report("Container cannot find iommu region %s offset %lx",
+ memory_region_name(section->mr), iommu_offset);
+ goto fail;
+ }
+
/*
* FIXME: For VFIO iommu types which have KVM acceleration to
* avoid bouncing all map/unmaps through qemu this way, this
@@ -676,7 +701,21 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
* about changes.
*/
if (memory_region_has_ram_discard_manager(section->mr)) {
- vfio_register_ram_discard_listener(bcontainer, section);
+ /*
+ * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
+ * want to remap the vaddr. vfio_container_region_add was already
+ * called in the past, so the ram discard listener already exists.
+ * Call its populate function directly, which calls vfio_dma_map.
+ */
+ if (remap) {
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
+ if (vrdl->listener.notify_populate(&vrdl->listener, section)) {
+ error_report("listener.notify_populate failed");
+ }
+ } else {
+ vfio_register_ram_discard_listener(bcontainer, section);
+ }
return;
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index bc51ebe..c4b95a8 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -29,9 +29,18 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
return false;
}
+ container->vaddr_unmapped = true;
return true;
}
+static void vfio_region_remap(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+ VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
+ remap_listener);
+ vfio_container_region_add(bcontainer, section, true);
+}
+
static bool vfio_can_cpr_exec(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -95,6 +104,37 @@ static const VMStateDescription vfio_container_vmstate = {
}
};
+static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
+{
+ VFIOContainer *container =
+ container_of(notifier, VFIOContainer, cpr_exec_notifier);
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ if (e->type != MIG_EVENT_PRECOPY_FAILED) {
+ return 0;
+ }
+
+ if (container->vaddr_unmapped) {
+ /*
+ * Force a call to vfio_region_remap for each mapped section by
+ * temporarily registering a listener, which calls vfio_dma_map
+ * further down the stack. Set reused so vfio_dma_map restores vaddr.
+ */
+ bcontainer->reused = true;
+ bcontainer->remap_listener = (MemoryListener) {
+ .name = "vfio recover",
+ .region_add = vfio_region_remap
+ };
+ memory_listener_register(&bcontainer->remap_listener,
+ bcontainer->space->as);
+ memory_listener_unregister(&bcontainer->remap_listener);
+ bcontainer->reused = false;
+ container->vaddr_unmapped = false;
+ }
+ return 0;
+}
+
bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer,
Error **errp)
{
@@ -107,6 +147,9 @@ bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer,
vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+ migration_add_notifier_mode(&container->cpr_exec_notifier,
+ vfio_cpr_fail_notifier,
+ MIG_MODE_CPR_EXEC);
return true;
}
@@ -115,4 +158,5 @@ void vfio_legacy_cpr_unregister_container(VFIOContainerBase *bcontainer)
VFIOContainer *container = VFIO_CONTAINER(bcontainer);
vmstate_unregister(NULL, &vfio_container_vmstate, container);
+ migration_remove_notifier(&container->cpr_exec_notifier);
}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7c4283b..1902c8f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -81,6 +81,8 @@ typedef struct VFIOContainer {
VFIOContainerBase bcontainer;
int fd; /* /dev/vfio/vfio, empowered by the attached groups */
unsigned iommu_type;
+ NotifierWithReturn cpr_exec_notifier;
+ bool vaddr_unmapped;
QLIST_HEAD(, VFIOGroup) group_list;
} VFIOContainer;
@@ -292,7 +294,7 @@ int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
uint64_t size, ram_addr_t ram_addr, Error **errp);
void vfio_container_region_add(VFIOContainerBase *bcontainer,
- MemoryRegionSection *section);
+ MemoryRegionSection *section, bool remap);
void vfio_listener_register(VFIOContainerBase *bcontainer);
/* Returns 0 on success, or a negative errno. */
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 82ccf0c..3d30365 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -37,6 +37,7 @@ typedef struct VFIOContainerBase {
Object parent;
VFIOAddressSpace *space;
MemoryListener listener;
+ MemoryListener remap_listener;
Error *error;
bool initialized;
bool reused;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)
2024-07-09 20:58 ` [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2024-07-10 20:03 ` Alex Williamson
2024-07-10 20:32 ` Steven Sistare
2024-07-16 14:42 ` Steven Sistare
1 sibling, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2024-07-10 20:03 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On Tue, 9 Jul 2024 13:58:53 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:
> Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu.
>
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in CPR state.
>
> In the container pre_save handler, suspend the use of virtual addresses
> in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will
> be remapped at a different VA after exec. DMA to already-mapped pages
> continues. Save the msi message area as part of vfio-pci vmstate, and
> save the interrupt and notifier eventfd's in vmstate.
>
> On qemu restart, vfio_realize() finds the saved descriptors, uses the
> descriptors, and notes that the device is being reused. Device and iommu
> state is already configured, so operations in vfio_realize that would
> modify the configuration are skipped for a reused device, including vfio
> ioctl's and writes to PCI configuration space. Vfio PCI device reset
> is also suppressed. The result is that vfio_realize constructs qemu
> data structures that reflect the current state of the device. However,
> the reconstruction is not complete until migrate_incoming is called.
> migrate_incoming loads the msi data, the vfio post_load handler finds
> eventfds in CPR state, rebuilds vector data structures, and attaches the
> interrupts to the new KVM instance. The container post_load handler then
> invokes the main vfio listener callback, which walks the flattened ranges
> of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the
> kernel of the new VA's. Lastly, migration resumes the VM.
Hi Steve,
What's the iommufd plan for cpr? Thanks,
Alex
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)
2024-07-10 20:03 ` Alex Williamson
@ 2024-07-10 20:32 ` Steven Sistare
0 siblings, 0 replies; 13+ messages in thread
From: Steven Sistare @ 2024-07-10 20:32 UTC (permalink / raw)
To: Alex Williamson
Cc: qemu-devel, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/10/2024 4:03 PM, Alex Williamson wrote:
> On Tue, 9 Jul 2024 13:58:53 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
>
>> Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in CPR state.
>>
>> In the container pre_save handler, suspend the use of virtual addresses
>> in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will
>> be remapped at a different VA after exec. DMA to already-mapped pages
>> continues. Save the msi message area as part of vfio-pci vmstate, and
>> save the interrupt and notifier eventfd's in vmstate.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses the
>> descriptors, and notes that the device is being reused. Device and iommu
>> state is already configured, so operations in vfio_realize that would
>> modify the configuration are skipped for a reused device, including vfio
>> ioctl's and writes to PCI configuration space. Vfio PCI device reset
>> is also suppressed. The result is that vfio_realize constructs qemu
>> data structures that reflect the current state of the device. However,
>> the reconstruction is not complete until migrate_incoming is called.
>> migrate_incoming loads the msi data, the vfio post_load handler finds
>> eventfds in CPR state, rebuilds vector data structures, and attaches the
>> interrupts to the new KVM instance. The container post_load handler then
>> invokes the main vfio listener callback, which walks the flattened ranges
>> of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the
>> kernel of the new VA's. Lastly, migration resumes the VM.
>
> Hi Steve,
>
> What's the iommufd plan for cpr? Thanks,
I am working on vdpa and iommufd as we speak, with working prototypes for both.
I plan to submit the kernel and qemu RFC for vdpa next week, followed by vacation,
and iommfd in the weeks after that.
- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)
2024-07-09 20:58 ` [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
2024-07-10 20:03 ` Alex Williamson
@ 2024-07-16 14:42 ` Steven Sistare
1 sibling, 0 replies; 13+ messages in thread
From: Steven Sistare @ 2024-07-16 14:42 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/9/2024 4:58 PM, Steve Sistare wrote:
> Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu.
>
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in CPR state.
>
> In the container pre_save handler, suspend the use of virtual addresses
> in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will
> be remapped at a different VA after exec. DMA to already-mapped pages
> continues. Save the msi message area as part of vfio-pci vmstate, and
> save the interrupt and notifier eventfd's in vmstate.
>
> On qemu restart, vfio_realize() finds the saved descriptors, uses the
> descriptors, and notes that the device is being reused. Device and iommu
> state is already configured, so operations in vfio_realize that would
> modify the configuration are skipped for a reused device, including vfio
> ioctl's and writes to PCI configuration space. Vfio PCI device reset
> is also suppressed. The result is that vfio_realize constructs qemu
> data structures that reflect the current state of the device. However,
> the reconstruction is not complete until migrate_incoming is called.
> migrate_incoming loads the msi data, the vfio post_load handler finds
> eventfds in CPR state, rebuilds vector data structures, and attaches the
> interrupts to the new KVM instance. The container post_load handler then
> invokes the main vfio listener callback, which walks the flattened ranges
> of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the
> kernel of the new VA's. Lastly, migration resumes the VM.
>
> This functionality is delivered by 3 patches for clarity. Part 1 handles
> device file descriptors and DMA. Part 2 adds eventfd and MSI/MSI-X vector
> support. Part 3 adds INTX support.
> [...]
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> new file mode 100644
> index 0000000..bc51ebe
> --- /dev/null
> +++ b/hw/vfio/cpr-legacy.c
> [...]
> +
> +bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer,
> + Error **errp)
> +{
> + VFIOContainer *container = VFIO_CONTAINER(bcontainer);
> +
> + if (!vfio_can_cpr_exec(container, &bcontainer->cpr_blocker)) {
> + return migrate_add_blocker_modes(&bcontainer->cpr_blocker, errp,
> + MIG_MODE_CPR_EXEC, -1);
This is a bug. With the change in cpr_register return type to bool, this
should be:
return migrate_add_blocker_modes(...) == 0;
- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V1 0/8] Live update: vfio
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
` (7 preceding siblings ...)
2024-07-09 20:58 ` [PATCH V1 8/8] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
@ 2024-08-12 18:19 ` Steven Sistare
8 siblings, 0 replies; 13+ messages in thread
From: Steven Sistare @ 2024-08-12 18:19 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas
Hi all, any comments or RBs? This should be a slam dunk. Alex reviewed
9 versions of this code and all feedback has been incorporated. The only
significant change in this version is the addition of support for the
two container types: legacy and iommufd.
- Steve
On 7/9/2024 4:58 PM, Steve Sistare wrote:
> Support vfio devices with the cpr-exec live migration mode.
> See the commit messages of the individual patches for details.
> No user-visible interfaces are added.
>
> This series is extracted from the following and updated for the latest QEMU:
> [PATCH V9 00/46] Live Update
> https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
>
> This series depends on the following, which is based on commit 44b7329de469:
> [PATCH V2 00/11] Live update: cpr-exec
> https://lore.kernel.org/qemu-devel/1719776434-435013-1-git-send-email-steven.sistare@oracle.com/
>
> Steve Sistare (8):
> migration: cpr_needed_for_reuse
> pci: export msix_is_pending
> vfio-pci: refactor for cpr
> vfio-pci: cpr part 1 (fd and dma)
> vfio-pci: cpr part 2 (msi)
> vfio-pci: cpr part 3 (intx)
> vfio: vfio_find_ram_discard_listener
> vfio-pci: recover from unmap-all-vaddr failure
>
> hw/pci/msix.c | 2 +-
> hw/pci/pci.c | 13 ++
> hw/vfio/common.c | 88 ++++++++--
> hw/vfio/container.c | 139 ++++++++++++---
> hw/vfio/cpr-legacy.c | 162 ++++++++++++++++++
> hw/vfio/cpr.c | 24 ++-
> hw/vfio/meson.build | 3 +-
> hw/vfio/pci.c | 308 +++++++++++++++++++++++++++++-----
> include/hw/pci/msix.h | 1 +
> include/hw/vfio/vfio-common.h | 10 ++
> include/hw/vfio/vfio-container-base.h | 7 +
> include/migration/cpr.h | 1 +
> include/migration/vmstate.h | 2 +
> migration/cpr.c | 5 +
> 14 files changed, 682 insertions(+), 83 deletions(-)
> create mode 100644 hw/vfio/cpr-legacy.c
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-08-12 18:20 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 20:58 [PATCH V1 0/8] Live update: vfio Steve Sistare
2024-07-09 20:58 ` [PATCH V1 1/8] migration: cpr_needed_for_reuse Steve Sistare
2024-07-09 20:58 ` [PATCH V1 2/8] pci: export msix_is_pending Steve Sistare
2024-07-09 20:58 ` [PATCH V1 3/8] vfio-pci: refactor for cpr Steve Sistare
2024-07-09 20:58 ` [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
2024-07-10 20:03 ` Alex Williamson
2024-07-10 20:32 ` Steven Sistare
2024-07-16 14:42 ` Steven Sistare
2024-07-09 20:58 ` [PATCH V1 5/8] vfio-pci: cpr part 2 (msi) Steve Sistare
2024-07-09 20:58 ` [PATCH V1 6/8] vfio-pci: cpr part 3 (intx) Steve Sistare
2024-07-09 20:58 ` [PATCH V1 7/8] vfio: vfio_find_ram_discard_listener Steve Sistare
2024-07-09 20:58 ` [PATCH V1 8/8] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
2024-08-12 18:19 ` [PATCH V1 0/8] Live update: vfio Steven Sistare
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).