* [PATCH V2 01/45] MAINTAINERS: Add reviewer for CPR
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:53 ` Peter Xu
2025-02-14 14:13 ` [PATCH V2 02/45] migration: cpr helpers Steve Sistare
` (44 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
CPR is integrated with live migration, and has the same maintainers.
But, add a CPR section to add a reviewer.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
MAINTAINERS | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 3848d37..2f9a6da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2951,6 +2951,17 @@ F: include/qemu/co-shared-resource.h
T: git https://gitlab.com/jsnow/qemu.git jobs
T: git https://gitlab.com/vsementsov/qemu.git block
+CheckPoint and Restart (CPR)
+M: Peter Xu <peterx@redhat.com>
+M: Fabiano Rosas <farosas@suse.de>
+R: Steve Sistare <steven.sistare@oracle.com>
+S: Supported
+F: hw/vfio/cpr*
+F: include/migration/cpr.h
+F: migration/cpr*
+F: tests/qtest/migration/cpr*
+F: docs/devel/migration/CPR.rst
+
Compute Express Link
M: Jonathan Cameron <jonathan.cameron@huawei.com>
R: Fan Ni <fan.ni@samsung.com>
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 01/45] MAINTAINERS: Add reviewer for CPR
2025-02-14 14:13 ` [PATCH V2 01/45] MAINTAINERS: Add reviewer for CPR Steve Sistare
@ 2025-02-14 14:53 ` Peter Xu
2025-02-14 20:14 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-14 14:53 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 06:13:43AM -0800, Steve Sistare wrote:
> CPR is integrated with live migration, and has the same maintainers.
> But, add a CPR section to add a reviewer.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> MAINTAINERS | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3848d37..2f9a6da 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2951,6 +2951,17 @@ F: include/qemu/co-shared-resource.h
> T: git https://gitlab.com/jsnow/qemu.git jobs
> T: git https://gitlab.com/vsementsov/qemu.git block
>
> +CheckPoint and Restart (CPR)
> +M: Peter Xu <peterx@redhat.com>
> +M: Fabiano Rosas <farosas@suse.de>
> +R: Steve Sistare <steven.sistare@oracle.com>
> +S: Supported
> +F: hw/vfio/cpr*
> +F: include/migration/cpr.h
> +F: migration/cpr*
> +F: tests/qtest/migration/cpr*
> +F: docs/devel/migration/CPR.rst
All above files are covered by either migration or vfio.
If the plan is to have CPR being part of existing subsystems, IMHO we could
drop the M: entries here but keep R: only. Or, make one M: entry for
yourself.
With that, IIUC anyone using get_maintainers.pl will always get the right
people to copy: it goes to VFIO if it's under the 1st entry (hw/vfio/cpr*),
or it goes to migration if it's the rest four entries. Meanwhile, if any
of above is touched you'll get copied too.
> +
> Compute Express Link
> M: Jonathan Cameron <jonathan.cameron@huawei.com>
> R: Fan Ni <fan.ni@samsung.com>
> --
> 1.8.3.1
>
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 01/45] MAINTAINERS: Add reviewer for CPR
2025-02-14 14:53 ` Peter Xu
@ 2025-02-14 20:14 ` Steven Sistare
0 siblings, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 20:14 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/14/2025 9:53 AM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 06:13:43AM -0800, Steve Sistare wrote:
>> CPR is integrated with live migration, and has the same maintainers.
>> But, add a CPR section to add a reviewer.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> MAINTAINERS | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 3848d37..2f9a6da 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2951,6 +2951,17 @@ F: include/qemu/co-shared-resource.h
>> T: git https://gitlab.com/jsnow/qemu.git jobs
>> T: git https://gitlab.com/vsementsov/qemu.git block
>>
>> +CheckPoint and Restart (CPR)
>> +M: Peter Xu <peterx@redhat.com>
>> +M: Fabiano Rosas <farosas@suse.de>
>> +R: Steve Sistare <steven.sistare@oracle.com>
>> +S: Supported
>> +F: hw/vfio/cpr*
>> +F: include/migration/cpr.h
>> +F: migration/cpr*
>> +F: tests/qtest/migration/cpr*
>> +F: docs/devel/migration/CPR.rst
>
> All above files are covered by either migration or vfio.
>
> If the plan is to have CPR being part of existing subsystems, IMHO we could
> drop the M: entries here but keep R: only. Or, make one M: entry for
> yourself.
>
> With that, IIUC anyone using get_maintainers.pl will always get the right
> people to copy: it goes to VFIO if it's under the 1st entry (hw/vfio/cpr*),
> or it goes to migration if it's the rest four entries. Meanwhile, if any
> of above is touched you'll get copied too.
OK, I'll remove the M entries - steve
>> +
>> Compute Express Link
>> M: Jonathan Cameron <jonathan.cameron@huawei.com>
>> R: Fan Ni <fan.ni@samsung.com>
>> --
>> 1.8.3.1
>>
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 02/45] migration: cpr helpers
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
2025-02-14 14:13 ` [PATCH V2 01/45] MAINTAINERS: Add reviewer for CPR Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 16:37 ` Peter Xu
2025-02-14 14:13 ` [PATCH V2 03/45] migration: lower handler priority Steve Sistare
` (43 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add cpr_needed_for_reuse, cpr_resave_fd helpers, cpr_is_incoming, and
cpr_open_fd, for use when adding cpr support for vfio and iommufd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 6 ++++++
migration/cpr.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 3a6deb7..6ad04d4 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -18,15 +18,21 @@
void cpr_save_fd(const char *name, int id, int fd);
void cpr_delete_fd(const char *name, int id);
int cpr_find_fd(const char *name, int id);
+void cpr_resave_fd(const char *name, int id, int fd);
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+ bool *reused, Error **errp);
MigMode cpr_get_incoming_mode(void);
void cpr_set_incoming_mode(MigMode mode);
+bool cpr_is_incoming(void);
int cpr_state_save(MigrationChannel *channel, Error **errp);
int cpr_state_load(MigrationChannel *channel, Error **errp);
void cpr_state_close(void);
struct QIOChannel *cpr_state_ioc(void);
+bool cpr_needed_for_reuse(void *opaque);
+
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 584b0b9..12c489b 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -95,6 +95,39 @@ int cpr_find_fd(const char *name, int id)
trace_cpr_find_fd(name, id, fd);
return fd;
}
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+ CprFd *elem = find_fd(&cpr_state.fds, name, id);
+ int old_fd = elem ? elem->fd : -1;
+
+ if (old_fd < 0) {
+ cpr_save_fd(name, id, fd);
+ } else if (old_fd != fd) {
+ error_setg(&error_fatal,
+ "internal error: cpr fd '%s' id %d value %d "
+ "already saved with a different value %d",
+ name, id, fd, old_fd);
+ }
+}
+
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+ bool *reused, Error **errp)
+{
+ int fd = cpr_find_fd(name, id);
+
+ if (reused) {
+ *reused = (fd >= 0);
+ }
+ if (fd < 0) {
+ fd = qemu_open(path, flags, errp);
+ if (fd >= 0) {
+ cpr_save_fd(name, id, fd);
+ }
+ }
+ return fd;
+}
+
/*************************************************************************/
#define CPR_STATE "CprState"
@@ -128,6 +161,11 @@ void cpr_set_incoming_mode(MigMode mode)
incoming_mode = mode;
}
+bool cpr_is_incoming(void)
+{
+ return incoming_mode != MIG_MODE_NONE;
+}
+
int cpr_state_save(MigrationChannel *channel, Error **errp)
{
int ret;
@@ -222,3 +260,9 @@ void cpr_state_close(void)
cpr_state_file = NULL;
}
}
+
+bool cpr_needed_for_reuse(void *opaque)
+{
+ MigMode mode = migrate_mode();
+ return mode == MIG_MODE_CPR_TRANSFER;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 02/45] migration: cpr helpers
2025-02-14 14:13 ` [PATCH V2 02/45] migration: cpr helpers Steve Sistare
@ 2025-02-14 16:37 ` Peter Xu
2025-02-14 20:31 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-14 16:37 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 06:13:44AM -0800, Steve Sistare wrote:
> Add cpr_needed_for_reuse, cpr_resave_fd helpers, cpr_is_incoming, and
> cpr_open_fd, for use when adding cpr support for vfio and iommufd.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/migration/cpr.h | 6 ++++++
> migration/cpr.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 50 insertions(+)
>
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 3a6deb7..6ad04d4 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -18,15 +18,21 @@
> void cpr_save_fd(const char *name, int id, int fd);
> void cpr_delete_fd(const char *name, int id);
> int cpr_find_fd(const char *name, int id);
> +void cpr_resave_fd(const char *name, int id, int fd);
> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
> + bool *reused, Error **errp);
>
> MigMode cpr_get_incoming_mode(void);
> void cpr_set_incoming_mode(MigMode mode);
> +bool cpr_is_incoming(void);
>
> int cpr_state_save(MigrationChannel *channel, Error **errp);
> int cpr_state_load(MigrationChannel *channel, Error **errp);
> void cpr_state_close(void);
> struct QIOChannel *cpr_state_ioc(void);
>
> +bool cpr_needed_for_reuse(void *opaque);
> +
> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 584b0b9..12c489b 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -95,6 +95,39 @@ int cpr_find_fd(const char *name, int id)
> trace_cpr_find_fd(name, id, fd);
> return fd;
> }
> +
> +void cpr_resave_fd(const char *name, int id, int fd)
> +{
> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
> + int old_fd = elem ? elem->fd : -1;
> +
> + if (old_fd < 0) {
> + cpr_save_fd(name, id, fd);
> + } else if (old_fd != fd) {
> + error_setg(&error_fatal,
> + "internal error: cpr fd '%s' id %d value %d "
> + "already saved with a different value %d",
> + name, id, fd, old_fd);
How bad it is to trigger this?
I wonder if cpr_save_fd() should have checked this already on duplicated
entries; it looks risky there too if this happens to existing cpr_save_fd()
callers.
> + }
> +}
> +
> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
> + bool *reused, Error **errp)
> +{
> + int fd = cpr_find_fd(name, id);
> +
> + if (reused) {
> + *reused = (fd >= 0);
> + }
> + if (fd < 0) {
> + fd = qemu_open(path, flags, errp);
> + if (fd >= 0) {
> + cpr_save_fd(name, id, fd);
> + }
> + }
> + return fd;
> +}
> +
> /*************************************************************************/
> #define CPR_STATE "CprState"
>
> @@ -128,6 +161,11 @@ void cpr_set_incoming_mode(MigMode mode)
> incoming_mode = mode;
> }
>
> +bool cpr_is_incoming(void)
> +{
> + return incoming_mode != MIG_MODE_NONE;
> +}
Maybe it'll be helpful to document either this function or incoming_mode;
it's probably not yet obvious to most readers incoming_mode is only set to
!NONE during a small window when VM loads.
> +
> int cpr_state_save(MigrationChannel *channel, Error **errp)
> {
> int ret;
> @@ -222,3 +260,9 @@ void cpr_state_close(void)
> cpr_state_file = NULL;
> }
> }
> +
> +bool cpr_needed_for_reuse(void *opaque)
> +{
> + MigMode mode = migrate_mode();
Nit: can drop the var.
> + return mode == MIG_MODE_CPR_TRANSFER;
> +}
> --
> 1.8.3.1
>
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 02/45] migration: cpr helpers
2025-02-14 16:37 ` Peter Xu
@ 2025-02-14 20:31 ` Steven Sistare
2025-02-18 16:26 ` Peter Xu
0 siblings, 1 reply; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 20:31 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/14/2025 11:37 AM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 06:13:44AM -0800, Steve Sistare wrote:
>> Add cpr_needed_for_reuse, cpr_resave_fd helpers, cpr_is_incoming, and
>> cpr_open_fd, for use when adding cpr support for vfio and iommufd.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/migration/cpr.h | 6 ++++++
>> migration/cpr.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 50 insertions(+)
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index 3a6deb7..6ad04d4 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -18,15 +18,21 @@
>> void cpr_save_fd(const char *name, int id, int fd);
>> void cpr_delete_fd(const char *name, int id);
>> int cpr_find_fd(const char *name, int id);
>> +void cpr_resave_fd(const char *name, int id, int fd);
>> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
>> + bool *reused, Error **errp);
>>
>> MigMode cpr_get_incoming_mode(void);
>> void cpr_set_incoming_mode(MigMode mode);
>> +bool cpr_is_incoming(void);
>>
>> int cpr_state_save(MigrationChannel *channel, Error **errp);
>> int cpr_state_load(MigrationChannel *channel, Error **errp);
>> void cpr_state_close(void);
>> struct QIOChannel *cpr_state_ioc(void);
>>
>> +bool cpr_needed_for_reuse(void *opaque);
>> +
>> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 584b0b9..12c489b 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -95,6 +95,39 @@ int cpr_find_fd(const char *name, int id)
>> trace_cpr_find_fd(name, id, fd);
>> return fd;
>> }
>> +
>> +void cpr_resave_fd(const char *name, int id, int fd)
>> +{
>> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> + int old_fd = elem ? elem->fd : -1;
>> +
>> + if (old_fd < 0) {
>> + cpr_save_fd(name, id, fd);
>> + } else if (old_fd != fd) {
>> + error_setg(&error_fatal,
>> + "internal error: cpr fd '%s' id %d value %d "
>> + "already saved with a different value %d",
>> + name, id, fd, old_fd);
>
> How bad it is to trigger this?
Bad, cpr will likely fail the next time it is used.
I suppose I could add a blocker instead of using error_fatal.
But, fundamentally something unknown has gone wrong, like for
any assertion failure, so continuing to run in an uncertain
state seems unwise.
I have only ever seen this during development after adding buggy code.
> I wonder if cpr_save_fd() should have checked this already on duplicated
> entries; it looks risky there too if this happens to existing cpr_save_fd()
> callers.
Yes, I could check for dups in cpr_save_fd, though it would cost O(N) instead
of O(1). That seems like overkill for a bug that should only bite during new
code development.
cpr_resave_fd is O(N), but not for error checking. Callers use it when they
know the fd was (or may have been) already created. It is a programming
convenience that simplifies the call sites.
>> + }
>> +}
>> +
>> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
>> + bool *reused, Error **errp)
>> +{
>> + int fd = cpr_find_fd(name, id);
>> +
>> + if (reused) {
>> + *reused = (fd >= 0);
>> + }
>> + if (fd < 0) {
>> + fd = qemu_open(path, flags, errp);
>> + if (fd >= 0) {
>> + cpr_save_fd(name, id, fd);
>> + }
>> + }
>> + return fd;
>> +}
>> +
>> /*************************************************************************/
>> #define CPR_STATE "CprState"
>>
>> @@ -128,6 +161,11 @@ void cpr_set_incoming_mode(MigMode mode)
>> incoming_mode = mode;
>> }
>>
>> +bool cpr_is_incoming(void)
>> +{
>> + return incoming_mode != MIG_MODE_NONE;
>> +}
>
> Maybe it'll be helpful to document either this function or incoming_mode;
> it's probably not yet obvious to most readers incoming_mode is only set to
> !NONE during a small window when VM loads.
OK, I'll add a function header comment.
>> +
>> int cpr_state_save(MigrationChannel *channel, Error **errp)
>> {
>> int ret;
>> @@ -222,3 +260,9 @@ void cpr_state_close(void)
>> cpr_state_file = NULL;
>> }
>> }
>> +
>> +bool cpr_needed_for_reuse(void *opaque)
>> +{
>> + MigMode mode = migrate_mode();
>
> Nit: can drop the var.
OK.
- Steve
>
>> + return mode == MIG_MODE_CPR_TRANSFER;
>> +}
>> --
>> 1.8.3.1
>>
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 02/45] migration: cpr helpers
2025-02-14 20:31 ` Steven Sistare
@ 2025-02-18 16:26 ` Peter Xu
2025-02-24 16:51 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-18 16:26 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 03:31:29PM -0500, Steven Sistare wrote:
> On 2/14/2025 11:37 AM, Peter Xu wrote:
> > On Fri, Feb 14, 2025 at 06:13:44AM -0800, Steve Sistare wrote:
> > > Add cpr_needed_for_reuse, cpr_resave_fd helpers, cpr_is_incoming, and
> > > cpr_open_fd, for use when adding cpr support for vfio and iommufd.
> > >
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > > include/migration/cpr.h | 6 ++++++
> > > migration/cpr.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 50 insertions(+)
> > >
> > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > > index 3a6deb7..6ad04d4 100644
> > > --- a/include/migration/cpr.h
> > > +++ b/include/migration/cpr.h
> > > @@ -18,15 +18,21 @@
> > > void cpr_save_fd(const char *name, int id, int fd);
> > > void cpr_delete_fd(const char *name, int id);
> > > int cpr_find_fd(const char *name, int id);
> > > +void cpr_resave_fd(const char *name, int id, int fd);
> > > +int cpr_open_fd(const char *path, int flags, const char *name, int id,
> > > + bool *reused, Error **errp);
> > > MigMode cpr_get_incoming_mode(void);
> > > void cpr_set_incoming_mode(MigMode mode);
> > > +bool cpr_is_incoming(void);
> > > int cpr_state_save(MigrationChannel *channel, Error **errp);
> > > int cpr_state_load(MigrationChannel *channel, Error **errp);
> > > void cpr_state_close(void);
> > > struct QIOChannel *cpr_state_ioc(void);
> > > +bool cpr_needed_for_reuse(void *opaque);
> > > +
> > > QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> > > QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
> > > diff --git a/migration/cpr.c b/migration/cpr.c
> > > index 584b0b9..12c489b 100644
> > > --- a/migration/cpr.c
> > > +++ b/migration/cpr.c
> > > @@ -95,6 +95,39 @@ int cpr_find_fd(const char *name, int id)
> > > trace_cpr_find_fd(name, id, fd);
> > > return fd;
> > > }
> > > +
> > > +void cpr_resave_fd(const char *name, int id, int fd)
> > > +{
> > > + CprFd *elem = find_fd(&cpr_state.fds, name, id);
> > > + int old_fd = elem ? elem->fd : -1;
> > > +
> > > + if (old_fd < 0) {
> > > + cpr_save_fd(name, id, fd);
> > > + } else if (old_fd != fd) {
> > > + error_setg(&error_fatal,
> > > + "internal error: cpr fd '%s' id %d value %d "
> > > + "already saved with a different value %d",
> > > + name, id, fd, old_fd);
> >
> > How bad it is to trigger this?
>
> Bad, cpr will likely fail the next time it is used.
> I suppose I could add a blocker instead of using error_fatal.
> But, fundamentally something unknown has gone wrong, like for
> any assertion failure, so continuing to run in an uncertain
> state seems unwise.
>
> I have only ever seen this during development after adding buggy code.
>
> > I wonder if cpr_save_fd() should have checked this already on duplicated
> > entries; it looks risky there too if this happens to existing cpr_save_fd()
> > callers.
>
> Yes, I could check for dups in cpr_save_fd, though it would cost O(N) instead
> of O(1). That seems like overkill for a bug that should only bite during new
> code development.
>
> cpr_resave_fd is O(N), but not for error checking. Callers use it when they
> know the fd was (or may have been) already created. It is a programming
> convenience that simplifies the call sites.
If the caller know the fd was created, then IIUC the caller shouldn't
invoke the call.
For the other case, could you give an example when the caller may have been
created, but maybe not? I'm a bit surprised we have such use case.
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 02/45] migration: cpr helpers
2025-02-18 16:26 ` Peter Xu
@ 2025-02-24 16:51 ` Steven Sistare
0 siblings, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-24 16:51 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/18/2025 11:26 AM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 03:31:29PM -0500, Steven Sistare wrote:
>> On 2/14/2025 11:37 AM, Peter Xu wrote:
>>> On Fri, Feb 14, 2025 at 06:13:44AM -0800, Steve Sistare wrote:
>>>> Add cpr_needed_for_reuse, cpr_resave_fd helpers, cpr_is_incoming, and
>>>> cpr_open_fd, for use when adding cpr support for vfio and iommufd.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> include/migration/cpr.h | 6 ++++++
>>>> migration/cpr.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 50 insertions(+)
>>>>
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index 3a6deb7..6ad04d4 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -18,15 +18,21 @@
>>>> void cpr_save_fd(const char *name, int id, int fd);
>>>> void cpr_delete_fd(const char *name, int id);
>>>> int cpr_find_fd(const char *name, int id);
>>>> +void cpr_resave_fd(const char *name, int id, int fd);
>>>> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
>>>> + bool *reused, Error **errp);
>>>> MigMode cpr_get_incoming_mode(void);
>>>> void cpr_set_incoming_mode(MigMode mode);
>>>> +bool cpr_is_incoming(void);
>>>> int cpr_state_save(MigrationChannel *channel, Error **errp);
>>>> int cpr_state_load(MigrationChannel *channel, Error **errp);
>>>> void cpr_state_close(void);
>>>> struct QIOChannel *cpr_state_ioc(void);
>>>> +bool cpr_needed_for_reuse(void *opaque);
>>>> +
>>>> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index 584b0b9..12c489b 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -95,6 +95,39 @@ int cpr_find_fd(const char *name, int id)
>>>> trace_cpr_find_fd(name, id, fd);
>>>> return fd;
>>>> }
>>>> +
>>>> +void cpr_resave_fd(const char *name, int id, int fd)
>>>> +{
>>>> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
>>>> + int old_fd = elem ? elem->fd : -1;
>>>> +
>>>> + if (old_fd < 0) {
>>>> + cpr_save_fd(name, id, fd);
>>>> + } else if (old_fd != fd) {
>>>> + error_setg(&error_fatal,
>>>> + "internal error: cpr fd '%s' id %d value %d "
>>>> + "already saved with a different value %d",
>>>> + name, id, fd, old_fd);
>>>
>>> How bad it is to trigger this?
>>
>> Bad, cpr will likely fail the next time it is used.
>> I suppose I could add a blocker instead of using error_fatal.
>> But, fundamentally something unknown has gone wrong, like for
>> any assertion failure, so continuing to run in an uncertain
>> state seems unwise.
>>
>> I have only ever seen this during development after adding buggy code.
>>
>>> I wonder if cpr_save_fd() should have checked this already on duplicated
>>> entries; it looks risky there too if this happens to existing cpr_save_fd()
>>> callers.
>>
>> Yes, I could check for dups in cpr_save_fd, though it would cost O(N) instead
>> of O(1). That seems like overkill for a bug that should only bite during new
>> code development.
>>
>> cpr_resave_fd is O(N), but not for error checking. Callers use it when they
>> know the fd was (or may have been) already created. It is a programming
>> convenience that simplifies the call sites.
>
> If the caller know the fd was created, then IIUC the caller shouldn't
> invoke the call.
>
> For the other case, could you give an example when the caller may have been
> created, but maybe not? I'm a bit surprised we have such use case.
It avoids the need to remember that an fd was reused, and test that fact before
calling cpr_save_fd. And sometimes those operations occur in different functions.
Thus resave saves a few lines of code. Trivial, though. I will just delete
cpr_resave_fd.
- Steve
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 03/45] migration: lower handler priority
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
2025-02-14 14:13 ` [PATCH V2 01/45] MAINTAINERS: Add reviewer for CPR Steve Sistare
2025-02-14 14:13 ` [PATCH V2 02/45] migration: cpr helpers Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 15:58 ` Peter Xu
2025-02-14 14:13 ` [PATCH V2 04/45] vfio: vfio_find_ram_discard_listener Steve Sistare
` (42 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define a vmstate priority that is lower than the default, so its handlers
run after all default priority handlers. Since 0 is no longer the default
priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
CPR for vfio will use this to install handlers for containers that run
after handlers for the devices that they contain.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
include/migration/vmstate.h | 6 +++++-
migration/savevm.c | 4 ++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a1dfab4..1ff7bd9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -155,7 +155,11 @@ enum VMStateFlags {
};
typedef enum {
- MIG_PRI_DEFAULT = 0,
+ MIG_PRI_UNINITIALIZED = 0, /* An uninitialized priority field maps to */
+ /* MIG_PRI_DEFAULT in save_state_priority */
+
+ MIG_PRI_LOW, /* Must happen after default */
+ MIG_PRI_DEFAULT,
MIG_PRI_IOMMU, /* Must happen before PCI devices */
MIG_PRI_PCI_BUS, /* Must happen before IOMMU */
MIG_PRI_VIRTIO_MEM, /* Must happen before IOMMU */
diff --git a/migration/savevm.c b/migration/savevm.c
index 85a3559..7eee90d 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -232,7 +232,7 @@ typedef struct SaveState {
static SaveState savevm_state = {
.handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
- .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
+ .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
.global_section_id = 0,
};
@@ -704,7 +704,7 @@ static int calculate_compat_instance_id(const char *idstr)
static inline MigrationPriority save_state_priority(SaveStateEntry *se)
{
- if (se->vmsd) {
+ if (se->vmsd && se->vmsd->priority) {
return se->vmsd->priority;
}
return MIG_PRI_DEFAULT;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 03/45] migration: lower handler priority
2025-02-14 14:13 ` [PATCH V2 03/45] migration: lower handler priority Steve Sistare
@ 2025-02-14 15:58 ` Peter Xu
0 siblings, 0 replies; 72+ messages in thread
From: Peter Xu @ 2025-02-14 15:58 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 06:13:45AM -0800, Steve Sistare wrote:
> Define a vmstate priority that is lower than the default, so its handlers
> run after all default priority handlers. Since 0 is no longer the default
> priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
>
> CPR for vfio will use this to install handlers for containers that run
> after handlers for the devices that they contain.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 04/45] vfio: vfio_find_ram_discard_listener
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (2 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 03/45] migration: lower handler priority Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 05/45] vfio/container: ram discard disable helper Steve Sistare
` (41 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define vfio_find_ram_discard_listener as a subroutine so additional calls to
it may be added in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/common.c | 35 ++++++++++++++++++++++-------------
include/hw/vfio/vfio-common.h | 3 +++
2 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index abbdc56..e3e1da0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -567,6 +567,26 @@ static void vfio_device_error_append(VFIODevice *vbasedev, Error **errp)
}
}
+VFIORamDiscardListener *vfio_find_ram_discard_listener(
+ VFIOContainerBase *bcontainer, MemoryRegionSection *section)
+{
+ VFIORamDiscardListener *vrdl = NULL;
+
+ QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
+ if (vrdl->mr == section->mr &&
+ vrdl->offset_within_address_space ==
+ section->offset_within_address_space) {
+ break;
+ }
+ }
+
+ if (!vrdl) {
+ hw_error("vfio: Trying to sync missing RAM discard listener");
+ /* does not return */
+ }
+ return vrdl;
+}
+
static void vfio_listener_region_add(MemoryListener *listener,
MemoryRegionSection *section)
{
@@ -1284,19 +1304,8 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
MemoryRegionSection *section)
{
RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
- VFIORamDiscardListener *vrdl = NULL;
-
- QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
- if (vrdl->mr == section->mr &&
- vrdl->offset_within_address_space ==
- section->offset_within_address_space) {
- break;
- }
- }
-
- if (!vrdl) {
- hw_error("vfio: Trying to sync missing RAM discard listener");
- }
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
/*
* We only want/can synchronize the bitmap for actually mapped parts -
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ac35136..d601eea 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -306,6 +306,9 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
uint64_t size, ram_addr_t ram_addr, Error **errp);
+VFIORamDiscardListener *vfio_find_ram_discard_listener(
+ VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
/* Returns 0 on success, or a negative errno. */
bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 05/45] vfio/container: ram discard disable helper
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (3 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 04/45] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-17 17:58 ` Cédric Le Goater
2025-02-14 14:13 ` [PATCH V2 06/45] vfio/container: reform vfio_connect_container cleanup Steve Sistare
` (40 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define a helper to set ram discard disable, generate error messages,
and cleanup on failure. The second vfio_ram_block_discard_disable
call site now performs VFIO_GROUP_UNSET_CONTAINER immediately on failure,
instead of relying on the close of the container fd to do so in the kernel,
but this is equivalent.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 48 +++++++++++++++++++++++++++---------------------
1 file changed, 27 insertions(+), 21 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 7c57bdd2..0f17d53 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -535,16 +535,10 @@ static bool vfio_legacy_setup(VFIOContainerBase *bcontainer, Error **errp)
return true;
}
-static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
- Error **errp)
+static bool vfio_attach_discard_disable(VFIOContainer *container,
+ VFIOGroup *group, Error **errp)
{
- VFIOContainer *container;
- VFIOContainerBase *bcontainer;
- int ret, fd;
- VFIOAddressSpace *space;
- VFIOIOMMUClass *vioc;
-
- space = vfio_get_address_space(as);
+ int ret;
/*
* VFIO is currently incompatible with discarding of RAM insofar as the
@@ -577,18 +571,32 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
* details once we know which type of IOMMU we are using.
*/
+ ret = vfio_ram_block_discard_disable(container, true);
+ if (ret) {
+ error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+ if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+ error_report("vfio: error disconnecting group %d from"
+ " container", group->groupid);
+ }
+ }
+ return !ret;
+}
+
+static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
+ Error **errp)
+{
+ VFIOContainer *container;
+ VFIOContainerBase *bcontainer;
+ int ret, fd;
+ VFIOAddressSpace *space;
+ VFIOIOMMUClass *vioc;
+
+ space = vfio_get_address_space(as);
+
QLIST_FOREACH(bcontainer, &space->containers, next) {
container = container_of(bcontainer, VFIOContainer, bcontainer);
if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
- ret = vfio_ram_block_discard_disable(container, true);
- if (ret) {
- error_setg_errno(errp, -ret,
- "Cannot set discarding of RAM broken");
- if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
- &container->fd)) {
- error_report("vfio: error disconnecting group %d from"
- " container", group->groupid);
- }
+ if (!vfio_attach_discard_disable(container, group, errp)) {
return false;
}
group->container = container;
@@ -620,9 +628,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
goto free_container_exit;
}
- ret = vfio_ram_block_discard_disable(container, true);
- if (ret) {
- error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+ if (!vfio_attach_discard_disable(container, group, errp)) {
goto unregister_container_exit;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 05/45] vfio/container: ram discard disable helper
2025-02-14 14:13 ` [PATCH V2 05/45] vfio/container: ram discard disable helper Steve Sistare
@ 2025-02-17 17:58 ` Cédric Le Goater
0 siblings, 0 replies; 72+ messages in thread
From: Cédric Le Goater @ 2025-02-17 17:58 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/14/25 15:13, Steve Sistare wrote:
> Define a helper to set ram discard disable, generate error messages,
> and cleanup on failure. The second vfio_ram_block_discard_disable
> call site now performs VFIO_GROUP_UNSET_CONTAINER immediately on failure,
> instead of relying on the close of the container fd to do so in the kernel,
> but this is equivalent.
should be ok.
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/container.c | 48 +++++++++++++++++++++++++++---------------------
> 1 file changed, 27 insertions(+), 21 deletions(-)
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 7c57bdd2..0f17d53 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -535,16 +535,10 @@ static bool vfio_legacy_setup(VFIOContainerBase *bcontainer, Error **errp)
> return true;
> }
>
> -static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> - Error **errp)
> +static bool vfio_attach_discard_disable(VFIOContainer *container,
> + VFIOGroup *group, Error **errp)
> {
> - VFIOContainer *container;
> - VFIOContainerBase *bcontainer;
> - int ret, fd;
> - VFIOAddressSpace *space;
> - VFIOIOMMUClass *vioc;
> -
> - space = vfio_get_address_space(as);
> + int ret;
>
> /*
> * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -577,18 +571,32 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> * details once we know which type of IOMMU we are using.
> */
>
> + ret = vfio_ram_block_discard_disable(container, true);
> + if (ret) {
> + error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> + error_report("vfio: error disconnecting group %d from"
> + " container", group->groupid);
> + }
> + }
> + return !ret;
> +}
> +
> +static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> + Error **errp)
> +{
> + VFIOContainer *container;
> + VFIOContainerBase *bcontainer;
> + int ret, fd;
> + VFIOAddressSpace *space;
> + VFIOIOMMUClass *vioc;
> +
> + space = vfio_get_address_space(as);
> +
> QLIST_FOREACH(bcontainer, &space->containers, next) {
> container = container_of(bcontainer, VFIOContainer, bcontainer);
> if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> - ret = vfio_ram_block_discard_disable(container, true);
> - if (ret) {
> - error_setg_errno(errp, -ret,
> - "Cannot set discarding of RAM broken");
> - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> - &container->fd)) {
> - error_report("vfio: error disconnecting group %d from"
> - " container", group->groupid);
> - }
> + if (!vfio_attach_discard_disable(container, group, errp)) {
> return false;
> }
> group->container = container;
> @@ -620,9 +628,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> goto free_container_exit;
> }
>
> - ret = vfio_ram_block_discard_disable(container, true);
> - if (ret) {
> - error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> + if (!vfio_attach_discard_disable(container, group, errp)) {
> goto unregister_container_exit;
> }
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 06/45] vfio/container: reform vfio_connect_container cleanup
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (4 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 05/45] vfio/container: ram discard disable helper Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-17 18:01 ` Cédric Le Goater
2025-02-14 14:13 ` [PATCH V2 07/45] vfio/container: vfio_container_group_add Steve Sistare
` (39 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Replace the proliferation of exit labels in vfio_connect_container with
conditionals for cleaning each piece of state. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 61 +++++++++++++++++++++++++++++------------------------
1 file changed, 33 insertions(+), 28 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 0f17d53..c668d07 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -587,9 +587,12 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
{
VFIOContainer *container;
VFIOContainerBase *bcontainer;
- int ret, fd;
+ int ret, fd = -1;
VFIOAddressSpace *space;
- VFIOIOMMUClass *vioc;
+ VFIOIOMMUClass *vioc = NULL;
+ bool new_container = false;
+ bool group_was_added = false;
+ bool discard_disabled = false;
space = vfio_get_address_space(as);
@@ -608,35 +611,37 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
if (fd < 0) {
- goto put_space_exit;
+ goto fail;
}
ret = ioctl(fd, VFIO_GET_API_VERSION);
if (ret != VFIO_API_VERSION) {
error_setg(errp, "supported vfio version: %d, "
"reported version: %d", VFIO_API_VERSION, ret);
- goto close_fd_exit;
+ goto fail;
}
container = vfio_create_container(fd, group, errp);
if (!container) {
- goto close_fd_exit;
+ goto fail;
}
+ new_container = true;
bcontainer = &container->bcontainer;
if (!vfio_cpr_register_container(bcontainer, errp)) {
- goto free_container_exit;
+ goto fail;
}
if (!vfio_attach_discard_disable(container, group, errp)) {
- goto unregister_container_exit;
+ goto fail;
}
+ discard_disabled = true;
vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
assert(vioc->setup);
if (!vioc->setup(bcontainer, errp)) {
- goto enable_discards_exit;
+ goto fail;
}
vfio_kvm_device_add_group(group);
@@ -645,6 +650,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
group->container = container;
QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ group_was_added = true;
bcontainer->listener = vfio_memory_listener;
memory_listener_register(&bcontainer->listener, bcontainer->space->as);
@@ -652,35 +658,34 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
if (bcontainer->error) {
error_propagate_prepend(errp, bcontainer->error,
"memory listener initialization failed: ");
- goto listener_release_exit;
+ goto fail;
}
bcontainer->initialized = true;
return true;
-listener_release_exit:
- QLIST_REMOVE(group, container_next);
- vfio_kvm_device_del_group(group);
+
+fail:
memory_listener_unregister(&bcontainer->listener);
- if (vioc->release) {
+
+ if (group_was_added) {
+ QLIST_REMOVE(group, container_next);
+ vfio_kvm_device_del_group(group);
+ }
+ if (vioc && vioc->release) {
vioc->release(bcontainer);
}
-
-enable_discards_exit:
- vfio_ram_block_discard_disable(container, false);
-
-unregister_container_exit:
- vfio_cpr_unregister_container(bcontainer);
-
-free_container_exit:
- object_unref(container);
-
-close_fd_exit:
- close(fd);
-
-put_space_exit:
+ if (discard_disabled) {
+ vfio_ram_block_discard_disable(container, false);
+ }
+ if (new_container) {
+ vfio_cpr_unregister_container(bcontainer);
+ object_unref(container);
+ }
+ if (fd >= 0) {
+ close(fd);
+ }
vfio_put_address_space(space);
-
return false;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 06/45] vfio/container: reform vfio_connect_container cleanup
2025-02-14 14:13 ` [PATCH V2 06/45] vfio/container: reform vfio_connect_container cleanup Steve Sistare
@ 2025-02-17 18:01 ` Cédric Le Goater
0 siblings, 0 replies; 72+ messages in thread
From: Cédric Le Goater @ 2025-02-17 18:01 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/14/25 15:13, Steve Sistare wrote:
> Replace the proliferation of exit labels in vfio_connect_container with
> conditionals for cleaning each piece of state. No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> hw/vfio/container.c | 61 +++++++++++++++++++++++++++++------------------------
> 1 file changed, 33 insertions(+), 28 deletions(-)
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 0f17d53..c668d07 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -587,9 +587,12 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> {
> VFIOContainer *container;
> VFIOContainerBase *bcontainer;
> - int ret, fd;
> + int ret, fd = -1;
> VFIOAddressSpace *space;
> - VFIOIOMMUClass *vioc;
> + VFIOIOMMUClass *vioc = NULL;
> + bool new_container = false;
> + bool group_was_added = false;
> + bool discard_disabled = false;
>
> space = vfio_get_address_space(as);
>
> @@ -608,35 +611,37 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>
> fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
> if (fd < 0) {
> - goto put_space_exit;
> + goto fail;
> }
>
> ret = ioctl(fd, VFIO_GET_API_VERSION);
> if (ret != VFIO_API_VERSION) {
> error_setg(errp, "supported vfio version: %d, "
> "reported version: %d", VFIO_API_VERSION, ret);
> - goto close_fd_exit;
> + goto fail;
> }
>
> container = vfio_create_container(fd, group, errp);
> if (!container) {
> - goto close_fd_exit;
> + goto fail;
> }
> + new_container = true;
> bcontainer = &container->bcontainer;>
> if (!vfio_cpr_register_container(bcontainer, errp)) {
> - goto free_container_exit;
> + goto fail;
> }
>
> if (!vfio_attach_discard_disable(container, group, errp)) {
> - goto unregister_container_exit;
> + goto fail;
> }
> + discard_disabled = true;
>
> vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> assert(vioc->setup);
>
> if (!vioc->setup(bcontainer, errp)) {
> - goto enable_discards_exit;
> + goto fail;
> }
>
> vfio_kvm_device_add_group(group);
> @@ -645,6 +650,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>
> group->container = container;
> QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> + group_was_added = true;
>
> bcontainer->listener = vfio_memory_listener;
> memory_listener_register(&bcontainer->listener, bcontainer->space->as);
> @@ -652,35 +658,34 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> if (bcontainer->error) {
> error_propagate_prepend(errp, bcontainer->error,
> "memory listener initialization failed: ");
> - goto listener_release_exit;
> + goto fail;
> }
>
> bcontainer->initialized = true;
>
> return true;
> -listener_release_exit:
> - QLIST_REMOVE(group, container_next);
> - vfio_kvm_device_del_group(group);
> +
> +fail:
> memory_listener_unregister(&bcontainer->listener);
> - if (vioc->release) {
> +
> + if (group_was_added) {
> + QLIST_REMOVE(group, container_next);
> + vfio_kvm_device_del_group(group);
> + }
> + if (vioc && vioc->release) {
> vioc->release(bcontainer);
> }
> -
> -enable_discards_exit:
> - vfio_ram_block_discard_disable(container, false);
> -
> -unregister_container_exit:
> - vfio_cpr_unregister_container(bcontainer);
> -
> -free_container_exit:
> - object_unref(container);
> -
> -close_fd_exit:
> - close(fd);
> -
> -put_space_exit:
> + if (discard_disabled) {
> + vfio_ram_block_discard_disable(container, false);
> + }
> + if (new_container) {
> + vfio_cpr_unregister_container(bcontainer);
> + object_unref(container);
> + }
> + if (fd >= 0) {
> + close(fd);
> + }
> vfio_put_address_space(space);
> -
> return false;
> }
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 07/45] vfio/container: vfio_container_group_add
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (5 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 06/45] vfio/container: reform vfio_connect_container cleanup Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-17 18:02 ` Cédric Le Goater
2025-02-14 14:13 ` [PATCH V2 08/45] vfio/container: register container for cpr Steve Sistare
` (38 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add vfio_container_group_add to de-dup some code. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 47 +++++++++++++++++++++++++----------------------
1 file changed, 25 insertions(+), 22 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index c668d07..c5bbb03 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -582,6 +582,26 @@ static bool vfio_attach_discard_disable(VFIOContainer *container,
return !ret;
}
+static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
+ Error **errp)
+{
+ if (!vfio_attach_discard_disable(container, group, errp)) {
+ return false;
+ }
+ group->container = container;
+ QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ vfio_kvm_device_add_group(group);
+ return true;
+}
+
+static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
+{
+ QLIST_REMOVE(group, container_next);
+ group->container = NULL;
+ vfio_kvm_device_del_group(group);
+ vfio_ram_block_discard_disable(container, false);
+}
+
static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
Error **errp)
{
@@ -592,20 +612,13 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
VFIOIOMMUClass *vioc = NULL;
bool new_container = false;
bool group_was_added = false;
- bool discard_disabled = false;
space = vfio_get_address_space(as);
QLIST_FOREACH(bcontainer, &space->containers, next) {
container = container_of(bcontainer, VFIOContainer, bcontainer);
if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
- if (!vfio_attach_discard_disable(container, group, errp)) {
- return false;
- }
- group->container = container;
- QLIST_INSERT_HEAD(&container->group_list, group, container_next);
- vfio_kvm_device_add_group(group);
- return true;
+ return vfio_container_group_add(container, group, errp);
}
}
@@ -632,11 +645,6 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
goto fail;
}
- if (!vfio_attach_discard_disable(container, group, errp)) {
- goto fail;
- }
- discard_disabled = true;
-
vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
assert(vioc->setup);
@@ -644,12 +652,11 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
goto fail;
}
- vfio_kvm_device_add_group(group);
-
vfio_address_space_insert(space, bcontainer);
- group->container = container;
- QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ if (!vfio_container_group_add(container, group, errp)) {
+ goto fail;
+ }
group_was_added = true;
bcontainer->listener = vfio_memory_listener;
@@ -669,15 +676,11 @@ fail:
memory_listener_unregister(&bcontainer->listener);
if (group_was_added) {
- QLIST_REMOVE(group, container_next);
- vfio_kvm_device_del_group(group);
+ vfio_container_group_del(container, group);
}
if (vioc && vioc->release) {
vioc->release(bcontainer);
}
- if (discard_disabled) {
- vfio_ram_block_discard_disable(container, false);
- }
if (new_container) {
vfio_cpr_unregister_container(bcontainer);
object_unref(container);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 07/45] vfio/container: vfio_container_group_add
2025-02-14 14:13 ` [PATCH V2 07/45] vfio/container: vfio_container_group_add Steve Sistare
@ 2025-02-17 18:02 ` Cédric Le Goater
0 siblings, 0 replies; 72+ messages in thread
From: Cédric Le Goater @ 2025-02-17 18:02 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/14/25 15:13, Steve Sistare wrote:
> Add vfio_container_group_add to de-dup some code. No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> hw/vfio/container.c | 47 +++++++++++++++++++++++++----------------------
> 1 file changed, 25 insertions(+), 22 deletions(-)
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index c668d07..c5bbb03 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -582,6 +582,26 @@ static bool vfio_attach_discard_disable(VFIOContainer *container,
> return !ret;
> }
>
> +static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
> + Error **errp)
> +{
> + if (!vfio_attach_discard_disable(container, group, errp)) {
> + return false;
> + }
> + group->container = container;
> + QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> + vfio_kvm_device_add_group(group);
> + return true;
> +}
> +
> +static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
> +{
> + QLIST_REMOVE(group, container_next);
> + group->container = NULL;
> + vfio_kvm_device_del_group(group);
> + vfio_ram_block_discard_disable(container, false);
> +}
> +
> static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> Error **errp)
> {
> @@ -592,20 +612,13 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> VFIOIOMMUClass *vioc = NULL;
> bool new_container = false;
> bool group_was_added = false;
> - bool discard_disabled = false;
>
> space = vfio_get_address_space(as);
>
> QLIST_FOREACH(bcontainer, &space->containers, next) {
> container = container_of(bcontainer, VFIOContainer, bcontainer);
> if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> - if (!vfio_attach_discard_disable(container, group, errp)) {
> - return false;
> - }
> - group->container = container;
> - QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> - vfio_kvm_device_add_group(group);
> - return true;
> + return vfio_container_group_add(container, group, errp);
> }
> }
>
> @@ -632,11 +645,6 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> goto fail;
> }
>
> - if (!vfio_attach_discard_disable(container, group, errp)) {
> - goto fail;
> - }
> - discard_disabled = true;
> -
> vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> assert(vioc->setup);
>
> @@ -644,12 +652,11 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> goto fail;
> }
>
> - vfio_kvm_device_add_group(group);
> -
> vfio_address_space_insert(space, bcontainer);
>
> - group->container = container;
> - QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> + if (!vfio_container_group_add(container, group, errp)) {
> + goto fail;
> + }
> group_was_added = true;
>
> bcontainer->listener = vfio_memory_listener;
> @@ -669,15 +676,11 @@ fail:
> memory_listener_unregister(&bcontainer->listener);
>
> if (group_was_added) {
> - QLIST_REMOVE(group, container_next);
> - vfio_kvm_device_del_group(group);
> + vfio_container_group_del(container, group);
> }
> if (vioc && vioc->release) {
> vioc->release(bcontainer);
> }
> - if (discard_disabled) {
> - vfio_ram_block_discard_disable(container, false);
> - }
> if (new_container) {
> vfio_cpr_unregister_container(bcontainer);
> object_unref(container);
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 08/45] vfio/container: register container for cpr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (6 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 07/45] vfio/container: vfio_container_group_add Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 09/45] vfio/container: preserve descriptors Steve Sistare
` (37 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Register a legacy container for cpr-transfer, replacing the generic CPR
register call with a more specific legacy container register call. Add a
blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
This is mostly boiler plate. The fields to to saved and restored are added
in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
MAINTAINERS | 1 +
hw/vfio/container.c | 6 ++--
hw/vfio/cpr-legacy.c | 69 +++++++++++++++++++++++++++++++++++++++++++
hw/vfio/cpr.c | 6 ++--
hw/vfio/meson.build | 3 +-
include/hw/vfio/vfio-common.h | 2 ++
include/hw/vfio/vfio-cpr.h | 25 ++++++++++++++++
7 files changed, 105 insertions(+), 7 deletions(-)
create mode 100644 hw/vfio/cpr-legacy.c
create mode 100644 include/hw/vfio/vfio-cpr.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 2f9a6da..aee1342 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2957,6 +2957,7 @@ M: Fabiano Rosas <farosas@suse.de>
R: Steve Sistare <steven.sistare@oracle.com>
S: Supported
F: hw/vfio/cpr*
+F: include/hw/vfio/vfio-cpr.h
F: include/migration/cpr.h
F: migration/cpr*
F: tests/qtest/migration/cpr*
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index c5bbb03..eca3362 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -641,7 +641,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
new_container = true;
bcontainer = &container->bcontainer;
- if (!vfio_cpr_register_container(bcontainer, errp)) {
+ if (!vfio_legacy_cpr_register_container(container, errp)) {
goto fail;
}
@@ -682,7 +682,7 @@ fail:
vioc->release(bcontainer);
}
if (new_container) {
- vfio_cpr_unregister_container(bcontainer);
+ vfio_legacy_cpr_unregister_container(container);
object_unref(container);
}
if (fd >= 0) {
@@ -722,7 +722,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
VFIOAddressSpace *space = bcontainer->space;
trace_vfio_disconnect_container(container->fd);
- vfio_cpr_unregister_container(bcontainer);
+ vfio_legacy_cpr_unregister_container(container);
close(container->fd);
object_unref(container);
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 0000000..d0557af
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include <sys/ioctl.h>
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+
+static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
+{
+ if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
+ error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
+ return false;
+
+ } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+ error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
+ return false;
+
+ } else {
+ return true;
+ }
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+ .name = "vfio-container",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ Error **cpr_blocker = &container->cpr.blocker;
+
+ migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+ vfio_cpr_reboot_notifier,
+ MIG_MODE_CPR_REBOOT);
+
+ if (!vfio_cpr_supported(container, cpr_blocker)) {
+ return migrate_add_blocker_modes(cpr_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+
+ vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+ return true;
+}
+
+void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+ migrate_del_blocker(&container->cpr.blocker);
+ vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 3d1c8d2..6790f8a 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -7,12 +7,12 @@
#include "qemu/osdep.h"
#include "hw/vfio/vfio-common.h"
-#include "migration/misc.h"
+#include "hw/vfio/vfio-cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
-static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
- MigrationEvent *e, Error **errp)
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
{
if (e->type == MIG_EVENT_PRECOPY_SETUP &&
!runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bba776f..5487815 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,13 +5,14 @@ vfio_ss.add(files(
'container-base.c',
'container.c',
'migration.c',
- 'cpr.c',
))
vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
'iommufd.c',
))
vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+ 'cpr.c',
+ 'cpr-legacy.c',
'display.c',
'pci-quirks.c',
'pci.c',
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d601eea..c482364 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -31,6 +31,7 @@
#endif
#include "system/system.h"
#include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
#include "system/host_iommu_device.h"
#include "system/iommufd.h"
@@ -85,6 +86,7 @@ typedef struct VFIOContainer {
int fd; /* /dev/vfio/vfio, empowered by the attached groups */
unsigned iommu_type;
QLIST_HEAD(, VFIOGroup) group_list;
+ VFIOContainerCPR cpr;
} VFIOContainer;
OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
new file mode 100644
index 0000000..d4f8346
--- /dev/null
+++ b/include/hw/vfio/vfio-cpr.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef HW_VFIO_VFIO_CPR_H
+#define HW_VFIO_VFIO_CPR_H
+
+#include "migration/misc.h"
+
+typedef struct VFIOContainerCPR {
+ Error *blocker;
+} VFIOContainerCPR;
+
+struct VFIOContainer;
+
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+ Error **errp);
+
+bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
+ Error **errp);
+void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
+#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 09/45] vfio/container: preserve descriptors
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (7 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 08/45] vfio/container: register container for cpr Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 10/45] vfio/container: export vfio_legacy_dma_map Steve Sistare
` (36 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
the saved descriptors, and remembers the reused status for subsequent
patches. The reused status is cleared when vmstate load finishes.
During reuse, device and iommu state is already configured, so operations
in vfio_realize that would modify the configuration, such as vfio ioctl's,
are skipped. The result is that vfio_realize constructs qemu data
structures that reflect the current state of the device.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 57 +++++++++++++++++++++++++++++++++++++------
hw/vfio/cpr-legacy.c | 45 ++++++++++++++++++++++++++++++++++
include/hw/vfio/vfio-common.h | 1 +
include/hw/vfio/vfio-cpr.h | 9 +++++++
4 files changed, 104 insertions(+), 8 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index eca3362..21f2706 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -31,6 +31,8 @@
#include "system/reset.h"
#include "trace.h"
#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/blocker.h"
#include "pci.h"
VFIOGroupList vfio_group_list =
@@ -413,7 +415,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
}
static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
- Error **errp)
+ bool cpr_reused, Error **errp)
{
int iommu_type;
const char *vioc_name;
@@ -424,7 +426,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
return NULL;
}
- if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
+ /*
+ * If container is reused, just set its type and skip the ioctls, as the
+ * container and group are already configured in the kernel.
+ */
+ if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
return NULL;
}
@@ -432,6 +438,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
container->fd = fd;
+ container->cpr.reused = cpr_reused;
container->iommu_type = iommu_type;
return container;
}
@@ -591,6 +598,7 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
group->container = container;
QLIST_INSERT_HEAD(&container->group_list, group, container_next);
vfio_kvm_device_add_group(group);
+ cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
return true;
}
@@ -600,6 +608,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
group->container = NULL;
vfio_kvm_device_del_group(group);
vfio_ram_block_discard_disable(container, false);
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
}
static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
@@ -612,17 +621,37 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
VFIOIOMMUClass *vioc = NULL;
bool new_container = false;
bool group_was_added = false;
+ bool cpr_reused;
space = vfio_get_address_space(as);
+ fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+ cpr_reused = (fd > 0);
+
+ /*
+ * If the container is reused, then the group is already attached in the
+ * kernel. If a container with matching fd is found, then update the
+ * userland group list and return. If not, then after the loop, create
+ * the container struct and group list.
+ */
QLIST_FOREACH(bcontainer, &space->containers, next) {
container = container_of(bcontainer, VFIOContainer, bcontainer);
- if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
- return vfio_container_group_add(container, group, errp);
+
+ if (cpr_reused) {
+ if (!vfio_cpr_container_match(container, group, &fd)) {
+ continue;
+ }
+ } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+ continue;
}
+
+ return vfio_container_group_add(container, group, errp);
+ }
+
+ if (!cpr_reused) {
+ fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
}
- fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
if (fd < 0) {
goto fail;
}
@@ -634,7 +663,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
goto fail;
}
- container = vfio_create_container(fd, group, errp);
+ container = vfio_create_container(fd, group, cpr_reused, errp);
if (!container) {
goto fail;
}
@@ -700,6 +729,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
QLIST_REMOVE(group, container_next);
group->container = NULL;
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
/*
* Explicitly release the listener first before unset container,
@@ -753,7 +783,7 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
group = g_malloc0(sizeof(*group));
snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
- group->fd = qemu_open(path, O_RDWR, errp);
+ group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
if (group->fd < 0) {
goto free_group_exit;
}
@@ -785,6 +815,7 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
return group;
close_fd_exit:
+ cpr_delete_fd("vfio_group", groupid);
close(group->fd);
free_group_exit:
@@ -806,6 +837,7 @@ static void vfio_put_group(VFIOGroup *group)
vfio_disconnect_container(group);
QLIST_REMOVE(group, next);
trace_vfio_put_group(group->fd);
+ cpr_delete_fd("vfio_group", group->groupid);
close(group->fd);
g_free(group);
}
@@ -815,8 +847,14 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
{
g_autofree struct vfio_device_info *info = NULL;
int fd;
+ bool cpr_reused;
+
+ fd = cpr_find_fd(name, 0);
+ cpr_reused = (fd >= 0);
+ if (!cpr_reused) {
+ fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+ }
- fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
if (fd < 0) {
error_setg_errno(errp, errno, "error getting device from group %d",
group->groupid);
@@ -861,6 +899,8 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
vbasedev->num_irqs = info->num_irqs;
vbasedev->num_regions = info->num_regions;
vbasedev->flags = info->flags;
+ vbasedev->cpr.reused = cpr_reused;
+ cpr_resave_fd(name, 0, fd);
trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
@@ -877,6 +917,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
QLIST_REMOVE(vbasedev, next);
vbasedev->group = NULL;
trace_vfio_put_base_device(vbasedev->fd);
+ cpr_delete_fd(vbasedev->name, 0);
close(vbasedev->fd);
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index d0557af..cee0f4e 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -30,10 +30,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
}
}
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+ VFIOContainer *container = opaque;
+ VFIOGroup *group;
+ VFIODevice *vbasedev;
+
+ container->cpr.reused = false;
+
+ QLIST_FOREACH(group, &container->group_list, container_next) {
+ QLIST_FOREACH(vbasedev, &group->device_list, next) {
+ vbasedev->cpr.reused = false;
+ }
+ }
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
@@ -67,3 +84,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
migrate_del_blocker(&container->cpr.blocker);
vmstate_unregister(NULL, &vfio_container_vmstate, container);
}
+
+static bool same_device(int fd1, int fd2)
+{
+ struct stat st1, st2;
+
+ return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
+}
+
+bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
+ int *pfd)
+{
+ if (container->fd == *pfd) {
+ return true;
+ }
+ if (!same_device(container->fd, *pfd)) {
+ return false;
+ }
+ /*
+ * Same device, different fd. This occurs when the container fd is
+ * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
+ * produces duplicates. De-dup it.
+ */
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
+ close(*pfd);
+ cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
+ *pfd = container->fd;
+ return true;
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c482364..780646e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -152,6 +152,7 @@ typedef struct VFIODevice {
IOMMUFDBackend *iommufd;
VFIOIOASHwpt *hwpt;
QLIST_ENTRY(VFIODevice) hwpt_next;
+ VFIODeviceCPR cpr;
} VFIODevice;
struct VFIODeviceOps {
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index d4f8346..1a3eee9 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -12,9 +12,15 @@
typedef struct VFIOContainerCPR {
Error *blocker;
+ bool reused;
} VFIOContainerCPR;
+typedef struct VFIODeviceCPR {
+ bool reused;
+} VFIODeviceCPR;
+
struct VFIOContainer;
+struct VFIOGroup;
int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
Error **errp);
@@ -22,4 +28,7 @@ int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
Error **errp);
void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
+
+bool vfio_cpr_container_match(struct VFIOContainer *container,
+ struct VFIOGroup *group, int *fd);
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 10/45] vfio/container: export vfio_legacy_dma_map
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (8 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 09/45] vfio/container: preserve descriptors Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 11/45] vfio/container: discard old DMA vaddr Steve Sistare
` (35 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export vfio_legacy_dma_map so it may be referenced outside the file
in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 4 ++--
include/hw/vfio/vfio-common.h | 3 +++
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 21f2706..931c435 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -176,8 +176,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
return 0;
}
-static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
- ram_addr_t size, void *vaddr, bool readonly)
+int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
+ ram_addr_t size, void *vaddr, bool readonly)
{
const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
bcontainer);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 780646e..7f33476 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -280,6 +280,9 @@ void vfio_reset_bytes_transferred(void);
bool vfio_device_state_is_running(VFIODevice *vbasedev);
bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
+int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
+ ram_addr_t size, void *vaddr, bool readonly);
+
#ifdef CONFIG_LINUX
int vfio_get_region_info(VFIODevice *vbasedev, int index,
struct vfio_region_info **info);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 11/45] vfio/container: discard old DMA vaddr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (9 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 10/45] vfio/container: export vfio_legacy_dma_map Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 12/45] vfio/container: restore " Steve Sistare
` (34 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
In the container pre_save handler, discard the virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will be
remapped at a different VA after in new QEMU. DMA to already-mapped
pages continues.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-legacy.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index cee0f4e..97a64e7 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -15,6 +15,22 @@
#include "migration/vmstate.h"
#include "qapi/error.h"
+static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+ .iova = 0,
+ .size = 0,
+ };
+ if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+ error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+ return false;
+ }
+ return true;
+}
+
+
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -30,6 +46,18 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
}
}
+static int vfio_container_pre_save(void *opaque)
+{
+ VFIOContainer *container = opaque;
+ Error *err = NULL;
+
+ if (!vfio_dma_unmap_vaddr_all(container, &err)) {
+ error_report_err(err);
+ return -1;
+ }
+ return 0;
+}
+
static int vfio_container_post_load(void *opaque, int version_id)
{
VFIOContainer *container = opaque;
@@ -50,6 +78,7 @@ static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 12/45] vfio/container: restore DMA vaddr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (10 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 11/45] vfio/container: discard old DMA vaddr Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 13/45] vfio/container: mdev cpr blocker Steve Sistare
` (33 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
In new QEMU, do not register the memory listener at device creation time.
Register it later, in the container post_load handler, after all vmstate
that may affect regions and mapping boundaries has been loaded. The
post_load registration will cause the listener to invoke its callback on
each flat section, and the calls will match the mappings remembered by the
kernel.
The listener calls a special dma_map handler that passes the new VA of each
section to the kernel using VFIO_DMA_MAP_FLAG_VADDR. Restore the normal
handler at the end.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 15 +++++++++++++--
hw/vfio/cpr-legacy.c | 43 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 931c435..d26f78e 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -133,6 +133,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
int ret;
Error *local_err = NULL;
+ assert(!container->cpr.reused);
+
if (iotlb && vfio_devices_all_dirty_tracking_started(bcontainer)) {
if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
bcontainer->dirty_pages_supported) {
@@ -688,8 +690,17 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
}
group_was_added = true;
- bcontainer->listener = vfio_memory_listener;
- memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+ /*
+ * If reused, register the listener later, after all state that may
+ * affect regions and mapping boundaries has been cpr load'ed. Later,
+ * the listener will invoke its callback on each flat section and call
+ * dma_map to supply the new vaddr, and the calls will match the mappings
+ * remembered by the kernel.
+ */
+ if (!cpr_reused) {
+ bcontainer->listener = vfio_memory_listener;
+ memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+ }
if (bcontainer->error) {
error_propagate_prepend(errp, bcontainer->error,
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 97a64e7..bb5f802 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -14,6 +14,7 @@
#include "migration/migration.h"
#include "migration/vmstate.h"
#include "qapi/error.h"
+#include "qemu/error-report.h"
static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
{
@@ -30,6 +31,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
return true;
}
+/*
+ * Set the new @vaddr for any mappings registered during cpr load.
+ * Reused is cleared thereafter.
+ */
+static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size, void *vaddr,
+ bool readonly)
+{
+ const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
+ bcontainer);
+ struct vfio_iommu_type1_dma_map map = {
+ .argsz = sizeof(map),
+ .flags = VFIO_DMA_MAP_FLAG_VADDR,
+ .vaddr = (__u64)(uintptr_t)vaddr,
+ .iova = iova,
+ .size = size,
+ };
+
+ assert(container->cpr.reused);
+
+ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+ error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
+ iova, size, vaddr, strerror(errno));
+ return -errno;
+ }
+
+ return 0;
+}
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
@@ -61,12 +90,20 @@ static int vfio_container_pre_save(void *opaque)
static int vfio_container_post_load(void *opaque, int version_id)
{
VFIOContainer *container = opaque;
+ VFIOContainerBase *bcontainer = &container->bcontainer;
VFIOGroup *group;
VFIODevice *vbasedev;
+ bcontainer->listener = vfio_memory_listener;
+ memory_listener_register(&bcontainer->listener, bcontainer->space->as);
container->cpr.reused = false;
QLIST_FOREACH(group, &container->group_list, container_next) {
+ VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+
+ /* Restore original dma_map function */
+ vioc->dma_map = vfio_legacy_dma_map;
+
QLIST_FOREACH(vbasedev, &group->device_list, next) {
vbasedev->cpr.reused = false;
}
@@ -78,6 +115,7 @@ static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
.pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
@@ -102,6 +140,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+ /* During incoming CPR, divert calls to dma_map. */
+ if (container->cpr.reused) {
+ VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ vioc->dma_map = vfio_legacy_cpr_dma_map;
+ }
return true;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 13/45] vfio/container: mdev cpr blocker
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (11 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 12/45] vfio/container: restore " Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 14/45] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
` (32 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
During CPR, after VFIO_DMA_UNMAP_FLAG_VADDR, the vaddr is temporarily
invalid, so mediated devices cannot be supported. Add a blocker for them.
This restriction will not apply to iommufd containers when CPR is added
for them in a future patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 8 ++++++++
include/hw/vfio/vfio-cpr.h | 1 +
2 files changed, 9 insertions(+)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index d26f78e..8130d1f 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -999,6 +999,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
return false;
}
+ if (vbasedev->mdev) {
+ error_setg(&vbasedev->cpr.mdev_blocker,
+ "CPR does not support vfio mdev %s", vbasedev->name);
+ migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, &error_fatal,
+ MIG_MODE_CPR_TRANSFER, -1);
+ }
+
bcontainer = &group->container->bcontainer;
vbasedev->bcontainer = bcontainer;
QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
@@ -1015,6 +1022,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
QLIST_REMOVE(vbasedev, container_next);
vbasedev->bcontainer = NULL;
trace_vfio_detach_device(vbasedev->name, group->groupid);
+ migrate_del_blocker(&vbasedev->cpr.mdev_blocker);
vfio_put_base_device(vbasedev);
vfio_put_group(group);
}
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 1a3eee9..25ac944 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -17,6 +17,7 @@ typedef struct VFIOContainerCPR {
typedef struct VFIODeviceCPR {
bool reused;
+ Error *mdev_blocker;
} VFIODeviceCPR;
struct VFIOContainer;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 14/45] vfio/container: recover from unmap-all-vaddr failure
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (12 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 13/45] vfio/container: mdev cpr blocker Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 15/45] pci: export msix_is_pending Steve Sistare
` (31 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
If there are multiple containers and unmap-all fails for some container, we
need to remap vaddr for the other containers for which unmap-all succeeded.
Recover by walking all address ranges of all containers to restore the vaddr
for each. Do so by invoking the vfio listener callback, and passing a new
"remap" flag that tells it to restore a mapping without re-allocating new
userland data structures.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 19 ++++++++-
hw/vfio/cpr-legacy.c | 91 +++++++++++++++++++++++++++++++++++++++++++
include/hw/vfio/vfio-common.h | 3 ++
include/hw/vfio/vfio-cpr.h | 11 ++++++
4 files changed, 123 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e3e1da0..48663ad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -592,6 +592,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
{
VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
listener);
+ vfio_container_region_add(bcontainer, section, false);
+}
+
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section,
+ bool cpr_remap)
+{
hwaddr iova, end;
Int128 llend, llsize;
void *vaddr;
@@ -627,6 +634,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
int iommu_idx;
trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
+
+ if (cpr_remap) {
+ vfio_cpr_giommu_remap(bcontainer, section);
+ }
+
/*
* FIXME: For VFIO iommu types which have KVM acceleration to
* avoid bouncing all map/unmaps through qemu this way, this
@@ -669,7 +681,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
* about changes.
*/
if (memory_region_has_ram_discard_manager(section->mr)) {
- vfio_register_ram_discard_listener(bcontainer, section);
+ if (!cpr_remap) {
+ vfio_register_ram_discard_listener(bcontainer, section);
+ } else if (!vfio_cpr_register_ram_discard_listener(bcontainer,
+ section)) {
+ goto fail;
+ }
return;
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index bb5f802..90f9f14 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -28,6 +28,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
return false;
}
+ container->cpr.vaddr_unmapped = true;
return true;
}
@@ -60,6 +61,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
return 0;
}
+static void vfio_region_remap(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+ VFIOContainer *container = container_of(listener, VFIOContainer,
+ cpr.remap_listener);
+ vfio_container_region_add(&container->bcontainer, section, true);
+}
+
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -124,6 +133,40 @@ static const VMStateDescription vfio_container_vmstate = {
}
};
+static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
+{
+ VFIOContainer *container =
+ container_of(notifier, VFIOContainer, cpr.transfer_notifier);
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ if (e->type != MIG_EVENT_PRECOPY_FAILED) {
+ return 0;
+ }
+
+ if (container->cpr.vaddr_unmapped) {
+ /*
+ * Force a call to vfio_region_remap for each mapped section by
+ * temporarily registering a listener, and temporarily diverting
+ * dma_map to vfio_legacy_cpr_dma_map. The latter restores vaddr.
+ */
+
+ VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ vioc->dma_map = vfio_legacy_cpr_dma_map;
+
+ container->cpr.remap_listener = (MemoryListener) {
+ .name = "vfio cpr recover",
+ .region_add = vfio_region_remap
+ };
+ memory_listener_register(&container->cpr.remap_listener,
+ bcontainer->space->as);
+ memory_listener_unregister(&container->cpr.remap_listener);
+ container->cpr.vaddr_unmapped = false;
+ vioc->dma_map = vfio_legacy_dma_map;
+ }
+ return 0;
+}
+
bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
{
VFIOContainerBase *bcontainer = &container->bcontainer;
@@ -145,6 +188,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
vioc->dma_map = vfio_legacy_cpr_dma_map;
}
+
+ migration_add_notifier_mode(&container->cpr.transfer_notifier,
+ vfio_cpr_fail_notifier,
+ MIG_MODE_CPR_TRANSFER);
return true;
}
@@ -155,6 +202,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
migrate_del_blocker(&container->cpr.blocker);
vmstate_unregister(NULL, &vfio_container_vmstate, container);
+ migration_remove_notifier(&container->cpr.transfer_notifier);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr. Call this
+ * to restore vaddr for a section with a giommu.
+ *
+ * The giommu already exists. Find it and replay it, which calls
+ * vfio_legacy_cpr_dma_map further down the stack.
+ */
+void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section)
+{
+ VFIOGuestIOMMU *giommu = NULL;
+ hwaddr as_offset = section->offset_within_address_space;
+ hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+ QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
+ if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
+ giommu->iommu_offset == iommu_offset) {
+ break;
+ }
+ }
+ g_assert(giommu);
+ memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr. Call this
+ * to restore vaddr for a section with a RamDiscardManager.
+ *
+ * The ram discard listener already exists. Call its populate function
+ * directly, which calls vfio_legacy_cpr_dma_map.
+ */
+bool vfio_cpr_register_ram_discard_listener(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section)
+{
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
+
+ g_assert(vrdl);
+ return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
}
static bool same_device(int fd1, int fd2)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7f33476..1563f3a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -315,6 +315,9 @@ int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
VFIORamDiscardListener *vfio_find_ram_discard_listener(
VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section, bool cpr_remap);
+
/* Returns 0 on success, or a negative errno. */
bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 25ac944..40d4f06 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -8,11 +8,15 @@
#ifndef HW_VFIO_VFIO_CPR_H
#define HW_VFIO_VFIO_CPR_H
+#include "exec/memory.h"
#include "migration/misc.h"
typedef struct VFIOContainerCPR {
Error *blocker;
bool reused;
+ bool vaddr_unmapped;
+ NotifierWithReturn transfer_notifier;
+ MemoryListener remap_listener;
} VFIOContainerCPR;
typedef struct VFIODeviceCPR {
@@ -22,6 +26,7 @@ typedef struct VFIODeviceCPR {
struct VFIOContainer;
struct VFIOGroup;
+struct VFIOContainerBase;
int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
Error **errp);
@@ -32,4 +37,10 @@ void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
bool vfio_cpr_container_match(struct VFIOContainer *container,
struct VFIOGroup *group, int *fd);
+
+void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section);
+
+bool vfio_cpr_register_ram_discard_listener(
+ struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 15/45] pci: export msix_is_pending
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (13 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 14/45] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:45 ` Steven Sistare
2025-02-14 14:13 ` [PATCH V2 16/45] pci: skip reset during cpr Steve Sistare
` (30 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export msix_is_pending for use by cpr. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
hw/pci/msix.c | 2 +-
include/hw/pci/msix.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 57ec708..c7b40cd 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -71,7 +71,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
return dev->msix_pba + vector / 8;
}
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
{
return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 0e6f257..11ef945 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
bool msix_is_masked(PCIDevice *dev, unsigned vector);
void msix_set_pending(PCIDevice *dev, unsigned vector);
void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
void msix_vector_use(PCIDevice *dev, unsigned vector);
void msix_vector_unuse(PCIDevice *dev, unsigned vector);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 15/45] pci: export msix_is_pending
2025-02-14 14:13 ` [PATCH V2 15/45] pci: export msix_is_pending Steve Sistare
@ 2025-02-14 14:45 ` Steven Sistare
2025-02-14 14:46 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 14:45 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
qemu-devel
Hi Michael,
You previously acked "pci: export msix_is_pending" -- thank you!
This patch also needs your attention.
There are no other changes in the core pci area.
- Steve
On 2/14/2025 9:13 AM, Steve Sistare wrote:
> Export msix_is_pending for use by cpr. No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> hw/pci/msix.c | 2 +-
> include/hw/pci/msix.h | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 57ec708..c7b40cd 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -71,7 +71,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
> return dev->msix_pba + vector / 8;
> }
>
> -static int msix_is_pending(PCIDevice *dev, int vector)
> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
> {
> return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
> }
> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
> index 0e6f257..11ef945 100644
> --- a/include/hw/pci/msix.h
> +++ b/include/hw/pci/msix.h
> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
> bool msix_is_masked(PCIDevice *dev, unsigned vector);
> void msix_set_pending(PCIDevice *dev, unsigned vector);
> void msix_clr_pending(PCIDevice *dev, int vector);
> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>
> void msix_vector_use(PCIDevice *dev, unsigned vector);
> void msix_vector_unuse(PCIDevice *dev, unsigned vector);
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 15/45] pci: export msix_is_pending
2025-02-14 14:45 ` Steven Sistare
@ 2025-02-14 14:46 ` Steven Sistare
0 siblings, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 14:46 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
qemu-devel
On 2/14/2025 9:45 AM, Steven Sistare wrote:
> Hi Michael,
>
> You previously acked "pci: export msix_is_pending" -- thank you!
>
> This patch also needs your attention.
Off by one -- patch 16/45 "pci: skip reset during cpr" needs your attention.
- Steve
> There are no other changes in the core pci area.
>
> - Steve
>
> On 2/14/2025 9:13 AM, Steve Sistare wrote:
>> Export msix_is_pending for use by cpr. No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>> hw/pci/msix.c | 2 +-
>> include/hw/pci/msix.h | 1 +
>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>> index 57ec708..c7b40cd 100644
>> --- a/hw/pci/msix.c
>> +++ b/hw/pci/msix.c
>> @@ -71,7 +71,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
>> return dev->msix_pba + vector / 8;
>> }
>> -static int msix_is_pending(PCIDevice *dev, int vector)
>> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
>> {
>> return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
>> }
>> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
>> index 0e6f257..11ef945 100644
>> --- a/include/hw/pci/msix.h
>> +++ b/include/hw/pci/msix.h
>> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
>> bool msix_is_masked(PCIDevice *dev, unsigned vector);
>> void msix_set_pending(PCIDevice *dev, unsigned vector);
>> void msix_clr_pending(PCIDevice *dev, int vector);
>> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>> void msix_vector_use(PCIDevice *dev, unsigned vector);
>> void msix_vector_unuse(PCIDevice *dev, unsigned vector);
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 16/45] pci: skip reset during cpr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (14 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 15/45] pci: export msix_is_pending Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:13 ` [PATCH V2 17/45] vfio-pci: " Steve Sistare
` (29 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Do not reset a vfio-pci device during CPR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/pci/pci.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 2afa423..2fa8884 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,8 @@
#include "hw/pci/pci_host.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "migration/cpr.h"
+#include "migration/misc.h"
#include "migration/qemu-file-types.h"
#include "migration/vmstate.h"
#include "net/net.h"
@@ -459,6 +461,17 @@ static void pci_reset_regions(PCIDevice *dev)
static void pci_do_device_reset(PCIDevice *dev)
{
+ /*
+ * A PCI device that is resuming for cpr is already configured, so do
+ * not reset it here when we are called from qemu_system_reset prior to
+ * cpr load, else interrupts may be lost for vfio-pci devices. It is
+ * safe to skip this reset for all PCI devices, because vmstate load will
+ * set all fields that would have been set here.
+ */
+ if (cpr_is_incoming()) {
+ return;
+ }
+
pci_device_deassert_intx(dev);
assert(dev->irq_state == 0);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 17/45] vfio-pci: skip reset during cpr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (15 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 16/45] pci: skip reset during cpr Steve Sistare
@ 2025-02-14 14:13 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 18/45] vfio/pci: vfio_vector_init Steve Sistare
` (28 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:13 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Do not reset a vfio-pci device during CPR, and do not complain if the
kernel's PCI config space changes for non-emulated bits between the
vmstate save and load, which can happen due to ongoing interrupt activity.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr.c | 31 +++++++++++++++++++++++++++++++
hw/vfio/pci.c | 6 ++++++
include/hw/vfio/vfio-cpr.h | 2 ++
3 files changed, 39 insertions(+)
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 6790f8a..8268c0c 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -8,6 +8,8 @@
#include "qemu/osdep.h"
#include "hw/vfio/vfio-common.h"
#include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/pci.h"
+#include "migration/cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
@@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
{
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
}
+
+/*
+ * The kernel may change non-emulated config bits. Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_cpr_pci_pre_load(void *opaque)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int size = MIN(pci_config_size(pdev), vdev->config_size);
+ int i;
+
+ for (i = 0; i < size; i++) {
+ pdev->cmask[i] &= vdev->emulated_config_bits[i];
+ }
+
+ return 0;
+}
+
+const VMStateDescription vfio_cpr_pci_vmstate = {
+ .name = "vfio-cpr-pci",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .pre_load = vfio_cpr_pci_pre_load,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 89d900e..bd080ea 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3294,6 +3294,11 @@ static void vfio_pci_reset(DeviceState *dev)
{
VFIOPCIDevice *vdev = VFIO_PCI(dev);
+ /* Do not reset the device during qemu_system_reset prior to cpr load */
+ if (vdev->vbasedev.cpr.reused) {
+ return;
+ }
+
trace_vfio_pci_reset(vdev->vbasedev.name);
vfio_pci_pre_reset(vdev);
@@ -3427,6 +3432,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
#ifdef CONFIG_IOMMUFD
object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
#endif
+ dc->vmsd = &vfio_cpr_pci_vmstate;
dc->desc = "VFIO-based PCI device assignment";
set_bit(DEVICE_CATEGORY_MISC, dc->categories);
pdc->realize = vfio_realize;
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 40d4f06..f5480de 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -43,4 +43,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
bool vfio_cpr_register_ram_discard_listener(
struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
+extern const VMStateDescription vfio_cpr_pci_vmstate;
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 18/45] vfio/pci: vfio_vector_init
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (16 preceding siblings ...)
2025-02-14 14:13 ` [PATCH V2 17/45] vfio-pci: " Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 19/45] vfio/pci: vfio_notifier_init Steve Sistare
` (27 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Extract a subroutine vfio_vector_init. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bd080ea..883257a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -511,6 +511,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
kvm_irqchip_commit_routes(kvm_state);
}
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+ VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+ PCIDevice *pdev = &vdev->pdev;
+
+ vector->vdev = vdev;
+ vector->virq = -1;
+ if (event_notifier_init(&vector->interrupt, 0)) {
+ error_report("vfio: Error: event_notifier_init failed");
+ }
+ vector->use = true;
+ if (vdev->interrupt == VFIO_INT_MSIX) {
+ msix_vector_use(pdev, nr);
+ }
+}
+
static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
MSIMessage *msg, IOHandler *handler)
{
@@ -524,13 +540,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vector = &vdev->msi_vectors[nr];
if (!vector->use) {
- vector->vdev = vdev;
- vector->virq = -1;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
- }
- vector->use = true;
- msix_vector_use(pdev, nr);
+ vfio_vector_init(vdev, nr);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 19/45] vfio/pci: vfio_notifier_init
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (17 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 18/45] vfio/pci: vfio_vector_init Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 20/45] vfio/pci: pass vector to virq functions Steve Sistare
` (26 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Move event_notifier_init calls to a helper vfio_notifier_init.
This version is trivial, but it will be expanded to support CPR
in subsequent patches. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 883257a..688b7d3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -54,6 +54,16 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+{
+ int ret = event_notifier_init(e, 0);
+
+ if (ret) {
+ error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+ }
+ return !ret;
+}
+
/*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead. We try to get the best of both worlds by
@@ -134,8 +144,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
pci_irq_deassert(&vdev->pdev);
/* Get an eventfd for resample/unmask */
- if (event_notifier_init(&vdev->intx.unmask, 0)) {
- error_setg(errp, "event_notifier_init failed eoi");
+ if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
goto fail;
}
@@ -266,7 +275,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
Error *err = NULL;
int32_t fd;
- int ret;
if (!pin) {
@@ -289,9 +297,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
}
#endif
- ret = event_notifier_init(&vdev->intx.interrupt, 0);
- if (ret) {
- error_setg_errno(errp, -ret, "event_notifier_init failed");
+ if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
return false;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -473,11 +479,13 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
{
+ const char *name = "kvm_interrupt";
+
if (vector->virq < 0) {
return;
}
- if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+ if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
goto fail_notifier;
}
@@ -515,11 +523,12 @@ static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
{
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
PCIDevice *pdev = &vdev->pdev;
+ Error *err = NULL;
vector->vdev = vdev;
vector->virq = -1;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
+ if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+ error_report_err(err);
}
vector->use = true;
if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -746,13 +755,14 @@ retry:
for (i = 0; i < vdev->nr_vectors; i++) {
VFIOMSIVector *vector = &vdev->msi_vectors[i];
+ Error *err = NULL;
vector->vdev = vdev;
vector->virq = -1;
vector->use = true;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
+ if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+ error_report_err(err);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -2864,8 +2874,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->err_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for error detection");
+ if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+ error_report_err(err);
vdev->pci_aer = false;
return;
}
@@ -2930,8 +2940,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->req_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for device request");
+ if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+ error_report_err(err);
return;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 20/45] vfio/pci: pass vector to virq functions
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (18 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 19/45] vfio/pci: vfio_notifier_init Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 21/45] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
` (25 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Pass the vector number to vfio_connect_kvm_msi_virq and
vfio_remove_kvm_msi_virq, so it can be passed to their subroutines in
a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 688b7d3..7b2d185 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -477,7 +477,7 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
vector_n, &vdev->pdev);
}
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
{
const char *name = "kvm_interrupt";
@@ -503,7 +503,8 @@ fail_notifier:
vector->virq = -1;
}
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int nr)
{
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
vector->virq);
@@ -561,7 +562,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
*/
if (vector->virq >= 0) {
if (!msg) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, nr);
} else {
vfio_update_kvm_msi_virq(vector, *msg, pdev);
}
@@ -573,7 +574,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
vfio_add_kvm_msi_virq(vdev, vector, nr, true);
kvm_irqchip_commit_route_changes(&vfio_route_change);
- vfio_connect_kvm_msi_virq(vector);
+ vfio_connect_kvm_msi_virq(vector, nr);
}
}
}
@@ -680,7 +681,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
kvm_irqchip_commit_route_changes(&vfio_route_change);
for (i = 0; i < vdev->nr_vectors; i++) {
- vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+ vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
}
}
@@ -817,7 +818,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
VFIOMSIVector *vector = &vdev->msi_vectors[i];
if (vdev->msi_vectors[i].use) {
if (vector->virq >= 0) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, i);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
NULL, NULL, NULL);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 21/45] vfio/pci: vfio_notifier_init cpr parameters
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (19 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 20/45] vfio/pci: pass vector to virq functions Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 22/45] vfio/pci: vfio_notifier_cleanup Steve Sistare
` (24 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Pass vdev and nr to vfio_notifier_init, for use by CPR in a subsequent
patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7b2d185..63fb2d9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -54,7 +54,8 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
-static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr, Error **errp)
{
int ret = event_notifier_init(e, 0);
@@ -144,7 +145,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
pci_irq_deassert(&vdev->pdev);
/* Get an eventfd for resample/unmask */
- if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
+ if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
goto fail;
}
@@ -297,7 +298,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
}
#endif
- if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
+ if (!vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0,
+ errp)) {
return false;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -485,7 +487,8 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
return;
}
- if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
+ if (!vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr,
+ NULL)) {
goto fail_notifier;
}
@@ -528,7 +531,7 @@ static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
vector->vdev = vdev;
vector->virq = -1;
- if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+ if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr, &err)) {
error_report_err(err);
}
vector->use = true;
@@ -762,7 +765,8 @@ retry:
vector->virq = -1;
vector->use = true;
- if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+ if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i,
+ &err)) {
error_report_err(err);
}
@@ -2875,7 +2879,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
return;
}
- if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+ if (!vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0,
+ &err)) {
error_report_err(err);
vdev->pci_aer = false;
return;
@@ -2941,7 +2946,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+ if (!vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0,
+ &err)) {
error_report_err(err);
return;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 22/45] vfio/pci: vfio_notifier_cleanup
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (20 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 21/45] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 23/45] vfio/pci: export MSI functions Steve Sistare
` (23 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Move event_notifier_cleanup calls to a helper vfio_notifier_cleanup.
This version is trivial, and does not yet use the vdev and nr parameters.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 63fb2d9..fd51b9e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -65,6 +65,12 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
return !ret;
}
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr)
+{
+ event_notifier_cleanup(e);
+}
+
/*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead. We try to get the best of both worlds by
@@ -177,7 +183,7 @@ fail_vfio:
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
vdev->intx.route.irq);
fail_irqfd:
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
fail:
qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -209,7 +215,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
}
/* We only need to close the eventfd for VFIO to cleanup the kernel side */
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
/* QEMU starts listening for interrupt events. */
qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -308,7 +314,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
return false;
}
@@ -335,7 +341,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
vdev->interrupt = VFIO_INT_NONE;
@@ -500,7 +506,7 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
return;
fail_kvm:
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
fail_notifier:
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
@@ -513,7 +519,7 @@ static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
vector->virq);
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
}
static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -826,7 +832,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
NULL, NULL, NULL);
- event_notifier_cleanup(&vector->interrupt);
+ vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
}
}
@@ -2893,7 +2899,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
vdev->pci_aer = false;
}
}
@@ -2912,7 +2918,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
}
static void vfio_req_notifier_handler(void *opaque)
@@ -2959,7 +2965,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
} else {
vdev->req_enabled = true;
}
@@ -2979,7 +2985,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
vdev->req_enabled = false;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 23/45] vfio/pci: export MSI functions
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (21 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 22/45] vfio/pci: vfio_notifier_cleanup Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 24/45] vfio-pci: preserve MSI Steve Sistare
` (22 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export various MSI functions, for use by CPR in subsequent patches.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 21 ++++++++++-----------
hw/vfio/pci.h | 12 ++++++++++++
2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index fd51b9e..29a5b3d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -277,7 +277,7 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
vfio_intx_update(vdev, &vdev->intx.route);
}
-static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
+bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
{
uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
Error *err = NULL;
@@ -351,7 +351,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
/*
* MSI/X
*/
-static void vfio_msi_interrupt(void *opaque)
+void vfio_msi_interrupt(void *opaque)
{
VFIOMSIVector *vector = opaque;
VFIOPCIDevice *vdev = vector->vdev;
@@ -474,8 +474,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
return ret;
}
-static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
- int vector_n, bool msix)
+void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int vector_n, bool msix)
{
if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
return;
@@ -529,7 +529,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
kvm_irqchip_commit_routes(kvm_state);
}
-static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
{
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
PCIDevice *pdev = &vdev->pdev;
@@ -640,13 +640,12 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
return 0;
}
-static int vfio_msix_vector_use(PCIDevice *pdev,
- unsigned int nr, MSIMessage msg)
+int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg)
{
return vfio_msix_vector_do_use(pdev, nr, &msg, vfio_msi_interrupt);
}
-static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
+void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
{
VFIOPCIDevice *vdev = VFIO_PCI(pdev);
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
@@ -673,14 +672,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
}
}
-static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
{
assert(!vdev->defer_kvm_irq_routing);
vdev->defer_kvm_irq_routing = true;
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
}
-static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
{
int i;
@@ -2587,7 +2586,7 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
return OBJECT(vdev);
}
-static bool vfio_msix_present(void *opaque, int version_id)
+bool vfio_msix_present(void *opaque, int version_id)
{
PCIDevice *pdev = opaque;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 43c1666..49faf57 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -197,6 +197,18 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
return class == PCI_CLASS_DISPLAY_VGA;
}
+/* MSI/MSI-X/INTx */
+void vfio_vector_init(VFIOPCIDevice *vdev, int nr);
+void vfio_msi_interrupt(void *opaque);
+void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int vector_n, bool msix);
+int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
+void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
+bool vfio_msix_present(void *opaque, int version_id);
+void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);
+
uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
void vfio_pci_write_config(PCIDevice *pdev,
uint32_t addr, uint32_t val, int len);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 24/45] vfio-pci: preserve MSI
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (22 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 23/45] vfio/pci: export MSI functions Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 25/45] vfio-pci: preserve INTx Steve Sistare
` (21 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the MSI message area as part of vfio-pci vmstate, and preserve the
interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
then the vfio-pci post_load handler finds the eventfds in CPR state,
rebuilds vector data structures, and attaches the interrupts to the new
KVM instance.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++
hw/vfio/pci.c | 40 ++++++++++++++++++--
include/hw/vfio/vfio-cpr.h | 8 ++++
3 files changed, 136 insertions(+), 3 deletions(-)
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 8268c0c..96eb10a 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -9,6 +9,8 @@
#include "hw/vfio/vfio-common.h"
#include "hw/vfio/vfio-cpr.h"
#include "hw/vfio/pci.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/msi.h"
#include "migration/cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
@@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
}
+#define STRDUP_VECTOR_FD_NAME(vdev, name) \
+ g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+ int fd)
+{
+ g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+ cpr_resave_fd(fdname, nr, fd);
+}
+
+int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+ return cpr_find_fd(fdname, nr);
+}
+
+void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+ cpr_delete_fd(fdname, nr);
+}
+
+static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
+ bool msix)
+{
+ int i, fd;
+ bool pending = false;
+ PCIDevice *pdev = &vdev->pdev;
+
+ vdev->nr_vectors = nr_vectors;
+ vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+ vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+ vfio_prepare_kvm_msi_virq_batch(vdev);
+
+ for (i = 0; i < nr_vectors; i++) {
+ VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+ fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
+ if (fd >= 0) {
+ vfio_vector_init(vdev, i);
+ qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+ }
+
+ if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
+ vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+ } else {
+ vdev->msi_vectors[i].virq = -1;
+ }
+
+ if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+ set_bit(i, vdev->msix->pending);
+ pending = true;
+ }
+ }
+
+ vfio_commit_kvm_msi_virq_batch(vdev);
+
+ if (msix) {
+ memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+ }
+}
+
/*
* The kernel may change non-emulated config bits. Exclude them from the
* changed-bits check in get_pci_config_device.
@@ -58,13 +123,39 @@ static int vfio_cpr_pci_pre_load(void *opaque)
return 0;
}
+static int vfio_cpr_pci_post_load(void *opaque, int version_id)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int nr_vectors;
+
+ if (msix_enabled(pdev)) {
+ msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
+ vfio_msix_vector_release, NULL);
+ nr_vectors = vdev->msix->entries;
+ vfio_cpr_claim_vectors(vdev, nr_vectors, true);
+
+ } else if (msi_enabled(pdev)) {
+ nr_vectors = msi_nr_vectors_allocated(pdev);
+ vfio_cpr_claim_vectors(vdev, nr_vectors, false);
+
+ } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+ g_assert_not_reached(); /* completed in a subsequent patch */
+ }
+
+ return 0;
+}
+
const VMStateDescription vfio_cpr_pci_vmstate = {
.name = "vfio-cpr-pci",
.version_id = 0,
.minimum_version_id = 0,
.pre_load = vfio_cpr_pci_pre_load,
+ .post_load = vfio_cpr_pci_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+ VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29a5b3d..465ca6b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,7 @@
#include "hw/pci/pci_bridge.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "hw/vfio/vfio-cpr.h"
#include "migration/vmstate.h"
#include "qobject/qdict.h"
#include "qemu/error-report.h"
@@ -54,13 +55,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+/* Create new or reuse existing eventfd */
static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr, Error **errp)
{
- int ret = event_notifier_init(e, 0);
+ int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
+ int ret = 0;
- if (ret) {
- error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+ if (fd >= 0) {
+ event_notifier_init_fd(e, fd);
+ } else {
+ ret = event_notifier_init(e, 0);
+ if (ret) {
+ error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+ } else {
+ fd = event_notifier_get_fd(e);
+ if (fd >= 0) {
+ vfio_cpr_save_vector_fd(vdev, name, nr, fd);
+ }
+ }
}
return !ret;
}
@@ -68,6 +81,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
{
+ vfio_cpr_delete_vector_fd(vdev, name, nr);
event_notifier_cleanup(e);
}
@@ -554,6 +568,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
int ret;
bool resizing = !!(vdev->nr_vectors < nr + 1);
+ /*
+ * Ignore the callback from msix_set_vector_notifiers during resume.
+ * The necessary subset of these actions is called from
+ * vfio_cpr_claim_vectors during post load.
+ */
+ if (vdev->vbasedev.cpr.reused) {
+ return 0;
+ }
+
trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
vector = &vdev->msi_vectors[nr];
@@ -2894,6 +2917,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->err_notifier);
qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (vdev->vbasedev.cpr.reused) {
+ return;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2960,6 +2988,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->req_notifier);
qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (vdev->vbasedev.cpr.reused) {
+ vdev->req_enabled = true;
+ return;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index f5480de..a9f2fbe 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -27,6 +27,7 @@ typedef struct VFIODeviceCPR {
struct VFIOContainer;
struct VFIOGroup;
struct VFIOContainerBase;
+struct VFIOPCIDevice;
int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
Error **errp);
@@ -44,5 +45,12 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
bool vfio_cpr_register_ram_discard_listener(
struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+ int nr, int fd);
+int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+ int nr);
+void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+ int nr);
+
extern const VMStateDescription vfio_cpr_pci_vmstate;
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 25/45] vfio-pci: preserve INTx
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (23 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 24/45] vfio-pci: preserve MSI Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 26/45] migration: close kvm after cpr Steve Sistare
` (20 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
follows:
pin : Recover this from the vfio config in kernel space
interrupt : Preserve its eventfd descriptor across exec.
unmask : Ditto
route.irq : This could perhaps be recovered in vfio_pci_post_load by
calling pci_device_route_intx_to_irq(pin), whose implementation reads
config space for a bridge device such as ich9. However, there is no
guarantee that the bridge vmstate is read before vfio vmstate. Rather
than fiddling with MigrationPriority for vmstate handlers, explicitly
save route.irq in vfio vmstate.
pending : save in vfio vmstate.
mmap_timeout, mmap_timer : Re-initialize
bool kvm_accel : Re-initialize
In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load. Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr.c | 27 ++++++++++++++++++++++++++-
hw/vfio/pci.c | 28 +++++++++++++++++++++++++---
2 files changed, 51 insertions(+), 4 deletions(-)
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 96eb10a..a2400ca 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -140,12 +140,36 @@ static int vfio_cpr_pci_post_load(void *opaque, int version_id)
vfio_cpr_claim_vectors(vdev, nr_vectors, false);
} else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
- g_assert_not_reached(); /* completed in a subsequent patch */
+ Error *err = NULL;
+ if (!vfio_intx_enable(vdev, &err)) {
+ error_report_err(err);
+ return -1;
+ }
}
return 0;
}
+static const VMStateDescription vfio_intx_vmstate = {
+ .name = "vfio-cpr-intx",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .fields = (VMStateField[]) {
+ VMSTATE_BOOL(pending, VFIOINTx),
+ VMSTATE_UINT32(route.mode, VFIOINTx),
+ VMSTATE_INT32(route.irq, VFIOINTx),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) { \
+ .name = (stringify(_field)), \
+ .size = sizeof(VFIOINTx), \
+ .vmsd = &vfio_intx_vmstate, \
+ .flags = VMS_STRUCT, \
+ .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
+}
+
const VMStateDescription vfio_cpr_pci_vmstate = {
.name = "vfio-cpr-pci",
.version_id = 0,
@@ -156,6 +180,7 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
.fields = (VMStateField[]) {
VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+ VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 465ca6b..c5470d0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -158,12 +158,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
return true;
}
+ if (vdev->vbasedev.cpr.reused) {
+ goto skip_state;
+ }
+
/* Get to a known interrupt state */
qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
vdev->intx.pending = false;
pci_irq_deassert(&vdev->pdev);
+skip_state:
/* Get an eventfd for resample/unmask */
if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
goto fail;
@@ -177,6 +182,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
goto fail_irqfd;
}
+ if (vdev->vbasedev.cpr.reused) {
+ goto skip_irq;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_UNMASK,
event_notifier_get_fd(&vdev->intx.unmask),
@@ -187,6 +196,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
/* Let'em rip */
vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+skip_irq:
vdev->intx.kvm_accel = true;
trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
@@ -302,7 +312,13 @@ bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
return true;
}
- vfio_disable_interrupts(vdev);
+ /*
+ * Do not alter interrupt state during vfio_realize and cpr load. The
+ * reused flag is cleared thereafter.
+ */
+ if (!vdev->vbasedev.cpr.reused) {
+ vfio_disable_interrupts(vdev);
+ }
vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -325,7 +341,8 @@ bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
- if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
+ if (!vdev->vbasedev.cpr.reused &&
+ !vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
@@ -3241,7 +3258,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
vfio_intx_routing_notifier);
vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
- if (!vfio_intx_enable(vdev, errp)) {
+ /*
+ * During CPR, do not call vfio_intx_enable at this time. Instead,
+ * call it from vfio_pci_post_load after the intx routing data has
+ * been loaded from vmstate.
+ */
+ if (!vdev->vbasedev.cpr.reused && !vfio_intx_enable(vdev, errp)) {
goto out_deregister;
}
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 26/45] migration: close kvm after cpr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (24 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 25/45] vfio-pci: preserve INTx Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 15:51 ` Steven Sistare
2025-02-14 14:14 ` [PATCH V2 27/45] migration: cpr_get_fd_param helper Steve Sistare
` (19 subsequent siblings)
45 siblings, 1 reply; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
cpr-transfer breaks vfio network connectivity to and from the guest, and
the host system log shows:
irq bypass consumer (token 00000000a03c32e5) registration fails: -16
which is EBUSY. This occurs because KVM descriptors are still open in
the old QEMU process. Close them.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
accel/kvm/kvm-all.c | 28 ++++++++++++++++++++++++++++
hw/vfio/common.c | 8 ++++++++
include/hw/vfio/vfio-common.h | 1 +
include/migration/cpr.h | 2 ++
include/system/kvm.h | 1 +
migration/cpr-transfer.c | 18 ++++++++++++++++++
migration/cpr.c | 8 ++++++++
migration/migration.c | 1 +
8 files changed, 67 insertions(+)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c65b790..cdbe91c 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -507,16 +507,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
goto err;
}
+ /* If I am the CPU that created coalesced_mmio_ring, then discard it */
+ if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
+ s->coalesced_mmio_ring = NULL;
+ }
+
ret = munmap(cpu->kvm_run, mmap_size);
if (ret < 0) {
goto err;
}
+ cpu->kvm_run = NULL;
if (cpu->kvm_dirty_gfns) {
ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
if (ret < 0) {
goto err;
}
+ cpu->kvm_dirty_gfns = NULL;
}
kvm_park_vcpu(cpu);
@@ -595,6 +602,27 @@ err:
return ret;
}
+void kvm_close(void)
+{
+ CPUState *cpu;
+
+ CPU_FOREACH(cpu) {
+ cpu_remove_sync(cpu);
+ close(cpu->kvm_fd);
+ cpu->kvm_fd = -1;
+ close(cpu->kvm_vcpu_stats_fd);
+ cpu->kvm_vcpu_stats_fd = -1;
+ }
+
+ if (kvm_state && kvm_state->fd != -1) {
+ close(kvm_state->vmfd);
+ kvm_state->vmfd = -1;
+ close(kvm_state->fd);
+ kvm_state->fd = -1;
+ }
+ kvm_state = NULL;
+}
+
/*
* dirty pages logging control
*/
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 48663ad..c536698 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1501,6 +1501,14 @@ int vfio_kvm_device_del_fd(int fd, Error **errp)
return 0;
}
+void vfio_kvm_device_close(void)
+{
+ if (vfio_kvm_device_fd != -1) {
+ close(vfio_kvm_device_fd);
+ vfio_kvm_device_fd = -1;
+ }
+}
+
VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
{
VFIOAddressSpace *space;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1563f3a..78e4f12 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -259,6 +259,7 @@ VFIODevice *vfio_get_vfio_device(Object *obj);
int vfio_kvm_device_add_fd(int fd, Error **errp);
int vfio_kvm_device_del_fd(int fd, Error **errp);
+void vfio_kvm_device_close(void);
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 6ad04d4..c5c191d 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -32,7 +32,9 @@ void cpr_state_close(void);
struct QIOChannel *cpr_state_ioc(void);
bool cpr_needed_for_reuse(void *opaque);
+void cpr_kvm_close(void);
+void cpr_transfer_init(void);
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
diff --git a/include/system/kvm.h b/include/system/kvm.h
index ab17c09..ad5c55e 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
int kvm_has_vcpu_events(void);
int kvm_max_nested_state_length(void);
int kvm_has_gsi_routing(void);
+void kvm_close(void);
/**
* kvm_arm_supports_user_irq
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index e1f1403..396558f 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -17,6 +17,24 @@
#include "migration/vmstate.h"
#include "trace.h"
+static int cpr_transfer_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e,
+ Error **errp)
+{
+ if (e->type == MIG_EVENT_PRECOPY_DONE) {
+ cpr_kvm_close();
+ }
+ return 0;
+}
+
+void cpr_transfer_init(void)
+{
+ static NotifierWithReturn notifier;
+
+ migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
+ MIG_MODE_CPR_TRANSFER);
+}
+
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
{
MigrationAddress *addr = channel->addr;
diff --git a/migration/cpr.c b/migration/cpr.c
index 12c489b..351e12d 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,12 +7,14 @@
#include "qemu/osdep.h"
#include "qapi/error.h"
+#include "hw/vfio/vfio-common.h"
#include "migration/cpr.h"
#include "migration/misc.h"
#include "migration/options.h"
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
+#include "system/kvm.h"
#include "system/runstate.h"
#include "trace.h"
@@ -266,3 +268,9 @@ bool cpr_needed_for_reuse(void *opaque)
MigMode mode = migrate_mode();
return mode == MIG_MODE_CPR_TRANSFER;
}
+
+void cpr_kvm_close(void)
+{
+ kvm_close();
+ vfio_kvm_device_close();
+}
diff --git a/migration/migration.c b/migration/migration.c
index 3969285..bdc5255 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -296,6 +296,7 @@ void migration_object_init(void)
ram_mig_init();
dirty_bitmap_mig_init();
+ cpr_transfer_init();
/* Initialize cpu throttle timers */
cpu_throttle_init();
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 26/45] migration: close kvm after cpr
2025-02-14 14:14 ` [PATCH V2 26/45] migration: close kvm after cpr Steve Sistare
@ 2025-02-14 15:51 ` Steven Sistare
0 siblings, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 15:51 UTC (permalink / raw)
To: qemu-devel, kvm
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Paolo Bonzini
cc kvm reviewers.
The series is here:
https://lore.kernel.org/qemu-devel/1739542467-226739-1-git-send-email-steven.sistare@oracle.com/
- Steve
On 2/14/2025 9:14 AM, Steve Sistare wrote:
> cpr-transfer breaks vfio network connectivity to and from the guest, and
> the host system log shows:
> irq bypass consumer (token 00000000a03c32e5) registration fails: -16
> which is EBUSY. This occurs because KVM descriptors are still open in
> the old QEMU process. Close them.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> accel/kvm/kvm-all.c | 28 ++++++++++++++++++++++++++++
> hw/vfio/common.c | 8 ++++++++
> include/hw/vfio/vfio-common.h | 1 +
> include/migration/cpr.h | 2 ++
> include/system/kvm.h | 1 +
> migration/cpr-transfer.c | 18 ++++++++++++++++++
> migration/cpr.c | 8 ++++++++
> migration/migration.c | 1 +
> 8 files changed, 67 insertions(+)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c65b790..cdbe91c 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -507,16 +507,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
> goto err;
> }
>
> + /* If I am the CPU that created coalesced_mmio_ring, then discard it */
> + if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
> + s->coalesced_mmio_ring = NULL;
> + }
> +
> ret = munmap(cpu->kvm_run, mmap_size);
> if (ret < 0) {
> goto err;
> }
> + cpu->kvm_run = NULL;
>
> if (cpu->kvm_dirty_gfns) {
> ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
> if (ret < 0) {
> goto err;
> }
> + cpu->kvm_dirty_gfns = NULL;
> }
>
> kvm_park_vcpu(cpu);
> @@ -595,6 +602,27 @@ err:
> return ret;
> }
>
> +void kvm_close(void)
> +{
> + CPUState *cpu;
> +
> + CPU_FOREACH(cpu) {
> + cpu_remove_sync(cpu);
> + close(cpu->kvm_fd);
> + cpu->kvm_fd = -1;
> + close(cpu->kvm_vcpu_stats_fd);
> + cpu->kvm_vcpu_stats_fd = -1;
> + }
> +
> + if (kvm_state && kvm_state->fd != -1) {
> + close(kvm_state->vmfd);
> + kvm_state->vmfd = -1;
> + close(kvm_state->fd);
> + kvm_state->fd = -1;
> + }
> + kvm_state = NULL;
> +}
> +
> /*
> * dirty pages logging control
> */
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 48663ad..c536698 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1501,6 +1501,14 @@ int vfio_kvm_device_del_fd(int fd, Error **errp)
> return 0;
> }
>
> +void vfio_kvm_device_close(void)
> +{
> + if (vfio_kvm_device_fd != -1) {
> + close(vfio_kvm_device_fd);
> + vfio_kvm_device_fd = -1;
> + }
> +}
> +
> VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> {
> VFIOAddressSpace *space;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1563f3a..78e4f12 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -259,6 +259,7 @@ VFIODevice *vfio_get_vfio_device(Object *obj);
>
> int vfio_kvm_device_add_fd(int fd, Error **errp);
> int vfio_kvm_device_del_fd(int fd, Error **errp);
> +void vfio_kvm_device_close(void);
>
> bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
> void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 6ad04d4..c5c191d 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -32,7 +32,9 @@ void cpr_state_close(void);
> struct QIOChannel *cpr_state_ioc(void);
>
> bool cpr_needed_for_reuse(void *opaque);
> +void cpr_kvm_close(void);
>
> +void cpr_transfer_init(void);
> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>
> diff --git a/include/system/kvm.h b/include/system/kvm.h
> index ab17c09..ad5c55e 100644
> --- a/include/system/kvm.h
> +++ b/include/system/kvm.h
> @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
> int kvm_has_vcpu_events(void);
> int kvm_max_nested_state_length(void);
> int kvm_has_gsi_routing(void);
> +void kvm_close(void);
>
> /**
> * kvm_arm_supports_user_irq
> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> index e1f1403..396558f 100644
> --- a/migration/cpr-transfer.c
> +++ b/migration/cpr-transfer.c
> @@ -17,6 +17,24 @@
> #include "migration/vmstate.h"
> #include "trace.h"
>
> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
> + MigrationEvent *e,
> + Error **errp)
> +{
> + if (e->type == MIG_EVENT_PRECOPY_DONE) {
> + cpr_kvm_close();
> + }
> + return 0;
> +}
> +
> +void cpr_transfer_init(void)
> +{
> + static NotifierWithReturn notifier;
> +
> + migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
> + MIG_MODE_CPR_TRANSFER);
> +}
> +
> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
> {
> MigrationAddress *addr = channel->addr;
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 12c489b..351e12d 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,12 +7,14 @@
>
> #include "qemu/osdep.h"
> #include "qapi/error.h"
> +#include "hw/vfio/vfio-common.h"
> #include "migration/cpr.h"
> #include "migration/misc.h"
> #include "migration/options.h"
> #include "migration/qemu-file.h"
> #include "migration/savevm.h"
> #include "migration/vmstate.h"
> +#include "system/kvm.h"
> #include "system/runstate.h"
> #include "trace.h"
>
> @@ -266,3 +268,9 @@ bool cpr_needed_for_reuse(void *opaque)
> MigMode mode = migrate_mode();
> return mode == MIG_MODE_CPR_TRANSFER;
> }
> +
> +void cpr_kvm_close(void)
> +{
> + kvm_close();
> + vfio_kvm_device_close();
> +}
> diff --git a/migration/migration.c b/migration/migration.c
> index 3969285..bdc5255 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -296,6 +296,7 @@ void migration_object_init(void)
>
> ram_mig_init();
> dirty_bitmap_mig_init();
> + cpr_transfer_init();
>
> /* Initialize cpu throttle timers */
> cpu_throttle_init();
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 27/45] migration: cpr_get_fd_param helper
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (25 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 26/45] migration: close kvm after cpr Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr Steve Sistare
` (18 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add the helper function cpr_get_fd_param, to use when preserving
a file descriptor that is opened externally and passed to QEMU.
cpr_get_fd_param returns a descriptor number either from a QEMU
command-line parameter, from a getfd command, or from CPR state.
When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
changes. Hence, during CPR, the command-line parameter is ignored
in new QEMU, and over-ridden by the value found in CPR state.
Similarly, if the descriptor was originally specified by a getfd
command in old QEMU, the fd number is not known outside of QEMU,
and it changes when sent to new QEMU via SCM_RIGHTS. Hence the
user cannot send getfd to new QEMU, but when the user sends a
hotplug command that references the fd, cpr_get_fd_param finds
its value in CPR state.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 2 ++
migration/cpr.c | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index c5c191d..23d0af4 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -33,6 +33,8 @@ struct QIOChannel *cpr_state_ioc(void);
bool cpr_needed_for_reuse(void *opaque);
void cpr_kvm_close(void);
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+ bool *reused, Error **errp);
void cpr_transfer_init(void);
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 351e12d..c903b24 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -14,6 +14,7 @@
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
+#include "monitor/monitor.h"
#include "system/kvm.h"
#include "system/runstate.h"
#include "trace.h"
@@ -274,3 +275,42 @@ void cpr_kvm_close(void)
kvm_close();
vfio_kvm_device_close();
}
+
+/*
+ * cpr_get_fd_param: find a descriptor and return its value.
+ *
+ * @name: CPR name for the descriptor
+ * @fdname: An integer-valued string, or a name passed to a getfd command
+ * @index: CPR index of the descriptor
+ * @reused: returns true if the fd is found in CPR state, else false.
+ * @errp: returned error message
+ *
+ * If CPR is not being performed, then use @fdname to find the fd.
+ * If CPR is being performed, then ignore @fdname, and look for @name
+ * and @index in CPR state.
+ *
+ * On success returns the fd value, else returns -1.
+ */
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+ bool *reused, Error **errp)
+{
+ ERRP_GUARD();
+ int fd;
+
+ if (cpr_is_incoming()) {
+ fd = cpr_find_fd(name, index);
+ if (fd < 0) {
+ error_setg(errp, "cannot find saved value for fd %s", fdname);
+ }
+ *reused = true;
+ } else {
+ fd = monitor_fd_param(monitor_cur(), fdname, errp);
+ if (fd >= 0) {
+ cpr_save_fd(name, index, fd);
+ } else {
+ error_prepend(errp, "Could not parse object fd %s:", fdname);
+ }
+ *reused = false;
+ }
+ return fd;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (26 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 27/45] migration: cpr_get_fd_param helper Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:38 ` Steven Sistare
2025-02-14 16:48 ` Peter Xu
2025-02-14 14:14 ` [PATCH V2 29/45] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
` (17 subsequent siblings)
45 siblings, 2 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
region that the translated address is found in. This will be needed by
CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
Also return the xlat offset, so we can simplify the interface by removing
the out parameters that can be trivially derived from mr and xlat.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 21 ++++++++++++++-------
hw/virtio/vhost-vdpa.c | 8 ++++++--
include/exec/memory.h | 6 +++---
system/memory.c | 19 ++++---------------
4 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c536698..3b0c520 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -246,14 +246,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
}
/* Called with rcu_read_lock held. */
-static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
- ram_addr_t *ram_addr, bool *read_only,
- Error **errp)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
+ hwaddr *xlat_p, Error **errp)
{
bool ret, mr_has_discard_manager;
- ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
- &mr_has_discard_manager, errp);
+ ret = memory_get_xlat_addr(iotlb, &mr_has_discard_manager, mr_p, xlat_p,
+ errp);
if (ret && mr_has_discard_manager) {
/*
* Malicious VMs might trigger discarding of IOMMU-mapped memory. The
@@ -281,6 +280,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
VFIOContainerBase *bcontainer = giommu->bcontainer;
hwaddr iova = iotlb->iova + giommu->iommu_offset;
+ MemoryRegion *mr;
+ hwaddr xlat;
void *vaddr;
int ret;
Error *local_err = NULL;
@@ -300,10 +301,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
bool read_only;
- if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
+ if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
error_report_err(local_err);
goto out;
}
+ vaddr = memory_region_get_ram_ptr(mr) + xlat;
+ read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
+
/*
* vaddr is only valid until rcu_read_unlock(). But after
* vfio_dma_map has set up the mapping the pages will be
@@ -1259,6 +1263,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
ram_addr_t translated_addr;
Error *local_err = NULL;
int ret = -EINVAL;
+ MemoryRegion *mr;
+ ram_addr_t xlat;
trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
@@ -1269,10 +1275,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
}
rcu_read_lock();
- if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
+ if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
error_report_err(local_err);
goto out_unlock;
}
+ translated_addr = memory_region_get_ram_addr(mr) + xlat;
ret = vfio_get_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
translated_addr, &local_err);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3cdaa12..5dfe51e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
int ret;
Int128 llend;
Error *local_err = NULL;
+ MemoryRegion *mr;
+ hwaddr xlat;
if (iotlb->target_as != &address_space_memory) {
error_report("Wrong target AS \"%s\", only system memory is allowed",
@@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
bool read_only;
- if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
- &local_err)) {
+ if (!memory_get_xlat_addr(iotlb, NULL, &mr, &xlat, &local_err)) {
error_report_err(local_err);
return;
}
+ vaddr = memory_region_get_ram_ptr(mr) + xlat;
+ read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
+
ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
iotlb->addr_mask + 1, vaddr, read_only);
if (ret) {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index ea5d33a..8590838 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -747,13 +747,13 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
* @read_only: indicates if writes are allowed
* @mr_has_discard_manager: indicates memory is controlled by a
* RamDiscardManager
+ * @mr_p: return the MemoryRegion containing the @iotlb translated addr
* @errp: pointer to Error*, to store an error if it happens.
*
* Return: true on success, else false setting @errp with error.
*/
-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
- ram_addr_t *ram_addr, bool *read_only,
- bool *mr_has_discard_manager, Error **errp);
+bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
+ MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp);
typedef struct CoalescedMemoryRange CoalescedMemoryRange;
typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
diff --git a/system/memory.c b/system/memory.c
index 4c82979..755eafe 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2183,9 +2183,8 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
}
/* Called with rcu_read_lock held. */
-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
- ram_addr_t *ram_addr, bool *read_only,
- bool *mr_has_discard_manager, Error **errp)
+bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
+ MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp)
{
MemoryRegion *mr;
hwaddr xlat;
@@ -2238,18 +2237,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
return false;
}
- if (vaddr) {
- *vaddr = memory_region_get_ram_ptr(mr) + xlat;
- }
-
- if (ram_addr) {
- *ram_addr = memory_region_get_ram_addr(mr) + xlat;
- }
-
- if (read_only) {
- *read_only = !writable || mr->readonly;
- }
-
+ *xlat_p = xlat;
+ *mr_p = mr;
return true;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-14 14:14 ` [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-02-14 14:38 ` Steven Sistare
2025-02-14 16:48 ` Peter Xu
1 sibling, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 14:38 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, David Hildenbrand, Philippe Mathieu-Daude"
cc memory reviewers.
The series is here:
https://lore.kernel.org/qemu-devel/1739542467-226739-1-git-send-email-steven.sistare@oracle.com/
- Steve
On 2/14/2025 9:14 AM, Steve Sistare wrote:
> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
> region that the translated address is found in. This will be needed by
> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>
> Also return the xlat offset, so we can simplify the interface by removing
> the out parameters that can be trivially derived from mr and xlat.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/common.c | 21 ++++++++++++++-------
> hw/virtio/vhost-vdpa.c | 8 ++++++--
> include/exec/memory.h | 6 +++---
> system/memory.c | 19 ++++---------------
> 4 files changed, 27 insertions(+), 27 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index c536698..3b0c520 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -246,14 +246,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> }
>
> /* Called with rcu_read_lock held. */
> -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> - ram_addr_t *ram_addr, bool *read_only,
> - Error **errp)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
> + hwaddr *xlat_p, Error **errp)
> {
> bool ret, mr_has_discard_manager;
>
> - ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
> - &mr_has_discard_manager, errp);
> + ret = memory_get_xlat_addr(iotlb, &mr_has_discard_manager, mr_p, xlat_p,
> + errp);
> if (ret && mr_has_discard_manager) {
> /*
> * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
> @@ -281,6 +280,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> VFIOContainerBase *bcontainer = giommu->bcontainer;
> hwaddr iova = iotlb->iova + giommu->iommu_offset;
> + MemoryRegion *mr;
> + hwaddr xlat;
> void *vaddr;
> int ret;
> Error *local_err = NULL;
> @@ -300,10 +301,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> bool read_only;
>
> - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
> + if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
> error_report_err(local_err);
> goto out;
> }
> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
> + read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
> +
> /*
> * vaddr is only valid until rcu_read_unlock(). But after
> * vfio_dma_map has set up the mapping the pages will be
> @@ -1259,6 +1263,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> ram_addr_t translated_addr;
> Error *local_err = NULL;
> int ret = -EINVAL;
> + MemoryRegion *mr;
> + ram_addr_t xlat;
>
> trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>
> @@ -1269,10 +1275,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> }
>
> rcu_read_lock();
> - if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
> + if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
> error_report_err(local_err);
> goto out_unlock;
> }
> + translated_addr = memory_region_get_ram_addr(mr) + xlat;
>
> ret = vfio_get_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
> translated_addr, &local_err);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 3cdaa12..5dfe51e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> int ret;
> Int128 llend;
> Error *local_err = NULL;
> + MemoryRegion *mr;
> + hwaddr xlat;
>
> if (iotlb->target_as != &address_space_memory) {
> error_report("Wrong target AS \"%s\", only system memory is allowed",
> @@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> bool read_only;
>
> - if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
> - &local_err)) {
> + if (!memory_get_xlat_addr(iotlb, NULL, &mr, &xlat, &local_err)) {
> error_report_err(local_err);
> return;
> }
> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
> + read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
> +
> ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
> iotlb->addr_mask + 1, vaddr, read_only);
> if (ret) {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index ea5d33a..8590838 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -747,13 +747,13 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> * @read_only: indicates if writes are allowed
> * @mr_has_discard_manager: indicates memory is controlled by a
> * RamDiscardManager
> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr
> * @errp: pointer to Error*, to store an error if it happens.
> *
> * Return: true on success, else false setting @errp with error.
> */
> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> - ram_addr_t *ram_addr, bool *read_only,
> - bool *mr_has_discard_manager, Error **errp);
> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp);
>
> typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
> diff --git a/system/memory.c b/system/memory.c
> index 4c82979..755eafe 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2183,9 +2183,8 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> }
>
> /* Called with rcu_read_lock held. */
> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> - ram_addr_t *ram_addr, bool *read_only,
> - bool *mr_has_discard_manager, Error **errp)
> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp)
> {
> MemoryRegion *mr;
> hwaddr xlat;
> @@ -2238,18 +2237,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> return false;
> }
>
> - if (vaddr) {
> - *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> - }
> -
> - if (ram_addr) {
> - *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> - }
> -
> - if (read_only) {
> - *read_only = !writable || mr->readonly;
> - }
> -
> + *xlat_p = xlat;
> + *mr_p = mr;
> return true;
> }
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-14 14:14 ` [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr Steve Sistare
2025-02-14 14:38 ` Steven Sistare
@ 2025-02-14 16:48 ` Peter Xu
2025-02-14 20:40 ` Steven Sistare
1 sibling, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-14 16:48 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 06:14:10AM -0800, Steve Sistare wrote:
> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
> region that the translated address is found in. This will be needed by
> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>
> Also return the xlat offset, so we can simplify the interface by removing
> the out parameters that can be trivially derived from mr and xlat.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/common.c | 21 ++++++++++++++-------
> hw/virtio/vhost-vdpa.c | 8 ++++++--
> include/exec/memory.h | 6 +++---
> system/memory.c | 19 ++++---------------
> 4 files changed, 27 insertions(+), 27 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index c536698..3b0c520 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -246,14 +246,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> }
>
> /* Called with rcu_read_lock held. */
> -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> - ram_addr_t *ram_addr, bool *read_only,
> - Error **errp)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
> + hwaddr *xlat_p, Error **errp)
> {
> bool ret, mr_has_discard_manager;
>
> - ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
> - &mr_has_discard_manager, errp);
> + ret = memory_get_xlat_addr(iotlb, &mr_has_discard_manager, mr_p, xlat_p,
> + errp);
> if (ret && mr_has_discard_manager) {
> /*
> * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
> @@ -281,6 +280,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> VFIOContainerBase *bcontainer = giommu->bcontainer;
> hwaddr iova = iotlb->iova + giommu->iommu_offset;
> + MemoryRegion *mr;
> + hwaddr xlat;
> void *vaddr;
> int ret;
> Error *local_err = NULL;
> @@ -300,10 +301,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> bool read_only;
>
> - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
> + if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
> error_report_err(local_err);
> goto out;
> }
> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
> + read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
> +
> /*
> * vaddr is only valid until rcu_read_unlock(). But after
> * vfio_dma_map has set up the mapping the pages will be
> @@ -1259,6 +1263,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> ram_addr_t translated_addr;
> Error *local_err = NULL;
> int ret = -EINVAL;
> + MemoryRegion *mr;
> + ram_addr_t xlat;
>
> trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>
> @@ -1269,10 +1275,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> }
>
> rcu_read_lock();
> - if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
> + if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
> error_report_err(local_err);
> goto out_unlock;
> }
> + translated_addr = memory_region_get_ram_addr(mr) + xlat;
>
> ret = vfio_get_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
> translated_addr, &local_err);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 3cdaa12..5dfe51e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> int ret;
> Int128 llend;
> Error *local_err = NULL;
> + MemoryRegion *mr;
> + hwaddr xlat;
>
> if (iotlb->target_as != &address_space_memory) {
> error_report("Wrong target AS \"%s\", only system memory is allowed",
> @@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> bool read_only;
>
> - if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
> - &local_err)) {
> + if (!memory_get_xlat_addr(iotlb, NULL, &mr, &xlat, &local_err)) {
> error_report_err(local_err);
> return;
> }
> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
> + read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
> +
> ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
> iotlb->addr_mask + 1, vaddr, read_only);
> if (ret) {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index ea5d33a..8590838 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -747,13 +747,13 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> * @read_only: indicates if writes are allowed
> * @mr_has_discard_manager: indicates memory is controlled by a
> * RamDiscardManager
(some prior fields are prone to removal))
> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr
> * @errp: pointer to Error*, to store an error if it happens.
> *
> * Return: true on success, else false setting @errp with error.
> */
> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> - ram_addr_t *ram_addr, bool *read_only,
> - bool *mr_has_discard_manager, Error **errp);
> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp);
>
> typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
> diff --git a/system/memory.c b/system/memory.c
> index 4c82979..755eafe 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2183,9 +2183,8 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> }
>
> /* Called with rcu_read_lock held. */
> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> - ram_addr_t *ram_addr, bool *read_only,
> - bool *mr_has_discard_manager, Error **errp)
> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp)
If we're going to return the MR anyway, probably we can drop
mr_has_discard_manager altogether..
> {
> MemoryRegion *mr;
> hwaddr xlat;
> @@ -2238,18 +2237,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> return false;
> }
>
> - if (vaddr) {
> - *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> - }
> -
> - if (ram_addr) {
> - *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> - }
> -
> - if (read_only) {
> - *read_only = !writable || mr->readonly;
> - }
> -
> + *xlat_p = xlat;
> + *mr_p = mr;
I suppose current use on the callers are still under RCU so looks ok, but
that'll need to be rich-documented.
Better way is always taking a MR reference when the MR pointer is returned,
with memory_region_ref(). Then it is even valid if by accident accessed
after rcu_read_unlock(), and caller should unref() after use.
> return true;
> }
>
> --
> 1.8.3.1
>
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-14 16:48 ` Peter Xu
@ 2025-02-14 20:40 ` Steven Sistare
2025-02-14 22:42 ` Peter Xu
0 siblings, 1 reply; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 20:40 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/14/2025 11:48 AM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 06:14:10AM -0800, Steve Sistare wrote:
>> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>> region that the translated address is found in. This will be needed by
>> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>>
>> Also return the xlat offset, so we can simplify the interface by removing
>> the out parameters that can be trivially derived from mr and xlat.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/common.c | 21 ++++++++++++++-------
>> hw/virtio/vhost-vdpa.c | 8 ++++++--
>> include/exec/memory.h | 6 +++---
>> system/memory.c | 19 ++++---------------
>> 4 files changed, 27 insertions(+), 27 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index c536698..3b0c520 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -246,14 +246,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> }
>>
>> /* Called with rcu_read_lock held. */
>> -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> - ram_addr_t *ram_addr, bool *read_only,
>> - Error **errp)
>> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
>> + hwaddr *xlat_p, Error **errp)
>> {
>> bool ret, mr_has_discard_manager;
>>
>> - ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
>> - &mr_has_discard_manager, errp);
>> + ret = memory_get_xlat_addr(iotlb, &mr_has_discard_manager, mr_p, xlat_p,
>> + errp);
>> if (ret && mr_has_discard_manager) {
>> /*
>> * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
>> @@ -281,6 +280,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>> VFIOContainerBase *bcontainer = giommu->bcontainer;
>> hwaddr iova = iotlb->iova + giommu->iommu_offset;
>> + MemoryRegion *mr;
>> + hwaddr xlat;
>> void *vaddr;
>> int ret;
>> Error *local_err = NULL;
>> @@ -300,10 +301,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>> bool read_only;
>>
>> - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
>> + if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>> error_report_err(local_err);
>> goto out;
>> }
>> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> + read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
>> +
>> /*
>> * vaddr is only valid until rcu_read_unlock(). But after
>> * vfio_dma_map has set up the mapping the pages will be
>> @@ -1259,6 +1263,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> ram_addr_t translated_addr;
>> Error *local_err = NULL;
>> int ret = -EINVAL;
>> + MemoryRegion *mr;
>> + ram_addr_t xlat;
>>
>> trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>>
>> @@ -1269,10 +1275,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> }
>>
>> rcu_read_lock();
>> - if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
>> + if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>> error_report_err(local_err);
>> goto out_unlock;
>> }
>> + translated_addr = memory_region_get_ram_addr(mr) + xlat;
>>
>> ret = vfio_get_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
>> translated_addr, &local_err);
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 3cdaa12..5dfe51e 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> int ret;
>> Int128 llend;
>> Error *local_err = NULL;
>> + MemoryRegion *mr;
>> + hwaddr xlat;
>>
>> if (iotlb->target_as != &address_space_memory) {
>> error_report("Wrong target AS \"%s\", only system memory is allowed",
>> @@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>> bool read_only;
>>
>> - if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
>> - &local_err)) {
>> + if (!memory_get_xlat_addr(iotlb, NULL, &mr, &xlat, &local_err)) {
>> error_report_err(local_err);
>> return;
>> }
>> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> + read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
>> +
>> ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
>> iotlb->addr_mask + 1, vaddr, read_only);
>> if (ret) {
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index ea5d33a..8590838 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -747,13 +747,13 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>> * @read_only: indicates if writes are allowed
>> * @mr_has_discard_manager: indicates memory is controlled by a
>> * RamDiscardManager
>
> (some prior fields are prone to removal))
My bad, thanks. I'll delete vaddr, ram_addr, read_only, and add xlat.
>> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr
>> * @errp: pointer to Error*, to store an error if it happens.
>> *
>> * Return: true on success, else false setting @errp with error.
>> */
>> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> - ram_addr_t *ram_addr, bool *read_only,
>> - bool *mr_has_discard_manager, Error **errp);
>> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
>> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp);
>>
>> typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>> diff --git a/system/memory.c b/system/memory.c
>> index 4c82979..755eafe 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2183,9 +2183,8 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>> }
>>
>> /* Called with rcu_read_lock held. */
>> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> - ram_addr_t *ram_addr, bool *read_only,
>> - bool *mr_has_discard_manager, Error **errp)
>> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
>> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp)
>
> If we're going to return the MR anyway, probably we can drop
> mr_has_discard_manager altogether..
To hoist mr_has_discard_manager to the vfio caller, I would need to return len.
Your call.
>> {
>> MemoryRegion *mr;
>> hwaddr xlat;
>> @@ -2238,18 +2237,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> return false;
>> }
>>
>> - if (vaddr) {
>> - *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> - }
>> -
>> - if (ram_addr) {
>> - *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>> - }
>> -
>> - if (read_only) {
>> - *read_only = !writable || mr->readonly;
>> - }
>> -
>> + *xlat_p = xlat;
>> + *mr_p = mr;
>
> I suppose current use on the callers are still under RCU so looks ok, but
> that'll need to be rich-documented.
I can do that, or ...
> Better way is always taking a MR reference when the MR pointer is returned,
> with memory_region_ref(). Then it is even valid if by accident accessed
> after rcu_read_unlock(), and caller should unref() after use.
I can do that, but it would add cycles. Is this considered a high performance
path that may be called frequently?
- Steve
>
>> return true;
>> }
>>
>> --
>> 1.8.3.1
>>
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-14 20:40 ` Steven Sistare
@ 2025-02-14 22:42 ` Peter Xu
2025-02-24 16:50 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-14 22:42 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 03:40:57PM -0500, Steven Sistare wrote:
> > > diff --git a/system/memory.c b/system/memory.c
> > > index 4c82979..755eafe 100644
> > > --- a/system/memory.c
> > > +++ b/system/memory.c
> > > @@ -2183,9 +2183,8 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> > > }
> > > /* Called with rcu_read_lock held. */
> > > -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> > > - ram_addr_t *ram_addr, bool *read_only,
> > > - bool *mr_has_discard_manager, Error **errp)
> > > +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
> > > + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp)
> >
> > If we're going to return the MR anyway, probably we can drop
> > mr_has_discard_manager altogether..
>
> To hoist mr_has_discard_manager to the vfio caller, I would need to return len.
> Your call.
I meant only dropping mr_has_discard_manager parameter from the function
interface, not the ram_discard_manager_is_populated() check.
>
> > > {
> > > MemoryRegion *mr;
> > > hwaddr xlat;
> > > @@ -2238,18 +2237,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> > > return false;
> > > }
> > > - if (vaddr) {
> > > - *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > > - }
> > > -
> > > - if (ram_addr) {
> > > - *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> > > - }
> > > -
> > > - if (read_only) {
> > > - *read_only = !writable || mr->readonly;
> > > - }
> > > -
> > > + *xlat_p = xlat;
> > > + *mr_p = mr;
> >
> > I suppose current use on the callers are still under RCU so looks ok, but
> > that'll need to be rich-documented.
>
> I can do that, or ...
>
> > Better way is always taking a MR reference when the MR pointer is returned,
> > with memory_region_ref(). Then it is even valid if by accident accessed
> > after rcu_read_unlock(), and caller should unref() after use.
>
> I can do that, but it would add cycles. Is this considered a high performance
> path that may be called frequently?
AFAICT, any vIOMMU mapping isn't high perf path. In this specific path,
the refcount op should be buried in any dma map operations..
Personally I slightly prefer this one because it's always safer to take a
refcount along with a pointer.. easier to follow.
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-14 22:42 ` Peter Xu
@ 2025-02-24 16:50 ` Steven Sistare
2025-02-24 19:20 ` Peter Xu
0 siblings, 1 reply; 72+ messages in thread
From: Steven Sistare @ 2025-02-24 16:50 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/14/2025 5:42 PM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 03:40:57PM -0500, Steven Sistare wrote:
>>>> diff --git a/system/memory.c b/system/memory.c
>>>> index 4c82979..755eafe 100644
>>>> --- a/system/memory.c
>>>> +++ b/system/memory.c
>>>> @@ -2183,9 +2183,8 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>>>> }
>>>> /* Called with rcu_read_lock held. */
>>>> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>>>> - ram_addr_t *ram_addr, bool *read_only,
>>>> - bool *mr_has_discard_manager, Error **errp)
>>>> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, bool *mr_has_discard_manager,
>>>> + MemoryRegion **mr_p, hwaddr *xlat_p, Error **errp)
>>>
>>> If we're going to return the MR anyway, probably we can drop
>>> mr_has_discard_manager altogether..
>>
>> To hoist mr_has_discard_manager to the vfio caller, I would need to return len.
>> Your call.
>
> I meant only dropping mr_has_discard_manager parameter from the function
> interface, not the ram_discard_manager_is_populated() check.
Got it, will do.
>>>> {
>>>> MemoryRegion *mr;
>>>> hwaddr xlat;
>>>> @@ -2238,18 +2237,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>>>> return false;
>>>> }
>>>> - if (vaddr) {
>>>> - *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>> - }
>>>> -
>>>> - if (ram_addr) {
>>>> - *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>>>> - }
>>>> -
>>>> - if (read_only) {
>>>> - *read_only = !writable || mr->readonly;
>>>> - }
>>>> -
>>>> + *xlat_p = xlat;
>>>> + *mr_p = mr;
>>>
>>> I suppose current use on the callers are still under RCU so looks ok, but
>>> that'll need to be rich-documented.
>>
>> I can do that, or ...
>>
>>> Better way is always taking a MR reference when the MR pointer is returned,
>>> with memory_region_ref(). Then it is even valid if by accident accessed
>>> after rcu_read_unlock(), and caller should unref() after use.
>>
>> I can do that, but it would add cycles. Is this considered a high performance
>> path that may be called frequently?
>
> AFAICT, any vIOMMU mapping isn't high perf path. In this specific path,
> the refcount op should be buried in any dma map operations..
memory_region_ref contains a comment that implies we should avoid taking a
ref if possible:
* Memory regions without an owner are supposed to never go away;
* we do not ref/unref them because it slows down DMA sensibly.
- Steve
> Personally I slightly prefer this one because it's always safer to take a
> refcount along with a pointer.. easier to follow.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-24 16:50 ` Steven Sistare
@ 2025-02-24 19:20 ` Peter Xu
2025-02-24 19:35 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-24 19:20 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Mon, Feb 24, 2025 at 11:50:50AM -0500, Steven Sistare wrote:
> > > I can do that, but it would add cycles. Is this considered a high performance
> > > path that may be called frequently?
> >
> > AFAICT, any vIOMMU mapping isn't high perf path. In this specific path,
> > the refcount op should be buried in any dma map operations..
>
> memory_region_ref contains a comment that implies we should avoid taking a
> ref if possible:
> * Memory regions without an owner are supposed to never go away;
> * we do not ref/unref them because it slows down DMA sensibly.
That's for internal / permanent MRs that don't have an owner to speed
things up, and that should be orthogonal to this, as that'll automatically
take effect even if we use memory_region_[un]ref() here.
AFAIU we should always suggest using memory_region_[un]ref() in callers
when there's MR references.
I'm also ok without the boosted refcount, but then please document the RCU
implications on referencing the MR, and the MR reference must not be used
after rcu_read_unlock(), in that case.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr
2025-02-24 19:20 ` Peter Xu
@ 2025-02-24 19:35 ` Steven Sistare
0 siblings, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-24 19:35 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/24/2025 2:20 PM, Peter Xu wrote:
> On Mon, Feb 24, 2025 at 11:50:50AM -0500, Steven Sistare wrote:
>>>> I can do that, but it would add cycles. Is this considered a high performance
>>>> path that may be called frequently?
>>>
>>> AFAICT, any vIOMMU mapping isn't high perf path. In this specific path,
>>> the refcount op should be buried in any dma map operations..
>>
>> memory_region_ref contains a comment that implies we should avoid taking a
>> ref if possible:
>> * Memory regions without an owner are supposed to never go away;
>> * we do not ref/unref them because it slows down DMA sensibly.
>
> That's for internal / permanent MRs that don't have an owner to speed
> things up, and that should be orthogonal to this, as that'll automatically
> take effect even if we use memory_region_[un]ref() here.
Yes, I understand that is a different type of mr, but my point is that the
author claims that memory_region_[un]ref() can noticeably slow DMA.
> AFAIU we should always suggest using memory_region_[un]ref() in callers
> when there's MR references.
>
> I'm also ok without the boosted refcount, but then please document the RCU
> implications on referencing the MR, and the MR reference must not be used
> after rcu_read_unlock(), in that case.
I'll do that, to avoid any possibility of performance regression.
- Steve
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 29/45] vfio: pass ramblock to vfio_container_dma_map
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (27 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 28/45] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 30/45] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
` (16 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Pass ramblock to vfio_container_dma_map for use in a subsequent patch.
The ramblock's attributes will be needed to map the block using
IOMMU_IOAS_MAP_FILE. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 8 +++++---
hw/vfio/container-base.c | 3 ++-
include/hw/vfio/vfio-container-base.h | 3 ++-
3 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3b0c520..350d5aa 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -317,7 +317,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
*/
ret = vfio_container_dma_map(bcontainer, iova,
iotlb->addr_mask + 1, vaddr,
- read_only);
+ read_only, mr->ram_block);
if (ret) {
error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
"0x%"HWADDR_PRIx", %p) = %d (%s)",
@@ -382,7 +382,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
vaddr = memory_region_get_ram_ptr(section->mr) + start;
ret = vfio_container_dma_map(bcontainer, iova, next - start,
- vaddr, section->readonly);
+ vaddr, section->readonly,
+ section->mr->ram_block);
if (ret) {
/* Rollback */
vfio_ram_discard_notify_discard(rdl, section);
@@ -716,7 +717,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
}
ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),
- vaddr, section->readonly);
+ vaddr, section->readonly,
+ section->mr->ram_block);
if (ret) {
error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
"0x%"HWADDR_PRIx", %p) = %d (%s)",
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 749a3fd..302cd4c 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -17,7 +17,8 @@
int vfio_container_dma_map(VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
- void *vaddr, bool readonly)
+ void *vaddr, bool readonly,
+ RAMBlock *rb)
{
VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 4cff994..d82e256 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -73,7 +73,8 @@ typedef struct VFIORamDiscardListener {
int vfio_container_dma_map(VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
- void *vaddr, bool readonly);
+ void *vaddr, bool readonly,
+ RAMBlock *rb);
int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 30/45] backends/iommufd: iommufd_backend_map_file_dma
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (28 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 29/45] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 31/45] backends/iommufd: change process ioctl Steve Sistare
` (15 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
This will be called as a substitute for iommufd_backend_map_dma, so
the error conditions for BARs are copied as-is from that function.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 36 ++++++++++++++++++++++++++++++++++++
backends/trace-events | 1 +
include/system/iommufd.h | 3 +++
3 files changed, 40 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index d57da44..612de78 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -172,6 +172,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
return ret;
}
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+ hwaddr iova, ram_addr_t size,
+ int mfd, unsigned long start, bool readonly)
+{
+ int ret, fd = be->fd;
+ struct iommu_ioas_map_file map = {
+ .size = sizeof(map),
+ .flags = IOMMU_IOAS_MAP_READABLE |
+ IOMMU_IOAS_MAP_FIXED_IOVA,
+ .ioas_id = ioas_id,
+ .fd = mfd,
+ .start = start,
+ .iova = iova,
+ .length = size,
+ };
+
+ if (!readonly) {
+ map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+ }
+
+ ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
+ trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
+ readonly, ret);
+ if (ret) {
+ ret = -errno;
+
+ /* TODO: Not support mapping hardware PCI BAR region for now. */
+ if (errno == EFAULT) {
+ warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
+ } else {
+ error_report("IOMMU_IOAS_MAP_FILE failed: %m");
+ }
+ }
+ return ret;
+}
+
int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
hwaddr iova, ram_addr_t size)
{
diff --git a/backends/trace-events b/backends/trace-events
index 40811a3..f478e18 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index cbab75b..ac700b8 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
Error **errp);
void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+ hwaddr iova, ram_addr_t size, int fd,
+ unsigned long start, bool readonly);
int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
ram_addr_t size, void *vaddr, bool readonly);
int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 31/45] backends/iommufd: change process ioctl
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (29 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 30/45] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset Steve Sistare
` (14 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define the change process ioctl
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 20 ++++++++++++++++++++
backends/trace-events | 1 +
include/system/iommufd.h | 2 ++
3 files changed, 23 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 612de78..cc3dcff 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc, void *data)
object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
}
+bool iommufd_change_process_capable(IOMMUFDBackend *be)
+{
+ struct iommu_ioas_change_process args = {.size = sizeof(args)};
+
+ return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+}
+
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
+{
+ struct iommu_ioas_change_process args = {.size = sizeof(args)};
+ bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+
+ if (!ret) {
+ error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d failed",
+ be->fd);
+ }
+ trace_iommufd_change_process(be->fd, ret);
+ return ret;
+}
+
bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
{
int fd;
diff --git a/backends/trace-events b/backends/trace-events
index f478e18..5ccdf90 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
dbus_vmstate_saving(const char *id) "id: %s"
# iommufd.c
+iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index ac700b8..db9ed53 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -64,6 +64,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
uint64_t iova, ram_addr_t size,
uint64_t page_size, uint64_t *data,
Error **errp);
+bool iommufd_change_process_capable(IOMMUFDBackend *be);
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (30 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 31/45] backends/iommufd: change process ioctl Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:39 ` Steven Sistare
2025-02-14 16:49 ` Peter Xu
2025-02-14 14:14 ` [PATCH V2 33/45] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
` (13 subsequent siblings)
45 siblings, 2 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define qemu_ram_get_fd_offset, so CPR can map a memory region using
IOMMU_IOAS_MAP_FILE in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/exec/cpu-common.h | 1 +
system/physmem.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index b1d76d6..0cab252 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -95,6 +95,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
const char *qemu_ram_get_idstr(RAMBlock *rb);
void *qemu_ram_get_host_addr(RAMBlock *rb);
ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
bool qemu_ram_is_shared(RAMBlock *rb);
diff --git a/system/physmem.c b/system/physmem.c
index 0bcfc6c..c41a80b 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1569,6 +1569,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
return rb->offset;
}
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
+{
+ return rb->fd_offset;
+}
+
ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
{
return rb->used_length;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset
2025-02-14 14:14 ` [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset Steve Sistare
@ 2025-02-14 14:39 ` Steven Sistare
2025-02-14 16:49 ` Peter Xu
1 sibling, 0 replies; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 14:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, David Hildenbrand, Philippe Mathieu-Daude"
cc memory reviewers.
The series is here:
https://lore.kernel.org/qemu-devel/1739542467-226739-1-git-send-email-steven.sistare@oracle.com/
- Steve
On 2/14/2025 9:14 AM, Steve Sistare wrote:
> Define qemu_ram_get_fd_offset, so CPR can map a memory region using
> IOMMU_IOAS_MAP_FILE in a subsequent patch.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/exec/cpu-common.h | 1 +
> system/physmem.c | 5 +++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index b1d76d6..0cab252 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -95,6 +95,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
> const char *qemu_ram_get_idstr(RAMBlock *rb);
> void *qemu_ram_get_host_addr(RAMBlock *rb);
> ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
> +ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
> ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
> ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
> bool qemu_ram_is_shared(RAMBlock *rb);
> diff --git a/system/physmem.c b/system/physmem.c
> index 0bcfc6c..c41a80b 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1569,6 +1569,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
> return rb->offset;
> }
>
> +ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
> +{
> + return rb->fd_offset;
> +}
> +
> ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
> {
> return rb->used_length;
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset
2025-02-14 14:14 ` [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset Steve Sistare
2025-02-14 14:39 ` Steven Sistare
@ 2025-02-14 16:49 ` Peter Xu
1 sibling, 0 replies; 72+ messages in thread
From: Peter Xu @ 2025-02-14 16:49 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 06:14:14AM -0800, Steve Sistare wrote:
> Define qemu_ram_get_fd_offset, so CPR can map a memory region using
> IOMMU_IOAS_MAP_FILE in a subsequent patch.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH V2 33/45] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (31 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 32/45] physmem: qemu_ram_get_fd_offset Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 34/45] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
` (12 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
Such a mapping can be preserved without modification during CPR,
because it depends on the file's address space, which does not change,
rather than on the process's address space, which does change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container-base.c | 9 +++++++++
hw/vfio/iommufd.c | 13 +++++++++++++
include/hw/vfio/vfio-container-base.h | 3 +++
3 files changed, 25 insertions(+)
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 302cd4c..fbaf04a 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -21,7 +21,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
RAMBlock *rb)
{
VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ int mfd = rb ? qemu_ram_get_fd(rb) : -1;
+ if (mfd >= 0 && vioc->dma_map_file) {
+ unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
+ unsigned long offset = qemu_ram_get_fd_offset(rb);
+
+ vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
+ readonly);
+ return 0;
+ }
g_assert(vioc->dma_map);
return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index df61edf..ea40da5 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -38,6 +38,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
iova, size, vaddr, readonly);
}
+static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ int fd, unsigned long start, bool readonly)
+{
+ const VFIOIOMMUFDContainer *container =
+ container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+ return iommufd_backend_map_file_dma(container->be,
+ container->ioas_id,
+ iova, size, fd, start, readonly);
+}
+
static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb)
@@ -794,6 +806,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
vioc->hiod_typename = TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO;
vioc->dma_map = iommufd_cdev_map;
+ vioc->dma_map_file = iommufd_cdev_map_file;
vioc->dma_unmap = iommufd_cdev_unmap;
vioc->attach_device = iommufd_cdev_attach;
vioc->detach_device = iommufd_cdev_detach;
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index d82e256..4daa5f8 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -115,6 +115,9 @@ struct VFIOIOMMUClass {
int (*dma_map)(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
void *vaddr, bool readonly);
+ int (*dma_map_file)(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ int fd, unsigned long start, bool readonly);
int (*dma_unmap)(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 34/45] vfio/iommufd: export iommufd_cdev_get_info_iova_range
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (32 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 33/45] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 35/45] vfio/iommufd: define hwpt constructors Steve Sistare
` (11 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export iommufd_cdev_get_info_iova_range, for use by CPR in a subsequent
patch to reconstruct the userland device state. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 4 ++--
include/hw/vfio/vfio-common.h | 3 +++
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index ea40da5..a6e24a7 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -429,8 +429,8 @@ static int iommufd_cdev_ram_block_discard_disable(bool state)
return ram_block_uncoordinated_discard_disable(state);
}
-static bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
- uint32_t ioas_id, Error **errp)
+bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
+ uint32_t ioas_id, Error **errp)
{
VFIOContainerBase *bcontainer = &container->bcontainer;
g_autofree struct iommu_ioas_iova_ranges *info = NULL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 78e4f12..9ca40d0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -264,6 +264,9 @@ void vfio_kvm_device_close(void);
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
+bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
+ uint32_t ioas_id, Error **errp);
+
extern const MemoryRegionOps vfio_region_ops;
typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
typedef QLIST_HEAD(VFIODeviceList, VFIODevice) VFIODeviceList;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 35/45] vfio/iommufd: define hwpt constructors
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (33 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 34/45] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 36/45] vfio/iommufd: invariant device name Steve Sistare
` (10 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Extract hwpt creation code from iommufd_cdev_autodomains_get into the
helpers iommufd_cdev_set_hwpt and iommufd_cdev_make_hwpt. These will
be used by CPR in a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 52 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 34 insertions(+), 18 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index a6e24a7..7c0cdd7 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -287,6 +287,34 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
return true;
}
+static void iommufd_cdev_set_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
+{
+ vbasedev->hwpt = hwpt;
+ vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
+ QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+}
+
+static VFIOIOASHwpt *iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
+ VFIOIOMMUFDContainer *container,
+ uint32_t hwpt_id)
+{
+ VFIOIOASHwpt *hwpt = g_malloc0(sizeof(*hwpt));
+ uint32_t flags = 0;
+
+ if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
+ flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
+ }
+
+ hwpt->hwpt_id = hwpt_id;
+ hwpt->hwpt_flags = flags;
+ QLIST_INIT(&hwpt->device_list);
+
+ QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+ container->bcontainer.dirty_pages_supported |=
+ vbasedev->iommu_dirty_tracking;
+ return hwpt;
+}
+
static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
VFIOIOMMUFDContainer *container,
Error **errp)
@@ -316,13 +344,10 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
return false;
} else {
- vbasedev->hwpt = hwpt;
- QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
- vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
return true;
}
}
-
/*
* This is quite early and VFIO Migration state isn't yet fully
* initialized, thus rely only on IOMMU hardware capabilities as to
@@ -341,24 +366,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
return false;
}
- hwpt = g_malloc0(sizeof(*hwpt));
- hwpt->hwpt_id = hwpt_id;
- hwpt->hwpt_flags = flags;
- QLIST_INIT(&hwpt->device_list);
-
- ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+ ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
if (ret) {
- iommufd_backend_free_id(container->be, hwpt->hwpt_id);
- g_free(hwpt);
+ iommufd_backend_free_id(container->be, hwpt_id);
return false;
}
- vbasedev->hwpt = hwpt;
- vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
- QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
- QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
- container->bcontainer.dirty_pages_supported |=
- vbasedev->iommu_dirty_tracking;
+ hwpt = iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id);
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
+
if (container->bcontainer.dirty_pages_supported &&
!vbasedev->iommu_dirty_tracking) {
warn_report("IOMMU instance for device %s doesn't support dirty tracking",
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 36/45] vfio/iommufd: invariant device name
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (34 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 35/45] vfio/iommufd: define hwpt constructors Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 37/45] vfio/iommufd: fix cpr register Steve Sistare
` (9 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
cpr-transfer will use the device name as a key to find the value
of the device descriptor in new QEMU. However, if the descriptor
number is specified by a command-line fd parameter, then
vfio_device_get_name creates a name that includes the fd number.
This causes a chicken-and-egg problem: new QEMU must know the fd
number to construct a name to find the fd number.
To fix, create an invariant name based on the id command-line
parameter. If id is not defined, add a CPR blocker.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr.c | 21 +++++++++++++++++++++
hw/vfio/helpers.c | 10 ++++------
hw/vfio/iommufd.c | 2 ++
include/hw/vfio/vfio-cpr.h | 4 ++++
4 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index a2400ca..e3ea2bf 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -11,6 +11,7 @@
#include "hw/vfio/pci.h"
#include "hw/pci/msix.h"
#include "hw/pci/msi.h"
+#include "migration/blocker.h"
#include "migration/cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
@@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
VMSTATE_END_OF_LIST()
}
};
+
+bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
+{
+ if (vbasedev->dev->id) {
+ vbasedev->name = g_strdup(vbasedev->dev->id);
+ return true;
+ } else {
+ /*
+ * Assign a name so any function printing it will not break, but the
+ * fd number changes across processes, so this cannot be used as an
+ * invariant name for CPR.
+ */
+ vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+ error_setg(&vbasedev->cpr.id_blocker,
+ "vfio device with fd=%d needs an id property",
+ vbasedev->fd);
+ return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+}
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index 4b255d4..4ff794c 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -29,6 +29,7 @@
#include "qapi/error.h"
#include "qemu/error-report.h"
#include "qemu/units.h"
+#include "migration/cpr.h"
#include "monitor/monitor.h"
/*
@@ -637,6 +638,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
{
ERRP_GUARD();
struct stat st;
+ bool ret = true;
if (vbasedev->fd < 0) {
if (stat(vbasedev->sysfsdev, &st) < 0) {
@@ -653,16 +655,12 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
error_setg(errp, "Use FD passing only with iommufd backend");
return false;
}
- /*
- * Give a name with fd so any function printing out vbasedev->name
- * will not break.
- */
if (!vbasedev->name) {
- vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+ ret = vfio_cpr_set_device_name(vbasedev, errp);
}
}
- return true;
+ return ret;
}
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 7c0cdd7..2de2811 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -24,6 +24,7 @@
#include "system/reset.h"
#include "qemu/cutils.h"
#include "qemu/chardev_open.h"
+#include "migration/blocker.h"
#include "pci.h"
#include "exec/ram_addr.h"
@@ -661,6 +662,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
iommufd_cdev_container_destroy(container);
vfio_put_address_space(space);
+ migrate_del_blocker(&vbasedev->cpr.id_blocker);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index a9f2fbe..8a30d30 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -22,12 +22,14 @@ typedef struct VFIOContainerCPR {
typedef struct VFIODeviceCPR {
bool reused;
Error *mdev_blocker;
+ Error *id_blocker;
} VFIODeviceCPR;
struct VFIOContainer;
struct VFIOGroup;
struct VFIOContainerBase;
struct VFIOPCIDevice;
+struct VFIODevice;
int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
Error **errp);
@@ -53,4 +55,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
int nr);
extern const VMStateDescription vfio_cpr_pci_vmstate;
+
+bool vfio_cpr_set_device_name(struct VFIODevice *vbasedev, Error **errp);
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 37/45] vfio/iommufd: fix cpr register
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (35 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 36/45] vfio/iommufd: invariant device name Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 38/45] vfio/iommufd: register container for cpr Steve Sistare
` (8 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
iommufd_cdev_attach should not call vfio_cpr_register_container if an
existing container is found. Fix that by registering earlier in the
code flow, which requires an additional call to unregister during error
recovery. Note it is safe to call unregister even if register has not
been called.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 2de2811..87c3bc2c 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -596,6 +596,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer->initialized = true;
+ if (!vfio_cpr_register_container(bcontainer, errp)) {
+ goto err_listener_register;
+ }
+
found_container:
ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
if (ret) {
@@ -603,10 +607,6 @@ found_container:
goto err_listener_register;
}
- if (!vfio_cpr_register_container(bcontainer, errp)) {
- goto err_listener_register;
- }
-
/*
* TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
* for discarding incompatibility check as well?
@@ -629,6 +629,7 @@ found_container:
return true;
err_listener_register:
+ vfio_cpr_unregister_container(bcontainer);
iommufd_cdev_ram_block_discard_disable(false);
err_discard_disable:
iommufd_cdev_detach_container(vbasedev, container);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 38/45] vfio/iommufd: register container for cpr
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (36 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 37/45] vfio/iommufd: fix cpr register Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 39/45] vfio/iommufd: preserve descriptors Steve Sistare
` (7 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Register a vfio iommufd container and device for CPR, replacing the generic
CPR register call with a more specific iommufd register call. Add a
blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
This is mostly boiler plate. The fields to to saved and restored are added
in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 97 +++++++++++++++++++++++++++++++++++++++++++
hw/vfio/iommufd.c | 8 ++--
hw/vfio/meson.build | 1 +
include/hw/vfio/vfio-common.h | 1 +
include/hw/vfio/vfio-cpr.h | 8 ++++
5 files changed, 112 insertions(+), 3 deletions(-)
create mode 100644 hw/vfio/cpr-iommufd.c
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
new file mode 100644
index 0000000..9fd0c82
--- /dev/null
+++ b/hw/vfio/cpr-iommufd.c
@@ -0,0 +1,97 @@
+/*
+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "system/iommufd.h"
+
+static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
+{
+ if (!iommufd_change_process_capable(container->be)) {
+ error_setg(errp,
+ "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
+ return false;
+ }
+ return true;
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+ .name = "vfio-iommufd-container",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+static const VMStateDescription iommufd_cpr_vmstate = {
+ .name = "iommufd",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
+ Error **errp)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ Error **cpr_blocker = &container->cpr_blocker;
+
+ migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+ vfio_cpr_reboot_notifier,
+ MIG_MODE_CPR_REBOOT);
+
+ if (!vfio_cpr_supported(container, cpr_blocker)) {
+ return migrate_add_blocker_modes(cpr_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+
+ vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+ vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);
+
+ return true;
+}
+
+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
+ vmstate_unregister(NULL, &vfio_container_vmstate, container);
+ migrate_del_blocker(&container->cpr_blocker);
+ migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+}
+
+static const VMStateDescription vfio_device_vmstate = {
+ .name = "vfio-iommufd-device",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
+{
+ vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
+}
+
+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
+{
+ vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
+}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 87c3bc2c..ddb3d23 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -596,7 +596,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer->initialized = true;
- if (!vfio_cpr_register_container(bcontainer, errp)) {
+ if (!vfio_iommufd_cpr_register_container(container, errp)) {
goto err_listener_register;
}
@@ -623,13 +623,14 @@ found_container:
vbasedev->bcontainer = bcontainer;
QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
+ vfio_iommufd_cpr_register_device(vbasedev);
trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
vbasedev->num_regions, vbasedev->flags);
return true;
err_listener_register:
- vfio_cpr_unregister_container(bcontainer);
+ vfio_iommufd_cpr_unregister_container(container);
iommufd_cdev_ram_block_discard_disable(false);
err_discard_disable:
iommufd_cdev_detach_container(vbasedev, container);
@@ -658,12 +659,13 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
iommufd_cdev_ram_block_discard_disable(false);
}
- vfio_cpr_unregister_container(bcontainer);
+ vfio_iommufd_cpr_unregister_container(container);
iommufd_cdev_detach_container(vbasedev, container);
iommufd_cdev_container_destroy(container);
vfio_put_address_space(space);
migrate_del_blocker(&vbasedev->cpr.id_blocker);
+ vfio_iommufd_cpr_unregister_device(vbasedev);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 5487815..998adb5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -13,6 +13,7 @@ vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'cpr.c',
'cpr-legacy.c',
+ 'cpr-iommufd.c',
'display.c',
'pci-quirks.c',
'pci.c',
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9ca40d0..6701393 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -110,6 +110,7 @@ typedef struct VFIOIOASHwpt {
typedef struct VFIOIOMMUFDContainer {
VFIOContainerBase bcontainer;
IOMMUFDBackend *be;
+ Error *cpr_blocker;
uint32_t ioas_id;
QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
} VFIOIOMMUFDContainer;
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 8a30d30..fa4b928 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -30,6 +30,7 @@ struct VFIOGroup;
struct VFIOContainerBase;
struct VFIOPCIDevice;
struct VFIODevice;
+struct VFIOIOMMUFDContainer;
int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
Error **errp);
@@ -38,6 +39,13 @@ bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
Error **errp);
void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
+bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
+ Error **errp);
+void vfio_iommufd_cpr_unregister_container(
+ struct VFIOIOMMUFDContainer *container);
+void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
+void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
+
bool vfio_cpr_container_match(struct VFIOContainer *container,
struct VFIOGroup *group, int *fd);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 39/45] vfio/iommufd: preserve descriptors
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (37 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 38/45] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 40/45] vfio/iommufd: reconstruct device Steve Sistare
` (6 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the iommu and vfio device fd in CPR state when it is created.
After CPR, the fd number is found in CPR state and reused. Remember
the reused status for subsequent patches. The reused status is cleared
when vmstate load finishes.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 19 ++++++++++---------
hw/vfio/cpr-iommufd.c | 15 +++++++++++++++
hw/vfio/helpers.c | 10 ++--------
hw/vfio/iommufd.c | 10 +++++++++-
include/system/iommufd.h | 1 +
5 files changed, 37 insertions(+), 18 deletions(-)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index cc3dcff..da90b21 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -16,12 +16,18 @@
#include "qemu/module.h"
#include "qom/object_interfaces.h"
#include "qemu/error-report.h"
+#include "migration/cpr.h"
#include "monitor/monitor.h"
#include "trace.h"
#include "hw/vfio/vfio-common.h"
#include <sys/ioctl.h>
#include <linux/iommufd.h>
+static const char *iommufd_fd_name(IOMMUFDBackend *be)
+{
+ return object_get_canonical_path_component(OBJECT(be));
+}
+
static void iommufd_backend_init(Object *obj)
{
IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
@@ -47,9 +53,8 @@ static void iommufd_backend_set_fd(Object *obj, const char *str, Error **errp)
IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
int fd = -1;
- fd = monitor_fd_param(monitor_cur(), str, errp);
+ fd = cpr_get_fd_param(iommufd_fd_name(be), str, 0, &be->cpr_reused, errp);
if (fd == -1) {
- error_prepend(errp, "Could not parse remote object fd %s:", str);
return;
}
be->fd = fd;
@@ -95,14 +100,9 @@ bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
{
- int fd;
-
if (be->owned && !be->users) {
- fd = qemu_open("/dev/iommu", O_RDWR, errp);
- if (fd < 0) {
- return false;
- }
- be->fd = fd;
+ be->fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0,
+ &be->cpr_reused, errp);
}
be->users++;
@@ -121,6 +121,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
be->fd = -1;
}
out:
+ cpr_delete_fd(iommufd_fd_name(be), 0);
trace_iommufd_backend_disconnect(be->fd, be->users);
}
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 9fd0c82..8c2fa3a 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -25,10 +25,25 @@ static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
return true;
}
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+ VFIOIOMMUFDContainer *container = opaque;
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ VFIODevice *vbasedev;
+
+ QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
+ vbasedev->cpr.reused = false;
+ }
+ container->be->cpr_reused = false;
+
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-iommufd-container",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index 4ff794c..679d33b 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -665,14 +665,8 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
{
- ERRP_GUARD();
- int fd = monitor_fd_param(monitor_cur(), str, errp);
-
- if (fd < 0) {
- error_prepend(errp, "Could not parse remote object fd %s:", str);
- return;
- }
- vbasedev->fd = fd;
+ vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
+ &vbasedev->cpr.reused, errp);
}
void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index ddb3d23..6c44303 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -25,6 +25,7 @@
#include "qemu/cutils.h"
#include "qemu/chardev_open.h"
#include "migration/blocker.h"
+#include "migration/cpr.h"
#include "pci.h"
#include "exec/ram_addr.h"
@@ -502,13 +503,18 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
if (vbasedev->fd < 0) {
- devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
+ devfd = cpr_find_fd(vbasedev->name, 0);
+ vbasedev->cpr.reused = (devfd >= 0);
+ if (!vbasedev->cpr.reused) {
+ devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
+ }
if (devfd < 0) {
return false;
}
vbasedev->fd = devfd;
} else {
devfd = vbasedev->fd;
+ /* reused was set in iommufd_backend_set_fd */
}
if (!iommufd_cdev_connect_and_bind(vbasedev, errp)) {
@@ -624,6 +630,7 @@ found_container:
QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
vfio_iommufd_cpr_register_device(vbasedev);
+ cpr_resave_fd(vbasedev->name, 0, devfd);
trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
vbasedev->num_regions, vbasedev->flags);
@@ -666,6 +673,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
migrate_del_blocker(&vbasedev->cpr.id_blocker);
vfio_iommufd_cpr_unregister_device(vbasedev);
+ cpr_delete_fd(vbasedev->name, 0);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index db9ed53..5c17abd 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
/*< protected >*/
int fd; /* /dev/iommu file descriptor */
bool owned; /* is the /dev/iommu opened internally */
+ bool cpr_reused; /* fd is reused after CPR */
uint32_t users;
/*< public >*/
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 40/45] vfio/iommufd: reconstruct device
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (38 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 39/45] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 41/45] vfio/iommufd: reconstruct hw_caps Steve Sistare
` (5 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Reconstruct userland device state after CPR. During vfio_realize, skip
all ioctls that configure the device, as it was already configured in old
QEMU.
Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
Because we skip ioctl's, it is not needed at realize time. However, we do
need the range info, so defer the call to iommufd_cdev_get_info_iova_range
to a post_load handler, at which time the ioas_id is known.
This reconstruction is not complete. hwpt_id and devid need special
treatment, handled in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 8 ++++++++
hw/vfio/iommufd.c | 23 +++++++++++++++++++++--
2 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 8c2fa3a..8453d76 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -30,6 +30,13 @@ static int vfio_container_post_load(void *opaque, int version_id)
VFIOIOMMUFDContainer *container = opaque;
VFIOContainerBase *bcontainer = &container->bcontainer;
VFIODevice *vbasedev;
+ Error *err = NULL;
+ uint32_t ioas_id = container->ioas_id;
+
+ if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
+ error_report_err(err);
+ return -1;
+ }
QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
vbasedev->cpr.reused = false;
@@ -46,6 +53,7 @@ static const VMStateDescription vfio_container_vmstate = {
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 6c44303..3fc530d 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -99,6 +99,10 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
goto err_kvm_device_add;
}
+ if (vbasedev->cpr.reused) {
+ goto skip_bind;
+ }
+
/* Bind device to iommufd */
bind.iommufd = iommufd->fd;
if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
@@ -110,6 +114,8 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
vbasedev->devid = bind.out_devid;
trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
vbasedev->fd, vbasedev->devid);
+
+skip_bind:
return true;
err_bind:
iommufd_cdev_kvm_device_del(vbasedev);
@@ -541,7 +547,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
vbasedev->iommufd != container->be) {
continue;
}
- if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
+ if (!vbasedev->cpr.reused &&
+ !iommufd_cdev_attach_container(vbasedev, container, &err)) {
const char *msg = error_get_pretty(err);
trace_iommufd_cdev_fail_attach_existing_container(msg);
@@ -558,6 +565,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
}
}
+ if (vbasedev->cpr.reused) {
+ ioas_id = -1; /* ioas_id will be received from vmstate */
+ goto skip_ioas_alloc;
+ }
+
/* Need to allocate a new dedicated container */
if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
goto err_alloc_ioas;
@@ -565,6 +577,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
+skip_ioas_alloc:
container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
container->be = vbasedev->iommufd;
container->ioas_id = ioas_id;
@@ -573,7 +586,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer = &container->bcontainer;
vfio_address_space_insert(space, bcontainer);
- if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
+ if (!vbasedev->cpr.reused &&
+ !iommufd_cdev_attach_container(vbasedev, container, errp)) {
goto err_attach_container;
}
@@ -583,6 +597,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
goto err_discard_disable;
}
+ if (vbasedev->cpr.reused) {
+ goto skip_info;
+ }
+
if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
error_append_hint(&err,
"Fallback to default 64bit IOVA range and 4K page size\n");
@@ -591,6 +609,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer->pgsizes = qemu_real_host_page_size();
}
+skip_info:
bcontainer->listener = vfio_memory_listener;
memory_listener_register(&bcontainer->listener, bcontainer->space->as);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 41/45] vfio/iommufd: reconstruct hw_caps
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (39 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 40/45] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 42/45] vfio/iommufd: reconstruct hwpt Steve Sistare
` (4 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
hw_caps is normally derived during realize, at vfio_device_hiod_realize ->
hiod_iommufd_vfio_realize -> iommufd_backend_get_device_info. However,
this depends on the devid, which is not preserved during CPR.
Save devid in vmstate. Defer the vfio_device_hiod_realize call to
post_load time, after devid has been recovered from vmstate.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 14 ++++++++++++++
hw/vfio/iommufd.c | 3 ++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 8453d76..a1ac517 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -99,12 +99,26 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
}
+static int vfio_device_post_load(void *opaque, int version_id)
+{
+ VFIODevice *vbasedev = opaque;
+ Error *err = NULL;
+
+ if (!vfio_device_hiod_realize(vbasedev, &err)) {
+ error_report_err(err);
+ return false;
+ }
+ return true;
+}
+
static const VMStateDescription vfio_device_vmstate = {
.name = "vfio-iommufd-device",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = vfio_device_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_INT32(devid, VFIODevice),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 3fc530d..693ed19 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -536,7 +536,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
* FD to be connected and having a devid to be able to successfully call
* iommufd_backend_get_device_info().
*/
- if (!vfio_device_hiod_realize(vbasedev, errp)) {
+ if (!vbasedev->cpr.reused &&
+ !vfio_device_hiod_realize(vbasedev, errp)) {
goto err_alloc_ioas;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 42/45] vfio/iommufd: reconstruct hwpt
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (40 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 41/45] vfio/iommufd: reconstruct hw_caps Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 43/45] vfio/iommufd: change process Steve Sistare
` (3 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the hwpt_id in vmstate. In realize, skip its allocation from
iommufd_cdev_attach -> iommufd_cdev_attach_container ->
iommufd_cdev_autodomains_get. Rebuild userland structures to hold
hwpt_id by calling iommufd_cdev_rebuild_hwpt at post load time.
This depends on hw_caps, which was restored by the post_load call to
vfio_device_hiod_realize.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 7 +++++++
hw/vfio/iommufd.c | 19 +++++++++++++++++++
hw/vfio/trace-events | 1 +
include/hw/vfio/vfio-common.h | 2 ++
include/hw/vfio/vfio-cpr.h | 1 +
5 files changed, 30 insertions(+)
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index a1ac517..4b78ebf 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -108,6 +108,12 @@ static int vfio_device_post_load(void *opaque, int version_id)
error_report_err(err);
return false;
}
+ if (!vbasedev->mdev) {
+ VFIOIOMMUFDContainer *container = container_of(vbasedev->bcontainer,
+ VFIOIOMMUFDContainer,
+ bcontainer);
+ iommufd_cdev_rebuild_hwpt(vbasedev, container);
+ }
return true;
}
@@ -119,6 +125,7 @@ static const VMStateDescription vfio_device_vmstate = {
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_INT32(devid, VFIODevice),
+ VMSTATE_UINT32(cpr.hwpt_id, VFIODevice),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 693ed19..cf92c96 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -298,6 +298,7 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
static void iommufd_cdev_set_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
{
vbasedev->hwpt = hwpt;
+ vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
}
@@ -323,6 +324,24 @@ static VFIOIOASHwpt *iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
return hwpt;
}
+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
+ VFIOIOMMUFDContainer *container)
+{
+ VFIOIOASHwpt *hwpt;
+ int hwpt_id = vbasedev->cpr.hwpt_id;
+
+ trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
+
+ QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+ if (hwpt->hwpt_id == hwpt_id) {
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
+ return;
+ }
+ }
+ hwpt = iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id);
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
+}
+
static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
VFIOIOMMUFDContainer *container,
Error **errp)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cab1cf1..25ff04c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -176,6 +176,7 @@ iommufd_cdev_connect_and_bind(int iommufd, const char *name, int devfd, int devi
iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id) " [iommufd=%d] Successfully attached device %s (%d) to id=%d"
iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d] Successfully detached %s"
+iommufd_cdev_rebuild_hwpt(int iommufd, int hwpt_id) " [iommufd=%d] hwpt %d"
iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d"
iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 6701393..00831b7 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -265,6 +265,8 @@ void vfio_kvm_device_close(void);
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
+ VFIOIOMMUFDContainer *container);
bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
uint32_t ioas_id, Error **errp);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index fa4b928..d195ce6 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -23,6 +23,7 @@ typedef struct VFIODeviceCPR {
bool reused;
Error *mdev_blocker;
Error *id_blocker;
+ uint32_t hwpt_id;
} VFIODeviceCPR;
struct VFIOContainer;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 43/45] vfio/iommufd: change process
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (41 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 42/45] vfio/iommufd: reconstruct hwpt Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 44/45] iommufd: preserve DMA mappings Steve Sistare
` (2 subsequent siblings)
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Finish CPR by change the owning process of the iommufd device in
post load.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 4b78ebf..92b101d 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -46,10 +46,27 @@ static int vfio_container_post_load(void *opaque, int version_id)
return 0;
}
+static int vfio_container_pre_save(void *opaque)
+{
+ VFIOIOMMUFDContainer *container = opaque;
+ Error *err = NULL;
+
+ /*
+ * The process has not changed yet, but proactively call the ioctl,
+ * and it will fail if any DMA mappings are not supported.
+ */
+ if (!iommufd_change_process(container->be, &err)) {
+ error_report_err(err);
+ return -1;
+ }
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-iommufd-container",
.version_id = 0,
.minimum_version_id = 0,
+ .pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
@@ -58,10 +75,23 @@ static const VMStateDescription vfio_container_vmstate = {
}
};
+static int iommufd_cpr_post_load(void *opaque, int version_id)
+{
+ IOMMUFDBackend *be = opaque;
+ Error *err = NULL;
+
+ if (!iommufd_change_process(be, &err)) {
+ error_report_err(err);
+ return -1;
+ }
+ return 0;
+}
+
static const VMStateDescription iommufd_cpr_vmstate = {
.name = "iommufd",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = iommufd_cpr_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 44/45] iommufd: preserve DMA mappings
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (42 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 43/45] vfio/iommufd: change process Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 14:14 ` [PATCH V2 45/45] vfio/container: delete old cpr register Steve Sistare
2025-02-14 15:56 ` [PATCH V2 00/45] Live update: vfio and iommufd Steven Sistare
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
During cpr-transfer load in new QEMU, the vfio_memory_listener causes
spurious calls to map and unmap DMA regions, as devices are created and
the address space is built. This memory was already already mapped by the
device in old QEMU, so suppress the map and unmap callbacks during CPR --
eg, if the reused flag is set. The reused flag is cleared in the post_load
handler.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 8 ++++++++
hw/vfio/cpr-iommufd.c | 1 +
2 files changed, 9 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index da90b21..dfcfd6b 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -209,6 +209,10 @@ int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
.length = size,
};
+ if (be->cpr_reused) {
+ return 0;
+ }
+
if (!readonly) {
map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
}
@@ -240,6 +244,10 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
.length = size,
};
+ if (be->cpr_reused) {
+ return 0;
+ }
+
ret = ioctl(fd, IOMMU_IOAS_UNMAP, &unmap);
/*
* IOMMUFD takes mapping as some kind of object, unmapping
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 92b101d..286597a 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -66,6 +66,7 @@ static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-iommufd-container",
.version_id = 0,
.minimum_version_id = 0,
+ .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
.pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH V2 45/45] vfio/container: delete old cpr register
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (43 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 44/45] iommufd: preserve DMA mappings Steve Sistare
@ 2025-02-14 14:14 ` Steve Sistare
2025-02-14 15:56 ` [PATCH V2 00/45] Live update: vfio and iommufd Steven Sistare
45 siblings, 0 replies; 72+ messages in thread
From: Steve Sistare @ 2025-02-14 14:14 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
vfio_cpr_[un]register_container is no longer used since they were
subsumed by container type-specific registration. Delete them.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr.c | 14 --------------
include/hw/vfio/vfio-common.h | 3 ---
2 files changed, 17 deletions(-)
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index e3ea2bf..5387b31 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -6,7 +6,6 @@
*/
#include "qemu/osdep.h"
-#include "hw/vfio/vfio-common.h"
#include "hw/vfio/vfio-cpr.h"
#include "hw/vfio/pci.h"
#include "hw/pci/msix.h"
@@ -30,19 +29,6 @@ int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
return 0;
}
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
-{
- migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
- vfio_cpr_reboot_notifier,
- MIG_MODE_CPR_REBOOT);
- return true;
-}
-
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
-{
- migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
-}
-
#define STRDUP_VECTOR_FD_NAME(vdev, name) \
g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 00831b7..d8c6510 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -262,9 +262,6 @@ int vfio_kvm_device_add_fd(int fd, Error **errp);
int vfio_kvm_device_del_fd(int fd, Error **errp);
void vfio_kvm_device_close(void);
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
-
void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
VFIOIOMMUFDContainer *container);
bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH V2 00/45] Live update: vfio and iommufd
2025-02-14 14:13 [PATCH V2 00/45] Live update: vfio and iommufd Steve Sistare
` (44 preceding siblings ...)
2025-02-14 14:14 ` [PATCH V2 45/45] vfio/container: delete old cpr register Steve Sistare
@ 2025-02-14 15:56 ` Steven Sistare
2025-02-14 16:06 ` Peter Xu
45 siblings, 1 reply; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 15:56 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas
Hi all, it would be nice to get this into qemu 10.0. Without it, the
basic support for cpr-transfer already in 10.0 is much less interesting.
Soft feature freeze is 2024-03-12.
- Steve
On 2/14/2025 9:13 AM, Steve Sistare wrote:
> Support vfio and iommufd devices with the cpr-transfer live migration mode.
> Devices that do not support live migration can still support cpr-transfer,
> allowing live update to a new version of QEMU on the same host, with no loss
> of guest connectivity.
>
> No user-visible interfaces are added.
>
> For legacy containers:
>
> Pass vfio device descriptors to new QEMU. In new QEMU, during vfio_realize,
> skip the ioctls that configure the device, because it is already configured.
>
> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
> QEMU and update the locked memory accounting. The physical pages remain
> pinned, because the descriptor of the device that locked them remains open,
> so DMA to those pages continues without interruption. Mediated devices are
> not supported, however, because they require the VA to always be valid, and
> there is a brief window where no VA is registered.
>
> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
> and notifier eventfd's to new QEMU. New QEMU loads the MSI data, then the
> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
> data structures, and attaches the interrupts to the new KVM instance. This
> logic also applies to iommufd containers.
>
> For iommufd containers:
>
> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
> backed by a file (including a memfd), so DMA mappings do not depend on VA,
> which can differ after live update. This allows mediated devices to be
> supported.
>
> Pass the iommufd and vfio device descriptors from old to new QEMU. In new
> QEMU, during vfio_realize, skip the ioctls that configure the device, because
> it is already configured.
>
> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
> locked memory accounting.
>
> Patches 5 to 14 are specific to legacy containers.
> Patches 27 to 44 are specific to iommufd containers.
> The remainder apply to both.
>
> Changes from previous versions:
> * V1 of this series contains minor changes from the "Live update: vfio" and
> "Live update: iommufd" series, mainly bug fixes and refactored patches.
>
> Changes in V2:
> * refactored various vfio code snippets into new cpr helpers
> * refactored vfio struct members into cpr-specific structures
> * refactored various small changes into their own patches
> * split complex patches. Notably:
> - split "refactor for cpr" into 5 patches
> - split "reconstruct device" into 4 patches
> * refactored vfio_connect_container using helpers and made its
> error recovery more robust.
> * moved vfio pci msi/vector/intx cpr functions to cpr.c
> * renamed "reused" to cpr_reused and cpr.reused
> * squashed vfio_cpr_[un]register_container to their call sites
> * simplified iommu_type setting after cpr
> * added cpr_open_fd and cpr_is_incoming helpers
> * removed changes from vfio_legacy_dma_map, and instead temporarily
> override dma_map and dma_unmap ops.
> * deleted error_report and returned Error to callers where possible.
> * simplified the memory_get_xlat_addr interface
> * fixed flags passed to iommufd_backend_alloc_hwpt
> * defined MIG_PRI_UNINITIALIZED
> * added maintainers
>
> Steve Sistare (45):
> MAINTAINERS: Add reviewer for CPR
> migration: cpr helpers
> migration: lower handler priority
> vfio: vfio_find_ram_discard_listener
> vfio/container: ram discard disable helper
> vfio/container: reform vfio_connect_container cleanup
> vfio/container: vfio_container_group_add
> vfio/container: register container for cpr
> vfio/container: preserve descriptors
> vfio/container: export vfio_legacy_dma_map
> vfio/container: discard old DMA vaddr
> vfio/container: restore DMA vaddr
> vfio/container: mdev cpr blocker
> vfio/container: recover from unmap-all-vaddr failure
> pci: export msix_is_pending
> pci: skip reset during cpr
> vfio-pci: skip reset during cpr
> vfio/pci: vfio_vector_init
> vfio/pci: vfio_notifier_init
> vfio/pci: pass vector to virq functions
> vfio/pci: vfio_notifier_init cpr parameters
> vfio/pci: vfio_notifier_cleanup
> vfio/pci: export MSI functions
> vfio-pci: preserve MSI
> vfio-pci: preserve INTx
> migration: close kvm after cpr
> migration: cpr_get_fd_param helper
> vfio: return mr from vfio_get_xlat_addr
> vfio: pass ramblock to vfio_container_dma_map
> backends/iommufd: iommufd_backend_map_file_dma
> backends/iommufd: change process ioctl
> physmem: qemu_ram_get_fd_offset
> vfio/iommufd: use IOMMU_IOAS_MAP_FILE
> vfio/iommufd: export iommufd_cdev_get_info_iova_range
> vfio/iommufd: define hwpt constructors
> vfio/iommufd: invariant device name
> vfio/iommufd: fix cpr register
> vfio/iommufd: register container for cpr
> vfio/iommufd: preserve descriptors
> vfio/iommufd: reconstruct device
> vfio/iommufd: reconstruct hw_caps
> vfio/iommufd: reconstruct hwpt
> vfio/iommufd: change process
> iommufd: preserve DMA mappings
> vfio/container: delete old cpr register
>
> MAINTAINERS | 12 ++
> accel/kvm/kvm-all.c | 28 ++++
> backends/iommufd.c | 83 ++++++++--
> backends/trace-events | 2 +
> hw/pci/msix.c | 2 +-
> hw/pci/pci.c | 13 ++
> hw/vfio/common.c | 91 ++++++++---
> hw/vfio/container-base.c | 12 +-
> hw/vfio/container.c | 216 +++++++++++++++++---------
> hw/vfio/cpr-iommufd.c | 172 +++++++++++++++++++++
> hw/vfio/cpr-legacy.c | 277 ++++++++++++++++++++++++++++++++++
> hw/vfio/cpr.c | 176 +++++++++++++++++++--
> hw/vfio/helpers.c | 20 +--
> hw/vfio/iommufd.c | 139 +++++++++++++----
> hw/vfio/meson.build | 4 +-
> hw/vfio/pci.c | 194 ++++++++++++++++++------
> hw/vfio/pci.h | 12 ++
> hw/vfio/trace-events | 1 +
> hw/virtio/vhost-vdpa.c | 8 +-
> include/exec/cpu-common.h | 1 +
> include/exec/memory.h | 6 +-
> include/hw/pci/msix.h | 1 +
> include/hw/vfio/vfio-common.h | 20 ++-
> include/hw/vfio/vfio-container-base.h | 6 +-
> include/hw/vfio/vfio-cpr.h | 69 +++++++++
> include/migration/cpr.h | 10 ++
> include/migration/vmstate.h | 6 +-
> include/system/iommufd.h | 6 +
> include/system/kvm.h | 1 +
> migration/cpr-transfer.c | 18 +++
> migration/cpr.c | 92 +++++++++++
> migration/migration.c | 1 +
> migration/savevm.c | 4 +-
> system/memory.c | 19 +--
> system/physmem.c | 5 +
> 35 files changed, 1490 insertions(+), 237 deletions(-)
> create mode 100644 hw/vfio/cpr-iommufd.c
> create mode 100644 hw/vfio/cpr-legacy.c
> create mode 100644 include/hw/vfio/vfio-cpr.h
>
> base-commit: de278e54aefed143526174335f8286f7437d20be
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 00/45] Live update: vfio and iommufd
2025-02-14 15:56 ` [PATCH V2 00/45] Live update: vfio and iommufd Steven Sistare
@ 2025-02-14 16:06 ` Peter Xu
2025-02-14 16:20 ` Steven Sistare
0 siblings, 1 reply; 72+ messages in thread
From: Peter Xu @ 2025-02-14 16:06 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Fri, Feb 14, 2025 at 10:56:02AM -0500, Steven Sistare wrote:
> Hi all, it would be nice to get this into qemu 10.0. Without it, the
> basic support for cpr-transfer already in 10.0 is much less interesting.
True..
> Soft feature freeze is 2024-03-12.
Said that, targeting 10.0 for such a huge series across multiple modules,
and especially during the time VFIO review is on heavy load.. may not be
easily achievable. It might be more practical, IMHO, to target this 10.1.
Review can still happen during / after soft-freeze.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 00/45] Live update: vfio and iommufd
2025-02-14 16:06 ` Peter Xu
@ 2025-02-14 16:20 ` Steven Sistare
2025-02-14 16:48 ` Cédric Le Goater
0 siblings, 1 reply; 72+ messages in thread
From: Steven Sistare @ 2025-02-14 16:20 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/14/2025 11:06 AM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 10:56:02AM -0500, Steven Sistare wrote:
>> Hi all, it would be nice to get this into qemu 10.0. Without it, the
>> basic support for cpr-transfer already in 10.0 is much less interesting.
>
> True..
>
>> Soft feature freeze is 2024-03-12.
>
> Said that, targeting 10.0 for such a huge series across multiple modules,
> and especially during the time VFIO review is on heavy load.. may not be
> easily achievable. It might be more practical, IMHO, to target this 10.1.
> Review can still happen during / after soft-freeze.
Understood. Let me know if I can do anything to help.
BTW, the series is less huge than it looks. I divided it into small patches
as requested.
- Steve
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH V2 00/45] Live update: vfio and iommufd
2025-02-14 16:20 ` Steven Sistare
@ 2025-02-14 16:48 ` Cédric Le Goater
0 siblings, 0 replies; 72+ messages in thread
From: Cédric Le Goater @ 2025-02-14 16:48 UTC (permalink / raw)
To: Steven Sistare, Peter Xu
Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Fabiano Rosas
On 2/14/25 17:20, Steven Sistare wrote:
> On 2/14/2025 11:06 AM, Peter Xu wrote:
>> On Fri, Feb 14, 2025 at 10:56:02AM -0500, Steven Sistare wrote:
>>> Hi all, it would be nice to get this into qemu 10.0. Without it, the
>>> basic support for cpr-transfer already in 10.0 is much less interesting.
>>
>> True..
>>
>>> Soft feature freeze is 2024-03-12.
>>
>> Said that, targeting 10.0 for such a huge series across multiple modules,>> and especially during the time VFIO review is on heavy load.. may not be>> easily achievable. It might be more practical, IMHO, to target this 10.1.
>> Review can still happen during / after soft-freeze.
yes. It is *very* optimistic and it is also a question of stability and
maintenance. One "big" feature per release is more than enough. "multifd
support for VFIO migration" is the next candidate.
And I am sorry Steve, I still haven't looked at your answers on v1 ...
They are next on my TODO list.
> Understood. Let me know if I can do anything to help.
Well, what bothers me today is that we have been adding a lot of new features
in the VFIO subsystem these last years (migration, IOMMUFD, etc) and we still
lack decent documentation in QEMU. That would be a great addition. For the
series "multifd support for VFIO migration" too.
> BTW, the series is less huge than it looks. I divided it into small patches
> as requested.
That's better. May be we can merge cleanup patches preparing ground for the
larger series.
Thanks,
C.
^ permalink raw reply [flat|nested] 72+ messages in thread