* [PATCH 1/4] reset: qemu_register_reset_one()
2024-01-17 9:15 [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices peterx
@ 2024-01-17 9:15 ` peterx
2024-01-17 10:29 ` Eric Auger
2024-01-17 9:15 ` [PATCH 2/4] reset: Allow multiple stages of system resets peterx
` (4 subsequent siblings)
5 siblings, 1 reply; 21+ messages in thread
From: peterx @ 2024-01-17 9:15 UTC (permalink / raw)
To: qemu-devel
Cc: Eric Auger, Michael S . Tsirkin, Jason Wang, Alex Williamson,
Igor Mammedov, peterx
From: Peter Xu <peterx@redhat.com>
Cleanup the code to use a single entrance on register reset hooks.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
hw/core/reset.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/hw/core/reset.c b/hw/core/reset.c
index d3263b613e..8cf60b2b09 100644
--- a/hw/core/reset.c
+++ b/hw/core/reset.c
@@ -39,23 +39,26 @@ typedef struct QEMUResetEntry {
static QTAILQ_HEAD(, QEMUResetEntry) reset_handlers =
QTAILQ_HEAD_INITIALIZER(reset_handlers);
-void qemu_register_reset(QEMUResetHandler *func, void *opaque)
+static void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
+ bool skip_snap)
{
QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
re->func = func;
re->opaque = opaque;
+ re->skip_on_snapshot_load = skip_snap;
QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
}
-void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
+void qemu_register_reset(QEMUResetHandler *func, void *opaque)
{
- QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
+ /* By default, do not skip during load of a snapshot */
+ qemu_register_reset_one(func, opaque, false);
+}
- re->func = func;
- re->opaque = opaque;
- re->skip_on_snapshot_load = true;
- QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
+void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
+{
+ qemu_register_reset_one(func, opaque, true);
}
void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/4] reset: qemu_register_reset_one()
2024-01-17 9:15 ` [PATCH 1/4] reset: qemu_register_reset_one() peterx
@ 2024-01-17 10:29 ` Eric Auger
0 siblings, 0 replies; 21+ messages in thread
From: Eric Auger @ 2024-01-17 10:29 UTC (permalink / raw)
To: peterx, qemu-devel
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov
On 1/17/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> Cleanup the code to use a single entrance on register reset hooks.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric
> ---
> hw/core/reset.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/hw/core/reset.c b/hw/core/reset.c
> index d3263b613e..8cf60b2b09 100644
> --- a/hw/core/reset.c
> +++ b/hw/core/reset.c
> @@ -39,23 +39,26 @@ typedef struct QEMUResetEntry {
> static QTAILQ_HEAD(, QEMUResetEntry) reset_handlers =
> QTAILQ_HEAD_INITIALIZER(reset_handlers);
>
> -void qemu_register_reset(QEMUResetHandler *func, void *opaque)
> +static void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
> + bool skip_snap)
> {
> QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
>
> re->func = func;
> re->opaque = opaque;
> + re->skip_on_snapshot_load = skip_snap;
> QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
> }
>
> -void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
> +void qemu_register_reset(QEMUResetHandler *func, void *opaque)
> {
> - QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
> + /* By default, do not skip during load of a snapshot */
> + qemu_register_reset_one(func, opaque, false);
> +}
>
> - re->func = func;
> - re->opaque = opaque;
> - re->skip_on_snapshot_load = true;
> - QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
> +void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
> +{
> + qemu_register_reset_one(func, opaque, true);
> }
>
> void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-17 9:15 [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices peterx
2024-01-17 9:15 ` [PATCH 1/4] reset: qemu_register_reset_one() peterx
@ 2024-01-17 9:15 ` peterx
2024-01-17 10:28 ` Eric Auger
2024-01-17 17:46 ` Peter Maydell
2024-01-17 9:15 ` [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset peterx
` (3 subsequent siblings)
5 siblings, 2 replies; 21+ messages in thread
From: peterx @ 2024-01-17 9:15 UTC (permalink / raw)
To: qemu-devel
Cc: Eric Auger, Michael S . Tsirkin, Jason Wang, Alex Williamson,
Igor Mammedov, peterx
From: Peter Xu <peterx@redhat.com>
QEMU resets do not have a way to order reset hooks. Add one coarse grained
reset stage so that some devices can be reset later than some others.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
include/sysemu/reset.h | 5 ++++
hw/core/reset.c | 60 +++++++++++++++++++++++++++++++-----------
2 files changed, 49 insertions(+), 16 deletions(-)
diff --git a/include/sysemu/reset.h b/include/sysemu/reset.h
index 609e4d50c2..0de697ce9f 100644
--- a/include/sysemu/reset.h
+++ b/include/sysemu/reset.h
@@ -5,9 +5,14 @@
typedef void QEMUResetHandler(void *opaque);
+#define QEMU_RESET_STAGES_N 2
+
void qemu_register_reset(QEMUResetHandler *func, void *opaque);
+void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
+ bool skip_snap, int stage);
void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque);
void qemu_unregister_reset(QEMUResetHandler *func, void *opaque);
+void qemu_unregister_reset_one(QEMUResetHandler *func, void *opaque, int stage);
void qemu_devices_reset(ShutdownCause reason);
#endif
diff --git a/hw/core/reset.c b/hw/core/reset.c
index 8cf60b2b09..a84c9bee84 100644
--- a/hw/core/reset.c
+++ b/hw/core/reset.c
@@ -36,55 +36,83 @@ typedef struct QEMUResetEntry {
bool skip_on_snapshot_load;
} QEMUResetEntry;
-static QTAILQ_HEAD(, QEMUResetEntry) reset_handlers =
- QTAILQ_HEAD_INITIALIZER(reset_handlers);
+typedef QTAILQ_HEAD(QEMUResetList, QEMUResetEntry) QEMUResetList;
+static QEMUResetList reset_handlers[QEMU_RESET_STAGES_N];
-static void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
- bool skip_snap)
+static void __attribute__((__constructor__)) qemu_reset_handlers_init(void)
+{
+ QEMUResetList *head;
+ int i = 0;
+
+ for (i = 0; i < QEMU_RESET_STAGES_N; i++) {
+ head = &reset_handlers[i];
+ QTAILQ_INIT(head);
+ }
+}
+
+void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
+ bool skip_snap, int stage)
{
QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
+ QEMUResetList *head;
+
+ assert(stage >= 0 && stage < QEMU_RESET_STAGES_N);
+ head = &reset_handlers[stage];
re->func = func;
re->opaque = opaque;
re->skip_on_snapshot_load = skip_snap;
- QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
+ QTAILQ_INSERT_TAIL(head, re, entry);
}
void qemu_register_reset(QEMUResetHandler *func, void *opaque)
{
- /* By default, do not skip during load of a snapshot */
- qemu_register_reset_one(func, opaque, false);
+ qemu_register_reset_one(func, opaque, false, 0);
}
void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
{
- qemu_register_reset_one(func, opaque, true);
+ qemu_register_reset_one(func, opaque, true, 0);
}
-void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
+void qemu_unregister_reset_one(QEMUResetHandler *func, void *opaque, int stage)
{
+ QEMUResetList *head;
QEMUResetEntry *re;
- QTAILQ_FOREACH(re, &reset_handlers, entry) {
+ assert(stage >= 0 && stage < QEMU_RESET_STAGES_N);
+ head = &reset_handlers[stage];
+
+ QTAILQ_FOREACH(re, head, entry) {
if (re->func == func && re->opaque == opaque) {
- QTAILQ_REMOVE(&reset_handlers, re, entry);
+ QTAILQ_REMOVE(head, re, entry);
g_free(re);
return;
}
}
}
+void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
+{
+ qemu_unregister_reset_one(func, opaque, 0);
+}
+
void qemu_devices_reset(ShutdownCause reason)
{
QEMUResetEntry *re, *nre;
+ QEMUResetList *head;
+ int stage;
/* reset all devices */
- QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
- if (reason == SHUTDOWN_CAUSE_SNAPSHOT_LOAD &&
- re->skip_on_snapshot_load) {
- continue;
+ for (stage = 0; stage < QEMU_RESET_STAGES_N; stage++) {
+ head = &reset_handlers[stage];
+ QTAILQ_FOREACH_SAFE(re, head, entry, nre) {
+ if (reason == SHUTDOWN_CAUSE_SNAPSHOT_LOAD &&
+ re->skip_on_snapshot_load) {
+ continue;
+ }
+ re->func(re->opaque);
}
- re->func(re->opaque);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-17 9:15 ` [PATCH 2/4] reset: Allow multiple stages of system resets peterx
@ 2024-01-17 10:28 ` Eric Auger
2024-01-17 13:58 ` Cédric Le Goater
2024-01-17 17:46 ` Peter Maydell
1 sibling, 1 reply; 21+ messages in thread
From: Eric Auger @ 2024-01-17 10:28 UTC (permalink / raw)
To: peterx, qemu-devel
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov
Hi Peter,
On 1/17/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> QEMU resets do not have a way to order reset hooks. Add one coarse grained
> reset stage so that some devices can be reset later than some others.
I would precise that the lowest stage has the highest priority and is
handled first.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/sysemu/reset.h | 5 ++++
> hw/core/reset.c | 60 +++++++++++++++++++++++++++++++-----------
> 2 files changed, 49 insertions(+), 16 deletions(-)
>
> diff --git a/include/sysemu/reset.h b/include/sysemu/reset.h
> index 609e4d50c2..0de697ce9f 100644
> --- a/include/sysemu/reset.h
> +++ b/include/sysemu/reset.h
> @@ -5,9 +5,14 @@
>
> typedef void QEMUResetHandler(void *opaque);
>
> +#define QEMU_RESET_STAGES_N 2
> +
> void qemu_register_reset(QEMUResetHandler *func, void *opaque);
> +void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
> + bool skip_snap, int stage);
> void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque);
> void qemu_unregister_reset(QEMUResetHandler *func, void *opaque);
> +void qemu_unregister_reset_one(QEMUResetHandler *func, void *opaque, int stage);
> void qemu_devices_reset(ShutdownCause reason);
>
> #endif
> diff --git a/hw/core/reset.c b/hw/core/reset.c
> index 8cf60b2b09..a84c9bee84 100644
> --- a/hw/core/reset.c
> +++ b/hw/core/reset.c
> @@ -36,55 +36,83 @@ typedef struct QEMUResetEntry {
> bool skip_on_snapshot_load;
> } QEMUResetEntry;
>
> -static QTAILQ_HEAD(, QEMUResetEntry) reset_handlers =
> - QTAILQ_HEAD_INITIALIZER(reset_handlers);
> +typedef QTAILQ_HEAD(QEMUResetList, QEMUResetEntry) QEMUResetList;
> +static QEMUResetList reset_handlers[QEMU_RESET_STAGES_N];
>
> -static void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
> - bool skip_snap)
> +static void __attribute__((__constructor__)) qemu_reset_handlers_init(void)
> +{
> + QEMUResetList *head;
> + int i = 0;
nit: you may put the declarations within the block
> +
> + for (i = 0; i < QEMU_RESET_STAGES_N; i++) {
> + head = &reset_handlers[i];
> + QTAILQ_INIT(head);
> + }
> +}
> +
> +void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
> + bool skip_snap, int stage)
> {
> QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
> + QEMUResetList *head;
> +
> + assert(stage >= 0 && stage < QEMU_RESET_STAGES_N);
> + head = &reset_handlers[stage];
>
> re->func = func;
> re->opaque = opaque;
> re->skip_on_snapshot_load = skip_snap;
> - QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
> + QTAILQ_INSERT_TAIL(head, re, entry);
> }
>
> void qemu_register_reset(QEMUResetHandler *func, void *opaque)
> {
> - /* By default, do not skip during load of a snapshot */
Shouldn't the above comment stay since the statement is not affected by
this patch? Or remove it in previous patch?
> - qemu_register_reset_one(func, opaque, false);
> + qemu_register_reset_one(func, opaque, false, 0);
> }
>
> void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
> {
> - qemu_register_reset_one(func, opaque, true);
> + qemu_register_reset_one(func, opaque, true, 0);
> }
>
> -void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
> +void qemu_unregister_reset_one(QEMUResetHandler *func, void *opaque, int stage)
> {
> + QEMUResetList *head;
> QEMUResetEntry *re;
>
> - QTAILQ_FOREACH(re, &reset_handlers, entry) {
> + assert(stage >= 0 && stage < QEMU_RESET_STAGES_N);
> + head = &reset_handlers[stage];
> +
> + QTAILQ_FOREACH(re, head, entry) {
> if (re->func == func && re->opaque == opaque) {
> - QTAILQ_REMOVE(&reset_handlers, re, entry);
> + QTAILQ_REMOVE(head, re, entry);
> g_free(re);
> return;
> }
> }
> }
>
> +void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
> +{
> + qemu_unregister_reset_one(func, opaque, 0);
> +}
> +
> void qemu_devices_reset(ShutdownCause reason)
> {
> QEMUResetEntry *re, *nre;
> + QEMUResetList *head;
> + int stage;
>
> /* reset all devices */
> - QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
> - if (reason == SHUTDOWN_CAUSE_SNAPSHOT_LOAD &&
> - re->skip_on_snapshot_load) {
> - continue;
> + for (stage = 0; stage < QEMU_RESET_STAGES_N; stage++) {
> + head = &reset_handlers[stage];
> + QTAILQ_FOREACH_SAFE(re, head, entry, nre) {
> + if (reason == SHUTDOWN_CAUSE_SNAPSHOT_LOAD &&
> + re->skip_on_snapshot_load) {
> + continue;
> + }
> + re->func(re->opaque);
> }
> - re->func(re->opaque);
> }
> }
>
Thanks
Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-17 10:28 ` Eric Auger
@ 2024-01-17 13:58 ` Cédric Le Goater
0 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2024-01-17 13:58 UTC (permalink / raw)
To: eric.auger, peterx, qemu-devel
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov
On 1/17/24 11:28, Eric Auger wrote:
> Hi Peter,
> On 1/17/24 10:15, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> QEMU resets do not have a way to order reset hooks. Add one coarse grained
>> reset stage so that some devices can be reset later than some others.
> I would precise that the lowest stage has the highest priority and is
> handled first.
yes. May be add an enum like we have for migration :
typedef enum {
MIG_PRI_DEFAULT = 0,
MIG_PRI_IOMMU, /* Must happen before PCI devices */
MIG_PRI_PCI_BUS, /* Must happen before IOMMU */
MIG_PRI_VIRTIO_MEM, /* Must happen before IOMMU */
MIG_PRI_GICV3_ITS, /* Must happen before PCI devices */
MIG_PRI_GICV3, /* Must happen before the ITS */
MIG_PRI_MAX,
} MigrationPriority;
I think it would help understand the reset ordering and maintenance
when grepping qemu_register_reset_one().
Thanks,
C.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>> include/sysemu/reset.h | 5 ++++
>> hw/core/reset.c | 60 +++++++++++++++++++++++++++++++-----------
>> 2 files changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/sysemu/reset.h b/include/sysemu/reset.h
>> index 609e4d50c2..0de697ce9f 100644
>> --- a/include/sysemu/reset.h
>> +++ b/include/sysemu/reset.h
>> @@ -5,9 +5,14 @@
>>
>> typedef void QEMUResetHandler(void *opaque);
>>
>> +#define QEMU_RESET_STAGES_N 2
>> +
>> void qemu_register_reset(QEMUResetHandler *func, void *opaque);
>> +void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
>> + bool skip_snap, int stage);
>> void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque);
>> void qemu_unregister_reset(QEMUResetHandler *func, void *opaque);
>> +void qemu_unregister_reset_one(QEMUResetHandler *func, void *opaque, int stage);
>> void qemu_devices_reset(ShutdownCause reason);
>>
>> #endif
>> diff --git a/hw/core/reset.c b/hw/core/reset.c
>> index 8cf60b2b09..a84c9bee84 100644
>> --- a/hw/core/reset.c
>> +++ b/hw/core/reset.c
>> @@ -36,55 +36,83 @@ typedef struct QEMUResetEntry {
>> bool skip_on_snapshot_load;
>> } QEMUResetEntry;
>>
>> -static QTAILQ_HEAD(, QEMUResetEntry) reset_handlers =
>> - QTAILQ_HEAD_INITIALIZER(reset_handlers);
>> +typedef QTAILQ_HEAD(QEMUResetList, QEMUResetEntry) QEMUResetList;
>> +static QEMUResetList reset_handlers[QEMU_RESET_STAGES_N];
>>
>> -static void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
>> - bool skip_snap)
>> +static void __attribute__((__constructor__)) qemu_reset_handlers_init(void)
>> +{
>> + QEMUResetList *head;
>> + int i = 0;
> nit: you may put the declarations within the block
>> +
>> + for (i = 0; i < QEMU_RESET_STAGES_N; i++) {
>> + head = &reset_handlers[i];
>> + QTAILQ_INIT(head);
>> + }
>> +}
>> +
>> +void qemu_register_reset_one(QEMUResetHandler *func, void *opaque,
>> + bool skip_snap, int stage)
>> {
>> QEMUResetEntry *re = g_new0(QEMUResetEntry, 1);
>> + QEMUResetList *head;
>> +
>> + assert(stage >= 0 && stage < QEMU_RESET_STAGES_N);
>> + head = &reset_handlers[stage];
>>
>> re->func = func;
>> re->opaque = opaque;
>> re->skip_on_snapshot_load = skip_snap;
>> - QTAILQ_INSERT_TAIL(&reset_handlers, re, entry);
>> + QTAILQ_INSERT_TAIL(head, re, entry);
>> }
>>
>> void qemu_register_reset(QEMUResetHandler *func, void *opaque)
>> {
>> - /* By default, do not skip during load of a snapshot */
> Shouldn't the above comment stay since the statement is not affected by
> this patch? Or remove it in previous patch?
>> - qemu_register_reset_one(func, opaque, false);
>> + qemu_register_reset_one(func, opaque, false, 0);
>> }
>>
>> void qemu_register_reset_nosnapshotload(QEMUResetHandler *func, void *opaque)
>> {
>> - qemu_register_reset_one(func, opaque, true);
>> + qemu_register_reset_one(func, opaque, true, 0);
>> }
>>
>> -void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
>> +void qemu_unregister_reset_one(QEMUResetHandler *func, void *opaque, int stage)
>> {
>> + QEMUResetList *head;
>> QEMUResetEntry *re;
>>
>> - QTAILQ_FOREACH(re, &reset_handlers, entry) {
>> + assert(stage >= 0 && stage < QEMU_RESET_STAGES_N);
>> + head = &reset_handlers[stage];
>> +
>> + QTAILQ_FOREACH(re, head, entry) {
>> if (re->func == func && re->opaque == opaque) {
>> - QTAILQ_REMOVE(&reset_handlers, re, entry);
>> + QTAILQ_REMOVE(head, re, entry);
>> g_free(re);
>> return;
>> }
>> }
>> }
>>
>> +void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
>> +{
>> + qemu_unregister_reset_one(func, opaque, 0);
>> +}
>> +
>> void qemu_devices_reset(ShutdownCause reason)
>> {
>> QEMUResetEntry *re, *nre;
>> + QEMUResetList *head;
>> + int stage;
>>
>> /* reset all devices */
>> - QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
>> - if (reason == SHUTDOWN_CAUSE_SNAPSHOT_LOAD &&
>> - re->skip_on_snapshot_load) {
>> - continue;
>> + for (stage = 0; stage < QEMU_RESET_STAGES_N; stage++) {
>> + head = &reset_handlers[stage];
>> + QTAILQ_FOREACH_SAFE(re, head, entry, nre) {
>> + if (reason == SHUTDOWN_CAUSE_SNAPSHOT_LOAD &&
>> + re->skip_on_snapshot_load) {
>> + continue;
>> + }
>> + re->func(re->opaque);
>> }
>> - re->func(re->opaque);
>> }
>> }
>>
> Thanks
>
> Eric
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-17 9:15 ` [PATCH 2/4] reset: Allow multiple stages of system resets peterx
2024-01-17 10:28 ` Eric Auger
@ 2024-01-17 17:46 ` Peter Maydell
2024-01-18 15:53 ` Philippe Mathieu-Daudé
1 sibling, 1 reply; 21+ messages in thread
From: Peter Maydell @ 2024-01-17 17:46 UTC (permalink / raw)
To: peterx
Cc: qemu-devel, Eric Auger, Michael S . Tsirkin, Jason Wang,
Alex Williamson, Igor Mammedov
On Wed, 17 Jan 2024 at 09:16, <peterx@redhat.com> wrote:
>
> From: Peter Xu <peterx@redhat.com>
>
> QEMU resets do not have a way to order reset hooks. Add one coarse grained
> reset stage so that some devices can be reset later than some others.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/sysemu/reset.h | 5 ++++
> hw/core/reset.c | 60 +++++++++++++++++++++++++++++++-----------
> 2 files changed, 49 insertions(+), 16 deletions(-)
>
> diff --git a/include/sysemu/reset.h b/include/sysemu/reset.h
> index 609e4d50c2..0de697ce9f 100644
> --- a/include/sysemu/reset.h
> +++ b/include/sysemu/reset.h
> @@ -5,9 +5,14 @@
>
> typedef void QEMUResetHandler(void *opaque);
>
> +#define QEMU_RESET_STAGES_N 2
> +
Our reset handling APIs are already pretty complicated, and
raw qemu_register_reset() is kind of a legacy API that I would
prefer that we try to move away from, not add extra complexity to.
Our device reset design already has a multiple-phase system
(see docs/devel/reset.rst), part of the point of which is to
try to give us a way to deal with reset-ordering problems.
I feel like the right way to handle the issue you're trying to
address is to ensure that the thing that needs to happen last is
done in the 'exit' phase rather than the 'hold' phase (which is
where legacy reset methods get called).
There are some annoying wrinkles here, notably that legacy
qemu_register_reset() doesn't support 3-phase reset and so
the phasing only happens for devices reset via the device/bus
tree hierarchy. But I think the way to go is to try to move
forward with that design (i.e. expand 3-phase reset to
qemu_register_reset() and/or move things using qemu_register_reset()
to device reset where that makes sense), not to ignore it and
put a completely different reset-ordering API in a different place.
thanks
-- PMM
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-17 17:46 ` Peter Maydell
@ 2024-01-18 15:53 ` Philippe Mathieu-Daudé
2024-01-18 16:15 ` Peter Maydell
2024-01-19 11:10 ` Peter Xu
0 siblings, 2 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-01-18 15:53 UTC (permalink / raw)
To: Peter Maydell, peterx
Cc: qemu-devel, Eric Auger, Michael S . Tsirkin, Jason Wang,
Alex Williamson, Igor Mammedov
On 17/1/24 18:46, Peter Maydell wrote:
> On Wed, 17 Jan 2024 at 09:16, <peterx@redhat.com> wrote:
>>
>> From: Peter Xu <peterx@redhat.com>
>>
>> QEMU resets do not have a way to order reset hooks. Add one coarse grained
>> reset stage so that some devices can be reset later than some others.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>> include/sysemu/reset.h | 5 ++++
>> hw/core/reset.c | 60 +++++++++++++++++++++++++++++++-----------
>> 2 files changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/sysemu/reset.h b/include/sysemu/reset.h
>> index 609e4d50c2..0de697ce9f 100644
>> --- a/include/sysemu/reset.h
>> +++ b/include/sysemu/reset.h
>> @@ -5,9 +5,14 @@
>>
>> typedef void QEMUResetHandler(void *opaque);
>>
>> +#define QEMU_RESET_STAGES_N 2
>> +
>
> Our reset handling APIs are already pretty complicated, and
> raw qemu_register_reset() is kind of a legacy API that I would
> prefer that we try to move away from, not add extra complexity to.
>
> Our device reset design already has a multiple-phase system
> (see docs/devel/reset.rst), part of the point of which is to
> try to give us a way to deal with reset-ordering problems.
> I feel like the right way to handle the issue you're trying to
> address is to ensure that the thing that needs to happen last is
> done in the 'exit' phase rather than the 'hold' phase (which is
> where legacy reset methods get called).
I concur. Devices reset is hard, but bus reset is even harder.
Having a quick look, the issues tracked by Alex & Peter might
come from the PCI bridges using the legacy DeviceReset. In
particular functions like:
- hw/pci-bridge/pcie_root_port.c
46 static void rp_reset_hold(Object *obj)
47 {
48 PCIDevice *d = PCI_DEVICE(obj);
49 DeviceState *qdev = DEVICE(obj);
50
51 rp_aer_vector_update(d);
52 pcie_cap_root_reset(d);
53 pcie_cap_deverr_reset(d);
54 pcie_cap_slot_reset(d);
55 pcie_cap_arifwd_reset(d);
56 pcie_acs_reset(d);
57 pcie_aer_root_reset(d);
58 pci_bridge_reset(qdev);
59 pci_bridge_disable_base_limit(d);
60 }
- hw/pci-bridge/pcie_pci_bridge.c
107 static void pcie_pci_bridge_reset(DeviceState *qdev)
108 {
109 PCIDevice *d = PCI_DEVICE(qdev);
110 pci_bridge_reset(qdev);
111 if (msi_present(d)) {
112 msi_reset(d);
113 }
114 shpc_reset(d);
115 }
- hw/pci-bridge/xio3130_downstream.c
56 static void xio3130_downstream_reset(DeviceState *qdev)
57 {
58 PCIDevice *d = PCI_DEVICE(qdev);
59
60 pcie_cap_deverr_reset(d);
61 pcie_cap_slot_reset(d);
62 pcie_cap_arifwd_reset(d);
63 pci_bridge_reset(qdev);
64 }
should really be split and converted as ResettablePhases.
pci_bridge_reset() is likely a ResettableExitPhase one.
> There are some annoying wrinkles here, notably that legacy
> qemu_register_reset() doesn't support 3-phase reset and so
> the phasing only happens for devices reset via the device/bus
> tree hierarchy. But I think the way to go is to try to move
> forward with that design (i.e. expand 3-phase reset to
> qemu_register_reset() and/or move things using qemu_register_reset()
> to device reset where that makes sense), not to ignore it and
> put a completely different reset-ordering API in a different place.
Unfortunately despite DeviceReset being deprecated since 4 years
in commit c11256aa6f ("hw/core: add Resettable support to BusClass
and DeviceClass"), we keep adding code using this legacy API; for
example in the last 4 months:
- e029bb00a7 ("hw/pci-host: Add Astro system bus adapter found on
PA-RISC machines")
- 2880e676c0 ("Add virtio-sound device stub")
- 4a58330343 ("hw/cxl: Add a switch mailbox CCI function")
- 95e1019a4a ("vhost-user-scsi: free the inflight area when reset")
- 6f9c3aaa34 ("fsl-imx: add simple RTC emulation for i.MX6 and i.MX7
boards")
Regards,
Phil.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-18 15:53 ` Philippe Mathieu-Daudé
@ 2024-01-18 16:15 ` Peter Maydell
2024-01-19 11:10 ` Peter Xu
1 sibling, 0 replies; 21+ messages in thread
From: Peter Maydell @ 2024-01-18 16:15 UTC (permalink / raw)
To: Philippe Mathieu-Daudé
Cc: peterx, qemu-devel, Eric Auger, Michael S . Tsirkin, Jason Wang,
Alex Williamson, Igor Mammedov
On Thu, 18 Jan 2024 at 15:53, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
> Unfortunately despite DeviceReset being deprecated since 4 years
> in commit c11256aa6f ("hw/core: add Resettable support to BusClass
> and DeviceClass"), we keep adding code using this legacy API; for
> example in the last 4 months:
>
> - e029bb00a7 ("hw/pci-host: Add Astro system bus adapter found on
> PA-RISC machines")
> - 2880e676c0 ("Add virtio-sound device stub")
> - 4a58330343 ("hw/cxl: Add a switch mailbox CCI function")
> - 95e1019a4a ("vhost-user-scsi: free the inflight area when reset")
> - 6f9c3aaa34 ("fsl-imx: add simple RTC emulation for i.MX6 and i.MX7
> boards")
For plain old leaf device reset methods, I'm not too fussed
about this, because a reset method is exactly equivalent
to a single 'hold' phase method, and the relatively
small amount of code to convert from one to the other isn't
introducing a lot of extra complication.
The cleanup I really want to get back to is the bus reset
handling, because we only have a few buses that don't use
3-phase reset, and if we convert those then we can get rid
of the handling of transitional reset from BusClass
completely. Unfortunately I ran into a regression somewhere
in the PCI reset handling in the patchsets I was working on
which I never got back to to track down the problem.
thanks
-- PMM
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/4] reset: Allow multiple stages of system resets
2024-01-18 15:53 ` Philippe Mathieu-Daudé
2024-01-18 16:15 ` Peter Maydell
@ 2024-01-19 11:10 ` Peter Xu
1 sibling, 0 replies; 21+ messages in thread
From: Peter Xu @ 2024-01-19 11:10 UTC (permalink / raw)
To: Philippe Mathieu-Daudé
Cc: Peter Maydell, qemu-devel, Eric Auger, Michael S . Tsirkin,
Jason Wang, Alex Williamson, Igor Mammedov
Hello, Phil, PeterM,
On Thu, Jan 18, 2024 at 04:53:42PM +0100, Philippe Mathieu-Daudé wrote:
> I concur. Devices reset is hard, but bus reset is even harder.
> Having a quick look, the issues tracked by Alex & Peter might
> come from the PCI bridges using the legacy DeviceReset.
The challenges we're facing with VFIO on vIOMMU are actually listed in
patch 4's large comment I added, here:
https://lore.kernel.org/qemu-devel/20240117091559.144730-5-peterx@redhat.com/
+ /*
+ * vIOMMU reset may require proper ordering with other devices. There
+ * are two complexities so that normal DeviceState.reset() may not
+ * work properly for vIOMMUs:
+ *
+ * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs
+ * (reference: resettable_cold_reset_fn())
+ *
+ * Currently, vIOMMU devices are created as normal '-device'
+ * cmdlines. It means in many ways it has the same attributes with
+ * most of the rest devices, even if the rest devices should
+ * logically be under control of the vIOMMU unit.
+ *
+ * One side effect of it is vIOMMU devices will be currently put
+ * randomly under qdev tree hierarchy, which is the source of
+ * device reset ordering in current QEMU (depth-first traversal).
+ * It means vIOMMU now can be reset before some devices. For fully
+ * emulated devices that's not a problem, because the traversal
+ * holds BQL for the whole process. However it is a problem if DMA
+ * can happen without BQL, like VFIO, vDPA or remote device process.
+ *
+ * TODO: one ideal solution can be that we make vIOMMU the parent
+ * of the whole pci host bridge. Hence vIOMMU can be reset after
+ * all the devices are reset and quiesced.
+ *
+ * (2) Some devices register its own reset functions
+ *
+ * Even if above issue solved, if devices register its own reset
+ * functions for some reason via QEMU reset hooks, vIOMMU can still
+ * be reset before the device. One example is vfio_reset_handler()
+ * where FLR is not supported on the device.
+ *
+ * TODO: merge relevant reset functions into the device tree reset
+ * framework.
+ *
+ * Neither of the above TODO may be trivial. To make it work for now,
+ * leverage reset stages and reset vIOMMU always at latter stage of the
+ * default. It means it needs to be reset after at least:
+ *
+ * - resettable_cold_reset_fn(): machine qdev tree reset
+ * - vfio_reset_handler(): VFIO reset for !FLR
+ */
What you're asking makes sense to me, because that's also what I would
prefer to consider (and that's why I mentioned my series "slightly ugly" in
the cover letter), even if I don't yet know much on the other reset
challenges QEMU is already facing.
The issue is just like what I mentioned in the cover letter: that complete
solution can be non-trivial and can take a lot of time, and I probably
wouldn't have time at least recently to persue such a solution.
To fix the issue cleanly, I assume we need to start from making sure all
VFIO paths will only use the Resettable interface to reset.
The second part will involve moving vIOMMU device in the QOM tree to be the
parent of whatever it manages. I didn't follow recent changes in this
area, but currently vIOMMU is probably an anonymous object dangling
somewhere and more or less orphaned, while "-device" is the interface for
now to create the vIOMMU which might be too generic. I'm not sure whether
we'll need new interface already to create the vIOMMU, e.g., it may ideally
need to be the parent of the root pci bus that it manages, one or multiple.
And we need to make sure the vIOMMU being present when the root pci buses
are created, I think. IIUC that can be before parsing "-device"s.
For the 2nd issue, I assume the ->hold() phase for VT-d to reset address
spaces might be good enough, as long as the reset is done depth-first, then
VFIO's ->hold()s will be called before that point. I consider after all
devices' ->hold() kicked off there should have no DMA anymore, so quit()
shouldn't hopefully matter.
However even if hold() works it is still challenging for the hierachy
change.
Considering that this issue so far shouldn't cause any functional breakage,
maybe one option is we gradually fix all above, and before that we declare
it a known issue.
Any other suggestions would also be greatly welcomed.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset
2024-01-17 9:15 [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices peterx
2024-01-17 9:15 ` [PATCH 1/4] reset: qemu_register_reset_one() peterx
2024-01-17 9:15 ` [PATCH 2/4] reset: Allow multiple stages of system resets peterx
@ 2024-01-17 9:15 ` peterx
2024-01-17 10:29 ` Eric Auger
2024-01-18 8:09 ` Philippe Mathieu-Daudé
2024-01-17 9:15 ` [PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset peterx
` (2 subsequent siblings)
5 siblings, 2 replies; 21+ messages in thread
From: peterx @ 2024-01-17 9:15 UTC (permalink / raw)
To: qemu-devel
Cc: Eric Auger, Michael S . Tsirkin, Jason Wang, Alex Williamson,
Igor Mammedov, peterx
From: Peter Xu <peterx@redhat.com>
No bug report for this, but logically tearing down of existing address
space should happen before reset of IOMMU state / registers, because the
current address spaces may still rely on those information.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
hw/i386/intel_iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1a07faddb4..8b467cbbd2 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4090,8 +4090,8 @@ static void vtd_reset(DeviceState *dev)
{
IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
- vtd_init(s);
vtd_address_space_refresh_all(s);
+ vtd_init(s);
}
static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset
2024-01-17 9:15 ` [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset peterx
@ 2024-01-17 10:29 ` Eric Auger
2024-01-18 8:09 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 21+ messages in thread
From: Eric Auger @ 2024-01-17 10:29 UTC (permalink / raw)
To: peterx, qemu-devel
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov
Hi Peter,
On 1/17/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> No bug report for this, but logically tearing down of existing address
> space should happen before reset of IOMMU state / registers, because the
> current address spaces may still rely on those information.
do you mean that vtd_address_space_refresh_all() implementation my rely
on data reset by vtd_init()?
By the way the comment before the function is a bit confusing because it
says that we should not reset as when reset but isn't it was it done here?
Thanks
Eric
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> hw/i386/intel_iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1a07faddb4..8b467cbbd2 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4090,8 +4090,8 @@ static void vtd_reset(DeviceState *dev)
> {
> IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>
> - vtd_init(s);
> vtd_address_space_refresh_all(s);
> + vtd_init(s);
> }
>
> static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset
2024-01-17 9:15 ` [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset peterx
2024-01-17 10:29 ` Eric Auger
@ 2024-01-18 8:09 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-01-18 8:09 UTC (permalink / raw)
To: peterx, qemu-devel
Cc: Eric Auger, Michael S . Tsirkin, Jason Wang, Alex Williamson,
Igor Mammedov
Hi Peter,
On 17/1/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> No bug report for this, but logically tearing down of existing address
> space should happen before reset of IOMMU state / registers, because the
> current address spaces may still rely on those information.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> hw/i386/intel_iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1a07faddb4..8b467cbbd2 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4090,8 +4090,8 @@ static void vtd_reset(DeviceState *dev)
> {
> IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>
> - vtd_init(s);
> vtd_address_space_refresh_all(s);
> + vtd_init(s);
> }
You might want to convert to 3-phases reset API here, calling
vtd_address_space_refresh_all() in a ResettableEnterPhase handler
and vtd_init() in ResettableHoldPhase (or ResettableExitPhase?).
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset
2024-01-17 9:15 [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices peterx
` (2 preceding siblings ...)
2024-01-17 9:15 ` [PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset peterx
@ 2024-01-17 9:15 ` peterx
2024-01-17 10:38 ` Eric Auger
2024-01-17 10:29 ` [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices Eric Auger
2025-01-23 9:16 ` Eric Auger
5 siblings, 1 reply; 21+ messages in thread
From: peterx @ 2024-01-17 9:15 UTC (permalink / raw)
To: qemu-devel
Cc: Eric Auger, Michael S . Tsirkin, Jason Wang, Alex Williamson,
Igor Mammedov, peterx, YangHang Liu
From: Peter Xu <peterx@redhat.com>
We got report from Yanghang Liu on an unexpected host DMA error when system
resets with VFIO attached to vIOMMU in the VM context. Alex Williamson
quickly spot that there can be ordering issues on resets. A further test
verified that the issue is indeed caused by such wrong ordering.
vIOMMU is a fundamentally infrustructural device, its reset is currently
problematic because no ordering is guaranteed against other PCI devices
which may DMA through the vIOMMU device.
The reset order is tricky, not only because it's current representation as
a normal "-device" (so it kind of follow the qdev tree depth-first reset,
but at a wrong place in the qtree; ideally it should be the parent
somewhere for all pci buses, or just part of pci host bridge), but also
because customized device reset hooks registered over the system reset
framework, so that the ordering of the vIOMMU reset is not guaranteed.
For example, VFIO can register its reset hook with vfio_reset_handler() if
some device does not support FLR. That will not so far follow the
depth-first travelsal reset mechanism provided by QEMU reset framework.
To remedy both of the issues with limited code changes, leverage the newly
introduced reset stage framework to reset vIOMMUs at the last stage of the
rest devices. More information can be found in the comments in the patch,
which I decided to persist even with the code to make the problem even
clearer (with potential TODOs for the future, if possible).
Buglink: https://issues.redhat.com/browse/RHEL-7188
Analyzed-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: YangHang Liu <yanghliu@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
hw/i386/intel_iommu.c | 54 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 52 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8b467cbbd2..5a8fbcad7a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,7 @@
#include "sysemu/kvm.h"
#include "sysemu/dma.h"
#include "sysemu/sysemu.h"
+#include "sysemu/reset.h"
#include "hw/i386/apic_internal.h"
#include "kvm/kvm_i386.h"
#include "migration/vmstate.h"
@@ -4086,7 +4087,7 @@ static void vtd_init(IntelIOMMUState *s)
/* Should not reset address_spaces when reset because devices will still use
* the address space they got at first (won't ask the bus again).
*/
-static void vtd_reset(DeviceState *dev)
+static void vtd_reset(void *dev)
{
IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
@@ -4242,7 +4243,6 @@ static void vtd_class_init(ObjectClass *klass, void *data)
DeviceClass *dc = DEVICE_CLASS(klass);
X86IOMMUClass *x86_class = X86_IOMMU_DEVICE_CLASS(klass);
- dc->reset = vtd_reset;
dc->vmsd = &vtd_vmstate;
device_class_set_props(dc, vtd_properties);
dc->hotpluggable = false;
@@ -4254,10 +4254,60 @@ static void vtd_class_init(ObjectClass *klass, void *data)
dc->desc = "Intel IOMMU (VT-d) DMA Remapping device";
}
+static void vtd_instance_init(Object *obj)
+{
+ IntelIOMMUState *s = INTEL_IOMMU_DEVICE(obj);
+
+ /*
+ * vIOMMU reset may require proper ordering with other devices. There
+ * are two complexities so that normal DeviceState.reset() may not
+ * work properly for vIOMMUs:
+ *
+ * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs
+ * (reference: resettable_cold_reset_fn())
+ *
+ * Currently, vIOMMU devices are created as normal '-device'
+ * cmdlines. It means in many ways it has the same attributes with
+ * most of the rest devices, even if the rest devices should
+ * logically be under control of the vIOMMU unit.
+ *
+ * One side effect of it is vIOMMU devices will be currently put
+ * randomly under qdev tree hierarchy, which is the source of
+ * device reset ordering in current QEMU (depth-first traversal).
+ * It means vIOMMU now can be reset before some devices. For fully
+ * emulated devices that's not a problem, because the traversal
+ * holds BQL for the whole process. However it is a problem if DMA
+ * can happen without BQL, like VFIO, vDPA or remote device process.
+ *
+ * TODO: one ideal solution can be that we make vIOMMU the parent
+ * of the whole pci host bridge. Hence vIOMMU can be reset after
+ * all the devices are reset and quiesced.
+ *
+ * (2) Some devices register its own reset functions
+ *
+ * Even if above issue solved, if devices register its own reset
+ * functions for some reason via QEMU reset hooks, vIOMMU can still
+ * be reset before the device. One example is vfio_reset_handler()
+ * where FLR is not supported on the device.
+ *
+ * TODO: merge relevant reset functions into the device tree reset
+ * framework.
+ *
+ * Neither of the above TODO may be trivial. To make it work for now,
+ * leverage reset stages and reset vIOMMU always at latter stage of the
+ * default. It means it needs to be reset after at least:
+ *
+ * - resettable_cold_reset_fn(): machine qdev tree reset
+ * - vfio_reset_handler(): VFIO reset for !FLR
+ */
+ qemu_register_reset_one(vtd_reset, s, false, 1);
+}
+
static const TypeInfo vtd_info = {
.name = TYPE_INTEL_IOMMU_DEVICE,
.parent = TYPE_X86_IOMMU_DEVICE,
.instance_size = sizeof(IntelIOMMUState),
+ .instance_init = vtd_instance_init,
.class_init = vtd_class_init,
};
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset
2024-01-17 9:15 ` [PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset peterx
@ 2024-01-17 10:38 ` Eric Auger
0 siblings, 0 replies; 21+ messages in thread
From: Eric Auger @ 2024-01-17 10:38 UTC (permalink / raw)
To: peterx, qemu-devel
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov,
YangHang Liu
Hi Peter,
On 1/17/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> We got report from Yanghang Liu on an unexpected host DMA error when system
> resets with VFIO attached to vIOMMU in the VM context. Alex Williamson
> quickly spot that there can be ordering issues on resets. A further test
> verified that the issue is indeed caused by such wrong ordering.
nit: not sure the commit msg should contain people info (cover letter
does it already + credits below)
>
> vIOMMU is a fundamentally infrustructural device, its reset is currently
infrastructural
> problematic because no ordering is guaranteed against other PCI devices
> which may DMA through the vIOMMU device.
>
> The reset order is tricky, not only because it's current representation as
s/it's/its
> a normal "-device" (so it kind of follow the qdev tree depth-first reset,
> but at a wrong place in the qtree; ideally it should be the parent
> somewhere for all pci buses, or just part of pci host bridge), but also
> because customized device reset hooks registered over the system reset
> framework, so that the ordering of the vIOMMU reset is not guaranteed.
>
> For example, VFIO can register its reset hook with vfio_reset_handler() if
> some device does not support FLR. That will not so far follow the
> depth-first travelsal reset mechanism provided by QEMU reset framework.
traversal
>
> To remedy both of the issues with limited code changes, leverage the newly
> introduced reset stage framework to reset vIOMMUs at the last stage of the
> rest devices. More information can be found in the comments in the patch,
> which I decided to persist even with the code to make the problem even
> clearer (with potential TODOs for the future, if possible).
>
> Buglink: https://issues.redhat.com/browse/RHEL-7188
> Analyzed-by: Alex Williamson <alex.williamson@redhat.com>
> Reported-by: YangHang Liu <yanghliu@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> hw/i386/intel_iommu.c | 54 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 8b467cbbd2..5a8fbcad7a 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -35,6 +35,7 @@
> #include "sysemu/kvm.h"
> #include "sysemu/dma.h"
> #include "sysemu/sysemu.h"
> +#include "sysemu/reset.h"
> #include "hw/i386/apic_internal.h"
> #include "kvm/kvm_i386.h"
> #include "migration/vmstate.h"
> @@ -4086,7 +4087,7 @@ static void vtd_init(IntelIOMMUState *s)
> /* Should not reset address_spaces when reset because devices will still use
> * the address space they got at first (won't ask the bus again).
> */
> -static void vtd_reset(DeviceState *dev)
> +static void vtd_reset(void *dev)
> {
> IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>
> @@ -4242,7 +4243,6 @@ static void vtd_class_init(ObjectClass *klass, void *data)
> DeviceClass *dc = DEVICE_CLASS(klass);
> X86IOMMUClass *x86_class = X86_IOMMU_DEVICE_CLASS(klass);
>
> - dc->reset = vtd_reset;
> dc->vmsd = &vtd_vmstate;
> device_class_set_props(dc, vtd_properties);
> dc->hotpluggable = false;
> @@ -4254,10 +4254,60 @@ static void vtd_class_init(ObjectClass *klass, void *data)
> dc->desc = "Intel IOMMU (VT-d) DMA Remapping device";
> }
>
> +static void vtd_instance_init(Object *obj)
> +{
> + IntelIOMMUState *s = INTEL_IOMMU_DEVICE(obj);
> +
> + /*
> + * vIOMMU reset may require proper ordering with other devices. There
> + * are two complexities so that normal DeviceState.reset() may not
> + * work properly for vIOMMUs:
> + *
> + * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs
> + * (reference: resettable_cold_reset_fn())
> + *
> + * Currently, vIOMMU devices are created as normal '-device'
> + * cmdlines. It means in many ways it has the same attributes with
s/with/as
> + * most of the rest devices, even if the rest devices should
s/rest/other
> + * logically be under control of the vIOMMU unit.
> + *
> + * One side effect of it is vIOMMU devices will be currently put
> + * randomly under qdev tree hierarchy, which is the source of
> + * device reset ordering in current QEMU (depth-first traversal).
> + * It means vIOMMU now can be reset before some devices. For fully
> + * emulated devices that's not a problem, because the traversal
> + * holds BQL for the whole process. However it is a problem if DMA
> + * can happen without BQL, like VFIO, vDPA or remote device process.
> + *
> + * TODO: one ideal solution can be that we make vIOMMU the parent
> + * of the whole pci host bridge. Hence vIOMMU can be reset after
> + * all the devices are reset and quiesced.
in hw/pci/pci.c there is also the info iommu_bus? When resetting a pci
device know whether it is attached to a viommu. I don't know if we could
plug the reset priority mechanism at this level.
> + *
> + * (2) Some devices register its own reset functions
> + *
> + * Even if above issue solved, if devices register its own reset
s/its/their
> + * functions for some reason via QEMU reset hooks, vIOMMU can still
> + * be reset before the device. One example is vfio_reset_handler()
> + * where FLR is not supported on the device.
> + *
> + * TODO: merge relevant reset functions into the device tree reset
> + * framework.
> + *
> + * Neither of the above TODO may be trivial. To make it work for now,
> + * leverage reset stages and reset vIOMMU always at latter stage of the
> + * default. It means it needs to be reset after at least:
> + *
> + * - resettable_cold_reset_fn(): machine qdev tree reset
> + * - vfio_reset_handler(): VFIO reset for !FLR
> + */
> + qemu_register_reset_one(vtd_reset, s, false, 1);
introducing enum values may be better ( just as we have for migration prio)
> +}
> +
> static const TypeInfo vtd_info = {
> .name = TYPE_INTEL_IOMMU_DEVICE,
> .parent = TYPE_X86_IOMMU_DEVICE,
> .instance_size = sizeof(IntelIOMMUState),
> + .instance_init = vtd_instance_init,
> .class_init = vtd_class_init,
> };
>
Thanks
Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices
2024-01-17 9:15 [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices peterx
` (3 preceding siblings ...)
2024-01-17 9:15 ` [PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset peterx
@ 2024-01-17 10:29 ` Eric Auger
2024-01-19 10:46 ` Peter Xu
2025-01-23 9:16 ` Eric Auger
5 siblings, 1 reply; 21+ messages in thread
From: Eric Auger @ 2024-01-17 10:29 UTC (permalink / raw)
To: peterx, qemu-devel
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov
Hi Peter,
On 1/17/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> There're issue reported that when syetem_reset the VM with an intel iommu
system_reset
> device and MT2892 PF(mlx5_core driver), the host kernel throws DMAR error.
>
> https://issues.redhat.com/browse/RHEL-7188
>
> Alex quickly spot a possible issue on ordering of device resets.
>
> It's verified by our QE team then that it is indeed the root cause of the
> problem. Consider when vIOMMU is reset before a VFIO device in a system
> reset: the device can be doing DMAs even if the vIOMMU is gone; in this
> specific context it means the shadow mapping can already be completely
> destroyed. Host will see these DMAs as malicious and report.
That's curious we did not get this earlier?
>
> To fix it, we'll need to make sure all devices under the vIOMMU device
> hierachy will be reset before the vIOMMU itself. There's plenty of trick
> inside, one can get those by reading the last patch.
Not sure what you meant here ;-)
>
> I didn't check other vIOMMUs, but this series should fix the issue for VT-d
> as of now. The solution can be slightly ugly, but a beautiful one can be
> very non-trivial.
>
> Review comments welcomed, thanks.
Thanks
Eric
>
> Peter Xu (4):
> reset: qemu_register_reset_one()
> reset: Allow multiple stages of system resets
> intel_iommu: Tear down address spaces before IOMMU reset
> intel_iommu: Reset vIOMMU at the last stage of system reset
>
> include/sysemu/reset.h | 5 ++++
> hw/core/reset.c | 67 ++++++++++++++++++++++++++++++------------
> hw/i386/intel_iommu.c | 56 +++++++++++++++++++++++++++++++++--
> 3 files changed, 107 insertions(+), 21 deletions(-)
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices
2024-01-17 10:29 ` [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices Eric Auger
@ 2024-01-19 10:46 ` Peter Xu
0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2024-01-19 10:46 UTC (permalink / raw)
To: Eric Auger
Cc: qemu-devel, Michael S . Tsirkin, Jason Wang, Alex Williamson,
Igor Mammedov
On Wed, Jan 17, 2024 at 11:29:08AM +0100, Eric Auger wrote:
> Hi Peter,
Hi, Eric,
Thanks for the reviews!
>
> On 1/17/24 10:15, peterx@redhat.com wrote:
> > From: Peter Xu <peterx@redhat.com>
> >
> > There're issue reported that when syetem_reset the VM with an intel iommu
> system_reset
> > device and MT2892 PF(mlx5_core driver), the host kernel throws DMAR error.
> >
> > https://issues.redhat.com/browse/RHEL-7188
> >
> > Alex quickly spot a possible issue on ordering of device resets.
> >
> > It's verified by our QE team then that it is indeed the root cause of the
> > problem. Consider when vIOMMU is reset before a VFIO device in a system
> > reset: the device can be doing DMAs even if the vIOMMU is gone; in this
> > specific context it means the shadow mapping can already be completely
> > destroyed. Host will see these DMAs as malicious and report.
> That's curious we did not get this earlier?
I sincerely don't know. It could be that we just didn't test much on
system resets. Or, we could have overlooked the host dmesgs; after all the
error messages can be benign from functional pov.
> >
> > To fix it, we'll need to make sure all devices under the vIOMMU device
> > hierachy will be reset before the vIOMMU itself. There's plenty of trick
> > inside, one can get those by reading the last patch.
> Not sure what you meant here ;-)
I meant "how to make sure all the vIOMMU managed devices will be reset
before the vIOMMU" is tricky on the implementation. I didn't reference any
of those in the cover letter, because I think I stated mostly in patch 4, I
want to reference that patch for the details. Since I think it's very
tricky, I left that major comment in the code to persist.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices
2024-01-17 9:15 [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices peterx
` (4 preceding siblings ...)
2024-01-17 10:29 ` [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices Eric Auger
@ 2025-01-23 9:16 ` Eric Auger
2025-01-23 17:57 ` Peter Xu
5 siblings, 1 reply; 21+ messages in thread
From: Eric Auger @ 2025-01-23 9:16 UTC (permalink / raw)
To: peterx, qemu-devel, Peter Maydell
Cc: Michael S . Tsirkin, Jason Wang, Alex Williamson, Igor Mammedov,
Cedric Le Goater, Zhenzhong Duan
Hi,
On 1/17/24 10:15 AM, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> There're issue reported that when syetem_reset the VM with an intel iommu
> device and MT2892 PF(mlx5_core driver), the host kernel throws DMAR error.
>
> https://issues.redhat.com/browse/RHEL-7188
>
> Alex quickly spot a possible issue on ordering of device resets.
>
> It's verified by our QE team then that it is indeed the root cause of the
> problem. Consider when vIOMMU is reset before a VFIO device in a system
> reset: the device can be doing DMAs even if the vIOMMU is gone; in this
> specific context it means the shadow mapping can already be completely
> destroyed. Host will see these DMAs as malicious and report.
>
> To fix it, we'll need to make sure all devices under the vIOMMU device
> hierachy will be reset before the vIOMMU itself. There's plenty of trick
> inside, one can get those by reading the last patch.
I have a case with intel iommu and vhost-net where I see that the vIOMMU
gets disabled by the guest before vhost_dev_stop() causing some spurious
lookup failures. This happens when rebooting the guest (see [PATCH]
hw/virtio/vhost: Disable IOTLB callbacks when IOMMU gets disabled
<https://lore.kernel.org/all/20250120173339.865681-1-eric.auger@redhat.com/>
https://lore.kernel.org/all/20250120173339.865681-1-eric.auger@redhat.com/
for more context).
I observe that
1) the guest disables the IOMMU through the vtd_handle_gcmd_write
2) vtd_reset (initiated from qemu_system_reset)
3) vhost_dev_stop (initiated from qemu_system_reset)
So I can effectively see the viommu is reset before vhost-net stop. 2)
is already an issue that looks the same as the one addressed in this
series. Now I am also in trouble wrt 1). I don't know yet which chain of
events causes the VTD GCMD TE bit to be written but this would also
cause spurious vhost IOLTB misses.
I haven't seen any follow-up on this series. Is anyone still looking at
this issue? Peter gave some guidance about the way to rework the reset
chain. Is it still up to date?
Thanks
Eric
>
> I didn't check other vIOMMUs, but this series should fix the issue for VT-d
> as of now. The solution can be slightly ugly, but a beautiful one can be
> very non-trivial.
>
> Review comments welcomed, thanks.
>
> Peter Xu (4):
> reset: qemu_register_reset_one()
> reset: Allow multiple stages of system resets
> intel_iommu: Tear down address spaces before IOMMU reset
> intel_iommu: Reset vIOMMU at the last stage of system reset
>
> include/sysemu/reset.h | 5 ++++
> hw/core/reset.c | 67 ++++++++++++++++++++++++++++++------------
> hw/i386/intel_iommu.c | 56 +++++++++++++++++++++++++++++++++--
> 3 files changed, 107 insertions(+), 21 deletions(-)
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices
2025-01-23 9:16 ` Eric Auger
@ 2025-01-23 17:57 ` Peter Xu
2025-01-23 18:02 ` Eric Auger
2025-01-29 18:22 ` Eric Auger
0 siblings, 2 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-23 17:57 UTC (permalink / raw)
To: Eric Auger
Cc: qemu-devel, Peter Maydell, Michael S . Tsirkin, Jason Wang,
Alex Williamson, Igor Mammedov, Cedric Le Goater, Zhenzhong Duan
On Thu, Jan 23, 2025 at 10:16:23AM +0100, Eric Auger wrote:
> I haven't seen any follow-up on this series. Is anyone still looking at
> this issue? Peter gave some guidance about the way to rework the reset
> chain. Is it still up to date?
I didn't continue looking at this issue since that time (and also stopped
working on vIOMMU stuff). No plan to continue from my side.. I suppose
nobody else has either, or I should have got some email like this. :)
It may not be uptodate indeed, so may worth rechecking its validity.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices
2025-01-23 17:57 ` Peter Xu
@ 2025-01-23 18:02 ` Eric Auger
2025-01-29 18:22 ` Eric Auger
1 sibling, 0 replies; 21+ messages in thread
From: Eric Auger @ 2025-01-23 18:02 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Peter Maydell, Michael S . Tsirkin, Jason Wang,
Alex Williamson, Igor Mammedov, Cedric Le Goater, Zhenzhong Duan
Hi Peter,
On 1/23/25 6:57 PM, Peter Xu wrote:
> On Thu, Jan 23, 2025 at 10:16:23AM +0100, Eric Auger wrote:
>> I haven't seen any follow-up on this series. Is anyone still looking at
>> this issue? Peter gave some guidance about the way to rework the reset
>> chain. Is it still up to date?
> I didn't continue looking at this issue since that time (and also stopped
> working on vIOMMU stuff). No plan to continue from my side.. I suppose
> nobody else has either, or I should have got some email like this. :)
>
> It may not be uptodate indeed, so may worth rechecking its validity.
>
> Thanks,
>
Than you for the confirmation
Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices
2025-01-23 17:57 ` Peter Xu
2025-01-23 18:02 ` Eric Auger
@ 2025-01-29 18:22 ` Eric Auger
1 sibling, 0 replies; 21+ messages in thread
From: Eric Auger @ 2025-01-29 18:22 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Peter Maydell, Michael S . Tsirkin, Jason Wang,
Alex Williamson, Igor Mammedov, Cedric Le Goater, Zhenzhong Duan
Hi Peter,
On 1/23/25 6:57 PM, Peter Xu wrote:
> On Thu, Jan 23, 2025 at 10:16:23AM +0100, Eric Auger wrote:
>> I haven't seen any follow-up on this series. Is anyone still looking at
>> this issue? Peter gave some guidance about the way to rework the reset
>> chain. Is it still up to date?
> I didn't continue looking at this issue since that time (and also stopped
> working on vIOMMU stuff). No plan to continue from my side.. I suppose
> nobody else has either, or I should have got some email like this. :)
>
> It may not be uptodate indeed, so may worth rechecking its validity.
thanks for the update. I will try to pursue the efforts then
Eric
>
> Thanks,
>
^ permalink raw reply [flat|nested] 21+ messages in thread