[PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis
@ 2024-09-20 22:29 Tomasz Lis
  2024-09-20 22:29 ` [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt Tomasz Lis
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Tomasz Lis @ 2024-09-20 22:29 UTC (permalink / raw)
  To: intel-xe; +Cc: Michał Winiarski, Michał Wajdeczko

To support VF Migration, it is necessary to do fixups to any
non-virtualized resources. These fixups need to be applied within
VM, on the KMD working with VF.

This series introduces a post-migration recovery worker, executed
after migration to perform fixups.

Specific fixups will be added later in further series.

Tomasz Lis (4):
  drm/xe/vf: React to MIGRATED interrupt
  drm/xe/vf: Send RESFIX_DONE message at end of VF restore
  drm/xe/vf: Start post-migration fixups with GuC MMIO handshake
  drm/xe/vf: Defer fixups if migrated twice fast

 drivers/gpu/drm/xe/Makefile                   |   1 +
 .../gpu/drm/xe/abi/guc_actions_sriov_abi.h    |  38 ++++
 drivers/gpu/drm/xe/xe_device_types.h          |   1 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c           |  33 ++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h           |   1 +
 drivers/gpu/drm/xe/xe_guc.c                   |   3 +
 drivers/gpu/drm/xe/xe_memirq.c                |   3 +
 drivers/gpu/drm/xe/xe_sriov.c                 |  15 ++
 drivers/gpu/drm/xe/xe_sriov.h                 |   1 +
 drivers/gpu/drm/xe/xe_sriov_types.h           |   6 +
 drivers/gpu/drm/xe/xe_sriov_vf.c              | 170 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_sriov_vf.h              |  17 ++
 12 files changed, 289 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_sriov_vf.c
 create mode 100644 drivers/gpu/drm/xe/xe_sriov_vf.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
@ 2024-09-20 22:29 ` Tomasz Lis
  2024-09-23  8:39   ` Michal Wajdeczko
  2024-09-20 22:29 ` [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore Tomasz Lis
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Tomasz Lis @ 2024-09-20 22:29 UTC (permalink / raw)
  To: intel-xe; +Cc: Michał Winiarski, Michał Wajdeczko

To properly support VF Save/Restore procedure, fixups need to be
applied after PF driver finishes its part of VF Restore. Those
fixups are applied by the VF driver within a VM.

A VF driver gets informed that it was migrated by receiving an
interrupt from each GuC. That should be the trigger for fixups.

The VF can safely do post-migration fixups on resources associated
to each GuC only after that GuC issued the MIGRATED interrupt.

This change introduces a worker to be used for post-migration fixups,
and a mechanism to schedule said worker when all GuCs sent the irq.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  1 +
 drivers/gpu/drm/xe/xe_device_types.h |  1 +
 drivers/gpu/drm/xe/xe_guc.c          |  3 +
 drivers/gpu/drm/xe/xe_memirq.c       |  3 +
 drivers/gpu/drm/xe/xe_sriov.c        | 15 +++++
 drivers/gpu/drm/xe/xe_sriov.h        |  1 +
 drivers/gpu/drm/xe/xe_sriov_types.h  |  6 ++
 drivers/gpu/drm/xe/xe_sriov_vf.c     | 82 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_sriov_vf.h     | 17 ++++++
 9 files changed, 129 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_sriov_vf.c
 create mode 100644 drivers/gpu/drm/xe/xe_sriov_vf.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index ae245fbd91ee..aa06644dffd5 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -123,6 +123,7 @@ xe-y += \
 	xe_gt_sriov_vf.o \
 	xe_guc_relay.o \
 	xe_memirq.o \
+	xe_sriov_vf.o \
 	xe_sriov.o
 
 xe-$(CONFIG_PCI_IOV) += \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 5ad96d283a71..331b55b457ab 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -374,6 +374,7 @@ struct xe_device {
 
 		/** @sriov.pf: PF specific data */
 		struct xe_device_pf pf;
+		struct xe_device_vf vf;
 
 		/** @sriov.wq: workqueue used by the virtualization workers */
 		struct workqueue_struct *wq;
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index b6cd5e941f19..65cfd6bd68f1 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -1096,6 +1096,9 @@ void xe_guc_irq_handler(struct xe_guc *guc, const u16 iir)
 {
 	if (iir & GUC_INTR_GUC2HOST)
 		xe_guc_ct_irq_handler(&guc->ct);
+
+	if (iir & GUC_INTR_SW_INT_0)
+		xe_sriov_migrated_event_handler(guc);
 }
 
 void xe_guc_sanitize(struct xe_guc *guc)
diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
index 3f8d4ca64302..2d2f40378942 100644
--- a/drivers/gpu/drm/xe/xe_memirq.c
+++ b/drivers/gpu/drm/xe/xe_memirq.c
@@ -435,6 +435,9 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
 
 	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
 		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
+
+	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
+		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_sriov.c b/drivers/gpu/drm/xe/xe_sriov.c
index 69a066ef20c0..7447d85104e5 100644
--- a/drivers/gpu/drm/xe/xe_sriov.c
+++ b/drivers/gpu/drm/xe/xe_sriov.c
@@ -9,9 +9,11 @@
 
 #include "xe_assert.h"
 #include "xe_device.h"
+#include "xe_guc.h"
 #include "xe_mmio.h"
 #include "xe_sriov.h"
 #include "xe_sriov_pf.h"
+#include "xe_sriov_vf.h"
 
 /**
  * xe_sriov_mode_to_string - Convert enum value to string.
@@ -112,6 +114,9 @@ int xe_sriov_init(struct xe_device *xe)
 			return err;
 	}
 
+	if (IS_SRIOV_VF(xe))
+		xe_sriov_vf_init_early(xe);
+
 	xe_assert(xe, !xe->sriov.wq);
 	xe->sriov.wq = alloc_workqueue("xe-sriov-wq", 0, 0);
 	if (!xe->sriov.wq)
@@ -150,3 +155,13 @@ const char *xe_sriov_function_name(unsigned int n, char *buf, size_t size)
 		strscpy(buf, "PF", size);
 	return buf;
 }
+
+int xe_sriov_migrated_event_handler(struct xe_guc *guc)
+{
+	struct xe_gt *gt = guc_to_gt(guc);
+
+	if (!IS_SRIOV_VF(gt_to_xe(gt)))
+		return 0;
+
+	return xe_sriov_vf_migrated_event_handler(gt);
+}
diff --git a/drivers/gpu/drm/xe/xe_sriov.h b/drivers/gpu/drm/xe/xe_sriov.h
index 688fbabf08f1..f7575177b75a 100644
--- a/drivers/gpu/drm/xe/xe_sriov.h
+++ b/drivers/gpu/drm/xe/xe_sriov.h
@@ -13,6 +13,7 @@
 struct drm_printer;
 
 const char *xe_sriov_mode_to_string(enum xe_sriov_mode mode);
+int xe_sriov_migrated_event_handler(struct xe_guc *guc);
 const char *xe_sriov_function_name(unsigned int n, char *buf, size_t len);
 
 void xe_sriov_probe_early(struct xe_device *xe);
diff --git a/drivers/gpu/drm/xe/xe_sriov_types.h b/drivers/gpu/drm/xe/xe_sriov_types.h
index c7b7ad4af5c8..a0b590bc7ffa 100644
--- a/drivers/gpu/drm/xe/xe_sriov_types.h
+++ b/drivers/gpu/drm/xe/xe_sriov_types.h
@@ -56,4 +56,10 @@ struct xe_device_pf {
 	struct mutex master_lock;
 };
 
+struct xe_device_vf {
+	/** @migration_worker: migration recovery worker */
+	struct work_struct migration_worker;
+	unsigned long migration_gt_flags;
+};
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
new file mode 100644
index 000000000000..b068c57b2bdc
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023-2024 Intel Corporation
+ */
+
+#include <drm/drm_managed.h>
+
+#include "xe_assert.h"
+#include "xe_device.h"
+#include "xe_guc_ct.h"
+#include "xe_module.h"
+#include "xe_sriov.h"
+#include "xe_sriov_vf.h"
+#include "xe_sriov_printk.h"
+
+static void migration_worker_func(struct work_struct *w);
+
+void xe_sriov_vf_init_early(struct xe_device *xe)
+{
+	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
+}
+
+static void vf_post_migration_recovery(struct xe_device *xe)
+{
+	drm_dbg(&xe->drm, "migration recovery in progress\n");
+	/* FIXME: add the recovery steps */
+	drm_notice(&xe->drm, "migration recovery completed\n");
+}
+
+static void migration_worker_func(struct work_struct *w)
+{
+	struct xe_device *xe = container_of(w, struct xe_device,
+						     sriov.vf.migration_worker);
+
+	vf_post_migration_recovery(xe);
+}
+
+/**
+ * xe_sriov_vf_start_migration_recovery - Start VF migration recovery.
+ * @xe: the &xe_device to start recovery on
+ *
+ * This function shall be called only by VF.
+ */
+static void xe_sriov_vf_start_migration_recovery(struct xe_device *xe)
+{
+	bool started;
+
+	XE_WARN_ON(!IS_SRIOV_VF(xe));
+
+	WRITE_ONCE(xe->sriov.vf.migration_gt_flags, 0);
+	smp_mb();
+
+	started = queue_work(xe->sriov.wq, &xe->sriov.vf.migration_worker);
+	dev_info(xe->drm.dev, "VF migration recovery %s\n", started ?
+		 "scheduled" : "already in progress");
+}
+
+static bool vf_ready_to_recovery_on_all_tiles(struct xe_device *xe)
+{
+	struct xe_gt *gt;
+	unsigned int id;
+
+	for_each_gt(gt, xe, id) {
+		if (!test_bit(id, &xe->sriov.vf.migration_gt_flags))
+			return false;
+	}
+	return true;
+}
+
+int xe_sriov_vf_migrated_event_handler(struct xe_gt *gt)
+{
+	struct xe_device *xe = gt_to_xe(gt);
+
+	set_bit(gt->info.id, &xe->sriov.vf.migration_gt_flags);
+	smp_mb__after_atomic();
+	dev_info(xe->drm.dev, "VF migration recovery ready on gt%d\n",
+		 gt->info.id);
+	if (vf_ready_to_recovery_on_all_tiles(xe))
+		xe_sriov_vf_start_migration_recovery(xe);
+
+	return -EREMOTEIO;
+}
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.h b/drivers/gpu/drm/xe/xe_sriov_vf.h
new file mode 100644
index 000000000000..d70cd84747e5
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023-2024 Intel Corporation
+ */
+
+#ifndef _XE_SRIOV_VF_H_
+#define _XE_SRIOV_VF_H_
+
+#include <linux/types.h>
+
+struct xe_device;
+struct xe_gt;
+
+void xe_sriov_vf_init_early(struct xe_device *xe);
+int xe_sriov_vf_migrated_event_handler(struct xe_gt *gt);
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt
  2024-09-20 22:29 ` [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt Tomasz Lis
@ 2024-09-23  8:39   ` Michal Wajdeczko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Wajdeczko @ 2024-09-23  8:39 UTC (permalink / raw)
  To: Tomasz Lis, intel-xe; +Cc: Michał Winiarski



On 21.09.2024 00:29, Tomasz Lis wrote:
> To properly support VF Save/Restore procedure, fixups need to be
> applied after PF driver finishes its part of VF Restore. Those
> fixups are applied by the VF driver within a VM.
> 
> A VF driver gets informed that it was migrated by receiving an
> interrupt from each GuC. That should be the trigger for fixups.
> 
> The VF can safely do post-migration fixups on resources associated
> to each GuC only after that GuC issued the MIGRATED interrupt.
> 
> This change introduces a worker to be used for post-migration fixups,
> and a mechanism to schedule said worker when all GuCs sent the irq.
> 
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  1 +
>  drivers/gpu/drm/xe/xe_guc.c          |  3 +
>  drivers/gpu/drm/xe/xe_memirq.c       |  3 +
>  drivers/gpu/drm/xe/xe_sriov.c        | 15 +++++
>  drivers/gpu/drm/xe/xe_sriov.h        |  1 +
>  drivers/gpu/drm/xe/xe_sriov_types.h  |  6 ++
>  drivers/gpu/drm/xe/xe_sriov_vf.c     | 82 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_sriov_vf.h     | 17 ++++++
>  9 files changed, 129 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_sriov_vf.c
>  create mode 100644 drivers/gpu/drm/xe/xe_sriov_vf.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index ae245fbd91ee..aa06644dffd5 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -123,6 +123,7 @@ xe-y += \
>  	xe_gt_sriov_vf.o \
>  	xe_guc_relay.o \
>  	xe_memirq.o \
> +	xe_sriov_vf.o \
>  	xe_sriov.o
>  
>  xe-$(CONFIG_PCI_IOV) += \
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 5ad96d283a71..331b55b457ab 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -374,6 +374,7 @@ struct xe_device {
>  
>  		/** @sriov.pf: PF specific data */
>  		struct xe_device_pf pf;
> +		struct xe_device_vf vf;

missing kernel-doc

>  
>  		/** @sriov.wq: workqueue used by the virtualization workers */
>  		struct workqueue_struct *wq;
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index b6cd5e941f19..65cfd6bd68f1 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -1096,6 +1096,9 @@ void xe_guc_irq_handler(struct xe_guc *guc, const u16 iir)
>  {
>  	if (iir & GUC_INTR_GUC2HOST)
>  		xe_guc_ct_irq_handler(&guc->ct);
> +
> +	if (iir & GUC_INTR_SW_INT_0)
> +		xe_sriov_migrated_event_handler(guc);

this likely should be

	xe_gt_sriov_vf_migrated_event_handler(guc_to_gt(guc));

>  }
>  
>  void xe_guc_sanitize(struct xe_guc *guc)
> diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
> index 3f8d4ca64302..2d2f40378942 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.c
> +++ b/drivers/gpu/drm/xe/xe_memirq.c
> @@ -435,6 +435,9 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
>  
>  	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
>  		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
> +
> +	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
> +		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_sriov.c b/drivers/gpu/drm/xe/xe_sriov.c
> index 69a066ef20c0..7447d85104e5 100644
> --- a/drivers/gpu/drm/xe/xe_sriov.c
> +++ b/drivers/gpu/drm/xe/xe_sriov.c
> @@ -9,9 +9,11 @@
>  
>  #include "xe_assert.h"
>  #include "xe_device.h"
> +#include "xe_guc.h"
>  #include "xe_mmio.h"
>  #include "xe_sriov.h"
>  #include "xe_sriov_pf.h"
> +#include "xe_sriov_vf.h"
>  
>  /**
>   * xe_sriov_mode_to_string - Convert enum value to string.
> @@ -112,6 +114,9 @@ int xe_sriov_init(struct xe_device *xe)
>  			return err;
>  	}
>  
> +	if (IS_SRIOV_VF(xe))
> +		xe_sriov_vf_init_early(xe);
> +
>  	xe_assert(xe, !xe->sriov.wq);
>  	xe->sriov.wq = alloc_workqueue("xe-sriov-wq", 0, 0);
>  	if (!xe->sriov.wq)
> @@ -150,3 +155,13 @@ const char *xe_sriov_function_name(unsigned int n, char *buf, size_t size)
>  		strscpy(buf, "PF", size);
>  	return buf;
>  }
> +
> +int xe_sriov_migrated_event_handler(struct xe_guc *guc)

if this function is strictly GuC related then it should have prefix

	xe_guc_...

and likely be placed in xe_guc.c file

but if we treat this function is GT related then it should have prefix

	xe_gt_sriov_...

and since it's VF specific it's name should also include VF:

	xe_gt_sriov_vf_...

and be in VF specific file xe_gt_sriov_vf.c

and maybe this function can be void

> +{
> +	struct xe_gt *gt = guc_to_gt(guc);
> +
> +	if (!IS_SRIOV_VF(gt_to_xe(gt)))
> +		return 0;

either this should be xe_gt_assert() or function should be named like

	xe_guc_sw_0_irq_handler(guc)

and then call migration stuff only if VF

> +
> +	return xe_sriov_vf_migrated_event_handler(gt);

if this is GT related function then it should start with:

	xe_gt_sriov_vf_...

> +}
> diff --git a/drivers/gpu/drm/xe/xe_sriov.h b/drivers/gpu/drm/xe/xe_sriov.h
> index 688fbabf08f1..f7575177b75a 100644
> --- a/drivers/gpu/drm/xe/xe_sriov.h
> +++ b/drivers/gpu/drm/xe/xe_sriov.h
> @@ -13,6 +13,7 @@
>  struct drm_printer;
>  
>  const char *xe_sriov_mode_to_string(enum xe_sriov_mode mode);
> +int xe_sriov_migrated_event_handler(struct xe_guc *guc);

this header is not for VF (or PF) specific functions
nor for GuC/GT oriented functions

>  const char *xe_sriov_function_name(unsigned int n, char *buf, size_t len);
>  
>  void xe_sriov_probe_early(struct xe_device *xe);
> diff --git a/drivers/gpu/drm/xe/xe_sriov_types.h b/drivers/gpu/drm/xe/xe_sriov_types.h
> index c7b7ad4af5c8..a0b590bc7ffa 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_types.h
> +++ b/drivers/gpu/drm/xe/xe_sriov_types.h
> @@ -56,4 +56,10 @@ struct xe_device_pf {
>  	struct mutex master_lock;
>  };
>  
> +struct xe_device_vf {

missing kernel-doc for the struct

> +	/** @migration_worker: migration recovery worker */
> +	struct work_struct migration_worker;
> +	unsigned long migration_gt_flags;

missing kernel-doc for the flags

btw, instead of adding 'migration' prefix better option seems to be to
define anonymous struct named 'migration' with .flags and .worker


> +};
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> new file mode 100644
> index 000000000000..b068c57b2bdc
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -0,0 +1,82 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023-2024 Intel Corporation
> + */
> +
> +#include <drm/drm_managed.h>
> +
> +#include "xe_assert.h"
> +#include "xe_device.h"
> +#include "xe_guc_ct.h"

is this include needed ?

> +#include "xe_module.h"

is this include needed ?

> +#include "xe_sriov.h"
> +#include "xe_sriov_vf.h"
> +#include "xe_sriov_printk.h"
> +
> +static void migration_worker_func(struct work_struct *w);
> +
> +void xe_sriov_vf_init_early(struct xe_device *xe)

add kernel-doc

> +{
> +	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
> +}
> +
> +static void vf_post_migration_recovery(struct xe_device *xe)
> +{
> +	drm_dbg(&xe->drm, "migration recovery in progress\n");
> +	/* FIXME: add the recovery steps */
> +	drm_notice(&xe->drm, "migration recovery completed\n");

maybe add this message only after adding those still missing recovery
steps, as otherwise it will be little misleading

> +}
> +
> +static void migration_worker_func(struct work_struct *w)
> +{
> +	struct xe_device *xe = container_of(w, struct xe_device,
> +						     sriov.vf.migration_worker);

looks like this line unaligned to above (

> +
> +	vf_post_migration_recovery(xe);
> +}
> +
> +/**
> + * xe_sriov_vf_start_migration_recovery - Start VF migration recovery.

usually we don't add full kernel-doc to static functions

> + * @xe: the &xe_device to start recovery on
> + *
> + * This function shall be called only by VF.
> + */
> +static void xe_sriov_vf_start_migration_recovery(struct xe_device *xe)
> +{
> +	bool started;
> +
> +	XE_WARN_ON(!IS_SRIOV_VF(xe));

use xe_assert() instead

> +
> +	WRITE_ONCE(xe->sriov.vf.migration_gt_flags, 0);
> +	smp_mb();
> +
> +	started = queue_work(xe->sriov.wq, &xe->sriov.vf.migration_worker);
> +	dev_info(xe->drm.dev, "VF migration recovery %s\n", started ?
> +		 "scheduled" : "already in progress");

in xe.ko we prefer drm_info() over dev_info()
and xe_gt_info() if something is GT related

> +}
> +
> +static bool vf_ready_to_recovery_on_all_tiles(struct xe_device *xe)
> +{
> +	struct xe_gt *gt;
> +	unsigned int id;
> +
> +	for_each_gt(gt, xe, id) {
> +		if (!test_bit(id, &xe->sriov.vf.migration_gt_flags))
> +			return false;
> +	}
> +	return true;
> +}
> +
> +int xe_sriov_vf_migrated_event_handler(struct xe_gt *gt)

since this is a GT-level function, it should be rather placed in the
xe_gt_sriov_vf.c file

and also please add kernel-doc as this is public function

> +{
> +	struct xe_device *xe = gt_to_xe(gt);
> +
> +	set_bit(gt->info.id, &xe->sriov.vf.migration_gt_flags);
> +	smp_mb__after_atomic();
> +	dev_info(xe->drm.dev, "VF migration recovery ready on gt%d\n",
> +		 gt->info.id);

probably it would be better to just use xe_gt_info() here

> +	if (vf_ready_to_recovery_on_all_tiles(xe))

maybe this could be part of the function below?

> +		xe_sriov_vf_start_migration_recovery(xe);
> +
> +	return -EREMOTEIO;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.h b/drivers/gpu/drm/xe/xe_sriov_vf.h
> new file mode 100644
> index 000000000000..d70cd84747e5
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2023-2024 Intel Corporation
> + */
> +
> +#ifndef _XE_SRIOV_VF_H_
> +#define _XE_SRIOV_VF_H_
> +
> +#include <linux/types.h>

likely you don't need this yet here

> +
> +struct xe_device;
> +struct xe_gt;
> +
> +void xe_sriov_vf_init_early(struct xe_device *xe);
> +int xe_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> +
> +#endif


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
  2024-09-20 22:29 ` [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt Tomasz Lis
@ 2024-09-20 22:29 ` Tomasz Lis
  2024-09-23 11:55   ` Michal Wajdeczko
  2024-09-20 22:29 ` [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake Tomasz Lis
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Tomasz Lis @ 2024-09-20 22:29 UTC (permalink / raw)
  To: intel-xe; +Cc: Michał Winiarski, Michał Wajdeczko

After restore, GuC will not answer to any messages from VF KMD until
fixups are applied. When that is done, VF KMD sends RESFIX_DONE
message to GuC, at which point GuC resumes normal operation.

This patch implements sending the RESFIX_DONE message at end of
post-migration recovery.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 .../gpu/drm/xe/abi/guc_actions_sriov_abi.h    | 38 +++++++++++++++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c           | 33 ++++++++++++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h           |  1 +
 drivers/gpu/drm/xe/xe_sriov_vf.c              | 23 +++++++++++
 4 files changed, 95 insertions(+)

diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
index b6a1852749dd..74874bbba10c 100644
--- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
+++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
@@ -501,6 +501,44 @@
 #define VF2GUC_VF_RESET_RESPONSE_MSG_LEN		GUC_HXG_RESPONSE_MSG_MIN_LEN
 #define VF2GUC_VF_RESET_RESPONSE_MSG_0_MBZ		GUC_HXG_RESPONSE_MSG_0_DATA0
 
+/**
+ * DOC: VF2GUC_NOTIFY_RESFIX_DONE
+ *
+ * This action is used by VF to notify the GuC that the VF KMD has completed
+ * post-migration recovery steps.
+ *
+ * This message must be sent as `MMIO HXG Message`_.
+ *
+ *  +---+-------+--------------------------------------------------------------+
+ *  |   | Bits  | Description                                                  |
+ *  +===+=======+==============================================================+
+ *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_HOST_                                |
+ *  |   +-------+--------------------------------------------------------------+
+ *  |   | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_                                 |
+ *  |   +-------+--------------------------------------------------------------+
+ *  |   | 27:16 | DATA0 = MBZ                                                  |
+ *  |   +-------+--------------------------------------------------------------+
+ *  |   |  15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE` = 0x5508    |
+ *  +---+-------+--------------------------------------------------------------+
+ *
+ *  +---+-------+--------------------------------------------------------------+
+ *  |   | Bits  | Description                                                  |
+ *  +===+=======+==============================================================+
+ *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_GUC_                                 |
+ *  |   +-------+--------------------------------------------------------------+
+ *  |   | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_                        |
+ *  |   +-------+--------------------------------------------------------------+
+ *  |   |  27:0 | DATA0 = MBZ                                                  |
+ *  +---+-------+--------------------------------------------------------------+
+ */
+#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE		0x5508
+
+#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_LEN	GUC_HXG_REQUEST_MSG_MIN_LEN
+#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_0_MBZ	GUC_HXG_REQUEST_MSG_0_DATA0
+
+#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_LEN	GUC_HXG_RESPONSE_MSG_MIN_LEN
+#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_0_MBZ	GUC_HXG_RESPONSE_MSG_0_DATA0
+
 /**
  * DOC: VF2GUC_QUERY_SINGLE_KLV
  *
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index d3baba50f085..08b5f6912923 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -223,6 +223,39 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt)
 	return 0;
 }
 
+static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
+{
+	u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
+		FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
+		FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
+		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
+	};
+	int ret;
+
+	ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
+
+	return ret > 0 ? -EPROTO : ret;
+}
+
+/**
+ * xe_gt_sriov_vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
+ * @gt: the &xe_gt struct instance linked to target GuC
+ */
+int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
+{
+	struct xe_guc *guc = &gt->uc.guc;
+	int err;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	err = guc_action_vf_notify_resfix_done(guc);
+	if (unlikely(err))
+		xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
+				ERR_PTR(err));
+
+	return err;
+}
+
 static int guc_action_query_single_klv(struct xe_guc *guc, u32 key,
 				       u32 *value, u32 value_len)
 {
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index e541ce57bec2..97e8c76a641a 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -17,6 +17,7 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
 int xe_gt_sriov_vf_connect(struct xe_gt *gt);
 int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
 int xe_gt_sriov_vf_prepare_ggtt(struct xe_gt *gt);
+int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index b068c57b2bdc..459fa936aaba 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -7,8 +7,10 @@
 
 #include "xe_assert.h"
 #include "xe_device.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_guc_ct.h"
 #include "xe_module.h"
+#include "xe_pm.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
 #include "xe_sriov_printk.h"
@@ -20,10 +22,31 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
 	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
 }
 
+/*
+ * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
+ * @xe: the &xe_device struct instance
+ */
+static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
+{
+	struct xe_gt *gt;
+	unsigned int id;
+	int err, num_sent = 0;
+
+	xe_pm_runtime_get(xe);
+	for_each_gt(gt, xe, id) {
+		err = xe_gt_sriov_vf_notify_resfix_done(gt);
+		if (!err)
+			num_sent++;
+	}
+	xe_pm_runtime_put(xe);
+	drm_dbg(&xe->drm, "sent %d VF resource fixups done notifications\n", num_sent);
+}
+
 static void vf_post_migration_recovery(struct xe_device *xe)
 {
 	drm_dbg(&xe->drm, "migration recovery in progress\n");
 	/* FIXME: add the recovery steps */
+	vf_post_migration_notify_resfix_done(xe);
 	drm_notice(&xe->drm, "migration recovery completed\n");
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore
  2024-09-20 22:29 ` [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore Tomasz Lis
@ 2024-09-23 11:55   ` Michal Wajdeczko
  2024-09-23 20:52     ` Lis, Tomasz
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Wajdeczko @ 2024-09-23 11:55 UTC (permalink / raw)
  To: Tomasz Lis, intel-xe; +Cc: Michał Winiarski



On 21.09.2024 00:29, Tomasz Lis wrote:
> After restore, GuC will not answer to any messages from VF KMD until
> fixups are applied. When that is done, VF KMD sends RESFIX_DONE
> message to GuC, at which point GuC resumes normal operation.
> 
> This patch implements sending the RESFIX_DONE message at end of
> post-migration recovery.
> 
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  .../gpu/drm/xe/abi/guc_actions_sriov_abi.h    | 38 +++++++++++++++++++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c           | 33 ++++++++++++++++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h           |  1 +
>  drivers/gpu/drm/xe/xe_sriov_vf.c              | 23 +++++++++++
>  4 files changed, 95 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> index b6a1852749dd..74874bbba10c 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> @@ -501,6 +501,44 @@
>  #define VF2GUC_VF_RESET_RESPONSE_MSG_LEN		GUC_HXG_RESPONSE_MSG_MIN_LEN
>  #define VF2GUC_VF_RESET_RESPONSE_MSG_0_MBZ		GUC_HXG_RESPONSE_MSG_0_DATA0
>  
> +/**
> + * DOC: VF2GUC_NOTIFY_RESFIX_DONE
> + *
> + * This action is used by VF to notify the GuC that the VF KMD has completed
> + * post-migration recovery steps.
> + *
> + * This message must be sent as `MMIO HXG Message`_.
> + *
> + *  +---+-------+--------------------------------------------------------------+
> + *  |   | Bits  | Description                                                  |
> + *  +===+=======+==============================================================+
> + *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_HOST_                                |
> + *  |   +-------+--------------------------------------------------------------+
> + *  |   | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_                                 |
> + *  |   +-------+--------------------------------------------------------------+
> + *  |   | 27:16 | DATA0 = MBZ                                                  |
> + *  |   +-------+--------------------------------------------------------------+
> + *  |   |  15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE` = 0x5508    |
> + *  +---+-------+--------------------------------------------------------------+
> + *
> + *  +---+-------+--------------------------------------------------------------+
> + *  |   | Bits  | Description                                                  |
> + *  +===+=======+==============================================================+
> + *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_GUC_                                 |
> + *  |   +-------+--------------------------------------------------------------+
> + *  |   | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_                        |
> + *  |   +-------+--------------------------------------------------------------+
> + *  |   |  27:0 | DATA0 = MBZ                                                  |
> + *  +---+-------+--------------------------------------------------------------+
> + */
> +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE		0x5508

please make it unsigned (like all newer definitions)	0x5508u

> +
> +#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_LEN	GUC_HXG_REQUEST_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_0_MBZ	GUC_HXG_REQUEST_MSG_0_DATA0
> +
> +#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_LEN	GUC_HXG_RESPONSE_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_0_MBZ	GUC_HXG_RESPONSE_MSG_0_DATA0
> +
>  /**
>   * DOC: VF2GUC_QUERY_SINGLE_KLV
>   *
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index d3baba50f085..08b5f6912923 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -223,6 +223,39 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt)
>  	return 0;
>  }
>  
> +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> +{
> +	u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> +		FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> +		FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> +		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> +	};
> +	int ret;
> +
> +	ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> +
> +	return ret > 0 ? -EPROTO : ret;
> +}
> +
> +/**
> + * xe_gt_sriov_vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> + * @gt: the &xe_gt struct instance linked to target GuC

don't forget about "Return:"

> + */
> +int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
> +{
> +	struct xe_guc *guc = &gt->uc.guc;
> +	int err;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	err = guc_action_vf_notify_resfix_done(guc);
> +	if (unlikely(err))
> +		xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> +				ERR_PTR(err));
> +
> +	return err;
> +}
> +
>  static int guc_action_query_single_klv(struct xe_guc *guc, u32 key,
>  				       u32 *value, u32 value_len)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index e541ce57bec2..97e8c76a641a 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -17,6 +17,7 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
>  int xe_gt_sriov_vf_connect(struct xe_gt *gt);
>  int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
>  int xe_gt_sriov_vf_prepare_ggtt(struct xe_gt *gt);
> +int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
>  
>  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index b068c57b2bdc..459fa936aaba 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -7,8 +7,10 @@
>  
>  #include "xe_assert.h"
>  #include "xe_device.h"
> +#include "xe_gt_sriov_vf.h"
>  #include "xe_guc_ct.h"
>  #include "xe_module.h"
> +#include "xe_pm.h"
>  #include "xe_sriov.h"
>  #include "xe_sriov_vf.h"
>  #include "xe_sriov_printk.h"
> @@ -20,10 +22,31 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
>  	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
>  }
>  
> +/*
> + * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
> + * @xe: the &xe_device struct instance
> + */
> +static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
> +{
> +	struct xe_gt *gt;
> +	unsigned int id;
> +	int err, num_sent = 0;
> +
> +	xe_pm_runtime_get(xe);

maybe pm_get/put can done once inside the top level recovery function:
vf_post_migration_recovery()

> +	for_each_gt(gt, xe, id) {
> +		err = xe_gt_sriov_vf_notify_resfix_done(gt);
> +		if (!err)
> +			num_sent++;
> +	}
> +	xe_pm_runtime_put(xe);
> +	drm_dbg(&xe->drm, "sent %d VF resource fixups done notifications\n", num_sent);

hmm, so what's the plan for handling the cases when one or all
notify-fixup-done fail? are we going wedge or will silently ignore like
here?

> +}
> +
>  static void vf_post_migration_recovery(struct xe_device *xe)
>  {
>  	drm_dbg(&xe->drm, "migration recovery in progress\n");
>  	/* FIXME: add the recovery steps */
> +	vf_post_migration_notify_resfix_done(xe);

hmm, shouldn't we have some kind of guard to actually track whether we
succeed with the fixups and only then send notify-done?

otherwise in addition that we're already cheating the user with below
message 'recovery-completed', now we cheating the GuC that 'fixup-done'

there likely always be a cases where something went wrong, so we either
end up with reset or wedge, so maybe we should also start with that and
only send 'fixup-done' when passing all post-migration steps

>  	drm_notice(&xe->drm, "migration recovery completed\n");
>  }
>  


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore
  2024-09-23 11:55   ` Michal Wajdeczko
@ 2024-09-23 20:52     ` Lis, Tomasz
  0 siblings, 0 replies; 13+ messages in thread
From: Lis, Tomasz @ 2024-09-23 20:52 UTC (permalink / raw)
  To: Michal Wajdeczko, intel-xe; +Cc: Michał Winiarski


On 23.09.2024 13:55, Michal Wajdeczko wrote:
>
> On 21.09.2024 00:29, Tomasz Lis wrote:
>> After restore, GuC will not answer to any messages from VF KMD until
>> fixups are applied. When that is done, VF KMD sends RESFIX_DONE
>> message to GuC, at which point GuC resumes normal operation.
>>
>> This patch implements sending the RESFIX_DONE message at end of
>> post-migration recovery.
>>
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   .../gpu/drm/xe/abi/guc_actions_sriov_abi.h    | 38 +++++++++++++++++++
>>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c           | 33 ++++++++++++++++
>>   drivers/gpu/drm/xe/xe_gt_sriov_vf.h           |  1 +
>>   drivers/gpu/drm/xe/xe_sriov_vf.c              | 23 +++++++++++
>>   4 files changed, 95 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
>> index b6a1852749dd..74874bbba10c 100644
>> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
>> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
>> @@ -501,6 +501,44 @@
>>   #define VF2GUC_VF_RESET_RESPONSE_MSG_LEN		GUC_HXG_RESPONSE_MSG_MIN_LEN
>>   #define VF2GUC_VF_RESET_RESPONSE_MSG_0_MBZ		GUC_HXG_RESPONSE_MSG_0_DATA0
>>   
>> +/**
>> + * DOC: VF2GUC_NOTIFY_RESFIX_DONE
>> + *
>> + * This action is used by VF to notify the GuC that the VF KMD has completed
>> + * post-migration recovery steps.
>> + *
>> + * This message must be sent as `MMIO HXG Message`_.
>> + *
>> + *  +---+-------+--------------------------------------------------------------+
>> + *  |   | Bits  | Description                                                  |
>> + *  +===+=======+==============================================================+
>> + *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_HOST_                                |
>> + *  |   +-------+--------------------------------------------------------------+
>> + *  |   | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_                                 |
>> + *  |   +-------+--------------------------------------------------------------+
>> + *  |   | 27:16 | DATA0 = MBZ                                                  |
>> + *  |   +-------+--------------------------------------------------------------+
>> + *  |   |  15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE` = 0x5508    |
>> + *  +---+-------+--------------------------------------------------------------+
>> + *
>> + *  +---+-------+--------------------------------------------------------------+
>> + *  |   | Bits  | Description                                                  |
>> + *  +===+=======+==============================================================+
>> + *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_GUC_                                 |
>> + *  |   +-------+--------------------------------------------------------------+
>> + *  |   | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_                        |
>> + *  |   +-------+--------------------------------------------------------------+
>> + *  |   |  27:0 | DATA0 = MBZ                                                  |
>> + *  +---+-------+--------------------------------------------------------------+
>> + */
>> +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE		0x5508
> please make it unsigned (like all newer definitions)	0x5508u
ack
>
>> +
>> +#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_LEN	GUC_HXG_REQUEST_MSG_MIN_LEN
>> +#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_0_MBZ	GUC_HXG_REQUEST_MSG_0_DATA0
>> +
>> +#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_LEN	GUC_HXG_RESPONSE_MSG_MIN_LEN
>> +#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_0_MBZ	GUC_HXG_RESPONSE_MSG_0_DATA0
>> +
>>   /**
>>    * DOC: VF2GUC_QUERY_SINGLE_KLV
>>    *
>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>> index d3baba50f085..08b5f6912923 100644
>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>> @@ -223,6 +223,39 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt)
>>   	return 0;
>>   }
>>   
>> +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
>> +{
>> +	u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
>> +		FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
>> +		FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
>> +		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
>> +	};
>> +	int ret;
>> +
>> +	ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
>> +
>> +	return ret > 0 ? -EPROTO : ret;
>> +}
>> +
>> +/**
>> + * xe_gt_sriov_vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
>> + * @gt: the &xe_gt struct instance linked to target GuC
> don't forget about "Return:"
ok
>
>> + */
>> +int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
>> +{
>> +	struct xe_guc *guc = &gt->uc.guc;
>> +	int err;
>> +
>> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>> +
>> +	err = guc_action_vf_notify_resfix_done(guc);
>> +	if (unlikely(err))
>> +		xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
>> +				ERR_PTR(err));
>> +
>> +	return err;
>> +}
>> +
>>   static int guc_action_query_single_klv(struct xe_guc *guc, u32 key,
>>   				       u32 *value, u32 value_len)
>>   {
>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>> index e541ce57bec2..97e8c76a641a 100644
>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>> @@ -17,6 +17,7 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
>>   int xe_gt_sriov_vf_connect(struct xe_gt *gt);
>>   int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
>>   int xe_gt_sriov_vf_prepare_ggtt(struct xe_gt *gt);
>> +int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
>>   
>>   u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
>>   u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> index b068c57b2bdc..459fa936aaba 100644
>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> @@ -7,8 +7,10 @@
>>   
>>   #include "xe_assert.h"
>>   #include "xe_device.h"
>> +#include "xe_gt_sriov_vf.h"
>>   #include "xe_guc_ct.h"
>>   #include "xe_module.h"
>> +#include "xe_pm.h"
>>   #include "xe_sriov.h"
>>   #include "xe_sriov_vf.h"
>>   #include "xe_sriov_printk.h"
>> @@ -20,10 +22,31 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
>>   	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
>>   }
>>   
>> +/*
>> + * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
>> + * @xe: the &xe_device struct instance
>> + */
>> +static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
>> +{
>> +	struct xe_gt *gt;
>> +	unsigned int id;
>> +	int err, num_sent = 0;
>> +
>> +	xe_pm_runtime_get(xe);
> maybe pm_get/put can done once inside the top level recovery function:
> vf_post_migration_recovery()

It would work. But I don't think there is a need for that.

While applying the actual fixups, we do not require the hardware - it is 
necessary only to query new provisioning at start, and to send 
RESFIX_DONE at end.

>
>> +	for_each_gt(gt, xe, id) {
>> +		err = xe_gt_sriov_vf_notify_resfix_done(gt);
>> +		if (!err)
>> +			num_sent++;
>> +	}
>> +	xe_pm_runtime_put(xe);
>> +	drm_dbg(&xe->drm, "sent %d VF resource fixups done notifications\n", num_sent);
> hmm, so what's the plan for handling the cases when one or all
> notify-fixup-done fail? are we going wedge or will silently ignore like
> here?

If RESFIX_DONE gives error, then the GuC was either not in RESFIX state 
or is not responding. If not in RESFIX state, that would mean the 
current procedure comes from a second migration just before sending 
RESFIX_DONE on recovery from the first, or GuC got reset by PF. So we 
have 3 cases:

* not in RESFIX state due to RESFIX_DONE from previous migration being sent:

If this happened, we were applying fixups on a living GuC. While many 
things could have gone wrong in that case, it is generally more likely 
that we've made it through fixups and everything will work now. The job 
which was executing without fixups might have hanged, but then all jobs 
were fixed, so this single job will timeout, get removed, and execution 
will continue.

There is no need to wedge in this case.

* not in RESFIX state due to PF reset:

This case should be handled by VMM, not by the kernel running on the VF 
- the whole VM should be marked as damaged. The VF KMD could wedge, but 
we have no way of distinguishing this situation from the previous one.

* GuC is not responding:

This will lead to PF reset, so see the point above. This time we can 
distinguish the situation as we've got ETIME from send function, but why 
treat this case in a different manner?

To conclude:

I don't think the VF KMD should wedge on no response, because it is 
possible that everything will work, and we have no way of distinguishing 
these cases - if there will be no progress or no response from hardware, 
we will get through standard reset and wedge procedures after the 
recovery ends. There is no need for extraordinary speedup of the demise 
in case of migration.

>
>> +}
>> +
>>   static void vf_post_migration_recovery(struct xe_device *xe)
>>   {
>>   	drm_dbg(&xe->drm, "migration recovery in progress\n");
>>   	/* FIXME: add the recovery steps */
>> +	vf_post_migration_notify_resfix_done(xe);
> hmm, shouldn't we have some kind of guard to actually track whether we
> succeed with the fixups and only then send notify-done?
>
> otherwise in addition that we're already cheating the user with below
> message 'recovery-completed', now we cheating the GuC that 'fixup-done'
>
> there likely always be a cases where something went wrong, so we either
> end up with reset or wedge, so maybe we should also start with that and
> only send 'fixup-done' when passing all post-migration steps

The fixups do not communicate with hardware, so they have no way of 
failing. They are a CPU-only procedure of iterating through internal KMD 
structures. Them failing would mean internal KMD structures are damaged.

The series under review does include a mechanism for fail, and a 
mechanism for defer. In both these mechanisms the RESFIX_DONE is skipped.

-Tomasz

>
>>   	drm_notice(&xe->drm, "migration recovery completed\n");
>>   }
>>   

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
  2024-09-20 22:29 ` [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt Tomasz Lis
  2024-09-20 22:29 ` [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore Tomasz Lis
@ 2024-09-20 22:29 ` Tomasz Lis
  2024-09-23 12:02   ` Michal Wajdeczko
  2024-09-20 22:29 ` [PATCH 4/4] drm/xe/vf: Defer fixups if migrated twice fast Tomasz Lis
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Tomasz Lis @ 2024-09-20 22:29 UTC (permalink / raw)
  To: intel-xe; +Cc: Michał Winiarski, Michał Wajdeczko

During post-migration recovery, only MMIO communication to GuC is
allowed. But that communication requires initialization.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_sriov_vf.c | 40 ++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index 459fa936aaba..3cea2d21525f 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -22,6 +22,36 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
 	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
 }
 
+/**
+ * vf_post_migration_reinit_guc - Re-initialize GuC communication.
+ * @xe: the &xe_device struct instance
+ *
+ * After migration, we need to reestablish communication with GuC and
+ * re-query all VF configuration to make sure they match previous
+ * provisioning. Note that most of VF provisioning shall be the same,
+ * except GGTT range, since GGTT is not virtualized per-VF.
+ *
+ * Returns: 0 if the operation completed successfully, or a negative error
+ * code otherwise.
+ */
+static int vf_post_migration_reinit_guc(struct xe_device *xe)
+{
+	struct xe_gt *gt;
+	unsigned int id;
+	int err, ret;
+
+	err = 0;
+	xe_pm_runtime_get(xe);
+	for_each_gt(gt, xe, id) {
+		ret = xe_gt_sriov_vf_bootstrap(gt);
+		if (!err)
+			err = ret;
+	}
+	xe_pm_runtime_put(xe);
+
+	return err;
+}
+
 /*
  * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
  * @xe: the &xe_device struct instance
@@ -44,10 +74,20 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
 
 static void vf_post_migration_recovery(struct xe_device *xe)
 {
+	int err;
+
 	drm_dbg(&xe->drm, "migration recovery in progress\n");
+	err = vf_post_migration_reinit_guc(xe);
+	if (unlikely(err))
+		goto fail;
+
 	/* FIXME: add the recovery steps */
 	vf_post_migration_notify_resfix_done(xe);
 	drm_notice(&xe->drm, "migration recovery completed\n");
+	return;
+fail:
+	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
+	xe_device_declare_wedged(xe);
 }
 
 static void migration_worker_func(struct work_struct *w)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake
  2024-09-20 22:29 ` [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake Tomasz Lis
@ 2024-09-23 12:02   ` Michal Wajdeczko
  2024-09-23 21:11     ` Lis, Tomasz
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Wajdeczko @ 2024-09-23 12:02 UTC (permalink / raw)
  To: Tomasz Lis, intel-xe; +Cc: Michał Winiarski



On 21.09.2024 00:29, Tomasz Lis wrote:
> During post-migration recovery, only MMIO communication to GuC is
> allowed. But that communication requires initialization.

shouldn't this be patch 2/4 ie. before actually trying to send anything

> 
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_sriov_vf.c | 40 ++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index 459fa936aaba..3cea2d21525f 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -22,6 +22,36 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
>  	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
>  }
>  
> +/**
> + * vf_post_migration_reinit_guc - Re-initialize GuC communication.
> + * @xe: the &xe_device struct instance
> + *
> + * After migration, we need to reestablish communication with GuC and
> + * re-query all VF configuration to make sure they match previous
> + * provisioning. Note that most of VF provisioning shall be the same,
> + * except GGTT range, since GGTT is not virtualized per-VF.
> + *
> + * Returns: 0 if the operation completed successfully, or a negative error

correct tag is "Return:" see [1]

[1] https://docs.kernel.org/doc-guide/kernel-doc.html#function-documentation

> + * code otherwise.
> + */
> +static int vf_post_migration_reinit_guc(struct xe_device *xe)
> +{
> +	struct xe_gt *gt;
> +	unsigned int id;
> +	int err, ret;
> +
> +	err = 0;
> +	xe_pm_runtime_get(xe);

again, maybe PM can be done once in vf_post_migration_recovery()

> +	for_each_gt(gt, xe, id) {
> +		ret = xe_gt_sriov_vf_bootstrap(gt);
> +		if (!err)
> +			err = ret;
> +	}
> +	xe_pm_runtime_put(xe);
> +
> +	return err;

do we care about sending a reset to those GuCs that successfully
completed handshake or we assume that going wedge is sufficient?

> +}
> +
>  /*
>   * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
>   * @xe: the &xe_device struct instance
> @@ -44,10 +74,20 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
>  
>  static void vf_post_migration_recovery(struct xe_device *xe)
>  {
> +	int err;
> +
>  	drm_dbg(&xe->drm, "migration recovery in progress\n");
> +	err = vf_post_migration_reinit_guc(xe);
> +	if (unlikely(err))
> +		goto fail;
> +
>  	/* FIXME: add the recovery steps */
>  	vf_post_migration_notify_resfix_done(xe);
>  	drm_notice(&xe->drm, "migration recovery completed\n");
> +	return;
> +fail:
> +	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
> +	xe_device_declare_wedged(xe);
>  }
>  
>  static void migration_worker_func(struct work_struct *w)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake
  2024-09-23 12:02   ` Michal Wajdeczko
@ 2024-09-23 21:11     ` Lis, Tomasz
  0 siblings, 0 replies; 13+ messages in thread
From: Lis, Tomasz @ 2024-09-23 21:11 UTC (permalink / raw)
  To: Michal Wajdeczko, intel-xe; +Cc: Michał Winiarski


On 23.09.2024 14:02, Michal Wajdeczko wrote:
>
> On 21.09.2024 00:29, Tomasz Lis wrote:
>> During post-migration recovery, only MMIO communication to GuC is
>> allowed. But that communication requires initialization.
> shouldn't this be patch 2/4 ie. before actually trying to send anything

GuC is obligated to accept the RESFIX_DONE or RESET while in the RESFIX 
state. Regardless of whether the KMD says "Hi" first or not.

This means the patches are in the correct order.

>
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_sriov_vf.c | 40 ++++++++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> index 459fa936aaba..3cea2d21525f 100644
>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> @@ -22,6 +22,36 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
>>   	INIT_WORK(&xe->sriov.vf.migration_worker, migration_worker_func);
>>   }
>>   
>> +/**
>> + * vf_post_migration_reinit_guc - Re-initialize GuC communication.
>> + * @xe: the &xe_device struct instance
>> + *
>> + * After migration, we need to reestablish communication with GuC and
>> + * re-query all VF configuration to make sure they match previous
>> + * provisioning. Note that most of VF provisioning shall be the same,
>> + * except GGTT range, since GGTT is not virtualized per-VF.
>> + *
>> + * Returns: 0 if the operation completed successfully, or a negative error
> correct tag is "Return:" see [1]
>
> [1] https://docs.kernel.org/doc-guide/kernel-doc.html#function-documentation
ack
>
>> + * code otherwise.
>> + */
>> +static int vf_post_migration_reinit_guc(struct xe_device *xe)
>> +{
>> +	struct xe_gt *gt;
>> +	unsigned int id;
>> +	int err, ret;
>> +
>> +	err = 0;
>> +	xe_pm_runtime_get(xe);
> again, maybe PM can be done once in vf_post_migration_recovery()

We could, but we do not need to keep it during fixups. From what I 
understand, the lower the granularity, the better.

>
>> +	for_each_gt(gt, xe, id) {
>> +		ret = xe_gt_sriov_vf_bootstrap(gt);
>> +		if (!err)
>> +			err = ret;
>> +	}
>> +	xe_pm_runtime_put(xe);
>> +
>> +	return err;
> do we care about sending a reset to those GuCs that successfully
> completed handshake or we assume that going wedge is sufficient?

On wedge, all we should do is silence the hardware. If left in RESFIX 
state, the hardware is silent.

We can't assure that GuCs won't be in RESFIX state anyway, because 
another migration might happen after the wedging.

-Tomasz

>
>> +}
>> +
>>   /*
>>    * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
>>    * @xe: the &xe_device struct instance
>> @@ -44,10 +74,20 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
>>   
>>   static void vf_post_migration_recovery(struct xe_device *xe)
>>   {
>> +	int err;
>> +
>>   	drm_dbg(&xe->drm, "migration recovery in progress\n");
>> +	err = vf_post_migration_reinit_guc(xe);
>> +	if (unlikely(err))
>> +		goto fail;
>> +
>>   	/* FIXME: add the recovery steps */
>>   	vf_post_migration_notify_resfix_done(xe);
>>   	drm_notice(&xe->drm, "migration recovery completed\n");
>> +	return;
>> +fail:
>> +	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
>> +	xe_device_declare_wedged(xe);
>>   }
>>   
>>   static void migration_worker_func(struct work_struct *w)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] drm/xe/vf: Defer fixups if migrated twice fast
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
                   ` (2 preceding siblings ...)
  2024-09-20 22:29 ` [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake Tomasz Lis
@ 2024-09-20 22:29 ` Tomasz Lis
  2024-09-20 22:34 ` ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery worker basis Patchwork
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Tomasz Lis @ 2024-09-20 22:29 UTC (permalink / raw)
  To: intel-xe; +Cc: Michał Winiarski, Michał Wajdeczko

If another VF migration happened during post-migration recovery,
then the current worker should be finished to allow the next
one start swiftly and cleanly.

Check for defer in two places: before fixups, and before
sending RESFIX_DONE.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_sriov_vf.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index 3cea2d21525f..93817cf21701 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -52,6 +52,19 @@ static int vf_post_migration_reinit_guc(struct xe_device *xe)
 	return err;
 }
 
+/*
+ * vf_post_migration_imminent - Check if post-restore recovery is coming.
+ * @xe: the &xe_device struct instance
+ *
+ * Returns if migration recovery worker will soon be running. Any worker currently executing
+ * does not affect the result.
+ */
+static bool vf_post_migration_imminent(struct xe_device *xe)
+{
+	return xe->sriov.vf.migration_gt_flags != 0 ||
+	work_pending(&xe->sriov.vf.migration_worker);
+}
+
 /*
  * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
  * @xe: the &xe_device struct instance
@@ -64,12 +77,19 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
 
 	xe_pm_runtime_get(xe);
 	for_each_gt(gt, xe, id) {
+		if (vf_post_migration_imminent(xe))
+			goto skip;
 		err = xe_gt_sriov_vf_notify_resfix_done(gt);
 		if (!err)
 			num_sent++;
 	}
 	xe_pm_runtime_put(xe);
 	drm_dbg(&xe->drm, "sent %d VF resource fixups done notifications\n", num_sent);
+	return;
+
+skip:
+	xe_pm_runtime_put(xe);
+	drm_dbg(&xe->drm, "another recovery imminent, skipping notifications\n");
 }
 
 static void vf_post_migration_recovery(struct xe_device *xe)
@@ -78,6 +98,8 @@ static void vf_post_migration_recovery(struct xe_device *xe)
 
 	drm_dbg(&xe->drm, "migration recovery in progress\n");
 	err = vf_post_migration_reinit_guc(xe);
+	if (vf_post_migration_imminent(xe))
+		goto defer;
 	if (unlikely(err))
 		goto fail;
 
@@ -85,6 +107,9 @@ static void vf_post_migration_recovery(struct xe_device *xe)
 	vf_post_migration_notify_resfix_done(xe);
 	drm_notice(&xe->drm, "migration recovery completed\n");
 	return;
+defer:
+	drm_dbg(&xe->drm, "migration recovery deferred\n");
+	return;
 fail:
 	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
 	xe_device_declare_wedged(xe);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery worker basis
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
                   ` (3 preceding siblings ...)
  2024-09-20 22:29 ` [PATCH 4/4] drm/xe/vf: Defer fixups if migrated twice fast Tomasz Lis
@ 2024-09-20 22:34 ` Patchwork
  2024-09-20 22:35 ` ✗ CI.checkpatch: warning " Patchwork
  2024-09-20 22:35 ` ✗ CI.KUnit: failure " Patchwork
  6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2024-09-20 22:34 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-xe

== Series Details ==

Series: drm/xe/vf: Post-migration recovery worker basis
URL   : https://patchwork.freedesktop.org/series/138935/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: caa511d74cfe drm-tip: 2024y-09m-20d-20h-39m-19s UTC integration manifest
=== git am output follows ===
Applying: drm/xe/vf: React to MIGRATED interrupt
Applying: drm/xe/vf: Send RESFIX_DONE message at end of VF restore
Applying: drm/xe/vf: Start post-migration fixups with GuC MMIO handshake
Applying: drm/xe/vf: Defer fixups if migrated twice fast



^ permalink raw reply	[flat|nested] 13+ messages in thread

* ✗ CI.checkpatch: warning for drm/xe/vf: Post-migration recovery worker basis
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
                   ` (4 preceding siblings ...)
  2024-09-20 22:34 ` ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery worker basis Patchwork
@ 2024-09-20 22:35 ` Patchwork
  2024-09-20 22:35 ` ✗ CI.KUnit: failure " Patchwork
  6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2024-09-20 22:35 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-xe

== Series Details ==

Series: drm/xe/vf: Post-migration recovery worker basis
URL   : https://patchwork.freedesktop.org/series/138935/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
30ab6715fc09baee6cc14cb3c89ad8858688d474
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 0254b7b8013a737605c0b11d97e60a2ca2bd2a25
Author: Tomasz Lis <tomasz.lis@intel.com>
Date:   Sat Sep 21 00:29:26 2024 +0200

    drm/xe/vf: Defer fixups if migrated twice fast
    
    If another VF migration happened during post-migration recovery,
    then the current worker should be finished to allow the next
    one start swiftly and cleanly.
    
    Check for defer in two places: before fixups, and before
    sending RESFIX_DONE.
    
    Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
+ /mt/dim checkpatch caa511d74cfe1960912443997a75a3e4abc67dbb drm-intel
8e2bd615ae9e drm/xe/vf: React to MIGRATED interrupt
-:141: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#141: 
new file mode 100644

-:196: WARNING:MEMORY_BARRIER: memory barrier without comment
#196: FILE: drivers/gpu/drm/xe/xe_sriov_vf.c:51:
+	smp_mb();

-:220: WARNING:MEMORY_BARRIER: memory barrier without comment
#220: FILE: drivers/gpu/drm/xe/xe_sriov_vf.c:75:
+	smp_mb__after_atomic();

total: 0 errors, 3 warnings, 0 checks, 181 lines checked
2b5e1048a020 drm/xe/vf: Send RESFIX_DONE message at end of VF restore
daf267332301 drm/xe/vf: Start post-migration fixups with GuC MMIO handshake
0254b7b8013a drm/xe/vf: Defer fixups if migrated twice fast



^ permalink raw reply	[flat|nested] 13+ messages in thread

* ✗ CI.KUnit: failure for drm/xe/vf: Post-migration recovery worker basis
  2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
                   ` (5 preceding siblings ...)
  2024-09-20 22:35 ` ✗ CI.checkpatch: warning " Patchwork
@ 2024-09-20 22:35 ` Patchwork
  6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2024-09-20 22:35 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-xe

== Series Details ==

Series: drm/xe/vf: Post-migration recovery worker basis
URL   : https://patchwork.freedesktop.org/series/138935/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../lib/iomap.c:156:5: warning: no previous prototype for ‘ioread64_lo_hi’ [-Wmissing-prototypes]
  156 | u64 ioread64_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:163:5: warning: no previous prototype for ‘ioread64_hi_lo’ [-Wmissing-prototypes]
  163 | u64 ioread64_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:170:5: warning: no previous prototype for ‘ioread64be_lo_hi’ [-Wmissing-prototypes]
  170 | u64 ioread64be_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:178:5: warning: no previous prototype for ‘ioread64be_hi_lo’ [-Wmissing-prototypes]
  178 | u64 ioread64be_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:264:6: warning: no previous prototype for ‘iowrite64_lo_hi’ [-Wmissing-prototypes]
  264 | void iowrite64_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:272:6: warning: no previous prototype for ‘iowrite64_hi_lo’ [-Wmissing-prototypes]
  272 | void iowrite64_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:280:6: warning: no previous prototype for ‘iowrite64be_lo_hi’ [-Wmissing-prototypes]
  280 | void iowrite64be_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../lib/iomap.c:288:6: warning: no previous prototype for ‘iowrite64be_hi_lo’ [-Wmissing-prototypes]
  288 | void iowrite64be_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_sriov_vf.c: In function ‘xe_sriov_vf_start_migration_recovery’:
../drivers/gpu/drm/xe/xe_sriov_vf.c:136:2: error: implicit declaration of function ‘XE_WARN_ON’; did you mean ‘VM_WARN_ON’? [-Werror=implicit-function-declaration]
  136 |  XE_WARN_ON(!IS_SRIOV_VF(xe));
      |  ^~~~~~~~~~
      |  VM_WARN_ON
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:244: drivers/gpu/drm/xe/xe_sriov_vf.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:485: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:485: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:485: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:485: drivers] Error 2
make[2]: *** [/kernel/Makefile:1926: .] Error 2
make[1]: *** [/kernel/Makefile:224: __sub-make] Error 2
make: *** [Makefile:224: __sub-make] Error 2

[22:35:04] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[22:35:08] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-09-23 21:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-20 22:29 [PATCH 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
2024-09-20 22:29 ` [PATCH 1/4] drm/xe/vf: React to MIGRATED interrupt Tomasz Lis
2024-09-23  8:39   ` Michal Wajdeczko
2024-09-20 22:29 ` [PATCH 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore Tomasz Lis
2024-09-23 11:55   ` Michal Wajdeczko
2024-09-23 20:52     ` Lis, Tomasz
2024-09-20 22:29 ` [PATCH 3/4] drm/xe/vf: Start post-migration fixups with GuC MMIO handshake Tomasz Lis
2024-09-23 12:02   ` Michal Wajdeczko
2024-09-23 21:11     ` Lis, Tomasz
2024-09-20 22:29 ` [PATCH 4/4] drm/xe/vf: Defer fixups if migrated twice fast Tomasz Lis
2024-09-20 22:34 ` ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery worker basis Patchwork
2024-09-20 22:35 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-20 22:35 ` ✗ CI.KUnit: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox