[RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
@ 2025-01-14  9:54 Jiang Liu
  2025-01-14  9:54 ` [RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume Jiang Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jiang Liu @ 2025-01-14  9:54 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

For virtual machines with AMD SR-IOV vGPUs, following work flow may be
used to support virtual machine hibernation(suspend):
1) suspends a virtual machine with AMD vGPU A.
2) hypervisor dumps guest RAM content to a disk image.
3) hypervisor loads the guest system image from disk.
4) resumes the guest OS with a different AMD vGPU B.

The step 4 above is special because we are resuming with a different
AMD vGPU device and the amdgpu driver may observe changed device
properties. To support above work flow, we need to fix those changed
device properties cached by the amdgpu drivers.

With information from the amdgpu driver source code (haven't read
corresponding hardware specs yet), we have identified following changed
device properties:
1) PCI MMIO address. This can be fixed by hypervisor.
2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
   they are information only.
3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
4) mc_fb_offset, which affects VRAM physical address.

We will focus on the VRAM address related changes here, because it's
sensitive to the GPU functionalities. The original data sources include
.get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
The main data cached by amdgpu driver are adev->gmc.vram_start and
adev->vm_manager.vram_base_offset. And the major consumers of the
cached information are ip_block.hw_init() and GMU page table builder.

After code analysis, we found that most consumers of dev->gmc.vram_start
and adev->vm_manager.vram_base_offset directly read value from these
two variables on demand instead of caching them. So if we fix these
two cached fields on resume, everything should work as expected.

But there's an exception, and an very import exception, that callers
of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
VRAM addresses. With further analysis, the callers of these interface
have three different patterns:
1) This pattern is safe.
   - call amdgpu_bo_create_reserved() in ip_block.hw_init()
   - call amdgpu_bo_free_kernel() in ip_block.suspend()
   - call amdgpu_bo_create_reserved() in ip_block.resume()
2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
   but bo.pin_count gets incorrect.
   - call amdgpu_bo_create_reserved() in ip_block.hw_init()
   - call amdgpu_bo_create_reserved() in ip_block.resume()
3) This pattern needs to be enhanced.
   - call amdgpu_bo_create_reserved() in ip_block.sw_init()

So my question is which pattern should we use here? Personally I prefer
pattern 2 with enhancement to fix the bo.pin_count.

Currently there're still bugs in SRIOV suspend/resume, so we can't test
our hypothesis. And we are not sure whether there are still other
blocking to enable resume with different AMD SR-IOV vGPUs.

Help is needed to identify more task items to enable resume with
different AMD SR-IOV vGPUs:)

Jiang Liu (2):
  drm/amdgpu: update cached vram base addresses on resume
  drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
 7 files changed, 51 insertions(+), 2 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume
  2025-01-14  9:54 [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Jiang Liu
@ 2025-01-14  9:54 ` Jiang Liu
  2025-01-14  9:54 ` [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr() Jiang Liu
  2025-01-14 10:46 ` [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Christian König
  2 siblings, 0 replies; 12+ messages in thread
From: Jiang Liu @ 2025-01-14  9:54 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

When resume on a different SR-IOV vGPU device, the VRAM base addresses
may have changed. So we need to update those cached addresses.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h    |  6 ++++--
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 36053b3d48b3..c9df54127417 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4970,6 +4970,21 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
 		return 0;
 
+	/* Get xgmi info again for sriov to detect device changes */
+	if (amdgpu_sriov_vf(adev) &&
+	    !(adev->flags & AMD_IS_APU) &&
+	    adev->gmc.xgmi.supported &&
+	    !adev->gmc.xgmi.connected_to_cpu) {
+		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
+		r = adev->gfxhub.funcs->get_xgmi_info(adev);
+		if (r)
+			return r;
+		if (adev->gmc.xgmi.physical_node_id != adev->gmc.xgmi.prev_physical_node_id)
+			adev->gmc.xgmi.physical_node_id_changed = true;
+		else
+			adev->gmc.xgmi.physical_node_id_changed = false;
+	}
+
 	if (adev->in_s0ix)
 		amdgpu_dpm_gfx_state_change(adev, sGpuChangeState_D0Entry);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 459a30fe239f..a32ab7b61cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -184,10 +184,12 @@ struct amdgpu_xgmi {
 	u64 hive_id;
 	/* fixed per family */
 	u64 node_segment_size;
-	/* physical node (0-3) */
-	unsigned physical_node_id;
 	/* number of nodes (0-4) */
 	unsigned num_physical_nodes;
+	/* physical node (0-3) */
+	unsigned physical_node_id;
+	unsigned prev_physical_node_id;
+	bool physical_node_id_changed;
 	/* gpu list in the same hive */
 	struct list_head head;
 	bool supported;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 9bedca9a79c6..94d86e9a08e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -1065,8 +1065,15 @@ static int gmc_v10_0_suspend(struct amdgpu_ip_block *ip_block)
 
 static int gmc_v10_0_resume(struct amdgpu_ip_block *ip_block)
 {
+	struct amdgpu_device *adev = ip_block->adev;
 	int r;
 
+	if (adev->gmc.xgmi.physical_node_id_changed) {
+		r = gmc_v10_0_mc_init(adev);
+		if (r)
+			return r;
+	}
+
 	r = gmc_v10_0_hw_init(ip_block);
 	if (r)
 		return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 291549765c38..db118e07efde 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -2532,6 +2532,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
 		adev->gmc.reset_flags &= ~AMDGPU_GMC_INIT_RESET_NPS;
 	}
 
+	if (adev->gmc.xgmi.physical_node_id_changed) {
+		r = gmc_v9_0_mc_init(adev);
+		if (r)
+			return r;
+	}
+
 	r = gmc_v9_0_hw_init(ip_block);
 	if (r)
 		return r;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
  2025-01-14  9:54 [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Jiang Liu
  2025-01-14  9:54 ` [RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume Jiang Liu
@ 2025-01-14  9:54 ` Jiang Liu
  2025-01-14 10:35   ` Christian König
  2025-01-14 10:46 ` [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Christian König
  2 siblings, 1 reply; 12+ messages in thread
From: Jiang Liu @ 2025-01-14  9:54 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

Introduce helper amdgpu_bo_get_pinned_gpu_addr(), which will be
used to update GPU address of pinned kernel BO during resume.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c | 9 +++++++++
 3 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4f057996ef35..bce939a63a99 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1555,6 +1555,15 @@ u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo)
 	return amdgpu_gmc_sign_extend(offset);
 }
 
+/**
+ * amdgpu_bo_get_kernel_gpu_addr - get GPU address of pinned kernel BO
+ */
+void amdgpu_bo_get_pinned_gpu_addr(struct amdgpu_bo *bo, u64 *gpu_addr)
+{
+	if (bo && bo->tbo.pin_count && gpu_addr)
+		*gpu_addr = amdgpu_bo_gpu_offset(bo);
+}
+
 /**
  * amdgpu_bo_get_preferred_domain - get preferred domain
  * @adev: amdgpu device object
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index ab3fe7b42da7..9022592291a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -305,6 +305,7 @@ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
 int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
 u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
 u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo);
+void amdgpu_bo_get_pinned_gpu_addr(struct amdgpu_bo *bo, u64 *gpu_addr);
 void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
 			  struct amdgpu_mem_stats *stats,
 			  unsigned int size);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
index dde15c6a96e1..40605749b5d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
@@ -881,6 +881,15 @@ static int umsch_mm_suspend(struct amdgpu_ip_block *ip_block)
 
 static int umsch_mm_resume(struct amdgpu_ip_block *ip_block)
 {
+	struct amdgpu_device *adev = ip_block->adev;
+
+	adev->umsch_mm.sch_ctx_gpu_addr = adev->wb.gpu_addr +
+					  (adev->umsch_mm.wb_index * 4);
+	amdgpu_bo_get_pinned_gpu_addr(adev->umsch_mm.cmd_buf_obj,
+				      &adev->umsch_mm.cmd_buf_gpu_addr);
+	amdgpu_bo_get_pinned_gpu_addr(adev->umsch_mm.dbglog_bo,
+				      &adev->umsch_mm.log_gpu_addr);
+
 	return umsch_mm_hw_init(ip_block);
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
  2025-01-14  9:54 ` [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr() Jiang Liu
@ 2025-01-14 10:35   ` Christian König
  0 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2025-01-14 10:35 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

Am 14.01.25 um 10:54 schrieb Jiang Liu:
> Introduce helper amdgpu_bo_get_pinned_gpu_addr(), which will be
> used to update GPU address of pinned kernel BO during resume.

Clear NAK to the whole approach. Pinned means that the address *never* 
changes.

Hacks like those here are a complete no-go since some firmware uses the 
location of temporary buffers inside their firmware state.

So you always need to resume to the exact same location as it was before 
suspend.

I'm going to reply on the cover letter as well.

Regards,
Christian.

>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   | 9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c | 9 +++++++++
>   3 files changed, 19 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 4f057996ef35..bce939a63a99 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1555,6 +1555,15 @@ u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo)
>   	return amdgpu_gmc_sign_extend(offset);
>   }
>   
> +/**
> + * amdgpu_bo_get_kernel_gpu_addr - get GPU address of pinned kernel BO
> + */
> +void amdgpu_bo_get_pinned_gpu_addr(struct amdgpu_bo *bo, u64 *gpu_addr)
> +{
> +	if (bo && bo->tbo.pin_count && gpu_addr)
> +		*gpu_addr = amdgpu_bo_gpu_offset(bo);
> +}
> +
>   /**
>    * amdgpu_bo_get_preferred_domain - get preferred domain
>    * @adev: amdgpu device object
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index ab3fe7b42da7..9022592291a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> @@ -305,6 +305,7 @@ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
>   int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
>   u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
>   u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo);
> +void amdgpu_bo_get_pinned_gpu_addr(struct amdgpu_bo *bo, u64 *gpu_addr);
>   void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
>   			  struct amdgpu_mem_stats *stats,
>   			  unsigned int size);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
> index dde15c6a96e1..40605749b5d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
> @@ -881,6 +881,15 @@ static int umsch_mm_suspend(struct amdgpu_ip_block *ip_block)
>   
>   static int umsch_mm_resume(struct amdgpu_ip_block *ip_block)
>   {
> +	struct amdgpu_device *adev = ip_block->adev;
> +
> +	adev->umsch_mm.sch_ctx_gpu_addr = adev->wb.gpu_addr +
> +					  (adev->umsch_mm.wb_index * 4);
> +	amdgpu_bo_get_pinned_gpu_addr(adev->umsch_mm.cmd_buf_obj,
> +				      &adev->umsch_mm.cmd_buf_gpu_addr);
> +	amdgpu_bo_get_pinned_gpu_addr(adev->umsch_mm.dbglog_bo,
> +				      &adev->umsch_mm.log_gpu_addr);
> +
>   	return umsch_mm_hw_init(ip_block);
>   }
>   


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-14  9:54 [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Jiang Liu
  2025-01-14  9:54 ` [RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume Jiang Liu
  2025-01-14  9:54 ` [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr() Jiang Liu
@ 2025-01-14 10:46 ` Christian König
  2025-01-14 11:03   ` Gerry Liu
  2 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2025-01-14 10:46 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

Hi Jiang,

Some of the firmware, especially the multimedia ones, keep FW pointers 
to buffers in the suspend/resume state.

In other words the firmware needs to be in the exact same location 
before and after resume. That's why we don't unpin the firmware BOs, but 
rather save their content and restore it. See function 
amdgpu_vcn_save_vcpu_bo() for reference.

Additional to that the serial numbers, IDs etc are used for things like 
TMZ. So anything which uses HW encryption won't work any more.

Then even two identical boards can have different harvest and memory 
channel configurations. Could be that we might be able to abstract that 
with SR-IOV but I won't rely on that.

To summarize that looks like a completely futile effort which most 
likely won't work reliable in a production environment.

Regards,
Christian.

Am 14.01.25 um 10:54 schrieb Jiang Liu:
> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
> used to support virtual machine hibernation(suspend):
> 1) suspends a virtual machine with AMD vGPU A.
> 2) hypervisor dumps guest RAM content to a disk image.
> 3) hypervisor loads the guest system image from disk.
> 4) resumes the guest OS with a different AMD vGPU B.
>
> The step 4 above is special because we are resuming with a different
> AMD vGPU device and the amdgpu driver may observe changed device
> properties. To support above work flow, we need to fix those changed
> device properties cached by the amdgpu drivers.
>
> With information from the amdgpu driver source code (haven't read
> corresponding hardware specs yet), we have identified following changed
> device properties:
> 1) PCI MMIO address. This can be fixed by hypervisor.
> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>     they are information only.
> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
> 4) mc_fb_offset, which affects VRAM physical address.
>
> We will focus on the VRAM address related changes here, because it's
> sensitive to the GPU functionalities. The original data sources include
> .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
> The main data cached by amdgpu driver are adev->gmc.vram_start and
> adev->vm_manager.vram_base_offset. And the major consumers of the
> cached information are ip_block.hw_init() and GMU page table builder.
>
> After code analysis, we found that most consumers of dev->gmc.vram_start
> and adev->vm_manager.vram_base_offset directly read value from these
> two variables on demand instead of caching them. So if we fix these
> two cached fields on resume, everything should work as expected.
>
> But there's an exception, and an very import exception, that callers
> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
> VRAM addresses. With further analysis, the callers of these interface
> have three different patterns:
> 1) This pattern is safe.
>     - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>     - call amdgpu_bo_free_kernel() in ip_block.suspend()
>     - call amdgpu_bo_create_reserved() in ip_block.resume()
> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>     but bo.pin_count gets incorrect.
>     - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>     - call amdgpu_bo_create_reserved() in ip_block.resume()
> 3) This pattern needs to be enhanced.
>     - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>
> So my question is which pattern should we use here? Personally I prefer
> pattern 2 with enhancement to fix the bo.pin_count.
>
> Currently there're still bugs in SRIOV suspend/resume, so we can't test
> our hypothesis. And we are not sure whether there are still other
> blocking to enable resume with different AMD SR-IOV vGPUs.
>
> Help is needed to identify more task items to enable resume with
> different AMD SR-IOV vGPUs:)
>
> Jiang Liu (2):
>    drm/amdgpu: update cached vram base addresses on resume
>    drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>   7 files changed, 51 insertions(+), 2 deletions(-)
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-14 10:46 ` [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Christian König
@ 2025-01-14 11:03   ` Gerry Liu
  2025-01-14 12:43     ` Christian König
  0 siblings, 1 reply; 12+ messages in thread
From: Gerry Liu @ 2025-01-14 11:03 UTC (permalink / raw)
  To: Christian König
  Cc: alexander.deucher, Xinhui.Pan, airlied, simona, sunil.khatri,
	lijo.lazar, Hawking.Zhang, mario.limonciello, xiaogang.chen,
	Kent.Russell, shuox.liu, amd-gfx



> 2025年1月14日 18:46，Christian König <christian.koenig@amd.com> 写道：
> 
> Hi Jiang,
> 
> Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.
> 
> In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.
> 
> Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.
> 
> Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.
> 
> To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.
Hi Christian,
	Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?
	Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?
Regards,
Gerry 


> 
> Regards,
> Christian.
> 
> Am 14.01.25 um 10:54 schrieb Jiang Liu:
>> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
>> used to support virtual machine hibernation(suspend):
>> 1) suspends a virtual machine with AMD vGPU A.
>> 2) hypervisor dumps guest RAM content to a disk image.
>> 3) hypervisor loads the guest system image from disk.
>> 4) resumes the guest OS with a different AMD vGPU B.
>> 
>> The step 4 above is special because we are resuming with a different
>> AMD vGPU device and the amdgpu driver may observe changed device
>> properties. To support above work flow, we need to fix those changed
>> device properties cached by the amdgpu drivers.
>> 
>> With information from the amdgpu driver source code (haven't read
>> corresponding hardware specs yet), we have identified following changed
>> device properties:
>> 1) PCI MMIO address. This can be fixed by hypervisor.
>> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>>    they are information only.
>> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
>> 4) mc_fb_offset, which affects VRAM physical address.
>> 
>> We will focus on the VRAM address related changes here, because it's
>> sensitive to the GPU functionalities. The original data sources include
>> .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
>> The main data cached by amdgpu driver are adev->gmc.vram_start and
>> adev->vm_manager.vram_base_offset. And the major consumers of the
>> cached information are ip_block.hw_init() and GMU page table builder.
>> 
>> After code analysis, we found that most consumers of dev->gmc.vram_start
>> and adev->vm_manager.vram_base_offset directly read value from these
>> two variables on demand instead of caching them. So if we fix these
>> two cached fields on resume, everything should work as expected.
>> 
>> But there's an exception, and an very import exception, that callers
>> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
>> VRAM addresses. With further analysis, the callers of these interface
>> have three different patterns:
>> 1) This pattern is safe.
>>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>    - call amdgpu_bo_free_kernel() in ip_block.suspend()
>>    - call amdgpu_bo_create_reserved() in ip_block.resume()
>> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>>    but bo.pin_count gets incorrect.
>>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>    - call amdgpu_bo_create_reserved() in ip_block.resume()
>> 3) This pattern needs to be enhanced.
>>    - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>> 
>> So my question is which pattern should we use here? Personally I prefer
>> pattern 2 with enhancement to fix the bo.pin_count.
>> 
>> Currently there're still bugs in SRIOV suspend/resume, so we can't test
>> our hypothesis. And we are not sure whether there are still other
>> blocking to enable resume with different AMD SR-IOV vGPUs.
>> 
>> Help is needed to identify more task items to enable resume with
>> different AMD SR-IOV vGPUs:)
>> 
>> Jiang Liu (2):
>>   drm/amdgpu: update cached vram base addresses on resume
>>   drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>> 
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>>  7 files changed, 51 insertions(+), 2 deletions(-)
>> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-14 11:03   ` Gerry Liu
@ 2025-01-14 12:43     ` Christian König
  2025-01-14 18:00       ` Liu, Shaoyun
  0 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2025-01-14 12:43 UTC (permalink / raw)
  To: Gerry Liu
  Cc: alexander.deucher, Xinhui.Pan, airlied, simona, sunil.khatri,
	lijo.lazar, Hawking.Zhang, mario.limonciello, xiaogang.chen,
	Kent.Russell, shuox.liu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 5947 bytes --]

Hi Gerry,

Am 14.01.25 um 12:03 schrieb Gerry Liu:
>> 2025年1月14日 18:46，Christian König<christian.koenig@amd.com>  写道：
>>
>> Hi Jiang,
>>
>> Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.
>>
>> In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.
>>
>> Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.
>>
>> Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.
>>
>> To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.
> Hi Christian,
> 	Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?

Not that I of hand know of any, but neither the hypervisor nor the 
driver stack was designed with something like this in mind. So could be 
that there are other dependencies I don't know about.

I do remember that this idea of resuming on different HW than suspending 
came up a while ago and was rejected by multiple parties as to 
complicated and error prone.

So we never looked more deeply into the possibility of doing that.

> 	Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?

If I'm not completely mistaken you can use checkpoint/restore for that. 
It's still under development, but as far as I can see it should solve 
your problem quite nicely.

Regards,
Christian.

> Regards,
> Gerry
>
>
>> Regards,
>> Christian.
>>
>> Am 14.01.25 um 10:54 schrieb Jiang Liu:
>>> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
>>> used to support virtual machine hibernation(suspend):
>>> 1) suspends a virtual machine with AMD vGPU A.
>>> 2) hypervisor dumps guest RAM content to a disk image.
>>> 3) hypervisor loads the guest system image from disk.
>>> 4) resumes the guest OS with a different AMD vGPU B.
>>>
>>> The step 4 above is special because we are resuming with a different
>>> AMD vGPU device and the amdgpu driver may observe changed device
>>> properties. To support above work flow, we need to fix those changed
>>> device properties cached by the amdgpu drivers.
>>>
>>> With information from the amdgpu driver source code (haven't read
>>> corresponding hardware specs yet), we have identified following changed
>>> device properties:
>>> 1) PCI MMIO address. This can be fixed by hypervisor.
>>> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>>>     they are information only.
>>> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
>>> 4) mc_fb_offset, which affects VRAM physical address.
>>>
>>> We will focus on the VRAM address related changes here, because it's
>>> sensitive to the GPU functionalities. The original data sources include
>>> .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
>>> The main data cached by amdgpu driver are adev->gmc.vram_start and
>>> adev->vm_manager.vram_base_offset. And the major consumers of the
>>> cached information are ip_block.hw_init() and GMU page table builder.
>>>
>>> After code analysis, we found that most consumers of dev->gmc.vram_start
>>> and adev->vm_manager.vram_base_offset directly read value from these
>>> two variables on demand instead of caching them. So if we fix these
>>> two cached fields on resume, everything should work as expected.
>>>
>>> But there's an exception, and an very import exception, that callers
>>> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
>>> VRAM addresses. With further analysis, the callers of these interface
>>> have three different patterns:
>>> 1) This pattern is safe.
>>>     - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>>     - call amdgpu_bo_free_kernel() in ip_block.suspend()
>>>     - call amdgpu_bo_create_reserved() in ip_block.resume()
>>> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>>>     but bo.pin_count gets incorrect.
>>>     - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>>     - call amdgpu_bo_create_reserved() in ip_block.resume()
>>> 3) This pattern needs to be enhanced.
>>>     - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>>>
>>> So my question is which pattern should we use here? Personally I prefer
>>> pattern 2 with enhancement to fix the bo.pin_count.
>>>
>>> Currently there're still bugs in SRIOV suspend/resume, so we can't test
>>> our hypothesis. And we are not sure whether there are still other
>>> blocking to enable resume with different AMD SR-IOV vGPUs.
>>>
>>> Help is needed to identify more task items to enable resume with
>>> different AMD SR-IOV vGPUs:)
>>>
>>> Jiang Liu (2):
>>>    drm/amdgpu: update cached vram base addresses on resume
>>>    drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>>>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>>>   7 files changed, 51 insertions(+), 2 deletions(-)
>>>

[-- Attachment #2: Type: text/html, Size: 6865 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-14 12:43     ` Christian König
@ 2025-01-14 18:00       ` Liu, Shaoyun
  2025-01-15  1:47         ` Gerry Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Liu, Shaoyun @ 2025-01-14 18:00 UTC (permalink / raw)
  To: Koenig, Christian, Gerry Liu
  Cc: Deucher, Alexander, Pan, Xinhui, airlied@gmail.com,
	simona@ffwll.ch, Khatri, Sunil, Lazar, Lijo, Zhang, Hawking,
	Limonciello, Mario, Chen, Xiaogang, Russell, Kent,
	shuox.liu@linux.alibaba.com, amd-gfx@lists.freedesktop.org

[-- Attachment #1: Type: text/plain, Size: 6996 bytes --]

[AMD Official Use Only - AMD Internal Distribution Only]

I think to resume with different SRIOV vGPUs depends on the  hypervisor has the live migration support . Different Hypervisor have different implementation , basically  it will call into the  host gpu driver in different stage and  host side do the  hw related  migration including the FW state.

Regards
Shaoyun.liu

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Tuesday, January 14, 2025 7:44 AM
To: Gerry Liu <gerry@linux.alibaba.com>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri, Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>; Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs

Hi Gerry,

Am 14.01.25 um 12:03 schrieb Gerry Liu:

2025年1月14日 18:46，Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com> 写道：

Hi Jiang,

Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.

In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.

Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.

Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.

To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.

Hi Christian,

  Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?

Not that I of hand know of any, but neither the hypervisor nor the driver stack was designed with something like this in mind. So could be that there are other dependencies I don't know about.

I do remember that this idea of resuming on different HW than suspending came up a while ago and was rejected by multiple parties as to complicated and error prone.

So we never looked more deeply into the possibility of doing that.

  Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?

If I'm not completely mistaken you can use checkpoint/restore for that. It's still under development, but as far as I can see it should solve your problem quite nicely.

Regards,
Christian.

Regards,

Gerry

Regards,

Christian.

Am 14.01.25 um 10:54 schrieb Jiang Liu:

For virtual machines with AMD SR-IOV vGPUs, following work flow may be

used to support virtual machine hibernation(suspend):

1) suspends a virtual machine with AMD vGPU A.

2) hypervisor dumps guest RAM content to a disk image.

3) hypervisor loads the guest system image from disk.

4) resumes the guest OS with a different AMD vGPU B.

The step 4 above is special because we are resuming with a different

AMD vGPU device and the amdgpu driver may observe changed device

properties. To support above work flow, we need to fix those changed

device properties cached by the amdgpu drivers.

With information from the amdgpu driver source code (haven't read

corresponding hardware specs yet), we have identified following changed

device properties:

1) PCI MMIO address. This can be fixed by hypervisor.

2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems

   they are information only.

3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.

4) mc_fb_offset, which affects VRAM physical address.

We will focus on the VRAM address related changes here, because it's

sensitive to the GPU functionalities. The original data sources include

.get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.

The main data cached by amdgpu driver are adev->gmc.vram_start and

adev->vm_manager.vram_base_offset. And the major consumers of the

cached information are ip_block.hw_init() and GMU page table builder.

After code analysis, we found that most consumers of dev->gmc.vram_start

and adev->vm_manager.vram_base_offset directly read value from these

two variables on demand instead of caching them. So if we fix these

two cached fields on resume, everything should work as expected.

But there's an exception, and an very import exception, that callers

of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache

VRAM addresses. With further analysis, the callers of these interface

have three different patterns:

1) This pattern is safe.

   - call amdgpu_bo_create_reserved() in ip_block.hw_init()

   - call amdgpu_bo_free_kernel() in ip_block.suspend()

   - call amdgpu_bo_create_reserved() in ip_block.resume()

2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()

   but bo.pin_count gets incorrect.

   - call amdgpu_bo_create_reserved() in ip_block.hw_init()

   - call amdgpu_bo_create_reserved() in ip_block.resume()

3) This pattern needs to be enhanced.

   - call amdgpu_bo_create_reserved() in ip_block.sw_init()

So my question is which pattern should we use here? Personally I prefer

pattern 2 with enhancement to fix the bo.pin_count.

Currently there're still bugs in SRIOV suspend/resume, so we can't test

our hypothesis. And we are not sure whether there are still other

blocking to enable resume with different AMD SR-IOV vGPUs.

Help is needed to identify more task items to enable resume with

different AMD SR-IOV vGPUs:)

Jiang Liu (2):

  drm/amdgpu: update cached vram base addresses on resume

  drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--

 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++

 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +

 drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++

 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++

 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++

 7 files changed, 51 insertions(+), 2 deletions(-)

[-- Attachment #2: Type: text/html, Size: 13678 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-14 18:00       ` Liu, Shaoyun
@ 2025-01-15  1:47         ` Gerry Liu
  2025-01-15  4:03           ` Liu, Shaoyun
  0 siblings, 1 reply; 12+ messages in thread
From: Gerry Liu @ 2025-01-15  1:47 UTC (permalink / raw)
  To: Liu, Shaoyun
  Cc: Koenig, Christian, Deucher, Alexander, Pan, Xinhui,
	airlied@gmail.com, simona@ffwll.ch, Khatri, Sunil, Lazar, Lijo,
	Zhang, Hawking, Limonciello, Mario, Chen, Xiaogang, Russell, Kent,
	shuox.liu@linux.alibaba.com, amd-gfx@lists.freedesktop.org



> 2025年1月15日 02:00，Liu, Shaoyun <Shaoyun.Liu@amd.com> 写道：
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> I think to resume with different SRIOV vGPUs depends on the  hypervisor has the live migration support . Different Hypervisor have different implementation , basically  it will call into the  host gpu driver in different stage and  host side do the  hw related  migration including the FW state.
Hi Shaoyun,
	Great news! That sounds like what I’m looking for:)
	Is there any documentation about how to enable this with an in-house implemented hypervisor? Will the hypervisor need to cooperate with the gim driver to enable resume with different vGPUs?
Regards
Gerry

>  
> Regards
> Shaoyun.liu
>  
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
> Sent: Tuesday, January 14, 2025 7:44 AM
> To: Gerry Liu <gerry@linux.alibaba.com>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri, Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>; Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com; amd-gfx@lists.freedesktop.org
> Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
>  
> Hi Gerry,
> 
> Am 14.01.25 um 12:03 schrieb Gerry Liu: 
> 2025年1月14日 18:46，Christian König <christian.koenig@amd.com> 写道：
>  
> Hi Jiang,
>  
> Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.
>  
> In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.
>  
> Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.
>  
> Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.
>  
> To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.
> Hi Christian,
>   Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?
> 
> Not that I of hand know of any, but neither the hypervisor nor the driver stack was designed with something like this in mind. So could be that there are other dependencies I don't know about.
> 
> I do remember that this idea of resuming on different HW than suspending came up a while ago and was rejected by multiple parties as to complicated and error prone.
> 
> So we never looked more deeply into the possibility of doing that.
> 
> 
>  
>   Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?
> 
> If I'm not completely mistaken you can use checkpoint/restore for that. It's still under development, but as far as I can see it should solve your problem quite nicely.
> 
> Regards,
> Christian.
> 
> 
>  
> Regards,
> Gerry 
>  
>  
>  
> Regards,
> Christian.
>  
> Am 14.01.25 um 10:54 schrieb Jiang Liu:
> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
> used to support virtual machine hibernation(suspend):
> 1) suspends a virtual machine with AMD vGPU A.
> 2) hypervisor dumps guest RAM content to a disk image.
> 3) hypervisor loads the guest system image from disk.
> 4) resumes the guest OS with a different AMD vGPU B.
>  
> The step 4 above is special because we are resuming with a different
> AMD vGPU device and the amdgpu driver may observe changed device
> properties. To support above work flow, we need to fix those changed
> device properties cached by the amdgpu drivers.
>  
> With information from the amdgpu driver source code (haven't read
> corresponding hardware specs yet), we have identified following changed
> device properties:
> 1) PCI MMIO address. This can be fixed by hypervisor.
> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>    they are information only.
> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
> 4) mc_fb_offset, which affects VRAM physical address.
>  
> We will focus on the VRAM address related changes here, because it's
> sensitive to the GPU functionalities. The original data sources include
> .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
> The main data cached by amdgpu driver are adev->gmc.vram_start and
> adev->vm_manager.vram_base_offset. And the major consumers of the
> cached information are ip_block.hw_init() and GMU page table builder.
>  
> After code analysis, we found that most consumers of dev->gmc.vram_start
> and adev->vm_manager.vram_base_offset directly read value from these
> two variables on demand instead of caching them. So if we fix these
> two cached fields on resume, everything should work as expected.
>  
> But there's an exception, and an very import exception, that callers
> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
> VRAM addresses. With further analysis, the callers of these interface
> have three different patterns:
> 1) This pattern is safe.
>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>    - call amdgpu_bo_free_kernel() in ip_block.suspend()
>    - call amdgpu_bo_create_reserved() in ip_block.resume()
> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>    but bo.pin_count gets incorrect.
>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>    - call amdgpu_bo_create_reserved() in ip_block.resume()
> 3) This pattern needs to be enhanced.
>    - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>  
> So my question is which pattern should we use here? Personally I prefer
> pattern 2 with enhancement to fix the bo.pin_count.
>  
> Currently there're still bugs in SRIOV suspend/resume, so we can't test
> our hypothesis. And we are not sure whether there are still other
> blocking to enable resume with different AMD SR-IOV vGPUs.
>  
> Help is needed to identify more task items to enable resume with
> different AMD SR-IOV vGPUs:)
>  
> Jiang Liu (2):
>   drm/amdgpu: update cached vram base addresses on resume
>   drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>  
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>  7 files changed, 51 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-15  1:47         ` Gerry Liu
@ 2025-01-15  4:03           ` Liu, Shaoyun
  2025-01-15  5:24             ` Gerry Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Liu, Shaoyun @ 2025-01-15  4:03 UTC (permalink / raw)
  To: Gerry Liu
  Cc: Koenig, Christian, Deucher, Alexander, Pan, Xinhui,
	airlied@gmail.com, simona@ffwll.ch, Khatri, Sunil, Lazar, Lijo,
	Zhang, Hawking, Limonciello, Mario, Chen, Xiaogang, Russell, Kent,
	shuox.liu@linux.alibaba.com, amd-gfx@lists.freedesktop.org

[AMD Official Use Only - AMD Internal Distribution Only]

I might misunderstood your requirement . For live migration, it's transparent  to the guest.  The guest can be  in running  state (ex. like running  some compute stuff),  hypervisor     and gim driver together will handle the GPU HW state migration from source vGPU to other  identical  vGPU .  It doesn't requires the guest to do the suspend/resume.  You can contact other engineers that work on SRIOV for more live  migration support info.

Regards
Shaoyun.liu

-----Original Message-----
From: Gerry Liu <gerry@linux.alibaba.com>
Sent: Tuesday, January 14, 2025 8:48 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri, Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>; Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs



> 2025年1月15日 02:00，Liu, Shaoyun <Shaoyun.Liu@amd.com> 写道：
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> I think to resume with different SRIOV vGPUs depends on the  hypervisor has the live migration support . Different Hypervisor have different implementation , basically  it will call into the  host gpu driver in different stage and  host side do the  hw related  migration including the FW state.
Hi Shaoyun,
        Great news! That sounds like what I’m looking for:)
        Is there any documentation about how to enable this with an in-house implemented hypervisor? Will the hypervisor need to cooperate with the gim driver to enable resume with different vGPUs?
Regards
Gerry

>
> Regards
> Shaoyun.liu
>
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Christian König
> Sent: Tuesday, January 14, 2025 7:44 AM
> To: Gerry Liu <gerry@linux.alibaba.com>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui
> <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri,
> Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang,
> Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario
> <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>;
> Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com;
> amd-gfx@lists.freedesktop.org
> Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
>
> Hi Gerry,
>
> Am 14.01.25 um 12:03 schrieb Gerry Liu:
> 2025年1月14日 18:46，Christian König <christian.koenig@amd.com> 写道：
>
> Hi Jiang,
>
> Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.
>
> In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.
>
> Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.
>
> Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.
>
> To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.
> Hi Christian,
>   Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?
>
> Not that I of hand know of any, but neither the hypervisor nor the driver stack was designed with something like this in mind. So could be that there are other dependencies I don't know about.
>
> I do remember that this idea of resuming on different HW than suspending came up a while ago and was rejected by multiple parties as to complicated and error prone.
>
> So we never looked more deeply into the possibility of doing that.
>
>
>
>   Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?
>
> If I'm not completely mistaken you can use checkpoint/restore for that. It's still under development, but as far as I can see it should solve your problem quite nicely.
>
> Regards,
> Christian.
>
>
>
> Regards,
> Gerry
>
>
>
> Regards,
> Christian.
>
> Am 14.01.25 um 10:54 schrieb Jiang Liu:
> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
> used to support virtual machine hibernation(suspend):
> 1) suspends a virtual machine with AMD vGPU A.
> 2) hypervisor dumps guest RAM content to a disk image.
> 3) hypervisor loads the guest system image from disk.
> 4) resumes the guest OS with a different AMD vGPU B.
>
> The step 4 above is special because we are resuming with a different
> AMD vGPU device and the amdgpu driver may observe changed device
> properties. To support above work flow, we need to fix those changed
> device properties cached by the amdgpu drivers.
>
> With information from the amdgpu driver source code (haven't read
> corresponding hardware specs yet), we have identified following
> changed device properties:
> 1) PCI MMIO address. This can be fixed by hypervisor.
> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>    they are information only.
> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
> 4) mc_fb_offset, which affects VRAM physical address.
>
> We will focus on the VRAM address related changes here, because it's
> sensitive to the GPU functionalities. The original data sources
> include .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
> The main data cached by amdgpu driver are adev->gmc.vram_start and
> adev->vm_manager.vram_base_offset. And the major consumers of the
> cached information are ip_block.hw_init() and GMU page table builder.
>
> After code analysis, we found that most consumers of
> dev->gmc.vram_start and adev->vm_manager.vram_base_offset directly
> read value from these two variables on demand instead of caching them.
> So if we fix these two cached fields on resume, everything should work as expected.
>
> But there's an exception, and an very import exception, that callers
> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
> VRAM addresses. With further analysis, the callers of these interface
> have three different patterns:
> 1) This pattern is safe.
>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>    - call amdgpu_bo_free_kernel() in ip_block.suspend()
>    - call amdgpu_bo_create_reserved() in ip_block.resume()
> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>    but bo.pin_count gets incorrect.
>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>    - call amdgpu_bo_create_reserved() in ip_block.resume()
> 3) This pattern needs to be enhanced.
>    - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>
> So my question is which pattern should we use here? Personally I
> prefer pattern 2 with enhancement to fix the bo.pin_count.
>
> Currently there're still bugs in SRIOV suspend/resume, so we can't
> test our hypothesis. And we are not sure whether there are still other
> blocking to enable resume with different AMD SR-IOV vGPUs.
>
> Help is needed to identify more task items to enable resume with
> different AMD SR-IOV vGPUs:)
>
> Jiang Liu (2):
>   drm/amdgpu: update cached vram base addresses on resume
>   drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>  7 files changed, 51 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-15  4:03           ` Liu, Shaoyun
@ 2025-01-15  5:24             ` Gerry Liu
  2025-01-15 11:23               ` Christian König
  0 siblings, 1 reply; 12+ messages in thread
From: Gerry Liu @ 2025-01-15  5:24 UTC (permalink / raw)
  To: Liu, Shaoyun
  Cc: Koenig, Christian, Deucher, Alexander, Pan, Xinhui,
	airlied@gmail.com, simona@ffwll.ch, Khatri, Sunil, Lazar, Lijo,
	Zhang, Hawking, Limonciello, Mario, Chen, Xiaogang, Russell, Kent,
	shuox.liu@linux.alibaba.com, amd-gfx@lists.freedesktop.org



> 2025年1月15日 12:03，Liu, Shaoyun <Shaoyun.Liu@amd.com> 写道：
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> I might misunderstood your requirement . For live migration, it's transparent  to the guest.  The guest can be  in running  state (ex. like running  some compute stuff),  hypervisor     and gim driver together will handle the GPU HW state migration from source vGPU to other  identical  vGPU .  It doesn't requires the guest to do the suspend/resume.  You can contact other engineers that work on SRIOV for more live  migration support info.
Yeah, there are different usage scenarios:
1) live migration
2) hibernate/suspend/resume
3) snapshot and clone
Currently we are focusing on live migration and hibernation, and hope that we can base on common underlying technologies.

> 
> Regards
> Shaoyun.liu
> 
> -----Original Message-----
> From: Gerry Liu <gerry@linux.alibaba.com>
> Sent: Tuesday, January 14, 2025 8:48 PM
> To: Liu, Shaoyun <Shaoyun.Liu@amd.com>
> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri, Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>; Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com; amd-gfx@lists.freedesktop.org
> Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
> 
> 
> 
>> 2025年1月15日 02:00，Liu, Shaoyun <Shaoyun.Liu@amd.com> 写道：
>> 
>> [AMD Official Use Only - AMD Internal Distribution Only]
>> 
>> I think to resume with different SRIOV vGPUs depends on the  hypervisor has the live migration support . Different Hypervisor have different implementation , basically  it will call into the  host gpu driver in different stage and  host side do the  hw related  migration including the FW state.
> Hi Shaoyun,
>        Great news! That sounds like what I’m looking for:)
>        Is there any documentation about how to enable this with an in-house implemented hypervisor? Will the hypervisor need to cooperate with the gim driver to enable resume with different vGPUs?
> Regards
> Gerry
> 
>> 
>> Regards
>> Shaoyun.liu
>> 
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Christian König
>> Sent: Tuesday, January 14, 2025 7:44 AM
>> To: Gerry Liu <gerry@linux.alibaba.com>
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui
>> <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri,
>> Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang,
>> Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario
>> <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>;
>> Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com;
>> amd-gfx@lists.freedesktop.org
>> Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
>> 
>> Hi Gerry,
>> 
>> Am 14.01.25 um 12:03 schrieb Gerry Liu:
>> 2025年1月14日 18:46，Christian König <christian.koenig@amd.com> 写道：
>> 
>> Hi Jiang,
>> 
>> Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.
>> 
>> In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.
>> 
>> Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.
>> 
>> Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.
>> 
>> To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.
>> Hi Christian,
>>  Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?
>> 
>> Not that I of hand know of any, but neither the hypervisor nor the driver stack was designed with something like this in mind. So could be that there are other dependencies I don't know about.
>> 
>> I do remember that this idea of resuming on different HW than suspending came up a while ago and was rejected by multiple parties as to complicated and error prone.
>> 
>> So we never looked more deeply into the possibility of doing that.
>> 
>> 
>> 
>>  Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?
>> 
>> If I'm not completely mistaken you can use checkpoint/restore for that. It's still under development, but as far as I can see it should solve your problem quite nicely.
>> 
>> Regards,
>> Christian.
>> 
>> 
>> 
>> Regards,
>> Gerry
>> 
>> 
>> 
>> Regards,
>> Christian.
>> 
>> Am 14.01.25 um 10:54 schrieb Jiang Liu:
>> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
>> used to support virtual machine hibernation(suspend):
>> 1) suspends a virtual machine with AMD vGPU A.
>> 2) hypervisor dumps guest RAM content to a disk image.
>> 3) hypervisor loads the guest system image from disk.
>> 4) resumes the guest OS with a different AMD vGPU B.
>> 
>> The step 4 above is special because we are resuming with a different
>> AMD vGPU device and the amdgpu driver may observe changed device
>> properties. To support above work flow, we need to fix those changed
>> device properties cached by the amdgpu drivers.
>> 
>> With information from the amdgpu driver source code (haven't read
>> corresponding hardware specs yet), we have identified following
>> changed device properties:
>> 1) PCI MMIO address. This can be fixed by hypervisor.
>> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>>   they are information only.
>> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
>> 4) mc_fb_offset, which affects VRAM physical address.
>> 
>> We will focus on the VRAM address related changes here, because it's
>> sensitive to the GPU functionalities. The original data sources
>> include .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
>> The main data cached by amdgpu driver are adev->gmc.vram_start and
>> adev->vm_manager.vram_base_offset. And the major consumers of the
>> cached information are ip_block.hw_init() and GMU page table builder.
>> 
>> After code analysis, we found that most consumers of
>> dev->gmc.vram_start and adev->vm_manager.vram_base_offset directly
>> read value from these two variables on demand instead of caching them.
>> So if we fix these two cached fields on resume, everything should work as expected.
>> 
>> But there's an exception, and an very import exception, that callers
>> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
>> VRAM addresses. With further analysis, the callers of these interface
>> have three different patterns:
>> 1) This pattern is safe.
>>   - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>   - call amdgpu_bo_free_kernel() in ip_block.suspend()
>>   - call amdgpu_bo_create_reserved() in ip_block.resume()
>> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>>   but bo.pin_count gets incorrect.
>>   - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>   - call amdgpu_bo_create_reserved() in ip_block.resume()
>> 3) This pattern needs to be enhanced.
>>   - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>> 
>> So my question is which pattern should we use here? Personally I
>> prefer pattern 2 with enhancement to fix the bo.pin_count.
>> 
>> Currently there're still bugs in SRIOV suspend/resume, so we can't
>> test our hypothesis. And we are not sure whether there are still other
>> blocking to enable resume with different AMD SR-IOV vGPUs.
>> 
>> Help is needed to identify more task items to enable resume with
>> different AMD SR-IOV vGPUs:)
>> 
>> Jiang Liu (2):
>>  drm/amdgpu: update cached vram base addresses on resume
>>  drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>> 
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>> 7 files changed, 51 insertions(+), 2 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
  2025-01-15  5:24             ` Gerry Liu
@ 2025-01-15 11:23               ` Christian König
  0 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2025-01-15 11:23 UTC (permalink / raw)
  To: Gerry Liu, Liu, Shaoyun
  Cc: Koenig, Christian, Deucher, Alexander, Pan, Xinhui,
	airlied@gmail.com, simona@ffwll.ch, Khatri, Sunil, Lazar, Lijo,
	Zhang, Hawking, Limonciello, Mario, Chen, Xiaogang, Russell, Kent,
	shuox.liu@linux.alibaba.com, amd-gfx@lists.freedesktop.org

Hi Shaoyun and Gerry,

yes good idea, live migration is certainly an option as well.

In live migration the hypervisor transparently provides the same GPU 
configuration to the client on the new GPU like it has seen it before on 
the old GPU. In other words your fully state including VRAM content and 
locations, FW and HW state is transferred from one GPU to another.

Checkpoint/restore on the other hand works on the application level by 
writing the state of a specific application into a checkpoint file with 
the help of the user mode and kernel drivers.

The hibernation approach you proposed kind of sits in between those two 
ideas. But the hypervisor of the new GPU is not aware of the old GPU 
configuration and the user mode driver side isn't aware that it is moved 
to different hardware either while configuration parameters still 
change. That's why you have to add those hacks to update the location of 
pinned BOs into the firmware.

But as far as I can see that will never work 100% reliable because you 
can't look into the firmware code and update all pointers there. 
Especially when firmware stack etc.. is relative to parameters the 
hypervisor has set.

Regards,
Christian.

Am 15.01.25 um 06:24 schrieb Gerry Liu:
>
>> 2025年1月15日 12:03，Liu, Shaoyun <Shaoyun.Liu@amd.com> 写道：
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> I might misunderstood your requirement . For live migration, it's transparent  to the guest.  The guest can be  in running  state (ex. like running  some compute stuff),  hypervisor     and gim driver together will handle the GPU HW state migration from source vGPU to other  identical  vGPU .  It doesn't requires the guest to do the suspend/resume.  You can contact other engineers that work on SRIOV for more live  migration support info.
> Yeah, there are different usage scenarios:
> 1) live migration
> 2) hibernate/suspend/resume
> 3) snapshot and clone
> Currently we are focusing on live migration and hibernation, and hope that we can base on common underlying technologies.
>
>> Regards
>> Shaoyun.liu
>>
>> -----Original Message-----
>> From: Gerry Liu <gerry@linux.alibaba.com>
>> Sent: Tuesday, January 14, 2025 8:48 PM
>> To: Liu, Shaoyun <Shaoyun.Liu@amd.com>
>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri, Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>; Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com; amd-gfx@lists.freedesktop.org
>> Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
>>
>>
>>
>>> 2025年1月15日 02:00，Liu, Shaoyun <Shaoyun.Liu@amd.com> 写道：
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> I think to resume with different SRIOV vGPUs depends on the  hypervisor has the live migration support . Different Hypervisor have different implementation , basically  it will call into the  host gpu driver in different stage and  host side do the  hw related  migration including the FW state.
>> Hi Shaoyun,
>>         Great news! That sounds like what I’m looking for:)
>>         Is there any documentation about how to enable this with an in-house implemented hypervisor? Will the hypervisor need to cooperate with the gim driver to enable resume with different vGPUs?
>> Regards
>> Gerry
>>
>>> Regards
>>> Shaoyun.liu
>>>
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>> Christian König
>>> Sent: Tuesday, January 14, 2025 7:44 AM
>>> To: Gerry Liu <gerry@linux.alibaba.com>
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Pan, Xinhui
>>> <Xinhui.Pan@amd.com>; airlied@gmail.com; simona@ffwll.ch; Khatri,
>>> Sunil <Sunil.Khatri@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Zhang,
>>> Hawking <Hawking.Zhang@amd.com>; Limonciello, Mario
>>> <Mario.Limonciello@amd.com>; Chen, Xiaogang <Xiaogang.Chen@amd.com>;
>>> Russell, Kent <Kent.Russell@amd.com>; shuox.liu@linux.alibaba.com;
>>> amd-gfx@lists.freedesktop.org
>>> Subject: Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
>>>
>>> Hi Gerry,
>>>
>>> Am 14.01.25 um 12:03 schrieb Gerry Liu:
>>> 2025年1月14日 18:46，Christian König <christian.koenig@amd.com> 写道：
>>>
>>> Hi Jiang,
>>>
>>> Some of the firmware, especially the multimedia ones, keep FW pointers to buffers in the suspend/resume state.
>>>
>>> In other words the firmware needs to be in the exact same location before and after resume. That's why we don't unpin the firmware BOs, but rather save their content and restore it. See function amdgpu_vcn_save_vcpu_bo() for reference.
>>>
>>> Additional to that the serial numbers, IDs etc are used for things like TMZ. So anything which uses HW encryption won't work any more.
>>>
>>> Then even two identical boards can have different harvest and memory channel configurations. Could be that we might be able to abstract that with SR-IOV but I won't rely on that.
>>>
>>> To summarize that looks like a completely futile effort which most likely won't work reliable in a production environment.
>>> Hi Christian,
>>>   Thanks for the information. Previously I assume that we may reset the asic and reload all firmwares on resume, but missed the vcn ip block which save and restore firmware vram content during suspend/resume. Is there any other IP blocks which save and restore firmware ram content?
>>>
>>> Not that I of hand know of any, but neither the hypervisor nor the driver stack was designed with something like this in mind. So could be that there are other dependencies I don't know about.
>>>
>>> I do remember that this idea of resuming on different HW than suspending came up a while ago and was rejected by multiple parties as to complicated and error prone.
>>>
>>> So we never looked more deeply into the possibility of doing that.
>>>
>>>
>>>
>>>   Our usage scenario targets GPGPU workload (amdkfd) with AMD GPU in single SR-IOV vGPU mode. Is it possible to resume on a different vGPU device in such a case?
>>>
>>> If I'm not completely mistaken you can use checkpoint/restore for that. It's still under development, but as far as I can see it should solve your problem quite nicely.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>
>>> Regards,
>>> Gerry
>>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 14.01.25 um 10:54 schrieb Jiang Liu:
>>> For virtual machines with AMD SR-IOV vGPUs, following work flow may be
>>> used to support virtual machine hibernation(suspend):
>>> 1) suspends a virtual machine with AMD vGPU A.
>>> 2) hypervisor dumps guest RAM content to a disk image.
>>> 3) hypervisor loads the guest system image from disk.
>>> 4) resumes the guest OS with a different AMD vGPU B.
>>>
>>> The step 4 above is special because we are resuming with a different
>>> AMD vGPU device and the amdgpu driver may observe changed device
>>> properties. To support above work flow, we need to fix those changed
>>> device properties cached by the amdgpu drivers.
>>>
>>> With information from the amdgpu driver source code (haven't read
>>> corresponding hardware specs yet), we have identified following
>>> changed device properties:
>>> 1) PCI MMIO address. This can be fixed by hypervisor.
>>> 2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
>>>    they are information only.
>>> 3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
>>> 4) mc_fb_offset, which affects VRAM physical address.
>>>
>>> We will focus on the VRAM address related changes here, because it's
>>> sensitive to the GPU functionalities. The original data sources
>>> include .get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
>>> The main data cached by amdgpu driver are adev->gmc.vram_start and
>>> adev->vm_manager.vram_base_offset. And the major consumers of the
>>> cached information are ip_block.hw_init() and GMU page table builder.
>>>
>>> After code analysis, we found that most consumers of
>>> dev->gmc.vram_start and adev->vm_manager.vram_base_offset directly
>>> read value from these two variables on demand instead of caching them.
>>> So if we fix these two cached fields on resume, everything should work as expected.
>>>
>>> But there's an exception, and an very import exception, that callers
>>> of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
>>> VRAM addresses. With further analysis, the callers of these interface
>>> have three different patterns:
>>> 1) This pattern is safe.
>>>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>>    - call amdgpu_bo_free_kernel() in ip_block.suspend()
>>>    - call amdgpu_bo_create_reserved() in ip_block.resume()
>>> 2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
>>>    but bo.pin_count gets incorrect.
>>>    - call amdgpu_bo_create_reserved() in ip_block.hw_init()
>>>    - call amdgpu_bo_create_reserved() in ip_block.resume()
>>> 3) This pattern needs to be enhanced.
>>>    - call amdgpu_bo_create_reserved() in ip_block.sw_init()
>>>
>>> So my question is which pattern should we use here? Personally I
>>> prefer pattern 2 with enhancement to fix the bo.pin_count.
>>>
>>> Currently there're still bugs in SRIOV suspend/resume, so we can't
>>> test our hypothesis. And we are not sure whether there are still other
>>> blocking to enable resume with different AMD SR-IOV vGPUs.
>>>
>>> Help is needed to identify more task items to enable resume with
>>> different AMD SR-IOV vGPUs:)
>>>
>>> Jiang Liu (2):
>>>   drm/amdgpu: update cached vram base addresses on resume
>>>   drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
>>> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
>>> 7 files changed, 51 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-01-15 11:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-14  9:54 [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Jiang Liu
2025-01-14  9:54 ` [RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume Jiang Liu
2025-01-14  9:54 ` [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr() Jiang Liu
2025-01-14 10:35   ` Christian König
2025-01-14 10:46 ` [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Christian König
2025-01-14 11:03   ` Gerry Liu
2025-01-14 12:43     ` Christian König
2025-01-14 18:00       ` Liu, Shaoyun
2025-01-15  1:47         ` Gerry Liu
2025-01-15  4:03           ` Liu, Shaoyun
2025-01-15  5:24             ` Gerry Liu
2025-01-15 11:23               ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.