[PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV.
@ 2025-05-19  8:20 Samuel Zhang
  2025-05-19  8:20 ` [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume Samuel Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-19  8:20 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Lijo.Lazar

On SRIOV and VM environment, customer may need to switch to new vGPU indexes after hibernate and then resume the VM. For GPUs with XGMI, `vram_start` will change in this case, the FB aperture gpu address of VRAM BOs will also change.
These gpu addresses need to be updated when resume. But these addresses are all over the KMD codebase, updating each of them is error-prone and not acceptable.

The solution is to use pdb0 page table to cover both vram and gart memory and use pdb0 virtual gpu address instead. When gpu indexes change, the virtual gpu address won't change.

For psp and smu, pdb0's gpu address does not work, so the original FB aperture gpu address is used instead. They need to be updated when resume with changed vGPUs.

v2:
- remove physical_node_id_changed
- set vram_start to 0 to switch cached gpu addr to gart aperture
- cleanup pdb0 patch
v3:
- remove gmc_v9_0_init_sw_mem_ranges() call
- remove vram_offset memeber
- add 4 refactoring patch to remove cached gpu addr
- cleanup pdb0 patch
v4:
- remove gmc_v9_0_mc_init() call and `refresh` update.
- do not set `fb_start` in mmhub_v1_8_get_fb_location() when pdb0 enabled.
v5:
- add amdgpu_virt_xgmi_migrate_enabled() check
- move vram_base_offset update to pdb0 patch
- remove 4 refactoring patches to remove cached gpu addr
- add patch to fix IH not working issue when resume with new VF
v6:
- rename amdgpu_device_update_xgmi_info() to amdgpu_virt_resume()
- merge xgmi node and vram_base_offset update, IH fix into amdgpu_virt_resume()
- remove 2 unnecessary gpu addr update changes

Samuel Zhang (4):
  drm/amdgpu: update xgmi info and vram_base_offset on resume
  drm/amdgpu: update GPU addresses for SMU and PSP
  drm/amdgpu: enable pdb0 for hibernation on SRIOV
  drm/amdgpu: fix fence fallback timer expired error

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    | 32 +++++++++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 20 +++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 23 ++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  7 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 10 ++++---
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c    |  6 ++--
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 18 +++++++++++
 14 files changed, 146 insertions(+), 15 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume
  2025-05-19  8:20 [PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV Samuel Zhang
@ 2025-05-19  8:20 ` Samuel Zhang
  2025-05-19 13:41   ` Christian König
  2025-05-19  8:20 ` [PATCH v6 2/4] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Samuel Zhang @ 2025-05-19  8:20 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Lijo.Lazar, Jiang Liu

For SRIOV VM env with XGMI enabled systems, XGMI physical node id may
change when hibernate and resume with different VF.

Update XGMI info and vram_base_offset on resume for gfx444 SRIOV env.
Add amdgpu_virt_xgmi_migrate_enabled() as the feature flag.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 32 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  7 +++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d477a901af84..e5bb46effb6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2732,6 +2732,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev)
 	if (!amdgpu_device_pcie_dynamic_switching_supported(adev))
 		adev->pm.pp_feature &= ~PP_PCIE_DPM_MASK;
 
+	adev->virt.is_xgmi_node_migrate_enabled = false;
+	if (amdgpu_sriov_vf(adev)) {
+		adev->virt.is_xgmi_node_migrate_enabled =
+			amdgpu_ip_version((adev), GC_HWIP, 0) == IP_VERSION(9, 4, 4);
+	}
+
 	total = true;
 	for (i = 0; i < adev->num_ip_blocks; i++) {
 		ip_block = &adev->ip_blocks[i];
@@ -5040,6 +5046,28 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 	return 0;
 }
 
+static inline int amdgpu_virt_resume(struct amdgpu_device *adev)
+{
+	int r;
+	unsigned int prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
+
+	if (!amdgpu_virt_xgmi_migrate_enabled(adev))
+		return 0;
+
+	r = adev->gfxhub.funcs->get_xgmi_info(adev);
+	if (r)
+		return r;
+
+	dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
+		prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
+
+	adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
+	adev->vm_manager.vram_base_offset +=
+		adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
+
+	return 0;
+}
+
 /**
  * amdgpu_device_resume - initiate device resume
  *
@@ -5061,6 +5089,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 			return r;
 	}
 
+	r = amdgpu_virt_resume(adev);
+	if (r)
+		goto exit;
+
 	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
 		return 0;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index df03dba67ab8..532b92628171 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -295,6 +295,9 @@ struct amdgpu_virt {
 	union amd_sriov_ras_caps ras_telemetry_en_caps;
 	struct amdgpu_virt_ras ras;
 	struct amd_sriov_ras_telemetry_error_count count_cache;
+
+	/* hibernate and resume with different VF feature for xgmi enabled system */
+	bool is_xgmi_node_migrate_enabled;
 };
 
 struct amdgpu_video_codec_info;
@@ -376,6 +379,10 @@ static inline bool is_virtual_machine(void)
 	((adev)->virt.gim_feature & AMDGIM_FEATURE_VCN_RB_DECOUPLE)
 #define amdgpu_sriov_is_mes_info_enable(adev) \
 	((adev)->virt.gim_feature & AMDGIM_FEATURE_MES_INFO_ENABLE)
+
+#define amdgpu_virt_xgmi_migrate_enabled(adev) \
+	((adev)->virt.is_xgmi_node_migrate_enabled)
+
 bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev);
 void amdgpu_virt_init_setting(struct amdgpu_device *adev);
 int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 2/4] drm/amdgpu: update GPU addresses for SMU and PSP
  2025-05-19  8:20 [PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV Samuel Zhang
  2025-05-19  8:20 ` [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume Samuel Zhang
@ 2025-05-19  8:20 ` Samuel Zhang
  2025-05-19 13:43   ` Christian König
  2025-05-19  8:20 ` [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
  2025-05-19  8:20 ` [PATCH v6 4/4] drm/amdgpu: fix fence fallback timer expired error Samuel Zhang
  3 siblings, 1 reply; 14+ messages in thread
From: Samuel Zhang @ 2025-05-19  8:20 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Lijo.Lazar, Jiang Liu

add amdgpu_bo_fb_aper_addr() and update the cached GPU addresses to use
the FB aperture address for SMU and PSP.

2 reasons for this change:
1. when pdb0 is enabled, gpu addr from amdgpu_bo_create_kernel() is GART
aperture address, it is not compatible with SMU and PSP, it need to be
updated to use FB aperture address.
2. Since FB aperture address will change after switching to new GPU
index after hibernation, it need to be updated on resume.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 20 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 23 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  |  3 +++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 18 +++++++++++++++++
 5 files changed, 65 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4e794d546b61..3dde57cd5b81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1513,6 +1513,26 @@ u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo)
 	return amdgpu_bo_gpu_offset_no_check(bo);
 }
 
+/**
+ * amdgpu_bo_fb_aper_addr - return FB aperture GPU offset of the VRAM bo
+ * @bo:	amdgpu VRAM buffer object for which we query the offset
+ *
+ * Returns:
+ * current FB aperture GPU offset of the object.
+ */
+u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo)
+{
+	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+	uint64_t offset, fb_base;
+
+	WARN_ON_ONCE(bo->tbo.resource->mem_type != TTM_PL_VRAM);
+
+	fb_base = adev->mmhub.funcs->get_fb_location(adev);
+	fb_base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
+	offset = (bo->tbo.resource->start << PAGE_SHIFT) + fb_base;
+	return amdgpu_gmc_sign_extend(offset);
+}
+
 /**
  * amdgpu_bo_gpu_offset_no_check - return GPU offset of bo
  * @bo:	amdgpu object for which we query the offset
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index dcce362bfad3..c8a63e38f5d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -320,6 +320,7 @@ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
 			     bool intr);
 int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
 u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
+u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo);
 u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo);
 uint32_t amdgpu_bo_mem_stats_placement(struct amdgpu_bo *bo);
 uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index e1e658a97b36..3fc4b8e6fc59 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -871,6 +871,8 @@ static int psp_tmr_init(struct psp_context *psp)
 					      &psp->tmr_bo, &psp->tmr_mc_addr,
 					      pptr);
 	}
+	if (amdgpu_virt_xgmi_migrate_enabled(psp->adev) && psp->tmr_bo)
+		psp->tmr_mc_addr = amdgpu_bo_fb_aper_addr(psp->tmr_bo);
 
 	return ret;
 }
@@ -1270,6 +1272,11 @@ int psp_ta_load(struct psp_context *psp, struct ta_context *context)
 	psp_copy_fw(psp, context->bin_desc.start_addr,
 		    context->bin_desc.size_bytes);
 
+	if (amdgpu_virt_xgmi_migrate_enabled(psp->adev) &&
+		context->mem_context.shared_bo)
+		context->mem_context.shared_mc_addr =
+			amdgpu_bo_fb_aper_addr(context->mem_context.shared_bo);
+
 	psp_prep_ta_load_cmd_buf(cmd, psp->fw_pri_mc_addr, context);
 
 	ret = psp_cmd_submit_buf(psp, NULL, cmd,
@@ -2336,11 +2343,27 @@ bool amdgpu_psp_tos_reload_needed(struct amdgpu_device *adev)
 	return false;
 }
 
+static void psp_update_gpu_addresses(struct amdgpu_device *adev)
+{
+	struct psp_context *psp = &adev->psp;
+
+	if (psp->cmd_buf_bo && psp->cmd_buf_mem) {
+		psp->fw_pri_mc_addr = amdgpu_bo_fb_aper_addr(psp->fw_pri_bo);
+		psp->fence_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->fence_buf_bo);
+		psp->cmd_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->cmd_buf_bo);
+	}
+	if (adev->firmware.rbuf && psp->km_ring.ring_mem)
+		psp->km_ring.ring_mem_mc_addr = amdgpu_bo_fb_aper_addr(adev->firmware.rbuf);
+}
+
 static int psp_hw_start(struct psp_context *psp)
 {
 	struct amdgpu_device *adev = psp->adev;
 	int ret;
 
+	if (amdgpu_virt_xgmi_migrate_enabled(adev))
+		psp_update_gpu_addresses(adev);
+
 	if (!amdgpu_sriov_vf(adev)) {
 		if ((is_psp_fw_valid(psp->kdb)) &&
 		    (psp->funcs->bootloader_load_kdb != NULL)) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index 3d9e9fdc10b4..bf9013f8d12e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -1152,6 +1152,9 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 		adev->firmware.max_ucodes = AMDGPU_UCODE_ID_MAXIMUM;
 	}
 
+	if (amdgpu_virt_xgmi_migrate_enabled(adev) && adev->firmware.fw_buf)
+		adev->firmware.fw_buf_mc = amdgpu_bo_fb_aper_addr(adev->firmware.fw_buf);
+
 	for (i = 0; i < adev->firmware.max_ucodes; i++) {
 		ucode = &adev->firmware.ucode[i];
 		if (ucode->fw) {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 315b0856bf02..f9f49f37dfcd 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1000,6 +1000,21 @@ static int smu_fini_fb_allocations(struct smu_context *smu)
 	return 0;
 }
 
+static void smu_update_gpu_addresses(struct smu_context *smu)
+{
+	struct smu_table_context *smu_table = &smu->smu_table;
+	struct smu_table *pm_status_table = smu_table->tables + SMU_TABLE_PMSTATUSLOG;
+	struct smu_table *driver_table = &(smu_table->driver_table);
+	struct smu_table *dummy_read_1_table = &smu_table->dummy_read_1_table;
+
+	if (pm_status_table->bo)
+		pm_status_table->mc_address = amdgpu_bo_fb_aper_addr(pm_status_table->bo);
+	if (driver_table->bo)
+		driver_table->mc_address = amdgpu_bo_fb_aper_addr(driver_table->bo);
+	if (dummy_read_1_table->bo)
+		dummy_read_1_table->mc_address = amdgpu_bo_fb_aper_addr(dummy_read_1_table->bo);
+}
+
 /**
  * smu_alloc_memory_pool - allocate memory pool in the system memory
  *
@@ -1789,6 +1804,9 @@ static int smu_start_smc_engine(struct smu_context *smu)
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0;
 
+	if (amdgpu_virt_xgmi_migrate_enabled(adev))
+		smu_update_gpu_addresses(smu);
+
 	smu->smc_fw_state = SMU_FW_INIT;
 
 	if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-19  8:20 [PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV Samuel Zhang
  2025-05-19  8:20 ` [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume Samuel Zhang
  2025-05-19  8:20 ` [PATCH v6 2/4] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
@ 2025-05-19  8:20 ` Samuel Zhang
  2025-05-19 13:57   ` Christian König
  2025-05-19  8:20 ` [PATCH v6 4/4] drm/amdgpu: fix fence fallback timer expired error Samuel Zhang
  3 siblings, 1 reply; 14+ messages in thread
From: Samuel Zhang @ 2025-05-19  8:20 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Lijo.Lazar, Emily Deng

When switching to new GPU index after hibernation and then resume,
VRAM offset of each VRAM BO will be changed, and the cached gpu
addresses needed to updated.

This is to enable pdb0 and switch to use pdb0-based virtual gpu
address by default in amdgpu_bo_create_reserved(). since the virtual
addresses do not change, this can avoid the need to update all
cached gpu addresses all over the codebase.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 32 ++++++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c    | 10 +++++---
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c  |  6 +++--
 5 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index d1fa5e8e3937..265d6c777af5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -38,6 +38,8 @@
 #include <drm/drm_drv.h>
 #include <drm/ttm/ttm_tt.h>
 
+static const u64 four_gb = 0x100000000ULL;
+
 /**
  * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
  *
@@ -249,15 +251,24 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
 {
 	u64 hive_vram_start = 0;
 	u64 hive_vram_end = mc->xgmi.node_segment_size * mc->xgmi.num_physical_nodes - 1;
-	mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
-	mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
-	mc->gart_start = hive_vram_end + 1;
+
+	if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
+		/* set mc->vram_start to 0 to switch the returned GPU address of
+		 * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
+		 */
+		amdgpu_gmc_vram_location(adev, mc, 0);
+	} else {
+		mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
+		mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
+		dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
+				mc->mc_vram_size >> 20, mc->vram_start,
+				mc->vram_end, mc->real_vram_size >> 20);
+	}
+	/* node_segment_size may not 4GB aligned on SRIOV, align up is needed. */
+	mc->gart_start = ALIGN(hive_vram_end + 1, four_gb);
 	mc->gart_end = mc->gart_start + mc->gart_size - 1;
 	mc->fb_start = hive_vram_start;
 	mc->fb_end = hive_vram_end;
-	dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
-			mc->mc_vram_size >> 20, mc->vram_start,
-			mc->vram_end, mc->real_vram_size >> 20);
 	dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
 			mc->gart_size >> 20, mc->gart_start, mc->gart_end);
 }
@@ -276,7 +287,6 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
 void amdgpu_gmc_gart_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
 			      enum amdgpu_gart_placement gart_placement)
 {
-	const uint64_t four_gb = 0x100000000ULL;
 	u64 size_af, size_bf;
 	/*To avoid the hole, limit the max mc address to AMDGPU_GMC_HOLE_START*/
 	u64 max_mc_address = min(adev->gmc.mc_mask, AMDGPU_GMC_HOLE_START - 1);
@@ -1068,6 +1078,14 @@ void amdgpu_gmc_init_pdb0(struct amdgpu_device *adev)
 	flags |= AMDGPU_PTE_FRAG((adev->gmc.vmid0_page_table_block_size + 9*1));
 	flags |= AMDGPU_PDE_PTE_FLAG(adev);
 
+	if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
+		/* always start from current device so that the GART address can keep
+		 * consistent when hibernate-resume with different GPUs.
+		 */
+		vram_addr = adev->vm_manager.vram_base_offset;
+		vram_end = vram_addr + vram_size;
+	}
+
 	/* The first n PDE0 entries are used as PTE,
 	 * pointing to vram
 	 */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index bd7fc123b8f9..46fac7ca7dfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -307,6 +307,7 @@ struct amdgpu_gmc {
 	struct amdgpu_bo		*pdb0_bo;
 	/* CPU kmapped address of pdb0*/
 	void				*ptr_pdb0;
+	bool pdb0_enabled;
 
 	/* MALL size */
 	u64 mall_size;
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
index cb25f7f0dfc1..e6165f6d0763 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
@@ -180,7 +180,7 @@ gfxhub_v1_2_xcc_init_system_aperture_regs(struct amdgpu_device *adev,
 		/* In the case squeezing vram into GART aperture, we don't use
 		 * FB aperture and AGP aperture. Disable them.
 		 */
-		if (adev->gmc.pdb0_bo) {
+		if (adev->gmc.pdb0_bo && !amdgpu_virt_xgmi_migrate_enabled(adev)) {
 			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_TOP, 0);
 			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_BASE, 0x00FFFFFF);
 			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_AGP_TOP, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 59385da80185..04fb99c64b37 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1682,6 +1682,8 @@ static int gmc_v9_0_early_init(struct amdgpu_ip_block *ip_block)
 		adev->gmc.private_aperture_start + (4ULL << 30) - 1;
 	adev->gmc.noretry_flags = AMDGPU_VM_NORETRY_FLAGS_TF;
 
+	adev->gmc.pdb0_enabled = adev->gmc.xgmi.connected_to_cpu ||
+		amdgpu_virt_xgmi_migrate_enabled(adev);
 	return 0;
 }
 
@@ -1726,7 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev,
 
 	/* add the xgmi offset of the physical node */
 	base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
-	if (adev->gmc.xgmi.connected_to_cpu) {
+	if (adev->gmc.pdb0_enabled) {
 		amdgpu_gmc_sysvm_location(adev, mc);
 	} else {
 		amdgpu_gmc_vram_location(adev, mc, base);
@@ -1841,7 +1843,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
 		return 0;
 	}
 
-	if (adev->gmc.xgmi.connected_to_cpu) {
+	if (adev->gmc.pdb0_enabled) {
 		adev->gmc.vmid0_page_table_depth = 1;
 		adev->gmc.vmid0_page_table_block_size = 12;
 	} else {
@@ -1867,7 +1869,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
 		if (r)
 			return r;
 
-		if (adev->gmc.xgmi.connected_to_cpu)
+		if (adev->gmc.pdb0_enabled)
 			r = amdgpu_gmc_pdb0_alloc(adev);
 	}
 
@@ -2372,7 +2374,7 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device *adev)
 {
 	int r;
 
-	if (adev->gmc.xgmi.connected_to_cpu)
+	if (adev->gmc.pdb0_enabled)
 		amdgpu_gmc_init_pdb0(adev);
 
 	if (adev->gart.bo == NULL) {
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
index 84cde1239ee4..18e80aa78aff 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
@@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
 	top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
 	top <<= 24;
 
-	adev->gmc.fb_start = base;
-	adev->gmc.fb_end = top;
+	if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
+		adev->gmc.fb_start = base;
+		adev->gmc.fb_end = top;
+	}
 
 	return base;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 4/4] drm/amdgpu: fix fence fallback timer expired error
  2025-05-19  8:20 [PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV Samuel Zhang
                   ` (2 preceding siblings ...)
  2025-05-19  8:20 ` [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
@ 2025-05-19  8:20 ` Samuel Zhang
  3 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-19  8:20 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Lijo.Lazar

IH is not working after switching a new gpu index for the first time.

The msix table in virtual machine is faked. The real msix table will be
programmed by QEMU when guest enable/disable msix interrupt. But QEMU
accessing VF msix table (register GFXMSIX_VECT0_ADDR_LO) is blocked
by nBIF protection if the VF isn't in exclusive access at that time.

call amdgpu_restore_msix on resume to restore msix table.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h    | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e5bb46effb6c..91066c6a5861 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5051,6 +5051,9 @@ static inline int amdgpu_virt_resume(struct amdgpu_device *adev)
 	int r;
 	unsigned int prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
 
+	if (amdgpu_sriov_vf(adev))
+		amdgpu_restore_msix(adev);
+
 	if (!amdgpu_virt_xgmi_migrate_enabled(adev))
 		return 0;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 0e890f2785b1..f080354efec8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -245,7 +245,7 @@ static bool amdgpu_msi_ok(struct amdgpu_device *adev)
 	return true;
 }
 
-static void amdgpu_restore_msix(struct amdgpu_device *adev)
+void amdgpu_restore_msix(struct amdgpu_device *adev)
 {
 	u16 ctrl;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
index aef5c216b191..f52bd7e6d988 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
@@ -149,5 +149,6 @@ void amdgpu_irq_gpu_reset_resume_helper(struct amdgpu_device *adev);
 int amdgpu_irq_add_domain(struct amdgpu_device *adev);
 void amdgpu_irq_remove_domain(struct amdgpu_device *adev);
 unsigned amdgpu_irq_create_mapping(struct amdgpu_device *adev, unsigned src_id);
+void amdgpu_restore_msix(struct amdgpu_device *adev);
 
 #endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume
  2025-05-19  8:20 ` [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume Samuel Zhang
@ 2025-05-19 13:41   ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2025-05-19 13:41 UTC (permalink / raw)
  To: Samuel Zhang, amd-gfx
  Cc: victor.zhao, haijun.chang, Alexander.Deucher, Owen.Zhang2,
	Qing.Ma, Lijo.Lazar, Jiang Liu

On 5/19/25 10:20, Samuel Zhang wrote:
> For SRIOV VM env with XGMI enabled systems, XGMI physical node id may
> change when hibernate and resume with different VF.
> 
> Update XGMI info and vram_base_offset on resume for gfx444 SRIOV env.
> Add amdgpu_virt_xgmi_migrate_enabled() as the feature flag.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 32 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  7 +++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d477a901af84..e5bb46effb6c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2732,6 +2732,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev)
>  	if (!amdgpu_device_pcie_dynamic_switching_supported(adev))
>  		adev->pm.pp_feature &= ~PP_PCIE_DPM_MASK;
>  
> +	adev->virt.is_xgmi_node_migrate_enabled = false;
> +	if (amdgpu_sriov_vf(adev)) {
> +		adev->virt.is_xgmi_node_migrate_enabled =
> +			amdgpu_ip_version((adev), GC_HWIP, 0) == IP_VERSION(9, 4, 4);
> +	}
> +
>  	total = true;
>  	for (i = 0; i < adev->num_ip_blocks; i++) {
>  		ip_block = &adev->ip_blocks[i];
> @@ -5040,6 +5046,28 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>  	return 0;
>  }
>  
> +static inline int amdgpu_virt_resume(struct amdgpu_device *adev)
> +{
> +	int r;
> +	unsigned int prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
> +
> +	if (!amdgpu_virt_xgmi_migrate_enabled(adev))
> +		return 0;
> +
> +	r = adev->gfxhub.funcs->get_xgmi_info(adev);
> +	if (r)
> +		return r;
> +
> +	dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
> +		prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
> +
> +	adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
> +	adev->vm_manager.vram_base_offset +=
> +		adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
> +
> +	return 0;
> +}
> +
>  /**
>   * amdgpu_device_resume - initiate device resume
>   *
> @@ -5061,6 +5089,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>  			return r;
>  	}
>  
> +	r = amdgpu_virt_resume(adev);
> +	if (r)
> +		goto exit;
> +
>  	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>  		return 0;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index df03dba67ab8..532b92628171 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -295,6 +295,9 @@ struct amdgpu_virt {
>  	union amd_sriov_ras_caps ras_telemetry_en_caps;
>  	struct amdgpu_virt_ras ras;
>  	struct amd_sriov_ras_telemetry_error_count count_cache;
> +
> +	/* hibernate and resume with different VF feature for xgmi enabled system */
> +	bool is_xgmi_node_migrate_enabled;
>  };
>  
>  struct amdgpu_video_codec_info;
> @@ -376,6 +379,10 @@ static inline bool is_virtual_machine(void)
>  	((adev)->virt.gim_feature & AMDGIM_FEATURE_VCN_RB_DECOUPLE)
>  #define amdgpu_sriov_is_mes_info_enable(adev) \
>  	((adev)->virt.gim_feature & AMDGIM_FEATURE_MES_INFO_ENABLE)
> +
> +#define amdgpu_virt_xgmi_migrate_enabled(adev) \
> +	((adev)->virt.is_xgmi_node_migrate_enabled)
> +
>  bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev);
>  void amdgpu_virt_init_setting(struct amdgpu_device *adev);
>  int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/4] drm/amdgpu: update GPU addresses for SMU and PSP
  2025-05-19  8:20 ` [PATCH v6 2/4] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
@ 2025-05-19 13:43   ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2025-05-19 13:43 UTC (permalink / raw)
  To: Samuel Zhang, amd-gfx
  Cc: victor.zhao, haijun.chang, Alexander.Deucher, Owen.Zhang2,
	Qing.Ma, Lijo.Lazar, Jiang Liu

On 5/19/25 10:20, Samuel Zhang wrote:
> add amdgpu_bo_fb_aper_addr() and update the cached GPU addresses to use
> the FB aperture address for SMU and PSP.
> 
> 2 reasons for this change:
> 1. when pdb0 is enabled, gpu addr from amdgpu_bo_create_kernel() is GART
> aperture address, it is not compatible with SMU and PSP, it need to be
> updated to use FB aperture address.
> 2. Since FB aperture address will change after switching to new GPU
> index after hibernation, it need to be updated on resume.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

I don't know the PSP code well enough to give an rb, but the amdgpu_bo_fb_aper_addr() function looks good to me now.

Lijo would be good if you take a look as well.

Thanks,
Christian.

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 20 +++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 23 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  |  3 +++
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 18 +++++++++++++++++
>  5 files changed, 65 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 4e794d546b61..3dde57cd5b81 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1513,6 +1513,26 @@ u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo)
>  	return amdgpu_bo_gpu_offset_no_check(bo);
>  }
>  
> +/**
> + * amdgpu_bo_fb_aper_addr - return FB aperture GPU offset of the VRAM bo
> + * @bo:	amdgpu VRAM buffer object for which we query the offset
> + *
> + * Returns:
> + * current FB aperture GPU offset of the object.
> + */
> +u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo)
> +{
> +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> +	uint64_t offset, fb_base;
> +
> +	WARN_ON_ONCE(bo->tbo.resource->mem_type != TTM_PL_VRAM);
> +
> +	fb_base = adev->mmhub.funcs->get_fb_location(adev);
> +	fb_base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
> +	offset = (bo->tbo.resource->start << PAGE_SHIFT) + fb_base;
> +	return amdgpu_gmc_sign_extend(offset);
> +}
> +
>  /**
>   * amdgpu_bo_gpu_offset_no_check - return GPU offset of bo
>   * @bo:	amdgpu object for which we query the offset
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index dcce362bfad3..c8a63e38f5d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> @@ -320,6 +320,7 @@ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
>  			     bool intr);
>  int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
>  u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
> +u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo);
>  u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo);
>  uint32_t amdgpu_bo_mem_stats_placement(struct amdgpu_bo *bo);
>  uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index e1e658a97b36..3fc4b8e6fc59 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -871,6 +871,8 @@ static int psp_tmr_init(struct psp_context *psp)
>  					      &psp->tmr_bo, &psp->tmr_mc_addr,
>  					      pptr);
>  	}
> +	if (amdgpu_virt_xgmi_migrate_enabled(psp->adev) && psp->tmr_bo)
> +		psp->tmr_mc_addr = amdgpu_bo_fb_aper_addr(psp->tmr_bo);
>  
>  	return ret;
>  }
> @@ -1270,6 +1272,11 @@ int psp_ta_load(struct psp_context *psp, struct ta_context *context)
>  	psp_copy_fw(psp, context->bin_desc.start_addr,
>  		    context->bin_desc.size_bytes);
>  
> +	if (amdgpu_virt_xgmi_migrate_enabled(psp->adev) &&
> +		context->mem_context.shared_bo)
> +		context->mem_context.shared_mc_addr =
> +			amdgpu_bo_fb_aper_addr(context->mem_context.shared_bo);
> +
>  	psp_prep_ta_load_cmd_buf(cmd, psp->fw_pri_mc_addr, context);
>  
>  	ret = psp_cmd_submit_buf(psp, NULL, cmd,
> @@ -2336,11 +2343,27 @@ bool amdgpu_psp_tos_reload_needed(struct amdgpu_device *adev)
>  	return false;
>  }
>  
> +static void psp_update_gpu_addresses(struct amdgpu_device *adev)
> +{
> +	struct psp_context *psp = &adev->psp;
> +
> +	if (psp->cmd_buf_bo && psp->cmd_buf_mem) {
> +		psp->fw_pri_mc_addr = amdgpu_bo_fb_aper_addr(psp->fw_pri_bo);
> +		psp->fence_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->fence_buf_bo);
> +		psp->cmd_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->cmd_buf_bo);
> +	}
> +	if (adev->firmware.rbuf && psp->km_ring.ring_mem)
> +		psp->km_ring.ring_mem_mc_addr = amdgpu_bo_fb_aper_addr(adev->firmware.rbuf);
> +}
> +
>  static int psp_hw_start(struct psp_context *psp)
>  {
>  	struct amdgpu_device *adev = psp->adev;
>  	int ret;
>  
> +	if (amdgpu_virt_xgmi_migrate_enabled(adev))
> +		psp_update_gpu_addresses(adev);
> +
>  	if (!amdgpu_sriov_vf(adev)) {
>  		if ((is_psp_fw_valid(psp->kdb)) &&
>  		    (psp->funcs->bootloader_load_kdb != NULL)) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> index 3d9e9fdc10b4..bf9013f8d12e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> @@ -1152,6 +1152,9 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>  		adev->firmware.max_ucodes = AMDGPU_UCODE_ID_MAXIMUM;
>  	}
>  
> +	if (amdgpu_virt_xgmi_migrate_enabled(adev) && adev->firmware.fw_buf)
> +		adev->firmware.fw_buf_mc = amdgpu_bo_fb_aper_addr(adev->firmware.fw_buf);
> +
>  	for (i = 0; i < adev->firmware.max_ucodes; i++) {
>  		ucode = &adev->firmware.ucode[i];
>  		if (ucode->fw) {
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 315b0856bf02..f9f49f37dfcd 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1000,6 +1000,21 @@ static int smu_fini_fb_allocations(struct smu_context *smu)
>  	return 0;
>  }
>  
> +static void smu_update_gpu_addresses(struct smu_context *smu)
> +{
> +	struct smu_table_context *smu_table = &smu->smu_table;
> +	struct smu_table *pm_status_table = smu_table->tables + SMU_TABLE_PMSTATUSLOG;
> +	struct smu_table *driver_table = &(smu_table->driver_table);
> +	struct smu_table *dummy_read_1_table = &smu_table->dummy_read_1_table;
> +
> +	if (pm_status_table->bo)
> +		pm_status_table->mc_address = amdgpu_bo_fb_aper_addr(pm_status_table->bo);
> +	if (driver_table->bo)
> +		driver_table->mc_address = amdgpu_bo_fb_aper_addr(driver_table->bo);
> +	if (dummy_read_1_table->bo)
> +		dummy_read_1_table->mc_address = amdgpu_bo_fb_aper_addr(dummy_read_1_table->bo);
> +}
> +
>  /**
>   * smu_alloc_memory_pool - allocate memory pool in the system memory
>   *
> @@ -1789,6 +1804,9 @@ static int smu_start_smc_engine(struct smu_context *smu)
>  	struct amdgpu_device *adev = smu->adev;
>  	int ret = 0;
>  
> +	if (amdgpu_virt_xgmi_migrate_enabled(adev))
> +		smu_update_gpu_addresses(smu);
> +
>  	smu->smc_fw_state = SMU_FW_INIT;
>  
>  	if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-19  8:20 ` [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
@ 2025-05-19 13:57   ` Christian König
  2025-05-20  5:10     ` Zhang, GuoQing (Sam)
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2025-05-19 13:57 UTC (permalink / raw)
  To: Samuel Zhang, amd-gfx
  Cc: victor.zhao, haijun.chang, Alexander.Deucher, Owen.Zhang2,
	Qing.Ma, Lijo.Lazar, Emily Deng

On 5/19/25 10:20, Samuel Zhang wrote:
> When switching to new GPU index after hibernation and then resume,
> VRAM offset of each VRAM BO will be changed, and the cached gpu
> addresses needed to updated.
> 
> This is to enable pdb0 and switch to use pdb0-based virtual gpu
> address by default in amdgpu_bo_create_reserved(). since the virtual
> addresses do not change, this can avoid the need to update all
> cached gpu addresses all over the codebase.
> 
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 32 ++++++++++++++++++------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c |  2 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c    | 10 +++++---
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c  |  6 +++--
>  5 files changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index d1fa5e8e3937..265d6c777af5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -38,6 +38,8 @@
>  #include <drm/drm_drv.h>
>  #include <drm/ttm/ttm_tt.h>
>  
> +static const u64 four_gb = 0x100000000ULL;
> +
>  /**
>   * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
>   *
> @@ -249,15 +251,24 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
>  {
>  	u64 hive_vram_start = 0;
>  	u64 hive_vram_end = mc->xgmi.node_segment_size * mc->xgmi.num_physical_nodes - 1;
> -	mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
> -	mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
> -	mc->gart_start = hive_vram_end + 1;
> +
> +	if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
> +		/* set mc->vram_start to 0 to switch the returned GPU address of
> +		 * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
> +		 */
> +		amdgpu_gmc_vram_location(adev, mc, 0);

This function does a lot more than just setting mc->vram_start and mc->vram_end.

You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.

> +	} else {
> +		mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
> +		mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
> +		dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
> +				mc->mc_vram_size >> 20, mc->vram_start,
> +				mc->vram_end, mc->real_vram_size >> 20);
> +	}
> +	/* node_segment_size may not 4GB aligned on SRIOV, align up is needed. */
> +	mc->gart_start = ALIGN(hive_vram_end + 1, four_gb);
>  	mc->gart_end = mc->gart_start + mc->gart_size - 1;
>  	mc->fb_start = hive_vram_start;
>  	mc->fb_end = hive_vram_end;
> -	dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
> -			mc->mc_vram_size >> 20, mc->vram_start,
> -			mc->vram_end, mc->real_vram_size >> 20);
>  	dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
>  			mc->gart_size >> 20, mc->gart_start, mc->gart_end);
>  }
> @@ -276,7 +287,6 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
>  void amdgpu_gmc_gart_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
>  			      enum amdgpu_gart_placement gart_placement)
>  {
> -	const uint64_t four_gb = 0x100000000ULL;
>  	u64 size_af, size_bf;
>  	/*To avoid the hole, limit the max mc address to AMDGPU_GMC_HOLE_START*/
>  	u64 max_mc_address = min(adev->gmc.mc_mask, AMDGPU_GMC_HOLE_START - 1);
> @@ -1068,6 +1078,14 @@ void amdgpu_gmc_init_pdb0(struct amdgpu_device *adev)
>  	flags |= AMDGPU_PTE_FRAG((adev->gmc.vmid0_page_table_block_size + 9*1));
>  	flags |= AMDGPU_PDE_PTE_FLAG(adev);
>  
> +	if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
> +		/* always start from current device so that the GART address can keep
> +		 * consistent when hibernate-resume with different GPUs.
> +		 */
> +		vram_addr = adev->vm_manager.vram_base_offset;
> +		vram_end = vram_addr + vram_size;
> +	}
> +
>  	/* The first n PDE0 entries are used as PTE,
>  	 * pointing to vram
>  	 */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> index bd7fc123b8f9..46fac7ca7dfa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -307,6 +307,7 @@ struct amdgpu_gmc {
>  	struct amdgpu_bo		*pdb0_bo;
>  	/* CPU kmapped address of pdb0*/
>  	void				*ptr_pdb0;
> +	bool pdb0_enabled;

This isn't needed, just always check (adev->gmc.xgmi.connected_to_cpu || amdgpu_virt_xgmi_migrate_enabled(adev)), make a function for that if necessary.

>  
>  	/* MALL size */
>  	u64 mall_size;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> index cb25f7f0dfc1..e6165f6d0763 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> @@ -180,7 +180,7 @@ gfxhub_v1_2_xcc_init_system_aperture_regs(struct amdgpu_device *adev,
>  		/* In the case squeezing vram into GART aperture, we don't use
>  		 * FB aperture and AGP aperture. Disable them.
>  		 */
> -		if (adev->gmc.pdb0_bo) {
> +		if (adev->gmc.pdb0_bo && !amdgpu_virt_xgmi_migrate_enabled(adev)) {
>  			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_TOP, 0);
>  			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_BASE, 0x00FFFFFF);
>  			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_AGP_TOP, 0);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 59385da80185..04fb99c64b37 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1682,6 +1682,8 @@ static int gmc_v9_0_early_init(struct amdgpu_ip_block *ip_block)
>  		adev->gmc.private_aperture_start + (4ULL << 30) - 1;
>  	adev->gmc.noretry_flags = AMDGPU_VM_NORETRY_FLAGS_TF;
>  
> +	adev->gmc.pdb0_enabled = adev->gmc.xgmi.connected_to_cpu ||
> +		amdgpu_virt_xgmi_migrate_enabled(adev);
>  	return 0;
>  }
>  
> @@ -1726,7 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev,
>  
>  	/* add the xgmi offset of the physical node */
>  	base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
> -	if (adev->gmc.xgmi.connected_to_cpu) {
> +	if (adev->gmc.pdb0_enabled) {
>  		amdgpu_gmc_sysvm_location(adev, mc);
>  	} else {
>  		amdgpu_gmc_vram_location(adev, mc, base);
> @@ -1841,7 +1843,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
>  		return 0;
>  	}
>  
> -	if (adev->gmc.xgmi.connected_to_cpu) {
> +	if (adev->gmc.pdb0_enabled) {
>  		adev->gmc.vmid0_page_table_depth = 1;
>  		adev->gmc.vmid0_page_table_block_size = 12;
>  	} else {
> @@ -1867,7 +1869,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
>  		if (r)
>  			return r;
>  
> -		if (adev->gmc.xgmi.connected_to_cpu)
> +		if (adev->gmc.pdb0_enabled)
>  			r = amdgpu_gmc_pdb0_alloc(adev);
>  	}
>  
> @@ -2372,7 +2374,7 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device *adev)
>  {
>  	int r;
>  
> -	if (adev->gmc.xgmi.connected_to_cpu)
> +	if (adev->gmc.pdb0_enabled)
>  		amdgpu_gmc_init_pdb0(adev);
>  
>  	if (adev->gart.bo == NULL) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> index 84cde1239ee4..18e80aa78aff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>  	top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>  	top <<= 24;
>  
> -	adev->gmc.fb_start = base;
> -	adev->gmc.fb_end = top;
> +	if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
> +		adev->gmc.fb_start = base;
> +		adev->gmc.fb_end = top;
> +	}

We should probably avoid calling this in the first place.

The function gmc_v9_0_vram_gtt_location() should probably be adjusted.

Regards,
Christian.

>  
>  	return base;
>  }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-19 13:57   ` Christian König
@ 2025-05-20  5:10     ` Zhang, GuoQing (Sam)
  2025-05-21  7:33       ` Zhang, Owen(SRDC)
  2025-05-21  8:06       ` Christian König
  0 siblings, 2 replies; 14+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-05-20  5:10 UTC (permalink / raw)
  To: Koenig, Christian, Zhang, GuoQing (Sam),
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Lazar, Lijo, Deng, Emily

[-- Attachment #1: Type: text/plain, Size: 10565 bytes --]


On 2025/5/19 21:57, Christian König wrote:
> On 5/19/25 10:20, Samuel Zhang wrote:
>> When switching to new GPU index after hibernation and then resume,
>> VRAM offset of each VRAM BO will be changed, and the cached gpu
>> addresses needed to updated.
>>
>> This is to enable pdb0 and switch to use pdb0-based virtual gpu
>> address by default in amdgpu_bo_create_reserved(). since the virtual
>> addresses do not change, this can avoid the need to update all
>> cached gpu addresses all over the codebase.
>>
>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 32 ++++++++++++++++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  1 +
>>   drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c    | 10 +++++---
>>   drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c  |  6 +++--
>>   5 files changed, 37 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index d1fa5e8e3937..265d6c777af5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -38,6 +38,8 @@
>>   #include <drm/drm_drv.h>
>>   #include <drm/ttm/ttm_tt.h>
>>
>> +static const u64 four_gb = 0x100000000ULL;
>> +
>>   /**
>>    * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
>>    *
>> @@ -249,15 +251,24 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
>>   {
>>       u64 hive_vram_start = 0;
>>       u64 hive_vram_end = mc->xgmi.node_segment_size * mc->xgmi.num_physical_nodes - 1;
>> -    mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
>> -    mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
>> -    mc->gart_start = hive_vram_end + 1;
>> +
>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>> +             */
>> +            amdgpu_gmc_vram_location(adev, mc, 0);
> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>
> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.

I tried only setting mc->vram_start and mc->vram_end. But KMD load will
fail with following error logs.

[  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
[  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
0x0000018000000000 - 0x000001801FFFFFFF
[  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
[  329.314386] [drm] RAM width 8192bits HBM
[  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
kernel bo
[  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
block <gmc_v9_0> failed -22
[  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed


It seems like setting mc->visible_vram_size and mc->visible_vram_size
fields are also needed. In this case call amdgpu_gmc_vram_location() is
better than inline the logic, I think.


>
>> +    } else {
>> +            mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
>> +            mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
>> +            dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
>> +                            mc->mc_vram_size >> 20, mc->vram_start,
>> +                            mc->vram_end, mc->real_vram_size >> 20);
>> +    }
>> +    /* node_segment_size may not 4GB aligned on SRIOV, align up is needed. */
>> +    mc->gart_start = ALIGN(hive_vram_end + 1, four_gb);
>>       mc->gart_end = mc->gart_start + mc->gart_size - 1;
>>       mc->fb_start = hive_vram_start;
>>       mc->fb_end = hive_vram_end;
>> -    dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
>> -                    mc->mc_vram_size >> 20, mc->vram_start,
>> -                    mc->vram_end, mc->real_vram_size >> 20);
>>       dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
>>                       mc->gart_size >> 20, mc->gart_start, mc->gart_end);
>>   }
>> @@ -276,7 +287,6 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
>>   void amdgpu_gmc_gart_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
>>                             enum amdgpu_gart_placement gart_placement)
>>   {
>> -    const uint64_t four_gb = 0x100000000ULL;
>>       u64 size_af, size_bf;
>>       /*To avoid the hole, limit the max mc address to AMDGPU_GMC_HOLE_START*/
>>       u64 max_mc_address = min(adev->gmc.mc_mask, AMDGPU_GMC_HOLE_START - 1);
>> @@ -1068,6 +1078,14 @@ void amdgpu_gmc_init_pdb0(struct amdgpu_device *adev)
>>       flags |= AMDGPU_PTE_FRAG((adev->gmc.vmid0_page_table_block_size + 9*1));
>>       flags |= AMDGPU_PDE_PTE_FLAG(adev);
>>
>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>> +            /* always start from current device so that the GART address can keep
>> +             * consistent when hibernate-resume with different GPUs.
>> +             */
>> +            vram_addr = adev->vm_manager.vram_base_offset;
>> +            vram_end = vram_addr + vram_size;
>> +    }
>> +
>>       /* The first n PDE0 entries are used as PTE,
>>        * pointing to vram
>>        */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> index bd7fc123b8f9..46fac7ca7dfa 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> @@ -307,6 +307,7 @@ struct amdgpu_gmc {
>>       struct amdgpu_bo                *pdb0_bo;
>>       /* CPU kmapped address of pdb0*/
>>       void                            *ptr_pdb0;
>> +    bool pdb0_enabled;
> This isn't needed, just always check (adev->gmc.xgmi.connected_to_cpu || amdgpu_virt_xgmi_migrate_enabled(adev)), make a function for that if necessary.

Ok, I will update it in the next patch version.


>
>>
>>       /* MALL size */
>>       u64 mall_size;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> index cb25f7f0dfc1..e6165f6d0763 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> @@ -180,7 +180,7 @@ gfxhub_v1_2_xcc_init_system_aperture_regs(struct amdgpu_device *adev,
>>               /* In the case squeezing vram into GART aperture, we don't use
>>                * FB aperture and AGP aperture. Disable them.
>>                */
>> -            if (adev->gmc.pdb0_bo) {
>> +            if (adev->gmc.pdb0_bo && !amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>                       WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_TOP, 0);
>>                       WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_BASE, 0x00FFFFFF);
>>                       WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_AGP_TOP, 0);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index 59385da80185..04fb99c64b37 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -1682,6 +1682,8 @@ static int gmc_v9_0_early_init(struct amdgpu_ip_block *ip_block)
>>               adev->gmc.private_aperture_start + (4ULL << 30) - 1;
>>       adev->gmc.noretry_flags = AMDGPU_VM_NORETRY_FLAGS_TF;
>>
>> +    adev->gmc.pdb0_enabled = adev->gmc.xgmi.connected_to_cpu ||
>> +            amdgpu_virt_xgmi_migrate_enabled(adev);
>>       return 0;
>>   }
>>
>> @@ -1726,7 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev,
>>
>>       /* add the xgmi offset of the physical node */
>>       base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
>> -    if (adev->gmc.xgmi.connected_to_cpu) {
>> +    if (adev->gmc.pdb0_enabled) {
>>               amdgpu_gmc_sysvm_location(adev, mc);
>>       } else {
>>               amdgpu_gmc_vram_location(adev, mc, base);
>> @@ -1841,7 +1843,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
>>               return 0;
>>       }
>>
>> -    if (adev->gmc.xgmi.connected_to_cpu) {
>> +    if (adev->gmc.pdb0_enabled) {
>>               adev->gmc.vmid0_page_table_depth = 1;
>>               adev->gmc.vmid0_page_table_block_size = 12;
>>       } else {
>> @@ -1867,7 +1869,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
>>               if (r)
>>                       return r;
>>
>> -            if (adev->gmc.xgmi.connected_to_cpu)
>> +            if (adev->gmc.pdb0_enabled)
>>                       r = amdgpu_gmc_pdb0_alloc(adev);
>>       }
>>
>> @@ -2372,7 +2374,7 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device *adev)
>>   {
>>       int r;
>>
>> -    if (adev->gmc.xgmi.connected_to_cpu)
>> +    if (adev->gmc.pdb0_enabled)
>>               amdgpu_gmc_init_pdb0(adev);
>>
>>       if (adev->gart.bo == NULL) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> index 84cde1239ee4..18e80aa78aff 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>       top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>       top <<= 24;
>>
>> -    adev->gmc.fb_start = base;
>> -    adev->gmc.fb_end = top;
>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>> +            adev->gmc.fb_start = base;
>> +            adev->gmc.fb_end = top;
>> +    }
> We should probably avoid calling this in the first place.
>
> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.

mmhub_v1_8_get_fb_location() is called by the new
amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().
mmhub_v1_8_get_fb_location() is supposed to be a query api according to
its name. having such side effect is very surprising.

Another approach is set the right fb_start and fb_end in the new
amdgpu_virt_resume(), like updating vram_base_offset.

Which approach do you prefer? Or any better suggestions? Thank you.


Regards
Sam



>
> Regards,
> Christian.
>
>>
>>       return base;
>>   }

[-- Attachment #2: Type: text/html, Size: 19127 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-20  5:10     ` Zhang, GuoQing (Sam)
@ 2025-05-21  7:33       ` Zhang, Owen(SRDC)
  2025-05-21  8:06       ` Christian König
  1 sibling, 0 replies; 14+ messages in thread
From: Zhang, Owen(SRDC) @ 2025-05-21  7:33 UTC (permalink / raw)
  To: Zhang, GuoQing (Sam), Koenig, Christian,
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander, Ma, Qing (Mark),
	Lazar, Lijo, Deng, Emily

[-- Attachment #1: Type: text/plain, Size: 11382 bytes --]

[AMD Official Use Only - AMD Internal Distribution Only]

Ping... @Koenig, Christian<mailto:Christian.Koenig@amd.com> kindly pls review and feedback... thanks you very much!


Rgds/Owen

From: Zhang, GuoQing (Sam) <GuoQing.Zhang@amd.com>
Sent: Tuesday, May 20, 2025 1:11 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhang, GuoQing (Sam) <GuoQing.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Zhao, Victor <Victor.Zhao@amd.com>; Chang, HaiJun <HaiJun.Chang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Owen(SRDC) <Owen.Zhang2@amd.com>; Ma, Qing (Mark) <Qing.Ma@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV


On 2025/5/19 21:57, Christian König wrote:
> On 5/19/25 10:20, Samuel Zhang wrote:
>> When switching to new GPU index after hibernation and then resume,
>> VRAM offset of each VRAM BO will be changed, and the cached gpu
>> addresses needed to updated.
>>
>> This is to enable pdb0 and switch to use pdb0-based virtual gpu
>> address by default in amdgpu_bo_create_reserved(). since the virtual
>> addresses do not change, this can avoid the need to update all
>> cached gpu addresses all over the codebase.
>>
>> Signed-off-by: Emily Deng <Emily.Deng@amd.com<mailto:Emily.Deng@amd.com>>
>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com<mailto:guoqing.zhang@amd.com>>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 32 ++++++++++++++++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  1 +
>>   drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c    | 10 +++++---
>>   drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c  |  6 +++--
>>   5 files changed, 37 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index d1fa5e8e3937..265d6c777af5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -38,6 +38,8 @@
>>   #include <drm/drm_drv.h>
>>   #include <drm/ttm/ttm_tt.h>
>>
>> +static const u64 four_gb = 0x100000000ULL;
>> +
>>   /**
>>    * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
>>    *
>> @@ -249,15 +251,24 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
>>   {
>>       u64 hive_vram_start = 0;
>>       u64 hive_vram_end = mc->xgmi.node_segment_size * mc->xgmi.num_physical_nodes - 1;
>> -    mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
>> -    mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
>> -    mc->gart_start = hive_vram_end + 1;
>> +
>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>> +             */
>> +            amdgpu_gmc_vram_location(adev, mc, 0);
> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>
> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.

I tried only setting mc->vram_start and mc->vram_end. But KMD load will
fail with following error logs.

[  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
[  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
0x0000018000000000 - 0x000001801FFFFFFF
[  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
[  329.314386] [drm] RAM width 8192bits HBM
[  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
kernel bo
[  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
block <gmc_v9_0> failed -22
[  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed


It seems like setting mc->visible_vram_size and mc->visible_vram_size
fields are also needed. In this case call amdgpu_gmc_vram_location() is
better than inline the logic, I think.


>
>> +    } else {
>> +            mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
>> +            mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
>> +            dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
>> +                            mc->mc_vram_size >> 20, mc->vram_start,
>> +                            mc->vram_end, mc->real_vram_size >> 20);
>> +    }
>> +    /* node_segment_size may not 4GB aligned on SRIOV, align up is needed. */
>> +    mc->gart_start = ALIGN(hive_vram_end + 1, four_gb);
>>       mc->gart_end = mc->gart_start + mc->gart_size - 1;
>>       mc->fb_start = hive_vram_start;
>>       mc->fb_end = hive_vram_end;
>> -    dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
>> -                    mc->mc_vram_size >> 20, mc->vram_start,
>> -                    mc->vram_end, mc->real_vram_size >> 20);
>>       dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
>>                       mc->gart_size >> 20, mc->gart_start, mc->gart_end);
>>   }
>> @@ -276,7 +287,6 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
>>   void amdgpu_gmc_gart_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
>>                             enum amdgpu_gart_placement gart_placement)
>>   {
>> -    const uint64_t four_gb = 0x100000000ULL;
>>       u64 size_af, size_bf;
>>       /*To avoid the hole, limit the max mc address to AMDGPU_GMC_HOLE_START*/
>>       u64 max_mc_address = min(adev->gmc.mc_mask, AMDGPU_GMC_HOLE_START - 1);
>> @@ -1068,6 +1078,14 @@ void amdgpu_gmc_init_pdb0(struct amdgpu_device *adev)
>>       flags |= AMDGPU_PTE_FRAG((adev->gmc.vmid0_page_table_block_size + 9*1));
>>       flags |= AMDGPU_PDE_PTE_FLAG(adev);
>>
>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>> +            /* always start from current device so that the GART address can keep
>> +             * consistent when hibernate-resume with different GPUs.
>> +             */
>> +            vram_addr = adev->vm_manager.vram_base_offset;
>> +            vram_end = vram_addr + vram_size;
>> +    }
>> +
>>       /* The first n PDE0 entries are used as PTE,
>>        * pointing to vram
>>        */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> index bd7fc123b8f9..46fac7ca7dfa 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> @@ -307,6 +307,7 @@ struct amdgpu_gmc {
>>       struct amdgpu_bo                *pdb0_bo;
>>       /* CPU kmapped address of pdb0*/
>>       void                            *ptr_pdb0;
>> +    bool pdb0_enabled;
> This isn't needed, just always check (adev->gmc.xgmi.connected_to_cpu || amdgpu_virt_xgmi_migrate_enabled(adev)), make a function for that if necessary.

Ok, I will update it in the next patch version.


>
>>
>>       /* MALL size */
>>       u64 mall_size;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> index cb25f7f0dfc1..e6165f6d0763 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> @@ -180,7 +180,7 @@ gfxhub_v1_2_xcc_init_system_aperture_regs(struct amdgpu_device *adev,
>>               /* In the case squeezing vram into GART aperture, we don't use
>>                * FB aperture and AGP aperture. Disable them.
>>                */
>> -            if (adev->gmc.pdb0_bo) {
>> +            if (adev->gmc.pdb0_bo && !amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>                       WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_TOP, 0);
>>                       WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_BASE, 0x00FFFFFF);
>>                       WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_AGP_TOP, 0);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index 59385da80185..04fb99c64b37 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -1682,6 +1682,8 @@ static int gmc_v9_0_early_init(struct amdgpu_ip_block *ip_block)
>>               adev->gmc.private_aperture_start + (4ULL << 30) - 1;
>>       adev->gmc.noretry_flags = AMDGPU_VM_NORETRY_FLAGS_TF;
>>
>> +    adev->gmc.pdb0_enabled = adev->gmc.xgmi.connected_to_cpu ||
>> +            amdgpu_virt_xgmi_migrate_enabled(adev);
>>       return 0;
>>   }
>>
>> @@ -1726,7 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev,
>>
>>       /* add the xgmi offset of the physical node */
>>       base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
>> -    if (adev->gmc.xgmi.connected_to_cpu) {
>> +    if (adev->gmc.pdb0_enabled) {
>>               amdgpu_gmc_sysvm_location(adev, mc);
>>       } else {
>>               amdgpu_gmc_vram_location(adev, mc, base);
>> @@ -1841,7 +1843,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
>>               return 0;
>>       }
>>
>> -    if (adev->gmc.xgmi.connected_to_cpu) {
>> +    if (adev->gmc.pdb0_enabled) {
>>               adev->gmc.vmid0_page_table_depth = 1;
>>               adev->gmc.vmid0_page_table_block_size = 12;
>>       } else {
>> @@ -1867,7 +1869,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
>>               if (r)
>>                       return r;
>>
>> -            if (adev->gmc.xgmi.connected_to_cpu)
>> +            if (adev->gmc.pdb0_enabled)
>>                       r = amdgpu_gmc_pdb0_alloc(adev);
>>       }
>>
>> @@ -2372,7 +2374,7 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device *adev)
>>   {
>>       int r;
>>
>> -    if (adev->gmc.xgmi.connected_to_cpu)
>> +    if (adev->gmc.pdb0_enabled)
>>               amdgpu_gmc_init_pdb0(adev);
>>
>>       if (adev->gart.bo == NULL) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> index 84cde1239ee4..18e80aa78aff 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>       top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>       top <<= 24;
>>
>> -    adev->gmc.fb_start = base;
>> -    adev->gmc.fb_end = top;
>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>> +            adev->gmc.fb_start = base;
>> +            adev->gmc.fb_end = top;
>> +    }
> We should probably avoid calling this in the first place.
>
> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.

mmhub_v1_8_get_fb_location() is called by the new
amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().
mmhub_v1_8_get_fb_location() is supposed to be a query api according to
its name. having such side effect is very surprising.

Another approach is set the right fb_start and fb_end in the new
amdgpu_virt_resume(), like updating vram_base_offset.

Which approach do you prefer? Or any better suggestions? Thank you.


Regards
Sam



>
> Regards,
> Christian.
>
>>
>>       return base;
>>   }

[-- Attachment #2: Type: text/html, Size: 21712 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-20  5:10     ` Zhang, GuoQing (Sam)
  2025-05-21  7:33       ` Zhang, Owen(SRDC)
@ 2025-05-21  8:06       ` Christian König
  2025-05-21 11:55         ` Zhang, GuoQing (Sam)
  1 sibling, 1 reply; 14+ messages in thread
From: Christian König @ 2025-05-21  8:06 UTC (permalink / raw)
  To: Zhang, GuoQing (Sam), Koenig, Christian,
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Lazar, Lijo, Deng, Emily

On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote:
>>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>>> +             */
>>> +            amdgpu_gmc_vram_location(adev, mc, 0);
>> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>>
>> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.
> 
> I tried only setting mc->vram_start and mc->vram_end. But KMD load will
> fail with following error logs.
> 
> [  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
> [  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
> 0x0000018000000000 - 0x000001801FFFFFFF
> [  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
> [  329.314386] [drm] RAM width 8192bits HBM
> [  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
> kernel bo
> [  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
> block <gmc_v9_0> failed -22
> [  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed
> 
> 
> It seems like setting mc->visible_vram_size and mc->visible_vram_size
> fields are also needed. In this case call amdgpu_gmc_vram_location() is
> better than inline the logic, I think.

Yeah, exactly that is not a good idea.

The mc->visible_vram_size and mc->real_vram_size should have been initialized by gmc_v9_0_mc_init(). Why didn't that happen?

>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>> index 84cde1239ee4..18e80aa78aff 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>>       top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>>       top <<= 24;
>>>  
>>> -    adev->gmc.fb_start = base;
>>> -    adev->gmc.fb_end = top;
>>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>> +            adev->gmc.fb_start = base;
>>> +            adev->gmc.fb_end = top;
>>> +    }
>> We should probably avoid calling this in the first place.
>>
>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.
> 
> mmhub_v1_8_get_fb_location() is called by the new
> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().

Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() should only rely on cached data.

> mmhub_v1_8_get_fb_location() is supposed to be a query api according to
> its name. having such side effect is very surprising.
> 
> Another approach is set the right fb_start and fb_end in the new
> amdgpu_virt_resume(), like updating vram_base_offset.

That is probably better. And skip setting fb_start and fb_end in amdgpu_gmc_sysvm_location() for this use case.

That was done only because we re-program those registers on bare metal.

Regards,
Christian.

> 
> Which approach do you prefer? Or any better suggestions? Thank you.
> 
> 
> Regards
> Sam
> 
> 
> 
>>
>> Regards,
>> Christian.
>>
>>>  
>>>       return base;
>>>   }
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-21  8:06       ` Christian König
@ 2025-05-21 11:55         ` Zhang, GuoQing (Sam)
  2025-05-21 12:00           ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-05-21 11:55 UTC (permalink / raw)
  To: Koenig, Christian, Zhang, GuoQing (Sam),
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Lazar, Lijo, Deng, Emily

[-- Attachment #1: Type: text/plain, Size: 6234 bytes --]


On 2025/5/21 16:06, Christian König wrote:
> On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote:
>>>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>>>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>>>> +             */
>>>> +            amdgpu_gmc_vram_location(adev, mc, 0);
>>> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>>>
>>> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.
>> I tried only setting mc->vram_start and mc->vram_end. But KMD load will
>> fail with following error logs.
>>
>> [  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
>> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
>> [  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
>> 0x0000018000000000 - 0x000001801FFFFFFF
>> [  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
>> [  329.314386] [drm] RAM width 8192bits HBM
>> [  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
>> kernel bo
>> [  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
>> block <gmc_v9_0> failed -22
>> [  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed
>>
>>
>> It seems like setting mc->visible_vram_size and mc->visible_vram_size
>> fields are also needed. In this case call amdgpu_gmc_vram_location() is
>> better than inline the logic, I think.
> Yeah, exactly that is not a good idea.
>
> The mc->visible_vram_size and mc->real_vram_size should have been initialized by gmc_v9_0_mc_init(). Why didn't that happen?


[Sam] visible_vram_size is set to 0x4000000000 (256G) from
`pci_resource_len(adev->pdev, 0)` in `gmc_v9_0_mc_init()`.
It is set to real_vram_size 0x2fec000000(192G) in
amdgpu_gmc_vram_location().

Should I update the 3 variables inline and not call
amdgpu_gmc_vram_location()?

         mc->vram_start = 0;
         mc->vram_end = mc->vram_start + mc->mc_vram_size - 1;
         if (mc->real_vram_size < mc->visible_vram_size)
             mc->visible_vram_size = mc->real_vram_size;


>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>> index 84cde1239ee4..18e80aa78aff 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>>>         top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>>>         top <<= 24;
>>>>
>>>> -    adev->gmc.fb_start = base;
>>>> -    adev->gmc.fb_end = top;
>>>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>> +            adev->gmc.fb_start = base;
>>>> +            adev->gmc.fb_end = top;
>>>> +    }
>>> We should probably avoid calling this in the first place.
>>>
>>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.
>> mmhub_v1_8_get_fb_location() is called by the new
>> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().
> Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() should only rely on cached data.


[Sam] Can I add new `fb_base` field in `struct amdgpu_gmc` to cache the
value of `get_fb_location()`?
Using this approach, we don't need to set fb_start and fb_end on resume
any more, since the reset of the 2 field is caused by
mmhub_v1_8_get_fb_location() calls from amdgpu_bo_fb_aper_addr().
Please see the code change below.

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -259,6 +259,7 @@ struct amdgpu_gmc {
          */
         u64                     fb_start;
         u64                     fb_end;
+       u64                     fb_base;
         unsigned                vram_width;
         u64                     real_vram_size;
         int                     vram_mtrr;

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1527,7 +1527,7 @@ u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo)

         WARN_ON_ONCE(bo->tbo.resource->mem_type != TTM_PL_VRAM);

-       fb_base = adev->mmhub.funcs->get_fb_location(adev);
+       fb_base = adev->gmc.fb_base;
         fb_base += adev->gmc.xgmi.physical_node_id *
adev->gmc.xgmi.node_segment_size;
         offset = (bo->tbo.resource->start << PAGE_SHIFT) + fb_base;
         return amdgpu_gmc_sign_extend(offset);

--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1728,6 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct
amdgpu_device *adev,
                                         struct amdgpu_gmc *mc)
  {
         u64 base = adev->mmhub.funcs->get_fb_location(adev);
+       mc->fb_base = base;

         /* add the xgmi offset of the physical node */
         base += adev->gmc.xgmi.physical_node_id *
adev->gmc.xgmi.node_segment_size;

--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
@@ -45,10 +45,8 @@ static u64 mmhub_v1_8_get_fb_location(struct
amdgpu_device *adev)
         top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
         top <<= 24;

-       if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
-               adev->gmc.fb_start = base;
-               adev->gmc.fb_end = top;
-       }
+       adev->gmc.fb_start = base;
+       adev->gmc.fb_end = top;


Regards
Sam


>
>> mmhub_v1_8_get_fb_location() is supposed to be a query api according to
>> its name. having such side effect is very surprising.
>>
>> Another approach is set the right fb_start and fb_end in the new
>> amdgpu_virt_resume(), like updating vram_base_offset.
> That is probably better. And skip setting fb_start and fb_end in amdgpu_gmc_sysvm_location() for this use case.
>
> That was done only because we re-program those registers on bare metal.
>
> Regards,
> Christian.
>
>> Which approach do you prefer? Or any better suggestions? Thank you.
>>
>>
>> Regards
>> Sam
>>
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>         return base;
>>>>     }

[-- Attachment #2: Type: text/html, Size: 11295 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-21 11:55         ` Zhang, GuoQing (Sam)
@ 2025-05-21 12:00           ` Christian König
  2025-05-22  3:49             ` Zhang, GuoQing (Sam)
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2025-05-21 12:00 UTC (permalink / raw)
  To: Zhang, GuoQing (Sam), amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Lazar, Lijo, Deng, Emily

On 5/21/25 13:55, Zhang, GuoQing (Sam) wrote:
> 
> On 2025/5/21 16:06, Christian König wrote:
>> On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote:
>>>>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>>>>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>>>>> +             */
>>>>> +            amdgpu_gmc_vram_location(adev, mc, 0);
>>>> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>>>>
>>>> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.
>>> I tried only setting mc->vram_start and mc->vram_end. But KMD load will
>>> fail with following error logs.
>>>
>>> [  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
>>> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
>>> [  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
>>> 0x0000018000000000 - 0x000001801FFFFFFF
>>> [  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
>>> [  329.314386] [drm] RAM width 8192bits HBM
>>> [  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
>>> kernel bo
>>> [  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
>>> block <gmc_v9_0> failed -22
>>> [  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed
>>>
>>>
>>> It seems like setting mc->visible_vram_size and mc->visible_vram_size
>>> fields are also needed. In this case call amdgpu_gmc_vram_location() is
>>> better than inline the logic, I think.
>> Yeah, exactly that is not a good idea.
>>
>> The mc->visible_vram_size and mc->real_vram_size should have been initialized by gmc_v9_0_mc_init(). Why didn't that happen?
> 
> 
> [Sam] visible_vram_size is set to 0x4000000000 (256G) from
> `pci_resource_len(adev->pdev, 0)` in `gmc_v9_0_mc_init()`.
> It is set to real_vram_size 0x2fec000000(192G) in
> amdgpu_gmc_vram_location().
> 
> Should I update the 3 variables inline and not call
> amdgpu_gmc_vram_location()?
> 
>          mc->vram_start = 0;
>          mc->vram_end = mc->vram_start + mc->mc_vram_size - 1;
>          if (mc->real_vram_size < mc->visible_vram_size)
>              mc->visible_vram_size = mc->real_vram_size;

Yeah that seems to make sense.

> 
> 
>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>> index 84cde1239ee4..18e80aa78aff 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>>>>         top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>>>>         top <<= 24;
>>>>>   
>>>>> -    adev->gmc.fb_start = base;
>>>>> -    adev->gmc.fb_end = top;
>>>>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>>> +            adev->gmc.fb_start = base;
>>>>> +            adev->gmc.fb_end = top;
>>>>> +    }
>>>> We should probably avoid calling this in the first place.
>>>>
>>>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.
>>> mmhub_v1_8_get_fb_location() is called by the new
>>> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().
>> Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() should only rely on cached data.
> 
> 
> [Sam] Can I add new `fb_base` field in `struct amdgpu_gmc` to cache the
> value of `get_fb_location()`?

No, please try to avoid that.

> Using this approach, we don't need to set fb_start and fb_end on resume
> any more, since the reset of the 2 field is caused by
> mmhub_v1_8_get_fb_location() calls from amdgpu_bo_fb_aper_addr().
> Please see the code change below.

What is wrong with setting fb_start and fb_end on resume?

Regards,
Christian.

> 
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -259,6 +259,7 @@ struct amdgpu_gmc {
>           */
>          u64                     fb_start;
>          u64                     fb_end;
> +       u64                     fb_base;
>          unsigned                vram_width;
>          u64                     real_vram_size;
>          int                     vram_mtrr;
> 
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1527,7 +1527,7 @@ u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo)
> 
>          WARN_ON_ONCE(bo->tbo.resource->mem_type != TTM_PL_VRAM);
> 
> -       fb_base = adev->mmhub.funcs->get_fb_location(adev);
> +       fb_base = adev->gmc.fb_base;
>          fb_base += adev->gmc.xgmi.physical_node_id *
> adev->gmc.xgmi.node_segment_size;
>          offset = (bo->tbo.resource->start << PAGE_SHIFT) + fb_base;
>          return amdgpu_gmc_sign_extend(offset);
> 
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1728,6 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct
> amdgpu_device *adev,
>                                          struct amdgpu_gmc *mc)
>   {
>          u64 base = adev->mmhub.funcs->get_fb_location(adev);
> +       mc->fb_base = base;
> 
>          /* add the xgmi offset of the physical node */
>          base += adev->gmc.xgmi.physical_node_id *
> adev->gmc.xgmi.node_segment_size;
> 
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> @@ -45,10 +45,8 @@ static u64 mmhub_v1_8_get_fb_location(struct
> amdgpu_device *adev)
>          top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>          top <<= 24;
> 
> -       if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
> -               adev->gmc.fb_start = base;
> -               adev->gmc.fb_end = top;
> -       }
> +       adev->gmc.fb_start = base;
> +       adev->gmc.fb_end = top;
> 
> 
> Regards
> Sam
> 
> 
>>
>>> mmhub_v1_8_get_fb_location() is supposed to be a query api according to
>>> its name. having such side effect is very surprising.
>>>
>>> Another approach is set the right fb_start and fb_end in the new
>>> amdgpu_virt_resume(), like updating vram_base_offset.
>> That is probably better. And skip setting fb_start and fb_end in amdgpu_gmc_sysvm_location() for this use case.
>>
>> That was done only because we re-program those registers on bare metal.
>>
>> Regards,
>> Christian.
>>
>>> Which approach do you prefer? Or any better suggestions? Thank you.
>>>
>>>
>>> Regards
>>> Sam
>>>
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>   
>>>>>         return base;
>>>>>     }
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-21 12:00           ` Christian König
@ 2025-05-22  3:49             ` Zhang, GuoQing (Sam)
  0 siblings, 0 replies; 14+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-05-22  3:49 UTC (permalink / raw)
  To: Koenig, Christian, Zhang, GuoQing (Sam),
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Lazar, Lijo, Deng, Emily

[-- Attachment #1: Type: text/plain, Size: 5776 bytes --]


On 2025/5/21 20:00, Christian König wrote:
> On 5/21/25 13:55, Zhang, GuoQing (Sam) wrote:
>> On 2025/5/21 16:06, Christian König wrote:
>>> On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote:
>>>>>> +    if (amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>>>> +            /* set mc->vram_start to 0 to switch the returned GPU address of
>>>>>> +             * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
>>>>>> +             */
>>>>>> +            amdgpu_gmc_vram_location(adev, mc, 0);
>>>>> This function does a lot more than just setting mc->vram_start and mc->vram_end.
>>>>>
>>>>> You should probably just update the two setting and not call amdgpu_gmc_vram_location() at all.
>>>> I tried only setting mc->vram_start and mc->vram_end. But KMD load will
>>>> fail with following error logs.
>>>>
>>>> [  329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M
>>>> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used)
>>>> [  329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M
>>>> 0x0000018000000000 - 0x000001801FFFFFFF
>>>> [  329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M
>>>> [  329.314386] [drm] RAM width 8192bits HBM
>>>> [  329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate
>>>> kernel bo
>>>> [  329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP
>>>> block <gmc_v9_0> failed -22
>>>> [  329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed
>>>>
>>>>
>>>> It seems like setting mc->visible_vram_size and mc->visible_vram_size
>>>> fields are also needed. In this case call amdgpu_gmc_vram_location() is
>>>> better than inline the logic, I think.
>>> Yeah, exactly that is not a good idea.
>>>
>>> The mc->visible_vram_size and mc->real_vram_size should have been initialized by gmc_v9_0_mc_init(). Why didn't that happen?
>>
>> [Sam] visible_vram_size is set to 0x4000000000 (256G) from
>> `pci_resource_len(adev->pdev, 0)` in `gmc_v9_0_mc_init()`.
>> It is set to real_vram_size 0x2fec000000(192G) in
>> amdgpu_gmc_vram_location().
>>
>> Should I update the 3 variables inline and not call
>> amdgpu_gmc_vram_location()?
>>
>>           mc->vram_start = 0;
>>           mc->vram_end = mc->vram_start + mc->mc_vram_size - 1;
>>           if (mc->real_vram_size < mc->visible_vram_size)
>>               mc->visible_vram_size = mc->real_vram_size;
> Yeah that seems to make sense.
>
>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>>> index 84cde1239ee4..18e80aa78aff 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
>>>>>>           top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
>>>>>>           top <<= 24;
>>>>>>
>>>>>> -    adev->gmc.fb_start = base;
>>>>>> -    adev->gmc.fb_end = top;
>>>>>> +    if (!amdgpu_virt_xgmi_migrate_enabled(adev)) {
>>>>>> +            adev->gmc.fb_start = base;
>>>>>> +            adev->gmc.fb_end = top;
>>>>>> +    }
>>>>> We should probably avoid calling this in the first place.
>>>>>
>>>>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted.
>>>> mmhub_v1_8_get_fb_location() is called by the new
>>>> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location().
>>> Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() should only rely on cached data.
>>
>> [Sam] Can I add new `fb_base` field in `struct amdgpu_gmc` to cache the
>> value of `get_fb_location()`?
> No, please try to avoid that.

OK. so "amdgpu_bo_fb_aper_addr() should only rely on cached data." is
not required and I don't need to change current amdgpu_bo_fb_aper_addr()
implementation, right?


>
>> Using this approach, we don't need to set fb_start and fb_end on resume
>> any more, since the reset of the 2 field is caused by
>> mmhub_v1_8_get_fb_location() calls from amdgpu_bo_fb_aper_addr().
>> Please see the code change below.
> What is wrong with setting fb_start and fb_end on resume?

It works. I have updated the patch in this way.

>>>> mmhub_v1_8_get_fb_location() is supposed to be a query api according to
>>>> its name. having such side effect is very surprising.
>>>>
>>>> Another approach is set the right fb_start and fb_end in the new
>>>> amdgpu_virt_resume(), like updating vram_base_offset.
>>> That is probably better. And skip setting fb_start and fb_end in amdgpu_gmc_sysvm_location() for this use case.

setting fb_start and fb_end in amdgpu_gmc_sysvm_location() is needed for
normal KMD load, since amdgpu_virt_resume() is not called on normal KMD
load.

I have sent out v7 patch list. Please take another look. Thank you!

mail titles:
[PATCH v7 0/4] enable xgmi node migration support for hibernate on SRIOV
[PATCH v7 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume
[PATCH v7 2/4] drm/amdgpu: update GPU addresses for SMU and PSP
[PATCH v7 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV
[PATCH v7 4/4] drm/amdgpu: fix fence fallback timer expired error

changes:
- remove pdb0_enabled and add gmc_v9_0_is_pdb0_enabled()
- remove amdgpu_gmc_vram_location() call in amdgpu_gmc_sysvm_location()
- remove check in mmhub_v1_8_get_fb_location() and update
fb_start/fb_end on resume

Regards
Sam


>>>
>>> That was done only because we re-program those registers on bare metal.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Which approach do you prefer? Or any better suggestions? Thank you.
>>>>
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>           return base;
>>>>>>       }

[-- Attachment #2: Type: text/html, Size: 9572 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-22  4:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19  8:20 [PATCH v6 0/4] enable xgmi node migration support for hibernate on SRIOV Samuel Zhang
2025-05-19  8:20 ` [PATCH v6 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume Samuel Zhang
2025-05-19 13:41   ` Christian König
2025-05-19  8:20 ` [PATCH v6 2/4] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
2025-05-19 13:43   ` Christian König
2025-05-19  8:20 ` [PATCH v6 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
2025-05-19 13:57   ` Christian König
2025-05-20  5:10     ` Zhang, GuoQing (Sam)
2025-05-21  7:33       ` Zhang, Owen(SRDC)
2025-05-21  8:06       ` Christian König
2025-05-21 11:55         ` Zhang, GuoQing (Sam)
2025-05-21 12:00           ` Christian König
2025-05-22  3:49             ` Zhang, GuoQing (Sam)
2025-05-19  8:20 ` [PATCH v6 4/4] drm/amdgpu: fix fence fallback timer expired error Samuel Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.