[PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV.

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV.
@ 2025-05-08  5:09 Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume Samuel Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma

On SRIOV and VM environment, customer may need to switch to new vGPU indexes
after hibernate and then resume the VM. For GPUs with XGMI, `vram_start` will 
change in this case, the FB aperture gpu address of VRAM BOs will also change.
These gpu addresses need to be updated when resume. But these addresses are all
over the KMD codebase, updating each of them is error-prone and not acceptable. 

The solution is to use pdb0 page table to cover both vram and gart memory and
use pdb0 virtual gpu address instead. When gpu indexes change, the virtual gpu 
address won't change.

For psp and smu, pdb0's gpu address does not work, so the original FB aperture
gpu address is used instead. They need to be updated when resume with changed vGPUs.

v4: 
- remove gmc_v9_0_mc_init() call and `refresh` update. 
- do not set `fb_start` in mmhub_v1_8_get_fb_location() when pdb0 enabled.

Samuel Zhang (7):
  drm/amdgpu: update XGMI info on resume
  drm/amdgpu: update GPU addresses for SMU and PSP
  drm/amdgpu: enable pdb0 for hibernation on SRIOV
  drm/amdgpu: remove cached gpu addr: amdgpu_firmware.fw_buf_mc
  drm/amdgpu: remove cached gpu addr: ta_mem_context.shared_mc_addr
  drm/amdgpu: remove cached gpu addr: psp_context.tmr_mc_addr
  drm/amdgpu: remove cached gpu addr: psp_context.cmd_buf_mc_addr

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    | 32 ++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 20 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 49 +++++++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  3 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  | 12 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h  |  1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 19 +++++++--
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c    |  6 ++-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 17 ++++++++
 14 files changed, 153 insertions(+), 38 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  2025-05-08  8:12   ` Lazar, Lijo
  2025-05-08  9:27   ` Christian König
  2025-05-08  5:09 ` [PATCH v4 2/7] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Jiang Liu

For virtual machine with vGPUs in SRIOV single device mode and XGMI
is enabled, XGMI physical node ids may change when waking up from
hiberation with different vGPU devices. So update XGMI info on resume.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
 3 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d477a901af84..843a3b0a9a07 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		r = adev->gfxhub.funcs->get_xgmi_info(adev);
 		if (r)
 			return r;
+		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
 	}
 
 	/* enable PCIE atomic ops */
@@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 	return 0;
 }
 
+static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
+{
+	int r;
+
+	/* Get xgmi info again for sriov to detect device changes */
+	if (amdgpu_sriov_vf(adev) &&
+	    !(adev->flags & AMD_IS_APU) &&
+	    adev->gmc.xgmi.supported &&
+	    !adev->gmc.xgmi.connected_to_cpu) {
+		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
+		r = adev->gfxhub.funcs->get_xgmi_info(adev);
+		if (r)
+			return r;
+
+		dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
+			adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
+	}
+	return 0;
+}
+
 /**
  * amdgpu_device_resume - initiate device resume
  *
@@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		r = amdgpu_virt_request_full_gpu(adev, true);
 		if (r)
 			return r;
+		r = amdgpu_device_update_xgmi_info(adev);
+		if (r)
+			return r;
 	}
 
 	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 32dabba4062f..1387901576f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -89,6 +89,7 @@ struct amdgpu_xgmi {
 	u64 node_segment_size;
 	/* physical node (0-3) */
 	unsigned physical_node_id;
+	unsigned prev_physical_node_id;
 	/* number of nodes (0-4) */
 	unsigned num_physical_nodes;
 	/* gpu list in the same hive */
@@ -101,6 +102,9 @@ struct amdgpu_xgmi {
 	uint8_t max_width;
 };
 
+#define amdgpu_xmgi_is_node_changed(adev) \
+	(adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)
+
 struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
 void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
 int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 59385da80185..7c0ca2721eb3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
 	struct amdgpu_device *adev = ip_block->adev;
 	int r;
 
+	if (amdgpu_xmgi_is_node_changed(adev)) {
+		adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
+		adev->vm_manager.vram_base_offset +=
+			adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
+	}
+
 	/* If a reset is done for NPS mode switch, read the memory range
 	 * information again.
 	 */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/7] drm/amdgpu: update GPU addresses for SMU and PSP
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 3/7] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Jiang Liu

add amdgpu_bo_fb_aper_addr() and update the cached GPU addresses to use
the FB aperture address for SMU and PSP.

2 reasons for this change:
1. when pdb0 is enabled, gpu addr from amdgpu_bo_create_kernel() is GART
aperture address, it is not compatible with SMU and PSP, it need to updated
to use FB aperture address.
2. Since FB aperture address will change after switching to new GPU
index after hibernation, it need to be updated after resume.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 20 ++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 22 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  |  3 +++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 17 +++++++++++++++++
 5 files changed, 63 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4e794d546b61..3dde57cd5b81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1513,6 +1513,26 @@ u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo)
 	return amdgpu_bo_gpu_offset_no_check(bo);
 }
 
+/**
+ * amdgpu_bo_fb_aper_addr - return FB aperture GPU offset of the VRAM bo
+ * @bo:	amdgpu VRAM buffer object for which we query the offset
+ *
+ * Returns:
+ * current FB aperture GPU offset of the object.
+ */
+u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo)
+{
+	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+	uint64_t offset, fb_base;
+
+	WARN_ON_ONCE(bo->tbo.resource->mem_type != TTM_PL_VRAM);
+
+	fb_base = adev->mmhub.funcs->get_fb_location(adev);
+	fb_base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
+	offset = (bo->tbo.resource->start << PAGE_SHIFT) + fb_base;
+	return amdgpu_gmc_sign_extend(offset);
+}
+
 /**
  * amdgpu_bo_gpu_offset_no_check - return GPU offset of bo
  * @bo:	amdgpu object for which we query the offset
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index dcce362bfad3..c8a63e38f5d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -320,6 +320,7 @@ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
 			     bool intr);
 int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
 u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
+u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo);
 u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo);
 uint32_t amdgpu_bo_mem_stats_placement(struct amdgpu_bo *bo);
 uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index e1e658a97b36..bdab40b42983 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -871,6 +871,8 @@ static int psp_tmr_init(struct psp_context *psp)
 					      &psp->tmr_bo, &psp->tmr_mc_addr,
 					      pptr);
 	}
+	if (psp->tmr_bo)
+		psp->tmr_mc_addr = amdgpu_bo_fb_aper_addr(psp->tmr_bo);
 
 	return ret;
 }
@@ -1210,6 +1212,9 @@ static void psp_prep_ta_load_cmd_buf(struct psp_gfx_cmd_resp *cmd,
 	cmd->cmd.cmd_load_ta.app_phy_addr_hi	= upper_32_bits(ta_bin_mc);
 	cmd->cmd.cmd_load_ta.app_len		= context->bin_desc.size_bytes;
 
+	if (context->mem_context.shared_bo)
+		context->mem_context.shared_mc_addr = amdgpu_bo_fb_aper_addr(context->mem_context.shared_bo);
+
 	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_lo =
 		lower_32_bits(context->mem_context.shared_mc_addr);
 	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_hi =
@@ -2336,11 +2341,26 @@ bool amdgpu_psp_tos_reload_needed(struct amdgpu_device *adev)
 	return false;
 }
 
+static void psp_update_gpu_addresses(struct amdgpu_device *adev)
+{
+	struct psp_context *psp = &adev->psp;
+
+	if (psp->cmd_buf_bo && psp->cmd_buf_mem) {
+		psp->fw_pri_mc_addr = amdgpu_bo_fb_aper_addr(psp->fw_pri_bo);
+		psp->fence_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->fence_buf_bo);
+		psp->cmd_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->cmd_buf_bo);
+	}
+	if (adev->firmware.rbuf && psp->km_ring.ring_mem)
+		psp->km_ring.ring_mem_mc_addr = amdgpu_bo_fb_aper_addr(adev->firmware.rbuf);
+}
+
 static int psp_hw_start(struct psp_context *psp)
 {
 	struct amdgpu_device *adev = psp->adev;
 	int ret;
 
+	psp_update_gpu_addresses(adev);
+
 	if (!amdgpu_sriov_vf(adev)) {
 		if ((is_psp_fw_valid(psp->kdb)) &&
 		    (psp->funcs->bootloader_load_kdb != NULL)) {
@@ -3976,6 +3996,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
 	memcpy_toio(fw_pri_cpu_addr, usbc_pd_fw->data, usbc_pd_fw->size);
 
 	mutex_lock(&adev->psp.mutex);
+	fw_pri_mc_addr = amdgpu_bo_fb_aper_addr(fw_buf_bo);
 	ret = psp_load_usbc_pd_fw(&adev->psp, fw_pri_mc_addr);
 	mutex_unlock(&adev->psp.mutex);
 
@@ -4085,6 +4106,7 @@ static ssize_t amdgpu_psp_vbflash_read(struct file *filp, struct kobject *kobj,
 	memcpy_toio(fw_pri_cpu_addr, adev->psp.vbflash_tmp_buf, adev->psp.vbflash_image_size);
 
 	mutex_lock(&adev->psp.mutex);
+	fw_pri_mc_addr = amdgpu_bo_fb_aper_addr(fw_buf_bo);
 	ret = psp_update_spirom(&adev->psp, fw_pri_mc_addr);
 	mutex_unlock(&adev->psp.mutex);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index 3d9e9fdc10b4..f3b56c219e7e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -1152,6 +1152,9 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 		adev->firmware.max_ucodes = AMDGPU_UCODE_ID_MAXIMUM;
 	}
 
+	if (adev->firmware.fw_buf)
+		adev->firmware.fw_buf_mc = amdgpu_bo_fb_aper_addr(adev->firmware.fw_buf);
+
 	for (i = 0; i < adev->firmware.max_ucodes; i++) {
 		ucode = &adev->firmware.ucode[i];
 		if (ucode->fw) {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 315b0856bf02..dfdda98cf0c5 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1000,6 +1000,21 @@ static int smu_fini_fb_allocations(struct smu_context *smu)
 	return 0;
 }
 
+static void smu_update_gpu_addresses(struct smu_context *smu)
+{
+	struct smu_table_context *smu_table = &smu->smu_table;
+	struct smu_table *pm_status_table = smu_table->tables + SMU_TABLE_PMSTATUSLOG;
+	struct smu_table *driver_table = &(smu_table->driver_table);
+	struct smu_table *dummy_read_1_table = &smu_table->dummy_read_1_table;
+
+	if (pm_status_table->bo)
+		pm_status_table->mc_address = amdgpu_bo_fb_aper_addr(pm_status_table->bo);
+	if (driver_table->bo)
+		driver_table->mc_address = amdgpu_bo_fb_aper_addr(driver_table->bo);
+	if (dummy_read_1_table->bo)
+		dummy_read_1_table->mc_address = amdgpu_bo_fb_aper_addr(dummy_read_1_table->bo);
+}
+
 /**
  * smu_alloc_memory_pool - allocate memory pool in the system memory
  *
@@ -1789,6 +1804,8 @@ static int smu_start_smc_engine(struct smu_context *smu)
 	struct amdgpu_device *adev = smu->adev;
 	int ret = 0;
 
+	smu_update_gpu_addresses(smu);
+
 	smu->smc_fw_state = SMU_FW_INIT;
 
 	if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 3/7] drm/amdgpu: enable pdb0 for hibernation on SRIOV
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 2/7] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 4/7] drm/amdgpu: remove cached gpu addr: amdgpu_firmware.fw_buf_mc Samuel Zhang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma, Emily Deng

When switching to new GPU index after hibernation and then resume,
VRAM offset of each VRAM BO will be changed, and the cached gpu
addresses needed to updated.

This is to enable pdb0 and switch to use pdb0-based virtual gpu
address by default in amdgpu_bo_create_reserved(). since the virtual
addresses do not change, this can avoid the need to update all
cached gpu addresses all over the codebase.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 32 ++++++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c    | 13 +++++++---
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c  |  6 +++--
 5 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index d1fa5e8e3937..c7179e5f7f78 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -38,6 +38,8 @@
 #include <drm/drm_drv.h>
 #include <drm/ttm/ttm_tt.h>
 
+static const u64 four_gb = 0x100000000ULL;
+
 /**
  * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
  *
@@ -249,15 +251,24 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
 {
 	u64 hive_vram_start = 0;
 	u64 hive_vram_end = mc->xgmi.node_segment_size * mc->xgmi.num_physical_nodes - 1;
-	mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
-	mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
-	mc->gart_start = hive_vram_end + 1;
+
+	if (adev->gmc.xgmi.connected_to_cpu) {
+		mc->vram_start = mc->xgmi.node_segment_size * mc->xgmi.physical_node_id;
+		mc->vram_end = mc->vram_start + mc->xgmi.node_segment_size - 1;
+		dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
+				mc->mc_vram_size >> 20, mc->vram_start,
+				mc->vram_end, mc->real_vram_size >> 20);
+	} else {
+		/* set mc->vram_start to 0 to switch the returned GPU address of
+		 * amdgpu_bo_create_reserved() from FB aperture to GART aperture.
+		 */
+		amdgpu_gmc_vram_location(adev, mc, 0);
+	}
+	/* node_segment_size may not 4GB aligned on SRIOV, align up is needed. */
+	mc->gart_start = ALIGN(hive_vram_end + 1, four_gb);
 	mc->gart_end = mc->gart_start + mc->gart_size - 1;
 	mc->fb_start = hive_vram_start;
 	mc->fb_end = hive_vram_end;
-	dev_info(adev->dev, "VRAM: %lluM 0x%016llX - 0x%016llX (%lluM used)\n",
-			mc->mc_vram_size >> 20, mc->vram_start,
-			mc->vram_end, mc->real_vram_size >> 20);
 	dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
 			mc->gart_size >> 20, mc->gart_start, mc->gart_end);
 }
@@ -276,7 +287,6 @@ void amdgpu_gmc_sysvm_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc
 void amdgpu_gmc_gart_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
 			      enum amdgpu_gart_placement gart_placement)
 {
-	const uint64_t four_gb = 0x100000000ULL;
 	u64 size_af, size_bf;
 	/*To avoid the hole, limit the max mc address to AMDGPU_GMC_HOLE_START*/
 	u64 max_mc_address = min(adev->gmc.mc_mask, AMDGPU_GMC_HOLE_START - 1);
@@ -1068,6 +1078,14 @@ void amdgpu_gmc_init_pdb0(struct amdgpu_device *adev)
 	flags |= AMDGPU_PTE_FRAG((adev->gmc.vmid0_page_table_block_size + 9*1));
 	flags |= AMDGPU_PDE_PTE_FLAG(adev);
 
+	if (!adev->gmc.xgmi.connected_to_cpu) {
+		/* always start from current device so that the GART address can keep
+		 * consistent when hibernate-resume with different GPUs.
+		 */
+		vram_addr = adev->vm_manager.vram_base_offset;
+		vram_end = vram_addr + vram_size;
+	}
+
 	/* The first n PDE0 entries are used as PTE,
 	 * pointing to vram
 	 */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index bd7fc123b8f9..758b47240c6f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -307,6 +307,7 @@ struct amdgpu_gmc {
 	struct amdgpu_bo		*pdb0_bo;
 	/* CPU kmapped address of pdb0*/
 	void				*ptr_pdb0;
+	bool enable_pdb0;
 
 	/* MALL size */
 	u64 mall_size;
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
index cb25f7f0dfc1..5ebb92ac9fd7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
@@ -180,7 +180,7 @@ gfxhub_v1_2_xcc_init_system_aperture_regs(struct amdgpu_device *adev,
 		/* In the case squeezing vram into GART aperture, we don't use
 		 * FB aperture and AGP aperture. Disable them.
 		 */
-		if (adev->gmc.pdb0_bo) {
+		if (adev->gmc.pdb0_bo && !amdgpu_sriov_vf(adev)) {
 			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_TOP, 0);
 			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_FB_LOCATION_BASE, 0x00FFFFFF);
 			WREG32_SOC15(GC, GET_INST(GC, i), regMC_VM_AGP_TOP, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 7c0ca2721eb3..356b0fe5a538 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1682,6 +1682,11 @@ static int gmc_v9_0_early_init(struct amdgpu_ip_block *ip_block)
 		adev->gmc.private_aperture_start + (4ULL << 30) - 1;
 	adev->gmc.noretry_flags = AMDGPU_VM_NORETRY_FLAGS_TF;
 
+	adev->gmc.enable_pdb0 = adev->gmc.xgmi.connected_to_cpu;
+	if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
+	    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4) ||
+	    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 5, 0))
+	    adev->gmc.enable_pdb0 = amdgpu_sriov_vf(adev);
 	return 0;
 }
 
@@ -1726,7 +1731,7 @@ static void gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev,
 
 	/* add the xgmi offset of the physical node */
 	base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
-	if (adev->gmc.xgmi.connected_to_cpu) {
+	if (adev->gmc.enable_pdb0) {
 		amdgpu_gmc_sysvm_location(adev, mc);
 	} else {
 		amdgpu_gmc_vram_location(adev, mc, base);
@@ -1841,7 +1846,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
 		return 0;
 	}
 
-	if (adev->gmc.xgmi.connected_to_cpu) {
+	if (adev->gmc.enable_pdb0) {
 		adev->gmc.vmid0_page_table_depth = 1;
 		adev->gmc.vmid0_page_table_block_size = 12;
 	} else {
@@ -1867,7 +1872,7 @@ static int gmc_v9_0_gart_init(struct amdgpu_device *adev)
 		if (r)
 			return r;
 
-		if (adev->gmc.xgmi.connected_to_cpu)
+		if (adev->gmc.enable_pdb0)
 			r = amdgpu_gmc_pdb0_alloc(adev);
 	}
 
@@ -2372,7 +2377,7 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device *adev)
 {
 	int r;
 
-	if (adev->gmc.xgmi.connected_to_cpu)
+	if (adev->gmc.enable_pdb0)
 		amdgpu_gmc_init_pdb0(adev);
 
 	if (adev->gart.bo == NULL) {
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
index 84cde1239ee4..9b3d1ac138ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
@@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev)
 	top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK;
 	top <<= 24;
 
-	adev->gmc.fb_start = base;
-	adev->gmc.fb_end = top;
+	if (!adev->gmc.enable_pdb0) {
+		adev->gmc.fb_start = base;
+		adev->gmc.fb_end = top;
+	}
 
 	return base;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 4/7] drm/amdgpu: remove cached gpu addr: amdgpu_firmware.fw_buf_mc
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
                   ` (2 preceding siblings ...)
  2025-05-08  5:09 ` [PATCH v4 3/7] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 5/7] drm/amdgpu: remove cached gpu addr: ta_mem_context.shared_mc_addr Samuel Zhang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma

When pdb0 enabled, the cached gpu addr is not compatible with SMU and
PSP. It always need to be updated. Remove the cached gpu addr and use
local variable instead.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 13 ++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h |  1 -
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index f3b56c219e7e..0ffa6c315158 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -1111,7 +1111,7 @@ int amdgpu_ucode_create_bo(struct amdgpu_device *adev)
 			(amdgpu_sriov_vf(adev) || adev->debug_use_vram_fw_buf) ?
 			AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
 			&adev->firmware.fw_buf,
-			&adev->firmware.fw_buf_mc,
+			NULL,
 			&adev->firmware.fw_buf_ptr);
 		if (!adev->firmware.fw_buf) {
 			dev_err(adev->dev, "failed to create kernel buffer for firmware.fw_buf\n");
@@ -1126,13 +1126,13 @@ int amdgpu_ucode_create_bo(struct amdgpu_device *adev)
 void amdgpu_ucode_free_bo(struct amdgpu_device *adev)
 {
 	amdgpu_bo_free_kernel(&adev->firmware.fw_buf,
-		&adev->firmware.fw_buf_mc,
+		NULL,
 		&adev->firmware.fw_buf_ptr);
 }
 
 int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 {
-	uint64_t fw_offset = 0;
+	uint64_t fw_offset = 0, fw_buf_mc;
 	int i;
 	struct amdgpu_firmware_info *ucode = NULL;
 
@@ -1152,20 +1152,19 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 		adev->firmware.max_ucodes = AMDGPU_UCODE_ID_MAXIMUM;
 	}
 
-	if (adev->firmware.fw_buf)
-		adev->firmware.fw_buf_mc = amdgpu_bo_fb_aper_addr(adev->firmware.fw_buf);
+	fw_buf_mc = amdgpu_bo_fb_aper_addr(adev->firmware.fw_buf);
 
 	for (i = 0; i < adev->firmware.max_ucodes; i++) {
 		ucode = &adev->firmware.ucode[i];
 		if (ucode->fw) {
-			amdgpu_ucode_init_single_fw(adev, ucode, adev->firmware.fw_buf_mc + fw_offset,
+			amdgpu_ucode_init_single_fw(adev, ucode, fw_buf_mc + fw_offset,
 						    adev->firmware.fw_buf_ptr + fw_offset);
 			if (i == AMDGPU_UCODE_ID_CP_MEC1 &&
 			    adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
 				const struct gfx_firmware_header_v1_0 *cp_hdr;
 
 				cp_hdr = (const struct gfx_firmware_header_v1_0 *)ucode->fw->data;
-				amdgpu_ucode_patch_jt(ucode,  adev->firmware.fw_buf_mc + fw_offset,
+				amdgpu_ucode_patch_jt(ucode,  fw_buf_mc + fw_offset,
 						    adev->firmware.fw_buf_ptr + fw_offset);
 				fw_offset += ALIGN(le32_to_cpu(cp_hdr->jt_size) << 2, PAGE_SIZE);
 			}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h
index 4eedd92f000b..47825c82a3ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h
@@ -599,7 +599,6 @@ struct amdgpu_firmware {
 	const struct firmware *gpu_info_fw;
 
 	void *fw_buf_ptr;
-	uint64_t fw_buf_mc;
 };
 
 void amdgpu_ucode_print_mc_hdr(const struct common_firmware_header *hdr);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 5/7] drm/amdgpu: remove cached gpu addr: ta_mem_context.shared_mc_addr
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
                   ` (3 preceding siblings ...)
  2025-05-08  5:09 ` [PATCH v4 4/7] drm/amdgpu: remove cached gpu addr: amdgpu_firmware.fw_buf_mc Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 6/7] drm/amdgpu: remove cached gpu addr: psp_context.tmr_mc_addr Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 7/7] drm/amdgpu: remove cached gpu addr: psp_context.cmd_buf_mc_addr Samuel Zhang
  6 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma

When pdb0 enabled, the cached gpu addr is not compatible with SMU and
PSP. It always need to be updated. Remove the cached gpu addr and use
local variable instead.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 18 +++++++-----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  1 -
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index bdab40b42983..153c0c868546 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -276,8 +276,7 @@ static int psp_early_init(struct amdgpu_ip_block *ip_block)
 
 void psp_ta_free_shared_buf(struct ta_mem_context *mem_ctx)
 {
-	amdgpu_bo_free_kernel(&mem_ctx->shared_bo, &mem_ctx->shared_mc_addr,
-			      &mem_ctx->shared_buf);
+	amdgpu_bo_free_kernel(&mem_ctx->shared_bo, NULL, &mem_ctx->shared_buf);
 	mem_ctx->shared_bo = NULL;
 }
 
@@ -1123,7 +1122,6 @@ static int psp_asd_initialize(struct psp_context *psp)
 	    amdgpu_ip_version(psp->adev, MP0_HWIP, 0) >= IP_VERSION(13, 0, 10))
 		return 0;
 
-	psp->asd_context.mem_context.shared_mc_addr  = 0;
 	psp->asd_context.mem_context.shared_mem_size = PSP_ASD_SHARED_MEM_SIZE;
 	psp->asd_context.ta_load_type                = GFX_CMD_ID_LOAD_ASD;
 
@@ -1207,18 +1205,16 @@ static void psp_prep_ta_load_cmd_buf(struct psp_gfx_cmd_resp *cmd,
 				     uint64_t ta_bin_mc,
 				     struct ta_context *context)
 {
+	uint64_t shared_mc_addr;
+
 	cmd->cmd_id				= context->ta_load_type;
 	cmd->cmd.cmd_load_ta.app_phy_addr_lo	= lower_32_bits(ta_bin_mc);
 	cmd->cmd.cmd_load_ta.app_phy_addr_hi	= upper_32_bits(ta_bin_mc);
 	cmd->cmd.cmd_load_ta.app_len		= context->bin_desc.size_bytes;
 
-	if (context->mem_context.shared_bo)
-		context->mem_context.shared_mc_addr = amdgpu_bo_fb_aper_addr(context->mem_context.shared_bo);
-
-	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_lo =
-		lower_32_bits(context->mem_context.shared_mc_addr);
-	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_hi =
-		upper_32_bits(context->mem_context.shared_mc_addr);
+	shared_mc_addr = amdgpu_bo_fb_aper_addr(context->mem_context.shared_bo);
+	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_lo = lower_32_bits(shared_mc_addr);
+	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_hi = upper_32_bits(shared_mc_addr);
 	cmd->cmd.cmd_load_ta.cmd_buf_len = context->mem_context.shared_mem_size;
 }
 
@@ -1233,7 +1229,7 @@ int psp_ta_init_shared_buf(struct psp_context *psp,
 				      PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM |
 				      AMDGPU_GEM_DOMAIN_GTT,
 				      &mem_ctx->shared_bo,
-				      &mem_ctx->shared_mc_addr,
+				      NULL,
 				      &mem_ctx->shared_buf);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 6ea5c21edc4e..106d07aaf8e1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -174,7 +174,6 @@ struct psp_bin_desc {
 
 struct ta_mem_context {
 	struct amdgpu_bo		*shared_bo;
-	uint64_t		shared_mc_addr;
 	void			*shared_buf;
 	enum psp_shared_mem_size	shared_mem_size;
 };
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 6/7] drm/amdgpu: remove cached gpu addr: psp_context.tmr_mc_addr
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
                   ` (4 preceding siblings ...)
  2025-05-08  5:09 ` [PATCH v4 5/7] drm/amdgpu: remove cached gpu addr: ta_mem_context.shared_mc_addr Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  2025-05-08  5:09 ` [PATCH v4 7/7] drm/amdgpu: remove cached gpu addr: psp_context.cmd_buf_mc_addr Samuel Zhang
  6 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma

When pdb0 enabled, the cached gpu addr is not compatible with SMU and
PSP. It always need to be updated. Remove the cached gpu addr and use
local variable instead.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  1 -
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 153c0c868546..3b71ff298f21 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -287,7 +287,7 @@ static void psp_free_shared_bufs(struct psp_context *psp)
 
 	/* free TMR memory buffer */
 	pptr = amdgpu_sriov_vf(psp->adev) ? &tmr_buf : NULL;
-	amdgpu_bo_free_kernel(&psp->tmr_bo, &psp->tmr_mc_addr, pptr);
+	amdgpu_bo_free_kernel(&psp->tmr_bo, NULL, pptr);
 	psp->tmr_bo = NULL;
 
 	/* free xgmi shared memory */
@@ -867,11 +867,9 @@ static int psp_tmr_init(struct psp_context *psp)
 					      AMDGPU_HAS_VRAM(psp->adev) ?
 					      AMDGPU_GEM_DOMAIN_VRAM :
 					      AMDGPU_GEM_DOMAIN_GTT,
-					      &psp->tmr_bo, &psp->tmr_mc_addr,
+					      &psp->tmr_bo, NULL,
 					      pptr);
 	}
-	if (psp->tmr_bo)
-		psp->tmr_mc_addr = amdgpu_bo_fb_aper_addr(psp->tmr_bo);
 
 	return ret;
 }
@@ -896,6 +894,7 @@ static int psp_tmr_load(struct psp_context *psp)
 {
 	int ret;
 	struct psp_gfx_cmd_resp *cmd;
+	uint64_t tmr_mc_addr;
 
 	/* For Navi12 and CHIP_SIENNA_CICHLID SRIOV, do not set up TMR.
 	 * Already set up by host driver.
@@ -905,10 +904,11 @@ static int psp_tmr_load(struct psp_context *psp)
 
 	cmd = acquire_psp_cmd_buf(psp);
 
-	psp_prep_tmr_cmd_buf(psp, cmd, psp->tmr_mc_addr, psp->tmr_bo);
+	tmr_mc_addr = amdgpu_bo_fb_aper_addr(psp->tmr_bo);
+	psp_prep_tmr_cmd_buf(psp, cmd, tmr_mc_addr, psp->tmr_bo);
 	if (psp->tmr_bo)
 		dev_info(psp->adev->dev, "reserve 0x%lx from 0x%llx for PSP TMR\n",
-			 amdgpu_bo_size(psp->tmr_bo), psp->tmr_mc_addr);
+			 amdgpu_bo_size(psp->tmr_bo), tmr_mc_addr);
 
 	ret = psp_cmd_submit_buf(psp, NULL, cmd,
 				 psp->fence_buf_mc_addr);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 106d07aaf8e1..d3f5c17ead7e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -348,7 +348,6 @@ struct psp_context {
 
 	/* tmr buffer */
 	struct amdgpu_bo		*tmr_bo;
-	uint64_t			tmr_mc_addr;
 
 	/* asd firmware */
 	const struct firmware		*asd_fw;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 7/7] drm/amdgpu: remove cached gpu addr: psp_context.cmd_buf_mc_addr
  2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
                   ` (5 preceding siblings ...)
  2025-05-08  5:09 ` [PATCH v4 6/7] drm/amdgpu: remove cached gpu addr: psp_context.tmr_mc_addr Samuel Zhang
@ 2025-05-08  5:09 ` Samuel Zhang
  6 siblings, 0 replies; 14+ messages in thread
From: Samuel Zhang @ 2025-05-08  5:09 UTC (permalink / raw)
  To: amd-gfx
  Cc: victor.zhao, haijun.chang, guoqing.zhang, Christian.Koenig,
	Alexander.Deucher, Owen.Zhang2, Qing.Ma

When pdb0 enabled, the cached gpu addr is not compatible with SMU and
PSP. It always need to be updated. Remove the cached gpu addr and use
local variable instead.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 +++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 1 -
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 3b71ff298f21..c89e593d4819 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -524,7 +524,7 @@ static int psp_sw_init(struct amdgpu_ip_block *ip_block)
 	ret = amdgpu_bo_create_kernel(adev, PSP_CMD_BUFFER_SIZE, PAGE_SIZE,
 				      AMDGPU_GEM_DOMAIN_VRAM |
 				      AMDGPU_GEM_DOMAIN_GTT,
-				      &psp->cmd_buf_bo, &psp->cmd_buf_mc_addr,
+				      &psp->cmd_buf_bo, NULL,
 				      (void **)&psp->cmd_buf_mem);
 	if (ret)
 		goto failed2;
@@ -567,7 +567,7 @@ static int psp_sw_fini(struct amdgpu_ip_block *ip_block)
 			      &psp->fw_pri_mc_addr, &psp->fw_pri_buf);
 	amdgpu_bo_free_kernel(&psp->fence_buf_bo,
 			      &psp->fence_buf_mc_addr, &psp->fence_buf);
-	amdgpu_bo_free_kernel(&psp->cmd_buf_bo, &psp->cmd_buf_mc_addr,
+	amdgpu_bo_free_kernel(&psp->cmd_buf_bo, NULL,
 			      (void **)&psp->cmd_buf_mem);
 
 	return 0;
@@ -682,6 +682,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	int timeout = psp->adev->psp_timeout;
 	bool ras_intr = false;
 	bool skip_unsupport = false;
+	uint64_t cmd_buf_mc_addr;
 
 	if (psp->adev->no_hw_access)
 		return 0;
@@ -691,7 +692,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	memcpy(psp->cmd_buf_mem, cmd, sizeof(struct psp_gfx_cmd_resp));
 
 	index = atomic_inc_return(&psp->fence_value);
-	ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
+	cmd_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->cmd_buf_bo);
+	ret = psp_ring_cmd_submit(psp, cmd_buf_mc_addr, fence_mc_addr, index);
 	if (ret) {
 		atomic_dec(&psp->fence_value);
 		goto exit;
@@ -2344,7 +2346,6 @@ static void psp_update_gpu_addresses(struct amdgpu_device *adev)
 	if (psp->cmd_buf_bo && psp->cmd_buf_mem) {
 		psp->fw_pri_mc_addr = amdgpu_bo_fb_aper_addr(psp->fw_pri_bo);
 		psp->fence_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->fence_buf_bo);
-		psp->cmd_buf_mc_addr = amdgpu_bo_fb_aper_addr(psp->cmd_buf_bo);
 	}
 	if (adev->firmware.rbuf && psp->km_ring.ring_mem)
 		psp->km_ring.ring_mem_mc_addr = amdgpu_bo_fb_aper_addr(adev->firmware.rbuf);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index d3f5c17ead7e..491697e1f141 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -365,7 +365,6 @@ struct psp_context {
 
 	/* cmd buffer */
 	struct amdgpu_bo		*cmd_buf_bo;
-	uint64_t			cmd_buf_mc_addr;
 	struct psp_gfx_cmd_resp		*cmd_buf_mem;
 
 	/* fence value associated with cmd buffer */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08  5:09 ` [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume Samuel Zhang
@ 2025-05-08  8:12   ` Lazar, Lijo
  2025-05-08 10:03     ` Zhang, GuoQing (Sam)
  2025-05-08 10:56     ` Christian König
  2025-05-08  9:27   ` Christian König
  1 sibling, 2 replies; 14+ messages in thread
From: Lazar, Lijo @ 2025-05-08  8:12 UTC (permalink / raw)
  To: Samuel Zhang, amd-gfx
  Cc: victor.zhao, haijun.chang, Christian.Koenig, Alexander.Deucher,
	Owen.Zhang2, Qing.Ma, Jiang Liu



On 5/8/2025 10:39 AM, Samuel Zhang wrote:
> For virtual machine with vGPUs in SRIOV single device mode and XGMI
> is enabled, XGMI physical node ids may change when waking up from
> hiberation with different vGPU devices. So update XGMI info on resume.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d477a901af84..843a3b0a9a07 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  		r = adev->gfxhub.funcs->get_xgmi_info(adev);
>  		if (r)
>  			return r;
> +		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>  	}
>  
>  	/* enable PCIE atomic ops */
> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>  	return 0;
>  }
>  
> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
> +{
> +	int r;
> +
> +	/* Get xgmi info again for sriov to detect device changes */
> +	if (amdgpu_sriov_vf(adev) &&
> +	    !(adev->flags & AMD_IS_APU) &&
> +	    adev->gmc.xgmi.supported &&
> +	    !adev->gmc.xgmi.connected_to_cpu) {
> +		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
> +		r = adev->gfxhub.funcs->get_xgmi_info(adev);
> +		if (r)
> +			return r;
> +
> +		dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
> +			adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
> +	}
> +	return 0;
> +}
> +
>  /**
>   * amdgpu_device_resume - initiate device resume
>   *
> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>  		r = amdgpu_virt_request_full_gpu(adev, true);
>  		if (r)
>  			return r;
> +		r = amdgpu_device_update_xgmi_info(adev);
> +		if (r)
> +			return r;
>  	}
>  
>  	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> index 32dabba4062f..1387901576f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {
>  	u64 node_segment_size;
>  	/* physical node (0-3) */
>  	unsigned physical_node_id;
> +	unsigned prev_physical_node_id;
>  	/* number of nodes (0-4) */
>  	unsigned num_physical_nodes;
>  	/* gpu list in the same hive */
> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {
>  	uint8_t max_width;
>  };
>  
> +#define amdgpu_xmgi_is_node_changed(adev) \

Typo - xgmi

> +	(adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)

Since prev_physical_node_id is updated only for VF, the check should be
there here as well.

Otherwise, you may have something like below in
amdgpu_device_update_xgmi_info()

amdgpu_xgmi.node_changed = false;
if (check_condition) {
	prev_node = adev->gmc.xgmi.physical_node_id;
	adev->gfxhub.funcs->get_xgmi_info(adev)
	amdgpu_xgmi.node_changed = (prev_node != adev->gmc.xgmi.physical_node_id);
}

To make it clearer -

Would still prefer to wrap under amdgpu_virt_migration_xyz() to make it
clear that this is done for node migration.

Ex:

bool amdgpu_virt_migration_detected()
{
	return amdgpu_xgmi.node_changed; // And any other combination checks
which could up in future.
}

The check needs to be done for any further changes down the series like

if (amdgpu_virt_migration_detected())
	psp_update_gpu_addresses();

Thanks,
Lijo

> +
>  struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>  void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
>  int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 59385da80185..7c0ca2721eb3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
>  	struct amdgpu_device *adev = ip_block->adev;
>  	int r;
>  
> +	if (amdgpu_xmgi_is_node_changed(adev)) {
> +		adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
> +		adev->vm_manager.vram_base_offset +=
> +			adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
> +	}
> +
>  	/* If a reset is done for NPS mode switch, read the memory range
>  	 * information again.
>  	 */


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08  5:09 ` [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume Samuel Zhang
  2025-05-08  8:12   ` Lazar, Lijo
@ 2025-05-08  9:27   ` Christian König
  2025-05-08 10:03     ` Zhang, GuoQing (Sam)
  1 sibling, 1 reply; 14+ messages in thread
From: Christian König @ 2025-05-08  9:27 UTC (permalink / raw)
  To: Samuel Zhang, amd-gfx
  Cc: victor.zhao, haijun.chang, Christian.Koenig, Alexander.Deucher,
	Owen.Zhang2, Qing.Ma, Jiang Liu



On 5/8/25 07:09, Samuel Zhang wrote:
> For virtual machine with vGPUs in SRIOV single device mode and XGMI
> is enabled, XGMI physical node ids may change when waking up from
> hiberation with different vGPU devices. So update XGMI info on resume.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d477a901af84..843a3b0a9a07 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  		r = adev->gfxhub.funcs->get_xgmi_info(adev);
>  		if (r)
>  			return r;
> +		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>  	}
>  
>  	/* enable PCIE atomic ops */
> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>  	return 0;
>  }
>  
> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
> +{
> +	int r;
> +
> +	/* Get xgmi info again for sriov to detect device changes */
> +	if (amdgpu_sriov_vf(adev) &&
> +	    !(adev->flags & AMD_IS_APU) &&
> +	    adev->gmc.xgmi.supported &&
> +	    !adev->gmc.xgmi.connected_to_cpu) {
> +		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
> +		r = adev->gfxhub.funcs->get_xgmi_info(adev);
> +		if (r)
> +			return r;
> +
> +		dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
> +			adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
> +	}
> +	return 0;
> +}
> +
>  /**
>   * amdgpu_device_resume - initiate device resume
>   *
> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>  		r = amdgpu_virt_request_full_gpu(adev, true);
>  		if (r)
>  			return r;
> +		r = amdgpu_device_update_xgmi_info(adev);
> +		if (r)
> +			return r;
>  	}
>  
>  	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> index 32dabba4062f..1387901576f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {
>  	u64 node_segment_size;
>  	/* physical node (0-3) */
>  	unsigned physical_node_id;
> +	unsigned prev_physical_node_id;
>  	/* number of nodes (0-4) */
>  	unsigned num_physical_nodes;
>  	/* gpu list in the same hive */
> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {
>  	uint8_t max_width;
>  };
>  
> +#define amdgpu_xmgi_is_node_changed(adev) \
> +	(adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)


Please drop that function and the related checks.

If this is necessary we want to update the relevant parameters all the time and not just when something changed.

Regards,
Christian.

> +
>  struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>  void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
>  int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 59385da80185..7c0ca2721eb3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
>  	struct amdgpu_device *adev = ip_block->adev;
>  	int r;
>  
> +	if (amdgpu_xmgi_is_node_changed(adev)) {
> +		adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
> +		adev->vm_manager.vram_base_offset +=
> +			adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
> +	}
> +
>  	/* If a reset is done for NPS mode switch, read the memory range
>  	 * information again.
>  	 */


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08  8:12   ` Lazar, Lijo
@ 2025-05-08 10:03     ` Zhang, GuoQing (Sam)
  2025-05-08 10:56     ` Christian König
  1 sibling, 0 replies; 14+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-05-08 10:03 UTC (permalink / raw)
  To: Lazar, Lijo, Zhang, GuoQing (Sam), amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Koenig, Christian,
	Deucher, Alexander, Zhang, Owen(SRDC), Ma, Qing (Mark), Jiang Liu

[-- Attachment #1: Type: text/plain, Size: 6034 bytes --]


On 2025/5/8 16:12, Lazar, Lijo wrote:
> On 5/8/2025 10:39 AM, Samuel Zhang wrote:
>> For virtual machine with vGPUs in SRIOV single device mode and XGMI
>> is enabled, XGMI physical node ids may change when waking up from
>> hiberation with different vGPU devices. So update XGMI info on resume.
>>
>> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
>> Signed-off-by: Samuel Zhang<guoqing.zhang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
>>   3 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d477a901af84..843a3b0a9a07 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>               r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>               if (r)
>>                       return r;
>> +            adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>       }
>>
>>       /* enable PCIE atomic ops */
>> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>       return 0;
>>   }
>>
>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
>> +{
>> +    int r;
>> +
>> +    /* Get xgmi info again for sriov to detect device changes */
>> +    if (amdgpu_sriov_vf(adev) &&
>> +        !(adev->flags & AMD_IS_APU) &&
>> +        adev->gmc.xgmi.supported &&
>> +        !adev->gmc.xgmi.connected_to_cpu) {
>> +            adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>> +            r = adev->gfxhub.funcs->get_xgmi_info(adev);
>> +            if (r)
>> +                    return r;
>> +
>> +            dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
>> +                    adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
>> +    }
>> +    return 0;
>> +}
>> +
>>   /**
>>    * amdgpu_device_resume - initiate device resume
>>    *
>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>               r = amdgpu_virt_request_full_gpu(adev, true);
>>               if (r)
>>                       return r;
>> +            r = amdgpu_device_update_xgmi_info(adev);
>> +            if (r)
>> +                    return r;
>>       }
>>
>>       if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> index 32dabba4062f..1387901576f1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {
>>       u64 node_segment_size;
>>       /* physical node (0-3) */
>>       unsigned physical_node_id;
>> +    unsigned prev_physical_node_id;
>>       /* number of nodes (0-4) */
>>       unsigned num_physical_nodes;
>>       /* gpu list in the same hive */
>> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {
>>       uint8_t max_width;
>>   };
>>
>> +#define amdgpu_xmgi_is_node_changed(adev) \
> Typo - xgmi
>
>> +    (adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)
> Since prev_physical_node_id is updated only for VF, the check should be
> there here as well.
>
> Otherwise, you may have something like below in
> amdgpu_device_update_xgmi_info()
>
> amdgpu_xgmi.node_changed = false;
> if (check_condition) {
>        prev_node = adev->gmc.xgmi.physical_node_id;
>        adev->gfxhub.funcs->get_xgmi_info(adev)
>        amdgpu_xgmi.node_changed = (prev_node != adev->gmc.xgmi.physical_node_id);
> }
>
> To make it clearer -
>
> Would still prefer to wrap under amdgpu_virt_migration_xyz() to make it
> clear that this is done for node migration.
>
> Ex:
>
> bool amdgpu_virt_migration_detected()
> {
>        return amdgpu_xgmi.node_changed; // And any other combination checks
> which could up in future.
> }
>
> The check needs to be done for any further changes down the series like
>
> if (amdgpu_virt_migration_detected())
>        psp_update_gpu_addresses();


psp_update_gpu_addresses() and other gpu address updating logic will
always needed when pdb0 is enabled, not just when detecting xgmi node
changed. Because when pdb0 is enabled, the returned gpu addr from
amdgpu_bo_create_reserved() will be in GART aperture, which is not
compatible with PSP and SMU. They need to updated to FB aperture addr
using the new `amdgpu_bo_fb_aper_addr()`.

That's reason of the last 4 refactoring patches, to remove the cached
gpu address completely and convert to local variables.

Regards
Sam


> Thanks,
> Lijo
>
>> +
>>   struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>>   void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
>>   int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index 59385da80185..7c0ca2721eb3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
>>       struct amdgpu_device *adev = ip_block->adev;
>>       int r;
>>
>> +    if (amdgpu_xmgi_is_node_changed(adev)) {
>> +            adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
>> +            adev->vm_manager.vram_base_offset +=
>> +                    adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
>> +    }
>> +
>>       /* If a reset is done for NPS mode switch, read the memory range
>>        * information again.
>>        */

[-- Attachment #2: Type: text/html, Size: 11145 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08  9:27   ` Christian König
@ 2025-05-08 10:03     ` Zhang, GuoQing (Sam)
  0 siblings, 0 replies; 14+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-05-08 10:03 UTC (permalink / raw)
  To: Koenig, Christian, Zhang, GuoQing (Sam),
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Jiang Liu

[-- Attachment #1: Type: text/plain, Size: 4948 bytes --]


On 2025/5/8 17:27, Christian König wrote:
>
> On 5/8/25 07:09, Samuel Zhang wrote:
>> For virtual machine with vGPUs in SRIOV single device mode and XGMI
>> is enabled, XGMI physical node ids may change when waking up from
>> hiberation with different vGPU devices. So update XGMI info on resume.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
>>   3 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d477a901af84..843a3b0a9a07 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>               r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>               if (r)
>>                       return r;
>> +            adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>       }
>>
>>       /* enable PCIE atomic ops */
>> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>       return 0;
>>   }
>>
>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
>> +{
>> +    int r;
>> +
>> +    /* Get xgmi info again for sriov to detect device changes */
>> +    if (amdgpu_sriov_vf(adev) &&
>> +        !(adev->flags & AMD_IS_APU) &&
>> +        adev->gmc.xgmi.supported &&
>> +        !adev->gmc.xgmi.connected_to_cpu) {
>> +            adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>> +            r = adev->gfxhub.funcs->get_xgmi_info(adev);
>> +            if (r)
>> +                    return r;
>> +
>> +            dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
>> +                    adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
>> +    }
>> +    return 0;
>> +}
>> +
>>   /**
>>    * amdgpu_device_resume - initiate device resume
>>    *
>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>               r = amdgpu_virt_request_full_gpu(adev, true);
>>               if (r)
>>                       return r;
>> +            r = amdgpu_device_update_xgmi_info(adev);
>> +            if (r)
>> +                    return r;
>>       }
>>
>>       if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> index 32dabba4062f..1387901576f1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {
>>       u64 node_segment_size;
>>       /* physical node (0-3) */
>>       unsigned physical_node_id;
>> +    unsigned prev_physical_node_id;
>>       /* number of nodes (0-4) */
>>       unsigned num_physical_nodes;
>>       /* gpu list in the same hive */
>> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {
>>       uint8_t max_width;
>>   };
>>
>> +#define amdgpu_xmgi_is_node_changed(adev) \
>> +    (adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)
>
> Please drop that function and the related checks.
>
> If this is necessary we want to update the relevant parameters all the time and not just when something changed.


Hi @Lazar, Lijo <mailto:Lijo.Lazar@amd.com> and @Koenig, Christian
<mailto:Christian.Koenig@amd.com>, you have different opinion on if the
check is needed. Could you align on this? Thank you!

Regards
Sam


>
> Regards,
> Christian.
>
>> +
>>   struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>>   void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
>>   int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index 59385da80185..7c0ca2721eb3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
>>       struct amdgpu_device *adev = ip_block->adev;
>>       int r;
>>
>> +    if (amdgpu_xmgi_is_node_changed(adev)) {
>> +            adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
>> +            adev->vm_manager.vram_base_offset +=
>> +                    adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
>> +    }
>> +
>>       /* If a reset is done for NPS mode switch, read the memory range
>>        * information again.
>>        */

[-- Attachment #2: Type: text/html, Size: 9857 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08  8:12   ` Lazar, Lijo
  2025-05-08 10:03     ` Zhang, GuoQing (Sam)
@ 2025-05-08 10:56     ` Christian König
  2025-05-09  4:22       ` Zhang, GuoQing (Sam)
  1 sibling, 1 reply; 14+ messages in thread
From: Christian König @ 2025-05-08 10:56 UTC (permalink / raw)
  To: Lazar, Lijo, Samuel Zhang, amd-gfx
  Cc: victor.zhao, haijun.chang, Alexander.Deucher, Owen.Zhang2,
	Qing.Ma, Jiang Liu

On 5/8/25 10:12, Lazar, Lijo wrote:
> 
> 
> On 5/8/2025 10:39 AM, Samuel Zhang wrote:
>> For virtual machine with vGPUs in SRIOV single device mode and XGMI
>> is enabled, XGMI physical node ids may change when waking up from
>> hiberation with different vGPU devices. So update XGMI info on resume.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
>>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
>>  3 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d477a901af84..843a3b0a9a07 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>  		r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>  		if (r)
>>  			return r;
>> +		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>  	}
>>  
>>  	/* enable PCIE atomic ops */
>> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>  	return 0;
>>  }
>>  
>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
>> +{
>> +	int r;
>> +
>> +	/* Get xgmi info again for sriov to detect device changes */
>> +	if (amdgpu_sriov_vf(adev) &&
>> +	    !(adev->flags & AMD_IS_APU) &&
>> +	    adev->gmc.xgmi.supported &&
>> +	    !adev->gmc.xgmi.connected_to_cpu) {
>> +		adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>> +		r = adev->gfxhub.funcs->get_xgmi_info(adev);
>> +		if (r)
>> +			return r;
>> +
>> +		dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
>> +			adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
>> +	}
>> +	return 0;
>> +}
>> +
>>  /**
>>   * amdgpu_device_resume - initiate device resume
>>   *
>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>  		r = amdgpu_virt_request_full_gpu(adev, true);
>>  		if (r)
>>  			return r;
>> +		r = amdgpu_device_update_xgmi_info(adev);
>> +		if (r)
>> +			return r;
>>  	}
>>  
>>  	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> index 32dabba4062f..1387901576f1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {
>>  	u64 node_segment_size;
>>  	/* physical node (0-3) */
>>  	unsigned physical_node_id;
>> +	unsigned prev_physical_node_id;
>>  	/* number of nodes (0-4) */
>>  	unsigned num_physical_nodes;
>>  	/* gpu list in the same hive */
>> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {
>>  	uint8_t max_width;
>>  };
>>  
>> +#define amdgpu_xmgi_is_node_changed(adev) \
> 
> Typo - xgmi
> 
>> +	(adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)
> 
> Since prev_physical_node_id is updated only for VF, the check should be
> there here as well.
> 
> Otherwise, you may have something like below in
> amdgpu_device_update_xgmi_info()
> 
> amdgpu_xgmi.node_changed = false;
> if (check_condition) {
> 	prev_node = adev->gmc.xgmi.physical_node_id;
> 	adev->gfxhub.funcs->get_xgmi_info(adev)
> 	amdgpu_xgmi.node_changed = (prev_node != adev->gmc.xgmi.physical_node_id);
> }
> 
> To make it clearer -
> 
> Would still prefer to wrap under amdgpu_virt_migration_xyz() to make it
> clear that this is done for node migration.


Yeah, that is a rather good idea as well.

And we should *always* execute that and not just when the physical node id changes.

Otherwise we won't always test the code path and potentially break it at some point.

Regards,
Christian.

> 
> Ex:
> 
> bool amdgpu_virt_migration_detected()
> {
> 	return amdgpu_xgmi.node_changed; // And any other combination checks
> which could up in future.
> }
> 
> The check needs to be done for any further changes down the series like
> 
> if (amdgpu_virt_migration_detected())
> 	psp_update_gpu_addresses();
> 
> Thanks,
> Lijo
> 
>> +
>>  struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>>  void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
>>  int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index 59385da80185..7c0ca2721eb3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
>>  	struct amdgpu_device *adev = ip_block->adev;
>>  	int r;
>>  
>> +	if (amdgpu_xmgi_is_node_changed(adev)) {
>> +		adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
>> +		adev->vm_manager.vram_base_offset +=
>> +			adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
>> +	}
>> +
>>  	/* If a reset is done for NPS mode switch, read the memory range
>>  	 * information again.
>>  	 */
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume
  2025-05-08 10:56     ` Christian König
@ 2025-05-09  4:22       ` Zhang, GuoQing (Sam)
  0 siblings, 0 replies; 14+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-05-09  4:22 UTC (permalink / raw)
  To: Koenig, Christian, Lazar, Lijo, Zhang, GuoQing (Sam),
	amd-gfx@lists.freedesktop.org
  Cc: Zhao, Victor, Chang, HaiJun, Deucher, Alexander,
	Zhang, Owen(SRDC), Ma, Qing (Mark), Jiang Liu

[-- Attachment #1: Type: text/plain, Size: 6222 bytes --]


On 2025/5/8 18:56, Christian König wrote:
> On 5/8/25 10:12, Lazar, Lijo wrote:
>>
>> On 5/8/2025 10:39 AM, Samuel Zhang wrote:
>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI
>>> is enabled, XGMI physical node ids may change when waking up from
>>> hiberation with different vGPU devices. So update XGMI info on resume.
>>>
>>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  4 ++++
>>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 ++++++
>>>   3 files changed, 34 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index d477a901af84..843a3b0a9a07 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>              r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>>              if (r)
>>>                      return r;
>>> +           adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>>      }
>>>
>>>      /* enable PCIE atomic ops */
>>> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>>>      return 0;
>>>   }
>>>
>>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
>>> +{
>>> +   int r;
>>> +
>>> +   /* Get xgmi info again for sriov to detect device changes */
>>> +   if (amdgpu_sriov_vf(adev) &&
>>> +       !(adev->flags & AMD_IS_APU) &&
>>> +       adev->gmc.xgmi.supported &&
>>> +       !adev->gmc.xgmi.connected_to_cpu) {
>>> +           adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>> +           r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>> +           if (r)
>>> +                   return r;
>>> +
>>> +           dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
>>> +                   adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);
>>> +   }
>>> +   return 0;
>>> +}
>>> +
>>>   /**
>>>    * amdgpu_device_resume - initiate device resume
>>>    *
>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>>              r = amdgpu_virt_request_full_gpu(adev, true);
>>>              if (r)
>>>                      return r;
>>> +           r = amdgpu_device_update_xgmi_info(adev);
>>> +           if (r)
>>> +                   return r;
>>>      }
>>>
>>>      if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>>> index 32dabba4062f..1387901576f1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>>> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {
>>>      u64 node_segment_size;
>>>      /* physical node (0-3) */
>>>      unsigned physical_node_id;
>>> +   unsigned prev_physical_node_id;
>>>      /* number of nodes (0-4) */
>>>      unsigned num_physical_nodes;
>>>      /* gpu list in the same hive */
>>> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {
>>>      uint8_t max_width;
>>>   };
>>>
>>> +#define amdgpu_xmgi_is_node_changed(adev) \
>> Typo - xgmi
>>
>>> +   (adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)
>> Since prev_physical_node_id is updated only for VF, the check should be
>> there here as well.
>>
>> Otherwise, you may have something like below in
>> amdgpu_device_update_xgmi_info()
>>
>> amdgpu_xgmi.node_changed = false;
>> if (check_condition) {
>>       prev_node = adev->gmc.xgmi.physical_node_id;
>>       adev->gfxhub.funcs->get_xgmi_info(adev)
>>       amdgpu_xgmi.node_changed = (prev_node != adev->gmc.xgmi.physical_node_id);
>> }
>>
>> To make it clearer -
>>
>> Would still prefer to wrap under amdgpu_virt_migration_xyz() to make it
>> clear that this is done for node migration.
>
> Yeah, that is a rather good idea as well.
>
> And we should *always* execute that and not just when the physical node id changes.
>
> Otherwise we won't always test the code path and potentially break it at some point.


Hi @Koenig, Christian <mailto:Christian.Koenig@amd.com>, thank you for
the explanation.

Hi @Lazar, Lijo <mailto:Lijo.Lazar@amd.com>, are you OK with removing
the check? I can do reset test using quark in VM to ensure no regression
is introduced. Please comment if you still have different opinion. Thank
you!

Regards
Sam


> Regards,
> Christian.
>
>> Ex:
>>
>> bool amdgpu_virt_migration_detected()
>> {
>>       return amdgpu_xgmi.node_changed; // And any other combination checks
>> which could up in future.
>> }
>>
>> The check needs to be done for any further changes down the series like
>>
>> if (amdgpu_virt_migration_detected())
>>       psp_update_gpu_addresses();
>>
>> Thanks,
>> Lijo
>>
>>> +
>>>   struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>>>   void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);
>>>   int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> index 59385da80185..7c0ca2721eb3 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)
>>>      struct amdgpu_device *adev = ip_block->adev;
>>>      int r;
>>>
>>> +   if (amdgpu_xmgi_is_node_changed(adev)) {
>>> +           adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);
>>> +           adev->vm_manager.vram_base_offset +=
>>> +                   adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;
>>> +   }
>>> +
>>>      /* If a reset is done for NPS mode switch, read the memory range
>>>       * information again.
>>>       */

[-- Attachment #2: Type: text/html, Size: 11741 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-09  4:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08  5:09 [PATCH v4 0/7] enable switching to new gpu index for hibernate on SRIOV Samuel Zhang
2025-05-08  5:09 ` [PATCH v4 1/7] drm/amdgpu: update XGMI info on resume Samuel Zhang
2025-05-08  8:12   ` Lazar, Lijo
2025-05-08 10:03     ` Zhang, GuoQing (Sam)
2025-05-08 10:56     ` Christian König
2025-05-09  4:22       ` Zhang, GuoQing (Sam)
2025-05-08  9:27   ` Christian König
2025-05-08 10:03     ` Zhang, GuoQing (Sam)
2025-05-08  5:09 ` [PATCH v4 2/7] drm/amdgpu: update GPU addresses for SMU and PSP Samuel Zhang
2025-05-08  5:09 ` [PATCH v4 3/7] drm/amdgpu: enable pdb0 for hibernation on SRIOV Samuel Zhang
2025-05-08  5:09 ` [PATCH v4 4/7] drm/amdgpu: remove cached gpu addr: amdgpu_firmware.fw_buf_mc Samuel Zhang
2025-05-08  5:09 ` [PATCH v4 5/7] drm/amdgpu: remove cached gpu addr: ta_mem_context.shared_mc_addr Samuel Zhang
2025-05-08  5:09 ` [PATCH v4 6/7] drm/amdgpu: remove cached gpu addr: psp_context.tmr_mc_addr Samuel Zhang
2025-05-08  5:09 ` [PATCH v4 7/7] drm/amdgpu: remove cached gpu addr: psp_context.cmd_buf_mc_addr Samuel Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox