[PATCH 0/3] reduce system memory requirement for hibernation

Linux Power Management development
 help / color / mirror / Atom feed

* [PATCH 0/3] reduce system memory requirement for hibernation
@ 2025-06-30 10:41 Samuel Zhang
  2025-06-30 10:41 ` [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction " Samuel Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Samuel Zhang @ 2025-06-30 10:41 UTC (permalink / raw)
  To: rafael, len.brown, pavel, alexander.deucher, christian.koenig,
	mario.limonciello, lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, amd-gfx, dri-devel, linux-pm,
	linux-kernel, Samuel Zhang

Modern data center dGPUs are usually equipped with very large VRAM. On
server with such dGPUs(192GB VRAM * 8) and 2TB system memory, hibernate
will fail due to no enough free memory.

The root cause is that during hibernation all VRAM memory get evicted to
GTT or shmem. In both case, it is in system memory and kernel will try to 
copy the pages to hibernation image. In the worst case, this causes 2 
copies of VRAM memory in system memory, 2TB is not enough for the 
hibernation image. 192GB * 8 * 2 = 3TB > 2TB.

The fix includes following 2 changes. With 2 changes, there's much less 
pages needed to be copied to hibernate image and hibernation can succeed.
1. move GTT to shmem after evicting VRAM. then the GTT pages can be freed.
2. force write shmem pages to swap disk and free shmem pages.

After swapout GTT to shmem in hibernation prepare stage, swapin and 
restore BOs in thaw stage takes lots of time(50 mintues observed for 
8 dGPUs). And it's not necessary since the follow-up hibernate stages do 
not use GPU for hibernation successful case. The third patch is just skip 
the BOs restore in thaw stage to reduce the hibernation time.

Samuel Zhang (3):
  drm/amdgpu: move GTT to SHM after eviction for hibernation
  PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
  drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 13 ++++++++++++-
 drivers/gpu/drm/ttm/ttm_resource.c         | 18 ++++++++++++++++++
 include/drm/ttm/ttm_resource.h             |  1 +
 kernel/power/hibernate.c                   | 13 +++++++++++++
 6 files changed, 47 insertions(+), 2 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
  2025-06-30 10:41 [PATCH 0/3] reduce system memory requirement for hibernation Samuel Zhang
@ 2025-06-30 10:41 ` Samuel Zhang
  2025-06-30 11:54   ` Christian König
  2025-06-30 10:41 ` [PATCH 2/3] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
  2025-06-30 10:41 ` [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw() Samuel Zhang
  2 siblings, 1 reply; 16+ messages in thread
From: Samuel Zhang @ 2025-06-30 10:41 UTC (permalink / raw)
  To: rafael, len.brown, pavel, alexander.deucher, christian.koenig,
	mario.limonciello, lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, amd-gfx, dri-devel, linux-pm,
	linux-kernel, Samuel Zhang

When hibernate with data center dGPUs, huge number of VRAM BOs evicted
to GTT and takes too much system memory. This will cause hibernation
fail due to insufficient memory for creating the hibernation image.

Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
hibernation code to make room for hibernation image.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
 drivers/gpu/drm/ttm/ttm_resource.c      | 18 ++++++++++++++++++
 include/drm/ttm/ttm_resource.h          |  1 +
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 4d57269c9ca8..5aede907a591 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -2889,6 +2889,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
 int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
 {
 	struct ttm_resource_manager *man;
+	int r;
 
 	switch (mem_type) {
 	case TTM_PL_VRAM:
@@ -2903,7 +2904,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
 		return -EINVAL;
 	}
 
-	return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
+	r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
+	if (r) {
+		DRM_ERROR("Failed to evict memory type %d\n", mem_type);
+		return r;
+	}
+	if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
+		r = ttm_resource_manager_swapout();
+		if (r)
+			DRM_ERROR("Failed to swap out, %d\n", r);
+	}
+	return r;
 }
 
 #if defined(CONFIG_DEBUG_FS)
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index fd41b56e2c66..07b1f5a5afc2 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -534,6 +534,24 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
 }
 EXPORT_SYMBOL(ttm_resource_manager_init);
 
+int ttm_resource_manager_swapout(void)
+{
+	struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = false,
+		.force_alloc = true
+	};
+	int ret;
+
+	while (true) {
+		ret = ttm_global_swapout(&ctx, GFP_KERNEL);
+		if (ret <= 0)
+			break;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(ttm_resource_manager_swapout);
+
 /*
  * ttm_resource_manager_evict_all
  *
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index b873be9597e2..46181758068e 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -463,6 +463,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
 
 int ttm_resource_manager_evict_all(struct ttm_device *bdev,
 				   struct ttm_resource_manager *man);
+int ttm_resource_manager_swapout(void);
 
 uint64_t ttm_resource_manager_usage(struct ttm_resource_manager *man);
 void ttm_resource_manager_debug(struct ttm_resource_manager *man,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
  2025-06-30 10:41 [PATCH 0/3] reduce system memory requirement for hibernation Samuel Zhang
  2025-06-30 10:41 ` [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction " Samuel Zhang
@ 2025-06-30 10:41 ` Samuel Zhang
  2025-06-30 20:21   ` Rafael J. Wysocki
  2025-06-30 10:41 ` [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw() Samuel Zhang
  2 siblings, 1 reply; 16+ messages in thread
From: Samuel Zhang @ 2025-06-30 10:41 UTC (permalink / raw)
  To: rafael, len.brown, pavel, alexander.deucher, christian.koenig,
	mario.limonciello, lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, amd-gfx, dri-devel, linux-pm,
	linux-kernel, Samuel Zhang

When hibernate with data center dGPUs, huge number of VRAM data will be
moved to shmem during dev_pm_ops.prepare(). These shmem pages take a lot
of system memory so that there's no enough free memory for creating the
hibernation image. This will cause hibernation fail and abort.

After dev_pm_ops.prepare(), call shrink_all_memory() to force move shmem
pages to swap disk and reclaim the pages, so that there's enough system
memory for hibernation image and less pages needed to copy to the image.

This patch can only flush and free about half shmem pages. It will be
better to flush and free more pages, even all of shmem pages, so that
there're less pages to be copied to the hibernation image and the overall
hibernation time can be reduced.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 kernel/power/hibernate.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 10a01af63a80..913a298c1d01 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -370,6 +370,17 @@ static int create_image(int platform_mode)
 	return error;
 }
 
+static void shrink_shmem_memory(void)
+{
+	struct sysinfo info;
+	unsigned long pages, freed;
+
+	si_meminfo(&info);
+	pages = info.sharedram;
+	freed = shrink_all_memory(pages);
+	pr_debug("requested to reclaim %lu pages, freed %lu pages\n", pages, freed);
+}
+
 /**
  * hibernation_snapshot - Quiesce devices and create a hibernation image.
  * @platform_mode: If set, use platform driver to prepare for the transition.
@@ -411,6 +422,8 @@ int hibernation_snapshot(int platform_mode)
 		goto Thaw;
 	}
 
+	shrink_shmem_memory();
+
 	suspend_console();
 	pm_restrict_gfp_mask();
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-06-30 10:41 [PATCH 0/3] reduce system memory requirement for hibernation Samuel Zhang
  2025-06-30 10:41 ` [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction " Samuel Zhang
  2025-06-30 10:41 ` [PATCH 2/3] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
@ 2025-06-30 10:41 ` Samuel Zhang
  2025-06-30 11:58   ` Christian König
  2 siblings, 1 reply; 16+ messages in thread
From: Samuel Zhang @ 2025-06-30 10:41 UTC (permalink / raw)
  To: rafael, len.brown, pavel, alexander.deucher, christian.koenig,
	mario.limonciello, lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, amd-gfx, dri-devel, linux-pm,
	linux-kernel, Samuel Zhang

The hibernation successful workflow:
- prepare: evict VRAM and swapout GTT BOs
- freeze
- create the hibernation image in system memory
- thaw: swapin and restore BOs
- complete
- write hibernation image to disk
- amdgpu_pci_shutdown
- goto S5, turn off the system.

During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
shmem. Then in thaw stage, all BOs will be swapin and restored.

On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
the swapin and restore BOs takes too long (50 minutes) and it is not
necessary since the follow-up stages does not use GPU.

This patch is to skip BOs restore during thaw to reduce the hibernation
time.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a8f4697deb1b..b550d07190a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		amdgpu_virt_init_data_exchange(adev);
 		amdgpu_virt_release_full_gpu(adev, true);
 
-		if (!adev->in_s0ix && !r && !adev->in_runpm)
+		if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
 			r = amdgpu_amdkfd_resume_process(adev);
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 571b70da4562..23b76e8ac2fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
 static int amdgpu_pmops_restore(struct device *dev)
 {
 	struct drm_device *drm_dev = dev_get_drvdata(dev);
+	struct amdgpu_device *adev = drm_to_adev(drm_dev);
 
+	adev->in_s4 = false;
 	return amdgpu_device_resume(drm_dev, true);
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
  2025-06-30 10:41 ` [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction " Samuel Zhang
@ 2025-06-30 11:54   ` Christian König
       [not found]     ` <DM4PR12MB5937FFB3E121E489A261785DE541A@DM4PR12MB5937.namprd12.prod.outlook.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2025-06-30 11:54 UTC (permalink / raw)
  To: Samuel Zhang, rafael, len.brown, pavel, alexander.deucher,
	mario.limonciello, lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, amd-gfx, dri-devel, linux-pm,
	linux-kernel

On 30.06.25 12:41, Samuel Zhang wrote:
> When hibernate with data center dGPUs, huge number of VRAM BOs evicted
> to GTT and takes too much system memory. This will cause hibernation
> fail due to insufficient memory for creating the hibernation image.
> 
> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
> hibernation code to make room for hibernation image.

This should probably be two patches, one for TTM and then an amdgpu patch to forward the event.

> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
>  drivers/gpu/drm/ttm/ttm_resource.c      | 18 ++++++++++++++++++
>  include/drm/ttm/ttm_resource.h          |  1 +
>  3 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 4d57269c9ca8..5aede907a591 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -2889,6 +2889,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>  int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>  {
>  	struct ttm_resource_manager *man;
> +	int r;
>  
>  	switch (mem_type) {
>  	case TTM_PL_VRAM:
> @@ -2903,7 +2904,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>  		return -EINVAL;
>  	}
>  
> -	return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
> +	r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
> +	if (r) {
> +		DRM_ERROR("Failed to evict memory type %d\n", mem_type);
> +		return r;
> +	}
> +	if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
> +		r = ttm_resource_manager_swapout();
> +		if (r)
> +			DRM_ERROR("Failed to swap out, %d\n", r);
> +	}
> +	return r;
>  }
>  
>  #if defined(CONFIG_DEBUG_FS)
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index fd41b56e2c66..07b1f5a5afc2 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -534,6 +534,24 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>  }
>  EXPORT_SYMBOL(ttm_resource_manager_init);
>  
> +int ttm_resource_manager_swapout(void)

This needs documentation, better placement and a better name.

First of all put it into ttm_device.c instead of the resource manager.

Then call it something like ttm_device_prepare_hibernation or similar.


> +{
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = false,
> +		.no_wait_gpu = false,
> +		.force_alloc = true
> +	};
> +	int ret;
> +
> +	while (true) {

Make that:

do {
	ret = ...
} while (ret > 0);

> +		ret = ttm_global_swapout(&ctx, GFP_KERNEL);
> +		if (ret <= 0)
> +			break;
> +	}
> +	return ret;

It's rather pointless to return the number of swapped out pages.

Make that "return ret < 0 ? ret : 0;

Regards,
Christian.

> +}
> +EXPORT_SYMBOL(ttm_resource_manager_swapout);
> +
>  /*
>   * ttm_resource_manager_evict_all
>   *
> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
> index b873be9597e2..46181758068e 100644
> --- a/include/drm/ttm/ttm_resource.h
> +++ b/include/drm/ttm/ttm_resource.h
> @@ -463,6 +463,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>  
>  int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>  				   struct ttm_resource_manager *man);
> +int ttm_resource_manager_swapout(void);
>  
>  uint64_t ttm_resource_manager_usage(struct ttm_resource_manager *man);
>  void ttm_resource_manager_debug(struct ttm_resource_manager *man,


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-06-30 10:41 ` [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw() Samuel Zhang
@ 2025-06-30 11:58   ` Christian König
       [not found]     ` <8eb1700d-4d60-4a1e-9d09-718f65baaf1e@amd.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2025-06-30 11:58 UTC (permalink / raw)
  To: Samuel Zhang, rafael, len.brown, pavel, alexander.deucher,
	mario.limonciello, lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, amd-gfx, dri-devel, linux-pm,
	linux-kernel

On 30.06.25 12:41, Samuel Zhang wrote:
> The hibernation successful workflow:
> - prepare: evict VRAM and swapout GTT BOs
> - freeze
> - create the hibernation image in system memory
> - thaw: swapin and restore BOs

Why should a thaw happen here in between?

> - complete
> - write hibernation image to disk
> - amdgpu_pci_shutdown
> - goto S5, turn off the system.
> 
> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
> shmem. Then in thaw stage, all BOs will be swapin and restored.

That's not correct. This is done by the application starting again and not during thaw.

> 
> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
> the swapin and restore BOs takes too long (50 minutes) and it is not
> necessary since the follow-up stages does not use GPU.
> 
> This patch is to skip BOs restore during thaw to reduce the hibernation
> time.

As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.

Regards,
Christian.

> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a8f4697deb1b..b550d07190a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>  		amdgpu_virt_init_data_exchange(adev);
>  		amdgpu_virt_release_full_gpu(adev, true);
>  
> -		if (!adev->in_s0ix && !r && !adev->in_runpm)
> +		if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
>  			r = amdgpu_amdkfd_resume_process(adev);
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 571b70da4562..23b76e8ac2fd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
>  static int amdgpu_pmops_restore(struct device *dev)
>  {
>  	struct drm_device *drm_dev = dev_get_drvdata(dev);
> +	struct amdgpu_device *adev = drm_to_adev(drm_dev);
>  
> +	adev->in_s4 = false;
>  	return amdgpu_device_resume(drm_dev, true);
>  }
>  


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
  2025-06-30 10:41 ` [PATCH 2/3] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
@ 2025-06-30 20:21   ` Rafael J. Wysocki
  0 siblings, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2025-06-30 20:21 UTC (permalink / raw)
  To: Samuel Zhang
  Cc: rafael, len.brown, pavel, alexander.deucher, christian.koenig,
	mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	amd-gfx, dri-devel, linux-pm, linux-kernel

On Mon, Jun 30, 2025 at 12:41 PM Samuel Zhang <guoqing.zhang@amd.com> wrote:
>
> When hibernate with data center dGPUs, huge number of VRAM data will be
> moved to shmem during dev_pm_ops.prepare(). These shmem pages take a lot
> of system memory so that there's no enough free memory for creating the
> hibernation image. This will cause hibernation fail and abort.
>
> After dev_pm_ops.prepare(), call shrink_all_memory() to force move shmem
> pages to swap disk and reclaim the pages, so that there's enough system
> memory for hibernation image and less pages needed to copy to the image.
>
> This patch can only flush and free about half shmem pages. It will be
> better to flush and free more pages, even all of shmem pages, so that
> there're less pages to be copied to the hibernation image and the overall
> hibernation time can be reduced.
>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  kernel/power/hibernate.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 10a01af63a80..913a298c1d01 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -370,6 +370,17 @@ static int create_image(int platform_mode)
>         return error;
>  }
>
> +static void shrink_shmem_memory(void)
> +{
> +       struct sysinfo info;
> +       unsigned long pages, freed;
> +

Please add a comment explaining what is going on here.

> +       si_meminfo(&info);
> +       pages = info.sharedram;
> +       freed = shrink_all_memory(pages);
> +       pr_debug("requested to reclaim %lu pages, freed %lu pages\n", pages, freed);

This message will be hard to decode without any context.

> +}
> +
>  /**
>   * hibernation_snapshot - Quiesce devices and create a hibernation image.
>   * @platform_mode: If set, use platform driver to prepare for the transition.
> @@ -411,6 +422,8 @@ int hibernation_snapshot(int platform_mode)
>                 goto Thaw;
>         }
>

Please add a comment explaining why the below is done.

> +       shrink_shmem_memory();
> +
>         suspend_console();
>         pm_restrict_gfp_mask();
>
> --
> 2.43.5
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
       [not found]     ` <DM4PR12MB5937FFB3E121E489A261785DE541A@DM4PR12MB5937.namprd12.prod.outlook.com>
@ 2025-07-01  8:22       ` Christian König
  2025-07-02  7:28         ` Samuel Zhang
  0 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2025-07-01  8:22 UTC (permalink / raw)
  To: Zhang, GuoQing (Sam), rafael@kernel.org, len.brown@intel.com,
	pavel@kernel.org, Deucher, Alexander, Limonciello, Mario,
	Lazar, Lijo
  Cc: Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org

On 01.07.25 10:18, Zhang, GuoQing (Sam) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> Hi Christian,
> 
>  
> 
> Thank you for the feedback.
> 
>  
> 
> For “return ret < 0 ? ret : 0;”, it is equivalent to “return ret;” since ret is always <= 0 after the loop.

No it isn't.

ttm_global_swapout() returns the number of pages swapped out and only a negative error code if something went wrong.

And it's probably not a good idea to return that from the new function.

Regards,
Christian.

> 
>  
> 
> For all other comments, I will revise the patch accordingly in v2.
> 
>  
> 
> Regards
> 
> Sam
> 
>  
> 
>  
> 
> *From: *Koenig, Christian <Christian.Koenig@amd.com>
> *Date: *Monday, June 30, 2025 at 19:54
> *To: *Zhang, GuoQing (Sam) <GuoQing.Zhang@amd.com>, rafael@kernel.org <rafael@kernel.org>, len.brown@intel.com <len.brown@intel.com>, pavel@kernel.org <pavel@kernel.org>, Deucher, Alexander <Alexander.Deucher@amd.com>, Limonciello, Mario <Mario.Limonciello@amd.com>, Lazar, Lijo <Lijo.Lazar@amd.com>
> *Cc: *Zhao, Victor <Victor.Zhao@amd.com>, Chang, HaiJun <HaiJun.Chang@amd.com>, Ma, Qing (Mark) <Qing.Ma@amd.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>, dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>, linux-pm@vger.kernel.org <linux-pm@vger.kernel.org>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> *Subject: *Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
> 
> On 30.06.25 12:41, Samuel Zhang wrote:
>> When hibernate with data center dGPUs, huge number of VRAM BOs evicted
>> to GTT and takes too much system memory. This will cause hibernation
>> fail due to insufficient memory for creating the hibernation image.
>> 
>> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
>> hibernation code to make room for hibernation image.
> 
> This should probably be two patches, one for TTM and then an amdgpu patch to forward the event.
> 
>> 
>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
>>  drivers/gpu/drm/ttm/ttm_resource.c      | 18 ++++++++++++++++++
>>  include/drm/ttm/ttm_resource.h          |  1 +
>>  3 files changed, 31 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 4d57269c9ca8..5aede907a591 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -2889,6 +2889,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>  int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>>  {
>>        struct ttm_resource_manager *man;
>> +     int r;
>>  
>>        switch (mem_type) {
>>        case TTM_PL_VRAM:
>> @@ -2903,7 +2904,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>>                return -EINVAL;
>>        }
>>  
>> -     return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
>> +     r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
>> +     if (r) {
>> +             DRM_ERROR("Failed to evict memory type %d\n", mem_type);
>> +             return r;
>> +     }
>> +     if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
>> +             r = ttm_resource_manager_swapout();
>> +             if (r)
>> +                     DRM_ERROR("Failed to swap out, %d\n", r);
>> +     }
>> +     return r;
>>  }
>>  
>>  #if defined(CONFIG_DEBUG_FS)
>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
>> index fd41b56e2c66..07b1f5a5afc2 100644
>> --- a/drivers/gpu/drm/ttm/ttm_resource.c
>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
>> @@ -534,6 +534,24 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>  }
>>  EXPORT_SYMBOL(ttm_resource_manager_init);
>>  
>> +int ttm_resource_manager_swapout(void)
> 
> This needs documentation, better placement and a better name.
> 
> First of all put it into ttm_device.c instead of the resource manager.
> 
> Then call it something like ttm_device_prepare_hibernation or similar.
> 
> 
>> +{
>> +     struct ttm_operation_ctx ctx = {
>> +             .interruptible = false,
>> +             .no_wait_gpu = false,
>> +             .force_alloc = true
>> +     };
>> +     int ret;
>> +
>> +     while (true) {
> 
> Make that:
> 
> do {
>         ret = ...
> } while (ret > 0);
> 
>> +             ret = ttm_global_swapout(&ctx, GFP_KERNEL);
>> +             if (ret <= 0)
>> +                     break;
>> +     }
>> +     return ret;
> 
> It's rather pointless to return the number of swapped out pages.
> 
> Make that "return ret < 0 ? ret : 0;
> 
> Regards,
> Christian.
> 
>> +}
>> +EXPORT_SYMBOL(ttm_resource_manager_swapout);
>> +
>>  /*
>>   * ttm_resource_manager_evict_all
>>   *
>> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
>> index b873be9597e2..46181758068e 100644
>> --- a/include/drm/ttm/ttm_resource.h
>> +++ b/include/drm/ttm/ttm_resource.h
>> @@ -463,6 +463,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>  
>>  int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>>                                   struct ttm_resource_manager *man);
>> +int ttm_resource_manager_swapout(void);
>>  
>>  uint64_t ttm_resource_manager_usage(struct ttm_resource_manager *man);
>>  void ttm_resource_manager_debug(struct ttm_resource_manager *man,
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
       [not found]     ` <8eb1700d-4d60-4a1e-9d09-718f65baaf1e@amd.com>
@ 2025-07-01  8:32       ` Christian König
  2025-07-01 16:07         ` Alex Deucher
  0 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2025-07-01  8:32 UTC (permalink / raw)
  To: Zhang, GuoQing (Sam), rafael@kernel.org, len.brown@intel.com,
	pavel@kernel.org, Deucher, Alexander, Limonciello, Mario,
	Lazar, Lijo
  Cc: Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org

On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote:
> 
> thaw() is called before writing the hiberation image to swap disk. See
> the doc here.
> https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552>
> 
> And amdgpu implemented thaw() callback by calling amdgpu_device_resume().
> https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572>
> 
> This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during
> hibernation. it is not skipped in restore() during resume from
> hibernation when system boot again.
> 
> 
> I just found the following kernel doc. Thaw() is intended to resume the
> storage device for saving the hibernation image.

Ah, that makes much more sense.

> Our GPU is not involved
> in it, it is not necessary to resume our GPU in thaw().
> https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588>
> 
> So another implementation is to remove the amdgpu_device_resume() call
> in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in
> amdgpu_pci_shutdown()for hibernation.
> Initial tests show it's working fine for hibernation successful case.
> Should I switch to this implementation?

No idea. Alex and the KFD guys need to take a look at that.

> But thaw() is also called to restore the GPU when hibernation is aborted
> due to some error in hibernation image creation stage. In this case,
> amdgpu_device_resume() is needed in thaw().
> 
> So I need a method to check if hibernation is aborted or not to
> conditionally skip amdgpu_device_resume() in thaw(). Currently I don't
> know how to do this.

Yeah that approach here looks fishy to me, but I don't know how to properly fix it either.

@Alex any idea?

Regards,
Christian.

> 
> 
> Regards
> Sam
> 
> 
> On 2025/6/30 19:58, Christian König wrote:
>> On 30.06.25 12:41, Samuel Zhang wrote:
>>> The hibernation successful workflow:
>>> - prepare: evict VRAM and swapout GTT BOs
>>> - freeze
>>> - create the hibernation image in system memory
>>> - thaw: swapin and restore BOs
>> Why should a thaw happen here in between?
>>
>>> - complete
>>> - write hibernation image to disk
>>> - amdgpu_pci_shutdown
>>> - goto S5, turn off the system.
>>>
>>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
>>> shmem. Then in thaw stage, all BOs will be swapin and restored.
>> That's not correct. This is done by the application starting again and not during thaw.
>>
>>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
>>> the swapin and restore BOs takes too long (50 minutes) and it is not
>>> necessary since the follow-up stages does not use GPU.
>>>
>>> This patch is to skip BOs restore during thaw to reduce the hibernation
>>> time.
>> As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
>>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index a8f4697deb1b..b550d07190a2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>>               amdgpu_virt_init_data_exchange(adev);
>>>               amdgpu_virt_release_full_gpu(adev, true);
>>>   
>>> -            if (!adev->in_s0ix && !r && !adev->in_runpm)
>>> +            if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
>>>                       r = amdgpu_amdkfd_resume_process(adev);
>>>       }
>>>   
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 571b70da4562..23b76e8ac2fd 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
>>>   static int amdgpu_pmops_restore(struct device *dev)
>>>   {
>>>       struct drm_device *drm_dev = dev_get_drvdata(dev);
>>> +    struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>   
>>> +    adev->in_s4 = false;
>>>       return amdgpu_device_resume(drm_dev, true);
>>>   }
>>>   
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-07-01  8:32       ` Christian König
@ 2025-07-01 16:07         ` Alex Deucher
  2025-07-02  7:23           ` Sam
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Deucher @ 2025-07-01 16:07 UTC (permalink / raw)
  To: Christian König
  Cc: Zhang, GuoQing (Sam), rafael@kernel.org, len.brown@intel.com,
	pavel@kernel.org, Deucher, Alexander, Limonciello, Mario,
	Lazar, Lijo, Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org

On Tue, Jul 1, 2025 at 4:32 AM Christian König <christian.koenig@amd.com> wrote:
>
> On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote:
> >
> > thaw() is called before writing the hiberation image to swap disk. See
> > the doc here.
> > https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552>
> >
> > And amdgpu implemented thaw() callback by calling amdgpu_device_resume().
> > https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572>
> >
> > This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during
> > hibernation. it is not skipped in restore() during resume from
> > hibernation when system boot again.
> >
> >
> > I just found the following kernel doc. Thaw() is intended to resume the
> > storage device for saving the hibernation image.
>
> Ah, that makes much more sense.
>
> > Our GPU is not involved
> > in it, it is not necessary to resume our GPU in thaw().
> > https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588>
> >
> > So another implementation is to remove the amdgpu_device_resume() call
> > in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in
> > amdgpu_pci_shutdown()for hibernation.
> > Initial tests show it's working fine for hibernation successful case.
> > Should I switch to this implementation?
>
> No idea. Alex and the KFD guys need to take a look at that.
>
> > But thaw() is also called to restore the GPU when hibernation is aborted
> > due to some error in hibernation image creation stage. In this case,
> > amdgpu_device_resume() is needed in thaw().
> >
> > So I need a method to check if hibernation is aborted or not to
> > conditionally skip amdgpu_device_resume() in thaw(). Currently I don't
> > know how to do this.
>
> Yeah that approach here looks fishy to me, but I don't know how to properly fix it either.
>
> @Alex any idea?

Yeah, I'm not sure how to handle that.  I don't see a way to avoid
having all of the callbacks.  We could ideally skip some of the steps.
Maybe we could optimize the freeze and thaw routines if we had some
hint from the pm core about why we were getting called.  E.g., thaw
after a failed hibernation restore.

Alex

>
> Regards,
> Christian.
>
> >
> >
> > Regards
> > Sam
> >
> >
> > On 2025/6/30 19:58, Christian König wrote:
> >> On 30.06.25 12:41, Samuel Zhang wrote:
> >>> The hibernation successful workflow:
> >>> - prepare: evict VRAM and swapout GTT BOs
> >>> - freeze
> >>> - create the hibernation image in system memory
> >>> - thaw: swapin and restore BOs
> >> Why should a thaw happen here in between?
> >>
> >>> - complete
> >>> - write hibernation image to disk
> >>> - amdgpu_pci_shutdown
> >>> - goto S5, turn off the system.
> >>>
> >>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
> >>> shmem. Then in thaw stage, all BOs will be swapin and restored.
> >> That's not correct. This is done by the application starting again and not during thaw.
> >>
> >>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
> >>> the swapin and restore BOs takes too long (50 minutes) and it is not
> >>> necessary since the follow-up stages does not use GPU.
> >>>
> >>> This patch is to skip BOs restore during thaw to reduce the hibernation
> >>> time.
> >> As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
> >>>   2 files changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index a8f4697deb1b..b550d07190a2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
> >>>               amdgpu_virt_init_data_exchange(adev);
> >>>               amdgpu_virt_release_full_gpu(adev, true);
> >>>
> >>> -            if (!adev->in_s0ix && !r && !adev->in_runpm)
> >>> +            if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
> >>>                       r = amdgpu_amdkfd_resume_process(adev);
> >>>       }
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> index 571b70da4562..23b76e8ac2fd 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
> >>>   static int amdgpu_pmops_restore(struct device *dev)
> >>>   {
> >>>       struct drm_device *drm_dev = dev_get_drvdata(dev);
> >>> +    struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >>>
> >>> +    adev->in_s4 = false;
> >>>       return amdgpu_device_resume(drm_dev, true);
> >>>   }
> >>>
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-07-01 16:07         ` Alex Deucher
@ 2025-07-02  7:23           ` Sam
  2025-07-02 13:54             ` Alex Deucher
  0 siblings, 1 reply; 16+ messages in thread
From: Sam @ 2025-07-02  7:23 UTC (permalink / raw)
  To: Alex Deucher, Christian König
  Cc: Zhang, GuoQing (Sam), rafael@kernel.org, len.brown@intel.com,
	pavel@kernel.org, Deucher, Alexander, Limonciello, Mario,
	Lazar, Lijo, Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org


On 2025/7/2 00:07, Alex Deucher wrote:
> On Tue, Jul 1, 2025 at 4:32 AM Christian König <christian.koenig@amd.com> wrote:
>> On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote:
>>> thaw() is called before writing the hiberation image to swap disk. See
>>> the doc here.
>>> https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552>
>>>
>>> And amdgpu implemented thaw() callback by calling amdgpu_device_resume().
>>> https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572>
>>>
>>> This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during
>>> hibernation. it is not skipped in restore() during resume from
>>> hibernation when system boot again.
>>>
>>>
>>> I just found the following kernel doc. Thaw() is intended to resume the
>>> storage device for saving the hibernation image.
>> Ah, that makes much more sense.
>>
>>> Our GPU is not involved
>>> in it, it is not necessary to resume our GPU in thaw().
>>> https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588>
>>>
>>> So another implementation is to remove the amdgpu_device_resume() call
>>> in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in
>>> amdgpu_pci_shutdown()for hibernation.
>>> Initial tests show it's working fine for hibernation successful case.
>>> Should I switch to this implementation?
>> No idea. Alex and the KFD guys need to take a look at that.
>>
>>> But thaw() is also called to restore the GPU when hibernation is aborted
>>> due to some error in hibernation image creation stage. In this case,
>>> amdgpu_device_resume() is needed in thaw().
>>>
>>> So I need a method to check if hibernation is aborted or not to
>>> conditionally skip amdgpu_device_resume() in thaw(). Currently I don't
>>> know how to do this.
>> Yeah that approach here looks fishy to me, but I don't know how to properly fix it either.
>>
>> @Alex any idea?
> Yeah, I'm not sure how to handle that.  I don't see a way to avoid
> having all of the callbacks.  We could ideally skip some of the steps.
> Maybe we could optimize the freeze and thaw routines if we had some
> hint from the pm core about why we were getting called.  E.g., thaw
> after a failed hibernation restore.
>
> Alex


I just found pm_transition variable can be used to check if hibernation 
is cancelled (PM_EVENT_RECOVER) or not(PM_EVENT_THAW) in thaw(). I just 
need to export this variable in kernel.
https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L64

Provided pm_transition is available, should we skip 
amdgpu_amdkfd_resume_process() only, or skip amdgpu_device_resume() 
completely?

Regards
Sam


>
>> Regards,
>> Christian.
>>
>>>
>>> Regards
>>> Sam
>>>
>>>
>>> On 2025/6/30 19:58, Christian König wrote:
>>>> On 30.06.25 12:41, Samuel Zhang wrote:
>>>>> The hibernation successful workflow:
>>>>> - prepare: evict VRAM and swapout GTT BOs
>>>>> - freeze
>>>>> - create the hibernation image in system memory
>>>>> - thaw: swapin and restore BOs
>>>> Why should a thaw happen here in between?
>>>>
>>>>> - complete
>>>>> - write hibernation image to disk
>>>>> - amdgpu_pci_shutdown
>>>>> - goto S5, turn off the system.
>>>>>
>>>>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
>>>>> shmem. Then in thaw stage, all BOs will be swapin and restored.
>>>> That's not correct. This is done by the application starting again and not during thaw.
>>>>
>>>>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
>>>>> the swapin and restore BOs takes too long (50 minutes) and it is not
>>>>> necessary since the follow-up stages does not use GPU.
>>>>>
>>>>> This patch is to skip BOs restore during thaw to reduce the hibernation
>>>>> time.
>>>> As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
>>>>>    2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index a8f4697deb1b..b550d07190a2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>>>>                amdgpu_virt_init_data_exchange(adev);
>>>>>                amdgpu_virt_release_full_gpu(adev, true);
>>>>>
>>>>> -            if (!adev->in_s0ix && !r && !adev->in_runpm)
>>>>> +            if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
>>>>>                        r = amdgpu_amdkfd_resume_process(adev);
>>>>>        }
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index 571b70da4562..23b76e8ac2fd 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
>>>>>    static int amdgpu_pmops_restore(struct device *dev)
>>>>>    {
>>>>>        struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>> +    struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>>
>>>>> +    adev->in_s4 = false;
>>>>>        return amdgpu_device_resume(drm_dev, true);
>>>>>    }
>>>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
  2025-07-01  8:22       ` Christian König
@ 2025-07-02  7:28         ` Samuel Zhang
  2025-07-02  7:48           ` Christian König
  0 siblings, 1 reply; 16+ messages in thread
From: Samuel Zhang @ 2025-07-02  7:28 UTC (permalink / raw)
  To: Christian König, Zhang, GuoQing (Sam), rafael@kernel.org,
	len.brown@intel.com, pavel@kernel.org, Deucher, Alexander,
	Limonciello, Mario, Lazar, Lijo
  Cc: Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org


On 2025/7/1 16:22, Christian König wrote:
> On 01.07.25 10:18, Zhang, GuoQing (Sam) wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>
>> Hi Christian,
>>
>>   
>>
>> Thank you for the feedback.
>>
>>   
>>
>> For “return ret < 0 ? ret : 0;”, it is equivalent to “return ret;” since ret is always <= 0 after the loop.
> No it isn't.
>
> ttm_global_swapout() returns the number of pages swapped out and only a negative error code if something went wrong.


/**
  * move GTT BOs to shmem for hibernation.
  *
  * returns 0 on success, negative on failure.
  */
int ttm_device_prepare_hibernation(void)
{
     struct ttm_operation_ctx ctx = {
         .interruptible = false,
         .no_wait_gpu = false,
         .force_alloc = true
     };
     int ret;

     do {
         ret = ttm_global_swapout(&ctx, GFP_KERNEL);
     } while (ret > 0);
     return ret;
}

This is the new code version.
If ttm_global_swapout() return positive number, the while loop will 
continue to the next iteration.
The while loop stops only when ttm_global_swapout() returns 0 or 
negative number. In both case, the new function can just return the ret.

The ret values printed in the do while loop:
[   53.745892] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 512
[   53.950975] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 35840
[   53.951713] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 9
[   67.712196] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 2187264
[   67.713726] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 512
[   67.759212] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 32768
[   67.761946] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1024
[   67.762685] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 85
[   67.763518] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 175
[   67.767318] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 2367
[   67.767942] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
[   67.768499] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
[   67.769054] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
...
[   67.783554] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
[   67.785755] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
[   67.788607] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
[   67.789906] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 0


Regards
Sam



>
> And it's probably not a good idea to return that from the new function.
>
> Regards,
> Christian.
>
>>   
>>
>> For all other comments, I will revise the patch accordingly in v2.
>>
>>   
>>
>> Regards
>>
>> Sam
>>
>>   
>>
>>   
>>
>> *From: *Koenig, Christian <Christian.Koenig@amd.com>
>> *Date: *Monday, June 30, 2025 at 19:54
>> *To: *Zhang, GuoQing (Sam) <GuoQing.Zhang@amd.com>, rafael@kernel.org <rafael@kernel.org>, len.brown@intel.com <len.brown@intel.com>, pavel@kernel.org <pavel@kernel.org>, Deucher, Alexander <Alexander.Deucher@amd.com>, Limonciello, Mario <Mario.Limonciello@amd.com>, Lazar, Lijo <Lijo.Lazar@amd.com>
>> *Cc: *Zhao, Victor <Victor.Zhao@amd.com>, Chang, HaiJun <HaiJun.Chang@amd.com>, Ma, Qing (Mark) <Qing.Ma@amd.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>, dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>, linux-pm@vger.kernel.org <linux-pm@vger.kernel.org>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>> *Subject: *Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
>>
>> On 30.06.25 12:41, Samuel Zhang wrote:
>>> When hibernate with data center dGPUs, huge number of VRAM BOs evicted
>>> to GTT and takes too much system memory. This will cause hibernation
>>> fail due to insufficient memory for creating the hibernation image.
>>>
>>> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
>>> hibernation code to make room for hibernation image.
>> This should probably be two patches, one for TTM and then an amdgpu patch to forward the event.
>>
>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
>>>    drivers/gpu/drm/ttm/ttm_resource.c      | 18 ++++++++++++++++++
>>>    include/drm/ttm/ttm_resource.h          |  1 +
>>>    3 files changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index 4d57269c9ca8..5aede907a591 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -2889,6 +2889,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>    int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>>>    {
>>>          struct ttm_resource_manager *man;
>>> +     int r;
>>>    
>>>          switch (mem_type) {
>>>          case TTM_PL_VRAM:
>>> @@ -2903,7 +2904,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>>>                  return -EINVAL;
>>>          }
>>>    
>>> -     return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
>>> +     r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
>>> +     if (r) {
>>> +             DRM_ERROR("Failed to evict memory type %d\n", mem_type);
>>> +             return r;
>>> +     }
>>> +     if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
>>> +             r = ttm_resource_manager_swapout();
>>> +             if (r)
>>> +                     DRM_ERROR("Failed to swap out, %d\n", r);
>>> +     }
>>> +     return r;
>>>    }
>>>    
>>>    #if defined(CONFIG_DEBUG_FS)
>>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
>>> index fd41b56e2c66..07b1f5a5afc2 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_resource.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
>>> @@ -534,6 +534,24 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>>    }
>>>    EXPORT_SYMBOL(ttm_resource_manager_init);
>>>    
>>> +int ttm_resource_manager_swapout(void)
>> This needs documentation, better placement and a better name.
>>
>> First of all put it into ttm_device.c instead of the resource manager.
>>
>> Then call it something like ttm_device_prepare_hibernation or similar.
>>
>>
>>> +{
>>> +     struct ttm_operation_ctx ctx = {
>>> +             .interruptible = false,
>>> +             .no_wait_gpu = false,
>>> +             .force_alloc = true
>>> +     };
>>> +     int ret;
>>> +
>>> +     while (true) {
>> Make that:
>>
>> do {
>>          ret = ...
>> } while (ret > 0);
>>
>>> +             ret = ttm_global_swapout(&ctx, GFP_KERNEL);
>>> +             if (ret <= 0)
>>> +                     break;
>>> +     }
>>> +     return ret;
>> It's rather pointless to return the number of swapped out pages.
>>
>> Make that "return ret < 0 ? ret : 0;
>>
>> Regards,
>> Christian.
>>
>>> +}
>>> +EXPORT_SYMBOL(ttm_resource_manager_swapout);
>>> +
>>>    /*
>>>     * ttm_resource_manager_evict_all
>>>     *
>>> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
>>> index b873be9597e2..46181758068e 100644
>>> --- a/include/drm/ttm/ttm_resource.h
>>> +++ b/include/drm/ttm/ttm_resource.h
>>> @@ -463,6 +463,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>>    
>>>    int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>>>                                     struct ttm_resource_manager *man);
>>> +int ttm_resource_manager_swapout(void);
>>>    
>>>    uint64_t ttm_resource_manager_usage(struct ttm_resource_manager *man);
>>>    void ttm_resource_manager_debug(struct ttm_resource_manager *man,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
  2025-07-02  7:28         ` Samuel Zhang
@ 2025-07-02  7:48           ` Christian König
  0 siblings, 0 replies; 16+ messages in thread
From: Christian König @ 2025-07-02  7:48 UTC (permalink / raw)
  To: Samuel Zhang, Zhang, GuoQing (Sam), rafael@kernel.org,
	len.brown@intel.com, pavel@kernel.org, Deucher, Alexander,
	Limonciello, Mario, Lazar, Lijo
  Cc: Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org

On 02.07.25 09:28, Samuel Zhang wrote:
> 
> On 2025/7/1 16:22, Christian König wrote:
>> On 01.07.25 10:18, Zhang, GuoQing (Sam) wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>
>>> Hi Christian,
>>>
>>>  
>>> Thank you for the feedback.
>>>
>>>  
>>> For “return ret < 0 ? ret : 0;”, it is equivalent to “return ret;” since ret is always <= 0 after the loop.
>> No it isn't.
>>
>> ttm_global_swapout() returns the number of pages swapped out and only a negative error code if something went wrong.
> 
> 
> /**
>  * move GTT BOs to shmem for hibernation.
>  *
>  * returns 0 on success, negative on failure.
>  */
> int ttm_device_prepare_hibernation(void)
> {
>     struct ttm_operation_ctx ctx = {
>         .interruptible = false,
>         .no_wait_gpu = false,
>         .force_alloc = true
>     };
>     int ret;
> 
>     do {
>         ret = ttm_global_swapout(&ctx, GFP_KERNEL);
>     } while (ret > 0);
>     return ret;
> }
> 
> This is the new code version.
> If ttm_global_swapout() return positive number, the while loop will continue to the next iteration.
> The while loop stops only when ttm_global_swapout() returns 0 or negative number. In both case, the new function can just return the ret.

Ok, now I at least got what you wanted to do. But that is not really what I had in mind and isn't really good coding style.

Please use ttm_device_swapout() instead of ttm_global_swapout(), apart from that we can probably keep it that way.

Regards,
Christian.

> 
> The ret values printed in the do while loop:
> [   53.745892] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 512
> [   53.950975] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 35840
> [   53.951713] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 9
> [   67.712196] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 2187264
> [   67.713726] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 512
> [   67.759212] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 32768
> [   67.761946] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1024
> [   67.762685] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 85
> [   67.763518] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 175
> [   67.767318] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 2367
> [   67.767942] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
> [   67.768499] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
> [   67.769054] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
> ...
> [   67.783554] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
> [   67.785755] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
> [   67.788607] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1
> [   67.789906] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 0
> 
> 
> Regards
> Sam
> 
> 
> 
>>
>> And it's probably not a good idea to return that from the new function.
>>
>> Regards,
>> Christian.
>>
>>>  
>>> For all other comments, I will revise the patch accordingly in v2.
>>>
>>>  
>>> Regards
>>>
>>> Sam
>>>
>>>  
>>>  
>>> *From: *Koenig, Christian <Christian.Koenig@amd.com>
>>> *Date: *Monday, June 30, 2025 at 19:54
>>> *To: *Zhang, GuoQing (Sam) <GuoQing.Zhang@amd.com>, rafael@kernel.org <rafael@kernel.org>, len.brown@intel.com <len.brown@intel.com>, pavel@kernel.org <pavel@kernel.org>, Deucher, Alexander <Alexander.Deucher@amd.com>, Limonciello, Mario <Mario.Limonciello@amd.com>, Lazar, Lijo <Lijo.Lazar@amd.com>
>>> *Cc: *Zhao, Victor <Victor.Zhao@amd.com>, Chang, HaiJun <HaiJun.Chang@amd.com>, Ma, Qing (Mark) <Qing.Ma@amd.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>, dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>, linux-pm@vger.kernel.org <linux-pm@vger.kernel.org>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>>> *Subject: *Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for hibernation
>>>
>>> On 30.06.25 12:41, Samuel Zhang wrote:
>>>> When hibernate with data center dGPUs, huge number of VRAM BOs evicted
>>>> to GTT and takes too much system memory. This will cause hibernation
>>>> fail due to insufficient memory for creating the hibernation image.
>>>>
>>>> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
>>>> hibernation code to make room for hibernation image.
>>> This should probably be two patches, one for TTM and then an amdgpu patch to forward the event.
>>>
>>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
>>>>    drivers/gpu/drm/ttm/ttm_resource.c      | 18 ++++++++++++++++++
>>>>    include/drm/ttm/ttm_resource.h          |  1 +
>>>>    3 files changed, 31 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index 4d57269c9ca8..5aede907a591 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -2889,6 +2889,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>    int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>>>>    {
>>>>          struct ttm_resource_manager *man;
>>>> +     int r;
>>>>             switch (mem_type) {
>>>>          case TTM_PL_VRAM:
>>>> @@ -2903,7 +2904,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>>>>                  return -EINVAL;
>>>>          }
>>>>    -     return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
>>>> +     r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
>>>> +     if (r) {
>>>> +             DRM_ERROR("Failed to evict memory type %d\n", mem_type);
>>>> +             return r;
>>>> +     }
>>>> +     if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
>>>> +             r = ttm_resource_manager_swapout();
>>>> +             if (r)
>>>> +                     DRM_ERROR("Failed to swap out, %d\n", r);
>>>> +     }
>>>> +     return r;
>>>>    }
>>>>       #if defined(CONFIG_DEBUG_FS)
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
>>>> index fd41b56e2c66..07b1f5a5afc2 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_resource.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
>>>> @@ -534,6 +534,24 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>>>    }
>>>>    EXPORT_SYMBOL(ttm_resource_manager_init);
>>>>    +int ttm_resource_manager_swapout(void)
>>> This needs documentation, better placement and a better name.
>>>
>>> First of all put it into ttm_device.c instead of the resource manager.
>>>
>>> Then call it something like ttm_device_prepare_hibernation or similar.
>>>
>>>
>>>> +{
>>>> +     struct ttm_operation_ctx ctx = {
>>>> +             .interruptible = false,
>>>> +             .no_wait_gpu = false,
>>>> +             .force_alloc = true
>>>> +     };
>>>> +     int ret;
>>>> +
>>>> +     while (true) {
>>> Make that:
>>>
>>> do {
>>>          ret = ...
>>> } while (ret > 0);
>>>
>>>> +             ret = ttm_global_swapout(&ctx, GFP_KERNEL);
>>>> +             if (ret <= 0)
>>>> +                     break;
>>>> +     }
>>>> +     return ret;
>>> It's rather pointless to return the number of swapped out pages.
>>>
>>> Make that "return ret < 0 ? ret : 0;
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +}
>>>> +EXPORT_SYMBOL(ttm_resource_manager_swapout);
>>>> +
>>>>    /*
>>>>     * ttm_resource_manager_evict_all
>>>>     *
>>>> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
>>>> index b873be9597e2..46181758068e 100644
>>>> --- a/include/drm/ttm/ttm_resource.h
>>>> +++ b/include/drm/ttm/ttm_resource.h
>>>> @@ -463,6 +463,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>>>       int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>>>>                                     struct ttm_resource_manager *man);
>>>> +int ttm_resource_manager_swapout(void);
>>>>       uint64_t ttm_resource_manager_usage(struct ttm_resource_manager *man);
>>>>    void ttm_resource_manager_debug(struct ttm_resource_manager *man,


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-07-02  7:23           ` Sam
@ 2025-07-02 13:54             ` Alex Deucher
  2025-07-02 14:07               ` Lazar, Lijo
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Deucher @ 2025-07-02 13:54 UTC (permalink / raw)
  To: Sam
  Cc: Christian König, Zhang, GuoQing (Sam), rafael@kernel.org,
	len.brown@intel.com, pavel@kernel.org, Deucher, Alexander,
	Limonciello, Mario, Lazar, Lijo, Zhao, Victor, Chang, HaiJun,
	Ma, Qing (Mark), amd-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, linux-pm@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Jul 2, 2025 at 3:24 AM Sam <guoqzhan@amd.com> wrote:
>
>
> On 2025/7/2 00:07, Alex Deucher wrote:
> > On Tue, Jul 1, 2025 at 4:32 AM Christian König <christian.koenig@amd.com> wrote:
> >> On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote:
> >>> thaw() is called before writing the hiberation image to swap disk. See
> >>> the doc here.
> >>> https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552>
> >>>
> >>> And amdgpu implemented thaw() callback by calling amdgpu_device_resume().
> >>> https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572>
> >>>
> >>> This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during
> >>> hibernation. it is not skipped in restore() during resume from
> >>> hibernation when system boot again.
> >>>
> >>>
> >>> I just found the following kernel doc. Thaw() is intended to resume the
> >>> storage device for saving the hibernation image.
> >> Ah, that makes much more sense.
> >>
> >>> Our GPU is not involved
> >>> in it, it is not necessary to resume our GPU in thaw().
> >>> https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588>
> >>>
> >>> So another implementation is to remove the amdgpu_device_resume() call
> >>> in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in
> >>> amdgpu_pci_shutdown()for hibernation.
> >>> Initial tests show it's working fine for hibernation successful case.
> >>> Should I switch to this implementation?
> >> No idea. Alex and the KFD guys need to take a look at that.
> >>
> >>> But thaw() is also called to restore the GPU when hibernation is aborted
> >>> due to some error in hibernation image creation stage. In this case,
> >>> amdgpu_device_resume() is needed in thaw().
> >>>
> >>> So I need a method to check if hibernation is aborted or not to
> >>> conditionally skip amdgpu_device_resume() in thaw(). Currently I don't
> >>> know how to do this.
> >> Yeah that approach here looks fishy to me, but I don't know how to properly fix it either.
> >>
> >> @Alex any idea?
> > Yeah, I'm not sure how to handle that.  I don't see a way to avoid
> > having all of the callbacks.  We could ideally skip some of the steps.
> > Maybe we could optimize the freeze and thaw routines if we had some
> > hint from the pm core about why we were getting called.  E.g., thaw
> > after a failed hibernation restore.
> >
> > Alex
>
>
> I just found pm_transition variable can be used to check if hibernation
> is cancelled (PM_EVENT_RECOVER) or not(PM_EVENT_THAW) in thaw(). I just
> need to export this variable in kernel.
> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L64
>
> Provided pm_transition is available, should we skip
> amdgpu_amdkfd_resume_process() only, or skip amdgpu_device_resume()
> completely?

Hmmm.  Still not sure how best to handle this.  For entering
hibernation, all we really need is freeze().  Once we are done with
that we don't need thaw() or poweroff() for hibernation as we've
already suspended in freeze() so there is nothing else to do.  For
exiting hibernation, we need freeze() to suspend and then either
thaw() (if the hibernation image is bad) or restore() (if the
hibernation image is good) to resume.

Alex

>
> Regards
> Sam
>
>
> >
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Regards
> >>> Sam
> >>>
> >>>
> >>> On 2025/6/30 19:58, Christian König wrote:
> >>>> On 30.06.25 12:41, Samuel Zhang wrote:
> >>>>> The hibernation successful workflow:
> >>>>> - prepare: evict VRAM and swapout GTT BOs
> >>>>> - freeze
> >>>>> - create the hibernation image in system memory
> >>>>> - thaw: swapin and restore BOs
> >>>> Why should a thaw happen here in between?
> >>>>
> >>>>> - complete
> >>>>> - write hibernation image to disk
> >>>>> - amdgpu_pci_shutdown
> >>>>> - goto S5, turn off the system.
> >>>>>
> >>>>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
> >>>>> shmem. Then in thaw stage, all BOs will be swapin and restored.
> >>>> That's not correct. This is done by the application starting again and not during thaw.
> >>>>
> >>>>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
> >>>>> the swapin and restore BOs takes too long (50 minutes) and it is not
> >>>>> necessary since the follow-up stages does not use GPU.
> >>>>>
> >>>>> This patch is to skip BOs restore during thaw to reduce the hibernation
> >>>>> time.
> >>>> As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> >>>>> ---
> >>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> >>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
> >>>>>    2 files changed, 3 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> index a8f4697deb1b..b550d07190a2 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
> >>>>>                amdgpu_virt_init_data_exchange(adev);
> >>>>>                amdgpu_virt_release_full_gpu(adev, true);
> >>>>>
> >>>>> -            if (!adev->in_s0ix && !r && !adev->in_runpm)
> >>>>> +            if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
> >>>>>                        r = amdgpu_amdkfd_resume_process(adev);
> >>>>>        }
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>> index 571b70da4562..23b76e8ac2fd 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
> >>>>>    static int amdgpu_pmops_restore(struct device *dev)
> >>>>>    {
> >>>>>        struct drm_device *drm_dev = dev_get_drvdata(dev);
> >>>>> +    struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >>>>>
> >>>>> +    adev->in_s4 = false;
> >>>>>        return amdgpu_device_resume(drm_dev, true);
> >>>>>    }
> >>>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-07-02 13:54             ` Alex Deucher
@ 2025-07-02 14:07               ` Lazar, Lijo
  2025-07-04 10:24                 ` Zhang, GuoQing (Sam)
  0 siblings, 1 reply; 16+ messages in thread
From: Lazar, Lijo @ 2025-07-02 14:07 UTC (permalink / raw)
  To: Alex Deucher, Sam
  Cc: Christian König, Zhang, GuoQing (Sam), rafael@kernel.org,
	len.brown@intel.com, pavel@kernel.org, Deucher, Alexander,
	Limonciello, Mario, Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org



On 7/2/2025 7:24 PM, Alex Deucher wrote:
> On Wed, Jul 2, 2025 at 3:24 AM Sam <guoqzhan@amd.com> wrote:
>>
>>
>> On 2025/7/2 00:07, Alex Deucher wrote:
>>> On Tue, Jul 1, 2025 at 4:32 AM Christian König <christian.koenig@amd.com> wrote:
>>>> On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote:
>>>>> thaw() is called before writing the hiberation image to swap disk. See
>>>>> the doc here.
>>>>> https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552>
>>>>>
>>>>> And amdgpu implemented thaw() callback by calling amdgpu_device_resume().
>>>>> https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572>
>>>>>
>>>>> This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during
>>>>> hibernation. it is not skipped in restore() during resume from
>>>>> hibernation when system boot again.
>>>>>
>>>>>
>>>>> I just found the following kernel doc. Thaw() is intended to resume the
>>>>> storage device for saving the hibernation image.
>>>> Ah, that makes much more sense.
>>>>
>>>>> Our GPU is not involved
>>>>> in it, it is not necessary to resume our GPU in thaw().
>>>>> https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588>
>>>>>
>>>>> So another implementation is to remove the amdgpu_device_resume() call
>>>>> in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in
>>>>> amdgpu_pci_shutdown()for hibernation.
>>>>> Initial tests show it's working fine for hibernation successful case.
>>>>> Should I switch to this implementation?
>>>> No idea. Alex and the KFD guys need to take a look at that.
>>>>
>>>>> But thaw() is also called to restore the GPU when hibernation is aborted
>>>>> due to some error in hibernation image creation stage. In this case,
>>>>> amdgpu_device_resume() is needed in thaw().
>>>>>
>>>>> So I need a method to check if hibernation is aborted or not to
>>>>> conditionally skip amdgpu_device_resume() in thaw(). Currently I don't
>>>>> know how to do this.
>>>> Yeah that approach here looks fishy to me, but I don't know how to properly fix it either.
>>>>
>>>> @Alex any idea?
>>> Yeah, I'm not sure how to handle that.  I don't see a way to avoid
>>> having all of the callbacks.  We could ideally skip some of the steps.
>>> Maybe we could optimize the freeze and thaw routines if we had some
>>> hint from the pm core about why we were getting called.  E.g., thaw
>>> after a failed hibernation restore.
>>>
>>> Alex
>>
>>
>> I just found pm_transition variable can be used to check if hibernation
>> is cancelled (PM_EVENT_RECOVER) or not(PM_EVENT_THAW) in thaw(). I just
>> need to export this variable in kernel.
>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L64
>>
>> Provided pm_transition is available, should we skip
>> amdgpu_amdkfd_resume_process() only, or skip amdgpu_device_resume()
>> completely?
> 
> Hmmm.  Still not sure how best to handle this.  For entering
> hibernation, all we really need is freeze().  Once we are done with
> that we don't need thaw() or poweroff() for hibernation as we've
> already suspended in freeze() so there is nothing else to do.  For
> exiting hibernation, we need freeze() to suspend and then either
> thaw() (if the hibernation image is bad) or restore() (if the
> hibernation image is good) to resume.
> 

If pm_transition is available, we can keep thaw() as we have now and do
resume only if pm_transition =  PM_EVENT_RECOVER. shutdown() may check
in_s4 and do nothing.

Thanks,
Lijo


> Alex
> 
>>
>> Regards
>> Sam
>>
>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>>
>>>>> On 2025/6/30 19:58, Christian König wrote:
>>>>>> On 30.06.25 12:41, Samuel Zhang wrote:
>>>>>>> The hibernation successful workflow:
>>>>>>> - prepare: evict VRAM and swapout GTT BOs
>>>>>>> - freeze
>>>>>>> - create the hibernation image in system memory
>>>>>>> - thaw: swapin and restore BOs
>>>>>> Why should a thaw happen here in between?
>>>>>>
>>>>>>> - complete
>>>>>>> - write hibernation image to disk
>>>>>>> - amdgpu_pci_shutdown
>>>>>>> - goto S5, turn off the system.
>>>>>>>
>>>>>>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
>>>>>>> shmem. Then in thaw stage, all BOs will be swapin and restored.
>>>>>> That's not correct. This is done by the application starting again and not during thaw.
>>>>>>
>>>>>>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
>>>>>>> the swapin and restore BOs takes too long (50 minutes) and it is not
>>>>>>> necessary since the follow-up stages does not use GPU.
>>>>>>>
>>>>>>> This patch is to skip BOs restore during thaw to reduce the hibernation
>>>>>>> time.
>>>>>> As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
>>>>>>>    2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index a8f4697deb1b..b550d07190a2 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>>>>>>                amdgpu_virt_init_data_exchange(adev);
>>>>>>>                amdgpu_virt_release_full_gpu(adev, true);
>>>>>>>
>>>>>>> -            if (!adev->in_s0ix && !r && !adev->in_runpm)
>>>>>>> +            if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
>>>>>>>                        r = amdgpu_amdkfd_resume_process(adev);
>>>>>>>        }
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> index 571b70da4562..23b76e8ac2fd 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
>>>>>>>    static int amdgpu_pmops_restore(struct device *dev)
>>>>>>>    {
>>>>>>>        struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>>>> +    struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>>>>
>>>>>>> +    adev->in_s4 = false;
>>>>>>>        return amdgpu_device_resume(drm_dev, true);
>>>>>>>    }
>>>>>>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
  2025-07-02 14:07               ` Lazar, Lijo
@ 2025-07-04 10:24                 ` Zhang, GuoQing (Sam)
  0 siblings, 0 replies; 16+ messages in thread
From: Zhang, GuoQing (Sam) @ 2025-07-04 10:24 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher
  Cc: Christian König, Zhang, GuoQing (Sam), rafael@kernel.org,
	len.brown@intel.com, pavel@kernel.org, Deucher, Alexander,
	Limonciello, Mario, Zhao, Victor, Chang, HaiJun, Ma, Qing (Mark),
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org


On 2025/7/2 22:07, Lazar, Lijo wrote:
>
> On 7/2/2025 7:24 PM, Alex Deucher wrote:
>> On Wed, Jul 2, 2025 at 3:24 AM Sam <guoqzhan@amd.com> wrote:
>>>
>>> On 2025/7/2 00:07, Alex Deucher wrote:
>>>> On Tue, Jul 1, 2025 at 4:32 AM Christian König <christian.koenig@amd.com> wrote:
>>>>> On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote:
>>>>>> thaw() is called before writing the hiberation image to swap disk. See
>>>>>> the doc here.
>>>>>> https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552>
>>>>>>
>>>>>> And amdgpu implemented thaw() callback by calling amdgpu_device_resume().
>>>>>> https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572>
>>>>>>
>>>>>> This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during
>>>>>> hibernation. it is not skipped in restore() during resume from
>>>>>> hibernation when system boot again.
>>>>>>
>>>>>>
>>>>>> I just found the following kernel doc. Thaw() is intended to resume the
>>>>>> storage device for saving the hibernation image.
>>>>> Ah, that makes much more sense.
>>>>>
>>>>>> Our GPU is not involved
>>>>>> in it, it is not necessary to resume our GPU in thaw().
>>>>>> https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588>
>>>>>>
>>>>>> So another implementation is to remove the amdgpu_device_resume() call
>>>>>> in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in
>>>>>> amdgpu_pci_shutdown()for hibernation.
>>>>>> Initial tests show it's working fine for hibernation successful case.
>>>>>> Should I switch to this implementation?
>>>>> No idea. Alex and the KFD guys need to take a look at that.
>>>>>
>>>>>> But thaw() is also called to restore the GPU when hibernation is aborted
>>>>>> due to some error in hibernation image creation stage. In this case,
>>>>>> amdgpu_device_resume() is needed in thaw().
>>>>>>
>>>>>> So I need a method to check if hibernation is aborted or not to
>>>>>> conditionally skip amdgpu_device_resume() in thaw(). Currently I don't
>>>>>> know how to do this.
>>>>> Yeah that approach here looks fishy to me, but I don't know how to properly fix it either.
>>>>>
>>>>> @Alex any idea?
>>>> Yeah, I'm not sure how to handle that.  I don't see a way to avoid
>>>> having all of the callbacks.  We could ideally skip some of the steps.
>>>> Maybe we could optimize the freeze and thaw routines if we had some
>>>> hint from the pm core about why we were getting called.  E.g., thaw
>>>> after a failed hibernation restore.
>>>>
>>>> Alex
>>>
>>> I just found pm_transition variable can be used to check if hibernation
>>> is cancelled (PM_EVENT_RECOVER) or not(PM_EVENT_THAW) in thaw(). I just
>>> need to export this variable in kernel.
>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L64
>>>
>>> Provided pm_transition is available, should we skip
>>> amdgpu_amdkfd_resume_process() only, or skip amdgpu_device_resume()
>>> completely?
>> Hmmm.  Still not sure how best to handle this.  For entering
>> hibernation, all we really need is freeze().  Once we are done with
>> that we don't need thaw() or poweroff() for hibernation as we've
>> already suspended in freeze() so there is nothing else to do.  For
>> exiting hibernation, we need freeze() to suspend and then either
>> thaw() (if the hibernation image is bad) or restore() (if the
>> hibernation image is good) to resume.
>>
> If pm_transition is available, we can keep thaw() as we have now and do
> resume only if pm_transition =  PM_EVENT_RECOVER. shutdown() may check
> in_s4 and do nothing.


Hi Lijo, Christian and Alex,
Thank you for the suggestion and feedback. I have implemented this 
behavior and send out the v2 patch list. Please help review. Thank you!

[PATCH v2 0/5] reduce system memory requirement for hibernation
[PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api
[PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation
[PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
[PATCH v2 4/5] PM: hibernate: export variable pm_transition
[PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal 
hibernation

Regards
Sam


>
> Thanks,
> Lijo
>
>
>> Alex
>>
>>> Regards
>>> Sam
>>>
>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Regards
>>>>>> Sam
>>>>>>
>>>>>>
>>>>>> On 2025/6/30 19:58, Christian König wrote:
>>>>>>> On 30.06.25 12:41, Samuel Zhang wrote:
>>>>>>>> The hibernation successful workflow:
>>>>>>>> - prepare: evict VRAM and swapout GTT BOs
>>>>>>>> - freeze
>>>>>>>> - create the hibernation image in system memory
>>>>>>>> - thaw: swapin and restore BOs
>>>>>>> Why should a thaw happen here in between?
>>>>>>>
>>>>>>>> - complete
>>>>>>>> - write hibernation image to disk
>>>>>>>> - amdgpu_pci_shutdown
>>>>>>>> - goto S5, turn off the system.
>>>>>>>>
>>>>>>>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
>>>>>>>> shmem. Then in thaw stage, all BOs will be swapin and restored.
>>>>>>> That's not correct. This is done by the application starting again and not during thaw.
>>>>>>>
>>>>>>>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
>>>>>>>> the swapin and restore BOs takes too long (50 minutes) and it is not
>>>>>>>> necessary since the follow-up stages does not use GPU.
>>>>>>>>
>>>>>>>> This patch is to skip BOs restore during thaw to reduce the hibernation
>>>>>>>> time.
>>>>>>> As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
>>>>>>>>     2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index a8f4697deb1b..b550d07190a2 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>>>>>>>>                 amdgpu_virt_init_data_exchange(adev);
>>>>>>>>                 amdgpu_virt_release_full_gpu(adev, true);
>>>>>>>>
>>>>>>>> -            if (!adev->in_s0ix && !r && !adev->in_runpm)
>>>>>>>> +            if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
>>>>>>>>                         r = amdgpu_amdkfd_resume_process(adev);
>>>>>>>>         }
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> index 571b70da4562..23b76e8ac2fd 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
>>>>>>>>     static int amdgpu_pmops_restore(struct device *dev)
>>>>>>>>     {
>>>>>>>>         struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>>>>> +    struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>>>>>
>>>>>>>> +    adev->in_s4 = false;
>>>>>>>>         return amdgpu_device_resume(drm_dev, true);
>>>>>>>>     }
>>>>>>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-07-04 10:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-30 10:41 [PATCH 0/3] reduce system memory requirement for hibernation Samuel Zhang
2025-06-30 10:41 ` [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction " Samuel Zhang
2025-06-30 11:54   ` Christian König
     [not found]     ` <DM4PR12MB5937FFB3E121E489A261785DE541A@DM4PR12MB5937.namprd12.prod.outlook.com>
2025-07-01  8:22       ` Christian König
2025-07-02  7:28         ` Samuel Zhang
2025-07-02  7:48           ` Christian König
2025-06-30 10:41 ` [PATCH 2/3] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
2025-06-30 20:21   ` Rafael J. Wysocki
2025-06-30 10:41 ` [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw() Samuel Zhang
2025-06-30 11:58   ` Christian König
     [not found]     ` <8eb1700d-4d60-4a1e-9d09-718f65baaf1e@amd.com>
2025-07-01  8:32       ` Christian König
2025-07-01 16:07         ` Alex Deucher
2025-07-02  7:23           ` Sam
2025-07-02 13:54             ` Alex Deucher
2025-07-02 14:07               ` Lazar, Lijo
2025-07-04 10:24                 ` Zhang, GuoQing (Sam)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox