[PATCH v2 0/5] reduce system memory requirement for hibernation

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] reduce system memory requirement for hibernation
@ 2025-07-04 10:12 Samuel Zhang
  2025-07-04 10:12 ` [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api Samuel Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Samuel Zhang @ 2025-07-04 10:12 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel, Samuel Zhang

Modern data center dGPUs are usually equipped with very large VRAM. On
server with such dGPUs(192GB VRAM * 8) and 2TB system memory, hibernate
will fail due to no enough free memory.

The root cause is that during hibernation all VRAM memory get evicted to
GTT or shmem. In both case, it is in system memory and kernel will try to 
copy the pages to hibernation image. In the worst case, this causes 2 
copies of VRAM memory in system memory, 2TB is not enough for the 
hibernation image. 192GB * 8 * 2 = 3TB > 2TB.

The fix includes following changes. With these changes, there's much less
pages needed to be copied to hibernate image and hibernation can succeed.
* patch 1 and 2: move GTT to shmem after evicting VRAM. so that the GTT 
  pages can be freed.
* patch 3: force write shmem pages to swap disk and free shmem pages.

After swapout GTT to shmem in hibernation prepare stage, the GPU will be
resumed again in thaw stage. The swapin and restore BOs of resume takes
lots of time (50 mintues observed for 8 dGPUs). And it's unnecessary since
writing hibernation image do not need GPU for hibernate successful case.
* patch 4 and 5: skip resume of device in thaw stage for successful
  hibernation case to reduce the hibernation time.

v2:
* split first patch to 2 patches, 1 for ttm, 1 for amdgpu
* refined the new ttm api
* add more comments for shrink_shmem_memory() and its callsite
* export variable pm_transition in kernel
* skip resume in thaw() for successful hibernation case

Samuel Zhang (5):
1. drm/ttm: add ttm_device_prepare_hibernation() api
2. drm/amdgpu: move GTT to shmem after eviction for hibernation
3. PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
4. PM: hibernate: export variable pm_transition
5. drm/amdgpu: do not resume device in thaw for normal hibernation

 drivers/base/power/main.c                    |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c      | 10 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      | 13 ++++++++-
 drivers/gpu/drm/amd/dkms/config/config.h     |  3 ++
 drivers/gpu/drm/amd/dkms/m4/pm_transition.m4 | 15 ++++++++++
 drivers/gpu/drm/ttm/ttm_device.c             | 29 ++++++++++++++++++++
 include/drm/ttm/ttm_device.h                 |  1 +
 include/linux/pm.h                           |  2 ++
 kernel/power/hibernate.c                     | 26 ++++++++++++++++++
 9 files changed, 100 insertions(+), 2 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/dkms/m4/pm_transition.m4

-- 
2.43.5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api
  2025-07-04 10:12 [PATCH v2 0/5] reduce system memory requirement for hibernation Samuel Zhang
@ 2025-07-04 10:12 ` Samuel Zhang
  2025-07-06 20:44   ` Mario Limonciello
  2025-07-07  9:13   ` Christian König
  2025-07-04 10:12 ` [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation Samuel Zhang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Samuel Zhang @ 2025-07-04 10:12 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel, Samuel Zhang

This new api is used for hibernation to move GTT BOs to shmem after
VRAM eviction. shmem will be flushed to swap disk later to reduce
the system memory usage for hibernation.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/ttm/ttm_device.c | 29 +++++++++++++++++++++++++++++
 include/drm/ttm/ttm_device.h     |  1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 02e797fd1891..19ab35ffeead 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -123,6 +123,35 @@ static int ttm_global_init(void)
 	return ret;
 }
 
+/**
+ * move GTT BOs to shmem for hibernation.
+ *
+ * returns 0 on success, negative on failure.
+ */
+int ttm_device_prepare_hibernation(void)
+{
+	struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = false,
+		.force_alloc = true
+	};
+	struct ttm_global *glob = &ttm_glob;
+	struct ttm_device *bdev;
+	int ret = 0;
+
+	mutex_lock(&ttm_global_mutex);
+	list_for_each_entry(bdev, &glob->device_list, device_list) {
+		do {
+			ret = ttm_device_swapout(bdev, &ctx, GFP_KERNEL);
+		} while (ret > 0);
+		if (ret < 0)
+			break;
+	}
+	mutex_unlock(&ttm_global_mutex);
+	return ret;
+}
+EXPORT_SYMBOL(ttm_device_prepare_hibernation);
+
 /*
  * A buffer object shrink method that tries to swap out the first
  * buffer object on the global::swap_lru list.
diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
index 39b8636b1845..b45498b398dd 100644
--- a/include/drm/ttm/ttm_device.h
+++ b/include/drm/ttm/ttm_device.h
@@ -272,6 +272,7 @@ struct ttm_device {
 int ttm_global_swapout(struct ttm_operation_ctx *ctx, gfp_t gfp_flags);
 int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
 		       gfp_t gfp_flags);
+int ttm_device_prepare_hibernation(void);
 
 static inline struct ttm_resource_manager *
 ttm_manager_type(struct ttm_device *bdev, int mem_type)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation
  2025-07-04 10:12 [PATCH v2 0/5] reduce system memory requirement for hibernation Samuel Zhang
  2025-07-04 10:12 ` [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api Samuel Zhang
@ 2025-07-04 10:12 ` Samuel Zhang
  2025-07-06 20:41   ` Mario Limonciello
  2025-07-07  9:16   ` Christian König
  2025-07-04 10:12 ` [PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Samuel Zhang @ 2025-07-04 10:12 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel, Samuel Zhang

When hibernate with data center dGPUs, huge number of VRAM BOs evicted
to GTT and takes too much system memory. This will cause hibernation
fail due to insufficient memory for creating the hibernation image.

Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
hibernation code to make room for hibernation image.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 27ab4e754b2a..a0b0682236e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -2414,6 +2414,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
 int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
 {
 	struct ttm_resource_manager *man;
+	int r;
 
 	switch (mem_type) {
 	case TTM_PL_VRAM:
@@ -2428,7 +2429,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
 		return -EINVAL;
 	}
 
-	return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
+	r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
+	if (r) {
+		DRM_ERROR("Failed to evict memory type %d\n", mem_type);
+		return r;
+	}
+	if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
+		r = ttm_device_prepare_hibernation();
+		if (r)
+			DRM_ERROR("Failed to swap out, %d\n", r);
+	}
+	return r;
 }
 
 #if defined(CONFIG_DEBUG_FS)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
  2025-07-04 10:12 [PATCH v2 0/5] reduce system memory requirement for hibernation Samuel Zhang
  2025-07-04 10:12 ` [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api Samuel Zhang
  2025-07-04 10:12 ` [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation Samuel Zhang
@ 2025-07-04 10:12 ` Samuel Zhang
  2025-07-07 18:45   ` Rafael J. Wysocki
  2025-07-04 10:12 ` [PATCH v2 4/5] PM: hibernate: export variable pm_transition Samuel Zhang
  2025-07-04 10:12 ` [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation Samuel Zhang
  4 siblings, 1 reply; 17+ messages in thread
From: Samuel Zhang @ 2025-07-04 10:12 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel, Samuel Zhang

When hibernate with data center dGPUs, huge number of VRAM data will be
moved to shmem during dev_pm_ops.prepare(). These shmem pages take a lot
of system memory so that there's no enough free memory for creating the
hibernation image. This will cause hibernation fail and abort.

After dev_pm_ops.prepare(), call shrink_all_memory() to force move shmem
pages to swap disk and reclaim the pages, so that there's enough system
memory for hibernation image and less pages needed to copy to the image.

This patch can only flush and free about half shmem pages. It will be
better to flush and free more pages, even all of shmem pages, so that
there're less pages to be copied to the hibernation image and the overall
hibernation time can be reduced.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 kernel/power/hibernate.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 10a01af63a80..7ae9d9a7aa1d 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -370,6 +370,23 @@ static int create_image(int platform_mode)
 	return error;
 }
 
+static void shrink_shmem_memory(void)
+{
+	struct sysinfo info;
+	unsigned long nr_shmem_pages, nr_freed_pages;
+
+	si_meminfo(&info);
+	nr_shmem_pages = info.sharedram; /* current page count used for shmem */
+	/*
+	 * The intent is to reclaim all shmem pages. Though shrink_all_memory() can
+	 * only reclaim about half of them, it's enough for creating the hibernation
+	 * image.
+	 */
+	nr_freed_pages = shrink_all_memory(nr_shmem_pages);
+	pr_debug("requested to reclaim %lu shmem pages, actually freed %lu pages\n",
+			nr_shmem_pages, nr_freed_pages);
+}
+
 /**
  * hibernation_snapshot - Quiesce devices and create a hibernation image.
  * @platform_mode: If set, use platform driver to prepare for the transition.
@@ -411,6 +428,15 @@ int hibernation_snapshot(int platform_mode)
 		goto Thaw;
 	}
 
+	/*
+	 * Device drivers may move lots of data to shmem in dpm_prepare(). The shmem
+	 * pages will use lots of system memory, causing hibernation image creation
+	 * fail due to insufficient free memory.
+	 * This call is to force flush the shmem pages to swap disk and reclaim
+	 * the system memory so that image creation can succeed.
+	 */
+	shrink_shmem_memory();
+
 	suspend_console();
 	pm_restrict_gfp_mask();
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/5] PM: hibernate: export variable pm_transition
  2025-07-04 10:12 [PATCH v2 0/5] reduce system memory requirement for hibernation Samuel Zhang
                   ` (2 preceding siblings ...)
  2025-07-04 10:12 ` [PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
@ 2025-07-04 10:12 ` Samuel Zhang
  2025-07-06 20:40   ` Mario Limonciello
  2025-07-04 10:12 ` [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation Samuel Zhang
  4 siblings, 1 reply; 17+ messages in thread
From: Samuel Zhang @ 2025-07-04 10:12 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel, Samuel Zhang

https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588
Per this kernel doc, dev_pm_ops.thaw() is called mainly for resume
storage devices for saving the hibernation image. Other devices that not
involved in the image saving do not need to resume the device.

But dev_pm_ops.thaw() is also called to restore devices when hibernation
is aborted due to some error in hibernation image creation stage.

So there need to be a way to query in thaw() to know if hibernation is
aborted or not and conditionally resume devices. Exported pm_transition
is such a way. When thaw() is called, the value is:
- PM_EVENT_THAW: normal hibernate, no need to resume non-storage devices.
- PM_EVENT_RECOVER: cancelled hibernation, need to resume devices.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/base/power/main.c | 3 ++-
 include/linux/pm.h        | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 40e1d8d8a589..d50f58c0121b 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -61,7 +61,8 @@ static LIST_HEAD(dpm_late_early_list);
 static LIST_HEAD(dpm_noirq_list);
 
 static DEFINE_MUTEX(dpm_list_mtx);
-static pm_message_t pm_transition;
+pm_message_t pm_transition;
+EXPORT_SYMBOL_GPL(pm_transition);
 
 static int async_error;
 
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 78855d794342..f01846852a90 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -657,6 +657,8 @@ struct pm_subsys_data {
 #define DPM_FLAG_SMART_SUSPEND		BIT(2)
 #define DPM_FLAG_MAY_SKIP_RESUME	BIT(3)
 
+extern pm_message_t pm_transition;
+
 struct dev_pm_info {
 	pm_message_t		power_state;
 	bool			can_wakeup:1;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation
  2025-07-04 10:12 [PATCH v2 0/5] reduce system memory requirement for hibernation Samuel Zhang
                   ` (3 preceding siblings ...)
  2025-07-04 10:12 ` [PATCH v2 4/5] PM: hibernate: export variable pm_transition Samuel Zhang
@ 2025-07-04 10:12 ` Samuel Zhang
  2025-07-06 20:34   ` Mario Limonciello
  4 siblings, 1 reply; 17+ messages in thread
From: Samuel Zhang @ 2025-07-04 10:12 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel, Samuel Zhang

For normal hibernation, GPU do not need to be resumed in thaw since it
is not involved in writing the hibernation image. Skip resume in this
case can reduce the hibernation time.

For cancelled hibernation, GPU need to be resumed.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4f8632737574..e064816aae4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2541,6 +2541,10 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
 	if (amdgpu_ras_intr_triggered())
 		return;
 
+	/* device maybe not resumed here, return immediately in this case */
+	if (adev->in_s4 && adev->in_suspend)
+		return;
+
 	/* if we are running in a VM, make sure the device
 	 * torn down properly on reboot/shutdown.
 	 * unfortunately we can't detect certain
@@ -2655,6 +2659,10 @@ static int amdgpu_pmops_thaw(struct device *dev)
 {
 	struct drm_device *drm_dev = dev_get_drvdata(dev);
 
+	/* do not resume device for normal hibernation */
+	if (pm_transition.event == PM_EVENT_THAW)
+		return 0;
+
 	return amdgpu_device_resume(drm_dev, true);
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation
  2025-07-04 10:12 ` [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation Samuel Zhang
@ 2025-07-06 20:34   ` Mario Limonciello
  2025-07-07  2:28     ` Lazar, Lijo
  0 siblings, 1 reply; 17+ messages in thread
From: Mario Limonciello @ 2025-07-06 20:34 UTC (permalink / raw)
  To: Samuel Zhang, alexander.deucher, christian.koenig, rafael,
	len.brown, pavel, gregkh, dakr, airlied, simona, ray.huang,
	matthew.auld, matthew.brost, maarten.lankhorst, mripard,
	tzimmermann
  Cc: lijo.lazar, victor.zhao, haijun.chang, Qing.Ma, linux-pm,
	linux-kernel, amd-gfx, dri-devel

On 7/4/2025 6:12 AM, Samuel Zhang wrote:
> For normal hibernation, GPU do not need to be resumed in thaw since it
> is not involved in writing the hibernation image. Skip resume in this
> case can reduce the hibernation time.

Since you have the measurements would you mind including them in the 
commit message for reference?

> 
> For cancelled hibernation, GPU need to be resumed.

If I'm following right you are actually handling two different things in 
this patch aren't you?

1) A change in thaw() to only resume on aborted hibernation
2) A change in shutdown() to skip running if the in s4 when shutdown() 
is called.

So I think it would be more logical to split this into two patches.

> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 4f8632737574..e064816aae4d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2541,6 +2541,10 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>   	if (amdgpu_ras_intr_triggered())
>   		return;
>   
> +	/* device maybe not resumed here, return immediately in this case */
> +	if (adev->in_s4 && adev->in_suspend)
> +		return;
> +
>   	/* if we are running in a VM, make sure the device
>   	 * torn down properly on reboot/shutdown.
>   	 * unfortunately we can't detect certain
> @@ -2655,6 +2659,10 @@ static int amdgpu_pmops_thaw(struct device *dev)
>   {
>   	struct drm_device *drm_dev = dev_get_drvdata(dev);
>   
> +	/* do not resume device for normal hibernation */
> +	if (pm_transition.event == PM_EVENT_THAW)
> +		return 0;
> +

Without digging into pm.h documentation I think it's not going to be 
very obvious next time we look at this code that amdgpu_device_resume() 
is only intended for the aborted case.

How would you feel about a switch/case?

Something like this:

switch (pm_transition.event) {
/* normal hibernation */
case PM_EVENT_THAW:
	return 0;
/* for aborted hibernation */
case PM_EVENT_RECOVER:
	return amdgpu_device_resume(drm_dev, true);
default:
	return -EOPNOTSUP;
}


>   	return amdgpu_device_resume(drm_dev, true);
>   }
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/5] PM: hibernate: export variable pm_transition
  2025-07-04 10:12 ` [PATCH v2 4/5] PM: hibernate: export variable pm_transition Samuel Zhang
@ 2025-07-06 20:40   ` Mario Limonciello
  2025-07-07 19:17     ` Rafael J. Wysocki
  0 siblings, 1 reply; 17+ messages in thread
From: Mario Limonciello @ 2025-07-06 20:40 UTC (permalink / raw)
  To: Samuel Zhang, Rafael J. Wysocki
  Cc: lijo.lazar, victor.zhao, haijun.chang, Qing.Ma, linux-pm,
	linux-kernel, amd-gfx, dri-devel, alexander.deucher,
	christian.koenig, rafael, len.brown, pavel, gregkh, dakr, airlied,
	simona, ray.huang, matthew.auld, matthew.brost, maarten.lankhorst,
	mripard, tzimmermann

On 7/4/2025 6:12 AM, Samuel Zhang wrote:
> https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588
> Per this kernel doc, dev_pm_ops.thaw() is called mainly for resume

Proper way to do this is to put the URL in a 'Link' tag above your SoB. 
That being said I don't think we need to reference the rst file.  Just 
reference the html file.

Something like this:

Per the PCI power management documentation [1] dev_pm_ops.thaw() is 
called mainly for resume.

.
.
.

Link: https://docs.kernel.org/power/pci.html [1]
S-o-b: Foo bar <foo@bar.com>

> storage devices for saving the hibernation image. Other devices that not

that are not

> involved in the image saving do not need to resume the device.
> 
> But dev_pm_ops.thaw() is also called to restore devices when hibernation
> is aborted due to some error in hibernation image creation stage.
> 
> So there need to be a way to query in thaw() to know if hibernation is
> aborted or not and conditionally resume devices. Exported pm_transition
> is such a way. When thaw() is called, the value is:
> - PM_EVENT_THAW: normal hibernate, no need to resume non-storage devices.
> - PM_EVENT_RECOVER: cancelled hibernation, need to resume devices.

If these events are being exported out for driver use I think that we 
also need matching kernel doc exported too.

That is the comments in include/linux/pm.h need to be converted into 
kernel doc.

Before you make any changes like that though let's see what Rafael 
thinks of this approach.

He might not want to export this symbol out and would prefer a new 
helper for drivers to use like:

inline bool pm_aborted_hibernate();

If that's the direction he prefers you'll need to make kernel doc for 
that instead.

> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>   drivers/base/power/main.c | 3 ++-
>   include/linux/pm.h        | 2 ++
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 40e1d8d8a589..d50f58c0121b 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -61,7 +61,8 @@ static LIST_HEAD(dpm_late_early_list);
>   static LIST_HEAD(dpm_noirq_list);
>   
>   static DEFINE_MUTEX(dpm_list_mtx);
> -static pm_message_t pm_transition;
> +pm_message_t pm_transition;
> +EXPORT_SYMBOL_GPL(pm_transition);
>   
>   static int async_error;
>   
> diff --git a/include/linux/pm.h b/include/linux/pm.h
> index 78855d794342..f01846852a90 100644
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -657,6 +657,8 @@ struct pm_subsys_data {
>   #define DPM_FLAG_SMART_SUSPEND		BIT(2)
>   #define DPM_FLAG_MAY_SKIP_RESUME	BIT(3)
>   
> +extern pm_message_t pm_transition;
> +
>   struct dev_pm_info {
>   	pm_message_t		power_state;
>   	bool			can_wakeup:1;


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation
  2025-07-04 10:12 ` [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation Samuel Zhang
@ 2025-07-06 20:41   ` Mario Limonciello
  2025-07-07  9:16   ` Christian König
  1 sibling, 0 replies; 17+ messages in thread
From: Mario Limonciello @ 2025-07-06 20:41 UTC (permalink / raw)
  To: Samuel Zhang, alexander.deucher, christian.koenig, rafael,
	len.brown, pavel, gregkh, dakr, airlied, simona, ray.huang,
	matthew.auld, matthew.brost, maarten.lankhorst, mripard,
	tzimmermann
  Cc: lijo.lazar, victor.zhao, haijun.chang, Qing.Ma, linux-pm,
	linux-kernel, amd-gfx, dri-devel

On 7/4/2025 6:12 AM, Samuel Zhang wrote:
> When hibernate with data center dGPUs, huge number of VRAM BOs evicted
> to GTT and takes too much system memory. This will cause hibernation
> fail due to insufficient memory for creating the hibernation image.
> 
> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
> hibernation code to make room for hibernation image.
> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
>   1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 27ab4e754b2a..a0b0682236e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -2414,6 +2414,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>   int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>   {
>   	struct ttm_resource_manager *man;
> +	int r;
>   
>   	switch (mem_type) {
>   	case TTM_PL_VRAM:
> @@ -2428,7 +2429,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>   		return -EINVAL;
>   	}
>   
> -	return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
> +	r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
> +	if (r) {
> +		DRM_ERROR("Failed to evict memory type %d\n", mem_type);

For new code can you please use the drm_err() macro instead.  This will 
help show which GPU had the problem with eviction.

> +		return r;
> +	}
> +	if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
> +		r = ttm_device_prepare_hibernation();
> +		if (r)
> +			DRM_ERROR("Failed to swap out, %d\n", r);

For new code can you please use the drm_err() macro instead.

> +	}
> +	return r;
>   }
>   
>   #if defined(CONFIG_DEBUG_FS)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api
  2025-07-04 10:12 ` [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api Samuel Zhang
@ 2025-07-06 20:44   ` Mario Limonciello
  2025-07-07  0:48     ` Matthew Brost
  2025-07-07  9:13   ` Christian König
  1 sibling, 1 reply; 17+ messages in thread
From: Mario Limonciello @ 2025-07-06 20:44 UTC (permalink / raw)
  To: Samuel Zhang, alexander.deucher, christian.koenig, rafael,
	len.brown, pavel, gregkh, dakr, airlied, simona, ray.huang,
	matthew.auld, matthew.brost, maarten.lankhorst, mripard,
	tzimmermann
  Cc: lijo.lazar, victor.zhao, haijun.chang, Qing.Ma, linux-pm,
	linux-kernel, amd-gfx, dri-devel

On 7/4/2025 6:12 AM, Samuel Zhang wrote:
> This new api is used for hibernation to move GTT BOs to shmem after
> VRAM eviction. shmem will be flushed to swap disk later to reduce
> the system memory usage for hibernation.
> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_device.c | 29 +++++++++++++++++++++++++++++
>   include/drm/ttm/ttm_device.h     |  1 +
>   2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 02e797fd1891..19ab35ffeead 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -123,6 +123,35 @@ static int ttm_global_init(void)
>   	return ret;
>   }
>   
> +/**
> + * move GTT BOs to shmem for hibernation.
> + *
> + * returns 0 on success, negative on failure.
> + */
> +int ttm_device_prepare_hibernation(void)
> +{
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = false,
> +		.no_wait_gpu = false,
> +		.force_alloc = true
> +	};
> +	struct ttm_global *glob = &ttm_glob;
> +	struct ttm_device *bdev;
> +	int ret = 0;
> +
> +	mutex_lock(&ttm_global_mutex);
> +	list_for_each_entry(bdev, &glob->device_list, device_list) {
> +		do {
> +			ret = ttm_device_swapout(bdev, &ctx, GFP_KERNEL);
> +		} while (ret > 0);
> +		if (ret < 0)
> +			break;
> +	}
> +	mutex_unlock(&ttm_global_mutex);
> +	return ret;

I'd personally rather see scoped guard here so you can return 
immediately and the guard will clean up but up to Christian what he thinks.

int ret;

scoped_guard(mutex, &ttm_global_mutex) {
	list_for_each_entry(bdev, &glob->device_list, device_list) {
		do {
			ret = ttm_device_swapout(bdev, &ctx, GFP_KERNEL);
		} while (ret > 0);
		if (ret)
			return ret;
}

return 0;

> +}
> +EXPORT_SYMBOL(ttm_device_prepare_hibernation);
> +
>   /*
>    * A buffer object shrink method that tries to swap out the first
>    * buffer object on the global::swap_lru list.
> diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
> index 39b8636b1845..b45498b398dd 100644
> --- a/include/drm/ttm/ttm_device.h
> +++ b/include/drm/ttm/ttm_device.h
> @@ -272,6 +272,7 @@ struct ttm_device {
>   int ttm_global_swapout(struct ttm_operation_ctx *ctx, gfp_t gfp_flags);
>   int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
>   		       gfp_t gfp_flags);
> +int ttm_device_prepare_hibernation(void);
>   
>   static inline struct ttm_resource_manager *
>   ttm_manager_type(struct ttm_device *bdev, int mem_type)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api
  2025-07-06 20:44   ` Mario Limonciello
@ 2025-07-07  0:48     ` Matthew Brost
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Brost @ 2025-07-07  0:48 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Samuel Zhang, alexander.deucher, christian.koenig, rafael,
	len.brown, pavel, gregkh, dakr, airlied, simona, ray.huang,
	matthew.auld, maarten.lankhorst, mripard, tzimmermann, lijo.lazar,
	victor.zhao, haijun.chang, Qing.Ma, linux-pm, linux-kernel,
	amd-gfx, dri-devel

On Sun, Jul 06, 2025 at 04:44:27PM -0400, Mario Limonciello wrote:
> On 7/4/2025 6:12 AM, Samuel Zhang wrote:
> > This new api is used for hibernation to move GTT BOs to shmem after
> > VRAM eviction. shmem will be flushed to swap disk later to reduce
> > the system memory usage for hibernation.
> > 
> > Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> > ---
> >   drivers/gpu/drm/ttm/ttm_device.c | 29 +++++++++++++++++++++++++++++
> >   include/drm/ttm/ttm_device.h     |  1 +
> >   2 files changed, 30 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> > index 02e797fd1891..19ab35ffeead 100644
> > --- a/drivers/gpu/drm/ttm/ttm_device.c
> > +++ b/drivers/gpu/drm/ttm/ttm_device.c
> > @@ -123,6 +123,35 @@ static int ttm_global_init(void)
> >   	return ret;
> >   }
> > +/**
> > + * move GTT BOs to shmem for hibernation.
> > + *
> > + * returns 0 on success, negative on failure.
> > + */
> > +int ttm_device_prepare_hibernation(void)
> > +{
> > +	struct ttm_operation_ctx ctx = {
> > +		.interruptible = false,
> > +		.no_wait_gpu = false,
> > +		.force_alloc = true
> > +	};
> > +	struct ttm_global *glob = &ttm_glob;
> > +	struct ttm_device *bdev;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&ttm_global_mutex);
> > +	list_for_each_entry(bdev, &glob->device_list, device_list) {
> > +		do {
> > +			ret = ttm_device_swapout(bdev, &ctx, GFP_KERNEL);
> > +		} while (ret > 0);
> > +		if (ret < 0)
> > +			break;
> > +	}
> > +	mutex_unlock(&ttm_global_mutex);
> > +	return ret;
> 
> I'd personally rather see scoped guard here so you can return immediately
> and the guard will clean up but up to Christian what he thinks.
> 
> int ret;
> 
> scoped_guard(mutex, &ttm_global_mutex) {

guard(mutex)(&ttm_global_mutex) would be more apporiate for this as the
scope of the guard is the entire function.

Matt

> 	list_for_each_entry(bdev, &glob->device_list, device_list) {
> 		do {
> 			ret = ttm_device_swapout(bdev, &ctx, GFP_KERNEL);
> 		} while (ret > 0);
> 		if (ret)
> 			return ret;
> }
> 
> return 0;
> 
> > +}
> > +EXPORT_SYMBOL(ttm_device_prepare_hibernation);
> > +
> >   /*
> >    * A buffer object shrink method that tries to swap out the first
> >    * buffer object on the global::swap_lru list.
> > diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
> > index 39b8636b1845..b45498b398dd 100644
> > --- a/include/drm/ttm/ttm_device.h
> > +++ b/include/drm/ttm/ttm_device.h
> > @@ -272,6 +272,7 @@ struct ttm_device {
> >   int ttm_global_swapout(struct ttm_operation_ctx *ctx, gfp_t gfp_flags);
> >   int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
> >   		       gfp_t gfp_flags);
> > +int ttm_device_prepare_hibernation(void);
> >   static inline struct ttm_resource_manager *
> >   ttm_manager_type(struct ttm_device *bdev, int mem_type)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation
  2025-07-06 20:34   ` Mario Limonciello
@ 2025-07-07  2:28     ` Lazar, Lijo
  2025-07-07  2:29       ` Mario Limonciello
  0 siblings, 1 reply; 17+ messages in thread
From: Lazar, Lijo @ 2025-07-07  2:28 UTC (permalink / raw)
  To: Mario Limonciello, Samuel Zhang, alexander.deucher,
	christian.koenig, rafael, len.brown, pavel, gregkh, dakr, airlied,
	simona, ray.huang, matthew.auld, matthew.brost, maarten.lankhorst,
	mripard, tzimmermann
  Cc: victor.zhao, haijun.chang, Qing.Ma, linux-pm, linux-kernel,
	amd-gfx, dri-devel



On 7/7/2025 2:04 AM, Mario Limonciello wrote:
> On 7/4/2025 6:12 AM, Samuel Zhang wrote:
>> For normal hibernation, GPU do not need to be resumed in thaw since it
>> is not involved in writing the hibernation image. Skip resume in this
>> case can reduce the hibernation time.
> 
> Since you have the measurements would you mind including them in the
> commit message for reference?
> 
>>
>> For cancelled hibernation, GPU need to be resumed.
> 
> If I'm following right you are actually handling two different things in
> this patch aren't you?
> 
> 1) A change in thaw() to only resume on aborted hibernation
> 2) A change in shutdown() to skip running if the in s4 when shutdown()
> is called.
> 
> So I think it would be more logical to split this into two patches.
> 

This is doing only one thing - Keep the device in suspended state for
thaw() operation during a successful hibernation. Splitting into two
could break hibernation during integration of the first part - it will
attempt another suspend during shutdown. I think we don't take care of
consecutive suspend calls.

Thanks,
Lijo

>>
>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/
>> drm/amd/amdgpu/amdgpu_drv.c
>> index 4f8632737574..e064816aae4d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2541,6 +2541,10 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>       if (amdgpu_ras_intr_triggered())
>>           return;
>>   +    /* device maybe not resumed here, return immediately in this
>> case */
>> +    if (adev->in_s4 && adev->in_suspend)
>> +        return;
>> +
>>       /* if we are running in a VM, make sure the device
>>        * torn down properly on reboot/shutdown.
>>        * unfortunately we can't detect certain
>> @@ -2655,6 +2659,10 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>   {
>>       struct drm_device *drm_dev = dev_get_drvdata(dev);
>>   +    /* do not resume device for normal hibernation */
>> +    if (pm_transition.event == PM_EVENT_THAW)
>> +        return 0;
>> +
> 
> Without digging into pm.h documentation I think it's not going to be
> very obvious next time we look at this code that amdgpu_device_resume()
> is only intended for the aborted case.
> 
> How would you feel about a switch/case?
> 
> Something like this:
> 
> switch (pm_transition.event) {
> /* normal hibernation */
> case PM_EVENT_THAW:
>     return 0;
> /* for aborted hibernation */
> case PM_EVENT_RECOVER:
>     return amdgpu_device_resume(drm_dev, true);
> default:
>     return -EOPNOTSUP;
> }
> 
> 
>>       return amdgpu_device_resume(drm_dev, true);
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation
  2025-07-07  2:28     ` Lazar, Lijo
@ 2025-07-07  2:29       ` Mario Limonciello
  0 siblings, 0 replies; 17+ messages in thread
From: Mario Limonciello @ 2025-07-07  2:29 UTC (permalink / raw)
  To: Lazar, Lijo, Samuel Zhang, alexander.deucher, christian.koenig,
	rafael, len.brown, pavel, gregkh, dakr, airlied, simona,
	ray.huang, matthew.auld, matthew.brost, maarten.lankhorst,
	mripard, tzimmermann
  Cc: victor.zhao, haijun.chang, Qing.Ma, linux-pm, linux-kernel,
	amd-gfx, dri-devel

On 7/6/2025 10:28 PM, Lazar, Lijo wrote:
> 
> 
> On 7/7/2025 2:04 AM, Mario Limonciello wrote:
>> On 7/4/2025 6:12 AM, Samuel Zhang wrote:
>>> For normal hibernation, GPU do not need to be resumed in thaw since it
>>> is not involved in writing the hibernation image. Skip resume in this
>>> case can reduce the hibernation time.
>>
>> Since you have the measurements would you mind including them in the
>> commit message for reference?
>>
>>>
>>> For cancelled hibernation, GPU need to be resumed.
>>
>> If I'm following right you are actually handling two different things in
>> this patch aren't you?
>>
>> 1) A change in thaw() to only resume on aborted hibernation
>> 2) A change in shutdown() to skip running if the in s4 when shutdown()
>> is called.
>>
>> So I think it would be more logical to split this into two patches.
>>
> 
> This is doing only one thing - Keep the device in suspended state for
> thaw() operation during a successful hibernation. Splitting into two
> could break hibernation during integration of the first part - it will
> attempt another suspend during shutdown. I think we don't take care of
> consecutive suspend calls.
> 
> Thanks,
> Lijo

Got it; thanks for clarification.

> 
>>>
>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/
>>> drm/amd/amdgpu/amdgpu_drv.c
>>> index 4f8632737574..e064816aae4d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -2541,6 +2541,10 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>>        if (amdgpu_ras_intr_triggered())
>>>            return;
>>>    +    /* device maybe not resumed here, return immediately in this
>>> case */
>>> +    if (adev->in_s4 && adev->in_suspend)
>>> +        return;
>>> +
>>>        /* if we are running in a VM, make sure the device
>>>         * torn down properly on reboot/shutdown.
>>>         * unfortunately we can't detect certain
>>> @@ -2655,6 +2659,10 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>    {
>>>        struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>    +    /* do not resume device for normal hibernation */
>>> +    if (pm_transition.event == PM_EVENT_THAW)
>>> +        return 0;
>>> +
>>
>> Without digging into pm.h documentation I think it's not going to be
>> very obvious next time we look at this code that amdgpu_device_resume()
>> is only intended for the aborted case.
>>
>> How would you feel about a switch/case?
>>
>> Something like this:
>>
>> switch (pm_transition.event) {
>> /* normal hibernation */
>> case PM_EVENT_THAW:
>>      return 0;
>> /* for aborted hibernation */
>> case PM_EVENT_RECOVER:
>>      return amdgpu_device_resume(drm_dev, true);
>> default:
>>      return -EOPNOTSUP;
>> }
>>
>>
>>>        return amdgpu_device_resume(drm_dev, true);
>>>    }
>>>    
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api
  2025-07-04 10:12 ` [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api Samuel Zhang
  2025-07-06 20:44   ` Mario Limonciello
@ 2025-07-07  9:13   ` Christian König
  1 sibling, 0 replies; 17+ messages in thread
From: Christian König @ 2025-07-07  9:13 UTC (permalink / raw)
  To: Samuel Zhang, alexander.deucher, rafael, len.brown, pavel, gregkh,
	dakr, airlied, simona, ray.huang, matthew.auld, matthew.brost,
	maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel

On 04.07.25 12:12, Samuel Zhang wrote:
> This new api is used for hibernation to move GTT BOs to shmem after
> VRAM eviction. shmem will be flushed to swap disk later to reduce
> the system memory usage for hibernation.
> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/ttm/ttm_device.c | 29 +++++++++++++++++++++++++++++
>  include/drm/ttm/ttm_device.h     |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 02e797fd1891..19ab35ffeead 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -123,6 +123,35 @@ static int ttm_global_init(void)
>  	return ret;
>  }
>  
> +/**
> + * move GTT BOs to shmem for hibernation.
> + *
> + * returns 0 on success, negative on failure.
> + */
> +int ttm_device_prepare_hibernation(void)

This needs the device as argument.

> +{
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = false,
> +		.no_wait_gpu = false,
> +		.force_alloc = true
> +	};
> +	struct ttm_global *glob = &ttm_glob;
> +	struct ttm_device *bdev;
> +	int ret = 0;
> +
> +	mutex_lock(&ttm_global_mutex);
> +	list_for_each_entry(bdev, &glob->device_list, device_list) {
> +		do {
> +			ret = ttm_device_swapout(bdev, &ctx, GFP_KERNEL);
> +		} while (ret > 0);
> +		if (ret < 0)
> +			break;
> +	}

In other words call ttm_device_swapout() in a loop for a specific device and not for all devices.

Regards,
Christian.

> +	mutex_unlock(&ttm_global_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ttm_device_prepare_hibernation);
> +
>  /*
>   * A buffer object shrink method that tries to swap out the first
>   * buffer object on the global::swap_lru list.
> diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
> index 39b8636b1845..b45498b398dd 100644
> --- a/include/drm/ttm/ttm_device.h
> +++ b/include/drm/ttm/ttm_device.h
> @@ -272,6 +272,7 @@ struct ttm_device {
>  int ttm_global_swapout(struct ttm_operation_ctx *ctx, gfp_t gfp_flags);
>  int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
>  		       gfp_t gfp_flags);
> +int ttm_device_prepare_hibernation(void);
>  
>  static inline struct ttm_resource_manager *
>  ttm_manager_type(struct ttm_device *bdev, int mem_type)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation
  2025-07-04 10:12 ` [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation Samuel Zhang
  2025-07-06 20:41   ` Mario Limonciello
@ 2025-07-07  9:16   ` Christian König
  1 sibling, 0 replies; 17+ messages in thread
From: Christian König @ 2025-07-07  9:16 UTC (permalink / raw)
  To: Samuel Zhang, alexander.deucher, rafael, len.brown, pavel, gregkh,
	dakr, airlied, simona, ray.huang, matthew.auld, matthew.brost,
	maarten.lankhorst, mripard, tzimmermann
  Cc: mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel



On 04.07.25 12:12, Samuel Zhang wrote:
> When hibernate with data center dGPUs, huge number of VRAM BOs evicted
> to GTT and takes too much system memory. This will cause hibernation
> fail due to insufficient memory for creating the hibernation image.
> 
> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
> hibernation code to make room for hibernation image.
> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 27ab4e754b2a..a0b0682236e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -2414,6 +2414,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>  int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>  {
>  	struct ttm_resource_manager *man;
> +	int r;
>  
>  	switch (mem_type) {
>  	case TTM_PL_VRAM:
> @@ -2428,7 +2429,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
>  		return -EINVAL;
>  	}
>  
> -	return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
> +	r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
> +	if (r) {
> +		DRM_ERROR("Failed to evict memory type %d\n", mem_type);
> +		return r;
> +	}
> +	if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
> +		r = ttm_device_prepare_hibernation();
> +		if (r)
> +			DRM_ERROR("Failed to swap out, %d\n", r);
> +	}
> +	return r;

That call needs to go into a separate amdgpu_ttm_* function and only be called from amdgpu_device_evict_resources().

Otherwise the debugfs tests will trigger it as well which is undesirable.

Regards,
Christian.

>  }
>  
>  #if defined(CONFIG_DEBUG_FS)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
  2025-07-04 10:12 ` [PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
@ 2025-07-07 18:45   ` Rafael J. Wysocki
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2025-07-07 18:45 UTC (permalink / raw)
  To: Samuel Zhang
  Cc: alexander.deucher, christian.koenig, rafael, len.brown, pavel,
	gregkh, dakr, airlied, simona, ray.huang, matthew.auld,
	matthew.brost, maarten.lankhorst, mripard, tzimmermann,
	mario.limonciello, lijo.lazar, victor.zhao, haijun.chang, Qing.Ma,
	linux-pm, linux-kernel, amd-gfx, dri-devel

On Fri, Jul 4, 2025 at 12:14 PM Samuel Zhang <guoqing.zhang@amd.com> wrote:
>
> When hibernate with data center dGPUs, huge number of VRAM data will be
> moved to shmem during dev_pm_ops.prepare(). These shmem pages take a lot
> of system memory so that there's no enough free memory for creating the
> hibernation image. This will cause hibernation fail and abort.
>
> After dev_pm_ops.prepare(), call shrink_all_memory() to force move shmem
> pages to swap disk and reclaim the pages, so that there's enough system
> memory for hibernation image and less pages needed to copy to the image.
>
> This patch can only flush and free about half shmem pages. It will be
> better to flush and free more pages, even all of shmem pages, so that
> there're less pages to be copied to the hibernation image and the overall
> hibernation time can be reduced.
>
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>

No fundamental objections from me, so

Acked-by: Rafael J. Wysocki <rafael@kernel.org>

> ---
>  kernel/power/hibernate.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 10a01af63a80..7ae9d9a7aa1d 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -370,6 +370,23 @@ static int create_image(int platform_mode)
>         return error;
>  }
>
> +static void shrink_shmem_memory(void)
> +{
> +       struct sysinfo info;
> +       unsigned long nr_shmem_pages, nr_freed_pages;
> +
> +       si_meminfo(&info);
> +       nr_shmem_pages = info.sharedram; /* current page count used for shmem */
> +       /*
> +        * The intent is to reclaim all shmem pages. Though shrink_all_memory() can
> +        * only reclaim about half of them, it's enough for creating the hibernation
> +        * image.
> +        */
> +       nr_freed_pages = shrink_all_memory(nr_shmem_pages);
> +       pr_debug("requested to reclaim %lu shmem pages, actually freed %lu pages\n",
> +                       nr_shmem_pages, nr_freed_pages);
> +}
> +
>  /**
>   * hibernation_snapshot - Quiesce devices and create a hibernation image.
>   * @platform_mode: If set, use platform driver to prepare for the transition.
> @@ -411,6 +428,15 @@ int hibernation_snapshot(int platform_mode)
>                 goto Thaw;
>         }
>
> +       /*
> +        * Device drivers may move lots of data to shmem in dpm_prepare(). The shmem
> +        * pages will use lots of system memory, causing hibernation image creation
> +        * fail due to insufficient free memory.
> +        * This call is to force flush the shmem pages to swap disk and reclaim
> +        * the system memory so that image creation can succeed.
> +        */
> +       shrink_shmem_memory();
> +
>         suspend_console();
>         pm_restrict_gfp_mask();
>
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/5] PM: hibernate: export variable pm_transition
  2025-07-06 20:40   ` Mario Limonciello
@ 2025-07-07 19:17     ` Rafael J. Wysocki
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2025-07-07 19:17 UTC (permalink / raw)
  To: Mario Limonciello, Samuel Zhang
  Cc: lijo.lazar, victor.zhao, haijun.chang, Qing.Ma, linux-pm,
	linux-kernel, amd-gfx, dri-devel, alexander.deucher,
	christian.koenig, len.brown, pavel, gregkh, dakr, airlied, simona,
	ray.huang, matthew.auld, matthew.brost, maarten.lankhorst,
	mripard, tzimmermann

On Sun, Jul 6, 2025 at 10:40 PM Mario Limonciello
<mario.limonciello@amd.com> wrote:
>
> On 7/4/2025 6:12 AM, Samuel Zhang wrote:
> > https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588
> > Per this kernel doc, dev_pm_ops.thaw() is called mainly for resume
>
> Proper way to do this is to put the URL in a 'Link' tag above your SoB.
> That being said I don't think we need to reference the rst file.  Just
> reference the html file.
>
> Something like this:
>
> Per the PCI power management documentation [1] dev_pm_ops.thaw() is
> called mainly for resume.
>
> .
> .
> .
>
> Link: https://docs.kernel.org/power/pci.html [1]
> S-o-b: Foo bar <foo@bar.com>
>
> > storage devices for saving the hibernation image. Other devices that not
>
> that are not
>
> > involved in the image saving do not need to resume the device.
> >
> > But dev_pm_ops.thaw() is also called to restore devices when hibernation
> > is aborted due to some error in hibernation image creation stage.

This isn't factually correct.

dev_pm_ops.thaw() can be called in an error path in two cases: (1) the
"freeze" transition before the creation of a memory snapshot image
fails and (2) the "freeze" transition during restore (before jumping
back to the image kernel) fails.

> > So there need to be a way to query in thaw() to know if hibernation is
> > aborted or not and conditionally resume devices. Exported pm_transition
> > is such a way. When thaw() is called, the value is:
> > - PM_EVENT_THAW: normal hibernate, no need to resume non-storage devices.
> > - PM_EVENT_RECOVER: cancelled hibernation, need to resume devices.
>
> If these events are being exported out for driver use I think that we
> also need matching kernel doc exported too.
>
> That is the comments in include/linux/pm.h need to be converted into
> kernel doc.
>
> Before you make any changes like that though let's see what Rafael
> thinks of this approach.
>
> He might not want to export this symbol out and would prefer a new
> helper for drivers to use like:
>
> inline bool pm_aborted_hibernate();
>
> If that's the direction he prefers you'll need to make kernel doc for
> that instead.

I would prefer a wrapper around pm_transition returning pm_transition.event.

It can be called pm_transition_event() even as far as I'm concerned.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-07 19:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 10:12 [PATCH v2 0/5] reduce system memory requirement for hibernation Samuel Zhang
2025-07-04 10:12 ` [PATCH v2 1/5] drm/ttm: add ttm_device_prepare_hibernation() api Samuel Zhang
2025-07-06 20:44   ` Mario Limonciello
2025-07-07  0:48     ` Matthew Brost
2025-07-07  9:13   ` Christian König
2025-07-04 10:12 ` [PATCH v2 2/5] drm/amdgpu: move GTT to shmem after eviction for hibernation Samuel Zhang
2025-07-06 20:41   ` Mario Limonciello
2025-07-07  9:16   ` Christian König
2025-07-04 10:12 ` [PATCH v2 3/5] PM: hibernate: shrink shmem pages after dev_pm_ops.prepare() Samuel Zhang
2025-07-07 18:45   ` Rafael J. Wysocki
2025-07-04 10:12 ` [PATCH v2 4/5] PM: hibernate: export variable pm_transition Samuel Zhang
2025-07-06 20:40   ` Mario Limonciello
2025-07-07 19:17     ` Rafael J. Wysocki
2025-07-04 10:12 ` [PATCH v2 5/5] drm/amdgpu: do not resume device in thaw for normal hibernation Samuel Zhang
2025-07-06 20:34   ` Mario Limonciello
2025-07-07  2:28     ` Lazar, Lijo
2025-07-07  2:29       ` Mario Limonciello

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).