[PATCH v2] drm/amdgpu: fix gpu page fault after hibernation on PF passthrough

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] drm/amdgpu: fix gpu page fault after hibernation on PF passthrough
@ 2025-11-03  4:05 Samuel Zhang
  2025-11-03  8:59 ` Lazar, Lijo
  0 siblings, 1 reply; 2+ messages in thread
From: Samuel Zhang @ 2025-11-03  4:05 UTC (permalink / raw)
  To: lijo.lazar
  Cc: victor.zhao, haijun.chang, Qing.Ma, Owen.Zhang2, amd-gfx,
	Samuel Zhang

On PF passthrough environment, after hibernate and then resume, coralgemm
will cause gpu page fault.

Mode1 reset happens during hibernate, but partition mode is not restored
on resume, register mmCP_HYP_XCP_CTL and mmCP_PSP_XCP_CTL is not right
after resume. When CP access the MQD BO, wrong stride size is used,
this will cause out of bound access on the MQD BO, resulting page fault.

The fix is to ensure gfx_v9_4_3_switch_compute_partition() is called
when resume from a hibernation.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 3 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
index 811124ff88a8..75357e8a35b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
+++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
@@ -407,7 +407,8 @@ static int aqua_vanjaram_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr,
 		return -EINVAL;
 	}
 
-	if (adev->kfd.init_complete && !amdgpu_in_reset(adev))
+	if (adev->kfd.init_complete && !amdgpu_in_reset(adev) &&
+		!adev->in_s4)
 		flags |= AMDGPU_XCP_OPS_KFD;
 
 	if (flags & AMDGPU_XCP_OPS_KFD) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index c4c551ef6b87..a12c72213a79 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -2291,7 +2291,9 @@ static int gfx_v9_4_3_cp_resume(struct amdgpu_device *adev)
 		r = amdgpu_xcp_init(adev->xcp_mgr, num_xcp, mode);
 
 	} else {
-		if (amdgpu_xcp_query_partition_mode(adev->xcp_mgr,
+		if (adev->in_s4) /* Restore if resuming from suspend */
+			amdgpu_xcp_restore_partition_mode(adev->xcp_mgr);
+		else if (amdgpu_xcp_query_partition_mode(adev->xcp_mgr,
 						    AMDGPU_XCP_FL_NONE) ==
 		    AMDGPU_UNKNOWN_COMPUTE_PARTITION_MODE)
 			r = amdgpu_xcp_switch_partition_mode(
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2] drm/amdgpu: fix gpu page fault after hibernation on PF passthrough
  2025-11-03  4:05 [PATCH v2] drm/amdgpu: fix gpu page fault after hibernation on PF passthrough Samuel Zhang
@ 2025-11-03  8:59 ` Lazar, Lijo
  0 siblings, 0 replies; 2+ messages in thread
From: Lazar, Lijo @ 2025-11-03  8:59 UTC (permalink / raw)
  To: Samuel Zhang; +Cc: victor.zhao, haijun.chang, Qing.Ma, Owen.Zhang2, amd-gfx



On 11/3/2025 9:35 AM, Samuel Zhang wrote:
> On PF passthrough environment, after hibernate and then resume, coralgemm
> will cause gpu page fault.
> 
> Mode1 reset happens during hibernate, but partition mode is not restored
> on resume, register mmCP_HYP_XCP_CTL and mmCP_PSP_XCP_CTL is not right
> after resume. When CP access the MQD BO, wrong stride size is used,
> this will cause out of bound access on the MQD BO, resulting page fault.
> 
> The fix is to ensure gfx_v9_4_3_switch_compute_partition() is called
> when resume from a hibernation.
> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 3 ++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 4 +++-
>   2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> index 811124ff88a8..75357e8a35b2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> @@ -407,7 +407,8 @@ static int aqua_vanjaram_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr,
>   		return -EINVAL;
>   	}
>   
> -	if (adev->kfd.init_complete && !amdgpu_in_reset(adev))
> +	if (adev->kfd.init_complete && !amdgpu_in_reset(adev) &&
> +		!adev->in_s4)

The logic should be the same for a generic suspend scenario. Please make 
it in_suspend (here and in the cp_resume call).

Thanks,
Lijo

>   		flags |= AMDGPU_XCP_OPS_KFD;
>   
>   	if (flags & AMDGPU_XCP_OPS_KFD) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index c4c551ef6b87..a12c72213a79 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -2291,7 +2291,9 @@ static int gfx_v9_4_3_cp_resume(struct amdgpu_device *adev)
>   		r = amdgpu_xcp_init(adev->xcp_mgr, num_xcp, mode);
>   
>   	} else {
> -		if (amdgpu_xcp_query_partition_mode(adev->xcp_mgr,
> +		if (adev->in_s4) /* Restore if resuming from suspend */
> +			amdgpu_xcp_restore_partition_mode(adev->xcp_mgr);
> +		else if (amdgpu_xcp_query_partition_mode(adev->xcp_mgr,
>   						    AMDGPU_XCP_FL_NONE) ==
>   		    AMDGPU_UNKNOWN_COMPUTE_PARTITION_MODE)
>   			r = amdgpu_xcp_switch_partition_mode(


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-11-03  8:59 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-03  4:05 [PATCH v2] drm/amdgpu: fix gpu page fault after hibernation on PF passthrough Samuel Zhang
2025-11-03  8:59 ` Lazar, Lijo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox