All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover
@ 2023-10-26 13:34 Yifan Zhang
  2023-10-26 14:05 ` Christian König
  0 siblings, 1 reply; 2+ messages in thread
From: Yifan Zhang @ 2023-10-26 13:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Tim.Huang, Yifan Zhang, christian.koenig,
	li.ma

gpu tlb flush is skipped if reset sem is held, it makes
mes_self_test fail since it involves add_hw_queue/remove_hw_queue
which needs tlb flush functional. Remove mes_self_test in gpu
recover sequence.

This patch is to fix the recover failure in gfx11.

[ 1831.768292] [drm] ring sdma_32769.3.3 was added
[ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
[ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
[ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process  pid 0 thread  pid 0)
[ 1831.768434] amdgpu 0000:c2:00.0: amdgpu:   in page starting at address 0x0000aec200000000 from client 10
[ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
[ 1831.768473] amdgpu 0000:c2:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[ 1831.768489] amdgpu 0000:c2:00.0: amdgpu:      MORE_FAULTS: 0x0
[ 1831.768501] amdgpu 0000:c2:00.0: amdgpu:      WALKER_ERROR: 0x0
[ 1831.768513] amdgpu 0000:c2:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[ 1831.768521] amdgpu 0000:c2:00.0: amdgpu:      MAPPING_ERROR: 0x0
[ 1831.768529] amdgpu 0000:c2:00.0: amdgpu:      RW: 0x0
[ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
[ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3

Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
Reported-by: Li Ma <li.ma@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7ec32b44df05..5169e55b7fd2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			drm_sched_start(&ring->sched, true);
 		}
 
-		if (adev->enable_mes &&
-		    amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
-			amdgpu_mes_self_test(tmp_adev);
-
 		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
 			drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover
  2023-10-26 13:34 [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover Yifan Zhang
@ 2023-10-26 14:05 ` Christian König
  0 siblings, 0 replies; 2+ messages in thread
From: Christian König @ 2023-10-26 14:05 UTC (permalink / raw)
  To: Yifan Zhang, amd-gfx
  Cc: Alexander.Deucher, Tim.Huang, christian.koenig, li.ma

Am 26.10.23 um 15:34 schrieb Yifan Zhang:
> gpu tlb flush is skipped if reset sem is held, it makes
> mes_self_test fail since it involves add_hw_queue/remove_hw_queue
> which needs tlb flush functional. Remove mes_self_test in gpu
> recover sequence.

Oh, the TLB issue is actually only the tip of the iceberg. That only 
worked by coincident.

During GPU reset you are not allowed to make any memory allocation or 
otherwise you can run into a deadlock. So doing a MES test is a complete 
no-go in the first place.

>
> This patch is to fix the recover failure in gfx11.
>
> [ 1831.768292] [drm] ring sdma_32769.3.3 was added
> [ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
> [ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
> [ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process  pid 0 thread  pid 0)
> [ 1831.768434] amdgpu 0000:c2:00.0: amdgpu:   in page starting at address 0x0000aec200000000 from client 10
> [ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
> [ 1831.768473] amdgpu 0000:c2:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
> [ 1831.768489] amdgpu 0000:c2:00.0: amdgpu:      MORE_FAULTS: 0x0
> [ 1831.768501] amdgpu 0000:c2:00.0: amdgpu:      WALKER_ERROR: 0x0
> [ 1831.768513] amdgpu 0000:c2:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
> [ 1831.768521] amdgpu 0000:c2:00.0: amdgpu:      MAPPING_ERROR: 0x0
> [ 1831.768529] amdgpu 0000:c2:00.0: amdgpu:      RW: 0x0
> [ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
> [ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
> [ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3
>
> Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
> Reported-by: Li Ma <li.ma@amd.com>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

Thanks for looking into this,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
>   1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7ec32b44df05..5169e55b7fd2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			drm_sched_start(&ring->sched, true);
>   		}
>   
> -		if (adev->enable_mes &&
> -		    amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
> -			amdgpu_mes_self_test(tmp_adev);
> -
>   		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
>   			drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
>   


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-26 14:05 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 13:34 [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover Yifan Zhang
2023-10-26 14:05 ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.