* [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover
@ 2023-10-26 13:34 Yifan Zhang
2023-10-26 14:05 ` Christian König
0 siblings, 1 reply; 2+ messages in thread
From: Yifan Zhang @ 2023-10-26 13:34 UTC (permalink / raw)
To: amd-gfx; +Cc: Alexander.Deucher, Tim.Huang, Yifan Zhang, christian.koenig,
li.ma
gpu tlb flush is skipped if reset sem is held, it makes
mes_self_test fail since it involves add_hw_queue/remove_hw_queue
which needs tlb flush functional. Remove mes_self_test in gpu
recover sequence.
This patch is to fix the recover failure in gfx11.
[ 1831.768292] [drm] ring sdma_32769.3.3 was added
[ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
[ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
[ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process pid 0 thread pid 0)
[ 1831.768434] amdgpu 0000:c2:00.0: amdgpu: in page starting at address 0x0000aec200000000 from client 10
[ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
[ 1831.768473] amdgpu 0000:c2:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 1831.768489] amdgpu 0000:c2:00.0: amdgpu: MORE_FAULTS: 0x0
[ 1831.768501] amdgpu 0000:c2:00.0: amdgpu: WALKER_ERROR: 0x0
[ 1831.768513] amdgpu 0000:c2:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 1831.768521] amdgpu 0000:c2:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 1831.768529] amdgpu 0000:c2:00.0: amdgpu: RW: 0x0
[ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
[ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3
Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
Reported-by: Li Ma <li.ma@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7ec32b44df05..5169e55b7fd2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
drm_sched_start(&ring->sched, true);
}
- if (adev->enable_mes &&
- amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
- amdgpu_mes_self_test(tmp_adev);
-
if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
--
2.37.3
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover
2023-10-26 13:34 [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover Yifan Zhang
@ 2023-10-26 14:05 ` Christian König
0 siblings, 0 replies; 2+ messages in thread
From: Christian König @ 2023-10-26 14:05 UTC (permalink / raw)
To: Yifan Zhang, amd-gfx
Cc: Alexander.Deucher, Tim.Huang, christian.koenig, li.ma
Am 26.10.23 um 15:34 schrieb Yifan Zhang:
> gpu tlb flush is skipped if reset sem is held, it makes
> mes_self_test fail since it involves add_hw_queue/remove_hw_queue
> which needs tlb flush functional. Remove mes_self_test in gpu
> recover sequence.
Oh, the TLB issue is actually only the tip of the iceberg. That only
worked by coincident.
During GPU reset you are not allowed to make any memory allocation or
otherwise you can run into a deadlock. So doing a MES test is a complete
no-go in the first place.
>
> This patch is to fix the recover failure in gfx11.
>
> [ 1831.768292] [drm] ring sdma_32769.3.3 was added
> [ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
> [ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
> [ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process pid 0 thread pid 0)
> [ 1831.768434] amdgpu 0000:c2:00.0: amdgpu: in page starting at address 0x0000aec200000000 from client 10
> [ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
> [ 1831.768473] amdgpu 0000:c2:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
> [ 1831.768489] amdgpu 0000:c2:00.0: amdgpu: MORE_FAULTS: 0x0
> [ 1831.768501] amdgpu 0000:c2:00.0: amdgpu: WALKER_ERROR: 0x0
> [ 1831.768513] amdgpu 0000:c2:00.0: amdgpu: PERMISSION_FAULTS: 0x3
> [ 1831.768521] amdgpu 0000:c2:00.0: amdgpu: MAPPING_ERROR: 0x0
> [ 1831.768529] amdgpu 0000:c2:00.0: amdgpu: RW: 0x0
> [ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
> [ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
> [ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3
>
> Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
> Reported-by: Li Ma <li.ma@amd.com>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Thanks for looking into this,
Christian.
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7ec32b44df05..5169e55b7fd2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> drm_sched_start(&ring->sched, true);
> }
>
> - if (adev->enable_mes &&
> - amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
> - amdgpu_mes_self_test(tmp_adev);
> -
> if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
> drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-10-26 14:05 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 13:34 [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover Yifan Zhang
2023-10-26 14:05 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.