All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Yifan Zhang <yifan1.zhang@amd.com>, amd-gfx@lists.freedesktop.org
Cc: Alexander.Deucher@amd.com, Tim.Huang@amd.com,
	christian.koenig@amd.com, li.ma@amd.com
Subject: Re: [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover
Date: Thu, 26 Oct 2023 16:05:47 +0200	[thread overview]
Message-ID: <9ddff9df-ea17-465c-964e-cdfef10e2c6b@gmail.com> (raw)
In-Reply-To: <20231026133436.1716057-1-yifan1.zhang@amd.com>

Am 26.10.23 um 15:34 schrieb Yifan Zhang:
> gpu tlb flush is skipped if reset sem is held, it makes
> mes_self_test fail since it involves add_hw_queue/remove_hw_queue
> which needs tlb flush functional. Remove mes_self_test in gpu
> recover sequence.

Oh, the TLB issue is actually only the tip of the iceberg. That only 
worked by coincident.

During GPU reset you are not allowed to make any memory allocation or 
otherwise you can run into a deadlock. So doing a MES test is a complete 
no-go in the first place.

>
> This patch is to fix the recover failure in gfx11.
>
> [ 1831.768292] [drm] ring sdma_32769.3.3 was added
> [ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
> [ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
> [ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process  pid 0 thread  pid 0)
> [ 1831.768434] amdgpu 0000:c2:00.0: amdgpu:   in page starting at address 0x0000aec200000000 from client 10
> [ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
> [ 1831.768473] amdgpu 0000:c2:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
> [ 1831.768489] amdgpu 0000:c2:00.0: amdgpu:      MORE_FAULTS: 0x0
> [ 1831.768501] amdgpu 0000:c2:00.0: amdgpu:      WALKER_ERROR: 0x0
> [ 1831.768513] amdgpu 0000:c2:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
> [ 1831.768521] amdgpu 0000:c2:00.0: amdgpu:      MAPPING_ERROR: 0x0
> [ 1831.768529] amdgpu 0000:c2:00.0: amdgpu:      RW: 0x0
> [ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
> [ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
> [ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3
>
> Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
> Reported-by: Li Ma <li.ma@amd.com>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

Thanks for looking into this,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
>   1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7ec32b44df05..5169e55b7fd2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			drm_sched_start(&ring->sched, true);
>   		}
>   
> -		if (adev->enable_mes &&
> -		    amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
> -			amdgpu_mes_self_test(tmp_adev);
> -
>   		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
>   			drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
>   


      reply	other threads:[~2023-10-26 14:05 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-26 13:34 [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover Yifan Zhang
2023-10-26 14:05 ` Christian König [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ddff9df-ea17-465c-964e-cdfef10e2c6b@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Tim.Huang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=li.ma@amd.com \
    --cc=yifan1.zhang@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.