From: Felix Kuehling <felix.kuehling@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
"Yunxiang Li" <Yunxiang.Li@amd.com>,
amd-gfx@lists.freedesktop.org
Cc: Alexander.Deucher@amd.com, Likun.Gao@amd.com, Hawking.Zhang@amd.com
Subject: Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
Date: Fri, 31 May 2024 11:47:43 -0400 [thread overview]
Message-ID: <4758e38d-09af-495c-b10b-d54ed908514f@amd.com> (raw)
In-Reply-To: <c848dcc1-0816-4c2b-bc87-8d5918046cd7@amd.com>
On 2024-05-31 2:52, Christian König wrote:
> Am 31.05.24 um 00:02 schrieb Felix Kuehling:
>> On 2024-05-28 13:23, Yunxiang Li wrote:
>>> These functions are missing the lock for reset domain.
>>>
>>> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 4 +++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>> drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>>> 3 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> index eb172388d99e..ddc5e9972da8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> @@ -34,6 +34,7 @@
>>> #include <asm/set_memory.h>
>>> #endif
>>> #include "amdgpu.h"
>>> +#include "amdgpu_reset.h"
>>> #include <drm/drm_drv.h>
>>> #include <drm/ttm/ttm_tt.h>
>>> @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
>>> {
>>> int i;
>>> - if (!adev->gart.ptr)
>>> + if (!adev->gart.ptr || !down_read_trylock(&adev->reset_domain->sem))
>>> return;
>>> mb();
>>> amdgpu_device_flush_hdp(adev, NULL);
>>> for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>>> amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
>>> + up_read(&adev->reset_domain->sem);
>>> }
>>> /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index e4742b65032d..52a3170d15b7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>>> dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>>> ring->name);
>>> } else {
>>> - r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>>> - &fence);
>>> + r = -ETIME;
>>> + if (down_read_trylock(&adev->reset_domain->sem)) {
>>> + r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>> + job, &fence);
>>> + up_read(&adev->reset_domain->sem);
>>> + }
>>> if (r)
>>> dev_err(adev->dev,
>>> "Error scheduling IBs (%d) in ring(%s)", r,
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> index 86ea610b16f3..21f5a1fb3bf8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> @@ -28,6 +28,7 @@
>>> #include "kfd_priv.h"
>>> #include "kfd_kernel_queue.h"
>>> #include "amdgpu_amdkfd.h"
>>> +#include "amdgpu_reset.h"
>>> static inline struct process_queue_node *get_queue_by_qid(
>>> struct process_queue_manager *pqm, unsigned int qid)
>>> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
>>> return;
>>> dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
>>> - if (dev->kfd->shared_resources.enable_mes)
>>> - amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
>>> + if (dev->kfd->shared_resources.enable_mes &&
>>> + down_read_trylock(&dev->adev->reset_domain->sem)) {
>>> + amdgpu_mes_flush_shader_debugger(dev->adev,
>>> + pdd->proc_ctx_gpu_addr);
>>> +
>>
>> It's not clear to me what's the requirement for reset domain locking around MES calls. We have a lot more of them in kfd_device_queue_manager.c (mostly calling adev->mes.funcs->... directly). Do they all need to be wrapped individually?
>
> Whenever you call a MES function (or any other function directly interacting with the rings or the HW registers) you need to make sure that at least the read side of the reset lock is held.
Having to do that for each caller of amdgpu_mes functions seems error prone.
Would it make sense to wrap that inside amdgpu_mes_lock/unlock? Maybe turn it into amdgpu_mes_trylock/unlock and make sure that all the amdgpu_mes functions that take that lock can fail and return an error code. Add an attribute so the compiler can flag callers that ignore the return values. This would make it easier to let the compiler spot places that don't handle errors due to reset lock failures.
Regards,
Felix
>
> Regards,
> Christian.
>
>>
>> Regards,
>> Felix
>>
>>
>>> up_read(&dev->adev->reset_domain->sem);
>>> + }
>>> pdd->already_dequeued = true;
>>> }
>
next prev parent reply other threads:[~2024-05-31 15:48 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-29 6:36 ` Christian König
2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-05-29 6:41 ` Christian König
2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
2024-05-29 6:38 ` Christian König
2024-05-29 13:22 ` Li, Yunxiang (Teddy)
2024-05-29 13:31 ` Christian König
2024-05-29 13:44 ` Li, Yunxiang (Teddy)
2024-05-29 13:55 ` Christian König
2024-05-29 14:31 ` Li, Yunxiang (Teddy)
2024-05-29 14:35 ` Christian König
2024-05-29 14:48 ` Li, Yunxiang (Teddy)
2024-05-29 15:19 ` Christian König
2024-05-31 14:44 ` Liu, Shaoyun
2024-06-03 10:58 ` Christian König
2024-06-03 18:28 ` Liu, Shaoyun
2024-06-04 8:07 ` Christian König
2024-06-05 12:32 ` Liu, Shaoyun
2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-29 6:41 ` Christian König
2024-05-29 23:04 ` Felix Kuehling
2024-05-30 0:06 ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-29 6:43 ` Christian König
2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-29 6:45 ` Christian König
2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-29 6:46 ` Christian König
2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-29 6:49 ` Christian König
2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-29 6:55 ` Christian König
2024-05-30 22:02 ` Felix Kuehling
2024-05-30 22:35 ` Li, Yunxiang (Teddy)
2024-05-31 6:52 ` Christian König
2024-05-31 15:47 ` Felix Kuehling [this message]
2024-06-04 12:52 ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
2024-05-28 19:04 ` Skvortsov, Victor
2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-30 21:47 ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-30 21:47 ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-06-05 1:12 ` Deng, Emily
2024-05-30 21:48 ` [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-31 6:50 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4758e38d-09af-495c-b10b-d54ed908514f@amd.com \
--to=felix.kuehling@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Hawking.Zhang@amd.com \
--cc=Likun.Gao@amd.com \
--cc=Yunxiang.Li@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox