From: "Christian König" <christian.koenig@amd.com>
To: Felix Kuehling <felix.kuehling@amd.com>,
Yunxiang Li <Yunxiang.Li@amd.com>,
amd-gfx@lists.freedesktop.org
Cc: Alexander.Deucher@amd.com, Likun.Gao@amd.com, Hawking.Zhang@amd.com
Subject: Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
Date: Fri, 31 May 2024 08:52:01 +0200 [thread overview]
Message-ID: <c848dcc1-0816-4c2b-bc87-8d5918046cd7@amd.com> (raw)
In-Reply-To: <29490f15-eb5f-4438-bfd7-ffdebff6e678@amd.com>
Am 31.05.24 um 00:02 schrieb Felix Kuehling:
> On 2024-05-28 13:23, Yunxiang Li wrote:
>> These functions are missing the lock for reset domain.
>>
>> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 4 +++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>> drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>> 3 files changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> index eb172388d99e..ddc5e9972da8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> @@ -34,6 +34,7 @@
>> #include <asm/set_memory.h>
>> #endif
>> #include "amdgpu.h"
>> +#include "amdgpu_reset.h"
>> #include <drm/drm_drv.h>
>> #include <drm/ttm/ttm_tt.h>
>> @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct
>> amdgpu_device *adev)
>> {
>> int i;
>> - if (!adev->gart.ptr)
>> + if (!adev->gart.ptr ||
>> !down_read_trylock(&adev->reset_domain->sem))
>> return;
>> mb();
>> amdgpu_device_flush_hdp(adev, NULL);
>> for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>> amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
>> + up_read(&adev->reset_domain->sem);
>> }
>> /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index e4742b65032d..52a3170d15b7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct
>> drm_sched_job *sched_job)
>> dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>> ring->name);
>> } else {
>> - r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>> - &fence);
>> + r = -ETIME;
>> + if (down_read_trylock(&adev->reset_domain->sem)) {
>> + r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>> + job, &fence);
>> + up_read(&adev->reset_domain->sem);
>> + }
>> if (r)
>> dev_err(adev->dev,
>> "Error scheduling IBs (%d) in ring(%s)", r,
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> index 86ea610b16f3..21f5a1fb3bf8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> @@ -28,6 +28,7 @@
>> #include "kfd_priv.h"
>> #include "kfd_kernel_queue.h"
>> #include "amdgpu_amdkfd.h"
>> +#include "amdgpu_reset.h"
>> static inline struct process_queue_node *get_queue_by_qid(
>> struct process_queue_manager *pqm, unsigned int qid)
>> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct
>> kfd_process_device *pdd)
>> return;
>> dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
>> - if (dev->kfd->shared_resources.enable_mes)
>> - amdgpu_mes_flush_shader_debugger(dev->adev,
>> pdd->proc_ctx_gpu_addr);
>> + if (dev->kfd->shared_resources.enable_mes &&
>> + down_read_trylock(&dev->adev->reset_domain->sem)) {
>> + amdgpu_mes_flush_shader_debugger(dev->adev,
>> + pdd->proc_ctx_gpu_addr);
>> +
>
> It's not clear to me what's the requirement for reset domain locking
> around MES calls. We have a lot more of them in
> kfd_device_queue_manager.c (mostly calling adev->mes.funcs->...
> directly). Do they all need to be wrapped individually?
Whenever you call a MES function (or any other function directly
interacting with the rings or the HW registers) you need to make sure
that at least the read side of the reset lock is held.
Regards,
Christian.
>
> Regards,
> Felix
>
>
>> up_read(&dev->adev->reset_domain->sem);
>> + }
>> pdd->already_dequeued = true;
>> }
next prev parent reply other threads:[~2024-05-31 6:52 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-29 6:36 ` Christian König
2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-05-29 6:41 ` Christian König
2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
2024-05-29 6:38 ` Christian König
2024-05-29 13:22 ` Li, Yunxiang (Teddy)
2024-05-29 13:31 ` Christian König
2024-05-29 13:44 ` Li, Yunxiang (Teddy)
2024-05-29 13:55 ` Christian König
2024-05-29 14:31 ` Li, Yunxiang (Teddy)
2024-05-29 14:35 ` Christian König
2024-05-29 14:48 ` Li, Yunxiang (Teddy)
2024-05-29 15:19 ` Christian König
2024-05-31 14:44 ` Liu, Shaoyun
2024-06-03 10:58 ` Christian König
2024-06-03 18:28 ` Liu, Shaoyun
2024-06-04 8:07 ` Christian König
2024-06-05 12:32 ` Liu, Shaoyun
2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-29 6:41 ` Christian König
2024-05-29 23:04 ` Felix Kuehling
2024-05-30 0:06 ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-29 6:43 ` Christian König
2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-29 6:45 ` Christian König
2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-29 6:46 ` Christian König
2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-29 6:49 ` Christian König
2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-29 6:55 ` Christian König
2024-05-30 22:02 ` Felix Kuehling
2024-05-30 22:35 ` Li, Yunxiang (Teddy)
2024-05-31 6:52 ` Christian König [this message]
2024-05-31 15:47 ` Felix Kuehling
2024-06-04 12:52 ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
2024-05-28 19:04 ` Skvortsov, Victor
2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-30 21:47 ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-30 21:47 ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-06-05 1:12 ` Deng, Emily
2024-05-30 21:48 ` [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-30 21:48 ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-31 6:50 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c848dcc1-0816-4c2b-bc87-8d5918046cd7@amd.com \
--to=christian.koenig@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Hawking.Zhang@amd.com \
--cc=Likun.Gao@amd.com \
--cc=Yunxiang.Li@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox