From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Felix Kuehling <felix.kuehling@amd.com>,
Philip Yang <philip.yang@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Sasha Levin <sashal@kernel.org>,
Felix.Kuehling@amd.com, christian.koenig@amd.com,
Xinhui.Pan@amd.com, airlied@gmail.com, daniel@ffwll.ch,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.6 19/31] drm/amdkfd: Fix lock dependency warning
Date: Sun, 28 Jan 2024 11:12:49 -0500 [thread overview]
Message-ID: <20240128161315.201999-19-sashal@kernel.org> (raw)
In-Reply-To: <20240128161315.201999-1-sashal@kernel.org>
From: Felix Kuehling <felix.kuehling@amd.com>
[ Upstream commit 47bf0f83fc86df1bf42b385a91aadb910137c5c9 ]
======================================================
WARNING: possible circular locking dependency detected
6.5.0-kfd-fkuehlin #276 Not tainted
------------------------------------------------------
kworker/8:2/2676 is trying to acquire lock:
ffff9435aae95c88 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}, at: __flush_work+0x52/0x550
but task is already holding lock:
ffff9435cd8e1720 (&svms->lock){+.+.}-{3:3}, at: svm_range_deferred_list_work+0xe8/0x340 [amdgpu]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&svms->lock){+.+.}-{3:3}:
__mutex_lock+0x97/0xd30
kfd_ioctl_alloc_memory_of_gpu+0x6d/0x3c0 [amdgpu]
kfd_ioctl+0x1b2/0x5d0 [amdgpu]
__x64_sys_ioctl+0x86/0xc0
do_syscall_64+0x39/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #1 (&mm->mmap_lock){++++}-{3:3}:
down_read+0x42/0x160
svm_range_evict_svm_bo_worker+0x8b/0x340 [amdgpu]
process_one_work+0x27a/0x540
worker_thread+0x53/0x3e0
kthread+0xeb/0x120
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x11/0x20
-> #0 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}:
__lock_acquire+0x1426/0x2200
lock_acquire+0xc1/0x2b0
__flush_work+0x80/0x550
__cancel_work_timer+0x109/0x190
svm_range_bo_release+0xdc/0x1c0 [amdgpu]
svm_range_free+0x175/0x180 [amdgpu]
svm_range_deferred_list_work+0x15d/0x340 [amdgpu]
process_one_work+0x27a/0x540
worker_thread+0x53/0x3e0
kthread+0xeb/0x120
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x11/0x20
other info that might help us debug this:
Chain exists of:
(work_completion)(&svm_bo->eviction_work) --> &mm->mmap_lock --> &svms->lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&svms->lock);
lock(&mm->mmap_lock);
lock(&svms->lock);
lock((work_completion)(&svm_bo->eviction_work));
I believe this cannot really lead to a deadlock in practice, because
svm_range_evict_svm_bo_worker only takes the mmap_read_lock if the BO
refcount is non-0. That means it's impossible that svm_range_bo_release
is running concurrently. However, there is no good way to annotate this.
To avoid the problem, take a BO reference in
svm_range_schedule_evict_svm_bo instead of in the worker. That way it's
impossible for a BO to get freed while eviction work is pending and the
cancel_work_sync call in svm_range_bo_release can be eliminated.
v2: Use svm_bo_ref_unless_zero and explained why that's safe. Also
removed redundant checks that are already done in
amdkfd_fence_enable_signaling.
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 8e368e4659fd..a4c911fa1675 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -391,14 +391,9 @@ static void svm_range_bo_release(struct kref *kref)
spin_lock(&svm_bo->list_lock);
}
spin_unlock(&svm_bo->list_lock);
- if (!dma_fence_is_signaled(&svm_bo->eviction_fence->base)) {
- /* We're not in the eviction worker.
- * Signal the fence and synchronize with any
- * pending eviction work.
- */
+ if (!dma_fence_is_signaled(&svm_bo->eviction_fence->base))
+ /* We're not in the eviction worker. Signal the fence. */
dma_fence_signal(&svm_bo->eviction_fence->base);
- cancel_work_sync(&svm_bo->eviction_work);
- }
dma_fence_put(&svm_bo->eviction_fence->base);
amdgpu_bo_unref(&svm_bo->bo);
kfree(svm_bo);
@@ -3424,13 +3419,14 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence)
{
- if (!fence)
- return -EINVAL;
-
- if (dma_fence_is_signaled(&fence->base))
- return 0;
-
- if (fence->svm_bo) {
+ /* Dereferencing fence->svm_bo is safe here because the fence hasn't
+ * signaled yet and we're under the protection of the fence->lock.
+ * After the fence is signaled in svm_range_bo_release, we cannot get
+ * here any more.
+ *
+ * Reference is dropped in svm_range_evict_svm_bo_worker.
+ */
+ if (svm_bo_ref_unless_zero(fence->svm_bo)) {
WRITE_ONCE(fence->svm_bo->evicting, 1);
schedule_work(&fence->svm_bo->eviction_work);
}
@@ -3445,8 +3441,6 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
int r = 0;
svm_bo = container_of(work, struct svm_range_bo, eviction_work);
- if (!svm_bo_ref_unless_zero(svm_bo))
- return; /* svm_bo was freed while eviction was pending */
if (mmget_not_zero(svm_bo->eviction_fence->mm)) {
mm = svm_bo->eviction_fence->mm;
--
2.43.0
next prev parent reply other threads:[~2024-01-28 16:13 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-28 16:12 [PATCH AUTOSEL 6.6 01/31] PCI: Only override AMD USB controller if required Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 02/31] PCI: switchtec: Fix stdev_release() crash after surprise hot remove Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 03/31] perf cs-etm: Bump minimum OpenCSD version to ensure a bugfix is present Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 04/31] xhci: fix possible null pointer deref during xhci urb enqueue Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 05/31] extcon: fix possible name leak in extcon_dev_register() Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 06/31] usb: hub: Replace hardcoded quirk value with BIT() macro Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 07/31] usb: hub: Add quirk to decrease IN-ep poll interval for Microchip USB491x hub Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 08/31] selftests/sgx: Fix linker script asserts Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 09/31] tty: allow TIOCSLCKTRMIOS with CAP_CHECKPOINT_RESTORE Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 10/31] fs/kernfs/dir: obey S_ISGID Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 11/31] spmi: mediatek: Fix UAF on device remove Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 12/31] power: supply: qcom_battmgr: Register the power supplies after PDR is up Sasha Levin
2024-01-29 13:04 ` Sebastian Reichel
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 13/31] PCI: Fix 64GT/s effective data rate calculation Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 14/31] PCI/AER: Decode Requester ID when no error info found Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 15/31] 9p: Fix initialisation of netfs_inode for 9p Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 16/31] usb: xhci-plat: fix usb disconnect issue after s4 Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 17/31] misc: lis3lv02d_i2c: Add missing setting of the reg_ctrl callback Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 18/31] libsubcmd: Fix memory leak in uniq() Sasha Levin
2024-01-28 16:12 ` Sasha Levin [this message]
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 20/31] drm/amdkfd: Fix lock dependency warning with srcu Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 21/31] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 22/31] blk-mq: fix IO hang from sbitmap wakeup race Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 23/31] ceph: reinitialize mds feature bit even when session in open Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 24/31] ceph: fix deadlock or deadcode of misusing dget() Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 25/31] ceph: fix invalid pointer access if get_quota_realm return ERR_PTR Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 26/31] drm/amdgpu: fix avg vs input power reporting on smu7 Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 27/31] drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 'get_platform_power_management_table()' Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 28/31] drm/amdgpu: Fix with right return code '-EIO' in 'amdgpu_gmc_vram_checking()' Sasha Levin
2024-01-28 16:12 ` [PATCH AUTOSEL 6.6 29/31] drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()' Sasha Levin
2024-01-28 16:13 ` [PATCH AUTOSEL 6.6 30/31] drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()' Sasha Levin
2024-01-28 16:13 ` [PATCH AUTOSEL 6.6 31/31] i2c: rk3x: Adjust mask/value offset for i2c2 on rv1126 Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240128161315.201999-19-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=Xinhui.Pan@amd.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=daniel@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=philip.yang@amd.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox