* [PATCH] drm/amdgpu: Fix a race of IB test @ 2021-09-11 1:55 xinhui pan 2021-09-11 7:45 ` Christian König 0 siblings, 1 reply; 6+ messages in thread From: xinhui pan @ 2021-09-11 1:55 UTC (permalink / raw) To: amd-gfx; +Cc: alexander.deucher, christian.koenig, xinhui pan Direct IB submission should be exclusive. So use write lock. Signed-off-by: xinhui pan <xinhui.pan@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 19323b4cce7b..acbe02928791 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) } /* Avoid accidently unparking the sched thread during GPU reset */ - r = down_read_killable(&adev->reset_sem); + r = down_write_killable(&adev->reset_sem); if (r) return r; + /* Avoid concurrently IB test but not cancel it as I don't know whether we + * would add more code in the delayed init work. + */ + flush_delayed_work(&adev->delayed_init_work); + /* hold on the scheduler */ for (i = 0; i < AMDGPU_MAX_RINGS; i++) { struct amdgpu_ring *ring = adev->rings[i]; @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) kthread_unpark(ring->sched.thread); } - up_read(&adev->reset_sem); + up_write(&adev->reset_sem); pm_runtime_mark_last_busy(dev->dev); pm_runtime_put_autosuspend(dev->dev); -- 2.25.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a race of IB test 2021-09-11 1:55 [PATCH] drm/amdgpu: Fix a race of IB test xinhui pan @ 2021-09-11 7:45 ` Christian König 2021-09-11 10:18 ` 回复: " Pan, Xinhui 0 siblings, 1 reply; 6+ messages in thread From: Christian König @ 2021-09-11 7:45 UTC (permalink / raw) To: xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig Am 11.09.21 um 03:55 schrieb xinhui pan: > Direct IB submission should be exclusive. So use write lock. > > Signed-off-by: xinhui pan <xinhui.pan@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 19323b4cce7b..acbe02928791 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) > } > > /* Avoid accidently unparking the sched thread during GPU reset */ > - r = down_read_killable(&adev->reset_sem); > + r = down_write_killable(&adev->reset_sem); > if (r) > return r; > > + /* Avoid concurrently IB test but not cancel it as I don't know whether we > + * would add more code in the delayed init work. > + */ > + flush_delayed_work(&adev->delayed_init_work); > + That won't work. It's at least theoretical possible that the delayed init work waits for the reset_sem which we are holding here. Very unlikely to happen, but lockdep might be able to point that out with a nice backtrace in the logs. On the other hand delayed init work and direct IB test through this interface should work at the same time, so I would just drop it. Christian. > /* hold on the scheduler */ > for (i = 0; i < AMDGPU_MAX_RINGS; i++) { > struct amdgpu_ring *ring = adev->rings[i]; > @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) > kthread_unpark(ring->sched.thread); > } > > - up_read(&adev->reset_sem); > + up_write(&adev->reset_sem); > > pm_runtime_mark_last_busy(dev->dev); > pm_runtime_put_autosuspend(dev->dev); ^ permalink raw reply [flat|nested] 6+ messages in thread
* 回复: [PATCH] drm/amdgpu: Fix a race of IB test 2021-09-11 7:45 ` Christian König @ 2021-09-11 10:18 ` Pan, Xinhui 2021-09-13 6:19 ` Christian König 0 siblings, 1 reply; 6+ messages in thread From: Pan, Xinhui @ 2021-09-11 10:18 UTC (permalink / raw) To: Christian König, amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander, Koenig, Christian [AMD Official Use Only] For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing. But I am not aware why delayed init work would try to lock reset_sem. delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method. But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed. I will drop this flush in V2. ________________________________________ 发件人: Christian König <ckoenig.leichtzumerken@gmail.com> 发送时间: 2021年9月11日 15:45 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org 抄送: Deucher, Alexander; Koenig, Christian 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test Am 11.09.21 um 03:55 schrieb xinhui pan: > Direct IB submission should be exclusive. So use write lock. > > Signed-off-by: xinhui pan <xinhui.pan@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 19323b4cce7b..acbe02928791 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) > } > > /* Avoid accidently unparking the sched thread during GPU reset */ > - r = down_read_killable(&adev->reset_sem); > + r = down_write_killable(&adev->reset_sem); > if (r) > return r; > > + /* Avoid concurrently IB test but not cancel it as I don't know whether we > + * would add more code in the delayed init work. > + */ > + flush_delayed_work(&adev->delayed_init_work); > + That won't work. It's at least theoretical possible that the delayed init work waits for the reset_sem which we are holding here. Very unlikely to happen, but lockdep might be able to point that out with a nice backtrace in the logs. On the other hand delayed init work and direct IB test through this interface should work at the same time, so I would just drop it. Christian. > /* hold on the scheduler */ > for (i = 0; i < AMDGPU_MAX_RINGS; i++) { > struct amdgpu_ring *ring = adev->rings[i]; > @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) > kthread_unpark(ring->sched.thread); > } > > - up_read(&adev->reset_sem); > + up_write(&adev->reset_sem); > > pm_runtime_mark_last_busy(dev->dev); > pm_runtime_put_autosuspend(dev->dev); ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test 2021-09-11 10:18 ` 回复: " Pan, Xinhui @ 2021-09-13 6:19 ` Christian König 2021-09-13 6:55 ` 回复: " Pan, Xinhui 0 siblings, 1 reply; 6+ messages in thread From: Christian König @ 2021-09-13 6:19 UTC (permalink / raw) To: Pan, Xinhui, Christian König, amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander Well is the delayed init work using direct submission or submission through the scheduler? If the later we have the down_write of the reset semaphore pulled in through the scheduler dependency. Anyway just having the sync before taking the lock should work. Christian. Am 11.09.21 um 12:18 schrieb Pan, Xinhui: > [AMD Official Use Only] > > For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing. > But I am not aware why delayed init work would try to lock reset_sem. > > delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method. > But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed. > > I will drop this flush in V2. > ________________________________________ > 发件人: Christian König <ckoenig.leichtzumerken@gmail.com> > 发送时间: 2021年9月11日 15:45 > 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org > 抄送: Deucher, Alexander; Koenig, Christian > 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test > > > > Am 11.09.21 um 03:55 schrieb xinhui pan: >> Direct IB submission should be exclusive. So use write lock. >> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> index 19323b4cce7b..acbe02928791 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >> } >> >> /* Avoid accidently unparking the sched thread during GPU reset */ >> - r = down_read_killable(&adev->reset_sem); >> + r = down_write_killable(&adev->reset_sem); >> if (r) >> return r; >> >> + /* Avoid concurrently IB test but not cancel it as I don't know whether we >> + * would add more code in the delayed init work. >> + */ >> + flush_delayed_work(&adev->delayed_init_work); >> + > That won't work. It's at least theoretical possible that the delayed > init work waits for the reset_sem which we are holding here. > > Very unlikely to happen, but lockdep might be able to point that out > with a nice backtrace in the logs. > > On the other hand delayed init work and direct IB test through this > interface should work at the same time, so I would just drop it. > > Christian. > >> /* hold on the scheduler */ >> for (i = 0; i < AMDGPU_MAX_RINGS; i++) { >> struct amdgpu_ring *ring = adev->rings[i]; >> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >> kthread_unpark(ring->sched.thread); >> } >> >> - up_read(&adev->reset_sem); >> + up_write(&adev->reset_sem); >> >> pm_runtime_mark_last_busy(dev->dev); >> pm_runtime_put_autosuspend(dev->dev); ^ permalink raw reply [flat|nested] 6+ messages in thread
* 回复: 回复: [PATCH] drm/amdgpu: Fix a race of IB test 2021-09-13 6:19 ` Christian König @ 2021-09-13 6:55 ` Pan, Xinhui 2021-09-13 6:58 ` Christian König 0 siblings, 1 reply; 6+ messages in thread From: Pan, Xinhui @ 2021-09-13 6:55 UTC (permalink / raw) To: Koenig, Christian, Christian König, amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander [AMD Official Use Only] These IB tests are all using direct IB submission including the delayed init work. ________________________________________ 发件人: Koenig, Christian <Christian.Koenig@amd.com> 发送时间: 2021年9月13日 14:19 收件人: Pan, Xinhui; Christian König; amd-gfx@lists.freedesktop.org 抄送: Deucher, Alexander 主题: Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test Well is the delayed init work using direct submission or submission through the scheduler? If the later we have the down_write of the reset semaphore pulled in through the scheduler dependency. Anyway just having the sync before taking the lock should work. Christian. Am 11.09.21 um 12:18 schrieb Pan, Xinhui: > [AMD Official Use Only] > > For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing. > But I am not aware why delayed init work would try to lock reset_sem. > > delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method. > But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed. > > I will drop this flush in V2. > ________________________________________ > 发件人: Christian König <ckoenig.leichtzumerken@gmail.com> > 发送时间: 2021年9月11日 15:45 > 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org > 抄送: Deucher, Alexander; Koenig, Christian > 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test > > > > Am 11.09.21 um 03:55 schrieb xinhui pan: >> Direct IB submission should be exclusive. So use write lock. >> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> index 19323b4cce7b..acbe02928791 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >> } >> >> /* Avoid accidently unparking the sched thread during GPU reset */ >> - r = down_read_killable(&adev->reset_sem); >> + r = down_write_killable(&adev->reset_sem); >> if (r) >> return r; >> >> + /* Avoid concurrently IB test but not cancel it as I don't know whether we >> + * would add more code in the delayed init work. >> + */ >> + flush_delayed_work(&adev->delayed_init_work); >> + > That won't work. It's at least theoretical possible that the delayed > init work waits for the reset_sem which we are holding here. > > Very unlikely to happen, but lockdep might be able to point that out > with a nice backtrace in the logs. > > On the other hand delayed init work and direct IB test through this > interface should work at the same time, so I would just drop it. > > Christian. > >> /* hold on the scheduler */ >> for (i = 0; i < AMDGPU_MAX_RINGS; i++) { >> struct amdgpu_ring *ring = adev->rings[i]; >> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >> kthread_unpark(ring->sched.thread); >> } >> >> - up_read(&adev->reset_sem); >> + up_write(&adev->reset_sem); >> >> pm_runtime_mark_last_busy(dev->dev); >> pm_runtime_put_autosuspend(dev->dev); ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 回复: 回复: [PATCH] drm/amdgpu: Fix a race of IB test 2021-09-13 6:55 ` 回复: " Pan, Xinhui @ 2021-09-13 6:58 ` Christian König 0 siblings, 0 replies; 6+ messages in thread From: Christian König @ 2021-09-13 6:58 UTC (permalink / raw) To: Pan, Xinhui, Koenig, Christian, amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander Yeah, because we avoid need to allocate an entity otherwise. Ok, all that comes swapped back into my head once more. As far as I can see that should work, but I would ask Andrey as well since he now takes care of GPU reset. Christian. Am 13.09.21 um 08:55 schrieb Pan, Xinhui: > [AMD Official Use Only] > > These IB tests are all using direct IB submission including the delayed init work. > ________________________________________ > 发件人: Koenig, Christian <Christian.Koenig@amd.com> > 发送时间: 2021年9月13日 14:19 > 收件人: Pan, Xinhui; Christian König; amd-gfx@lists.freedesktop.org > 抄送: Deucher, Alexander > 主题: Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test > > Well is the delayed init work using direct submission or submission > through the scheduler? > > If the later we have the down_write of the reset semaphore pulled in > through the scheduler dependency. > > Anyway just having the sync before taking the lock should work. > > Christian. > > Am 11.09.21 um 12:18 schrieb Pan, Xinhui: >> [AMD Official Use Only] >> >> For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing. >> But I am not aware why delayed init work would try to lock reset_sem. >> >> delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method. >> But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed. >> >> I will drop this flush in V2. >> ________________________________________ >> 发件人: Christian König <ckoenig.leichtzumerken@gmail.com> >> 发送时间: 2021年9月11日 15:45 >> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org >> 抄送: Deucher, Alexander; Koenig, Christian >> 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test >> >> >> >> Am 11.09.21 um 03:55 schrieb xinhui pan: >>> Direct IB submission should be exclusive. So use write lock. >>> >>> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- >>> 1 file changed, 7 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>> index 19323b4cce7b..acbe02928791 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >>> } >>> >>> /* Avoid accidently unparking the sched thread during GPU reset */ >>> - r = down_read_killable(&adev->reset_sem); >>> + r = down_write_killable(&adev->reset_sem); >>> if (r) >>> return r; >>> >>> + /* Avoid concurrently IB test but not cancel it as I don't know whether we >>> + * would add more code in the delayed init work. >>> + */ >>> + flush_delayed_work(&adev->delayed_init_work); >>> + >> That won't work. It's at least theoretical possible that the delayed >> init work waits for the reset_sem which we are holding here. >> >> Very unlikely to happen, but lockdep might be able to point that out >> with a nice backtrace in the logs. >> >> On the other hand delayed init work and direct IB test through this >> interface should work at the same time, so I would just drop it. >> >> Christian. >> >>> /* hold on the scheduler */ >>> for (i = 0; i < AMDGPU_MAX_RINGS; i++) { >>> struct amdgpu_ring *ring = adev->rings[i]; >>> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >>> kthread_unpark(ring->sched.thread); >>> } >>> >>> - up_read(&adev->reset_sem); >>> + up_write(&adev->reset_sem); >>> >>> pm_runtime_mark_last_busy(dev->dev); >>> pm_runtime_put_autosuspend(dev->dev); ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-09-13 6:58 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-09-11 1:55 [PATCH] drm/amdgpu: Fix a race of IB test xinhui pan 2021-09-11 7:45 ` Christian König 2021-09-11 10:18 ` 回复: " Pan, Xinhui 2021-09-13 6:19 ` Christian König 2021-09-13 6:55 ` 回复: " Pan, Xinhui 2021-09-13 6:58 ` Christian König
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox