* [RFC 0/3] Potential context entity destroy race
@ 2025-05-19 16:37 Tvrtko Ursulin
2025-05-19 16:37 ` [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static Tvrtko Ursulin
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Tvrtko Ursulin @ 2025-05-19 16:37 UTC (permalink / raw)
To: amd-gfx; +Cc: Tvrtko Ursulin, Alex Deucher, Christian König
One weird KASAN failure observed one time only. More details in the last patch.
If plausible I will need to fix the commit message in the last patch.
First two should be non-controversial cleanups, hopefully.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Tvrtko Ursulin (3):
drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static
drm/amdgpu: Remove duplicated "context still alive" check
drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini()
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 +++-------------
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 -
2 files changed, 3 insertions(+), 14 deletions(-)
--
2.48.0
^ permalink raw reply [flat|nested] 11+ messages in thread* [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static 2025-05-19 16:37 [RFC 0/3] Potential context entity destroy race Tvrtko Ursulin @ 2025-05-19 16:37 ` Tvrtko Ursulin 2025-05-20 7:31 ` Christian König 2025-05-19 16:37 ` [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check Tvrtko Ursulin 2025-05-19 16:37 ` [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() Tvrtko Ursulin 2 siblings, 1 reply; 11+ messages in thread From: Tvrtko Ursulin @ 2025-05-19 16:37 UTC (permalink / raw) To: amd-gfx; +Cc: Tvrtko Ursulin, Alex Deucher, Christian König Function amdgpu_ctx_mgr_entity_fini() only has a single local caller so lets make it local. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index c43d1b6e5d66..4ff8552e872d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -919,7 +919,7 @@ long amdgpu_ctx_mgr_entity_flush(struct amdgpu_ctx_mgr *mgr, long timeout) return timeout; } -void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) +static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) { struct amdgpu_ctx *ctx; struct idr *idp; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h index 85376baaa92f..090dfe86f75b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h @@ -92,7 +92,6 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, void amdgpu_ctx_mgr_init(struct amdgpu_ctx_mgr *mgr, struct amdgpu_device *adev); -void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr); long amdgpu_ctx_mgr_entity_flush(struct amdgpu_ctx_mgr *mgr, long timeout); void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr); void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr, -- 2.48.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static 2025-05-19 16:37 ` [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static Tvrtko Ursulin @ 2025-05-20 7:31 ` Christian König 0 siblings, 0 replies; 11+ messages in thread From: Christian König @ 2025-05-20 7:31 UTC (permalink / raw) To: Tvrtko Ursulin, amd-gfx; +Cc: Alex Deucher On 5/19/25 18:37, Tvrtko Ursulin wrote: > Function amdgpu_ctx_mgr_entity_fini() only has a single local caller so > lets make it local. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: Christian König <christian.koenig@amd.com> Alternatively we could also just inline it, but either way Reviewed-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 - > 2 files changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index c43d1b6e5d66..4ff8552e872d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -919,7 +919,7 @@ long amdgpu_ctx_mgr_entity_flush(struct amdgpu_ctx_mgr *mgr, long timeout) > return timeout; > } > > -void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) > +static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) > { > struct amdgpu_ctx *ctx; > struct idr *idp; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h > index 85376baaa92f..090dfe86f75b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h > @@ -92,7 +92,6 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, > > void amdgpu_ctx_mgr_init(struct amdgpu_ctx_mgr *mgr, > struct amdgpu_device *adev); > -void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr); > long amdgpu_ctx_mgr_entity_flush(struct amdgpu_ctx_mgr *mgr, long timeout); > void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr); > void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr, ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check 2025-05-19 16:37 [RFC 0/3] Potential context entity destroy race Tvrtko Ursulin 2025-05-19 16:37 ` [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static Tvrtko Ursulin @ 2025-05-19 16:37 ` Tvrtko Ursulin 2025-05-20 7:34 ` Christian König 2025-05-19 16:37 ` [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() Tvrtko Ursulin 2 siblings, 1 reply; 11+ messages in thread From: Tvrtko Ursulin @ 2025-05-19 16:37 UTC (permalink / raw) To: amd-gfx; +Cc: Tvrtko Ursulin, Alex Deucher, Christian König When amdgpu_ctx_mgr_fini() calls amdgpu_ctx_mgr_entity_fini() it contains the exact same "context still alive" check as it will do next. Remove the duplicated copy. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index 4ff8552e872d..85567d0d9545 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -949,19 +949,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) { - struct amdgpu_ctx *ctx; - struct idr *idp; - uint32_t id; - amdgpu_ctx_mgr_entity_fini(mgr); - - idp = &mgr->ctx_handles; - - idr_for_each_entry(idp, ctx, id) { - if (kref_put(&ctx->refcount, amdgpu_ctx_fini) != 1) - DRM_ERROR("ctx %p is still alive\n", ctx); - } - idr_destroy(&mgr->ctx_handles); mutex_destroy(&mgr->lock); } -- 2.48.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check 2025-05-19 16:37 ` [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check Tvrtko Ursulin @ 2025-05-20 7:34 ` Christian König 2025-05-20 14:17 ` Alex Deucher 0 siblings, 1 reply; 11+ messages in thread From: Christian König @ 2025-05-20 7:34 UTC (permalink / raw) To: Tvrtko Ursulin, amd-gfx; +Cc: Alex Deucher On 5/19/25 18:37, Tvrtko Ursulin wrote: > When amdgpu_ctx_mgr_fini() calls amdgpu_ctx_mgr_entity_fini() it contains > the exact same "context still alive" check as it will do next. Remove the > duplicated copy. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: Christian König <christian.koenig@amd.com> I think we could also completely remove this check from both places. IIRC it was only added because somebody suggested that CTX could potentially outlive the file descriptor. We fortunately abandoned that idea even before amdgpu went upstream. Either way this here is clearly superfluous, so feel free to add Reviewed-by: Christian König <christian.koenig@amd.com> Regards, Christian. > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ------------ > 1 file changed, 12 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index 4ff8552e872d..85567d0d9545 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -949,19 +949,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) > > void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) > { > - struct amdgpu_ctx *ctx; > - struct idr *idp; > - uint32_t id; > - > amdgpu_ctx_mgr_entity_fini(mgr); > - > - idp = &mgr->ctx_handles; > - > - idr_for_each_entry(idp, ctx, id) { > - if (kref_put(&ctx->refcount, amdgpu_ctx_fini) != 1) > - DRM_ERROR("ctx %p is still alive\n", ctx); > - } > - > idr_destroy(&mgr->ctx_handles); > mutex_destroy(&mgr->lock); > } ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check 2025-05-20 7:34 ` Christian König @ 2025-05-20 14:17 ` Alex Deucher 0 siblings, 0 replies; 11+ messages in thread From: Alex Deucher @ 2025-05-20 14:17 UTC (permalink / raw) To: Christian König; +Cc: Tvrtko Ursulin, amd-gfx, Alex Deucher Applied patches 1 and 2. Alex On Tue, May 20, 2025 at 3:59 AM Christian König <christian.koenig@amd.com> wrote: > > On 5/19/25 18:37, Tvrtko Ursulin wrote: > > When amdgpu_ctx_mgr_fini() calls amdgpu_ctx_mgr_entity_fini() it contains > > the exact same "context still alive" check as it will do next. Remove the > > duplicated copy. > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > > Cc: Alex Deucher <alexander.deucher@amd.com> > > Cc: Christian König <christian.koenig@amd.com> > > I think we could also completely remove this check from both places. > > IIRC it was only added because somebody suggested that CTX could potentially outlive the file descriptor. > > We fortunately abandoned that idea even before amdgpu went upstream. > > Either way this here is clearly superfluous, so feel free to add Reviewed-by: Christian König <christian.koenig@amd.com> > > Regards, > Christian. > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ------------ > > 1 file changed, 12 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > index 4ff8552e872d..85567d0d9545 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > @@ -949,19 +949,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) > > > > void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) > > { > > - struct amdgpu_ctx *ctx; > > - struct idr *idp; > > - uint32_t id; > > - > > amdgpu_ctx_mgr_entity_fini(mgr); > > - > > - idp = &mgr->ctx_handles; > > - > > - idr_for_each_entry(idp, ctx, id) { > > - if (kref_put(&ctx->refcount, amdgpu_ctx_fini) != 1) > > - DRM_ERROR("ctx %p is still alive\n", ctx); > > - } > > - > > idr_destroy(&mgr->ctx_handles); > > mutex_destroy(&mgr->lock); > > } > ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() 2025-05-19 16:37 [RFC 0/3] Potential context entity destroy race Tvrtko Ursulin 2025-05-19 16:37 ` [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static Tvrtko Ursulin 2025-05-19 16:37 ` [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check Tvrtko Ursulin @ 2025-05-19 16:37 ` Tvrtko Ursulin 2025-05-20 7:39 ` Christian König 2 siblings, 1 reply; 11+ messages in thread From: Tvrtko Ursulin @ 2025-05-19 16:37 UTC (permalink / raw) To: amd-gfx; +Cc: Tvrtko Ursulin, Alex Deucher, Christian König Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is could it in theory see a stale entry and attempt to destroy the drm_sched_entity twice? Problem is I have hit this on a KASAN enabled kernel only _once_ and never since. In that case it reported amdgpu_ctx_ioctl() had freed the entity already which would mean the question is could we possibly go through the mutex_unlock() on one CPU, and another CPU to follow immediately with file->release (DRM postclose) and see the stale entry. Throwing it out there not to forget about it, since I have manage to lose the KASAN trace already.. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index 85567d0d9545..95b005ed839e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) idp = &mgr->ctx_handles; + mutex_lock(&mgr->lock); idr_for_each_entry(idp, ctx, id) { if (kref_read(&ctx->refcount) != 1) { DRM_ERROR("ctx %p is still alive\n", ctx); @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) } } } + mutex_unlock(&mgr->lock); } void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) -- 2.48.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() 2025-05-19 16:37 ` [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() Tvrtko Ursulin @ 2025-05-20 7:39 ` Christian König 2025-05-20 7:50 ` Tvrtko Ursulin 0 siblings, 1 reply; 11+ messages in thread From: Christian König @ 2025-05-20 7:39 UTC (permalink / raw) To: Tvrtko Ursulin, amd-gfx; +Cc: Alex Deucher On 5/19/25 18:37, Tvrtko Ursulin wrote: > Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is > could it in theory see a stale entry and attempt to destroy the > drm_sched_entity twice? No, when this function is called when the file descriptor is freed up. So there should never ever be any concurrent user of this. > Problem is I have hit this on a KASAN enabled kernel only _once_ and never > since. In that case it reported amdgpu_ctx_ioctl() had freed the entity > already which would mean the question is could we possibly go through the > mutex_unlock() on one CPU, and another CPU to follow immediately with > file->release (DRM postclose) and see the stale entry. Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing. I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it. > Throwing it out there not to forget about it, since I have manage to > lose the KASAN trace already.. If you manage to trigger that again please send it to me ASAP. Thanks, Christian. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index 85567d0d9545..95b005ed839e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) > > idp = &mgr->ctx_handles; > > + mutex_lock(&mgr->lock); > idr_for_each_entry(idp, ctx, id) { > if (kref_read(&ctx->refcount) != 1) { > DRM_ERROR("ctx %p is still alive\n", ctx); > @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) > } > } > } > + mutex_unlock(&mgr->lock); > } > > void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() 2025-05-20 7:39 ` Christian König @ 2025-05-20 7:50 ` Tvrtko Ursulin 2025-05-20 8:40 ` Christian König 0 siblings, 1 reply; 11+ messages in thread From: Tvrtko Ursulin @ 2025-05-20 7:50 UTC (permalink / raw) To: Christian König, amd-gfx; +Cc: Alex Deucher On 20/05/2025 08:39, Christian König wrote: > On 5/19/25 18:37, Tvrtko Ursulin wrote: >> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is >> could it in theory see a stale entry and attempt to destroy the >> drm_sched_entity twice? > > No, when this function is called when the file descriptor is freed up. > > So there should never ever be any concurrent user of this. > >> Problem is I have hit this on a KASAN enabled kernel only _once_ and never >> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity >> already which would mean the question is could we possibly go through the >> mutex_unlock() on one CPU, and another CPU to follow immediately with >> file->release (DRM postclose) and see the stale entry. > > Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing. I know, like if the VFS side of things was broken. > I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it. So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU. If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release(). Thread A Thread B amdgpu_ctx_ioctl -> amdgpu_ctx_free mutex_lock idr_remove mutex_unlock fput() fput() ->release() amdgpu_ctx_mgr_fini -> amdgpu_ctx_mgr_entity_fini idr_for_each_entry stale entry ??? -> BOOM Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter. Regards, Tvrtko >> Throwing it out there not to forget about it, since I have manage to >> lose the KASAN trace already.. > > If you manage to trigger that again please send it to me ASAP. > > Thanks, > Christian. > >> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> Cc: Alex Deucher <alexander.deucher@amd.com> >> Cc: Christian König <christian.koenig@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> index 85567d0d9545..95b005ed839e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) >> >> idp = &mgr->ctx_handles; >> >> + mutex_lock(&mgr->lock); >> idr_for_each_entry(idp, ctx, id) { >> if (kref_read(&ctx->refcount) != 1) { >> DRM_ERROR("ctx %p is still alive\n", ctx); >> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) >> } >> } >> } >> + mutex_unlock(&mgr->lock); >> } >> >> void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() 2025-05-20 7:50 ` Tvrtko Ursulin @ 2025-05-20 8:40 ` Christian König 2025-05-20 16:23 ` Tvrtko Ursulin 0 siblings, 1 reply; 11+ messages in thread From: Christian König @ 2025-05-20 8:40 UTC (permalink / raw) To: Tvrtko Ursulin, amd-gfx; +Cc: Alex Deucher On 5/20/25 09:50, Tvrtko Ursulin wrote: > > On 20/05/2025 08:39, Christian König wrote: >> On 5/19/25 18:37, Tvrtko Ursulin wrote: >>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is >>> could it in theory see a stale entry and attempt to destroy the >>> drm_sched_entity twice? >> >> No, when this function is called when the file descriptor is freed up. >> >> So there should never ever be any concurrent user of this. >> >>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never >>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity >>> already which would mean the question is could we possibly go through the >>> mutex_unlock() on one CPU, and another CPU to follow immediately with >>> file->release (DRM postclose) and see the stale entry. >> >> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing. > > I know, like if the VFS side of things was broken. > >> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it. > > So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU. > > If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release(). > > Thread A Thread B > amdgpu_ctx_ioctl > -> amdgpu_ctx_free > mutex_lock > idr_remove > mutex_unlock > fput() fput() > ->release() > amdgpu_ctx_mgr_fini > -> amdgpu_ctx_mgr_entity_fini > idr_for_each_entry > stale entry ??? -> BOOM > > Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter. Oh, good question! I have absolutely no idea. The mutex_unlock() certainly doesn't provide any memory barrier for the other CPU which does the _fini, but the fput() or the ->release code path might. As far as I know it is rather common to not lock anything in the destroy path. I will try to double check. Thanks, Christian. > > Regards, > > Tvrtko > >> Throwing it out there not to forget about it, since I have manage to >>> lose the KASAN trace already.. >> >> If you manage to trigger that again please send it to me ASAP. >> >> Thanks, >> Christian. >> >>> >>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>> Cc: Alex Deucher <alexander.deucher@amd.com> >>> Cc: Christian König <christian.koenig@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>> index 85567d0d9545..95b005ed839e 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) >>> idp = &mgr->ctx_handles; >>> + mutex_lock(&mgr->lock); >>> idr_for_each_entry(idp, ctx, id) { >>> if (kref_read(&ctx->refcount) != 1) { >>> DRM_ERROR("ctx %p is still alive\n", ctx); >>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) >>> } >>> } >>> } >>> + mutex_unlock(&mgr->lock); >>> } >>> void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) >> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() 2025-05-20 8:40 ` Christian König @ 2025-05-20 16:23 ` Tvrtko Ursulin 0 siblings, 0 replies; 11+ messages in thread From: Tvrtko Ursulin @ 2025-05-20 16:23 UTC (permalink / raw) To: Christian König, amd-gfx; +Cc: Alex Deucher On 20/05/2025 09:40, Christian König wrote: > On 5/20/25 09:50, Tvrtko Ursulin wrote: >> >> On 20/05/2025 08:39, Christian König wrote: >>> On 5/19/25 18:37, Tvrtko Ursulin wrote: >>>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is >>>> could it in theory see a stale entry and attempt to destroy the >>>> drm_sched_entity twice? >>> >>> No, when this function is called when the file descriptor is freed up. >>> >>> So there should never ever be any concurrent user of this. >>> >>>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never >>>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity >>>> already which would mean the question is could we possibly go through the >>>> mutex_unlock() on one CPU, and another CPU to follow immediately with >>>> file->release (DRM postclose) and see the stale entry. >>> >>> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing. >> >> I know, like if the VFS side of things was broken. >> >>> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it. >> >> So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU. >> >> If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release(). >> >> Thread A Thread B >> amdgpu_ctx_ioctl >> -> amdgpu_ctx_free >> mutex_lock >> idr_remove >> mutex_unlock >> fput() fput() >> ->release() >> amdgpu_ctx_mgr_fini >> -> amdgpu_ctx_mgr_entity_fini >> idr_for_each_entry >> stale entry ??? -> BOOM >> >> Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter. > > Oh, good question! I have absolutely no idea. > > The mutex_unlock() certainly doesn't provide any memory barrier for the other CPU which does the _fini, but the fput() or the ->release code path might. > > As far as I know it is rather common to not lock anything in the destroy path. I will try to double check. FWIW I tried to provoke this with a freshly made IGT today but failed. I will park it for now since it does not feel a very plausible race. Regards, Tvrtko >> >> Throwing it out there not to forget about it, since I have manage to >>>> lose the KASAN trace already.. >>> >>> If you manage to trigger that again please send it to me ASAP. >>> >>> Thanks, >>> Christian. >>> >>>> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>>> Cc: Alex Deucher <alexander.deucher@amd.com> >>>> Cc: Christian König <christian.koenig@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> index 85567d0d9545..95b005ed839e 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) >>>> idp = &mgr->ctx_handles; >>>> + mutex_lock(&mgr->lock); >>>> idr_for_each_entry(idp, ctx, id) { >>>> if (kref_read(&ctx->refcount) != 1) { >>>> DRM_ERROR("ctx %p is still alive\n", ctx); >>>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr) >>>> } >>>> } >>>> } >>>> + mutex_unlock(&mgr->lock); >>>> } >>>> void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr) >>> >> > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-05-20 16:23 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-19 16:37 [RFC 0/3] Potential context entity destroy race Tvrtko Ursulin 2025-05-19 16:37 ` [RFC 1/3] drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static Tvrtko Ursulin 2025-05-20 7:31 ` Christian König 2025-05-19 16:37 ` [RFC 2/3] drm/amdgpu: Remove duplicated "context still alive" check Tvrtko Ursulin 2025-05-20 7:34 ` Christian König 2025-05-20 14:17 ` Alex Deucher 2025-05-19 16:37 ` [RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini() Tvrtko Ursulin 2025-05-20 7:39 ` Christian König 2025-05-20 7:50 ` Tvrtko Ursulin 2025-05-20 8:40 ` Christian König 2025-05-20 16:23 ` Tvrtko Ursulin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.