All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Philipp Stanner <pstanner@redhat.com>,
	dakr@kernel.org, dri-devel@lists.freedesktop.org,
	ltuikov89@gmail.com
Subject: Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini
Date: Mon, 23 Sep 2024 17:24:10 +0200	[thread overview]
Message-ID: <cef7c754-df50-409b-a7ee-4c184afafa5c@gmail.com> (raw)
In-Reply-To: <2f0b15d47576f25b65927de6c039a6d9839dbb81.camel@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4352 bytes --]

Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
>> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
>>>> Tearing down the scheduler with jobs still on the pending list
>>>> can
>>>> lead to use after free issues. Add a warning if drivers try to
>>>> destroy a scheduler which still has work pushed to the HW.
>>> Did you have time yet to look into my proposed waitque-solution?
>> I don't remember seeing anything. What have I missed?
> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/

Mhm, I didn't got that in my inbox for some reason.

Interesting approach, I'm just not sure if we can or should wait in 
drm_sched_fini().

Probably better to make that a separate function, something like 
drm_sched_flush() or similar.

>>>> When there are still entities with jobs the situation is even
>>>> worse
>>>> since the dma_fences for those jobs can never signal we can just
>>>> choose between potentially locking up core memory management and
>>>> random memory corruption. When drivers really mess it up that
>>>> well
>>>> let them run into a BUG_ON().
>>>>
>>>> Signed-off-by: Christian König<christian.koenig@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>>>    1 file changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index f093616fe53c..8a46fab5cdc8 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
>>>> drm_gpu_scheduler
>>>> *sched)
>>> I agree with Sima that it should first be documented in the
>>> function's
>>> docstring what the user is expected to have done before calling the
>>> function.
>> Good point, going to update the documentation as well.
> Cool thing, thx.
> Would be great if everything (not totally trivial) necessary to be done
> before _fini() is mentioned.
>
> One could also think about providing a hint at how the driver can do
> that. AFAICS the only way for the driver to ensure that is to maintain
> its own, separate list of submitted jobs.

Even with a duplicated pending list it's actually currently impossible 
to do this fully cleanly.

The problem is that the dma_fence object gives no guarantee when 
callbacks are processed, e.g. they can be both processed from interrupt 
context as well as from a CPU which called dma_fence_is_signaled().

So when a driver (or drm_sched_fini) waits for the last submitted fence 
it actually can be that the drm_sched object still needs to do some 
processing. See the hack in amdgpu_vm_tlb_seq() for more background on 
the problem.

Regards,
Christian.

>
> P.
>
>> Thanks,
>> Christian.
>>
>>> P.
>>>
>>>>    
>>>>    	drm_sched_wqueue_stop(sched);
>>>>    
>>>> +	/*
>>>> +	 * Tearing down the scheduler wile there are still
>>>> unprocessed jobs can
>>>> +	 * lead to use after free issues in the scheduler fence.
>>>> +	 */
>>>> +	WARN_ON(!list_empty(&sched->pending_list));
>>>> +
>>>>    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
>>>> i++)
>>>> {
>>>>    		struct drm_sched_rq *rq = sched->sched_rq[i];
>>>>    
>>>>    		spin_lock(&rq->lock);
>>>> -		list_for_each_entry(s_entity, &rq->entities,
>>>> list)
>>>> +		list_for_each_entry(s_entity, &rq->entities,
>>>> list) {
>>>> +			/*
>>>> +			 * The justification for this BUG_ON()
>>>> is
>>>> that tearing
>>>> +			 * down the scheduler while jobs are
>>>> pending
>>>> leaves
>>>> +			 * dma_fences unsignaled. Since we have
>>>> dependencies
>>>> +			 * from the core memory management to
>>>> eventually signal
>>>> +			 * dma_fences this can trivially lead to
>>>> a
>>>> system wide
>>>> +			 * stop because of a locked up memory
>>>> management.
>>>> +			 */
>>>> +			BUG_ON(spsc_queue_count(&s_entity-
>>>>> job_queue));
>>>> +
>>>>    			/*
>>>>    			 * Prevents reinsertion and marks
>>>> job_queue
>>>> as idle,
>>>>    			 * it will removed from rq in
>>>> drm_sched_entity_fini
>>>>    			 * eventually
>>>>    			 */
>>>>    			s_entity->stopped = true;
>>>> +		}
>>>>    		spin_unlock(&rq->lock);
>>>>    		kfree(sched->sched_rq[i]);
>>>>    	}

[-- Attachment #2: Type: text/html, Size: 6490 bytes --]

  reply	other threads:[~2024-09-23 15:24 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-18 13:39 [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini Christian König
2024-09-18 13:39 ` [PATCH 2/2] drm/sched: clarify the documentation on drm_sched_entity_error Christian König
2024-11-25 10:00   ` Philipp Stanner
2024-09-18 14:41 ` [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini Jani Nikula
2024-09-18 14:56   ` Christian König
2024-09-20  8:57 ` Philipp Stanner
2024-09-20 10:33   ` Christian König
2024-09-20 13:26     ` Philipp Stanner
2024-09-23 15:24       ` Christian König [this message]
2024-09-24 11:18         ` Simona Vetter
2024-09-25 14:53           ` Philipp Stanner
2024-09-27  9:04             ` Christian König
2024-10-14 14:56               ` Danilo Krummrich
2024-10-18 12:07                 ` Philipp Stanner
2024-10-29  7:22                   ` Philipp Stanner
2024-11-15 11:26                     ` Christian König
2024-11-15 13:55                       ` Philipp Stanner
2024-11-15 14:39                         ` Christian König
2024-11-15 16:07                           ` Philipp Stanner
2024-11-18 11:48                             ` Christian König
2024-11-21  9:06                               ` Philipp Stanner
2024-11-21 18:10                                 ` Matthew Brost
2024-11-21 18:05                     ` Matthew Brost
2024-11-22 16:24                       ` Philipp Stanner
2024-11-22 16:39                         ` Matthew Brost
2024-10-14 14:49           ` Danilo Krummrich
2024-11-08  4:00 ` Dave Airlie
2024-11-12 15:26   ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cef7c754-df50-409b-a7ee-4c184afafa5c@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ltuikov89@gmail.com \
    --cc=pstanner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.