dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] drm/sched: Document potential forever-hang
@ 2025-10-28 13:46 Philipp Stanner
  2025-10-28 13:46 ` [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work() Philipp Stanner
  2025-10-28 13:46 ` [PATCH 2/2] drm/sched: Add FIXME detailing potential hang Philipp Stanner
  0 siblings, 2 replies; 7+ messages in thread
From: Philipp Stanner @ 2025-10-28 13:46 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, tursulin
  Cc: dri-devel, linux-kernel

While implementing a similar feature for drm_jobqueue I was looking for
inspiration in drm_sched and found this problem here. Now I have two
problems instead of one ^^'

Anyways, this is not a huge issue since it's unlikely to occur – but we
should document it.

Philipp Stanner (2):
  drm/sched: Fix comment in drm_sched_run_job_work()
  drm/sched: Add FIXME detailing potential hang

 drivers/gpu/drm/scheduler/sched_main.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work()
  2025-10-28 13:46 [PATCH 0/2] drm/sched: Document potential forever-hang Philipp Stanner
@ 2025-10-28 13:46 ` Philipp Stanner
  2025-10-29 16:54   ` Matthew Brost
  2025-10-28 13:46 ` [PATCH 2/2] drm/sched: Add FIXME detailing potential hang Philipp Stanner
  1 sibling, 1 reply; 7+ messages in thread
From: Philipp Stanner @ 2025-10-28 13:46 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, tursulin
  Cc: dri-devel, linux-kernel

drm_sched_run_job_work() contains a comment which explains that an
entity being NULL means that there is no more work to do. It can,
however, also mean that there is work, but the scheduler doesn't have
enough credits to process the jobs right now.

Provide this detail in the comment.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c39f0245e3a9..492e8af639db 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1237,8 +1237,13 @@ static void drm_sched_run_job_work(struct work_struct *w)
 
 	/* Find entity with a ready job */
 	entity = drm_sched_select_entity(sched);
-	if (!entity)
-		return;	/* No more work */
+	if (!entity) {
+		/*
+		 * Either no more work to do, or the next ready job needs more
+		 * credits than the scheduler has currently available.
+		 */
+		return;
+	}
 
 	sched_job = drm_sched_entity_pop_job(entity);
 	if (!sched_job) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] drm/sched: Add FIXME detailing potential hang
  2025-10-28 13:46 [PATCH 0/2] drm/sched: Document potential forever-hang Philipp Stanner
  2025-10-28 13:46 ` [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work() Philipp Stanner
@ 2025-10-28 13:46 ` Philipp Stanner
  2025-10-28 19:43   ` Matthew Brost
  1 sibling, 1 reply; 7+ messages in thread
From: Philipp Stanner @ 2025-10-28 13:46 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, tursulin
  Cc: dri-devel, linux-kernel

If a job from a ready entity needs more credits than are currently
available, drm_sched_run_job_work() (a work item) simply returns and
doesn't reschedule itself. The scheduler is only woken up again when the
next job gets pushed with drm_sched_entity_push_job().

If someone submits a job that needs too many credits and doesn't submit
more jobs afterwards, this would lead to the scheduler never pulling the
too-expensive job, effectively hanging forever.

Document this problem as a FIXME.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 492e8af639db..eaf8d17b2a66 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
 
 	/* Find entity with a ready job */
 	entity = drm_sched_select_entity(sched);
+	/*
+	 * FIXME:
+	 * The entity can be NULL when the scheduler currently has no capacity
+	 * (credits) for more jobs. If that happens, the work item terminates
+	 * itself here, without rescheduling itself.
+	 *
+	 * It only gets started again in drm_sched_entity_push_job(). IOW, the
+	 * scheduler might hang forever if a job that needs too many credits
+	 * gets submitted to an entity and no other, subsequent jobs are.
+	 */
 	if (!entity) {
 		/*
 		 * Either no more work to do, or the next ready job needs more
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] drm/sched: Add FIXME detailing potential hang
  2025-10-28 13:46 ` [PATCH 2/2] drm/sched: Add FIXME detailing potential hang Philipp Stanner
@ 2025-10-28 19:43   ` Matthew Brost
  2025-10-29  7:30     ` Philipp Stanner
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Brost @ 2025-10-28 19:43 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Danilo Krummrich, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	tursulin, dri-devel, linux-kernel

On Tue, Oct 28, 2025 at 02:46:02PM +0100, Philipp Stanner wrote:
> If a job from a ready entity needs more credits than are currently
> available, drm_sched_run_job_work() (a work item) simply returns and
> doesn't reschedule itself. The scheduler is only woken up again when the
> next job gets pushed with drm_sched_entity_push_job().
> 
> If someone submits a job that needs too many credits and doesn't submit
> more jobs afterwards, this would lead to the scheduler never pulling the
> too-expensive job, effectively hanging forever.
> 
> Document this problem as a FIXME.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 492e8af639db..eaf8d17b2a66 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  
>  	/* Find entity with a ready job */
>  	entity = drm_sched_select_entity(sched);
> +	/*
> +	 * FIXME:
> +	 * The entity can be NULL when the scheduler currently has no capacity
> +	 * (credits) for more jobs. If that happens, the work item terminates
> +	 * itself here, without rescheduling itself.
> +	 *
> +	 * It only gets started again in drm_sched_entity_push_job(). IOW, the
> +	 * scheduler might hang forever if a job that needs too many credits
> +	 * gets submitted to an entity and no other, subsequent jobs are.
> +	 */

drm_sched_job_done frees the credits, which triggers
drm_sched_free_job_work, and that in turn triggers
drm_sched_run_job_work.

This flow could be refined a bit, but I do believe it works—unless I'm
missing something. I'm pretty sure we have tests in Xe that exhaust the
credits, though it might be continuous submissions; I'd have to check.

Matt 

>  	if (!entity) {
>  		/*
>  		 * Either no more work to do, or the next ready job needs more
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] drm/sched: Add FIXME detailing potential hang
  2025-10-28 19:43   ` Matthew Brost
@ 2025-10-29  7:30     ` Philipp Stanner
  0 siblings, 0 replies; 7+ messages in thread
From: Philipp Stanner @ 2025-10-29  7:30 UTC (permalink / raw)
  To: Matthew Brost, Philipp Stanner
  Cc: Danilo Krummrich, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	tursulin, dri-devel, linux-kernel

On Tue, 2025-10-28 at 12:43 -0700, Matthew Brost wrote:
> On Tue, Oct 28, 2025 at 02:46:02PM +0100, Philipp Stanner wrote:
> > If a job from a ready entity needs more credits than are currently
> > available, drm_sched_run_job_work() (a work item) simply returns and
> > doesn't reschedule itself. The scheduler is only woken up again when the
> > next job gets pushed with drm_sched_entity_push_job().
> > 
> > If someone submits a job that needs too many credits and doesn't submit
> > more jobs afterwards, this would lead to the scheduler never pulling the
> > too-expensive job, effectively hanging forever.
> > 
> > Document this problem as a FIXME.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 492e8af639db..eaf8d17b2a66 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
> >  
> >  	/* Find entity with a ready job */
> >  	entity = drm_sched_select_entity(sched);
> > +	/*
> > +	 * FIXME:
> > +	 * The entity can be NULL when the scheduler currently has no capacity
> > +	 * (credits) for more jobs. If that happens, the work item terminates
> > +	 * itself here, without rescheduling itself.
> > +	 *
> > +	 * It only gets started again in drm_sched_entity_push_job(). IOW, the
> > +	 * scheduler might hang forever if a job that needs too many credits
> > +	 * gets submitted to an entity and no other, subsequent jobs are.
> > +	 */
> 
> drm_sched_job_done frees the credits, which triggers
> drm_sched_free_job_work, and that in turn triggers
> drm_sched_run_job_work.

Sounds correct to me.

We can still merge #1, though, for a bit more clearness.

P.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work()
  2025-10-28 13:46 ` [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work() Philipp Stanner
@ 2025-10-29 16:54   ` Matthew Brost
  2025-10-31 10:10     ` Philipp Stanner
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Brost @ 2025-10-29 16:54 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Danilo Krummrich, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	tursulin, dri-devel, linux-kernel

On Tue, Oct 28, 2025 at 02:46:01PM +0100, Philipp Stanner wrote:
> drm_sched_run_job_work() contains a comment which explains that an
> entity being NULL means that there is no more work to do. It can,
> however, also mean that there is work, but the scheduler doesn't have
> enough credits to process the jobs right now.
> 
> Provide this detail in the comment.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c39f0245e3a9..492e8af639db 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1237,8 +1237,13 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  
>  	/* Find entity with a ready job */
>  	entity = drm_sched_select_entity(sched);
> -	if (!entity)
> -		return;	/* No more work */
> +	if (!entity) {
> +		/*
> +		 * Either no more work to do, or the next ready job needs more
> +		 * credits than the scheduler has currently available.
> +		 */
> +		return;
> +	}
>  
>  	sched_job = drm_sched_entity_pop_job(entity);
>  	if (!sched_job) {
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work()
  2025-10-29 16:54   ` Matthew Brost
@ 2025-10-31 10:10     ` Philipp Stanner
  0 siblings, 0 replies; 7+ messages in thread
From: Philipp Stanner @ 2025-10-31 10:10 UTC (permalink / raw)
  To: Matthew Brost, Philipp Stanner
  Cc: Danilo Krummrich, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	tursulin, dri-devel, linux-kernel

On Wed, 2025-10-29 at 09:54 -0700, Matthew Brost wrote:
> On Tue, Oct 28, 2025 at 02:46:01PM +0100, Philipp Stanner wrote:
> > drm_sched_run_job_work() contains a comment which explains that an
> > entity being NULL means that there is no more work to do. It can,
> > however, also mean that there is work, but the scheduler doesn't have
> > enough credits to process the jobs right now.
> > 
> > Provide this detail in the comment.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> 
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>

Applied to drm-misc-next.
Dropped the other patch.

Thx
P.

> 
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index c39f0245e3a9..492e8af639db 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1237,8 +1237,13 @@ static void drm_sched_run_job_work(struct work_struct *w)
> >  
> >  	/* Find entity with a ready job */
> >  	entity = drm_sched_select_entity(sched);
> > -	if (!entity)
> > -		return;	/* No more work */
> > +	if (!entity) {
> > +		/*
> > +		 * Either no more work to do, or the next ready job needs more
> > +		 * credits than the scheduler has currently available.
> > +		 */
> > +		return;
> > +	}
> >  
> >  	sched_job = drm_sched_entity_pop_job(entity);
> >  	if (!sched_job) {
> > -- 
> > 2.49.0
> > 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-31 10:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 13:46 [PATCH 0/2] drm/sched: Document potential forever-hang Philipp Stanner
2025-10-28 13:46 ` [PATCH 1/2] drm/sched: Fix comment in drm_sched_run_job_work() Philipp Stanner
2025-10-29 16:54   ` Matthew Brost
2025-10-31 10:10     ` Philipp Stanner
2025-10-28 13:46 ` [PATCH 2/2] drm/sched: Add FIXME detailing potential hang Philipp Stanner
2025-10-28 19:43   ` Matthew Brost
2025-10-29  7:30     ` Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).