[PATCH 1/2] drm/sched: Fix outdated comments referencing thread

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] drm/sched: Fix outdated comments referencing thread
@ 2025-02-25 13:13 Philipp Stanner
  2025-02-25 13:13 ` [PATCH 2/2] drm/sched: Remove kthread header Philipp Stanner
  2025-02-25 13:53 ` [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Danilo Krummrich
  0 siblings, 2 replies; 6+ messages in thread
From: Philipp Stanner @ 2025-02-25 13:13 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel

The GPU scheduler's comments refer to a "thread" at various places.
Those are leftovers stemming from a rework in which the scheduler was
ported from using a kthread to workqueues.

Replace all references to kthreads.

Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread")
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c |  8 ++++----
 drivers/gpu/drm/scheduler/sched_main.c   | 21 +++++++++++----------
 include/drm/gpu_scheduler.h              | 12 ++++++------
 3 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 87f88259ddf6..f9811420c787 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -538,10 +538,10 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 		return;
 
 	/*
-	 * Only when the queue is empty are we guaranteed that the scheduler
-	 * thread cannot change ->last_scheduled. To enforce ordering we need
-	 * a read barrier here. See drm_sched_entity_pop_job() for the other
-	 * side.
+	 * Only when the queue is empty are we guaranteed that
+	 * drm_sched_run_job_work() cannot change entity->last_scheduled. To
+	 * enforce ordering we need a read barrier here. See
+	 * drm_sched_entity_pop_job() for the other side.
 	 */
 	smp_rmb();
 
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c634993f1346..015ee327fe52 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -389,7 +389,7 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
  * drm_sched_job_done - complete a job
  * @s_job: pointer to the job which is done
  *
- * Finish the job's fence and wake up the worker thread.
+ * Finish the job's fence and wake up the work item.
  */
 static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
 {
@@ -550,8 +550,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	if (job) {
 		/*
 		 * Remove the bad job so it cannot be freed by concurrent
-		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-		 * is parked at which point it's safe.
+		 * drm_sched_cleanup_jobs. It will be reinserted back after the
+		 * scheduler's workqueues are stopped at which point it's safe.
 		 */
 		list_del_init(&job->list);
 		spin_unlock(&sched->job_list_lock);
@@ -597,10 +597,10 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 
 	/*
 	 * Reinsert back the bad job here - now it's safe as
-	 * drm_sched_get_finished_job cannot race against us and release the
+	 * drm_sched_get_finished_job() cannot race against us and release the
 	 * bad job at this point - we parked (waited for) any in progress
-	 * (earlier) cleanups and drm_sched_get_finished_job will not be called
-	 * now until the scheduler thread is unparked.
+	 * (earlier) cleanups and drm_sched_get_finished_job() will not be
+	 * called now until the scheduler's workqueues are unparked.
 	 */
 	if (bad && bad->sched == sched)
 		/*
@@ -613,7 +613,8 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 	 * Iterate the job list from later to  earlier one and either deactive
 	 * their HW callbacks or remove them from pending list if they already
 	 * signaled.
-	 * This iteration is thread safe as sched thread is stopped.
+	 * This iteration is thread safe as the scheduler's workqueues are
+	 * stopped.
 	 */
 	list_for_each_entry_safe_reverse(s_job, tmp, &sched->pending_list,
 					 list) {
@@ -678,9 +679,9 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
 	struct drm_sched_job *s_job, *tmp;
 
 	/*
-	 * Locking the list is not required here as the sched thread is parked
-	 * so no new jobs are being inserted or removed. Also concurrent
-	 * GPU recovers can't run in parallel.
+	 * Locking the list is not required here as the scheduler's workqueues
+	 * are paused, so no new jobs are being inserted or removed. Also
+	 * concurrent GPU recovers can't run in parallel.
 	 */
 	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
 		struct dma_fence *fence = s_job->s_fence->parent;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 50928a7ae98e..7da7b0b52a7e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -192,8 +192,8 @@ struct drm_sched_entity {
 	 * @last_scheduled:
 	 *
 	 * Points to the finished fence of the last scheduled job. Only written
-	 * by the scheduler thread, can be accessed locklessly from
-	 * drm_sched_job_arm() if the queue is empty.
+	 * by &struct drm_gpu_scheduler.submit_wq. Can be accessed locklessly
+	 * from drm_sched_job_arm() if the queue is empty.
 	 */
 	struct dma_fence __rcu		*last_scheduled;
 
@@ -426,14 +426,14 @@ struct drm_sched_backend_ops {
 	 * Drivers typically issue a reset to recover from GPU hangs, and this
 	 * procedure usually follows the following workflow:
 	 *
-	 * 1. Stop the scheduler using drm_sched_stop(). This will park the
-	 *    scheduler thread and cancel the timeout work, guaranteeing that
-	 *    nothing is queued while we reset the hardware queue
+	 * 1. Stop the scheduler using drm_sched_stop(). This will stop the
+	 *    scheduler's workqueues and cancel the timeout work, guaranteeing
+	 *    that  nothing is queued while we reset the hardware queue
 	 * 2. Try to gracefully stop non-faulty jobs (optional)
 	 * 3. Issue a GPU reset (driver-specific)
 	 * 4. Re-submit jobs using drm_sched_resubmit_jobs()
 	 * 5. Restart the scheduler using drm_sched_start(). At that point, new
-	 *    jobs can be queued, and the scheduler thread is unblocked
+	 *    jobs can be queued, and the scheduler's workqueues be started again.
 	 *
 	 * Note that some GPUs have distinct hardware queues but need to reset
 	 * the GPU globally, which requires extra synchronization between the
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] drm/sched: Remove kthread header
  2025-02-25 13:13 [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Philipp Stanner
@ 2025-02-25 13:13 ` Philipp Stanner
  2025-02-25 13:55   ` Danilo Krummrich
  2025-02-25 13:53 ` [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Danilo Krummrich
  1 sibling, 1 reply; 6+ messages in thread
From: Philipp Stanner @ 2025-02-25 13:13 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel

The kthread header doesn't need to be included anymore. It's a relict
from the days when the scheduler was still using a kthread instead of
workqueues.

Remove the unneeded includes.

Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread")
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
I'm not sure whether we should Cc the stable kernel. It's inconvenient
and makes build times slower, but isn't really a bug.
---
 drivers/gpu/drm/scheduler/sched_entity.c | 1 -
 drivers/gpu/drm/scheduler/sched_fence.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index f9811420c787..e55b98af8a50 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -21,7 +21,6 @@
  *
  */
 
-#include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/completion.h>
 
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index e971528504a5..d6239e015b66 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -21,7 +21,6 @@
  *
  */
 
-#include <linux/kthread.h>
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] drm/sched: Remove kthread header
  2025-02-25 13:13 ` [PATCH 2/2] drm/sched: Remove kthread header Philipp Stanner
@ 2025-02-25 13:55   ` Danilo Krummrich
  0 siblings, 0 replies; 6+ messages in thread
From: Danilo Krummrich @ 2025-02-25 13:55 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Tvrtko Ursulin, dri-devel, linux-kernel

On Tue, Feb 25, 2025 at 02:13:34PM +0100, Philipp Stanner wrote:
> The kthread header doesn't need to be included anymore. It's a relict
> from the days when the scheduler was still using a kthread instead of
> workqueues.
> 
> Remove the unneeded includes.
> 
> Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread")
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> I'm not sure whether we should Cc the stable kernel. It's inconvenient
> and makes build times slower, but isn't really a bug.

I don't think that's needed. Analogous to the previous patch of this series, I
don't even think it should have a 'Fixes' tag.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] drm/sched: Fix outdated comments referencing thread
  2025-02-25 13:13 [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Philipp Stanner
  2025-02-25 13:13 ` [PATCH 2/2] drm/sched: Remove kthread header Philipp Stanner
@ 2025-02-25 13:53 ` Danilo Krummrich
  2025-03-03 15:48   ` Philipp Stanner
  1 sibling, 1 reply; 6+ messages in thread
From: Danilo Krummrich @ 2025-02-25 13:53 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Tvrtko Ursulin, dri-devel, linux-kernel

On Tue, Feb 25, 2025 at 02:13:32PM +0100, Philipp Stanner wrote:
> The GPU scheduler's comments refer to a "thread" at various places.
> Those are leftovers stemming from a rework in which the scheduler was

Maybe "leftovers from commit a6149f039369 ("drm/sched: Convert drm scheduler to
use a work queue rather than kthread") [...]".

> ported from using a kthread to workqueues.
> 
> Replace all references to kthreads.
> 
> Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread")

I suggest to drop the 'Fixes' tag, it's not a fix in the sense of this tag.

> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c |  8 ++++----
>  drivers/gpu/drm/scheduler/sched_main.c   | 21 +++++++++++----------
>  include/drm/gpu_scheduler.h              | 12 ++++++------
>  3 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 87f88259ddf6..f9811420c787 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -538,10 +538,10 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>  		return;
>  
>  	/*
> -	 * Only when the queue is empty are we guaranteed that the scheduler
> -	 * thread cannot change ->last_scheduled. To enforce ordering we need
> -	 * a read barrier here. See drm_sched_entity_pop_job() for the other
> -	 * side.
> +	 * Only when the queue is empty are we guaranteed that
> +	 * drm_sched_run_job_work() cannot change entity->last_scheduled. To
> +	 * enforce ordering we need a read barrier here. See
> +	 * drm_sched_entity_pop_job() for the other side.
>  	 */
>  	smp_rmb();
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c634993f1346..015ee327fe52 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -389,7 +389,7 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>   * drm_sched_job_done - complete a job
>   * @s_job: pointer to the job which is done
>   *
> - * Finish the job's fence and wake up the worker thread.
> + * Finish the job's fence and wake up the work item.
>   */
>  static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>  {
> @@ -550,8 +550,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  	if (job) {
>  		/*
>  		 * Remove the bad job so it cannot be freed by concurrent
> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread

Not related to your patch, but I assume this means "cannot be freed by
concurrent calls to drm_sched_job_cleanup()", which would still be incorrect,
since drm_sched_job_cleanup() doesn't free the job itself. Maybe you want to fix
this as well?

> -		 * is parked at which point it's safe.
> +		 * drm_sched_cleanup_jobs. It will be reinserted back after the
> +		 * scheduler's workqueues are stopped at which point it's safe.

You don't know whether the workqueues are "stopped". I think you want to say
that run_job / free_job work isn't scheduled or running.

Same for a couple more instances below.

>  		 */
>  		list_del_init(&job->list);
>  		spin_unlock(&sched->job_list_lock);
> @@ -597,10 +597,10 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  
>  	/*
>  	 * Reinsert back the bad job here - now it's safe as
> -	 * drm_sched_get_finished_job cannot race against us and release the
> +	 * drm_sched_get_finished_job() cannot race against us and release the

Not related to the patch, but fine for me, since you do it for consistency with
the change below.

>  	 * bad job at this point - we parked (waited for) any in progress
> -	 * (earlier) cleanups and drm_sched_get_finished_job will not be called
> -	 * now until the scheduler thread is unparked.
> +	 * (earlier) cleanups and drm_sched_get_finished_job() will not be
> +	 * called now until the scheduler's workqueues are unparked.

workqueues are unparked?

>  	 */
>  	if (bad && bad->sched == sched)
>  		/*
> @@ -613,7 +613,8 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  	 * Iterate the job list from later to  earlier one and either deactive
>  	 * their HW callbacks or remove them from pending list if they already
>  	 * signaled.
> -	 * This iteration is thread safe as sched thread is stopped.
> +	 * This iteration is thread safe as the scheduler's workqueues are
> +	 * stopped.
>  	 */
>  	list_for_each_entry_safe_reverse(s_job, tmp, &sched->pending_list,
>  					 list) {
> @@ -678,9 +679,9 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  	struct drm_sched_job *s_job, *tmp;
>  
>  	/*
> -	 * Locking the list is not required here as the sched thread is parked
> -	 * so no new jobs are being inserted or removed. Also concurrent
> -	 * GPU recovers can't run in parallel.
> +	 * Locking the list is not required here as the scheduler's workqueues
> +	 * are paused, so no new jobs are being inserted or removed. Also
> +	 * concurrent GPU recovers can't run in parallel.
>  	 */
>  	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>  		struct dma_fence *fence = s_job->s_fence->parent;
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 50928a7ae98e..7da7b0b52a7e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -192,8 +192,8 @@ struct drm_sched_entity {
>  	 * @last_scheduled:
>  	 *
>  	 * Points to the finished fence of the last scheduled job. Only written
> -	 * by the scheduler thread, can be accessed locklessly from
> -	 * drm_sched_job_arm() if the queue is empty.
> +	 * by &struct drm_gpu_scheduler.submit_wq. Can be accessed locklessly
> +	 * from drm_sched_job_arm() if the queue is empty.
>  	 */
>  	struct dma_fence __rcu		*last_scheduled;
>  
> @@ -426,14 +426,14 @@ struct drm_sched_backend_ops {
>  	 * Drivers typically issue a reset to recover from GPU hangs, and this
>  	 * procedure usually follows the following workflow:
>  	 *
> -	 * 1. Stop the scheduler using drm_sched_stop(). This will park the
> -	 *    scheduler thread and cancel the timeout work, guaranteeing that
> -	 *    nothing is queued while we reset the hardware queue
> +	 * 1. Stop the scheduler using drm_sched_stop(). This will stop the
> +	 *    scheduler's workqueues and cancel the timeout work, guaranteeing
> +	 *    that  nothing is queued while we reset the hardware queue
>  	 * 2. Try to gracefully stop non-faulty jobs (optional)
>  	 * 3. Issue a GPU reset (driver-specific)
>  	 * 4. Re-submit jobs using drm_sched_resubmit_jobs()
>  	 * 5. Restart the scheduler using drm_sched_start(). At that point, new
> -	 *    jobs can be queued, and the scheduler thread is unblocked
> +	 *    jobs can be queued, and the scheduler's workqueues be started again.
>  	 *
>  	 * Note that some GPUs have distinct hardware queues but need to reset
>  	 * the GPU globally, which requires extra synchronization between the
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] drm/sched: Fix outdated comments referencing thread
  2025-02-25 13:53 ` [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Danilo Krummrich
@ 2025-03-03 15:48   ` Philipp Stanner
  2025-03-03 15:53     ` Danilo Krummrich
  0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stanner @ 2025-03-03 15:48 UTC (permalink / raw)
  To: Danilo Krummrich, Philipp Stanner
  Cc: Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Tvrtko Ursulin, dri-devel, linux-kernel

On Tue, 2025-02-25 at 14:53 +0100, Danilo Krummrich wrote:
> On Tue, Feb 25, 2025 at 02:13:32PM +0100, Philipp Stanner wrote:
> > The GPU scheduler's comments refer to a "thread" at various places.
> > Those are leftovers stemming from a rework in which the scheduler
> > was
> 
> Maybe "leftovers from commit a6149f039369 ("drm/sched: Convert drm
> scheduler to
> use a work queue rather than kthread") [...]".
> 
> > ported from using a kthread to workqueues.
> > 
> > Replace all references to kthreads.
> > 
> > Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a
> > work queue rather than kthread")
> 
> I suggest to drop the 'Fixes' tag, it's not a fix in the sense of
> this tag.
> 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_entity.c |  8 ++++----
> >  drivers/gpu/drm/scheduler/sched_main.c   | 21 +++++++++++---------
> > -
> >  include/drm/gpu_scheduler.h              | 12 ++++++------
> >  3 files changed, 21 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 87f88259ddf6..f9811420c787 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -538,10 +538,10 @@ void drm_sched_entity_select_rq(struct
> > drm_sched_entity *entity)
> >  		return;
> >  
> >  	/*
> > -	 * Only when the queue is empty are we guaranteed that the
> > scheduler
> > -	 * thread cannot change ->last_scheduled. To enforce
> > ordering we need
> > -	 * a read barrier here. See drm_sched_entity_pop_job() for
> > the other
> > -	 * side.
> > +	 * Only when the queue is empty are we guaranteed that
> > +	 * drm_sched_run_job_work() cannot change entity-
> > >last_scheduled. To
> > +	 * enforce ordering we need a read barrier here. See
> > +	 * drm_sched_entity_pop_job() for the other side.
> >  	 */
> >  	smp_rmb();
> >  
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index c634993f1346..015ee327fe52 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -389,7 +389,7 @@ static void drm_sched_run_free_queue(struct
> > drm_gpu_scheduler *sched)
> >   * drm_sched_job_done - complete a job
> >   * @s_job: pointer to the job which is done
> >   *
> > - * Finish the job's fence and wake up the worker thread.
> > + * Finish the job's fence and wake up the work item.
> >   */
> >  static void drm_sched_job_done(struct drm_sched_job *s_job, int
> > result)
> >  {
> > @@ -550,8 +550,8 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >  	if (job) {
> >  		/*
> >  		 * Remove the bad job so it cannot be freed by
> > concurrent
> > -		 * drm_sched_cleanup_jobs. It will be reinserted
> > back after sched->thread
> 
> Not related to your patch, but I assume this means "cannot be freed
> by
> concurrent calls to drm_sched_job_cleanup()", which would still be
> incorrect,
> since drm_sched_job_cleanup() doesn't free the job itself. Maybe you
> want to fix
> this as well?
> 
> > -		 * is parked at which point it's safe.
> > +		 * drm_sched_cleanup_jobs. It will be reinserted
> > back after the
> > +		 * scheduler's workqueues are stopped at which
> > point it's safe.
> 
> You don't know whether the workqueues are "stopped". I think you want
> to say
> that run_job / free_job work isn't scheduled or running.

How about "after the scheduler's work items have been cancelled"?

P.

> 
> Same for a couple more instances below.
> 
> >  		 */
> >  		list_del_init(&job->list);
> >  		spin_unlock(&sched->job_list_lock);
> > @@ -597,10 +597,10 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > *sched, struct drm_sched_job *bad)
> >  
> >  	/*
> >  	 * Reinsert back the bad job here - now it's safe as
> > -	 * drm_sched_get_finished_job cannot race against us and
> > release the
> > +	 * drm_sched_get_finished_job() cannot race against us and
> > release the
> 
> Not related to the patch, but fine for me, since you do it for
> consistency with
> the change below.
> 
> >  	 * bad job at this point - we parked (waited for) any in
> > progress
> > -	 * (earlier) cleanups and drm_sched_get_finished_job will
> > not be called
> > -	 * now until the scheduler thread is unparked.
> > +	 * (earlier) cleanups and drm_sched_get_finished_job()
> > will not be
> > +	 * called now until the scheduler's workqueues are
> > unparked.
> 
> workqueues are unparked?
> 
> >  	 */
> >  	if (bad && bad->sched == sched)
> >  		/*
> > @@ -613,7 +613,8 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > *sched, struct drm_sched_job *bad)
> >  	 * Iterate the job list from later to  earlier one and
> > either deactive
> >  	 * their HW callbacks or remove them from pending list if
> > they already
> >  	 * signaled.
> > -	 * This iteration is thread safe as sched thread is
> > stopped.
> > +	 * This iteration is thread safe as the scheduler's
> > workqueues are
> > +	 * stopped.
> >  	 */
> >  	list_for_each_entry_safe_reverse(s_job, tmp, &sched-
> > >pending_list,
> >  					 list) {
> > @@ -678,9 +679,9 @@ void drm_sched_start(struct drm_gpu_scheduler
> > *sched, int errno)
> >  	struct drm_sched_job *s_job, *tmp;
> >  
> >  	/*
> > -	 * Locking the list is not required here as the sched
> > thread is parked
> > -	 * so no new jobs are being inserted or removed. Also
> > concurrent
> > -	 * GPU recovers can't run in parallel.
> > +	 * Locking the list is not required here as the
> > scheduler's workqueues
> > +	 * are paused, so no new jobs are being inserted or
> > removed. Also
> > +	 * concurrent GPU recovers can't run in parallel.
> >  	 */
> >  	list_for_each_entry_safe(s_job, tmp, &sched->pending_list,
> > list) {
> >  		struct dma_fence *fence = s_job->s_fence->parent;
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 50928a7ae98e..7da7b0b52a7e 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -192,8 +192,8 @@ struct drm_sched_entity {
> >  	 * @last_scheduled:
> >  	 *
> >  	 * Points to the finished fence of the last scheduled job.
> > Only written
> > -	 * by the scheduler thread, can be accessed locklessly
> > from
> > -	 * drm_sched_job_arm() if the queue is empty.
> > +	 * by &struct drm_gpu_scheduler.submit_wq. Can be accessed
> > locklessly
> > +	 * from drm_sched_job_arm() if the queue is empty.
> >  	 */
> >  	struct dma_fence __rcu		*last_scheduled;
> >  
> > @@ -426,14 +426,14 @@ struct drm_sched_backend_ops {
> >  	 * Drivers typically issue a reset to recover from GPU
> > hangs, and this
> >  	 * procedure usually follows the following workflow:
> >  	 *
> > -	 * 1. Stop the scheduler using drm_sched_stop(). This will
> > park the
> > -	 *    scheduler thread and cancel the timeout work,
> > guaranteeing that
> > -	 *    nothing is queued while we reset the hardware queue
> > +	 * 1. Stop the scheduler using drm_sched_stop(). This will
> > stop the
> > +	 *    scheduler's workqueues and cancel the timeout work,
> > guaranteeing
> > +	 *    that  nothing is queued while we reset the hardware
> > queue
> >  	 * 2. Try to gracefully stop non-faulty jobs (optional)
> >  	 * 3. Issue a GPU reset (driver-specific)
> >  	 * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> >  	 * 5. Restart the scheduler using drm_sched_start(). At
> > that point, new
> > -	 *    jobs can be queued, and the scheduler thread is
> > unblocked
> > +	 *    jobs can be queued, and the scheduler's workqueues
> > be started again.
> >  	 *
> >  	 * Note that some GPUs have distinct hardware queues but
> > need to reset
> >  	 * the GPU globally, which requires extra synchronization
> > between the
> > -- 
> > 2.48.1
> > 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] drm/sched: Fix outdated comments referencing thread
  2025-03-03 15:48   ` Philipp Stanner
@ 2025-03-03 15:53     ` Danilo Krummrich
  0 siblings, 0 replies; 6+ messages in thread
From: Danilo Krummrich @ 2025-03-03 15:53 UTC (permalink / raw)
  To: phasta
  Cc: Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Tvrtko Ursulin, dri-devel, linux-kernel

On Mon, Mar 03, 2025 at 04:48:36PM +0100, Philipp Stanner wrote:
> On Tue, 2025-02-25 at 14:53 +0100, Danilo Krummrich wrote:
> > On Tue, Feb 25, 2025 at 02:13:32PM +0100, Philipp Stanner wrote:
> 
> > > -		 * is parked at which point it's safe.
> > > +		 * drm_sched_cleanup_jobs. It will be reinserted
> > > back after the
> > > +		 * scheduler's workqueues are stopped at which
> > > point it's safe.
> > 
> > You don't know whether the workqueues are "stopped". I think you want
> > to say
> > that run_job / free_job work isn't scheduled or running.
> 
> How about "after the scheduler's work items have been cancelled"?

Sure, that's fine too. But be aware that you may have to change this wording in
case another (unrelated) work item is gets added to the scheduler for some
reason.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-03-03 15:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-25 13:13 [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Philipp Stanner
2025-02-25 13:13 ` [PATCH 2/2] drm/sched: Remove kthread header Philipp Stanner
2025-02-25 13:55   ` Danilo Krummrich
2025-02-25 13:53 ` [PATCH 1/2] drm/sched: Fix outdated comments referencing thread Danilo Krummrich
2025-03-03 15:48   ` Philipp Stanner
2025-03-03 15:53     ` Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox