All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()
@ 2023-06-13  9:28 Boris Brezillon
  2023-06-13 10:05 ` Boris Brezillon
  2023-06-16 10:19 ` Boris Brezillon
  0 siblings, 2 replies; 3+ messages in thread
From: Boris Brezillon @ 2023-06-13  9:28 UTC (permalink / raw)
  To: dri-devel
  Cc: Luben Tuikov, Sarah Walker, Sumit Semwal, Boris Brezillon,
	Donald Robson, Christian König

drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.

In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.

Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().

We also make sure we wait on drm_sched_fence::finished if we were asked
to wait on drm_sched_fence::scheduled, but the intent was probably to
delegate the wait to the GPU, but we want resources to be completely
idle when killing jobs.

v3:
- Always wait for drm_sched_fence::finished fences in
  drm_sched_entity_kill_jobs_cb() when we see a sched_fence

v2:
- Don't evict deps in drm_sched_job_dependency()

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Suggested-by: "Christian König" <christian.koenig@amd.com>
Cc: Frank Binns <frank.binns@imgtec.com>
Cc: Sarah Walker <sarah.walker@imgtec.com>
Cc: Donald Robson <donald.robson@imgtec.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 28 ++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 68e807ae136a..bc1bc3d47f7f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -176,13 +176,24 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 {
 	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 						 finish_cb);
+	unsigned long index;
 	int r;
 
 	dma_fence_put(f);
 
 	/* Wait for all dependencies to avoid data corruptions */
-	while (!xa_empty(&job->dependencies)) {
-		f = xa_erase(&job->dependencies, job->last_dependency++);
+	xa_for_each(&job->dependencies, index, f) {
+		struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
+
+		/* Make sure we wait for the finished fence here, so we can
+		 * guarantee that any job we depend on that is still accessing
+		 * resources is done before we signal this job finished fence
+		 * and unblock further accesses on those resources.
+		 */
+		if (s_fence && f == &s_fence->scheduled)
+			f = &s_fence->finished;
+
+		xa_erase(&job->dependencies, index);
 		r = dma_fence_add_callback(f, &job->finish_cb,
 					   drm_sched_entity_kill_jobs_cb);
 		if (!r)
@@ -415,8 +426,17 @@ static struct dma_fence *
 drm_sched_job_dependency(struct drm_sched_job *job,
 			 struct drm_sched_entity *entity)
 {
-	if (!xa_empty(&job->dependencies))
-		return xa_erase(&job->dependencies, job->last_dependency++);
+	struct dma_fence *f;
+
+	/* We keep the fence around, so we can iterate over all dependencies
+	 * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
+	 * before killing the job.
+	 */
+	f = xa_load(&job->dependencies, job->last_dependency);
+	if (f) {
+		job->last_dependency++;
+		return dma_fence_get(f);
+	}
 
 	if (job->sched->ops->prepare_job)
 		return job->sched->ops->prepare_job(job, entity);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()
  2023-06-13  9:28 [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb() Boris Brezillon
@ 2023-06-13 10:05 ` Boris Brezillon
  2023-06-16 10:19 ` Boris Brezillon
  1 sibling, 0 replies; 3+ messages in thread
From: Boris Brezillon @ 2023-06-13 10:05 UTC (permalink / raw)
  To: dri-devel
  Cc: Sarah Walker, Sumit Semwal, Luben Tuikov, Donald Robson,
	Christian König

On Tue, 13 Jun 2023 11:28:45 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> We also make sure we wait on drm_sched_fence::finished if we were asked
> to wait on drm_sched_fence::scheduled, but the intent was probably to
> delegate the wait to the GPU, but we want resources to be completely
> idle when killing jobs.

Uh, I need to rephrase that part:

"
We also make sure we wait on drm_sched_fence::finished if we were
originally asked to wait on drm_sched_fence::scheduled. In that case,
we assume the intent was to delegate the wait to the firmware/GPU or
rely on the pipelining done at the entity/scheduler level, but when
killing jobs, we really want to wait for completion not just scheduling.
"

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()
  2023-06-13  9:28 [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb() Boris Brezillon
  2023-06-13 10:05 ` Boris Brezillon
@ 2023-06-16 10:19 ` Boris Brezillon
  1 sibling, 0 replies; 3+ messages in thread
From: Boris Brezillon @ 2023-06-16 10:19 UTC (permalink / raw)
  To: dri-devel
  Cc: Sarah Walker, Sumit Semwal, Luben Tuikov, Donald Robson,
	Christian König

On Tue, 13 Jun 2023 11:28:45 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
> from the dependency array that was waited upon before
> drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> so we're basically waiting for all dependencies except one.
> 
> In theory, this wait shouldn't be needed because resources should have
> their users registered to the dma_resv object, thus guaranteeing that
> future jobs wanting to access these resources wait on all the previous
> users (depending on the access type, of course). But we want to keep
> these explicit waits in the kill entity path just in case.
> 
> Let's make sure we keep all dependencies in the array in
> drm_sched_job_dependency(), so we can iterate over the array and wait
> in drm_sched_entity_kill_jobs_cb().
> 
> We also make sure we wait on drm_sched_fence::finished if we were asked
> to wait on drm_sched_fence::scheduled, but the intent was probably to
> delegate the wait to the GPU, but we want resources to be completely
> idle when killing jobs.
> 
> v3:
> - Always wait for drm_sched_fence::finished fences in
>   drm_sched_entity_kill_jobs_cb() when we see a sched_fence
> 
> v2:
> - Don't evict deps in drm_sched_job_dependency()
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Suggested-by: "Christian König" <christian.koenig@amd.com>
> Cc: Frank Binns <frank.binns@imgtec.com>
> Cc: Sarah Walker <sarah.walker@imgtec.com>
> Cc: Donald Robson <donald.robson@imgtec.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c | 28 ++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 68e807ae136a..bc1bc3d47f7f 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -176,13 +176,24 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>  {
>  	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
>  						 finish_cb);
> +	unsigned long index;
>  	int r;
>  
>  	dma_fence_put(f);
>  
>  	/* Wait for all dependencies to avoid data corruptions */
> -	while (!xa_empty(&job->dependencies)) {
> -		f = xa_erase(&job->dependencies, job->last_dependency++);
> +	xa_for_each(&job->dependencies, index, f) {
> +		struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
> +
> +		/* Make sure we wait for the finished fence here, so we can
> +		 * guarantee that any job we depend on that is still accessing
> +		 * resources is done before we signal this job finished fence
> +		 * and unblock further accesses on those resources.
> +		 */
> +		if (s_fence && f == &s_fence->scheduled)
> +			f = &s_fence->finished;

Oops. There's a use-after-free+leak bug here: when switching from
scheduled to finished fence, we need to grab a ref on the finished
fence and release the one we had on the scheduled fence.

Sending a v4 to fix that.

> +
> +		xa_erase(&job->dependencies, index);
>  		r = dma_fence_add_callback(f, &job->finish_cb,
>  					   drm_sched_entity_kill_jobs_cb);
>  		if (!r)
> @@ -415,8 +426,17 @@ static struct dma_fence *
>  drm_sched_job_dependency(struct drm_sched_job *job,
>  			 struct drm_sched_entity *entity)
>  {
> -	if (!xa_empty(&job->dependencies))
> -		return xa_erase(&job->dependencies, job->last_dependency++);
> +	struct dma_fence *f;
> +
> +	/* We keep the fence around, so we can iterate over all dependencies
> +	 * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
> +	 * before killing the job.
> +	 */
> +	f = xa_load(&job->dependencies, job->last_dependency);
> +	if (f) {
> +		job->last_dependency++;
> +		return dma_fence_get(f);
> +	}
>  
>  	if (job->sched->ops->prepare_job)
>  		return job->sched->ops->prepare_job(job, entity);


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-06-16 10:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-13  9:28 [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb() Boris Brezillon
2023-06-13 10:05 ` Boris Brezillon
2023-06-16 10:19 ` Boris Brezillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.