Re: [PATCH v2] drm/xe: Take ref to job and job's fence in xe_sched_job_arm

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Matthew Auld <matthew.auld@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2] drm/xe: Take ref to job and job's fence in xe_sched_job_arm
Date: Wed, 25 Sep 2024 15:32:59 +0000	[thread overview]
Message-ID: <ZvQtK78Jo+OxtTpy@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <0a1bbe76-c0ba-4d6e-aae7-0ff723bc334f@intel.com>

On Wed, Sep 25, 2024 at 03:29:20PM +0100, Matthew Auld wrote:
> On 24/09/2024 19:45, Matthew Brost wrote:
> > Fixes two possible races:
> > 
> > - Submission to hardware signals job's fence before dma_fence_get at end
> >    of run_job
> > - TDR fires and signals fence + free job before run_job completes
> > 
> > Taking refs in xe_sched_job_arm to job and job's fence solves these by
> > ensure all refs collected before entering the DRM scheduler. The refs
> > are dropped in run_job and DRM scheduler respectfully. Safe as once
> > xe_sched_job_arm is called execution of job through DRM sched is
> > guaranteed.
> > 
> > v2:
> >   - Take job ref on resubmit (Matt Auld)
> > 
> > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2811
> 
> Maybe also:
> https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2843
> 
> ?

Yes, look like same issue.

> 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: <stable@vger.kernel.org> # v6.8+
> > ---
> >   drivers/gpu/drm/xe/xe_execlist.c        |  4 +++-
> >   drivers/gpu/drm/xe/xe_gpu_scheduler.c   | 17 +++++++++++++++++
> >   drivers/gpu/drm/xe/xe_gpu_scheduler.h   |  6 +-----
> >   drivers/gpu/drm/xe/xe_guc_submit.c      | 11 +++++++----
> >   drivers/gpu/drm/xe/xe_sched_job.c       |  5 ++---
> >   drivers/gpu/drm/xe/xe_sched_job_types.h |  1 -
> >   6 files changed, 30 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
> > index f3b71fe7a96d..b70706c9caf2 100644
> > --- a/drivers/gpu/drm/xe/xe_execlist.c
> > +++ b/drivers/gpu/drm/xe/xe_execlist.c
> > @@ -309,11 +309,13 @@ execlist_run_job(struct drm_sched_job *drm_job)
> >   	struct xe_sched_job *job = to_xe_sched_job(drm_job);
> >   	struct xe_exec_queue *q = job->q;
> >   	struct xe_execlist_exec_queue *exl = job->q->execlist;
> > +	struct dma_fence *fence = job->fence;
> >   	q->ring_ops->emit_job(job);
> >   	xe_execlist_make_active(exl);
> > +	xe_sched_job_put(job);
> > -	return dma_fence_get(job->fence);
> > +	return fence;
> >   }
> >   static void execlist_job_free(struct drm_sched_job *drm_job)
> > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > index c518d1d16d82..7ea0c8e9e7a9 100644
> > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > @@ -4,6 +4,7 @@
> >    */
> >   #include "xe_gpu_scheduler.h"
> > +#include "xe_sched_job.h"
> >   static void xe_sched_process_msg_queue(struct xe_gpu_scheduler *sched)
> >   {
> > @@ -106,3 +107,19 @@ void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
> >   	list_add_tail(&msg->link, &sched->msgs);
> >   	xe_sched_process_msg_queue(sched);
> >   }
> > +
> > +/**
> > + * xe_sched_resubmit_jobs() - Resubmit scheduler jobs
> > + * @sched: Xe GPU scheduler
> > + *
> > + * Take a ref all jobs on scheduler and resubmit.
> > + */
> > +void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
> > +{
> > +	struct drm_sched_job *s_job;
> > +
> > +	list_for_each_entry(s_job, &sched->base.pending_list, list)
> > +		xe_sched_job_get(to_xe_sched_job(s_job));	/* Paired with put in run_job */
> > +
> > +	drm_sched_resubmit_jobs(&sched->base);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > index cee9c6809fc0..ecbe5dd6664e 100644
> > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > @@ -26,6 +26,7 @@ void xe_sched_add_msg(struct xe_gpu_scheduler *sched,
> >   		      struct xe_sched_msg *msg);
> >   void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
> >   			     struct xe_sched_msg *msg);
> > +void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched);
> >   static inline void xe_sched_msg_lock(struct xe_gpu_scheduler *sched)
> >   {
> > @@ -47,11 +48,6 @@ static inline void xe_sched_tdr_queue_imm(struct xe_gpu_scheduler *sched)
> >   	drm_sched_tdr_queue_imm(&sched->base);
> >   }
> > -static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
> > -{
> > -	drm_sched_resubmit_jobs(&sched->base);
> > -}
> > -
> >   static inline bool
> >   xe_sched_invalidate_job(struct xe_sched_job *job, int threshold)
> >   {
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index fbbe6a487bbb..689279fdef80 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -766,6 +766,7 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
> >   	struct xe_guc *guc = exec_queue_to_guc(q);
> >   	struct xe_device *xe = guc_to_xe(guc);
> >   	bool lr = xe_exec_queue_is_lr(q);
> > +	struct dma_fence *fence = NULL;
> >   	xe_assert(xe, !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
> >   		  exec_queue_banned(q) || exec_queue_suspended(q));
> > @@ -782,12 +783,14 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
> >   	if (lr) {
> >   		xe_sched_job_set_error(job, -EOPNOTSUPP);
> > -		return NULL;
> > -	} else if (test_and_set_bit(JOB_FLAG_SUBMIT, &job->fence->flags)) {
> > -		return job->fence;
> > +		dma_fence_put(job->fence);	/* Drop ref from xe_sched_job_arm */
> >   	} else {
> > -		return dma_fence_get(job->fence);
> > +		fence = job->fence;
> >   	}
> > +
> > +	xe_sched_job_put(job);	/* Pairs with get from xe_sched_job_arm */
> 
> Only doubt is job being destroyed here. I think you were saying that
> guc_exec_queue_free_job(drm_job) can potentially happen before run_job()
> completes. But if that's the case can't the refcount reach zero here, and
> then caller of run_job() goes down in flames, since the drm_job is no longer
> a valid pointer, assuming the put() here frees the memory for it?
>

Free job just puts the job (creation ref) so we still have reference
from xe_sched_job_arm here. This put could potentially free the job's
memory but it safe at this point in time as only the job's fence is
needed after this. The job's fence is decoupled from the job and ref
counted too.

Matt

> > +
> > +	return fence;
> >   }
> >   static void guc_exec_queue_free_job(struct drm_sched_job *drm_job)
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
> > index eeccc1c318ae..d0f4b908411f 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job.c
> > +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> > @@ -280,16 +280,15 @@ void xe_sched_job_arm(struct xe_sched_job *job)
> >   		fence = &chain->base;
> >   	}
> > -	job->fence = fence;
> > +	xe_sched_job_get(job);			/* Pairs with put in run_job */
> > +	job->fence = dma_fence_get(fence);	/* Pairs with put in scheduler */
> >   	drm_sched_job_arm(&job->drm);
> >   }
> >   void xe_sched_job_push(struct xe_sched_job *job)
> >   {
> > -	xe_sched_job_get(job);
> >   	trace_xe_sched_job_exec(job);
> >   	drm_sched_entity_push_job(&job->drm);
> > -	xe_sched_job_put(job);
> >   }
> >   /**
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
> > index 0d3f76fb05ce..8ed95e1a378f 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job_types.h
> > +++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
> > @@ -40,7 +40,6 @@ struct xe_sched_job {
> >   	 * @fence: dma fence to indicate completion. 1 way relationship - job
> >   	 * can safely reference fence, fence cannot safely reference job.
> >   	 */
> > -#define JOB_FLAG_SUBMIT		DMA_FENCE_FLAG_USER_BITS
> >   	struct dma_fence *fence;
> >   	/** @user_fence: write back value when BB is complete */
> >   	struct {

next prev parent reply	other threads:[~2024-09-25 15:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-24 18:45 [PATCH v2] drm/xe: Take ref to job and job's fence in xe_sched_job_arm Matthew Brost
2024-09-25 14:29 ` Matthew Auld
2024-09-25 15:32   ` Matthew Brost [this message]
2024-09-25 15:58     ` Matthew Auld
2024-09-25 16:22       ` Matthew Brost
2024-09-26  3:51 ` ✓ CI.Patch_applied: success for drm/xe: Take ref to job and job's fence in xe_sched_job_arm (rev2) Patchwork
2024-09-26  3:51 ` ✓ CI.checkpatch: " Patchwork
2024-09-26  3:52 ` ✓ CI.KUnit: " Patchwork
2024-09-26  4:04 ` ✓ CI.Build: " Patchwork
2024-09-26  4:06 ` ✓ CI.Hooks: " Patchwork
2024-09-26  4:07 ` ✓ CI.checksparse: " Patchwork
2024-09-26  4:53 ` ✓ CI.BAT: " Patchwork
2024-09-26 15:34 ` ✗ CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZvQtK78Jo+OxtTpy@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox