From: Jingwen Chen <Jingwen.Chen2@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
"Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
amd-gfx@lists.freedesktop.org
Cc: <monk.liu@amd.com>
Subject: Re: [PATCH] Revert "drm/scheduler: Avoid accessing freed bad job."
Date: Wed, 18 Aug 2021 16:08:37 +0800 [thread overview]
Message-ID: <20210818080837.gnadncffueb6appu@wayne-dev> (raw)
In-Reply-To: <2126dab8-3484-7fd6-b99e-b94e830fd50b@amd.com>
On Tue Aug 17, 2021 at 03:43:58PM +0200, Christian König wrote:
>
>
> Am 17.08.21 um 15:37 schrieb Andrey Grodzovsky:
> > On 2021-08-17 12:28 a.m., Jingwen Chen wrote:
> > > [Why]
> > > for bailing job, this commit will delete it from pending list thus the
> > > bailing job will never have a chance to be resubmitted even in advance
> > > tdr mode.
> > >
> > > [How]
> > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > this commit tries to work around is completely solved.So revert this
> > > commit.
> > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> >
> >
> > Can you elaborate please how this solves the race ?
> > As far as I see and with this patch reverted, in drm_sched_job_timedout
> > you get a pointer to next job to process in timed out handler,
> > immediately
> > next this job is actually finished and it's fence signaled, this in turn
> > triggers
> > drm_sched_get_cleanup_job which fetches this job and returns to
Hi Andrey,
if drm_sched_job_timedout is triggered first, drm_sched_get_cleanup_job will return
NULL when the timeout worker is running according to this code:
if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
!cancel_delayed_work(&sched->work_tdr)) ||
kthread_should_park())
return NULL;
But yes a dma_fence_get(job->s_fence->parent) is needed before
processing timedout_job. When the bad job is signaled just before
processing, the amdgpu_fence_free can be triggered by the dma_fence_put() of HW fence.
And if I'm understanding this race condition correctly, the spin_lock is
still needed here to avoid the drm_sched_get_cleanup_job get the
spin_lock first and then enter the tdr work.
> > drm_sched_main
> > which in turn call free_job callabck->...->amdgpu_fence_free which frees
> > the job
> > from the HW dma_fence release callback. After that you proceed with a
> > freed job
> > in timed out handler.
> >
> > If you could take the HW fence reference from drm_sched_job_timedout
> > before
> > starting processing then yes, I think it would work.
>
> Yes, precisely that's what I had in mind as well and seems to be missing
> from this patch.
>
> Regards,
> Christian.
>
> >
> > Andrey
> >
> >
> > >
> > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > ---
> > > drivers/gpu/drm/scheduler/sched_main.c | 27 --------------------------
> > > 1 file changed, 27 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index a2a953693b45..31d1176d939f 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -317,21 +317,10 @@ static void drm_sched_job_timedout(struct
> > > work_struct *work)
> > > enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > sched = container_of(work, struct drm_gpu_scheduler,
> > > work_tdr.work);
> > > -
> > > - /* Protects against concurrent deletion in
> > > drm_sched_get_cleanup_job */
> > > - spin_lock(&sched->job_list_lock);
> > > job = list_first_entry_or_null(&sched->pending_list,
> > > struct drm_sched_job, list);
> > > if (job) {
> > > - /*
> > > - * Remove the bad job so it cannot be freed by concurrent
> > > - * drm_sched_cleanup_jobs. It will be reinserted back after
> > > sched->thread
> > > - * is parked at which point it's safe.
> > > - */
> > > - list_del_init(&job->list);
> > > - spin_unlock(&sched->job_list_lock);
> > > -
> > > status = job->sched->ops->timedout_job(job);
> > > /*
> > > @@ -342,8 +331,6 @@ static void drm_sched_job_timedout(struct
> > > work_struct *work)
> > > job->sched->ops->free_job(job);
> > > sched->free_guilty = false;
> > > }
> > > - } else {
> > > - spin_unlock(&sched->job_list_lock);
> > > }
> > > if (status != DRM_GPU_SCHED_STAT_ENODEV) {
> > > @@ -392,20 +379,6 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > > *sched, struct drm_sched_job *bad)
> > > kthread_park(sched->thread);
> > > - /*
> > > - * Reinsert back the bad job here - now it's safe as
> > > - * drm_sched_get_cleanup_job cannot race against us and release the
> > > - * bad job at this point - we parked (waited for) any in progress
> > > - * (earlier) cleanups and drm_sched_get_cleanup_job will not be
> > > called
> > > - * now until the scheduler thread is unparked.
> > > - */
> > > - if (bad && bad->sched == sched)
> > > - /*
> > > - * Add at the head of the queue to reflect it was the earliest
> > > - * job extracted.
> > > - */
> > > - list_add(&bad->list, &sched->pending_list);
> > > -
> > > /*
> > > * Iterate the job list from later to earlier one and either
> > > deactive
> > > * their HW callbacks or remove them from pending list if they
> > > already
>
next prev parent reply other threads:[~2021-08-18 8:08 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-17 4:28 [PATCH] Revert "drm/scheduler: Avoid accessing freed bad job." Jingwen Chen
2021-08-17 6:35 ` Liu, Monk
2021-08-17 13:37 ` Andrey Grodzovsky
2021-08-17 13:43 ` Christian König
2021-08-18 8:08 ` Jingwen Chen [this message]
2021-08-18 10:39 ` Jingwen Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210818080837.gnadncffueb6appu@wayne-dev \
--to=jingwen.chen2@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrey.grodzovsky@amd.com \
--cc=christian.koenig@amd.com \
--cc=monk.liu@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.