From: Philipp Stanner <phasta@mailbox.org>
To: "Maíra Canal" <mcanal@igalia.com>,
phasta@kernel.org, "Matthew Brost" <matthew.brost@intel.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
"Simona Vetter" <simona@ffwll.ch>,
"David Airlie" <airlied@gmail.com>,
"Melissa Wen" <mwen@igalia.com>,
"Lucas Stach" <l.stach@pengutronix.de>,
"Russell King" <linux+etnaviv@armlinux.org.uk>,
"Christian Gmeiner" <christian.gmeiner@gmail.com>,
"Lucas De Marchi" <lucas.demarchi@intel.com>,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"Boris Brezillon" <boris.brezillon@collabora.com>,
"Rob Herring" <robh@kernel.org>,
"Steven Price" <steven.price@arm.com>,
"Liviu Dudau" <liviu.dudau@arm.com>
Cc: kernel-dev@igalia.com, dri-devel@lists.freedesktop.org,
etnaviv@lists.freedesktop.org, intel-xe@lists.freedesktop.org
Subject: Re: [PATCH v2 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
Date: Mon, 02 Jun 2025 13:59:14 +0200 [thread overview]
Message-ID: <bcc0ed477f8a6f3bb06665b1756bcb98fb7af871.camel@mailbox.org> (raw)
In-Reply-To: <177ea4e7-b05d-4ee0-8b5e-e3dfd67491a0@igalia.com>
On Mon, 2025-06-02 at 08:36 -0300, Maíra Canal wrote:
> Hi Philipp,
>
> On 02/06/25 04:28, Philipp Stanner wrote:
> > On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote:
>
> [...]
>
> > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > index
> > > 7146069a98492f5fab2a49d96e2054f649e1fe3d..46f5391e84a12232b247886
> > > cf13
> > > 11f8e09f42f04 100644
> > > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > @@ -40,11 +40,11 @@ static enum drm_gpu_sched_stat
> > > etnaviv_sched_timedout_job(struct drm_sched_job
> > > int change;
> > >
> > > /*
> > > - * If the GPU managed to complete this jobs fence, the
> > > timout is
> > > - * spurious. Bail out.
> > > + * If the GPU managed to complete this jobs fence, the
> > > timeout has
> > > + * fired before free-job worker. The timeout is
> > > spurious, so
> > > bail out.
> > > */
> > > if (dma_fence_is_signaled(submit->out_fence))
> > > - goto out_no_timeout;
> > > + return DRM_GPU_SCHED_STAT_NO_HANG;
> > >
> > > /*
> > > * If the GPU is still making forward progress on the
> > > front-
> > > end (which
> > > @@ -70,7 +70,7 @@ static enum drm_gpu_sched_stat
> > > etnaviv_sched_timedout_job(struct drm_sched_job
> > > gpu->hangcheck_dma_addr = dma_addr;
> > > gpu->hangcheck_primid = primid;
> > > gpu->hangcheck_fence = gpu->completed_fence;
> > > - goto out_no_timeout;
> > > + return DRM_GPU_SCHED_STAT_NO_HANG;
> > > }
> > >
> > > /* block scheduler */
> > > @@ -86,10 +86,7 @@ static enum drm_gpu_sched_stat
> > > etnaviv_sched_timedout_job(struct drm_sched_job
> > > drm_sched_resubmit_jobs(&gpu->sched);
> > >
> > > drm_sched_start(&gpu->sched, 0);
> > > - return DRM_GPU_SCHED_STAT_RESET;
> > >
> > > -out_no_timeout:
> > > - list_add(&sched_job->list, &sched_job->sched-
> > > >pending_list);
> >
> > Here you actually remove the manipulation of the scheduler
> > internals,
> > but you didn't in v3d. Just to point that out.
> >
> >
> > And BTW I'm just seeing that the pending_list gets manipulated here
> > with the scheduler's workqueues running and no locks being hold.
> >
> > Oh man :(
> >
> > That is most certainly a bug, and I recommend that the etnaviv
> > maintainers at least add the appropriate lock here and backport
> > that
> > since it can race any time.
> >
> >
> > But thx for working on that, Maíra. Good that we can remove the
> > stuff
> > this way.
> >
> > Thinking about it, this patch even fixes a bug. So could contain a
> > Fixes: tag. But I'm not sure if it's worth it to mark the entire
> > series
> > for Stable. Opinions?
>
> I believe the best way to fix this bug would be doing something
> similar
> to what I did to v3d: send a temporary fix adding the locks, which
> will
> be backported to stable. I'll send a fix to Etnaviv today.
Having the locking is significantly better than not having it, and
adding it gets my blessing, but it still doesn't solve all issues.
As Matt pointed out and as you address in your patch №2, a job could
still leak if no one makes sure that free_job_work gets kicked off. And
exposing yet another scheduler function publicly is obviously
unacceptable, so it's either live with the potential leak, or use the
new status code.
P.
>
> Thanks for the review, Phillip!
>
> Best Regards,
> - Maíra
>
> >
> >
> > P.
> >
> >
> > > return DRM_GPU_SCHED_STAT_RESET;
> > > }
> > >
> > >
> >
>
next prev parent reply other threads:[~2025-06-13 13:59 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-30 14:01 [PATCH v2 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-05-30 14:01 ` [PATCH v2 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
2025-05-30 14:01 ` [PATCH v2 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
2025-06-02 7:06 ` Philipp Stanner
2025-05-30 14:01 ` [PATCH v2 3/8] drm/sched: Reduce scheduler's timeout for timeout tests Maíra Canal
2025-06-02 8:54 ` Philipp Stanner
2025-06-02 9:06 ` Tvrtko Ursulin
2025-05-30 14:01 ` [PATCH v2 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-06-02 9:34 ` Tvrtko Ursulin
2025-05-30 14:01 ` [PATCH v2 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset Maíra Canal
2025-06-02 7:13 ` Philipp Stanner
2025-06-02 11:27 ` Maíra Canal
2025-05-30 14:01 ` [PATCH v2 6/8] drm/etnaviv: " Maíra Canal
2025-06-02 7:28 ` Philipp Stanner
2025-06-02 11:36 ` Maíra Canal
2025-06-02 11:59 ` Philipp Stanner [this message]
2025-05-30 14:01 ` [PATCH v2 7/8] drm/xe: " Maíra Canal
2025-06-02 7:47 ` Philipp Stanner
2025-05-30 14:01 ` [PATCH v2 8/8] drm/panfrost: " Maíra Canal
2025-05-30 14:08 ` ✗ CI.Patch_applied: failure for drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bcc0ed477f8a6f3bb06665b1756bcb98fb7af871.camel@mailbox.org \
--to=phasta@mailbox.org \
--cc=airlied@gmail.com \
--cc=boris.brezillon@collabora.com \
--cc=christian.gmeiner@gmail.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=etnaviv@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=kernel-dev@igalia.com \
--cc=l.stach@pengutronix.de \
--cc=linux+etnaviv@armlinux.org.uk \
--cc=liviu.dudau@arm.com \
--cc=lucas.demarchi@intel.com \
--cc=matthew.brost@intel.com \
--cc=mcanal@igalia.com \
--cc=mwen@igalia.com \
--cc=phasta@kernel.org \
--cc=robh@kernel.org \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
--cc=steven.price@arm.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tvrtko.ursulin@igalia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox