From: Lucas Stach <l.stach@pengutronix.de>
To: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: kernel@pengutronix.de, etnaviv@lists.freedesktop.org,
dri-devel@lists.freedesktop.org, patchwork-lst@pengutronix.de
Subject: Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
Date: Tue, 03 Jul 2018 11:54:43 +0200 [thread overview]
Message-ID: <1530611683.2242.1.camel@pengutronix.de> (raw)
In-Reply-To: <20180703094412.GP17271@n2100.armlinux.org.uk>
Am Dienstag, den 03.07.2018, 10:44 +0100 schrieb Russell King - ARM Linux:
> On Wed, Jun 27, 2018 at 04:34:27PM +0200, Lucas Stach wrote:
> > When the hangcheck handler was replaced by the DRM scheduler timeout
> > handling we dropped the forward progress check, as this might allow
> > clients to hog the GPU for a long time with a big job.
> >
> > It turns out that even reasonably well behaved clients like the
> > Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> > back the forward progress check to get rid of the userspace regression.
> >
> > We would still like to fix userspace to submit smaller batches
> > if possible, but that is for another day.
> >
> > Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> > ---
> > drivers/gpu/drm/etnaviv/etnaviv_gpu.h | 3 +++
> > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 24 ++++++++++++++++++++++++
> > 2 files changed, 27 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> > index dd430f0f8ff5..90f17ff7888e 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> > @@ -131,6 +131,9 @@ struct etnaviv_gpu {
> > > > struct work_struct sync_point_work;
> > > > int sync_point_event;
> >
> > > > + /* hang detection */
> > > > + u32 hangcheck_dma_addr;
> > +
> > > > void __iomem *mmio;
> > > > int irq;
> >
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > index a74eb57af15b..50d6b88cb7aa 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > @@ -10,6 +10,7 @@
> > #include "etnaviv_gem.h"
> > #include "etnaviv_gpu.h"
> > #include "etnaviv_sched.h"
> > +#include "state.xml.h"
> >
> > static int etnaviv_job_hang_limit = 0;
> > module_param_named(job_hang_limit, etnaviv_job_hang_limit, int , 0444);
> > @@ -85,6 +86,29 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
> > {
> > > > struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
> > > > struct etnaviv_gpu *gpu = submit->gpu;
> > > > + u32 dma_addr;
> > > > + int change;
> > +
> > > > + /*
> > > > + * If the GPU managed to complete this jobs fence, the timout is
> > > > + * spurious. Bail out.
> > > > + */
> > > > + if (fence_completed(gpu, submit->out_fence->seqno))
> > > > + return;
> > +
> > > > + /*
> > > > + * If the GPU is still making forward progress on the front-end (which
> > > > + * should never loop) we shift out the timeout to give it a chance to
> > > > + * finish the job.
> > > > + */
> > > > + dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
> > > > + change = dma_addr - gpu->hangcheck_dma_addr;
> > > > + if (change < 0 || change > 16) {
> > > > + gpu->hangcheck_dma_addr = dma_addr;
> > > > + schedule_delayed_work(&sched_job->work_tdr,
> > > > + sched_job->sched->timeout);
> > > > + return;
> > + }
>
> Doesn't this patch, by ignoring the timeout, have the effect of completely
> disabling the job timeout after its first instance of firing? From my
> reading of the gpu scheduler code, this seems to be the case.
The schedule_delayed_work() in the code above is what rearms the
timeout in case we detect progress, so it should fire again.
> work_tdr is only armed when:
> - a job completes and there is another job waiting (we're waiting for it...)
> - a job is started, and is the first job on the ring (which won't happen)
> - drm_sched_job_recovery() is called (which it isn't)
>
> So, what would appear to happen is we timeout the job, but detect we
> have progress, so the timeout is ignored. The timeout is not re-armed,
> so if there was some progress and /then/ the GPU gets stuck, there is
> nothing to re-check for progress or timeout.
>
> I suspect that will completely defeat the timeout mechanism, since the
> first (and only) timeout will always update the progress check and cause
> the timeout to be ignored.
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
next prev parent reply other threads:[~2018-07-03 9:54 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
2018-06-27 16:15 ` Fabio Estevam
2018-06-27 17:25 ` Eric Anholt
2018-06-28 8:40 ` Lucas Stach
2018-06-28 17:35 ` Eric Anholt
2018-07-03 9:26 ` Lucas Stach
2018-07-03 9:44 ` Russell King - ARM Linux
2018-07-03 9:54 ` Lucas Stach [this message]
2018-07-11 20:14 ` Russell King - ARM Linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1530611683.2242.1.camel@pengutronix.de \
--to=l.stach@pengutronix.de \
--cc=dri-devel@lists.freedesktop.org \
--cc=etnaviv@lists.freedesktop.org \
--cc=kernel@pengutronix.de \
--cc=linux@armlinux.org.uk \
--cc=patchwork-lst@pengutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).