All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Skeggs <bskeggs-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Marcin Slusarz <marcin.slusarz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH v2 4/4] drm/nouveau: gpu lockup recovery
Date: Wed, 02 May 2012 21:28:57 +1000	[thread overview]
Message-ID: <1335958137.1898.6.camel@nisroch> (raw)
In-Reply-To: <20120428144956.GA10116-OI9uyE9O0yo@public.gmane.org>

On Sat, 2012-04-28 at 16:49 +0200, Marcin Slusarz wrote:
> On Thu, Apr 26, 2012 at 05:32:29PM +1000, Ben Skeggs wrote:
> > On Wed, 2012-04-25 at 23:20 +0200, Marcin Slusarz wrote:
> > > Overall idea:
> > > Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> > > handle them at ioctl level, reset the GPU and repeat last ioctl.
> > > 
> > > GPU reset is done by doing suspend / resume cycle with few tweaks:
> > > - CPU-only bo eviction
> > > - ignoring vm flush / fence timeouts
> > > - shortening waits
> > Okay.  I've thought about this a bit for a couple of days and think I'll
> > be able to coherently share my thoughts on this issue now :)
> > 
> > Firstly, while I agree that we need to become more resilient to errors,
> > I don't think that following in the radeon/intel footsteps with
> > something (imo, hackish) like this is the right choice for us
> > necessarily.
> 
> This is not only radeon/intel way. Windows, since Vista SP1, does the
> same - see http://msdn.microsoft.com/en-us/windows/hardware/gg487368.
> It's funny how similar it is to this patch (I haven't seen this page earlier).
Yes, I am aware of this feature in Windows.  And I'm not arguing that
something like it isn't necessary.

> 
> If you fear people will stop reporting bugs - don't. GPU reset is painfully
> slow and can take up to 50 seconds (BO eviction is the most time consuming
> part), so people will be annoyed enough to report them.
> Currently, GPU lockups make users so angry, they frequently switch to blob
> without even thinking about reporting anything.
I'm not so concerned about the lost bug reports, I expect the same
people that are actually willing to report bugs now will continue to do
so :)

> 
> > The *vast* majority of "lockups" we have are as a result of us badly
> > mishandling exceptions reported to us by the GPU.  There are a couple of
> > exceptions, however, they're very rare..
> 
> > A very common example is where people gain DMA_PUSHERs for whatever
> > reason, and things go haywire eventually.
> 
> Nope, I had tens of lockups during testing, and only once I had DMA_PUSHER
> before detecting GPU lockup.
Out of curiosity, what were the lockup situations you were triggering
exactly?

> 
> > To handle a DMA_PUSHER
> > sanely, generally you have to drop all pending commands for the channel
> > (set GET=PUT, etc) and continue on.  However, this leaves us with fences
> > and semaphores unsignalled etc, causing issues further up the stack with
> > perfectly good channels hanging on attempting to sync with the crashed
> > channel etc.
> > 
> > The next most common example I can think of is nv4x hardware, getting a
> > LIMIT_COLOR/ZETA exception from PGRAPH, and then a hang.  The solution
> > is simple, learn how to handle the exception, log it, and PGRAPH
> > survives.
> > 
> > I strongly believe that if we focused our efforts on dealing with what
> > the GPU reports to us a lot better, we'll find we really don't need such
> > "lockup recovery".
> 
> While I agree we need to improve on error handling to make "lockup recovery"
> not needed, the reality is we can't predict everything and driver needs to
> cope with its own bugs.
Right, again, I don't disagree :)  I think we can improve a lot on the
big-hammer-suspend-the-gpu solution though, and instead reset only the
faulting engine.  It's (in theory) almost possible for us to do now, but
I have a couple of reworks to areas related to this pending (basically,
making the various driver subsystems more independent), which should be
ready soon.  This'll go a long way to making it very easy to reset a
single engine, and likely result in *far* faster recovery from hangs.

> 
> > I am, however, considering pulling the vm flush timeout error
> > propagation and break-out-of-waits-on-signals that builds on it.  As we
> > really do need to become better at having killable processes if things
> > go wrong :)
> 
> Good :)
> 
> Marcin
> _______________________________________________
> Nouveau mailing list
> Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

  parent reply	other threads:[~2012-05-02 11:28 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-25 21:20 [PATCH v2 4/4] drm/nouveau: gpu lockup recovery Marcin Slusarz
2012-04-25 21:32 ` Marcin Slusarz
     [not found] ` <1335388836-13127-4-git-send-email-marcin.slusarz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2012-04-26  7:32   ` Ben Skeggs
2012-04-28 14:49     ` Marcin Slusarz
     [not found]       ` <20120428144956.GA10116-OI9uyE9O0yo@public.gmane.org>
2012-05-02 11:28         ` Ben Skeggs [this message]
2012-05-02 13:33           ` Martin Peres
     [not found]             ` <4FA137C4.3000900-GANU6spQydw@public.gmane.org>
2012-05-02 13:48               ` Ben Skeggs
2012-05-02 13:53                 ` Martin Peres
2012-04-28 14:56   ` Marcin Slusarz
     [not found]     ` <20120428145615.GB10116-OI9uyE9O0yo@public.gmane.org>
2012-04-30  9:47       ` Martin Peres
2012-05-27 19:52   ` Marcin Slusarz
2012-08-05 21:15   ` Marcin Slusarz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1335958137.1898.6.camel@nisroch \
    --to=bskeggs-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=marcin.slusarz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.