Re: [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Martin Peres <martin.peres-GANU6spQydw@public.gmane.org>
To: bskeggs-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH v2 4/4] drm/nouveau: gpu lockup recovery
Date: Wed, 02 May 2012 15:53:39 +0200	[thread overview]
Message-ID: <4FA13C63.4090806@free.fr> (raw)
In-Reply-To: <1335966493.1898.15.camel@nisroch>

On 02/05/2012 15:48, Ben Skeggs wrote:
> On Wed, 2012-05-02 at 15:33 +0200, Martin Peres wrote:
>> On 02/05/2012 13:28, Ben Skeggs wrote:
>>> Right, again, I don't disagree :)  I think we can improve a lot on the
>>> big-hammer-suspend-the-gpu solution though, and instead reset only the
>>> faulting engine.  It's (in theory) almost possible for us to do now, but
>>> I have a couple of reworks to areas related to this pending (basically,
>>> making the various driver subsystems more independent), which should be
>>> ready soon.  This'll go a long way to making it very easy to reset a
>>> single engine, and likely result in *far* faster recovery from hangs.
>> Hey,
>>
>> What about kicking a channel that put the card in a bad state? Wouldn't
>> that be possible?
>>
>> This way, we don't loose the context of other channels and only the
>> application that hang the card will be exited.
> That's pretty much the idea.  The trouble comes in where PFIFO will hang
> waiting for the stuck engine to report that it's done (eg. it will wait
> for PGRAPH to go "i've finished unloading my context now" after it's
> told PGRAPH to do so).
>
> Hence why it's important to be able to (preferably) un-stick the stuck
> engine (usually handling the appropriate interrupts properly will
> achieve this), and failing that, reset it and lose the context for just
> that channel.
>
> The work I'm doing at the moment will, among other nice things, make
> handling all of this a lot nicer.  And it should be nice and speedy in
> comparison to the suspend/resume option, we won't have to evict all
> buffers from vram without accel, which can take quite a while (not to
> mention that it might not even be possible to get to the VRAM not mapped
> into the FB BAR on earlier chipsets if accel dies).
I get it, that seems nice and good.
>
>> I wonder how pfifo handles commands sent to a non-existing channel, but
>> I'm sure it shouldn't hang or anything.
> It can't happen anyway, if we destroyed the fifo context for a channel
> we wouldn't be telling it to execute commands still :)
Right, but there may still be some commands left in the IB ring buffer, 
right?
>
>> Anyway, if this is not possible to only kick one channel, then what
>> about kicking all channels, rePOSTING the card and using KMS to output
>> the lockup report (and send a notification of the report through udev
>> and store the report in a sysfs file)?
>>
>> Let's not try to be perfect, let us just be able to do better bug reports.
> I'm still skeptical about how useful any kind of generic "lockup report"
> can possibly be, beyond kernel logs..  However, as part of the work I'm
> working on, there may be some additional information available via
> debugfs..  I don't wan't to elaborate on this too much yet until I wrap
> my head around what exactly I want to achieve, but I'll give you a
> heads-up once I do :)
Well, a good report is important so as we can have an idea of what went 
wrong
and also, that would allow us to differenciate bug reports.

Basically, I'm now convinced that the nvaX random lockup is not actually 
one issue.
Having such an enhanced bug report could allow us to verify this theory.

PS: Speaking about nvaX lockups. I still get lockups (nva3/5) and I 
suspect that the
problem comes from the context switching micro code. Not loosing the 
email I'm writing
simply because kwin's channel crashed would be a big win to me.

Martin

next prev parent reply	other threads:[~2012-05-02 13:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-25 21:20 [PATCH v2 4/4] drm/nouveau: gpu lockup recovery Marcin Slusarz
2012-04-25 21:32 ` Marcin Slusarz
     [not found] ` <1335388836-13127-4-git-send-email-marcin.slusarz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2012-04-26  7:32   ` Ben Skeggs
2012-04-28 14:49     ` Marcin Slusarz
     [not found]       ` <20120428144956.GA10116-OI9uyE9O0yo@public.gmane.org>
2012-05-02 11:28         ` Ben Skeggs
2012-05-02 13:33           ` Martin Peres
     [not found]             ` <4FA137C4.3000900-GANU6spQydw@public.gmane.org>
2012-05-02 13:48               ` Ben Skeggs
2012-05-02 13:53                 ` Martin Peres [this message]
2012-04-28 14:56   ` Marcin Slusarz
     [not found]     ` <20120428145615.GB10116-OI9uyE9O0yo@public.gmane.org>
2012-04-30  9:47       ` Martin Peres
2012-05-27 19:52   ` Marcin Slusarz
2012-08-05 21:15   ` Marcin Slusarz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FA13C63.4090806@free.fr \
    --to=martin.peres-ganu6spqydw@public.gmane.org \
    --cc=bskeggs-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.