Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Daniel Vetter <daniel@ffwll.ch>
To: Ben Widawsky <ben@bwidawsk.net>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
Date: Mon, 2 Jul 2012 18:47:48 +0200	[thread overview]
Message-ID: <20120702164748.GC5064@phenom.ffwll.local> (raw)
In-Reply-To: <20120702090448.119b02cc@bwidawsk.net>

On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote:
> On Sun, 1 Jul 2012 12:41:19 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote:
> > > On Tue, 26 Jun 2012 23:08:50 +0200
> > > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > >> Having this early check in intel_ring_begin doesn't buy us anything,
> > >> since we'll be calling into wait_request in the usual case already
> > >> anyway. In the corner case of not waiting for free space using the
> > >> last_retired_head we simply need to do the same check, too.
> > >>
> > >> With these changes we'll only ever get an -EIO from intel_ring_begin
> > >> if the gpu has truely been declared dead.
> > >>
> > >> v2: Don't conflate the change to prevent intel_ring_begin from returning
> > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> > >> just prep work to avoid returning -EAGAIN in callsites that can't handle
> > >> syscall restarting.
> > >
> > > I'm really scared by this change. It's diffuclt to review because there
> > > are SO many callers of intel_ring_begin, and figuring out if they all
> > > work in the wedged case is simply too difficult. I've yet to review the
> > > rest of the series, but it may make more sense to put this change last
> > > perhaps?
> > 
> > Well, this patch doesn't really affect much what error codes the
> > callers get - we'll still throw both -EGAIN and -EIO at them (later on
> > patches will fix this).
> > 
> > What this patch does though is prevent us from returning too many
> > -EIO. Imagine the gpu died and we've noticed already (hence
> > dev_priv->mm.wedged is set), but some process is stuck churning
> > through the execbuf ioctl, holding dev->struct_mutex. While processing
> > the execbuf we call intel_ring_begin to emit a few commands. Now
> > usually, even when the gpu is dead, there is enough space in the ring
> > to do so, which allows us to complete the execbuf ioctl and then later
> > on we can block properly when trying to grab the mutex in the next
> > ioctl for the gpu reset work handler to complete.
> 
> That in itself is a pretty big change, don't you think? It seems rather
> strange and dangerous to modify HW (which we will if we allow execbuf to
> continue when we write the tail pointer). I think we want some way to
> not write the tail ptr in such a case. Now you might respond, well, who
> cares about writes? But this imposes a pretty large restriction on any
> code that can't work well after the GPU is hung.
> 
> I see the irony. I'm complaining that you can make GPU hangs unstable,
> and the patch series is fixing GPU reset. Call it paranoia, it still
> seems unsafe to me, and makes us have to think a bit more whenever
> adding any code. Am I over-thinking this?

I think so. It takes us a few seconds before we notice that the gpu died -
only when the hangcheck timer expired at least twice we'll declare the hw
wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer
emissions (and hence tail pointer writes) ;-)

So I think trying to avoid touching the gpu for another few usecs _is_
overthinking the problem, because fundamentally we can't avoid touching a
dead gpu - we will only ever notice its demise after the fact.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

next prev parent reply	other threads:[~2012-07-02 16:47 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
2012-06-26 22:08   ` Łukasz Kuryło
2012-06-26 22:34     ` Daniel Vetter
2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
2012-06-27 15:19   ` Ben Widawsky
2012-06-27 16:36     ` Daniel Vetter
2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter
2012-07-01  3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky
2012-07-01 10:41   ` Daniel Vetter
2012-07-02 16:04     ` Ben Widawsky
2012-07-02 16:47       ` Daniel Vetter [this message]
2012-07-03 15:59 ` Chris Wilson
2012-07-03 18:11   ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120702164748.GC5064@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=ben@bwidawsk.net \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox