* [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
@ 2012-06-26 21:08 Daniel Vetter
2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw)
To: Intel Graphics Development; +Cc: Daniel Vetter
Having this early check in intel_ring_begin doesn't buy us anything,
since we'll be calling into wait_request in the usual case already
anyway. In the corner case of not waiting for free space using the
last_retired_head we simply need to do the same check, too.
With these changes we'll only ever get an -EIO from intel_ring_begin
if the gpu has truely been declared dead.
v2: Don't conflate the change to prevent intel_ring_begin from returning
a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
just prep work to avoid returning -EAGAIN in callsites that can't handle
syscall restarting.
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
drivers/gpu/drm/i915/intel_ringbuffer.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index f30a53a..501546e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
int intel_ring_begin(struct intel_ring_buffer *ring,
int num_dwords)
{
- struct drm_i915_private *dev_priv = ring->dev->dev_private;
int n = 4*num_dwords;
int ret;
- if (unlikely(atomic_read(&dev_priv->mm.wedged)))
- return -EIO;
-
if (unlikely(ring->tail + n > ring->effective_size)) {
ret = intel_wrap_ring_buffer(ring);
if (unlikely(ret))
--
1.7.10
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 2/4] drm/i915: don't trylock in the gpu reset code 2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter @ 2012-06-26 21:08 ` Daniel Vetter 2012-06-26 22:08 ` Łukasz Kuryło 2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter ` (3 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw) To: Intel Graphics Development; +Cc: Daniel Vetter Simply failing to reset the gpu because someone else might still hold the mutex isn't a great idea - I see reliable silent reset failures. And gpu reset simply needs to be reliable and Just Work. "But ... the deadlocks!" We already kick all processes waiting for the gpu before launching the reset work item. New waiters need to check the wedging state anyway and then bail out. If we have places that can deadlock, we simply need to fix them. "But ... testing!" We have the gpu hangman, and if the current gpu load gem_exec_nop isn't good enough to hit a specific case, we can add a new one. "But ... don't we return -EAGAIN for non-interruptible calls to wait_seqno now?" Yep, but this problem already exists in the current code. A follow up patch will remedy this by returning -EIO for non-interruptible sleeps if the gpu died and the low-level wait bails out with -EAGAIN. Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> --- drivers/gpu/drm/i915/i915_drv.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 79be879..0e114f7 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -866,8 +866,7 @@ int i915_reset(struct drm_device *dev) if (!i915_try_reset) return 0; - if (!mutex_trylock(&dev->struct_mutex)) - return -EBUSY; + mutex_lock(&dev->struct_mutex); i915_gem_reset(dev); -- 1.7.10 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/4] drm/i915: don't trylock in the gpu reset code 2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter @ 2012-06-26 22:08 ` Łukasz Kuryło 2012-06-26 22:34 ` Daniel Vetter 0 siblings, 1 reply; 14+ messages in thread From: Łukasz Kuryło @ 2012-06-26 22:08 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel Graphics Development > > - if (!mutex_trylock(&dev->struct_mutex)) > - return -EBUSY; > + mutex_lock(&dev->struct_mutex); > > i915_gem_reset(dev); > But ... the original code: Correct me if I'm wrong. In every manual I've found mutex_trylock(...) returns 0 on success. So if(!mutex_trylock(&dev->struct_mutex)) return -EBUSY; would actually execute when mutex was acquired as requested. We then return without releasing it and possibly end up with deadlock. On the other hand if mutex was already locked, by other or maybe even the same thread, mutex_trylock returns error (some nonzero value), nonzero means true ... if(!true) -> if(false) means do not execute the "if" code. In this case in spite of not getting the mutex we would proceed. You've written: "Simply failing to reset the gpu because someone else might still hold the mutex isn't a great idea - I see reliable silent reset failures. And gpu reset simply needs to be reliable and Just Work." I believe that was not the case with the original code. >From my understanding this procedure failed to reset gpu when the mutex was unlocked (locking it and bailing out) and tried to reset it if the mutex could not be acquired. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/4] drm/i915: don't trylock in the gpu reset code 2012-06-26 22:08 ` Łukasz Kuryło @ 2012-06-26 22:34 ` Daniel Vetter 0 siblings, 0 replies; 14+ messages in thread From: Daniel Vetter @ 2012-06-26 22:34 UTC (permalink / raw) To: Łukasz Kuryło; +Cc: Daniel Vetter, Intel Graphics Development On Wed, Jun 27, 2012 at 12:08:33AM +0200, Łukasz Kuryło wrote: > > > > - if (!mutex_trylock(&dev->struct_mutex)) > > - return -EBUSY; > > + mutex_lock(&dev->struct_mutex); > > > > i915_gem_reset(dev); > > > > But ... the original code: > > Correct me if I'm wrong. > In every manual I've found mutex_trylock(...) returns 0 on success. > So Qoting the kernel doc for mutex_trylock: "Try to acquire the mutex atomically. Returns 1 if the mutex has been acquired successfully, and 0 on contention." Also, it's open-source, so you could double-check the implementation ... -Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN 2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter 2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter @ 2012-06-26 21:08 ` Daniel Vetter 2012-06-27 15:19 ` Ben Widawsky 2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter ` (2 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw) To: Intel Graphics Development; +Cc: Daniel Vetter So don't return -EAGAIN, even in the case of a gpu hang. Remap it to -EIO instead. This is a bit ugly because intel_ring_begin is all non-interruptible and hence only returns -EIO. But as the comment in there says, auditing all the callsites would be a pain. To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno and intel_wait_ring_buffer. Also use the opportunity to clarify the different cases in i915_gem_check_wedge a bit with comments. v2: Don't access dev_priv->mm.interruptible from check_wedge - we might not hold dev->struct_mutex, making this racy. Instead pass interruptible in as a parameter. I've noticed this because I've hit a BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been added in commit b4aca0106c466b5a0329318203f65bac2d91b682 Author: Ben Widawsky <ben@bwidawsk.net> Date: Wed Apr 25 20:50:12 2012 -0700 drm/i915: extract some common olr+wedge code although that commit is missing any justification for this it. I guess it's just copy&paste, because the same commit add the same BUG_ON check to check_olr, where it indeed makes sense. But in check_wedge everything we access is protected by other means, so this is superflous. And because it now gets in the way (we add a new caller in __wait_seqno, which can be called without dev->struct_mutext) let's just remove it. v3: Group all the i915_gem_check_wedge refactoring into this patch, so that this patch here is all about not returning -EAGAIN to callsites that can't handle syscall restarting. Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> --- drivers/gpu/drm/i915/i915_drv.h | 2 ++ drivers/gpu/drm/i915/i915_gem.c | 26 ++++++++++++++++++-------- drivers/gpu/drm/i915/intel_ringbuffer.c | 6 ++++-- 3 files changed, 24 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index a0c15ab..ab9ade0 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1330,6 +1330,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj) void i915_gem_retire_requests(struct drm_device *dev); void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring); +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv, + bool interruptible); void i915_gem_reset(struct drm_device *dev); void i915_gem_clflush_object(struct drm_i915_gem_object *obj); diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 6a98c06..af6a510 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work) mutex_unlock(&dev->struct_mutex); } -static int -i915_gem_check_wedge(struct drm_i915_private *dev_priv) +int +i915_gem_check_wedge(struct drm_i915_private *dev_priv, + bool interruptible) { - BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex)); - if (atomic_read(&dev_priv->mm.wedged)) { struct completion *x = &dev_priv->error_completion; bool recovery_complete; @@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv) recovery_complete = x->done > 0; spin_unlock_irqrestore(&x->wait.lock, flags); - return recovery_complete ? -EIO : -EAGAIN; + /* Non-interruptible callers can't handle -EAGAIN, hence return + * -EIO unconditionally for these. */ + if (!interruptible) + return -EIO; + + /* Recovery complete, but still wedged means reset failure. */ + if (recovery_complete) + return -EIO; + + return -EAGAIN; } return 0; @@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno, unsigned long timeout_jiffies; long end; bool wait_forever = true; + int ret; if (i915_seqno_passed(ring->get_seqno(ring), seqno)) return 0; @@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno, end = wait_event_timeout(ring->irq_queue, EXIT_COND, timeout_jiffies); - if (atomic_read(&dev_priv->mm.wedged)) - end = -EAGAIN; + ret = i915_gem_check_wedge(dev_priv, interruptible); + if (ret) + end = ret; } while (end == 0 && wait_forever); getrawmonotonic(&now); @@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno) BUG_ON(seqno == 0); - ret = i915_gem_check_wedge(dev_priv); + ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible); if (ret) return ret; diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 501546e..6c024d4 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -1220,8 +1220,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) } msleep(1); - if (atomic_read(&dev_priv->mm.wedged)) - return -EAGAIN; + + ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible); + if (ret) + return ret; } while (!time_after(jiffies, end)); trace_i915_ring_wait_end(ring); return -EBUSY; -- 1.7.10 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN 2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter @ 2012-06-27 15:19 ` Ben Widawsky 2012-06-27 16:36 ` Daniel Vetter 0 siblings, 1 reply; 14+ messages in thread From: Ben Widawsky @ 2012-06-27 15:19 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel Graphics Development On Tue, 26 Jun 2012 23:08:52 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > So don't return -EAGAIN, even in the case of a gpu hang. Remap it to -EIO > instead. What I'd really like to see in this rather long commit message is what exactly happens in this case that's being fixed (maybe I should know, but I don't). > > This is a bit ugly because intel_ring_begin is all non-interruptible > and hence only returns -EIO. But as the comment in there says, > auditing all the callsites would be a pain. > > To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno > and intel_wait_ring_buffer. Also use the opportunity to clarify the > different cases in i915_gem_check_wedge a bit with comments. > > v2: Don't access dev_priv->mm.interruptible from check_wedge - we > might not hold dev->struct_mutex, making this racy. Instead pass > interruptible in as a parameter. I've noticed this because I've hit a > BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been > added in > > commit b4aca0106c466b5a0329318203f65bac2d91b682 > Author: Ben Widawsky <ben@bwidawsk.net> > Date: Wed Apr 25 20:50:12 2012 -0700 > > drm/i915: extract some common olr+wedge code > > although that commit is missing any justification for this it. I guess > it's just copy&paste, because the same commit add the same BUG_ON > check to check_olr, where it indeed makes sense. > > But in check_wedge everything we access is protected by other means, > so this is superflous. And because it now gets in the way (we add a > new caller in __wait_seqno, which can be called without > dev->struct_mutext) let's just remove it. > > v3: Group all the i915_gem_check_wedge refactoring into this patch, so > that this patch here is all about not returning -EAGAIN to callsites > that can't handle syscall restarting. > > Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Ben Widawsky <ben@bwidawsk.net> > --- > drivers/gpu/drm/i915/i915_drv.h | 2 ++ > drivers/gpu/drm/i915/i915_gem.c | 26 ++++++++++++++++++-------- > drivers/gpu/drm/i915/intel_ringbuffer.c | 6 ++++-- > 3 files changed, 24 insertions(+), 10 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index a0c15ab..ab9ade0 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -1330,6 +1330,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj) > > void i915_gem_retire_requests(struct drm_device *dev); > void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring); > +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv, > + bool interruptible); > > void i915_gem_reset(struct drm_device *dev); > void i915_gem_clflush_object(struct drm_i915_gem_object *obj); > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 6a98c06..af6a510 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work) > mutex_unlock(&dev->struct_mutex); > } > > -static int > -i915_gem_check_wedge(struct drm_i915_private *dev_priv) > +int > +i915_gem_check_wedge(struct drm_i915_private *dev_priv, > + bool interruptible) > { > - BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex)); > - > if (atomic_read(&dev_priv->mm.wedged)) { > struct completion *x = &dev_priv->error_completion; > bool recovery_complete; > @@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv) > recovery_complete = x->done > 0; > spin_unlock_irqrestore(&x->wait.lock, flags); > > - return recovery_complete ? -EIO : -EAGAIN; > + /* Non-interruptible callers can't handle -EAGAIN, hence return > + * -EIO unconditionally for these. */ > + if (!interruptible) > + return -EIO; > + > + /* Recovery complete, but still wedged means reset failure. */ > + if (recovery_complete) > + return -EIO; > + > + return -EAGAIN; > } > > return 0; > @@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno, > unsigned long timeout_jiffies; > long end; > bool wait_forever = true; > + int ret; > > if (i915_seqno_passed(ring->get_seqno(ring), seqno)) > return 0; > @@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno, > end = wait_event_timeout(ring->irq_queue, EXIT_COND, > timeout_jiffies); > > - if (atomic_read(&dev_priv->mm.wedged)) > - end = -EAGAIN; > + ret = i915_gem_check_wedge(dev_priv, interruptible); > + if (ret) > + end = ret; > } while (end == 0 && wait_forever); > > getrawmonotonic(&now); > @@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno) > > BUG_ON(seqno == 0); > > - ret = i915_gem_check_wedge(dev_priv); > + ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible); > if (ret) > return ret; > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > index 501546e..6c024d4 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > @@ -1220,8 +1220,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) > } > > msleep(1); > - if (atomic_read(&dev_priv->mm.wedged)) > - return -EAGAIN; > + > + ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible); > + if (ret) > + return ret; > } while (!time_after(jiffies, end)); > trace_i915_ring_wait_end(ring); > return -EBUSY; -- Ben Widawsky, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN 2012-06-27 15:19 ` Ben Widawsky @ 2012-06-27 16:36 ` Daniel Vetter 0 siblings, 0 replies; 14+ messages in thread From: Daniel Vetter @ 2012-06-27 16:36 UTC (permalink / raw) To: Ben Widawsky; +Cc: Daniel Vetter, Intel Graphics Development On Wed, Jun 27, 2012 at 08:19:03AM -0700, Ben Widawsky wrote: > On Tue, 26 Jun 2012 23:08:52 +0200 > Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > So don't return -EAGAIN, even in the case of a gpu hang. Remap it to -EIO > > instead. > > What I'd really like to see in this rather long commit message is what > exactly happens in this case that's being fixed (maybe I should know, > but I don't). What about adding: "Note that this isn't really an issue with interruptability, but more that we have quite a few codepaths (mostly around kms stuff) that simply can't handle any errors and hence not even -EGAIN. Instead of adding proper failure paths so that we could restart these ioctls we've opted for the cheap way out of sleeping non-interruptibly. Which works everywhere but when the gpu dies, which this patch fixes. So essentially interruptible == false means 'wait for the gpu or die trying'." -Daniel > > > > > This is a bit ugly because intel_ring_begin is all non-interruptible > > and hence only returns -EIO. But as the comment in there says, > > auditing all the callsites would be a pain. > > > > To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno > > and intel_wait_ring_buffer. Also use the opportunity to clarify the > > different cases in i915_gem_check_wedge a bit with comments. > > > > v2: Don't access dev_priv->mm.interruptible from check_wedge - we > > might not hold dev->struct_mutex, making this racy. Instead pass > > interruptible in as a parameter. I've noticed this because I've hit a > > BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been > > added in > > > > commit b4aca0106c466b5a0329318203f65bac2d91b682 > > Author: Ben Widawsky <ben@bwidawsk.net> > > Date: Wed Apr 25 20:50:12 2012 -0700 > > > > drm/i915: extract some common olr+wedge code > > > > although that commit is missing any justification for this it. I guess > > it's just copy&paste, because the same commit add the same BUG_ON > > check to check_olr, where it indeed makes sense. > > > > But in check_wedge everything we access is protected by other means, > > so this is superflous. And because it now gets in the way (we add a > > new caller in __wait_seqno, which can be called without > > dev->struct_mutext) let's just remove it. > > > > v3: Group all the i915_gem_check_wedge refactoring into this patch, so > > that this patch here is all about not returning -EAGAIN to callsites > > that can't handle syscall restarting. > > > > Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > Reviewed-by: Ben Widawsky <ben@bwidawsk.net> > > > --- > > drivers/gpu/drm/i915/i915_drv.h | 2 ++ > > drivers/gpu/drm/i915/i915_gem.c | 26 ++++++++++++++++++-------- > > drivers/gpu/drm/i915/intel_ringbuffer.c | 6 ++++-- > > 3 files changed, 24 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > > index a0c15ab..ab9ade0 100644 > > --- a/drivers/gpu/drm/i915/i915_drv.h > > +++ b/drivers/gpu/drm/i915/i915_drv.h > > @@ -1330,6 +1330,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj) > > > > void i915_gem_retire_requests(struct drm_device *dev); > > void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring); > > +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv, > > + bool interruptible); > > > > void i915_gem_reset(struct drm_device *dev); > > void i915_gem_clflush_object(struct drm_i915_gem_object *obj); > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > > index 6a98c06..af6a510 100644 > > --- a/drivers/gpu/drm/i915/i915_gem.c > > +++ b/drivers/gpu/drm/i915/i915_gem.c > > @@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work) > > mutex_unlock(&dev->struct_mutex); > > } > > > > -static int > > -i915_gem_check_wedge(struct drm_i915_private *dev_priv) > > +int > > +i915_gem_check_wedge(struct drm_i915_private *dev_priv, > > + bool interruptible) > > { > > - BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex)); > > - > > if (atomic_read(&dev_priv->mm.wedged)) { > > struct completion *x = &dev_priv->error_completion; > > bool recovery_complete; > > @@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv) > > recovery_complete = x->done > 0; > > spin_unlock_irqrestore(&x->wait.lock, flags); > > > > - return recovery_complete ? -EIO : -EAGAIN; > > + /* Non-interruptible callers can't handle -EAGAIN, hence return > > + * -EIO unconditionally for these. */ > > + if (!interruptible) > > + return -EIO; > > + > > + /* Recovery complete, but still wedged means reset failure. */ > > + if (recovery_complete) > > + return -EIO; > > + > > + return -EAGAIN; > > } > > > > return 0; > > @@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno, > > unsigned long timeout_jiffies; > > long end; > > bool wait_forever = true; > > + int ret; > > > > if (i915_seqno_passed(ring->get_seqno(ring), seqno)) > > return 0; > > @@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno, > > end = wait_event_timeout(ring->irq_queue, EXIT_COND, > > timeout_jiffies); > > > > - if (atomic_read(&dev_priv->mm.wedged)) > > - end = -EAGAIN; > > + ret = i915_gem_check_wedge(dev_priv, interruptible); > > + if (ret) > > + end = ret; > > } while (end == 0 && wait_forever); > > > > getrawmonotonic(&now); > > @@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno) > > > > BUG_ON(seqno == 0); > > > > - ret = i915_gem_check_wedge(dev_priv); > > + ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible); > > if (ret) > > return ret; > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > > index 501546e..6c024d4 100644 > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > > @@ -1220,8 +1220,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) > > } > > > > msleep(1); > > - if (atomic_read(&dev_priv->mm.wedged)) > > - return -EAGAIN; > > + > > + ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible); > > + if (ret) > > + return ret; > > } while (!time_after(jiffies, end)); > > trace_i915_ring_wait_end(ring); > > return -EBUSY; > > > > -- > Ben Widawsky, Intel Open Source Technology Center -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck 2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter 2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter 2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter @ 2012-06-26 21:08 ` Daniel Vetter 2012-07-01 3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky 2012-07-03 15:59 ` Chris Wilson 4 siblings, 0 replies; 14+ messages in thread From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw) To: Intel Graphics Development; +Cc: Daniel Vetter With the gpu reset no longer using a trylock we've the chances of userspace getting stuck quite a bit. To make that (hopefully) rare case more paletable time out when waiting for the gpu reset code to complete and signal this little issue to the caller by returning -EIO. This should help userspace to somewhat gracefully fall back and hopefully allow the user to grab some logs and reboot the machine (instead of staring at a frozen X screen in agony). Suggested by Chris Wilson because I've been stubborn about allowing the gpu reset code to fail. Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> --- drivers/gpu/drm/i915/i915_gem.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index af6a510..7d28555 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -96,9 +96,18 @@ i915_gem_wait_for_error(struct drm_device *dev) if (!atomic_read(&dev_priv->mm.wedged)) return 0; - ret = wait_for_completion_interruptible(x); - if (ret) + /* + * Only wait 10 seconds for the gpu reset to complete to avoid hanging + * userspace. If it takes that long something really bad is going on and + * we should simply try to bail out and fail as gracefully as possible. + */ + ret = wait_for_completion_interruptible_timeout(x, 10*HZ); + if (ret == 0) { + DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); + return -EIO; + } else if (ret < 0) { return ret; + } if (atomic_read(&dev_priv->mm.wedged)) { /* GPU is hung, bump the completion count to account for -- 1.7.10 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin 2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter ` (2 preceding siblings ...) 2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter @ 2012-07-01 3:09 ` Ben Widawsky 2012-07-01 10:41 ` Daniel Vetter 2012-07-03 15:59 ` Chris Wilson 4 siblings, 1 reply; 14+ messages in thread From: Ben Widawsky @ 2012-07-01 3:09 UTC (permalink / raw) To: intel-gfx On Tue, 26 Jun 2012 23:08:50 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > Having this early check in intel_ring_begin doesn't buy us anything, > since we'll be calling into wait_request in the usual case already > anyway. In the corner case of not waiting for free space using the > last_retired_head we simply need to do the same check, too. > > With these changes we'll only ever get an -EIO from intel_ring_begin > if the gpu has truely been declared dead. > > v2: Don't conflate the change to prevent intel_ring_begin from returning > a spurious -EIO with the refactor to use i915_gem_check_wedge, which is > just prep work to avoid returning -EAGAIN in callsites that can't handle > syscall restarting. I'm really scared by this change. It's diffuclt to review because there are SO many callers of intel_ring_begin, and figuring out if they all work in the wedged case is simply too difficult. I've yet to review the rest of the series, but it may make more sense to put this change last perhaps? > > Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > --- > drivers/gpu/drm/i915/intel_ringbuffer.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > index f30a53a..501546e 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) > int intel_ring_begin(struct intel_ring_buffer *ring, > int num_dwords) > { > - struct drm_i915_private *dev_priv = ring->dev->dev_private; > int n = 4*num_dwords; > int ret; > > - if (unlikely(atomic_read(&dev_priv->mm.wedged))) > - return -EIO; > - > if (unlikely(ring->tail + n > ring->effective_size)) { > ret = intel_wrap_ring_buffer(ring); > if (unlikely(ret)) -- Ben Widawsky, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin 2012-07-01 3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky @ 2012-07-01 10:41 ` Daniel Vetter 2012-07-02 16:04 ` Ben Widawsky 0 siblings, 1 reply; 14+ messages in thread From: Daniel Vetter @ 2012-07-01 10:41 UTC (permalink / raw) To: Ben Widawsky; +Cc: intel-gfx On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote: > On Tue, 26 Jun 2012 23:08:50 +0200 > Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > >> Having this early check in intel_ring_begin doesn't buy us anything, >> since we'll be calling into wait_request in the usual case already >> anyway. In the corner case of not waiting for free space using the >> last_retired_head we simply need to do the same check, too. >> >> With these changes we'll only ever get an -EIO from intel_ring_begin >> if the gpu has truely been declared dead. >> >> v2: Don't conflate the change to prevent intel_ring_begin from returning >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is >> just prep work to avoid returning -EAGAIN in callsites that can't handle >> syscall restarting. > > I'm really scared by this change. It's diffuclt to review because there > are SO many callers of intel_ring_begin, and figuring out if they all > work in the wedged case is simply too difficult. I've yet to review the > rest of the series, but it may make more sense to put this change last > perhaps? Well, this patch doesn't really affect much what error codes the callers get - we'll still throw both -EGAIN and -EIO at them (later on patches will fix this). What this patch does though is prevent us from returning too many -EIO. Imagine the gpu died and we've noticed already (hence dev_priv->mm.wedged is set), but some process is stuck churning through the execbuf ioctl, holding dev->struct_mutex. While processing the execbuf we call intel_ring_begin to emit a few commands. Now usually, even when the gpu is dead, there is enough space in the ring to do so, which allows us to complete the execbuf ioctl and then later on we can block properly when trying to grab the mutex in the next ioctl for the gpu reset work handler to complete. But thanks to that wedged check in intel_ring_begin we'll instead return -EIO, despite the fact that later on we could successfully reset the gpu. Returning -EIO forces the X server to fall back to s/w rendering and disabling dri2, and in the case of a 3d compositor usually results in a abort. After this patch we can still return -EIO if the gpu is dead but the reset work hasn't completed yet, but only so if the ring is full (which in many cases is unlikely). Cheers, Daniel > >> >> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> >> --- >> drivers/gpu/drm/i915/intel_ringbuffer.c | 4 ---- >> 1 file changed, 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c >> index f30a53a..501546e 100644 >> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c >> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c >> @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) >> int intel_ring_begin(struct intel_ring_buffer *ring, >> int num_dwords) >> { >> - struct drm_i915_private *dev_priv = ring->dev->dev_private; >> int n = 4*num_dwords; >> int ret; >> >> - if (unlikely(atomic_read(&dev_priv->mm.wedged))) >> - return -EIO; >> - >> if (unlikely(ring->tail + n > ring->effective_size)) { >> ret = intel_wrap_ring_buffer(ring); >> if (unlikely(ret)) > > > > -- > Ben Widawsky, Intel Open Source Technology Center > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter daniel.vetter@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin 2012-07-01 10:41 ` Daniel Vetter @ 2012-07-02 16:04 ` Ben Widawsky 2012-07-02 16:47 ` Daniel Vetter 0 siblings, 1 reply; 14+ messages in thread From: Ben Widawsky @ 2012-07-02 16:04 UTC (permalink / raw) To: Daniel Vetter; +Cc: intel-gfx On Sun, 1 Jul 2012 12:41:19 +0200 Daniel Vetter <daniel@ffwll.ch> wrote: > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote: > > On Tue, 26 Jun 2012 23:08:50 +0200 > > Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > >> Having this early check in intel_ring_begin doesn't buy us anything, > >> since we'll be calling into wait_request in the usual case already > >> anyway. In the corner case of not waiting for free space using the > >> last_retired_head we simply need to do the same check, too. > >> > >> With these changes we'll only ever get an -EIO from intel_ring_begin > >> if the gpu has truely been declared dead. > >> > >> v2: Don't conflate the change to prevent intel_ring_begin from returning > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is > >> just prep work to avoid returning -EAGAIN in callsites that can't handle > >> syscall restarting. > > > > I'm really scared by this change. It's diffuclt to review because there > > are SO many callers of intel_ring_begin, and figuring out if they all > > work in the wedged case is simply too difficult. I've yet to review the > > rest of the series, but it may make more sense to put this change last > > perhaps? > > Well, this patch doesn't really affect much what error codes the > callers get - we'll still throw both -EGAIN and -EIO at them (later on > patches will fix this). > > What this patch does though is prevent us from returning too many > -EIO. Imagine the gpu died and we've noticed already (hence > dev_priv->mm.wedged is set), but some process is stuck churning > through the execbuf ioctl, holding dev->struct_mutex. While processing > the execbuf we call intel_ring_begin to emit a few commands. Now > usually, even when the gpu is dead, there is enough space in the ring > to do so, which allows us to complete the execbuf ioctl and then later > on we can block properly when trying to grab the mutex in the next > ioctl for the gpu reset work handler to complete. That in itself is a pretty big change, don't you think? It seems rather strange and dangerous to modify HW (which we will if we allow execbuf to continue when we write the tail pointer). I think we want some way to not write the tail ptr in such a case. Now you might respond, well, who cares about writes? But this imposes a pretty large restriction on any code that can't work well after the GPU is hung. I see the irony. I'm complaining that you can make GPU hangs unstable, and the patch series is fixing GPU reset. Call it paranoia, it still seems unsafe to me, and makes us have to think a bit more whenever adding any code. Am I over-thinking this? > > But thanks to that wedged check in intel_ring_begin we'll instead > return -EIO, despite the fact that later on we could successfully > reset the gpu. Returning -EIO forces the X server to fall back to s/w > rendering and disabling dri2, and in the case of a 3d compositor > usually results in a abort. After this patch we can still return -EIO > if the gpu is dead but the reset work hasn't completed yet, but only > so if the ring is full (which in many cases is unlikely). > > Cheers, Daniel > > > > >> > >> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > >> --- > >> drivers/gpu/drm/i915/intel_ringbuffer.c | 4 ---- > >> 1 file changed, 4 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > >> index f30a53a..501546e 100644 > >> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > >> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > >> @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) > >> int intel_ring_begin(struct intel_ring_buffer *ring, > >> int num_dwords) > >> { > >> - struct drm_i915_private *dev_priv = ring->dev->dev_private; > >> int n = 4*num_dwords; > >> int ret; > >> > >> - if (unlikely(atomic_read(&dev_priv->mm.wedged))) > >> - return -EIO; > >> - > >> if (unlikely(ring->tail + n > ring->effective_size)) { > >> ret = intel_wrap_ring_buffer(ring); > >> if (unlikely(ret)) > > > > > > > > -- > > Ben Widawsky, Intel Open Source Technology Center > > _______________________________________________ > > Intel-gfx mailing list > > Intel-gfx@lists.freedesktop.org > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > > -- Ben Widawsky, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin 2012-07-02 16:04 ` Ben Widawsky @ 2012-07-02 16:47 ` Daniel Vetter 0 siblings, 0 replies; 14+ messages in thread From: Daniel Vetter @ 2012-07-02 16:47 UTC (permalink / raw) To: Ben Widawsky; +Cc: intel-gfx On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote: > On Sun, 1 Jul 2012 12:41:19 +0200 > Daniel Vetter <daniel@ffwll.ch> wrote: > > > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote: > > > On Tue, 26 Jun 2012 23:08:50 +0200 > > > Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > > > >> Having this early check in intel_ring_begin doesn't buy us anything, > > >> since we'll be calling into wait_request in the usual case already > > >> anyway. In the corner case of not waiting for free space using the > > >> last_retired_head we simply need to do the same check, too. > > >> > > >> With these changes we'll only ever get an -EIO from intel_ring_begin > > >> if the gpu has truely been declared dead. > > >> > > >> v2: Don't conflate the change to prevent intel_ring_begin from returning > > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is > > >> just prep work to avoid returning -EAGAIN in callsites that can't handle > > >> syscall restarting. > > > > > > I'm really scared by this change. It's diffuclt to review because there > > > are SO many callers of intel_ring_begin, and figuring out if they all > > > work in the wedged case is simply too difficult. I've yet to review the > > > rest of the series, but it may make more sense to put this change last > > > perhaps? > > > > Well, this patch doesn't really affect much what error codes the > > callers get - we'll still throw both -EGAIN and -EIO at them (later on > > patches will fix this). > > > > What this patch does though is prevent us from returning too many > > -EIO. Imagine the gpu died and we've noticed already (hence > > dev_priv->mm.wedged is set), but some process is stuck churning > > through the execbuf ioctl, holding dev->struct_mutex. While processing > > the execbuf we call intel_ring_begin to emit a few commands. Now > > usually, even when the gpu is dead, there is enough space in the ring > > to do so, which allows us to complete the execbuf ioctl and then later > > on we can block properly when trying to grab the mutex in the next > > ioctl for the gpu reset work handler to complete. > > That in itself is a pretty big change, don't you think? It seems rather > strange and dangerous to modify HW (which we will if we allow execbuf to > continue when we write the tail pointer). I think we want some way to > not write the tail ptr in such a case. Now you might respond, well, who > cares about writes? But this imposes a pretty large restriction on any > code that can't work well after the GPU is hung. > > I see the irony. I'm complaining that you can make GPU hangs unstable, > and the patch series is fixing GPU reset. Call it paranoia, it still > seems unsafe to me, and makes us have to think a bit more whenever > adding any code. Am I over-thinking this? I think so. It takes us a few seconds before we notice that the gpu died - only when the hangcheck timer expired at least twice we'll declare the hw wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer emissions (and hence tail pointer writes) ;-) So I think trying to avoid touching the gpu for another few usecs _is_ overthinking the problem, because fundamentally we can't avoid touching a dead gpu - we will only ever notice its demise after the fact. Yours, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin 2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter ` (3 preceding siblings ...) 2012-07-01 3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky @ 2012-07-03 15:59 ` Chris Wilson 2012-07-03 18:11 ` Daniel Vetter 4 siblings, 1 reply; 14+ messages in thread From: Chris Wilson @ 2012-07-03 15:59 UTC (permalink / raw) To: Intel Graphics Development; +Cc: Daniel Vetter My experience with these patches is that they make it less likely that the hang is reported to the userspace in a timely fashion (filling the ring full leads to lots of lost rendering) and worse make it much more likely that i915_gem_fault() hits an EIO and goes bang. That is unacceptable and trivial to hit with these patches. I have not yet reproduced that issue using the same broken renderer without these patches. I do think the patches are a step in the right direction, but with the change in userspace behaviour it has to be a NAK for the time being. -Chris -- Chris Wilson, Intel Open Source Technology Centre ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin 2012-07-03 15:59 ` Chris Wilson @ 2012-07-03 18:11 ` Daniel Vetter 0 siblings, 0 replies; 14+ messages in thread From: Daniel Vetter @ 2012-07-03 18:11 UTC (permalink / raw) To: Chris Wilson; +Cc: Intel Graphics Development On Tue, Jul 3, 2012 at 5:59 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > My experience with these patches is that they make it less likely that > the hang is reported to the userspace in a timely fashion (filling the > ring full leads to lots of lost rendering) and worse make it much more > likely that i915_gem_fault() hits an EIO and goes bang. That is > unacceptable and trivial to hit with these patches. I have not yet > reproduced that issue using the same broken renderer without these > patches. Hm, I don't see how these patches allow much more rendering to be queued up until we stop everything - for single-threaded userspace that should be at most one additional batch (which might have been the one that could catch the spurious -EIO). All subsequent execbuf ioctl calls should stall when trying to grab the mutex. Same for the case that the gpu reset failed, userspace should be able to submit one more batch afaict until it gets an -EIO. So can you please dig into what exactly your seeing a bit more and unconfuse me? > I do think the patches are a step in the right direction, but with the > change in userspace behaviour it has to be a NAK for the time being. Ok, I guess I'll have to throw the -EIO sigbus eater into the mix, too. After all userspace is supposed to call set_domain(GTT) before accessing the gtt mmap, so it should still get notice when the gpu has died and rendering might be incomplete. Imo we should still return -EIO for all ioctls, userspace should be able to cope with these (In the future we might even add optional behaviour to signal potentially dropped rendering due to a gpu reset at wait_rendering for userspace that really cares). So would the sigbus eater be good enough or do we need more? Thanks, Daniel -- Daniel Vetter daniel.vetter@ffwll.ch - +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-07-03 18:11 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter 2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter 2012-06-26 22:08 ` Łukasz Kuryło 2012-06-26 22:34 ` Daniel Vetter 2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter 2012-06-27 15:19 ` Ben Widawsky 2012-06-27 16:36 ` Daniel Vetter 2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter 2012-07-01 3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky 2012-07-01 10:41 ` Daniel Vetter 2012-07-02 16:04 ` Ben Widawsky 2012-07-02 16:47 ` Daniel Vetter 2012-07-03 15:59 ` Chris Wilson 2012-07-03 18:11 ` Daniel Vetter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox