* [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
@ 2015-06-11 16:14 ville.syrjala
2015-06-11 17:05 ` [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips ville.syrjala
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: ville.syrjala @ 2015-06-11 16:14 UTC (permalink / raw)
To: intel-gfx
From: Ville Syrjälä <ville.syrjala@linux.intel.com>
When the GPU gets reset __i915_wait_request() returns -EIO to the
mmio flip worker. Currently we WARN whenever we get anything other
than 0. Ignore the -EIO too since it's a perfectly normal thing
to get during a GPU reset.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
drivers/gpu/drm/i915/intel_display.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 9bf759c..3cd0935 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -11327,11 +11327,13 @@ static void intel_mmio_flip_work_func(struct work_struct *work)
struct intel_mmio_flip *mmio_flip =
container_of(work, struct intel_mmio_flip, work);
- if (mmio_flip->req)
- WARN_ON(__i915_wait_request(mmio_flip->req,
- mmio_flip->crtc->reset_counter,
- false, NULL,
- &mmio_flip->i915->rps.mmioflips));
+ if (mmio_flip->req) {
+ int ret = __i915_wait_request(mmio_flip->req,
+ mmio_flip->crtc->reset_counter,
+ false, NULL,
+ &mmio_flip->i915->rps.mmioflips);
+ WARN_ON(ret != 0 && ret != -EIO);
+ }
intel_do_mmio_flip(mmio_flip->crtc);
--
2.3.6
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips
2015-06-11 16:14 [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip ville.syrjala
@ 2015-06-11 17:05 ` ville.syrjala
2015-06-15 4:01 ` shuang.he
2015-06-29 2:53 ` shuang.he
2015-06-11 20:01 ` [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip Chris Wilson
` (2 subsequent siblings)
3 siblings, 2 replies; 16+ messages in thread
From: ville.syrjala @ 2015-06-11 17:05 UTC (permalink / raw)
To: intel-gfx; +Cc: Ander Conselvan de Oliveira
From: Ville Syrjälä <ville.syrjala@linux.intel.com>
When the GPU gets reset __i915_wait_request() returns -EIO to the
mmio flip worker. Currently we WARN whenever we get anything other
than 0. Ignore the -EIO too since it's a perfectly normal thing
to get during a GPU reset.
Also give intel_finish_fb() the same treatment, which triggers now at
least with CS flips on my gen4.
The intel_finish_fb() warning got added in
commit 2e2f351dbf29681d54a3a0f1003c5bb9bc832072
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Mon Apr 27 13:41:14 2015 +0100
drm/i915: Remove domain flubbing from i915_gem_object_finish_gpu()
The mmio flip one in
commit 536f5b5e86b225dab94c7ff8061ae482b6077387
Author: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Date: Thu Nov 6 11:03:40 2014 +0200
drm/i915: Make mmio flip wait for seqno in the work function
v2: Ignore -EIO in intel_finish_fb() too
Cc: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
drivers/gpu/drm/i915/intel_display.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 9bf759c..0e4720e 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3279,7 +3279,7 @@ intel_finish_fb(struct drm_framebuffer *old_fb)
ret = i915_gem_object_wait_rendering(obj, true);
dev_priv->mm.interruptible = was_interruptible;
- WARN_ON(ret);
+ WARN_ON(ret != 0 && ret != -EIO);
}
static bool intel_crtc_has_pending_flip(struct drm_crtc *crtc)
@@ -11327,11 +11327,13 @@ static void intel_mmio_flip_work_func(struct work_struct *work)
struct intel_mmio_flip *mmio_flip =
container_of(work, struct intel_mmio_flip, work);
- if (mmio_flip->req)
- WARN_ON(__i915_wait_request(mmio_flip->req,
- mmio_flip->crtc->reset_counter,
- false, NULL,
- &mmio_flip->i915->rps.mmioflips));
+ if (mmio_flip->req) {
+ int ret = __i915_wait_request(mmio_flip->req,
+ mmio_flip->crtc->reset_counter,
+ false, NULL,
+ &mmio_flip->i915->rps.mmioflips);
+ WARN_ON(ret != 0 && ret != -EIO);
+ }
intel_do_mmio_flip(mmio_flip->crtc);
--
2.3.6
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-11 16:14 [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip ville.syrjala
2015-06-11 17:05 ` [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips ville.syrjala
@ 2015-06-11 20:01 ` Chris Wilson
2015-06-15 16:34 ` Daniel Vetter
2015-06-15 1:40 ` shuang.he
2015-06-29 9:11 ` shuang.he
3 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-06-11 20:01 UTC (permalink / raw)
To: ville.syrjala; +Cc: intel-gfx
On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
>
> When the GPU gets reset __i915_wait_request() returns -EIO to the
> mmio flip worker. Currently we WARN whenever we get anything other
> than 0. Ignore the -EIO too since it's a perfectly normal thing
> to get during a GPU reset.
Nak. I consider it is a bug in __i915_wait_request(). I am discussing
with Thomas Elf how to fix this wrt the next generation of individual
ring resets.
In the meantime I prefer a fix along the lines of
http://patchwork.freedesktop.org/patch/46607/ which addresses this and
more such as the false SIGBUSes.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-11 16:14 [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip ville.syrjala
2015-06-11 17:05 ` [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips ville.syrjala
2015-06-11 20:01 ` [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip Chris Wilson
@ 2015-06-15 1:40 ` shuang.he
2015-06-29 9:11 ` shuang.he
3 siblings, 0 replies; 16+ messages in thread
From: shuang.he @ 2015-06-15 1:40 UTC (permalink / raw)
To: shuang.he, lei.a.liu, intel-gfx, ville.syrjala
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6571
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV 276/276 276/276
ILK 303/303 303/303
SNB 312/312 312/312
IVB 343/343 343/343
BYT 287/287 287/287
BDW 321/321 321/321
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips
2015-06-11 17:05 ` [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips ville.syrjala
@ 2015-06-15 4:01 ` shuang.he
2015-06-29 2:53 ` shuang.he
1 sibling, 0 replies; 16+ messages in thread
From: shuang.he @ 2015-06-15 4:01 UTC (permalink / raw)
To: shuang.he, lei.a.liu, intel-gfx, ville.syrjala
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6572
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV 276/276 276/276
ILK 303/303 303/303
SNB 312/312 312/312
IVB 343/343 343/343
BYT 287/287 287/287
BDW 321/321 321/321
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-11 20:01 ` [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip Chris Wilson
@ 2015-06-15 16:34 ` Daniel Vetter
2015-06-16 12:10 ` Chris Wilson
0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-06-15 16:34 UTC (permalink / raw)
To: Chris Wilson, ville.syrjala, intel-gfx
On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >
> > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > mmio flip worker. Currently we WARN whenever we get anything other
> > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > to get during a GPU reset.
>
> Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> with Thomas Elf how to fix this wrt the next generation of individual
> ring resets.
We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
the reset is ongoing. Neither is currently handled. For lockless users we
probably want a version of wait_request which just dtrt (of waiting for
the reset handler to complete without trying to grab the mutex and then
returning). Or some other means of retrying.
Returning -EIO from the low-level wait function still seems appropriate,
but callers need to eat/handle it appropriately. WARN_ON isn't it here
ofc.
Also we have piles of flip vs. gpu hang testcases ... do they fail to
provoke this or is this another case of bug lost in bugzilla? In any case
needs a Testcase: line.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-15 16:34 ` Daniel Vetter
@ 2015-06-16 12:10 ` Chris Wilson
2015-06-16 16:21 ` Daniel Vetter
0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-06-16 12:10 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx
On Mon, Jun 15, 2015 at 06:34:51PM +0200, Daniel Vetter wrote:
> On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> > On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > >
> > > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > > mmio flip worker. Currently we WARN whenever we get anything other
> > > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > > to get during a GPU reset.
> >
> > Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> > with Thomas Elf how to fix this wrt the next generation of individual
> > ring resets.
>
> We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
> the reset is ongoing. Neither is currently handled. For lockless users we
> probably want a version of wait_request which just dtrt (of waiting for
> the reset handler to complete without trying to grab the mutex and then
> returning). Or some other means of retrying.
>
> Returning -EIO from the low-level wait function still seems appropriate,
> but callers need to eat/handle it appropriately. WARN_ON isn't it here
> ofc.
Bleh, a few years ago you decided not to take the EIO handling along the
call paths that don't care.
I disagree. There are two classes of callers, those that care about
EIO/EAGAIN and those that simply want to know when the GPU is no longer
processing that request. That latter class is still popping up in
bugzilla with frozen displays. For the former, we actually only care
about backoff if we are holding the mutex - and that is only required
for EAGAIN. The only user that cares about EIO is throttle().
> Also we have piles of flip vs. gpu hang testcases ... do they fail to
> provoke this or is this another case of bug lost in bugzilla?
We have a few bugs every year for incorrect EIOs returned by
wait_request, but none for this case.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-16 12:10 ` Chris Wilson
@ 2015-06-16 16:21 ` Daniel Vetter
2015-06-16 16:30 ` Chris Wilson
0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-06-16 16:21 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, ville.syrjala, intel-gfx
On Tue, Jun 16, 2015 at 01:10:33PM +0100, Chris Wilson wrote:
> On Mon, Jun 15, 2015 at 06:34:51PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> > > On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > >
> > > > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > > > mmio flip worker. Currently we WARN whenever we get anything other
> > > > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > > > to get during a GPU reset.
> > >
> > > Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> > > with Thomas Elf how to fix this wrt the next generation of individual
> > > ring resets.
> >
> > We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
> > the reset is ongoing. Neither is currently handled. For lockless users we
> > probably want a version of wait_request which just dtrt (of waiting for
> > the reset handler to complete without trying to grab the mutex and then
> > returning). Or some other means of retrying.
> >
> > Returning -EIO from the low-level wait function still seems appropriate,
> > but callers need to eat/handle it appropriately. WARN_ON isn't it here
> > ofc.
>
> Bleh, a few years ago you decided not to take the EIO handling along the
> call paths that don't care.
>
> I disagree. There are two classes of callers, those that care about
> EIO/EAGAIN and those that simply want to know when the GPU is no longer
> processing that request. That latter class is still popping up in
> bugzilla with frozen displays. For the former, we actually only care
> about backoff if we are holding the mutex - and that is only required
> for EAGAIN. The only user that cares about EIO is throttle().
Hm, right now the design is that for non-interruptible designs we indeed
return -EIO or -EAGAIN, but the reset handler will fix up outstanding
flips. So I guess removing the WARN_ON here is indeed the right thing to
do. We should probably change this once we have atomic (where the wait
doesn't need a lock really, at least for async commits which is what
matters here) and loop until completion.
I'm still vary of eating -EIO in general since it's so hard to test all
this for correctness. Maybe we need a __check_wedge which can return -EIO
and a check_wedge which eats it. And then decide once for where to put
special checks, probably just execbuf and throttle.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-16 16:21 ` Daniel Vetter
@ 2015-06-16 16:30 ` Chris Wilson
2015-06-17 11:53 ` Daniel Vetter
0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-06-16 16:30 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx
On Tue, Jun 16, 2015 at 06:21:53PM +0200, Daniel Vetter wrote:
> On Tue, Jun 16, 2015 at 01:10:33PM +0100, Chris Wilson wrote:
> > On Mon, Jun 15, 2015 at 06:34:51PM +0200, Daniel Vetter wrote:
> > > On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> > > > On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > >
> > > > > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > > > > mmio flip worker. Currently we WARN whenever we get anything other
> > > > > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > > > > to get during a GPU reset.
> > > >
> > > > Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> > > > with Thomas Elf how to fix this wrt the next generation of individual
> > > > ring resets.
> > >
> > > We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
> > > the reset is ongoing. Neither is currently handled. For lockless users we
> > > probably want a version of wait_request which just dtrt (of waiting for
> > > the reset handler to complete without trying to grab the mutex and then
> > > returning). Or some other means of retrying.
> > >
> > > Returning -EIO from the low-level wait function still seems appropriate,
> > > but callers need to eat/handle it appropriately. WARN_ON isn't it here
> > > ofc.
> >
> > Bleh, a few years ago you decided not to take the EIO handling along the
> > call paths that don't care.
> >
> > I disagree. There are two classes of callers, those that care about
> > EIO/EAGAIN and those that simply want to know when the GPU is no longer
> > processing that request. That latter class is still popping up in
> > bugzilla with frozen displays. For the former, we actually only care
> > about backoff if we are holding the mutex - and that is only required
> > for EAGAIN. The only user that cares about EIO is throttle().
>
> Hm, right now the design is that for non-interruptible designs we indeed
> return -EIO or -EAGAIN, but the reset handler will fix up outstanding
> flips. So I guess removing the WARN_ON here is indeed the right thing to
> do. We should probably change this once we have atomic (where the wait
> doesn't need a lock really, at least for async commits which is what
> matters here) and loop until completion.
>
> I'm still vary of eating -EIO in general since it's so hard to test all
> this for correctness. Maybe we need a __check_wedge which can return -EIO
> and a check_wedge which eats it. And then decide once for where to put
> special checks, probably just execbuf and throttle.
Even execbuf really doesn't care. If the GPU didn't complete the earlier
request (principally for semaphore sw sync), it makes no difference for
us now. The content is either corrupt, or we bail when we spot the
wedged GPU upon writing to the ring. Reporting EIO because of an earlier
failure is a poor substitute for the async reset notification. But here
we still need EAGAIN backoff ofc.
I really think eating EIO is the right thing to do in most circumstances
and is correct with the semantics of the callers.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-16 16:30 ` Chris Wilson
@ 2015-06-17 11:53 ` Daniel Vetter
2015-06-17 13:05 ` Chris Wilson
0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-06-17 11:53 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, ville.syrjala, intel-gfx
On Tue, Jun 16, 2015 at 05:30:19PM +0100, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 06:21:53PM +0200, Daniel Vetter wrote:
> > On Tue, Jun 16, 2015 at 01:10:33PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 15, 2015 at 06:34:51PM +0200, Daniel Vetter wrote:
> > > > On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> > > > > On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > >
> > > > > > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > > > > > mmio flip worker. Currently we WARN whenever we get anything other
> > > > > > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > > > > > to get during a GPU reset.
> > > > >
> > > > > Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> > > > > with Thomas Elf how to fix this wrt the next generation of individual
> > > > > ring resets.
> > > >
> > > > We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
> > > > the reset is ongoing. Neither is currently handled. For lockless users we
> > > > probably want a version of wait_request which just dtrt (of waiting for
> > > > the reset handler to complete without trying to grab the mutex and then
> > > > returning). Or some other means of retrying.
> > > >
> > > > Returning -EIO from the low-level wait function still seems appropriate,
> > > > but callers need to eat/handle it appropriately. WARN_ON isn't it here
> > > > ofc.
> > >
> > > Bleh, a few years ago you decided not to take the EIO handling along the
> > > call paths that don't care.
> > >
> > > I disagree. There are two classes of callers, those that care about
> > > EIO/EAGAIN and those that simply want to know when the GPU is no longer
> > > processing that request. That latter class is still popping up in
> > > bugzilla with frozen displays. For the former, we actually only care
> > > about backoff if we are holding the mutex - and that is only required
> > > for EAGAIN. The only user that cares about EIO is throttle().
> >
> > Hm, right now the design is that for non-interruptible designs we indeed
> > return -EIO or -EAGAIN, but the reset handler will fix up outstanding
> > flips. So I guess removing the WARN_ON here is indeed the right thing to
> > do. We should probably change this once we have atomic (where the wait
> > doesn't need a lock really, at least for async commits which is what
> > matters here) and loop until completion.
> >
> > I'm still vary of eating -EIO in general since it's so hard to test all
> > this for correctness. Maybe we need a __check_wedge which can return -EIO
> > and a check_wedge which eats it. And then decide once for where to put
> > special checks, probably just execbuf and throttle.
>
> Even execbuf really doesn't care. If the GPU didn't complete the earlier
> request (principally for semaphore sw sync), it makes no difference for
> us now. The content is either corrupt, or we bail when we spot the
> wedged GPU upon writing to the ring. Reporting EIO because of an earlier
> failure is a poor substitute for the async reset notification. But here
> we still need EAGAIN backoff ofc.
>
> I really think eating EIO is the right thing to do in most circumstances
> and is correct with the semantics of the callers.
Well we once had the transparent sw fallback at least in the ddx for -EIO.
Mesa never coped for obvious reasons, and given that a modern desktop
can't survive with GL there's not all that much point any more. But still
I think if the gpu is terminally dead we need to tell this to userspace
somehow I think.
What I'm unclear about is which ioctl that should be, and my assumption
thus has been that it's execbuf.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-17 11:53 ` Daniel Vetter
@ 2015-06-17 13:05 ` Chris Wilson
2015-06-17 14:16 ` Chris Wilson
0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-06-17 13:05 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx
On Wed, Jun 17, 2015 at 01:53:55PM +0200, Daniel Vetter wrote:
> On Tue, Jun 16, 2015 at 05:30:19PM +0100, Chris Wilson wrote:
> > On Tue, Jun 16, 2015 at 06:21:53PM +0200, Daniel Vetter wrote:
> > > On Tue, Jun 16, 2015 at 01:10:33PM +0100, Chris Wilson wrote:
> > > > On Mon, Jun 15, 2015 at 06:34:51PM +0200, Daniel Vetter wrote:
> > > > > On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> > > > > > On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala@linux.intel.com wrote:
> > > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > >
> > > > > > > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > > > > > > mmio flip worker. Currently we WARN whenever we get anything other
> > > > > > > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > > > > > > to get during a GPU reset.
> > > > > >
> > > > > > Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> > > > > > with Thomas Elf how to fix this wrt the next generation of individual
> > > > > > ring resets.
> > > > >
> > > > > We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
> > > > > the reset is ongoing. Neither is currently handled. For lockless users we
> > > > > probably want a version of wait_request which just dtrt (of waiting for
> > > > > the reset handler to complete without trying to grab the mutex and then
> > > > > returning). Or some other means of retrying.
> > > > >
> > > > > Returning -EIO from the low-level wait function still seems appropriate,
> > > > > but callers need to eat/handle it appropriately. WARN_ON isn't it here
> > > > > ofc.
> > > >
> > > > Bleh, a few years ago you decided not to take the EIO handling along the
> > > > call paths that don't care.
> > > >
> > > > I disagree. There are two classes of callers, those that care about
> > > > EIO/EAGAIN and those that simply want to know when the GPU is no longer
> > > > processing that request. That latter class is still popping up in
> > > > bugzilla with frozen displays. For the former, we actually only care
> > > > about backoff if we are holding the mutex - and that is only required
> > > > for EAGAIN. The only user that cares about EIO is throttle().
> > >
> > > Hm, right now the design is that for non-interruptible designs we indeed
> > > return -EIO or -EAGAIN, but the reset handler will fix up outstanding
> > > flips. So I guess removing the WARN_ON here is indeed the right thing to
> > > do. We should probably change this once we have atomic (where the wait
> > > doesn't need a lock really, at least for async commits which is what
> > > matters here) and loop until completion.
> > >
> > > I'm still vary of eating -EIO in general since it's so hard to test all
> > > this for correctness. Maybe we need a __check_wedge which can return -EIO
> > > and a check_wedge which eats it. And then decide once for where to put
> > > special checks, probably just execbuf and throttle.
> >
> > Even execbuf really doesn't care. If the GPU didn't complete the earlier
> > request (principally for semaphore sw sync), it makes no difference for
> > us now. The content is either corrupt, or we bail when we spot the
> > wedged GPU upon writing to the ring. Reporting EIO because of an earlier
> > failure is a poor substitute for the async reset notification. But here
> > we still need EAGAIN backoff ofc.
> >
> > I really think eating EIO is the right thing to do in most circumstances
> > and is correct with the semantics of the callers.
>
> Well we once had the transparent sw fallback at least in the ddx for -EIO.
> Mesa never coped for obvious reasons, and given that a modern desktop
> can't survive with GL there's not all that much point any more. But still
> I think if the gpu is terminally dead we need to tell this to userspace
> somehow I think.
The DDX checks throttle() for that purposes. Error returns from
execbuffer usually indicate that the kernel is broken and we promptly
ignore it. Having execbuf report EIO is superflous since it is an async
error from before.
> What I'm unclear about is which ioctl that should be, and my assumption
> thus has been that it's execbuf.
Nope. It's throttle.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-17 13:05 ` Chris Wilson
@ 2015-06-17 14:16 ` Chris Wilson
2015-06-17 15:07 ` Daniel Vetter
0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-06-17 14:16 UTC (permalink / raw)
To: Daniel Vetter, ville.syrjala, intel-gfx
We have gone far off topic.
The question is how we want __i915_wait_request() to handle a wedged
GPU.
It currently reports EIO, and my argument is that is wrong wrt to the
semantics of the wait completion and that no caller actually cares about
EIO from __i915_wait_request().
* Correction: one caller cares!
If we regard a wedged GPU (and in the short term a reset is equally
terminal to an outstanding request) then the GPU can no longer be
accesing thta request and the wait can be safely completed. Imo it is
correct to return 0 in all circumstances. (Reset pending needs to return
-EAGAIN if we need to backoff, but for the lockless consumers we can
just ignore the reset notification.
That is set-domain, mmioflip, modesetting do not care if the request
succeeded, just that it completed.
Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is
used by X to detect when the GPU is unusable prior to use, e.g. when
waking up, and also during its periodic flushes.
Overlay reports -EIO when turning on and hanging the GPU. To be fair, it
can equally report that failure the very next time it touches the ring.
Execbuf itself doesn't rely on wait request reporting EIO, just that we
report EIO prior to submitting work o a dead GPU/context. Execbuf uses
wait_request via two paths, syncing to an old request on another ring
and for flushing requests from the ringbuffer to make room for new
commands. This is the tricky part, the only instance where we rely on
aborting after waiting but before further operation - we won't even
notice a dead GPU prior to starting a request and running out of space
otherwise). Since it is the only instance, we can move the terminal
detection of a dead GPU from the wait request into the
ring_wait_for_space(). This is in keeping with the ethos that we do not
report -EIO until we attempt to access the GPU.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-17 14:16 ` Chris Wilson
@ 2015-06-17 15:07 ` Daniel Vetter
2015-06-17 15:46 ` Chris Wilson
0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-06-17 15:07 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, ville.syrjala, intel-gfx
On Wed, Jun 17, 2015 at 03:16:10PM +0100, Chris Wilson wrote:
> We have gone far off topic.
>
> The question is how we want __i915_wait_request() to handle a wedged
> GPU.
>
> It currently reports EIO, and my argument is that is wrong wrt to the
> semantics of the wait completion and that no caller actually cares about
> EIO from __i915_wait_request().
>
> * Correction: one caller cares!
>
> If we regard a wedged GPU (and in the short term a reset is equally
> terminal to an outstanding request) then the GPU can no longer be
> accesing thta request and the wait can be safely completed. Imo it is
> correct to return 0 in all circumstances. (Reset pending needs to return
> -EAGAIN if we need to backoff, but for the lockless consumers we can
> just ignore the reset notification.
>
> That is set-domain, mmioflip, modesetting do not care if the request
> succeeded, just that it completed.
>
> Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is
> used by X to detect when the GPU is unusable prior to use, e.g. when
> waking up, and also during its periodic flushes.
>
> Overlay reports -EIO when turning on and hanging the GPU. To be fair, it
> can equally report that failure the very next time it touches the ring.
>
> Execbuf itself doesn't rely on wait request reporting EIO, just that we
> report EIO prior to submitting work o a dead GPU/context. Execbuf uses
> wait_request via two paths, syncing to an old request on another ring
> and for flushing requests from the ringbuffer to make room for new
> commands. This is the tricky part, the only instance where we rely on
> aborting after waiting but before further operation - we won't even
> notice a dead GPU prior to starting a request and running out of space
> otherwise). Since it is the only instance, we can move the terminal
> detection of a dead GPU from the wait request into the
> ring_wait_for_space(). This is in keeping with the ethos that we do not
> report -EIO until we attempt to access the GPU.
Ok, following up with my side of the irc discussion we've had. I agree
that there's only 2 places where we must report an EIO if the gpu is
terminally wedge:
- throttle
- execbuf
How that's done doesn't matter, and when it's racy wrt concurrent gpu
deaths that also doens't matter, i.e. we don't need wait_request to EIO
immediately as long as we check terminally_wedged somewhere in these
ioctls.
My main concern is that if we remove the EIO from wait_request we'll
accidentally also remove the EIO from execbuf. And we've had kernels where
the only EIO left was the wait_request from ring_begin ...
But if we add a small igt to manually wedge the gpu through debugfs and
then check that throttle/execbuf do EIO that risk is averted and I'd be ok
with eating EIO from wait_request with extreme prejudice. Since indeed we
still have trouble with EIO at least temporarily totally wreaking modeset
ioclts and other things that really always should work.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-17 15:07 ` Daniel Vetter
@ 2015-06-17 15:46 ` Chris Wilson
0 siblings, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2015-06-17 15:46 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx
On Wed, Jun 17, 2015 at 05:07:31PM +0200, Daniel Vetter wrote:
> On Wed, Jun 17, 2015 at 03:16:10PM +0100, Chris Wilson wrote:
> > We have gone far off topic.
> >
> > The question is how we want __i915_wait_request() to handle a wedged
> > GPU.
> >
> > It currently reports EIO, and my argument is that is wrong wrt to the
> > semantics of the wait completion and that no caller actually cares about
> > EIO from __i915_wait_request().
> >
> > * Correction: one caller cares!
> >
> > If we regard a wedged GPU (and in the short term a reset is equally
> > terminal to an outstanding request) then the GPU can no longer be
> > accesing thta request and the wait can be safely completed. Imo it is
> > correct to return 0 in all circumstances. (Reset pending needs to return
> > -EAGAIN if we need to backoff, but for the lockless consumers we can
> > just ignore the reset notification.
> >
> > That is set-domain, mmioflip, modesetting do not care if the request
> > succeeded, just that it completed.
> >
> > Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is
> > used by X to detect when the GPU is unusable prior to use, e.g. when
> > waking up, and also during its periodic flushes.
> >
> > Overlay reports -EIO when turning on and hanging the GPU. To be fair, it
> > can equally report that failure the very next time it touches the ring.
> >
> > Execbuf itself doesn't rely on wait request reporting EIO, just that we
> > report EIO prior to submitting work o a dead GPU/context. Execbuf uses
> > wait_request via two paths, syncing to an old request on another ring
> > and for flushing requests from the ringbuffer to make room for new
> > commands. This is the tricky part, the only instance where we rely on
> > aborting after waiting but before further operation - we won't even
> > notice a dead GPU prior to starting a request and running out of space
> > otherwise). Since it is the only instance, we can move the terminal
> > detection of a dead GPU from the wait request into the
> > ring_wait_for_space(). This is in keeping with the ethos that we do not
> > report -EIO until we attempt to access the GPU.
>
> Ok, following up with my side of the irc discussion we've had. I agree
> that there's only 2 places where we must report an EIO if the gpu is
> terminally wedge:
> - throttle
> - execbuf
>
> How that's done doesn't matter, and when it's racy wrt concurrent gpu
> deaths that also doens't matter, i.e. we don't need wait_request to EIO
> immediately as long as we check terminally_wedged somewhere in these
> ioctls.
>
> My main concern is that if we remove the EIO from wait_request we'll
> accidentally also remove the EIO from execbuf. And we've had kernels where
> the only EIO left was the wait_request from ring_begin ...
My plan has been to do an is-wedged check on allocating a request
(that's guarranteed to be the first action before writing into a ring),
and then double-check that no events have taken place before submitting
that request (i.e. on finishing the block). To supplement that, we do
the explicit is-wedged check after waiting for ring space.
> But if we add a small igt to manually wedge the gpu through debugfs and
> then check that throttle/execbuf do EIO that risk is averted and I'd be ok
> with eating EIO from wait_request with extreme prejudice. Since indeed we
> still have trouble with EIO at least temporarily totally wreaking modeset
> ioclts and other things that really always should work.
Well, I've successfully wedged the GPU, broke the driver but verified
that throttle reports -EIO. Just needs a little TLC.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips
2015-06-11 17:05 ` [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips ville.syrjala
2015-06-15 4:01 ` shuang.he
@ 2015-06-29 2:53 ` shuang.he
1 sibling, 0 replies; 16+ messages in thread
From: shuang.he @ 2015-06-29 2:53 UTC (permalink / raw)
To: shuang.he, lei.a.liu, intel-gfx, ville.syrjala
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6632
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
ILK -1 302/302 301/302
SNB 312/316 312/316
IVB 343/343 343/343
BYT -1 287/287 286/287
HSW 380/380 380/380
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*ILK igt@gem_persistent_relocs@forked-interruptible-thrashing PASS(1) DMESG_WARN(1)
(dmesg patch applied)drm:i915_hangcheck_elapsed[i915]]*ERROR*Hangcheck_timer_elapsed...bsd_ring_idle@Hangcheck timer elapsed... bsd ring idle
*BYT igt@gem_partial_pwrite_pread@reads PASS(1) FAIL(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip
2015-06-11 16:14 [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip ville.syrjala
` (2 preceding siblings ...)
2015-06-15 1:40 ` shuang.he
@ 2015-06-29 9:11 ` shuang.he
3 siblings, 0 replies; 16+ messages in thread
From: shuang.he @ 2015-06-29 9:11 UTC (permalink / raw)
To: shuang.he, lei.a.liu, intel-gfx, ville.syrjala
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6643
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
ILK 302/302 302/302
SNB 312/316 312/316
IVB 343/343 343/343
BYT -2 287/287 285/287
HSW 380/380 380/380
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*BYT igt@gem_partial_pwrite_pread@reads PASS(1) FAIL(1)
*BYT igt@gem_tiled_partial_pwrite_pread@reads PASS(1) FAIL(1)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-06-29 9:11 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-11 16:14 [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip ville.syrjala
2015-06-11 17:05 ` [PATCH v2] drm/i915: Ignore -EIO from __i915_wait_request() during flips ville.syrjala
2015-06-15 4:01 ` shuang.he
2015-06-29 2:53 ` shuang.he
2015-06-11 20:01 ` [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip Chris Wilson
2015-06-15 16:34 ` Daniel Vetter
2015-06-16 12:10 ` Chris Wilson
2015-06-16 16:21 ` Daniel Vetter
2015-06-16 16:30 ` Chris Wilson
2015-06-17 11:53 ` Daniel Vetter
2015-06-17 13:05 ` Chris Wilson
2015-06-17 14:16 ` Chris Wilson
2015-06-17 15:07 ` Daniel Vetter
2015-06-17 15:46 ` Chris Wilson
2015-06-15 1:40 ` shuang.he
2015-06-29 9:11 ` shuang.he
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox