* [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
@ 2017-01-03 11:05 Chris Wilson
2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 11:05 UTC (permalink / raw)
To: dri-devel; +Cc: intel-gfx
As the fence->status is an optional field that may be set before
dma_fence_signal() is called to convey that the fence completed with an
error, we have to ensure that it is always set to zero on initialisation
so that the typical use (i.e. unset) always flags a successful completion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
drivers/dma-buf/dma-fence.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 3444f293ad4a..9130f790ebf3 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
fence->context = context;
fence->seqno = seqno;
fence->flags = 0UL;
+ fence->status = 0;
trace_dma_fence_init(fence);
}
--
2.11.0
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson @ 2017-01-03 11:05 ` Chris Wilson 2017-01-03 11:34 ` Tvrtko Ursulin 2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork 2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin 2 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-03 11:05 UTC (permalink / raw) To: dri-devel; +Cc: Mika Kuoppala, intel-gfx, Tvrtko Ursulin The struct dma_fence carries a status field exposed to userspace by sync_file. This is inspected after the fence is signaled and can convey whether or not the request completed successfully, or in our case if we detected a hang during the request (signaled via -EIO in SYNC_IOC_FILE_INFO). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> --- drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 204c4a673bf3..bc99c0e292d8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) ring_hung = false; } - if (ring_hung) + if (ring_hung) { i915_gem_context_mark_guilty(request->ctx); - else + request->fence.status = -EIO; + } else { i915_gem_context_mark_innocent(request->ctx); + } if (!ring_hung) return; -- 2.11.0 _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson @ 2017-01-03 11:34 ` Tvrtko Ursulin 2017-01-03 11:46 ` [Intel-gfx] " Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Tvrtko Ursulin @ 2017-01-03 11:34 UTC (permalink / raw) To: Chris Wilson, dri-devel; +Cc: intel-gfx On 03/01/2017 11:05, Chris Wilson wrote: > The struct dma_fence carries a status field exposed to userspace by > sync_file. This is inspected after the fence is signaled and can convey > whether or not the request completed successfully, or in our case if we > detected a hang during the request (signaled via -EIO in > SYNC_IOC_FILE_INFO). > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > --- > drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 204c4a673bf3..bc99c0e292d8 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) > ring_hung = false; > } > > - if (ring_hung) > + if (ring_hung) { > i915_gem_context_mark_guilty(request->ctx); > - else > + request->fence.status = -EIO; > + } else { > i915_gem_context_mark_innocent(request->ctx); > + } > > if (!ring_hung) > return; > Reading what happens later in this function, should we set the status of all the other requests we are about to clear? However one thing I don't understand is how this scheme interacts with the current userspace. We will clear/no-nop some of the submitted requests since the state is corrupt. But how will userspace notice this before it submits more requets? Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 11:34 ` Tvrtko Ursulin @ 2017-01-03 11:46 ` Chris Wilson 2017-01-03 11:57 ` Tvrtko Ursulin 0 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-03 11:46 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: > > On 03/01/2017 11:05, Chris Wilson wrote: > >The struct dma_fence carries a status field exposed to userspace by > >sync_file. This is inspected after the fence is signaled and can convey > >whether or not the request completed successfully, or in our case if we > >detected a hang during the request (signaled via -EIO in > >SYNC_IOC_FILE_INFO). > > > >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > >--- > > drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >index 204c4a673bf3..bc99c0e292d8 100644 > >--- a/drivers/gpu/drm/i915/i915_gem.c > >+++ b/drivers/gpu/drm/i915/i915_gem.c > >@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) > > ring_hung = false; > > } > > > >- if (ring_hung) > >+ if (ring_hung) { > > i915_gem_context_mark_guilty(request->ctx); > >- else > >+ request->fence.status = -EIO; > >+ } else { > > i915_gem_context_mark_innocent(request->ctx); > >+ } > > > > if (!ring_hung) > > return; > > > > Reading what happens later in this function, should we set the > status of all the other requests we are about to clear? > > However one thing I don't understand is how this scheme interacts > with the current userspace. We will clear/no-nop some of the > submitted requests since the state is corrupt. But how will > userspace notice this before it submits more requets? There is no mechanism currently for user space to be able to detect a hung request. (It can use the uevent for async notification of the hang/reset, but that will not tell you who caused the hang.) Userspace can track the number of hangs it caused, but the delay makes any roundtripping impractical (i.e. you have to synchronise before all rendering if you must detect the event immediately). Note also that we do not want to give out interprocess information (i.e. to allow one client to spy on another), which makes things harder to get right. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 11:46 ` [Intel-gfx] " Chris Wilson @ 2017-01-03 11:57 ` Tvrtko Ursulin 2017-01-03 12:13 ` Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Tvrtko Ursulin @ 2017-01-03 11:57 UTC (permalink / raw) To: Chris Wilson, dri-devel, intel-gfx On 03/01/2017 11:46, Chris Wilson wrote: > On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: >> >> On 03/01/2017 11:05, Chris Wilson wrote: >>> The struct dma_fence carries a status field exposed to userspace by >>> sync_file. This is inspected after the fence is signaled and can convey >>> whether or not the request completed successfully, or in our case if we >>> detected a hang during the request (signaled via -EIO in >>> SYNC_IOC_FILE_INFO). >>> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >>> index 204c4a673bf3..bc99c0e292d8 100644 >>> --- a/drivers/gpu/drm/i915/i915_gem.c >>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) >>> ring_hung = false; >>> } >>> >>> - if (ring_hung) >>> + if (ring_hung) { >>> i915_gem_context_mark_guilty(request->ctx); >>> - else >>> + request->fence.status = -EIO; >>> + } else { >>> i915_gem_context_mark_innocent(request->ctx); >>> + } >>> >>> if (!ring_hung) >>> return; >>> >> >> Reading what happens later in this function, should we set the >> status of all the other requests we are about to clear? >> >> However one thing I don't understand is how this scheme interacts >> with the current userspace. We will clear/no-nop some of the >> submitted requests since the state is corrupt. But how will >> userspace notice this before it submits more requets? > > There is no mechanism currently for user space to be able to detect a > hung request. (It can use the uevent for async notification of the > hang/reset, but that will not tell you who caused the hang.) Userspace > can track the number of hangs it caused, but the delay makes any > roundtripping impractical (i.e. you have to synchronise before all > rendering if you must detect the event immediately). Note also that we > do not want to give out interprocess information (i.e. to allow one > client to spy on another), which makes things harder to get right. So idea is to clear already submitted requests _if_ the userspace is synchronising before all rendering and looking at reset stats, to make it theoretically possible to detect the corrupt state? Still with the fences do you agree error status needs to be set on those as well? Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 11:57 ` Tvrtko Ursulin @ 2017-01-03 12:13 ` Chris Wilson 2017-01-03 12:34 ` [Intel-gfx] " Tvrtko Ursulin 0 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-03 12:13 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: > > On 03/01/2017 11:46, Chris Wilson wrote: > >On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: > >> > >>On 03/01/2017 11:05, Chris Wilson wrote: > >>>The struct dma_fence carries a status field exposed to userspace by > >>>sync_file. This is inspected after the fence is signaled and can convey > >>>whether or not the request completed successfully, or in our case if we > >>>detected a hang during the request (signaled via -EIO in > >>>SYNC_IOC_FILE_INFO). > >>> > >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > >>>--- > >>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- > >>>1 file changed, 4 insertions(+), 2 deletions(-) > >>> > >>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >>>index 204c4a673bf3..bc99c0e292d8 100644 > >>>--- a/drivers/gpu/drm/i915/i915_gem.c > >>>+++ b/drivers/gpu/drm/i915/i915_gem.c > >>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) > >>> ring_hung = false; > >>> } > >>> > >>>- if (ring_hung) > >>>+ if (ring_hung) { > >>> i915_gem_context_mark_guilty(request->ctx); > >>>- else > >>>+ request->fence.status = -EIO; > >>>+ } else { > >>> i915_gem_context_mark_innocent(request->ctx); > >>>+ } > >>> > >>> if (!ring_hung) > >>> return; > >>> > >> > >>Reading what happens later in this function, should we set the > >>status of all the other requests we are about to clear? > >> > >>However one thing I don't understand is how this scheme interacts > >>with the current userspace. We will clear/no-nop some of the > >>submitted requests since the state is corrupt. But how will > >>userspace notice this before it submits more requets? > > > >There is no mechanism currently for user space to be able to detect a > >hung request. (It can use the uevent for async notification of the > >hang/reset, but that will not tell you who caused the hang.) Userspace > >can track the number of hangs it caused, but the delay makes any > >roundtripping impractical (i.e. you have to synchronise before all > >rendering if you must detect the event immediately). Note also that we > >do not want to give out interprocess information (i.e. to allow one > >client to spy on another), which makes things harder to get right. > > So idea is to clear already submitted requests _if_ the userspace is > synchronising before all rendering and looking at reset stats, to > make it theoretically possible to detect the corrupt state? No, I'm just don't see a way that userspace can detect the hang without testing after seeing the request signaled (either by waiting on the batch or by waiting on the fence), i.e. by being completely synchronous (or at least chosing its synchronous points very carefully, such as around IPC). It can either poll reset-count or use sync_file (which requires fence exporting). The current robustness interfaces is a basic query on whether any reset occurred within the context, not when. > Still with the fences do you agree error status needs to be set on > those as well? Yes. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 12:13 ` Chris Wilson @ 2017-01-03 12:34 ` Tvrtko Ursulin 2017-01-03 12:38 ` Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Tvrtko Ursulin @ 2017-01-03 12:34 UTC (permalink / raw) To: Chris Wilson, dri-devel, intel-gfx On 03/01/2017 12:13, Chris Wilson wrote: > On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: >> >> On 03/01/2017 11:46, Chris Wilson wrote: >>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: >>>> >>>> On 03/01/2017 11:05, Chris Wilson wrote: >>>>> The struct dma_fence carries a status field exposed to userspace by >>>>> sync_file. This is inspected after the fence is signaled and can convey >>>>> whether or not the request completed successfully, or in our case if we >>>>> detected a hang during the request (signaled via -EIO in >>>>> SYNC_IOC_FILE_INFO). >>>>> >>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>>>> --- >>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- >>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >>>>> index 204c4a673bf3..bc99c0e292d8 100644 >>>>> --- a/drivers/gpu/drm/i915/i915_gem.c >>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) >>>>> ring_hung = false; >>>>> } >>>>> >>>>> - if (ring_hung) >>>>> + if (ring_hung) { >>>>> i915_gem_context_mark_guilty(request->ctx); >>>>> - else >>>>> + request->fence.status = -EIO; >>>>> + } else { >>>>> i915_gem_context_mark_innocent(request->ctx); >>>>> + } >>>>> >>>>> if (!ring_hung) >>>>> return; >>>>> >>>> >>>> Reading what happens later in this function, should we set the >>>> status of all the other requests we are about to clear? >>>> >>>> However one thing I don't understand is how this scheme interacts >>>> with the current userspace. We will clear/no-nop some of the >>>> submitted requests since the state is corrupt. But how will >>>> userspace notice this before it submits more requets? >>> >>> There is no mechanism currently for user space to be able to detect a >>> hung request. (It can use the uevent for async notification of the >>> hang/reset, but that will not tell you who caused the hang.) Userspace >>> can track the number of hangs it caused, but the delay makes any >>> roundtripping impractical (i.e. you have to synchronise before all >>> rendering if you must detect the event immediately). Note also that we >>> do not want to give out interprocess information (i.e. to allow one >>> client to spy on another), which makes things harder to get right. >> >> So idea is to clear already submitted requests _if_ the userspace is >> synchronising before all rendering and looking at reset stats, to >> make it theoretically possible to detect the corrupt state? > > No, I'm just don't see a way that userspace can detect the hang without > testing after seeing the request signaled (either by waiting on the > batch or by waiting on the fence), i.e. by being completely synchronous > (or at least chosing its synchronous points very carefully, such as > around IPC). It can either poll reset-count or use sync_file (which > requires fence exporting). > > The current robustness interfaces is a basic query on whether any reset > occurred within the context, not when. Why do we bother with clearing the submitted requests then? Regards, Tvrtko _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 12:34 ` [Intel-gfx] " Tvrtko Ursulin @ 2017-01-03 12:38 ` Chris Wilson 2017-01-03 13:17 ` Tvrtko Ursulin 0 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-03 12:38 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote: > > On 03/01/2017 12:13, Chris Wilson wrote: > >On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: > >> > >>On 03/01/2017 11:46, Chris Wilson wrote: > >>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: > >>>> > >>>>On 03/01/2017 11:05, Chris Wilson wrote: > >>>>>The struct dma_fence carries a status field exposed to userspace by > >>>>>sync_file. This is inspected after the fence is signaled and can convey > >>>>>whether or not the request completed successfully, or in our case if we > >>>>>detected a hang during the request (signaled via -EIO in > >>>>>SYNC_IOC_FILE_INFO). > >>>>> > >>>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > >>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>>>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > >>>>>--- > >>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- > >>>>>1 file changed, 4 insertions(+), 2 deletions(-) > >>>>> > >>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >>>>>index 204c4a673bf3..bc99c0e292d8 100644 > >>>>>--- a/drivers/gpu/drm/i915/i915_gem.c > >>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c > >>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) > >>>>> ring_hung = false; > >>>>> } > >>>>> > >>>>>- if (ring_hung) > >>>>>+ if (ring_hung) { > >>>>> i915_gem_context_mark_guilty(request->ctx); > >>>>>- else > >>>>>+ request->fence.status = -EIO; > >>>>>+ } else { > >>>>> i915_gem_context_mark_innocent(request->ctx); > >>>>>+ } > >>>>> > >>>>> if (!ring_hung) > >>>>> return; > >>>>> > >>>> > >>>>Reading what happens later in this function, should we set the > >>>>status of all the other requests we are about to clear? > >>>> > >>>>However one thing I don't understand is how this scheme interacts > >>>>with the current userspace. We will clear/no-nop some of the > >>>>submitted requests since the state is corrupt. But how will > >>>>userspace notice this before it submits more requets? > >>> > >>>There is no mechanism currently for user space to be able to detect a > >>>hung request. (It can use the uevent for async notification of the > >>>hang/reset, but that will not tell you who caused the hang.) Userspace > >>>can track the number of hangs it caused, but the delay makes any > >>>roundtripping impractical (i.e. you have to synchronise before all > >>>rendering if you must detect the event immediately). Note also that we > >>>do not want to give out interprocess information (i.e. to allow one > >>>client to spy on another), which makes things harder to get right. > >> > >>So idea is to clear already submitted requests _if_ the userspace is > >>synchronising before all rendering and looking at reset stats, to > >>make it theoretically possible to detect the corrupt state? > > > >No, I'm just don't see a way that userspace can detect the hang without > >testing after seeing the request signaled (either by waiting on the > >batch or by waiting on the fence), i.e. by being completely synchronous > >(or at least chosing its synchronous points very carefully, such as > >around IPC). It can either poll reset-count or use sync_file (which > >requires fence exporting). > > > >The current robustness interfaces is a basic query on whether any reset > >occurred within the context, not when. > > Why do we bother with clearing the submitted requests then? The same reason we ban processes from submitting new requests if they cause repeated hangs. If before we ban that client, it has already submitted 1000 hanging requests, it has successfully locked the machine up for a couple of hours. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 12:38 ` Chris Wilson @ 2017-01-03 13:17 ` Tvrtko Ursulin 2017-01-03 13:25 ` Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Tvrtko Ursulin @ 2017-01-03 13:17 UTC (permalink / raw) To: Chris Wilson, dri-devel, intel-gfx On 03/01/2017 12:38, Chris Wilson wrote: > On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote: >> >> On 03/01/2017 12:13, Chris Wilson wrote: >>> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: >>>> >>>> On 03/01/2017 11:46, Chris Wilson wrote: >>>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: >>>>>> >>>>>> On 03/01/2017 11:05, Chris Wilson wrote: >>>>>>> The struct dma_fence carries a status field exposed to userspace by >>>>>>> sync_file. This is inspected after the fence is signaled and can convey >>>>>>> whether or not the request completed successfully, or in our case if we >>>>>>> detected a hang during the request (signaled via -EIO in >>>>>>> SYNC_IOC_FILE_INFO). >>>>>>> >>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- >>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >>>>>>> index 204c4a673bf3..bc99c0e292d8 100644 >>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c >>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) >>>>>>> ring_hung = false; >>>>>>> } >>>>>>> >>>>>>> - if (ring_hung) >>>>>>> + if (ring_hung) { >>>>>>> i915_gem_context_mark_guilty(request->ctx); >>>>>>> - else >>>>>>> + request->fence.status = -EIO; >>>>>>> + } else { >>>>>>> i915_gem_context_mark_innocent(request->ctx); >>>>>>> + } >>>>>>> >>>>>>> if (!ring_hung) >>>>>>> return; >>>>>>> >>>>>> >>>>>> Reading what happens later in this function, should we set the >>>>>> status of all the other requests we are about to clear? >>>>>> >>>>>> However one thing I don't understand is how this scheme interacts >>>>>> with the current userspace. We will clear/no-nop some of the >>>>>> submitted requests since the state is corrupt. But how will >>>>>> userspace notice this before it submits more requets? >>>>> >>>>> There is no mechanism currently for user space to be able to detect a >>>>> hung request. (It can use the uevent for async notification of the >>>>> hang/reset, but that will not tell you who caused the hang.) Userspace >>>>> can track the number of hangs it caused, but the delay makes any >>>>> roundtripping impractical (i.e. you have to synchronise before all >>>>> rendering if you must detect the event immediately). Note also that we >>>>> do not want to give out interprocess information (i.e. to allow one >>>>> client to spy on another), which makes things harder to get right. >>>> >>>> So idea is to clear already submitted requests _if_ the userspace is >>>> synchronising before all rendering and looking at reset stats, to >>>> make it theoretically possible to detect the corrupt state? >>> >>> No, I'm just don't see a way that userspace can detect the hang without >>> testing after seeing the request signaled (either by waiting on the >>> batch or by waiting on the fence), i.e. by being completely synchronous >>> (or at least chosing its synchronous points very carefully, such as >>> around IPC). It can either poll reset-count or use sync_file (which >>> requires fence exporting). >>> >>> The current robustness interfaces is a basic query on whether any reset >>> occurred within the context, not when. >> >> Why do we bother with clearing the submitted requests then? > > The same reason we ban processes from submitting new requests if they > cause repeated hangs. If before we ban that client, it has already > submitted 1000 hanging requests, it has successfully locked the machine > up for a couple of hours. So we would need to gate clearing on the transition to banned state I think. Because currently it does in unconditionally. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang 2017-01-03 13:17 ` Tvrtko Ursulin @ 2017-01-03 13:25 ` Chris Wilson 0 siblings, 0 replies; 19+ messages in thread From: Chris Wilson @ 2017-01-03 13:25 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel On Tue, Jan 03, 2017 at 01:17:19PM +0000, Tvrtko Ursulin wrote: > > On 03/01/2017 12:38, Chris Wilson wrote: > >On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote: > >> > >>On 03/01/2017 12:13, Chris Wilson wrote: > >>>On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: > >>>> > >>>>On 03/01/2017 11:46, Chris Wilson wrote: > >>>>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: > >>>>>> > >>>>>>On 03/01/2017 11:05, Chris Wilson wrote: > >>>>>>>The struct dma_fence carries a status field exposed to userspace by > >>>>>>>sync_file. This is inspected after the fence is signaled and can convey > >>>>>>>whether or not the request completed successfully, or in our case if we > >>>>>>>detected a hang during the request (signaled via -EIO in > >>>>>>>SYNC_IOC_FILE_INFO). > >>>>>>> > >>>>>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > >>>>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>>>>>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > >>>>>>>--- > >>>>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- > >>>>>>>1 file changed, 4 insertions(+), 2 deletions(-) > >>>>>>> > >>>>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>index 204c4a673bf3..bc99c0e292d8 100644 > >>>>>>>--- a/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) > >>>>>>> ring_hung = false; > >>>>>>> } > >>>>>>> > >>>>>>>- if (ring_hung) > >>>>>>>+ if (ring_hung) { > >>>>>>> i915_gem_context_mark_guilty(request->ctx); > >>>>>>>- else > >>>>>>>+ request->fence.status = -EIO; > >>>>>>>+ } else { > >>>>>>> i915_gem_context_mark_innocent(request->ctx); > >>>>>>>+ } > >>>>>>> > >>>>>>> if (!ring_hung) > >>>>>>> return; > >>>>>>> > >>>>>> > >>>>>>Reading what happens later in this function, should we set the > >>>>>>status of all the other requests we are about to clear? > >>>>>> > >>>>>>However one thing I don't understand is how this scheme interacts > >>>>>>with the current userspace. We will clear/no-nop some of the > >>>>>>submitted requests since the state is corrupt. But how will > >>>>>>userspace notice this before it submits more requets? > >>>>> > >>>>>There is no mechanism currently for user space to be able to detect a > >>>>>hung request. (It can use the uevent for async notification of the > >>>>>hang/reset, but that will not tell you who caused the hang.) Userspace > >>>>>can track the number of hangs it caused, but the delay makes any > >>>>>roundtripping impractical (i.e. you have to synchronise before all > >>>>>rendering if you must detect the event immediately). Note also that we > >>>>>do not want to give out interprocess information (i.e. to allow one > >>>>>client to spy on another), which makes things harder to get right. > >>>> > >>>>So idea is to clear already submitted requests _if_ the userspace is > >>>>synchronising before all rendering and looking at reset stats, to > >>>>make it theoretically possible to detect the corrupt state? > >>> > >>>No, I'm just don't see a way that userspace can detect the hang without > >>>testing after seeing the request signaled (either by waiting on the > >>>batch or by waiting on the fence), i.e. by being completely synchronous > >>>(or at least chosing its synchronous points very carefully, such as > >>>around IPC). It can either poll reset-count or use sync_file (which > >>>requires fence exporting). > >>> > >>>The current robustness interfaces is a basic query on whether any reset > >>>occurred within the context, not when. > >> > >>Why do we bother with clearing the submitted requests then? > > > >The same reason we ban processes from submitting new requests if they > >cause repeated hangs. If before we ban that client, it has already > >submitted 1000 hanging requests, it has successfully locked the machine > >up for a couple of hours. > > So we would need to gate clearing on the transition to banned state > I think. Because currently it does in unconditionally. Yes, see the other patch :) -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson 2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson @ 2017-01-03 11:53 ` Patchwork 2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin 2 siblings, 0 replies; 19+ messages in thread From: Patchwork @ 2017-01-03 11:53 UTC (permalink / raw) To: Chris Wilson; +Cc: intel-gfx == Series Details == Series: series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() URL : https://patchwork.freedesktop.org/series/17403/ State : failure == Summary == Series 17403v1 Series without cover letter https://patchwork.freedesktop.org/api/1.0/series/17403/revisions/1/mbox/ Test gem_busy: Subgroup basic-hang-default: fail -> PASS (fi-hsw-4770r) Test kms_pipe_crc_basic: Subgroup suspend-read-crc-pipe-b: pass -> INCOMPLETE (fi-snb-2600) fi-bdw-5557u total:246 pass:232 dwarn:0 dfail:0 fail:0 skip:14 fi-bsw-n3050 total:246 pass:207 dwarn:0 dfail:0 fail:0 skip:39 fi-bxt-j4205 total:246 pass:224 dwarn:0 dfail:0 fail:0 skip:22 fi-bxt-t5700 total:82 pass:69 dwarn:0 dfail:0 fail:0 skip:12 fi-byt-j1900 total:246 pass:219 dwarn:0 dfail:0 fail:0 skip:27 fi-byt-n2820 total:246 pass:215 dwarn:0 dfail:0 fail:0 skip:31 fi-hsw-4770 total:246 pass:227 dwarn:0 dfail:0 fail:0 skip:19 fi-hsw-4770r total:246 pass:227 dwarn:0 dfail:0 fail:0 skip:19 fi-ivb-3520m total:246 pass:225 dwarn:0 dfail:0 fail:0 skip:21 fi-ivb-3770 total:246 pass:225 dwarn:0 dfail:0 fail:0 skip:21 fi-kbl-7500u total:246 pass:225 dwarn:0 dfail:0 fail:0 skip:21 fi-skl-6260u total:246 pass:233 dwarn:0 dfail:0 fail:0 skip:13 fi-skl-6700hq total:246 pass:226 dwarn:0 dfail:0 fail:0 skip:20 fi-skl-6700k total:246 pass:222 dwarn:3 dfail:0 fail:0 skip:21 fi-skl-6770hq total:246 pass:233 dwarn:0 dfail:0 fail:0 skip:13 fi-snb-2520m total:246 pass:215 dwarn:0 dfail:0 fail:0 skip:31 fi-snb-2600 total:209 pass:181 dwarn:0 dfail:0 fail:0 skip:27 c882640c3f6f9571df175774c5609a798e859fe2 drm-tip: 2017y-01m-03d-10h-16m-34s UTC integration manifest a0752fa drm/i915: Set guilty-flag on fence after detecting a hang 0269bc0 dma-fence: Clear fence->status during dma_fence_init() == Logs == For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3421/ _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson 2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson 2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork @ 2017-01-03 14:04 ` Tvrtko Ursulin 2017-01-04 9:15 ` [Intel-gfx] " Daniel Vetter 2 siblings, 1 reply; 19+ messages in thread From: Tvrtko Ursulin @ 2017-01-03 14:04 UTC (permalink / raw) To: Chris Wilson, dri-devel; +Cc: intel-gfx On 03/01/2017 11:05, Chris Wilson wrote: > As the fence->status is an optional field that may be set before > dma_fence_signal() is called to convey that the fence completed with an > error, we have to ensure that it is always set to zero on initialisation > so that the typical use (i.e. unset) always flags a successful completion. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/dma-buf/dma-fence.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > index 3444f293ad4a..9130f790ebf3 100644 > --- a/drivers/dma-buf/dma-fence.c > +++ b/drivers/dma-buf/dma-fence.c > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > fence->context = context; > fence->seqno = seqno; > fence->flags = 0UL; > + fence->status = 0; > > trace_dma_fence_init(fence); > } > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Intel-gfx] [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin @ 2017-01-04 9:15 ` Daniel Vetter 2017-01-04 9:24 ` Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Daniel Vetter @ 2017-01-04 9:15 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: gustavo.padovan, intel-gfx, dri-devel On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > On 03/01/2017 11:05, Chris Wilson wrote: > > As the fence->status is an optional field that may be set before > > dma_fence_signal() is called to convey that the fence completed with an > > error, we have to ensure that it is always set to zero on initialisation > > so that the typical use (i.e. unset) always flags a successful completion. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > --- > > drivers/dma-buf/dma-fence.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > index 3444f293ad4a..9130f790ebf3 100644 > > --- a/drivers/dma-buf/dma-fence.c > > +++ b/drivers/dma-buf/dma-fence.c > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > fence->context = context; > > fence->seqno = seqno; > > fence->flags = 0UL; > > + fence->status = 0; > > > > trace_dma_fence_init(fence); > > } > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Looking at existing users there's only the sync_file stuff. And that gets it wrong, because afaics no one ever sets fence->status to anything useful. But sync_file blindly assumes it's correct. Gustavo, Maarten? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-04 9:15 ` [Intel-gfx] " Daniel Vetter @ 2017-01-04 9:24 ` Chris Wilson 2017-01-04 9:37 ` [Intel-gfx] " Daniel Vetter 0 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-04 9:24 UTC (permalink / raw) To: Daniel Vetter; +Cc: gustavo.padovan, intel-gfx, dri-devel On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote: > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > > > On 03/01/2017 11:05, Chris Wilson wrote: > > > As the fence->status is an optional field that may be set before > > > dma_fence_signal() is called to convey that the fence completed with an > > > error, we have to ensure that it is always set to zero on initialisation > > > so that the typical use (i.e. unset) always flags a successful completion. > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > --- > > > drivers/dma-buf/dma-fence.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > index 3444f293ad4a..9130f790ebf3 100644 > > > --- a/drivers/dma-buf/dma-fence.c > > > +++ b/drivers/dma-buf/dma-fence.c > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > fence->context = context; > > > fence->seqno = seqno; > > > fence->flags = 0UL; > > > + fence->status = 0; > > > > > > trace_dma_fence_init(fence); > > > } > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Looking at existing users there's only the sync_file stuff. And that gets > it wrong, because afaics no one ever sets fence->status to anything > useful. But sync_file blindly assumes it's correct. In terms of doc, sync_file is using it correctly, and dma-fence isn't living up to its doc. The documented behaviour (sync_file) seems useful. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Intel-gfx] [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-04 9:24 ` Chris Wilson @ 2017-01-04 9:37 ` Daniel Vetter 2017-01-04 9:43 ` Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Daniel Vetter @ 2017-01-04 9:37 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, dri-devel, intel-gfx, Maarten Lankhorst, gustavo.padovan On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote: > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote: > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > > > > > On 03/01/2017 11:05, Chris Wilson wrote: > > > > As the fence->status is an optional field that may be set before > > > > dma_fence_signal() is called to convey that the fence completed with an > > > > error, we have to ensure that it is always set to zero on initialisation > > > > so that the typical use (i.e. unset) always flags a successful completion. > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > --- > > > > drivers/dma-buf/dma-fence.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > index 3444f293ad4a..9130f790ebf3 100644 > > > > --- a/drivers/dma-buf/dma-fence.c > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > > fence->context = context; > > > > fence->seqno = seqno; > > > > fence->flags = 0UL; > > > > + fence->status = 0; > > > > > > > > trace_dma_fence_init(fence); > > > > } > > > > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > Looking at existing users there's only the sync_file stuff. And that gets > > it wrong, because afaics no one ever sets fence->status to anything > > useful. But sync_file blindly assumes it's correct. > > In terms of doc, sync_file is using it correctly, and dma-fence isn't > living up to its doc. The documented behaviour (sync_file) seems useful. Yeah, it looks like an oversight in merging sync_file and dma_fence together, but instead of piecemeal fixing (like this patch does), fixing it all&properly might be better. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-04 9:37 ` [Intel-gfx] " Daniel Vetter @ 2017-01-04 9:43 ` Chris Wilson 2017-01-04 10:18 ` Daniel Vetter 0 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-04 9:43 UTC (permalink / raw) To: Daniel Vetter; +Cc: gustavo.padovan, intel-gfx, dri-devel On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote: > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote: > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote: > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > > > > > > > On 03/01/2017 11:05, Chris Wilson wrote: > > > > > As the fence->status is an optional field that may be set before > > > > > dma_fence_signal() is called to convey that the fence completed with an > > > > > error, we have to ensure that it is always set to zero on initialisation > > > > > so that the typical use (i.e. unset) always flags a successful completion. > > > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > --- > > > > > drivers/dma-buf/dma-fence.c | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > index 3444f293ad4a..9130f790ebf3 100644 > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > > > fence->context = context; > > > > > fence->seqno = seqno; > > > > > fence->flags = 0UL; > > > > > + fence->status = 0; > > > > > > > > > > trace_dma_fence_init(fence); > > > > > } > > > > > > > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > > > Looking at existing users there's only the sync_file stuff. And that gets > > > it wrong, because afaics no one ever sets fence->status to anything > > > useful. But sync_file blindly assumes it's correct. > > > > In terms of doc, sync_file is using it correctly, and dma-fence isn't > > living up to its doc. The documented behaviour (sync_file) seems useful. > > Yeah, it looks like an oversight in merging sync_file and dma_fence > together, but instead of piecemeal fixing (like this patch does), fixing > it all&properly might be better. What's missing? * @status: Optional, only valid if < 0, must be set before calling * dma_fence_signal, indicates that the fence has completed with an * error. dma-fence must then guarantee that is zeroed during init, then the drivers can supply the optional information prior to calling dma_fence_signal() A dma_fence_set_status(fence, status) { BUG_ON(test_bit(SIGNALED, &fence->flags); BUG_ON(!IS_ERR_VALUE(status)); BUG_ON(fence->status); fence->status = status; } ? -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-04 9:43 ` Chris Wilson @ 2017-01-04 10:18 ` Daniel Vetter 2017-01-04 10:26 ` Chris Wilson 0 siblings, 1 reply; 19+ messages in thread From: Daniel Vetter @ 2017-01-04 10:18 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, dri-devel, intel-gfx, Maarten Lankhorst, gustavo.padovan On Wed, Jan 04, 2017 at 09:43:51AM +0000, Chris Wilson wrote: > On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote: > > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote: > > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote: > > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > > > > > > > > > On 03/01/2017 11:05, Chris Wilson wrote: > > > > > > As the fence->status is an optional field that may be set before > > > > > > dma_fence_signal() is called to convey that the fence completed with an > > > > > > error, we have to ensure that it is always set to zero on initialisation > > > > > > so that the typical use (i.e. unset) always flags a successful completion. > > > > > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > > --- > > > > > > drivers/dma-buf/dma-fence.c | 1 + > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > index 3444f293ad4a..9130f790ebf3 100644 > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > > > > fence->context = context; > > > > > > fence->seqno = seqno; > > > > > > fence->flags = 0UL; > > > > > > + fence->status = 0; > > > > > > > > > > > > trace_dma_fence_init(fence); > > > > > > } > > > > > > > > > > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > > > > > Looking at existing users there's only the sync_file stuff. And that gets > > > > it wrong, because afaics no one ever sets fence->status to anything > > > > useful. But sync_file blindly assumes it's correct. > > > > > > In terms of doc, sync_file is using it correctly, and dma-fence isn't > > > living up to its doc. The documented behaviour (sync_file) seems useful. > > > > Yeah, it looks like an oversight in merging sync_file and dma_fence > > together, but instead of piecemeal fixing (like this patch does), fixing > > it all&properly might be better. > > What's missing? > > * @status: Optional, only valid if < 0, must be set before calling > * dma_fence_signal, indicates that the fence has completed with an > * error. > > dma-fence must then guarantee that is zeroed during init, then the > drivers can supply the optional information prior to calling > dma_fence_signal() > > A dma_fence_set_status(fence, status) { > BUG_ON(test_bit(SIGNALED, &fence->flags); > BUG_ON(!IS_ERR_VALUE(status)); > BUG_ON(fence->status); > fence->status = status; > } ? The trouble I'm seeing is that sync_file.c and sync_debug.c _only_ look at fence->status, and assume that statue > 0 means the fence is singalled. That code doesn't check fence->flags at all ... So maybe we need to rename status to error_status to make it clear it's for failure modes only, and fix the sync_file code to look at flags, too. Please maybe even some userspace tests to sweeten the deal? Perhaps in our hangstats tests. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-04 10:18 ` Daniel Vetter @ 2017-01-04 10:26 ` Chris Wilson 2017-01-04 11:31 ` [Intel-gfx] " Daniel Vetter 0 siblings, 1 reply; 19+ messages in thread From: Chris Wilson @ 2017-01-04 10:26 UTC (permalink / raw) To: Daniel Vetter; +Cc: gustavo.padovan, intel-gfx, dri-devel On Wed, Jan 04, 2017 at 11:18:58AM +0100, Daniel Vetter wrote: > On Wed, Jan 04, 2017 at 09:43:51AM +0000, Chris Wilson wrote: > > On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote: > > > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote: > > > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote: > > > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > > > > > > > > > > > On 03/01/2017 11:05, Chris Wilson wrote: > > > > > > > As the fence->status is an optional field that may be set before > > > > > > > dma_fence_signal() is called to convey that the fence completed with an > > > > > > > error, we have to ensure that it is always set to zero on initialisation > > > > > > > so that the typical use (i.e. unset) always flags a successful completion. > > > > > > > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > > > --- > > > > > > > drivers/dma-buf/dma-fence.c | 1 + > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > index 3444f293ad4a..9130f790ebf3 100644 > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > > > > > fence->context = context; > > > > > > > fence->seqno = seqno; > > > > > > > fence->flags = 0UL; > > > > > > > + fence->status = 0; > > > > > > > > > > > > > > trace_dma_fence_init(fence); > > > > > > > } > > > > > > > > > > > > > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > > > > > > > Looking at existing users there's only the sync_file stuff. And that gets > > > > > it wrong, because afaics no one ever sets fence->status to anything > > > > > useful. But sync_file blindly assumes it's correct. > > > > > > > > In terms of doc, sync_file is using it correctly, and dma-fence isn't > > > > living up to its doc. The documented behaviour (sync_file) seems useful. > > > > > > Yeah, it looks like an oversight in merging sync_file and dma_fence > > > together, but instead of piecemeal fixing (like this patch does), fixing > > > it all&properly might be better. > > > > What's missing? > > > > * @status: Optional, only valid if < 0, must be set before calling > > * dma_fence_signal, indicates that the fence has completed with an > > * error. > > > > dma-fence must then guarantee that is zeroed during init, then the > > drivers can supply the optional information prior to calling > > dma_fence_signal() > > > > A dma_fence_set_status(fence, status) { > > BUG_ON(test_bit(SIGNALED, &fence->flags); > > BUG_ON(!IS_ERR_VALUE(status)); > > BUG_ON(fence->status); > > fence->status = status; > > } ? > > The trouble I'm seeing is that sync_file.c and sync_debug.c _only_ look at > fence->status, and assume that statue > 0 means the fence is singalled. > That code doesn't check fence->flags at all ... sync_file: sync_fill_fence_info() { if (dma_fence_is_signaled(fence)) info->status = fence->status >= 0 ? 1 : fence->status; else info->status = 0; } The no_fences: path is confusing since it sets info.status, but that's a different struct entirely. sync_debug: sync_print_fence() { status = 1; if (dma_fence_is_is_signaled_locked(fence)) status = fence->status; That's backwards... > So maybe we need to rename status to error_status to make it clear it's > for failure modes only, and fix the sync_file code to look at flags, too. It is already ^^. > Please maybe even some userspace tests to sweeten the deal? Perhaps in our > hangstats tests. I am already querying the error code in igt. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Intel-gfx] [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() 2017-01-04 10:26 ` Chris Wilson @ 2017-01-04 11:31 ` Daniel Vetter 0 siblings, 0 replies; 19+ messages in thread From: Daniel Vetter @ 2017-01-04 11:31 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, dri-devel, intel-gfx, Maarten Lankhorst, gustavo.padovan On Wed, Jan 04, 2017 at 10:26:49AM +0000, Chris Wilson wrote: > On Wed, Jan 04, 2017 at 11:18:58AM +0100, Daniel Vetter wrote: > > On Wed, Jan 04, 2017 at 09:43:51AM +0000, Chris Wilson wrote: > > > On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote: > > > > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote: > > > > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote: > > > > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote: > > > > > > > > > > > > > > On 03/01/2017 11:05, Chris Wilson wrote: > > > > > > > > As the fence->status is an optional field that may be set before > > > > > > > > dma_fence_signal() is called to convey that the fence completed with an > > > > > > > > error, we have to ensure that it is always set to zero on initialisation > > > > > > > > so that the typical use (i.e. unset) always flags a successful completion. > > > > > > > > > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > > > > --- > > > > > > > > drivers/dma-buf/dma-fence.c | 1 + > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > index 3444f293ad4a..9130f790ebf3 100644 > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > > > > > > fence->context = context; > > > > > > > > fence->seqno = seqno; > > > > > > > > fence->flags = 0UL; > > > > > > > > + fence->status = 0; > > > > > > > > > > > > > > > > trace_dma_fence_init(fence); > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > > > > > > > > > Looking at existing users there's only the sync_file stuff. And that gets > > > > > > it wrong, because afaics no one ever sets fence->status to anything > > > > > > useful. But sync_file blindly assumes it's correct. > > > > > > > > > > In terms of doc, sync_file is using it correctly, and dma-fence isn't > > > > > living up to its doc. The documented behaviour (sync_file) seems useful. > > > > > > > > Yeah, it looks like an oversight in merging sync_file and dma_fence > > > > together, but instead of piecemeal fixing (like this patch does), fixing > > > > it all&properly might be better. > > > > > > What's missing? > > > > > > * @status: Optional, only valid if < 0, must be set before calling > > > * dma_fence_signal, indicates that the fence has completed with an > > > * error. > > > > > > dma-fence must then guarantee that is zeroed during init, then the > > > drivers can supply the optional information prior to calling > > > dma_fence_signal() > > > > > > A dma_fence_set_status(fence, status) { > > > BUG_ON(test_bit(SIGNALED, &fence->flags); > > > BUG_ON(!IS_ERR_VALUE(status)); > > > BUG_ON(fence->status); > > > fence->status = status; > > > } ? > > > > The trouble I'm seeing is that sync_file.c and sync_debug.c _only_ look at > > fence->status, and assume that statue > 0 means the fence is singalled. > > That code doesn't check fence->flags at all ... > > sync_file: > sync_fill_fence_info() { > if (dma_fence_is_signaled(fence)) > info->status = fence->status >= 0 ? 1 : fence->status; > else > info->status = 0; > } > > The no_fences: path is confusing since it sets info.status, but that's a > different struct entirely. Ah right, coffee didn't properly kick in this morning ;-) > sync_debug: > sync_print_fence() { > status = 1; > if (dma_fence_is_is_signaled_locked(fence)) > status = fence->status; > > That's backwards... Yeah, that's the one I've meant. But it's for debug only > > So maybe we need to rename status to error_status to make it clear it's > > for failure modes only, and fix the sync_file code to look at flags, too. > > It is already ^^. > > > Please maybe even some userspace tests to sweeten the deal? Perhaps in our > > hangstats tests. > > I am already querying the error code in igt. Excellent, and since sync_file works that explains why igt works ;-) Can you pls spin some patch to fix up sync_debug, so I can pull it all in? And I still think s/status/error_status/ would help clarity a lot. And then maybe even the dma_fence_set_error_status helper from above. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-01-04 11:31 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson 2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson 2017-01-03 11:34 ` Tvrtko Ursulin 2017-01-03 11:46 ` [Intel-gfx] " Chris Wilson 2017-01-03 11:57 ` Tvrtko Ursulin 2017-01-03 12:13 ` Chris Wilson 2017-01-03 12:34 ` [Intel-gfx] " Tvrtko Ursulin 2017-01-03 12:38 ` Chris Wilson 2017-01-03 13:17 ` Tvrtko Ursulin 2017-01-03 13:25 ` Chris Wilson 2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork 2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin 2017-01-04 9:15 ` [Intel-gfx] " Daniel Vetter 2017-01-04 9:24 ` Chris Wilson 2017-01-04 9:37 ` [Intel-gfx] " Daniel Vetter 2017-01-04 9:43 ` Chris Wilson 2017-01-04 10:18 ` Daniel Vetter 2017-01-04 10:26 ` Chris Wilson 2017-01-04 11:31 ` [Intel-gfx] " Daniel Vetter
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.