From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Subject: Re: [PATCH] drm/i915: Break modeset deadlocks on reset
Date: Wed, 21 Jun 2017 14:37:33 +0100 [thread overview]
Message-ID: <f865412e-babc-13d5-7c47-3dfc551301af@linux.intel.com> (raw)
In-Reply-To: <149804561157.15021.16207511831696299140@mail.alporthouse.com>
On 21/06/2017 12:46, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-06-21 12:30:15)
>>
>> On 20/06/2017 20:55, Chris Wilson wrote:
>>> Trying to do a modeset from within a reset is fraught with danger. We
>>> can fall into a cyclic deadlock where the modeset is waiting on a
>>> previous modeset that is waiting on a request, and since the GPU hung
>>> that request completion is waiting on the reset. As modesetting doesn't
>>> allow its locks to be broken and restarted, or for its *own* reset
>>> mechanism to take over the display, we have to do something very
>>> evil instead. If we detect that we are stuck waiting to prepare the
>>> display reset (by using a very simple timeout), resort to cancelling all
>>> in-flight requests and throwing the user data into /dev/null, which is
>>> marginally better than the driver locking up and keeping that data to
>>> itself.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem.c | 18 ++++--------------
>>> drivers/gpu/drm/i915/i915_irq.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>>> 2 files changed, 44 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index c2213016fd86..973f4f9e6b84 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -3040,7 +3040,8 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>>> /* Mark all executing requests as skipped */
>>> spin_lock_irqsave(&engine->timeline->lock, flags);
>>> list_for_each_entry(request, &engine->timeline->requests, link)
>>> - dma_fence_set_error(&request->fence, -EIO);
>>> + if (!i915_gem_request_completed(request))
>>> + dma_fence_set_error(&request->fence, -EIO);
>>> spin_unlock_irqrestore(&engine->timeline->lock, flags);
>>>
>>> /* Mark all pending requests as complete so that any concurrent
>>> @@ -3079,6 +3080,7 @@ static int __i915_gem_set_wedged_BKL(void *data)
>>> struct intel_engine_cs *engine;
>>> enum intel_engine_id id;
>>>
>>> + set_bit(I915_WEDGED, &i915->gpu_error.flags);
>>> for_each_engine(engine, i915, id)
>>> engine_set_wedged(engine);
>>>
>>> @@ -3087,20 +3089,7 @@ static int __i915_gem_set_wedged_BKL(void *data)
>>>
>>> void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
>>> {
>>> - lockdep_assert_held(&dev_priv->drm.struct_mutex);
>>> - set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
>>> -
>>> - /* Retire completed requests first so the list of inflight/incomplete
>>> - * requests is accurate and we don't try and mark successful requests
>>> - * as in error during __i915_gem_set_wedged_BKL().
>>> - */
>>> - i915_gem_retire_requests(dev_priv);
>>> -
>>> stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
>>> -
>>> - i915_gem_contexts_lost(dev_priv);
>>> -
>>> - mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
>>
>> What was the purpose of this, since it is now gone?
>
> It was to clean up the device after explicitly cancelling everything. In
> the pre-reset path, I felt it was overkill and replaced with the retire_work
> (which will then call idle_work as required).
>
> Hmm, thinking about the reset cycle, the retirement is forced anyway. My
> main concern is that we lose the cleanliness of a "normal" wedging.
>
>>> bool i915_gem_unset_wedged(struct drm_i915_private *i915)
>>> @@ -3155,6 +3144,7 @@ bool i915_gem_unset_wedged(struct drm_i915_private *i915)
>>> * context and do not require stop_machine().
>>> */
>>> intel_engines_reset_default_submission(i915);
>>> + i915_gem_contexts_lost(i915);
>>>
>>> smp_mb__before_atomic(); /* complete takeover before enabling execbuf */
>>> clear_bit(I915_WEDGED, &i915->gpu_error.flags);
>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>> index bce2d1feceb1..6585bcdf61a6 100644
>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>> @@ -2599,6 +2599,40 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>>> return ret;
>>> }
>>>
>>> +struct wedge_me {
>>> + struct delayed_work work;
>>> + struct drm_i915_private *i915;
>>> + const char *name;
>>> +};
>>> +
>>> +static void wedge_me(struct work_struct *work)
>>> +{
>>> + struct wedge_me *w = container_of(work, typeof(*w), work.work);
>>> +
>>> + DRM_ERROR("%s timed out, cancelling all in-flight rendering.\n",
>>> + w->name);
>>
>> Do you plan to call this from other places to need the name string?
>
> I was keeping that option open. >
>>
>>> + i915_gem_set_wedged(w->i915);
>>
>> Is it safe to do the execlist_port manipulation at this point? Couldn't
>> we receive an interrupt for one request completing after we have cleared
>> it from ports but before the actual reset?
>
> We do the port manipulation and queue cancellation inside stop_machine,
> i.e. in complete isolation. That ensures we won't succumb to a race
> there, we just have to be careful that the execlists cancellation works
> once the machine process the interrupts afterwards. That I'm not sure
> about...
>
>>> +static bool init_wedge_on_timeout(struct wedge_me *w,
>>> + struct drm_i915_private *i915,
>>> + long timeout,
>>> + const char *name)
>>> +{
>>> + w->i915 = i915;
>>> + w->name = name;
>>> + INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me);
>>> +
>>> + schedule_delayed_work(&w->work, HZ);
>>
>> One second is pessimistic enough etc?
>
> It's a hard call. If we make it too long, then the modesetting times out
> with even worse results. Too short and we may timeout in the middle of a
> slow eDP cycle.
>
>>
>>> + return true;
>>
>> Is the return value needed?
>
> That's just for syntatic sugar. What I wanted was something like
>
> i915_try_or_wedge(params) {
> dangerous_func().
> }
>
> if (init_wedge()) { } was easy with no macro abuse.
>
> The best macro abuse I can think of is for(init_wedge(w, params); try_wedge(w); )
>
>>> +}
>>> +
>>> +static void destroy_wedge_on_timeout(struct wedge_me *w)
>>> +{
>>> + cancel_delayed_work_sync(&w->work);
>>> + destroy_delayed_work_on_stack(&w->work);
>>> +}
>>> +
>>
>> Slight objection to the "on stack" API since it is disconnected from the
>> allocation so does not directly see whether it is really on stack. But
>> it is close enough to it and with just one user so could be passable.
>
> I hear you, but the stack properties and ensuring a strictly controlled
> critical section were too nice.
>
>>> /**
>>> * i915_reset_and_wakeup - do process context error handling work
>>> * @dev_priv: i915 device private
>>> @@ -2612,13 +2646,17 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>
>> Will need rebase.
>
> git is magic.
>
>>> char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>>> char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>>> char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
>>> + struct wedge_me w;
>>>
>>> kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>>>
>>> DRM_DEBUG_DRIVER("resetting chip\n");
>>> kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>>>
>>> - intel_prepare_reset(dev_priv);
>>> + if (init_wedge_on_timeout(&w, dev_priv, HZ, "intel_prepare_reset")) {
>>> + intel_prepare_reset(dev_priv);
>>> + destroy_wedge_on_timeout(&w);
>>> + }
>>>
>>> set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>> wake_up_all(&dev_priv->gpu_error.wait_queue);
>>> @@ -2642,6 +2680,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>> HZ));
>>>
>>> intel_finish_reset(dev_priv);
>>> + mod_delayed_work(dev_priv->wq, &dev_priv->gt.retire_work, 0);
>>
>> A comment explaining why this?
>
> It's the replacement for the removed idle_work, but now I realise that we
> are guaranteed a i915_gem_retire_requests() as part of the reset procedure.
So will be respinning with this removed?
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2017-06-21 13:37 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-20 16:12 [PATCH] drm/i915: Break modeset deadlocks on reset Chris Wilson
2017-06-20 19:55 ` Chris Wilson
2017-06-21 11:30 ` Tvrtko Ursulin
2017-06-21 11:46 ` Chris Wilson
2017-06-21 12:40 ` Chris Wilson
2017-06-21 13:37 ` Tvrtko Ursulin [this message]
2017-06-21 13:48 ` Chris Wilson
2017-06-20 20:11 ` ✓ Fi.CI.BAT: success for drm/i915: Break modeset deadlocks on reset (rev2) Patchwork
2017-06-21 14:24 ` [PATCH v2] drm/i915: Break modeset deadlocks on reset Chris Wilson
2017-06-21 15:55 ` Maarten Lankhorst
2017-06-21 16:06 ` Chris Wilson
2017-06-21 17:27 ` Maarten Lankhorst
2017-06-21 15:39 ` ✓ Fi.CI.BAT: success for drm/i915: Break modeset deadlocks on reset (rev3) Patchwork
2017-06-22 10:56 ` [PATCH v3] drm/i915: Break modeset deadlocks on reset Chris Wilson
2017-06-23 12:35 ` Tvrtko Ursulin
2017-06-23 12:44 ` Chris Wilson
2017-06-22 11:33 ` ✓ Fi.CI.BAT: success for drm/i915: Break modeset deadlocks on reset (rev4) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f865412e-babc-13d5-7c47-3dfc551301af@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
--cc=mika.kuoppala@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.