From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
intel-gfx@lists.freedesktop.org,
Mika Kuoppala <mika.kuoppala@linux.intel.com>
Subject: Re: [PATCH 15/15] drm/i915/guc: Split hw submission for replay after GPU reset
Date: Mon, 28 Nov 2016 15:55:42 +0000 [thread overview]
Message-ID: <d995f795-7633-06e4-604b-ecdc37e2a48b@linux.intel.com> (raw)
In-Reply-To: <20161128141932.GE17384@nuc-i3427.alporthouse.com>
On 28/11/2016 14:19, Chris Wilson wrote:
> On Mon, Nov 28, 2016 at 02:02:33PM +0000, Tvrtko Ursulin wrote:
>>
>> On 25/11/2016 09:30, Chris Wilson wrote:
>>> Something I missed before sending off the partial series was that the
>>> non-scheduler guc reset path was broken (in the full series, this is
>>> pushed to the execlists reset handler). The issue is that after a reset,
>>> we have to refill the GuC workqueues, which we do by resubmitting the
>>> requests. However, if we already have submitted them, the fences within
>>> them have already been used and triggering them again is an error.
>>> Instead, just repopulate the guc workqueue.
>>>
>>> [ 115.858560] [IGT] gem_busy: starting subtest hang-render
>>> [ 135.839867] [drm] GPU HANG: ecode 9:0:0xe757fefe, in gem_busy [1716], reason: Hang on render ring, action: reset
>>> [ 135.839902] drm/i915: Resetting chip after gpu hang
>>> [ 135.839957] [drm] RC6 on
>>> [ 135.858351] ------------[ cut here ]------------
>>> [ 135.858357] WARNING: CPU: 2 PID: 45 at drivers/gpu/drm/i915/i915_sw_fence.c:108 i915_sw_fence_complete+0x25/0x30
>>> [ 135.858357] Modules linked in: rfcomm bnep binfmt_misc nls_iso8859_1 input_leds snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core btusb btrtl snd_hwdep snd_pcm 8250_dw snd_seq_midi hid_lenovo snd_seq_midi_event snd_rawmidi iwlwifi x86_pkg_temp_thermal coretemp snd_seq crct10dif_pclmul snd_seq_device hci_uart snd_timer crc32_pclmul ghash_clmulni_intel idma64 aesni_intel virt_dma btbcm snd btqca aes_x86_64 btintel lrw cfg80211 bluetooth gf128mul glue_helper ablk_helper cryptd soundcore intel_lpss_pci intel_pch_thermal intel_lpss_acpi intel_lpss acpi_als mfd_core kfifo_buf acpi_pad industrialio autofs4 hid_plantronics usbhid dm_mirror dm_region_hash dm_log sdhci_pci ahci sdhci libahci i2c_hid hid
>>> [ 135.858389] CPU: 2 PID: 45 Comm: kworker/2:1 Tainted: G W 4.9.0-rc4+ #238
>>> [ 135.858389] Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
>>> [ 135.858392] Workqueue: events_long i915_hangcheck_elapsed
>>> [ 135.858394] ffffc900001bf9b8 ffffffff812bb238 0000000000000000 0000000000000000
>>> [ 135.858396] ffffc900001bf9f8 ffffffff8104f621 0000006c00000000 ffff8808296137f8
>>> [ 135.858398] 0000000000000a00 ffff8808457a0000 ffff880845764e60 ffff880845760000
>>> [ 135.858399] Call Trace:
>>> [ 135.858403] [<ffffffff812bb238>] dump_stack+0x4d/0x65
>>> [ 135.858405] [<ffffffff8104f621>] __warn+0xc1/0xe0
>>> [ 135.858406] [<ffffffff8104f748>] warn_slowpath_null+0x18/0x20
>>> [ 135.858408] [<ffffffff813f8c15>] i915_sw_fence_complete+0x25/0x30
>>> [ 135.858410] [<ffffffff813f8fad>] i915_sw_fence_commit+0xd/0x30
>>> [ 135.858412] [<ffffffff8142e591>] __i915_gem_request_submit+0xe1/0xf0
>>> [ 135.858413] [<ffffffff8142e5c8>] i915_gem_request_submit+0x28/0x40
>>> [ 135.858415] [<ffffffff814433e7>] i915_guc_submit+0x47/0x210
>>> [ 135.858417] [<ffffffff81443e98>] i915_guc_submission_enable+0x468/0x540
>>> [ 135.858419] [<ffffffff81442495>] intel_guc_setup+0x715/0x810
>>> [ 135.858421] [<ffffffff8142b6b4>] i915_gem_init_hw+0x114/0x2a0
>>> [ 135.858423] [<ffffffff813eeaa8>] i915_reset+0xe8/0x120
>>> [ 135.858424] [<ffffffff813f3937>] i915_reset_and_wakeup+0x157/0x180
>>> [ 135.858426] [<ffffffff813f79db>] i915_handle_error+0x1ab/0x230
>>> [ 135.858428] [<ffffffff812c760d>] ? scnprintf+0x4d/0x90
>>> [ 135.858430] [<ffffffff81435985>] i915_hangcheck_elapsed+0x275/0x3d0
>>> [ 135.858432] [<ffffffff810668cf>] process_one_work+0x12f/0x410
>>> [ 135.858433] [<ffffffff81066bf3>] worker_thread+0x43/0x4d0
>>> [ 135.858435] [<ffffffff81066bb0>] ? process_one_work+0x410/0x410
>>> [ 135.858436] [<ffffffff81066bb0>] ? process_one_work+0x410/0x410
>>> [ 135.858438] [<ffffffff8106bbb4>] kthread+0xd4/0xf0
>>> [ 135.858440] [<ffffffff8106bae0>] ? kthread_park+0x60/0x60
>>>
>>> v2: Only resubmit submitted requests
>>> v3: Don't forget the pending requests have reserved space.
>>>
>>> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_guc_submission.c | 37 ++++++++++++++++--------------
>>> 1 file changed, 20 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
>>> index 00b5fa871644..e12ff881d38d 100644
>>> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
>>> @@ -602,12 +602,9 @@ static int guc_ring_doorbell(struct i915_guc_client *gc)
>>> }
>>>
>>> /**
>>> - * i915_guc_submit() - Submit commands through GuC
>>> + * __i915_guc_submit() - Submit commands through GuC
>>> * @rq: request associated with the commands
>>> *
>>> - * Return: 0 on success, otherwise an errno.
>>> - * (Note: nonzero really shouldn't happen!)
>>> - *
>>> * The caller must have already called i915_guc_wq_reserve() above with
>>> * a result of 0 (success), guaranteeing that there is space in the work
>>> * queue for the new request, so enqueuing the item cannot fail.
>>> @@ -619,7 +616,7 @@ static int guc_ring_doorbell(struct i915_guc_client *gc)
>>> * The only error here arises if the doorbell hardware isn't functioning
>>> * as expected, which really shouln't happen.
>>> */
>>> -static void i915_guc_submit(struct drm_i915_gem_request *rq)
>>> +static void __i915_guc_submit(struct drm_i915_gem_request *rq)
>>> {
>>> struct drm_i915_private *dev_priv = rq->i915;
>>> struct intel_engine_cs *engine = rq->engine;
>>> @@ -628,17 +625,6 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
>>> struct i915_guc_client *client = guc->execbuf_client;
>>> int b_ret;
>>>
>>> - /* We keep the previous context alive until we retire the following
>>> - * request. This ensures that any the context object is still pinned
>>> - * for any residual writes the HW makes into it on the context switch
>>> - * into the next object following the breadcrumb. Otherwise, we may
>>> - * retire the context too early.
>>> - */
>>> - rq->previous_context = engine->last_context;
>>> - engine->last_context = rq->ctx;
>>> -
>>> - i915_gem_request_submit(rq);
>>> -
>>> spin_lock(&client->wq_lock);
>>> guc_wq_item_append(client, rq);
>>>
>>> @@ -658,6 +644,23 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
>>> spin_unlock(&client->wq_lock);
>>> }
>>>
>>> +static void i915_guc_submit(struct drm_i915_gem_request *rq)
>>> +{
>>> + struct intel_engine_cs *engine = rq->engine;
>>> +
>>> + /* We keep the previous context alive until we retire the following
>>> + * request. This ensures that any the context object is still pinned
>>> + * for any residual writes the HW makes into it on the context switch
>>> + * into the next object following the breadcrumb. Otherwise, we may
>>> + * retire the context too early.
>>> + */
>>> + rq->previous_context = engine->last_context;
>>> + engine->last_context = rq->ctx;
>>> +
>>> + i915_gem_request_submit(rq);
>>> + __i915_guc_submit(rq);
>>> +}
>>> +
>>> /*
>>> * Everything below here is concerned with setup & teardown, and is
>>> * therefore not part of the somewhat time-critical batch-submission
>>> @@ -1556,7 +1559,7 @@ int i915_guc_submission_enable(struct drm_i915_private *dev_priv)
>>> /* Replay the current set of previously submitted requests */
>>> list_for_each_entry(rq, &engine->timeline->requests, link) {
>>> client->wq_rsvd += sizeof(struct guc_wq_item);
>>> - i915_guc_submit(rq);
>>> + __i915_guc_submit(rq);
>>> }
>>> }
>>>
>>>
>>
>> The only thing bothering me here is how this interacts with the
>> common, or "execlist" part of the reset.
>
> Also applies to ringbuffer (some parts at least).
>
>> Copying Mika in hope he can help here. :)
>>
>> I see that i915_gem_reset_engine runs before this GuC wq refill and
>> does some magic with the requests on the engine->timeline->requests
>> lists.
>>
>> Would it be possible to somehow better integrate the two things?
>
> I thought they were nicely separated. The common work to disable the
> guilty context is handled by the core reset function. The specific work
> required to restart and replay the requests handled by the different
> backends.
>
> i915_gem_reset_engine: is for disabling the guilty context
> engine->reset: updates the context / ring for restarting past a hung
> batch (if required)
> engine->init: starts the engine, processing the pending requests if any
>
> engine->reset is a bad name, since we have per-engine resets as well,
> but I haven't a better verb yet for handling the request bumping after a
> reset.
Yes but final part of the reset when GuC is enabled, is not in any of
those and is not even called from i915_gem_reset. It happens later via
i915_gem_init_hw in i915_guc_submission_enable.
I was wondering if that is the most logical way of doing it - does it
belong with the GEM reset, or the hardware reset?
For example could we, for GuC, not use reset_common_ring but add
guc_reset_engine (yes, why did you name it ring when we went to engines?).
But I see that the ordering of operations would make it very
problematic, since GuC is not setup until later following a reset.
So it all becomes very complicated and since it also predates this patch
it doesn't matter. Just confused me a little bit, and probably will
again in the future. :)
So..
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2016-11-28 15:55 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-25 9:30 GuC vs multiple timelines Chris Wilson
2016-11-25 9:30 ` [PATCH 01/15] drm: Hold mode_config.lock to prevent hotplug whilst setting up crtcs Chris Wilson
2016-11-25 9:30 ` [PATCH 02/15] drm: Pull together probe + setup for drm_fb_helper Chris Wilson
2016-11-25 9:30 ` [PATCH 03/15] drm: Protect fb_helper list manipulation with a mutex Chris Wilson
2016-11-25 9:30 ` [PATCH 04/15] drm/i915/perf: Wrap 64bit divides in do_div() Chris Wilson
2016-11-25 9:30 ` [PATCH 05/15] drm/i915: Add is-completed assert to request retire entrypoint Chris Wilson
2016-11-25 10:36 ` Joonas Lahtinen
2016-11-25 9:30 ` [PATCH 06/15] drm/i915: Assert no external observers when unwind a failed request alloc Chris Wilson
2016-11-25 10:39 ` Joonas Lahtinen
2016-11-25 11:27 ` Chris Wilson
2016-11-25 9:30 ` [PATCH 07/15] drm/i915: Hold a reference on the request for its fence chain Chris Wilson
2016-11-25 10:31 ` Joonas Lahtinen
2016-11-25 10:46 ` Chris Wilson
2016-11-25 9:30 ` [PATCH 08/15] drm/i915: Integrate i915_sw_fence with debugobjects Chris Wilson
2016-11-25 10:24 ` Joonas Lahtinen
2016-11-25 9:30 ` [PATCH 09/15] drm/i915: Enable swfence debugobject support for i915.ko Chris Wilson
2016-11-25 10:32 ` Joonas Lahtinen
2016-11-25 9:30 ` [PATCH 10/15] HAX drm/i915: Enable guc submission Chris Wilson
2016-11-25 9:30 ` [PATCH 11/15] drm/i915: Trim i915_guc_info() stack usage Chris Wilson
2016-11-28 11:15 ` Tvrtko Ursulin
2016-11-28 11:35 ` Chris Wilson
2016-11-28 12:17 ` Tvrtko Ursulin
2016-11-25 9:30 ` [PATCH 12/15] drm/i915/guc: Rename client->cookie to match use Chris Wilson
2016-11-28 11:37 ` Tvrtko Ursulin
2016-11-25 9:30 ` [PATCH 13/15] drm/i915/guc: Initialise doorbell cookie to matching value Chris Wilson
2016-11-28 12:09 ` Tvrtko Ursulin
2016-11-28 12:18 ` Chris Wilson
2016-11-25 9:30 ` [PATCH 14/15] drm/i915/guc: Keep the execbuf client allocated across reset Chris Wilson
2016-11-28 13:49 ` Tvrtko Ursulin
2016-11-28 14:11 ` Chris Wilson
2016-11-28 15:44 ` Tvrtko Ursulin
2016-11-28 16:01 ` Chris Wilson
2016-11-25 9:30 ` [PATCH 15/15] drm/i915/guc: Split hw submission for replay after GPU reset Chris Wilson
2016-11-28 14:02 ` Tvrtko Ursulin
2016-11-28 14:19 ` Chris Wilson
2016-11-28 15:20 ` Mika Kuoppala
2016-11-28 15:55 ` Tvrtko Ursulin [this message]
2016-11-28 16:06 ` Chris Wilson
2016-11-25 10:16 ` ✗ Fi.CI.BAT: warning for series starting with [01/15] drm: Hold mode_config.lock to prevent hotplug whilst setting up crtcs Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d995f795-7633-06e4-604b-ecdc37e2a48b@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
--cc=mika.kuoppala@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.