From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH v4 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts
Date: Mon, 3 Oct 2022 08:53:37 +0100 [thread overview]
Message-ID: <c61d540a-9b3a-76f5-2641-c508a6e2bcbd@linux.intel.com> (raw)
In-Reply-To: <89566262-2cd0-f456-e8b2-c7bc6ad6fe36@intel.com>
On 30/09/2022 18:44, John Harrison wrote:
> On 9/30/2022 02:22, Tvrtko Ursulin wrote:
>> On 29/09/2022 17:21, John Harrison wrote:
>>> On 9/29/2022 00:42, Tvrtko Ursulin wrote:
>>>> On 29/09/2022 03:18, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> Compute workloads are inherently not pre-emptible for long periods on
>>>>> current hardware. As a workaround for this, the pre-emption timeout
>>>>> for compute capable engines was disabled. This is undesirable with GuC
>>>>> submission as it prevents per engine reset of hung contexts. Hence the
>>>>> next patch will re-enable the timeout but bumped up by an order of
>>>>> magnitude.
>>>>>
>>>>> However, the heartbeat might not respect that. Depending upon current
>>>>> activity, a pre-emption to the heartbeat pulse might not even be
>>>>> attempted until the last heartbeat period. Which means that only one
>>>>> period is granted for the pre-emption to occur. With the aforesaid
>>>>> bump, the pre-emption timeout could be significantly larger than this
>>>>> heartbeat period.
>>>>>
>>>>> So adjust the heartbeat code to take the pre-emption timeout into
>>>>> account. When it reaches the final (high priority) period, it now
>>>>> ensures the delay before hitting reset is bigger than the pre-emption
>>>>> timeout.
>>>>>
>>>>> v2: Fix for selftests which adjust the heartbeat period manually.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>> .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 19
>>>>> +++++++++++++++++++
>>>>> 1 file changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> index a3698f611f457..823a790a0e2ae 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> @@ -22,9 +22,28 @@
>>>>> static bool next_heartbeat(struct intel_engine_cs *engine)
>>>>> {
>>>>> + struct i915_request *rq;
>>>>> long delay;
>>>>> delay = READ_ONCE(engine->props.heartbeat_interval_ms);
>>>>> +
>>>>> + rq = engine->heartbeat.systole;
>>>>> +
>>>>> + if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
>>>>> + delay == engine->defaults.heartbeat_interval_ms) {
>>>>
>>>> Maybe I forgot but what is the reason for the check against the
>>>> default heartbeat interval?
>>> That's the 'v2: fix for selftests that manually adjust the
>>> heartbeat'. If something (or someone) has explicitly set an override
>>> of the heartbeat then it has to be assumed that they know what they
>>> are doing, and if things don't work any more that's their problem.
>>> But if we don't respect their override then they won't get the
>>> timings they expect and the selftest will fail.
>>
>> Isn't this a bit too strict for the non-selftest case? If the new
>> concept is extending the last pulse to guarantee preemption, then I
>> think we could allow tweaking of the heartbeat period. Like what if
>> user wants 1s, or 10s instead of 2.5s - why would that need to break
>> the improvement from this patch?
> Then the user is back to where they were before this patch.
>
>>
>> In what ways selftests fail? Are they trying to guess time to reset
>> based on the hearbeat period set? If so perhaps add a helper to query
>> it based on the last pulse extension.
>
> I don't recall. It was six months ago when I was actually working on
> this. And right now I do not have the time to go back and re-run all the
> testing and re-write a bunch of self tests with whole new helpers and
> algorithms and whatever else might be necessary to polish this to
> perfection. And in the meantime, all the existing issues are still
> present - there is no range checking on any of this stuff, it is very
> possible for a driver with default settings to break a legal workload
> because the heartbeat and pre-emption are fighting with each other, we
> don't even have per engine resets enabled, etc.
>
> Maybe it could be even better with a follow up patch. Feel free to do
> that. But as it stands, this patch set significantly improves the
> situation without making anything worse.
As we seem to be in agreement that the check against default heartbeat
is a hack with only purpose to work around assumptions made by
selftests, then please file a Jira about removing it (this hack). Then
work can be assigned to someone to clean it up. With that done I would
agree the series is indeed an improvement and it would have my ack.
Regards,
Tvrtko
next prev parent reply other threads:[~2022-10-03 7:53 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-29 2:18 [Intel-gfx] [PATCH v4 0/4] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-09-29 2:18 ` [Intel-gfx] [PATCH v4 1/4] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-09-29 7:39 ` Tvrtko Ursulin
2022-09-29 2:18 ` [Intel-gfx] [PATCH v4 2/4] drm/i915: Fix compute pre-emption w/a to apply to compute engines John.C.Harrison
2022-09-29 2:18 ` [Intel-gfx] [PATCH v4 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-09-29 7:42 ` Tvrtko Ursulin
2022-09-29 16:21 ` John Harrison
2022-09-30 9:22 ` Tvrtko Ursulin
2022-09-30 17:44 ` John Harrison
2022-10-03 7:53 ` Tvrtko Ursulin [this message]
2022-10-03 12:00 ` Tvrtko Ursulin
2022-10-05 18:48 ` John Harrison
2022-10-06 10:03 ` Tvrtko Ursulin
2022-09-29 2:18 ` [Intel-gfx] [PATCH v4 4/4] drm/i915: Improve long running compute w/a for GuC submission John.C.Harrison
2022-09-29 7:44 ` Tvrtko Ursulin
2022-09-29 2:33 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Improve anti-pre-emption w/a for compute workloads (rev7) Patchwork
2022-09-29 2:33 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-09-29 2:53 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-09-30 2:28 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c61d540a-9b3a-76f5-2641-c508a6e2bcbd@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=john.c.harrison@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox