Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH v4 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts
Date: Mon, 3 Oct 2022 08:53:37 +0100	[thread overview]
Message-ID: <c61d540a-9b3a-76f5-2641-c508a6e2bcbd@linux.intel.com> (raw)
In-Reply-To: <89566262-2cd0-f456-e8b2-c7bc6ad6fe36@intel.com>


On 30/09/2022 18:44, John Harrison wrote:
> On 9/30/2022 02:22, Tvrtko Ursulin wrote:
>> On 29/09/2022 17:21, John Harrison wrote:
>>> On 9/29/2022 00:42, Tvrtko Ursulin wrote:
>>>> On 29/09/2022 03:18, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> Compute workloads are inherently not pre-emptible for long periods on
>>>>> current hardware. As a workaround for this, the pre-emption timeout
>>>>> for compute capable engines was disabled. This is undesirable with GuC
>>>>> submission as it prevents per engine reset of hung contexts. Hence the
>>>>> next patch will re-enable the timeout but bumped up by an order of
>>>>> magnitude.
>>>>>
>>>>> However, the heartbeat might not respect that. Depending upon current
>>>>> activity, a pre-emption to the heartbeat pulse might not even be
>>>>> attempted until the last heartbeat period. Which means that only one
>>>>> period is granted for the pre-emption to occur. With the aforesaid
>>>>> bump, the pre-emption timeout could be significantly larger than this
>>>>> heartbeat period.
>>>>>
>>>>> So adjust the heartbeat code to take the pre-emption timeout into
>>>>> account. When it reaches the final (high priority) period, it now
>>>>> ensures the delay before hitting reset is bigger than the pre-emption
>>>>> timeout.
>>>>>
>>>>> v2: Fix for selftests which adjust the heartbeat period manually.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 19 
>>>>> +++++++++++++++++++
>>>>>   1 file changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> index a3698f611f457..823a790a0e2ae 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>>>> @@ -22,9 +22,28 @@
>>>>>     static bool next_heartbeat(struct intel_engine_cs *engine)
>>>>>   {
>>>>> +    struct i915_request *rq;
>>>>>       long delay;
>>>>>         delay = READ_ONCE(engine->props.heartbeat_interval_ms);
>>>>> +
>>>>> +    rq = engine->heartbeat.systole;
>>>>> +
>>>>> +    if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
>>>>> +        delay == engine->defaults.heartbeat_interval_ms) {
>>>>
>>>> Maybe I forgot but what is the reason for the check against the 
>>>> default heartbeat interval?
>>> That's the 'v2: fix for selftests that manually adjust the 
>>> heartbeat'. If something (or someone) has explicitly set an override 
>>> of the heartbeat then it has to be assumed that they know what they 
>>> are doing, and if things don't work any more that's their problem. 
>>> But if we don't respect their override then they won't get the 
>>> timings they expect and the selftest will fail.
>>
>> Isn't this a bit too strict for the non-selftest case? If the new 
>> concept is extending the last pulse to guarantee preemption, then I 
>> think we could allow tweaking of the heartbeat period. Like what if 
>> user wants 1s, or 10s instead of 2.5s - why would that need to break 
>> the improvement from this patch?
> Then the user is back to where they were before this patch.
> 
>>
>> In what ways selftests fail? Are they trying to guess time to reset 
>> based on the hearbeat period set? If so perhaps add a helper to query 
>> it based on the last pulse extension.
> 
> I don't recall. It was six months ago when I was actually working on 
> this. And right now I do not have the time to go back and re-run all the 
> testing and re-write a bunch of self tests with whole new helpers and 
> algorithms and whatever else might be necessary to polish this to 
> perfection. And in the meantime, all the existing issues are still 
> present - there is no range checking on any of this stuff, it is very 
> possible for a driver with default settings to break a legal workload 
> because the heartbeat and pre-emption are fighting with each other, we 
> don't even have per engine resets enabled, etc.
> 
> Maybe it could be even better with a follow up patch. Feel free to do 
> that. But as it stands, this patch set significantly improves the 
> situation without making anything worse.

As we seem to be in agreement that the check against default heartbeat 
is a hack with only purpose to work around assumptions made by 
selftests, then please file a Jira about removing it (this hack). Then 
work can be assigned to someone to clean it up. With that done I would 
agree the series is indeed an improvement and it would have my ack.

Regards,

Tvrtko

  reply	other threads:[~2022-10-03  7:53 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-29  2:18 [Intel-gfx] [PATCH v4 0/4] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-09-29  2:18 ` [Intel-gfx] [PATCH v4 1/4] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-09-29  7:39   ` Tvrtko Ursulin
2022-09-29  2:18 ` [Intel-gfx] [PATCH v4 2/4] drm/i915: Fix compute pre-emption w/a to apply to compute engines John.C.Harrison
2022-09-29  2:18 ` [Intel-gfx] [PATCH v4 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-09-29  7:42   ` Tvrtko Ursulin
2022-09-29 16:21     ` John Harrison
2022-09-30  9:22       ` Tvrtko Ursulin
2022-09-30 17:44         ` John Harrison
2022-10-03  7:53           ` Tvrtko Ursulin [this message]
2022-10-03 12:00             ` Tvrtko Ursulin
2022-10-05 18:48               ` John Harrison
2022-10-06 10:03                 ` Tvrtko Ursulin
2022-09-29  2:18 ` [Intel-gfx] [PATCH v4 4/4] drm/i915: Improve long running compute w/a for GuC submission John.C.Harrison
2022-09-29  7:44   ` Tvrtko Ursulin
2022-09-29  2:33 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Improve anti-pre-emption w/a for compute workloads (rev7) Patchwork
2022-09-29  2:33 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-09-29  2:53 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-09-30  2:28 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c61d540a-9b3a-76f5-2641-c508a6e2bcbd@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=john.c.harrison@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox