All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Andrzej Hajda <andrzej.hajda@intel.com>,
	John Harrison <john.c.harrison@intel.com>,
	intel-gfx@lists.freedesktop.org
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Subject: Re: [Intel-gfx] [PATCH] drm/i915: fix exiting context timeout calculation
Date: Fri, 2 Dec 2022 09:14:25 +0000	[thread overview]
Message-ID: <fdabe479-b063-e6d5-0a1e-e2d3cebb705f@linux.intel.com> (raw)
In-Reply-To: <c0bb381e-43d2-3af9-0ada-2bc60027a4f5@intel.com>


On 01/12/2022 16:36, Andrzej Hajda wrote:
> On 01.12.2022 11:28, Tvrtko Ursulin wrote:
>>
>> On 01/12/2022 00:22, John Harrison wrote:
>>> On 11/29/2022 00:43, Tvrtko Ursulin wrote:
>>>> On 28/11/2022 16:52, Andrzej Hajda wrote:
>>>>> In case context is exiting preempt_timeout_ms is used for timeout,
>>>>> but since introduction of DRM_I915_PREEMPT_TIMEOUT_COMPUTE it 
>>>>> increases
>>>>> to 7.5 seconds. Heartbeat occurs earlier but it is still 2.5s.
>>>>>
>>>>> Fixes: d7a8680ec9fb21 ("drm/i915: Improve long running compute w/a 
>>>>> for GuC submission")
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2410
>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>> ---
>>>>> Hi all,
>>>>>
>>>>> I am not sure what is expected solution here, and if my patch does not
>>>>> actually reverts intentions of patch d7a8680ec9fb21. Feel free to 
>>>>> propose
>>>>> something better.
>>>>> Other alternative would be to increase t/o in IGT tests, but I am 
>>>>> not sure
>>>>> if this is good direction.
>>>>
>>>> Is it the hack with the FIXME marker from 47daf84a8bfb ("drm/i915: 
>>>> Make the heartbeat play nice with long pre-emption timeouts") that 
>>>> actually breaks things? (If IGT modifies the preempt timeout the 
>>>> heartbeat extension will not work as intended.)
>>>>
>>>> If so, I think we agreed during review that was a weakness which 
>>>> needs to be addressed, but I would need to re-read the old threads 
>>>> to remember what was the plan. Regardless what it was it may be time 
>>>> is now to continue with those improvements.
>>>>
>>> What is the actual issue? Just that closing contexts are taking 
>>> forever to actually close? That would be the whole point of the 
>>> 'context_is_exiting' patch. Which I still totally disagree with.
>>>
>>> If the context is being closed 'gracefully' and it is intended that 
>>> it should be allowed time to pre-empt without being killed via an 
>>> engine reset then the 7.5s delay is required. That is the officially 
>>> agreed upon timeout to allow compute capable contexts to reach a 
>>> pre-emption point before they should be killed. If an IGT is failing 
>>> because it enforces a shorter timeout then the IGT needs to be 
>>> updated to account for the fact that i915 has to support slow compute 
>>> workloads.
>>>
>>> If the context is being closed 'forcefully' and should be killed 
>>> immediately then you should be using the 'BANNED_PREEMPT_TIMEOUT' 
>>> value not the sysfs/config value.
>>>
>>> Regarding heartbeats...
>>>
>>> The heartbeat period is 2.5s. But there are up to five heartbeat 
>>> periods between the heartbeat starting and it declaring a hang. The 
>>> patch you mention also introduced a check on the pre-emption timeout 
>>> when the last period starts. If the pre-emption timeout is longer 
>>> than the heartbeat period then the last period is extended to 
>>> guarantee that a full pre-emption time is granted before declaring 
>>> the hang.
>>>
>>> Are you saying that a heartbeat timeout is occurring and killing the 
>>> system? Or are you just worried that something doesn't align correctly?
>>
>> I leave this to Andrzej since I am not the one debugging this. I just 
>> glanced over the IGT and saw that there's code in there which sets 
>> both the preempt timeout and heartbeat interval to non-default values. 
>> And then I remembered this:
> 
> The test is gem_ctx_persistence@many-contexts. It does not modify sysfs 
> timeouts, but it assumes 1sec is enough to wait for exiting context 
> (no-preemption). It works with bcs, vcs, vecs, but fails on rcs since it 
> has
> timeout set to 7.5sec (btw it works with GuC submissions enabled). It 
> seemed to me somehow inconsistent, but if this is how it should work
> I will just adjust the test.

This looks odd then. That test is using non-preemptable spinners and 
AFAICT it keeps submitting them for 30s, across all engines, and then it 
stops and waits for one second for all of them to exit.

With the 7.5 preempt timeout I'd expect test should fail both with GuC 
and execlists.

What should happen is that every context is marked as "exiting" and is 
revoked. On the next scheduling event they would all be dropped.

So I think two questions - how did increase of preempt timeout to 7.5s 
pass CI - is the failure sporadic for instance?

Second question - you are saying with GuC test always passes - how does 
GuC manages to revoke a non-preemptible spinner in less than one second 
if preempt timeout is 7.5s.. colour me confused.

Anyway those questions are secondary.. Fix here I think pretty obviously 
is for many_contexts() to fetch the preempt timeout from sysfs and allow 
for that much time (plus a safety factor). Use the longest timeout 
between all engines since all are submitted to.

Regards,

Tvrtko

> 
> Regards
> Andrzej
> 
> 
>>
>> next_heartbeat():
>> ...
>>          /*
>>           * FIXME: The final period extension is disabled if the 
>> period has been
>>           * modified from the default. This is to prevent issues with 
>> certain
>>           * selftests which override the value and expect specific 
>> behaviour.
>>           * Once the selftests have been updated to either cope with 
>> variable
>>           * heartbeat periods (or to override the pre-emption timeout 
>> as well,
>>           * or just to add a selftest specific override of the 
>> extension), the
>>           * generic override can be removed.
>>           */
>>          if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
>>              delay == engine->defaults.heartbeat_interval_ms) {
>>
>> Which then wouldn't dtrt with last heartbeat pulse extensions, if the 
>> IGT would be relying on that. Don't know, just pointing out to check 
>> and see if this FIXME needs to be prioritised.
>>
>> Regards,
>>
>> Tvrtko
> 

  reply	other threads:[~2022-12-02  9:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-28 16:52 [Intel-gfx] [PATCH] drm/i915: fix exiting context timeout calculation Andrzej Hajda
2022-11-28 18:06 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2022-11-28 18:28 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-11-29  0:32 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2022-11-29  8:43 ` [Intel-gfx] [PATCH] " Tvrtko Ursulin
2022-12-01  0:22   ` John Harrison
2022-12-01 10:28     ` Tvrtko Ursulin
2022-12-01 16:36       ` Andrzej Hajda
2022-12-02  9:14         ` Tvrtko Ursulin [this message]
2022-12-02 12:19           ` Andrzej Hajda
2022-12-02 13:13             ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdabe479-b063-e6d5-0a1e-e2d3cebb705f@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=andrzej.hajda@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.