dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Jani Nikula <jani.nikula@linux.intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches
Date: Thu, 3 Nov 2022 09:38:14 +0000	[thread overview]
Message-ID: <2b2eb780-08f7-c7df-0397-a7f732da272d@linux.intel.com> (raw)
In-Reply-To: <1855f0f2-8a1c-7bf4-76c0-76a4354ea8e8@linux.intel.com>


On 03/11/2022 09:18, Tvrtko Ursulin wrote:
> 
> On 03/11/2022 01:33, John Harrison wrote:
>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>>
>>>> sysfs?
>> Sorry, that was meant to say debugfs. I've also been working on some 
>> sysfs IGT issues and evidently got my wires crossed!
>>
>>>>
>>>>> special flags set. One of the possible paths waits for idle with an
>>>>> infinite timeout. That causes problems for debugging issues when CI
>>>>> catches a "can't go idle" test failure. Best case, the CI system times
>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>>> Sometimes a serial port log of dmesg might be available, sometimes 
>>>>> not.
>>>>>
>>>>> So rather than making life hard for ourselves, change the timeout to
>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>> working system (if possible).
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>             DROP_RESET_ACTIVE | \
>>>>>             DROP_RESET_SEQNO | \
>>>>>             DROP_RCU)
>>>>> +
>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>
>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>>>> here.
>>>
>>> So move here, dropping i915 prefix, next to the newly proposed one?
>> Sure, can do that.
>>
>>>
>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>> gt/intel_gt.c.
>>>
>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>
>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>>>
>>> No action needed, maybe drop i915 prefix if wanted.
>>>
>> These two are totally unrelated and in code not being touched by this 
>> change. I would rather not conflate changing random other things with 
>> fixing this specific issue.
>>
>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>
>>> Add _MS suffix if wanted.
>>>
>>>> My head spins.
>>>
>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
>>> to DROP_ACTIVE and not only DROP_IDLE.
>> My original intention for the name was that is the 'drop caches 
>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and 
>> hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that 
>> name can be conflated with the DROP_IDLE flag. Will rename.
>>
>>
>>>
>>> Things get refactored, code moves around, bits get left behind, who 
>>> knows. No reason to get too worked up. :) As long as people are 
>>> taking a wider view when touching the code base, and are not afraid 
>>> to send cleanups, things should be good.
>> On the other hand, if every patch gets blocked in code review because 
>> someone points out some completely unrelated piece of code could be a 
>> bit better then nothing ever gets fixed. If you spot something that 
>> you think should be improved, isn't the general idea that you should 
>> post a patch yourself to improve it?
> 
> There's two maintainers per branch and an order of magnitude or two more 
> developers so it'd be nice if cleanups would just be incoming on 
> self-initiative basis. ;)
> 
>>> For the actual functional change at hand - it would be nice if code 
>>> paths in question could handle SIGINT and then we could punt the 
>>> decision on how long someone wants to wait purely to userspace. But 
>>> it's probably hard and it's only debugfs so whatever.
>>>
>> The code paths in question will already abort on a signal won't they? 
>> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
>> which is where the uc_wait_for_idle eventually ends up, have an 
>> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like 
>> what you are asking for is a change in the IGT libraries and/or CI 
>> framework to start sending signals after some specific timeout. That 
>> seems like a significantly more complex change (in terms of the number 
>> of entities affected and number of groups involved) and unnecessary.
> 
> If you say so, I haven't looked at them all. But if the code path in 
> question already aborts on signals then I am not sure what is the patch 
> fixing? I assumed you are trying to avoid the write stuck in D forever, 
> which then prevents driver unload and everything, requiring the test 
> runner to eventually reboot. If you say SIGINT works then you can 
> already recover from userspace, no?
> 
>>> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
>>> err on the side of safety and make it longer, but at most half from 
>>> the test runner timeout.
>> This is supposed to be test clean up. This is not about how long a 
>> particular test takes to complete but about how long it takes to 
>> declare the system broken after the test has already finished. I would 
>> argue that even 10s is massively longer than required.
>>
>>>
>>> I am not convinced that wedging is correct though. Conceptually could 
>>> be just that the timeout is too short. What does wedging really give 
>>> us, on top of limiting the wait, when latter AFAIU is the key factor 
>>> which would prevent the need to reboot the machine?
>>>
>> It gives us a system that knows what state it is in. If we can't idle 
>> the GT then something is very badly wrong. Wedging indicates that. It 
>> also ensure that a full GT reset will be attempted before the next 
>> test is run. Helping to prevent a failure on test X from propagating 
>> into failures of unrelated tests X+1, X+2, ... And if the GT reset 
>> does not work because the system is really that badly broken then 
>> future tests will not run rather than report erroneous failures.
>>
>> This is not about getting a more stable system for end users by 
>> sweeping issues under the carpet and pretending all is well. End users 
>> don't run IGTs or explicitly call dodgy debugfs entry points. The sole 
>> motivation here is to get more accurate results from CI. That is, 
>> correctly identifying which test has hit a problem, getting valid 
>> debug analysis for that test (logs and such) and allowing further 
>> testing to complete correctly in the case where the system can be 
>> recovered.
> 
> I don't really oppose shortening of the timeout in principle, just want 
> a clear statement if this is something IGT / test runner could already 
> do or not. It can apply a timeout, it can also send SIGINT, and it could 
> even trigger a reset from outside. Sure it is debugfs hacks so general 
> "kernel should not implement policy" need not be strictly followed, but 
> lets have it clear what are the options.

One conceptual problem with applying this policy is that the code is:

	if (val & (DROP_IDLE | DROP_ACTIVE)) {
		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
		if (ret)
			return ret;
	}

	if (val & DROP_IDLE) {
		ret = intel_gt_pm_wait_for_idle(gt);
		if (ret)
			return ret;
	}

So if someone passes in DROP_IDLE and then why would only the first 
branch have a short timeout and wedge. Yeah some bug happens to be there 
at the moment, but put a bug in a different place and you hang on the 
second branch and then need another patch. Versus perhaps making it all 
respect SIGINT and handle from outside.

Regards,

Tvrtko

  reply	other threads:[~2022-11-03  9:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-01 23:50 [PATCH] drm/i915: Don't wait forever in drop_caches John.C.Harrison
2022-11-02 12:12 ` Jani Nikula
2022-11-02 14:20   ` [Intel-gfx] " Tvrtko Ursulin
2022-11-03  1:33     ` John Harrison
2022-11-03  9:18       ` Tvrtko Ursulin
2022-11-03  9:38         ` Tvrtko Ursulin [this message]
2022-11-03 19:16           ` John Harrison
2022-11-04 10:01             ` Tvrtko Ursulin
2022-11-04 17:45               ` John Harrison
2022-11-07 14:09                 ` Tvrtko Ursulin
2022-11-07 19:45                   ` John Harrison
2022-11-08  9:08                     ` Tvrtko Ursulin
2022-11-08 19:37                       ` John Harrison
2022-11-09 11:35                         ` Tvrtko Ursulin
2022-11-10  6:20                           ` John Harrison
2022-11-03 19:37         ` John Harrison
2022-11-03 10:45       ` Jani Nikula
2022-11-03 19:39         ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2b2eb780-08f7-c7df-0397-a7f732da272d@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=jani.nikula@linux.intel.com \
    --cc=john.c.harrison@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).