public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: "Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>,
	"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis
Date: Tue, 20 Sep 2022 15:33:09 +0100	[thread overview]
Message-ID: <cd372139-2a33-c556-e375-03a578da157c@linux.intel.com> (raw)
In-Reply-To: <c6e968c8-aaa0-949c-61c9-857ef5d9f617@intel.com>



On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:
> 
> 
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
>>
>> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:

[snip]

>>>>
>>>>> +    /*
>>>>> +     * If the context gets closed while the execbuf is ongoing, 
>>>>> the context
>>>>> +     * close code will race with the below code to cancel the 
>>>>> delayed work.
>>>>> +     * If the context close wins the race and cancels the work, it 
>>>>> will
>>>>> +     * immediately call the sched disable (see guc_context_close), 
>>>>> so there
>>>>> +     * is a chance we can get past this check while the 
>>>>> sched_disable code
>>>>> +     * is being executed. To make sure that code completes before 
>>>>> we check
>>>>> +     * the status further down, we wait for the close process to 
>>>>> complete.
>>>>> +     */
>>>>> +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
>>>>> +        intel_context_sched_disable_unpin(ce);
>>>>> +    else if (intel_context_is_closed(ce))
>>>>> +        wait_for(context_close_done(ce), 1);
>>>>
>>>> Comment makes it sounds important to handle the race, althought it
>>>> doesn't really explain the consequences. But most importantly, what if
>>>> close doesn't complete in 1ms?
>>>
>>> will add the consequence (i believe the consequence is that we could 
>>> prematurely
>>> read context flags bit indicating its gucid is still registered and 
>>> after skipping
>>> re-registration, find that context gets closed and guc-id freed).
>>>
>>> Yes the 1 second is arbitrary and unnervingly short. Just spent 
>>> sometime trying to
>>
>> One millisecond even. :)
> 
> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
>>
>> What would be the consequence of prematurely reading the still 
>> registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
>> Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
>>
>> And why cant' this race with context close happen at any other point 
>> than this particular spot in guc_request_alloc? Immediately after the 
>> added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.

It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.

*)
intel_context_close
   set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
   ->guc_context_close
	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
		  ((which is:
			spin_lock_irqsave(&ce->guc_state.lock, flags);
			..stuff..
			unlock
		  ))
  
	spin_lock_irqsave(&ce->guc_state.lock, flags);
	set_context_close_done(ce);
	spin_unlock_irqrestore(&ce->guc_state.lock, flags);

Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.

Regards,

Tvrtko

  parent reply	other threads:[~2022-09-20 14:33 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-15  2:12 [Intel-gfx] [PATCH 0/1] Delay disabling GuC scheduling of an idle context Alan Previn
2022-09-15  2:12 ` [Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis Alan Previn
2022-09-15  8:48   ` Tvrtko Ursulin
2022-09-16  7:53     ` Teres Alexis, Alan Previn
2022-09-16  8:58       ` Tvrtko Ursulin
2022-09-16 15:36         ` Ceraolo Spurio, Daniele
2022-09-20  6:22           ` Teres Alexis, Alan Previn
2022-09-20 14:33           ` Tvrtko Ursulin [this message]
2022-09-21  7:59           ` Teres Alexis, Alan Previn
2022-09-21 22:30           ` Teres Alexis, Alan Previn
2022-09-21 22:30         ` Teres Alexis, Alan Previn
2022-09-15  3:22 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Delay disabling GuC scheduling of an idle context Patchwork
2022-09-15  3:41 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-09-15 20:41 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd372139-2a33-c556-e375-03a578da157c@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=alan.previn.teres.alexis@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox