Re: [PATCH 1/4] drm/i915: Unify execlist and legacy request life-cycles

From: Nick Hoath <nicholas.hoath@intel.com>
To: "intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>,
	Daniel Vetter <daniel@ffwll.ch>
Subject: Re: [PATCH 1/4] drm/i915: Unify execlist and legacy request life-cycles
Date: Wed, 14 Oct 2015 17:19:11 +0100	[thread overview]
Message-ID: <561E807F.7060105@intel.com> (raw)
In-Reply-To: <561E69D2.7050900@intel.com>

On 14/10/2015 15:42, Dave Gordon wrote:
> On 13/10/15 12:36, Chris Wilson wrote:
>> On Tue, Oct 13, 2015 at 01:29:56PM +0200, Daniel Vetter wrote:
>>> On Fri, Oct 09, 2015 at 06:23:50PM +0100, Chris Wilson wrote:
>>>> On Fri, Oct 09, 2015 at 07:18:21PM +0200, Daniel Vetter wrote:
>>>>> On Fri, Oct 09, 2015 at 10:45:35AM +0100, Chris Wilson wrote:
>>>>>> On Fri, Oct 09, 2015 at 11:15:08AM +0200, Daniel Vetter wrote:
>>>>>>> My idea was to create a new request for 3. which gets signalled by the
>>>>>>> scheduler in intel_lrc_irq_handler. My idea was that we'd only create
>>>>>>> these when a ctx switch might occur to avoid overhead, but I guess if we
>>>>>>> just outright delay all requests a notch if need that might work too. But
>>>>>>> I'm really not sure on the implications of that (i.e. does the hardware
>>>>>>> really unlod the ctx if it's idle?), and whether that would fly still with
>>>>>>> the scheduler.
>>>>>>>
>>>>>>> But figuring this one out here seems to be the cornestone of this reorg.
>>>>>>> Without it we can't just throw contexts onto the active list.
>>>>>>
>>>>>> (Let me see if I understand it correctly)
>>>>>>
>>>>>> Basically the problem is that we can't trust the context object to be
>>>>>> synchronized until after the status interrupt. The way we handled that
>>>>>> for legacy is to track the currently bound context and keep the
>>>>>> vma->pin_count asserted until the request containing the switch away.
>>>>>> Doing the same for execlists would trivially fix the issue and if done
>>>>>> smartly allows us to share more code (been there, done that).
>>>>>>
>>>>>> That satisfies me for keeping requests as a basic fence in the GPU
>>>>>> timeline and should keep everyone happy that the context can't vanish
>>>>>> until after it is complete. The only caveat is that we cannot evict the
>>>>>> most recent context. For legacy, we do a switch back to the always
>>>>>> pinned default context. For execlists we don't, but it still means we
>>>>>> should only have one context which cannot be evicted (like legacy). But
>>>>>> it does leave us with the issue that i915_gpu_idle() returns early and
>>>>>> i915_gem_context_fini() must keep the explicit gpu reset to be
>>>>>> absolutely sure that the pending context writes are completed before the
>>>>>> final context is unbound.
>>>>>
>>>>> Yes, and that was what I originally had in mind. Meanwhile the scheduler
>>>>> (will) happen and that means we won't have FIFO ordering. Which means when
>>>>> we switch contexts (as opposed to just adding more to the ringbuffer of
>>>>> the current one) we won't have any idea which context will be the next
>>>>> one. Which also means we don't know which request to pick to retire the
>>>>> old context. Hence why I think we need to be better.
>>>>
>>>> But the scheduler does - it is also in charge of making sure the
>>>> retirement queue is in order. The essence is that we only actually pin
>>>> engine->last_context, which is chosen as we submit stuff to the hw.
>>>
>>> Well I'm not sure how much it will reorder, but I'd expect it wants to
>>> reorder stuff pretty freely. And as soon as it reorders context (ofc they
>>> can't depend on each another) then the legacy hw ctx tracking won't work.
>>>
>>> I think at least ...
>>
>> Not the way it is written today, but the principle behind it still
>> stands. The last_context submitted to the hardware is pinned until a new
>> one is submitted (such that it remains bound in the GGTT until after the
>> context switch is complete due to the active reference). Instead of
>> doing the context tracking at the start of the execbuffer, the context
>> tracking needs to be pushed down to the submission backend/middleman.
>> -Chris
>
> Does anyone actually know what guarantees (if any) the GPU provides
> w.r.t access to context images vs. USER_INTERRUPTs and CSB-updated
> interrupts? Does 'active->idle' really mean that the context has been
> fully updated in memory (and can therefore be unmapped), or just that
> the engine has stopped processing (but the context might not be saved
> until it's known that it isn't going to be reactivated).
>
> For example, it could implement this:
>
> (End of last batch in current context)
> 	1.	Update seqno
> 	2.	Generate USER_INTERRUPT
> 	3.	Engine finishes work
> 		(HEAD == TAIL and no further contexts queued in ELSP)
> 	4.	Save all per-context registers to context image
> 	5.	Flush to memory and invalidate
> 	6.	Update CSB
> 	7.	Flush to memory
> 	8.	Generate CSB-update interrupt.
>
> (New batch in same context submitted via ELSP)
> 	9.	Reload entire context image from memory
> 	10.	Update CSB
> 	11.	Generate CSB-update interrupt
>
> Or this:
> 	1. Update seqno
> 	2. Generate USER_INTERRUPT
> 	3. Engine finishes work
> 		(HEAD == TAIL and no further contexts queued in ELSP)
> 	4. Update CSB
> 	5. Generate CSB-update interrupt.
>
> (New batch in DIFFERENT context submitted via ELSP)
> 	6. Save all per-context registers to old context image
> 	7. Load entire context image from new image
> 	8. Update CSB
> 	9. Generate CSB-update interrupt
>
> The former is synchronous and relatively easy to model, the latter is
> more like the way legacy mode works. Any various other permutations are
> possible (sync save vs async save vs deferred save, full reload vs
> lite-restore, etc). So I think we either need to know what really
> happens (and assume future chips will work the same way), or make only
> minimal assumptions and code something that will work no matter how the
> hardware actually behaves. That probably precludes any attempt at
> tracking individual context-switches at the CSB level, which in any case
> aren't passed to the CPU in GuC submission mode.
>
> .Dave.
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

Tracking context via last_context at request retirement.

In LRC/ELSP mode:
	At startup:
		- Double refcount default context
		- Set last_context to default context

	When a request is complete
		- If last_context == current_context
			- queue request for cleanup
		- If last_context != current_context
			- unref last_context
			- update last_context to current_context
			- queue request for cleanup

What this achieves:
	Make the code path closer to legacy submission
	Can now use active_list tracking for contexts & ringbufs

Additional work 1:
	- When there is no work pending on an engine, at some point:
		- Send a nop request on the default context
			This moves last_context to be default context,
			allowing previous last_context to be unref'd

Additional work 2:
	- Change legacy mode to use last_context post request completion
		This will allow us to unify the code paths.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx