Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
Date: Wed, 26 Apr 2017 13:13:41 +0100	[thread overview]
Message-ID: <7c70131c-e660-556c-e3d6-52fe3dca7ff3@linux.intel.com> (raw)
In-Reply-To: <20170426111849.GP11432@nuc-i3427.alporthouse.com>

On 26/04/2017 12:18, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
>>>>> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
>>>>> + * context id to the last u32 fence seqno waited upon from that context.
>>>>> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
>>>>> + * the root. This allows us to access the whole tree via a single pointer
>>>>> + * to the most recently used layer. We expect fence contexts to be dense
>>>>> + * and most reuse to be on the same i915_gem_context but on neighbouring
>>>>> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
>>>>> + * effective lookup cache. If the new lookup is not on the same leaf, we
>>>>> + * expect it to be on the neighbouring branch.
>>>>> + *
>>>>> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
>>>>> + * allows us to store whether a particular seqno is valid (i.e. allows us
>>>>> + * to distinguish unset from 0).
>>>>> + *
>>>>> + * A branch holds an array of layer pointers, and has height > 0, and always
>>>>> + * has at least 2 layers (either branches or leaves) below it.
>>>>> + *
>>>>> + */
>>>>
>>>> @_@ :)
>>>>
>>>> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
>>>> because of u64 keys. :( How about a hash table? It would be much
>>>> simpler to review. :) Seriously, if it would perform close enough it
>>>> would be a much much simpler implementation.
>>>
>>> You want a resizable hashtable. rht is appallingly slow, so you want a
>>> custom resizeable ht. They are not as simple as this codewise ;)
>>> (Plus a compressed radixtree is part of my plan for scalability
>>> improvements for struct reservation_object.)
>>
>> Why resizable? I was thinking a normal one. If at any given time we
>> have an active set of contexts, or at least lookups are as you say
>> below, to neighbouring contexts, that would mean we are talking
>> about lookups to different hash buckets.  And for the typical
>> working set we would expect many collisions so longer lists in each
>> bucket? So maybe NUM_ENGINES * some typical load constant number
>> buckets would not be that bad?
>
> Consider a long running display server that will accumulate 10,000s of
> thousands of clients in its lifetime, each with their own contents that
> get shared by passing around fences/framebuffers. (Or on a shorter scale
> any of the context stress tests in igt.) Due to the non-recycling of the
> context ids, we can grow to very large tables - but we have no knowledge
> of what contexts are no longer required.
>
> To compensate we need to occasionally prune the sync points. For a ht we
> could just scrap it, For an idr, we could store last use and delete
> stale leaves.

Hm, pruning yes.. but you don't have pruning in this patch either. So 
that's something which needs to be addressed either way.

> But first we have a question of how many buckets do we give the static
> ht? Most processes will be sharing between 2 contexts (render context,
> presentation context) except for a display server who may have 10-100 of
> clients - and possibly where eliminating repeated syncs is going to be
> most valuable. That suggests 256 buckets for every timeline (assuming
> just a pair of engines across shared contexts). Overkill for the
> majority, and going to be miserable in stress tests.
>
> Finally what do you store in the ht? Fences are the obvious candidate,
> but need reaping. Or you just create a ht for the request and only
> squash repeated fences inside a single request - that doesn't benefit
> from timeline tracking and squashing between requests (but does avoid
> keeping fences around forever). Hence why I went with tracking seqno. To
> avoid allocations for the ht nodes, we could create an open-addressed ht
> with the {context, seqno} embedded in it. It would be efficient, but
> needs online resizing and is a fair chunk of new code.

I was thinking of exactly the same thing as this patch does, u64 context 
id as key, u32 seqnos (wrapped in a container with hlist_node).

Ok, so the key is pruning to keep the display server scenario in check.

Free the key from i915_fence_release?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx