All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
Date: Wed, 26 Apr 2017 11:54:08 +0100	[thread overview]
Message-ID: <3fdcfbc7-e404-498e-9e85-fa328f99ffc1@linux.intel.com> (raw)
In-Reply-To: <20170426103846.GM11432@nuc-i3427.alporthouse.com>


On 26/04/2017 11:38, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 11:20:16AM +0100, Tvrtko Ursulin wrote:
>>
>> On 19/04/2017 10:41, Chris Wilson wrote:
>>> Track the latest fence waited upon on each context, and only add a new
>>> asynchronous wait if the new fence is more recent than the recorded
>>> fence for that context. This requires us to filter out unordered
>>> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
>>> absence of a universal identifier, we have to use our own
>>> i915->mm.unordered_timeline token.
>>>
>>> v2: Throw around the debug crutches
>>> v3: Inline the likely case of the pre-allocation cache being full.
>>> v4: Drop the pre-allocation support, we can lose the most recent fence
>>> in case of allocation failure -- it just means we may emit more awaits
>>> than strictly necessary but will not break.
>>> v5: Trim allocation size for leaf nodes, they only need an array of u32
>>> not pointers.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
>>> drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
>>> drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
>>> drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
>>> .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
>>> 5 files changed, 438 insertions(+), 27 deletions(-)
>>> create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
>>> index 97c07986b7c1..fb6c31ba3ef9 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
>>> @@ -730,9 +730,7 @@ int
>>> i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>>> 				 struct dma_fence *fence)
>>> {
>>> -	struct dma_fence_array *array;
>>> 	int ret;
>>> -	int i;
>>>
>>> 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> 		return 0;
>>> @@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>>> 	if (fence->context == req->fence.context)
>>> 		return 0;
>>>
>>> -	if (dma_fence_is_i915(fence))
>>> -		return i915_gem_request_await_request(req, to_request(fence));
>>> +	/* Squash repeated waits to the same timelines, picking the latest */
>>> +	if (fence->context != req->i915->mm.unordered_timeline &&
>>> +	    intel_timeline_sync_get(req->timeline,
>>> +				    fence->context, fence->seqno))
>>
>> Function name is non-intuitive to me. It doesn't seem to get
>> anything, but is more like query? Since it ends up with
>> i915_seqno_passed, maybe intel_timeline_sync_is_newer/older ? (give
>> or take)
>
> _get was choosen as the partner for _set, which seemed to make sense.
> Keep intel_timeline_sync_set() and replace _get with
> intel_timeline_sync_passed() ?
> intel_timeline_sync_is_later() ?

Both are better in my opinion. _get just makes it sounds like it is 
returning something from the object, which it is not. So whichever you 
prefer.

>>> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
>>> index b596ca7ee058..f2b734dda895 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
>>> @@ -24,6 +24,254 @@
>>>
>>> #include "i915_drv.h"
>>>
>>> +#define NSYNC 16
>>> +#define SHIFT ilog2(NSYNC)
>>> +#define MASK (NSYNC - 1)
>>> +
>>> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
>>> + * context id to the last u32 fence seqno waited upon from that context.
>>> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
>>> + * the root. This allows us to access the whole tree via a single pointer
>>> + * to the most recently used layer. We expect fence contexts to be dense
>>> + * and most reuse to be on the same i915_gem_context but on neighbouring
>>> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
>>> + * effective lookup cache. If the new lookup is not on the same leaf, we
>>> + * expect it to be on the neighbouring branch.
>>> + *
>>> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
>>> + * allows us to store whether a particular seqno is valid (i.e. allows us
>>> + * to distinguish unset from 0).
>>> + *
>>> + * A branch holds an array of layer pointers, and has height > 0, and always
>>> + * has at least 2 layers (either branches or leaves) below it.
>>> + *
>>> + */
>>
>> @_@ :)
>>
>> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
>> because of u64 keys. :( How about a hash table? It would be much
>> simpler to review. :) Seriously, if it would perform close enough it
>> would be a much much simpler implementation.
>
> You want a resizable hashtable. rht is appallingly slow, so you want a
> custom resizeable ht. They are not as simple as this codewise ;)
> (Plus a compressed radixtree is part of my plan for scalability
> improvements for struct reservation_object.)

Why resizable? I was thinking a normal one. If at any given time we have 
an active set of contexts, or at least lookups are as you say below, to 
neighbouring contexts, that would mean we are talking about lookups to 
different hash buckets.  And for the typical working set we would expect 
many collisions so longer lists in each bucket? So maybe NUM_ENGINES * 
some typical load constant number buckets would not be that bad?

> This is designed around the idea that most lookups are to neighbouring
> contexts (i.e. same i915_gem_context, different engines) and so are on
> the same leaf and so cached. (A goal here is to be cheaper than the cost
> of repetitions along the fence signaling. They are indirect costs that
> show up in a couple of places, but are reasonably cheap. Offsetting the
> cost is the benefit of moving it off the signal->exec path.)
>
> Plus radixtree scrapped the idr lookup cache, which is a negative for
> most of our code :( Fortunately for execbuf, we do have a bypass planned.

I trust the data structure is a great, but would like to understand if 
something simpler could perhaps get us 99% of the performance (or some 
number).

>>> +static int igt_seqmap(void *arg)
>>> +{
>>> +	struct drm_i915_private *i915 = arg;
>>> +	const struct {
>>> +		const char *name;
>>> +		u32 seqno;
>>> +		bool expected;
>>> +		bool set;
>>> +	} pass[] = {
>>> +		{ "unset", 0, false, false },
>>> +		{ "new", 0, false, true },
>>> +		{ "0a", 0, true, true },
>>> +		{ "1a", 1, false, true },
>>> +		{ "1b", 1, true, true },
>>> +		{ "0b", 0, true, false },
>>> +		{ "2a", 2, false, true },
>>> +		{ "4", 4, false, true },
>>> +		{ "INT_MAX", INT_MAX, false, true },
>>> +		{ "INT_MAX-1", INT_MAX-1, true, false },
>>> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
>>> +		{ "INT_MAX", INT_MAX, true, false },
>>> +		{ "UINT_MAX", UINT_MAX, false, true },
>>> +		{ "wrap", 0, false, true },
>>> +		{ "unwrap", UINT_MAX, true, false },
>>> +		{},
>>> +	}, *p;
>>> +	struct intel_timeline *tl;
>>> +	int order, offset;
>>> +	int ret;
>>> +
>>> +	tl = &i915->gt.global_timeline.engine[RCS];
>>
>> Unless I am missing something, it looks like you could get away with
>> a lighter solution of implementing a mock_timeline instead of the
>> whole mock_gem_device. I think it would be preferable.
>
> Fine, I was just using a familiar pattern.

If it is more than a few lines (I thought it wouldn't be) to add a mock 
timeline then don't bother. I just thought unit tests should preferable 
stay as lean as possible.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2017-04-26 10:54 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
2017-04-19  9:41 ` [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically Chris Wilson
2017-04-20  7:42   ` Joonas Lahtinen
2017-04-19  9:41 ` [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
2017-04-19 16:52   ` Dongwon Kim
2017-04-19 17:15     ` Chris Wilson
2017-04-19 17:46     ` Chris Wilson
2017-04-19 18:08     ` Chris Wilson
2017-04-19 18:13       ` Dongwon Kim
2017-04-19 18:26         ` Chris Wilson
2017-04-19 20:30           ` Dongwon Kim
2017-04-19 20:49           ` Dongwon Kim
2017-04-19  9:41 ` [PATCH 03/27] drm/i915: Mark up clflushes as belonging to an unordered timeline Chris Wilson
2017-04-19  9:41 ` [PATCH 04/27] drm/i915: Lift timeline ordering to await_dma_fence Chris Wilson
2017-04-19  9:41 ` [PATCH 05/27] drm/i915: Make ptr_unpack_bits() more function-like Chris Wilson
2017-04-19  9:41 ` [PATCH 06/27] drm/i915: Redefine ptr_pack_bits() and friends Chris Wilson
2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
2017-04-24 13:03   ` Tvrtko Ursulin
2017-04-24 13:19     ` Chris Wilson
2017-04-24 13:31       ` Chris Wilson
2017-04-26 10:20   ` Tvrtko Ursulin
2017-04-26 10:38     ` Chris Wilson
2017-04-26 10:54       ` Tvrtko Ursulin [this message]
2017-04-26 11:18         ` Chris Wilson
2017-04-26 12:13           ` Tvrtko Ursulin
2017-04-26 12:23             ` Chris Wilson
2017-04-26 14:36               ` Tvrtko Ursulin
2017-04-26 14:55                 ` Chris Wilson
2017-04-26 15:04                 ` Chris Wilson
2017-04-26 18:56             ` Chris Wilson
2017-04-26 22:22               ` Chris Wilson
2017-04-27  9:20                 ` Tvrtko Ursulin
2017-04-27  9:47                   ` Chris Wilson
2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
2017-04-27  7:14     ` Chris Wilson
2017-04-27  9:50     ` Chris Wilson
2017-04-27 11:42       ` Chris Wilson
2017-04-27 11:48     ` [PATCH v9] " Chris Wilson
2017-04-27 16:47       ` Tvrtko Ursulin
2017-04-27 17:25         ` Chris Wilson
2017-04-27 20:34           ` Chris Wilson
2017-04-27 20:53             ` Chris Wilson
2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
2017-04-28  7:59         ` Chris Wilson
2017-04-28  9:32         ` Tvrtko Ursulin
2017-04-28  9:54           ` Chris Wilson
2017-04-28  9:55         ` Tvrtko Ursulin
2017-04-28 10:11           ` Chris Wilson
2017-04-28 14:12         ` [PATCH v13] " Chris Wilson
2017-04-28 19:02           ` [PATCH v14] " Chris Wilson
2017-05-02 12:24             ` Tvrtko Ursulin
2017-05-02 14:45               ` Chris Wilson
2017-05-02 15:11                 ` Chris Wilson
2017-05-02 15:17                 ` Tvrtko Ursulin
2017-05-02 14:50               ` Chris Wilson
2017-04-19  9:41 ` [PATCH 08/27] drm/i915: Rename intel_timeline.sync_seqno[] to .global_sync[] Chris Wilson
2017-04-19  9:41 ` [PATCH 09/27] drm/i915: Confirm the request is still active before adding it to the await Chris Wilson
2017-04-19  9:41 ` [PATCH 10/27] drm/i915: Do not record a successful syncpoint for a dma-await Chris Wilson
2017-04-19  9:41 ` [PATCH 11/27] drm/i915: Switch the global i915.semaphores check to a local predicate Chris Wilson
2017-04-19  9:41 ` [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep Chris Wilson
2017-04-20 13:30   ` Tvrtko Ursulin
2017-04-20 13:57     ` Chris Wilson
2017-04-19  9:41 ` [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request Chris Wilson
2017-04-20 14:58   ` Tvrtko Ursulin
2017-04-27 14:37     ` Chris Wilson
2017-04-28 12:02       ` Tvrtko Ursulin
2017-04-28 12:21         ` Chris Wilson
2017-04-19  9:41 ` [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle Chris Wilson
2017-04-20  8:53   ` Joonas Lahtinen
2017-04-19  9:41 ` [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list Chris Wilson
2017-04-24 10:28   ` Tvrtko Ursulin
2017-04-24 11:07     ` Chris Wilson
2017-04-24 12:18       ` Chris Wilson
2017-04-24 12:44       ` Tvrtko Ursulin
2017-04-24 13:06         ` Chris Wilson
2017-04-19  9:41 ` [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
2017-04-28 12:20   ` Tvrtko Ursulin
2017-04-19  9:41 ` [PATCH 17/27] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
2017-04-19  9:41 ` [PATCH 18/27] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
2017-04-19  9:41 ` [PATCH 19/27] drm/i915: Split vma exec_link/evict_link Chris Wilson
2017-04-19  9:41 ` [PATCH 20/27] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
2017-04-19  9:41 ` [PATCH 21/27] drm/i915: Pass vma to relocate entry Chris Wilson
2017-04-19  9:41 ` [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
2017-04-20  8:49   ` Joonas Lahtinen
2017-04-19  9:41 ` [PATCH 23/27] drm/i915: First try the previous execbuffer location Chris Wilson
2017-04-19  9:41 ` [PATCH 24/27] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
2017-04-19  9:41 ` [PATCH 25/27] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
2017-04-19  9:41 ` [PATCH 26/27] drm/i915: Async GPU relocation processing Chris Wilson
2017-04-19  9:41 ` [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities Chris Wilson
2017-04-19 10:09   ` Chris Wilson
2017-04-19 11:07     ` Tvrtko Ursulin
2017-04-19 10:01 ` ✗ Fi.CI.BAT: failure for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically Patchwork
2017-04-27  7:27 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev2) Patchwork
2017-04-28 14:31 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev5) Patchwork
2017-04-28 19:22 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev6) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3fdcfbc7-e404-498e-9e85-fa328f99ffc1@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.