Re: [igt-dev] [PATCH i-g-t] tests/perf_pmu: Improve accuracy by waiting on spinner to start

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	Tvrtko Ursulin <tursulin@ursulin.net>,
	igt-dev@lists.freedesktop.org
Cc: Intel-gfx@lists.freedesktop.org,
	Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Subject: Re: [igt-dev] [PATCH i-g-t] tests/perf_pmu: Improve accuracy by waiting on spinner to start
Date: Thu, 15 Mar 2018 14:37:59 +0000	[thread overview]
Message-ID: <24aaf956-edc7-d497-72e3-89a27f9ea9db@linux.intel.com> (raw)
In-Reply-To: <152112155077.25315.9635916282186626812@mail.alporthouse.com>

On 15/03/2018 13:45, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-03-15 13:36:26)
>>
>> On 15/03/2018 13:14, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-03-15 12:56:17)
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> More than one test assumes that the spinner is running pretty much
>>>> immediately after we have create or submitted it.
>>>>
>>>> In actuality there is a variable delay, especially on execlists platforms,
>>>> between submission and spin batch starting to run on the hardware.
>>>>
>>>> To enable tests which care about this level of timing to account for this,
>>>> we add a new spin batch constructor which provides an output field which
>>>> can be polled to determine when the batch actually started running.
>>>>
>>>> This is implemented via MI_STOREDW_IMM from the spin batch, writing into
>>>> memory mapped page shared with userspace.
>>>>
>>>> Using this facility from perf_pmu, where applicable, should improve very
>>>> occasional test fails across the set and platforms.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>> ---
>>>>    lib/igt_dummyload.c |  99 +++++++++++++++++++++++++++++++----
>>>>    lib/igt_dummyload.h |   9 ++++
>>>>    tests/perf_pmu.c    | 145 +++++++++++++++++++++++++++++++++++-----------------
>>>>    3 files changed, 196 insertions(+), 57 deletions(-)
>>>>
>>>> diff --git a/lib/igt_dummyload.c b/lib/igt_dummyload.c
>>>> index 4b20f23dfe26..0447d2f14d57 100644
>>>> --- a/lib/igt_dummyload.c
>>>> +++ b/lib/igt_dummyload.c
>>>> @@ -74,9 +74,12 @@ fill_reloc(struct drm_i915_gem_relocation_entry *reloc,
>>>>           reloc->write_domain = write_domains;
>>>>    }
>>>>    
>>>> -static int emit_recursive_batch(igt_spin_t *spin,
>>>> -                               int fd, uint32_t ctx, unsigned engine,
>>>> -                               uint32_t dep, bool out_fence)
>>>> +#define OUT_FENCE      (1 << 0)
>>>> +#define POLL_RUN       (1 << 1)
>>>> +
>>>> +static int
>>>> +emit_recursive_batch(igt_spin_t *spin, int fd, uint32_t ctx, unsigned engine,
>>>> +                    uint32_t dep, unsigned int flags)
>>>>    {
>>>>    #define SCRATCH 0
>>>>    #define BATCH 1
>>>> @@ -116,6 +119,8 @@ static int emit_recursive_batch(igt_spin_t *spin,
>>>>           execbuf.buffer_count++;
>>>>    
>>>>           if (dep) {
>>>> +               igt_assert(!(flags & POLL_RUN));
>>>> +
>>>
>>> Challenge left to the reader :)
>>
>> Well not the reader, whoever gets to need both. :)
>>
>>>>                   /* dummy write to dependency */
>>>>                   obj[SCRATCH].handle = dep;
>>>>                   fill_reloc(&relocs[obj[BATCH].relocation_count++],
>>>> @@ -123,6 +128,41 @@ static int emit_recursive_batch(igt_spin_t *spin,
>>>>                              I915_GEM_DOMAIN_RENDER,
>>>>                              I915_GEM_DOMAIN_RENDER);
>>>>                   execbuf.buffer_count++;
>>>> +       } else if (flags & POLL_RUN) {
>>>> +               unsigned int offset;
>>>> +
>>>> +               igt_assert(!dep);
>>>> +
>>>> +               spin->poll_handle = gem_create(fd, 4096);
>>>> +               spin->running = __gem_mmap__wc(fd, spin->poll_handle,
>>>> +                                              0, 4096, PROT_READ | PROT_WRITE);
>>>
>>> Use mmap_cpu and gem_set_caching().
>>
>> Wouldn't that get us into coherency issues on some platforms? I keep the
>> page mapped for API users to poll on.
> 
> bxt-a? The point of using gem_set_caching() is that it is coherent with
> the CPU cache even on !llc via snooping. It's then essentially the same
> as how we handle breadcrumbs.
> 
> Now admittedly, we should do
> 
> if (__gem_set_caching() == 0)
> 	running = __gem_mmap__wb();
> else
> 	running = __gem_mmap__wc();
> 
> The caller need be known the wiser; except having to assume the worst
> and so __sync_synchronize() if they do *running = x themselves.

Ok.

>>>> +               igt_assert(spin->running);
>>>> +               igt_assert_eq(*spin->running, 0);
>>>> +
>>>> +               *batch++ = MI_STORE_DWORD_IMM | (gen < 6 ? 1 << 22 : 0);
>>>
>>> Hmm, have we forgot the (len-2) or is this an unusual command that knows
>>> its own length?
>>
>> I lifted the code from elsewhere.
> 
> I checked, we have the same bug everywhere or nowhere. :|
> 
>>>> +/**
>>>> + * igt_spin_batch_new_poll:
>>>> + * @fd: open i915 drm file descriptor
>>>> + * @engine: Ring to execute batch OR'd with execbuf flags. If value is less
>>>> + *          than 0, execute on all available rings.
>>>> + *
>>>> + * Start a recursive batch on a ring. Immediately returns a #igt_spin_t that
>>>> + * contains the batch's handle that can be waited upon. The returned structure
>>>> + * must be passed to igt_spin_batch_free() for post-processing.
>>>> + *
>>>> + * igt_spin_t->running will containt a pointer which target will change from
>>>> + * zero to one once the spinner actually starts executing on the GPU.
>>>> + *
>>>> + * Returns:
>>>> + * Structure with helper internal state for igt_spin_batch_free().
>>>> + */
>>>> +igt_spin_t *
>>>> +igt_spin_batch_new_poll(int fd, uint32_t ctx, unsigned engine)
>>>> +{
>>>> +       igt_spin_t *spin;
>>>> +
>>>> +       igt_require_gem(fd);
>>>> +       igt_require(gem_mmap__has_wc(fd));
>>>
>>> igt_require(gem_can_store_dword(fd, engine));
>>>
>>> Not all platforms have a MI_STORE_DWORD/DATA_IMM (with virtual addresses
>>> at least) and some platforms will die (*cough* snb *cough*).
>>
>> Grr that makes it all problematic. Well, maybe not completely, I can
>> just fall back to less accurate method on those platforms.
> 
> It's only a few, I don't think in the grand scheme of things it's enough
> to worry about. We should lose just a few pmu tests on snb.
> 
>>>>    static void
>>>>    __submit_spin_batch(int gem_fd,
>>>> +                   igt_spin_t *spin,
>>>>                       struct drm_i915_gem_exec_object2 *obj,
>>>>                       const struct intel_execution_engine2 *e)
>>>>    {
>>>>           struct drm_i915_gem_execbuffer2 eb = {
>>>> -               .buffer_count = 1,
>>>>                   .buffers_ptr = to_user_pointer(obj),
>>>>                   .flags = e2ring(gem_fd, e),
>>>>           };
>>>>    
>>>> +       if (spin->running) {
>>>> +               obj[0].handle = spin->poll_handle;
>>>> +               obj[0].flags = EXEC_OBJECT_ASYNC;
>>>> +               obj[1].handle = spin->handle;
>>>> +               eb.buffer_count = 2;
>>>> +       } else {
>>>> +               obj[0].handle = spin->handle;
>>>> +               eb.buffer_count = 1;
>>>> +       }
>>>
>>> obj[] must be set up by the caller; the EXEC_OBJECT_PINNED are
>>> essential. Or else the kernel *will* move spin->poll_handle and then it
>>> is fubar.
>>
>> Why the caller has to do it? It is providing obj array which gets
>> populated by the helper and by the kernel. If I add EXEC_OBJECT_PINNED
>> to the helper is there a remaining problem?
> 
> Yes. The caller needs to ensure that flags = PINNED *and* the offset is
> correct. We can't just randomly stuff PINNED in there as that pretty
> much guarantees the object will be moved, breaking the implicit
> relocations.

Nevermind I was confused, thought I was always populating the obj array.

> As we are making changes to igt_spin_t, one of the ideas was that we put
> the obj[] array there (with the offsets and flags setup correctly) so
> that we could just feed that in again later without having to worry
> about the relocations.

I tried that before but we couldn't agree on resubmit semantics.

My patch had igt_spin_batch_restart(fd, spin) - so emitting the exact 
same batch, including the dependency. That would actually work well for 
this use case.

So if you are happy with that, I can resurrect that patch, add one more 
to implement stuff from this patch, and rebase perf_pmu changes to follow.

Actually I would change it to igt_spin_batch_restart(fd, spin, engine) - 
so the engine can be changed.

Regards,

Tvrtko

> -Chris
> 
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev