Re: [RFC] drm/i915/bdw+: Do not emit user interrupts when not needed

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	Intel-gfx@lists.freedesktop.org,
	Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Subject: Re: [RFC] drm/i915/bdw+: Do not emit user interrupts when not needed
Date: Fri, 18 Dec 2015 13:51:38 +0000	[thread overview]
Message-ID: <56740F6A.2080907@linux.intel.com> (raw)
In-Reply-To: <20151218122836.GH26780@nuc-i3427.alporthouse.com>

On 18/12/15 12:28, Chris Wilson wrote:
> On Fri, Dec 18, 2015 at 11:59:41AM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> We can rely on context complete interrupt to wake up the waiters
>> apart in the case where requests are merged into a single ELSP
>> submission. In this case we inject MI_USER_INTERRUPTS in the
>> ring buffer to ensure prompt wake-ups.
>>
>> This optimization has the effect on for example GLBenchmark
>> Egypt off-screen test of decreasing the number of generated
>> interrupts per second by a factor of two, and context switched
>> by factor of five to six.
>
> I half like it. Are the interupts a limiting factor in this case though?
> This should be ~100 waits/second with ~1000 batches/second, right? What
> is the delay between request completion and client wakeup - difficult to
> measure after you remove the user interrupt though! But I estimate it
> should be on the order of just a few GPU cycles.

Neither of the two benchmarks I ran (trex onscreen and egypt offscreen) 
show any framerate improvements.

The only thing I did manage to measure is that CPU energy usage goes 
down with the optimisation. Roughly 8-10%, courtesy of RAPL script 
someone posted here.

Benchmarking is generally very hard so it is a pity we don't have a farm 
similar to CI which does it all in a repeatable and solid manner.

>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 27f06198a51e..d9be878dbde7 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -359,6 +359,13 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
>>   	spin_unlock(&dev_priv->uncore.lock);
>>   }
>>
>> +static void execlists_emit_user_interrupt(struct drm_i915_gem_request *req)
>> +{
>> +	struct intel_ringbuffer *ringbuf = req->ringbuf;
>> +
>> +	iowrite32(MI_USER_INTERRUPT, ringbuf->virtual_start + req->tail - 8);
>> +}
>> +
>>   static int execlists_update_context(struct drm_i915_gem_request *rq)
>>   {
>>   	struct intel_engine_cs *ring = rq->ring;
>> @@ -433,6 +440,12 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>>   			cursor->elsp_submitted = req0->elsp_submitted;
>>   			list_move_tail(&req0->execlist_link,
>>   				       &ring->execlist_retired_req_list);
>> +			/*
>> +			 * When merging requests make sure there is still
>> +			 * something after each batch buffer to wake up waiters.
>> +			 */
>> +			if (cursor != req0)
>> +				execlists_emit_user_interrupt(req0);
>
> You may have already missed this instruction as you patch it, and keep
> doing so as long as the context is resubmitted. I think to be safe, you
> need to patch cursor as well. You could then MI_NOOP out the MI_INTERUPT
> on the terminal request.

I don't at the moment see it could miss it? We don't do preemption, but 
granted I don't understand this code fully.

But patching it out definitely looks safer. And I even don't have to 
unbreak GuC in that case. So I'll try that approach.

> An interesting igt experiement I think would be:
>
> thread A, keep queuing batches with just a single MI_STORE_DWORD_IMM *addr
> thread B, waits on batch from A, reads *addr (asynchronously), measures
> latency (actual value - expected(batch))
>
> Run for 10s, report min/max/median latency.
>
> Repeat for more threads/contexts and more waiters. Ah, that may be the
> demonstration for the thundering herd I've been looking for!

Hm I'll think about it.

Wrt your second reply, that is an interesting question.

All I can tell that empirically it looks interrupts do arrive split, 
otherwise there would be no reduction in interrupt numbers. But why are 
they split I don't know.

I'll try adding some counters to get a feel how often does that happen 
in various scenarios.

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx