Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test

From: Dave Gordon <david.s.gordon@intel.com>
To: John Harrison <John.C.Harrison@Intel.com>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
Date: Thu, 18 Aug 2016 16:27:06 +0100	[thread overview]
Message-ID: <4d49e32f-21eb-ff72-1bbc-5787f89f54ae@intel.com> (raw)
In-Reply-To: <c4895c3d-ffec-9310-5e99-a6bec05116d9@Intel.com>

On 18/08/16 13:01, John Harrison wrote:
> On 03/08/2016 17:05, Dave Gordon wrote:
>> On 03/08/16 16:45, Chris Wilson wrote:
>>> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>>>> The parallel execution test in gem_exec_nop chooses a pessimal
>>>> distribution of work to multiple engines; specifically, it
>>>> round-robins one batch to each engine in turn. As the workloads
>>>> are trivial (NOPs), this results in each engine becoming idle
>>>> between batches. Hence parallel submission is seen to take LONGER
>>>> than the same number of batches executed sequentially.
>>>>
>>>> If on the other hand we send enough work to each engine to keep
>>>> it busy until the next time we add to its queue, (i.e. round-robin
>>>> some larger number of batches to each engine in turn) then we can
>>>> get true parallel execution and should find that it is FASTER than
>>>> sequential execuion.
>>>>
>>>> By experiment, burst sizes of between 8 and 256 are sufficient to
>>>> keep multiple engines loaded, with the optimum (for this trivial
>>>> workload) being around 64. This is expected to be lower (possibly
>>>> as low as one) for more realistic (heavier) workloads.
>>>
>>> Quite funny. The driver submission overhead of A...A vs ABAB... engines
>>> is nearly identical, at least as far as the analysis presented here.
>>> -Chris
>>
>> Correct; but because the workloads are so trivial, if we hand out jobs
>> one at a time to each engine, the first will have finished the one
>> batch it's been given before we get round to giving at a second one
>> (even in execlist mode). If there are N engines, submitting a single
>> batch takes S seconds, and the workload takes W seconds to execute,
>> then if W < N*S the engine will be idle between batches. For example,
>> if N is 4, W is 2us, and S is 1us, then the engine will be idle some
>> 50% of the time.
>>
>> This wouldn't be an issue for more realistic workloads, where W >> S.
>> It only looks problematic because of the trivial nature of the work.
>
> Can you post the numbers that you get?
>
> I seem to get massive variability on my BDW. The render ring always
> gives me around 2.9us/batch but the other rings sometimes give me region
> of 1.2us and sometimes 7-8us.

skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 
4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64)
Using GuC submission
render: 594,944 cycles: 3.366us/batch
bsd: 737,280 cycles: 2.715us/batch
blt: 833,536 cycles: 2.400us/batch
vebox: 710,656 cycles: 2.818us/batch
Slowest engine was render, 3.366us/batch
Total for all 4 engines is 11.300us per cycle, average 2.825us/batch
All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch, 
overlap 90.1%
Subtest basic: SUCCESS (18.013s)

These are the results of running the modified test on SKL with GuC 
submission.

If the GPU could execute a trivial batch in less time than it takes the 
CPU to submit one, then CPU/driver/GuC performance would become the 
determining factor -- every batch would be completed before the next one 
was submitted to the GPU even when they're going to the same engine.

If the GPU takes longer to execute a batch than N times the time taken 
for the driver to submit it (where N is the number of engines), then the 
GPU performance would become the limiting factor; the CPU would be able 
to hand out one batch to each engine, and by the time it returned to the 
first, that engine would still not be idle.

But in crossover territory, where the batch takes longer to execute than 
the time to submit it, but less than N times as long, the round-robin 
burst size (number of batches sent to each engine before moving to the 
next) can make a big difference, primarily because the submission 
mechanism gets the opportunity to use dual submission and/or lite 
restore, effectively reducing the number of separate writes to the ELSP 
and hence the s/w overhead per batch.

Note that SKL GuC firmware 6.1 didn't support dual submission or lite 
restore, whereas the next version (8.11) does. Therefore, with that 
firmware we don't see the same slowdown when going to 1-at-a-time 
round-robin. I have a different (new) test that shows this more clearly.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx