Re: Making IGT runnable by CI and developers

From: Jani Nikula <jani.nikula@linux.intel.com>
To: Daniel Vetter <daniel@ffwll.ch>,
	Martin Peres <martin.peres@linux.intel.com>
Cc: "intel-gfx@lists.freedesktop.org" <intel-gfx@lists.freedesktop.org>
Subject: Re: Making IGT runnable by CI and developers
Date: Fri, 21 Jul 2017 18:13:37 +0300	[thread overview]
Message-ID: <87mv7xn7q6.fsf@intel.com> (raw)
In-Reply-To: <CAKMK7uFzd9JQLLV9_5hZD13Htr56gHdXqCScgkc7xDTPiZ4YYg@mail.gmail.com>

On Fri, 21 Jul 2017, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, Jul 20, 2017 at 6:23 PM, Martin Peres
> <martin.peres@linux.intel.com> wrote:
>> Hi everyone,
>>
>> As some of you may already know, we have made great strides in making our CI
>> system usable, especially in the last 6 months when everything started
>> clicking together.
>>
>> The CI team is no longer overwhelmed with fires and bug reports, so we
>> started working on increasing the coverage from just fast-feedback, to a
>> bigger set of IGT tests.
>>
>> As some of you may know, running IGT has been a challenge that few manage to
>> overcome. Not only is the execution time counted in machine months, but it
>> can also lead to disk corruption, which does not encourage developers to run
>> it either. One test takes 21 days, on its own, and it is a subset of another
>> test which we never ran for obvious reasons.
>>
>> I would thus like to get the CI team and developers to work together to
>> decrease sharply the execution time of IGT, and get these tests run multiple
>> times per day!
>>
>> There are three usages that the CI team envision (up for debate):
>>  - Basic acceptance testing: Meant for developers and CI to check quickly if
>> a patch series is not completely breaking the world (< 10 minutes, timeout
>> per test of 30s)
>>  - Full run: Meant to be ran overnight by developers and users (< 6 hours)
>>  - Stress tests: They can be in the test suite as a way to catch rare
>> issues, but they cannot be part of the default run mode. They likely should
>> be run on a case-by-case basis, on demand of a developer. Each test could be
>> allowed to take up to 1h.
>>
>> There are multiple ways of getting to this situation (up for debate):
>>
>>  1) All the tests exposed by default are fast and meant to be run:
>>   - Fast-feedback is provided by a testlist, for BAT
>>   - Stress tests ran using a special command, kept for on-demand testing
>>
>>  2) Tests are all tagged with information about their exec time:
>>   - igt@basic@.*: Meant for BAT
>>   - igt@complete@.*: Meant for FULL
>>   - igt@stress@.*: The stress tests

Ugh. I don't want any scheme that relies on modifying or renaming the
tests themselves to categorize them. IMO the names of tests should only
be informative. Categorization should be external to that.

>>
>>  3) Testlists all the way:
>>   - fast-feedback: for BAT
>>   - all: the tests that people are expected to run (CI will run them)
>>   - Stress tests will not be part of any testlist.
>>
>> Whatever decision is being accepted, the CI team is mandating global
>> timeouts for both BAT and FULL testing, in order to guarantee throughput.
>> This will require the team as a whole to agree on time quotas per
>> sub-systems, and enforce them.
>>
>> Can we try to get some healthy debate and reach a consensus on this? Our CI
>> efforts are being limited by this issue right now, and we will be doing
>> whatever we can until the test suite becomes saner and runnable, but this
>> may be unfair to some developers.
>>
>> Looking forward to some constructive feedback and intelligent discussions!
>> Martin
>
> Imo the critical bit for the full run (which should regression test
> all features while being fast enough that we can use it for pre-merge
> testing) must be the default set. Default here means what you get
> without any special cmdline options (to either the test or piglit),
> and without any special testlist that are separately maintained.
> Default also means that it will be included by default if you do a new
> testcase. There's two reasons for that:
>
> - Maintaining a separate test list is a pain. Also, it encourages
> adding tons of tests that no one runs.
>
> - If tests aren't run by default we can't test them pre-merging before
> they land in igt and wreak havoc.

I agree the goal should be to run all tests by default. And this means
we should start being more critical of the tests we add.

For stress tests I would like to look more into splitting up the tests
in a way that you could run one iteration fast (as part of default), and
repeat the tests for more stress and coverage. I don't know how feasible
this is, and if it requires carrying over state from one iteration to
other, but I like the goal of running also some of this by default. This
would better catch silly bugs in tests too. (We discussed this offline
with Martin and Tomi.)

> Second, we must have a reasonable runtime, and reasonable runtime here
> means a few hours of machine time for everything, total. There's two
> reasons for that:
> - Only pre-merge is early enough to catch regressions. We can lament
> all day long, but fact is that post-merge regressions don't get fixed
> or handled in a timely manner, except when they're really serious.
> This means any testing strategy that depends upon lots of post-merge
> testing, or expects such post-merge testing to work, is bound to fail.
> Either we can test everything pre-merge, or there's no regression
> testing at all.

It's rarely as black and white as you make it out to be, but it's easy
to agree pre-merge is the thing that really motivates people to figure
stuff out because it blocks their patch from being merged.

> - We can't mix together multiple patch series bisect autobisecting is
> too unreliable. I've been promised an autobisector for 3 years by
> about 4 different teams now, making that happen in a reliable way is
> _really_ hard. Blocking CI on this is not reasonable.
>
> Also, the testsuite really should be fast enough that developers can
> run it locally on their machines in a work day. Current plan is that
> we can only test on HSW for now, until more budget appears (again, we
> can lament about this, but it's not going to change), which means
> developers _must_ be able to run stuff on e.g. SKL in a reasonable
> amount of time.
>
> Right now we have a runtime of the gem|prime tests of around 24 days,
> and estimated 10 months for the stress tests included. I think the
> actual machine time we'll have available in the near future, on this
> HSW farm is going to allow 2-3h for gem tests. That's the time budget
> for this default set of regression tests.
>
> Wrt actually implementing it: I don't care, as long as it fulfills the
> above. So tagging, per-test comdline options, outright deleting all
> the tests we can't run anyway, disabling them in the build system or
> whatever else is all fine with me, as long as the default set doesn't
> require any special action. For tags this would mean that untagged
> tests are _all_ included.

Off-topic, but IMO test lists are just an implementation of tags. In
that sense, we already have tagging.

BR,
Jani.

-- 
Jani Nikula, Intel Open Source Technology Center
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx