public inbox for kernelci@lists.linux.dev
 help / color / mirror / Atom feed
* Dealing with test results
@ 2018-07-17 13:39 Guillaume Tucker
  2018-07-18 19:37 ` [kernelci] " dan.rue
  2018-07-26  7:24 ` Ana Guerrero Lopez
  0 siblings, 2 replies; 7+ messages in thread
From: Guillaume Tucker @ 2018-07-17 13:39 UTC (permalink / raw)
  To: kernelci

Hi,

As we're expanding the number of tests being run, one crucial point
to consider is how to store the results in the backend.  It needs to
be designed in such a way to enable relevant reports, searches,
visualisation and a remote API.  It's also important to be able to
detect regressions and run bisections with the correct data set.

So on one hand, I think we can start revisiting what we have in our
database model.  Then on the other hand, we need to think about
useful information we want to be able to extract from the database.


At the moment, we have 3 collections to store these results.  Here's
a simplified model:

test suite
* suite name
* build info (revision, defconfig...)
* lab name
* test sets
* test cases

test set
* set name
* test cases

test case
* case name
* status
* measurements

Here's an example:

   https://staging.kernelci.org/test/suite/5b489cc8cf3a0fe42f9d9145/

The first thing I can see here is that we don't actually use the test
sets: each test suite has exactly one test set called "default", with
all the test cases stored both in the suite and the set.  So I think
we could simplify things by having only 2 collections: test suite and
test case.  Does anyone know what the test sets were intended for?


Then the next thing to look into is actually about the results
themselves.  They are currently stored as "status" and
"measurements".  Status can take one of 4 values: error, fail, pass
or skip.  Measurements are an arbitrary dictionary.  This works fine
when the test case has an absolute pass/fail result, and when the
measurement is only additional information such as the time it took
to run it.

It's not that simple for test results which use the measurement to
determine the pass/fail criteria.  For these, there needs to be some
logic with some thresholds stored somewhere to determine whether the
measurement results in pass or fail.  This could either be done as
part of the test case, or in the backend.  Then some similar logic
needs to be run to detect regressions, as some tests don't have an
absolute threshold but must not be giving lower scores than previous
runs.

It seems to me that having all the logic related to the test case
stored in the test definition would be ideal, to keep it
self-contained.  For example, previous test results could be fetched
from the backend API and passed as meta-data to the LAVA job to
determine whether the new result is a pass or fail.  The concept of
pass/fail in this case may actually not be too accurate, rather that
a score drop needs to be detected as a regression.  The advantage of
this approach is that there is no need for any test-specific logic in
the backend, regressions would still just be based on the status
field.

How does that all sound?


It is only a starting point, as Kevin mentioned we should probably
get some advice from experts in the field of data in general.  This
thread could easily split into several to discus the different
aspects of this issue.  Still reviewing what we have and making basic
changes should help us scale to the extent of what we're aiming for
at the moment with small test suites.

Then the second part of this discussion would be, what do we want to
get out of the database? (emails, visualisation, post-processing...)
It seems worth gathering people's thoughts on this and look for some
common ground.

Best wishes,
Guillaume

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Dealing with test results
@ 2018-11-12 13:58 Guillaume Tucker
  2018-11-12 14:47 ` [kernelci] " Milosz Wasilewski
  0 siblings, 1 reply; 7+ messages in thread
From: Guillaume Tucker @ 2018-11-12 13:58 UTC (permalink / raw)
  To: kernelci

A recurring topic is how to deal with test results, from the
point in the test code where they're generated to how the result
is stored in a database.  This was brought up again during last
week's meeting while discussing kernel warnings in boot tests, so
let's take another look at it and try to break it down into
smaller problems to solve:


* generating the test results

Each test suite currently has its own way of generating test
results, typically with some arbitrary format on stdout.  This
means a custom parser for each test suite, which is tedious to
maintain and error-prone, but a first step at getting results.
In some cases, such as boot testing, there isn't any real
alternative.

There are however several standards for encoding test results, I
think this is being discussed quite a lot already (LKFT people?).
There's also a thread on linux-media about this, following the
work we've done with them to improve testing in that area of the
kernel:

  https://www.spinics.net/lists/linux-media/msg142520.html

The bottom line is: we need a good machine-readable test output
format try to align test suites to be compatible with it.


* handling the test results

The next step is about writing the results somewhere: on the
console, in a file, or even directly to a remote API.  Some
devices may not have a functionarl network interface with access
to the internet, so it's hard to require the devices to push
results directly.  The least common denominator is that the
results need to eventually land in the database, so how they get
there isn't necessarily relevant.  It is useful though to store
the full log of the job to do some manual investigation later on.


* importing the results

This is the crucial part here in my opinion: turning the output
of a test suite into results that can be stored into the
database.  It can be done in several places, and how to do it is
often directly linked to the definition of the test: the format
of the results may depend on options passed when calling the test
etc...

The standard way to do this with LAVA is to call "lava-test-case"
and "lava-test-set" while the test is running on the device, then
have the resulting data sent to a remote API via the callback
mechanism.  This seems to be working rather well, with some
things that can probably be improved (sub-groups limited to 1
level, noise in the log with LAVA messages...).

Another place where this could be done is on a test server,
between the device and the database API.  In the case of LAVA,
this may be the dispatcher which has direct access to the device.
I believe this is how the "regex" pattern approach works.  The
inconvenient here is that the test server needs to have the
capability to parse the results, so doing custom things may not
always be possible.

Then it's also possible in principle to send all the raw results
as-is to a remote API which will parse it itself and store it in
the database directly.  The difference with what we're doing with
the LAVA callback is that it provides the pass/fail data for each
test case already populated.  It seems to me that adding more
parsing capability in the backend is only sustainable if the
results are provided in a structured format, as having test-suite
specific parsers in the backend is bound to break when the test
suites change.


* determining pass / fail based on measurements

While it's not something we've been doing much yet, we can't
ignore this aspect as well.  Things like power consumption, CPU
cycles and memory usage don't have an absolute pass/fail criteria
but are based on previous results.  We need to be able to access
at least the last results for the same test configuration to
determine whether a new result is pass or fail.

To keep all the test logic self-contained, it could also be done
on the device as long as it has access to the required
data (i.e. last result and relative thresholds).  With LAVA, the
part doing the test generation (Python script in Jenkins) could
be querying the database API for this and pass it to the test
suite via the job definition.

Just like for importing the results, this could be done
elsewhere (test server, database API...) but I think it would
also mean fragmenting the test suite definition.


* storing the results

Right now the kernelci-backend API has a built-in Mongo database.
As previously discussed, it would seem good to be able to replace
the database with any engine and keep the logic we have as a
separate service.  That way, we could still use our
kernelci-specific code that tracks regressions, triggers
bisections, sends emails etc... but also enable arbitrary
searches and direct access to the test data.


I think that covers the main aspects of the problem, now we
probably have a few solutions to look for.  Overall, it would
seem like a good idea to describe a "reference" workflow based on
LAVA with a standard format for test results etc...  Still we
have to allow non-LAVA labs and other test frameworks to
contribute to KernelCI especially as new members are likely to be
joining the LF project with their own mature test infrastructure.

Does anyone see anything important missing here?  We could start
a wiki page to explain what the reference workflow would look
like.

Cheers,
Guillaume

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-27 22:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-17 13:39 Dealing with test results Guillaume Tucker
2018-07-18 19:37 ` [kernelci] " dan.rue
2018-07-26  7:24 ` Ana Guerrero Lopez
2018-07-26 17:19   ` Kevin Hilman
2018-07-27  6:28     ` Tomeu Vizoso
  -- strict thread matches above, loose matches on Subject: below --
2018-11-12 13:58 Guillaume Tucker
2018-11-12 14:47 ` [kernelci] " Milosz Wasilewski
2018-11-27 22:57   ` Kevin Hilman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox