Dealing with test results

public inbox for kernelci@lists.linux.dev
 help / color / mirror / Atom feed

* Dealing with test results
@ 2018-07-17 13:39 Guillaume Tucker
  2018-07-18 19:37 ` [kernelci] " dan.rue
  2018-07-26  7:24 ` Ana Guerrero Lopez
  0 siblings, 2 replies; 7+ messages in thread
From: Guillaume Tucker @ 2018-07-17 13:39 UTC (permalink / raw)
  To: kernelci

Hi,

As we're expanding the number of tests being run, one crucial point
to consider is how to store the results in the backend.  It needs to
be designed in such a way to enable relevant reports, searches,
visualisation and a remote API.  It's also important to be able to
detect regressions and run bisections with the correct data set.

So on one hand, I think we can start revisiting what we have in our
database model.  Then on the other hand, we need to think about
useful information we want to be able to extract from the database.

At the moment, we have 3 collections to store these results.  Here's
a simplified model:

test suite
* suite name
* build info (revision, defconfig...)
* lab name
* test sets
* test cases

test set
* set name
* test cases

test case
* case name
* status
* measurements

Here's an example:

   https://staging.kernelci.org/test/suite/5b489cc8cf3a0fe42f9d9145/

The first thing I can see here is that we don't actually use the test
sets: each test suite has exactly one test set called "default", with
all the test cases stored both in the suite and the set.  So I think
we could simplify things by having only 2 collections: test suite and
test case.  Does anyone know what the test sets were intended for?

Then the next thing to look into is actually about the results
themselves.  They are currently stored as "status" and
"measurements".  Status can take one of 4 values: error, fail, pass
or skip.  Measurements are an arbitrary dictionary.  This works fine
when the test case has an absolute pass/fail result, and when the
measurement is only additional information such as the time it took
to run it.

It's not that simple for test results which use the measurement to
determine the pass/fail criteria.  For these, there needs to be some
logic with some thresholds stored somewhere to determine whether the
measurement results in pass or fail.  This could either be done as
part of the test case, or in the backend.  Then some similar logic
needs to be run to detect regressions, as some tests don't have an
absolute threshold but must not be giving lower scores than previous
runs.

It seems to me that having all the logic related to the test case
stored in the test definition would be ideal, to keep it
self-contained.  For example, previous test results could be fetched
from the backend API and passed as meta-data to the LAVA job to
determine whether the new result is a pass or fail.  The concept of
pass/fail in this case may actually not be too accurate, rather that
a score drop needs to be detected as a regression.  The advantage of
this approach is that there is no need for any test-specific logic in
the backend, regressions would still just be based on the status
field.

How does that all sound?

It is only a starting point, as Kevin mentioned we should probably
get some advice from experts in the field of data in general.  This
thread could easily split into several to discus the different
aspects of this issue.  Still reviewing what we have and making basic
changes should help us scale to the extent of what we're aiming for
at the moment with small test suites.

Then the second part of this discussion would be, what do we want to
get out of the database? (emails, visualisation, post-processing...)
It seems worth gathering people's thoughts on this and look for some
common ground.

Best wishes,
Guillaume

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [kernelci] Dealing with test results
  2018-07-17 13:39 Dealing with test results Guillaume Tucker
@ 2018-07-18 19:37 ` dan.rue
  2018-07-26  7:24 ` Ana Guerrero Lopez
  1 sibling, 0 replies; 7+ messages in thread
From: dan.rue @ 2018-07-18 19:37 UTC (permalink / raw)
  To: kernelci; +Cc: Antonio Terceiro, Milosz Wasilewski

Hi Guillaume -

On Tue, Jul 17, 2018 at 02:39:15PM +0100, Guillaume Tucker wrote:
> Hi,
> 
> As we're expanding the number of tests being run, one crucial point
> to consider is how to store the results in the backend.  It needs to
> be designed in such a way to enable relevant reports, searches,
> visualisation and a remote API.  It's also important to be able to
> detect regressions and run bisections with the correct data set.

So, I'm relatively new to kernelci, linaro, and linux testing in
general. As I've become involved with and lurked on the various efforts,
I've been trying to figure out where the various projects can
collaborate to make better tools and solutions than any of us are able
to make individually. Kernelci is a great example of such an effort.

When it comes to this discussion, I guess I am biased, but it seems to
me that squad has already solved many of these questions, and that
kernelci should be able to benefit from it directly.

Details aside, if we put our efforts into working together on the same
stack, we will end up ahead.

So this raises two questions: Does the work proposed here represent
duplicate work? If it does, is there a way to avoid it and join efforts?

As to the first question, there are differences between squad and
kernelci. Primarily, that squad is a general purpose result engine,
while kernelci is written specifically for kernel testing. Sometimes,
this makes certain features (especially in squad's generic front end), a
bit more difficult than they are in kernelci.

Also, I sense there is a desire to keep kernelci's backend all within
the same database and API, which I agree is better (when no other
considerations are made).

That said, I would much rather see people helping out to improve squad,
which already has many of the features requested, than spending their
time re-implementing it. It's certainly not perfect, but neither will
another implementation be.

It would be straight forward to use squad for test results, and use
kernelci's existing back end for build and boot results. The kernelci
front end could stay consistent and pull results from either API,
depending on what page and data is being viewed.

We could re-use qa-reports.linaro.org, which already has an automated
scaled out (3 tier) architecture in AWS, or re-use its automation to
deploy it for kernelci specifically (but it would be more expensive).

> So on one hand, I think we can start revisiting what we have in our
> database model.  Then on the other hand, we need to think about
> useful information we want to be able to extract from the database.
> 
> 
> At the moment, we have 3 collections to store these results.  Here's
> a simplified model:
> 
> test suite
> * suite name
> * build info (revision, defconfig...)
> * lab name
> * test sets
> * test cases
> 
> test set
> * set name
> * test cases
> 
> test case
> * case name
> * status
> * measurements

This is similar to squad's model, though most things are generic
versions of what's listed above. For example, similar to lava, each
testrun can have arbitrary metadata assigned (such as lab name, build
info, etc). Squad assumes nothing about such data. Here's a diagram
from https://github.com/Linaro/squad/blob/master/doc/intro.rst.

    +----+  * +-------+  * +-----+  * +-------+  * +----+ *   1 +-----+
    |Team|--->|Project|--->|Build|--->|TestRun|--->|Test|------>|Suite|
    +----+    +---+---+    +-----+    +-------+    +----+       +-----+
                  ^ *         ^         | *   |                    ^ 1
                  |           |         |     |  * +------+ *      |
              +---+--------+  |         |     +--->|Metric|--------+
              |Subscription|  |         |          +------+
              +------------+  |         v 1
              +-------------+ |       +-----------+
              |ProjectStatus|-+       |Environment|
              +-------------+         +-----------+

> 
> Here's an example:
> 
>   https://staging.kernelci.org/test/suite/5b489cc8cf3a0fe42f9d9145/
> 
> The first thing I can see here is that we don't actually use the test
> sets: each test suite has exactly one test set called "default", with
> all the test cases stored both in the suite and the set.  So I think
> we could simplify things by having only 2 collections: test suite and
> test case.  Does anyone know what the test sets were intended for?

Squad doesn't have a concept of a test set, but in practice we do have a
division smaller than suite and larger than a single test case. For
example in LKFT, we break LTP up into 20 different LAVA jobs, each
running a subset of LTP. Each of these 20 get put into squad as a
separate 'suite'. This actually causes us a bit of a problem because
each of these reports their version of LTP used, which should all always
be the same, but we can't guarantee it because it is only by convention.

If we had 'set', we could use set instead of suite when submitting the
ltp subset, and have them all associated with the same suite (ltp).

> Then the next thing to look into is actually about the results
> themselves.  They are currently stored as "status" and
> "measurements".  Status can take one of 4 values: error, fail, pass
> or skip.  Measurements are an arbitrary dictionary.  This works fine
> when the test case has an absolute pass/fail result, and when the
> measurement is only additional information such as the time it took
> to run it.

Squad is quite similar here. Again see the previous url that describes
squad's data model. Squad calls them metrics instead of measurements.

> 
> It's not that simple for test results which use the measurement to
> determine the pass/fail criteria.  For these, there needs to be some
> logic with some thresholds stored somewhere to determine whether the
> measurement results in pass or fail.  This could either be done as
> part of the test case, or in the backend.  Then some similar logic
> needs to be run to detect regressions, as some tests don't have an
> absolute threshold but must not be giving lower scores than previous
> runs.
> 
> It seems to me that having all the logic related to the test case
> stored in the test definition would be ideal, to keep it
> self-contained.  For example, previous test results could be fetched
> from the backend API and passed as meta-data to the LAVA job to
> determine whether the new result is a pass or fail.  The concept of
> pass/fail in this case may actually not be too accurate, rather that
> a score drop needs to be detected as a regression.  The advantage of
> this approach is that there is no need for any test-specific logic in
> the backend, regressions would still just be based on the status
> field.

If the threshold is stored with the test or test definition, then the
test can just report pass/fail directly. This is a common pattern for
lots of tests. Defining boundaries for metrics is a hard problem. Basing
it on past results is fraught with perils, and quickly leads to ML type
solutions. In squad, we detect regressions by naively just looking at
the status of a given test in the previous build, and I think that it
has proved to be inadequate. Doing so for metrics would be more-so.

> It is only a starting point, as Kevin mentioned we should probably
> get some advice from experts in the field of data in general.  This
> thread could easily split into several to discus the different
> aspects of this issue.  Still reviewing what we have and making basic
> changes should help us scale to the extent of what we're aiming for
> at the moment with small test suites.
> 
> Then the second part of this discussion would be, what do we want to
> get out of the database? (emails, visualisation, post-processing...)
> It seems worth gathering people's thoughts on this and look for some
> common ground.

If we keep the components loosely coupled, we retain the ability to do
reporting, visualisation, analytics, etc, as we see fit. Trying to
predict all the ways the data will be used is difficult and I don't want
to get stuck with analysis paralysis.

Thanks,
Dan

> 
> Best wishes,
> Guillaume
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [kernelci] Dealing with test results
  2018-07-17 13:39 Dealing with test results Guillaume Tucker
  2018-07-18 19:37 ` [kernelci] " dan.rue
@ 2018-07-26  7:24 ` Ana Guerrero Lopez
  2018-07-26 17:19   ` Kevin Hilman
  1 sibling, 1 reply; 7+ messages in thread
From: Ana Guerrero Lopez @ 2018-07-26  7:24 UTC (permalink / raw)
  To: kernelci

Hi!

In the last two weeks I have been working on the backend code.
I already implemented the possibility of triggering emails with 
the result of the test suites and I'm also working in the code 
for reporting the regressions. So this discussion impacts directly
these two features.

On Tue, Jul 17, 2018 at 02:39:15PM +0100, Guillaume Tucker wrote:
[...]
> So on one hand, I think we can start revisiting what we have in our
> database model.  Then on the other hand, we need to think about
> useful information we want to be able to extract from the database.
> 
> 
> At the moment, we have 3 collections to store these results.  Here's
> a simplified model:
> 
> test suite
> * suite name
> * build info (revision, defconfig...)
> * lab name
> * test sets
> * test cases
> 
> test set
> * set name
> * test cases
> 
> test case
> * case name
> * status
> * measurements
> 
> Here's an example:
> 
>   https://staging.kernelci.org/test/suite/5b489cc8cf3a0fe42f9d9145/
> 
> The first thing I can see here is that we don't actually use the test
> sets: each test suite has exactly one test set called "default", with
> all the test cases stored both in the suite and the set.  So I think
> we could simplify things by having only 2 collections: test suite and
> test case.  Does anyone know what the test sets were intended for?

Yes, please remove test sets. I don't know why they were added in the
past I don't see them being useful in the present. The test_case collection
stored in mongodb doesn't add any new information that's not already in the 
test_suite and test_case collections. 
See https://github.com/kernelci/kernelci-doc/wiki/Mongo-Database-Schema
for the mongodb schema.
I've been checking and they shouldn't be difficult to remove from the
current backend code and I expect the changes to be straighforward in the
frontend.

> Then the next thing to look into is actually about the results
> themselves.  They are currently stored as "status" and
> "measurements".  Status can take one of 4 values: error, fail, pass
> or skip.  Measurements are an arbitrary dictionary.  This works fine
> when the test case has an absolute pass/fail result, and when the
> measurement is only additional information such as the time it took
> to run it.
> 
> It's not that simple for test results which use the measurement to
> determine the pass/fail criteria.  For these, there needs to be some
> logic with some thresholds stored somewhere to determine whether the
> measurement results in pass or fail.  This could either be done as
> part of the test case, or in the backend.  Then some similar logic
> needs to be run to detect regressions, as some tests don't have an
> absolute threshold but must not be giving lower scores than previous
> runs.
> 
> It seems to me that having all the logic related to the test case
> stored in the test definition would be ideal, to keep it
> self-contained.  For example, previous test results could be fetched
> from the backend API and passed as meta-data to the LAVA job to
> determine whether the new result is a pass or fail.  The concept of
> pass/fail in this case may actually not be too accurate, rather that
> a score drop needs to be detected as a regression.  The advantage of
> this approach is that there is no need for any test-specific logic in
> the backend, regressions would still just be based on the status
> field.
> 
> How does that all sound?

Sounds good to me.

[...]
> Then the second part of this discussion would be, what do we want to
> get out of the database? (emails, visualisation, post-processing...)
> It seems worth gathering people's thoughts on this and look for some
> common ground.

I'm afraid I have more questions that answers about this. IMHO it's a
discussion that should reach to potential users of kernelci to get
also their input and that's a wider group than people in this list.
This doesn't mean we will be able, or want, to implement all the ideas 
but at least to get a sense of what would be more appreciated.

Ana


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [kernelci] Dealing with test results
  2018-07-26  7:24 ` Ana Guerrero Lopez
@ 2018-07-26 17:19   ` Kevin Hilman
  2018-07-27  6:28     ` Tomeu Vizoso
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Hilman @ 2018-07-26 17:19 UTC (permalink / raw)
  To: Ana Guerrero Lopez; +Cc: kernelci

"Ana Guerrero Lopez" <ana.guerrero@collabora.com> writes:


> In the last two weeks I have been working on the backend code.
> I already implemented the possibility of triggering emails with 
> the result of the test suites and I'm also working in the code 
> for reporting the regressions. So this discussion impacts directly
> these two features.
>
> On Tue, Jul 17, 2018 at 02:39:15PM +0100, Guillaume Tucker wrote:
> [...]
>> So on one hand, I think we can start revisiting what we have in our
>> database model.  Then on the other hand, we need to think about
>> useful information we want to be able to extract from the database.
>> 
>> 
>> At the moment, we have 3 collections to store these results.  Here's
>> a simplified model:
>> 
>> test suite
>> * suite name
>> * build info (revision, defconfig...)
>> * lab name
>> * test sets
>> * test cases
>> 
>> test set
>> * set name
>> * test cases
>> 
>> test case
>> * case name
>> * status
>> * measurements
>> 
>> Here's an example:
>> 
>>   https://staging.kernelci.org/test/suite/5b489cc8cf3a0fe42f9d9145/
>> 
>> The first thing I can see here is that we don't actually use the test
>> sets: each test suite has exactly one test set called "default", with
>> all the test cases stored both in the suite and the set.  So I think
>> we could simplify things by having only 2 collections: test suite and
>> test case.  Does anyone know what the test sets were intended for?

IIRC, they were added because LAVA supports all three levels.

> Yes, please remove test sets. I don't know why they were added in the
> past I don't see them being useful in the present.
>
> The test_case collection
> stored in mongodb doesn't add any new information that's not already in the 
> test_suite and test_case collections. 
> See https://github.com/kernelci/kernelci-doc/wiki/Mongo-Database-Schema
> for the mongodb schema.
> I've been checking and they shouldn't be difficult to remove from the
> current backend code and I expect the changes to be straighforward in the
> frontend.

I disagree.  Looking at the IGT example above, there's a lot of test
cases in that test suite.  It could (and probably should) be broken down
into test sets.

Also if you think about large test suites (like LTP, or kselftest) it's
quite easy to imagine using all 3 levels.  For example, for test-suite =
kselftest, each dir under tools/testing/selftest would be a test-set,
and each test in that dir would be a test-case.

>> Then the next thing to look into is actually about the results
>> themselves.  They are currently stored as "status" and
>> "measurements".  Status can take one of 4 values: error, fail, pass
>> or skip.  Measurements are an arbitrary dictionary.  This works fine
>> when the test case has an absolute pass/fail result, and when the
>> measurement is only additional information such as the time it took
>> to run it.
>> 
>> It's not that simple for test results which use the measurement to
>> determine the pass/fail criteria.  For these, there needs to be some
>> logic with some thresholds stored somewhere to determine whether the
>> measurement results in pass or fail.  This could either be done as
>> part of the test case, or in the backend.  Then some similar logic
>> needs to be run to detect regressions, as some tests don't have an
>> absolute threshold but must not be giving lower scores than previous
>> runs.
>> 
>> It seems to me that having all the logic related to the test case
>> stored in the test definition would be ideal, to keep it
>> self-contained.  For example, previous test results could be fetched
>> from the backend API and passed as meta-data to the LAVA job to
>> determine whether the new result is a pass or fail.  The concept of
>> pass/fail in this case may actually not be too accurate, rather that
>> a score drop needs to be detected as a regression.  The advantage of
>> this approach is that there is no need for any test-specific logic in
>> the backend, regressions would still just be based on the status
>> field.
>> 
>> How does that all sound?
>
> Sounds good to me.
>

I agree, test-specific login in the backend sounds difficult to
manage/maintain.

> [...]
>> Then the second part of this discussion would be, what do we want to
>> get out of the database? (emails, visualisation, post-processing...)
>> It seems worth gathering people's thoughts on this and look for some
>> common ground.
>
> I'm afraid I have more questions that answers about this. IMHO it's a
> discussion that should reach to potential users of kernelci to get
> also their input and that's a wider group than people in this list.
> This doesn't mean we will be able, or want, to implement all the ideas 
> but at least to get a sense of what would be more appreciated.

I think I have more questions than answers too, but, for starters we
need the /test view to have more functionaliity.  Currently it only
allows you to filter by a single board, but like our /boot views, we
want to be able to filter by build (tree/branch), or specific test
suite, etc.

We are working on some PoC view for some of this right now (should show
up on github in the next week or two).

But, for the medium/long term, I think we need to rethink the frontend
completely, and start thinking of all of this data we have as a "big
data" problem.

If we step back and think of our boots and tests as micro-services that
start up, spit out some logs, and disappear, it's not hugely different
than any large distributed cloud app, and there are *lots* of logging
and analytics tools geared towards monitoring, analyzing and
visiualizing these kinds of systems (e.g. Apache Spark, Elastic/ELK
Stack[1], graylog, to name only a few.)

In short, I don't think we can fully predict how people are going to
want to use/visualize/analyze all the data, so we need to use a
flexible, log-basd analytics framework that will grow as kernelCI grows.

Kevin

[1] https://www.elastic.co/elk-stack


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [kernelci] Dealing with test results
  2018-07-26 17:19   ` Kevin Hilman
@ 2018-07-27  6:28     ` Tomeu Vizoso
  0 siblings, 0 replies; 7+ messages in thread
From: Tomeu Vizoso @ 2018-07-27  6:28 UTC (permalink / raw)
  To: kernelci, Ana Guerrero Lopez

On 07/26/2018 07:19 PM, Kevin Hilman wrote:
> "Ana Guerrero Lopez" <ana.guerrero@collabora.com> writes:
> 
> 
>> In the last two weeks I have been working on the backend code.
>> I already implemented the possibility of triggering emails with
>> the result of the test suites and I'm also working in the code
>> for reporting the regressions. So this discussion impacts directly
>> these two features.
>>
>> On Tue, Jul 17, 2018 at 02:39:15PM +0100, Guillaume Tucker wrote:
>> [...]
>>> So on one hand, I think we can start revisiting what we have in our
>>> database model.  Then on the other hand, we need to think about
>>> useful information we want to be able to extract from the database.
>>>
>>>
>>> At the moment, we have 3 collections to store these results.  Here's
>>> a simplified model:
>>>
>>> test suite
>>> * suite name
>>> * build info (revision, defconfig...)
>>> * lab name
>>> * test sets
>>> * test cases
>>>
>>> test set
>>> * set name
>>> * test cases
>>>
>>> test case
>>> * case name
>>> * status
>>> * measurements
>>>
>>> Here's an example:
>>>
>>>    https://staging.kernelci.org/test/suite/5b489cc8cf3a0fe42f9d9145/
>>>
>>> The first thing I can see here is that we don't actually use the test
>>> sets: each test suite has exactly one test set called "default", with
>>> all the test cases stored both in the suite and the set.  So I think
>>> we could simplify things by having only 2 collections: test suite and
>>> test case.  Does anyone know what the test sets were intended for?
> 
> IIRC, they were added because LAVA supports all three levels.
> 
>> Yes, please remove test sets. I don't know why they were added in the
>> past I don't see them being useful in the present.
>>
>> The test_case collection
>> stored in mongodb doesn't add any new information that's not already in the
>> test_suite and test_case collections.
>> See https://github.com/kernelci/kernelci-doc/wiki/Mongo-Database-Schema
>> for the mongodb schema.
>> I've been checking and they shouldn't be difficult to remove from the
>> current backend code and I expect the changes to be straighforward in the
>> frontend.
> 
> I disagree.  Looking at the IGT example above, there's a lot of test
> cases in that test suite.  It could (and probably should) be broken down
> into test sets.
> 
> Also if you think about large test suites (like LTP, or kselftest) it's
> quite easy to imagine using all 3 levels.  For example, for test-suite =
> kselftest, each dir under tools/testing/selftest would be a test-set,
> and each test in that dir would be a test-case.

Just a small note that we have one more level above those three: the job. 
So a kselftest job could have each dir as a test suite and each test a 
test-case, without needing test sets.

May be less awkward to get rid of the test-suite level if we only run one 
test suite per job. But if we want to have jobs that run multiple test 
suites that have lots of jobs, then we maybe need the 4 levels. But then, 
I would be worried about having lots of incomplete results when there's a 
crash.

I'm not particularly in favour of dropping one level now, but I hope we 
aren't planning to put too many tests in single jobs.

[...]
>> [...]
>>> Then the second part of this discussion would be, what do we want to
>>> get out of the database? (emails, visualisation, post-processing...)
>>> It seems worth gathering people's thoughts on this and look for some
>>> common ground.
>>
>> I'm afraid I have more questions that answers about this. IMHO it's a
>> discussion that should reach to potential users of kernelci to get
>> also their input and that's a wider group than people in this list.
>> This doesn't mean we will be able, or want, to implement all the ideas
>> but at least to get a sense of what would be more appreciated.
> 
> I think I have more questions than answers too, but, for starters we
> need the /test view to have more functionaliity.  Currently it only
> allows you to filter by a single board, but like our /boot views, we
> want to be able to filter by build (tree/branch), or specific test
> suite, etc.
> 
> We are working on some PoC view for some of this right now (should show
> up on github in the next week or two).
> 
> But, for the medium/long term, I think we need to rethink the frontend
> completely, and start thinking of all of this data we have as a "big
> data" problem.
> 
> If we step back and think of our boots and tests as micro-services that
> start up, spit out some logs, and disappear, it's not hugely different
> than any large distributed cloud app, and there are *lots* of logging
> and analytics tools geared towards monitoring, analyzing and
> visiualizing these kinds of systems (e.g. Apache Spark, Elastic/ELK
> Stack[1], graylog, to name only a few.)
> 
> In short, I don't think we can fully predict how people are going to
> want to use/visualize/analyze all the data, so we need to use a
> flexible, log-basd analytics framework that will grow as kernelCI grows.

Makes sense to me!

Cheers,

Tomeu

> 
> Kevin
> 
> [1] https://www.elastic.co/elk-stack
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Dealing with test results
@ 2018-11-12 13:58 Guillaume Tucker
  2018-11-12 14:47 ` [kernelci] " Milosz Wasilewski
  0 siblings, 1 reply; 7+ messages in thread
From: Guillaume Tucker @ 2018-11-12 13:58 UTC (permalink / raw)
  To: kernelci

A recurring topic is how to deal with test results, from the
point in the test code where they're generated to how the result
is stored in a database.  This was brought up again during last
week's meeting while discussing kernel warnings in boot tests, so
let's take another look at it and try to break it down into
smaller problems to solve:

* generating the test results

Each test suite currently has its own way of generating test
results, typically with some arbitrary format on stdout.  This
means a custom parser for each test suite, which is tedious to
maintain and error-prone, but a first step at getting results.
In some cases, such as boot testing, there isn't any real
alternative.

There are however several standards for encoding test results, I
think this is being discussed quite a lot already (LKFT people?).
There's also a thread on linux-media about this, following the
work we've done with them to improve testing in that area of the
kernel:

  https://www.spinics.net/lists/linux-media/msg142520.html

The bottom line is: we need a good machine-readable test output
format try to align test suites to be compatible with it.

* handling the test results

The next step is about writing the results somewhere: on the
console, in a file, or even directly to a remote API.  Some
devices may not have a functionarl network interface with access
to the internet, so it's hard to require the devices to push
results directly.  The least common denominator is that the
results need to eventually land in the database, so how they get
there isn't necessarily relevant.  It is useful though to store
the full log of the job to do some manual investigation later on.

* importing the results

This is the crucial part here in my opinion: turning the output
of a test suite into results that can be stored into the
database.  It can be done in several places, and how to do it is
often directly linked to the definition of the test: the format
of the results may depend on options passed when calling the test
etc...

The standard way to do this with LAVA is to call "lava-test-case"
and "lava-test-set" while the test is running on the device, then
have the resulting data sent to a remote API via the callback
mechanism.  This seems to be working rather well, with some
things that can probably be improved (sub-groups limited to 1
level, noise in the log with LAVA messages...).

Another place where this could be done is on a test server,
between the device and the database API.  In the case of LAVA,
this may be the dispatcher which has direct access to the device.
I believe this is how the "regex" pattern approach works.  The
inconvenient here is that the test server needs to have the
capability to parse the results, so doing custom things may not
always be possible.

Then it's also possible in principle to send all the raw results
as-is to a remote API which will parse it itself and store it in
the database directly.  The difference with what we're doing with
the LAVA callback is that it provides the pass/fail data for each
test case already populated.  It seems to me that adding more
parsing capability in the backend is only sustainable if the
results are provided in a structured format, as having test-suite
specific parsers in the backend is bound to break when the test
suites change.

* determining pass / fail based on measurements

While it's not something we've been doing much yet, we can't
ignore this aspect as well.  Things like power consumption, CPU
cycles and memory usage don't have an absolute pass/fail criteria
but are based on previous results.  We need to be able to access
at least the last results for the same test configuration to
determine whether a new result is pass or fail.

To keep all the test logic self-contained, it could also be done
on the device as long as it has access to the required
data (i.e. last result and relative thresholds).  With LAVA, the
part doing the test generation (Python script in Jenkins) could
be querying the database API for this and pass it to the test
suite via the job definition.

Just like for importing the results, this could be done
elsewhere (test server, database API...) but I think it would
also mean fragmenting the test suite definition.

* storing the results

Right now the kernelci-backend API has a built-in Mongo database.
As previously discussed, it would seem good to be able to replace
the database with any engine and keep the logic we have as a
separate service.  That way, we could still use our
kernelci-specific code that tracks regressions, triggers
bisections, sends emails etc... but also enable arbitrary
searches and direct access to the test data.

I think that covers the main aspects of the problem, now we
probably have a few solutions to look for.  Overall, it would
seem like a good idea to describe a "reference" workflow based on
LAVA with a standard format for test results etc...  Still we
have to allow non-LAVA labs and other test frameworks to
contribute to KernelCI especially as new members are likely to be
joining the LF project with their own mature test infrastructure.

Does anyone see anything important missing here?  We could start
a wiki page to explain what the reference workflow would look
like.

Cheers,
Guillaume

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [kernelci] Dealing with test results
  2018-11-12 13:58 Guillaume Tucker
@ 2018-11-12 14:47 ` Milosz Wasilewski
  2018-11-27 22:57   ` Kevin Hilman
  0 siblings, 1 reply; 7+ messages in thread
From: Milosz Wasilewski @ 2018-11-12 14:47 UTC (permalink / raw)
  To: kernelci

On Mon, 12 Nov 2018 at 13:58, Guillaume Tucker
<guillaume.tucker@gmail.com> wrote:
>
> A recurring topic is how to deal with test results, from the
> point in the test code where they're generated to how the result
> is stored in a database.  This was brought up again during last
> week's meeting while discussing kernel warnings in boot tests, so
> let's take another look at it and try to break it down into
> smaller problems to solve:
>
>
> * generating the test results
>
> Each test suite currently has its own way of generating test
> results, typically with some arbitrary format on stdout.  This
> means a custom parser for each test suite, which is tedious to
> maintain and error-prone, but a first step at getting results.
> In some cases, such as boot testing, there isn't any real
> alternative.
>
> There are however several standards for encoding test results, I
> think this is being discussed quite a lot already (LKFT people?).

from my experience there are as many formats as test suites out there :)
However it might be a good idea to try to output in some 'standard'
way. TAP13 seems to be a fairly well defined format that is both human
and machine readable. IIUC there were some efforts to enable TAP13 in
kselftets. It would also make offline parsing a bit easier.

> There's also a thread on linux-media about this, following the
> work we've done with them to improve testing in that area of the
> kernel:
>
>   https://www.spinics.net/lists/linux-media/msg142520.html
>
> The bottom line is: we need a good machine-readable test output
> format try to align test suites to be compatible with it.
>
>
> * handling the test results
>
> The next step is about writing the results somewhere: on the
> console, in a file, or even directly to a remote API.  Some
> devices may not have a functionarl network interface with access
> to the internet, so it's hard to require the devices to push
> results directly.  The least common denominator is that the
> results need to eventually land in the database, so how they get
> there isn't necessarily relevant.  It is useful though to store
> the full log of the job to do some manual investigation later on.
>

I would go with console or file as other media might not be available
for all use cases.

>
> * importing the results
>
> This is the crucial part here in my opinion: turning the output
> of a test suite into results that can be stored into the
> database.  It can be done in several places, and how to do it is
> often directly linked to the definition of the test: the format
> of the results may depend on options passed when calling the test
> etc...
>
> The standard way to do this with LAVA is to call "lava-test-case"
> and "lava-test-set" while the test is running on the device, then
> have the resulting data sent to a remote API via the callback
> mechanism.  This seems to be working rather well, with some
> things that can probably be improved (sub-groups limited to 1
> level, noise in the log with LAVA messages...).

using lava-test-case locks you into LAVA and makes it hard for others
to reproduce your results.

>
> Another place where this could be done is on a test server,
> between the device and the database API.  In the case of LAVA,
> this may be the dispatcher which has direct access to the device.
> I believe this is how the "regex" pattern approach works.  The
> inconvenient here is that the test server needs to have the
> capability to parse the results, so doing custom things may not
> always be possible.
>
> Then it's also possible in principle to send all the raw results
> as-is to a remote API which will parse it itself and store it in
> the database directly.  The difference with what we're doing with
> the LAVA callback is that it provides the pass/fail data for each
> test case already populated.  It seems to me that adding more
> parsing capability in the backend is only sustainable if the
> results are provided in a structured format, as having test-suite
> specific parsers in the backend is bound to break when the test
> suites change.

I would go with processing the results 'offline' after all logs were
collected. This means that result processing is done somewhere in
kernelCI backend. The workflow would look sth like:
1. execute the test and collect the raw logs (output files)
2. save the raw logs/output files somewhere (kernelCI db?)
3. process the results using the parser obtained from the test suite

This approach assumes that each test suite contains some parser that
allows to translate from log or human readable output to machine
readable format. Again TAP13 seems a handy approach. Also this kind of
approach was recently discussed at ELC-E (automated testing summit).
Current default parser can be defined as 'LAVA' and use the <LAVA_...>
markers in the log.

>
>
> * determining pass / fail based on measurements
>
> While it's not something we've been doing much yet, we can't
> ignore this aspect as well.  Things like power consumption, CPU
> cycles and memory usage don't have an absolute pass/fail criteria
> but are based on previous results.  We need to be able to access
> at least the last results for the same test configuration to
> determine whether a new result is pass or fail.
>
> To keep all the test logic self-contained, it could also be done
> on the device as long as it has access to the required
> data (i.e. last result and relative thresholds).  With LAVA, the
> part doing the test generation (Python script in Jenkins) could
> be querying the database API for this and pass it to the test
> suite via the job definition.
>
> Just like for importing the results, this could be done
> elsewhere (test server, database API...) but I think it would
> also mean fragmenting the test suite definition.

I don't have any strong opinion here. Parsing offline (not during
execution) seems a bit more flexible. Also bear in mind that some
benchmarks are 'less is better' (i.e. latency) while other are 'more
is better' (i.e. FLOPS)

>
>
> * storing the results
>
> Right now the kernelci-backend API has a built-in Mongo database.
> As previously discussed, it would seem good to be able to replace
> the database with any engine and keep the logic we have as a
> separate service.  That way, we could still use our
> kernelci-specific code that tracks regressions, triggers
> bisections, sends emails etc... but also enable arbitrary
> searches and direct access to the test data.

Working on it now. As I mentioned above I'm trying to asynchronously
run parsers on raw logs based on 'insert' events. Let's see if that
approach works. It would also allow to keep the current logic
(regressions, bisections, emails, etc) almost intact.

>
>
> I think that covers the main aspects of the problem, now we
> probably have a few solutions to look for.  Overall, it would
> seem like a good idea to describe a "reference" workflow based on
> LAVA with a standard format for test results etc...  Still we
> have to allow non-LAVA labs and other test frameworks to
> contribute to KernelCI especially as new members are likely to be
> joining the LF project with their own mature test infrastructure.

I think this nails down to the fact that currently kernelCI goes in
the direction of 'scheduling test jobs runs on remote labs'. IMHO this
isn't best approach as we're excluding any other types of executors
(fuego, slav to name a few). I would much prefer kernelCI offers
build, notification and data (result) sharing service leaving the
details of execution (boot and test) to the subscribers.

milosz

>
> Does anyone see anything important missing here?  We could start
> a wiki page to explain what the reference workflow would look
> like.
>
> Cheers,
> Guillaume
>
> 
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [kernelci] Dealing with test results
  2018-11-12 14:47 ` [kernelci] " Milosz Wasilewski
@ 2018-11-27 22:57   ` Kevin Hilman
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin Hilman @ 2018-11-27 22:57 UTC (permalink / raw)
  To: Milosz Wasilewski, kernelci

"Milosz Wasilewski" <milosz.wasilewski@linaro.org> writes:

> On Mon, 12 Nov 2018 at 13:58, Guillaume Tucker
> <guillaume.tucker@gmail.com> wrote:
>>
>> A recurring topic is how to deal with test results, from the
>> point in the test code where they're generated to how the result
>> is stored in a database.  This was brought up again during last
>> week's meeting while discussing kernel warnings in boot tests, so
>> let's take another look at it and try to break it down into
>> smaller problems to solve:
>>
>>
>> * generating the test results
>>
>> Each test suite currently has its own way of generating test
>> results, typically with some arbitrary format on stdout.  This
>> means a custom parser for each test suite, which is tedious to
>> maintain and error-prone, but a first step at getting results.
>> In some cases, such as boot testing, there isn't any real
>> alternative.
>>
>> There are however several standards for encoding test results, I
>> think this is being discussed quite a lot already (LKFT people?).
>
> from my experience there are as many formats as test suites out there :)
> However it might be a good idea to try to output in some 'standard'
> way. TAP13 seems to be a fairly well defined format that is both human
> and machine readable. IIUC there were some efforts to enable TAP13 in
> kselftets. It would also make offline parsing a bit easier.
>
>> There's also a thread on linux-media about this, following the
>> work we've done with them to improve testing in that area of the
>> kernel:
>>
>>   https://www.spinics.net/lists/linux-media/msg142520.html
>>
>> The bottom line is: we need a good machine-readable test output
>> format try to align test suites to be compatible with it.
>>
>>
>> * handling the test results
>>
>> The next step is about writing the results somewhere: on the
>> console, in a file, or even directly to a remote API.  Some
>> devices may not have a functionarl network interface with access
>> to the internet, so it's hard to require the devices to push
>> results directly.  The least common denominator is that the
>> results need to eventually land in the database, so how they get
>> there isn't necessarily relevant.  It is useful though to store
>> the full log of the job to do some manual investigation later on.
>>
>
> I would go with console or file as other media might not be available
> for all use cases.
>
>>
>> * importing the results
>>
>> This is the crucial part here in my opinion: turning the output
>> of a test suite into results that can be stored into the
>> database.  It can be done in several places, and how to do it is
>> often directly linked to the definition of the test: the format
>> of the results may depend on options passed when calling the test
>> etc...
>>
>> The standard way to do this with LAVA is to call "lava-test-case"
>> and "lava-test-set" while the test is running on the device, then
>> have the resulting data sent to a remote API via the callback
>> mechanism.  This seems to be working rather well, with some
>> things that can probably be improved (sub-groups limited to 1
>> level, noise in the log with LAVA messages...).
>
> using lava-test-case locks you into LAVA and makes it hard for others
> to reproduce your results.
>
>>
>> Another place where this could be done is on a test server,
>> between the device and the database API.  In the case of LAVA,
>> this may be the dispatcher which has direct access to the device.
>> I believe this is how the "regex" pattern approach works.  The
>> inconvenient here is that the test server needs to have the
>> capability to parse the results, so doing custom things may not
>> always be possible.
>>
>> Then it's also possible in principle to send all the raw results
>> as-is to a remote API which will parse it itself and store it in
>> the database directly.  The difference with what we're doing with
>> the LAVA callback is that it provides the pass/fail data for each
>> test case already populated.  It seems to me that adding more
>> parsing capability in the backend is only sustainable if the
>> results are provided in a structured format, as having test-suite
>> specific parsers in the backend is bound to break when the test
>> suites change.
>
> I would go with processing the results 'offline' after all logs were
> collected. This means that result processing is done somewhere in
> kernelCI backend. The workflow would look sth like:
> 1. execute the test and collect the raw logs (output files)
> 2. save the raw logs/output files somewhere (kernelCI db?)

This step give me anothe opportunity to argue for fluentd:
https://www.fluentd.org/

Expecially for this step, we should not try to reinvent the wheel.
Tools like fluentd were written exactly for the problem of unifiying the
log/data colletion from a wide variety of inputsources, in order to be
processed by higher level tools (databases, elasticsearch, distributed
storage, etc. etc.)

> 3. process the results using the parser obtained from the test suite
>
> This approach assumes that each test suite contains some parser that
> allows to translate from log or human readable output to machine
> readable format.

Or that fluentd would grow a "data sources"[1] plugin to understand any
new formats and do basic parsing and collecting.

Once the raw data is in fluentd, it's then available for any of the
fluentd "data outputs"[2].

Of particular interest with fluent is that the data can be consumed by
multiple "data sources".  e.g. We could do basic storage/backup to
Hadoop or AWS, but also have more sophisticaed search and visualization
using elasticsearch+Kibana.

> Again TAP13 seems a handy approach. Also this kind of
> approach was recently discussed at ELC-E (automated testing summit).

Full ack.  A TAP13 data sources plugin might be a good first project for
fluentd.

Kevin

[1] https://www.fluentd.org/datasources
[2] https://www.fluentd.org/dataoutputs


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-27 22:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-17 13:39 Dealing with test results Guillaume Tucker
2018-07-18 19:37 ` [kernelci] " dan.rue
2018-07-26  7:24 ` Ana Guerrero Lopez
2018-07-26 17:19   ` Kevin Hilman
2018-07-27  6:28     ` Tomeu Vizoso
  -- strict thread matches above, loose matches on Subject: below --
2018-11-12 13:58 Guillaume Tucker
2018-11-12 14:47 ` [kernelci] " Milosz Wasilewski
2018-11-27 22:57   ` Kevin Hilman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox