Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Cleber Rosa <crosa@redhat.com>
To: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: "Thomas Huth" <thuth@redhat.com>,
	virt-ci-maint-team@redhat.com, qemu-devel@nongnu.org,
	"Wainer dos Santos Moschetta" <wainersm@redhat.com>,
	"Willian Rampazzo" <wrampazz@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore
Date: Mon, 30 Nov 2020 22:48:25 -0500	[thread overview]
Message-ID: <20201201034825.GA2343635@localhost.localdomain> (raw)
In-Reply-To: <20201127174110.1932671-1-philmd@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 8035 bytes --]

On Fri, Nov 27, 2020 at 06:41:10PM +0100, Philippe Mathieu-Daudé wrote:
> We lately realized that the Avocado framework was not designed
> to be regularly run on CI environments. Therefore, as of 5.2

Hi Phil,

First of all, let me say that I understand your overall goal, and
although I don't agree with the strategy, I believe we're in agreement
wrt the destination.

The main issue that you seem to address here is the fact that some CI
tests may fail more often than others, which will lead to jobs that
will fail more than others, which will ultimately taint the overall CI
status.  Does that sound like an appropriate overall grasp of your
motivation?

Assuming I got it right, let me say that having an "always green CI"
is a noble goal, but it can also be extremely frustrating if one
doesn't apply some safeguarding measures.  More on that later.

As you certainly know, I'm also interested in understanding the "not
designed to be regularly run on CI environments" part.  The best
correlation I could make was to link that to the these two points you
raised elsewhere:

 1) Failing randomly
 2) Images hardcoded in tests are being removed from public servers

With regards to point #1, this is probably unavoidable as a whole.
I've had some experience running dedicated test jobs for close to a
decade, and maybe the only way to get close to avoid random failures
on integration tests, is to run close to nothing in those jobs.
Running "/bin/true" has a very low chance of failing randomly.

In my own experience, the only way to address point #1, is to
babysit jobs.  That means:

 a) assume they will produce some messy stuff at no particular time
 b) act as quickly and effectively as possible
 c) be compassionate, that is, waive the unavoidable mess incidents

Building on the previous analogy, if you decide to not have a baby,
but a plant, you'll probably need to to a lot less of those.
If you get a pet, than some more.  Now a human baby will probably
(I guess) require a whole lot more of those.  And as those age
and reach maturity, they'll (hopefully) require less babysitting,
but they can still mess up at any given time.

Analogies and jokes aside, the urgent *action item* here has been
discussed both publicly, and internally at Red Hat. It consists of
having an "always on" maintainer for those jobs.  In the specific case
of the "Acceptance" jobs, Willian Rampazzo has volunteered to,
initially, be this person.  He'll manage all related information
on job's issues.

We're still discussing the tools to use to give the visibility that
the QEMU projects needs.  I personally would be happy enough to start
with a publicly accessible spreadsheet that builds upon the
information produced by GitLab.  A proper application is also being
considered.  A sample of the requirements include:

   I) waive failures (say a job and tests failed because of a
      network outage)
  II) build trends (show how stable all executions of test "foo"
      were during the last: week, month, eternity, etc).
 III) keep a list of the known issues and relate them to waivers
      and currently skipped tests

Getting back to point #2, I have two main points about it.  First is
that I've had a lot of experience with tests having copies of images,
both on local filesystems and in on close by NFS servers.  Local
filesystems would fail at provisioning/sync time. And NFS based ones
would fail every now and then for various network or NFS server
issues.  It goes back to my point about not being able to escape the
babysitting ever.

Second is that this is somehow related to features and improvements
that could/should be added to whathever supporting code (aka
framework) we use.  Right now, we have some specific features
scheduled to be introduced in Avocado 84.0 (due in ~2 weeks).  They
can be seen with the "customer:QEMU" label:

  https://github.com/avocado-framework/avocado/issues?q=is%3Aissue+is%3Aopen+label%3Acustomer%3AQEMU

A number of other features have already landed on previous versions,
but I was unable to send patches in time for 5.2, so my expectation is
to bundle more of them and bump Avocado to 84.0 at once (instead of
82.0 or 83.0).

> we deprecate the gitlab-ci jobs using Avocado. To not disrupt
> current users, it is possible to keep the current behavior by
> setting the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable
> (see [*]).
> From now on, using these jobs (or adding new tests to them)
> is strongly discouraged.
>

These jobs run `make check-acceptance`, which will pickup new tests.
So how do you suggest to *not adding new tests* to those jobs?  Are
you suggesting that no new acceptance test be added?

> Tests based on Avocado will be ported to new job schemes during
> the next releases, with better documentation and templates.
>

This is a very good approach to move forward.  For what is worth,
Avocado has invested in an API specifically for that, the Job API.
The goal is to have smarter jobs for different purposes that behave
appropriattely and account for the environment (such as host platform,
CI, etc).  Example of my upcoming "job-kvm-only.py":

------------------------------------------------------------------------------

#!/usr/bin/env python

import os
import sys

from qemu.accel import kvm_available
from avocado.core.job import Job

def main():
    if not kvm_available():
        sys.exit(0)

    config = {'run.references': ['tests/acceptance/'],
              'filter.by_tags.tags': ['accel:kvm,arch:%s' % os.uname()[4]]}
    with Job.from_config(config) as job:
        return job.run()

if __name__ == '__main__':
    sys.exit(main())

------------------------------------------------------------------------------

Other examples of Jobs using this API can be seen here:

   https://github.com/avocado-framework/avocado/tree/master/examples/jobs

And the documentation on the features one can use by setting
configuration keys can be found here:

   https://avocado-framework.readthedocs.io/en/83.0/config/index.html

So for example, if one wants to ignore errors while fetching assets in a job,
there is this:

   https://avocado-framework.readthedocs.io/en/83.0/config/index.html#assets-fetch-ignore-errors

> [*] https://docs.gitlab.com/ee/ci/variables/README.html#create-a-custom-variable-in-the-ui
> 
> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> ---
>  .gitlab-ci.yml | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
> index d0173e82b16..2674407cd13 100644
> --- a/.gitlab-ci.yml
> +++ b/.gitlab-ci.yml
> @@ -66,6 +66,15 @@ include:
>      - cd build
>      - python3 -c 'import json; r = json.load(open("tests/results/latest/results.json")); [print(t["logfile"]) for t in r["tests"] if t["status"] not in ("PASS", "SKIP", "CANCEL")]' | xargs cat
>      - du -chs ${CI_PROJECT_DIR}/avocado-cache
> +  rules:
> +  # As of QEMU 5.2, Avocado is not yet ready to run in CI environments, therefore
> +  # the jobs based on this template are not run automatically (except if the user
> +  # explicitly sets the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE environment
> +  # variable). Adding new jobs on top of this template is strongly discouraged.
> +  - if: $QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE == null
> +    when: manual
> +    allow_failure: true
> +  - when: always
>

I believe the best way to move forward is a bit different than what
you propose here.  I'd go with the babysitting approach, and
agressively disable tests on first sign of failure, instead of
"muting" all of them at once.  My perception is that without the
babysitting and quick actions, new jobs will end up in the same
situation given enough tests are added.

Anyway, thanks for bringing up the discussion here.

- Cleber.

>  build-system-ubuntu:
>    <<: *native_build_job_definition
> -- 
> 2.26.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

     prev parent reply	other threads:[~2020-12-01  3:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-27 17:41 [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore Philippe Mathieu-Daudé
2020-11-27 17:47 ` Thomas Huth
2020-11-27 17:57   ` Philippe Mathieu-Daudé
2020-11-27 18:29     ` Thomas Huth
2020-11-27 18:46       ` Philippe Mathieu-Daudé
2020-11-30  8:10         ` Cornelia Huck
2020-11-30  9:36           ` Philippe Mathieu-Daudé
2020-11-30 10:07             ` Cornelia Huck
2020-11-30  9:03         ` Thomas Huth
2020-11-30  9:52           ` Philippe Mathieu-Daudé
2020-11-30 10:25           ` Daniel P. Berrangé
2020-11-30 10:31       ` Daniel P. Berrangé
2020-11-30 16:45         ` Ademar Reis
2020-12-04 14:42       ` Wainer dos Santos Moschetta
2020-11-27 23:21     ` Niek Linnenbank
2020-11-27 19:36 ` Wainer dos Santos Moschetta
2020-11-30 21:42 ` Willian Rampazzo
2020-12-01  3:48 ` Cleber Rosa [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201201034825.GA2343635@localhost.localdomain \
    --to=crosa@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    --cc=virt-ci-maint-team@redhat.com \
    --cc=wainersm@redhat.com \
    --cc=wrampazz@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).