Re: [PATCH v3] test/service: fix spurious failures by extending timeout

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Monjalon <thomas@monjalon.net>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>
Cc: "David Marchand" <david.marchand@redhat.com>,
	dev@dpdk.org, "dpdklab@iol.unh.edu" <dpdklab@iol.unh.edu>,
	"ci@dpdk.org" <ci@dpdk.org>,
	"Honnappa.Nagarahalli@arm.com" <Honnappa.Nagarahalli@arm.com>,
	"mattias. ronnblom" <mattias.ronnblom@ericsson.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"Tyler Retzlaff" <roretzla@linux.microsoft.com>,
	"Aaron Conole" <aconole@redhat.com>,
	bruce.richardson@intel.com
Subject: Re: [PATCH v3] test/service: fix spurious failures by extending timeout
Date: Thu, 23 Feb 2023 21:15:03 +0100	[thread overview]
Message-ID: <4205390.Fh7cpCN91P@thomas> (raw)
In-Reply-To: <BN0PR11MB571285C339B02AD6D6EE71E7D7D79@BN0PR11MB5712.namprd11.prod.outlook.com>

03/02/2023 17:09, Van Haaren, Harry:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 03/02/2023 16:03, Van Haaren, Harry:
> > > From: Van Haaren, Harry
> > > > > The timeout approach just does not have its place in a functional test.
> > > > > Either this test is rewritten, or it must go to the performance tests
> > > > > list so that we stop getting false positives.
> > > > > Can you work on this?
> > > >
> > > > I'll investigate various approaches on Thursday and reply here with suggested
> > > > next steps.
> > >
> > > I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> > > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > > In the CI, the service-core just hasn't been scheduled (yet) and causes the
> > "failure".
> > >
> > > Option 1)
> > > One option is to while(1) loop, waiting for the service-thread to be scheduled.
> > This can be
> > > seen as "increasing the timeout", however in this case the test-case would be
> > errored
> > > not in the test-code, but in the meson-test runner as a timeout (with a 10sec
> > default?)
> > > The benefit here is that massively increasing (~1sec or less to 10 sec) will cover
> > all/many
> > > of the CI timeouts.
> > >
> > > Option 2)
> > > Move to perf-tests, and not run these in a noisy-CI environment where the
> > results are not
> > > consistent enough to have value. This would mean that the tests are not run in
> > CI for the
> > > 3 checks in question are below, they all *require* the service core to be
> > scheduled:
> > > service_attr_get() -> requires service core to run for service stats to increment
> > > service_lcore_attr_get() -> requires service core to run for lcore stats to
> > increment
> > > service_lcore_start_stop() -> requires service to run to to ensure service-func
> > itself executes.
> > >
> > > I don't see how we can "improve" option 2 to not require the service-thread to
> > be scheduled by the OS..
> > > And the only way to make the OS schedule it in the CI more consistently is to
> > give it more time?
> > 
> > We are talking about seconds.
> > There are setups where scheduling a thread is taking seconds?
> 
> Apparently so - otherwise these tests would always pass.
> 
> They *only* fail at random runs in CI, and reliably pass everywhere else.. I've not had
> them fail locally, and that includes running in a loop for hours with a busy system..
> but not a low-priority CI VM in a busy datacenter.
> 
> 
> [Bruce wrote in separate mail]

Bruce was not Cc'ed in this reply.

> >>> For me, the question is - why hasn't the service-core been scheduled? Can
> >>> we use sched-yield or some other mechanism to force a wakeup of it?
> 
> I'm not aware of a way to make *a specific other pthread* wakeup.  We could sacrifice
> the current lcore that's waiting for the service-lcore, with a sched_yield() as you suggest.
> It would potentially "churn" the scheduler enough to give the service core some CPU?
> It's a guess/gamble in the end, kind of like the timeouts we have today..
> 
> > > Thoughts and input welcomed, I'm happy to make the code changes
> > themselves, its small effort
> > > For both option 1 & 2.
> > 
> > For time-sensitive tests, yes they should be in perf tests category.
> > As David said earlier, no timeout approach in functional tests.
> 
> Ok, as before, option 1) is to while(1) and wait for "success". Then there's
> no timeout in the test code, but our meson test runner will time-out/fail after ~10sec IIRC.
> 
> Or we move the tests perf-tests, as per Option 2), and these simply won't run in CI.
> 
> I'm OK with all 3 (including testing with sched_yield() for a month or two and if that helps?)

Did you send a patch to go in a direction or another?
If not, please move the test to perf-test as suggested before.
We are still hitting the issues in the CI and it is *very* annoying.
It is consuming time of a lot of people for a lot of patches,
just to check it is again an issue with this test.

Please let's remove this test from the CI now.

next prev parent reply	other threads:[~2023-02-23 20:15 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-06  8:17 [PATCH] test/service: fix spurious failures by extending timeout Harry van Haaren
2022-10-06  8:28 ` [PATCH v2] " Harry van Haaren
2022-10-06  8:39   ` David Marchand
2022-10-06  8:54     ` Mattias Rönnblom
2022-10-06  8:37 ` [PATCH] " Mattias Rönnblom
2022-10-06 12:52 ` [PATCH v3] " Harry van Haaren
2022-10-06 13:27   ` Morten Brørup
2022-10-06 19:33     ` David Marchand
2023-01-26  9:29       ` David Marchand
2023-01-31 17:24         ` Van Haaren, Harry
2023-02-03 15:03           ` Van Haaren, Harry
2023-02-03 15:12             ` Bruce Richardson
2023-02-23 20:10               ` Thomas Monjalon
2023-02-27  8:41                 ` Van Haaren, Harry
2023-02-03 15:16             ` Thomas Monjalon
2023-02-03 16:09               ` Van Haaren, Harry
2023-02-23 20:15                 ` Thomas Monjalon [this message]
2023-02-27  8:41                   ` Van Haaren, Harry
2022-10-06 14:00   ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4205390.Fh7cpCN91P@thomas \
    --to=thomas@monjalon.net \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=aconole@redhat.com \
    --cc=bruce.richardson@intel.com \
    --cc=ci@dpdk.org \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=dpdklab@iol.unh.edu \
    --cc=harry.van.haaren@intel.com \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=roretzla@linux.microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.