[TEST] The no-kvm CI instances going away

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [TEST] The no-kvm CI instances going away
@ 2024-02-06  1:41 Jakub Kicinski
  2024-02-06 11:16 ` Matthieu Baerts
  2024-02-07 17:45 ` David Ahern
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2024-02-06  1:41 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

because cloud computing is expensive I'm shutting down the instances
which were running without KVM support. We're left with the KVM-enabled
instances only (metal) - one normal and one with debug configs enabled.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-06  1:41 [TEST] The no-kvm CI instances going away Jakub Kicinski
@ 2024-02-06 11:16 ` Matthieu Baerts
  2024-02-07  1:44   ` Jakub Kicinski
  2024-02-07 17:45 ` David Ahern
  1 sibling, 1 reply; 12+ messages in thread
From: Matthieu Baerts @ 2024-02-06 11:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

Hi Jakub,

On 06/02/2024 02:41, Jakub Kicinski wrote:
> because cloud computing is expensive I'm shutting down the instances
> which were running without KVM support. We're left with the KVM-enabled
> instances only (metal) - one normal and one with debug configs enabled.

Thank you for the notification!

It sounds like good news if the non-support of KVM was causing issues :)

I think we can then no longer ignore the two MPTCP tests that were
unstable in the previous environment.

The results from the different tests running on the -dbg instances don't
look good. Maybe some debug kconfig have a too big impact? [1]

For MPTCP, one test always hits the selftest timeout [2] when using a
debug kconfig. I don't know what to do in this case: if we need to set a
timeout value that is supported by debug environments, the value will be
so high, it will no longer catch issues "early enough" in "normal"
environments.
Or could it be possible to ignore or double the timeout value in this
debug environment?

Also, what is the plan with this debug env? It looks like the results
are not reported to patchwork for the moment. Maybe only "important"
issues, like kernel warnings, could be reported? Failed tests could be
reported as "Warning" instead of "Fail"?

[1]
https://lore.kernel.org/netdev/90c6d9b6-0bc4-468a-95fe-ebc2a23fffc1@kernel.org/
[2]
https://netdev-3.bots.linux.dev/vmksft-mptcp-dbg/results/453502/1-mptcp-join-sh/stdout

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-06 11:16 ` Matthieu Baerts
@ 2024-02-07  1:44   ` Jakub Kicinski
  2024-02-07  9:44     ` Matthieu Baerts
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2024-02-07  1:44 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

On Tue, 6 Feb 2024 12:16:43 +0100 Matthieu Baerts wrote:
> Hi Jakub,
> 
> On 06/02/2024 02:41, Jakub Kicinski wrote:
> > because cloud computing is expensive I'm shutting down the instances
> > which were running without KVM support. We're left with the KVM-enabled
> > instances only (metal) - one normal and one with debug configs enabled.  
> 
> Thank you for the notification!
> 
> It sounds like good news if the non-support of KVM was causing issues :)
> 
> I think we can then no longer ignore the two MPTCP tests that were
> unstable in the previous environment.
> 
> The results from the different tests running on the -dbg instances don't
> look good. Maybe some debug kconfig have a too big impact? [1]

Sorry, I'm behind on the reading the list. FWIW if you want to reach me
quickly make sure the To: doesn't include anyone else. That gets sorted
to a higher prio folder :S

> For MPTCP, one test always hits the selftest timeout [2] when using a
> debug kconfig. I don't know what to do in this case: if we need to set a
> timeout value that is supported by debug environments, the value will be
> so high, it will no longer catch issues "early enough" in "normal"
> environments.
> Or could it be possible to ignore or double the timeout value in this
> debug environment?
> 
> Also, what is the plan with this debug env? It looks like the results
> are not reported to patchwork for the moment. Maybe only "important"
> issues, like kernel warnings, could be reported? Failed tests could be
> reported as "Warning" instead of "Fail"?

Unfortunately I'm really behind on my "real job". I don't have a clear
plan. I think we should scale the timeout by 2x or so, but I haven't
looked how to do that.

I wish the selftest subsystem had some basic guidance.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07  1:44   ` Jakub Kicinski
@ 2024-02-07  9:44     ` Matthieu Baerts
  2024-02-07 14:25       ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Matthieu Baerts @ 2024-02-07  9:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

Hi Jakub,

On 07/02/2024 02:44, Jakub Kicinski wrote:
> On Tue, 6 Feb 2024 12:16:43 +0100 Matthieu Baerts wrote:
>> Hi Jakub,
>>
>> On 06/02/2024 02:41, Jakub Kicinski wrote:
>>> because cloud computing is expensive I'm shutting down the instances
>>> which were running without KVM support. We're left with the KVM-enabled
>>> instances only (metal) - one normal and one with debug configs enabled.  
>>
>> Thank you for the notification!
>>
>> It sounds like good news if the non-support of KVM was causing issues :)
>>
>> I think we can then no longer ignore the two MPTCP tests that were
>> unstable in the previous environment.
>>
>> The results from the different tests running on the -dbg instances don't
>> look good. Maybe some debug kconfig have a too big impact? [1]
> 
> Sorry, I'm behind on the reading the list. FWIW if you want to reach me
> quickly make sure the To: doesn't include anyone else. That gets sorted
> to a higher prio folder :S

Sorry, there was no urgency, I only wanted to add a link to the previous
discussion for those who wanted more details about that.

Thank you for the note!

>> For MPTCP, one test always hits the selftest timeout [2] when using a
>> debug kconfig. I don't know what to do in this case: if we need to set a
>> timeout value that is supported by debug environments, the value will be
>> so high, it will no longer catch issues "early enough" in "normal"
>> environments.
>> Or could it be possible to ignore or double the timeout value in this
>> debug environment?
>>
>> Also, what is the plan with this debug env? It looks like the results
>> are not reported to patchwork for the moment. Maybe only "important"
>> issues, like kernel warnings, could be reported? Failed tests could be
>> reported as "Warning" instead of "Fail"?
> 
> Unfortunately I'm really behind on my "real job". I don't have a clear
> plan. I think we should scale the timeout by 2x or so, but I haven't
> looked how to do that.

No hurry, I understand.

It is not clear to me how the patches you add on top of the ones from
patchwork are managed. Then, I don't know if it can help, but on the
debug instance, this command could be launched before starting the tests
to double the timeout values in all the "net" selftests:

  $ find tools/testing/selftests/net -name settings -print0 | xargs -0 \
       awk -i inplace -F '=' \
           '{if ($1 == "timeout") { print $1 "=" $2*2 } else { print }}'

> I wish the selftest subsystem had some basic guidance.

Me too :)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07  9:44     ` Matthieu Baerts
@ 2024-02-07 14:25       ` Jakub Kicinski
  2024-02-07 14:37         ` Matthieu Baerts
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2024-02-07 14:25 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

On Wed, 7 Feb 2024 10:44:14 +0100 Matthieu Baerts wrote:
> > Unfortunately I'm really behind on my "real job". I don't have a clear
> > plan. I think we should scale the timeout by 2x or so, but I haven't
> > looked how to do that.  
> 
> No hurry, I understand.
> 
> It is not clear to me how the patches you add on top of the ones from
> patchwork are managed. Then, I don't know if it can help, but on the
> debug instance, this command could be launched before starting the tests
> to double the timeout values in all the "net" selftests:
> 
>   $ find tools/testing/selftests/net -name settings -print0 | xargs -0 \
>        awk -i inplace -F '=' \
>            '{if ($1 == "timeout") { print $1 "=" $2*2 } else { print }}'

I'd rather not modify the tree. Poking around - this seems to work:

  export kselftest_override_timeout=1

Now it's just a matter of finding 15min to code it up :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07 14:25       ` Jakub Kicinski
@ 2024-02-07 14:37         ` Matthieu Baerts
  2024-02-07 15:29           ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Matthieu Baerts @ 2024-02-07 14:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

On 07/02/2024 15:25, Jakub Kicinski wrote:
> On Wed, 7 Feb 2024 10:44:14 +0100 Matthieu Baerts wrote:
>>> Unfortunately I'm really behind on my "real job". I don't have a clear
>>> plan. I think we should scale the timeout by 2x or so, but I haven't
>>> looked how to do that.  
>>
>> No hurry, I understand.
>>
>> It is not clear to me how the patches you add on top of the ones from
>> patchwork are managed. Then, I don't know if it can help, but on the
>> debug instance, this command could be launched before starting the tests
>> to double the timeout values in all the "net" selftests:
>>
>>   $ find tools/testing/selftests/net -name settings -print0 | xargs -0 \
>>        awk -i inplace -F '=' \
>>            '{if ($1 == "timeout") { print $1 "=" $2*2 } else { print }}'
> 
> I'd rather not modify the tree. Poking around - this seems to work:
> 
>   export kselftest_override_timeout=1

Even better :)

  f=tools/testing/selftests/net/settings
  kselftest_override_timeout=$(awk -F = '/^timeout=/ {print $2*2}' $f)

> Now it's just a matter of finding 15min to code it up :)
I'm not sure if I can help here :)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07 14:37         ` Matthieu Baerts
@ 2024-02-07 15:29           ` Jakub Kicinski
  2024-02-07 16:06             ` Matthieu Baerts
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2024-02-07 15:29 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

On Wed, 7 Feb 2024 15:37:22 +0100 Matthieu Baerts wrote:
> > I'd rather not modify the tree. Poking around - this seems to work:
> > 
> >   export kselftest_override_timeout=1  
> 
> Even better :)
> 
>   f=tools/testing/selftests/net/settings
>   kselftest_override_timeout=$(awk -F = '/^timeout=/ {print $2*2}' $f)
> 
> > Now it's just a matter of finding 15min to code it up :)  
> I'm not sure if I can help here :)

If you're willing to touch my nasty Python - that'd be very welcome! :)

Right now I put this in the configs:

[vm]
exports=KSFT_MACHINE_SLOW=yes

We can leave the support for adding exports in place, but we should add
a new config option like

[vm]
slowdown=2

(exact name up for debate), if it's present generate the
KSFT_MACHINE_SLOW=yes export automatically, without the explicit entry
in the config. And add the export for the timeout matching the logic
you propose (but in Python).

FWIW here's the configs we currently use:

$ cat remote.config
[remote]
branches=https://netdev.bots.linux.dev/static/nipa/branches.json
filters=https://netdev.bots.linux.dev/contest/filters.json
[local]
results_path=results
json_path=jsons
[env]
paths=/home/virtme/virtme-ng/
[vm]
paths=/home/virtme/tools/fs/bin:/home/virtme/tools/fs/sbin:/home/virtme/tools/fs/usr/bin:/home/virtme/tools/fs/usr/sbin
ld_paths=/lib64/:/home/virtme/tools/fs/usr/lib/:/home/virtme/tools/fs/lib64/:/home/virtme/tools/fs/usr/lib64/
init_prompt=bash-5.2#
default_timeout=200
boot_timeout=45


$ cat ksft-mptcp-dbg.config
[executor]
name=vmksft-mptcp-dbg
[local]
tree_path=/home/virtme/testing-12/
base_path=/home/virtme/outputs-12/
[www]
url=https://netdev-3.bots.linux.dev/vmksft-mptcp-dbg
[vm]
exports=KSFT_MACHINE_SLOW=yes
configs=kernel/configs/debug.config,kernel/configs/x86_debug.config
default_timeout=300
cpus=4
[ksft]
target=net/mptcp
[cfg]
thread_cnt=1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07 15:29           ` Jakub Kicinski
@ 2024-02-07 16:06             ` Matthieu Baerts
  0 siblings, 0 replies; 12+ messages in thread
From: Matthieu Baerts @ 2024-02-07 16:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, MPTCP Upstream, Paolo Abeni,
	Mat Martineau

On 07/02/2024 16:29, Jakub Kicinski wrote:
> On Wed, 7 Feb 2024 15:37:22 +0100 Matthieu Baerts wrote:
>>> I'd rather not modify the tree. Poking around - this seems to work:
>>>
>>>   export kselftest_override_timeout=1  
>>
>> Even better :)
>>
>>   f=tools/testing/selftests/net/settings
>>   kselftest_override_timeout=$(awk -F = '/^timeout=/ {print $2*2}' $f)
>>
>>> Now it's just a matter of finding 15min to code it up :)  
>> I'm not sure if I can help here :)
> 
> If you're willing to touch my nasty Python - that'd be very welcome! :)

Sure, I can have a look (but not today) :)

Thank you for the guide! Like with my PR to support subtests, I might
only test the code I modify not to have to set up everything, I hope it
is fine :)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-06  1:41 [TEST] The no-kvm CI instances going away Jakub Kicinski
  2024-02-06 11:16 ` Matthieu Baerts
@ 2024-02-07 17:45 ` David Ahern
  2024-02-07 18:55   ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: David Ahern @ 2024-02-07 17:45 UTC (permalink / raw)
  To: Jakub Kicinski, netdev@vger.kernel.org

On 2/5/24 6:41 PM, Jakub Kicinski wrote:
> because cloud computing is expensive I'm shutting down the instances
> which were running without KVM support. We're left with the KVM-enabled
> instances only (metal) - one normal and one with debug configs enabled.


who is covering the cost of the cloud VMs? Have you considered cheaper
alternatives to AWS?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07 17:45 ` David Ahern
@ 2024-02-07 18:55   ` Jakub Kicinski
  2024-02-07 19:45     ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2024-02-07 18:55 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev@vger.kernel.org

On Wed, 7 Feb 2024 10:45:26 -0700 David Ahern wrote:
> On 2/5/24 6:41 PM, Jakub Kicinski wrote:
> > because cloud computing is expensive I'm shutting down the instances
> > which were running without KVM support. We're left with the KVM-enabled
> > instances only (metal) - one normal and one with debug configs enabled.  
> 
> who is covering the cost of the cloud VMs?

Meta

> Have you considered cheaper alternatives to AWS?

If I'm completely honest it's more a time thing than cost thing.
I have set a budget for the project in internal tooling to 3x
what I expected just the build bot to consume, so it can fit one
large instance without me having to jump thru any hoops.
I will slowly jump thru hoops to get more as time allows,
but I figured the VM instance was a mistake in the first place,
so I can as well just kill it off already. The -dbg runners
are also slow. Or do you see benefit to running without KVM?
Another potential extension is running on ARM.

And yes, it was much cheaper when the builder run in Digital Ocean.

But why do you ask? :) Just to offer cheaper alternatives or do you
happen to have the ability to get a check written to support the
effort? :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07 18:55   ` Jakub Kicinski
@ 2024-02-07 19:45     ` David Ahern
  2024-02-07 20:07       ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2024-02-07 19:45 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev@vger.kernel.org

On 2/7/24 11:55 AM, Jakub Kicinski wrote:
> On Wed, 7 Feb 2024 10:45:26 -0700 David Ahern wrote:
>> On 2/5/24 6:41 PM, Jakub Kicinski wrote:
>>> because cloud computing is expensive I'm shutting down the instances
>>> which were running without KVM support. We're left with the KVM-enabled
>>> instances only (metal) - one normal and one with debug configs enabled.  
>>
>> who is covering the cost of the cloud VMs?
> 
> Meta
> 
>> Have you considered cheaper alternatives to AWS?
> 
> If I'm completely honest it's more a time thing than cost thing.
> I have set a budget for the project in internal tooling to 3x
> what I expected just the build bot to consume, so it can fit one
> large instance without me having to jump thru any hoops.
> I will slowly jump thru hoops to get more as time allows,
> but I figured the VM instance was a mistake in the first place,
> so I can as well just kill it off already. The -dbg runners
> are also slow. Or do you see benefit to running without KVM?
> Another potential extension is running on ARM.
> 
> And yes, it was much cheaper when the builder run in Digital Ocean.
> 
> But why do you ask? :) Just to offer cheaper alternatives or do you
> happen to have the ability to get a check written to support the
> effort? :)

I have no such ability :-) I cover the costs myself when I use VMs on
DigitalOcean and Vultr for Linux development and testing.

Kernel builds and selftests just need raw compute power, none of the
fancy enterprise features that AWS provides (and bills accordingly).

The first question about who is covering the cost is to avoid
assumptions and acknowledge the service (and costs) provided to the
community. Having the selftests tied to patchsets is really helpful to
proactively identify potential regressions.

For the second question I was just curious as to whether you had tried
the cheaper options (DO, Vultr, Linode, ...) and if they worked ok for
you. ie., why AWS. I like the range of OS versions that are accessible
within minutes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TEST] The no-kvm CI instances going away
  2024-02-07 19:45     ` David Ahern
@ 2024-02-07 20:07       ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2024-02-07 20:07 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

On Wed, 7 Feb 2024 12:45:00 -0700 David Ahern wrote:
> I have no such ability :-) I cover the costs myself when I use VMs on
> DigitalOcean and Vultr for Linux development and testing.
> 
> Kernel builds and selftests just need raw compute power, none of the
> fancy enterprise features that AWS provides (and bills accordingly).
> 
> The first question about who is covering the cost is to avoid
> assumptions and acknowledge the service (and costs) provided to the
> community. Having the selftests tied to patchsets is really helpful to
> proactively identify potential regressions.

Thanks! Meta is pretty great at covering various community costs.
I try to keep the asks reasonable, but I haven't been told "no",
yet. The reason why it'd be great to have more supporters is that
(at a cost of slightly more paperwork) we could externalize the
whole shebang, and give out SSH keys more freely :(

I also heard that some clouds or LF have free cloud credits for open
source CI. But IDK how I could connect those to my corporate account..

Anyway. In my ideal world we'd write a check for LF and get a bunch 
of VMs in DO or alike. In the present reality we have AWS which only 
I can access 🥴️

Good thing is the system is distributed. Hopefully over time more
people will join in by reporting the results (like Mojatatu does
with TCD).

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-02-07 20:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-06  1:41 [TEST] The no-kvm CI instances going away Jakub Kicinski
2024-02-06 11:16 ` Matthieu Baerts
2024-02-07  1:44   ` Jakub Kicinski
2024-02-07  9:44     ` Matthieu Baerts
2024-02-07 14:25       ` Jakub Kicinski
2024-02-07 14:37         ` Matthieu Baerts
2024-02-07 15:29           ` Jakub Kicinski
2024-02-07 16:06             ` Matthieu Baerts
2024-02-07 17:45 ` David Ahern
2024-02-07 18:55   ` Jakub Kicinski
2024-02-07 19:45     ` David Ahern
2024-02-07 20:07       ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).