Azure infrastructure update

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Azure infrastructure update
@ 2023-06-28 10:44 Paolo Bonzini
  2023-06-28 11:28 ` Daniel P. Berrangé
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2023-06-28 10:44 UTC (permalink / raw)
  To: qemu
  Cc: qemu-devel, Camilla Conte, Richard Henderson, P. Berrange, Daniel,
	Thomas Huth, Armbruster, Markus

Hi all,

a small update on the infrastructure we have set up on Azure and the
expected costs. Remember that we have $10000/year credits from the
Microsoft open source program, therefore the actual cost to the
project is zero unless we exceed the threshold.

Historically, QEMU's infrastructure was hosted on virtual machines
sponsored by Rackspace's open source infrastructure program. When the
program was abruptly terminated, QEMU faced a cost of roughly
$1500/month, mostly due to bandwidth.

As an initial step to cut these costs, downloads were moved to Azure.
However, bandwidth costs remained high and in 2022 we exceeded the
credits from the sponsorship and we had to pay roughly $4000 to
Microsoft, in addition to roughly $2000 for VMs that were still hosted
on Rackspace. While not a definitive solution, this saved the project
an expense of over $10000.

Fortunately, the GNOME project stepped in and offered to host
downloads for QEMU on their CDN. This freed up all the Azure credits
for more interesting uses. In particular, Stefan and I moved the
Rackspace VMs over to Azure, after which the Rackspace bill went down
to zero.

This resulted in two VMs, both running CentOS Stream 9:
- a larger one (E2s instance type) for Patchew and wiki.qemu.org,
costing ~$1900/year between VMs and disks. The websites on this VM are
implemented as podman containers + a simple nginx front-end on ports
80/443.
- a smaller one (D2s instance type) one that proxies qemu.org and
git.qemu.org to gitlab and provides an SSH mirror of the QEMU
downloads, costing $1200/year between VMs and disks. This was a more
traditional monolithic setup.

We also have two virtual machines from OSUOSL (Oregon State University
Open Source Labs); one is unused and can be decommissioned; the other
(also running CentOS Stream 9) is running Patchew background jobs to
import patches and apply them.

Last April, Camilla Conte also added Kubernetes-based private runners
for QEMU CI to our Azure setup. Private runners avoid hitting the
GitLab limits on shared runners and shorten the time it takes to run
individual test jobs. This is because CI, thanks to its burst-y
nature, can use larger VMs than "pet" VMs such as the ones above.
Currently we are using 8 vCPU / 32 GB VMs for the Kubernetes nodes,
and each node is assigned 4 vCPUs.

Starting June 1, all pipelines running in qemu-project/qemu have been
using the private runners. Besides benefiting from the higher number
of vCPUs per job, this, it leaves the GitLab shared runner allowance
to Windows jobs as well as updates to qemu-web. It also made it
possible to estimate the cost of running Linux jobs on Azure at all
times, and to compare the costs with the credits that are made
available through the sponsorship.

Finally, earlier this month I noticed that the OSUOSL mirror for
download.qemu.org was not being updated. Therefore, I rebuilt the
qemu.org and git.qemu.org proxies as containers and moved them to the
same VM running Patchew, wiki.qemu.org and now the KVM Forum website
too. This made it possible to delete the second VM mentioned above. We
will re-evaluate how to provide the source for mirroring
download.qemu.org.

Our consumption of Azure credits was as follows:
* $2005 as of Jun 1, of which $371 used for the now-deleted D2s VM
* $2673 as of Jun 28, of which $457 used for the now-deleted D2s VM

Based on the credits consumed from Jun 1 to Jun 28, which should be
representative of normal resource use, I am estimating the Azure costs
as follows:

$6700 for this year, of which:
- $1650 for the E2s VM
- $450 for the now-deleted D2s VM
- $1600 for the Kubernetes compute nodes
- $2500 for AKS (Azure Kubernetes Service) including system nodes,
load balancing, monitoring and a few more itemized services(*)
- $500 for bandwidth and IP address allocation

$7800 starting next year, of which:
- $1900 for the E2s VM
- $2250 for the Kubernetes compute nodes
- $3100 for AKS-related services
- $550 for bandwidth and IP address allocation

This fits within the allowance of the Azure open source credits
program, while leaving some leeway in case of increased costs or
increased usage of the private runners. As a contingency plan in case
costs surge, we can always disable usage of the private runners and
revert to wider usage of shared runners.

That said, the cost for the compute nodes is not small. In particular,
at the last QEMU Summit we discussed the possibility of adopting a
merge request workflow for maintainer pull requests. These merge
requests would replace the pipelines that are run by committers as
part of merging trees, and therefore should not introduce excessive
costs. However, as things stand, in case of a more generalized
adoption of GitLab MRs(**) the QEMU project will *not* be able to
shoulder the cost of running our (pretty expensive) CI on private
runners for all merge requests.

Thanks,

Paolo

(*) not that we use any of this, but they are added automatically when
you set up AKS

(**) which was NOT considered at QEMU Summit

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Azure infrastructure update
  2023-06-28 10:44 Azure infrastructure update Paolo Bonzini
@ 2023-06-28 11:28 ` Daniel P. Berrangé
  2023-06-28 11:41   ` Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel P. Berrangé @ 2023-06-28 11:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu, qemu-devel, Camilla Conte, Richard Henderson, Thomas Huth,
	Armbruster, Markus

On Wed, Jun 28, 2023 at 12:44:33PM +0200, Paolo Bonzini wrote:
> Starting June 1, all pipelines running in qemu-project/qemu have been
> using the private runners. Besides benefiting from the higher number
> of vCPUs per job, this, it leaves the GitLab shared runner allowance
> to Windows jobs as well as updates to qemu-web.

Also the python-qemu-qmp.git  CI is on shared runners currently.

> Our consumption of Azure credits was as follows:
> * $2005 as of Jun 1, of which $371 used for the now-deleted D2s VM
> * $2673 as of Jun 28, of which $457 used for the now-deleted D2s VM
> 
> Based on the credits consumed from Jun 1 to Jun 28, which should be
> representative of normal resource use, I am estimating the Azure costs
> as follows:

Only caveat is that June did not co-incide with a soft freeze. My
impression is that our CI pipeline usage has a spike in the weeks
around the freeze.

> $6700 for this year, of which:
> - $1650 for the E2s VM
> - $450 for the now-deleted D2s VM
> - $1600 for the Kubernetes compute nodes
> - $2500 for AKS (Azure Kubernetes Service) including system nodes,
> load balancing, monitoring and a few more itemized services(*)
> - $500 for bandwidth and IP address allocation
> 
> $7800 starting next year, of which:
> - $1900 for the E2s VM

Same size VM as last year, but more ? Is this is simply you
anticipating possible price increases from Azure ?

> - $2250 for the Kubernetes compute nodes

IIUC, the $1600 from year this will cover about 7.5 months worth
of usage (Jun -> Dec), which would imply more around $2500 for a
full 12 months, possibly more if we add in peaks for soft freeze.
IOW could conceivably be closer to $3k mark without much difficulty,
especially if also start doing more pipelines for stable branches
on a regular basis, now we have CI working properly for stable.

> - $3100 for AKS-related services

Same question about anticipated prices ?

> - $550 for bandwidth and IP address allocation
> 
> This fits within the allowance of the Azure open source credits
> program, while leaving some leeway in case of increased costs or
> increased usage of the private runners. As a contingency plan in case
> costs surge, we can always disable usage of the private runners and
> revert to wider usage of shared runners.

We also still have Eldondev's physical machine setup as a runner,
assuming that's going to be available indefinitely if we need
the resource.

> That said, the cost for the compute nodes is not small. In particular,
> at the last QEMU Summit we discussed the possibility of adopting a
> merge request workflow for maintainer pull requests. These merge
> requests would replace the pipelines that are run by committers as
> part of merging trees, and therefore should not introduce excessive
> costs.

Depending on how we setup the CI workflow, it might increase our
usage, potentially double it quite easily.

Right now, whomever is doing CI for merging pull requests is fully
serializing CI pipelines, so what's tested 100% matches what is
merged to master.

With a merge request workflow it can be slightly different based
on a couple of variables.

When pushing a merge request to their fork, prior to opening the
merge request, CI credits are burnt in their fork for every push,
based on whatever is the HEAD of their branch. This might be behind
current upstream 'master' by some amount.

Typically when using merge requests though, you would change the
gitlab CI workflow rules to trigger CI pipelines from merge request
actions, instead of branch push actions.

If we do this, then when opening a merge request, an initial pipeline
would be triggered.

If-and-only-if the maintainer has "Developer" on gitlab.com/qemu-project,
then that merge request initial pipeline will burn upstream CI credits.

If they are not a "Developer", it will burn their own fork credits. If
they don't have any credits left, then someone with "Developer" role
will have to spawn a pipeline on their behalf, which will run in
upstream context and burn upstream credits. The latter is tedious,
so I think expectation is that anyone who submits pull requests would
be expected to have 'Developer' role on qemu-project. We want that
anyway really so we can tag maintainers in issues on gitlab too.

IOW, assume that any maintainer opening a merge req will be burning
upstream CI credits on their merge request pipelines.

This initial pipeline will run against a merge commit that grafts
the head of the pull request, and 'master' at the time the pipeline
was triggered.

In a default config, if we apply the merge request at that point it
would go into master with no further pipeline run.

Merge requests are not serialized though.

So if a second merge request had been applied to master, after the
time the first merge request pipeline started, the pipeline for the
first merge request is potentially invalid. Compared to our use of
the (serialized) pipelines on the 'staging' branch, this setup would
be a regression in coverage.

To address this would require using GitLab's  "merge trains" feature.

When merge trains are enabled, when someone hits the button to apply
a merge request to master, an *additional* CI pipeline is started
based on the exact content that will be applied to master. Crucially,
as the name hints, the merge train pipelines  are serialized. IOW,
if you request to apply 4 merge requests in quick succession a queue
of pipelines will be created and run one after the other. If any
pipeline fails, that MR is kicked out of the queue, and the
following pipelines carry on.

IOW, the merge trains feature replicates what we achieve with the
serialized 'staging' branch.

What you can see here though, is that every merge request will have
at least 2 pipelines - one when the MR is opened, and one when it
is applied to master - both consuming upstream CI credits.

IOW, we potentially double our CI usage in this model if we don't
make any changes to how CI pipelines are triggered.

Essentially the idea with merge requests is that the initial
pipeline upon opening the merge requests does full validation
and catches all the silly stuff. Failures are ok because this
is all parallelized with other MRs, so failures don't delay
anything/anyone else. The merge train is then the safety net
to prove the original pipeline results are still valid for
current HEAD at time of applying it. You want the merge train
pipelines to essentially never fail as that's disruptive to
anything following on.

If we can afford the CI credits, I'd keep things simple and
just accept the increased CI burn, but with your figures above
I fear we'd be too close to the limit to be relaxed about it.

The extra eldondev runner could come into play here possibly.

If we can't afford the double pipelines, then we would have
to write our GitLab CI yml rules to exclude the initial
pipeline, or just do a very minimalist "smoke test", and
focus bulk of CI usage on teh merge train pipeline.

This is all solvable in one way or another. We just need to
figure out the right tradeoffs we want.

>          However, as things stand, in case of a more generalized
> adoption of GitLab MRs(**) the QEMU project will *not* be able to
> shoulder the cost of running our (pretty expensive) CI on private
> runners for all merge requests.

With more generalized adoption of MR workflow for all contributions
bear in mind that many of the contributors will NOT have the
'Developer' role on gitlab.com/qemu-project. Thus their merge
requests pipelines would run in fork context and consume their own
CI credits, unless a "Developer" had to manually trigger a pipeline
on their behalf.

So yes, I agree that full adoption of MRs would definitley increase
our CI usage, but not be quite such a horrendous amount as you might
first think. We would definitely need more resources whichever way
you look at it though.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Azure infrastructure update
  2023-06-28 11:28 ` Daniel P. Berrangé
@ 2023-06-28 11:41   ` Paolo Bonzini
  2023-06-28 12:07     ` Daniel P. Berrangé
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2023-06-28 11:41 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu, qemu-devel, Camilla Conte, Richard Henderson, Thomas Huth,
	Armbruster, Markus

On Wed, Jun 28, 2023 at 1:28 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > $6700 for this year, of which:
> > - $1650 for the E2s VM
> > - $450 for the now-deleted D2s VM
> > - $1600 for the Kubernetes compute nodes
> > - $2500 for AKS (Azure Kubernetes Service) including system nodes,
> > load balancing, monitoring and a few more itemized services(*)
> > - $500 for bandwidth and IP address allocation
> >
> > $7800 starting next year, of which:
> > - $1900 for the E2s VM
>
> Same size VM as last year, but more ? Is this is simply you
> anticipating possible price increases from Azure ?

No it's just ~11 vs. 12 months, because we didn't set it up from the first day.

> > - $2250 for the Kubernetes compute nodes
>
> IIUC, the $1600 from year this will cover about 7.5 months worth
> of usage (Jun -> Dec), which would imply more around $2500 for a
> full 12 months, possibly more if we add in peaks for soft freeze.
> IOW could conceivably be closer to $3k mark without much difficulty,
> especially if also start doing more pipelines for stable branches
> on a regular basis, now we have CI working properly for stable.

It also covers several weeks in May. In any case I've saved the broken
down data and will redo the estimate after 4 months (Jun 1->Sep 30,
covering both a soft freeze and a hard freeze).

> > - $3100 for AKS-related services
>
> Same question about anticipated prices ?

Same answer about 9-10 months vs. 12. :)

> > That said, the cost for the compute nodes is not small. In particular,
> > at the last QEMU Summit we discussed the possibility of adopting a
> > merge request workflow for maintainer pull requests. These merge
> > requests would replace the pipelines that are run by committers as
> > part of merging trees, and therefore should not introduce excessive
> > costs.
>
> Depending on how we setup the CI workflow, it might increase our
> usage, potentially double it quite easily.
>
> Right now, whomever is doing CI for merging pull requests is fully
> serializing CI pipelines, so what's tested 100% matches what is
> merged to master.
>
> With a merge request workflow it can be slightly different based
> on a couple of variables.
>
> When pushing a merge request to their fork, prior to opening the
> merge request, CI credits are burnt in their fork for every push,
> based on whatever is the HEAD of their branch. This might be behind
> current upstream 'master' by some amount.
>
> Typically when using merge requests though, you would change the
> gitlab CI workflow rules to trigger CI pipelines from merge request
> actions, instead of branch push actions.

Yes, that was my idea as well.

> If we do this, then when opening a merge request, an initial pipeline
> would be triggered.
>
> If-and-only-if the maintainer has "Developer" on gitlab.com/qemu-project,
> then that merge request initial pipeline will burn upstream CI credits.
>
> If they are not a "Developer", it will burn their own fork credits. If
> they don't have any credits left, then someone with "Developer" role
> will have to spawn a pipeline on their behalf, which will run in
> upstream context and burn upstream credits. The latter is tedious,
> so I think expectation is that anyone who submits pull requests would
> be expected to have 'Developer' role on qemu-project. We want that
> anyway really so we can tag maintainers in issues on gitlab too.

Agreed. Is there no option to have the "Developer" use his own credits?

> IOW, assume that any maintainer opening a merge req will be burning
> upstream CI credits on their merge request pipelines. [...]
> Merge requests are not serialized though. [...]
> To address this would require using GitLab's  "merge trains" feature.
>
> When merge trains are enabled, when someone hits the button to apply
> a merge request to master, an *additional* CI pipeline is started
> based on the exact content that will be applied to master. Crucially,
> as the name hints, the merge train pipelines  are serialized. IOW,
> if you request to apply 4 merge requests in quick succession a queue
> of pipelines will be created and run one after the other. If any
> pipeline fails, that MR is kicked out of the queue, and the
> following pipelines carry on.
>
> IOW, the merge trains feature replicates what we achieve with the
> serialized 'staging' branch.
>
> What you can see here though, is that every merge request will have
> at least 2 pipelines - one when the MR is opened, and one when it
> is applied to master - both consuming upstream CI credits.
>
> IOW, we potentially double our CI usage in this model if we don't
> make any changes to how CI pipelines are triggered. [...]
> If we can afford the CI credits, I'd keep things simple and
> just accept the increased CI burn, but with your figures above
> I fear we'd be too close to the limit to be relaxed about it.

Hmm, now that I think about it I'm not sure the merge request CI would
use private runners. Would it use the CI variables that are set in
settings/ci_cd? If not, the pipeline would not tag the jobs for
private runners, and therefore the merge request would use shared
runners (thus burning project minutes, but that's a different
problem).

> If we can't afford the double pipelines, then we would have
> to write our GitLab CI yml rules to exclude the initial
> pipeline, or just do a very minimalist "smoke test", and
> focus bulk of CI usage on teh merge train pipeline.
>
> This is all solvable in one way or another. We just need to
> figure out the right tradeoffs we want.
>
> >          However, as things stand, in case of a more generalized
> > adoption of GitLab MRs(**) the QEMU project will *not* be able to
> > shoulder the cost of running our (pretty expensive) CI on private
> > runners for all merge requests.
>
> With more generalized adoption of MR workflow for all contributions
> bear in mind that many of the contributors will NOT have the
> 'Developer' role on gitlab.com/qemu-project. Thus their merge
> requests pipelines would run in fork context and consume their own
> CI credits, unless a "Developer" had to manually trigger a pipeline
> on their behalf.

I would expect most MRs to come from Developers but yes, that's not a given.

Paolo



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Azure infrastructure update
  2023-06-28 11:41   ` Paolo Bonzini
@ 2023-06-28 12:07     ` Daniel P. Berrangé
  0 siblings, 0 replies; 4+ messages in thread
From: Daniel P. Berrangé @ 2023-06-28 12:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu, qemu-devel, Camilla Conte, Richard Henderson, Thomas Huth,
	Armbruster, Markus

On Wed, Jun 28, 2023 at 01:41:03PM +0200, Paolo Bonzini wrote:
> > If we do this, then when opening a merge request, an initial pipeline
> > would be triggered.
> >
> > If-and-only-if the maintainer has "Developer" on gitlab.com/qemu-project,
> > then that merge request initial pipeline will burn upstream CI credits.
> >
> > If they are not a "Developer", it will burn their own fork credits. If
> > they don't have any credits left, then someone with "Developer" role
> > will have to spawn a pipeline on their behalf, which will run in
> > upstream context and burn upstream credits. The latter is tedious,
> > so I think expectation is that anyone who submits pull requests would
> > be expected to have 'Developer' role on qemu-project. We want that
> > anyway really so we can tag maintainers in issues on gitlab too.
> 
> Agreed. Is there no option to have the "Developer" use his own credits?

I've never found any way to control / force this behaviour.

> > IOW, assume that any maintainer opening a merge req will be burning
> > upstream CI credits on their merge request pipelines. [...]
> > Merge requests are not serialized though. [...]
> > To address this would require using GitLab's  "merge trains" feature.
> >
> > When merge trains are enabled, when someone hits the button to apply
> > a merge request to master, an *additional* CI pipeline is started
> > based on the exact content that will be applied to master. Crucially,
> > as the name hints, the merge train pipelines  are serialized. IOW,
> > if you request to apply 4 merge requests in quick succession a queue
> > of pipelines will be created and run one after the other. If any
> > pipeline fails, that MR is kicked out of the queue, and the
> > following pipelines carry on.
> >
> > IOW, the merge trains feature replicates what we achieve with the
> > serialized 'staging' branch.
> >
> > What you can see here though, is that every merge request will have
> > at least 2 pipelines - one when the MR is opened, and one when it
> > is applied to master - both consuming upstream CI credits.
> >
> > IOW, we potentially double our CI usage in this model if we don't
> > make any changes to how CI pipelines are triggered. [...]
> > If we can afford the CI credits, I'd keep things simple and
> > just accept the increased CI burn, but with your figures above
> > I fear we'd be too close to the limit to be relaxed about it.
> 
> Hmm, now that I think about it I'm not sure the merge request CI would
> use private runners. Would it use the CI variables that are set in
> settings/ci_cd? If not, the pipeline would not tag the jobs for
> private runners, and therefore the merge request would use shared
> runners (thus burning project minutes, but that's a different
> problem).

The repo/project global CI env variable settings should be
honoured for all pipelines in that repo, regardless of what
action triggers the pipeline. So I'd expect merge request
triggered pipelines to "just work" with the runner tagging.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-06-28 12:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-28 10:44 Azure infrastructure update Paolo Bonzini
2023-06-28 11:28 ` Daniel P. Berrangé
2023-06-28 11:41   ` Paolo Bonzini
2023-06-28 12:07     ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).