Re: Sketch of an idea for handling the "mixed workload" problem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Demi Marie Obenour <demi@invisiblethingslab.com>
To: George Dunlap <george.dunlap@cloud.com>
Cc: Xen-devel <xen-devel@lists.xenproject.org>,
	"Juergen Gross" <jgross@suse.com>,
	"Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Subject: Re: Sketch of an idea for handling the "mixed workload" problem
Date: Tue, 23 Jan 2024 11:58:07 -0500	[thread overview]
Message-ID: <Za_wImJMW0UuNJue@itl-email> (raw)
In-Reply-To: <CA+zSX=YNjVaGn8=kio=2iT8onHAP61pzP-dicMrr4pKJQ827gw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9310 bytes --]

On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> On Mon, Jan 22, 2024 at 12:31 AM Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > > The basic credit2 algorithm goes something like this:
> > >
> > > 1. All vcpus start with the same number of credits; about 10ms worth
> > > if everyone has the same weight
> > >
> > > 2. vcpus burn credits as they consume cpu, based on the relative
> > > weights: higher weights burn slower, lower weights burn faster
> > >
> > > 3. At any given point in time, the runnable vcpu with the highest
> > > credit is allowed to run
> > >
> > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > > credit over the reset.
> > >
> > > Generally speaking, vcpus that use less than their quota and have lots
> > > of interrupts are scheduled immediately, since when they wake up they
> > > always have more credit than the vcpus who are burning through their
> > > slices.
> > >
> > > But what about a situation as described recently on Matrix, where a VM
> > > uses a non-negligible amount of cpu doing un-accelerated encryption
> > > and decryption, which can be delayed by a few MS, as well as handling
> > > audio events?  How can we make sure that:
> > >
> > > 1. We can run whenever interrupts happen
> > > 2. We get no more than our fair share of the cpu?
> > >
> > > The counter-intuitive key here is that in order to achieve the above,
> > > you need to *deschedule or preempt early*, so that when the interrupt
> > > comes, you have spare credit to run the interrupt handler.  How do we
> > > manage that?
> > >
> > > The idea I'm working out comes from a phrase I used in the Matrix
> > > discussion, about a vcpu that "foolishly burned all its credits".
> > > Naturally the thing you want to do to have credits available is to
> > > save them up.
> > >
> > > So the idea would be this.  Each vcpu would have a "boost credit
> > > ratio" and a "default boost interval"; there would be sensible
> > > defaults based on typical workloads, but these could be tweaked for
> > > individual VMs.
> > >
> > > When credit is assigned, all VMs would get the same amount of credit,
> > > but divided into two "buckets", according to the boost credit ratio.
> > >
> > > Under certain conditions, a vcpu would be considered "boosted"; this
> > > state would last either until the default boost interval, or until
> > > some other event (such as a de-boost yield).
> > >
> > > The queue would be sorted thus:
> > >
> > > * Boosted vcpus, by boost credit available
> > > * Non-boosted vcpus, by non-boost credit available
> > >
> > > Getting more boost credit means having lower priority when not
> > > boosted; and burning through your boost credit means not being
> > > scheduled when you need to be.
> > >
> > > Other ways we could consider putting a vcpu into a boosted state (some
> > > discussed on Matrix or emails linked from Matrix):
> > > * Xen is about to preempt, but finds that the vcpu interrupts are
> > > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > > one)
> > > * Xen is about to preempt, but finds that the (currently out-of-tree)
> > > "dont_desched" bit has been set in the shared memory area
> >
> > I think both of these would be good.  Another one would be when Xen is
> > about to deliver an interrupt to a guest, provided that there is no
> > storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
> > spike through what I presume is an interrupt storm, and I suspect that
> > others have observed similar behavior with USB external drives.
> 
> How would you determine that a given interrupt was part of a "storm",
> and what would you do differently as a result of determining that?

I’m not sure.  One heuristic might be that if a device assigned to a VM
is interrupting Xen too many times while Xen is running other VMs,
interrupts from that device are blocked as needed to ensure other VMs
get to execute.  Theoretically, an interrupt from a USB storage device
should be safe to block until Xen is no longer running boosted
workloads, but an interrupt from a USB microphone or speaker is not.

> > > Other ways to consider de-boosting:
> > > * There's a way to trigger a VMEXIT when interrupts have been
> > > re-enabled; setting this up when the VM is in the boost state
> >
> > That’s a good idea, but should be conditional on “dont_desched” _not_
> > being set.  This handles the case where the guest is running a realtime
> > thread.
> 
> In which case we need some way for the "enlightened" guest to know how
> to de-boost itself; a yield might do.

That would be sufficient.

> > Generally, I’d like to see something like this:
> >
> > - A vCPU with sufficient boost credit is boosted by Xen under the
> >   following conditions:
> >
> >   1. Xen interrupts the guest.
> 
> I take it you mean, "delivers an interrupt to the guest"?

Yes.

> >   2. Xen is about to preempt, but detects that “dont_desched” is set.
> >   3. Xen is about to preempt, but detects that interrupts are disabled.
> >
> > - A vCPU is deboosted if:
> >
> >   1. It runs out of boost credit, even if “dont_desched” is set.
> >   2. An interrupt handler returns, but only if “dont_desched” is not set.
> >   3. Interrupts are re-enabled, but only if “dont_desched” is not set.
> >
> >   The first case is an abnormal condition and typically means that
> >   either the system is overloaded or a vCPU is running boosted for too
> >   long.  To help debug this situation, Xen will log a warning and
> >   increment both a system-wide and a per-domain counter.  dom0 can
> >   retrieve counters for any domain, and a domain can read its own
> >   counter.
> >
> > - When to set “dont_desched” is entirely up to the guest kernel, but
> >   there are some general rules guests should follow:
> >
> >   - Only set “dont_desched” if there is a good reason, and unset it as
> >     soon as possible.  Xen gives vCPUs with “dont_desched” set priority
> >     over all other vCPUs on the system, but the amount of time a vCPU is
> >     allowed to run with an elevated priority is limited.  Xen will log a
> >     warning if a guest tries to run with elevated priority for too long.
> >
> >   - Xen boosts vCPUs before delivering an interrupt, but there should be
> >     a way for a vCPU to deboost itself even before returning from the
> >     interrupt handler.
> >
> >   - Guests should always set “dont_desched” when running hard-realtime
> >     threads (used for e.g. audio processing), even when the thread is in
> >     userspace.  This ensures that Xen gives the underlying vCPU priority
> >     over vCPUs
> >
> >   - Guests should always set “dont_desched” when holding a spin lock,
> >     but it is even better to use paravirtualized spin locks (which make
> >     a hypercall into Xen and therefore allow other vCPUs to run).
> >
> >   - Xen does not implement priority inheritance, so guests need to do
> >     that.
> >
> > - Max boost credits can be set by dom0 via a hypercall.
> >
> > The advantage of this approach is that it keeps almost all policy out of
> > Xen.  The only exception is the boosting when an interrupt is received,
> > but a well-behaved guest will deboost itself very quickly (by enabling
> > interrupts) if the boost was not actually needed, so this should have
> > very limited impact.  I think this should be enough for realtime audio,
> > and it is somewhat related to (but hopefully simpler than) the KVM RFC
> > from Google [1].
> >
> > Any thoughts on this?
> 
> Overall sounds good.  I think a good approach would be to start by
> implementing it without the "dont_desched" flag, and then add that on
> top later.  It sounds like you have a clear vision for what you want,
> so it shouldn't be too hard to write such that adding the
> "dont_desched" doesn't require a lot of pointless refactoring.
> 
> The other issue I have with this (and essentially where I got stuck
> developing credit2 in the first place) is testing: how do you ensure
> that it has the properties that you expect?  How do you develop a
> "regression test" to make sure that server-based workloads don't have
> issues in this sort of case?

I don’t have any server workloads myself.  Would it be reasonable to ask
those who do have such workloads to develop such a test?  They would be
in a much better position to check for regressions on these workloads,
and have server hardware that they can use to benchmark such workloads.
I just have my laptop and a test laptop, both running Qubes OS.

It’s also possible that some of these changes will improve latency at
the expense of throughput.  In that case, I could add a Xen command-line
option (or even a runtime toggle) that controls whether Xen honors the
boost state.  I do expect that the rest of the logic should have very
little overhead in this case.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

     prev parent reply	other threads:[~2024-01-23 16:58 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-29 16:42 Sketch of an idea for handling the "mixed workload" problem George Dunlap
2023-09-30 23:28 ` Demi Marie Obenour
2023-10-02 11:20   ` George Dunlap
2024-01-21 23:46     ` Demi Marie Obenour
2024-01-22  0:31 ` Demi Marie Obenour
2024-01-22 11:54   ` George Dunlap
2024-01-22 12:17     ` Marek Marczykowski-Górecki
2024-01-22 12:25       ` George Dunlap
2024-01-22 12:50         ` Marek Marczykowski-Górecki
2024-01-22 13:02           ` George Dunlap
2024-01-22 13:03             ` George Dunlap
2024-01-23 16:58     ` Demi Marie Obenour [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Za_wImJMW0UuNJue@itl-email \
    --to=demi@invisiblethingslab.com \
    --cc=george.dunlap@cloud.com \
    --cc=jgross@suse.com \
    --cc=marmarek@invisiblethingslab.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.