Re: Sketch of an idea for handling the "mixed workload" problem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Demi Marie Obenour <demi@invisiblethingslab.com>
To: George Dunlap <george.dunlap@cloud.com>,
	Xen-devel <xen-devel@lists.xenproject.org>
Cc: "Juergen Gross" <jgross@suse.com>,
	"Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Subject: Re: Sketch of an idea for handling the "mixed workload" problem
Date: Sun, 21 Jan 2024 19:31:41 -0500	[thread overview]
Message-ID: <Za23cKyEOl1WTvhZ@itl-email> (raw)
In-Reply-To: <CA+zSX=Z904nF0yD1grRZc1miEOhdTHqAd4j-S1j8GY+1bo9COw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6571 bytes --]

On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> The basic credit2 algorithm goes something like this:
> 
> 1. All vcpus start with the same number of credits; about 10ms worth
> if everyone has the same weight
> 
> 2. vcpus burn credits as they consume cpu, based on the relative
> weights: higher weights burn slower, lower weights burn faster
> 
> 3. At any given point in time, the runnable vcpu with the highest
> credit is allowed to run
> 
> 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> reset: everyone gets another 10ms, and can carry over at most 2ms of
> credit over the reset.
> 
> Generally speaking, vcpus that use less than their quota and have lots
> of interrupts are scheduled immediately, since when they wake up they
> always have more credit than the vcpus who are burning through their
> slices.
> 
> But what about a situation as described recently on Matrix, where a VM
> uses a non-negligible amount of cpu doing un-accelerated encryption
> and decryption, which can be delayed by a few MS, as well as handling
> audio events?  How can we make sure that:
> 
> 1. We can run whenever interrupts happen
> 2. We get no more than our fair share of the cpu?
> 
> The counter-intuitive key here is that in order to achieve the above,
> you need to *deschedule or preempt early*, so that when the interrupt
> comes, you have spare credit to run the interrupt handler.  How do we
> manage that?
> 
> The idea I'm working out comes from a phrase I used in the Matrix
> discussion, about a vcpu that "foolishly burned all its credits".
> Naturally the thing you want to do to have credits available is to
> save them up.
> 
> So the idea would be this.  Each vcpu would have a "boost credit
> ratio" and a "default boost interval"; there would be sensible
> defaults based on typical workloads, but these could be tweaked for
> individual VMs.
> 
> When credit is assigned, all VMs would get the same amount of credit,
> but divided into two "buckets", according to the boost credit ratio.
> 
> Under certain conditions, a vcpu would be considered "boosted"; this
> state would last either until the default boost interval, or until
> some other event (such as a de-boost yield).
> 
> The queue would be sorted thus:
> 
> * Boosted vcpus, by boost credit available
> * Non-boosted vcpus, by non-boost credit available
> 
> Getting more boost credit means having lower priority when not
> boosted; and burning through your boost credit means not being
> scheduled when you need to be.
> 
> Other ways we could consider putting a vcpu into a boosted state (some
> discussed on Matrix or emails linked from Matrix):
> * Xen is about to preempt, but finds that the vcpu interrupts are
> blocked (this sort of overlaps with the "when we deliver an interrupt"
> one)
> * Xen is about to preempt, but finds that the (currently out-of-tree)
> "dont_desched" bit has been set in the shared memory area

I think both of these would be good.  Another one would be when Xen is
about to deliver an interrupt to a guest, provided that there is no
storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
spike through what I presume is an interrupt storm, and I suspect that
others have observed similar behavior with USB external drives.

> Other ways to consider de-boosting:
> * There's a way to trigger a VMEXIT when interrupts have been
> re-enabled; setting this up when the VM is in the boost state

That’s a good idea, but should be conditional on “dont_desched” _not_
being set.  This handles the case where the guest is running a realtime
thread.

Generally, I’d like to see something like this:

- A vCPU with sufficient boost credit is boosted by Xen under the
  following conditions:

  1. Xen interrupts the guest.
  2. Xen is about to preempt, but detects that “dont_desched” is set.
  3. Xen is about to preempt, but detects that interrupts are disabled.

- A vCPU is deboosted if:

  1. It runs out of boost credit, even if “dont_desched” is set.
  2. An interrupt handler returns, but only if “dont_desched” is not set.
  3. Interrupts are re-enabled, but only if “dont_desched” is not set.

  The first case is an abnormal condition and typically means that
  either the system is overloaded or a vCPU is running boosted for too
  long.  To help debug this situation, Xen will log a warning and
  increment both a system-wide and a per-domain counter.  dom0 can
  retrieve counters for any domain, and a domain can read its own
  counter.

- When to set “dont_desched” is entirely up to the guest kernel, but
  there are some general rules guests should follow:

  - Only set “dont_desched” if there is a good reason, and unset it as
    soon as possible.  Xen gives vCPUs with “dont_desched” set priority
    over all other vCPUs on the system, but the amount of time a vCPU is
    allowed to run with an elevated priority is limited.  Xen will log a
    warning if a guest tries to run with elevated priority for too long.
    
  - Xen boosts vCPUs before delivering an interrupt, but there should be
    a way for a vCPU to deboost itself even before returning from the
    interrupt handler.

  - Guests should always set “dont_desched” when running hard-realtime
    threads (used for e.g. audio processing), even when the thread is in
    userspace.  This ensures that Xen gives the underlying vCPU priority
    over vCPUs 

  - Guests should always set “dont_desched” when holding a spin lock,
    but it is even better to use paravirtualized spin locks (which make
    a hypercall into Xen and therefore allow other vCPUs to run).

  - Xen does not implement priority inheritance, so guests need to do
    that.

- Max boost credits can be set by dom0 via a hypercall.

The advantage of this approach is that it keeps almost all policy out of
Xen.  The only exception is the boosting when an interrupt is received,
but a well-behaved guest will deboost itself very quickly (by enabling
interrupts) if the boost was not actually needed, so this should have
very limited impact.  I think this should be enough for realtime audio,
and it is somewhat related to (but hopefully simpler than) the KVM RFC
from Google [1].

Any thoughts on this?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[1]: https://lore.kernel.org/kvm/20231214024727.3503870-1-vineeth@bitbyteword.org/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2024-01-22  0:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-29 16:42 Sketch of an idea for handling the "mixed workload" problem George Dunlap
2023-09-30 23:28 ` Demi Marie Obenour
2023-10-02 11:20   ` George Dunlap
2024-01-21 23:46     ` Demi Marie Obenour
2024-01-22  0:31 ` Demi Marie Obenour [this message]
2024-01-22 11:54   ` George Dunlap
2024-01-22 12:17     ` Marek Marczykowski-Górecki
2024-01-22 12:25       ` George Dunlap
2024-01-22 12:50         ` Marek Marczykowski-Górecki
2024-01-22 13:02           ` George Dunlap
2024-01-22 13:03             ` George Dunlap
2024-01-23 16:58     ` Demi Marie Obenour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Za23cKyEOl1WTvhZ@itl-email \
    --to=demi@invisiblethingslab.com \
    --cc=george.dunlap@cloud.com \
    --cc=jgross@suse.com \
    --cc=marmarek@invisiblethingslab.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.