Sketch of an idea for handling the "mixed workload" problem

All of lore.kernel.org
 help / color / mirror / Atom feed

* Sketch of an idea for handling the "mixed workload" problem
@ 2023-09-29 16:42 George Dunlap
  2023-09-30 23:28 ` Demi Marie Obenour
  2024-01-22  0:31 ` Demi Marie Obenour
  0 siblings, 2 replies; 12+ messages in thread
From: George Dunlap @ 2023-09-29 16:42 UTC (permalink / raw)
  To: Xen-devel
  Cc: Juergen Gross, Demi Marie Obenour,
	Marek Marczykowski-Górecki

The basic credit2 algorithm goes something like this:

1. All vcpus start with the same number of credits; about 10ms worth
if everyone has the same weight

2. vcpus burn credits as they consume cpu, based on the relative
weights: higher weights burn slower, lower weights burn faster

3. At any given point in time, the runnable vcpu with the highest
credit is allowed to run

4. When the "next runnable vcpu" on a runqueue is negative, credit is
reset: everyone gets another 10ms, and can carry over at most 2ms of
credit over the reset.

Generally speaking, vcpus that use less than their quota and have lots
of interrupts are scheduled immediately, since when they wake up they
always have more credit than the vcpus who are burning through their
slices.

But what about a situation as described recently on Matrix, where a VM
uses a non-negligible amount of cpu doing un-accelerated encryption
and decryption, which can be delayed by a few MS, as well as handling
audio events?  How can we make sure that:

1. We can run whenever interrupts happen
2. We get no more than our fair share of the cpu?

The counter-intuitive key here is that in order to achieve the above,
you need to *deschedule or preempt early*, so that when the interrupt
comes, you have spare credit to run the interrupt handler.  How do we
manage that?

The idea I'm working out comes from a phrase I used in the Matrix
discussion, about a vcpu that "foolishly burned all its credits".
Naturally the thing you want to do to have credits available is to
save them up.

So the idea would be this.  Each vcpu would have a "boost credit
ratio" and a "default boost interval"; there would be sensible
defaults based on typical workloads, but these could be tweaked for
individual VMs.

When credit is assigned, all VMs would get the same amount of credit,
but divided into two "buckets", according to the boost credit ratio.

Under certain conditions, a vcpu would be considered "boosted"; this
state would last either until the default boost interval, or until
some other event (such as a de-boost yield).

The queue would be sorted thus:

* Boosted vcpus, by boost credit available
* Non-boosted vcpus, by non-boost credit available

Getting more boost credit means having lower priority when not
boosted; and burning through your boost credit means not being
scheduled when you need to be.

Other ways we could consider putting a vcpu into a boosted state (some
discussed on Matrix or emails linked from Matrix):
* Xen is about to preempt, but finds that the vcpu interrupts are
blocked (this sort of overlaps with the "when we deliver an interrupt"
one)
* Xen is about to preempt, but finds that the (currently out-of-tree)
"dont_desched" bit has been set in the shared memory area

Other ways to consider de-boosting:
* There's a way to trigger a VMEXIT when interrupts have been
re-enabled; setting this up when the VM is in the boost state

Getting the defaults right might take some thinking.  If you set the
default "boost credit ratio" to 25% and the "default boost interval"
to 500ms, then you'd basically have five "boosts" per scheduling
window.  The window depends on how active other vcpus are, but if it's
longer than 20ms your system is too overloaded.

Thoughts?  Demi, what kinds of interrupt counts are you getting for your VM?

 -George

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2023-09-29 16:42 Sketch of an idea for handling the "mixed workload" problem George Dunlap
@ 2023-09-30 23:28 ` Demi Marie Obenour
  2023-10-02 11:20   ` George Dunlap
  2024-01-22  0:31 ` Demi Marie Obenour
  1 sibling, 1 reply; 12+ messages in thread
From: Demi Marie Obenour @ 2023-09-30 23:28 UTC (permalink / raw)
  To: George Dunlap, Xen-devel; +Cc: Juergen Gross, Marek Marczykowski-Górecki

[-- Attachment #1: Type: text/plain, Size: 6475 bytes --]

On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> The basic credit2 algorithm goes something like this:
> 
> 1. All vcpus start with the same number of credits; about 10ms worth
> if everyone has the same weight

> 2. vcpus burn credits as they consume cpu, based on the relative
> weights: higher weights burn slower, lower weights burn faster
> 
> 3. At any given point in time, the runnable vcpu with the highest
> credit is allowed to run
> 
> 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> reset: everyone gets another 10ms, and can carry over at most 2ms of
> credit over the reset.

One relevant aspect of Qubes OS is that it is very very heavily
oversubscribed: having more VMs running than physical CPUs is (at least
in my usage) not uncommon, and each of those VMs will typically have at
least two vCPUs.  With a credit of 10ms and 36 vCPUs, I could easily see
a vCPU not being allowed to execute for 200ms or more.  For audio or
video, workloads, this is a disaster.

10ms is a LOT for desktop workloads or for anyone who cares about
latency.  At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a
heavily contended system frame drops are guaranteed.

> Generally speaking, vcpus that use less than their quota and have lots
> of interrupts are scheduled immediately, since when they wake up they
> always have more credit than the vcpus who are burning through their
> slices.
> 
> But what about a situation as described recently on Matrix, where a VM
> uses a non-negligible amount of cpu doing un-accelerated encryption
> and decryption, which can be delayed by a few MS, as well as handling
> audio events?  How can we make sure that:
> 
> 1. We can run whenever interrupts happen
> 2. We get no more than our fair share of the cpu?
> 
> The counter-intuitive key here is that in order to achieve the above,
> you need to *deschedule or preempt early*, so that when the interrupt
> comes, you have spare credit to run the interrupt handler.  How do we
> manage that?
> 
> The idea I'm working out comes from a phrase I used in the Matrix
> discussion, about a vcpu that "foolishly burned all its credits".
> Naturally the thing you want to do to have credits available is to
> save them up.
> 
> So the idea would be this.  Each vcpu would have a "boost credit
> ratio" and a "default boost interval"; there would be sensible
> defaults based on typical workloads, but these could be tweaked for
> individual VMs.
> 
> When credit is assigned, all VMs would get the same amount of credit,
> but divided into two "buckets", according to the boost credit ratio.
> 
> Under certain conditions, a vcpu would be considered "boosted"; this
> state would last either until the default boost interval, or until
> some other event (such as a de-boost yield).
> 
> The queue would be sorted thus:
> 
> * Boosted vcpus, by boost credit available
> * Non-boosted vcpus, by non-boost credit available
> 
> Getting more boost credit means having lower priority when not
> boosted; and burning through your boost credit means not being
> scheduled when you need to be.
> 
> Other ways we could consider putting a vcpu into a boosted state (some
> discussed on Matrix or emails linked from Matrix):
> * Xen is about to preempt, but finds that the vcpu interrupts are
> blocked (this sort of overlaps with the "when we deliver an interrupt"
> one)

This is also a good heuristic for "vCPU owns a spinlock", which is
definitely a bad time to preempt.

> * Xen is about to preempt, but finds that the (currently out-of-tree)
> "dont_desched" bit has been set in the shared memory area
> 
> Other ways to consider de-boosting:
> * There's a way to trigger a VMEXIT when interrupts have been
> re-enabled; setting this up when the VM is in the boost state

This is a good idea.

> Getting the defaults right might take some thinking.  If you set the
> default "boost credit ratio" to 25% and the "default boost interval"
> to 500ms, then you'd basically have five "boosts" per scheduling
> window.  The window depends on how active other vcpus are, but if it's
> longer than 20ms your system is too overloaded.

An interval of 500ms seems rather long to me.  Did you mean 500μs?

> Thoughts?

My first thought when I had the problem is that Xen's scheduling quantum
was too long.  This is consistent with the observation that dom0 (which
was not very busy IIRC) fell behind in its delivery of audio samples.
Presumably it had plenty of credit, but simply did not get scheduled in
time, perhaps because Xen did not preempt soon enough.  It’s also worth
noting that Qubes makes heavy use of vchans, and I expect the latency of
these to be directly proportional to the time between preemption
interrupts.

Audio is not very demanding on throughput, but is extremely sensitive to
latency.  Therefore, the top priority is making sure that every runnable
vCPU gets a chance to execute periodically.  One way to solve this would
be for both the credits (both the initial credit and the maximum credit
carried over) and the interval between preemptions to be inversely
proportional to the number of runnable vCPUs, so that the time needed to
cycle through all runnable vCPUs is roughly constant.  Specifically,
they would be proportional to Lmax/runnable_vCPUs, where Lmax is the
latency target (1ms or so).  This also ensures that even Xen-unaware VMs
(such as a Windows guest running Microsoft Teams or Skype) get to run
periodically.  There would need to be a limit to prevent Xen from
hogging more than e.g. 10% of CPU time just doing preemption, but if
this is hit, Xen should log something and possibly notify dom0 so that a
warning can be displayed to the user.  Additionally, a certain amount of
CPU time (such as 10%) should be reserved for dom0, so that the system
remains responsive.

Qubes OS could also help here.  If a VM is allowed to record audio, it
(and the VMs providing network to it, transitively) should get a boost
in priority, so that if the system is overloaded other guests are more
likely be delayed in their execution.

> Demi, what kinds of interrupt counts are you getting for your VM?

I didn't measure it, but I can check the next time I am on a video call
or doing audio recoring.

>  -George

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2023-09-30 23:28 ` Demi Marie Obenour
@ 2023-10-02 11:20   ` George Dunlap
  2024-01-21 23:46     ` Demi Marie Obenour
  0 siblings, 1 reply; 12+ messages in thread
From: George Dunlap @ 2023-10-02 11:20 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xen-devel, Juergen Gross, Marek Marczykowski-Górecki

On Sun, Oct 1, 2023 at 12:28 AM Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > The basic credit2 algorithm goes something like this:
> >
> > 1. All vcpus start with the same number of credits; about 10ms worth
> > if everyone has the same weight
>
> > 2. vcpus burn credits as they consume cpu, based on the relative
> > weights: higher weights burn slower, lower weights burn faster
> >
> > 3. At any given point in time, the runnable vcpu with the highest
> > credit is allowed to run
> >
> > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > credit over the reset.
>
> One relevant aspect of Qubes OS is that it is very very heavily
> oversubscribed: having more VMs running than physical CPUs is (at least
> in my usage) not uncommon, and each of those VMs will typically have at
> least two vCPUs.  With a credit of 10ms and 36 vCPUs, I could easily see
> a vCPU not being allowed to execute for 200ms or more.  For audio or
> video, workloads, this is a disaster.
>
> 10ms is a LOT for desktop workloads or for anyone who cares about
> latency.  At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a
> heavily contended system frame drops are guaranteed.

You'd probably benefit from understanding better how the various
algorithms actually work.  I'm sorry I don't have any really good
"virtualization scheduling for dummies" resources; the best I have is
a few talks I gave on the subject; e.g.:

https://www.youtube.com/watch?v=C3jjvkr6fgQ

For one, when I say "oversubscribed", I don't mean "vcpus / pcpus"; I
mean "requested vcpu execution time / vcpus".  If you have 18 vcpus on
a single pcpu, and all of them *on an empty system* would have run at
5%, you're totally fine.  If you have 18 vcpus on a single pcpu, and
all of them on an empty system would have averaged 100%, there's only
so much the scheduler can do to avoid problems.

Secondly, while on credit1 a vcpu is allowed to run for 10ms without
stopping (and then must wait for 18x that time to get the same credit
back, if there are 18 other vcpus running on that same pcpu), this is
not the case for credit2.  The exact calculation can be found in
xen/common/sched/credit2.c:sched2_runtime(), but generally here's the
general algorithm from the comment:

/* General algorithm:
 * 1) Run until snext's credit will be 0.
 * 2) But if someone is waiting, run until snext's credit is equal
 *    to his.
 * 3) But, if we are capped, never run more than our budget.
 * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
 *    the ratelimit time.
 */

Default MIN_TIMER is 500us, and is configurable via sysctl; default
MAX_TIMER is... hmm, I'm pretty sure this started out as 2ms, but now
it seems to be 10ms.  Looks like this was changed in da92ec5bd1 ("xen:
credit2: "relax" CSCHED2_MAX_TIMER") in 2016.  (MAX_TIMER isn't
configurable, but arguably it should be; and making it configurable
should just be a matter of duplicating the logic around MIN_TIMER.)

That's not yet the last word though: If a VM that was a sleep wakes
up, and it has credit than the running vcpu, then it will generally
preempt that cpu.

All that to say, that it should be very rare for a cpu to run for a
full 10ms under credit2.

> > Other ways we could consider putting a vcpu into a boosted state (some
> > discussed on Matrix or emails linked from Matrix):
> > * Xen is about to preempt, but finds that the vcpu interrupts are
> > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > one)
>
> This is also a good heuristic for "vCPU owns a spinlock", which is
> definitely a bad time to preempt.

Not all spinlocks disable IRQs, but certainly some do.

> > Getting the defaults right might take some thinking.  If you set the
> > default "boost credit ratio" to 25% and the "default boost interval"
> > to 500ms, then you'd basically have five "boosts" per scheduling
> > window.  The window depends on how active other vcpus are, but if it's
> > longer than 20ms your system is too overloaded.
>
> An interval of 500ms seems rather long to me.  Did you mean 500μs?

Yes, I did mean 500us, sorry.

I'll respond to the other suggestions later.

> > Demi, what kinds of interrupt counts are you getting for your VM?
>
> I didn't measure it, but I can check the next time I am on a video call
> or doing audio recoring.

Running xentrace would be really interesting too; those are another
good way to nerd-snipe me. :-)

 -George

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2023-10-02 11:20   ` George Dunlap
@ 2024-01-21 23:46     ` Demi Marie Obenour
  0 siblings, 0 replies; 12+ messages in thread
From: Demi Marie Obenour @ 2024-01-21 23:46 UTC (permalink / raw)
  To: George Dunlap; +Cc: Xen-devel, Juergen Gross, Marek Marczykowski-Górecki

[-- Attachment #1: Type: text/plain, Size: 5473 bytes --]

On Mon, Oct 02, 2023 at 12:20:31PM +0100, George Dunlap wrote:
> On Sun, Oct 1, 2023 at 12:28 AM Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > > The basic credit2 algorithm goes something like this:
> > >
> > > 1. All vcpus start with the same number of credits; about 10ms worth
> > > if everyone has the same weight
> >
> > > 2. vcpus burn credits as they consume cpu, based on the relative
> > > weights: higher weights burn slower, lower weights burn faster
> > >
> > > 3. At any given point in time, the runnable vcpu with the highest
> > > credit is allowed to run
> > >
> > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > > credit over the reset.
> >
> > One relevant aspect of Qubes OS is that it is very very heavily
> > oversubscribed: having more VMs running than physical CPUs is (at least
> > in my usage) not uncommon, and each of those VMs will typically have at
> > least two vCPUs.  With a credit of 10ms and 36 vCPUs, I could easily see
> > a vCPU not being allowed to execute for 200ms or more.  For audio or
> > video, workloads, this is a disaster.
> >
> > 10ms is a LOT for desktop workloads or for anyone who cares about
> > latency.  At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a
> > heavily contended system frame drops are guaranteed.
> 
> You'd probably benefit from understanding better how the various
> algorithms actually work.  I'm sorry I don't have any really good
> "virtualization scheduling for dummies" resources; the best I have is
> a few talks I gave on the subject; e.g.:
> 
> https://www.youtube.com/watch?v=C3jjvkr6fgQ
> 
> For one, when I say "oversubscribed", I don't mean "vcpus / pcpus"; I
> mean "requested vcpu execution time / vcpus".  If you have 18 vcpus on
> a single pcpu, and all of them *on an empty system* would have run at
> 5%, you're totally fine.  If you have 18 vcpus on a single pcpu, and
> all of them on an empty system would have averaged 100%, there's only
> so much the scheduler can do to avoid problems.

If each vCPU would have spent 4% time doing realtime tasks, it should be
possible to give all of the realtime tasks all the time they need, while
the remaining 100 - 4 * 18 = 28% of time is available to non-realtime
tasks.  That’s not awesome, but it might be enough to prevent audio from
glitching.

> Secondly, while on credit1 a vcpu is allowed to run for 10ms without
> stopping (and then must wait for 18x that time to get the same credit
> back, if there are 18 other vcpus running on that same pcpu), this is
> not the case for credit2.  The exact calculation can be found in
> xen/common/sched/credit2.c:sched2_runtime(), but generally here's the
> general algorithm from the comment:
> 
> /* General algorithm:
>  * 1) Run until snext's credit will be 0.
>  * 2) But if someone is waiting, run until snext's credit is equal
>  *    to his.
>  * 3) But, if we are capped, never run more than our budget.
>  * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
>  *    the ratelimit time.
>  */
> 
> Default MIN_TIMER is 500us, and is configurable via sysctl; default
> MAX_TIMER is... hmm, I'm pretty sure this started out as 2ms, but now
> it seems to be 10ms.  Looks like this was changed in da92ec5bd1 ("xen:
> credit2: "relax" CSCHED2_MAX_TIMER") in 2016.  (MAX_TIMER isn't
> configurable, but arguably it should be; and making it configurable
> should just be a matter of duplicating the logic around MIN_TIMER.)

Maybe MAX_TIMER should be lowered to e.g. 1ms?

> That's not yet the last word though: If a VM that was a sleep wakes
> up, and it has credit than the running vcpu, then it will generally
> preempt that cpu.
> 
> All that to say, that it should be very rare for a cpu to run for a
> full 10ms under credit2.

That’s good.

> > > Other ways we could consider putting a vcpu into a boosted state (some
> > > discussed on Matrix or emails linked from Matrix):
> > > * Xen is about to preempt, but finds that the vcpu interrupts are
> > > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > > one)
> >
> > This is also a good heuristic for "vCPU owns a spinlock", which is
> > definitely a bad time to preempt.
> 
> Not all spinlocks disable IRQs, but certainly some do.
> 
> > > Getting the defaults right might take some thinking.  If you set the
> > > default "boost credit ratio" to 25% and the "default boost interval"
> > > to 500ms, then you'd basically have five "boosts" per scheduling
> > > window.  The window depends on how active other vcpus are, but if it's
> > > longer than 20ms your system is too overloaded.
> >
> > An interval of 500ms seems rather long to me.  Did you mean 500μs?
> 
> Yes, I did mean 500us, sorry.
> 
> I'll respond to the other suggestions later.
> 
> > > Demi, what kinds of interrupt counts are you getting for your VM?
> >
> > I didn't measure it, but I can check the next time I am on a video call
> > or doing audio recoring.
> 
> Running xentrace would be really interesting too; those are another
> good way to nerd-snipe me. :-)
> 
>  -George

That would certainly be a good idea!
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2023-09-29 16:42 Sketch of an idea for handling the "mixed workload" problem George Dunlap
  2023-09-30 23:28 ` Demi Marie Obenour
@ 2024-01-22  0:31 ` Demi Marie Obenour
  2024-01-22 11:54   ` George Dunlap
  1 sibling, 1 reply; 12+ messages in thread
From: Demi Marie Obenour @ 2024-01-22  0:31 UTC (permalink / raw)
  To: George Dunlap, Xen-devel; +Cc: Juergen Gross, Marek Marczykowski-Górecki

[-- Attachment #1: Type: text/plain, Size: 6571 bytes --]

On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> The basic credit2 algorithm goes something like this:
> 
> 1. All vcpus start with the same number of credits; about 10ms worth
> if everyone has the same weight
> 
> 2. vcpus burn credits as they consume cpu, based on the relative
> weights: higher weights burn slower, lower weights burn faster
> 
> 3. At any given point in time, the runnable vcpu with the highest
> credit is allowed to run
> 
> 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> reset: everyone gets another 10ms, and can carry over at most 2ms of
> credit over the reset.
> 
> Generally speaking, vcpus that use less than their quota and have lots
> of interrupts are scheduled immediately, since when they wake up they
> always have more credit than the vcpus who are burning through their
> slices.
> 
> But what about a situation as described recently on Matrix, where a VM
> uses a non-negligible amount of cpu doing un-accelerated encryption
> and decryption, which can be delayed by a few MS, as well as handling
> audio events?  How can we make sure that:
> 
> 1. We can run whenever interrupts happen
> 2. We get no more than our fair share of the cpu?
> 
> The counter-intuitive key here is that in order to achieve the above,
> you need to *deschedule or preempt early*, so that when the interrupt
> comes, you have spare credit to run the interrupt handler.  How do we
> manage that?
> 
> The idea I'm working out comes from a phrase I used in the Matrix
> discussion, about a vcpu that "foolishly burned all its credits".
> Naturally the thing you want to do to have credits available is to
> save them up.
> 
> So the idea would be this.  Each vcpu would have a "boost credit
> ratio" and a "default boost interval"; there would be sensible
> defaults based on typical workloads, but these could be tweaked for
> individual VMs.
> 
> When credit is assigned, all VMs would get the same amount of credit,
> but divided into two "buckets", according to the boost credit ratio.
> 
> Under certain conditions, a vcpu would be considered "boosted"; this
> state would last either until the default boost interval, or until
> some other event (such as a de-boost yield).
> 
> The queue would be sorted thus:
> 
> * Boosted vcpus, by boost credit available
> * Non-boosted vcpus, by non-boost credit available
> 
> Getting more boost credit means having lower priority when not
> boosted; and burning through your boost credit means not being
> scheduled when you need to be.
> 
> Other ways we could consider putting a vcpu into a boosted state (some
> discussed on Matrix or emails linked from Matrix):
> * Xen is about to preempt, but finds that the vcpu interrupts are
> blocked (this sort of overlaps with the "when we deliver an interrupt"
> one)
> * Xen is about to preempt, but finds that the (currently out-of-tree)
> "dont_desched" bit has been set in the shared memory area

I think both of these would be good.  Another one would be when Xen is
about to deliver an interrupt to a guest, provided that there is no
storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
spike through what I presume is an interrupt storm, and I suspect that
others have observed similar behavior with USB external drives.

> Other ways to consider de-boosting:
> * There's a way to trigger a VMEXIT when interrupts have been
> re-enabled; setting this up when the VM is in the boost state

That’s a good idea, but should be conditional on “dont_desched” _not_
being set.  This handles the case where the guest is running a realtime
thread.

Generally, I’d like to see something like this:

- A vCPU with sufficient boost credit is boosted by Xen under the
  following conditions:

  1. Xen interrupts the guest.
  2. Xen is about to preempt, but detects that “dont_desched” is set.
  3. Xen is about to preempt, but detects that interrupts are disabled.

- A vCPU is deboosted if:

  1. It runs out of boost credit, even if “dont_desched” is set.
  2. An interrupt handler returns, but only if “dont_desched” is not set.
  3. Interrupts are re-enabled, but only if “dont_desched” is not set.

  The first case is an abnormal condition and typically means that
  either the system is overloaded or a vCPU is running boosted for too
  long.  To help debug this situation, Xen will log a warning and
  increment both a system-wide and a per-domain counter.  dom0 can
  retrieve counters for any domain, and a domain can read its own
  counter.

- When to set “dont_desched” is entirely up to the guest kernel, but
  there are some general rules guests should follow:

  - Only set “dont_desched” if there is a good reason, and unset it as
    soon as possible.  Xen gives vCPUs with “dont_desched” set priority
    over all other vCPUs on the system, but the amount of time a vCPU is
    allowed to run with an elevated priority is limited.  Xen will log a
    warning if a guest tries to run with elevated priority for too long.
    
  - Xen boosts vCPUs before delivering an interrupt, but there should be
    a way for a vCPU to deboost itself even before returning from the
    interrupt handler.

  - Guests should always set “dont_desched” when running hard-realtime
    threads (used for e.g. audio processing), even when the thread is in
    userspace.  This ensures that Xen gives the underlying vCPU priority
    over vCPUs 

  - Guests should always set “dont_desched” when holding a spin lock,
    but it is even better to use paravirtualized spin locks (which make
    a hypercall into Xen and therefore allow other vCPUs to run).

  - Xen does not implement priority inheritance, so guests need to do
    that.

- Max boost credits can be set by dom0 via a hypercall.

The advantage of this approach is that it keeps almost all policy out of
Xen.  The only exception is the boosting when an interrupt is received,
but a well-behaved guest will deboost itself very quickly (by enabling
interrupts) if the boost was not actually needed, so this should have
very limited impact.  I think this should be enough for realtime audio,
and it is somewhat related to (but hopefully simpler than) the KVM RFC
from Google [1].

Any thoughts on this?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[1]: https://lore.kernel.org/kvm/20231214024727.3503870-1-vineeth@bitbyteword.org/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22  0:31 ` Demi Marie Obenour
@ 2024-01-22 11:54   ` George Dunlap
  2024-01-22 12:17     ` Marek Marczykowski-Górecki
  2024-01-23 16:58     ` Demi Marie Obenour
  0 siblings, 2 replies; 12+ messages in thread
From: George Dunlap @ 2024-01-22 11:54 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xen-devel, Juergen Gross, Marek Marczykowski-Górecki

On Mon, Jan 22, 2024 at 12:31 AM Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > The basic credit2 algorithm goes something like this:
> >
> > 1. All vcpus start with the same number of credits; about 10ms worth
> > if everyone has the same weight
> >
> > 2. vcpus burn credits as they consume cpu, based on the relative
> > weights: higher weights burn slower, lower weights burn faster
> >
> > 3. At any given point in time, the runnable vcpu with the highest
> > credit is allowed to run
> >
> > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > credit over the reset.
> >
> > Generally speaking, vcpus that use less than their quota and have lots
> > of interrupts are scheduled immediately, since when they wake up they
> > always have more credit than the vcpus who are burning through their
> > slices.
> >
> > But what about a situation as described recently on Matrix, where a VM
> > uses a non-negligible amount of cpu doing un-accelerated encryption
> > and decryption, which can be delayed by a few MS, as well as handling
> > audio events?  How can we make sure that:
> >
> > 1. We can run whenever interrupts happen
> > 2. We get no more than our fair share of the cpu?
> >
> > The counter-intuitive key here is that in order to achieve the above,
> > you need to *deschedule or preempt early*, so that when the interrupt
> > comes, you have spare credit to run the interrupt handler.  How do we
> > manage that?
> >
> > The idea I'm working out comes from a phrase I used in the Matrix
> > discussion, about a vcpu that "foolishly burned all its credits".
> > Naturally the thing you want to do to have credits available is to
> > save them up.
> >
> > So the idea would be this.  Each vcpu would have a "boost credit
> > ratio" and a "default boost interval"; there would be sensible
> > defaults based on typical workloads, but these could be tweaked for
> > individual VMs.
> >
> > When credit is assigned, all VMs would get the same amount of credit,
> > but divided into two "buckets", according to the boost credit ratio.
> >
> > Under certain conditions, a vcpu would be considered "boosted"; this
> > state would last either until the default boost interval, or until
> > some other event (such as a de-boost yield).
> >
> > The queue would be sorted thus:
> >
> > * Boosted vcpus, by boost credit available
> > * Non-boosted vcpus, by non-boost credit available
> >
> > Getting more boost credit means having lower priority when not
> > boosted; and burning through your boost credit means not being
> > scheduled when you need to be.
> >
> > Other ways we could consider putting a vcpu into a boosted state (some
> > discussed on Matrix or emails linked from Matrix):
> > * Xen is about to preempt, but finds that the vcpu interrupts are
> > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > one)
> > * Xen is about to preempt, but finds that the (currently out-of-tree)
> > "dont_desched" bit has been set in the shared memory area
>
> I think both of these would be good.  Another one would be when Xen is
> about to deliver an interrupt to a guest, provided that there is no
> storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
> spike through what I presume is an interrupt storm, and I suspect that
> others have observed similar behavior with USB external drives.

How would you determine that a given interrupt was part of a "storm",
and what would you do differently as a result of determining that?

> > Other ways to consider de-boosting:
> > * There's a way to trigger a VMEXIT when interrupts have been
> > re-enabled; setting this up when the VM is in the boost state
>
> That’s a good idea, but should be conditional on “dont_desched” _not_
> being set.  This handles the case where the guest is running a realtime
> thread.

In which case we need some way for the "enlightened" guest to know how
to de-boost itself; a yield might do.

> Generally, I’d like to see something like this:
>
> - A vCPU with sufficient boost credit is boosted by Xen under the
>   following conditions:
>
>   1. Xen interrupts the guest.

I take it you mean, "delivers an interrupt to the guest"?


>   2. Xen is about to preempt, but detects that “dont_desched” is set.
>   3. Xen is about to preempt, but detects that interrupts are disabled.
>
> - A vCPU is deboosted if:
>
>   1. It runs out of boost credit, even if “dont_desched” is set.
>   2. An interrupt handler returns, but only if “dont_desched” is not set.
>   3. Interrupts are re-enabled, but only if “dont_desched” is not set.
>
>   The first case is an abnormal condition and typically means that
>   either the system is overloaded or a vCPU is running boosted for too
>   long.  To help debug this situation, Xen will log a warning and
>   increment both a system-wide and a per-domain counter.  dom0 can
>   retrieve counters for any domain, and a domain can read its own
>   counter.
>
> - When to set “dont_desched” is entirely up to the guest kernel, but
>   there are some general rules guests should follow:
>
>   - Only set “dont_desched” if there is a good reason, and unset it as
>     soon as possible.  Xen gives vCPUs with “dont_desched” set priority
>     over all other vCPUs on the system, but the amount of time a vCPU is
>     allowed to run with an elevated priority is limited.  Xen will log a
>     warning if a guest tries to run with elevated priority for too long.
>
>   - Xen boosts vCPUs before delivering an interrupt, but there should be
>     a way for a vCPU to deboost itself even before returning from the
>     interrupt handler.
>
>   - Guests should always set “dont_desched” when running hard-realtime
>     threads (used for e.g. audio processing), even when the thread is in
>     userspace.  This ensures that Xen gives the underlying vCPU priority
>     over vCPUs
>
>   - Guests should always set “dont_desched” when holding a spin lock,
>     but it is even better to use paravirtualized spin locks (which make
>     a hypercall into Xen and therefore allow other vCPUs to run).
>
>   - Xen does not implement priority inheritance, so guests need to do
>     that.
>
> - Max boost credits can be set by dom0 via a hypercall.
>
> The advantage of this approach is that it keeps almost all policy out of
> Xen.  The only exception is the boosting when an interrupt is received,
> but a well-behaved guest will deboost itself very quickly (by enabling
> interrupts) if the boost was not actually needed, so this should have
> very limited impact.  I think this should be enough for realtime audio,
> and it is somewhat related to (but hopefully simpler than) the KVM RFC
> from Google [1].
>
> Any thoughts on this?

Overall sounds good.  I think a good approach would be to start by
implementing it without the "dont_desched" flag, and then add that on
top later.  It sounds like you have a clear vision for what you want,
so it shouldn't be too hard to write such that adding the
"dont_desched" doesn't require a lot of pointless refactoring.

The other issue I have with this (and essentially where I got stuck
developing credit2 in the first place) is testing: how do you ensure
that it has the properties that you expect?  How do you develop a
"regression test" to make sure that server-based workloads don't have
issues in this sort of case?

 -George


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22 11:54   ` George Dunlap
@ 2024-01-22 12:17     ` Marek Marczykowski-Górecki
  2024-01-22 12:25       ` George Dunlap
  2024-01-23 16:58     ` Demi Marie Obenour
  1 sibling, 1 reply; 12+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-01-22 12:17 UTC (permalink / raw)
  To: George Dunlap; +Cc: Demi Marie Obenour, Xen-devel, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> The other issue I have with this (and essentially where I got stuck
> developing credit2 in the first place) is testing: how do you ensure
> that it has the properties that you expect?  

Audio is actually quite nice use case at this, since it's quite
sensitive for scheduling jitter. I think even a simple "PCI passthrough a
sound card and play/record something" should show results. Especially
you can measure how hard you can push the system (for example artificial
load in other domains) until it breaks.

> How do you develop a
> "regression test" to make sure that server-based workloads don't have
> issues in this sort of case?

For this I believe there are several benchmarking methods already,
starting with old trusty "Linux kernel build time".

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22 12:17     ` Marek Marczykowski-Górecki
@ 2024-01-22 12:25       ` George Dunlap
  2024-01-22 12:50         ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 12+ messages in thread
From: George Dunlap @ 2024-01-22 12:25 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Demi Marie Obenour, Xen-devel, Juergen Gross

On Mon, Jan 22, 2024 at 12:17 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> > The other issue I have with this (and essentially where I got stuck
> > developing credit2 in the first place) is testing: how do you ensure
> > that it has the properties that you expect?
>
> Audio is actually quite nice use case at this, since it's quite
> sensitive for scheduling jitter. I think even a simple "PCI passthrough a
> sound card and play/record something" should show results. Especially
> you can measure how hard you can push the system (for example artificial
> load in other domains) until it breaks.

Are we going have a gitlab runner which says, "Marek sits in front of
his test machine and listens to audio for pops"? :-)

>
> > How do you develop a
> > "regression test" to make sure that server-based workloads don't have
> > issues in this sort of case?
>
> For this I believe there are several benchmarking methods already,
> starting with old trusty "Linux kernel build time".

First of all, AFAICT "Linux kernel bulid time" is not representative
of almost any actual server workload; and the end-to-end throughput
completely misses what most server workloads will actually care about,
like latency.

Secondly, what you're testing isn't the performance of a single
workload on an empty system; you're testing how workloads *interact*.
If you want ideal throughput for a single workload on an empty system,
use the null scheduler; more complex schedulers are only necessary
when multiple different workloads interact.

FWIW this was my first stab at trying to be systematic about testing
the scheduler:

https://github.com/gwd/schedbench

The rump kernel project has basically died AFAIK, so anyone trying to
resurrect this would probably have to try to rebase that bit of it
against something like XTF or unikernels.

 -George

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22 12:25       ` George Dunlap
@ 2024-01-22 12:50         ` Marek Marczykowski-Górecki
  2024-01-22 13:02           ` George Dunlap
  0 siblings, 1 reply; 12+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-01-22 12:50 UTC (permalink / raw)
  To: George Dunlap; +Cc: Demi Marie Obenour, Xen-devel, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 2710 bytes --]

On Mon, Jan 22, 2024 at 12:25:58PM +0000, George Dunlap wrote:
> On Mon, Jan 22, 2024 at 12:17 PM Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com> wrote:
> >
> > On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> > > The other issue I have with this (and essentially where I got stuck
> > > developing credit2 in the first place) is testing: how do you ensure
> > > that it has the properties that you expect?
> >
> > Audio is actually quite nice use case at this, since it's quite
> > sensitive for scheduling jitter. I think even a simple "PCI passthrough a
> > sound card and play/record something" should show results. Especially
> > you can measure how hard you can push the system (for example artificial
> > load in other domains) until it breaks.
> 
> Are we going have a gitlab runner which says, "Marek sits in front of
> his test machine and listens to audio for pops"? :-)

Kinda ;)
We have already audio tests in qubes CI. They do more or less the above,
but using our audio virtualization. Play something, record in another
domain, and compare. Running the very same thing in gitlab-ci may be too
complicated (require bringing in some qubes infrastructure to make PV
audio work), but maybe similar test can be done based on qemu-emulated
audio or other pv audio solution?

> > > How do you develop a
> > > "regression test" to make sure that server-based workloads don't have
> > > issues in this sort of case?
> >
> > For this I believe there are several benchmarking methods already,
> > starting with old trusty "Linux kernel build time".
> 
> First of all, AFAICT "Linux kernel bulid time" is not representative
> of almost any actual server workload; and the end-to-end throughput
> completely misses what most server workloads will actually care about,
> like latency.
> 
> Secondly, what you're testing isn't the performance of a single
> workload on an empty system; you're testing how workloads *interact*.
> If you want ideal throughput for a single workload on an empty system,
> use the null scheduler; more complex schedulers are only necessary
> when multiple different workloads interact.

I should have clarified I meant `make -jNN`. But still, that's the same
workload on multiple vCPUs.

> FWIW this was my first stab at trying to be systematic about testing
> the scheduler:
> 
> https://github.com/gwd/schedbench
> 
> The rump kernel project has basically died AFAIK, so anyone trying to
> resurrect this would probably have to try to rebase that bit of it
> against something like XTF or unikernels.
> 
>  -George

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22 12:50         ` Marek Marczykowski-Górecki
@ 2024-01-22 13:02           ` George Dunlap
  2024-01-22 13:03             ` George Dunlap
  0 siblings, 1 reply; 12+ messages in thread
From: George Dunlap @ 2024-01-22 13:02 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Demi Marie Obenour, Xen-devel, Juergen Gross

On Mon, Jan 22, 2024 at 12:50 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> On Mon, Jan 22, 2024 at 12:25:58PM +0000, George Dunlap wrote:
> > On Mon, Jan 22, 2024 at 12:17 PM Marek Marczykowski-Górecki
> > <marmarek@invisiblethingslab.com> wrote:
> > >
> > > On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> > > > The other issue I have with this (and essentially where I got stuck
> > > > developing credit2 in the first place) is testing: how do you ensure
> > > > that it has the properties that you expect?
> > >
> > > Audio is actually quite nice use case at this, since it's quite
> > > sensitive for scheduling jitter. I think even a simple "PCI passthrough a
> > > sound card and play/record something" should show results. Especially
> > > you can measure how hard you can push the system (for example artificial
> > > load in other domains) until it breaks.
> >
> > Are we going have a gitlab runner which says, "Marek sits in front of
> > his test machine and listens to audio for pops"? :-)
>
> Kinda ;)
> We have already audio tests in qubes CI. They do more or less the above,
> but using our audio virtualization. Play something, record in another
> domain, and compare. Running the very same thing in gitlab-ci may be too
> complicated (require bringing in some qubes infrastructure to make PV
> audio work), but maybe similar test can be done based on qemu-emulated
> audio or other pv audio solution?
>
> > > > How do you develop a
> > > > "regression test" to make sure that server-based workloads don't have
> > > > issues in this sort of case?
> > >
> > > For this I believe there are several benchmarking methods already,
> > > starting with old trusty "Linux kernel build time".
> >
> > First of all, AFAICT "Linux kernel bulid time" is not representative
> > of almost any actual server workload; and the end-to-end throughput
> > completely misses what most server workloads will actually care about,
> > like latency.
> >
> > Secondly, what you're testing isn't the performance of a single
> > workload on an empty system; you're testing how workloads *interact*.
> > If you want ideal throughput for a single workload on an empty system,
> > use the null scheduler; more complex schedulers are only necessary
> > when multiple different workloads interact.
>
> I should have clarified I meant `make -jNN`. But still, that's the same
> workload on multiple vCPUs.

See, you're still not getting it. :-)

What you need is not multiple vcpus across a single VM, but multiple
instances of different workloads across different VMs.  For example:

1. One VM running kernbench
2. two VMs running kernbench, but not competing for vcpu
3. four VMs running kernbench, competing for vcpus
4. three VMs running kernbench, and one playing audio
5. four VMs running kernbench, one of which is *also* playing audio

And then you have to collect several metrics:

1. Total kernbench throughput of entire system
2. Kernbench performance of each VM, compared with expected "fair share"
3. Some measure of latency for the audio VM

And figure out how to compare trade-offs -- how much total throughput
hit should we tolerate to increase fairness?  How much fairness hit
should we take to decrease latency?

And as I said, kernbench isn't really a great server workload; you
should do something request-based, measuring both throughput and
latency.

 -George


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22 13:02           ` George Dunlap
@ 2024-01-22 13:03             ` George Dunlap
  0 siblings, 0 replies; 12+ messages in thread
From: George Dunlap @ 2024-01-22 13:03 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Demi Marie Obenour, Xen-devel, Juergen Gross

On Mon, Jan 22, 2024 at 1:02 PM George Dunlap <george.dunlap@cloud.com> wrote:
> 2. two VMs running kernbench, but not competing for vcpu
> 3. four VMs running kernbench, competing for vcpus

Sorry, this should be competing for *P*cpus

 -George


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Sketch of an idea for handling the "mixed workload" problem
  2024-01-22 11:54   ` George Dunlap
  2024-01-22 12:17     ` Marek Marczykowski-Górecki
@ 2024-01-23 16:58     ` Demi Marie Obenour
  1 sibling, 0 replies; 12+ messages in thread
From: Demi Marie Obenour @ 2024-01-23 16:58 UTC (permalink / raw)
  To: George Dunlap; +Cc: Xen-devel, Juergen Gross, Marek Marczykowski-Górecki

[-- Attachment #1: Type: text/plain, Size: 9310 bytes --]

On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> On Mon, Jan 22, 2024 at 12:31 AM Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > > The basic credit2 algorithm goes something like this:
> > >
> > > 1. All vcpus start with the same number of credits; about 10ms worth
> > > if everyone has the same weight
> > >
> > > 2. vcpus burn credits as they consume cpu, based on the relative
> > > weights: higher weights burn slower, lower weights burn faster
> > >
> > > 3. At any given point in time, the runnable vcpu with the highest
> > > credit is allowed to run
> > >
> > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > > credit over the reset.
> > >
> > > Generally speaking, vcpus that use less than their quota and have lots
> > > of interrupts are scheduled immediately, since when they wake up they
> > > always have more credit than the vcpus who are burning through their
> > > slices.
> > >
> > > But what about a situation as described recently on Matrix, where a VM
> > > uses a non-negligible amount of cpu doing un-accelerated encryption
> > > and decryption, which can be delayed by a few MS, as well as handling
> > > audio events?  How can we make sure that:
> > >
> > > 1. We can run whenever interrupts happen
> > > 2. We get no more than our fair share of the cpu?
> > >
> > > The counter-intuitive key here is that in order to achieve the above,
> > > you need to *deschedule or preempt early*, so that when the interrupt
> > > comes, you have spare credit to run the interrupt handler.  How do we
> > > manage that?
> > >
> > > The idea I'm working out comes from a phrase I used in the Matrix
> > > discussion, about a vcpu that "foolishly burned all its credits".
> > > Naturally the thing you want to do to have credits available is to
> > > save them up.
> > >
> > > So the idea would be this.  Each vcpu would have a "boost credit
> > > ratio" and a "default boost interval"; there would be sensible
> > > defaults based on typical workloads, but these could be tweaked for
> > > individual VMs.
> > >
> > > When credit is assigned, all VMs would get the same amount of credit,
> > > but divided into two "buckets", according to the boost credit ratio.
> > >
> > > Under certain conditions, a vcpu would be considered "boosted"; this
> > > state would last either until the default boost interval, or until
> > > some other event (such as a de-boost yield).
> > >
> > > The queue would be sorted thus:
> > >
> > > * Boosted vcpus, by boost credit available
> > > * Non-boosted vcpus, by non-boost credit available
> > >
> > > Getting more boost credit means having lower priority when not
> > > boosted; and burning through your boost credit means not being
> > > scheduled when you need to be.
> > >
> > > Other ways we could consider putting a vcpu into a boosted state (some
> > > discussed on Matrix or emails linked from Matrix):
> > > * Xen is about to preempt, but finds that the vcpu interrupts are
> > > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > > one)
> > > * Xen is about to preempt, but finds that the (currently out-of-tree)
> > > "dont_desched" bit has been set in the shared memory area
> >
> > I think both of these would be good.  Another one would be when Xen is
> > about to deliver an interrupt to a guest, provided that there is no
> > storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
> > spike through what I presume is an interrupt storm, and I suspect that
> > others have observed similar behavior with USB external drives.
> 
> How would you determine that a given interrupt was part of a "storm",
> and what would you do differently as a result of determining that?

I’m not sure.  One heuristic might be that if a device assigned to a VM
is interrupting Xen too many times while Xen is running other VMs,
interrupts from that device are blocked as needed to ensure other VMs
get to execute.  Theoretically, an interrupt from a USB storage device
should be safe to block until Xen is no longer running boosted
workloads, but an interrupt from a USB microphone or speaker is not.

> > > Other ways to consider de-boosting:
> > > * There's a way to trigger a VMEXIT when interrupts have been
> > > re-enabled; setting this up when the VM is in the boost state
> >
> > That’s a good idea, but should be conditional on “dont_desched” _not_
> > being set.  This handles the case where the guest is running a realtime
> > thread.
> 
> In which case we need some way for the "enlightened" guest to know how
> to de-boost itself; a yield might do.

That would be sufficient.

> > Generally, I’d like to see something like this:
> >
> > - A vCPU with sufficient boost credit is boosted by Xen under the
> >   following conditions:
> >
> >   1. Xen interrupts the guest.
> 
> I take it you mean, "delivers an interrupt to the guest"?

Yes.

> >   2. Xen is about to preempt, but detects that “dont_desched” is set.
> >   3. Xen is about to preempt, but detects that interrupts are disabled.
> >
> > - A vCPU is deboosted if:
> >
> >   1. It runs out of boost credit, even if “dont_desched” is set.
> >   2. An interrupt handler returns, but only if “dont_desched” is not set.
> >   3. Interrupts are re-enabled, but only if “dont_desched” is not set.
> >
> >   The first case is an abnormal condition and typically means that
> >   either the system is overloaded or a vCPU is running boosted for too
> >   long.  To help debug this situation, Xen will log a warning and
> >   increment both a system-wide and a per-domain counter.  dom0 can
> >   retrieve counters for any domain, and a domain can read its own
> >   counter.
> >
> > - When to set “dont_desched” is entirely up to the guest kernel, but
> >   there are some general rules guests should follow:
> >
> >   - Only set “dont_desched” if there is a good reason, and unset it as
> >     soon as possible.  Xen gives vCPUs with “dont_desched” set priority
> >     over all other vCPUs on the system, but the amount of time a vCPU is
> >     allowed to run with an elevated priority is limited.  Xen will log a
> >     warning if a guest tries to run with elevated priority for too long.
> >
> >   - Xen boosts vCPUs before delivering an interrupt, but there should be
> >     a way for a vCPU to deboost itself even before returning from the
> >     interrupt handler.
> >
> >   - Guests should always set “dont_desched” when running hard-realtime
> >     threads (used for e.g. audio processing), even when the thread is in
> >     userspace.  This ensures that Xen gives the underlying vCPU priority
> >     over vCPUs
> >
> >   - Guests should always set “dont_desched” when holding a spin lock,
> >     but it is even better to use paravirtualized spin locks (which make
> >     a hypercall into Xen and therefore allow other vCPUs to run).
> >
> >   - Xen does not implement priority inheritance, so guests need to do
> >     that.
> >
> > - Max boost credits can be set by dom0 via a hypercall.
> >
> > The advantage of this approach is that it keeps almost all policy out of
> > Xen.  The only exception is the boosting when an interrupt is received,
> > but a well-behaved guest will deboost itself very quickly (by enabling
> > interrupts) if the boost was not actually needed, so this should have
> > very limited impact.  I think this should be enough for realtime audio,
> > and it is somewhat related to (but hopefully simpler than) the KVM RFC
> > from Google [1].
> >
> > Any thoughts on this?
> 
> Overall sounds good.  I think a good approach would be to start by
> implementing it without the "dont_desched" flag, and then add that on
> top later.  It sounds like you have a clear vision for what you want,
> so it shouldn't be too hard to write such that adding the
> "dont_desched" doesn't require a lot of pointless refactoring.
> 
> The other issue I have with this (and essentially where I got stuck
> developing credit2 in the first place) is testing: how do you ensure
> that it has the properties that you expect?  How do you develop a
> "regression test" to make sure that server-based workloads don't have
> issues in this sort of case?

I don’t have any server workloads myself.  Would it be reasonable to ask
those who do have such workloads to develop such a test?  They would be
in a much better position to check for regressions on these workloads,
and have server hardware that they can use to benchmark such workloads.
I just have my laptop and a test laptop, both running Qubes OS.

It’s also possible that some of these changes will improve latency at
the expense of throughput.  In that case, I could add a Xen command-line
option (or even a runtime toggle) that controls whether Xen honors the
boost state.  I do expect that the rest of the logic should have very
little overhead in this case.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-23 16:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-29 16:42 Sketch of an idea for handling the "mixed workload" problem George Dunlap
2023-09-30 23:28 ` Demi Marie Obenour
2023-10-02 11:20   ` George Dunlap
2024-01-21 23:46     ` Demi Marie Obenour
2024-01-22  0:31 ` Demi Marie Obenour
2024-01-22 11:54   ` George Dunlap
2024-01-22 12:17     ` Marek Marczykowski-Górecki
2024-01-22 12:25       ` George Dunlap
2024-01-22 12:50         ` Marek Marczykowski-Górecki
2024-01-22 13:02           ` George Dunlap
2024-01-22 13:03             ` George Dunlap
2024-01-23 16:58     ` Demi Marie Obenour

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.