From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 31326C47258 for ; Tue, 23 Jan 2024 16:58:37 +0000 (UTC) Received: from list by lists.xenproject.org with outflank-mailman.670551.1043418 (Exim 4.92) (envelope-from ) id 1rSK6S-0000R3-2A; Tue, 23 Jan 2024 16:58:24 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 670551.1043418; Tue, 23 Jan 2024 16:58:24 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1rSK6R-0000Qw-Vl; Tue, 23 Jan 2024 16:58:23 +0000 Received: by outflank-mailman (input) for mailman id 670551; Tue, 23 Jan 2024 16:58:22 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1rSK6P-0000PS-IP for xen-devel@lists.xenproject.org; Tue, 23 Jan 2024 16:58:22 +0000 Received: from wfout2-smtp.messagingengine.com (wfout2-smtp.messagingengine.com [64.147.123.145]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id 994ae95e-ba10-11ee-9b0f-b553b5be7939; Tue, 23 Jan 2024 17:58:18 +0100 (CET) Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailfout.west.internal (Postfix) with ESMTP id 2B9D61C0006E; Tue, 23 Jan 2024 11:58:13 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute6.internal (MEProxy); Tue, 23 Jan 2024 11:58:13 -0500 Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 23 Jan 2024 11:58:12 -0500 (EST) X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: 994ae95e-ba10-11ee-9b0f-b553b5be7939 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= invisiblethingslab.com; h=cc:cc:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1706029092; x=1706115492; bh=+Cb07fSHIP/nAV4h9p9hfVaNyIElPq3pGvjzWSI9rLA=; b= F2Xv9OwCocia3TEgrjkG0KBBPT9bMkfVAt9DqDtvnvl1vllhZEU/bMPGmQW9aZuZ Q3dU5b1fSNiR5x1gLVWCT9G7plUWxTBxLVixYRDpP4i54iz0vnwJjg7946pVla82 cgxS0TVfp8+PMnGyWEZjVqtuCWkPK7AZoIqYC/ySJ5KvI/0OrR/h8WASmHW9j9+p Srl5xZnOoF9O9qgwVVnq9tPgIK6CTHvYsUDGbl9SXFHtJ7LyUbRcJDcFd1LS2+MG EXLApnHqwVzGnM5IHGbGzW3XhsqnCvM8FrJIxfFOkoBL7n4BAnM9syczlwn//ffM l4DWYMokZTCzxUX0H6PGRg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1706029092; x=1706115492; bh=+Cb07fSHIP/nAV4h9p9hfVaNyIEl Pq3pGvjzWSI9rLA=; b=nG64xXRNdfA4cLBQTyBB4HcFZy9q8ImQPGdGZgeN8YJy sd2jKI9a2TZxzNR0bMIKjoF7zs20w0CwqCpVmvRkY4yORIlUTsT2/I/swqguVxu3 M+HFcxzUWgkH2o+gKT9gij79bZTBadg0hmMluDsJ44C2Qi9zs/+ehASTmpX+R27k fE/rhw/OyxkLIKB3YCdcYJ+caKFtTWINiSqEmKz34/nUmoulIb9C23t+uewvu170 PmNIGOjlAvjzW40xWtKPQj5AqSxlRYmhDlScMzU7YVGWyW8AUcTaupI/IjIJlNXl 383sKGcZS9J9JBycu2WhHG5VuNKNXwNonQHXQ9ghlA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvkedrvdekkedgleefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvfevuffkfhggtggujgesghdtreertddtjeenucfhrhhomhepffgvmhhi ucforghrihgvucfqsggvnhhouhhruceouggvmhhisehinhhvihhsihgslhgvthhhihhngh hslhgrsgdrtghomheqnecuggftrfgrthhtvghrnhepvdejteegkefhteduhffgteffgeff gfduvdfghfffieefieekkedtheegteehffelnecuvehluhhsthgvrhfuihiivgeptdenuc frrghrrghmpehmrghilhhfrhhomhepuggvmhhisehinhhvihhsihgslhgvthhhihhnghhs lhgrsgdrtghomh X-ME-Proxy: Feedback-ID: iac594737:Fastmail Date: Tue, 23 Jan 2024 11:58:07 -0500 From: Demi Marie Obenour To: George Dunlap Cc: Xen-devel , Juergen Gross , Marek =?utf-8?Q?Marczykowski-G=C3=B3recki?= Subject: Re: Sketch of an idea for handling the "mixed workload" problem Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="Hvgit4ad6LeOOsQH" Content-Disposition: inline In-Reply-To: --Hvgit4ad6LeOOsQH Content-Type: text/plain; protected-headers=v1; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Date: Tue, 23 Jan 2024 11:58:07 -0500 From: Demi Marie Obenour To: George Dunlap Cc: Xen-devel , Juergen Gross , Marek =?utf-8?Q?Marczykowski-G=C3=B3recki?= Subject: Re: Sketch of an idea for handling the "mixed workload" problem On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote: > On Mon, Jan 22, 2024 at 12:31=E2=80=AFAM Demi Marie Obenour > wrote: > > > > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote: > > > The basic credit2 algorithm goes something like this: > > > > > > 1. All vcpus start with the same number of credits; about 10ms worth > > > if everyone has the same weight > > > > > > 2. vcpus burn credits as they consume cpu, based on the relative > > > weights: higher weights burn slower, lower weights burn faster > > > > > > 3. At any given point in time, the runnable vcpu with the highest > > > credit is allowed to run > > > > > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is > > > reset: everyone gets another 10ms, and can carry over at most 2ms of > > > credit over the reset. > > > > > > Generally speaking, vcpus that use less than their quota and have lots > > > of interrupts are scheduled immediately, since when they wake up they > > > always have more credit than the vcpus who are burning through their > > > slices. > > > > > > But what about a situation as described recently on Matrix, where a VM > > > uses a non-negligible amount of cpu doing un-accelerated encryption > > > and decryption, which can be delayed by a few MS, as well as handling > > > audio events? How can we make sure that: > > > > > > 1. We can run whenever interrupts happen > > > 2. We get no more than our fair share of the cpu? > > > > > > The counter-intuitive key here is that in order to achieve the above, > > > you need to *deschedule or preempt early*, so that when the interrupt > > > comes, you have spare credit to run the interrupt handler. How do we > > > manage that? > > > > > > The idea I'm working out comes from a phrase I used in the Matrix > > > discussion, about a vcpu that "foolishly burned all its credits". > > > Naturally the thing you want to do to have credits available is to > > > save them up. > > > > > > So the idea would be this. Each vcpu would have a "boost credit > > > ratio" and a "default boost interval"; there would be sensible > > > defaults based on typical workloads, but these could be tweaked for > > > individual VMs. > > > > > > When credit is assigned, all VMs would get the same amount of credit, > > > but divided into two "buckets", according to the boost credit ratio. > > > > > > Under certain conditions, a vcpu would be considered "boosted"; this > > > state would last either until the default boost interval, or until > > > some other event (such as a de-boost yield). > > > > > > The queue would be sorted thus: > > > > > > * Boosted vcpus, by boost credit available > > > * Non-boosted vcpus, by non-boost credit available > > > > > > Getting more boost credit means having lower priority when not > > > boosted; and burning through your boost credit means not being > > > scheduled when you need to be. > > > > > > Other ways we could consider putting a vcpu into a boosted state (some > > > discussed on Matrix or emails linked from Matrix): > > > * Xen is about to preempt, but finds that the vcpu interrupts are > > > blocked (this sort of overlaps with the "when we deliver an interrupt" > > > one) > > > * Xen is about to preempt, but finds that the (currently out-of-tree) > > > "dont_desched" bit has been set in the shared memory area > > > > I think both of these would be good. Another one would be when Xen is > > about to deliver an interrupt to a guest, provided that there is no > > storm of interrupts. I=E2=80=99ve seen a USB webcam cause a system-wid= e latency > > spike through what I presume is an interrupt storm, and I suspect that > > others have observed similar behavior with USB external drives. >=20 > How would you determine that a given interrupt was part of a "storm", > and what would you do differently as a result of determining that? I=E2=80=99m not sure. One heuristic might be that if a device assigned to = a VM is interrupting Xen too many times while Xen is running other VMs, interrupts from that device are blocked as needed to ensure other VMs get to execute. Theoretically, an interrupt from a USB storage device should be safe to block until Xen is no longer running boosted workloads, but an interrupt from a USB microphone or speaker is not. > > > Other ways to consider de-boosting: > > > * There's a way to trigger a VMEXIT when interrupts have been > > > re-enabled; setting this up when the VM is in the boost state > > > > That=E2=80=99s a good idea, but should be conditional on =E2=80=9Cdont_= desched=E2=80=9D _not_ > > being set. This handles the case where the guest is running a realtime > > thread. >=20 > In which case we need some way for the "enlightened" guest to know how > to de-boost itself; a yield might do. That would be sufficient. > > Generally, I=E2=80=99d like to see something like this: > > > > - A vCPU with sufficient boost credit is boosted by Xen under the > > following conditions: > > > > 1. Xen interrupts the guest. >=20 > I take it you mean, "delivers an interrupt to the guest"? Yes. > > 2. Xen is about to preempt, but detects that =E2=80=9Cdont_desched=E2= =80=9D is set. > > 3. Xen is about to preempt, but detects that interrupts are disabled. > > > > - A vCPU is deboosted if: > > > > 1. It runs out of boost credit, even if =E2=80=9Cdont_desched=E2=80= =9D is set. > > 2. An interrupt handler returns, but only if =E2=80=9Cdont_desched=E2= =80=9D is not set. > > 3. Interrupts are re-enabled, but only if =E2=80=9Cdont_desched=E2=80= =9D is not set. > > > > The first case is an abnormal condition and typically means that > > either the system is overloaded or a vCPU is running boosted for too > > long. To help debug this situation, Xen will log a warning and > > increment both a system-wide and a per-domain counter. dom0 can > > retrieve counters for any domain, and a domain can read its own > > counter. > > > > - When to set =E2=80=9Cdont_desched=E2=80=9D is entirely up to the gues= t kernel, but > > there are some general rules guests should follow: > > > > - Only set =E2=80=9Cdont_desched=E2=80=9D if there is a good reason, = and unset it as > > soon as possible. Xen gives vCPUs with =E2=80=9Cdont_desched=E2=80= =9D set priority > > over all other vCPUs on the system, but the amount of time a vCPU is > > allowed to run with an elevated priority is limited. Xen will log a > > warning if a guest tries to run with elevated priority for too long. > > > > - Xen boosts vCPUs before delivering an interrupt, but there should be > > a way for a vCPU to deboost itself even before returning from the > > interrupt handler. > > > > - Guests should always set =E2=80=9Cdont_desched=E2=80=9D when runnin= g hard-realtime > > threads (used for e.g. audio processing), even when the thread is in > > userspace. This ensures that Xen gives the underlying vCPU priority > > over vCPUs > > > > - Guests should always set =E2=80=9Cdont_desched=E2=80=9D when holdin= g a spin lock, > > but it is even better to use paravirtualized spin locks (which make > > a hypercall into Xen and therefore allow other vCPUs to run). > > > > - Xen does not implement priority inheritance, so guests need to do > > that. > > > > - Max boost credits can be set by dom0 via a hypercall. > > > > The advantage of this approach is that it keeps almost all policy out of > > Xen. The only exception is the boosting when an interrupt is received, > > but a well-behaved guest will deboost itself very quickly (by enabling > > interrupts) if the boost was not actually needed, so this should have > > very limited impact. I think this should be enough for realtime audio, > > and it is somewhat related to (but hopefully simpler than) the KVM RFC > > from Google [1]. > > > > Any thoughts on this? >=20 > Overall sounds good. I think a good approach would be to start by > implementing it without the "dont_desched" flag, and then add that on > top later. It sounds like you have a clear vision for what you want, > so it shouldn't be too hard to write such that adding the > "dont_desched" doesn't require a lot of pointless refactoring. >=20 > The other issue I have with this (and essentially where I got stuck > developing credit2 in the first place) is testing: how do you ensure > that it has the properties that you expect? How do you develop a > "regression test" to make sure that server-based workloads don't have > issues in this sort of case? I don=E2=80=99t have any server workloads myself. Would it be reasonable t= o ask those who do have such workloads to develop such a test? They would be in a much better position to check for regressions on these workloads, and have server hardware that they can use to benchmark such workloads. I just have my laptop and a test laptop, both running Qubes OS. It=E2=80=99s also possible that some of these changes will improve latency = at the expense of throughput. In that case, I could add a Xen command-line option (or even a runtime toggle) that controls whether Xen honors the boost state. I do expect that the rest of the logic should have very little overhead in this case. --=20 Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab --Hvgit4ad6LeOOsQH Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmWv8CIACgkQsoi1X/+c IsGXKxAAqYxnG+7E+4HnkBQWxTciOX9n1Tpj4mcsXnHhDEqsQSrTh32eMK88YUMv a+TxwLEW5ibJWSWXwpAJr35s4wm8JSHwJjtCDnmADYJNfXCse3Be1ZT6+cs9VJkw AcH2Can1Z1SS+/w1uHiRsFo6Wknw9Cla0ZKPdzEguIRSrGVDA84Tmd42AcMCgeM3 iyRkSRlgH+ge0t0MN0SGpTKMocTan/WPjGJrQo/T74K+hfNEFwbexPBxxu23nkvP I0Jf/B0WA0K60lKxq4SccvPKAgFw1OUJtowHrdeaO0u+WMDPpLdCjIeujkXSSM3W /5vjGRJ/IR8CGv0mrRDGsfC0bg3qDEEQB4ObMLoxpThdbeFtVdxv6KPN593yoGut 7cJj1Apf4T0lJ8L3CHp/7E4c95iZyTVkTK1uG5seEeiNIC/wzkSjPsbgfJrio2DM SP1bPlE8degB1a43QTh/1TjM3pHgfUEhtZecAf42a2lVIFJJ4v+WJygs3/OVHL9f yZprjLitg5/HljmEf4evg2/eP7BAZW0z4qDZMtgypopZXUFshLbk3oJyIgNCgFTu wjL5OOQ5uWApm5sKWh26nz+DK1VcMXENmJwzcFZLpyyzUuRgqXpBhc6fOM5T97/s AoWKa0ba3pFvxIKdpaeGNIPLrJIZ3yBbazQ2nnyWowOViKuVNZ8= =KH1Y -----END PGP SIGNATURE----- --Hvgit4ad6LeOOsQH--