All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Vineeth Remanan Pillai <vineeth@bitbyteword.org>
Cc: Ben Segall <bsegall@google.com>, Borislav Petkov <bp@alien8.de>,
	 Daniel Bristot de Oliveira <bristot@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"H . Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
	 Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>,
	 Paolo Bonzini <pbonzini@redhat.com>,
	Andy Lutomirski <luto@kernel.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	 Thomas Gleixner <tglx@linutronix.de>,
	Valentin Schneider <vschneid@redhat.com>,
	 Vincent Guittot <vincent.guittot@linaro.org>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	 Wanpeng Li <wanpengli@tencent.com>,
	Suleiman Souhlal <suleiman@google.com>,
	 Masami Hiramatsu <mhiramat@google.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	 x86@kernel.org, Tejun Heo <tj@kernel.org>,
	Josh Don <joshdon@google.com>,  Barret Rhoden <brho@google.com>,
	David Vernet <dvernet@meta.com>,
	 Joel Fernandes <joel@joelfernandes.org>
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm
Date: Fri, 15 Dec 2023 08:56:57 -0800	[thread overview]
Message-ID: <ZXyFWTSU3KRk7EtQ@google.com> (raw)
In-Reply-To: <CAO7JXPik9eMgef6amjCk5JPeEhg66ghDXowWQESBrd_fAaEsCA@mail.gmail.com>

On Fri, Dec 15, 2023, Vineeth Remanan Pillai wrote:
> > > >
> > > I get your point. A generic way would have been more preferable, but I
> > > feel the scenario we are tackling is a bit more time critical and kvm
> > > is better equipped to handle this. kvm has control over the VM/vcpu
> > > execution and hence it can take action in the most effective way.
> >
> > No, KVM most definitely does not.  Between sched, KVM, and userspace, I would
> > rank KVM a very distant third.  Userspace controls when to do KVM_RUN, to which
> > cgroup(s) a vCPU task is assigned, the affinity of the task, etc.  sched decides
> > when and where to run a vCPU task based on input from userspace.
> >
> > Only in some edge cases that are largely unique to overcommitted CPUs does KVM
> > have any input on scheduling whatsoever.   And even then, KVM's view is largely
> > limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different
> > VM would be interesting, to say the least.
> >
> Over committed case is exactly what we are trying to tackle.

Yes, I know.  I was objecting to the assertion that "kvm has control over the
VM/vcpu execution and hence it can take action in the most effective way".  In
overcommit use cases, KVM has some *influence*, and in non-overcommit use cases,
KVM is essentially not in the picture at all.

> Sorry for not making this clear in the cover letter. ChromeOS runs on low-end
> devices (eg: 2C/2T cpus) and does not have enough compute capacity to
> offload scheduling decisions. In-band scheduling decisions gave the
> best results.
> 
> > > One example is the place where we handle boost/unboost. By the time
> > > you come out of kvm to userspace it would be too late.
> >
> > Making scheduling decisions in userspace doesn't require KVM to exit to userspace.
> > It doesn't even need to require a VM-Exit to KVM.  E.g. if the scheduler (whether
> > it's in kernel or userspace) is running on a different logical CPU(s), then there's
> > no need to trigger a VM-Exit because the scheduler can incorporate information
> > about a vCPU in real time, and interrupt the vCPU if and only if something else
> > needs to run on that associated CPU.  From the sched_ext cover letter:
> >
> >  : Google has also experimented with some promising, novel scheduling policies.
> >  : One example is “central” scheduling, wherein a single CPU makes all
> >  : scheduling decisions for the entire system. This allows most cores on the
> >  : system to be fully dedicated to running workloads, and can have significant
> >  : performance improvements for certain use cases. For example, central
> >  : scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
> >  : instead delegating the responsibility of preemption checks from the tick to
> >  : a single CPU. See scx_central.bpf.c for a simple example of a central
> >  : scheduling policy built in sched_ext.
> >
> This makes sense when the host has enough compute resources for
> offloading scheduling decisions.

Yeah, again, I know.  The point I am trying to get across is that this RFC only
benefits/handles one use case, and doesn't have line of sight to being extensible
to other use cases.

> > > As you mentioned, custom contract between guest and host userspace is
> > > really flexible, but I believe tackling scheduling(especially latency)
> > > issues is a bit more difficult with generic approaches. Here kvm does
> > > have some information known only to kvm(which could be shared - eg:
> > > interrupt injection) but more importantly kvm has some unique
> > > capabilities when it comes to scheduling. kvm and scheduler are
> > > cooperating currently for various cases like, steal time accounting,
> > > vcpu preemption state, spinlock handling etc. We could possibly try to
> > > extend it a little further in a non-intrusive way.
> >
> > I'm not too worried about the code being intrusive, I'm worried about the
> > maintainability, longevity, and applicability of this approach.
> >
> > IMO, this has a significantly lower ceiling than what is possible with something
> > like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
> > because it'd require a kernel-defined ABI, would essentially be limited to knobs
> > that are broadly useful.  I.e. every bit of information that you want to add to
> > the guest/host ABI will need to get approval from at least the affected subsystems
> > in the guest, from KVM, and possibly from the host scheduler too.  That's going
> > to make for a very high bar.
> >
> Just thinking out  loud, The ABI could be very simple to start with. A
> shared page with dedicated guest and host areas. Guest fills details
> about its priority requirements, host fills details about the actions
> it took(boost/unboost, priority/sched class etc). Passing this
> information could be in-band or out-of-band. out-of-band could be used
> by dedicated userland schedulers. If both guest and host agrees on
> in-band during guest startup, kvm could hand over the data to
> scheduler using a scheduler callback. I feel this small addition to
> kvm could be maintainable and by leaving the protocol for interpreting
> shared memory to guest and host, this would be very generic and cater
> to multiple use cases. Something like above could be used both by
> low-end devices and high-end server like systems and guest and host
> could have custom protocols to interpret the data and make decisions.
> 
> In this RFC, we have a miniature form of the above, where we have a
> shared memory area and the scheduler callback is basically
> sched_setscheduler. But it could be made very generic as part of ABI
> design. For out-of-band schedulers, this call back could be setup by
> sched_ext, a userland scheduler and any similar out-of-band scheduler.
> 
> I agree, getting a consensus and approval is non-trivial. IMHO, this
> use case is compelling for such an ABI because out-of-band schedulers
> might not give the desired results for low-end devices.
> 
> > > Having a formal paravirt scheduling ABI is something we would want to
> > > pursue (as I mentioned in the cover letter) and this could help not
> > > only with latencies, but optimal task placement for efficiency, power
> > > utilization etc. kvm's role could be to set the stage and share
> > > information with minimum delay and less resource overhead.
> >
> > Making KVM middle-man is most definitely not going to provide minimum delay or
> > overhead.  Minimum delay would be the guest directly communicating with the host
> > scheduler.  I get that convincing the sched folks to add a bunch of paravirt
> > stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
> > sched_ext folks.
> >
> As mentioned above, guest directly talking to host scheduler without
> involving kvm would mean an out-of-band scheduler and the
> effectiveness depends on how fast the scheduler gets to run.

No, the "host scheduler" could very well be a dedicated in-kernel paravirt
scheduler.  It could be a sched_ext BPF program that for all intents and purposes
is in-band.

You are basically proposing that KVM bounce-buffer data between guest and host.
I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy.

> In lowend compute devices, that would pose a challenge. In such scenarios, kvm
> seems to be a better option to provide minimum delay and cpu overhead.

  reply	other threads:[~2023-12-15 16:57 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14  2:47 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm Vineeth Pillai (Google)
2023-12-14  2:47 ` [RFC PATCH 1/8] kvm: x86: MSR for setting up scheduler info shared memory Vineeth Pillai (Google)
2023-12-14 10:53   ` Vitaly Kuznetsov
2023-12-14 19:53     ` Vineeth Remanan Pillai
2023-12-14  2:47 ` [RFC PATCH 2/8] sched/core: sched_setscheduler_pi_nocheck for interrupt context usage Vineeth Pillai (Google)
2023-12-14  2:47 ` [RFC PATCH 3/8] kvm: x86: vcpu boosting/unboosting framework Vineeth Pillai (Google)
2023-12-14  2:47 ` [RFC PATCH 4/8] kvm: x86: boost vcpu threads on latency sensitive paths Vineeth Pillai (Google)
2023-12-14  2:47 ` [RFC PATCH 5/8] kvm: x86: upper bound for preemption based boost duration Vineeth Pillai (Google)
2023-12-14  2:47 ` [RFC PATCH 6/8] kvm: x86: enable/disable global/per-guest vcpu boost feature Vineeth Pillai (Google)
2023-12-14  2:47 ` [RFC PATCH 7/8] sched/core: boost/unboost in guest scheduler Vineeth Pillai (Google)
2024-01-09 17:26   ` Shrikanth Hegde
2023-12-14  2:47 ` [RFC PATCH 8/8] irq: boost/unboost in irq/nmi entry/exit and softirq Vineeth Pillai (Google)
2023-12-15 17:26   ` Thomas Gleixner
2023-12-15 18:52     ` Vineeth Remanan Pillai
2023-12-14 16:38 ` [RFC PATCH 0/8] Dynamic vcpu priority management in kvm Sean Christopherson
2023-12-14 19:25   ` Vineeth Remanan Pillai
2023-12-14 20:13     ` Sean Christopherson
2023-12-14 21:36       ` Vineeth Remanan Pillai
2023-12-15  0:47         ` Sean Christopherson
2023-12-15 14:34           ` Vineeth Remanan Pillai
2023-12-15 16:56             ` Sean Christopherson [this message]
2023-12-15 17:40               ` Vineeth Remanan Pillai
2023-12-15 17:54                 ` Sean Christopherson
2023-12-15 19:10                   ` Vineeth Remanan Pillai
2023-12-15 15:20       ` Joel Fernandes
2023-12-15 16:38         ` Sean Christopherson
2023-12-15 20:18           ` Joel Fernandes
2023-12-15 22:01             ` Sean Christopherson
2024-01-12 18:37               ` Joel Fernandes
2023-12-15 18:10   ` David Vernet
2024-01-03 20:09     ` Joel Fernandes
2024-01-04 22:34       ` David Vernet
2024-01-24  2:15         ` Joel Fernandes
2024-01-24 17:06           ` David Vernet
2024-01-25  1:08             ` Joel Fernandes
2024-01-26 21:19               ` David Vernet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZXyFWTSU3KRk7EtQ@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dvernet@meta.com \
    --cc=hpa@zytor.com \
    --cc=joel@joelfernandes.org \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mhiramat@google.com \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vineeth@bitbyteword.org \
    --cc=vkuznets@redhat.com \
    --cc=vschneid@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.