From: Sean Christopherson <seanjc@google.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>,
Vineeth Remanan Pillai <vineeth@bitbyteword.org>,
Ben Segall <bsegall@google.com>, Borislav Petkov <bp@alien8.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
"H . Peter Anvin" <hpa@zytor.com>,
Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>,
Paolo Bonzini <pbonzini@redhat.com>,
Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Valentin Schneider <vschneid@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Steven Rostedt <rostedt@goodmis.org>,
Suleiman Souhlal <suleiman@google.com>,
Masami Hiramatsu <mhiramat@kernel.org>,
himadrics@inria.fr, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, x86@kernel.org, graf@amazon.com,
drjunior.org@gmail.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
Date: Fri, 12 Jul 2024 07:48:10 -0700 [thread overview]
Message-ID: <ZpFCKrRKluacu58x@google.com> (raw)
In-Reply-To: <19ecf8c8-d5ac-4cfb-a650-cf072ced81ce@efficios.com>
On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
> On 2024-07-12 08:57, Joel Fernandes wrote:
> > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> [...]
> > > Existing use cases
> > > -------------------------
> > >
> > > - A latency sensitive workload on the guest might need more than one
> > > time slice to complete, but should not block any higher priority task
> > > in the host. In our design, the latency sensitive workload shares its
> > > priority requirements to host(RT priority, cfs nice value etc). Host
> > > implementation of the protocol sets the priority of the vcpu task
> > > accordingly so that the host scheduler can make an educated decision
> > > on the next task to run. This makes sure that host processes and vcpu
> > > tasks compete fairly for the cpu resource.
>
> AFAIU, the information you need to convey to achieve this is the priority
> of the task within the guest. This information need to reach the host
> scheduler to make informed decision.
>
> One thing that is unclear about this is what is the acceptable
> overhead/latency to push this information from guest to host ?
> Is an hypercall OK or does it need to be exchanged over a memory
> mapping shared between guest and host ?
>
> Hypercalls provide simple ABIs across guest/host, and they allow
> the guest to immediately notify the host (similar to an interrupt).
Hypercalls have myriad problems. They require a VM-Exit, which largely defeats
the purpose of boosting the vCPU priority for performance reasons. They don't
allow for delegation as there's no way for the hypervisor to know if a hypercall
from guest userspace should be allowed, versus anything memory based where the
ability for guest userspace to access the memory demonstrates permission (else
the guest kernel wouldn't have mapped the memory into userspace).
> > > Ideas brought up during offlist discussion
> > > -------------------------------------------------------
> > >
> > > 1. rseq based timeslice extension mechanism[1]
> > >
> > > While the rseq based mechanism helps in giving the vcpu task one more
> > > time slice, it will not help in the other use cases. We had a chat
> > > with Steve and the rseq mechanism was mainly for improving lock
> > > contention and would not work best with vcpu boosting considering all
> > > the use cases above. RT or high priority tasks in the VM would often
> > > need more than one time slice to complete its work and at the same,
> > > should not be hurting the host workloads. The goal for the above use
> > > cases is not requesting an extra slice, but to modify the priority in
> > > such a way that host processes and guest processes get a fair way to
> > > compete for cpu resources. This also means that vcpu task can request
> > > a lower priority when it is running lower priority tasks in the VM.
Then figure out a way to let userspace boot a task's priority without needing a
syscall. vCPUs are not directly schedulable entities, the task doing KVM_RUN
on the vCPU fd is what the scheduler sees. Any scheduling enhancement that
benefits vCPUs by definition can benefit userspace tasks.
> > I was looking at the rseq on request from the KVM call, however it does not
> > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > kernel. rseq is for userspace to kernel, not VM to kernel.
Any memory that is exposed to host userspace can be exposed to the guest. Things
like this are implemented via "overlay" pages, where the guest asks host userspace
to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a
memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
address of the page containing the rseq structure associated with the vCPU (in
pretty much every modern VMM, each vCPU has a dedicated task/thread).
A that point, the vCPU can read/write the rseq structure directly.
The reason us KVM folks are pushing y'all towards something like rseq is that
(again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
is actually just priority boosting a task. So rather than invent something
virtualization specific, invent a mechanism for priority boosting from userspace
without a syscall, and then extend it to the virtualization use case.
> > Steven Rostedt said as much as well, thoughts? Add Mathieu as well.
>
> I'm not sure that rseq would help at all here, but I think we may want to
> borrow concepts of data sitting in shared memory across privilege levels
> and apply them to VMs.
>
> If some of the ideas end up being useful *outside* of the context of VMs,
Modulo the assertion above that this is is about boosting priority instead of
requesting an extended time slice, this is essentially the same thing as the
"delay resched" discussion[*]. The only difference is that the vCPU is in a
critical section, e.q. IRQ handler, versus the userspace task being in a critical
section.
[*] https://lore.kernel.org/all/20231025054219.1acaa3dd@gandalf.local.home
> then I'd be willing to consider adding fields to rseq. But as long as it is
> VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
> you can safely share across host/guest kernels.
next prev parent reply other threads:[~2024-07-12 14:48 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
2024-04-08 13:57 ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
2024-04-08 13:58 ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
2024-04-08 13:59 ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
2024-04-08 14:00 ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
2024-04-08 14:01 ` Vineeth Remanan Pillai
2024-04-08 13:54 ` [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Remanan Pillai
2024-05-01 15:29 ` Sean Christopherson
2024-05-02 13:42 ` Vineeth Remanan Pillai
2024-06-24 11:01 ` Vineeth Remanan Pillai
2024-07-12 12:57 ` Joel Fernandes
2024-07-12 14:09 ` Mathieu Desnoyers
2024-07-12 14:48 ` Sean Christopherson [this message]
2024-07-12 15:32 ` Mathieu Desnoyers
2024-07-12 16:14 ` Sean Christopherson
2024-07-12 16:30 ` Steven Rostedt
2024-07-12 16:39 ` Sean Christopherson
2024-07-12 17:02 ` Steven Rostedt
2024-07-12 16:24 ` Steven Rostedt
2024-07-12 16:44 ` Sean Christopherson
2024-07-12 16:50 ` Joel Fernandes
2024-07-12 17:08 ` Sean Christopherson
2024-07-12 17:14 ` Steven Rostedt
2024-07-12 17:12 ` Steven Rostedt
2024-07-16 23:44 ` Sean Christopherson
2024-07-17 0:13 ` Steven Rostedt
2024-07-17 5:16 ` Joel Fernandes
2024-07-17 14:14 ` Sean Christopherson
2024-07-17 14:36 ` Steven Rostedt
2024-07-17 14:52 ` Steven Rostedt
2024-07-17 15:20 ` Steven Rostedt
2024-07-17 17:03 ` Suleiman Souhlal
2024-07-17 20:57 ` Joel Fernandes
2024-07-17 21:00 ` Steven Rostedt
2024-07-17 21:09 ` Joel Fernandes
2024-07-12 16:24 ` Joel Fernandes
2024-07-12 17:28 ` Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZpFCKrRKluacu58x@google.com \
--to=seanjc@google.com \
--cc=bp@alien8.de \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dave.hansen@linux.intel.com \
--cc=dietmar.eggemann@arm.com \
--cc=drjunior.org@gmail.com \
--cc=graf@amazon.com \
--cc=himadrics@inria.fr \
--cc=hpa@zytor.com \
--cc=joel@joelfernandes.org \
--cc=juri.lelli@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mgorman@suse.de \
--cc=mhiramat@kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=suleiman@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vineeth@bitbyteword.org \
--cc=vkuznets@redhat.com \
--cc=vschneid@redhat.com \
--cc=wanpengli@tencent.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox