public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>,
	 Vineeth Remanan Pillai <vineeth@bitbyteword.org>,
	Ben Segall <bsegall@google.com>,  Borislav Petkov <bp@alien8.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	 "H . Peter Anvin" <hpa@zytor.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,  Mel Gorman <mgorman@suse.de>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andy Lutomirski <luto@kernel.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	 Valentin Schneider <vschneid@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	 Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	 Steven Rostedt <rostedt@goodmis.org>,
	Suleiman Souhlal <suleiman@google.com>,
	 Masami Hiramatsu <mhiramat@kernel.org>,
	himadrics@inria.fr, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, x86@kernel.org, graf@amazon.com,
	 drjunior.org@gmail.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
Date: Fri, 12 Jul 2024 09:14:52 -0700	[thread overview]
Message-ID: <ZpFWfInsXQdPJC0V@google.com> (raw)
In-Reply-To: <01c3e7de-0c1a-45e0-aed6-c11e9fa763df@efficios.com>

On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
> On 2024-07-12 10:48, Sean Christopherson wrote:
> > > > I was looking at the rseq on request from the KVM call, however it does not
> > > > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > > > kernel.  rseq is for userspace to kernel, not VM to kernel.
> > 
> > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > 
> > A that point, the vCPU can read/write the rseq structure directly.
> 
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).

Yeah, exposing the full rseq structure to the guest is probably a terrible idea.
The above isn't intended to be a design, the goal is just to illustrate how an
rseq-like mechanism can be extended to the guest without needing virtualization
specific ABI and without needing new KVM functionality.

> But this is something we can improve upon once we understand what we
> are trying to achieve.
> 
> > 
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task.  So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> > 
> [...]
> 
> OK, so how about we expose "offsets" tuning the base values ?
> 
> - The task doing KVM_RUN, just like any other task, has its "priority"
>   value as set by setpriority(2).
> 
> - We introduce two new fields in the per-thread struct rseq, which is
>   mapped in the host task doing KVM_RUN and readable from the scheduler:
> 
>   - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
> 
>   - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */

Ideally, there won't be a virtualization specific structure.  A vCPU specific
field might make sense (or it might not), but I really want to avoid defining a
structure that is unique to virtualization.  E.g. a userspace doing M:N scheduling
can likely benefit from any capacity hooks/information that would benefit a guest
scheduler.  I.e. rather than a vcpu_sched structure, have a user_sched structure
(or whatever name makes sense), and then have two struct pointers in rseq.

Though I'm skeptical that having two structs in play would be necessary or sane.
E.g. if both userspace and guest can adjust priority, then they'll need to coordinate
in order to avoid unexpected results.  I can definitely see wanting to let the
userspace VMM bound the priority of a vCPU, but that should be a relatively static
decision, i.e. can be done via syscall or something similarly "slow".

>     vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
>     which would be typically NULL except for tasks doing KVM_RUN. This would
>     sit in its own pages per vcpu, which takes care of isolation between guest
>     kernel and host process. Those would be RW by the guest kernel as
>     well and contain e.g.:
> 
>     struct vcpu_sched {
>         __u32 len;  /* Length of active fields. */
> 
>         __s32 prio_offset;
>         __s32 cpu_capacity_offset;
>         [...]
>     };
> 
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
> 
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
> 
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

  reply	other threads:[~2024-07-12 16:14 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
2024-04-08 13:57   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
2024-04-08 13:58   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
2024-04-08 13:59   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
2024-04-08 14:00   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
2024-04-08 14:01   ` Vineeth Remanan Pillai
2024-04-08 13:54 ` [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Remanan Pillai
2024-05-01 15:29 ` Sean Christopherson
2024-05-02 13:42   ` Vineeth Remanan Pillai
2024-06-24 11:01     ` Vineeth Remanan Pillai
2024-07-12 12:57       ` Joel Fernandes
2024-07-12 14:09         ` Mathieu Desnoyers
2024-07-12 14:48           ` Sean Christopherson
2024-07-12 15:32             ` Mathieu Desnoyers
2024-07-12 16:14               ` Sean Christopherson [this message]
2024-07-12 16:30               ` Steven Rostedt
2024-07-12 16:39                 ` Sean Christopherson
2024-07-12 17:02                   ` Steven Rostedt
2024-07-12 16:24           ` Steven Rostedt
2024-07-12 16:44             ` Sean Christopherson
2024-07-12 16:50               ` Joel Fernandes
2024-07-12 17:08                 ` Sean Christopherson
2024-07-12 17:14                   ` Steven Rostedt
2024-07-12 17:12               ` Steven Rostedt
2024-07-16 23:44                 ` Sean Christopherson
2024-07-17  0:13                   ` Steven Rostedt
2024-07-17  5:16                   ` Joel Fernandes
2024-07-17 14:14                     ` Sean Christopherson
2024-07-17 14:36                       ` Steven Rostedt
2024-07-17 14:52                         ` Steven Rostedt
2024-07-17 15:20                           ` Steven Rostedt
2024-07-17 17:03                             ` Suleiman Souhlal
2024-07-17 20:57                             ` Joel Fernandes
2024-07-17 21:00                               ` Steven Rostedt
2024-07-17 21:09                                 ` Joel Fernandes
2024-07-12 16:24           ` Joel Fernandes
2024-07-12 17:28             ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZpFWfInsXQdPJC0V@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=drjunior.org@gmail.com \
    --cc=graf@amazon.com \
    --cc=himadrics@inria.fr \
    --cc=hpa@zytor.com \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vineeth@bitbyteword.org \
    --cc=vkuznets@redhat.com \
    --cc=vschneid@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox