public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sean Christopherson <seanjc@google.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Vineeth Remanan Pillai <vineeth@bitbyteword.org>,
	Ben Segall <bsegall@google.com>, Borislav Petkov <bp@alien8.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"H . Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Suleiman Souhlal <suleiman@google.com>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	himadrics@inria.fr, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, x86@kernel.org, graf@amazon.com,
	drjunior.org@gmail.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)
Date: Fri, 12 Jul 2024 12:30:19 -0400	[thread overview]
Message-ID: <20240712123019.7e18c67a@rorschach.local.home> (raw)
In-Reply-To: <01c3e7de-0c1a-45e0-aed6-c11e9fa763df@efficios.com>

On Fri, 12 Jul 2024 11:32:30 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >>> I was looking at the rseq on request from the KVM call, however it does not
> >>> make sense to me yet how to expose the rseq area via the Guest VA to the host
> >>> kernel.  rseq is for userspace to kernel, not VM to kernel.  
> > 
> > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > 
> > A that point, the vCPU can read/write the rseq structure directly.  

So basically, the vCPU thread can just create a virtio device that
exposes the rseq memory to the guest kernel?

One other issue we need to worry about is that IIUC rseq memory is
allocated by the guest/user, not the host kernel. This means it can be
swapped out. The code that handles this needs to be able to handle user
page faults.

> 
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).
> But this is something we can improve upon once we understand what we
> are trying to achieve.
> 
> > 
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task.  So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> >   
> [...]
> 
> OK, so how about we expose "offsets" tuning the base values ?
> 
> - The task doing KVM_RUN, just like any other task, has its "priority"
>    value as set by setpriority(2).
> 
> - We introduce two new fields in the per-thread struct rseq, which is
>    mapped in the host task doing KVM_RUN and readable from the scheduler:
> 
>    - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
> 
>    - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */
> 
>      vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
>      which would be typically NULL except for tasks doing KVM_RUN. This would
>      sit in its own pages per vcpu, which takes care of isolation between guest
>      kernel and host process. Those would be RW by the guest kernel as
>      well and contain e.g.:

Hmm, maybe not make this only vcpu specific, but perhaps this can be
useful for user space tasks that want to dynamically change their
priority without a system call. It could do the same thing. Yeah, yeah,
I may be coming up with a solution in search of a problem ;-)

-- Steve

> 
>      struct vcpu_sched {
>          __u32 len;  /* Length of active fields. */
> 
>          __s32 prio_offset;
>          __s32 cpu_capacity_offset;
>          [...]
>      };
> 
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
> 
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
> 
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 


  parent reply	other threads:[~2024-07-12 16:30 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-03 14:01 [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Pillai (Google)
2024-04-03 14:01 ` [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework Vineeth Pillai (Google)
2024-04-08 13:57   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm Vineeth Pillai (Google)
2024-04-08 13:58   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs Vineeth Pillai (Google)
2024-04-08 13:59   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 4/5] pvsched: bpf support for pvsched Vineeth Pillai (Google)
2024-04-08 14:00   ` Vineeth Remanan Pillai
2024-04-03 14:01 ` [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver Vineeth Pillai (Google)
2024-04-08 14:01   ` Vineeth Remanan Pillai
2024-04-08 13:54 ` [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management) Vineeth Remanan Pillai
2024-05-01 15:29 ` Sean Christopherson
2024-05-02 13:42   ` Vineeth Remanan Pillai
2024-06-24 11:01     ` Vineeth Remanan Pillai
2024-07-12 12:57       ` Joel Fernandes
2024-07-12 14:09         ` Mathieu Desnoyers
2024-07-12 14:48           ` Sean Christopherson
2024-07-12 15:32             ` Mathieu Desnoyers
2024-07-12 16:14               ` Sean Christopherson
2024-07-12 16:30               ` Steven Rostedt [this message]
2024-07-12 16:39                 ` Sean Christopherson
2024-07-12 17:02                   ` Steven Rostedt
2024-07-12 16:24           ` Steven Rostedt
2024-07-12 16:44             ` Sean Christopherson
2024-07-12 16:50               ` Joel Fernandes
2024-07-12 17:08                 ` Sean Christopherson
2024-07-12 17:14                   ` Steven Rostedt
2024-07-12 17:12               ` Steven Rostedt
2024-07-16 23:44                 ` Sean Christopherson
2024-07-17  0:13                   ` Steven Rostedt
2024-07-17  5:16                   ` Joel Fernandes
2024-07-17 14:14                     ` Sean Christopherson
2024-07-17 14:36                       ` Steven Rostedt
2024-07-17 14:52                         ` Steven Rostedt
2024-07-17 15:20                           ` Steven Rostedt
2024-07-17 17:03                             ` Suleiman Souhlal
2024-07-17 20:57                             ` Joel Fernandes
2024-07-17 21:00                               ` Steven Rostedt
2024-07-17 21:09                                 ` Joel Fernandes
2024-07-12 16:24           ` Joel Fernandes
2024-07-12 17:28             ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240712123019.7e18c67a@rorschach.local.home \
    --to=rostedt@goodmis.org \
    --cc=bp@alien8.de \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=drjunior.org@gmail.com \
    --cc=graf@amazon.com \
    --cc=himadrics@inria.fr \
    --cc=hpa@zytor.com \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=seanjc@google.com \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vineeth@bitbyteword.org \
    --cc=vkuznets@redhat.com \
    --cc=vschneid@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox