Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: "Sieber, Fernand" <sieberf@amazon.com>
To: "seanjc@google.com" <seanjc@google.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"nh-open-source@amazon.com" <nh-open-source@amazon.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
Date: Thu, 27 Feb 2025 07:20:11 +0000	[thread overview]
Message-ID: <f114eb3a8a21e1cd1a120db32258340504464458.camel@amazon.com> (raw)
In-Reply-To: <Z7-A76KjcYB8HAP8@google.com>

On Wed, 2025-02-26 at 13:00 -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Fernand Sieber wrote:
> > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
> > > > In this RFC we introduce the concept of guest halted time to
> > > > address
> > > > these concerns. Guest halted time (gtime_halted) accounts for
> > > > cycles
> > > > spent in guest mode while the cpu is halted. gtime_halted
> > > > relies on
> > > > measuring the mperf msr register (x86) around VM enter/exits to
> > > > compute
> > > > the number of unhalted cycles; halted cycles are then derived
> > > > from the
> > > > tsc difference minus the mperf difference.
> > > 
> > > IMO, there are better ways to solve this than having KVM sample
> > > MPERF on
> > > every entry and exit.
> > > 
> > > The kernel already samples APERF/MPREF on every tick and provides
> > > that
> > > information via /proc/cpuinfo, just use that.  If your userspace
> > > is unable
> > > to use /proc/cpuinfo or similar, that needs to be explained.
> > 
> > If I understand correctly what you are suggesting is to have
> > userspace
> > regularly sampling these values to detect the most idle CPUs and
> > then
> > use CPU affinity to repin housekeeping tasks to these. While it's
> > possible this essentially requires to implement another scheduling
> > layer in userspace through constant re-pinning of tasks. This also
> > requires to constantly identify the full set of tasks that can
> > induce
> > undesirable overhead so that they can be pinned accordingly. For
> > these
> > reasons we would rather want the logic to be implemented directly
> > in
> > the scheduler.
> > 
> > > And if you're running vCPUs on tickless CPUs, and you're doing
> > > HLT/MWAIT
> > > passthrough, *and* you want to schedule other tasks on those
> > > CPUs, then IMO
> > > you're abusing all of those things and it's not KVM's problem to
> > > solve,
> > > especially now that sched_ext is a thing.
> > 
> > We are running vCPUs with ticks, the rest of your observations are
> > correct.
> 
> If there's a host tick, why do you need KVM's help to make scheduling
> decisions?
> It sounds like what you want is a scheduler that is primarily driven
> by MPERF
> (and APERF?), and sched_tick() => arch_scale_freq_tick() already
> knows about MPERF.

Having the measure around VM enter/exit makes it easy to attribute the
unhalted cycles to a specific task (vCPU), which solves both our use
cases of VM metrics and scheduling. That said we may be able to avoid
it and achieve the same results.

i.e
* the VM metrics use case can be solved by using /proc/cpuinfo from
userspace.
* for the scheduling use case, the tick based sampling of MPERF means
we could potentially introduce a correcting factor on PELT accounting
of pinned vCPU tasks based on its value (similar to what I do in the
last patch of the series).

The combination of these would remove the requirement of adding any
logic around VM entrer/exit to support our use cases.

I'm happy to prototype that if we think it's going in the right
direction?




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

next prev parent reply	other threads:[~2025-02-27  7:20 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-18 20:26 [RFC PATCH 0/3] kvm,sched: Add gtime halted Fernand Sieber
2025-02-18 20:26 ` [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat Fernand Sieber
2025-02-18 20:26 ` [RFC PATCH 2/3] kvm/x86: Add support for gtime halted Fernand Sieber
2025-02-18 20:26 ` [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware Fernand Sieber
2025-02-27  7:34   ` Vincent Guittot
2025-02-27  8:27     ` [RFC PATCH 3/3] sched, x86: " Sieber, Fernand
2025-02-27  9:03       ` Vincent Guittot
2025-02-26  2:17 ` [RFC PATCH 0/3] kvm,sched: Add gtime halted Sean Christopherson
2025-02-26 20:27   ` Sieber, Fernand
2025-02-26 21:00     ` Sean Christopherson
2025-02-27  7:20       ` Sieber, Fernand [this message]
2025-02-27 14:39         ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f114eb3a8a21e1cd1a120db32258340504464458.camel@amazon.com \
    --to=sieberf@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nh-open-source@amazon.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=seanjc@google.com \
    --cc=vincent.guittot@linaro.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox