Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@redhat.com, juri.lelli@redhat.com,
	 vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org,  bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com,  linux-kernel@vger.kernel.org,
	kprateek.nayak@amd.com,  wuyun.abel@bytedance.com,
	youssefesmat@chromium.org, tglx@linutronix.de,  efault@gmx.de,
	kvm@vger.kernel.org
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
Date: Wed, 9 Oct 2024 19:49:54 -0700	[thread overview]
Message-ID: <ZwdA0sbA2tJA3IKh@google.com> (raw)
In-Reply-To: <5618d029-769a-4690-a581-2df8939f26a9@samsung.com>

+KVM

On Thu, Aug 29, 2024, Marek Szyprowski wrote:
> On 27.07.2024 12:27, Peter Zijlstra wrote:
> > Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> > noting that lag is fundamentally a temporal measure. It should not be
> > carried around indefinitely.
> >
> > OTOH it should also not be instantly discarded, doing so will allow a
> > task to game the system by purposefully (micro) sleeping at the end of
> > its time quantum.
> >
> > Since lag is intimately tied to the virtual time base, a wall-time
> > based decay is also insufficient, notably competition is required for
> > any of this to make sense.
> >
> > Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> > competing until they are eligible.
> >
> > Strictly speaking, we only care about keeping them until the 0-lag
> > point, but that is a difficult proposition, instead carry them around
> > until they get picked again, and dequeue them at that point.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> This patch landed recently in linux-next as commit 152e11f6df29 
> ("sched/fair: Implement delayed dequeue"). In my tests on some of the 
> ARM 32bit boards it causes a regression in rtcwake tool behavior - from 
> time to time this simple call never ends:
> 
> # time rtcwake -s 10 -m on
> 
> Reverting this commit (together with its compile dependencies) on top of 
> linux-next fixes this issue. Let me know how can I help debugging this 
> issue.

This commit broke KVM's posted interrupt handling (and other things), and the root
cause may be the same underlying issue.

TL;DR: Code that checks task_struct.on_rq may be broken by this commit.

KVM's breakage boils down to the preempt notifiers, i.e. kvm_sched_out(), being
invoked with current->on_rq "true" after KVM has explicitly called schedule().
kvm_sched_out() uses current->on_rq to determine if the vCPU is being preempted
(voluntarily or not, doesn't matter), and so waiting until some later point in
time to call __block_task() causes KVM to think the task was preempted, when in
reality it was not.

  static void kvm_sched_out(struct preempt_notifier *pn,
 			  struct task_struct *next)
  {
	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

	WRITE_ONCE(vcpu->scheduled_out, true);

	if (current->on_rq && vcpu->wants_to_run) {  <================
		WRITE_ONCE(vcpu->preempted, true);
		WRITE_ONCE(vcpu->ready, true);
	}
	kvm_arch_vcpu_put(vcpu);
	__this_cpu_write(kvm_running_vcpu, NULL);
  }

KVM uses vcpu->preempted for a variety of things, but the most visibly problematic
is waking a vCPU from (virtual) HLT via posted interrupt wakeup.  When a vCPU
HLTs, KVM ultimate calls schedule() to schedule out the vCPU until it receives
a wake event.

When a device or another vCPU can post an interrupt as a wake event, KVM mucks
with the blocking vCPU's posted interrupt descriptor so that posted interrupts
that should be wake events get delivered on a dedicated host IRQ vector, so that
KVM can kick and wake the target vCPU.

But when vcpu->preempted is true, KVM suppresses posted interrupt notifications,
knowing that the vCPU will be scheduled back in.  Because a vCPU (task) can be
preempted while KVM is emulating HLT, KVM keys off vcpu->preempted to set PID.SN,
and doesn't exempt the blocking case.  In short, KVM uses vcpu->preempted, i.e.
current->on_rq, to differentiate between the vCPU getting preempted and KVM
executing schedule().

As a result, the false positive for vcpu->preempted causes KVM to suppress posted
interrupt notifications and the target vCPU never gets its wake event.

Peter,

Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
was before this commit.

@@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,

        WRITE_ONCE(vcpu->scheduled_out, true);

-       if (current->on_rq && vcpu->wants_to_run) {
+       if (se_runnable(&current->se) && vcpu->wants_to_run) {
                WRITE_ONCE(vcpu->preempted, true);
                WRITE_ONCE(vcpu->ready, true);
        }

next      parent reply	other threads:[~2024-10-10  2:49 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240727102732.960974693@infradead.org>
     [not found] ` <20240727105030.226163742@infradead.org>
     [not found]   ` <CGME20240828223802eucas1p16755f4531ed0611dc4871649746ea774@eucas1p1.samsung.com>
     [not found]     ` <5618d029-769a-4690-a581-2df8939f26a9@samsung.com>
2024-10-10  2:49       ` Sean Christopherson [this message]
2024-10-10  7:57         ` [PATCH 17/24] sched/fair: Implement delayed dequeue Mike Galbraith
2024-10-10 16:18           ` Sean Christopherson
2024-10-10 17:12             ` Mike Galbraith
2024-10-10  8:19         ` Peter Zijlstra
2024-10-10  9:18           ` Peter Zijlstra
2024-10-10 18:23             ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZwdA0sbA2tJA3IKh@google.com \
    --to=seanjc@google.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=youssefesmat@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox