From: Sean Christopherson <seanjc@google.com>
To: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
mingo@redhat.com, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, linux-kernel@vger.kernel.org,
kprateek.nayak@amd.com, wuyun.abel@bytedance.com,
youssefesmat@chromium.org, tglx@linutronix.de, efault@gmx.de,
kvm@vger.kernel.org
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
Date: Wed, 9 Oct 2024 19:49:54 -0700 [thread overview]
Message-ID: <ZwdA0sbA2tJA3IKh@google.com> (raw)
In-Reply-To: <5618d029-769a-4690-a581-2df8939f26a9@samsung.com>
+KVM
On Thu, Aug 29, 2024, Marek Szyprowski wrote:
> On 27.07.2024 12:27, Peter Zijlstra wrote:
> > Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> > noting that lag is fundamentally a temporal measure. It should not be
> > carried around indefinitely.
> >
> > OTOH it should also not be instantly discarded, doing so will allow a
> > task to game the system by purposefully (micro) sleeping at the end of
> > its time quantum.
> >
> > Since lag is intimately tied to the virtual time base, a wall-time
> > based decay is also insufficient, notably competition is required for
> > any of this to make sense.
> >
> > Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> > competing until they are eligible.
> >
> > Strictly speaking, we only care about keeping them until the 0-lag
> > point, but that is a difficult proposition, instead carry them around
> > until they get picked again, and dequeue them at that point.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> This patch landed recently in linux-next as commit 152e11f6df29
> ("sched/fair: Implement delayed dequeue"). In my tests on some of the
> ARM 32bit boards it causes a regression in rtcwake tool behavior - from
> time to time this simple call never ends:
>
> # time rtcwake -s 10 -m on
>
> Reverting this commit (together with its compile dependencies) on top of
> linux-next fixes this issue. Let me know how can I help debugging this
> issue.
This commit broke KVM's posted interrupt handling (and other things), and the root
cause may be the same underlying issue.
TL;DR: Code that checks task_struct.on_rq may be broken by this commit.
KVM's breakage boils down to the preempt notifiers, i.e. kvm_sched_out(), being
invoked with current->on_rq "true" after KVM has explicitly called schedule().
kvm_sched_out() uses current->on_rq to determine if the vCPU is being preempted
(voluntarily or not, doesn't matter), and so waiting until some later point in
time to call __block_task() causes KVM to think the task was preempted, when in
reality it was not.
static void kvm_sched_out(struct preempt_notifier *pn,
struct task_struct *next)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
WRITE_ONCE(vcpu->scheduled_out, true);
if (current->on_rq && vcpu->wants_to_run) { <================
WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
}
kvm_arch_vcpu_put(vcpu);
__this_cpu_write(kvm_running_vcpu, NULL);
}
KVM uses vcpu->preempted for a variety of things, but the most visibly problematic
is waking a vCPU from (virtual) HLT via posted interrupt wakeup. When a vCPU
HLTs, KVM ultimate calls schedule() to schedule out the vCPU until it receives
a wake event.
When a device or another vCPU can post an interrupt as a wake event, KVM mucks
with the blocking vCPU's posted interrupt descriptor so that posted interrupts
that should be wake events get delivered on a dedicated host IRQ vector, so that
KVM can kick and wake the target vCPU.
But when vcpu->preempted is true, KVM suppresses posted interrupt notifications,
knowing that the vCPU will be scheduled back in. Because a vCPU (task) can be
preempted while KVM is emulating HLT, KVM keys off vcpu->preempted to set PID.SN,
and doesn't exempt the blocking case. In short, KVM uses vcpu->preempted, i.e.
current->on_rq, to differentiate between the vCPU getting preempted and KVM
executing schedule().
As a result, the false positive for vcpu->preempted causes KVM to suppress posted
interrupt notifications and the target vCPU never gets its wake event.
Peter,
Any thoughts on how best to handle this? The below hack-a-fix resolves the issue,
but it's obviously not appropriate. KVM uses vcpu->preempted for more than just
posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
was before this commit.
@@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
WRITE_ONCE(vcpu->scheduled_out, true);
- if (current->on_rq && vcpu->wants_to_run) {
+ if (se_runnable(¤t->se) && vcpu->wants_to_run) {
WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
}
next parent reply other threads:[~2024-10-10 2:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240727102732.960974693@infradead.org>
[not found] ` <20240727105030.226163742@infradead.org>
[not found] ` <CGME20240828223802eucas1p16755f4531ed0611dc4871649746ea774@eucas1p1.samsung.com>
[not found] ` <5618d029-769a-4690-a581-2df8939f26a9@samsung.com>
2024-10-10 2:49 ` Sean Christopherson [this message]
2024-10-10 7:57 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Mike Galbraith
2024-10-10 16:18 ` Sean Christopherson
2024-10-10 17:12 ` Mike Galbraith
2024-10-10 8:19 ` Peter Zijlstra
2024-10-10 9:18 ` Peter Zijlstra
2024-10-10 18:23 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZwdA0sbA2tJA3IKh@google.com \
--to=seanjc@google.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=efault@gmx.de \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=m.szyprowski@samsung.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=wuyun.abel@bytedance.com \
--cc=youssefesmat@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox