* [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting
@ 2011-01-13 5:21 Rik van Riel
2011-01-13 5:22 ` [RFC -v4 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Rik van Riel @ 2011-01-13 5:21 UTC (permalink / raw)
To: kvm
Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.
Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.
The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up. This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.
In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others. Instead of spinning, it ends up simply not running
much at all.
This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got
preempted, hopefully the lock holder.
v4:
- change to newer version of Mike Galbraith's yield_to implementation
- chainsaw out some code from Mike that looked like a great idea, but
turned out to give weird interactions in practice
v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
yet, since it is not entirely clear how to do that right
(suggestions welcome)
--
All rights reversed.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC -v4 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu
2011-01-13 5:21 [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
@ 2011-01-13 5:22 ` Rik van Riel
2011-01-13 5:26 ` [RFC -v4 PATCH 2/3] sched: Add yield_to(task, preempt) functionality Rik van Riel
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2011-01-13 5:22 UTC (permalink / raw)
To: kvm
Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
Keep track of which task is running a KVM vcpu. This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.
Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
- move vcpu->task manipulation as suggested by Chris Wright
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..180085b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
#endif
int vcpu_id;
struct mutex mutex;
+ struct task_struct *task;
int cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..c95bad1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2248,6 +2248,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+ vcpu->task = NULL;
kvm_arch_vcpu_load(vcpu, cpu);
}
@@ -2256,6 +2257,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+ vcpu->task = current;
kvm_arch_vcpu_put(vcpu);
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC -v4 PATCH 2/3] sched: Add yield_to(task, preempt) functionality.
2011-01-13 5:21 [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
2011-01-13 5:22 ` [RFC -v4 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
@ 2011-01-13 5:26 ` Rik van Riel
2011-01-13 5:27 ` [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
2011-01-13 13:12 ` [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Avi Kivity
3 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2011-01-13 5:26 UTC (permalink / raw)
To: kvm
Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
From: Mike Galbraith <efault@gmx.de>
Currently only implemented for fair class tasks.
Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.
Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected. We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..4492d84 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,6 +1047,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+ bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
@@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
# define rt_mutex_adjust_pi(p) do { } while (0)
#endif
+extern void yield_to(struct task_struct *p, bool preempt);
extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..f154a2e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5270,6 +5270,62 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, bool preempt)
+{
+ struct task_struct *curr = current;
+ struct rq *rq, *p_rq;
+ unsigned long flags;
+ bool yield = 0;
+
+ local_irq_save(flags);
+ rq = this_rq();
+
+again:
+ p_rq = task_rq(p);
+ double_rq_lock(rq, p_rq);
+ while (task_rq(p) != p_rq) {
+ double_rq_unlock(rq, p_rq);
+ goto again;
+ }
+
+ if (!curr->sched_class->yield_to_task)
+ goto out;
+
+ if (curr->sched_class != p->sched_class)
+ goto out;
+
+ if (task_running(p_rq, p) || p->state)
+ goto out;
+
+ if (!same_thread_group(p, curr))
+ goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ if (task_group(p) != task_group(curr))
+ goto out;
+#endif
+
+ yield = curr->sched_class->yield_to_task(rq, p, preempt);
+
+out:
+ double_rq_unlock(rq, p_rq);
+ local_irq_restore(flags);
+
+ if (yield) {
+ set_current_state(TASK_RUNNING);
+ schedule();
+ }
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 00ebd76..5006f35 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1742,6 +1742,23 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
}
}
+static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt)
+{
+ struct sched_entity *se = &p->se;
+
+ if (!se->on_rq)
+ return false;
+
+ /* Tell the scheduler that we'd really like pse to run next. */
+ set_next_buddy(se);
+
+ /* Make p's CPU reschedule; pick_next_entry takes care of fairness. */
+ if (preempt)
+ resched_task(rq->curr);
+
+ return true;
+}
+
#ifdef CONFIG_SMP
/**************************************************
* Fair scheduling class load-balancing methods:
@@ -3935,6 +3952,7 @@ static const struct sched_class fair_sched_class = {
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
+ .yield_to_task = yield_to_task_fair,
.check_preempt_curr = check_preempt_wakeup,
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
2011-01-13 5:21 [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
2011-01-13 5:22 ` [RFC -v4 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
2011-01-13 5:26 ` [RFC -v4 PATCH 2/3] sched: Add yield_to(task, preempt) functionality Rik van Riel
@ 2011-01-13 5:27 ` Rik van Riel
2011-01-13 13:16 ` Avi Kivity
2011-01-13 13:12 ` [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Avi Kivity
3 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2011-01-13 5:27 UTC (permalink / raw)
To: kvm
Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c011ba3..ad3cb4a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -185,6 +185,7 @@ struct kvm {
#endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+ int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 21f816c..5822246 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1878,18 +1878,44 @@ void kvm_resched(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_resched);
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
- ktime_t expires;
- DEFINE_WAIT(wait);
-
- prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
-
- /* Sleep for 100 us, and hope lock-holder got scheduled */
- expires = ktime_add_ns(ktime_get(), 100000UL);
- schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+ struct kvm *kvm = me->kvm;
+ struct kvm_vcpu *vcpu;
+ int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+ int yielded = 0;
+ int pass;
+ int i;
- finish_wait(&vcpu->wq, &wait);
+ /*
+ * We boost the priority of a VCPU that is runnable but not
+ * currently running, because it got preempted by something
+ * else and called schedule in __vcpu_run. Hopefully that
+ * VCPU is holding the lock that we need and will release it.
+ * We approximate round-robin by starting at the last boosted VCPU.
+ */
+ for (pass = 0; pass < 2 && !yielded; pass++) {
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct task_struct *task = vcpu->task;
+ if (!pass && i < last_boosted_vcpu) {
+ i = last_boosted_vcpu;
+ continue;
+ } else if (pass && i > last_boosted_vcpu)
+ break;
+ if (vcpu == me)
+ continue;
+ if (!task)
+ continue;
+ if (waitqueue_active(&vcpu->wq))
+ continue;
+ if (task->flags & PF_VCPU)
+ continue;
+ kvm->last_boosted_vcpu = i;
+ yielded = 1;
+ yield_to(task, 1);
+ break;
+ }
+ }
}
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting
2011-01-13 5:21 [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
` (2 preceding siblings ...)
2011-01-13 5:27 ` [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
@ 2011-01-13 13:12 ` Avi Kivity
3 siblings, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2011-01-13 13:12 UTC (permalink / raw)
To: Rik van Riel
Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
On 01/13/2011 07:21 AM, Rik van Riel wrote:
> When running SMP virtual machines, it is possible for one VCPU to be
> spinning on a spinlock, while the VCPU that holds the spinlock is not
> currently running, because the host scheduler preempted it to run
> something else.
>
> Both Intel and AMD CPUs have a feature that detects when a virtual
> CPU is spinning on a lock and will trap to the host.
>
> The current KVM code sleeps for a bit whenever that happens, which
> results in eg. a 64 VCPU Windows guest taking forever and a bit to
> boot up. This is because the VCPU holding the lock is actually
> running and not sleeping, so the pause is counter-productive.
>
> In other workloads a pause can also be counter-productive, with
> spinlock detection resulting in one guest giving up its CPU time
> to the others. Instead of spinning, it ends up simply not running
> much at all.
>
> This patch series aims to fix that, by having a VCPU that spins
> give the remainder of its timeslice to another VCPU in the same
> guest before yielding the CPU - one that is runnable but got
> preempted, hopefully the lock holder.
Can you share some benchmark results?
I'm mostly interested in moderately sized guests (4-8 vcpus) under
conditions of no overcommit, and high overcommit (2x).
For no overcommit, I'd like to see comparisons against mainline with PLE
disabled, to be sure there aren't significant regressions. For
overcommit, comparisons against the no overcommit case. Comparisons
against mainline, with or without PLE disabled, are uninteresting since
we know it sucks both ways.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
2011-01-13 5:27 ` [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
@ 2011-01-13 13:16 ` Avi Kivity
2011-01-13 15:06 ` Rik van Riel
0 siblings, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2011-01-13 13:16 UTC (permalink / raw)
To: Rik van Riel
Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
On 01/13/2011 07:27 AM, Rik van Riel wrote:
> Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
> slowdowns of certain workloads, we instead use yield_to to hand
> the rest of our timeslice to another vcpu in the same KVM guest.
>
>
> + for (pass = 0; pass< 2&& !yielded; pass++) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct task_struct *task = vcpu->task;
> + if (!pass&& i< last_boosted_vcpu) {
> + i = last_boosted_vcpu;
> + continue;
> + } else if (pass&& i> last_boosted_vcpu)
> + break;
> + if (vcpu == me)
> + continue;
> + if (!task)
> + continue;
> + if (waitqueue_active(&vcpu->wq))
> + continue;
Suppose the vcpu exits at this point, and its task terminates.
> + if (task->flags& PF_VCPU)
> + continue;
Here you dereference freed memory.
> + kvm->last_boosted_vcpu = i;
> + yielded = 1;
> + yield_to(task, 1);
And here you do unimaginable things to that freed memory.
I think the first patch needs some reference counting... I'd move it to
the outermost KVM_RUN loop to reduce the performance impact.
> + break;
> + }
> + }
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
2011-01-13 13:16 ` Avi Kivity
@ 2011-01-13 15:06 ` Rik van Riel
2011-01-13 15:23 ` Avi Kivity
0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2011-01-13 15:06 UTC (permalink / raw)
To: Avi Kivity
Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
On 01/13/2011 08:16 AM, Avi Kivity wrote:
>> + for (pass = 0; pass< 2&& !yielded; pass++) {
>> + kvm_for_each_vcpu(i, vcpu, kvm) {
>> + struct task_struct *task = vcpu->task;
>> + if (!pass&& i< last_boosted_vcpu) {
>> + i = last_boosted_vcpu;
>> + continue;
>> + } else if (pass&& i> last_boosted_vcpu)
>> + break;
>> + if (vcpu == me)
>> + continue;
>> + if (!task)
>> + continue;
>> + if (waitqueue_active(&vcpu->wq))
>> + continue;
>
> Suppose the vcpu exits at this point, and its task terminates.
Arghh, good point.
> I think the first patch needs some reference counting... I'd move it to
> the outermost KVM_RUN loop to reduce the performance impact.
I don't see how refcounting from that other thread could
possibly help, and I now see that the task_struct_cachep
does not have SLAB_DESTROY_BY_LRU, either :(
What do you have in mind here that would both work and
be acceptable to you as KVM maintainer?
--
All rights reversed
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
2011-01-13 15:06 ` Rik van Riel
@ 2011-01-13 15:23 ` Avi Kivity
2011-01-14 0:10 ` Rik van Riel
0 siblings, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2011-01-13 15:23 UTC (permalink / raw)
To: Rik van Riel
Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
On 01/13/2011 05:06 PM, Rik van Riel wrote:
>
>> I think the first patch needs some reference counting... I'd move it to
>> the outermost KVM_RUN loop to reduce the performance impact.
>
> I don't see how refcounting from that other thread could
> possibly help, and I now see that the task_struct_cachep
> does not have SLAB_DESTROY_BY_LRU, either :(
>
> What do you have in mind here that would both work and
> be acceptable to you as KVM maintainer?
>
I think a 'struct pid' fits the bill here.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
2011-01-13 15:23 ` Avi Kivity
@ 2011-01-14 0:10 ` Rik van Riel
0 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2011-01-14 0:10 UTC (permalink / raw)
To: Avi Kivity
Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
Mike Galbraith, Chris Wright
On 01/13/2011 10:23 AM, Avi Kivity wrote:
> On 01/13/2011 05:06 PM, Rik van Riel wrote:
>>
>>> I think the first patch needs some reference counting... I'd move it to
>>> the outermost KVM_RUN loop to reduce the performance impact.
>>
>> I don't see how refcounting from that other thread could
>> possibly help, and I now see that the task_struct_cachep
>> does not have SLAB_DESTROY_BY_LRU, either :(
>>
>> What do you have in mind here that would both work and
>> be acceptable to you as KVM maintainer?
>>
>
> I think a 'struct pid' fits the bill here.
Indeed it does. New patches on the way later, after I have
tested them some more.
--
All rights reversed
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-01-14 0:10 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-13 5:21 [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
2011-01-13 5:22 ` [RFC -v4 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
2011-01-13 5:26 ` [RFC -v4 PATCH 2/3] sched: Add yield_to(task, preempt) functionality Rik van Riel
2011-01-13 5:27 ` [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
2011-01-13 13:16 ` Avi Kivity
2011-01-13 15:06 ` Rik van Riel
2011-01-13 15:23 ` Avi Kivity
2011-01-14 0:10 ` Rik van Riel
2011-01-13 13:12 ` [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting Avi Kivity
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox