[RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting
@ 2011-01-14  8:02 Rik van Riel
  2011-01-14  8:03 ` [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu Rik van Riel
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Rik van Riel @ 2011-01-14  8:02 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v5:
- fix the race condition Avi pointed out, by tracking vcpu->pid
- also allows us to yield to vcpu tasks that got preempted while in qemu
  userspace
v4:
- change to newer version of Mike Galbraith's yield_to implementation
- chainsaw out some code from Mike that looked like a great idea, but
  turned out to give weird interactions in practice
v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
  situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)

Benchmark "results":

Two 4-CPU KVM guests are pinned to the same 4 physical CPUs.

One guest runs the AMQP performance test, the other guest runs
0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5
and 4.

The AMQP perftest is run 30 times, with 8 and 16 threads.

8thr	no overcommit	1.5x overcommit		2x overcommit

no PLE	198918		132625			90523.5
PLE	213904		127507			95098.5

16thr	no overcommit	1.5x overcommit		2x overcommit

no PLE	197526		127941			87187.8
PLE	210696		136874			87005.9	

Note: there seems to be something wrong with CPU balancing,
possibly related to cgroups.  The AMQP guest only got about
80% CPU time (of 400% total) when running with 2x overcommit,
as opposed to the expected 200%.  Without PLE, the guest
seems to get closer to 100% CPU time, which is still far
below the expected.

Without overcommit, the AMQP guest gets about 340-350%
CPU time without the PLE code, and around 380% CPU time
with the PLE code kicking the scheduler around.

Unfortunately, it looks like this test ended up more as a 
demonstration of other scheduler issues, than as a performance 
test of the PLE code.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu
  2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
@ 2011-01-14  8:03 ` Rik van Riel
  2011-01-16 15:17   ` Avi Kivity
  2011-01-14  8:03 ` [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality Rik van Riel
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2011-01-14  8:03 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.

Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..9d56ed5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
 #endif
 	int vcpu_id;
 	struct mutex mutex;
+	struct pid *pid;
 	int   cpu;
 	atomic_t guest_mode;
 	struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..65e997a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -185,6 +185,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->cpu = -1;
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
+	vcpu->pid = 0;
 	init_waitqueue_head(&vcpu->wq);
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -208,6 +209,8 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init);
 
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
+	if (vcpu->pid)
+		put_pid(vcpu->pid);
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
 }
@@ -1456,6 +1459,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
 		r = -EINVAL;
 		if (arg)
 			goto out;
+		if (unlikely(vcpu->pid != current->pids[PIDTYPE_PID].pid)) {
+			/* The thread running this VCPU changed. */
+			struct pid *oldpid = vcpu->pid;
+			vcpu->pid = get_task_pid(current, PIDTYPE_PID);
+			put_pid(oldpid);
+		}
 		r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run);
 		break;
 	case KVM_GET_REGS: {


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu
  2011-01-14  8:03 ` [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu Rik van Riel
@ 2011-01-16 15:17   ` Avi Kivity
  0 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2011-01-16 15:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

On 01/14/2011 10:03 AM, Rik van Riel wrote:
> Keep track of which task is running a KVM vcpu.  This helps us
> figure out later what task to wake up if we want to boost a
> vcpu that got preempted.
>
> Unfortunately there are no guarantees that the same task
> always keeps the same vcpu, so we can only track the task
> across a single "run" of the vcpu.
>
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5225052..65e997a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -185,6 +185,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   	vcpu->cpu = -1;
>   	vcpu->kvm = kvm;
>   	vcpu->vcpu_id = id;
> +	vcpu->pid = 0;

NULL

> @@ -1456,6 +1459,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
>   		r = -EINVAL;
>   		if (arg)
>   			goto out;
> +		if (unlikely(vcpu->pid != current->pids[PIDTYPE_PID].pid)) {
> +			/* The thread running this VCPU changed. */
> +			struct pid *oldpid = vcpu->pid;
> +			vcpu->pid = get_task_pid(current, PIDTYPE_PID);
> +			put_pid(oldpid);
> +		}

This is subject to the same race as before.  If another vcpu picks up 
vcpu->pid before the assignment (that is, oldpid), but dereferences it 
after put_pid(), it hits freed memory.

You want something like

     struct pid *oldpid = vcpu->pid;
     rcu_assign_pointer(vcpu->pid, get_task_pid());
     synchronize_rcu();
     put_pid(oldpid);

with rcu_read_lock() / rcu_dereference() protection on the reader side.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality.
  2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
  2011-01-14  8:03 ` [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu Rik van Riel
@ 2011-01-14  8:03 ` Rik van Riel
  2011-01-14 17:15   ` Peter Zijlstra
  2011-01-14 17:47   ` Srivatsa Vaddagiri
  2011-01-14  8:04 ` [RFC -v5 PATCH 3/4] export pid symbols needed for kvm_vcpu_on_spin Rik van Riel
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Rik van Riel @ 2011-01-14  8:03 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

From: Mike Galbraith <efault@gmx.de>

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..6c43fc4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,6 +1047,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task) (struct rq *rq);
+	bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
@@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p)		do { } while (0)
 #endif
 
+extern bool yield_to(struct task_struct *p, bool preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..d47b282 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5270,6 +5270,66 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ *
+ * Returns true if we indeed boosted the target task.
+ */
+bool __sched yield_to(struct task_struct *p, bool preempt)
+{
+	struct task_struct *curr = current;
+	struct rq *rq, *p_rq;
+	unsigned long flags;
+	bool yield = 0;
+
+	local_irq_save(flags);
+	rq = this_rq();
+
+again:
+	p_rq = task_rq(p);
+	double_rq_lock(rq, p_rq);
+	while (task_rq(p) != p_rq) {
+		double_rq_unlock(rq, p_rq);
+		goto again;
+	}
+
+	if (!curr->sched_class->yield_to_task)
+		goto out;
+
+	if (curr->sched_class != p->sched_class)
+		goto out;
+
+	if (task_running(p_rq, p) || p->state)
+		goto out;
+
+	if (!same_thread_group(p, curr))
+		goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (task_group(p) != task_group(curr))
+		goto out;
+#endif
+
+	yield = curr->sched_class->yield_to_task(rq, p, preempt);
+
+out:
+	double_rq_unlock(rq, p_rq);
+	local_irq_restore(flags);
+
+	if (yield) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+
+	return yield;
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 00ebd76..5006f35 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1742,6 +1742,23 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 	}
 }
 
+static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt)
+{
+	struct sched_entity *se = &p->se;
+
+	if (!se->on_rq)
+		return false;
+
+	/* Tell the scheduler that we'd really like pse to run next. */
+	set_next_buddy(se);
+
+	/* Make p's CPU reschedule; pick_next_entry takes care of fairness. */
+	if (preempt)
+		resched_task(rq->curr);
+
+	return true;
+}
+
 #ifdef CONFIG_SMP
 /**************************************************
  * Fair scheduling class load-balancing methods:
@@ -3935,6 +3952,7 @@ static const struct sched_class fair_sched_class = {
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+	.yield_to_task		= yield_to_task_fair,
 
 	.check_preempt_curr	= check_preempt_wakeup,
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality.
  2011-01-14  8:03 ` [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality Rik van Riel
@ 2011-01-14 17:15   ` Peter Zijlstra
  2011-01-14 17:47   ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2011-01-14 17:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Mike Galbraith,
	Chris Wright, ttracy, dshaks

On Fri, 2011-01-14 at 03:03 -0500, Rik van Riel wrote:
> From: Mike Galbraith <efault@gmx.de>
> 
> Currently only implemented for fair class tasks.
> 
> Add a yield_to_task method() to the fair scheduling class. allowing the
> caller of yield_to() to accelerate another thread in it's thread group,
> task group.
> 
> Implemented via a scheduler hint, using cfs_rq->next to encourage the
> target being selected.  We can rely on pick_next_entity to keep things
> fair, so noone can accelerate a thread that has already used its fair
> share of CPU time.
> 
> This also means callers should only call yield_to when they really
> mean it.  Calling it too often can result in the scheduler just
> ignoring the hint.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de> 

Looks good to me, do you want me to merge this or will you merge it
through the kvm tree with all other patches?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality.
  2011-01-14  8:03 ` [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality Rik van Riel
  2011-01-14 17:15   ` Peter Zijlstra
@ 2011-01-14 17:47   ` Srivatsa Vaddagiri
  2011-01-14 18:29     ` Rik van Riel
  1 sibling, 1 reply; 13+ messages in thread
From: Srivatsa Vaddagiri @ 2011-01-14 17:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Peter Zijlstra, Mike Galbraith,
	Chris Wright, ttracy, dshaks

On Fri, Jan 14, 2011 at 03:03:57AM -0500, Rik van Riel wrote:
> From: Mike Galbraith <efault@gmx.de>
> 
> Currently only implemented for fair class tasks.
> 
> Add a yield_to_task method() to the fair scheduling class. allowing the
> caller of yield_to() to accelerate another thread in it's thread group,
> task group.
> 
> Implemented via a scheduler hint, using cfs_rq->next to encourage the
> target being selected.  We can rely on pick_next_entity to keep things
> fair, so noone can accelerate a thread that has already used its fair
> share of CPU time.

If I recall correctly, one of the motivations for yield_to_task (rather than
a simple yield) was to avoid leaking bandwidth to other guests i.e we don't want
the remaining timeslice of spinning vcpu to be given away to other guests but
rather donate it to another (lock-holding) vcpu and thus retain the bandwidth
allocated to the guest.

I am not sure whether we are meeting that objective via this patch, as
lock-spinning vcpu would simply yield after setting next buddy to preferred
vcpu on target pcpu, thereby leaking some amount of bandwidth on the pcpu
where it is spinning. Would be nice to see what kind of fairness impact this 
has under heavy contention scenario.

- vatsa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality.
  2011-01-14 17:47   ` Srivatsa Vaddagiri
@ 2011-01-14 18:29     ` Rik van Riel
  2011-01-17 15:53       ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2011-01-14 18:29 UTC (permalink / raw)
  To: vatsa
  Cc: kvm, linux-kernel, Avi Kiviti, Peter Zijlstra, Mike Galbraith,
	Chris Wright, ttracy, dshaks

On 01/14/2011 12:47 PM, Srivatsa Vaddagiri wrote:

> If I recall correctly, one of the motivations for yield_to_task (rather than
> a simple yield) was to avoid leaking bandwidth to other guests i.e we don't want
> the remaining timeslice of spinning vcpu to be given away to other guests but
> rather donate it to another (lock-holding) vcpu and thus retain the bandwidth
> allocated to the guest.

No, that was not the motivation.   The motivation was to try
and get the lock holder to run soon, so it can release the
lock.

What you describe is merely one of the mechanisms considered
for meeting that objective.

> I am not sure whether we are meeting that objective via this patch, as
> lock-spinning vcpu would simply yield after setting next buddy to preferred
> vcpu on target pcpu, thereby leaking some amount of bandwidth on the pcpu
> where it is spinning.

Have you read the patch?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality.
  2011-01-14 18:29     ` Rik van Riel
@ 2011-01-17 15:53       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 13+ messages in thread
From: Srivatsa Vaddagiri @ 2011-01-17 15:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Peter Zijlstra, Mike Galbraith,
	Chris Wright, ttracy, dshaks

On Fri, Jan 14, 2011 at 01:29:52PM -0500, Rik van Riel wrote:
> >I am not sure whether we are meeting that objective via this patch, as
> >lock-spinning vcpu would simply yield after setting next buddy to preferred
> >vcpu on target pcpu, thereby leaking some amount of bandwidth on the pcpu
> >where it is spinning.
> 
> Have you read the patch?

Sorry had mis-read the patch!

On reviewing it further, I am wondering if we can optimize yield_to() further
for case when target and current are on same pcpu, by swapping vruntimes of two
tasks (to let target run in current's place - as we do in task_fork_fair()).

- vatsa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC -v5 PATCH 3/4] export pid symbols needed for kvm_vcpu_on_spin
  2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
  2011-01-14  8:03 ` [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu Rik van Riel
  2011-01-14  8:03 ` [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality Rik van Riel
@ 2011-01-14  8:04 ` Rik van Riel
  2011-01-16 15:18   ` Avi Kivity
  2011-01-14  8:05 ` [RFC -v5 PATCH 4/4] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2011-01-14  8:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

Export the symbols required for a race-free kvm_vcpu_on_spin.

Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..adc8f47 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk)
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
+EXPORT_SYMBOL_GPL(__put_task_struct);
 
 /*
  * macro override instead of weak attribute alias, to workaround
diff --git a/kernel/pid.c b/kernel/pid.c
index 39b65b6..02f2212 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
 	rcu_read_unlock();
 	return pid;
 }
+EXPORT_SYMBOL_GPL(get_task_pid);
 
 struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
 {
@@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
 	rcu_read_unlock();
 	return result;
 }
+EXPORT_SYMBOL_GPL(get_pid_task);
 
 struct pid *find_get_pid(pid_t nr)
 {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 3/4] export pid symbols needed for kvm_vcpu_on_spin
  2011-01-14  8:04 ` [RFC -v5 PATCH 3/4] export pid symbols needed for kvm_vcpu_on_spin Rik van Riel
@ 2011-01-16 15:18   ` Avi Kivity
  0 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2011-01-16 15:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

On 01/14/2011 10:04 AM, Rik van Riel wrote:
> Export the symbols required for a race-free kvm_vcpu_on_spin.
>

Needs to be reordered with the first patch for bisectability.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC -v5 PATCH 4/4] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
  2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
                   ` (2 preceding siblings ...)
  2011-01-14  8:04 ` [RFC -v5 PATCH 3/4] export pid symbols needed for kvm_vcpu_on_spin Rik van Riel
@ 2011-01-14  8:05 ` Rik van Riel
  2011-01-14 17:34 ` [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
  2011-01-14 21:29 ` Rik van Riel
  5 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2011-01-14  8:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d56ed5..fab2250 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -187,6 +187,7 @@ struct kvm {
 #endif
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	atomic_t online_vcpus;
+	int last_boosted_vcpu;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 65e997a..a7c45c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1292,18 +1292,52 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-	ktime_t expires;
-	DEFINE_WAIT(wait);
-
-	prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
-
-	/* Sleep for 100 us, and hope lock-holder got scheduled */
-	expires = ktime_add_ns(ktime_get(), 100000UL);
-	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+	struct kvm *kvm = me->kvm;
+	struct kvm_vcpu *vcpu;
+	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+	int yielded = 0;
+	int pass;
+	int i;
 
-	finish_wait(&vcpu->wq, &wait);
+	/*
+	 * We boost the priority of a VCPU that is runnable but not
+	 * currently running, because it got preempted by something
+	 * else and called schedule in __vcpu_run.  Hopefully that
+	 * VCPU is holding the lock that we need and will release it.
+	 * We approximate round-robin by starting at the last boosted VCPU.
+	 */
+	for (pass = 0; pass < 2 && !yielded; pass++) {
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			struct task_struct *task;
+			if (!pass && i < last_boosted_vcpu) {
+				i = last_boosted_vcpu;
+				continue;
+			} else if (pass && i > last_boosted_vcpu)
+				break;
+			if (vcpu == me)
+				continue;
+			if (!vcpu->pid)
+				continue;
+			if (waitqueue_active(&vcpu->wq))
+				continue;
+			task = get_pid_task(vcpu->pid, PIDTYPE_PID);
+			if (!task)
+				continue;
+			if (task->flags & PF_VCPU) {
+				put_task_struct(task);
+				continue;
+			}
+			if (yield_to(task, 1)) {
+				put_task_struct(task);
+				kvm->last_boosted_vcpu = i;
+				yielded = 1;
+				break;
+			}
+			put_task_struct(task);
+		}
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting
  2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
                   ` (3 preceding siblings ...)
  2011-01-14  8:05 ` [RFC -v5 PATCH 4/4] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
@ 2011-01-14 17:34 ` Rik van Riel
  2011-01-14 21:29 ` Rik van Riel
  5 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2011-01-14 17:34 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

On 01/14/2011 03:02 AM, Rik van Riel wrote:

> Benchmark "results":
>
> Two 4-CPU KVM guests are pinned to the same 4 physical CPUs.

I just discovered that I had in fact pinned the 4-CPU KVM
guests to 4 HT threads across 2 cores, and the scheduler
has all kinds of special magic for dealing with HT siblings.

I am now rerunning the tests with the KVM guests bound to
cores 0,2,4,6 to see if that makes a difference.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting
  2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
                   ` (4 preceding siblings ...)
  2011-01-14 17:34 ` [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
@ 2011-01-14 21:29 ` Rik van Riel
  5 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2011-01-14 21:29 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright, ttracy, dshaks

On 01/14/2011 03:02 AM, Rik van Riel wrote:

> Benchmark "results":
>
> Two 4-CPU KVM guests are pinned to the same 4 physical CPUs.

Unfortunately, it turned out I was running my benchmark on
only two CPU cores, using two HT threads of each core.

I have re-run the benchmark with the guests bound to 4
different CPU cores, one HT on each core.

> One guest runs the AMQP performance test, the other guest runs
> 0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5
> and 4.
>
> The AMQP perftest is run 30 times, with 8 and 16 threads.

8thr	no overcommit	1.5x overcommit		2x overcommit

no PLE	224934		139311			94216.6
PLE	226413		142830			87836.4

16thr	no overcommit	1.5x overcommit		2x overcommit

no PLE	224266		134819			92123.1
PLE	224985		137280			100832

The other conclusions hold - it looks like this test is
doing more to expose issues with the scheduler, than
testing the PLE code.

I have some ideas on how to improve yield(), so it can
do the right thing even in the presence of cgroups.

> Note: there seems to be something wrong with CPU balancing,
> possibly related to cgroups.  The AMQP guest only got about
> 80% CPU time (of 400% total) when running with 2x overcommit,
> as opposed to the expected 200%.  Without PLE, the guest
> seems to get closer to 100% CPU time, which is still far
> below the expected.

> Unfortunately, it looks like this test ended up more as a
> demonstration of other scheduler issues, than as a performance
> test of the PLE code.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-01-17 15:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-14  8:02 [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
2011-01-14  8:03 ` [RFC -v5 PATCH 1/4] kvm: keep track of which task is running a KVM vcpu Rik van Riel
2011-01-16 15:17   ` Avi Kivity
2011-01-14  8:03 ` [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality Rik van Riel
2011-01-14 17:15   ` Peter Zijlstra
2011-01-14 17:47   ` Srivatsa Vaddagiri
2011-01-14 18:29     ` Rik van Riel
2011-01-17 15:53       ` Srivatsa Vaddagiri
2011-01-14  8:04 ` [RFC -v5 PATCH 3/4] export pid symbols needed for kvm_vcpu_on_spin Rik van Riel
2011-01-16 15:18   ` Avi Kivity
2011-01-14  8:05 ` [RFC -v5 PATCH 4/4] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
2011-01-14 17:34 ` [RFC -v5 PATCH 0/4] directed yield for Pause Loop Exiting Rik van Riel
2011-01-14 21:29 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox