linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] directed yield for Pause Loop Exiting
@ 2010-12-02 19:41 Rik van Riel
  2010-12-02 19:43 ` [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
                   ` (4 more replies)
  0 siblings, 5 replies; 65+ messages in thread
From: Rik van Riel @ 2010-12-02 19:41 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Ingo Molnar, Anthony Liguori

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

Scheduler people, please flame me with anything I may have done
wrong, so I can do it right for a next version :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [RFC PATCH 2/3] sched: add yield_to function
@ 2011-01-02 11:43 Hillf Danton
  2011-01-02 17:19 ` Rik van Riel
  0 siblings, 1 reply; 65+ messages in thread
From: Hillf Danton @ 2011-01-02 11:43 UTC (permalink / raw)
  To: Rik van Riel, Marcelo Tosatti
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, linux-kernel

On Thu, 2 Dec 2010 14:44:23 -0500, Rik van Riel wrote:
> Add a yield_to function to the scheduler code, allowing us to
> give the remainder of our timeslice to another thread.
>
> We may want to use this to provide a sys_yield_to system call
> one day.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Hey all

The following work is based on what Rik posted, with a few changes.

[1] the added requeue_task() is replaced with resched_task().
[2] there is no longer change by slice_remain() in schedule class.
[3] the schedule_hrtimeout() in KVM still plays its role, and it looks
     nicer to move the searching of task out of kvm_vcpu_on_spin()
     to be a function.

The compensation of yielded nanoseconds is not considered
or corrected in this work for both lender and borrower, but it looks
not a vulnerability since the lock contention is detected by CPU,
as Rik mentioned, and since both lender and borrower are marked
with PF_VCPU.

What scheduler should consider further for the PF_VCPU?

Cheers
Hillf
---

--- a/include/linux/sched.h	2010-11-01 19:54:12.000000000 +0800
+++ b/include/linux/sched.h	2011-01-02 18:09:38.000000000 +0800
@@ -1945,6 +1945,7 @@ static inline int rt_mutex_getprio(struc
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
+extern void yield_to(struct task_struct *, u64 *);
 extern int can_nice(const struct task_struct *p, const int nice);
 extern int task_curr(const struct task_struct *p);
 extern int idle_cpu(int cpu);
--- a/kernel/sched.c	2010-11-01 19:54:12.000000000 +0800
+++ b/kernel/sched.c	2011-01-02 18:14:20.000000000 +0800
@@ -5151,6 +5151,42 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t
 	return ret;
 }

+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ *
+ * @nsecs: feedback to caller the nanoseconds yielded
+ */
+void yield_to(struct task_struct *p, u64 *nsecs)
+{
+	unsigned long flags;
+	struct sched_entity *se = &p->se;
+	struct rq *rq;
+	struct cfs_rq *cfs_rq;
+	u64 vruntime;
+
+	rq = task_rq_lock(p, &flags);
+	if (task_running(rq, p) || task_has_rt_policy(p))
+		goto out;
+	cfs_rq = cfs_rq_of(se);
+	vruntime = se->vruntime;
+	se->vruntime = cfs_rq->min_vruntime;
+	if (nsecs) {
+		if (vruntime > se->vruntime)
+			vruntime -= se->vruntime;
+		else
+			vruntime = 0;
+		*nsecs = vruntime;
+	}
+	/* kick p onto its CPU */
+	resched_task(rq->curr);
+ out:
+	task_rq_unlock(rq, &flags);
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
--- a/include/linux/kvm_host.h	2010-11-01 19:54:12.000000000 +0800
+++ b/include/linux/kvm_host.h	2011-01-02 17:43:26.000000000 +0800
@@ -91,6 +91,7 @@ struct kvm_vcpu {
 	int fpu_active;
 	int guest_fpu_loaded, guest_xcr0_loaded;
 	wait_queue_head_t wq;
+	int spinning;
 	int sigset_active;
 	sigset_t sigset;
 	struct kvm_vcpu_stat stat;
@@ -186,6 +187,7 @@ struct kvm {
 #endif
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	atomic_t online_vcpus;
+	int last_boosted_vcpu;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
--- a/virt/kvm/kvm_main.c	2010-11-01 19:54:12.000000000 +0800
+++ b/virt/kvm/kvm_main.c	2011-01-02 18:03:42.000000000 +0800
@@ -1289,18 +1289,65 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);

-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	ktime_t expires;
 	DEFINE_WAIT(wait);
+	u64 nsecs;
+	struct kvm *kvm = me->kvm;
+	struct kvm_vcpu *vcpu;
+	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+	int first_round = 1;
+	int i;

-	prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+	me->spinning = 1;

+	/*
+	 * We boost the priority of a VCPU that is runnable but not
+	 * currently running, because it got preempted by something
+	 * else and called schedule in __vcpu_run.  Hopefully that
+	 * VCPU is holding the lock that we need and will release it.
+	 * We approximate round-robin by starting at the last boosted VCPU.
+	 */
+ again:
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct task_struct *task = vcpu->task;
+		if (first_round && i < last_boosted_vcpu) {
+			i = last_boosted_vcpu;
+			continue;
+		} else if (!first_round && i > last_boosted_vcpu)
+			break;
+		if (vcpu == me)
+			continue;
+		if (vcpu->spinning)
+			continue;
+		if (!task)
+			continue;
+		if (waitqueue_active(&vcpu->wq))
+			continue;
+		if (task->flags & PF_VCPU)
+			continue;
+		kvm->last_boosted_vcpu = i;
+		goto yield;
+	}
+	if (first_round && last_boosted_vcpu == kvm->last_boosted_vcpu) {
+		/* We have not found anyone yet. */
+		first_round = 0;
+		goto again;
+	}
+	me->spinning = 0;
+	return;
+ yield:
+	yield_to(task, &(nsecs=0));
 	/* Sleep for 100 us, and hope lock-holder got scheduled */
-	expires = ktime_add_ns(ktime_get(), 100000UL);
+	if (nsecs < 100000)
+		nsecs = 100000;
+	prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+	expires = ktime_add_ns(ktime_get(), nsecs);
 	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
-
 	finish_wait(&vcpu->wq, &wait);
+
+	me->spinning = 0;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2011-01-04 12:44 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-02 19:41 [RFC PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
2010-12-02 19:43 ` [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
2010-12-03  1:18   ` Chris Wright
2010-12-03 14:50     ` Rik van Riel
2010-12-03 15:55       ` Chris Wright
2010-12-05 12:40       ` Avi Kivity
2010-12-03 12:17   ` Srivatsa Vaddagiri
2010-12-03 14:16     ` Rik van Riel
2010-12-05 12:59       ` Avi Kivity
2010-12-02 19:44 ` [RFC PATCH 2/3] sched: add yield_to function Rik van Riel
2010-12-03  0:50   ` Chris Wright
2010-12-03 18:27     ` Rik van Riel
2010-12-03 19:30       ` Chris Wright
2010-12-03 21:30       ` Peter Zijlstra
2010-12-03  5:54   ` Mike Galbraith
2010-12-03 13:46     ` Srivatsa Vaddagiri
2010-12-03 14:45       ` Mike Galbraith
2010-12-03 14:48         ` Rik van Riel
2010-12-03 15:09           ` Mike Galbraith
2010-12-03 15:35             ` Rik van Riel
2010-12-03 16:20               ` Srivatsa Vaddagiri
2010-12-03 17:09                 ` Rik van Riel
2010-12-03 17:29                   ` Srivatsa Vaddagiri
2010-12-03 17:33                     ` Rik van Riel
2010-12-03 17:45                       ` Srivatsa Vaddagiri
2010-12-03 20:05               ` Mike Galbraith
2010-12-03 21:26             ` Peter Zijlstra
2010-12-03 13:23   ` Peter Zijlstra
2010-12-03 13:30     ` Srivatsa Vaddagiri
2010-12-03 14:03       ` Peter Zijlstra
2010-12-03 14:06         ` Srivatsa Vaddagiri
2010-12-03 14:10           ` Srivatsa Vaddagiri
2010-12-03 21:23             ` Peter Zijlstra
2010-12-04 13:02               ` Rik van Riel
2010-12-10  4:34           ` Rik van Riel
2010-12-10  8:39             ` Srivatsa Vaddagiri
2010-12-10 14:55               ` Rik van Riel
2010-12-08 17:55     ` Rik van Riel
2010-12-08 20:00       ` Peter Zijlstra
2010-12-08 20:04         ` Peter Zijlstra
2010-12-08 22:59         ` Rik van Riel
2010-12-02 19:45 ` [RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
2010-12-03  2:24   ` Chris Wright
2010-12-05 12:58     ` Avi Kivity
2010-12-05 12:56   ` Avi Kivity
2010-12-08 22:38     ` Rik van Riel
2010-12-09 10:28       ` Avi Kivity
2010-12-09 17:07         ` Rik van Riel
2010-12-11  7:27           ` Avi Kivity
2010-12-02 22:41 ` [RFC PATCH 0/3] directed yield for Pause Loop Exiting Chris Wright
2010-12-05 13:02   ` Avi Kivity
2010-12-10  5:03 ` Balbir Singh
2010-12-10 14:54   ` Rik van Riel
2010-12-11  7:31   ` Avi Kivity
2010-12-11 13:57     ` Balbir Singh
2010-12-13 11:57       ` Avi Kivity
2010-12-13 12:39         ` Balbir Singh
2010-12-13 12:42           ` Avi Kivity
2010-12-13 17:02       ` Rik van Riel
2010-12-14  9:25         ` Balbir Singh
  -- strict thread matches above, loose matches on Subject: below --
2011-01-02 11:43 [RFC PATCH 2/3] sched: add yield_to function Hillf Danton
2011-01-02 17:19 ` Rik van Riel
2011-01-03  4:18   ` Hillf Danton
2011-01-03  5:04     ` Rik van Riel
2011-01-04 12:44       ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).