[RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
@ 2011-01-03 21:26 Rik van Riel
  2011-01-03 21:27 ` [RFC -v3 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-03 21:26 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
  situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC -v3 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu
  2011-01-03 21:26 [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
@ 2011-01-03 21:27 ` Rik van Riel
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-03 21:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
- move vcpu->task manipulation as suggested by Chris Wright

 include/linux/kvm_host.h |    1 +
 virt/kvm/kvm_main.c      |    2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..180085b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
 #endif
 	int vcpu_id;
 	struct mutex mutex;
+	struct task_struct *task;
 	int   cpu;
 	atomic_t guest_mode;
 	struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..c95bad1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2248,6 +2248,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 {
 	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
+	vcpu->task = NULL;
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -2256,6 +2257,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 {
 	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
+	vcpu->task = current;
 	kvm_arch_vcpu_put(vcpu);
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-03 21:26 [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
  2011-01-03 21:27 ` [RFC -v3 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
@ 2011-01-03 21:29 ` Rik van Riel
  2011-01-04  1:51   ` Mike Galbraith
                     ` (4 more replies)
  2011-01-03 21:30 ` [RFC -v3 PATCH 3/3] Subject: kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
                   ` (2 subsequent siblings)
  4 siblings, 5 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-03 21:29 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

From: Mike Galbraith <efault@gmx.de>

Add a yield_to function to the scheduler code, allowing us to
give enough of our timeslice to another thread to allow it to
run and release whatever resource we need it to release.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Not-signed-off-by: Mike Galbraith <efault@gmx.de>

--- 
Mike, want to change the above into a Signed-off-by: ? :)
This code seems to work well.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5f926c..0b8a3e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1083,6 +1083,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	int (*yield_to_task) (struct task_struct *p, int preempt);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
@@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p)		do { } while (0)
 #endif
 
+extern void yield_to(struct task_struct *p, int preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index f8e5a25..ffa7a9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6901,6 +6901,53 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+	struct task_struct *curr = current;
+	struct rq *rq, *p_rq;
+	unsigned long flags;
+	int yield = 0;
+
+	local_irq_save(flags);
+	rq = this_rq();
+
+again:
+	p_rq = task_rq(p);
+	double_rq_lock(rq, p_rq);
+	while (task_rq(p) != p_rq) {
+		double_rq_unlock(rq, p_rq);
+		goto again;
+	}
+
+	if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
+			!same_thread_group(p, curr) ||
+			!curr->sched_class->yield_to_task ||
+			curr->sched_class != p->sched_class) {
+		goto out;
+	}
+
+	yield = curr->sched_class->yield_to_task(p, preempt);
+
+out:
+	double_rq_unlock(rq, p_rq);
+	local_irq_restore(flags);
+
+	if (yield) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+}
+EXPORT_SYMBOL(yield_to);
+
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5119b08..3288e7c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+		      struct rq *this_rq, int this_cpu);
+#endif
+
+static int yield_to_task_fair(struct task_struct *p, int preempt)
+{
+	struct sched_entity *se = &current->se;
+	struct sched_entity *pse = &p->se;
+	struct sched_entity *curr = &(task_rq(p)->curr)->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+	int yield = this_rq() == task_rq(p);
+	int want_preempt = preempt;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (cfs_rq->tg != p_cfs_rq->tg)
+		return 0;
+
+	/* Preemption only allowed within the same task group. */
+	if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
+		preempt = 0;
+#endif
+	/* Preemption only allowed within the same thread group. */
+	if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
+		preempt = 0;
+
+#ifdef CONFIG_SMP
+	/*
+	 * If this yield is important enough to want to preempt instead
+	 * of only dropping a ->next hint, we're alone, and the target
+	 * is not alone, pull the target to this cpu.
+	 */
+	if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
+			cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
+		pull_task(task_rq(p), p, this_rq(), smp_processor_id());
+		p_cfs_rq = cfs_rq_of(pse);
+		yield = 1;
+	}
+#endif
+
+	if (yield)
+		clear_buddies(cfs_rq, se);
+	else if (preempt)
+		clear_buddies(p_cfs_rq, curr);
+
+	/* Tell the scheduler that we'd really like pse to run next. */
+	p_cfs_rq->next = pse;
+
+	if (!yield && preempt)
+		resched_task(task_of(p_cfs_rq->curr));
+
+	return yield;
+}
+
+#ifdef CONFIG_SMP
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
@@ -2081,6 +2136,7 @@ static const struct sched_class fair_sched_class = {
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+	.yield_to_task		= yield_to_task_fair,
 
 	.check_preempt_curr	= check_preempt_wakeup,
 


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC -v3 PATCH 3/3] Subject: kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
  2011-01-03 21:26 [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
  2011-01-03 21:27 ` [RFC -v3 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
@ 2011-01-03 21:30 ` Rik van Riel
  2011-01-04  6:42 ` [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Mike Galbraith
  2011-01-04  9:14 ` Avi Kivity
  4 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-03 21:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c011ba3..ad3cb4a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -185,6 +185,7 @@ struct kvm {
 #endif
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	atomic_t online_vcpus;
+	int last_boosted_vcpu;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 21f816c..5822246 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1878,18 +1878,44 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-	ktime_t expires;
-	DEFINE_WAIT(wait);
-
-	prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
-
-	/* Sleep for 100 us, and hope lock-holder got scheduled */
-	expires = ktime_add_ns(ktime_get(), 100000UL);
-	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+	struct kvm *kvm = me->kvm;
+	struct kvm_vcpu *vcpu;
+	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+	int yielded = 0;
+	int pass;
+	int i;
 
-	finish_wait(&vcpu->wq, &wait);
+	/*
+	 * We boost the priority of a VCPU that is runnable but not
+	 * currently running, because it got preempted by something
+	 * else and called schedule in __vcpu_run.  Hopefully that
+	 * VCPU is holding the lock that we need and will release it.
+	 * We approximate round-robin by starting at the last boosted VCPU.
+	 */
+	for (pass = 0; pass < 2 && !yielded; pass++) {
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			struct task_struct *task = vcpu->task;
+			if (!pass && i < last_boosted_vcpu) {
+				i = last_boosted_vcpu;
+				continue;
+			} else if (pass && i > last_boosted_vcpu)
+				break;
+			if (vcpu == me)
+				continue;
+			if (!task)
+				continue;
+			if (waitqueue_active(&vcpu->wq))
+				continue;
+			if (task->flags & PF_VCPU)
+				continue;
+			kvm->last_boosted_vcpu = i;
+			yielded = 1;
+			yield_to(task, 1);
+			break;
+		}
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
@ 2011-01-04  1:51   ` Mike Galbraith
  2011-01-04  6:14   ` KOSAKI Motohiro
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-01-04  1:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Chris Wright

On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote:
> From: Mike Galbraith <efault@gmx.de>
> 
> Add a yield_to function to the scheduler code, allowing us to
> give enough of our timeslice to another thread to allow it to
> run and release whatever resource we need it to release.
> 
> We may want to use this to provide a sys_yield_to system call
> one day.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> --- 
> Mike, want to change the above into a Signed-off-by: ? :)
> This code seems to work well.

Done.

	-Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
  2011-01-04  1:51   ` Mike Galbraith
@ 2011-01-04  6:14   ` KOSAKI Motohiro
  2011-01-04 12:03     ` Avi Kivity
  2011-01-04 17:16     ` Peter Zijlstra
  2011-01-04 14:28   ` Hillf Danton
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 52+ messages in thread
From: KOSAKI Motohiro @ 2011-01-04  6:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, kvm, linux-kernel, Avi Kiviti,
	Srivatsa Vaddagiri, Peter Zijlstra, Mike Galbraith, Chris Wright

> From: Mike Galbraith <efault@gmx.de>
> 
> Add a yield_to function to the scheduler code, allowing us to
> give enough of our timeslice to another thread to allow it to
> run and release whatever resource we need it to release.
> 
> We may want to use this to provide a sys_yield_to system call
> one day.

At least I want. Ruby has GIL(giant interpreter lock). And giant lock
naturally enforce an app to implement cooperative multithreading model.
Therefore it has similar problem with your one. Python solved this issue
by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), 
but I don't want it.

Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
avoid almost Nehalem (and other P2P cache arch) lock unfairness 
problem. (probaby creating pthread_condattr_setautoyield_np or similar
knob is good one)




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-03 21:26 [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
                   ` (2 preceding siblings ...)
  2011-01-03 21:30 ` [RFC -v3 PATCH 3/3] Subject: kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
@ 2011-01-04  6:42 ` Mike Galbraith
  2011-01-04  9:09   ` Avi Kivity
  2011-01-04  9:14 ` Avi Kivity
  4 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-01-04  6:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Chris Wright

A couple questions.

On Mon, 2011-01-03 at 16:26 -0500, Rik van Riel wrote:
> When running SMP virtual machines, it is possible for one VCPU to be
> spinning on a spinlock, while the VCPU that holds the spinlock is not
> currently running, because the host scheduler preempted it to run
> something else.

Do you have any numbers?

If I were to, say, run a 256 CPU VM on my quad, would this help me get
more hackbench or whatever oomph from my (256X80386/20:) box?

> Both Intel and AMD CPUs have a feature that detects when a virtual
> CPU is spinning on a lock and will trap to the host.

Does an Intel Q6600 have this trap gizmo (iow will this do anything at
all for my little box if I were to try it out).

	-Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-04  6:42 ` [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Mike Galbraith
@ 2011-01-04  9:09   ` Avi Kivity
  2011-01-04 10:32     ` Mike Galbraith
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-04  9:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Chris Wright

On 01/04/2011 08:42 AM, Mike Galbraith wrote:
> A couple questions.
>
> On Mon, 2011-01-03 at 16:26 -0500, Rik van Riel wrote:
> >  When running SMP virtual machines, it is possible for one VCPU to be
> >  spinning on a spinlock, while the VCPU that holds the spinlock is not
> >  currently running, because the host scheduler preempted it to run
> >  something else.
>
> Do you have any numbers?
>
> If I were to, say, run a 256 CPU VM on my quad, would this help me get
> more hackbench or whatever oomph from my (256X80386/20:) box?

First of all, you can't run 256 guests on x86 kvm.  Second, you'll never 
see better performance when you overcommit.  What this patchset does is 
reduce the degradation from utterly ridiculous to something manageable.  
This allows a host to deliver reasonable performance when overcommitted 
vcpus are actually used, but it's not a good idea to run 64 vcpus on a 
32 cpu host.

> >  Both Intel and AMD CPUs have a feature that detects when a virtual
> >  CPU is spinning on a lock and will trap to the host.
>
> Does an Intel Q6600 have this trap gizmo (iow will this do anything at
> all for my little box if I were to try it out).

Likely not.  Run 
http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/scripts/vmxcap;hb=HEAD, 
look for 'PAUSE-loop exiting'.  I think the first processors to include 
them were the Nehalem-EXs, and Westmeres have them as well.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-03 21:26 [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
                   ` (3 preceding siblings ...)
  2011-01-04  6:42 ` [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Mike Galbraith
@ 2011-01-04  9:14 ` Avi Kivity
  2011-01-04 10:26   ` Mike Galbraith
  4 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-04  9:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On 01/03/2011 11:26 PM, Rik van Riel wrote:
> When running SMP virtual machines, it is possible for one VCPU to be
> spinning on a spinlock, while the VCPU that holds the spinlock is not
> currently running, because the host scheduler preempted it to run
> something else.
>
> Both Intel and AMD CPUs have a feature that detects when a virtual
> CPU is spinning on a lock and will trap to the host.
>
> The current KVM code sleeps for a bit whenever that happens, which
> results in eg. a 64 VCPU Windows guest taking forever and a bit to
> boot up.  This is because the VCPU holding the lock is actually
> running and not sleeping, so the pause is counter-productive.
>
> In other workloads a pause can also be counter-productive, with
> spinlock detection resulting in one guest giving up its CPU time
> to the others.  Instead of spinning, it ends up simply not running
> much at all.
>
> This patch series aims to fix that, by having a VCPU that spins
> give the remainder of its timeslice to another VCPU in the same
> guest before yielding the CPU - one that is runnable but got
> preempted, hopefully the lock holder.
>

Looks great.

Assuming there are no objections, Mike, can you get 2/3 into a 
fast-forward-only branch of tip.git?  I'll then merge it into kvm.git 
and apply 1/3 and 2/3.

Since the merge window is about to open I'd like to merge this for 
2.6.39 so it gets to stew a while in tip.git, kvm.git, and linux-next.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-04  9:14 ` Avi Kivity
@ 2011-01-04 10:26   ` Mike Galbraith
  2011-01-04 17:20     ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-01-04 10:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Chris Wright

On Tue, 2011-01-04 at 11:14 +0200, Avi Kivity wrote:

> Assuming there are no objections, Mike, can you get 2/3 into a 
> fast-forward-only branch of tip.git?  I'll then merge it into kvm.git 
> and apply 1/3 and 2/3.

Wrong guy.  Fast-forward is Peter's department.

	-Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-04  9:09   ` Avi Kivity
@ 2011-01-04 10:32     ` Mike Galbraith
  2011-01-04 10:35       ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-01-04 10:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Chris Wright

On Tue, 2011-01-04 at 11:09 +0200, Avi Kivity wrote:
> On 01/04/2011 08:42 AM, Mike Galbraith wrote:

> > If I were to, say, run a 256 CPU VM on my quad, would this help me get
> > more hackbench or whatever oomph from my (256X80386/20:) box?
> 
> First of all, you can't run 256 guests on x86 kvm.

I didn't want 256 guests. Just one great big dog slow virtual playpen :)

> > >  Both Intel and AMD CPUs have a feature that detects when a virtual
> > >  CPU is spinning on a lock and will trap to the host.
> >
> > Does an Intel Q6600 have this trap gizmo (iow will this do anything at
> > all for my little box if I were to try it out).
> 
> Likely not.  Run 
> http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/scripts/vmxcap;hb=HEAD, 
> look for 'PAUSE-loop exiting'.  I think the first processors to include 
> them were the Nehalem-EXs, and Westmeres have them as well.

Oh darn, need new box to get all the toys.  Thanks.

	-Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-04 10:32     ` Mike Galbraith
@ 2011-01-04 10:35       ` Avi Kivity
  0 siblings, 0 replies; 52+ messages in thread
From: Avi Kivity @ 2011-01-04 10:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Chris Wright

On 01/04/2011 12:32 PM, Mike Galbraith wrote:
> On Tue, 2011-01-04 at 11:09 +0200, Avi Kivity wrote:
> >  On 01/04/2011 08:42 AM, Mike Galbraith wrote:
>
> >  >  If I were to, say, run a 256 CPU VM on my quad, would this help me get
> >  >  more hackbench or whatever oomph from my (256X80386/20:) box?
> >
> >  First of all, you can't run 256 guests on x86 kvm.
>
> I didn't want 256 guests. Just one great big dog slow virtual playpen :)

I meant a 256 vcpu guest.  Certainly you can run 256 guests, you're 
limited only by virtual memory.  Current kvm limit is 64 vcpus.

> >  >  >   Both Intel and AMD CPUs have a feature that detects when a virtual
> >  >  >   CPU is spinning on a lock and will trap to the host.
> >  >
> >  >  Does an Intel Q6600 have this trap gizmo (iow will this do anything at
> >  >  all for my little box if I were to try it out).
> >
> >  Likely not.  Run
> >  http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/scripts/vmxcap;hb=HEAD,
> >  look for 'PAUSE-loop exiting'.  I think the first processors to include
> >  them were the Nehalem-EXs, and Westmeres have them as well.
>
> Oh darn, need new box to get all the toys.  Thanks.
>

I have a patchset somewhere that emulates pause-loop exiting by looking 
at rip during timer interrupts.  Unfortunately timer interrupts are very 
rare compared to pause-loop exits, so it's not very helpful.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04  6:14   ` KOSAKI Motohiro
@ 2011-01-04 12:03     ` Avi Kivity
  2011-01-05  2:39       ` KOSAKI Motohiro
  2011-01-04 17:16     ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-04 12:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Chris Wright

On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> avoid almost Nehalem (and other P2P cache arch) lock unfairness
> problem. (probaby creating pthread_condattr_setautoyield_np or similar
> knob is good one)

Often, the thread calling pthread_cond_signal() wants to continue 
executing, not yield.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
  2011-01-04  1:51   ` Mike Galbraith
  2011-01-04  6:14   ` KOSAKI Motohiro
@ 2011-01-04 14:28   ` Hillf Danton
  2011-01-04 16:41   ` Hillf Danton
  2011-01-04 18:04   ` Peter Zijlstra
  4 siblings, 0 replies; 52+ messages in thread
From: Hillf Danton @ 2011-01-04 14:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel <riel@redhat.com> wrote:
> From: Mike Galbraith <efault@gmx.de>
>
> Add a yield_to function to the scheduler code, allowing us to
> give enough of our timeslice to another thread to allow it to
> run and release whatever resource we need it to release.
>
> We may want to use this to provide a sys_yield_to system call
> one day.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Not-signed-off-by: Mike Galbraith <efault@gmx.de>
>
> ---
> Mike, want to change the above into a Signed-off-by: ? :)
> This code seems to work well.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5f926c..0b8a3e6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1083,6 +1083,7 @@ struct sched_class {
>        void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
>        void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
>        void (*yield_task) (struct rq *rq);
> +       int (*yield_to_task) (struct task_struct *p, int preempt);
>
>        void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
>  # define rt_mutex_adjust_pi(p)         do { } while (0)
>  #endif
>
> +extern void yield_to(struct task_struct *p, int preempt);
>  extern void set_user_nice(struct task_struct *p, long nice);
>  extern int task_prio(const struct task_struct *p);
>  extern int task_nice(const struct task_struct *p);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f8e5a25..ffa7a9d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6901,6 +6901,53 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
>
> +/**
> + * yield_to - yield the current processor to another thread in
> + * your thread group, or accelerate that thread toward the
> + * processor it's on.
> + *
> + * It's the caller's job to ensure that the target task struct
> + * can't go away on us before we can do any checks.
> + */
> +void __sched yield_to(struct task_struct *p, int preempt)
> +{
> +       struct task_struct *curr = current;
> +       struct rq *rq, *p_rq;
> +       unsigned long flags;
> +       int yield = 0;
> +
> +       local_irq_save(flags);
> +       rq = this_rq();
> +
> +again:
> +       p_rq = task_rq(p);
> +       double_rq_lock(rq, p_rq);
> +       while (task_rq(p) != p_rq) {
> +               double_rq_unlock(rq, p_rq);
> +               goto again;
> +       }
> +
> +       if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
> +                       !same_thread_group(p, curr) ||
> +                       !curr->sched_class->yield_to_task ||
> +                       curr->sched_class != p->sched_class) {
> +               goto out;
> +       }
> +
> +       yield = curr->sched_class->yield_to_task(p, preempt);
> +
> +out:
> +       double_rq_unlock(rq, p_rq);
> +       local_irq_restore(flags);
> +
> +       if (yield) {
> +               set_current_state(TASK_RUNNING);
> +               schedule();
> +       }
> +}
> +EXPORT_SYMBOL(yield_to);
> +
> +
>  /*
>  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
>  * that process accounting knows that this is a task in IO wait state.
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5119b08..3288e7c 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
>  }
>
>  #ifdef CONFIG_SMP
> +static void pull_task(struct rq *src_rq, struct task_struct *p,
> +                     struct rq *this_rq, int this_cpu);
> +#endif
> +
> +static int yield_to_task_fair(struct task_struct *p, int preempt)
> +{
> +       struct sched_entity *se = &current->se;
> +       struct sched_entity *pse = &p->se;
> +       struct sched_entity *curr = &(task_rq(p)->curr)->se;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
> +       int yield = this_rq() == task_rq(p);
> +       int want_preempt = preempt;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       if (cfs_rq->tg != p_cfs_rq->tg)
> +               return 0;
> +
> +       /* Preemption only allowed within the same task group. */
> +       if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
> +               preempt = 0;
> +#endif
> +       /* Preemption only allowed within the same thread group. */
> +       if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
> +               preempt = 0;
> +
> +#ifdef CONFIG_SMP
> +       /*
> +        * If this yield is important enough to want to preempt instead
> +        * of only dropping a ->next hint, we're alone, and the target
> +        * is not alone, pull the target to this cpu.
> +        */
> +       if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
> +                       cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
> +               pull_task(task_rq(p), p, this_rq(), smp_processor_id());
> +               p_cfs_rq = cfs_rq_of(pse);
> +               yield = 1;
> +       }
> +#endif
> +
> +       if (yield)
> +               clear_buddies(cfs_rq, se);
> +       else if (preempt)
> +               clear_buddies(p_cfs_rq, curr);
> +
> +       /* Tell the scheduler that we'd really like pse to run next. */
> +       p_cfs_rq->next = pse;

If not pulled and this_rq() != task_rq(p), only assigning ->next could
kick p onto its CPU?

If not, how is the lock contention eased then?

A few words to explain please.

thanks
Hillf

> +
> +       if (!yield && preempt)
> +               resched_task(task_of(p_cfs_rq->curr));
> +
> +       return yield;
> +}
> +
> +#ifdef CONFIG_SMP
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
> @@ -2081,6 +2136,7 @@ static const struct sched_class fair_sched_class = {
>        .enqueue_task           = enqueue_task_fair,
>        .dequeue_task           = dequeue_task_fair,
>        .yield_task             = yield_task_fair,
> +       .yield_to_task          = yield_to_task_fair,
>
>        .check_preempt_curr     = check_preempt_wakeup,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
                     ` (2 preceding siblings ...)
  2011-01-04 14:28   ` Hillf Danton
@ 2011-01-04 16:41   ` Hillf Danton
  2011-01-04 16:44     ` Rik van Riel
  2011-01-04 18:04   ` Peter Zijlstra
  4 siblings, 1 reply; 52+ messages in thread
From: Hillf Danton @ 2011-01-04 16:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel <riel@redhat.com> wrote:
> From: Mike Galbraith <efault@gmx.de>
>
> Add a yield_to function to the scheduler code, allowing us to
> give enough of our timeslice to another thread to allow it to
> run and release whatever resource we need it to release.
>
> We may want to use this to provide a sys_yield_to system call
> one day.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Not-signed-off-by: Mike Galbraith <efault@gmx.de>
>
> ---
> Mike, want to change the above into a Signed-off-by: ? :)
> This code seems to work well.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5f926c..0b8a3e6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1083,6 +1083,7 @@ struct sched_class {
>        void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
>        void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
>        void (*yield_task) (struct rq *rq);
> +       int (*yield_to_task) (struct task_struct *p, int preempt);
>
>        void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
>  # define rt_mutex_adjust_pi(p)         do { } while (0)
>  #endif
>
> +extern void yield_to(struct task_struct *p, int preempt);
>  extern void set_user_nice(struct task_struct *p, long nice);
>  extern int task_prio(const struct task_struct *p);
>  extern int task_nice(const struct task_struct *p);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f8e5a25..ffa7a9d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6901,6 +6901,53 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
>
> +/**
> + * yield_to - yield the current processor to another thread in
> + * your thread group, or accelerate that thread toward the
> + * processor it's on.
> + *
> + * It's the caller's job to ensure that the target task struct
> + * can't go away on us before we can do any checks.
> + */
> +void __sched yield_to(struct task_struct *p, int preempt)
> +{
> +       struct task_struct *curr = current;

            struct task_struct *next;

> +       struct rq *rq, *p_rq;
> +       unsigned long flags;
> +       int yield = 0;
> +
> +       local_irq_save(flags);
> +       rq = this_rq();
> +
> +again:
> +       p_rq = task_rq(p);
> +       double_rq_lock(rq, p_rq);
> +       while (task_rq(p) != p_rq) {
> +               double_rq_unlock(rq, p_rq);
> +               goto again;
> +       }
> +
> +       if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
> +                       !same_thread_group(p, curr) ||

/*                       !curr->sched_class->yield_to_task ||        */

> +                       curr->sched_class != p->sched_class) {
> +               goto out;
> +       }
> +
	/*
         * ask scheduler to compute the next for successfully kicking
@p onto its CPU
         * what if p_rq is rt_class to do?
         */
	next = pick_next_task(p_rq);
	if (next != p)
		p->se.vruntime = next->se.vruntime -1;
	deactivate_task(p_rq, p, 0);
	activate_task(p_rq, p, 0);
	if (rq == p_rq)
		schedule();
	else
		resched_task(p_rq->curr);
	yield = 0;

/*       yield = curr->sched_class->yield_to_task(p, preempt); */

> +
> +out:
> +       double_rq_unlock(rq, p_rq);
> +       local_irq_restore(flags);
> +
> +       if (yield) {
> +               set_current_state(TASK_RUNNING);
> +               schedule();
> +       }
> +}
> +EXPORT_SYMBOL(yield_to);
> +
> +
>  /*
>  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
>  * that process accounting knows that this is a task in IO wait state.
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5119b08..3288e7c 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
>  }
>
>  #ifdef CONFIG_SMP
> +static void pull_task(struct rq *src_rq, struct task_struct *p,
> +                     struct rq *this_rq, int this_cpu);
> +#endif
> +
> +static int yield_to_task_fair(struct task_struct *p, int preempt)
> +{
> +       struct sched_entity *se = &current->se;
> +       struct sched_entity *pse = &p->se;
> +       struct sched_entity *curr = &(task_rq(p)->curr)->se;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
> +       int yield = this_rq() == task_rq(p);
> +       int want_preempt = preempt;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       if (cfs_rq->tg != p_cfs_rq->tg)
> +               return 0;
> +
> +       /* Preemption only allowed within the same task group. */
> +       if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
> +               preempt = 0;
> +#endif
> +       /* Preemption only allowed within the same thread group. */
> +       if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
> +               preempt = 0;
> +
> +#ifdef CONFIG_SMP
> +       /*
> +        * If this yield is important enough to want to preempt instead
> +        * of only dropping a ->next hint, we're alone, and the target
> +        * is not alone, pull the target to this cpu.
> +        */
> +       if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
> +                       cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
> +               pull_task(task_rq(p), p, this_rq(), smp_processor_id());
> +               p_cfs_rq = cfs_rq_of(pse);
> +               yield = 1;
> +       }
> +#endif
> +
> +       if (yield)
> +               clear_buddies(cfs_rq, se);
> +       else if (preempt)
> +               clear_buddies(p_cfs_rq, curr);
> +
> +       /* Tell the scheduler that we'd really like pse to run next. */
> +       p_cfs_rq->next = pse;
> +
> +       if (!yield && preempt)
> +               resched_task(task_of(p_cfs_rq->curr));
> +
> +       return yield;
> +}
> +
> +#ifdef CONFIG_SMP
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
> @@ -2081,6 +2136,7 @@ static const struct sched_class fair_sched_class = {
>        .enqueue_task           = enqueue_task_fair,
>        .dequeue_task           = dequeue_task_fair,
>        .yield_task             = yield_task_fair,
> +       .yield_to_task          = yield_to_task_fair,
>
>        .check_preempt_curr     = check_preempt_wakeup,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 16:41   ` Hillf Danton
@ 2011-01-04 16:44     ` Rik van Riel
  2011-01-04 16:51       ` Hillf Danton
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-04 16:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On 01/04/2011 11:41 AM, Hillf Danton wrote:

> /*                       !curr->sched_class->yield_to_task ||        */
>
>> +                       curr->sched_class != p->sched_class) {
>> +               goto out;
>> +       }
>> +
> 	/*
>           * ask scheduler to compute the next for successfully kicking
> @p onto its CPU
>           * what if p_rq is rt_class to do?
>           */
> 	next = pick_next_task(p_rq);
> 	if (next != p)
> 		p->se.vruntime = next->se.vruntime -1;
> 	deactivate_task(p_rq, p, 0);
> 	activate_task(p_rq, p, 0);
> 	if (rq == p_rq)
> 		schedule();
> 	else
> 		resched_task(p_rq->curr);
> 	yield = 0;

Wouldn't that break for FIFO and RR tasks?

There's a reason all the scheduler folks wanted a
per-class yield_to_task function :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 16:44     ` Rik van Riel
@ 2011-01-04 16:51       ` Hillf Danton
  2011-01-04 16:54         ` Rik van Riel
  2011-01-04 17:08         ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: Hillf Danton @ 2011-01-04 16:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On Wed, Jan 5, 2011 at 12:44 AM, Rik van Riel <riel@redhat.com> wrote:
> On 01/04/2011 11:41 AM, Hillf Danton wrote:
>
>> /*                       !curr->sched_class->yield_to_task ||        */
>>
>>> +                       curr->sched_class != p->sched_class) {
>>> +               goto out;
>>> +       }
>>> +
>>
>>        /*
>>          * ask scheduler to compute the next for successfully kicking
>> @p onto its CPU
>>          * what if p_rq is rt_class to do?
>>          */
>>        next = pick_next_task(p_rq);
>>        if (next != p)
>>                p->se.vruntime = next->se.vruntime -1;
>>        deactivate_task(p_rq, p, 0);
>>        activate_task(p_rq, p, 0);
>>        if (rq == p_rq)
>>                schedule();
>>        else
>>                resched_task(p_rq->curr);
>>        yield = 0;
>
> Wouldn't that break for FIFO and RR tasks?
>
> There's a reason all the scheduler folks wanted a
> per-class yield_to_task function :)
>

Where is the yield_to callback in the patch for RT schedule class?
If @p is RT, what could you do?

Hillf

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 16:51       ` Hillf Danton
@ 2011-01-04 16:54         ` Rik van Riel
  2011-01-04 17:02           ` Hillf Danton
  2011-01-04 17:08         ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-04 16:54 UTC (permalink / raw)
  To: Hillf Danton
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On 01/04/2011 11:51 AM, Hillf Danton wrote:

>> Wouldn't that break for FIFO and RR tasks?
>>
>> There's a reason all the scheduler folks wanted a
>> per-class yield_to_task function :)
>>
>
> Where is the yield_to callback in the patch for RT schedule class?
> If @p is RT, what could you do?

If the user chooses to overcommit the CPU with realtime
tasks, the user cannot expect realtime response.

For realtime, I have not implemented the yield_to callback
at all because it would probably break realtime semantics
and I assume people will not overcommit the CPU with realtime
tasks anyway.

I could see running a few realtime guests on a system, with
the number of realtime VCPUs not exceeding the number of
physical CPUs.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 16:54         ` Rik van Riel
@ 2011-01-04 17:02           ` Hillf Danton
  0 siblings, 0 replies; 52+ messages in thread
From: Hillf Danton @ 2011-01-04 17:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Chris Wright

On Wed, Jan 5, 2011 at 12:54 AM, Rik van Riel <riel@redhat.com> wrote:
> On 01/04/2011 11:51 AM, Hillf Danton wrote:
>
>>> Wouldn't that break for FIFO and RR tasks?
>>>
>>> There's a reason all the scheduler folks wanted a
>>> per-class yield_to_task function :)
>>>
>>
>> Where is the yield_to callback in the patch for RT schedule class?
>> If @p is RT, what could you do?
>
> If the user chooses to overcommit the CPU with realtime
> tasks, the user cannot expect realtime response.
>
> For realtime, I have not implemented the yield_to callback
> at all because it would probably break realtime semantics
> and I assume people will not overcommit the CPU with realtime
> tasks anyway.
>
> I could see running a few realtime guests on a system, with
> the number of realtime VCPUs not exceeding the number of
> physical CPUs.
>
Then it looks curr->sched_class != p->sched_class is not enough,
and yield_to can not ease the lock contention in KVM in case where
p->rq->curr is RT.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 16:51       ` Hillf Danton
  2011-01-04 16:54         ` Rik van Riel
@ 2011-01-04 17:08         ` Peter Zijlstra
  2011-01-04 17:12           ` Hillf Danton
  2011-01-04 17:53           ` Rik van Riel
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-04 17:08 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
> Where is the yield_to callback in the patch for RT schedule class?
> If @p is RT, what could you do? 

RT guests are a pipe dream, you first need to get the hypervisor (kvm in
this case) to be RT, which it isn't. Then you either need to very
statically set-up the host and the guest scheduling constraints (not
possible with RR/FIFO) or have a complete paravirt RT scheduler which
communicates its requirements to the host.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 17:08         ` Peter Zijlstra
@ 2011-01-04 17:12           ` Hillf Danton
  2011-01-04 17:22             ` Peter Zijlstra
  2011-01-04 17:53           ` Rik van Riel
  1 sibling, 1 reply; 52+ messages in thread
From: Hillf Danton @ 2011-01-04 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
>> Where is the yield_to callback in the patch for RT schedule class?
>> If @p is RT, what could you do?
>
> RT guests are a pipe dream, you first need to get the hypervisor (kvm in
> this case) to be RT, which it isn't. Then you either need to very
> statically set-up the host and the guest scheduling constraints (not
> possible with RR/FIFO) or have a complete paravirt RT scheduler which
> communicates its requirements to the host.
>
Even guest is not RT, you could not prevent it from being preempted by
RT task which has nothing to do guests.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04  6:14   ` KOSAKI Motohiro
  2011-01-04 12:03     ` Avi Kivity
@ 2011-01-04 17:16     ` Peter Zijlstra
  2011-01-05  3:17       ` KOSAKI Motohiro
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-04 17:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Tue, 2011-01-04 at 15:14 +0900, KOSAKI Motohiro wrote:
> > From: Mike Galbraith <efault@gmx.de>
> > 
> > Add a yield_to function to the scheduler code, allowing us to
> > give enough of our timeslice to another thread to allow it to
> > run and release whatever resource we need it to release.
> > 
> > We may want to use this to provide a sys_yield_to system call
> > one day.
> 
> At least I want. Ruby has GIL(giant interpreter lock). And giant lock
> naturally enforce an app to implement cooperative multithreading model.
> Therefore it has similar problem with your one. Python solved this issue
> by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), 
> but I don't want it.
> 
> Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> avoid almost Nehalem (and other P2P cache arch) lock unfairness 
> problem. (probaby creating pthread_condattr_setautoyield_np or similar
> knob is good one)

NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it
is because its wants to be utter crap (run unmodified guests).

There is plenty of sane serialization primitives for userspace, fix your
locking mess instead of pushing crap.

The only reason I'm maybe half-way considering this is because its a
pure in-kernel interface which we can 'fix' once unmodified guests
aren't important anymore.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting
  2011-01-04 10:26   ` Mike Galbraith
@ 2011-01-04 17:20     ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-04 17:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Avi Kivity, Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Chris Wright

On Tue, 2011-01-04 at 11:26 +0100, Mike Galbraith wrote:
> On Tue, 2011-01-04 at 11:14 +0200, Avi Kivity wrote:
> 
> > Assuming there are no objections, Mike, can you get 2/3 into a 
> > fast-forward-only branch of tip.git?  I'll then merge it into kvm.git 
> > and apply 1/3 and 2/3.
> 
> Wrong guy.  Fast-forward is Peter's department.

Right, let me have a look, I put this thread on pause for a while, need
to catch up.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 17:12           ` Hillf Danton
@ 2011-01-04 17:22             ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-04 17:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, 2011-01-05 at 01:12 +0800, Hillf Danton wrote:
> On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
> >> Where is the yield_to callback in the patch for RT schedule class?
> >> If @p is RT, what could you do?
> >
> > RT guests are a pipe dream, you first need to get the hypervisor (kvm in
> > this case) to be RT, which it isn't. Then you either need to very
> > statically set-up the host and the guest scheduling constraints (not
> > possible with RR/FIFO) or have a complete paravirt RT scheduler which
> > communicates its requirements to the host.
> >
> Even guest is not RT, you could not prevent it from being preempted by
> RT task which has nothing to do guests.

Sure, but yield_to() being a NOP for RT tasks is perfectly fine. Pretty
much all yield*() usage is a bug anyway.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 17:08         ` Peter Zijlstra
  2011-01-04 17:12           ` Hillf Danton
@ 2011-01-04 17:53           ` Rik van Riel
  2011-01-04 18:05             ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-04 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On 01/04/2011 12:08 PM, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
>> Where is the yield_to callback in the patch for RT schedule class?
>> If @p is RT, what could you do?
>
> RT guests are a pipe dream, you first need to get the hypervisor (kvm in
> this case) to be RT, which it isn't. Then you either need to very
> statically set-up the host and the guest scheduling constraints (not
> possible with RR/FIFO) or have a complete paravirt RT scheduler which
> communicates its requirements to the host.

There's a limited use case.

One host can have a few RT guests, say a host with 8 CPUs
can have up to 6 or 7 RT VCPUs.  Those guests get top
priority.

The host can then have some other, low priority, guests
that scavenge remaining CPU time.

In this case, no yield_to is required for RT guests,
because they do not do overcommit.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
                     ` (3 preceding siblings ...)
  2011-01-04 16:41   ` Hillf Danton
@ 2011-01-04 18:04   ` Peter Zijlstra
  2011-01-04 18:53     ` Mike Galbraith
                       ` (2 more replies)
  4 siblings, 3 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-04 18:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri, Mike Galbraith,
	Chris Wright

On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote:
> From: Mike Galbraith <efault@gmx.de>
> 
> Add a yield_to function to the scheduler code, allowing us to
> give enough of our timeslice to another thread to allow it to
> run and release whatever resource we need it to release.
> 
> We may want to use this to provide a sys_yield_to system call
> one day.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Not-signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> --- 
> Mike, want to change the above into a Signed-off-by: ? :)
> This code seems to work well.
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5f926c..0b8a3e6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1083,6 +1083,7 @@ struct sched_class {
>  	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
>  	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
>  	void (*yield_task) (struct rq *rq);
> +	int (*yield_to_task) (struct task_struct *p, int preempt);
>  
>  	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>  
> @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
>  # define rt_mutex_adjust_pi(p)		do { } while (0)
>  #endif
>  
> +extern void yield_to(struct task_struct *p, int preempt);
>  extern void set_user_nice(struct task_struct *p, long nice);
>  extern int task_prio(const struct task_struct *p);
>  extern int task_nice(const struct task_struct *p);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f8e5a25..ffa7a9d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6901,6 +6901,53 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
>  
> +/**
> + * yield_to - yield the current processor to another thread in
> + * your thread group, or accelerate that thread toward the
> + * processor it's on.
> + *
> + * It's the caller's job to ensure that the target task struct
> + * can't go away on us before we can do any checks.
> + */
> +void __sched yield_to(struct task_struct *p, int preempt)
> +{
> +	struct task_struct *curr = current;
> +	struct rq *rq, *p_rq;
> +	unsigned long flags;
> +	int yield = 0;
> +
> +	local_irq_save(flags);
> +	rq = this_rq();
> +
> +again:
> +	p_rq = task_rq(p);
> +	double_rq_lock(rq, p_rq);
> +	while (task_rq(p) != p_rq) {
> +		double_rq_unlock(rq, p_rq);
> +		goto again;
> +	}
> +
> +	if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
> +			!same_thread_group(p, curr) ||
> +			!curr->sched_class->yield_to_task ||
> +			curr->sched_class != p->sched_class) {
> +		goto out;
> +	}
> +
> +	yield = curr->sched_class->yield_to_task(p, preempt);
> +
> +out:
> +	double_rq_unlock(rq, p_rq);
> +	local_irq_restore(flags);
> +
> +	if (yield) {
> +		set_current_state(TASK_RUNNING);
> +		schedule();
> +	}
> +}
> +EXPORT_SYMBOL(yield_to);

This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
I'd make it so only kvm.o could use it. It really sucks that kvm is a
module.

>  /*
>   * This task is about to go to sleep on IO. Increment rq->nr_iowait so
>   * that process accounting knows that this is a task in IO wait state.
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5119b08..3288e7c 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
>  }
>  
>  #ifdef CONFIG_SMP
> +static void pull_task(struct rq *src_rq, struct task_struct *p,
> +		      struct rq *this_rq, int this_cpu);
> +#endif
> +
> +static int yield_to_task_fair(struct task_struct *p, int preempt)
> +{
> +	struct sched_entity *se = &current->se;
> +	struct sched_entity *pse = &p->se;
> +	struct sched_entity *curr = &(task_rq(p)->curr)->se;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
> +	int yield = this_rq() == task_rq(p);
> +	int want_preempt = preempt;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +	if (cfs_rq->tg != p_cfs_rq->tg)
> +		return 0;
> +
> +	/* Preemption only allowed within the same task group. */
> +	if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
> +		preempt = 0;
> +#endif

I'd simply bail if its not the same cgroup, who cares about that case
anyway, all KVM vcpu threads should be in the same cgroup I think.

> +	/* Preemption only allowed within the same thread group. */
> +	if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
> +		preempt = 0;

The calling site already checks for same_thread_group(), we never even
get here if that's not the case.

> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this yield is important enough to want to preempt instead
> +	 * of only dropping a ->next hint, we're alone, and the target
> +	 * is not alone, pull the target to this cpu.
> +	 */
> +	if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
> +			cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
> +		pull_task(task_rq(p), p, this_rq(), smp_processor_id());

This only works by the grace that the caller checked p->se.on_rq. A
comment might be in order.

> +		p_cfs_rq = cfs_rq_of(pse);
> +		yield = 1;
> +	}
> +#endif
> +
> +	if (yield)
> +		clear_buddies(cfs_rq, se);
> +	else if (preempt)
> +		clear_buddies(p_cfs_rq, curr);
> +
> +	/* Tell the scheduler that we'd really like pse to run next. */
> +	p_cfs_rq->next = pse;
> +
> +	if (!yield && preempt)
> +		resched_task(task_of(p_cfs_rq->curr));

I don't get this.. Why would you resched the remote cpu, surely you
didn't just pull its current task over..

> +	return yield;
> +}
> +
> +#ifdef CONFIG_SMP
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
> @@ -2081,6 +2136,7 @@ static const struct sched_class fair_sched_class = {
>  	.enqueue_task		= enqueue_task_fair,
>  	.dequeue_task		= dequeue_task_fair,
>  	.yield_task		= yield_task_fair,
> +	.yield_to_task		= yield_to_task_fair,
>  
>  	.check_preempt_curr	= check_preempt_wakeup,
>  
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 17:53           ` Rik van Riel
@ 2011-01-04 18:05             ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-04 18:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hillf Danton, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Tue, 2011-01-04 at 12:53 -0500, Rik van Riel wrote:
> One host can have a few RT guests, say a host with 8 CPUs
> can have up to 6 or 7 RT VCPUs.  Those guests get top
> priority. 

RT guests don't make sense, there's nowhere near enough infrastructure
for that to work well.

I'd argue that KVM running with RT priority is a bug.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 18:04   ` Peter Zijlstra
@ 2011-01-04 18:53     ` Mike Galbraith
  2011-01-05 16:57     ` Mike Galbraith
  2011-01-05 17:10     ` Avi Kivity
  2 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-01-04 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On Tue, 2011-01-04 at 19:04 +0100, Peter Zijlstra wrote:

> > +		p_cfs_rq = cfs_rq_of(pse);
> > +		yield = 1;
> > +	}
> > +#endif
> > +
> > +	if (yield)
> > +		clear_buddies(cfs_rq, se);
> > +	else if (preempt)
> > +		clear_buddies(p_cfs_rq, curr);
> > +
> > +	/* Tell the scheduler that we'd really like pse to run next. */
> > +	p_cfs_rq->next = pse;
> > +
> > +	if (!yield && preempt)
> > +		resched_task(task_of(p_cfs_rq->curr));
> 
> I don't get this.. Why would you resched the remote cpu, surely you
> didn't just pull its current task over..

:) hope not.   Caller bails on task_running(p_rq, p).  wants a comment
too I suppose.

Target doesn't live here, preempt is still set/allowed, so we want the
remote cpu to schedule.

	-Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 12:03     ` Avi Kivity
@ 2011-01-05  2:39       ` KOSAKI Motohiro
  2011-01-05  8:35         ` Avi Kivity
  2011-01-05 10:04         ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: KOSAKI Motohiro @ 2011-01-05  2:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kosaki.motohiro, Rik van Riel, kvm, linux-kernel,
	Srivatsa Vaddagiri, Peter Zijlstra, Mike Galbraith, Chris Wright

> On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> > Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> > avoid almost Nehalem (and other P2P cache arch) lock unfairness
> > problem. (probaby creating pthread_condattr_setautoyield_np or similar
> > knob is good one)
> 
> Often, the thread calling pthread_cond_signal() wants to continue 
> executing, not yield.

Then, it doesn't work.

After calling pthread_cond_signal(), T1 which cond_signal caller and T2
which waked start to GIL grab race. But usually T1 is always win because
lock variable is in T1's cpu cache. Why kernel and userland have so much
different result? One of a reason is glibc doesn't have any ticket lock scheme. 

If you are interesting GIL mess and issue, please feel free to ask more.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 17:16     ` Peter Zijlstra
@ 2011-01-05  3:17       ` KOSAKI Motohiro
  0 siblings, 0 replies; 52+ messages in thread
From: KOSAKI Motohiro @ 2011-01-05  3:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Rik van Riel, kvm, linux-kernel, Avi Kiviti,
	Srivatsa Vaddagiri, Mike Galbraith, Chris Wright

> NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it
> is because its wants to be utter crap (run unmodified guests).
> 
> There is plenty of sane serialization primitives for userspace, fix your
> locking mess instead of pushing crap.
> 
> The only reason I'm maybe half-way considering this is because its a
> pure in-kernel interface which we can 'fix' once unmodified guests
> aren't important anymore.

KVM also have slow and inefficient mecanism. Userspace have a lot of
slow primitives too.

But okey, I'm withdraw this claim even though I disagree userspace have
plenty of sane serialization. It's offtopic and we need more discuss
at proper place.

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  2:39       ` KOSAKI Motohiro
@ 2011-01-05  8:35         ` Avi Kivity
  2011-01-05  8:40           ` KOSAKI Motohiro
  2011-01-05 10:04         ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-05  8:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Chris Wright

On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
> >  On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> >  >  Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> >  >  avoid almost Nehalem (and other P2P cache arch) lock unfairness
> >  >  problem. (probaby creating pthread_condattr_setautoyield_np or similar
> >  >  knob is good one)
> >
> >  Often, the thread calling pthread_cond_signal() wants to continue
> >  executing, not yield.
>
> Then, it doesn't work.
>
> After calling pthread_cond_signal(), T1 which cond_signal caller and T2
> which waked start to GIL grab race. But usually T1 is always win because
> lock variable is in T1's cpu cache. Why kernel and userland have so much
> different result? One of a reason is glibc doesn't have any ticket lock scheme.
>
> If you are interesting GIL mess and issue, please feel free to ask more.

I suggest looking into an explicit round-robin scheme, where each thread 
adds itself to a queue and an unlock wakes up the first waiter.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  8:35         ` Avi Kivity
@ 2011-01-05  8:40           ` KOSAKI Motohiro
  2011-01-05  9:08             ` Avi Kivity
  2011-01-05 10:24             ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: KOSAKI Motohiro @ 2011-01-05  8:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kosaki.motohiro, Rik van Riel, kvm, linux-kernel,
	Srivatsa Vaddagiri, Peter Zijlstra, Mike Galbraith, Chris Wright

> On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
> > >  On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> > >  >  Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> > >  >  avoid almost Nehalem (and other P2P cache arch) lock unfairness
> > >  >  problem. (probaby creating pthread_condattr_setautoyield_np or similar
> > >  >  knob is good one)
> > >
> > >  Often, the thread calling pthread_cond_signal() wants to continue
> > >  executing, not yield.
> >
> > Then, it doesn't work.
> >
> > After calling pthread_cond_signal(), T1 which cond_signal caller and T2
> > which waked start to GIL grab race. But usually T1 is always win because
> > lock variable is in T1's cpu cache. Why kernel and userland have so much
> > different result? One of a reason is glibc doesn't have any ticket lock scheme.
> >
> > If you are interesting GIL mess and issue, please feel free to ask more.
> 
> I suggest looking into an explicit round-robin scheme, where each thread 
> adds itself to a queue and an unlock wakes up the first waiter.

I'm sure you haven't try your scheme. but I did. It's slow.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  8:40           ` KOSAKI Motohiro
@ 2011-01-05  9:08             ` Avi Kivity
  2011-01-05  9:30               ` KOSAKI Motohiro
  2011-01-05 10:24             ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-05  9:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Chris Wright

On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote:
> >  On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
> >  >  >   On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> >  >  >   >   Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> >  >  >   >   avoid almost Nehalem (and other P2P cache arch) lock unfairness
> >  >  >   >   problem. (probaby creating pthread_condattr_setautoyield_np or similar
> >  >  >   >   knob is good one)
> >  >  >
> >  >  >   Often, the thread calling pthread_cond_signal() wants to continue
> >  >  >   executing, not yield.
> >  >
> >  >  Then, it doesn't work.
> >  >
> >  >  After calling pthread_cond_signal(), T1 which cond_signal caller and T2
> >  >  which waked start to GIL grab race. But usually T1 is always win because
> >  >  lock variable is in T1's cpu cache. Why kernel and userland have so much
> >  >  different result? One of a reason is glibc doesn't have any ticket lock scheme.
> >  >
> >  >  If you are interesting GIL mess and issue, please feel free to ask more.
> >
> >  I suggest looking into an explicit round-robin scheme, where each thread
> >  adds itself to a queue and an unlock wakes up the first waiter.
>
> I'm sure you haven't try your scheme. but I did. It's slow.

Won't anything with a heavily contented global/giant lock be slow?

What's the average lock hold time per thread? 10%? 50%? 90%?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  9:08             ` Avi Kivity
@ 2011-01-05  9:30               ` KOSAKI Motohiro
  2011-01-05  9:34                 ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: KOSAKI Motohiro @ 2011-01-05  9:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kosaki.motohiro, Rik van Riel, kvm, linux-kernel,
	Srivatsa Vaddagiri, Peter Zijlstra, Mike Galbraith, Chris Wright

> On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote:
> > >  On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
> > >  >  >   On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> > >  >  >   >   Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> > >  >  >   >   avoid almost Nehalem (and other P2P cache arch) lock unfairness
> > >  >  >   >   problem. (probaby creating pthread_condattr_setautoyield_np or similar
> > >  >  >   >   knob is good one)
> > >  >  >
> > >  >  >   Often, the thread calling pthread_cond_signal() wants to continue
> > >  >  >   executing, not yield.
> > >  >
> > >  >  Then, it doesn't work.
> > >  >
> > >  >  After calling pthread_cond_signal(), T1 which cond_signal caller and T2
> > >  >  which waked start to GIL grab race. But usually T1 is always win because
> > >  >  lock variable is in T1's cpu cache. Why kernel and userland have so much
> > >  >  different result? One of a reason is glibc doesn't have any ticket lock scheme.
> > >  >
> > >  >  If you are interesting GIL mess and issue, please feel free to ask more.
> > >
> > >  I suggest looking into an explicit round-robin scheme, where each thread
> > >  adds itself to a queue and an unlock wakes up the first waiter.
> >
> > I'm sure you haven't try your scheme. but I did. It's slow.
> 
> Won't anything with a heavily contented global/giant lock be slow?
> What's the average lock hold time per thread? 10%? 50%? 90%?

Well, Of cource all of heavily contetion are slow. but we don't have to
compare heavily contended with light contended. we have to compare
heavily contended with heavily contended or light contended with light
contended. If we are talking a scripting language VM, pipe benchmark
show impressively FIFO overhead which like your propsed. Because
pipe bench makes frequently GIL grab/ungrab storm. Similar to pipe 
bench showed our (very) old kernel's bottleneck. Sadly userspace have
no way to implement per-cpu runqueue. I think.

And, if we are talking a language VM, I can't say any average time. It
depend on running script.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  9:30               ` KOSAKI Motohiro
@ 2011-01-05  9:34                 ` Avi Kivity
  0 siblings, 0 replies; 52+ messages in thread
From: Avi Kivity @ 2011-01-05  9:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Chris Wright

On 01/05/2011 11:30 AM, KOSAKI Motohiro wrote:
> >  On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote:
> >  >  >   On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
> >  >  >   >   >    On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
> >  >  >   >   >    >    Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
> >  >  >   >   >    >    avoid almost Nehalem (and other P2P cache arch) lock unfairness
> >  >  >   >   >    >    problem. (probaby creating pthread_condattr_setautoyield_np or similar
> >  >  >   >   >    >    knob is good one)
> >  >  >   >   >
> >  >  >   >   >    Often, the thread calling pthread_cond_signal() wants to continue
> >  >  >   >   >    executing, not yield.
> >  >  >   >
> >  >  >   >   Then, it doesn't work.
> >  >  >   >
> >  >  >   >   After calling pthread_cond_signal(), T1 which cond_signal caller and T2
> >  >  >   >   which waked start to GIL grab race. But usually T1 is always win because
> >  >  >   >   lock variable is in T1's cpu cache. Why kernel and userland have so much
> >  >  >   >   different result? One of a reason is glibc doesn't have any ticket lock scheme.
> >  >  >   >
> >  >  >   >   If you are interesting GIL mess and issue, please feel free to ask more.
> >  >  >
> >  >  >   I suggest looking into an explicit round-robin scheme, where each thread
> >  >  >   adds itself to a queue and an unlock wakes up the first waiter.
> >  >
> >  >  I'm sure you haven't try your scheme. but I did. It's slow.
> >
> >  Won't anything with a heavily contented global/giant lock be slow?
> >  What's the average lock hold time per thread? 10%? 50%? 90%?
>
> Well, Of cource all of heavily contetion are slow. but we don't have to
> compare heavily contended with light contended. we have to compare
> heavily contended with heavily contended or light contended with light
> contended. If we are talking a scripting language VM, pipe benchmark
> show impressively FIFO overhead which like your propsed. Because
> pipe bench makes frequently GIL grab/ungrab storm. Similar to pipe
> bench showed our (very) old kernel's bottleneck. Sadly userspace have
> no way to implement per-cpu runqueue. I think.

A completely fair lock will likely be slower than an unfair lock.

> And, if we are talking a language VM, I can't say any average time. It
> depend on running script.

Pick some parallel compute intensive script, please.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  2:39       ` KOSAKI Motohiro
  2011-01-05  8:35         ` Avi Kivity
@ 2011-01-05 10:04         ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-05 10:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Avi Kivity, Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright, Darren Hart, tglx

On Wed, 2011-01-05 at 11:39 +0900, KOSAKI Motohiro wrote:

> After calling pthread_cond_signal(), T1 which cond_signal caller and T2
> which waked start to GIL grab race. But usually T1 is always win because
> lock variable is in T1's cpu cache. Why kernel and userland have so much
> different result? One of a reason is glibc doesn't have any ticket lock scheme. 

The problem is that making locks strictly fair is that that sucks for
performance, iirc most futex ops are fifo-ordered when they his the
block path, but we do allow for lock-stealing.

Lock-stealing greatly improves performance since it avoids lots of
block/wakeup cycles, but it does make things unfair.

I'm not sure we have a futex option to disable lock-stealing, nor do I
think you really want to, performance suffers really badly.

[This btw is the reason why people reported a performance improvement
when they wrapped all mmap() calls in a pthread_mutex, the
rwsem_down_write() thing doesn't allow for lock-stealing since it needs
to be strict fifo-fair]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05  8:40           ` KOSAKI Motohiro
  2011-01-05  9:08             ` Avi Kivity
@ 2011-01-05 10:24             ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-05 10:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Avi Kivity, Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, 2011-01-05 at 17:40 +0900, KOSAKI Motohiro wrote:
> > > If you are interesting GIL mess and issue, please feel free to ask more.
> > 
> > I suggest looking into an explicit round-robin scheme, where each thread 
> > adds itself to a queue and an unlock wakes up the first waiter.
> 
> I'm sure you haven't try your scheme. but I did. It's slow.

Of course it is, but then your locking scheme (GIL) is the problem, not
the locking constructs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 18:04   ` Peter Zijlstra
  2011-01-04 18:53     ` Mike Galbraith
@ 2011-01-05 16:57     ` Mike Galbraith
  2011-01-05 17:04       ` Peter Zijlstra
  2011-01-06 14:33       ` Hillf Danton
  2011-01-05 17:10     ` Avi Kivity
  2 siblings, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-01-05 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

It's Rik's patch to do with whatever he wants (I donated suggestion;),
but I took the liberty of cleaning it up a bit, see below.  It's
essentially the same, but fixes a bug where the caller may have targeted
a task in it's thread group etc, but when preemption decision time came,
may have preempted a forbidden task.

On Tue, 2011-01-04 at 19:04 +0100, Peter Zijlstra wrote:
> On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote:
> >  
> > +out:
> > +	double_rq_unlock(rq, p_rq);
> > +	local_irq_restore(flags);
> > +
> > +	if (yield) {
> > +		set_current_state(TASK_RUNNING);
> > +		schedule();
> > +	}
> > +}
> > +EXPORT_SYMBOL(yield_to);
> 
> This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
> I'd make it so only kvm.o could use it. It really sucks that kvm is a
> module.

Did that.

> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +	if (cfs_rq->tg != p_cfs_rq->tg)
> > +		return 0;
> > +
> > +	/* Preemption only allowed within the same task group. */
> > +	if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
> > +		preempt = 0;
> > +#endif
> 
> I'd simply bail if its not the same cgroup, who cares about that case
> anyway, all KVM vcpu threads should be in the same cgroup I think.

See below.

> > +	/* Preemption only allowed within the same thread group. */
> > +	if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
> > +		preempt = 0;
> 
> The calling site already checks for same_thread_group(), we never even
> get here if that's not the case.

Yeah, task group check was over there already as well.  Hasty hack.

> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this yield is important enough to want to preempt instead
> > +	 * of only dropping a ->next hint, we're alone, and the target
> > +	 * is not alone, pull the target to this cpu.
> > +	 */
> > +	if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
> > +			cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
> > +		pull_task(task_rq(p), p, this_rq(), smp_processor_id());
> 
> This only works by the grace that the caller checked p->se.on_rq. A
> comment might be in order.

I moved all of the entity decisions to the class method, and tried to
make it a bit less ugly.


sched: Add yield_to(task, preempt) functionality.

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group, and sched class toward either it's cpu, or potentially the
caller's own cpu if the 'preempt' argument is also passed.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 include/linux/sched.h |    1 
 kernel/sched.c        |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c   |   52 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 109 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task) (struct rq *rq);
+	int (*yield_to_task) (struct task_struct *p, int preempt);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5327,6 +5327,62 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+	struct task_struct *curr = current;
+	struct rq *rq, *p_rq;
+	unsigned long flags;
+	int yield = 0;
+
+	local_irq_save(flags);
+	rq = this_rq();
+
+again:
+	p_rq = task_rq(p);
+	double_rq_lock(rq, p_rq);
+	while (task_rq(p) != p_rq) {
+		double_rq_unlock(rq, p_rq);
+		goto again;
+	}
+
+	if (!curr->sched_class->yield_to_task)
+		goto out;
+
+	if (curr->sched_class != p->sched_class)
+		goto out;
+
+	if (task_running(p_rq, p) || p->state)
+		goto out;
+
+	if (!same_thread_group(p, curr))
+		goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (task_group(p) != task_group(curr))
+		goto out;
+#endif
+
+	yield = curr->sched_class->yield_to_task(p, preempt);
+
+out:
+	double_rq_unlock(rq, p_rq);
+	local_irq_restore(flags);
+
+	if (yield) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1337,6 +1337,57 @@ static void yield_task_fair(struct rq *r
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+		      struct rq *this_rq, int this_cpu);
+#endif
+
+static int yield_to_task_fair(struct task_struct *p, int preempt)
+{
+	struct sched_entity *se = &current->se;
+	struct sched_entity *pse = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+	int local = cfs_rq == p_cfs_rq;
+	int this_cpu = smp_processor_id();
+
+	if (!pse->on_rq)
+		return 0;
+
+#ifdef CONFIG_SMP
+	/*
+	 * If this yield is important enough to want to preempt instead
+	 * of only dropping a ->next hint, we're alone, and the target
+	 * is not alone, pull the target to this cpu.
+	 *
+	 * NOTE: the target may be alone in it's cfs_rq if another class
+	 * task or another task group is currently executing on it's cpu.
+	 * In this case, we still pull, to accelerate it toward the cpu.
+	 */
+	if (!local && preempt && cfs_rq->nr_running == 1 &&
+			cpumask_test_cpu(this_cpu, &p->cpus_allowed)) {
+		pull_task(task_rq(p), p, this_rq(), this_cpu);
+		p_cfs_rq = cfs_rq_of(pse);
+		local = 1;
+	}
+#endif
+
+	/* Tell the scheduler that we'd really like pse to run next. */
+	p_cfs_rq->next = pse;
+
+	/* We know whether we want to preempt or not, but are we allowed? */
+	preempt &= same_thread_group(p, task_of(p_cfs_rq->curr));
+
+	if (local)
+		clear_buddies(cfs_rq, se);
+	else if (preempt) {
+		clear_buddies(p_cfs_rq, p_cfs_rq->curr);
+		resched_task(task_of(p_cfs_rq->curr));
+	}
+
+	return local;
+}
+
+#ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)
 {
@@ -4143,6 +4194,7 @@ static const struct sched_class fair_sch
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+	.yield_to_task		= yield_to_task_fair,
 
 	.check_preempt_curr	= check_preempt_wakeup,
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 16:57     ` Mike Galbraith
@ 2011-01-05 17:04       ` Peter Zijlstra
  2011-01-05 17:23         ` Mike Galbraith
  2011-01-07  5:29         ` Mike Galbraith
  2011-01-06 14:33       ` Hillf Danton
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-05 17:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote:
> +               p_cfs_rq = cfs_rq_of(pse);
> +               local = 1;
> +       }
> +#endif
> +
> +       /* Tell the scheduler that we'd really like pse to run next. */
> +       p_cfs_rq->next = pse;
> +
> +       /* We know whether we want to preempt or not, but are we allowed? */
> +       preempt &= same_thread_group(p, task_of(p_cfs_rq->curr));
> +
> +       if (local)
> +               clear_buddies(cfs_rq, se); 

You might want to clear before setting next :-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-04 18:04   ` Peter Zijlstra
  2011-01-04 18:53     ` Mike Galbraith
  2011-01-05 16:57     ` Mike Galbraith
@ 2011-01-05 17:10     ` Avi Kivity
  2011-01-05 17:15       ` Peter Zijlstra
  2 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-05 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
> This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
> I'd make it so only kvm.o could use it. It really sucks that kvm is a
> module.

Why does it suck?  I mean apart from the "virtualization is crap" song.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:10     ` Avi Kivity
@ 2011-01-05 17:15       ` Peter Zijlstra
  2011-01-05 17:19         ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-05 17:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
> On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
> > This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
> > I'd make it so only kvm.o could use it. It really sucks that kvm is a
> > module.
> 
> Why does it suck?  I mean apart from the "virtualization is crap" song.

Because it needs hooks all over the core kernel, like this yield_to()
stuff. Exporting this might lead to others wanting to use it too.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:15       ` Peter Zijlstra
@ 2011-01-05 17:19         ` Avi Kivity
  2011-01-05 17:28           ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-05 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On 01/05/2011 07:15 PM, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
> >  On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
> >  >  This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
> >  >  I'd make it so only kvm.o could use it. It really sucks that kvm is a
> >  >  module.
> >
> >  Why does it suck?  I mean apart from the "virtualization is crap" song.
>
> Because it needs hooks all over the core kernel, like this yield_to()
> stuff. Exporting this might lead to others wanting to use it too.

Well, it's very convenient for development (modprobe vs. reboot).  What 
hooks do you object to? mmu notifiers are useful for some drivers, sched 
notifiers are useful for cmwq and possibly perf?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:04       ` Peter Zijlstra
@ 2011-01-05 17:23         ` Mike Galbraith
  2011-01-07  5:29         ` Mike Galbraith
  1 sibling, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-01-05 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On Wed, 2011-01-05 at 18:04 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote:
> > +               p_cfs_rq = cfs_rq_of(pse);
> > +               local = 1;
> > +       }
> > +#endif
> > +
> > +       /* Tell the scheduler that we'd really like pse to run next. */
> > +       p_cfs_rq->next = pse;
> > +
> > +       /* We know whether we want to preempt or not, but are we allowed? */
> > +       preempt &= same_thread_group(p, task_of(p_cfs_rq->curr));
> > +
> > +       if (local)
> > +               clear_buddies(cfs_rq, se); 
> 
> You might want to clear before setting next :-)

Heh, you made me thwap forehead.. but pse != se or p_cfs_rq->curr.
Moved it back where it came from anyway ;-)


sched: Add yield_to(task, preempt) functionality.

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group, and sched class toward either it's cpu, or potentially the
caller's own cpu if the 'prempt' argument is also passed.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 include/linux/sched.h |    1 
 kernel/sched.c        |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c   |   52 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 109 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task) (struct rq *rq);
+	int (*yield_to_task) (struct task_struct *p, int preempt);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5327,6 +5327,62 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+	struct task_struct *curr = current;
+	struct rq *rq, *p_rq;
+	unsigned long flags;
+	int yield = 0;
+
+	local_irq_save(flags);
+	rq = this_rq();
+
+again:
+	p_rq = task_rq(p);
+	double_rq_lock(rq, p_rq);
+	while (task_rq(p) != p_rq) {
+		double_rq_unlock(rq, p_rq);
+		goto again;
+	}
+
+	if (!curr->sched_class->yield_to_task)
+		goto out;
+
+	if (curr->sched_class != p->sched_class)
+		goto out;
+
+	if (task_running(p_rq, p) || p->state)
+		goto out;
+
+	if (!same_thread_group(p, curr))
+		goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (task_group(p) != task_group(curr))
+		goto out;
+#endif
+
+	yield = curr->sched_class->yield_to_task(p, preempt);
+
+out:
+	double_rq_unlock(rq, p_rq);
+	local_irq_restore(flags);
+
+	if (yield) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1337,6 +1337,57 @@ static void yield_task_fair(struct rq *r
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+		      struct rq *this_rq, int this_cpu);
+#endif
+
+static int yield_to_task_fair(struct task_struct *p, int preempt)
+{
+	struct sched_entity *se = &current->se;
+	struct sched_entity *pse = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+	int local = cfs_rq == p_cfs_rq;
+	int this_cpu = smp_processor_id();
+
+	if (!pse->on_rq)
+		return 0;
+
+#ifdef CONFIG_SMP
+	/*
+	 * If this yield is important enough to want to preempt instead
+	 * of only dropping a ->next hint, we're alone, and the target
+	 * is not alone, pull the target to this cpu.
+	 *
+	 * NOTE: the target may be alone in it's cfs_rq if another class
+	 * task or another task group is currently executing on it's cpu.
+	 * In this case, we still pull, to accelerate it toward the cpu.
+	 */
+	if (!local && preempt && cfs_rq->nr_running == 1 &&
+			cpumask_test_cpu(this_cpu, &p->cpus_allowed)) {
+		pull_task(task_rq(p), p, this_rq(), this_cpu);
+		p_cfs_rq = cfs_rq_of(pse);
+		local = 1;
+	}
+#endif
+
+	/* We know whether we want to preempt or not, but are we allowed? */
+	preempt &= same_thread_group(p, task_of(p_cfs_rq->curr));
+
+	if (local)
+		clear_buddies(cfs_rq, se);
+	else if (preempt) {
+		clear_buddies(p_cfs_rq, p_cfs_rq->curr);
+		resched_task(task_of(p_cfs_rq->curr));
+	}
+
+	/* Tell the scheduler that we'd really like pse to run next. */
+	p_cfs_rq->next = pse;
+
+	return local;
+}
+
+#ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)
 {
@@ -4143,6 +4194,7 @@ static const struct sched_class fair_sch
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+	.yield_to_task		= yield_to_task_fair,
 
 	.check_preempt_curr	= check_preempt_wakeup,
 











^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:19         ` Avi Kivity
@ 2011-01-05 17:28           ` Peter Zijlstra
  2011-01-05 17:35             ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-05 17:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, 2011-01-05 at 19:19 +0200, Avi Kivity wrote:
> On 01/05/2011 07:15 PM, Peter Zijlstra wrote:
> > On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
> > >  On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
> > >  >  This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
> > >  >  I'd make it so only kvm.o could use it. It really sucks that kvm is a
> > >  >  module.
> > >
> > >  Why does it suck?  I mean apart from the "virtualization is crap" song.
> >
> > Because it needs hooks all over the core kernel, like this yield_to()
> > stuff. Exporting this might lead to others wanting to use it too.
> 
> Well, it's very convenient for development (modprobe vs. reboot).

Get a machine that boots fast ;-), it might be because I've never worked
on anything modular, but typically stuff crashes when I got it wrong, in
which case you're forced to power-cycle anyway.

>   What hooks do you object to? 

Don't really keep a list, but every now and again see something that
makes me think that ought not be exported.

> mmu notifiers are useful for some drivers, sched 
> notifiers are useful for cmwq and possibly perf?

Both use explicit function calls because its easier to read the code and
faster.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:28           ` Peter Zijlstra
@ 2011-01-05 17:35             ` Avi Kivity
  2011-01-05 17:39               ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2011-01-05 17:35 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On 01/05/2011 07:28 PM, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 19:19 +0200, Avi Kivity wrote:
> >  On 01/05/2011 07:15 PM, Peter Zijlstra wrote:
> >  >  On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
> >  >  >   On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
> >  >  >   >   This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
> >  >  >   >   I'd make it so only kvm.o could use it. It really sucks that kvm is a
> >  >  >   >   module.
> >  >  >
> >  >  >   Why does it suck?  I mean apart from the "virtualization is crap" song.
> >  >
> >  >  Because it needs hooks all over the core kernel, like this yield_to()
> >  >  stuff. Exporting this might lead to others wanting to use it too.
> >
> >  Well, it's very convenient for development (modprobe vs. reboot).
>
> Get a machine that boots fast ;-), it might be because I've never worked
> on anything modular, but typically stuff crashes when I got it wrong, in
> which case you're forced to power-cycle anyway.

I often get by with just guest crashes.  Or maybe it oopses, I look 
around, then reboot at my leisure.

> >    What hooks do you object to?
>
> Don't really keep a list, but every now and again see something that
> makes me think that ought not be exported.
>
> >  mmu notifiers are useful for some drivers, sched
> >  notifiers are useful for cmwq and possibly perf?
>
> Both use explicit function calls because its easier to read the code and
> faster.

I doubt it's measurable, especially for perf with the msr tweaking it does.

Tejun, why did you end up not using preempt_notifiers in cmwq?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:35             ` Avi Kivity
@ 2011-01-05 17:39               ` Peter Zijlstra
  2011-01-06  3:49                 ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-01-05 17:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tejun Heo, Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

On Wed, 2011-01-05 at 19:35 +0200, Avi Kivity wrote:

> Tejun, why did you end up not using preempt_notifiers in cmwq?

Because I told him to use explicit function calls because that keeps the
code easier to read.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:39               ` Peter Zijlstra
@ 2011-01-06  3:49                 ` Tejun Heo
  0 siblings, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2011-01-06  3:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Rik van Riel, kvm, linux-kernel, Srivatsa Vaddagiri,
	Mike Galbraith, Chris Wright

Hello,

On Wed, Jan 05, 2011 at 06:39:19PM +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 19:35 +0200, Avi Kivity wrote:
> 
> > Tejun, why did you end up not using preempt_notifiers in cmwq?
> 
> Because I told him to use explicit function calls because that keeps the
> code easier to read.

It went like the following.

* Beefing up preempt_notifier() results in too many different
  notifiers.  They better be unified.

* Unified notifiers are ugly (macro bonanza as expected) and cmwq
  hooks are always enabled unlike other notifiers.  Let's just keep it
  simple and specialized.

So, the cmwq stuff is now mostly hard coded into scheduler.  For cmwq,
everything is fine but it would still be great if the various
notifiers can be assimilated in prettier way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 16:57     ` Mike Galbraith
  2011-01-05 17:04       ` Peter Zijlstra
@ 2011-01-06 14:33       ` Hillf Danton
  1 sibling, 0 replies; 52+ messages in thread
From: Hillf Danton @ 2011-01-06 14:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Rik van Riel, kvm, linux-kernel, Avi Kiviti,
	Srivatsa Vaddagiri, Chris Wright

On Thu, Jan 6, 2011 at 12:57 AM, Mike Galbraith <efault@gmx.de> wrote:
> sched: Add yield_to(task, preempt) functionality.
>
> Currently only implemented for fair class tasks.
>
> Add a yield_to_task method() to the fair scheduling class. allowing the
> caller of yield_to() to accelerate another thread in it's thread group,
> task group, and sched class toward either it's cpu, or potentially the
> caller's own cpu if the 'preempt' argument is also passed.
>
> Implemented via a scheduler hint, using cfs_rq->next to encourage the
> target being selected.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
>
> ---
>  include/linux/sched.h |    1
>  kernel/sched.c        |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched_fair.c   |   52 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 109 insertions(+)
>
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1056,6 +1056,7 @@ struct sched_class {
>        void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
>        void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
>        void (*yield_task) (struct rq *rq);
> +       int (*yield_to_task) (struct task_struct *p, int preempt);
>
>        void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -5327,6 +5327,62 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
>
> +/**
> + * yield_to - yield the current processor to another thread in
> + * your thread group, or accelerate that thread toward the
> + * processor it's on.
> + *
> + * It's the caller's job to ensure that the target task struct
> + * can't go away on us before we can do any checks.
> + */
> +void __sched yield_to(struct task_struct *p, int preempt)
> +{
> +       struct task_struct *curr = current;
> +       struct rq *rq, *p_rq;
> +       unsigned long flags;
> +       int yield = 0;
> +
> +       local_irq_save(flags);
> +       rq = this_rq();
> +
> +again:
> +       p_rq = task_rq(p);
> +       double_rq_lock(rq, p_rq);
> +       while (task_rq(p) != p_rq) {
> +               double_rq_unlock(rq, p_rq);
> +               goto again;
> +       }
> +
> +       if (!curr->sched_class->yield_to_task)
> +               goto out;
> +
> +       if (curr->sched_class != p->sched_class)
> +               goto out;
> +

to be clearer?
            if (task_running(p_rq, p) || p->state != TASK_RUNNING)

> +               goto out;
> +
> +       if (!same_thread_group(p, curr))
> +               goto out;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-05 17:04       ` Peter Zijlstra
  2011-01-05 17:23         ` Mike Galbraith
@ 2011-01-07  5:29         ` Mike Galbraith
  2011-01-13  3:02           ` Rik van Riel
  1 sibling, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-01-07  5:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On Wed, 2011-01-05 at 18:04 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote:
> > +               p_cfs_rq = cfs_rq_of(pse);
> > +               local = 1;
> > +       }
> > +#endif
> > +
> > +       /* Tell the scheduler that we'd really like pse to run next. */
> > +       p_cfs_rq->next = pse;
> > +
> > +       /* We know whether we want to preempt or not, but are we allowed? */
> > +       preempt &= same_thread_group(p, task_of(p_cfs_rq->curr));
> > +
> > +       if (local)
> > +               clear_buddies(cfs_rq, se); 
> 
> You might want to clear before setting next :-)

Or better, just remove dept. of redundancy dept. cruft.  We clear
buddies upon selection.  It's also pointless worrying whether to set
TIF_RESCHED or not, no cycle savings to be had there methinks.

While performing cruftectomy, also did cosmetic int ==> bool.



sched: Add yield_to(task, preempt) functionality.

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group, and sched class toward either it's cpu, or potentially the
caller's own cpu if the 'preempt' argument is also passed.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 include/linux/sched.h |    1 
 kernel/sched.c        |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c   |   44 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task) (struct rq *rq);
+	bool (*yield_to_task) (struct task_struct *p, bool preempt);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5327,6 +5327,62 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, bool preempt)
+{
+	struct task_struct *curr = current;
+	struct rq *rq, *p_rq;
+	unsigned long flags;
+	bool yield = 0;
+
+	local_irq_save(flags);
+	rq = this_rq();
+
+again:
+	p_rq = task_rq(p);
+	double_rq_lock(rq, p_rq);
+	while (task_rq(p) != p_rq) {
+		double_rq_unlock(rq, p_rq);
+		goto again;
+	}
+
+	if (!curr->sched_class->yield_to_task)
+		goto out;
+
+	if (curr->sched_class != p->sched_class)
+		goto out;
+
+	if (task_running(p_rq, p) || p->state)
+		goto out;
+
+	if (!same_thread_group(p, curr))
+		goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (task_group(p) != task_group(curr))
+		goto out;
+#endif
+
+	yield = curr->sched_class->yield_to_task(p, preempt);
+
+out:
+	double_rq_unlock(rq, p_rq);
+	local_irq_restore(flags);
+
+	if (yield) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+	}
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1337,6 +1337,49 @@ static void yield_task_fair(struct rq *r
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+		      struct rq *this_rq, int this_cpu);
+#endif
+
+static bool yield_to_task_fair(struct task_struct *p, bool preempt)
+{
+	struct sched_entity *se = &current->se;
+	struct sched_entity *pse = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+	int this_cpu = smp_processor_id();
+
+	if (!pse->on_rq)
+		return false;
+
+#ifdef CONFIG_SMP
+	/*
+	 * If this yield is important enough to want to preempt instead
+	 * of only dropping a ->next hint, we're alone, and the target
+	 * is not alone, pull the target to this cpu.
+	 *
+	 * NOTE: the target may be alone in it's cfs_rq if another class
+	 * task or another task group is currently executing on it's cpu.
+	 * In this case, we still pull, to accelerate it toward the cpu.
+	 */
+	if (cfs_rq != p_cfs_rq && preempt && cfs_rq->nr_running == 1 &&
+			cpumask_test_cpu(this_cpu, &p->cpus_allowed)) {
+		pull_task(task_rq(p), p, this_rq(), this_cpu);
+		p_cfs_rq = cfs_rq_of(pse);
+	}
+#endif
+
+	/* Tell the scheduler that we'd really like pse to run next. */
+	p_cfs_rq->next = pse;
+
+	/* We know whether we want to preempt or not, but are we allowed? */
+	if (preempt && same_thread_group(p, task_of(p_cfs_rq->curr)))
+		resched_task(task_of(p_cfs_rq->curr));
+
+	return cfs_rq == p_cfs_rq;
+}
+
+#ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)
 {
@@ -4143,6 +4186,7 @@ static const struct sched_class fair_sch
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+	.yield_to_task		= yield_to_task_fair,
 
 	.check_preempt_curr	= check_preempt_wakeup,
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-07  5:29         ` Mike Galbraith
@ 2011-01-13  3:02           ` Rik van Riel
  2011-01-13  3:26             ` Mike Galbraith
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-13  3:02 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On 01/07/2011 12:29 AM, Mike Galbraith wrote:

> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this yield is important enough to want to preempt instead
> +	 * of only dropping a ->next hint, we're alone, and the target
> +	 * is not alone, pull the target to this cpu.
> +	 *
> +	 * NOTE: the target may be alone in it's cfs_rq if another class
> +	 * task or another task group is currently executing on it's cpu.
> +	 * In this case, we still pull, to accelerate it toward the cpu.
> +	 */
> +	if (cfs_rq != p_cfs_rq&&  preempt&&  cfs_rq->nr_running == 1&&
> +			cpumask_test_cpu(this_cpu,&p->cpus_allowed)) {
> +		pull_task(task_rq(p), p, this_rq(), this_cpu);
> +		p_cfs_rq = cfs_rq_of(pse);
> +	}
> +#endif

This causes some fun issues in a simple test case on
my system.  The test consists of 2 4-VCPU KVM guests,
bound to the same 4 CPUs on the host.

One guest is running the AMQP performance test, the
other guest is totally idle.  This means that besides
the 4 very busy VCPUs, there is only a few percent
CPU use in background tasks from the idle guest and
qemu-kvm userspace bits.

However, the busy guest is restricted to using just
3 out of the 4 CPUs, leaving one idle!

A simple explanation for this is that the above
pulling code will pull another VCPU onto the local
CPU whenever we run into contention inside the guest
and some random background task runs on the CPU where
that other VCPU was.

 From that point on, the 4 VCPUs will stay on 3
CPUs, leaving one idle.  Any time we have contention
inside the guest (pretty frequent), we move whoever
is not currently running to another CPU.

Cgroups only makes the matter worse - libvirt places
each KVM guest into its own cgroup, so a VCPU will
generally always be alone on its own per-cgroup, per-cpu
runqueue!  That can lead to pulling a VCPU onto our local
CPU because we think we are alone, when in reality we
share the CPU with others...

Removing the pulling code allows me to use all 4
CPUs with a 4-VCPU KVM guest in an uncontended situation.

> +	/* Tell the scheduler that we'd really like pse to run next. */
> +	p_cfs_rq->next = pse;

Using set_next_buddy propagates this up to the root,
allowing the scheduler to actually know who we want to
run next when cgroups is involved.

> +	/* We know whether we want to preempt or not, but are we allowed? */
> +	if (preempt&&  same_thread_group(p, task_of(p_cfs_rq->curr)))
> +		resched_task(task_of(p_cfs_rq->curr));

With this in place, we can get into the situation where
we will gladly give up CPU time, but not actually give
any to the other VCPUs in our guest.

I believe we can get rid of that test, because pick_next_entity
already makes sure it ignores ->next if picking ->next would
lead to unfairness.

Removing this test (and simplifying yield_to_task_fair) seems
to lead to more predictable test results.

I'll send the updated patch in another email, since this one is
already way too long for a changelog :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-13  3:02           ` Rik van Riel
@ 2011-01-13  3:26             ` Mike Galbraith
  2011-01-13  5:08               ` Rik van Riel
  0 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-01-13  3:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On Wed, 2011-01-12 at 22:02 -0500, Rik van Riel wrote:

> Cgroups only makes the matter worse - libvirt places
> each KVM guest into its own cgroup, so a VCPU will
> generally always be alone on its own per-cgroup, per-cpu
> runqueue!  That can lead to pulling a VCPU onto our local
> CPU because we think we are alone, when in reality we
> share the CPU with others...

How can that happen?  If the task you're trying to accelerate isn't in
your task group, the whole attempt should be a noop.

> Removing the pulling code allows me to use all 4
> CPUs with a 4-VCPU KVM guest in an uncontended situation.
> 
> > +	/* Tell the scheduler that we'd really like pse to run next. */
> > +	p_cfs_rq->next = pse;
> 
> Using set_next_buddy propagates this up to the root,
> allowing the scheduler to actually know who we want to
> run next when cgroups is involved.
> 
> > +	/* We know whether we want to preempt or not, but are we allowed? */
> > +	if (preempt&&  same_thread_group(p, task_of(p_cfs_rq->curr)))
> > +		resched_task(task_of(p_cfs_rq->curr));
> 
> With this in place, we can get into the situation where
> we will gladly give up CPU time, but not actually give
> any to the other VCPUs in our guest.
> 
> I believe we can get rid of that test, because pick_next_entity
> already makes sure it ignores ->next if picking ->next would
> lead to unfairness.

Preempting everybody who is in your way isn't playing nice neighbor, so
I think at least the same_thread_group() test needs to stay.  But that's
Peter's call.  Starting a zillion threads to play wakeup preempt and
lets hog the cpu isn't nice either, but it's allowed.

> Removing this test (and simplifying yield_to_task_fair) seems
> to lead to more predictable test results.

Less is more :)

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  2011-01-13  3:26             ` Mike Galbraith
@ 2011-01-13  5:08               ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-13  5:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, kvm, linux-kernel, Avi Kiviti, Srivatsa Vaddagiri,
	Chris Wright

On 01/12/2011 10:26 PM, Mike Galbraith wrote:
> On Wed, 2011-01-12 at 22:02 -0500, Rik van Riel wrote:
>
>> Cgroups only makes the matter worse - libvirt places
>> each KVM guest into its own cgroup, so a VCPU will
>> generally always be alone on its own per-cgroup, per-cpu
>> runqueue!  That can lead to pulling a VCPU onto our local
>> CPU because we think we are alone, when in reality we
>> share the CPU with others...
>
> How can that happen?  If the task you're trying to accelerate isn't in
> your task group, the whole attempt should be a noop.

Nono, all the VCPUs from the same guest are in the same
cgroup.  However, with 4 VCPUs and 4 physical CPUs,
chances are that they're all alone on their own per-cpu,
per-cgroup cfs_rq.

However, each CPU might have other runnable processes in
other cfs_rq sched entities.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-01-13  5:09 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-03 21:26 [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Rik van Riel
2011-01-03 21:27 ` [RFC -v3 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu Rik van Riel
2011-01-03 21:29 ` [RFC -v3 PATCH 2/3] sched: add yield_to function Rik van Riel
2011-01-04  1:51   ` Mike Galbraith
2011-01-04  6:14   ` KOSAKI Motohiro
2011-01-04 12:03     ` Avi Kivity
2011-01-05  2:39       ` KOSAKI Motohiro
2011-01-05  8:35         ` Avi Kivity
2011-01-05  8:40           ` KOSAKI Motohiro
2011-01-05  9:08             ` Avi Kivity
2011-01-05  9:30               ` KOSAKI Motohiro
2011-01-05  9:34                 ` Avi Kivity
2011-01-05 10:24             ` Peter Zijlstra
2011-01-05 10:04         ` Peter Zijlstra
2011-01-04 17:16     ` Peter Zijlstra
2011-01-05  3:17       ` KOSAKI Motohiro
2011-01-04 14:28   ` Hillf Danton
2011-01-04 16:41   ` Hillf Danton
2011-01-04 16:44     ` Rik van Riel
2011-01-04 16:51       ` Hillf Danton
2011-01-04 16:54         ` Rik van Riel
2011-01-04 17:02           ` Hillf Danton
2011-01-04 17:08         ` Peter Zijlstra
2011-01-04 17:12           ` Hillf Danton
2011-01-04 17:22             ` Peter Zijlstra
2011-01-04 17:53           ` Rik van Riel
2011-01-04 18:05             ` Peter Zijlstra
2011-01-04 18:04   ` Peter Zijlstra
2011-01-04 18:53     ` Mike Galbraith
2011-01-05 16:57     ` Mike Galbraith
2011-01-05 17:04       ` Peter Zijlstra
2011-01-05 17:23         ` Mike Galbraith
2011-01-07  5:29         ` Mike Galbraith
2011-01-13  3:02           ` Rik van Riel
2011-01-13  3:26             ` Mike Galbraith
2011-01-13  5:08               ` Rik van Riel
2011-01-06 14:33       ` Hillf Danton
2011-01-05 17:10     ` Avi Kivity
2011-01-05 17:15       ` Peter Zijlstra
2011-01-05 17:19         ` Avi Kivity
2011-01-05 17:28           ` Peter Zijlstra
2011-01-05 17:35             ` Avi Kivity
2011-01-05 17:39               ` Peter Zijlstra
2011-01-06  3:49                 ` Tejun Heo
2011-01-03 21:30 ` [RFC -v3 PATCH 3/3] Subject: kvm: use yield_to instead of sleep in kvm_vcpu_on_spin Rik van Riel
2011-01-04  6:42 ` [RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting Mike Galbraith
2011-01-04  9:09   ` Avi Kivity
2011-01-04 10:32     ` Mike Galbraith
2011-01-04 10:35       ` Avi Kivity
2011-01-04  9:14 ` Avi Kivity
2011-01-04 10:26   ` Mike Galbraith
2011-01-04 17:20     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox