From: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@elte.hu>, Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Rik van Riel <riel@redhat.com>,
linux-kernel@vger.kernel.org, vatsa@linux.vnet.ibm.com,
bharata@linux.vnet.ibm.com
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS
Date: Mon, 20 Feb 2012 13:38:28 +0530 [thread overview]
Message-ID: <87fwe6ork3.fsf@abhimanyu.in.ibm.com> (raw)
In-Reply-To: <20120105091059.GA3249@elte.hu>
On Thu, 5 Jan 2012 10:10:59 +0100, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Avi Kivity <avi@redhat.com> wrote:
>
> > > So why wait for non-running vcpus at all? That is, why not
> > > paravirt the TLB flush such that the invalidate marks the
> > > non-running VCPU's state so that on resume it will first
> > > flush its TLBs. That way you don't have to wake it up and
> > > wait for it to invalidate its TLBs.
> >
> > That's what Xen does, but it's tricky. For example
> > get_user_pages_fast() depends on the IPI to hold off page
> > freeing, if we paravirt it we have to take that into
> > consideration.
> >
> > > Or am I like totally missing the point (I am after all
> > > reading the thread backwards and I haven't yet fully paged
> > > the kernel stuff back into my brain).
> >
> > You aren't, and I bet those kernel pages are unswappable
> > anyway.
> >
> > > I guess tagging remote VCPU state like that might be
> > > somewhat tricky.. but it seems worth considering, the whole
> > > wake and wait for flush thing seems daft.
> >
> > It's nasty, but then so is paravirt. It's hard to get right,
> > and it has a tendency to cause performance regressions as
> > hardware improves.
>
> Here it would massively improve performance - without regressing
> the scheduler code massively.
>
I tried doing an experiment with the flush_tlb_others_ipi. This depends
on Raghu's "kvm : Paravirt-spinlock support for KVM guests"
(https://lkml.org/lkml/2012/1/14/66), which has new hypercall for
kicking another vcpu out of halt.
Here are the results from non-PLE hardware. Running ebizzy
workload inside the VMs. The table shows the ebizzy score -
Records/sec.
8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)
+--------+------------+------------+-------------+
| | baseline | gang | pv_flush |
+--------+------------+------------+-------------+
| 2VM | 3979.50 | 8818.00 | 11002.50 |
| 4VM | 1817.50 | 6236.50 | 6196.75 |
| 8VM | 922.12 | 4043.00 | 4001.38 |
+--------+------------+------------+-------------+
I will be posting the results for PLE hardware as well.
Here is the patch, this still needs to be hooked with the pv_mmu_ops. So,
Not-yet-Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Index: linux-tip-f4ab688-pv/arch/x86/mm/tlb.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/mm/tlb.c 2012-02-14 18:26:21.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/mm/tlb.c 2012-02-20 15:23:10.242576314 +0800
@@ -43,6 +43,7 @@ union smp_flush_state {
struct mm_struct *flush_mm;
unsigned long flush_va;
raw_spinlock_t tlbstate_lock;
+ int sender_cpu;
DECLARE_BITMAP(flush_cpumask, NR_CPUS);
};
char pad[INTERNODE_CACHE_BYTES];
@@ -116,6 +117,9 @@ EXPORT_SYMBOL_GPL(leave_mm);
*
* Interrupts are disabled.
*/
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+extern void kvm_kick_cpu(int cpu);
+#endif
/*
* FIXME: use of asmlinkage is not consistent. On x86_64 it's noop
@@ -166,6 +170,10 @@ out:
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+ if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+ kvm_kick_cpu(f->sender_cpu);
+#endif
inc_irq_stat(irq_tlb_count);
}
@@ -184,7 +192,10 @@ static void flush_tlb_others_ipi(const s
f->flush_mm = mm;
f->flush_va = va;
+ f->sender_cpu = smp_processor_id();
if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+ int loop = 1024;
+
/*
* We have to send the IPI only to
* CPUs affected.
@@ -192,8 +203,15 @@ static void flush_tlb_others_ipi(const s
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+ while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+ cpu_relax();
+ if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
+ halt();
+#else
while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
cpu_relax();
+#endif
}
f->flush_mm = NULL;
Index: linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/kernel/kvm.c 2012-02-14 18:26:55.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c 2012-02-14 18:26:55.178450933 +0800
@@ -653,16 +653,17 @@ out:
PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int apicid)
+void kvm_kick_cpu(int cpu)
{
+ int apicid = per_cpu(x86_cpu_to_apicid, cpu);
kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
}
+EXPORT_SYMBOL_GPL(kvm_kick_cpu);
/* Kick vcpu waiting on @lock->head to reach value @ticket */
static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
{
int cpu;
- int apicid;
add_stats(RELEASED_SLOW, 1);
@@ -671,8 +672,7 @@ static void kvm_unlock_kick(struct arch_
if (ACCESS_ONCE(w->lock) == lock &&
ACCESS_ONCE(w->want) == ticket) {
add_stats(RELEASED_SLOW_KICKED, 1);
- apicid = per_cpu(x86_cpu_to_apicid, cpu);
- kvm_kick_cpu(apicid);
+ kvm_kick_cpu(cpu);
break;
}
}
next prev parent reply other threads:[~2012-02-20 8:08 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-19 8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
2011-12-19 8:34 ` [RFC PATCH 1/4] sched: Adding cpu.gang file to cpu cgroup Nikunj A. Dadhania
2011-12-19 8:34 ` [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure Nikunj A. Dadhania
2011-12-19 15:51 ` Peter Zijlstra
2011-12-19 16:51 ` Peter Zijlstra
2011-12-20 1:43 ` Nikunj A Dadhania
2011-12-20 1:39 ` Nikunj A Dadhania
2011-12-19 8:34 ` [RFC PATCH 3/4] sched: Gang using set_next_buddy Nikunj A. Dadhania
2011-12-19 8:35 ` [RFC PATCH 4/4] sched:Implement set_gang_buddy Nikunj A. Dadhania
2011-12-19 15:51 ` Peter Zijlstra
2011-12-20 1:43 ` Nikunj A Dadhania
2011-12-26 2:30 ` Nikunj A Dadhania
2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
2011-12-19 11:44 ` Avi Kivity
2011-12-19 11:50 ` Nikunj A Dadhania
2011-12-19 11:59 ` Avi Kivity
2011-12-19 12:06 ` Nikunj A Dadhania
2011-12-19 12:50 ` Avi Kivity
2011-12-19 13:09 ` Nikunj A Dadhania
2011-12-19 11:45 ` Nikunj A Dadhania
2011-12-19 13:22 ` Nikunj A Dadhania
2011-12-19 16:28 ` Ingo Molnar
2011-12-21 10:39 ` Nikunj A Dadhania
2011-12-21 10:43 ` Avi Kivity
2011-12-23 3:20 ` Nikunj A Dadhania
2011-12-23 10:36 ` Ingo Molnar
2011-12-25 10:58 ` Avi Kivity
2011-12-25 15:45 ` Avi Kivity
2011-12-26 3:14 ` Nikunj A Dadhania
2011-12-26 9:05 ` Avi Kivity
2011-12-26 11:33 ` Nikunj A Dadhania
2011-12-26 11:41 ` Avi Kivity
2011-12-27 1:47 ` Nikunj A Dadhania
2011-12-27 9:15 ` Avi Kivity
2011-12-27 10:24 ` Nikunj A Dadhania
2011-12-29 16:07 ` Better qemu/kvm defaults (was Re: [RFC PATCH 0/4] Gang scheduling in CFS) Dor Laor
2011-12-29 16:07 ` [Qemu-devel] " Dor Laor
2011-12-29 16:13 ` Avi Kivity
2011-12-29 16:13 ` [Qemu-devel] " Avi Kivity
2011-12-29 16:16 ` Anthony Liguori
2011-12-29 16:16 ` Anthony Liguori
2012-01-01 10:16 ` Dor Laor
2012-01-01 10:16 ` [Qemu-devel] " Dor Laor
2012-01-01 14:01 ` Ronen Hod
2012-01-01 14:01 ` Ronen Hod
2012-01-02 9:37 ` Dor Laor
2012-01-02 9:37 ` [Qemu-devel] " Dor Laor
2012-01-03 15:48 ` Anthony Liguori
2012-01-03 15:48 ` Anthony Liguori
2012-01-03 22:31 ` Dor Laor
2012-01-03 22:31 ` Dor Laor
2012-01-03 22:45 ` Anthony Liguori
2012-01-03 22:45 ` [Qemu-devel] " Anthony Liguori
2012-01-03 22:59 ` Dor Laor
2012-01-03 22:59 ` Dor Laor
2011-12-27 3:15 ` [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A Dadhania
2011-12-27 9:17 ` Avi Kivity
2011-12-27 9:44 ` Nikunj A Dadhania
2011-12-27 9:51 ` Avi Kivity
2011-12-27 10:10 ` Nikunj A Dadhania
2011-12-27 10:34 ` Avi Kivity
2011-12-27 10:43 ` Nikunj A Dadhania
2011-12-27 10:53 ` Avi Kivity
2011-12-30 9:51 ` Ingo Molnar
2011-12-30 10:10 ` Nikunj A Dadhania
2011-12-31 2:21 ` Nikunj A Dadhania
2012-01-02 4:20 ` Nikunj A Dadhania
2012-01-02 9:39 ` Avi Kivity
2012-01-02 10:22 ` Nikunj A Dadhania
2012-01-02 9:37 ` Avi Kivity
2012-01-02 10:30 ` Nikunj A Dadhania
2012-01-02 13:33 ` Avi Kivity
2012-01-04 10:52 ` Nikunj A Dadhania
2012-01-04 14:41 ` Avi Kivity
2012-01-04 14:56 ` Srivatsa Vaddagiri
2012-01-04 17:13 ` Avi Kivity
2012-01-05 6:57 ` Nikunj A Dadhania
2012-01-04 16:47 ` Rik van Riel
2012-01-04 17:16 ` Avi Kivity
2012-01-04 20:56 ` Rik van Riel
2012-01-04 21:31 ` Peter Zijlstra
2012-01-04 21:41 ` Avi Kivity
2012-01-05 9:10 ` Ingo Molnar
2012-02-20 8:08 ` Nikunj A Dadhania [this message]
2012-02-20 8:14 ` Ingo Molnar
2012-02-20 10:51 ` Peter Zijlstra
2012-02-20 11:53 ` Nikunj A Dadhania
2012-02-20 12:02 ` Srivatsa Vaddagiri
2012-02-20 12:14 ` Peter Zijlstra
2012-01-05 2:10 ` Nikunj A Dadhania
2011-12-19 15:51 ` Peter Zijlstra
2011-12-19 16:09 ` Alan Cox
2011-12-19 22:10 ` Benjamin Herrenschmidt
2011-12-20 1:56 ` Nikunj A Dadhania
2011-12-20 8:52 ` Jeremy Fitzhardinge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87fwe6ork3.fsf@abhimanyu.in.ibm.com \
--to=nikunj@linux.vnet.ibm.com \
--cc=avi@redhat.com \
--cc=bharata@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.