[patch]x86: spread tlb flush vector between nodes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch]x86: spread tlb flush vector between nodes
@ 2010-10-13  7:41 Shaohua Li
  2010-10-13  8:16 ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2010-10-13  7:41 UTC (permalink / raw)
  To: lkml; +Cc: Ingo Molnar, hpa@zytor.com, Andi Kleen, Chen, Tim C

Currently flush tlb vector allocation is based on below equation:
	sender = smp_processor_id() % 8
This isn't optimal, CPUs from different node can have the same vector, this
causes a lot of lock contention. Instead, we can assign the same vectors to
CPUs from the same node, while different node has different vectors. This has
below advantages:
a. if there is lock contention, the lock contention is between CPUs from one
node. This should be much cheaper than the contention between nodes.
b. completely avoid lock contention between nodes. This especially benefits
kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
to specific node.

In my test, this could reduce > 20% CPU overhead in extreme case.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 arch/x86/mm/tlb.c |   47 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/tlb.c	2010-10-13 20:40:19.000000000 +0800
+++ linux-2.6/arch/x86/mm/tlb.c	2010-10-13 23:19:26.000000000 +0800
@@ -5,6 +5,7 @@
 #include <linux/smp.h>
 #include <linux/interrupt.h>
 #include <linux/module.h>
+#include <linux/cpu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -52,6 +53,8 @@
    want false sharing in the per cpu data segment. */
 static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
 
+static int tlb_vector_offset[NR_CPUS] __read_mostly;
+
 /*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
@@ -173,7 +176,7 @@
 	union smp_flush_state *f;
 
 	/* Caller has disabled preemption */
-	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
+	sender = tlb_vector_offset[smp_processor_id()];
 	f = &flush_state[sender];
 
 	/*
@@ -218,6 +221,46 @@
 	flush_tlb_others_ipi(cpumask, mm, va);
 }
 
+static void __cpuinit calculate_tlb_offset(void)
+{
+	int cpu, node, nr_node_vecs;
+	/*
+	 * we are changing tlb_vector_offset[] for each CPU in runtime, but this
+	 * will not cause inconsistency, as the write is atomic under X86. we
+	 * might see more lock contentions in a short time, but after all CPU's
+	 * tlb_vector_offset[] are changed, everything should go normal
+	 *
+	 * Note: if NUM_INVALIDATE_TLB_VECTORS % nr_online_nodes !=0, we might
+	 * waste some vectors.
+	 **/
+	if (nr_online_nodes > NUM_INVALIDATE_TLB_VECTORS)
+		nr_node_vecs = 1;
+	else
+		nr_node_vecs = NUM_INVALIDATE_TLB_VECTORS/nr_online_nodes;
+
+	for_each_online_node(node) {
+		int node_offset = (node % NUM_INVALIDATE_TLB_VECTORS) *
+			nr_node_vecs;
+		int cpu_offset = 0;
+		for_each_cpu(cpu, cpumask_of_node(node)) {
+			tlb_vector_offset[cpu] = node_offset + cpu_offset;
+			cpu_offset++;
+			cpu_offset = cpu_offset % nr_node_vecs;
+		}
+	}
+}
+
+static int tlb_cpuhp_notify(struct notifier_block *n,
+		unsigned long action, void *hcpu)
+{
+	switch (action & 0xf) {
+	case CPU_ONLINE:
+	case CPU_DEAD:
+		calculate_tlb_offset();
+	}
+	return NOTIFY_OK;
+}
+
 static int __cpuinit init_smp_flush(void)
 {
 	int i;
@@ -225,6 +268,8 @@
 	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
 		raw_spin_lock_init(&flush_state[i].tlbstate_lock);
 
+	calculate_tlb_offset();
+	hotcpu_notifier(tlb_cpuhp_notify, 0);
 	return 0;
 }
 core_initcall(init_smp_flush);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-13  7:41 [patch]x86: spread tlb flush vector between nodes Shaohua Li
@ 2010-10-13  8:16 ` Andi Kleen
  2010-10-13  8:39   ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2010-10-13  8:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: lkml, Ingo Molnar, hpa@zytor.com, Andi Kleen, Chen, Tim C

On Wed, Oct 13, 2010 at 03:41:38PM +0800, Shaohua Li wrote:

Hi Shaohua,

> Currently flush tlb vector allocation is based on below equation:
> 	sender = smp_processor_id() % 8
> This isn't optimal, CPUs from different node can have the same vector, this
> causes a lot of lock contention. Instead, we can assign the same vectors to
> CPUs from the same node, while different node has different vectors. This has
> below advantages:
> a. if there is lock contention, the lock contention is between CPUs from one
> node. This should be much cheaper than the contention between nodes.
> b. completely avoid lock contention between nodes. This especially benefits
> kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
> to specific node.

The original scheme with 8 vectors was designed when Linux didn't have
per CPU interrupt numbers yet, and interrupts vectors were a scarce resource.

Now that we have per CPU interrupts and there is no immediate danger 
of running out I think it's better to use more than 8 vectors for the TLB 
flushes.

Perhaps could use 32 vectors or so and give each node on a 8S 
system 4 slots and on a 4 node system 8 slots?

> In my test, this could reduce > 20% CPU overhead in extreme case.

Nice result.

> +
> +static int tlb_cpuhp_notify(struct notifier_block *n,
> +		unsigned long action, void *hcpu)
> +{
> +	switch (action & 0xf) {
> +	case CPU_ONLINE:
> +	case CPU_DEAD:
> +		calculate_tlb_offset();
> +	}
> +	return NOTIFY_OK;

I don't think we really need the complexity of a notifier here.
In most x86 setups possible is very similar to online.

So I would suggest simply to compute a static mapping at boot
and simplify the code.

In theory there is a slight danger of node<->CPU numbers
changing with consecutive hot plug actions, but right now
this should not happen anyways and it would be unlikely
later.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-13  8:16 ` Andi Kleen
@ 2010-10-13  8:39   ` Shaohua Li
  2010-10-13 11:19     ` Ingo Molnar
  2010-10-19  5:39     ` Shaohua Li
  0 siblings, 2 replies; 13+ messages in thread
From: Shaohua Li @ 2010-10-13  8:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: lkml, Ingo Molnar, hpa@zytor.com, Chen, Tim C

On Wed, 2010-10-13 at 16:16 +0800, Andi Kleen wrote:
> On Wed, Oct 13, 2010 at 03:41:38PM +0800, Shaohua Li wrote:
> 
> Hi Shaohua,
> 
> > Currently flush tlb vector allocation is based on below equation:
> > 	sender = smp_processor_id() % 8
> > This isn't optimal, CPUs from different node can have the same vector, this
> > causes a lot of lock contention. Instead, we can assign the same vectors to
> > CPUs from the same node, while different node has different vectors. This has
> > below advantages:
> > a. if there is lock contention, the lock contention is between CPUs from one
> > node. This should be much cheaper than the contention between nodes.
> > b. completely avoid lock contention between nodes. This especially benefits
> > kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
> > to specific node.
> 
> The original scheme with 8 vectors was designed when Linux didn't have
> per CPU interrupt numbers yet, and interrupts vectors were a scarce resource.
> 
> Now that we have per CPU interrupts and there is no immediate danger 
> of running out I think it's better to use more than 8 vectors for the TLB 
> flushes.
> 
> Perhaps could use 32 vectors or so and give each node on a 8S 
> system 4 slots and on a 4 node system 8 slots?
Haven't too much idea. Before we have per CPU interrupts, muti vector
msi-x isn't widely deployed. Thought we need data if this is really
required.

> > +
> > +static int tlb_cpuhp_notify(struct notifier_block *n,
> > +		unsigned long action, void *hcpu)
> > +{
> > +	switch (action & 0xf) {
> > +	case CPU_ONLINE:
> > +	case CPU_DEAD:
> > +		calculate_tlb_offset();
> > +	}
> > +	return NOTIFY_OK;
> 
> I don't think we really need the complexity of a notifier here.
> In most x86 setups possible is very similar to online.
> 
> So I would suggest simply to compute a static mapping at boot
> and simplify the code.
> 
> In theory there is a slight danger of node<->CPU numbers
> changing with consecutive hot plug actions, but right now
> this should not happen anyways and it would be unlikely
> later.
yes, it's unlikely. could we get the node info for a CPU before it's
hotplugged? Anyway, this doesn't take overhead.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-13  8:39   ` Shaohua Li
@ 2010-10-13 11:19     ` Ingo Molnar
  2010-10-19  5:39     ` Shaohua Li
  1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2010-10-13 11:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andi Kleen, lkml, hpa@zytor.com, Chen, Tim C


* Shaohua Li <shaohua.li@intel.com> wrote:

> On Wed, 2010-10-13 at 16:16 +0800, Andi Kleen wrote:
> > On Wed, Oct 13, 2010 at 03:41:38PM +0800, Shaohua Li wrote:
> > > +
> > > +static int tlb_cpuhp_notify(struct notifier_block *n,
> > > +		unsigned long action, void *hcpu)
> > > +{
> > > +	switch (action & 0xf) {
> > > +	case CPU_ONLINE:
> > > +	case CPU_DEAD:
> > > +		calculate_tlb_offset();
> > > +	}
> > > +	return NOTIFY_OK;
> > 
> > I don't think we really need the complexity of a notifier here.
> > In most x86 setups possible is very similar to online.
> > 
> > So I would suggest simply to compute a static mapping at boot
> > and simplify the code.
> > 
> > In theory there is a slight danger of node<->CPU numbers
> > changing with consecutive hot plug actions, but right now
> > this should not happen anyways and it would be unlikely
> > later.
>
> yes, it's unlikely. could we get the node info for a CPU before it's 
> hotplugged? Anyway, this doesn't take overhead.

It would be rather stupid to throw away the fairly simple hotplug-aware 
code you wrote and replace it with some lame boot-time calculated 
mapping just because currently it's considered rare.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-13  8:39   ` Shaohua Li
  2010-10-13 11:19     ` Ingo Molnar
@ 2010-10-19  5:39     ` Shaohua Li
  2010-10-19  6:21       ` H. Peter Anvin
  1 sibling, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2010-10-19  5:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: lkml, Ingo Molnar, hpa@zytor.com, Chen, Tim C

On Wed, 2010-10-13 at 16:39 +0800, Shaohua Li wrote:
> On Wed, 2010-10-13 at 16:16 +0800, Andi Kleen wrote:
> > On Wed, Oct 13, 2010 at 03:41:38PM +0800, Shaohua Li wrote:
> > 
> > Hi Shaohua,
> > 
> > > Currently flush tlb vector allocation is based on below equation:
> > > 	sender = smp_processor_id() % 8
> > > This isn't optimal, CPUs from different node can have the same vector, this
> > > causes a lot of lock contention. Instead, we can assign the same vectors to
> > > CPUs from the same node, while different node has different vectors. This has
> > > below advantages:
> > > a. if there is lock contention, the lock contention is between CPUs from one
> > > node. This should be much cheaper than the contention between nodes.
> > > b. completely avoid lock contention between nodes. This especially benefits
> > > kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
> > > to specific node.
> > 
> > The original scheme with 8 vectors was designed when Linux didn't have
> > per CPU interrupt numbers yet, and interrupts vectors were a scarce resource.
> > 
> > Now that we have per CPU interrupts and there is no immediate danger 
> > of running out I think it's better to use more than 8 vectors for the TLB 
> > flushes.
> > 
> > Perhaps could use 32 vectors or so and give each node on a 8S 
> > system 4 slots and on a 4 node system 8 slots?
> Haven't too much idea. Before we have per CPU interrupts, muti vector
> msi-x isn't widely deployed. Thought we need data if this is really
> required.
looks there are still some overhead with total 8 vectors in a big
machine. I'll try the 32 vectors as you suggested. I'll send separate
patches out to address the 32 vectors issue. Can we merge this patch
first?

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19  5:39     ` Shaohua Li
@ 2010-10-19  6:21       ` H. Peter Anvin
  2010-10-19  8:44         ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2010-10-19  6:21 UTC (permalink / raw)
  To: Shaohua Li, Andi Kleen; +Cc: lkml, Ingo Molnar, Chen, Tim C

Technically, it is way too late for anything new in this merge window, but we can try to make a reasonable assessment of the risk since the merge window got delayed.  However, this close to the merge window you cannot just expect to be merged even if the patch itself is OK.

"Shaohua Li" <shaohua.li@intel.com> wrote:

>On Wed, 2010-10-13 at 16:39 +0800, Shaohua Li wrote:
>> On Wed, 2010-10-13 at 16:16 +0800, Andi Kleen wrote:
>> > On Wed, Oct 13, 2010 at 03:41:38PM +0800, Shaohua Li wrote:
>> > 
>> > Hi Shaohua,
>> > 
>> > > Currently flush tlb vector allocation is based on below equation:
>> > > 	sender = smp_processor_id() % 8
>> > > This isn't optimal, CPUs from different node can have the same vector, this
>> > > causes a lot of lock contention. Instead, we can assign the same vectors to
>> > > CPUs from the same node, while different node has different vectors. This has
>> > > below advantages:
>> > > a. if there is lock contention, the lock contention is between CPUs from one
>> > > node. This should be much cheaper than the contention between nodes.
>> > > b. completely avoid lock contention between nodes. This especially benefits
>> > > kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
>> > > to specific node.
>> > 
>> > The original scheme with 8 vectors was designed when Linux didn't have
>> > per CPU interrupt numbers yet, and interrupts vectors were a scarce resource.
>> > 
>> > Now that we have per CPU interrupts and there is no immediate danger 
>> > of running out I think it's better to use more than 8 vectors for the TLB 
>> > flushes.
>> > 
>> > Perhaps could use 32 vectors or so and give each node on a 8S 
>> > system 4 slots and on a 4 node system 8 slots?
>> Haven't too much idea. Before we have per CPU interrupts, muti vector
>> msi-x isn't widely deployed. Thought we need data if this is really
>> required.
>looks there are still some overhead with total 8 vectors in a big
>machine. I'll try the 32 vectors as you suggested. I'll send separate
>patches out to address the 32 vectors issue. Can we merge this patch
>first?
>
>Thanks,
>Shaohua
>

-- 
Sent from my mobile phone.  Please pardon any lack of formatting.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19  6:21       ` H. Peter Anvin
@ 2010-10-19  8:44         ` Ingo Molnar
  2010-10-19  8:55           ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2010-10-19  8:44 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Shaohua Li, Andi Kleen, lkml, Chen, Tim C


* H. Peter Anvin <hpa@zytor.com> wrote:

> Technically, it is way too late for anything new in this merge window, but we can 
> try to make a reasonable assessment of the risk since the merge window got 
> delayed.  However, this close to the merge window you cannot just expect to be 
> merged even if the patch itself is OK.

a prompt re-send of the patch today-ish, with proper changelog, etc. and with the 
new tuning in place is definitely a must.

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19  8:44         ` Ingo Molnar
@ 2010-10-19  8:55           ` Shaohua Li
  2010-10-19 10:37             ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2010-10-19  8:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: H. Peter Anvin, Andi Kleen, lkml, Chen, Tim C

On Tue, 2010-10-19 at 16:44 +0800, Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
> > Technically, it is way too late for anything new in this merge window, but we can 
> > try to make a reasonable assessment of the risk since the merge window got 
> > delayed.  However, this close to the merge window you cannot just expect to be 
> > merged even if the patch itself is OK.
> 
> a prompt re-send of the patch today-ish, with proper changelog, etc. and with the 
> new tuning in place is definitely a must.
the previous patch has changelog. what did you mean a new tuning?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19  8:55           ` Shaohua Li
@ 2010-10-19 10:37             ` Ingo Molnar
  2010-10-19 13:28               ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2010-10-19 10:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: H. Peter Anvin, Andi Kleen, lkml, Chen, Tim C


* Shaohua Li <shaohua.li@intel.com> wrote:

> On Tue, 2010-10-19 at 16:44 +0800, Ingo Molnar wrote:
> > * H. Peter Anvin <hpa@zytor.com> wrote:
> > 
> > > Technically, it is way too late for anything new in this merge window, but we can 
> > > try to make a reasonable assessment of the risk since the merge window got 
> > > delayed.  However, this close to the merge window you cannot just expect to be 
> > > merged even if the patch itself is OK.
> > 
> > a prompt re-send of the patch today-ish, with proper changelog, etc. and with 
> > the new tuning in place is definitely a must.
>
> the previous patch has changelog. what did you mean a new tuning?

The new tuning would be the 8->32 patch - but that would be a more complex and 
separate (and definitely controversial) patch anyway.

So if hpa gives his ack we can try this current spread-tlb-vectors-better patch in 
-tip and see how it fares. Could you please update the changelog to specify the 20% 
improvement more precisely? What kind of workload was used and how was the 
improvement measured?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19 10:37             ` Ingo Molnar
@ 2010-10-19 13:28               ` Shaohua Li
  2010-10-19 13:34                 ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2010-10-19 13:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: H. Peter Anvin, Andi Kleen, lkml, Chen, Tim C

On Tue, 2010-10-19 at 18:37 +0800, Ingo Molnar wrote:
> * Shaohua Li <shaohua.li@intel.com> wrote:
> 
> > On Tue, 2010-10-19 at 16:44 +0800, Ingo Molnar wrote:
> > > * H. Peter Anvin <hpa@zytor.com> wrote:
> > > 
> > > > Technically, it is way too late for anything new in this merge window, but we can 
> > > > try to make a reasonable assessment of the risk since the merge window got 
> > > > delayed.  However, this close to the merge window you cannot just expect to be 
> > > > merged even if the patch itself is OK.
> > > 
> > > a prompt re-send of the patch today-ish, with proper changelog, etc. and with 
> > > the new tuning in place is definitely a must.
> >
> > the previous patch has changelog. what did you mean a new tuning?
> 
> The new tuning would be the 8->32 patch - but that would be a more complex and 
> separate (and definitely controversial) patch anyway.
Yes, I'm working on it. And it hasn't too many relation with this patch.

> So if hpa gives his ack we can try this current spread-tlb-vectors-better patch in 
> -tip and see how it fares. Could you please update the changelog to specify the 20% 
> improvement more precisely? What kind of workload was used and how was the 
> improvement measured?
Below is the updated patch which describes the workload I use.


Currently flush tlb vector allocation is based on below equation:
	sender = smp_processor_id() % 8
This isn't optimal, CPUs from different node can have the same vector, this
causes a lot of lock contention. Instead, we can assign the same vectors to
CPUs from the same node, while different node has different vectors. This has
below advantages:
a. if there is lock contention, the lock contention is between CPUs from one
node. This should be much cheaper than the contention between nodes.
b. completely avoid lock contention between nodes. This especially benefits
kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
to specific node.

In my test, this could reduce > 20% CPU overhead in extreme case. The test
machine has 4 nodes and each node has 16 CPUs. I then bind each node's kswapd
to the first CPU of the node. I run a workload with 4 sequential mmap file
read thread. The files are empty sparse file. This workload will trigger a
lot of page reclaim and tlbflush. The kswapd bind is to easy trigger the
extreme tlb flush lock contention because otherwise kswapd keeps migrating
between CPUs of a node and I can't get stable result. Sure in real workload,
we can't always see so big tlb flush lock contention, but it's possible.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 arch/x86/mm/tlb.c |   47 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/tlb.c	2010-10-13 20:40:19.000000000 +0800
+++ linux-2.6/arch/x86/mm/tlb.c	2010-10-13 23:19:26.000000000 +0800
@@ -5,6 +5,7 @@
 #include <linux/smp.h>
 #include <linux/interrupt.h>
 #include <linux/module.h>
+#include <linux/cpu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -52,6 +53,8 @@
    want false sharing in the per cpu data segment. */
 static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
 
+static int tlb_vector_offset[NR_CPUS] __read_mostly;
+
 /*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
@@ -173,7 +176,7 @@
 	union smp_flush_state *f;
 
 	/* Caller has disabled preemption */
-	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
+	sender = tlb_vector_offset[smp_processor_id()];
 	f = &flush_state[sender];
 
 	/*
@@ -218,6 +221,46 @@
 	flush_tlb_others_ipi(cpumask, mm, va);
 }
 
+static void __cpuinit calculate_tlb_offset(void)
+{
+	int cpu, node, nr_node_vecs;
+	/*
+	 * we are changing tlb_vector_offset[] for each CPU in runtime, but this
+	 * will not cause inconsistency, as the write is atomic under X86. we
+	 * might see more lock contentions in a short time, but after all CPU's
+	 * tlb_vector_offset[] are changed, everything should go normal
+	 *
+	 * Note: if NUM_INVALIDATE_TLB_VECTORS % nr_online_nodes !=0, we might
+	 * waste some vectors.
+	 **/
+	if (nr_online_nodes > NUM_INVALIDATE_TLB_VECTORS)
+		nr_node_vecs = 1;
+	else
+		nr_node_vecs = NUM_INVALIDATE_TLB_VECTORS/nr_online_nodes;
+
+	for_each_online_node(node) {
+		int node_offset = (node % NUM_INVALIDATE_TLB_VECTORS) *
+			nr_node_vecs;
+		int cpu_offset = 0;
+		for_each_cpu(cpu, cpumask_of_node(node)) {
+			tlb_vector_offset[cpu] = node_offset + cpu_offset;
+			cpu_offset++;
+			cpu_offset = cpu_offset % nr_node_vecs;
+		}
+	}
+}
+
+static int tlb_cpuhp_notify(struct notifier_block *n,
+		unsigned long action, void *hcpu)
+{
+	switch (action & 0xf) {
+	case CPU_ONLINE:
+	case CPU_DEAD:
+		calculate_tlb_offset();
+	}
+	return NOTIFY_OK;
+}
+
 static int __cpuinit init_smp_flush(void)
 {
 	int i;
@@ -225,6 +268,8 @@
 	for (i = 0; i < ARRAY_SIZE(flush_state); i++)
 		raw_spin_lock_init(&flush_state[i].tlbstate_lock);
 
+	calculate_tlb_offset();
+	hotcpu_notifier(tlb_cpuhp_notify, 0);
 	return 0;
 }
 core_initcall(init_smp_flush);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19 13:28               ` Shaohua Li
@ 2010-10-19 13:34                 ` Andi Kleen
  2010-10-20  1:13                   ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2010-10-19 13:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, lkml, Chen, Tim C

>  
> +static int tlb_vector_offset[NR_CPUS] __read_mostly;

Never use NR_CPUS. Always use per cpu data.

Otherwise you waste a lot of space on a CONFIG_MAX_SMP
kernel running on a smaller box.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-19 13:34                 ` Andi Kleen
@ 2010-10-20  1:13                   ` Shaohua Li
  2010-10-20  2:39                     ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2010-10-20  1:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, H. Peter Anvin, lkml, Chen, Tim C

On Tue, 2010-10-19 at 21:34 +0800, Andi Kleen wrote:
> >  
> > +static int tlb_vector_offset[NR_CPUS] __read_mostly;
> 
> Never use NR_CPUS. Always use per cpu data.
> 
> Otherwise you waste a lot of space on a CONFIG_MAX_SMP
> kernel running on a smaller box.
ha, I want it to be __read_mostly to avoid cache pollution, apparently
we have no per cpu API to do this. Maybe I need add one.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch]x86: spread tlb flush vector between nodes
  2010-10-20  1:13                   ` Shaohua Li
@ 2010-10-20  2:39                     ` H. Peter Anvin
  0 siblings, 0 replies; 13+ messages in thread
From: H. Peter Anvin @ 2010-10-20  2:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andi Kleen, Ingo Molnar, lkml, Chen, Tim C

On 10/19/2010 06:13 PM, Shaohua Li wrote:
> On Tue, 2010-10-19 at 21:34 +0800, Andi Kleen wrote:
>>>  
>>> +static int tlb_vector_offset[NR_CPUS] __read_mostly;
>>
>> Never use NR_CPUS. Always use per cpu data.
>>
>> Otherwise you waste a lot of space on a CONFIG_MAX_SMP
>> kernel running on a smaller box.
> ha, I want it to be __read_mostly to avoid cache pollution, apparently
> we have no per cpu API to do this. Maybe I need add one.
> 

Quite possible, however, definitely percpu over a static array.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-10-20  2:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-13  7:41 [patch]x86: spread tlb flush vector between nodes Shaohua Li
2010-10-13  8:16 ` Andi Kleen
2010-10-13  8:39   ` Shaohua Li
2010-10-13 11:19     ` Ingo Molnar
2010-10-19  5:39     ` Shaohua Li
2010-10-19  6:21       ` H. Peter Anvin
2010-10-19  8:44         ` Ingo Molnar
2010-10-19  8:55           ` Shaohua Li
2010-10-19 10:37             ` Ingo Molnar
2010-10-19 13:28               ` Shaohua Li
2010-10-19 13:34                 ` Andi Kleen
2010-10-20  1:13                   ` Shaohua Li
2010-10-20  2:39                     ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox