linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.
@ 2010-08-18 16:56 Robin Holt
  2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-18 16:56 UTC (permalink / raw)
  To: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar
  Cc: H. Peter Anvin, x86, Yinghai Lu, Linus Torvalds, Joerg Roedel,
	Andi Kleen, Linux Kernel, Stable Maintainers


Subject: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.

While testing on a 256 node, 4096 cpus system, Jack Steiner noticed
that we would use between 0.08% average and 0.8% max of every second
in vmstat_update.  This could be tuned using sysctl's stat_interval,
but that was simply reducing the impact of the problem.

When I investigated, I noticed that all the zone_data[] structures are
allocated precisely at the beginning of the individual node's physical
memory.  By simply staggering based upon nodeid, I reduced the average
down to 0.0006% of every second.

With this patch, the max value did not change.  I believe that is a
combination of cacheline contention updating the zone's vmstat information
combined with round_jiffies_common spattering unrelated cpus onto the same
jiffie for their next update.  I will investigate those issues seperately.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@ppc970.osdl.org>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Cc: Stable Maintainers <stable@kernel.org>

---

This patch applies cleanly to v2.6.34 and later.  It manually applies
to previous kernels but the x86-bootmem fixes introduce differences in
the surrounding areas.

I had no idea whether to ask stable@kernel.org to pull this back to the
stable releases.  My reading of the stable_kernel_rules.txt criteria is
only fuzzy as to whether this meets the "oh, that's not good" standard.
I personally think this meets that criteria, but I am unwilling to defend
that position too stridently.  In the end, I punted and added them to
the Cc list.  We will be asking both SuSE and RedHat to add this to
their upcoming update releases as we expect it to affect their customers.

 arch/x86/mm/numa_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: round_jiffies/arch/x86/mm/numa_64.c
===================================================================
--- round_jiffies.orig/arch/x86/mm/numa_64.c	2010-08-18 11:39:20.495141178 -0500
+++ round_jiffies/arch/x86/mm/numa_64.c	2010-08-18 11:47:18.391210989 -0500
@@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
 	unsigned long start_pfn, last_pfn, nodedata_phys;
 	const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
 	int nid;
+	int cache_alias_offset;
 #ifndef CONFIG_NO_BOOTMEM
 	unsigned long bootmap_start, bootmap_pages, bootmap_size;
 	void *bootmap;
@@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
 	start_pfn = start >> PAGE_SHIFT;
 	last_pfn = end >> PAGE_SHIFT;
 
-	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
+	/*
+	 * Allocate an extra cacheline per node to reduce cacheline
+	 * aliasing when scanning all node's node_data.
+	 */
+	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
+	node_data[nodeid] = cache_alias_offset +
+			    early_node_mem(nodeid, start, end,
+					   pgdat_size + cache_alias_offset,
 					   SMP_CACHE_BYTES);
-	if (node_data[nodeid] == NULL)
+	if (node_data[nodeid] == cache_alias_offset)
 		return;
 	nodedata_phys = __pa(node_data[nodeid]);
 	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-18 16:56 [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive Robin Holt
@ 2010-08-18 18:30 ` Robin Holt
  2010-08-19 17:30   ` Roedel, Joerg
  2010-08-19 22:54   ` H. Peter Anvin
  0 siblings, 2 replies; 14+ messages in thread
From: Robin Holt @ 2010-08-18 18:30 UTC (permalink / raw)
  To: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar
  Cc: H. Peter Anvin, x86, Yinghai Lu, Linus Torvalds, Joerg Roedel,
	Linux Kernel, Stable Maintainers


While testing on a 256 node, 4096 cpus system, Jack Steiner noticed
that we would use between 0.08% average and 0.8% max of every second
in vmstat_update.  This could be tuned using sysctl's stat_interval,
but that was simply reducing the impact of the problem.

When I investigated, I noticed that all the zone_data[] structures are
allocated precisely at the beginning of the individual node's physical
memory.  By simply staggering based upon nodeid, I reduced the average
down to 0.0006% of every second.

With this patch, the max value did not change.  I believe that is a
combination of cacheline contention updating the zone's vmstat information
combined with round_jiffies_common spattering unrelated cpus onto the same
jiffie for their next update.  I will investigate those issues seperately.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Cc: Stable Maintainers <stable@kernel.org>

---

Changes from -V1, clean up a compile warning.

This patch applies cleanly to v2.6.34 and later.  It manually applies
to previous kernels but the x86-bootmem fixes introduce differences in
the surrounding areas.

I had no idea whether to ask stable@kernel.org to pull this back to the
stable releases.  My reading of the stable_kernel_rules.txt criteria is
only fuzzy as to whether this meets the "oh, that's not good" standard.
I personally think this meets that criteria, but I am unwilling to defend
that position too stridently.  In the end, I punted and added them to
the Cc list.  We will be asking both SuSE and RedHat to add this to
their upcoming update releases as we expect it to affect their customers.

 arch/x86/mm/numa_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: round_jiffies/arch/x86/mm/numa_64.c
===================================================================
--- round_jiffies.orig/arch/x86/mm/numa_64.c	2010-08-18 11:39:20.495141178 -0500
+++ round_jiffies/arch/x86/mm/numa_64.c	2010-08-18 13:24:42.679080103 -0500
@@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
 	unsigned long start_pfn, last_pfn, nodedata_phys;
 	const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
 	int nid;
+	unsigned long cache_alias_offset;
 #ifndef CONFIG_NO_BOOTMEM
 	unsigned long bootmap_start, bootmap_pages, bootmap_size;
 	void *bootmap;
@@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
 	start_pfn = start >> PAGE_SHIFT;
 	last_pfn = end >> PAGE_SHIFT;
 
-	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
+	/*
+	 * Allocate an extra cacheline per node to reduce cacheline
+	 * aliasing when scanning all node's node_data.
+	 */
+	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
+	node_data[nodeid] = cache_alias_offset +
+			    early_node_mem(nodeid, start, end,
+					   pgdat_size + cache_alias_offset,
 					   SMP_CACHE_BYTES);
-	if (node_data[nodeid] == NULL)
+	if (node_data[nodeid] == (void *)cache_alias_offset)
 		return;
 	nodedata_phys = __pa(node_data[nodeid]);
 	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
@ 2010-08-19 17:30   ` Roedel, Joerg
  2010-08-19 20:42     ` Robin Holt
  2010-08-19 22:54   ` H. Peter Anvin
  1 sibling, 1 reply; 14+ messages in thread
From: Roedel, Joerg @ 2010-08-19 17:30 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, Yinghai Lu, Linus Torvalds, Linux Kernel,
	Stable Maintainers

On Wed, Aug 18, 2010 at 02:30:24PM -0400, Robin Holt wrote:
> When I investigated, I noticed that all the zone_data[] structures are
> allocated precisely at the beginning of the individual node's physical
> memory.  By simply staggering based upon nodeid, I reduced the average
> down to 0.0006% of every second.

Interesting. Have you measured cache misses for both cases?

> With this patch, the max value did not change.  I believe that is a
> combination of cacheline contention updating the zone's vmstat information
> combined with round_jiffies_common spattering unrelated cpus onto the same
> jiffie for their next update.  I will investigate those issues seperately.

The max value probably the case where none of the data the code accesses
is actually in the cache. Cache aliasing does not help then so I would
not expect a change in the maximum amount here.

> I had no idea whether to ask stable@kernel.org to pull this back to the
> stable releases.  My reading of the stable_kernel_rules.txt criteria is
> only fuzzy as to whether this meets the "oh, that's not good" standard.
> I personally think this meets that criteria, but I am unwilling to defend
> that position too stridently.  In the end, I punted and added them to
> the Cc list.  We will be asking both SuSE and RedHat to add this to
> their upcoming update releases as we expect it to affect their customers.

I don't think this is stable material. It improves performance and does
not fix a bug. But I am not the one to decide this :-)

> 
>  arch/x86/mm/numa_64.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> Index: round_jiffies/arch/x86/mm/numa_64.c
> ===================================================================
> --- round_jiffies.orig/arch/x86/mm/numa_64.c	2010-08-18 11:39:20.495141178 -0500
> +++ round_jiffies/arch/x86/mm/numa_64.c	2010-08-18 13:24:42.679080103 -0500
> @@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
>  	unsigned long start_pfn, last_pfn, nodedata_phys;
>  	const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
>  	int nid;
> +	unsigned long cache_alias_offset;
>  #ifndef CONFIG_NO_BOOTMEM
>  	unsigned long bootmap_start, bootmap_pages, bootmap_size;
>  	void *bootmap;
> @@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
>  	start_pfn = start >> PAGE_SHIFT;
>  	last_pfn = end >> PAGE_SHIFT;
>  
> -	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
> +	/*
> +	 * Allocate an extra cacheline per node to reduce cacheline
> +	 * aliasing when scanning all node's node_data.
> +	 */
> +	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
> +	node_data[nodeid] = cache_alias_offset +
> +			    early_node_mem(nodeid, start, end,
> +					   pgdat_size + cache_alias_offset,
>  					   SMP_CACHE_BYTES);
> -	if (node_data[nodeid] == NULL)
> +	if (node_data[nodeid] == (void *)cache_alias_offset)

It is cleaner to keep the NULL check here and add the cache_alias_offset
to the pointer after that check. 

>  		return;
>  	nodedata_phys = __pa(node_data[nodeid]);
>  	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");
> 

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-19 17:30   ` Roedel, Joerg
@ 2010-08-19 20:42     ` Robin Holt
  2010-08-19 22:02       ` Robin Holt
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-19 20:42 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86@kernel.org, Yinghai Lu, Linus Torvalds,
	Linux Kernel, Stable Maintainers


I hope my tone in this email comes across as measured as I feel.  I am
not completely confident I have not missed some thing which may have
had an effect, but I can not see what that would be.  I believe I am
correct, but am willing to be shown to be wrong.

On Thu, Aug 19, 2010 at 07:30:13PM +0200, Roedel, Joerg wrote:
> On Wed, Aug 18, 2010 at 02:30:24PM -0400, Robin Holt wrote:
> > When I investigated, I noticed that all the zone_data[] structures are
> > allocated precisely at the beginning of the individual node's physical
> > memory.  By simply staggering based upon nodeid, I reduced the average
> > down to 0.0006% of every second.
> 
> Interesting. Have you measured cache misses for both cases?

Not directly.  The change made such a compelling difference and was in
such a method as to strongly suggest it was cache aliasing.

As a result of your question, I did some back of the envelope
calculations.  The shared L3 cache is 12,288 cachelines in a set, 24
ways for an 18MB cache.  The sizeof(struct zone) is 1792 bytes or 28
cache lines.  That means a zone would not overflow an entire set and
a single set, without the stride value, would hold one zone's info and
we would exhaust the L3 at 24 nodes.  With the single line stride, we
increase that to 28 nodes per set, 24 ways, or 672 nodes before we
exhaust the L3 cache.

This seems like a plausible explanation for the problem and I do not see
any benefit to further testing unless you have a concern that I have done
something significantly wrong in my calculation above or you can see some
assumption I am making which is wrong.

> > With this patch, the max value did not change.  I believe that is a
> > combination of cacheline contention updating the zone's vmstat information
> > combined with round_jiffies_common spattering unrelated cpus onto the same
> > jiffie for their next update.  I will investigate those issues seperately.
> 
> The max value probably the case where none of the data the code accesses
> is actually in the cache. Cache aliasing does not help then so I would
> not expect a change in the maximum amount here.

The only reason I look at the other possible causes is a very weak
data point I have.  My trace code that I was using to narrow down the
slowdown area within refresh_cpu_vm_stats captured a couple data points
where a single cpu had discovered a node with pages present, the scan
of the per-cpu values and the addition of those values back into the
node_data's zone took a very long time for the two values which were
being transferred in each instance.  That seems more likely to have been
seperate sockets contending for exclusive access to the same cacheline
on the node_data's zone than a cache refill issue.

I do not disagree that the time to fully refill all node_data's zone
information would take time, but I would expect that to be a very
infrequent occurrance since the round_jiffies_common() ends up scheduling
one thread on each socket to do this calculation on each tick.

> > I had no idea whether to ask stable@kernel.org to pull this back to the
> > stable releases.  My reading of the stable_kernel_rules.txt criteria is
> > only fuzzy as to whether this meets the "oh, that's not good" standard.
> > I personally think this meets that criteria, but I am unwilling to defend
> > that position too stridently.  In the end, I punted and added them to
> > the Cc list.  We will be asking both SuSE and RedHat to add this to
> > their upcoming update releases as we expect it to affect their customers.
> 
> I don't think this is stable material. It improves performance and does
> not fix a bug. But I am not the one to decide this :-)

The only reason I think it qualifies is we are talking about 0.8% of each
cpus time.  That means that on the 4096 cpu system, we are dedicating
the equivalent of 32 cpus to just vmstat_update.  That feels like it
meets the "oh, that's not good" standard to me.  Compare the relatively
minor chance for negative impact to the known positive impact and it
becomes slightly more compelling.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-19 20:42     ` Robin Holt
@ 2010-08-19 22:02       ` Robin Holt
  0 siblings, 0 replies; 14+ messages in thread
From: Robin Holt @ 2010-08-19 22:02 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86@kernel.org, Yinghai Lu, Linus Torvalds,
	Linux Kernel, Stable Maintainers

> > I don't think this is stable material. It improves performance and does
> > not fix a bug. But I am not the one to decide this :-)
> 
> The only reason I think it qualifies is we are talking about 0.8% of each
> cpus time.  That means that on the 4096 cpu system, we are dedicating
> the equivalent of 32 cpus to just vmstat_update.  That feels like it

Wrong!  0.08%, not 0.8% and therefore 3.2 cpus, not 32.

Sorry for the confusion,
Robin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
  2010-08-19 17:30   ` Roedel, Joerg
@ 2010-08-19 22:54   ` H. Peter Anvin
  2010-08-20 13:58     ` Robin Holt
  1 sibling, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2010-08-19 22:54 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Thomas Gleixner, Ingo Molnar, x86, Yinghai Lu,
	Linus Torvalds, Joerg Roedel, Linux Kernel, Stable Maintainers

On 08/18/2010 11:30 AM, Robin Holt wrote:
>  
> -	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
> +	/*
> +	 * Allocate an extra cacheline per node to reduce cacheline
> +	 * aliasing when scanning all node's node_data.
> +	 */
> +	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
> +	node_data[nodeid] = cache_alias_offset +
> +			    early_node_mem(nodeid, start, end,
> +					   pgdat_size + cache_alias_offset,
>  					   SMP_CACHE_BYTES);
> -	if (node_data[nodeid] == NULL)
> +	if (node_data[nodeid] == (void *)cache_alias_offset)
>  		return;
>  	nodedata_phys = __pa(node_data[nodeid]);
>  	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");

I'm concerned about this, because it really seems to rely on subtleties
in the behavior of early_node_mem, as well as the direction of
find_e820_area -- which is pretty much intended to change anyway.  It's
the "action at a distance" effect.

What we really want, I think, is to push the offsetting into
find_early_area().  Right now we have an alignment parameter, but what
we need is an alignment and a color parameter (this is just a classic
case of cache coloring, after all) which indicates the desirable offset
from the alignment base.

	-hpa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-19 22:54   ` H. Peter Anvin
@ 2010-08-20 13:58     ` Robin Holt
  2010-08-20 15:03       ` Robin Holt
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-20 13:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar, x86,
	Yinghai Lu, Linus Torvalds, Joerg Roedel, Linux Kernel,
	Stable Maintainers

On Thu, Aug 19, 2010 at 03:54:36PM -0700, H. Peter Anvin wrote:
> On 08/18/2010 11:30 AM, Robin Holt wrote:
> >  
> > -	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
> > +	/*
> > +	 * Allocate an extra cacheline per node to reduce cacheline
> > +	 * aliasing when scanning all node's node_data.
> > +	 */
> > +	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
> > +	node_data[nodeid] = cache_alias_offset +
> > +			    early_node_mem(nodeid, start, end,
> > +					   pgdat_size + cache_alias_offset,
> >  					   SMP_CACHE_BYTES);
> > -	if (node_data[nodeid] == NULL)
> > +	if (node_data[nodeid] == (void *)cache_alias_offset)
> >  		return;
> >  	nodedata_phys = __pa(node_data[nodeid]);
> >  	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");
> 
> I'm concerned about this, because it really seems to rely on subtleties
> in the behavior of early_node_mem, as well as the direction of
> find_e820_area -- which is pretty much intended to change anyway.  It's
> the "action at a distance" effect.
> 
> What we really want, I think, is to push the offsetting into
> find_early_area().  Right now we have an alignment parameter, but what
> we need is an alignment and a color parameter (this is just a classic
> case of cache coloring, after all) which indicates the desirable offset
> from the alignment base.

That sounds reasonable.  Are there other examples you can think of which
I can build upon?

Robin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-20 13:58     ` Robin Holt
@ 2010-08-20 15:03       ` Robin Holt
  2010-08-20 16:16         ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-20 15:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar, x86,
	Yinghai Lu, Linus Torvalds, Joerg Roedel, Linux Kernel,
	Stable Maintainers

On Fri, Aug 20, 2010 at 08:58:22AM -0500, Robin Holt wrote:
> On Thu, Aug 19, 2010 at 03:54:36PM -0700, H. Peter Anvin wrote:
> > On 08/18/2010 11:30 AM, Robin Holt wrote:
> > >  
> > > -	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
> > > +	/*
> > > +	 * Allocate an extra cacheline per node to reduce cacheline
> > > +	 * aliasing when scanning all node's node_data.
> > > +	 */
> > > +	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
> > > +	node_data[nodeid] = cache_alias_offset +
> > > +			    early_node_mem(nodeid, start, end,
> > > +					   pgdat_size + cache_alias_offset,
> > >  					   SMP_CACHE_BYTES);
> > > -	if (node_data[nodeid] == NULL)
> > > +	if (node_data[nodeid] == (void *)cache_alias_offset)
> > >  		return;
> > >  	nodedata_phys = __pa(node_data[nodeid]);
> > >  	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");
> > 
> > I'm concerned about this, because it really seems to rely on subtleties
> > in the behavior of early_node_mem, as well as the direction of
> > find_e820_area -- which is pretty much intended to change anyway.  It's
> > the "action at a distance" effect.
> > 
> > What we really want, I think, is to push the offsetting into
> > find_early_area().  Right now we have an alignment parameter, but what
> > we need is an alignment and a color parameter (this is just a classic
> > case of cache coloring, after all) which indicates the desirable offset
> > from the alignment base.
> 
> That sounds reasonable.  Are there other examples you can think of which
> I can build upon?

I think this is more difficult than I would like.  The difficulty comes
in with determining "alignment base".  I decided to look at a machine
and see how I would manually determine color.  The first question I
ran into was "What is the color with respect to?"  Is it an L3 color, an
L2, L1?  I then decided to punt and assume it was L3 (as that is what I am
personally concerned with here).  I looked at cpu0's L3 and noticed I had
12,288 possible colors for the first line of my allocation.  I thought,
splendid, I now have both my alignment base and alignment offset.

Then, I looked around the machine.  This particular machine was not the
same as the one on which I had recorded the first statistics.  This one
has 1024 cpus (in its current configuration).  Some sockets have an 18MB
cache, others have a 24MB cache.  Likewise, some have 12,288 colors
while others have 16,384 colors.  In this case, which do I use for my
base and alignment calculation?

Lastly, If I were to use the L3 number_of_sets, I would need to hold off
on these allocation until after the init_intel() call is done which is
well after the point where we have done these allocations.  Alternatively,
I could use the cpuid() to calculate the L3 size, but that means putting
cpu specific knowledge into the e820 allocator.

In short, without the cpu information, I think we are heading back to
as much of a kludge as I had originally submitted.  We could assume
the number of sets will always be less than some large value like 16MB,
but that runs the risk of wasting a large amount of memory.

Alternatively, we could base the color value upon something very concrete.
For this particular allocation, we have an array of structures whose
elements are 1792 bytes long (28 cache lines).  If I specify an offset
of 29, it merely means the first element of my newly allocated array
is now going to collide with the first allocation's second element.
I really see no advantage to further allocating space.  The advantage
to this method is it entirely removes the processor configuration from
the question.  It allows me to keep the offset calculation from polluting
the e820 allocator as well.  Basically, the change remains localized.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-20 15:03       ` Robin Holt
@ 2010-08-20 16:16         ` H. Peter Anvin
  2010-08-21 13:07           ` Robin Holt
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2010-08-20 16:16 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Thomas Gleixner, Ingo Molnar, x86, Yinghai Lu,
	Linus Torvalds, Joerg Roedel, Linux Kernel, Stable Maintainers

On 08/20/2010 08:03 AM, Robin Holt wrote:
> 
> In short, without the cpu information, I think we are heading back to
> as much of a kludge as I had originally submitted.  We could assume
> the number of sets will always be less than some large value like 16MB,
> but that runs the risk of wasting a large amount of memory.
> 
> Alternatively, we could base the color value upon something very concrete.
> For this particular allocation, we have an array of structures whose
> elements are 1792 bytes long (28 cache lines).  If I specify an offset
> of 29, it merely means the first element of my newly allocated array
> is now going to collide with the first allocation's second element.
> I really see no advantage to further allocating space.  The advantage
> to this method is it entirely removes the processor configuration from
> the question.  It allows me to keep the offset calculation from polluting
> the e820 allocator as well.  Basically, the change remains localized.
> 

That's pretty much the idea, really.  However, I don't think you can
localize the change without making invalid assumptions of the behavior
of lower primitives, which given that changes are in progress will cause
serious problems.

The issue here is that the e820 allocator (which is about to be axed,
but its successor will need to support similar operations) has a few
parameters that it takes in: (start, end, size, alignment).  It will
return a block at address (addr) fulfilling the requirements:

	start <= addr
	addr+size <= end
	(addr % alignment) == 0

However, for coloring (which is what you're doing here, coloring doesn't
have to be precise) what you really want is for the last constraint to
read like:

	start <= addr
	addr+size <= end
	(addr % alignment) == offset

You can leave your alignment some arbitrarily large value (in the case
of your 1792-byte structure, you can make the observation that
1792*4096 < 8 MiB) and the alignment is simply 1792*(node number).  This
will over-color massively, of course, *but you're not allocating memory
you don't need* and so it doesn't really matter.

However, this does mean that there is a need to be able to pass the
offset parameter down to the allocator.  Not doing that will either mean
wasting huge amount of memory or relying on internal behavior of the
allocator which is already scheduled to change.

Does this make sense?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-20 16:16         ` H. Peter Anvin
@ 2010-08-21 13:07           ` Robin Holt
  2010-08-23 21:42             ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-21 13:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar, x86,
	Yinghai Lu, Linus Torvalds, Joerg Roedel, Linux Kernel,
	Stable Maintainers

> The issue here is that the e820 allocator (which is about to be axed,
> but its successor will need to support similar operations) has a few
> parameters that it takes in: (start, end, size, alignment).  It will
> return a block at address (addr) fulfilling the requirements:
> 
> 	start <= addr
> 	addr+size <= end
> 	(addr % alignment) == 0
> 
> However, for coloring (which is what you're doing here, coloring doesn't
> have to be precise) what you really want is for the last constraint to
> read like:
> 
> 	start <= addr
> 	addr+size <= end
> 	(addr % alignment) == offset
> 
> You can leave your alignment some arbitrarily large value (in the case
> of your 1792-byte structure, you can make the observation that
> 1792*4096 < 8 MiB) and the alignment is simply 1792*(node number).  This
> will over-color massively, of course, *but you're not allocating memory
> you don't need* and so it doesn't really matter.
> 
> However, this does mean that there is a need to be able to pass the
> offset parameter down to the allocator.  Not doing that will either mean
> wasting huge amount of memory or relying on internal behavior of the
> allocator which is already scheduled to change.
> 
> Does this make sense?

It does make sense.

I am not sure how to proceed.  You are saying the e820 allocator is
being replaced.  Yet, this is the allocator used for this section of code.
I feel sort of foolish to tweak the e820 allocator to allow for handling
color only to have it replaced in the near future.

Add to that this simple fix is enough to break up the most egregious
problem which is the scanning of all zones in the system and checking
that zone's pages_present.  If this is adequate, would you accept
this simple patch for now and place expectations on adjusting the e820
replacement allocator later to support color with a simple patch to fix
up the node_data allocations later?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-21 13:07           ` Robin Holt
@ 2010-08-23 21:42             ` H. Peter Anvin
  2010-08-25 11:08               ` Robin Holt
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2010-08-23 21:42 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Thomas Gleixner, Ingo Molnar, x86, Yinghai Lu,
	Linus Torvalds, Joerg Roedel, Linux Kernel, Stable Maintainers

On 08/21/2010 06:07 AM, Robin Holt wrote:
> 
> It does make sense.
> 
> I am not sure how to proceed.  You are saying the e820 allocator is
> being replaced.  Yet, this is the allocator used for this section of code.
> I feel sort of foolish to tweak the e820 allocator to allow for handling
> color only to have it replaced in the near future.
> 
> Add to that this simple fix is enough to break up the most egregious
> problem which is the scanning of all zones in the system and checking
> that zone's pages_present.  If this is adequate, would you accept
> this simple patch for now and place expectations on adjusting the e820
> replacement allocator later to support color with a simple patch to fix
> up the node_data allocations later?
> 

No, this isn't really the right way to go about it.

The whole point is to avoid reliance on undocumented side effects of the
specific implementations of an interface.  That means creating a new
interface with the desired semantics, instead of creating a "local"
patch which happens to work for the current implementation -- and which
will break silently when the new implementation of the interface is
created, and which means the new implementation will be seen as causing
a performance regression.

So yes, this means adding an interface to the e820 allocator, even
though it's scheduled to be replaced.  Because the new implementation
will see the new interface and know they have to implement it, and the
interface will make it clear just what exactly is expected of the
implementation.

So therefore I will not accept your "simple" patch, exactly because it
really isn't simple at all.

	-hpa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-23 21:42             ` H. Peter Anvin
@ 2010-08-25 11:08               ` Robin Holt
  2010-08-25 18:56                 ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-25 11:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar, x86,
	Yinghai Lu, Linus Torvalds, Joerg Roedel, Linux Kernel,
	Stable Maintainers

> So yes, this means adding an interface to the e820 allocator, even
> though it's scheduled to be replaced.  Because the new implementation
> will see the new interface and know they have to implement it, and the
> interface will make it clear just what exactly is expected of the
> implementation.

What is the new allocator called?  Who is developing it?  When do you
expect it to be introduced into the kernel?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-25 11:08               ` Robin Holt
@ 2010-08-25 18:56                 ` H. Peter Anvin
  2010-08-25 21:49                   ` Yinghai Lu
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2010-08-25 18:56 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Thomas Gleixner, Ingo Molnar, x86, Yinghai Lu,
	Linus Torvalds, Joerg Roedel, Linux Kernel, Stable Maintainers,
	Benjamin Herrenschmidt, Yinghai Lu

On 08/25/2010 04:08 AM, Robin Holt wrote:
>> So yes, this means adding an interface to the e820 allocator, even
>> though it's scheduled to be replaced.  Because the new implementation
>> will see the new interface and know they have to implement it, and the
>> interface will make it clear just what exactly is expected of the
>> implementation.
> 
> What is the new allocator called?  Who is developing it?  When do you
> expect it to be introduced into the kernel?
> 

memblock -- the core is already in the kernel and used by several
non-x86 architectures, and so we want to switch x86 to use it, as well.

Ben Herrenschmidt and Yinghai Lu are driving this one.

As with anything in Linux it will get integrated when it is ready.  It
would be nice if it was ready for .37, but it's not certain.

	-hpa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2.
  2010-08-25 18:56                 ` H. Peter Anvin
@ 2010-08-25 21:49                   ` Yinghai Lu
  0 siblings, 0 replies; 14+ messages in thread
From: Yinghai Lu @ 2010-08-25 21:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar, x86,
	Linus Torvalds, Joerg Roedel, Linux Kernel, Stable Maintainers,
	Benjamin Herrenschmidt

On 08/25/2010 11:56 AM, H. Peter Anvin wrote:
> On 08/25/2010 04:08 AM, Robin Holt wrote:
>>> So yes, this means adding an interface to the e820 allocator, even
>>> though it's scheduled to be replaced.  Because the new implementation
>>> will see the new interface and know they have to implement it, and the
>>> interface will make it clear just what exactly is expected of the
>>> implementation.
>>
>> What is the new allocator called?  Who is developing it?  When do you
>> expect it to be introduced into the kernel?
>>
> 
> memblock -- the core is already in the kernel and used by several
> non-x86 architectures, and so we want to switch x86 to use it, as well.
> 
> Ben Herrenschmidt and Yinghai Lu are driving this one.
> 
> As with anything in Linux it will get integrated when it is ready.  It
> would be nice if it was ready for .37, but it's not certain.

i put latest version against current tip/master in

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-2.6-yinghai.git memblock

http://git.kernel.org/?p=linux/kernel/git/yinghai/linux-2.6-yinghai.git;a=shortlog;h=refs/heads/memblock

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-08-25 21:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-18 16:56 [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive Robin Holt
2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
2010-08-19 17:30   ` Roedel, Joerg
2010-08-19 20:42     ` Robin Holt
2010-08-19 22:02       ` Robin Holt
2010-08-19 22:54   ` H. Peter Anvin
2010-08-20 13:58     ` Robin Holt
2010-08-20 15:03       ` Robin Holt
2010-08-20 16:16         ` H. Peter Anvin
2010-08-21 13:07           ` Robin Holt
2010-08-23 21:42             ` H. Peter Anvin
2010-08-25 11:08               ` Robin Holt
2010-08-25 18:56                 ` H. Peter Anvin
2010-08-25 21:49                   ` Yinghai Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).