public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] faking and fixing the NUMA SLIT
@ 2007-07-18  9:30 Joachim Deguara
  2007-07-18  9:30 ` [PATCH 1/2] fake " Joachim Deguara
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Joachim Deguara @ 2007-07-18  9:30 UTC (permalink / raw)
  To: lkml List; +Cc: gregkh, Andreas Kleen, lenb, Christoph Lameter

The problem with NUMA distances in the SLIT is that they are often wrong, oh 
wait they aren't there at all because the BIOS didn't create a SLIT since 
Windows does not use it.  If Linux does not find a slit it just says the 
distance to local=10 and remote=20 according to ACPI spec.  The problem is 
when we have a 4P system (or larger), there is generally one node where we 
have two hops and its distance should be >20.

Following are patches to first fake the SLIT in the ACPI code and then add 
ability to write the distances from sysfs.

-Joachim



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] fake the NUMA SLIT
  2007-07-18  9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
@ 2007-07-18  9:30 ` Joachim Deguara
  2007-07-18  9:31 ` [PATCH 2/2] make node distance writeable Joachim Deguara
  2007-07-18  9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
  2 siblings, 0 replies; 7+ messages in thread
From: Joachim Deguara @ 2007-07-18  9:30 UTC (permalink / raw)
  To: lkml List; +Cc: gregkh, Andreas Kleen, lenb, Christoph Lameter

Most x86 NUMA systems do not have a SLIT provided by them from the BIOS.  We 
want to fake this by either creating one or copying the original.  The reason 
to do this is so to later be able to alter it. 

Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
--
Index: kernel/drivers/acpi/numa.c
===================================================================
--- kernel.orig/drivers/acpi/numa.c
+++ kernel/drivers/acpi/numa.c
@@ -228,6 +228,28 @@ int __init acpi_numa_init(void)
 	return 0;
 }
 
+int __init acpi_numa_slit_fixup(void)
+{
+	/* either no SLIT table from ACPI so we create one or we just copy*/
+	struct acpi_table_slit *fake_slit;
+	u32 localities = num_online_nodes();
+	int i, j, slitsize;
+
+	slitsize = sizeof(struct acpi_table_slit) + localities * localities - 1;
+	fake_slit = kmalloc(slitsize, GFP_KERNEL);
+	if (!fake_slit)
+		return -ENOMEM;
+
+	fake_slit->locality_count = localities;
+	for (i = 0; i < localities; i++)
+		for (j = 0; j < localities; j++)
+			fake_slit->entry[i*localities + j] = node_distance(i,j);
+
+	acpi_numa_slit_init(fake_slit);
+
+	return 0;
+}
+
 int acpi_get_pxm(acpi_handle h)
 {
 	unsigned long pxm;
Index: kernel/drivers/acpi/bus.c
===================================================================
--- kernel.orig/drivers/acpi/bus.c
+++ kernel/drivers/acpi/bus.c
@@ -650,6 +650,10 @@ void __init acpi_early_init(void)
 		goto error0;
 	}
 
+#ifdef CONFIG_ACPI_NUMA
+	acpi_numa_slit_fixup();
+#endif
+
 	return;
 
       error0:




^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] make node distance writeable
  2007-07-18  9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
  2007-07-18  9:30 ` [PATCH 1/2] fake " Joachim Deguara
@ 2007-07-18  9:31 ` Joachim Deguara
  2007-07-18  9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
  2 siblings, 0 replies; 7+ messages in thread
From: Joachim Deguara @ 2007-07-18  9:31 UTC (permalink / raw)
  To: lkml List; +Cc: gregkh, Andreas Kleen, lenb, Christoph Lameter

This adds the ability to write the node distance for NUMA systems.  This is 
generally handled by the SLIT but unfortunately the large majority of systems 
do not have a SLIT as Windows does not use them. Until now if no SLIT was 
found all remote nodes had a distance of 20 which is ok for 2P systems but 
wrong for 4P and larger.

Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>


--
Index: kernel/drivers/base/node.c
===================================================================
--- kernel.orig/drivers/base/node.c
+++ kernel/drivers/base/node.c
@@ -129,7 +129,30 @@ static ssize_t node_read_distance(struct
 	len += sprintf(buf + len, "\n");
 	return len;
 }
-static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
+
+//takes a space seperated string as the distances of online nodes
+static ssize_t node_write_distance(struct sys_device * dev, const char * buf,
+					size_t size){
+	int i, ret;
+	u8 dist;
+
+	for_each_online_node(i){
+		if (i){
+			buf = strchr(buf, ' ');
+			buf++;
+		}
+		ret = sscanf(buf, "%hu", &dist);
+		if (!ret)
+			return -EINVAL;
+		if (dist < 10)
+			dist = 10;
+		set_node_distance(dev->id, i, dist);
+	}
+	return size;
+}
+
+static SYSDEV_ATTR(distance, S_IRUGO | S_IWUSR, node_read_distance,
+		node_write_distance);
 
 
 /*
Index: kernel/arch/x86_64/mm/srat.c
===================================================================
--- kernel.orig/arch/x86_64/mm/srat.c
+++ kernel/arch/x86_64/mm/srat.c
@@ -471,6 +471,18 @@ int __node_distance(int a, int b)
 
 EXPORT_SYMBOL(__node_distance);
 
+void __set_node_distance(int a, int b, u8 dist)
+{
+	int index;
+
+	if (!acpi_slit)
+		return;
+	index = acpi_slit->locality_count * node_to_pxm(a);
+	acpi_slit->entry[index + node_to_pxm(b)] = dist;
+}
+
+EXPORT_SYMBOL(__set_node_distance);
+
 int memory_add_physaddr_to_nid(u64 start)
 {
 	int i, ret = 0;
Index: kernel/include/asm-x86_64/topology.h
===================================================================
--- kernel.orig/include/asm-x86_64/topology.h
+++ kernel/include/asm-x86_64/topology.h
@@ -14,7 +14,9 @@ extern cpumask_t     node_to_cpumask[];
 
 #ifdef CONFIG_ACPI_NUMA
 extern int __node_distance(int, int);
+extern void __set_node_distance(int, int, u8);
 #define node_distance(a,b) __node_distance(a,b)
+#define set_node_distance(a,b,dist) __set_node_distance(a,b,dist)
 /* #else fallback version */
 #endif
 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] faking and fixing the NUMA SLIT
  2007-07-18  9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
  2007-07-18  9:30 ` [PATCH 1/2] fake " Joachim Deguara
  2007-07-18  9:31 ` [PATCH 2/2] make node distance writeable Joachim Deguara
@ 2007-07-18  9:42 ` Andi Kleen
  2007-07-18  9:57   ` Joachim Deguara
  2007-07-23 20:25   ` Christoph Lameter
  2 siblings, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2007-07-18  9:42 UTC (permalink / raw)
  To: Joachim Deguara; +Cc: lkml List, gregkh, lenb, Christoph Lameter

On Wednesday 18 July 2007 11:30:01 Joachim Deguara wrote:
> The problem with NUMA distances in the SLIT is that they are often wrong, oh 
> wait they aren't there at all because the BIOS didn't create a SLIT since 
> Windows does not use it.  If Linux does not find a slit it just says the 
> distance to local=10 and remote=20 according to ACPI spec.  The problem is 
> when we have a 4P system (or larger), there is generally one node where we 
> have two hops and its distance should be >20.
> 
> Following are patches to first fake the SLIT in the ACPI code and then add 
> ability to write the distances from sysfs.

The main use for the SLIT information are the zone fallback lists in 
the VM. These are created at boot.  If you change the SLIT later these
won't be regenerated. 

The scheduler also uses it for load balancing, but it is much less
important there than in the VM.

The only use would be for libnuma applications that read the SLIT later,
but I'm not aware of any.

Don't think that is really useful.

If anything you would probably need a early boot option for this, but that
would become so ugly that I would rather ask for fixing the BIOSes.
Or implement true node hotplug, but that would be also a lot of work.

On 4S it should not make that much difference anyways and 8S is hopefully
ok.

-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] faking and fixing the NUMA SLIT
  2007-07-18  9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
@ 2007-07-18  9:57   ` Joachim Deguara
  2007-07-23 20:25   ` Christoph Lameter
  1 sibling, 0 replies; 7+ messages in thread
From: Joachim Deguara @ 2007-07-18  9:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: lkml List, gregkh, lenb, Christoph Lameter

On Wednesday 18 July 2007 11:42:20 Andi Kleen wrote:
> On Wednesday 18 July 2007 11:30:01 Joachim Deguara wrote:
> > The problem with NUMA distances in the SLIT is that they are often wrong,
> > oh wait they aren't there at all because the BIOS didn't create a SLIT
> > since Windows does not use it.  If Linux does not find a slit it just
> > says the distance to local=10 and remote=20 according to ACPI spec.  The
> > problem is when we have a 4P system (or larger), there is generally one
> > node where we have two hops and its distance should be >20.
> >
> > Following are patches to first fake the SLIT in the ACPI code and then
> > add ability to write the distances from sysfs.
>
> The main use for the SLIT information are the zone fallback lists in
> the VM. These are created at boot.  If you change the SLIT later these
> won't be regenerated.
>
> The scheduler also uses it for load balancing, but it is much less
> important there than in the VM.

I looked at how node_distance() was called from page_alloc.c and sched.c but I 
overlooked that those results are really only used at init time.  So the 
backing mechanism of patching the SLIT is right but sysfs is way to late for 
those uses and creating a boot option is as you say ugly.  It would be great 
if BIOS just did this for us and correctly but what do you really expect 
that ;)

I would be happy to add it as a boot option if there is any popular support 
for the idea.

-Joachim






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] faking and fixing the NUMA SLIT
  2007-07-18  9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
  2007-07-18  9:57   ` Joachim Deguara
@ 2007-07-23 20:25   ` Christoph Lameter
  2007-07-23 22:10     ` Len Brown
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2007-07-23 20:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Joachim Deguara, lkml List, gregkh, lenb, travis, steiner

On Wed, 18 Jul 2007 11:42:20 +0200
Andi Kleen <ak@suse.de> wrote:

> Don't think that is really useful.

I think this is useful for NUMA debugging since one may use this to
create various slit configuration that can be useful to simulate many
fallback scenarios that may require testing.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] faking and fixing the NUMA SLIT
  2007-07-23 20:25   ` Christoph Lameter
@ 2007-07-23 22:10     ` Len Brown
  0 siblings, 0 replies; 7+ messages in thread
From: Len Brown @ 2007-07-23 22:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Joachim Deguara, lkml List, gregkh, travis, steiner

On Monday 23 July 2007 16:25, Christoph Lameter wrote:
> On Wed, 18 Jul 2007 11:42:20 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > Don't think that is really useful.
> 
> I think this is useful for NUMA debugging since one may use this to
> create various slit configuration that can be useful to simulate many
> fallback scenarios that may require testing.

Why not build a new SLIT into the kernel for the test
to override the BIOS -- like we do when we want a test version of the DSDT?

-Len


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-07-23 22:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-18  9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
2007-07-18  9:30 ` [PATCH 1/2] fake " Joachim Deguara
2007-07-18  9:31 ` [PATCH 2/2] make node distance writeable Joachim Deguara
2007-07-18  9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
2007-07-18  9:57   ` Joachim Deguara
2007-07-23 20:25   ` Christoph Lameter
2007-07-23 22:10     ` Len Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox