* [PATCH 1/2] fake the NUMA SLIT
2007-07-18 9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
@ 2007-07-18 9:30 ` Joachim Deguara
2007-07-18 9:31 ` [PATCH 2/2] make node distance writeable Joachim Deguara
2007-07-18 9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
2 siblings, 0 replies; 7+ messages in thread
From: Joachim Deguara @ 2007-07-18 9:30 UTC (permalink / raw)
To: lkml List; +Cc: gregkh, Andreas Kleen, lenb, Christoph Lameter
Most x86 NUMA systems do not have a SLIT provided by them from the BIOS. We
want to fake this by either creating one or copying the original. The reason
to do this is so to later be able to alter it.
Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
--
Index: kernel/drivers/acpi/numa.c
===================================================================
--- kernel.orig/drivers/acpi/numa.c
+++ kernel/drivers/acpi/numa.c
@@ -228,6 +228,28 @@ int __init acpi_numa_init(void)
return 0;
}
+int __init acpi_numa_slit_fixup(void)
+{
+ /* either no SLIT table from ACPI so we create one or we just copy*/
+ struct acpi_table_slit *fake_slit;
+ u32 localities = num_online_nodes();
+ int i, j, slitsize;
+
+ slitsize = sizeof(struct acpi_table_slit) + localities * localities - 1;
+ fake_slit = kmalloc(slitsize, GFP_KERNEL);
+ if (!fake_slit)
+ return -ENOMEM;
+
+ fake_slit->locality_count = localities;
+ for (i = 0; i < localities; i++)
+ for (j = 0; j < localities; j++)
+ fake_slit->entry[i*localities + j] = node_distance(i,j);
+
+ acpi_numa_slit_init(fake_slit);
+
+ return 0;
+}
+
int acpi_get_pxm(acpi_handle h)
{
unsigned long pxm;
Index: kernel/drivers/acpi/bus.c
===================================================================
--- kernel.orig/drivers/acpi/bus.c
+++ kernel/drivers/acpi/bus.c
@@ -650,6 +650,10 @@ void __init acpi_early_init(void)
goto error0;
}
+#ifdef CONFIG_ACPI_NUMA
+ acpi_numa_slit_fixup();
+#endif
+
return;
error0:
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 2/2] make node distance writeable
2007-07-18 9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
2007-07-18 9:30 ` [PATCH 1/2] fake " Joachim Deguara
@ 2007-07-18 9:31 ` Joachim Deguara
2007-07-18 9:42 ` [PATCH 0/2] faking and fixing the NUMA SLIT Andi Kleen
2 siblings, 0 replies; 7+ messages in thread
From: Joachim Deguara @ 2007-07-18 9:31 UTC (permalink / raw)
To: lkml List; +Cc: gregkh, Andreas Kleen, lenb, Christoph Lameter
This adds the ability to write the node distance for NUMA systems. This is
generally handled by the SLIT but unfortunately the large majority of systems
do not have a SLIT as Windows does not use them. Until now if no SLIT was
found all remote nodes had a distance of 20 which is ok for 2P systems but
wrong for 4P and larger.
Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
--
Index: kernel/drivers/base/node.c
===================================================================
--- kernel.orig/drivers/base/node.c
+++ kernel/drivers/base/node.c
@@ -129,7 +129,30 @@ static ssize_t node_read_distance(struct
len += sprintf(buf + len, "\n");
return len;
}
-static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
+
+//takes a space seperated string as the distances of online nodes
+static ssize_t node_write_distance(struct sys_device * dev, const char * buf,
+ size_t size){
+ int i, ret;
+ u8 dist;
+
+ for_each_online_node(i){
+ if (i){
+ buf = strchr(buf, ' ');
+ buf++;
+ }
+ ret = sscanf(buf, "%hu", &dist);
+ if (!ret)
+ return -EINVAL;
+ if (dist < 10)
+ dist = 10;
+ set_node_distance(dev->id, i, dist);
+ }
+ return size;
+}
+
+static SYSDEV_ATTR(distance, S_IRUGO | S_IWUSR, node_read_distance,
+ node_write_distance);
/*
Index: kernel/arch/x86_64/mm/srat.c
===================================================================
--- kernel.orig/arch/x86_64/mm/srat.c
+++ kernel/arch/x86_64/mm/srat.c
@@ -471,6 +471,18 @@ int __node_distance(int a, int b)
EXPORT_SYMBOL(__node_distance);
+void __set_node_distance(int a, int b, u8 dist)
+{
+ int index;
+
+ if (!acpi_slit)
+ return;
+ index = acpi_slit->locality_count * node_to_pxm(a);
+ acpi_slit->entry[index + node_to_pxm(b)] = dist;
+}
+
+EXPORT_SYMBOL(__set_node_distance);
+
int memory_add_physaddr_to_nid(u64 start)
{
int i, ret = 0;
Index: kernel/include/asm-x86_64/topology.h
===================================================================
--- kernel.orig/include/asm-x86_64/topology.h
+++ kernel/include/asm-x86_64/topology.h
@@ -14,7 +14,9 @@ extern cpumask_t node_to_cpumask[];
#ifdef CONFIG_ACPI_NUMA
extern int __node_distance(int, int);
+extern void __set_node_distance(int, int, u8);
#define node_distance(a,b) __node_distance(a,b)
+#define set_node_distance(a,b,dist) __set_node_distance(a,b,dist)
/* #else fallback version */
#endif
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH 0/2] faking and fixing the NUMA SLIT
2007-07-18 9:30 [PATCH 0/2] faking and fixing the NUMA SLIT Joachim Deguara
2007-07-18 9:30 ` [PATCH 1/2] fake " Joachim Deguara
2007-07-18 9:31 ` [PATCH 2/2] make node distance writeable Joachim Deguara
@ 2007-07-18 9:42 ` Andi Kleen
2007-07-18 9:57 ` Joachim Deguara
2007-07-23 20:25 ` Christoph Lameter
2 siblings, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2007-07-18 9:42 UTC (permalink / raw)
To: Joachim Deguara; +Cc: lkml List, gregkh, lenb, Christoph Lameter
On Wednesday 18 July 2007 11:30:01 Joachim Deguara wrote:
> The problem with NUMA distances in the SLIT is that they are often wrong, oh
> wait they aren't there at all because the BIOS didn't create a SLIT since
> Windows does not use it. If Linux does not find a slit it just says the
> distance to local=10 and remote=20 according to ACPI spec. The problem is
> when we have a 4P system (or larger), there is generally one node where we
> have two hops and its distance should be >20.
>
> Following are patches to first fake the SLIT in the ACPI code and then add
> ability to write the distances from sysfs.
The main use for the SLIT information are the zone fallback lists in
the VM. These are created at boot. If you change the SLIT later these
won't be regenerated.
The scheduler also uses it for load balancing, but it is much less
important there than in the VM.
The only use would be for libnuma applications that read the SLIT later,
but I'm not aware of any.
Don't think that is really useful.
If anything you would probably need a early boot option for this, but that
would become so ugly that I would rather ask for fixing the BIOSes.
Or implement true node hotplug, but that would be also a lot of work.
On 4S it should not make that much difference anyways and 8S is hopefully
ok.
-Andi
^ permalink raw reply [flat|nested] 7+ messages in thread