* Re: Externalize SLIT table [not found] <20041103205655.GA5084@sgi.com> @ 2004-11-04 1:59 ` Takayoshi Kochi 2004-11-04 4:07 ` Andi Kleen 2004-11-04 14:13 ` Jack Steiner 0 siblings, 2 replies; 30+ messages in thread From: Takayoshi Kochi @ 2004-11-04 1:59 UTC (permalink / raw) To: steiner; +Cc: linux-ia64, linux-kernel Hi, For wider audience, added LKML. From: Jack Steiner <steiner@sgi.com> Subject: Externalize SLIT table Date: Wed, 3 Nov 2004 14:56:56 -0600 > The SLIT table provides useful information on internode > distances. Has anyone considered externalizing this > table via /proc or some equivalent mechanism. > > For example, something like the following would be useful: > > # cat /proc/acpi/slit > 010 066 046 066 > 066 010 066 046 > 046 066 010 020 > 066 046 020 010 > > If this looks ok (or something equivalent), I'll generate a patch.... For user space to manipulate scheduling domains, pinning processes to some cpu groups etc, that kind of information is very useful! Without this, users have no notion about how far between two nodes. But ACPI SLIT table is too arch specific (ia64 and x86 only) and user-visible logical number and ACPI proximity domain number is not always identical. Why not export node_distance() under sysfs? I like (1). (1) obey one-value-per-file sysfs principle % cat /sys/devices/system/node/node0/distance0 10 % cat /sys/devices/system/node/node0/distance1 66 (2) one distance for each line % cat /sys/devices/system/node/node0/distance 0:10 1:66 2:46 3:66 (3) all distances in one line like /proc/<PID>/stat % cat /sys/devices/system/node/node0/distance 10 66 46 66 --- Takayoshi Kochi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 1:59 ` Externalize SLIT table Takayoshi Kochi @ 2004-11-04 4:07 ` Andi Kleen 2004-11-04 4:57 ` Takayoshi Kochi 2004-11-09 19:23 ` Matthew Dobson 2004-11-04 14:13 ` Jack Steiner 1 sibling, 2 replies; 30+ messages in thread From: Andi Kleen @ 2004-11-04 4:07 UTC (permalink / raw) To: Takayoshi Kochi; +Cc: steiner, linux-ia64, linux-kernel On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote: > Hi, > > For wider audience, added LKML. > > From: Jack Steiner <steiner@sgi.com> > Subject: Externalize SLIT table > Date: Wed, 3 Nov 2004 14:56:56 -0600 > > > The SLIT table provides useful information on internode > > distances. Has anyone considered externalizing this > > table via /proc or some equivalent mechanism. > > > > For example, something like the following would be useful: > > > > # cat /proc/acpi/slit > > 010 066 046 066 > > 066 010 066 046 > > 046 066 010 020 > > 066 046 020 010 > > > > If this looks ok (or something equivalent), I'll generate a patch.... This isn't very useful without information about proximity domains. e.g. on x86-64 the proximity domain number is not necessarily the same as the node number. > For user space to manipulate scheduling domains, pinning processes > to some cpu groups etc, that kind of information is very useful! > Without this, users have no notion about how far between two nodes. Also some reporting of _PXM for PCI devices is needed. I had a experimental patch for this on x86-64 (not ACPI based), that reported nearby nodes for PCI busses. > > But ACPI SLIT table is too arch specific (ia64 and x86 only) and > user-visible logical number and ACPI proximity domain number is > not always identical. Exactly. > > Why not export node_distance() under sysfs? > I like (1). > > (1) obey one-value-per-file sysfs principle > > % cat /sys/devices/system/node/node0/distance0 > 10 Surely distance from 0 to 0 is 0? > % cat /sys/devices/system/node/node0/distance1 > 66 > > (2) one distance for each line > > % cat /sys/devices/system/node/node0/distance > 0:10 > 1:66 > 2:46 > 3:66 > > (3) all distances in one line like /proc/<PID>/stat > > % cat /sys/devices/system/node/node0/distance > 10 66 46 66 I would prefer that. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 4:07 ` Andi Kleen @ 2004-11-04 4:57 ` Takayoshi Kochi 2004-11-04 6:37 ` Andi Kleen 2004-11-05 16:08 ` Jack Steiner 2004-11-09 19:23 ` Matthew Dobson 1 sibling, 2 replies; 30+ messages in thread From: Takayoshi Kochi @ 2004-11-04 4:57 UTC (permalink / raw) To: ak; +Cc: steiner, linux-ia64, linux-kernel Hi, From: Andi Kleen <ak@suse.de> Subject: Re: Externalize SLIT table Date: Thu, 4 Nov 2004 05:07:13 +0100 > > Why not export node_distance() under sysfs? > > I like (1). > > > > (1) obey one-value-per-file sysfs principle > > > > % cat /sys/devices/system/node/node0/distance0 > > 10 > > Surely distance from 0 to 0 is 0? According to the ACPI spec, 10 means local and other values mean ratio to 10. But what the distance number should mean mean is ambiguous from the spec (e.g. some veondors interpret as memory access latency, others interpret as memory throughput etc.) However relative distance just works for most of uses, I believe. Anyway, we should clarify how the numbers should be interpreted to avoid confusion. How about this? "The distance to itself means the base value. Distance to other nodes are relative to the base value. 0 means unreachable (hot-removed or disabled) to that node." (Just FYI, numbers 0-9 are reserved and 255 (unsigned char -1) means unreachable, according to the ACPI spec.) > > % cat /sys/devices/system/node/node0/distance1 > > 66 > > > > > (2) one distance for each line > > > > % cat /sys/devices/system/node/node0/distance > > 0:10 > > 1:66 > > 2:46 > > 3:66 > > > > (3) all distances in one line like /proc/<PID>/stat > > > > % cat /sys/devices/system/node/node0/distance > > 10 66 46 66 > > I would prefer that. Ah, I missed the following last sentence in Documentation/filesystems/sysfs.txt: |Attributes should be ASCII text files, preferably with only one value |per file. It is noted that it may not be efficient to contain only |value per file, so it is socially acceptable to express an array of |values of the same type. If an array is acceptable, I would prefer (3), too. --- Takayoshi Kochi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 4:57 ` Takayoshi Kochi @ 2004-11-04 6:37 ` Andi Kleen 2004-11-05 16:08 ` Jack Steiner 1 sibling, 0 replies; 30+ messages in thread From: Andi Kleen @ 2004-11-04 6:37 UTC (permalink / raw) To: Takayoshi Kochi; +Cc: ak, steiner, linux-ia64, linux-kernel On Thu, Nov 04, 2004 at 01:57:21PM +0900, Takayoshi Kochi wrote: > Hi, > > From: Andi Kleen <ak@suse.de> > Subject: Re: Externalize SLIT table > Date: Thu, 4 Nov 2004 05:07:13 +0100 > > > > Why not export node_distance() under sysfs? > > > I like (1). > > > > > > (1) obey one-value-per-file sysfs principle > > > > > > % cat /sys/devices/system/node/node0/distance0 > > > 10 > > > > Surely distance from 0 to 0 is 0? > > According to the ACPI spec, 10 means local and other values > mean ratio to 10. But what the distance number should mean Ah, missed that. ok I guess it makes sense to use the same encoding as ACPI, no need to be intentionally different. > mean is ambiguous from the spec (e.g. some veondors interpret as > memory access latency, others interpret as memory throughput > etc.) > However relative distance just works for most of uses, I believe. > > Anyway, we should clarify how the numbers should be interpreted > to avoid confusion. Defining it as "as defined in the ACPI spec" should be ok. I guess even non ACPI architectures will be able to live with that. Anyways, since we seem to agree and so far nobody has complained it's just that somebody needs to do a patch? If possible make it generic code in drivers/acpi/numa.c, there won't be anything architecture specific in this and it should work for x86-64 too. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 4:57 ` Takayoshi Kochi 2004-11-04 6:37 ` Andi Kleen @ 2004-11-05 16:08 ` Jack Steiner 2004-11-05 16:26 ` Andreas Schwab 2004-11-05 17:13 ` Erich Focht 1 sibling, 2 replies; 30+ messages in thread From: Jack Steiner @ 2004-11-05 16:08 UTC (permalink / raw) To: Takayoshi Kochi; +Cc: ak, linux-ia64, linux-kernel Based on the ideas from Andi & Takayoshi, I created a patch to add the SLIT distance information to the sysfs. I've tested this on Altix/IA64 & it appears to work ok. I have not tried it on other architectures. Andi also posted a related patch for adding similar information for PCI busses. Comments, suggestions, ..... # cd /sys/devices/system # find . ./node ./node/node5 ./node/node5/cpu11 ./node/node5/cpu10 ./node/node5/distance ./node/node5/numastat ./node/node5/meminfo ./node/node5/cpumap ./node/node4 ./node/node4/cpu9 ./node/node4/cpu8 ./node/node4/distance ./node/node4/numastat ./node/node4/meminfo ./node/node4/cpumap .... ./cpu ./cpu/cpu11 ./cpu/cpu11/distance ./cpu/cpu10 ./cpu/cpu10/distance ./cpu/cpu9 ./cpu/cpu9/distance ./cpu/cpu8 ... # cat ./node/node0/distance 10 20 64 42 42 22 # cat ./cpu/cpu8/distance 42 42 64 64 22 22 42 42 10 10 20 20 # cat node/*/distance 10 20 64 42 42 22 20 10 42 22 64 84 64 42 10 20 22 42 42 22 20 10 42 62 42 64 22 42 10 20 22 84 42 62 20 10 # cat cpu/*/distance 10 10 20 20 64 64 42 42 42 42 22 22 10 10 20 20 64 64 42 42 42 42 22 22 20 20 10 10 42 42 22 22 64 64 84 84 20 20 10 10 42 42 22 22 64 64 84 84 64 64 42 42 10 10 20 20 22 22 42 42 64 64 42 42 10 10 20 20 22 22 42 42 42 42 22 22 20 20 10 10 42 42 62 62 42 42 22 22 20 20 10 10 42 42 62 62 42 42 64 64 22 22 42 42 10 10 20 20 42 42 64 64 22 22 42 42 10 10 20 20 22 22 84 84 42 42 62 62 20 20 10 10 22 22 84 84 42 42 62 62 20 20 10 10 Index: linux/drivers/base/node.c =================================================================== --- linux.orig/drivers/base/node.c 2004-11-05 08:34:42.000000000 -0600 +++ linux/drivers/base/node.c 2004-11-05 09:00:01.000000000 -0600 @@ -111,6 +111,21 @@ static ssize_t node_read_numastat(struct } static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); +static ssize_t node_read_distance(struct sys_device * dev, char * buf) +{ + int nid = dev->id; + int len = 0; + int i; + + for (i = 0; i < numnodes; i++) + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); + + len += sprintf(buf + len, "\n"); + return len; +} +static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL); + + /* * register_node - Setup a driverfs device for a node. * @num - Node number to use when creating the device. @@ -129,6 +144,7 @@ int __init register_node(struct node *no sysdev_create_file(&node->sysdev, &attr_cpumap); sysdev_create_file(&node->sysdev, &attr_meminfo); sysdev_create_file(&node->sysdev, &attr_numastat); + sysdev_create_file(&node->sysdev, &attr_distance); } return error; } Index: linux/drivers/base/cpu.c =================================================================== --- linux.orig/drivers/base/cpu.c 2004-11-05 08:58:09.000000000 -0600 +++ linux/drivers/base/cpu.c 2004-11-05 08:59:25.000000000 -0600 @@ -8,6 +8,7 @@ #include <linux/cpu.h> #include <linux/topology.h> #include <linux/device.h> +#include <linux/cpumask.h> struct sysdev_class cpu_sysdev_class = { @@ -58,6 +59,31 @@ static inline void register_cpu_control( } #endif /* CONFIG_HOTPLUG_CPU */ +#ifdef CONFIG_NUMA +static ssize_t cpu_read_distance(struct sys_device * dev, char * buf) +{ + int nid = cpu_to_node(dev->id); + int len = 0; + int i; + + for (i = 0; i < num_possible_cpus(); i++) + len += sprintf(buf + len, "%s%d", i ? " " : "", + node_distance(nid, cpu_to_node(i))); + len += sprintf(buf + len, "\n"); + return len; +} +static SYSDEV_ATTR(distance, S_IRUGO, cpu_read_distance, NULL); + +static inline void register_cpu_distance(struct cpu *cpu) +{ + sysdev_create_file(&cpu->sysdev, &attr_distance); +} +#else /* !CONFIG_NUMA */ +static inline void register_cpu_distance(struct cpu *cpu) +{ +} +#endif + /* * register_cpu - Setup a driverfs device for a CPU. * @cpu - Callers can set the cpu->no_control field to 1, to indicate not to @@ -81,6 +107,10 @@ int __init register_cpu(struct cpu *cpu, kobject_name(&cpu->sysdev.kobj)); if (!error && !cpu->no_control) register_cpu_control(cpu); + + if (!error) + register_cpu_distance(cpu); + return error; } On Thu, Nov 04, 2004 at 01:57:21PM +0900, Takayoshi Kochi wrote: > Hi, > > From: Andi Kleen <ak@suse.de> > Subject: Re: Externalize SLIT table > Date: Thu, 4 Nov 2004 05:07:13 +0100 > > > > Why not export node_distance() under sysfs? > > > I like (1). > > > > > > (1) obey one-value-per-file sysfs principle > > > > > > % cat /sys/devices/system/node/node0/distance0 > > > 10 > > > > Surely distance from 0 to 0 is 0? > > According to the ACPI spec, 10 means local and other values > mean ratio to 10. But what the distance number should mean > mean is ambiguous from the spec (e.g. some veondors interpret as > memory access latency, others interpret as memory throughput > etc.) > However relative distance just works for most of uses, I believe. > > Anyway, we should clarify how the numbers should be interpreted > to avoid confusion. > > How about this? > "The distance to itself means the base value. Distance to > other nodes are relative to the base value. > 0 means unreachable (hot-removed or disabled) to that node." > > (Just FYI, numbers 0-9 are reserved and 255 (unsigned char -1) means > unreachable, according to the ACPI spec.) > > > > % cat /sys/devices/system/node/node0/distance1 > > > 66 > > > > > > > > (2) one distance for each line > > > > > > % cat /sys/devices/system/node/node0/distance > > > 0:10 > > > 1:66 > > > 2:46 > > > 3:66 > > > > > > (3) all distances in one line like /proc/<PID>/stat > > > > > > % cat /sys/devices/system/node/node0/distance > > > 10 66 46 66 > > > > I would prefer that. > > Ah, I missed the following last sentence in > Documentation/filesystems/sysfs.txt: > > |Attributes should be ASCII text files, preferably with only one value > |per file. It is noted that it may not be efficient to contain only > |value per file, so it is socially acceptable to express an array of > |values of the same type. > > If an array is acceptable, I would prefer (3), too. > > --- > Takayoshi Kochi -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-05 16:08 ` Jack Steiner @ 2004-11-05 16:26 ` Andreas Schwab 2004-11-05 16:44 ` Jack Steiner 2004-11-05 17:13 ` Erich Focht 1 sibling, 1 reply; 30+ messages in thread From: Andreas Schwab @ 2004-11-05 16:26 UTC (permalink / raw) To: Jack Steiner; +Cc: Takayoshi Kochi, ak, linux-ia64, linux-kernel Jack Steiner <steiner@sgi.com> writes: > @@ -111,6 +111,21 @@ static ssize_t node_read_numastat(struct > } > static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > > +static ssize_t node_read_distance(struct sys_device * dev, char * buf) > +{ > + int nid = dev->id; > + int len = 0; > + int i; > + > + for (i = 0; i < numnodes; i++) > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); Can this overflow the space allocated for buf? > @@ -58,6 +59,31 @@ static inline void register_cpu_control( > } > #endif /* CONFIG_HOTPLUG_CPU */ > > +#ifdef CONFIG_NUMA > +static ssize_t cpu_read_distance(struct sys_device * dev, char * buf) > +{ > + int nid = cpu_to_node(dev->id); > + int len = 0; > + int i; > + > + for (i = 0; i < num_possible_cpus(); i++) > + len += sprintf(buf + len, "%s%d", i ? " " : "", > + node_distance(nid, cpu_to_node(i))); Or this? Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux AG, Maxfeldstraße 5, 90409 Nürnberg, Germany Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-05 16:26 ` Andreas Schwab @ 2004-11-05 16:44 ` Jack Steiner 2004-11-06 11:50 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Jack Steiner @ 2004-11-05 16:44 UTC (permalink / raw) To: Andreas Schwab; +Cc: Takayoshi Kochi, ak, linux-ia64, linux-kernel On Fri, Nov 05, 2004 at 05:26:10PM +0100, Andreas Schwab wrote: > Jack Steiner <steiner@sgi.com> writes: > > > @@ -111,6 +111,21 @@ static ssize_t node_read_numastat(struct > > } > > static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > > > > +static ssize_t node_read_distance(struct sys_device * dev, char * buf) > > +{ > > + int nid = dev->id; > > + int len = 0; > > + int i; > > + > > + for (i = 0; i < numnodes; i++) > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); > > Can this overflow the space allocated for buf? Good point. I think we are ok for now. AFAIK, the largest cpu count currently supported is 512. That gives a max string of 2k (max of 3 digits + space per cpu). However, I should probably add a BUILD_BUG_ON to check for overflow. BUILD_BUG_ON(NR_NODES*4 > PAGE_SIZE/2); BUILD_BUG_ON(NR_CPUS*4 > PAGE_SIZE/2); > > > @@ -58,6 +59,31 @@ static inline void register_cpu_control( > > } > > #endif /* CONFIG_HOTPLUG_CPU */ > > > > +#ifdef CONFIG_NUMA > > +static ssize_t cpu_read_distance(struct sys_device * dev, char * buf) > > +{ > > + int nid = cpu_to_node(dev->id); > > + int len = 0; > > + int i; > > + > > + for (i = 0; i < num_possible_cpus(); i++) > > + len += sprintf(buf + len, "%s%d", i ? " " : "", > > + node_distance(nid, cpu_to_node(i))); > > Or this? > > Andreas. > > -- > Andreas Schwab, SuSE Labs, schwab@suse.de > SuSE Linux AG, Maxfeldstra_e 5, 90409 N|rnberg, Germany > Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 > "And now for something completely different." -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-05 16:44 ` Jack Steiner @ 2004-11-06 11:50 ` Christoph Hellwig 2004-11-06 12:48 ` Andi Kleen 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2004-11-06 11:50 UTC (permalink / raw) To: Jack Steiner Cc: Andreas Schwab, Takayoshi Kochi, ak, linux-ia64, linux-kernel On Fri, Nov 05, 2004 at 10:44:49AM -0600, Jack Steiner wrote: > > > + for (i = 0; i < numnodes; i++) > > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); > > > > Can this overflow the space allocated for buf? > > > Good point. I think we are ok for now. AFAIK, the largest cpu count > currently supported is 512. That gives a max string of 2k (max of 3 > digits + space per cpu). I always wondered why sysfs doesn't use the seq_file interface that makes life easier in the rest of them kernel. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-06 11:50 ` Christoph Hellwig @ 2004-11-06 12:48 ` Andi Kleen 2004-11-06 13:07 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Andi Kleen @ 2004-11-06 12:48 UTC (permalink / raw) To: Christoph Hellwig, Jack Steiner, Andreas Schwab, Takayoshi Kochi, ak, linux-ia64, linux-kernel On Sat, Nov 06, 2004 at 11:50:29AM +0000, Christoph Hellwig wrote: > On Fri, Nov 05, 2004 at 10:44:49AM -0600, Jack Steiner wrote: > > > > + for (i = 0; i < numnodes; i++) > > > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); > > > > > > Can this overflow the space allocated for buf? > > > > > > Good point. I think we are ok for now. AFAIK, the largest cpu count > > currently supported is 512. That gives a max string of 2k (max of 3 > > digits + space per cpu). > > I always wondered why sysfs doesn't use the seq_file interface that makes > life easier in the rest of them kernel. Most fields only output a single number, and seq_file would be extreme overkill for that. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-06 12:48 ` Andi Kleen @ 2004-11-06 13:07 ` Christoph Hellwig 0 siblings, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2004-11-06 13:07 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Hellwig, Jack Steiner, Andreas Schwab, Takayoshi Kochi, linux-ia64, linux-kernel On Sat, Nov 06, 2004 at 01:48:38PM +0100, Andi Kleen wrote: > On Sat, Nov 06, 2004 at 11:50:29AM +0000, Christoph Hellwig wrote: > > On Fri, Nov 05, 2004 at 10:44:49AM -0600, Jack Steiner wrote: > > > > > + for (i = 0; i < numnodes; i++) > > > > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); > > > > > > > > Can this overflow the space allocated for buf? > > > > > > > > > Good point. I think we are ok for now. AFAIK, the largest cpu count > > > currently supported is 512. That gives a max string of 2k (max of 3 > > > digits + space per cpu). > > > > I always wondered why sysfs doesn't use the seq_file interface that makes > > life easier in the rest of them kernel. > > Most fields only output a single number, and seq_file would be > extreme overkill for that. Personally I think even a: static void show_foo(struct device *dev, struct seq_file *s) { seq_printf(s, "blafcsvsdfg\n"); } static ssize_t show_foo(struct device *dev, char *buf) { return snprintf(buf, 20, "blafcsvsdfg\n"); } would be a definitive improvement. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-05 16:08 ` Jack Steiner 2004-11-05 16:26 ` Andreas Schwab @ 2004-11-05 17:13 ` Erich Focht 2004-11-05 19:13 ` Jack Steiner 1 sibling, 1 reply; 30+ messages in thread From: Erich Focht @ 2004-11-05 17:13 UTC (permalink / raw) To: Jack Steiner; +Cc: Takayoshi Kochi, ak, linux-ia64, linux-kernel Hi Jack, the patch looks fine, of course. > # cat ./node/node0/distance > 10 20 64 42 42 22 Great! But: > # cat ./cpu/cpu8/distance > 42 42 64 64 22 22 42 42 10 10 20 20 ... what exactly do you mean by cpu_to_cpu distance? In analogy with the node distance I'd say it is the time (latency) for moving data from the register of one CPU into the register of another CPU: cpu*/distance : cpu -> memory -> cpu node1 node? node2 On most architectures this means flushing a cacheline to memory on one side and reading it on another side. What you actually implement is the latency from memory (one node) to a particular cpu (on some node). memory -> cpu node1 node2 That's only half of the story and actually misleading. I don't think the complexity hiding is good in this place. Questions coming to my mind are: Where is the memory? Is the SLIT matrix really symmetric (cpu_to_cpu distance only makes sense for symmetric matrices)? I remember talking to IBM people about hardware where the node distance matrix was asymmetric. Why do you want this distance anyway? libnuma offers you _node_ masks for allocating memory from a particular node. And when you want to arrange a complex MPI process structure you'll have to think about latency for moving data from one processes buffer to the other processes buffer. The buffers live on nodes, not on cpus. Regards, Erich ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-05 17:13 ` Erich Focht @ 2004-11-05 19:13 ` Jack Steiner 0 siblings, 0 replies; 30+ messages in thread From: Jack Steiner @ 2004-11-05 19:13 UTC (permalink / raw) To: Erich Focht; +Cc: Takayoshi Kochi, ak, linux-ia64, linux-kernel On Fri, Nov 05, 2004 at 06:13:24PM +0100, Erich Focht wrote: > Hi Jack, > > the patch looks fine, of course. > > # cat ./node/node0/distance > > 10 20 64 42 42 22 > Great! > > But: > > # cat ./cpu/cpu8/distance > > 42 42 64 64 22 22 42 42 10 10 20 20 > ... > > what exactly do you mean by cpu_to_cpu distance? In analogy with the > node distance I'd say it is the time (latency) for moving data from > the register of one CPU into the register of another CPU: > cpu*/distance : cpu -> memory -> cpu > node1 node? node2 > I'm trying to create an easy-to-use metric for finding sets of cpus that are close to each other. By "close", I mean that the average offnode reference from a cpu to remote memory in the set is minimized. The numbers in cpuN/distance represent the distance from cpu N to the memory that is local to each of the other cpus. I agree that this can be derived from converting cpuN->node, finding internode distances, then finding the cpus on each remote node. The cpu metric is much easier to use. > On most architectures this means flushing a cacheline to memory on one > side and reading it on another side. What you actually implement is > the latency from memory (one node) to a particular cpu (on some > node). > memory -> cpu > node1 node2 I see how the term can be misleading. The metric is intended to represent ONLY the cost of remote access to another processor's local memory. Is there a better way to describe the cpu-to-remote-cpu's-memory metric OR should we let users contruct their own matrix from the node data? > > That's only half of the story and actually misleading. I don't > think the complexity hiding is good in this place. Questions coming to > my mind are: Where is the memory? Is the SLIT matrix really symmetric > (cpu_to_cpu distance only makes sense for symmetric matrices)? I > remember talking to IBM people about hardware where the node distance > matrix was asymmetric. > > Why do you want this distance anyway? libnuma offers you _node_ masks > for allocating memory from a particular node. And when you want to > arrange a complex MPI process structure you'll have to think about > latency for moving data from one processes buffer to the other > processes buffer. The buffers live on nodes, not on cpus. One important use is in the creation of cpusets. The batch scheduler needs to pick a subset of cpus that are as close together as possible. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 4:07 ` Andi Kleen 2004-11-04 4:57 ` Takayoshi Kochi @ 2004-11-09 19:23 ` Matthew Dobson 1 sibling, 0 replies; 30+ messages in thread From: Matthew Dobson @ 2004-11-09 19:23 UTC (permalink / raw) To: Andi Kleen; +Cc: Takayoshi Kochi, steiner, linux-ia64, LKML On Wed, 2004-11-03 at 20:07, Andi Kleen wrote: > On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote: > > (3) all distances in one line like /proc/<PID>/stat > > > > % cat /sys/devices/system/node/node0/distance > > 10 66 46 66 > > I would prefer that. > > -Andi That would be my vote as well. One line, space delimited. Easy to parse... Plus you could easily reproduce the entire SLIT matrix by: cd /sys/devices/system/node/ for i in `ls node*`; do cat $i/distance; done -Matt ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 1:59 ` Externalize SLIT table Takayoshi Kochi 2004-11-04 4:07 ` Andi Kleen @ 2004-11-04 14:13 ` Jack Steiner 2004-11-04 14:29 ` Andi Kleen 2004-11-04 15:31 ` Erich Focht 1 sibling, 2 replies; 30+ messages in thread From: Jack Steiner @ 2004-11-04 14:13 UTC (permalink / raw) To: Takayoshi Kochi; +Cc: linux-ia64, linux-kernel On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote: > Hi, > > For wider audience, added LKML. > > From: Jack Steiner <steiner@sgi.com> > Subject: Externalize SLIT table > Date: Wed, 3 Nov 2004 14:56:56 -0600 > > > The SLIT table provides useful information on internode > > distances. Has anyone considered externalizing this > > table via /proc or some equivalent mechanism. > > > > For example, something like the following would be useful: > > > > # cat /proc/acpi/slit > > 010 066 046 066 > > 066 010 066 046 > > 046 066 010 020 > > 066 046 020 010 > > > > If this looks ok (or something equivalent), I'll generate a patch.... > > For user space to manipulate scheduling domains, pinning processes > to some cpu groups etc, that kind of information is very useful! > Without this, users have no notion about how far between two nodes. > > But ACPI SLIT table is too arch specific (ia64 and x86 only) and > user-visible logical number and ACPI proximity domain number is > not always identical. > > Why not export node_distance() under sysfs? > I like (1). > > (1) obey one-value-per-file sysfs principle > > % cat /sys/devices/system/node/node0/distance0 > 10 > % cat /sys/devices/system/node/node0/distance1 > 66 I'm not familar with the internals of sysfs. For example, on a 256 node system, there will be 65536 instances of /sys/devices/system/node/node<M>/distance<N> Does this require any significant amount of kernel resources to maintain this amount of information. > > (2) one distance for each line > > % cat /sys/devices/system/node/node0/distance > 0:10 > 1:66 > 2:46 > 3:66 > > (3) all distances in one line like /proc/<PID>/stat > > % cat /sys/devices/system/node/node0/distance > 10 66 46 66 > I like (3) the best. I think it would also be useful to have a similar cpu-to-cpu distance metric: % cat /sys/devices/system/cpu/cpu0/distance 10 20 40 60 This gives the same information but is cpu-centric rather than node centric. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 14:13 ` Jack Steiner @ 2004-11-04 14:29 ` Andi Kleen 2004-11-04 15:31 ` Erich Focht 1 sibling, 0 replies; 30+ messages in thread From: Andi Kleen @ 2004-11-04 14:29 UTC (permalink / raw) To: Jack Steiner; +Cc: Takayoshi Kochi, linux-ia64, linux-kernel On Thu, Nov 04, 2004 at 08:13:37AM -0600, Jack Steiner wrote: > On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote: > > Hi, > > > > For wider audience, added LKML. > > > > From: Jack Steiner <steiner@sgi.com> > > Subject: Externalize SLIT table > > Date: Wed, 3 Nov 2004 14:56:56 -0600 > > > > > The SLIT table provides useful information on internode > > > distances. Has anyone considered externalizing this > > > table via /proc or some equivalent mechanism. > > > > > > For example, something like the following would be useful: > > > > > > # cat /proc/acpi/slit > > > 010 066 046 066 > > > 066 010 066 046 > > > 046 066 010 020 > > > 066 046 020 010 > > > > > > If this looks ok (or something equivalent), I'll generate a patch.... > > > > For user space to manipulate scheduling domains, pinning processes > > to some cpu groups etc, that kind of information is very useful! > > Without this, users have no notion about how far between two nodes. > > > > But ACPI SLIT table is too arch specific (ia64 and x86 only) and > > user-visible logical number and ACPI proximity domain number is > > not always identical. > > > > Why not export node_distance() under sysfs? > > I like (1). > > > > (1) obey one-value-per-file sysfs principle > > > > % cat /sys/devices/system/node/node0/distance0 > > 10 > > % cat /sys/devices/system/node/node0/distance1 > > 66 > > I'm not familar with the internals of sysfs. For example, on a 256 node > system, there will be 65536 instances of > /sys/devices/system/node/node<M>/distance<N> > > Does this require any significant amount of kernel resources to > maintain this amount of information. Yes it does, even with the new sysfs backing store. And reading it would create all the inodes and dentries, which are quite bloated. > > I think it would also be useful to have a similar cpu-to-cpu distance > metric: > % cat /sys/devices/system/cpu/cpu0/distance > 10 20 40 60 > > This gives the same information but is cpu-centric rather than > node centric. And the same thing for PCI busses, like in this patch. However for strict ACPI systems this information would need to be gotten from _PXM first. x86-64 on Opteron currently reads it directly from the hardware and uses it to allocate DMA memory near the device. -Andi diff -urpN -X ../KDIFX linux-2.6.8rc3/drivers/pci/pci-sysfs.c linux-2.6.8rc3-amd64/drivers/pci/pci-sysfs.c --- linux-2.6.8rc3/drivers/pci/pci-sysfs.c 2004-07-27 14:44:10.000000000 +0200 +++ linux-2.6.8rc3-amd64/drivers/pci/pci-sysfs.c 2004-08-04 02:42:11.000000000 +0200 @@ -17,6 +17,7 @@ #include <linux/kernel.h> #include <linux/pci.h> #include <linux/stat.h> +#include <linux/topology.h> #include "pci.h" @@ -38,6 +39,15 @@ pci_config_attr(subsystem_device, "0x%04 pci_config_attr(class, "0x%06x\n"); pci_config_attr(irq, "%u\n"); +static ssize_t local_cpus_show(struct device *dev, char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + cpumask_t mask = pcibus_to_cpumask(pdev->bus->number); + int len = cpumask_scnprintf(buf, PAGE_SIZE-1, mask); + strcat(buf,"\n"); + return 1+len; +} + /* show resources */ static ssize_t resource_show(struct device * dev, char * buf) @@ -67,6 +77,7 @@ struct device_attribute pci_dev_attrs[] __ATTR_RO(subsystem_device), __ATTR_RO(class), __ATTR_RO(irq), + __ATTR_RO(local_cpus), __ATTR_NULL, }; ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 14:13 ` Jack Steiner 2004-11-04 14:29 ` Andi Kleen @ 2004-11-04 15:31 ` Erich Focht 2004-11-04 17:04 ` Andi Kleen 2004-11-09 19:43 ` Matthew Dobson 1 sibling, 2 replies; 30+ messages in thread From: Erich Focht @ 2004-11-04 15:31 UTC (permalink / raw) To: Jack Steiner; +Cc: Takayoshi Kochi, linux-ia64, linux-kernel On Thursday 04 November 2004 15:13, Jack Steiner wrote: > I think it would also be useful to have a similar cpu-to-cpu distance > metric: > % cat /sys/devices/system/cpu/cpu0/distance > 10 20 40 60 > > This gives the same information but is cpu-centric rather than > node centric. I don't see the use of that once you have some way to find the logical CPU to node number mapping. The "node distances" are meant to be proportional to the memory access latency ratios (20 means 2 times larger than local (intra-node) access, which is by definition 10). If the cpu_to_cpu distance is necessary because there is a hierarchy in the memory blocks inside one node, then maybe the definition of a node should be changed... We currently have (at least in -mm kernels): % ls /sys/devices/system/node/node0/cpu* for finding out which CPUs belong to which nodes. Together with /sys/devices/system/node/node0/distances this should be enough for user space NUMA tools. Regards, Erich ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 15:31 ` Erich Focht @ 2004-11-04 17:04 ` Andi Kleen 2004-11-04 19:36 ` Jack Steiner 2004-11-09 19:45 ` Matthew Dobson 2004-11-09 19:43 ` Matthew Dobson 1 sibling, 2 replies; 30+ messages in thread From: Andi Kleen @ 2004-11-04 17:04 UTC (permalink / raw) To: Erich Focht; +Cc: Jack Steiner, Takayoshi Kochi, linux-ia64, linux-kernel On Thu, Nov 04, 2004 at 04:31:42PM +0100, Erich Focht wrote: > On Thursday 04 November 2004 15:13, Jack Steiner wrote: > > I think it would also be useful to have a similar cpu-to-cpu distance > > metric: > > ????????% cat /sys/devices/system/cpu/cpu0/distance > > ????????10 20 40 60 > > > > This gives the same information but is cpu-centric rather than > > node centric. > > I don't see the use of that once you have some way to find the logical > CPU to node number mapping. The "node distances" are meant to be I think he wants it just to have a more convenient interface, which is not necessarily a bad thing. But then one could put the convenience into libnuma anyways. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 17:04 ` Andi Kleen @ 2004-11-04 19:36 ` Jack Steiner 2004-11-09 19:45 ` Matthew Dobson 1 sibling, 0 replies; 30+ messages in thread From: Jack Steiner @ 2004-11-04 19:36 UTC (permalink / raw) To: Andi Kleen; +Cc: Erich Focht, Takayoshi Kochi, linux-ia64, linux-kernel On Thu, Nov 04, 2004 at 06:04:35PM +0100, Andi Kleen wrote: > On Thu, Nov 04, 2004 at 04:31:42PM +0100, Erich Focht wrote: > > On Thursday 04 November 2004 15:13, Jack Steiner wrote: > > > I think it would also be useful to have a similar cpu-to-cpu distance > > > metric: > > > ????????% cat /sys/devices/system/cpu/cpu0/distance > > > ????????10 20 40 60 > > > > > > This gives the same information but is cpu-centric rather than > > > node centric. > > > > I don't see the use of that once you have some way to find the logical > > CPU to node number mapping. The "node distances" are meant to be > > I think he wants it just to have a more convenient interface, > which is not necessarily a bad thing. But then one could put the > convenience into libnuma anyways. > > -Andi Yes, strictly convenience. Most of the cases that I have seen deal with cpu placement & cpu distances from each other. I agree that cpu-to-cpu distances can be determined by converting to nodes & finding the node-to-node distance. A second reason is symmetry. If there is a /sys/devices/system/node/node0/distance metric, it seems as though there should also be a /sys/devices/system/cpu/cpu0/distance metric. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 17:04 ` Andi Kleen 2004-11-04 19:36 ` Jack Steiner @ 2004-11-09 19:45 ` Matthew Dobson 1 sibling, 0 replies; 30+ messages in thread From: Matthew Dobson @ 2004-11-09 19:45 UTC (permalink / raw) To: Andi Kleen; +Cc: Erich Focht, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Thu, 2004-11-04 at 09:04, Andi Kleen wrote: > On Thu, Nov 04, 2004 at 04:31:42PM +0100, Erich Focht wrote: > > On Thursday 04 November 2004 15:13, Jack Steiner wrote: > > > I think it would also be useful to have a similar cpu-to-cpu distance > > > metric: > > > ????????% cat /sys/devices/system/cpu/cpu0/distance > > > ????????10 20 40 60 > > > > > > This gives the same information but is cpu-centric rather than > > > node centric. > > > > I don't see the use of that once you have some way to find the logical > > CPU to node number mapping. The "node distances" are meant to be > > I think he wants it just to have a more convenient interface, > which is not necessarily a bad thing. But then one could put the > convenience into libnuma anyways. > > -Andi Using libnuma sounds fine to me. On a 512 CPU system, with 4 CPUs/node, we'd have 128 nodes. Re-exporting ALL the same data, those huge strings of node-to-node distances, 512 *additional* times in the per-CPU sysfs directories seems like a waste. -Matt ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-04 15:31 ` Erich Focht 2004-11-04 17:04 ` Andi Kleen @ 2004-11-09 19:43 ` Matthew Dobson 2004-11-09 20:34 ` Mark Goodwin 1 sibling, 1 reply; 30+ messages in thread From: Matthew Dobson @ 2004-11-09 19:43 UTC (permalink / raw) To: Erich Focht; +Cc: Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Thu, 2004-11-04 at 07:31, Erich Focht wrote: > On Thursday 04 November 2004 15:13, Jack Steiner wrote: > > I think it would also be useful to have a similar cpu-to-cpu distance > > metric: > > % cat /sys/devices/system/cpu/cpu0/distance > > 10 20 40 60 > > > > This gives the same information but is cpu-centric rather than > > node centric. > > I don't see the use of that once you have some way to find the logical > CPU to node number mapping. The "node distances" are meant to be > proportional to the memory access latency ratios (20 means 2 times > larger than local (intra-node) access, which is by definition 10). > If the cpu_to_cpu distance is necessary because there is a hierarchy > in the memory blocks inside one node, then maybe the definition of a > node should be changed... > > We currently have (at least in -mm kernels): > % ls /sys/devices/system/node/node0/cpu* > for finding out which CPUs belong to which nodes. Together with > /sys/devices/system/node/node0/distances > this should be enough for user space NUMA tools. > > Regards, > Erich I have to agree with Erich here. Node distances make sense, but adding 'cpu distances' which are just re-exporting the node distances in each cpu's directory in sysfs doesn't make much sense to me. Especially because it is so trivial to get a list of which CPUs are on which node. If you're looking for groups of CPUs which are close, simply look for groups of nodes that are close, then use the CPUs on those nodes. If we came up with some sort of different notion of 'distance' for CPUs and exported that, I'd be OK with it, because it'd be new information. I don't think we should export the *exact same* node distance information through the CPUs, though. -Matt ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-09 19:43 ` Matthew Dobson @ 2004-11-09 20:34 ` Mark Goodwin 2004-11-09 22:00 ` Jesse Barnes 2004-11-09 23:58 ` Matthew Dobson 0 siblings, 2 replies; 30+ messages in thread From: Mark Goodwin @ 2004-11-09 20:34 UTC (permalink / raw) To: Matthew Dobson Cc: Erich Focht, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Tue, 9 Nov 2004, Matthew Dobson wrote: > ... > I don't think we should export the *exact same* node distance information > through the CPUs, though. We should still export cpu distances though because the distance between cpus on the same node may not be equal. e.g. consider a node with multiple cpu sockets, each socket with a hyperthreaded (or dual core) cpu. Once again however, it depends on the definition of distance. For nodes, we've established it's the ACPI SLIT (relative distance to memory). For cpus, should it be distance to memory? Distance to cache? Registers? Or what? -- Mark ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-09 20:34 ` Mark Goodwin @ 2004-11-09 22:00 ` Jesse Barnes 2004-11-09 23:58 ` Matthew Dobson 1 sibling, 0 replies; 30+ messages in thread From: Jesse Barnes @ 2004-11-09 22:00 UTC (permalink / raw) To: Mark Goodwin Cc: Matthew Dobson, Erich Focht, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Tuesday, November 09, 2004 3:34 pm, Mark Goodwin wrote: > On Tue, 9 Nov 2004, Matthew Dobson wrote: > > ... > > I don't think we should export the *exact same* node distance information > > through the CPUs, though. > > We should still export cpu distances though because the distance between > cpus on the same node may not be equal. e.g. consider a node with multiple > cpu sockets, each socket with a hyperthreaded (or dual core) cpu. > > Once again however, it depends on the definition of distance. For nodes, > we've established it's the ACPI SLIT (relative distance to memory). For > cpus, should it be distance to memory? Distance to cache? Registers? Or > what? Yeah, that's a tough call. We should definitely get the node stuff in there now though, IMO. We can always add the CPU distances later if we figure out what they should mean. Jesse ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-09 20:34 ` Mark Goodwin 2004-11-09 22:00 ` Jesse Barnes @ 2004-11-09 23:58 ` Matthew Dobson 2004-11-10 5:05 ` Mark Goodwin 1 sibling, 1 reply; 30+ messages in thread From: Matthew Dobson @ 2004-11-09 23:58 UTC (permalink / raw) To: Mark Goodwin; +Cc: Erich Focht, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote: > On Tue, 9 Nov 2004, Matthew Dobson wrote: > > ... > > I don't think we should export the *exact same* node distance information > > through the CPUs, though. > > We should still export cpu distances though because the distance between > cpus on the same node may not be equal. e.g. consider a node with multiple > cpu sockets, each socket with a hyperthreaded (or dual core) cpu. Well, I'm not sure that just because a CPU has two hyperthread units in the same core that those HT units have a different distance or latency to memory...? The fact that it is a HT unit and not a physical core has implications to the scheduler, but I thought that the 2 siblings looked identical to userspace, no? If 2 CPUs in the same node are on the same bus, then in all likelihood they have the same "distance". > Once again however, it depends on the definition of distance. For nodes, > we've established it's the ACPI SLIT (relative distance to memory). For > cpus, should it be distance to memory? Distance to cache? Registers? Or > what? > > -- Mark That's the real issue. We need to agree upon a meaningful definition of CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all agree on what Node-to-Node "distance" means, but there doesn't appear to be much consensus on what CPU "distance" means. -Matt ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-09 23:58 ` Matthew Dobson @ 2004-11-10 5:05 ` Mark Goodwin 2004-11-10 18:45 ` Erich Focht 0 siblings, 1 reply; 30+ messages in thread From: Mark Goodwin @ 2004-11-10 5:05 UTC (permalink / raw) To: Matthew Dobson Cc: Erich Focht, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Tue, 9 Nov 2004, Matthew Dobson wrote: > On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote: >> Once again however, it depends on the definition of distance. For nodes, >> we've established it's the ACPI SLIT (relative distance to memory). For >> cpus, should it be distance to memory? Distance to cache? Registers? Or >> what? >> > That's the real issue. We need to agree upon a meaningful definition of > CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all > agree on what Node-to-Node "distance" means, but there doesn't appear to > be much consensus on what CPU "distance" means. How about we define cpu-distance to be "relative distance to the lowest level cache on another CPU". On a system that has nodes with multiple sockets (each supporting multiple cores or HT "CPUs" sharing some level of cache), when the scheduler needs to migrate a task it would first choose a CPU sharing the same cache, then a CPU on the same node, then an off-node CPU (i.e. falling back to node distance). Of course, I have no idea if that's anything like an optimal or desirable task migration policy. Probably depends on cache-trashiness of the task being migrated. -- Mark ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-10 5:05 ` Mark Goodwin @ 2004-11-10 18:45 ` Erich Focht 2004-11-10 22:09 ` Matthew Dobson 0 siblings, 1 reply; 30+ messages in thread From: Erich Focht @ 2004-11-10 18:45 UTC (permalink / raw) To: Mark Goodwin Cc: Matthew Dobson, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Wednesday 10 November 2004 06:05, Mark Goodwin wrote: > > On Tue, 9 Nov 2004, Matthew Dobson wrote: > > On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote: > >> Once again however, it depends on the definition of distance. For nodes, > >> we've established it's the ACPI SLIT (relative distance to memory). For > >> cpus, should it be distance to memory? Distance to cache? Registers? Or > >> what? > >> > > That's the real issue. We need to agree upon a meaningful definition of > > CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all > > agree on what Node-to-Node "distance" means, but there doesn't appear to > > be much consensus on what CPU "distance" means. > > How about we define cpu-distance to be "relative distance to the > lowest level cache on another CPU". Several definitions are possible, this is really a source of confusion. Any of these can be reconstructed if one has access to the constituents: node-to-node latency (SLIT), cache-to-cache latencies. The later ones aren't available and would anyhow be better placed in something like /proc/cpuinfo or similar. They are CPU or package specific and have nothing to do with NUMA. > On a system that has nodes with multiple sockets (each supporting > multiple cores or HT "CPUs" sharing some level of cache), when the > scheduler needs to migrate a task it would first choose a CPU > sharing the same cache, then a CPU on the same node, then an > off-node CPU (i.e. falling back to node distance). This should be done by correctly setting up the sched domains. It's not a question of exporting useless or redundant information to user space. The need for some (any) cpu-to-cpu metrics initially brought up by Jack seemed mainly motivated by existing user space tools for constructing cpusets (maybe in PBS). I think it is a tolerable effort to introduce in user space an inlined function or macro doing something like cpu_metric(i,j) := node_metric(cpu_node(i),cpu_node(j)) It keeps the kernel free of misleading information which might just slightly make cpusets construction more comfortable. In user space you have the full freedom to enhance your metrics when getting more details about the next generation cpus. Regards, Erich ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-10 18:45 ` Erich Focht @ 2004-11-10 22:09 ` Matthew Dobson 0 siblings, 0 replies; 30+ messages in thread From: Matthew Dobson @ 2004-11-10 22:09 UTC (permalink / raw) To: Erich Focht; +Cc: Mark Goodwin, Jack Steiner, Takayoshi Kochi, linux-ia64, LKML On Wed, 2004-11-10 at 10:45, Erich Focht wrote: > On Wednesday 10 November 2004 06:05, Mark Goodwin wrote: > > On a system that has nodes with multiple sockets (each supporting > > multiple cores or HT "CPUs" sharing some level of cache), when the > > scheduler needs to migrate a task it would first choose a CPU > > sharing the same cache, then a CPU on the same node, then an > > off-node CPU (i.e. falling back to node distance). > > This should be done by correctly setting up the sched domains. It's > not a question of exporting useless or redundant information to user > space. > > The need for some (any) cpu-to-cpu metrics initially brought up by > Jack seemed mainly motivated by existing user space tools for > constructing cpusets (maybe in PBS). I think it is a tolerable effort > to introduce in user space an inlined function or macro doing > something like > cpu_metric(i,j) := node_metric(cpu_node(i),cpu_node(j)) > > It keeps the kernel free of misleading information which might just > slightly make cpusets construction more comfortable. In user space you > have the full freedom to enhance your metrics when getting more > details about the next generation cpus. Good point, Erich. I don't think there is any desperate need for CPU-to-CPU distances to be exported to userspace right now. If that is incorrect and someone really needs a particular distance metric to be exported by the kernel, we can look into that and export the required info. For now I think the Node-to-Node distance information is enough. -Matt ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20041103205655.GA5084@sgi.com.suse.lists.linux.kernel>]
[parent not found: <20041104.105908.18574694.t-kochi@bq.jp.nec.com.suse.lists.linux.kernel>]
[parent not found: <20041104040713.GC21211@wotan.suse.de.suse.lists.linux.kernel>]
[parent not found: <20041104.135721.08317994.t-kochi@bq.jp.nec.com.suse.lists.linux.kernel>]
[parent not found: <20041105160808.GA26719@sgi.com.suse.lists.linux.kernel>]
* Re: Externalize SLIT table [not found] ` <20041105160808.GA26719@sgi.com.suse.lists.linux.kernel> @ 2004-11-06 6:30 ` Andi Kleen 2004-11-23 17:32 ` Jack Steiner 0 siblings, 1 reply; 30+ messages in thread From: Andi Kleen @ 2004-11-06 6:30 UTC (permalink / raw) To: Jack Steiner; +Cc: linux-kernel Jack Steiner <steiner@sgi.com> writes: > > +static ssize_t node_read_distance(struct sys_device * dev, char * buf) > +{ > + int nid = dev->id; > + int len = 0; > + int i; > + > + for (i = 0; i < numnodes; i++) > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); One problem is that most architectures define node_distance currently as nid != i. This would give 0 on them for the identity mapping and 10 on IA64 which uses the SLIT values. Not good for a portable interface. I would suggest to at least change them to return 10 for a zero node distance. Also in general I would prefer if you could move all the SLIT parsing into drivers/acpi/numa.c. Then the other ACPI architectures don't need to copy the basically identical code from ia64. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-06 6:30 ` Andi Kleen @ 2004-11-23 17:32 ` Jack Steiner 2004-11-23 19:06 ` Andi Kleen 0 siblings, 1 reply; 30+ messages in thread From: Jack Steiner @ 2004-11-23 17:32 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel (Sorry for the delay in posting this. Our mail server was dropping mail ....) Here is an update patch to externalize the SLIT information. I think I have encorporated all the comments that were posted previously) For example: # cd /sys/devices/system # find . ./node ./node/node5 ./node/node5/distance ./node/node5/numastat ./node/node5/meminfo ./node/node5/cpumap # cat ./node/node0/distance 10 20 64 42 42 22 # cat node/*/distance 10 20 64 42 42 22 20 10 42 22 64 84 64 42 10 20 22 42 42 22 20 10 42 62 42 64 22 42 10 20 22 84 42 62 20 10 Does this look ok??? Signed-off-by: Jack Steiner <steiner@sgi.com> Add SLIT (inter node distance) information to sysfs. Index: linux/drivers/base/node.c =================================================================== --- linux.orig/drivers/base/node.c 2004-11-05 08:34:42.461312000 -0600 +++ linux/drivers/base/node.c 2004-11-05 15:56:23.345662000 -0600 @@ -111,6 +111,24 @@ static ssize_t node_read_numastat(struct } static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); +static ssize_t node_read_distance(struct sys_device * dev, char * buf) +{ + int nid = dev->id; + int len = 0; + int i; + + /* buf currently PAGE_SIZE, need ~4 chars per node */ + BUILD_BUG_ON(NR_NODES*4 > PAGE_SIZE/2); + + for (i = 0; i < numnodes; i++) + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i)); + + len += sprintf(buf + len, "\n"); + return len; +} +static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL); + + /* * register_node - Setup a driverfs device for a node. * @num - Node number to use when creating the device. @@ -129,6 +147,7 @@ int __init register_node(struct node *no sysdev_create_file(&node->sysdev, &attr_cpumap); sysdev_create_file(&node->sysdev, &attr_meminfo); sysdev_create_file(&node->sysdev, &attr_numastat); + sysdev_create_file(&node->sysdev, &attr_distance); } return error; } Index: linux/include/asm-i386/topology.h =================================================================== --- linux.orig/include/asm-i386/topology.h 2004-11-05 08:34:53.713053000 -0600 +++ linux/include/asm-i386/topology.h 2004-11-23 09:59:43.574062951 -0600 @@ -66,9 +66,6 @@ static inline cpumask_t pcibus_to_cpumas return node_to_cpumask(mp_bus_id_to_node[bus]); } -/* Node-to-Node distance */ -#define node_distance(from, to) ((from) != (to)) - /* sched_domains SD_NODE_INIT for NUMAQ machines */ #define SD_NODE_INIT (struct sched_domain) { \ .span = CPU_MASK_NONE, \ Index: linux/include/linux/topology.h =================================================================== --- linux.orig/include/linux/topology.h 2004-11-05 08:34:57.492932000 -0600 +++ linux/include/linux/topology.h 2004-11-23 10:03:26.700821978 -0600 @@ -55,7 +55,10 @@ static inline int __next_node_with_cpus( for (node = 0; node < numnodes; node = __next_node_with_cpus(node)) #ifndef node_distance -#define node_distance(from,to) ((from) != (to)) +/* Conform to ACPI 2.0 SLIT distance definitions */ +#define LOCAL_DISTANCE 10 +#define REMOTE_DISTANCE 20 +#define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) #endif #ifndef PENALTY_FOR_NODE_WITH_CPUS #define PENALTY_FOR_NODE_WITH_CPUS (1) -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table 2004-11-23 17:32 ` Jack Steiner @ 2004-11-23 19:06 ` Andi Kleen 0 siblings, 0 replies; 30+ messages in thread From: Andi Kleen @ 2004-11-23 19:06 UTC (permalink / raw) To: Jack Steiner; +Cc: Andi Kleen, linux-kernel On Tue, Nov 23, 2004 at 11:32:09AM -0600, Jack Steiner wrote: > (Sorry for the delay in posting this. Our mail server was > dropping mail ....) Looks good. Thanks. I actually came up with my own patch now (which ended up quite similar), but yours looks slightly better. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Externalize SLIT table @ 2004-11-18 16:39 Jack Steiner 0 siblings, 0 replies; 30+ messages in thread From: Jack Steiner @ 2004-11-18 16:39 UTC (permalink / raw) To: linux-kernel; +Cc: linux-ia64 (Resend of mail sent Nov 10, 2004 - as far as I can tell, it went nowhere) On Wed, Nov 10, 2004 at 04:05:43PM +1100, Mark Goodwin wrote: > > On Tue, 9 Nov 2004, Matthew Dobson wrote: > >On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote: > >>Once again however, it depends on the definition of distance. For nodes, > >>we've established it's the ACPI SLIT (relative distance to memory). For > >>cpus, should it be distance to memory? Distance to cache? Registers? Or > >>what? > >> > >That's the real issue. We need to agree upon a meaningful definition of > >CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all > >agree on what Node-to-Node "distance" means, but there doesn't appear to > >be much consensus on what CPU "distance" means. > > How about we define cpu-distance to be "relative distance to the > lowest level cache on another CPU". On a system that has nodes with > multiple sockets (each supporting multiple cores or HT "CPUs" sharing > some level of cache), when the scheduler needs to migrate a task it would > first choose a CPU sharing the same cache, then a CPU on the same node, > then an off-node CPU (i.e. falling back to node distance). I think I like your definition better than the one I originally proposed (cpu distance was distance between the local memories of the cpus). But how do we determine the distance between the caches. > > Of course, I have no idea if that's anything like an optimal or desirable > task migration policy. Probably depends on cache-trashiness of the task > being migrated. > > -- Mark -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2004-11-23 19:11 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20041103205655.GA5084@sgi.com>
2004-11-04 1:59 ` Externalize SLIT table Takayoshi Kochi
2004-11-04 4:07 ` Andi Kleen
2004-11-04 4:57 ` Takayoshi Kochi
2004-11-04 6:37 ` Andi Kleen
2004-11-05 16:08 ` Jack Steiner
2004-11-05 16:26 ` Andreas Schwab
2004-11-05 16:44 ` Jack Steiner
2004-11-06 11:50 ` Christoph Hellwig
2004-11-06 12:48 ` Andi Kleen
2004-11-06 13:07 ` Christoph Hellwig
2004-11-05 17:13 ` Erich Focht
2004-11-05 19:13 ` Jack Steiner
2004-11-09 19:23 ` Matthew Dobson
2004-11-04 14:13 ` Jack Steiner
2004-11-04 14:29 ` Andi Kleen
2004-11-04 15:31 ` Erich Focht
2004-11-04 17:04 ` Andi Kleen
2004-11-04 19:36 ` Jack Steiner
2004-11-09 19:45 ` Matthew Dobson
2004-11-09 19:43 ` Matthew Dobson
2004-11-09 20:34 ` Mark Goodwin
2004-11-09 22:00 ` Jesse Barnes
2004-11-09 23:58 ` Matthew Dobson
2004-11-10 5:05 ` Mark Goodwin
2004-11-10 18:45 ` Erich Focht
2004-11-10 22:09 ` Matthew Dobson
[not found] <20041103205655.GA5084@sgi.com.suse.lists.linux.kernel>
[not found] ` <20041104.105908.18574694.t-kochi@bq.jp.nec.com.suse.lists.linux.kernel>
[not found] ` <20041104040713.GC21211@wotan.suse.de.suse.lists.linux.kernel>
[not found] ` <20041104.135721.08317994.t-kochi@bq.jp.nec.com.suse.lists.linux.kernel>
[not found] ` <20041105160808.GA26719@sgi.com.suse.lists.linux.kernel>
2004-11-06 6:30 ` Andi Kleen
2004-11-23 17:32 ` Jack Steiner
2004-11-23 19:06 ` Andi Kleen
2004-11-18 16:39 Jack Steiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).