public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Allow to change SD_NODES_PER_DOMAIN at configuration or boot time
@ 2005-02-16 17:17 Xavier Bru
  2005-02-17  0:08 ` Luck, Tony
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Xavier Bru @ 2005-02-16 17:17 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 3165 bytes --]

Hello Nick and all,
I remember this was discussed some months ago, but it still seems that 
on 2.6.10, SD_NODES_PER_DOMAIN is statically defined to value 6.
This is not what is expected on Bull ia64 platforms, based on modules of 
4 bricks of 4 cpus each.
Using the cpu hot-plug mechanism to re-define dynamically the 
sched-domains looks heavy (please correct me if I am wrong).
Hereafter a trivial patch that allows to setup SD_NODES_PER_DOMAIN at 
configuration time or at boot time.
Boot time parameter should be helpfull, as on 32*ways machine based on 2 
modules of 4 bricks of 4 cpus each, it allows to build either 1 
(SD_NODES_PER_DOMAIN=8) or 2 (SD_NODES_PER_DOMAIN=4) NUMA sched-domain 
levels.
Thanks in advance for your comments.

diff --exclude-from /home15/xb/proc/patch.exclude -Nurp 
/tmp/linux-2.6.10/arch/ia64/Kconfig linux-2.6.10/arch/ia64/Kconfig
--- /tmp/linux-2.6.10/arch/ia64/Kconfig    2004-12-24 22:35:29.000000000 
+0100
+++ linux-2.6.10/arch/ia64/Kconfig    2005-02-15 17:07:46.741070673 +0100
@@ -168,6 +168,17 @@ config NUMA
       Access).  This option is for configuring high-end multiprocessor
       server systems.  If in doubt, say N.
 
+config SD_NODES_PER_DOMAIN
+    int "Number of nodes per base sched_domains"
+    default "4" if IA64_DIG
+    default "6"
+    help
+      Number of nodes per base sched_domains.
+      Should be 6 for SGI platforms.
+      Should be 4 for DIG platforms.
+      This value can be provided at boot time using the sd_nodes_per_domain
+      boot parameter.
+       
 config VIRTUAL_MEM_MAP
     bool "Virtual mem map"
     default y if !IA64_HP_SIM
diff --exclude-from /home15/xb/proc/patch.exclude -Nurp 
/tmp/linux-2.6.10/arch/ia64/kernel/domain.c 
linux-2.6.10/arch/ia64/kernel/domain.c
--- /tmp/linux-2.6.10/arch/ia64/kernel/domain.c    2004-12-24 
22:35:40.000000000 +0100
+++ linux-2.6.10/arch/ia64/kernel/domain.c    2005-02-15 
15:04:08.964794354 +0100
@@ -13,7 +13,14 @@
 #include <linux/init.h>
 #include <linux/topology.h>
 
-#define SD_NODES_PER_DOMAIN 6
+int sd_nodes_per_domain = CONFIG_SD_NODES_PER_DOMAIN;
+
+static int __init set_sd_nodes_per_domain(char *str)
+{
+    get_option(&str, &sd_nodes_per_domain);
+    return 1;
+}
+__setup("sd_nodes_per_domain=", set_sd_nodes_per_domain);
 
 #ifdef CONFIG_NUMA
 /**
@@ -78,7 +85,7 @@ static cpumask_t __devinit sched_domain_
     cpus_or(span, span, nodemask);
     set_bit(node, used_nodes);
 
-    for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
+    for (i = 1; i < sd_nodes_per_domain; i++) {
         int next_node = find_next_best_node(node, used_nodes);
         nodemask = node_to_cpumask(next_node);
         cpus_or(span, span, nodemask);
@@ -159,7 +166,7 @@ void __devinit arch_init_sched_domains(v
 
 #ifdef CONFIG_NUMA
         if (num_online_cpus()
-                > SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
+                > sd_nodes_per_domain*cpus_weight(nodemask)) {
             sd = &per_cpu(allnodes_domains, i);
             *sd = SD_ALLNODES_INIT;
             sd->span = cpu_default_map;

-- 

	Sincères salutations.


[-- Attachment #2: xavier.bru.vcf --]
[-- Type: text/x-vcard, Size: 306 bytes --]

begin:vcard
fn:Xavier Bru
n:Bru;Xavier
adr:;;1 rue de Provence, BP 208;Echirolles;;38432 Cedex;France
email;internet:Xavier.Bru@bull.net
title:BULL/DT/Open Software/linux/ia64
tel;work:+33 (0)4 76 29 77 45
tel;fax:+33 (0)4 76 29 77 70
x-mozilla-html:TRUE
url:http://www-frec.bull.fr
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Allow to change SD_NODES_PER_DOMAIN at configuration or boot time
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
@ 2005-02-17  0:08 ` Luck, Tony
  2005-02-17  0:13 ` Jesse Barnes
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Luck, Tony @ 2005-02-17  0:08 UTC (permalink / raw)
  To: linux-ia64

>I remember this was discussed some months ago, but it still seems that 
>on 2.6.10, SD_NODES_PER_DOMAIN is statically defined to value 6.
>This is not what is expected on Bull ia64 platforms, based on 
>modules of 4 bricks of 4 cpus each.

I guess I still don't understand how defining the number of
nodes per domain gets the *right* nodes assigned to a domain.
Does this rely on node discovery code assigning logical node
numbers in such a way that nodes 0, 1, 2, 3 belong to one
domain, and nodes 4, 5, 6, 7 belong to the next domain (for
a system where SD_NODES_PER_DOMAIN=4)?  What if we have a
system where node numbers are effectively randomly assigned
by firmware at power-on? Then nodes 0, 3, 6, 7 might make up
a super-node, but we'll create a couple of domains that have
a jumbled mix of nodes from each super-node.

That's why I asked whether we need to parse the SLIT to determine
how many nodes belong to a domain ... but also to find out
which nodes are in which domain.

-Tony

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Allow to change SD_NODES_PER_DOMAIN at configuration or boot time
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
  2005-02-17  0:08 ` Luck, Tony
@ 2005-02-17  0:13 ` Jesse Barnes
  2005-02-17  0:28 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot Nick Piggin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jesse Barnes @ 2005-02-17  0:13 UTC (permalink / raw)
  To: linux-ia64

On Wednesday, February 16, 2005 4:08 pm, Luck, Tony wrote:
> I guess I still don't understand how defining the number of
> nodes per domain gets the *right* nodes assigned to a domain.
> Does this rely on node discovery code assigning logical node
> numbers in such a way that nodes 0, 1, 2, 3 belong to one
> domain, and nodes 4, 5, 6, 7 belong to the next domain (for
> a system where SD_NODES_PER_DOMAIN=4)?  What if we have a
> system where node numbers are effectively randomly assigned
> by firmware at power-on? Then nodes 0, 3, 6, 7 might make up
> a super-node, but we'll create a couple of domains that have
> a jumbled mix of nodes from each super-node.

It uses the SLIT table to put a cluster of SD_NODES_PER_DOMAIN nodes into a 
sched domain.  So if SD_NODES_PER_DOMAIN is 4, node 0 will be in a domain 
with the three nodes closest to node 0, which could be node 9, 15, and 2 for 
all we know...

> That's why I asked whether we need to parse the SLIT to determine
> how many nodes belong to a domain ... but also to find out
> which nodes are in which domain.

I don't think the SLIT gives us enough info to determine the best grouping of 
nodes, but it depends on how the various firmwares build it--some may.

Jesse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Allow to change SD_NODES_PER_DOMAIN at configuration or boot
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
  2005-02-17  0:08 ` Luck, Tony
  2005-02-17  0:13 ` Jesse Barnes
@ 2005-02-17  0:28 ` Nick Piggin
  2005-02-17  1:07 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Luck, Tony
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2005-02-17  0:28 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote:
>>I remember this was discussed some months ago, but it still seems that 
>>on 2.6.10, SD_NODES_PER_DOMAIN is statically defined to value 6.
>>This is not what is expected on Bull ia64 platforms, based on 
>>modules of 4 bricks of 4 cpus each.
> 

In that case, yes you would be better off with a different value
for SD_NODES_PER_DOMAIN, maybe 16? It is really something you want
to be able to set in sub-architecture specific code. You'd really
have to test and find out.

Although, in general I don't think our multiprocessor scheduling
is very efficient at the moment, which is what I'm working on now
- and so any change I might make might invalidate your testing
unfortunately.

> 
> I guess I still don't understand how defining the number of
> nodes per domain gets the *right* nodes assigned to a domain.
> Does this rely on node discovery code assigning logical node
> numbers in such a way that nodes 0, 1, 2, 3 belong to one
> domain, and nodes 4, 5, 6, 7 belong to the next domain (for

It uses node_distance, which IIRC is implemented to use SLIT
on ia64.

> a system where SD_NODES_PER_DOMAIN=4)?  What if we have a
> system where node numbers are effectively randomly assigned
> by firmware at power-on? Then nodes 0, 3, 6, 7 might make up
> a super-node, but we'll create a couple of domains that have
> a jumbled mix of nodes from each super-node.
> 

If node_distance is random then yeah that could happen.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Allow to change SD_NODES_PER_DOMAIN at configuration or boot time
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
                   ` (2 preceding siblings ...)
  2005-02-17  0:28 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot Nick Piggin
@ 2005-02-17  1:07 ` Luck, Tony
  2005-02-17  1:24 ` Jesse Barnes
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Luck, Tony @ 2005-02-17  1:07 UTC (permalink / raw)
  To: linux-ia64

>> I guess I still don't understand how defining the number of
>> nodes per domain gets the *right* nodes assigned to a domain.
>> Does this rely on node discovery code assigning logical node
>> numbers in such a way that nodes 0, 1, 2, 3 belong to one
>> domain, and nodes 4, 5, 6, 7 belong to the next domain (for
>
>It uses node_distance, which IIRC is implemented to use SLIT
>on ia64.

Ok ... this will give reasonable results if SD_NODES_PER_DOMAIN
is right, but from strange to downright wierd if it doesn't match
the h/w topology.  E.g. with eight nodes that are all equidistant
and SD_NODES_PER_DOMAIN=4, then node 0 will consider nodes 1, 2, 3
to be in the same domain.  Node 1 thinks that 2, 3, 4 are in its
domain; node 2 thinks 3, 4, 5 are in its domain ... and so on.
Which means that node balancing will slowly skid processes towards
higher node numbers (and then wrap to zero).  But since all the
nodes are actually equidistant, this really isn't a tragedy.

Now I understand what's happening, I'm a bit happier with the
patch.  The CONFIG options are odd though. Does "DIG" say anything
about NUMA topology?  I doubt it ... so it isn't quite the right
thing to use to pick the default value ... I suspect here it
means "not-SGI", but other platform owners can chime in with what
they need here.

In fact even for a specific platform, the right value may depend
on the configuration of a particular machine.  E.g. a machine that
can have up to 8 nodes, which are divided into two banks of four
nodes.  If only 6 nodes are installed (3 per bank) then we'd want
to use SD_NODES_PER_DOMAIN=3, but increase to 4 when we buy and
install the 2 extra nodes.  Yes?

Summary: I like the command-line settable value.  But the CONFIG
option still looks like it does the wrong thing as often as it helps.
The command line option needs to add some text to kernel-parameters.txt
to explain what this does, and how to pick the right value.

-Tony

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Allow to change SD_NODES_PER_DOMAIN at configuration or boot time
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
                   ` (3 preceding siblings ...)
  2005-02-17  1:07 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Luck, Tony
@ 2005-02-17  1:24 ` Jesse Barnes
  2005-02-17 11:05 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot Xavier Bru
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jesse Barnes @ 2005-02-17  1:24 UTC (permalink / raw)
  To: linux-ia64

On Wednesday, February 16, 2005 5:07 pm, Luck, Tony wrote:
> Summary: I like the command-line settable value.  But the CONFIG
> option still looks like it does the wrong thing as often as it helps.
> The command line option needs to add some text to kernel-parameters.txt
> to explain what this does, and how to pick the right value.

I completely agree and like the idea of a tunable too.  I'm not sure what the 
default should be though--I don't think 6 makes much sense, but 4, as you 
said, is capable of causing harm...  I guess we should just pick something.

Jesse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Allow to change SD_NODES_PER_DOMAIN at configuration or boot
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
                   ` (4 preceding siblings ...)
  2005-02-17  1:24 ` Jesse Barnes
@ 2005-02-17 11:05 ` Xavier Bru
  2005-02-17 16:33 ` Nick Piggin
  2005-02-24  7:39 ` Nick Piggin
  7 siblings, 0 replies; 9+ messages in thread
From: Xavier Bru @ 2005-02-17 11:05 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]

Nick Piggin wrote:

>Right. It may make more sense to have the setup based on some
>maximum distance between nodes. Eg. all nodes less than distance
>10 away from node0 are to be in node0's first level NUMA domain
>(the next level is always global, IIRC).
>
>Then you would still need some configuration option, but it would
>appear to be a more useful metric to use.
>
>  
>
Hello Nick & all,
Do you mean that there should ever be only one NUMA sched-domain level ?
On a 2x4x4 cpus machine, we could in theory have  SD_NODES_PER_DOMAIN=4, 
thus providing a 2 level NUMA sched-domain (domain 0 spans 4 cpus, 
domain 1 spans 16, domain 2 is global and spans 32).
But it is true that this configuration does not show evident performance 
gains upon SD_NODES_PER_DOMAIN=8 (domain 0 spans 4 cpus, domain 1 is 
global and spans 32), at least on parallel compilation of the kernel.
Providing SD_NODES_PER_DOMAIN as a boot parameter was also intended to 
choose between a multilevel sched-domains or not.

-- 

	Sincères salutations.


[-- Attachment #2: xavier.bru.vcf --]
[-- Type: text/x-vcard, Size: 306 bytes --]

begin:vcard
fn:Xavier Bru
n:Bru;Xavier
adr:;;1 rue de Provence, BP 208;Echirolles;;38432 Cedex;France
email;internet:Xavier.Bru@bull.net
title:BULL/DT/Open Software/linux/ia64
tel;work:+33 (0)4 76 29 77 45
tel;fax:+33 (0)4 76 29 77 70
x-mozilla-html:TRUE
url:http://www-frec.bull.fr
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Allow to change SD_NODES_PER_DOMAIN at configuration or boot
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
                   ` (5 preceding siblings ...)
  2005-02-17 11:05 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot Xavier Bru
@ 2005-02-17 16:33 ` Nick Piggin
  2005-02-24  7:39 ` Nick Piggin
  7 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2005-02-17 16:33 UTC (permalink / raw)
  To: linux-ia64

Xavier Bru wrote:
> Nick Piggin wrote:
> 
>> Right. It may make more sense to have the setup based on some
>> maximum distance between nodes. Eg. all nodes less than distance
>> 10 away from node0 are to be in node0's first level NUMA domain
>> (the next level is always global, IIRC).
>>
>> Then you would still need some configuration option, but it would
>> appear to be a more useful metric to use.
>>
>>  
>>
> Hello Nick & all,
> Do you mean that there should ever be only one NUMA sched-domain level ?

Hi,

No, I just mean that if the metric used to determine the nodes
in the lower level NUMA sched-domain is "node distance of no
greater than N", rather than "closest N nodes", you might have
a system that is easier to manage, and be less likely to have
the weird "artifacts" discussed.

> On a 2x4x4 cpus machine, we could in theory have  SD_NODES_PER_DOMAIN=4, 
> thus providing a 2 level NUMA sched-domain (domain 0 spans 4 cpus, 
> domain 1 spans 16, domain 2 is global and spans 32).
> But it is true that this configuration does not show evident performance 
> gains upon SD_NODES_PER_DOMAIN=8 (domain 0 spans 4 cpus, domain 1 is 
> global and spans 32), at least on parallel compilation of the kernel.

I'd say yeah, such a system may be too small for that to make much
difference. That said, a kernel compile probably isn't too sensitive
to scheduling placement, provided it is not completely broken.

> Providing SD_NODES_PER_DOMAIN as a boot parameter was also intended to 
> choose between a multilevel sched-domains or not.
> 

Oh yes, that's better than nothing at all, definitely.

Nick


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Allow to change SD_NODES_PER_DOMAIN at configuration or boot
  2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
                   ` (6 preceding siblings ...)
  2005-02-17 16:33 ` Nick Piggin
@ 2005-02-24  7:39 ` Nick Piggin
  7 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2005-02-24  7:39 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2005-02-16 at 17:24 -0800, Jesse Barnes wrote:
> On Wednesday, February 16, 2005 5:07 pm, Luck, Tony wrote:
> > Summary: I like the command-line settable value.  But the CONFIG
> > option still looks like it does the wrong thing as often as it helps.
> > The command line option needs to add some text to kernel-parameters.txt
> > to explain what this does, and how to pick the right value.
> 
> I completely agree and like the idea of a tunable too.  I'm not sure what the 
> default should be though--I don't think 6 makes much sense, but 4, as you 
> said, is capable of causing harm...  I guess we should just pick something.
> 

To all those looking at scheduler tuning - I have posted my
scheduler patchset to LKML, which should address various issues
with SMT, CMP, NUMA, excessive task movement between CPUs,
nodes, etc. However it is likely to be in a poor state of tune
for some workloads.

I hope Andrew will pick it up in -mm soon, but I can provide a
rollup against 2.6 if anyone is interested. I don't expect it to
be merged until after 2.6.12 at the earliest, but I hope that
anyone doing tuning can target this version, because it will
otherwise invalidate your results if I am able to get it merged.

Please ask, and I will be more than willing to try to help tuning
or solving any problems and regressions.

Nick


http://mobile.yahoo.com.au - Yahoo! Mobile
- Check & compose your email via SMS on your Telstra or Vodafone mobile.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-02-24  7:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-16 17:17 Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Xavier Bru
2005-02-17  0:08 ` Luck, Tony
2005-02-17  0:13 ` Jesse Barnes
2005-02-17  0:28 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot Nick Piggin
2005-02-17  1:07 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot time Luck, Tony
2005-02-17  1:24 ` Jesse Barnes
2005-02-17 11:05 ` Allow to change SD_NODES_PER_DOMAIN at configuration or boot Xavier Bru
2005-02-17 16:33 ` Nick Piggin
2005-02-24  7:39 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox