* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
@ 2006-04-14 16:18 ` Luck, Tony
2006-04-14 16:44 ` Christoph Lameter
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 16:18 UTC (permalink / raw)
To: linux-ia64
> We found that on motherboard latencies are typically 1 to 1.4 of local memory
> access speed whereas multinode systems which benefit from zone reclaim have
> usually more than 1.5 times the latency of a local access.
>
> Set the reclaim distance for IA64 to 1.5 times.
Does this really apply just to ia64 systems? Or is the "20" value in
topology.h just wrong?
Is the right value for this any any way related to the "migration_cost"
that we spend many seconds of boot time computing?
-Tony
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
2006-04-14 16:18 ` Luck, Tony
@ 2006-04-14 16:44 ` Christoph Lameter
2006-04-14 17:36 ` John Hawkes
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 16:44 UTC (permalink / raw)
To: linux-ia64
On Fri, 14 Apr 2006, Luck, Tony wrote:
> > We found that on motherboard latencies are typically 1 to 1.4 of local memory
> > access speed whereas multinode systems which benefit from zone reclaim have
> > usually more than 1.5 times the latency of a local access.
> >
> > Set the reclaim distance for IA64 to 1.5 times.
>
> Does this really apply just to ia64 systems? Or is the "20" value in
> topology.h just wrong?
The value in topology.h is the same as the REMOTE_DISTANCE. I think we
better leave these as fallbacks for systems that do not implement
node_distance().
> Is the right value for this any any way related to the "migration_cost"
> that we spend many seconds of boot time computing?
I am not familiar with that. Does the migration_cost take the
SLIT node distances into consideration?
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
2006-04-14 16:18 ` Luck, Tony
2006-04-14 16:44 ` Christoph Lameter
@ 2006-04-14 17:36 ` John Hawkes
2006-04-14 17:39 ` Luck, Tony
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2006-04-14 17:36 UTC (permalink / raw)
To: linux-ia64
From: "Christoph Lameter" <clameter@sgi.com>
> > Is the right value for this any any way related to the "migration_cost"
> > that we spend many seconds of boot time computing?
>
> I am not familiar with that. Does the migration_cost take the
> SLIT node distances into consideration?
No. The migration_cost is an empirical runtime calculation that estimates the
relative cost of migrating a task within each level of sched domain. And even
that is sometimes grossly inaccurate because only two arbitrary CPUs are
chosen for this calculation within one sched domain at each level, thus
assuming that the migration_cost is the same between any and all two CPUs in
that sched domain, and that all the sched domains for a given level exhibit
equivalent migration_cost behavior.
For example, one level of sched domain is the all-CPUs sched domain. For a
NUMA system it is unlikely that the migration_cost between cpu0 and cpu1 is
the same as between cpu0 and cpu511, and yet only cpu0 and cpu1 are chosen.
Another sched domain level on sn2 platforms, typically two CPUs per node, is
32p. Again, there the migration_cost between cpu0 and cpu1 is different than
cpu0 and cpu31.
John Hawkes
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (2 preceding siblings ...)
2006-04-14 17:36 ` John Hawkes
@ 2006-04-14 17:39 ` Luck, Tony
2006-04-14 17:41 ` Christoph Lameter
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:39 UTC (permalink / raw)
To: linux-ia64
> I am not familiar with that. Does the migration_cost take the
> SLIT node distances into consideration?
migration_cost measures the actual node-node latency. It gets
printed early in console output. Since I only have one-node
boxes, I just see:
migration_cost\x10002
but I think on a multi-node box you'll get a string of numbers.
Ah, but looking at the code, I think I was mistaken ... it seems
to only know about distances between scheduler domains ... which
may or may not match with all the node information. But it might
still be worth looking at to see whether it can be used, or can
be easily extended to be used. A measured value is often better
than a static define that doesn't apply to all systems.
-Tony
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (3 preceding siblings ...)
2006-04-14 17:39 ` Luck, Tony
@ 2006-04-14 17:41 ` Christoph Lameter
2006-04-14 17:43 ` Luck, Tony
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 17:41 UTC (permalink / raw)
To: linux-ia64
On Fri, 14 Apr 2006, John Hawkes wrote:
> For example, one level of sched domain is the all-CPUs sched domain. For a
> NUMA system it is unlikely that the migration_cost between cpu0 and cpu1 is
> the same as between cpu0 and cpu511, and yet only cpu0 and cpu1 are chosen.
> Another sched domain level on sn2 platforms, typically two CPUs per node, is
> 32p. Again, there the migration_cost between cpu0 and cpu1 is different than
> cpu0 and cpu31.
Would it not be much simpler to use the SLIT table to estimate the
migration costs?
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (4 preceding siblings ...)
2006-04-14 17:41 ` Christoph Lameter
@ 2006-04-14 17:43 ` Luck, Tony
2006-04-14 17:43 ` Christoph Lameter
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:43 UTC (permalink / raw)
To: linux-ia64
> Would it not be much simpler to use the SLIT table to estimate the
> migration costs?
SLIT table doesn't account for HT and multicore effects. Just node
to node distances.
-Tony
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (5 preceding siblings ...)
2006-04-14 17:43 ` Luck, Tony
@ 2006-04-14 17:43 ` Christoph Lameter
2006-04-14 17:44 ` Luck, Tony
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 17:43 UTC (permalink / raw)
To: linux-ia64
On Fri, 14 Apr 2006, Luck, Tony wrote:
> Ah, but looking at the code, I think I was mistaken ... it seems
> to only know about distances between scheduler domains ... which
> may or may not match with all the node information. But it might
> still be worth looking at to see whether it can be used, or can
> be easily extended to be used. A measured value is often better
> than a static define that doesn't apply to all systems.
RECLAIM_DISTANCE is compared against the largest node_distance() in the
system in order to make the decision if zone_reclaim should be enabled for
a system.
node_distance() uses the SLIT talbe and is a dynamic value provided by
each system.
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (6 preceding siblings ...)
2006-04-14 17:43 ` Christoph Lameter
@ 2006-04-14 17:44 ` Luck, Tony
2006-04-14 17:54 ` Luck, Tony
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:44 UTC (permalink / raw)
To: linux-ia64
> No. The migration_cost is an empirical runtime calculation that estimates the
> relative cost of migrating a task within each level of sched domain. And even
> that is sometimes grossly inaccurate because only two arbitrary CPUs are
> chosen for this calculation within one sched domain at each level, thus
> assuming that the migration_cost is the same between any and all two CPUs in
> that sched domain, and that all the sched domains for a given level exhibit
> equivalent migration_cost behavior.
Sounds like this code needs to see if there is a SLIT table, and if there is
use it to decide which cpus to use, rather than picking arbitrarily.
-Tony
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (7 preceding siblings ...)
2006-04-14 17:44 ` Luck, Tony
@ 2006-04-14 17:54 ` Luck, Tony
2006-04-14 18:04 ` John Hawkes
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:54 UTC (permalink / raw)
To: linux-ia64
> RECLAIM_DISTANCE is compared against the largest node_distance() in the
> system in order to make the decision if zone_reclaim should be enabled for
> a system.
>
> node_distance() uses the SLIT talbe and is a dynamic value provided by
> each system.
So node_distance is a reasonably meaningful number (providing that the
SLIT table is a accurate representation of reality ... about which we
are at the mercy of the f/w writers).
But I'm still trying to see why RECLAIM_DISTANCE would be different for
ia64 systems from other architectures. If 15 is the right number for
ia64, why is 20 the right number for powerpc? What if I have a 4-node
NUMA system with a NUMA factor of 1.6? SLIT =
10 16 16 16
16 10 16 16
16 16 10 16
16 16 16 10
Should I not use zone reclaim on this system? But with factor=1.5
it would be OK?
-Tony
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (8 preceding siblings ...)
2006-04-14 17:54 ` Luck, Tony
@ 2006-04-14 18:04 ` John Hawkes
2006-04-14 18:06 ` Christoph Lameter
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2006-04-14 18:04 UTC (permalink / raw)
To: linux-ia64
From: "Christoph Lameter" <clameter@sgi.com>
> Would it not be much simpler to use the SLIT table to estimate the
> migration costs?
If the architecture has a SLIT, perhaps yes.
The current algorithm is very empirical: for a given pair of CPUs, dirty the
L2 cache, migrate the task to the 2nd CPU, then measure how long it takes to
redirty the data. The SLIT can give you some metric of "distance" between two
CPUs or two nodes, but the scheduler is looking for something it deems
directly related to the effects of migrating a cache-hot task.
Again, my problem with the migration_cost is that it is expensive to
calculate, and that the calculation takes pains to be accurate to within
10-20%, and yet it makes assumptions (e.g., that the migration cost between
cpu0 and cpu1 is the same as between cpu0 and cpu31) that make its usefulness
questionable.
John Hawkes
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (9 preceding siblings ...)
2006-04-14 18:04 ` John Hawkes
@ 2006-04-14 18:06 ` Christoph Lameter
2006-04-14 18:08 ` Christoph Lameter
2006-04-14 18:22 ` Chen, Kenneth W
12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 18:06 UTC (permalink / raw)
To: linux-ia64
On Fri, 14 Apr 2006, Luck, Tony wrote:
> So node_distance is a reasonably meaningful number (providing that the
> SLIT table is a accurate representation of reality ... about which we
> are at the mercy of the f/w writers).
>
> But I'm still trying to see why RECLAIM_DISTANCE would be different for
> ia64 systems from other architectures. If 15 is the right number for
> ia64, why is 20 the right number for powerpc? What if I have a 4-node
> NUMA system with a NUMA factor of 1.6? SLIT =
20 is a sane fallback for an architecture that potentially does not
implement slit tables at all. REMOTE_DISTANCE is already defined to be 20
if no node_distance is defined. This means that any NUMA arch must at
least be able to do something meaningful with numa distances 10 and 20. So
20 is a safe value to assign to REMOTE_DISTANCE. The values in
include/linux/topology.h are just fallback definitions.
> 10 16 16 16
> 16 10 16 16
> 16 16 10 16
> 16 16 16 10
>
> Should I not use zone reclaim on this system? But with factor=1.5
> it would be OK?
To some extend this is an arbitrary decision. But it was implemented in
order to allow an automatic determination to switch off zone reclaim for
low latency systems. This can be overridden by writing a value into
/proc/sys/vm/zone_reclaim_mode.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (10 preceding siblings ...)
2006-04-14 18:06 ` Christoph Lameter
@ 2006-04-14 18:08 ` Christoph Lameter
2006-04-14 18:22 ` Chen, Kenneth W
12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 18:08 UTC (permalink / raw)
To: linux-ia64
On Fri, 14 Apr 2006, John Hawkes wrote:
> Again, my problem with the migration_cost is that it is expensive to
> calculate, and that the calculation takes pains to be accurate to within
> 10-20%, and yet it makes assumptions (e.g., that the migration cost between
> cpu0 and cpu1 is the same as between cpu0 and cpu31) that make its usefulness
> questionable.
Would not using the node_distance() allow us to reduce the expense?
Calculate the costs within one node (HT / multicore) and between
neighboring nodes with a known distance and extrapolate based on the slit
numbers from there?
^ permalink raw reply [flat|nested] 14+ messages in thread* RE: Setup an IA64 specific reclaim distance
2006-04-14 1:23 Setup an IA64 specific reclaim distance Christoph Lameter
` (11 preceding siblings ...)
2006-04-14 18:08 ` Christoph Lameter
@ 2006-04-14 18:22 ` Chen, Kenneth W
12 siblings, 0 replies; 14+ messages in thread
From: Chen, Kenneth W @ 2006-04-14 18:22 UTC (permalink / raw)
To: linux-ia64
John Hawkes wrote on Friday, April 14, 2006 11:04 AM
> From: "Christoph Lameter" <clameter@sgi.com>
> > Would it not be much simpler to use the SLIT table to estimate the
> > migration costs?
>
> The current algorithm is very empirical: for a given pair of CPUs, dirty the
> L2 cache, migrate the task to the 2nd CPU, then measure how long it takes to
> redirty the data. The SLIT can give you some metric of "distance" between two
> CPUs or two nodes, but the scheduler is looking for something it deems
> directly related to the effects of migrating a cache-hot task.
>
> Again, my problem with the migration_cost is that it is expensive to
> calculate, and that the calculation takes pains to be accurate to within
> 10-20%, and yet it makes assumptions (e.g., that the migration cost between
> cpu0 and cpu1 is the same as between cpu0 and cpu31) that make its usefulness
> questionable.
That looks like a design flaw in the migration_cost measurement. I remember
it previously iterates over all cpu and make boot time measurement unbearable
(something like 30 minutes on a moderately sized numa system). The solution
was to cache one pair of measurement and not to do the rest assuming they are
the same. That logic appears to be implemented incorrectly if it behaves like
what you've said above.
It is fixable, I suppose. I will look into it.
- Ken
^ permalink raw reply [flat|nested] 14+ messages in thread