public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Setup an IA64 specific reclaim distance
@ 2006-04-14  1:23 Christoph Lameter
  2006-04-14 16:18 ` Luck, Tony
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14  1:23 UTC (permalink / raw)
  To: linux-ia64

RECLAIM_DISTANCE is checked on bootup against the SLIT table distances.
Zone reclaim is important for system that have higher latencies but not for
systems that have multiple nodes on one motherboard and therefore low latencies.

We found that on motherboard latencies are typically 1 to 1.4 of local memory
access speed whereas multinode systems which benefit from zone reclaim have
usually more than 1.5 times the latency of a local access.

Set the reclaim distance for IA64 to 1.5 times.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc1-mm2/include/asm-ia64/topology.h
=================================--- linux-2.6.17-rc1-mm2.orig/include/asm-ia64/topology.h	2006-04-02 20:22:10.000000000 -0700
+++ linux-2.6.17-rc1-mm2/include/asm-ia64/topology.h	2006-04-13 17:49:18.000000000 -0700
@@ -23,6 +23,11 @@
 #define PENALTY_FOR_NODE_WITH_CPUS 255
 
 /*
+ * Distance above which we begin to use zone reclaim
+ */
+#define RECLAIM_DISTANCE 15
+
+/*
  * Returns the number of the node containing CPU 'cpu'
  */
 #define cpu_to_node(cpu) (int)(cpu_to_node_map[cpu])

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
@ 2006-04-14 16:18 ` Luck, Tony
  2006-04-14 16:44 ` Christoph Lameter
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 16:18 UTC (permalink / raw)
  To: linux-ia64

> We found that on motherboard latencies are typically 1 to 1.4 of local memory
> access speed whereas multinode systems which benefit from zone reclaim have
> usually more than 1.5 times the latency of a local access.
> 
> Set the reclaim distance for IA64 to 1.5 times.

Does this really apply just to ia64 systems?  Or is the "20" value in
topology.h just wrong?

Is the right value for this any any way related to the "migration_cost"
that we spend many seconds of boot time computing?

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
  2006-04-14 16:18 ` Luck, Tony
@ 2006-04-14 16:44 ` Christoph Lameter
  2006-04-14 17:36 ` John Hawkes
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 16:44 UTC (permalink / raw)
  To: linux-ia64

On Fri, 14 Apr 2006, Luck, Tony wrote:

> > We found that on motherboard latencies are typically 1 to 1.4 of local memory
> > access speed whereas multinode systems which benefit from zone reclaim have
> > usually more than 1.5 times the latency of a local access.
> > 
> > Set the reclaim distance for IA64 to 1.5 times.
> 
> Does this really apply just to ia64 systems?  Or is the "20" value in
> topology.h just wrong?

The value in topology.h is the same as the REMOTE_DISTANCE. I think we 
better leave these as fallbacks for systems that do not implement 
node_distance().

> Is the right value for this any any way related to the "migration_cost"
> that we spend many seconds of boot time computing?

I am not familiar with that. Does the migration_cost take the 
SLIT node distances into consideration?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
  2006-04-14 16:18 ` Luck, Tony
  2006-04-14 16:44 ` Christoph Lameter
@ 2006-04-14 17:36 ` John Hawkes
  2006-04-14 17:39 ` Luck, Tony
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2006-04-14 17:36 UTC (permalink / raw)
  To: linux-ia64

From: "Christoph Lameter" <clameter@sgi.com>
> > Is the right value for this any any way related to the "migration_cost"
> > that we spend many seconds of boot time computing?
>
> I am not familiar with that. Does the migration_cost take the
> SLIT node distances into consideration?

No.  The migration_cost is an empirical runtime calculation that estimates the
relative cost of migrating a task within each level of sched domain.  And even
that is sometimes grossly inaccurate because only two arbitrary CPUs are
chosen for this calculation within one sched domain at each level, thus
assuming that the migration_cost is the same between any and all two CPUs in
that sched domain, and that all the sched domains for a given level exhibit
equivalent migration_cost behavior.

For example, one level of sched domain is the all-CPUs sched domain.  For a
NUMA system it is unlikely that the migration_cost between cpu0 and cpu1 is
the same as between cpu0 and cpu511, and yet only cpu0 and cpu1 are chosen.
Another sched domain level on sn2 platforms, typically two CPUs per node, is
32p.  Again, there the migration_cost between cpu0 and cpu1 is different than
cpu0 and cpu31.

John Hawkes


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (2 preceding siblings ...)
  2006-04-14 17:36 ` John Hawkes
@ 2006-04-14 17:39 ` Luck, Tony
  2006-04-14 17:41 ` Christoph Lameter
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:39 UTC (permalink / raw)
  To: linux-ia64

> I am not familiar with that. Does the migration_cost take the 
> SLIT node distances into consideration?

migration_cost measures the actual node-node latency.  It gets
printed early in console output.  Since I only have one-node
boxes, I just see:

 migration_cost\x10002

but I think on a multi-node box you'll get a string of numbers.

Ah, but looking at the code, I think I was mistaken ... it seems
to only know about distances between scheduler domains ... which
may or may not match with all the node information.  But it might
still be worth looking at to see whether it can be used, or can
be easily extended to be used.  A measured value is often better
than a static define that doesn't apply to all systems.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (3 preceding siblings ...)
  2006-04-14 17:39 ` Luck, Tony
@ 2006-04-14 17:41 ` Christoph Lameter
  2006-04-14 17:43 ` Luck, Tony
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 17:41 UTC (permalink / raw)
  To: linux-ia64

On Fri, 14 Apr 2006, John Hawkes wrote:

> For example, one level of sched domain is the all-CPUs sched domain.  For a
> NUMA system it is unlikely that the migration_cost between cpu0 and cpu1 is
> the same as between cpu0 and cpu511, and yet only cpu0 and cpu1 are chosen.
> Another sched domain level on sn2 platforms, typically two CPUs per node, is
> 32p.  Again, there the migration_cost between cpu0 and cpu1 is different than
> cpu0 and cpu31.

Would it not be much simpler to use the SLIT table to estimate the 
migration costs?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (4 preceding siblings ...)
  2006-04-14 17:41 ` Christoph Lameter
@ 2006-04-14 17:43 ` Luck, Tony
  2006-04-14 17:43 ` Christoph Lameter
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:43 UTC (permalink / raw)
  To: linux-ia64

> Would it not be much simpler to use the SLIT table to estimate the 
> migration costs?

SLIT table doesn't account for HT and multicore effects.  Just node
to node distances.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (5 preceding siblings ...)
  2006-04-14 17:43 ` Luck, Tony
@ 2006-04-14 17:43 ` Christoph Lameter
  2006-04-14 17:44 ` Luck, Tony
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 17:43 UTC (permalink / raw)
  To: linux-ia64

On Fri, 14 Apr 2006, Luck, Tony wrote:

> Ah, but looking at the code, I think I was mistaken ... it seems
> to only know about distances between scheduler domains ... which
> may or may not match with all the node information.  But it might
> still be worth looking at to see whether it can be used, or can
> be easily extended to be used.  A measured value is often better
> than a static define that doesn't apply to all systems.

RECLAIM_DISTANCE is compared against the largest node_distance() in the 
system in order to make the decision if zone_reclaim should be enabled for 
a system.

node_distance() uses the SLIT talbe and is a dynamic value provided by 
each system.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (6 preceding siblings ...)
  2006-04-14 17:43 ` Christoph Lameter
@ 2006-04-14 17:44 ` Luck, Tony
  2006-04-14 17:54 ` Luck, Tony
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:44 UTC (permalink / raw)
  To: linux-ia64

> No.  The migration_cost is an empirical runtime calculation that estimates the
> relative cost of migrating a task within each level of sched domain.  And even
> that is sometimes grossly inaccurate because only two arbitrary CPUs are
> chosen for this calculation within one sched domain at each level, thus
> assuming that the migration_cost is the same between any and all two CPUs in
> that sched domain, and that all the sched domains for a given level exhibit
> equivalent migration_cost behavior.

Sounds like this code needs to see if there is a SLIT table, and if there is
use it to decide which cpus to use, rather than picking arbitrarily.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (7 preceding siblings ...)
  2006-04-14 17:44 ` Luck, Tony
@ 2006-04-14 17:54 ` Luck, Tony
  2006-04-14 18:04 ` John Hawkes
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2006-04-14 17:54 UTC (permalink / raw)
  To: linux-ia64

> RECLAIM_DISTANCE is compared against the largest node_distance() in the 
> system in order to make the decision if zone_reclaim should be enabled for 
> a system.
>
> node_distance() uses the SLIT talbe and is a dynamic value provided by 
> each system.

So node_distance is a reasonably meaningful number (providing that the
SLIT table is a accurate representation of reality ... about which we
are at the mercy of the f/w writers).

But I'm still trying to see why RECLAIM_DISTANCE would be different for
ia64 systems from other architectures.  If 15 is the right number for
ia64, why is 20 the right number for powerpc?  What if I have a 4-node
NUMA system with a NUMA factor of 1.6?  SLIT = 

10 16 16 16
16 10 16 16
16 16 10 16
16 16 16 10

Should I not use zone reclaim on this system?  But with factor=1.5
it would be OK?

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (8 preceding siblings ...)
  2006-04-14 17:54 ` Luck, Tony
@ 2006-04-14 18:04 ` John Hawkes
  2006-04-14 18:06 ` Christoph Lameter
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2006-04-14 18:04 UTC (permalink / raw)
  To: linux-ia64

From: "Christoph Lameter" <clameter@sgi.com>
> Would it not be much simpler to use the SLIT table to estimate the
> migration costs?

If the architecture has a SLIT, perhaps yes.

The current algorithm is very empirical:  for a given pair of CPUs, dirty the
L2 cache, migrate the task to the 2nd CPU, then measure how long it takes to
redirty the data.  The SLIT can give you some metric of "distance" between two
CPUs or two nodes, but the scheduler is looking for something it deems
directly related to the effects of migrating a cache-hot task.

Again, my problem with the migration_cost is that it is expensive to
calculate, and that the calculation takes pains to be accurate to within
10-20%, and yet it makes assumptions (e.g., that the migration cost between
cpu0 and cpu1 is the same as between cpu0 and cpu31) that make its usefulness
questionable.

John Hawkes


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (9 preceding siblings ...)
  2006-04-14 18:04 ` John Hawkes
@ 2006-04-14 18:06 ` Christoph Lameter
  2006-04-14 18:08 ` Christoph Lameter
  2006-04-14 18:22 ` Chen, Kenneth W
  12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 18:06 UTC (permalink / raw)
  To: linux-ia64

On Fri, 14 Apr 2006, Luck, Tony wrote:

> So node_distance is a reasonably meaningful number (providing that the
> SLIT table is a accurate representation of reality ... about which we
> are at the mercy of the f/w writers).
> 
> But I'm still trying to see why RECLAIM_DISTANCE would be different for
> ia64 systems from other architectures.  If 15 is the right number for
> ia64, why is 20 the right number for powerpc?  What if I have a 4-node
> NUMA system with a NUMA factor of 1.6?  SLIT = 

20 is a sane fallback for an architecture that potentially does not 
implement slit tables at all. REMOTE_DISTANCE is already defined to be 20
if no node_distance is defined. This means that any NUMA arch must at 
least be able to do something meaningful with numa distances 10 and 20. So 
20 is a safe value to assign to REMOTE_DISTANCE. The values in 
include/linux/topology.h are just fallback definitions.
 
> 10 16 16 16
> 16 10 16 16
> 16 16 10 16
> 16 16 16 10
> 
> Should I not use zone reclaim on this system?  But with factor=1.5
> it would be OK?

To some extend this is an arbitrary decision. But it was implemented in 
order to allow an automatic determination to switch off zone reclaim for 
low latency systems. This can be overridden by writing a value into 
/proc/sys/vm/zone_reclaim_mode.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (10 preceding siblings ...)
  2006-04-14 18:06 ` Christoph Lameter
@ 2006-04-14 18:08 ` Christoph Lameter
  2006-04-14 18:22 ` Chen, Kenneth W
  12 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2006-04-14 18:08 UTC (permalink / raw)
  To: linux-ia64

On Fri, 14 Apr 2006, John Hawkes wrote:

> Again, my problem with the migration_cost is that it is expensive to
> calculate, and that the calculation takes pains to be accurate to within
> 10-20%, and yet it makes assumptions (e.g., that the migration cost between
> cpu0 and cpu1 is the same as between cpu0 and cpu31) that make its usefulness
> questionable.

Would not using the node_distance() allow us to reduce the expense?
Calculate the costs  within one node (HT / multicore) and between 
neighboring nodes with a known distance and extrapolate based on the slit 
numbers from there?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Setup an IA64 specific reclaim distance
  2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
                   ` (11 preceding siblings ...)
  2006-04-14 18:08 ` Christoph Lameter
@ 2006-04-14 18:22 ` Chen, Kenneth W
  12 siblings, 0 replies; 14+ messages in thread
From: Chen, Kenneth W @ 2006-04-14 18:22 UTC (permalink / raw)
  To: linux-ia64

John Hawkes wrote on Friday, April 14, 2006 11:04 AM
> From: "Christoph Lameter" <clameter@sgi.com>
> > Would it not be much simpler to use the SLIT table to estimate the
> > migration costs?
> 
> The current algorithm is very empirical:  for a given pair of CPUs, dirty the
> L2 cache, migrate the task to the 2nd CPU, then measure how long it takes to
> redirty the data.  The SLIT can give you some metric of "distance" between two
> CPUs or two nodes, but the scheduler is looking for something it deems
> directly related to the effects of migrating a cache-hot task.
> 
> Again, my problem with the migration_cost is that it is expensive to
> calculate, and that the calculation takes pains to be accurate to within
> 10-20%, and yet it makes assumptions (e.g., that the migration cost between
> cpu0 and cpu1 is the same as between cpu0 and cpu31) that make its usefulness
> questionable.

That looks like a design flaw in the migration_cost measurement.  I remember
it previously iterates over all cpu and make boot time measurement unbearable
(something like 30 minutes on a moderately sized numa system).  The solution
was to cache one pair of measurement and not to do the rest assuming they are
the same.  That logic appears to be implemented incorrectly if it behaves like
what you've said above.

It is fixable, I suppose.  I will look into it.

- Ken

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-04-14 18:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-14  1:23 Setup an IA64 specific reclaim distance Christoph Lameter
2006-04-14 16:18 ` Luck, Tony
2006-04-14 16:44 ` Christoph Lameter
2006-04-14 17:36 ` John Hawkes
2006-04-14 17:39 ` Luck, Tony
2006-04-14 17:41 ` Christoph Lameter
2006-04-14 17:43 ` Luck, Tony
2006-04-14 17:43 ` Christoph Lameter
2006-04-14 17:44 ` Luck, Tony
2006-04-14 17:54 ` Luck, Tony
2006-04-14 18:04 ` John Hawkes
2006-04-14 18:06 ` Christoph Lameter
2006-04-14 18:08 ` Christoph Lameter
2006-04-14 18:22 ` Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox