* [RFC] NUMA schedulers benchmark results
@ 2002-10-06 16:51 Erich Focht
2002-10-06 20:24 ` Erich Focht
` (3 more replies)
0 siblings, 4 replies; 24+ messages in thread
From: Erich Focht @ 2002-10-06 16:51 UTC (permalink / raw)
To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 3752 bytes --]
Hi,
here are some results from testing various versions and approaches
to a NUMA scheduler. I used the numa_test benchmark which I'll post
in a separate email. It runs in parallel N tasks doing the same job:
access randomly a large array. As the array is large enough not to
fit into cache, this is very memory latency sensitive. Also it is
memory bandwidth sensitive. To emulate a real multi-user environment, the
jobs are disturbed by a short load peak. This is simulated by a call
to "hackbench" 3 seconds after the tasks were started. The performance
of the user tasks is depending very much on where they are scheduled
and are CPU hoggers such that the user times are quite independent of
the non-scheduler part of the underlying kernel. The elapsed times
are depending on "hackbench" which actually blocks the machine for the
time it is running. Hackbench is depending on the underlying kernel
and one should compare "elapsed_time - hackbench_time".
The test machine is a 16 CPU NEC Azusa with Itanium processors and
four nodes. The tested schedulers are:
A: O(1) scheduler in 2.5.39
B: O(1) scheduler with task steal limited to only one task (node
affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
set to 0, under 2.4.18
E: node affine scheduler with initial load balancing and static homenode
F: node affine scheduler without initial load balancing and dynamic
homenode selection (homenode selected where most of the memory is
allocated).
As I'm rewriting the node affine scheduler to be more modular, I'll
redo the tests for cases D, E, F on top of 2.5.X kernels soon.
The results are summarized in the tables below. A set of outputs (for
N=8, 16, 32) is attached. They show clearly why the node affine
scheduler beats them all: The initial load balancing is node-aware,
the task-stealing too. Sometimes the node affine force is not large
enough to bring the task back to the homenode, but it is almost always
good enough.
The (F) solution with dynamically determined homenode show that the
initial load balancing is crucial, as the equal node balance is not
strongly enforced dynamically. So the optimal solution is probably
(F) with initial load balancing.
Average user time (U) and total user time (TU). Minimum per row should
be considered.
----------------------------------------------------------------------
Scheduler: A B C D E F
N=4 U 28.12 30.77 33.00 - 27.20 30.29
TU 112.55 123.13 132.08 - 108.88 121.25
N=8 U 30.47 31.39 31.65 30.76 28.67 30.08
TU 243.86 251.27 253.30 246.23 229.51 240.75
N=16 U 36.42 33.64 32.18 32.27 31.50 32.83
TU 582.91 538.49 515.11 516.53 504.17 525.59
N=32 U 38.69 34.83 34.05 33.76 33.89 34.11
TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9
N=64 U 39.73 34.73 34.23 - (33.32) 34.98
TU 2543.4 2223.4 2191.7 - (2133) 2239.5
Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
2.5.39 based kernels due to "hackbench", which performs differently.
Compare E-H within a row, but don't take it too seriously.
--------------------------------------------------------------------
Scheduler: A B C D E F
N=4 E 37.33 37.96 48.31 - 28.14 35.91
H 9.98 1.49 10.65 - 1.99 1.43
N=8 E 46.17 39.50 42.53 39.72 30.28 38.28
H 9.64 1.86 7.27 2.07 2.33 1.86
N=16 E 47.21 44.67 49.66 42.97 36.98 42.51
H 5.90 4.69 2.93 5.178 5.56 5.94
N=32 E 88.60 79.92 80.34 78.35 76.84 77.38
H 6.29 5.23 2.85 4.51 5.29 4.28
N=64 E 167.10 147.16 150.59 - (133.9) 148.94
H 5.96 4.67 3.10 - (-) 6.86
(The E:N=64 results are without hackbench disturbance.)
Best regards,
Erich
[-- Attachment #2: NUMA_SCHED_8_16_32.results --]
[-- Type: text/plain, Size: 24048 bytes --]
A: 2.5.39 O(1) scheduler
Running 'hackbench 20' in the background: Time: 9.639
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 7.5 92.5 0.0 0.0 | 0 1 *| 38.29
2 0.0 0.0 0.0 100.0 | 0 3 *| 30.19
3 0.0 100.0 0.0 0.0 | 0 1 *| 27.84
4 0.0 0.0 0.0 100.0 | 0 3 *| 30.38
5 100.0 0.0 0.0 0.0 | 0 0 | 29.96
6 100.0 0.0 0.0 0.0 | 0 0 | 29.63
7 0.0 0.0 100.0 0.0 | 0 2 *| 27.33
8 0.0 0.0 0.0 100.0 | 0 3 *| 30.14
AverageUserTime 30.47 seconds
ElapsedTime 46.17
TotalUserTime 243.86
TotalSysTime 0.41
Running 'hackbench 20' in the background: Time: 5.896
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 98.7 1.3 0.0 | 0 1 *| 41.07
2 89.2 10.8 0.0 0.0 | 0 0 | 39.16
3 9.4 0.0 90.6 0.0 | 0 2 *| 40.72
4 0.0 0.0 0.0 100.0 | 0 3 *| 31.96
5 0.0 0.1 99.0 0.9 | 0 2 *| 41.36
6 0.0 0.0 100.0 0.0 | 0 2 *| 30.62
7 0.6 0.0 86.2 13.2 | 0 2 *| 40.11
8 100.0 0.0 0.0 0.0 | 0 0 | 31.25
9 8.0 0.0 0.0 92.0 | 0 3 *| 40.63
10 0.0 100.0 0.0 0.0 | 0 1 *| 30.43
11 0.0 0.0 0.0 100.0 | 0 3 *| 31.64
12 91.6 0.0 8.4 0.0 | 0 0 | 39.92
13 0.0 100.0 0.0 0.0 | 0 1 *| 30.46
14 92.1 0.0 7.9 0.0 | 0 0 | 40.36
15 0.0 0.0 0.0 100.0 | 0 3 *| 32.29
16 4.8 95.2 0.0 0.0 | 0 1 *| 40.70
AverageUserTime 36.42 seconds
ElapsedTime 47.21
TotalUserTime 582.91
TotalSysTime 0.85
Running 'hackbench 20' in the background: Time: 6.293
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 100.0 0.0 | 0 2 *| 31.53
2 100.0 0.0 0.0 0.0 | 0 0 | 34.84
3 100.0 0.0 0.0 0.0 | 0 0 | 32.70
4 0.0 100.0 0.0 0.0 | 0 1 *| 32.25
5 0.0 0.0 0.0 100.0 | 0 3 *| 33.41
6 96.5 2.7 0.8 0.0 | 0 0 | 43.39
7 5.6 0.0 94.4 0.0 | 0 2 *| 44.42
8 73.4 22.9 3.7 0.0 | 0 0 | 42.14
9 0.9 0.0 99.1 0.0 | 0 2 *| 45.10
10 0.0 8.4 91.6 0.0 | 0 2 *| 42.85
11 0.0 0.0 0.0 100.0 | 0 3 *| 32.11
12 0.0 100.0 0.0 0.0 | 0 1 *| 32.16
13 0.0 0.0 0.0 100.0 | 0 3 *| 33.48
14 0.0 0.0 100.0 0.0 | 0 2 *| 31.35
15 0.0 0.3 3.2 96.4 | 0 3 *| 42.27
16 91.0 0.0 0.0 9.0 | 0 0 | 35.50
17 0.0 100.0 0.0 0.0 | 0 1 *| 32.45
18 0.0 97.6 2.4 0.0 | 0 1 *| 41.97
19 0.0 100.0 0.0 0.0 | 0 1 *| 32.46
20 1.3 0.0 74.2 24.5 | 0 2 *| 43.78
21 0.0 0.0 0.9 99.1 | 0 3 *| 32.74
22 0.8 0.0 99.2 0.0 | 0 2 *| 45.61
23 100.0 0.0 0.0 0.0 | 0 0 | 33.78
24 97.0 1.0 0.0 1.9 | 0 0 | 43.74
25 0.0 1.8 0.0 98.2 | 0 3 *| 43.31
26 0.0 96.1 0.0 3.9 | 0 1 *| 43.61
27 2.6 0.0 0.3 97.0 | 0 3 *| 46.54
28 0.0 0.0 0.0 100.0 | 0 3 *| 33.59
29 0.0 92.1 0.0 7.9 | 0 1 *| 43.43
30 2.2 0.0 97.8 0.0 | 0 2 *| 45.19
31 26.0 72.3 0.0 1.8 | 0 1 *| 43.29
32 98.5 1.3 0.2 0.0 | 0 0 | 42.97
AverageUserTime 38.69 seconds
ElapsedTime 88.60
TotalUserTime 1238.43
TotalSysTime 1.85
-------------------------------------------------------------
B: O(1) scheduler equivalent, 1 task steal per load balance step
(2.4.18 node affine scheduler with CONFIG_NUMA_SCHED=n)
Running 'hackbench 20' in the background: Time: 1.861
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 100.0 0.0 | 2 2 | 31.62
2 0.0 0.0 100.0 0.0 | 2 2 | 31.68
3 100.0 0.0 0.0 0.0 | 0 0 | 29.40
4 0.0 0.0 100.0 0.0 | 2 2 | 31.56
5 0.0 0.0 100.0 0.0 | 2 2 | 31.60
6 95.0 5.0 0.0 0.0 | 1 0 *| 38.14
7 100.0 0.0 0.0 0.0 | 0 0 | 29.27
8 0.0 100.0 0.0 0.0 | 1 1 | 27.87
AverageUserTime 31.39 seconds
ElapsedTime 39.50
TotalUserTime 251.27
TotalSysTime 0.48
Running 'hackbench 20' in the background: Time: 4.688
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 100.0 0.0 | 2 2 | 31.49
2 0.0 0.0 100.0 0.0 | 2 2 | 31.33
3 0.0 0.0 100.0 0.0 | 2 2 | 31.30
4 100.0 0.0 0.0 0.0 | 0 0 | 31.98
5 100.0 0.0 0.0 0.0 | 0 0 | 32.21
6 0.0 91.3 8.7 0.0 | 2 1 *| 39.88
7 0.0 100.0 0.0 0.0 | 1 1 | 30.72
8 0.0 0.0 0.0 100.0 | 3 3 | 31.56
9 1.2 98.8 0.0 0.0 | 0 1 *| 41.31
10 0.0 0.0 0.0 100.0 | 3 3 | 31.56
11 0.0 100.0 0.0 0.0 | 1 1 | 30.67
12 0.0 0.0 0.0 100.0 | 3 3 | 31.45
13 100.0 0.0 0.0 0.0 | 0 0 | 31.94
14 5.5 0.0 94.5 0.0 | 0 2 *| 40.83
15 0.0 0.0 0.0 100.0 | 3 3 | 31.53
16 83.0 17.0 0.0 0.0 | 1 0 *| 38.50
AverageUserTime 33.64 seconds
ElapsedTime 44.67
TotalUserTime 538.49
TotalSysTime 1.02
Running 'hackbench 20' in the background: Time: 5.229
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 67.6 0.0 17.4 15.0 | 2 0 *| 41.36
2 0.0 0.0 100.0 0.0 | 2 2 | 32.78
3 100.0 0.0 0.0 0.0 | 0 0 | 32.92
4 0.0 0.0 100.0 0.0 | 2 2 | 33.56
5 22.4 75.3 2.3 0.0 | 2 1 *| 43.26
6 0.0 0.0 100.0 0.0 | 2 2 | 33.38
7 100.0 0.0 0.0 0.0 | 0 0 | 33.16
8 0.0 0.0 0.0 100.0 | 3 3 | 32.03
9 0.0 100.0 0.0 0.0 | 1 1 | 32.64
10 0.0 0.0 0.0 100.0 | 3 3 | 32.00
11 4.5 0.0 0.0 95.5 | 0 3 *| 43.82
12 0.0 0.0 0.0 100.0 | 3 3 | 31.91
13 100.0 0.0 0.0 0.0 | 0 0 | 33.46
14 0.0 100.0 0.0 0.0 | 1 1 | 32.39
15 0.0 0.0 0.0 100.0 | 3 3 | 31.86
16 0.0 0.0 100.0 0.0 | 2 2 | 33.35
17 100.0 0.0 0.0 0.0 | 0 0 | 33.95
18 0.0 0.0 0.0 100.0 | 3 3 | 31.68
19 0.0 5.3 0.0 94.7 | 1 3 *| 42.97
20 2.0 98.0 0.0 0.0 | 0 1 *| 43.56
21 0.0 0.0 0.0 100.0 | 3 3 | 31.65
22 100.0 0.0 0.0 0.0 | 0 0 | 34.13
23 0.0 100.0 0.0 0.0 | 1 1 | 32.52
24 9.9 90.1 0.0 0.0 | 1 1 | 33.58
25 0.0 3.2 96.8 0.0 | 1 2 *| 42.51
26 89.8 0.0 0.0 10.2 | 0 0 | 34.18
27 0.0 100.0 0.0 0.0 | 1 1 | 32.44
28 0.0 2.4 97.6 0.0 | 2 2 | 33.59
29 8.4 0.0 91.6 0.0 | 2 2 | 33.78
30 100.0 0.0 0.0 0.0 | 0 0 | 33.52
31 0.0 6.3 93.7 0.0 | 2 2 | 33.37
32 0.0 100.0 0.0 0.0 | 1 1 | 33.20
AverageUserTime 34.83 seconds
ElapsedTime 79.92
TotalUserTime 1114.91
TotalSysTime 2.30
-----------------------------------------------------------
C: simple scheduler MH
Running 'hackbench 20' in the background: Time: 7.271
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 32.66
2 100.0 0.0 0.0 0.0 | 0 0 | 32.53
3 100.0 0.0 0.0 0.0 | 0 0 | 32.47
4 0.0 100.0 0.0 0.0 | 1 1 | 30.33
5 0.0 100.0 0.0 0.0 | 1 1 | 30.20
6 0.0 100.0 0.0 0.0 | 1 1 | 30.06
7 100.0 0.0 0.0 0.0 | 0 0 | 32.35
8 100.0 0.0 0.0 0.0 | 0 0 | 32.58
AverageUserTime 31.65 seconds
ElapsedTime 42.53
TotalUserTime 253.30
TotalSysTime 0.48
Running 'hackbench 20' in the background: Time: 2.929
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 31.07
2 100.0 0.0 0.0 0.0 | 0 0 | 30.97
3 100.0 0.0 0.0 0.0 | 0 0 | 30.86
4 0.0 100.0 0.0 0.0 | 1 1 | 31.91
5 0.0 100.0 0.0 0.0 | 1 1 | 32.04
6 0.0 100.0 0.0 0.0 | 1 1 | 31.98
7 0.0 100.0 0.0 0.0 | 1 1 | 32.01
8 0.0 0.0 100.0 0.0 | 2 2 | 31.83
9 0.0 0.0 100.0 0.0 | 2 2 | 31.66
10 0.0 0.0 100.0 0.0 | 2 2 | 31.47
11 0.0 0.0 100.0 0.0 | 2 2 | 31.74
12 0.0 0.0 0.0 100.0 | 3 3 | 31.54
13 0.0 0.0 0.0 100.0 | 3 3 | 31.48
14 0.0 0.0 0.0 100.0 | 3 3 | 31.49
15 0.0 0.0 0.0 100.0 | 3 3 | 31.97
16 5.4 0.0 0.0 94.6 | 0 3 *| 40.86
AverageUserTime 32.18 seconds
ElapsedTime 49.66
TotalUserTime 515.11
TotalSysTime 0.95
Running 'hackbench 20' in the background: Time: 2.849
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 33.75
2 100.0 0.0 0.0 0.0 | 0 0 | 33.89
3 100.0 0.0 0.0 0.0 | 0 0 | 33.87
4 0.0 100.0 0.0 0.0 | 1 1 | 33.34
5 0.0 100.0 0.0 0.0 | 1 1 | 33.41
6 0.0 100.0 0.0 0.0 | 1 1 | 33.16
7 0.0 100.0 0.0 0.0 | 1 1 | 33.25
8 0.0 0.0 100.0 0.0 | 2 2 | 32.90
9 0.0 0.0 100.0 0.0 | 2 2 | 33.08
10 0.0 0.0 100.0 0.0 | 2 2 | 33.07
11 0.0 0.0 100.0 0.0 | 2 2 | 33.11
12 0.0 0.0 0.0 100.0 | 3 3 | 31.46
13 0.0 0.0 0.0 100.0 | 3 3 | 32.04
14 0.0 0.0 0.0 100.0 | 3 3 | 31.59
15 0.0 0.0 0.0 100.0 | 3 3 | 31.83
16 100.0 0.0 0.0 0.0 | 0 0 | 33.95
17 1.2 0.0 0.0 98.8 | 0 3 *| 44.66
18 100.0 0.0 0.0 0.0 | 0 0 | 34.02
19 55.4 0.0 0.0 44.6 | 3 0 | 37.11
20 100.0 0.0 0.0 0.0 | 0 0 | 33.62
21 0.0 100.0 0.0 0.0 | 1 1 | 33.33
22 0.0 100.0 0.0 0.0 | 1 1 | 33.26
23 0.0 100.0 0.0 0.0 | 1 1 | 33.41
24 0.0 100.0 0.0 0.0 | 1 1 | 33.14
25 0.0 0.0 100.0 0.0 | 2 2 | 33.26
26 0.0 0.0 100.0 0.0 | 2 2 | 33.05
27 0.0 0.0 3.1 96.9 | 2 3 *| 43.50
28 100.0 0.0 0.0 0.0 | 0 0 | 34.02
29 60.4 0.0 39.6 0.0 | 2 0 *| 37.05
30 0.0 0.0 0.0 100.0 | 3 3 | 31.53
31 100.0 0.0 0.0 0.0 | 0 0 | 33.94
32 0.0 0.0 100.0 0.0 | 2 2 | 33.03
AverageUserTime 34.05 seconds
ElapsedTime 80.34
TotalUserTime 1090.10
TotalSysTime 2.15
---------------------------------------------------------------
D: numasched : pooling only (idle pool delay set to 0)
Running 'hackbench 20' in the background: Time: 2.066
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 0.0 100.0 | 3 3 | 28.50
2 100.0 0.0 0.0 0.0 | 0 0 | 30.97
3 0.0 0.0 0.0 100.0 | 3 3 | 28.49
4 0.0 0.0 100.0 0.0 | 2 2 | 28.65
5 100.0 0.0 0.0 0.0 | 0 0 | 31.18
6 100.0 0.0 0.0 0.0 | 0 0 | 30.96
7 5.7 94.3 0.0 0.0 | 0 1 *| 38.69
8 0.0 0.0 100.0 0.0 | 2 2 | 28.65
AverageUserTime 30.76 seconds
ElapsedTime 39.72
TotalUserTime 246.23
TotalSysTime 0.46
Running 'hackbench 20' in the background: Time: 5.178
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 32.01
2 100.0 0.0 0.0 0.0 | 0 0 | 31.82
3 0.0 3.4 0.0 96.6 | 1 3 *| 41.01
4 100.0 0.0 0.0 0.0 | 0 0 | 32.24
5 100.0 0.0 0.0 0.0 | 0 0 | 31.91
6 0.0 0.0 100.0 0.0 | 2 2 | 31.75
7 0.0 0.0 0.0 100.0 | 3 3 | 30.55
8 0.0 0.0 100.0 0.0 | 2 2 | 31.74
9 0.0 0.0 100.0 0.0 | 2 2 | 31.74
10 0.0 0.0 0.0 100.0 | 3 3 | 30.63
11 0.0 0.0 0.0 100.0 | 3 3 | 30.67
12 0.0 0.0 100.0 0.0 | 2 2 | 31.75
13 0.0 100.0 0.0 0.0 | 1 1 | 31.94
14 0.0 100.0 0.0 0.0 | 1 1 | 32.20
15 0.0 100.0 0.0 0.0 | 1 1 | 32.24
16 0.0 100.0 0.0 0.0 | 1 1 | 32.07
AverageUserTime 32.27 seconds
ElapsedTime 42.97
TotalUserTime 516.53
TotalSysTime 1.16
Running 'hackbench 20' in the background: Time: 4.505
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 49.4 0.0 50.6 | 3 3 | 36.48
2 0.3 99.7 0.0 0.0 | 0 1 *| 39.21
3 0.0 100.0 0.0 0.0 | 1 1 | 32.23
4 100.0 0.0 0.0 0.0 | 0 0 | 32.88
5 100.0 0.0 0.0 0.0 | 0 0 | 32.98
6 100.0 0.0 0.0 0.0 | 0 0 | 32.97
7 0.0 18.3 81.7 0.0 | 2 2 | 35.13
8 0.0 5.7 94.3 0.0 | 2 2 | 34.00
9 0.0 0.0 100.0 0.0 | 2 2 | 33.14
10 0.0 3.3 96.7 0.0 | 2 2 | 34.22
11 0.0 100.0 0.0 0.0 | 1 1 | 32.34
12 0.0 100.0 0.0 0.0 | 1 1 | 32.37
13 0.0 13.4 0.0 86.6 | 3 3 | 34.55
14 0.0 0.0 0.0 100.0 | 3 3 | 33.93
15 0.0 0.0 0.0 100.0 | 3 3 | 33.08
16 100.0 0.0 0.0 0.0 | 0 0 | 33.31
17 100.0 0.0 0.0 0.0 | 0 0 | 33.08
18 0.0 8.2 0.0 91.8 | 3 3 | 33.76
19 0.0 0.0 0.0 100.0 | 3 3 | 33.23
20 0.0 32.7 0.0 67.3 | 3 3 | 34.79
21 0.0 0.0 0.0 100.0 | 3 3 | 32.94
22 100.0 0.0 0.0 0.0 | 0 0 | 33.23
23 0.0 100.0 0.0 0.0 | 1 1 | 32.35
24 0.0 0.0 100.0 0.0 | 2 2 | 33.44
25 0.0 100.0 0.0 0.0 | 1 1 | 32.42
26 0.0 0.0 100.0 0.0 | 2 2 | 33.25
27 0.0 0.0 100.0 0.0 | 2 2 | 33.82
28 0.0 0.0 100.0 0.0 | 2 2 | 32.88
29 0.0 0.0 0.0 100.0 | 3 3 | 31.99
30 0.0 100.0 0.0 0.0 | 1 1 | 31.90
31 100.0 0.0 0.0 0.0 | 0 0 | 32.93
32 99.6 0.0 0.4 0.0 | 2 0 *| 41.48
AverageUserTime 33.76 seconds
ElapsedTime 78.35
TotalUserTime 1080.75
TotalSysTime 2.03
-----------------------------------------------------------
E: node affine scheduler with initial load balancing, statically
assigned homenode
Running 'hackbench 20' in the background: Time: 2.330
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 100.0 0.0 | 2 2 | 28.86
2 0.0 100.0 0.0 0.0 | 1 1 | 28.82
3 0.0 100.0 0.0 0.0 | 1 1 | 28.85
4 0.0 0.0 0.0 100.0 | 3 3 | 28.47
5 100.0 0.0 0.0 0.0 | 0 0 | 28.69
6 0.0 0.0 100.0 0.0 | 2 2 | 28.65
7 0.0 0.0 0.0 100.0 | 3 3 | 28.49
8 100.0 0.0 0.0 0.0 | 0 0 | 28.55
AverageUserTime 28.67 seconds
ElapsedTime 30.28
TotalUserTime 229.51
TotalSysTime 0.43
Running 'hackbench 20' in the background: Time: 5.562
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 100.0 0.0 0.0 | 1 1 | 31.47
2 0.0 100.0 0.0 0.0 | 1 1 | 31.38
3 0.0 0.0 0.0 100.0 | 3 3 | 31.53
4 0.0 0.0 0.0 100.0 | 3 3 | 31.46
5 100.0 0.0 0.0 0.0 | 0 0 | 31.64
6 100.0 0.0 0.0 0.0 | 0 0 | 31.57
7 100.0 0.0 0.0 0.0 | 0 0 | 31.57
8 0.0 0.0 0.0 100.0 | 3 3 | 31.51
9 0.0 0.0 100.0 0.0 | 2 2 | 31.63
10 0.0 0.0 100.0 0.0 | 2 2 | 31.56
11 0.0 0.0 100.0 0.0 | 2 2 | 31.42
12 0.0 100.0 0.0 0.0 | 1 1 | 31.44
13 0.0 0.0 0.0 100.0 | 3 3 | 31.43
14 0.0 0.0 100.0 0.0 | 2 2 | 31.61
15 100.0 0.0 0.0 0.0 | 0 0 | 31.34
16 0.0 100.0 0.0 0.0 | 1 1 | 31.36
AverageUserTime 31.50 seconds
ElapsedTime 36.98
TotalUserTime 504.17
TotalSysTime 1.03
Running 'hackbench 20' in the background: Time: 5.288
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 46.1 53.9 | 2 3 *| 37.11
2 0.0 0.0 0.0 100.0 | 3 3 | 31.32
3 0.0 100.0 0.0 0.0 | 1 1 | 33.15
4 0.0 4.9 95.1 0.0 | 2 2 | 33.55
5 0.0 0.0 100.0 0.0 | 2 2 | 32.81
6 0.0 100.0 0.0 0.0 | 1 1 | 32.88
7 0.0 100.0 0.0 0.0 | 1 1 | 33.31
8 82.9 17.1 0.0 0.0 | 1 0 *| 40.70
9 100.0 0.0 0.0 0.0 | 0 0 | 33.36
10 9.3 0.0 0.0 90.7 | 0 3 *| 42.12
11 100.0 0.0 0.0 0.0 | 0 0 | 33.38
12 0.0 0.0 0.0 100.0 | 3 3 | 31.28
13 0.0 0.0 0.0 100.0 | 3 3 | 31.43
14 0.0 100.0 0.0 0.0 | 1 1 | 33.33
15 0.0 100.0 0.0 0.0 | 1 1 | 33.04
16 100.0 0.0 0.0 0.0 | 0 0 | 33.02
17 95.2 4.8 0.0 0.0 | 0 0 | 33.82
18 3.8 0.0 0.0 96.2 | 0 3 *| 43.21
19 0.0 0.0 0.0 100.0 | 3 3 | 31.11
20 0.0 0.0 100.0 0.0 | 2 2 | 32.19
21 0.0 100.0 0.0 0.0 | 1 1 | 33.43
22 0.0 0.0 100.0 0.0 | 2 2 | 33.03
23 0.0 0.0 100.0 0.0 | 2 2 | 33.06
24 100.0 0.0 0.0 0.0 | 0 0 | 33.04
25 100.0 0.0 0.0 0.0 | 0 0 | 33.07
26 0.0 0.0 100.0 0.0 | 2 2 | 33.07
27 0.0 0.0 100.0 0.0 | 2 2 | 33.22
28 0.0 0.0 0.0 100.0 | 3 3 | 31.06
29 0.0 15.3 84.7 0.0 | 2 2 | 33.25
30 0.0 100.0 0.0 0.0 | 1 1 | 33.36
31 85.0 0.0 0.0 15.0 | 0 0 | 34.50
32 0.0 86.4 0.0 13.6 | 1 1 | 34.27
AverageUserTime 33.89 seconds
ElapsedTime 76.84
TotalUserTime 1084.86
TotalSysTime 2.12
--------------------------------------------------------
F: node affine scheduler, dynamically assigned homenode, no
initial load balancing
Running 'hackbench 20' in the background: Time: 1.861
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 0.0 100.0 | 3 3 | 30.75
2 100.0 0.0 0.0 0.0 | 0 0 | 27.12
3 0.0 0.0 0.0 100.0 | 3 3 | 30.60
4 0.0 0.0 0.0 100.0 | 3 3 | 30.65
5 0.0 100.0 0.0 0.0 | 1 1 | 28.65
6 0.0 0.0 78.4 21.6 | 3 2 *| 36.63
7 0.0 100.0 0.0 0.0 | 1 1 | 28.57
8 0.0 0.0 100.0 0.0 | 2 2 | 27.65
AverageUserTime 30.08 seconds
ElapsedTime 38.28
TotalUserTime 240.75
TotalSysTime 0.49
Running 'hackbench 20' in the background: Time: 5.936
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 0.0 100.0 | 3 3 | 31.47
2 100.0 0.0 0.0 0.0 | 0 0 | 29.90
3 88.2 0.0 4.6 7.2 | 2 0 *| 40.96
4 0.0 0.0 0.0 100.0 | 3 3 | 31.53
5 0.0 0.0 0.0 100.0 | 3 3 | 31.50
6 0.0 100.0 0.0 0.0 | 1 1 | 32.25
7 0.0 84.5 0.0 15.5 | 1 1 | 33.41
8 0.0 0.0 100.0 0.0 | 2 2 | 32.38
9 83.0 17.0 0.0 0.0 | 1 0 *| 39.29
10 0.0 0.0 0.0 100.0 | 3 3 | 31.51
11 0.0 100.0 0.0 0.0 | 1 1 | 32.38
12 0.0 0.0 100.0 0.0 | 2 2 | 32.30
13 0.0 100.0 0.0 0.0 | 1 1 | 32.06
14 100.0 0.0 0.0 0.0 | 0 0 | 29.85
15 0.0 0.0 100.0 0.0 | 2 2 | 32.18
16 0.0 0.0 100.0 0.0 | 2 2 | 32.39
AverageUserTime 32.83 seconds
ElapsedTime 42.51
TotalUserTime 525.59
TotalSysTime 1.14
Running 'hackbench 20' in the background: Time: 4.277
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 0.0 0.0 100.0 | 3 3 | 31.69
2 0.0 0.0 100.0 0.0 | 2 2 | 33.81
3 100.0 0.0 0.0 0.0 | 0 0 | 33.10
4 1.4 4.9 2.8 90.9 | 2 3 *| 44.26
5 0.0 0.0 0.0 100.0 | 3 3 | 31.76
6 0.0 0.0 0.0 100.0 | 3 3 | 31.69
7 0.0 0.0 3.6 96.4 | 2 3 *| 43.73
8 0.0 100.0 0.0 0.0 | 1 1 | 33.07
9 0.0 0.0 100.0 0.0 | 2 2 | 34.04
10 0.0 100.0 0.0 0.0 | 1 1 | 33.24
11 0.0 100.0 0.0 0.0 | 1 1 | 33.23
12 100.0 0.0 0.0 0.0 | 0 0 | 33.77
13 0.0 100.0 0.0 0.0 | 1 1 | 33.10
14 100.0 0.0 0.0 0.0 | 0 0 | 33.26
15 0.0 100.0 0.0 0.0 | 1 1 | 33.16
16 100.0 0.0 0.0 0.0 | 0 0 | 33.23
17 3.9 0.0 0.0 96.1 | 0 3 *| 42.74
18 0.0 100.0 0.0 0.0 | 1 1 | 32.97
19 100.0 0.0 0.0 0.0 | 0 0 | 33.21
20 0.0 0.0 92.1 7.9 | 2 2 | 34.30
21 0.0 0.0 95.9 4.1 | 2 2 | 34.31
22 0.0 0.0 100.0 0.0 | 2 2 | 33.88
23 0.0 0.0 0.0 100.0 | 3 3 | 30.85
24 0.0 100.0 0.0 0.0 | 1 1 | 32.99
25 0.0 0.0 100.0 0.0 | 2 2 | 33.84
26 0.0 100.0 0.0 0.0 | 1 1 | 33.35
27 100.0 0.0 0.0 0.0 | 0 0 | 33.29
28 0.0 0.0 100.0 0.0 | 2 2 | 33.79
29 100.0 0.0 0.0 0.0 | 0 0 | 33.25
30 100.0 0.0 0.0 0.0 | 0 0 | 33.41
31 0.0 0.0 100.0 0.0 | 2 2 | 33.92
32 0.0 0.0 0.0 100.0 | 3 3 | 31.16
AverageUserTime 34.11 seconds
ElapsedTime 77.38
TotalUserTime 1091.86
TotalSysTime 2.14
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [RFC] NUMA schedulers benchmark results 2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht @ 2002-10-06 20:24 ` Erich Focht 2002-10-07 0:00 ` Martin J. Bligh 2002-10-07 0:58 ` Martin J. Bligh ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Erich Focht @ 2002-10-06 20:24 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum Cc: Anton Blanchard, Ingo Molnar, linux-kernel, LSE [-- Attachment #1: Type: text/plain, Size: 4283 bytes --] Hi, here comes the benchmark I used for the NUMA scheduler test. Would be interesting to know whether it is useful to any other NUMA developer... Regards, Erich PS: it uses a 'time' command that understands the --format option, e.g. GNU time 1.7. Change it in the main script, if it doesn't work for you. On Sunday 06 October 2002 18:51, Erich Focht wrote: > Hi, > > here are some results from testing various versions and approaches > to a NUMA scheduler. I used the numa_test benchmark which I'll post > in a separate email. It runs in parallel N tasks doing the same job: > access randomly a large array. As the array is large enough not to > fit into cache, this is very memory latency sensitive. Also it is > memory bandwidth sensitive. To emulate a real multi-user environment, the > jobs are disturbed by a short load peak. This is simulated by a call > to "hackbench" 3 seconds after the tasks were started. The performance > of the user tasks is depending very much on where they are scheduled > and are CPU hoggers such that the user times are quite independent of > the non-scheduler part of the underlying kernel. The elapsed times > are depending on "hackbench" which actually blocks the machine for the > time it is running. Hackbench is depending on the underlying kernel > and one should compare "elapsed_time - hackbench_time". > > The test machine is a 16 CPU NEC Azusa with Itanium processors and > four nodes. The tested schedulers are: > > A: O(1) scheduler in 2.5.39 > B: O(1) scheduler with task steal limited to only one task (node > affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18 > C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39 > D: pooling NUMA scheduler, no initial load balancing, idle pool_delay > set to 0, under 2.4.18 > E: node affine scheduler with initial load balancing and static homenode > F: node affine scheduler without initial load balancing and dynamic > homenode selection (homenode selected where most of the memory is > allocated). > > As I'm rewriting the node affine scheduler to be more modular, I'll > redo the tests for cases D, E, F on top of 2.5.X kernels soon. > > The results are summarized in the tables below. A set of outputs (for > N=8, 16, 32) is attached. They show clearly why the node affine > scheduler beats them all: The initial load balancing is node-aware, > the task-stealing too. Sometimes the node affine force is not large > enough to bring the task back to the homenode, but it is almost always > good enough. > > The (F) solution with dynamically determined homenode show that the > initial load balancing is crucial, as the equal node balance is not > strongly enforced dynamically. So the optimal solution is probably > (F) with initial load balancing. > > > Average user time (U) and total user time (TU). Minimum per row should > be considered. > ---------------------------------------------------------------------- > Scheduler: A B C D E F > N=4 U 28.12 30.77 33.00 - 27.20 30.29 > TU 112.55 123.13 132.08 - 108.88 121.25 > N=8 U 30.47 31.39 31.65 30.76 28.67 30.08 > TU 243.86 251.27 253.30 246.23 229.51 240.75 > N=16 U 36.42 33.64 32.18 32.27 31.50 32.83 > TU 582.91 538.49 515.11 516.53 504.17 525.59 > N=32 U 38.69 34.83 34.05 33.76 33.89 34.11 > TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9 > N=64 U 39.73 34.73 34.23 - (33.32) 34.98 > TU 2543.4 2223.4 2191.7 - (2133) 2239.5 > > > Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and > 2.5.39 based kernels due to "hackbench", which performs differently. > Compare E-H within a row, but don't take it too seriously. > -------------------------------------------------------------------- > Scheduler: A B C D E F > N=4 E 37.33 37.96 48.31 - 28.14 35.91 > H 9.98 1.49 10.65 - 1.99 1.43 > N=8 E 46.17 39.50 42.53 39.72 30.28 38.28 > H 9.64 1.86 7.27 2.07 2.33 1.86 > N=16 E 47.21 44.67 49.66 42.97 36.98 42.51 > H 5.90 4.69 2.93 5.178 5.56 5.94 > N=32 E 88.60 79.92 80.34 78.35 76.84 77.38 > H 6.29 5.23 2.85 4.51 5.29 4.28 > N=64 E 167.10 147.16 150.59 - (133.9) 148.94 > H 5.96 4.67 3.10 - (-) 6.86 > > (The E:N=64 results are without hackbench disturbance.) > > Best regards, > Erich [-- Attachment #2: numa_test --] [-- Type: application/x-shellscript, Size: 8322 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-06 20:24 ` Erich Focht @ 2002-10-07 0:00 ` Martin J. Bligh 0 siblings, 0 replies; 24+ messages in thread From: Martin J. Bligh @ 2002-10-07 0:00 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum Cc: Anton Blanchard, Ingo Molnar, linux-kernel, LSE Errm, your magic shellscript is too wierd (aka doesn't work). Can you just send the files out as attatchments? M. --On Sunday, October 06, 2002 10:24 PM +0200 Erich Focht <efocht@ess.nec.de> wrote: > Hi, > > here comes the benchmark I used for the NUMA scheduler test. Would > be interesting to know whether it is useful to any other NUMA > developer... > > Regards, > Erich > > PS: it uses a 'time' command that understands the --format option, > e.g. GNU time 1.7. Change it in the main script, if it doesn't > work for you. > > > On Sunday 06 October 2002 18:51, Erich Focht wrote: >> Hi, >> >> here are some results from testing various versions and approaches >> to a NUMA scheduler. I used the numa_test benchmark which I'll post >> in a separate email. It runs in parallel N tasks doing the same job: >> access randomly a large array. As the array is large enough not to >> fit into cache, this is very memory latency sensitive. Also it is >> memory bandwidth sensitive. To emulate a real multi-user environment, the >> jobs are disturbed by a short load peak. This is simulated by a call >> to "hackbench" 3 seconds after the tasks were started. The performance >> of the user tasks is depending very much on where they are scheduled >> and are CPU hoggers such that the user times are quite independent of >> the non-scheduler part of the underlying kernel. The elapsed times >> are depending on "hackbench" which actually blocks the machine for the >> time it is running. Hackbench is depending on the underlying kernel >> and one should compare "elapsed_time - hackbench_time". >> >> The test machine is a 16 CPU NEC Azusa with Itanium processors and >> four nodes. The tested schedulers are: >> >> A: O(1) scheduler in 2.5.39 >> B: O(1) scheduler with task steal limited to only one task (node >> affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18 >> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39 >> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay >> set to 0, under 2.4.18 >> E: node affine scheduler with initial load balancing and static homenode >> F: node affine scheduler without initial load balancing and dynamic >> homenode selection (homenode selected where most of the memory is >> allocated). >> >> As I'm rewriting the node affine scheduler to be more modular, I'll >> redo the tests for cases D, E, F on top of 2.5.X kernels soon. >> >> The results are summarized in the tables below. A set of outputs (for >> N=8, 16, 32) is attached. They show clearly why the node affine >> scheduler beats them all: The initial load balancing is node-aware, >> the task-stealing too. Sometimes the node affine force is not large >> enough to bring the task back to the homenode, but it is almost always >> good enough. >> >> The (F) solution with dynamically determined homenode show that the >> initial load balancing is crucial, as the equal node balance is not >> strongly enforced dynamically. So the optimal solution is probably >> (F) with initial load balancing. >> >> >> Average user time (U) and total user time (TU). Minimum per row should >> be considered. >> ---------------------------------------------------------------------- >> Scheduler: A B C D E F >> N=4 U 28.12 30.77 33.00 - 27.20 30.29 >> TU 112.55 123.13 132.08 - 108.88 121.25 >> N=8 U 30.47 31.39 31.65 30.76 28.67 30.08 >> TU 243.86 251.27 253.30 246.23 229.51 240.75 >> N=16 U 36.42 33.64 32.18 32.27 31.50 32.83 >> TU 582.91 538.49 515.11 516.53 504.17 525.59 >> N=32 U 38.69 34.83 34.05 33.76 33.89 34.11 >> TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9 >> N=64 U 39.73 34.73 34.23 - (33.32) 34.98 >> TU 2543.4 2223.4 2191.7 - (2133) 2239.5 >> >> >> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and >> 2.5.39 based kernels due to "hackbench", which performs differently. >> Compare E-H within a row, but don't take it too seriously. >> -------------------------------------------------------------------- >> Scheduler: A B C D E F >> N=4 E 37.33 37.96 48.31 - 28.14 35.91 >> H 9.98 1.49 10.65 - 1.99 1.43 >> N=8 E 46.17 39.50 42.53 39.72 30.28 38.28 >> H 9.64 1.86 7.27 2.07 2.33 1.86 >> N=16 E 47.21 44.67 49.66 42.97 36.98 42.51 >> H 5.90 4.69 2.93 5.178 5.56 5.94 >> N=32 E 88.60 79.92 80.34 78.35 76.84 77.38 >> H 6.29 5.23 2.85 4.51 5.29 4.28 >> N=64 E 167.10 147.16 150.59 - (133.9) 148.94 >> H 5.96 4.67 3.10 - (-) 6.86 >> >> (The E:N=64 results are without hackbench disturbance.) >> >> Best regards, >> Erich ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht 2002-10-06 20:24 ` Erich Focht @ 2002-10-07 0:58 ` Martin J. Bligh 2002-10-07 16:52 ` Erich Focht 2002-10-07 7:25 ` Martin J. Bligh 2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum 3 siblings, 1 reply; 24+ messages in thread From: Martin J. Bligh @ 2002-10-07 0:58 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > As I'm rewriting the node affine scheduler to be more modular, I'll > redo the tests for cases D, E, F on top of 2.5.X kernels soon. It's really hard to see anything comparing 2.4 vs 2.5 results, so when you have results for everything across the same kernel, it'll be a lot easier to decipher ... For the record, I really don't care whose code gets accepted, I'm just interested in having a scheduler that works well for me ;-) However, I think it would be a good idea to run some other benchmarks on these, that are a little more general ... dbench, kernel compile, sdet, whatever, as *well* as your own benchmark ... I think we're working on that. > Average user time (U) and total user time (TU). Minimum per row should > be considered. > ---------------------------------------------------------------------- > Scheduler: A B C D E F > N=4 U 28.12 30.77 33.00 - 27.20 30.29 > N=8 U 30.47 31.39 31.65 30.76 28.67 30.08 > N=16 U 36.42 33.64 32.18 32.27 31.50 32.83 > N=32 U 38.69 34.83 34.05 33.76 33.89 34.11 > N=64 U 39.73 34.73 34.23 - (33.32) 34.98 So lower numbers are better, right? So Michael's stuff seems to work fine at the higher end, but poorly at the lower end - I think this may be due to a bug he found on Friday, if that gets fixed, it might make a difference. > The results are summarized in the tables below. A set of outputs > (for N=8, 16, 32) is attached. They show clearly why the node affine > scheduler beats them all It would be a lot clearer if you baselines weren't different. Now maybe I'm just totally misreading your results (tell us if so) but I presume it's more valid to compare: 1. difference between A and C (2.5 virgin vs 2.5 + Michael's stuff) vs 2. difference between E and B (2.4 virgin vs 2.4 + your stuff) rather than across 2.4 vs 2.5, which seems meaningless. N=32 (E) .... C vs A 80.34 vs 88.60 (about 10% improvement) E vs B 76.84 vs 79.92 (about 5% improvement) N=32 (H) .... C vs A 2.85 vs 6.29 (over 50% improvement) E vs B 5.29 vs 5.23 (about 1% impromement) Which actually makes C better than E on both measures, surely? Obviously Michael's stuff is worse at the low end, but to say "They show clearly why the node affine scheduler beats them all" seems a little optimistic ;-) I'm not trying to say you're wrong, or one scheduler is better than the other ... I'm just saying I can't really draw any clear-cut conclusion from your results other than that Michael's stuff looks borked for low numbers of tasks ... I'll run your tests on the NUMA-Q comparing 2.5.40 vs Michael's stuff. If you send me a cleaned up version (ie without the ia64 stuff in generic code) of your stuff against 2.5.40 (or ideally 2.5.40-mm2), I'll run that too ... (if you're putting the latencies into arch code, you can set i386 (well NUMA-Q) for 20:1 if you think that'll help). M. PS. the wierd "this one was run without hackbench" thing makes the results even harder to read ... PPS. Does Kimio's discontigmem patch work for you on 2.5? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-07 0:58 ` Martin J. Bligh @ 2002-10-07 16:52 ` Erich Focht 0 siblings, 0 replies; 24+ messages in thread From: Erich Focht @ 2002-10-07 16:52 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel On Monday 07 October 2002 02:58, Martin J. Bligh wrote: > > As I'm rewriting the node affine scheduler to be more modular, I'll > > redo the tests for cases D, E, F on top of 2.5.X kernels soon. > > It's really hard to see anything comparing 2.4 vs 2.5 results, > so when you have results for everything across the same kernel, > it'll be a lot easier to decipher ... The user times are really comparable. And total user time, too. I'm taking the node affine patches to 2.5.39 and will rerun them ASAP. > For the record, I really don't care whose code gets accepted, I'm > just interested in having a scheduler that works well for me ;-) I was running several of these to see which features help and to try to understand where the advantage comes from. In the output tables you can see where the job has spentits time and on which node it started. You can also recognize from the user time how well a job ran. Alone on it's node: 27s, away from it's memory: 36s (or so), sharing the node with others: 27-32s, running on wrong node with others: up to 43s. And seeing these you can understand how well a feature works or how much you miss another feature. It wasn't my plan to advertise the node affine scheduler, it just has most of the features. > However, I think it would be a good idea to run some other benchmarks > on these, that are a little more general ... dbench, kernel compile, > sdet, whatever, as *well* as your own benchmark ... I think we're > working on that. Yes, great! > So lower numbers are better, right? So Michael's stuff seems to > work fine at the higher end, but poorly at the lower end - I think > this may be due to a bug he found on Friday, if that gets fixed, it > might make a difference. OK, but how about the node-wise selection and balancing? Then we get the approaches closer... > I'll run your tests on the NUMA-Q comparing 2.5.40 vs Michael's stuff. > If you send me a cleaned up version (ie without the ia64 stuff in > generic code) of your stuff against 2.5.40 (or ideally 2.5.40-mm2), > I'll run that too ... (if you're putting the latencies into arch code, > you can set i386 (well NUMA-Q) for 20:1 if you think that'll help). Is in preparation. First step: pooling scheduler without multi-level support but node-wise initial balancing. > PS. the wierd "this one was run without hackbench" thing makes the > results even harder to read ... Sorry about that. Didn't have a newer measurement. > PPS. Does Kimio's discontigmem patch work for you on 2.5? Yes. We're trying to get that stuff in but David's away from his email for a while as you know. Regards, Erich ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht 2002-10-06 20:24 ` Erich Focht 2002-10-07 0:58 ` Martin J. Bligh @ 2002-10-07 7:25 ` Martin J. Bligh 2002-10-07 7:40 ` Ingo Molnar 2002-10-07 20:09 ` [PATCH] pooling NUMA scheduler with initial load balancing Erich Focht 2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum 3 siblings, 2 replies; 24+ messages in thread From: Martin J. Bligh @ 2002-10-07 7:25 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel Just a data dump for you, I'm sleepy ... will look at it in the morning. 16-way NUMA-Q running your benchmark: 2.5.40-mm2 time.4:AverageUserTime 40.69 seconds time.4:ElapsedTime 52.14 time.4:TotalUserTime 162.85 time.4:TotalSysTime 0.71 time.8:AverageUserTime 36.90 seconds time.8:ElapsedTime 62.12 time.8:TotalUserTime 295.29 time.8:TotalSysTime 1.79 time.16:AverageUserTime 63.91 seconds time.16:ElapsedTime 88.82 time.16:TotalUserTime 1022.71 time.16:TotalSysTime 4.85 time.32:AverageUserTime 102.11 seconds time.32:ElapsedTime 224.24 time.32:TotalUserTime 3267.84 time.32:TotalSysTime 11.73 time.64:AverageUserTime 148.33 seconds time.64:ElapsedTime 611.71 time.64:TotalUserTime 9494.10 time.64:TotalSysTime 25.81 2.5.40-mm2 + michael's stuff (latest on tux1) time.4:AverageUserTime 31.55 seconds time.4:ElapsedTime 41.58 time.4:TotalUserTime 126.23 time.4:TotalSysTime 1.11 time.8:AverageUserTime 51.80 seconds time.8:ElapsedTime 80.38 time.8:TotalUserTime 414.50 time.8:TotalSysTime 2.67 time.16:AverageUserTime 50.75 seconds time.16:ElapsedTime 72.28 time.16:TotalUserTime 812.08 time.16:TotalSysTime 4.60 time.32:AverageUserTime 55.78 seconds time.32:ElapsedTime 126.45 time.32:TotalUserTime 1785.21 time.32:TotalSysTime 8.79 time.64:AverageUserTime 57.49 seconds time.64:ElapsedTime 243.27 time.64:TotalUserTime 3679.94 time.64:TotalSysTime 15.09 Looks pretty sweet to me, apart from 8, which is just wierd. Profile looks horrible (what the hell is fib_node_get_info? net/ipv4/fib_semantics.c): without: 241652 total 0.2455 120117 default_idle 19728 fib_node_get_info 17400 swapin_readahead 12805 kmalloc 11047 kfree 7918 sock_alloc_send_pskb 5956 alloc_skb 4690 __wake_up 4194 kmem_cache_free with: 213131 total 0.2165 138082 default_idle 14620 fib_node_get_info 9631 swapin_readahead 5316 kmalloc 5279 kfree 5251 sock_alloc_send_pskb 4154 __wake_up 4116 schedule 3002 alloc_skb ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-07 7:25 ` Martin J. Bligh @ 2002-10-07 7:40 ` Ingo Molnar 2002-10-07 20:09 ` [PATCH] pooling NUMA scheduler with initial load balancing Erich Focht 1 sibling, 0 replies; 24+ messages in thread From: Ingo Molnar @ 2002-10-07 7:40 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Erich Focht, Michael Hohnbaum, linux-kernel On Mon, 7 Oct 2002, Martin J. Bligh wrote: > Profile looks horrible (what the hell is fib_node_get_info? > net/ipv4/fib_semantics.c): most likely the wrong System.map? Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-07 7:25 ` Martin J. Bligh 2002-10-07 7:40 ` Ingo Molnar @ 2002-10-07 20:09 ` Erich Focht [not found] ` <1420721189.1034032091@[10.10.2.3]> 1 sibling, 1 reply; 24+ messages in thread From: Erich Focht @ 2002-10-07 20:09 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1604 bytes --] Hi, here come two patches which together implement the pooling NUMA scheduler with initial load balancing stripped of the node affine scheduler. First patch does: - implement CPU pools structure - changes find_busiest_queue to first scan the own node, then (if no imbalance found) find the most loaded node and steal a task from it if the imbalance exceeds 25%. - The steal is delayed 100ms if we steal from a remote node and the own node is averagely (within 25%) loaded. - The steal is delayed 1ms if the own node is unloaded (this gives CPUs from the own node an advantage). I dropped the multilevel structure as it is only complicating things at this stage and can be introduced later. The patch is prepared for it. The second patch implements initial balancing by selecting the most unloaded node and its most unloaded CPU for an execed task. It does a round-robin selection of the next node if the loads are equal. The node affine patch will be on top of this. The pool setup is pretty messy, it needs some sort of RCU to become nicer. Also undefining CONFIG_NUMA_SCHED could lead to errors, I didn't clean this up. This is mainly for testing and as a base experimenting with the node affine extensions. Michael, you could just take the second patch for initial load balancing of your approach. Compared to your patch this one tries to achieve equal node load also in the find_busiest_queue step. With more effort and code, though... I hope it works for you, I removed the arch dependencies and just need __cpu_to_node(). Regards, Erich [-- Attachment #2: 01-numa_sched_core5-noML-2.5.39.patch --] [-- Type: text/x-diff, Size: 21171 bytes --] diff -urNp a/arch/i386/config.in b/arch/i386/config.in --- a/arch/i386/config.in Fri Sep 27 23:49:42 2002 +++ b/arch/i386/config.in Mon Oct 7 19:46:52 2002 @@ -177,6 +177,7 @@ else fi # Common NUMA Features if [ "$CONFIG_X86_NUMAQ" = "y" ]; then + bool 'Enable NUMA scheduler' CONFIG_NUMA_SCHED bool 'Numa Memory Allocation Support' CONFIG_NUMA if [ "$CONFIG_NUMA" = "y" ]; then define_bool CONFIG_DISCONTIGMEM y diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002 +++ b/arch/i386/kernel/smpboot.c Mon Oct 7 19:46:52 2002 @@ -765,6 +765,8 @@ static int __init wakeup_secondary_via_I extern unsigned long cpu_initialized; +static int __initdata nr_lnodes = 0; + static void __init do_boot_cpu (int apicid) /* * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad @@ -1194,6 +1196,11 @@ int __devinit __cpu_up(unsigned int cpu) void __init smp_cpus_done(unsigned int max_cpus) { zap_low_mappings(); +#ifdef CONFIG_NUMA_SCHED + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif } void __init smp_intr_init() diff -urNp a/arch/ia64/config.in b/arch/ia64/config.in --- a/arch/ia64/config.in Sat Oct 5 17:23:24 2002 +++ b/arch/ia64/config.in Mon Oct 7 19:47:37 2002 @@ -69,6 +69,7 @@ then bool ' Enable NUMA support' CONFIG_NUMA if [ "$CONFIG_NUMA" = "y" ]; then define_bool CONFIG_DISCONTIGMEM y + bool ' Enable NUMA scheduler' CONFIG_NUMA_SCHED fi bool ' Enable IA-64 Machine Check Abort' CONFIG_IA64_MCA define_bool CONFIG_PM y diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c --- a/arch/ia64/kernel/smpboot.c Mon Oct 7 17:42:02 2002 +++ b/arch/ia64/kernel/smpboot.c Mon Oct 7 19:46:52 2002 @@ -397,7 +397,7 @@ unsigned long cache_decay_ticks; /* # of static void smp_tune_scheduling (void) { - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */ printk("task migration cache decay timeout: %ld msecs.\n", (cache_decay_ticks + 1) * 1000 / HZ); @@ -508,6 +508,11 @@ smp_cpus_done (unsigned int dummy) printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n", num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100); +#ifdef CONFIG_NUMA_SCHED + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif } int __devinit diff -urNp a/include/asm-i386/atomic.h b/include/asm-i386/atomic.h --- a/include/asm-i386/atomic.h Fri Sep 27 23:49:16 2002 +++ b/include/asm-i386/atomic.h Mon Oct 7 19:46:52 2002 @@ -111,6 +111,18 @@ static __inline__ void atomic_inc(atomic } /** + * atomic_inc_return - increment atomic variable and return new value + * @v: pointer of type atomic_t + * + * Atomically increments @v by 1 and return it's new value. Note that + * the guaranteed useful range of an atomic_t is only 24 bits. + */ +static inline int atomic_inc_return(atomic_t *v){ + atomic_inc(v); + return v->counter; +} + +/** * atomic_dec - decrement atomic variable * @v: pointer of type atomic_t * diff -urNp a/include/asm-ia64/topology.h b/include/asm-ia64/topology.h --- a/include/asm-ia64/topology.h Mon Oct 7 17:42:02 2002 +++ b/include/asm-ia64/topology.h Mon Oct 7 19:46:52 2002 @@ -37,7 +37,7 @@ * don't use it too early. * Who needs this? */ -/* #define __node_to_first_cpu(node) pool_cpus[pool_ptr[node]] */ +#define __node_to_first_cpu(node) pool_cpus[pool_ptr[node]] /* * Returns a bitmask of CPUs on Node 'node'. diff -urNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h Sat Oct 5 17:22:45 2002 +++ b/include/linux/sched.h Mon Oct 7 19:46:52 2002 @@ -22,6 +22,8 @@ extern unsigned long event; #include <asm/mmu.h> #include <linux/smp.h> +#include <asm/topology.h> +#include <asm/numa.h> #include <linux/sem.h> #include <linux/signal.h> #include <linux/securebits.h> @@ -167,7 +169,6 @@ extern void update_one_process(struct ta extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; - #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); asmlinkage void schedule(void); @@ -457,6 +458,15 @@ extern void set_cpus_allowed(task_t *p, # define set_cpus_allowed(p, new_mask) do { } while (0) #endif +#ifdef CONFIG_NUMA_SCHED +extern int node_levels[NR_NODES]; +extern int nr_node_levels; +/* extern void find_node_levels(int numpools); */ +extern void pooldata_lock(void); +extern void pooldata_unlock(void); +#endif +extern void sched_migrate_task(task_t *p, int cpu); + extern void set_user_nice(task_t *p, long nice); extern int task_prio(task_t *p); extern int task_nice(task_t *p); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Fri Sep 27 23:50:27 2002 +++ b/kernel/sched.c Mon Oct 7 19:46:52 2002 @@ -154,6 +154,10 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; + unsigned long wait_time; + int wait_node; + int load[2][NR_CPUS]; + } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; @@ -174,6 +178,95 @@ static struct runqueue runqueues[NR_CPUS #endif /* + * Variables for describing and accessing processor pools. Using a + * compressed row format like notation. Processor pools are treated + * like logical node numbers. + * + * numpools: number of CPU pools (nodes), + * pool_cpus[]: CPUs in pools sorted by their pool ID, + * pool_ptr[pool]: index of first element in pool_cpus[] belonging to pool. + * pool_mask[]: cpu mask of a pool. + * + * Example: loop over all CPUs in a pool p: + * loop_over_node(i,node) { + * cpu = pool_cpu(i); + * ... + * } + * <efocht@ess.nec.de> + */ +int numpools, pool_ptr[NR_NODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[NR_NODES]; +unsigned long pool_mask[NR_NODES]; +static atomic_t pool_lock = ATOMIC_INIT(0); +#define cpu_to_node(cpu) __cpu_to_node(cpu) + +/* Avoid zeroes in integer divides for load calculations */ +#define BALANCE_FACTOR 100 + +#ifdef CONFIG_NUMA_SCHED +#define POOL_DELAY (100) +#define loop_over_node(i,n) for(i=pool_ptr[n]; i<pool_ptr[n+1]; i++) +#define pool_cpu(i) pool_cpus[i] + +void pooldata_lock(void) +{ + int i; + retry: + while (atomic_read(&pool_lock)); + if (atomic_inc_return(&pool_lock) > 1) { + atomic_dec(&pool_lock); + goto retry; + } + /* + * Wait a while, any loops using pool data should finish + * in between. This is VERY ugly and should be replaced + * by some real RCU stuff. [EF] + */ + for (i=0; i<100; i++) + udelay(1000); +} + +void pooldata_unlock(void) +{ + atomic_dec(&pool_lock); +} + +/* + * Call pooldata_lock() before calling this function and + * pooldata_unlock() after! + */ +void bld_pools(void) +{ + int n, cpu, ptr, i; + unsigned long mask; + + ptr=0; + for (n=0; n<numnodes; n++) { + mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map; + pool_ptr[n] = ptr; + for (cpu=0; cpu<NR_CPUS; cpu++) + if (mask & (1UL << cpu)) + pool_cpu(ptr++) = cpu; + pool_nr_cpus[n] = ptr - pool_ptr[n]; + } + numpools=numnodes; + pool_ptr[numpools]=ptr; + printk("CPU pools : %d\n",numpools); + for (n=0;n<numpools;n++) { + printk("pool %d :",n); + loop_over_node(i,n) + printk("%d ",pool_cpu(i)); + printk("\n"); + } +} + +#else +#define POOL_DELAY (0) +#define loop_over_node(i,n) for(i=0; i<NR_CPUS; i++) +#define pool_cpu(i) (i) +#endif + + +/* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. @@ -632,121 +728,145 @@ static inline unsigned int double_lock_b } /* - * find_busiest_queue - find the busiest runqueue. + * Calculate load of a CPU pool, store results in data[][NR_CPUS]. + * Return the index of the most loaded runqueue. */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static int calc_pool_load(int data[][NR_CPUS], int this_cpu, int pool, int idle) { - int nr_running, load, max_load, i; - runqueue_t *busiest, *rq_src; - - /* - * We search all runqueues to find the most busy one. - * We do this lockless to reduce cache-bouncing overhead, - * we re-check the 'best' source CPU later on again, with - * the lock held. - * - * We fend off statistical fluctuations in runqueue lengths by - * saving the runqueue length during the previous load-balancing - * operation and using the smaller one the current and saved lengths. - * If a runqueue is long enough for a longer amount of time then - * we recognize it and pull tasks from it. - * - * The 'current runqueue length' is a statistical maximum variable, - * for that one we take the longer one - to avoid fluctuations in - * the other direction. So for a load-balance to happen it needs - * stable long runqueue on the target CPU and stable short runqueue - * on the local runqueue. - * - * We make an exception if this CPU is about to become idle - in - * that case we are less picky about moving a task across CPUs and - * take what can be taken. - */ - if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) - nr_running = this_rq->nr_running; - else - nr_running = this_rq->prev_nr_running[this_cpu]; + runqueue_t *rq_src, *this_rq = cpu_rq(this_cpu); + int this_pool = cpu_to_node(this_cpu); + int i, ii, idx=-1, refload, load; - busiest = NULL; - max_load = 1; - for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) - continue; + data[1][pool] = 0; + refload = -1; + loop_over_node(ii, pool) { + i = pool_cpu(ii); rq_src = cpu_rq(i); if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) load = rq_src->nr_running; else load = this_rq->prev_nr_running[i]; this_rq->prev_nr_running[i] = rq_src->nr_running; - - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; + data[0][i] = load; + data[1][pool] += load; + if (load > refload) { + idx = i; + refload = load; } } + data[1][pool] = data[1][pool] * BALANCE_FACTOR / pool_nr_cpus[pool]; + return idx; +} - if (likely(!busiest)) - goto out; +/* + * Find a runqueue from which to steal a task. We try to do this as locally as + * possible because we don't want to let tasks get far from their home node. + * This is done in two steps: + * 1. First try to find a runqueue within the own CPU pool (AKA node) with + * imbalance larger than 25% (relative to the current runqueue). + * 2. If the local node is well balanced, locate the most loaded node and its + * most loaded CPU. Remote runqueues running tasks having their homenode on the + * current node are preferred (those tasks count twice in the load calculation). + * If the current load is far below the average try to steal a task from the + * most loaded node/cpu. Otherwise wait 100ms and give less loaded nodes the + * chance to approach the average load. + * + * This concept can be extended easilly to more than two levels (multi-level + * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode... + * <efocht@ess.nec.de> + */ +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, + int idle, int *nr_running) +{ + runqueue_t *busiest = NULL; + int imax, best_cpu, pool, max_pool_load, max_pool_idx; + int i, del_shift; + int avg_load=-1, this_pool = cpu_to_node(this_cpu); - *imbalance = (max_load - nr_running) / 2; + /* Need at least ~25% imbalance to trigger balancing. */ +#define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4)) - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (*imbalance < (max_load + 3)/4)) { - busiest = NULL; + if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) + *nr_running = this_rq->nr_running; + else + *nr_running = this_rq->prev_nr_running[this_cpu]; + + best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle); + + if (best_cpu != this_cpu) + goto check_out; + + scan_all: + best_cpu = -1; + max_pool_load = this_rq->load[1][this_pool]; + max_pool_idx = this_pool; + avg_load = max_pool_load * pool_nr_cpus[this_pool]; + for (i = 1; i < numpools; i++) { + pool = (i + this_pool) % numpools; + imax = calc_pool_load(this_rq->load, this_cpu, pool, idle); + avg_load += this_rq->load[1][pool]*pool_nr_cpus[pool]; + if (this_rq->load[1][pool] > max_pool_load) { + max_pool_load = this_rq->load[1][pool]; + max_pool_idx = pool; + best_cpu = imax; + } + } + /* Exit if not enough imbalance on any remote node. */ + if ((best_cpu < 0) || + BALANCED(max_pool_load,this_rq->load[1][this_pool])) { + this_rq->wait_node = -1; goto out; } + avg_load /= num_online_cpus(); + /* Wait longer before stealing if load is average. */ + if (BALANCED(avg_load,this_rq->load[1][this_pool])) + del_shift = 0; + else + del_shift = 6; - nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); - /* - * Make sure nothing changed since we checked the - * runqueue length. - */ - if (busiest->nr_running <= nr_running + 1) { - spin_unlock(&busiest->lock); - busiest = NULL; - } -out: - return busiest; -} + if (this_rq->wait_node != max_pool_idx) { + this_rq->wait_node = max_pool_idx; + this_rq->wait_time = jiffies; + goto out; + } else + if (jiffies - this_rq->wait_time < (POOL_DELAY >> del_shift)) + goto out; -/* - * pull_task - move a task from a remote runqueue to the local runqueue. - * Both runqueues must be locked. - */ -static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) -{ - dequeue_task(p, src_array); - src_rq->nr_running--; - set_task_cpu(p, this_cpu); - this_rq->nr_running++; - enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); + check_out: + /* Enough imbalance in the remote cpu loads? */ + if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) { + busiest = cpu_rq(best_cpu); + this_rq->wait_node = -1; + } else if (avg_load == -1) + /* only scanned local pool, so let's look at all of them */ + goto scan_all; + out: + return busiest; } /* - * Current runqueue is empty, or rebalance tick: if there is an - * inbalance (current runqueue is too short) then pull from - * busiest runqueue(s). - * - * We call this with the current runqueue locked, - * irqs disabled. + * Find a task to steal from the busiest RQ. The busiest->lock must be held + * while calling this routine. */ -static void load_balance(runqueue_t *this_rq, int idle) +static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu) { - int imbalance, idx, this_cpu = smp_processor_id(); - runqueue_t *busiest; + int idx; + task_t *next = NULL, *tmp; prio_array_t *array; struct list_head *head, *curr; - task_t *tmp; + int this_pool=cpu_to_node(this_cpu), weight, maxweight=0; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); - if (!busiest) - goto out; + /* + * We do not migrate tasks that are: + * 1) running (obviously), or + * 2) cannot be migrated to this CPU due to cpus_allowed. + */ + +#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ + ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ + p != rq->curr && \ + ((p)->cpus_allowed & (1UL<<(this_cpu)))) /* * We first consider expired tasks. Those will likely not be @@ -772,7 +892,7 @@ skip_bitmap: array = busiest->active; goto new_array; } - goto out_unlock; + goto out; } head = array->queue + idx; @@ -780,33 +900,74 @@ skip_bitmap: skip_queue: tmp = list_entry(curr, task_t, run_list); + if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { + weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks; + if (weight > maxweight) { + maxweight = weight; + next = tmp; + } + } + curr = curr->next; + if (curr != head) + goto skip_queue; + idx++; + goto skip_bitmap; + + out: + return next; +} + +/* + * pull_task - move a task from a remote runqueue to the local runqueue. + * Both runqueues must be locked. + */ +static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +{ + dequeue_task(p, src_array); + src_rq->nr_running--; + set_task_cpu(p, this_cpu); + this_rq->nr_running++; + enqueue_task(p, this_rq->active); /* - * We do not migrate tasks that are: - * 1) running (obviously), or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) are cache-hot on their current CPU. + * Note that idle threads have a prio of MAX_PRIO, for this test + * to be always true for them. */ + if (p->prio < this_rq->curr->prio) + set_need_resched(); +} -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) - - curr = curr->prev; - - if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } - pull_task(busiest, array, tmp, this_rq, this_cpu); - if (!idle && --imbalance) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } +/* + * Current runqueue is empty, or rebalance tick: if there is an + * inbalance (current runqueue is too short) then pull from + * busiest runqueue(s). + * + * We call this with the current runqueue locked, + * irqs disabled. + */ +static void load_balance(runqueue_t *this_rq, int idle) +{ + int nr_running, this_cpu = task_cpu(this_rq->curr); + task_t *tmp; + runqueue_t *busiest; + + /* avoid deadlock by timer interrupt on own cpu */ + if (atomic_read(&pool_lock)) return; + busiest = find_busiest_queue(this_rq, this_cpu, idle, &nr_running); + if (!busiest) + goto out; + + nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); + /* + * Make sure nothing changed since we checked the + * runqueue length. + */ + if (busiest->nr_running <= nr_running + 1) + goto out_unlock; + + tmp = task_to_steal(busiest, this_cpu); + if (!tmp) + goto out_unlock; + pull_task(busiest, tmp->array, tmp, this_rq, this_cpu); out_unlock: spin_unlock(&busiest->lock); out: @@ -819,10 +980,10 @@ out: * frequency and balancing agressivity depends on whether the CPU is * idle or not. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on * systems with HZ=100, every 10 msecs.) */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1) #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) static inline void idle_tick(runqueue_t *rq) @@ -1920,6 +2081,8 @@ typedef struct { struct list_head list; task_t *task; struct completion done; + int cpu_dest; + int sync; } migration_req_t; /* @@ -1965,6 +2128,7 @@ void set_cpus_allowed(task_t *p, unsigne } init_completion(&req.done); req.task = p; + req.sync = 1; list_add(&req.list, &rq->migration_queue); task_rq_unlock(rq, &flags); wake_up_process(rq->migration_thread); @@ -1974,6 +2138,17 @@ out: preempt_enable(); } +void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + if (!(old_mask & (1UL << dest_cpu))) + return; + set_cpus_allowed(p, 1UL << dest_cpu); + set_cpus_allowed(p, old_mask); +} + /* * migration_thread - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. @@ -2009,7 +2184,7 @@ static int migration_thread(void * data) for (;;) { runqueue_t *rq_src, *rq_dest; struct list_head *head; - int cpu_src, cpu_dest; + int cpu_src, cpu_dest, sync; migration_req_t *req; unsigned long flags; task_t *p; @@ -2024,10 +2199,17 @@ static int migration_thread(void * data) } req = list_entry(head->next, migration_req_t, list); list_del_init(head->next); - spin_unlock_irqrestore(&rq->lock, flags); p = req->task; - cpu_dest = __ffs(p->cpus_allowed); + sync = req->sync; + if (sync) + cpu_dest = __ffs(p->cpus_allowed & cpu_online_map); + else { + cpu_dest = req->cpu_dest; + req->task = NULL; + } + spin_unlock_irqrestore(&rq->lock, flags); + rq_dest = cpu_rq(cpu_dest); repeat: cpu_src = task_cpu(p); @@ -2050,7 +2232,8 @@ repeat: double_rq_unlock(rq_src, rq_dest); local_irq_restore(flags); - complete(&req->done); + if (sync) + complete(&req->done); } } @@ -2130,6 +2313,15 @@ void __init sched_init(void) __set_bit(MAX_PRIO, array->bitmap); } } +#ifdef CONFIG_NUMA_SCHED + pool_ptr[0] = 0; + pool_ptr[1] = NR_CPUS; + + numpools = 1; + pool_mask[0] = -1L; + pool_nr_cpus[0] = NR_CPUS; +#endif + /* * We have to do a little magic to get the first * thread right in SMP mode. [-- Attachment #3: 02-numa_sched_ilb-2.5.39.patch --] [-- Type: text/x-diff, Size: 2336 bytes --] diff -urNp a/fs/exec.c b/fs/exec.c --- a/fs/exec.c Sat Oct 5 17:22:45 2002 +++ b/fs/exec.c Mon Oct 7 19:49:08 2002 @@ -993,6 +993,7 @@ int do_execve(char * filename, char ** a int retval; int i; + sched_balance_exec(); file = open_exec(filename); retval = PTR_ERR(file); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Mon Oct 7 19:46:52 2002 +++ b/kernel/sched.c Mon Oct 7 20:01:56 2002 @@ -2138,6 +2135,78 @@ out: preempt_enable(); } +#ifdef CONFIG_NUMA_SCHED +/* used as counter for round-robin node-scheduling */ +static atomic_t sched_node=ATOMIC_INIT(0); + +/* + * Find the least loaded CPU on the current node of the task. + */ +static int sched_best_cpu(struct task_struct *p, int node) +{ + int n, cpu, load, best_cpu = task_cpu(p); + + load = 1000000; + loop_over_node(n,node) { + cpu = pool_cpu(n); + if (!(p->cpus_allowed & (1UL << cpu) & cpu_online_map)) + continue; + if (cpu_rq(cpu)->nr_running < load) { + best_cpu = cpu; + load = cpu_rq(cpu)->nr_running; + } + } + return best_cpu; +} + +/* + * Find the node with fewest number of tasks running on it. + */ +static int sched_best_node(struct task_struct *p) +{ + int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id(); + int pool, load; + unsigned long mask = p->cpus_allowed & cpu_online_map; + + do { + best_node = atomic_inc_return(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); + + min_load = 100000000; + for (n = 0; n < numpools; n++) { + pool = (best_node + n) % numpools; + load = 0; + loop_over_node(i, pool) { + cpu=pool_cpu(i); + if (!cpu_online(cpu)) continue; + load += cpu_rq(cpu)->nr_running; + } + if (pool == numa_node_id()) load--; + pool_load = 100*load/pool_nr_cpus[pool]; + if ((pool_load < min_load) && (pool_mask[pool] & mask)) { + min_load = pool_load; + min_pool = pool; + } + } + atomic_set(&sched_node, min_pool); + return min_pool; +} + +void sched_balance_exec(void) +{ + int new_cpu, new_node=0; + + while (atomic_read(&pool_lock)) + cpu_relax(); + if (numpools > 1) { + new_node = sched_best_node(current, 0); + } + new_cpu = sched_best_cpu(current, new_node); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); +} +#endif + void sched_migrate_task(task_t *p, int dest_cpu) { unsigned long old_mask; ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <1420721189.1034032091@[10.10.2.3]>]
* Re: [PATCH] pooling NUMA scheduler with initial load balancing [not found] ` <1420721189.1034032091@[10.10.2.3]> @ 2002-10-08 17:33 ` Erich Focht 2002-10-08 19:44 ` Martin J. Bligh 2002-10-09 1:15 ` Christoph Hellwig 0 siblings, 2 replies; 24+ messages in thread From: Erich Focht @ 2002-10-08 17:33 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 673 bytes --] On Tuesday 08 October 2002 08:08, Martin J. Bligh wrote: > I did: > > 1. remove asm/numa.h declaration (you need to work out something > generic here ... mmzone.h should include it, I think?) > 2. changed NR_NODES to MAX_NUMNODES everywhere. > 3. Put in an extern declartion into fs/exec.c for sched_balance_exec > > Then I got bored around the linking problem. Does this compile for > you somehow? Aaargh, you got the wrong second patch :-( Sorry for that... Thanks for the hints, I cleaned up the first patch, too. No CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including asm/numa.h from asm/topology.h, so no need for you to see it. Erich [-- Attachment #2: 01-numa_sched_core6-2.5.39.patch --] [-- Type: text/x-diff, Size: 19311 bytes --] diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002 +++ b/arch/i386/kernel/smpboot.c Tue Oct 8 11:37:56 2002 @@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu) void __init smp_cpus_done(unsigned int max_cpus) { zap_low_mappings(); +#ifdef CONFIG_NUMA + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif } void __init smp_intr_init() diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c --- a/arch/ia64/kernel/smpboot.c Tue Oct 8 10:57:10 2002 +++ b/arch/ia64/kernel/smpboot.c Tue Oct 8 11:38:17 2002 @@ -397,7 +397,7 @@ unsigned long cache_decay_ticks; /* # of static void smp_tune_scheduling (void) { - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */ printk("task migration cache decay timeout: %ld msecs.\n", (cache_decay_ticks + 1) * 1000 / HZ); @@ -508,6 +508,11 @@ smp_cpus_done (unsigned int dummy) printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n", num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100); +#ifdef CONFIG_NUMA + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif } int __devinit diff -urNp a/include/asm-i386/atomic.h b/include/asm-i386/atomic.h --- a/include/asm-i386/atomic.h Fri Sep 27 23:49:16 2002 +++ b/include/asm-i386/atomic.h Tue Oct 8 10:59:13 2002 @@ -111,6 +111,18 @@ static __inline__ void atomic_inc(atomic } /** + * atomic_inc_return - increment atomic variable and return new value + * @v: pointer of type atomic_t + * + * Atomically increments @v by 1 and return it's new value. Note that + * the guaranteed useful range of an atomic_t is only 24 bits. + */ +static inline int atomic_inc_return(atomic_t *v){ + atomic_inc(v); + return v->counter; +} + +/** * atomic_dec - decrement atomic variable * @v: pointer of type atomic_t * diff -urNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h Tue Oct 8 10:56:28 2002 +++ b/include/linux/sched.h Tue Oct 8 11:30:21 2002 @@ -22,6 +22,7 @@ extern unsigned long event; #include <asm/mmu.h> #include <linux/smp.h> +#include <asm/topology.h> #include <linux/sem.h> #include <linux/signal.h> #include <linux/securebits.h> @@ -167,7 +168,6 @@ extern void update_one_process(struct ta extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; - #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); asmlinkage void schedule(void); @@ -457,6 +457,12 @@ extern void set_cpus_allowed(task_t *p, # define set_cpus_allowed(p, new_mask) do { } while (0) #endif +#ifdef CONFIG_NUMA +extern void pooldata_lock(void); +extern void pooldata_unlock(void); +#endif +extern void sched_migrate_task(task_t *p, int cpu); + extern void set_user_nice(task_t *p, long nice); extern int task_prio(task_t *p); extern int task_nice(task_t *p); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Fri Sep 27 23:50:27 2002 +++ b/kernel/sched.c Tue Oct 8 11:19:40 2002 @@ -154,6 +154,10 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; + unsigned long wait_time; + int wait_node; + int load[2][NR_CPUS]; + } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; @@ -174,6 +178,95 @@ static struct runqueue runqueues[NR_CPUS #endif /* + * Variables for describing and accessing processor pools. Using a + * compressed row format like notation. Processor pools are treated + * like logical node numbers. + * + * numpools: number of CPU pools (nodes), + * pool_cpus[]: CPUs in pools sorted by their pool ID, + * pool_ptr[pool]: index of first element in pool_cpus[] belonging to pool. + * pool_mask[]: cpu mask of a pool. + * + * Example: loop over all CPUs in a pool p: + * loop_over_node(i,node) { + * cpu = pool_cpu(i); + * ... + * } + * <efocht@ess.nec.de> + */ +int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES]; +unsigned long pool_mask[MAX_NUMNODES]; +static atomic_t pool_lock = ATOMIC_INIT(0); +#define cpu_to_node(cpu) __cpu_to_node(cpu) + +/* Avoid zeroes in integer divides for load calculations */ +#define BALANCE_FACTOR 100 + +#ifdef CONFIG_NUMA +#define POOL_DELAY (100) +#define loop_over_node(i,n) for(i=pool_ptr[n]; i<pool_ptr[n+1]; i++) +#define pool_cpu(i) pool_cpus[i] + +void pooldata_lock(void) +{ + int i; + retry: + while (atomic_read(&pool_lock)); + if (atomic_inc_return(&pool_lock) > 1) { + atomic_dec(&pool_lock); + goto retry; + } + /* + * Wait a while, any loops using pool data should finish + * in between. This is VERY ugly and should be replaced + * by some real RCU stuff. [EF] + */ + for (i=0; i<100; i++) + udelay(1000); +} + +void pooldata_unlock(void) +{ + atomic_dec(&pool_lock); +} + +/* + * Call pooldata_lock() before calling this function and + * pooldata_unlock() after! + */ +void bld_pools(void) +{ + int n, cpu, ptr, i; + unsigned long mask; + + ptr=0; + for (n=0; n<numnodes; n++) { + mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map; + pool_ptr[n] = ptr; + for (cpu=0; cpu<NR_CPUS; cpu++) + if (mask & (1UL << cpu)) + pool_cpu(ptr++) = cpu; + pool_nr_cpus[n] = ptr - pool_ptr[n]; + } + numpools=numnodes; + pool_ptr[numpools]=ptr; + printk("CPU pools : %d\n",numpools); + for (n=0;n<numpools;n++) { + printk("pool %d :",n); + loop_over_node(i,n) + printk("%d ",pool_cpu(i)); + printk("\n"); + } +} + +#else +#define POOL_DELAY (0) +#define loop_over_node(i,n) for(i=0; i<NR_CPUS; i++) +#define pool_cpu(i) (i) +#endif + + +/* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. @@ -632,121 +725,144 @@ static inline unsigned int double_lock_b } /* - * find_busiest_queue - find the busiest runqueue. + * Calculate load of a CPU pool, store results in data[][NR_CPUS]. + * Return the index of the most loaded runqueue. */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static int calc_pool_load(int data[][NR_CPUS], int this_cpu, int pool, int idle) { - int nr_running, load, max_load, i; - runqueue_t *busiest, *rq_src; - - /* - * We search all runqueues to find the most busy one. - * We do this lockless to reduce cache-bouncing overhead, - * we re-check the 'best' source CPU later on again, with - * the lock held. - * - * We fend off statistical fluctuations in runqueue lengths by - * saving the runqueue length during the previous load-balancing - * operation and using the smaller one the current and saved lengths. - * If a runqueue is long enough for a longer amount of time then - * we recognize it and pull tasks from it. - * - * The 'current runqueue length' is a statistical maximum variable, - * for that one we take the longer one - to avoid fluctuations in - * the other direction. So for a load-balance to happen it needs - * stable long runqueue on the target CPU and stable short runqueue - * on the local runqueue. - * - * We make an exception if this CPU is about to become idle - in - * that case we are less picky about moving a task across CPUs and - * take what can be taken. - */ - if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) - nr_running = this_rq->nr_running; - else - nr_running = this_rq->prev_nr_running[this_cpu]; + runqueue_t *rq_src, *this_rq = cpu_rq(this_cpu); + int i, ii, idx=-1, refload, load; - busiest = NULL; - max_load = 1; - for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) - continue; + data[1][pool] = 0; + refload = -1; + loop_over_node(ii, pool) { + i = pool_cpu(ii); rq_src = cpu_rq(i); if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) load = rq_src->nr_running; else load = this_rq->prev_nr_running[i]; this_rq->prev_nr_running[i] = rq_src->nr_running; - - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; + data[0][i] = load; + data[1][pool] += load; + if (load > refload) { + idx = i; + refload = load; } } + data[1][pool] = data[1][pool] * BALANCE_FACTOR / pool_nr_cpus[pool]; + return idx; +} - if (likely(!busiest)) - goto out; +/* + * Find a runqueue from which to steal a task. We try to do this as locally as + * possible because we don't want to let tasks get far from their home node. + * This is done in two steps: + * 1. First try to find a runqueue within the own CPU pool (AKA node) with + * imbalance larger than 25% (relative to the current runqueue). + * 2. If the local node is well balanced, locate the most loaded node and its + * most loaded CPU. Remote runqueues running tasks having their homenode on the + * current node are preferred (those tasks count twice in the load calculation). + * If the current load is far below the average try to steal a task from the + * most loaded node/cpu. Otherwise wait 100ms and give less loaded nodes the + * chance to approach the average load. + * + * This concept can be extended easilly to more than two levels (multi-level + * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode... + * <efocht@ess.nec.de> + */ +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, + int idle, int *nr_running) +{ + runqueue_t *busiest = NULL; + int imax, best_cpu, pool, max_pool_load, max_pool_idx; + int i, del_shift; + int avg_load=-1, this_pool = cpu_to_node(this_cpu); + + /* Need at least ~25% imbalance to trigger balancing. */ +#define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4)) + + if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) + *nr_running = this_rq->nr_running; + else + *nr_running = this_rq->prev_nr_running[this_cpu]; - *imbalance = (max_load - nr_running) / 2; + best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle); + + if (best_cpu != this_cpu) + goto check_out; - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (*imbalance < (max_load + 3)/4)) { - busiest = NULL; + scan_all: + best_cpu = -1; + max_pool_load = this_rq->load[1][this_pool]; + max_pool_idx = this_pool; + avg_load = max_pool_load * pool_nr_cpus[this_pool]; + for (i = 1; i < numpools; i++) { + pool = (i + this_pool) % numpools; + imax = calc_pool_load(this_rq->load, this_cpu, pool, idle); + avg_load += this_rq->load[1][pool]*pool_nr_cpus[pool]; + if (this_rq->load[1][pool] > max_pool_load) { + max_pool_load = this_rq->load[1][pool]; + max_pool_idx = pool; + best_cpu = imax; + } + } + /* Exit if not enough imbalance on any remote node. */ + if ((best_cpu < 0) || + BALANCED(max_pool_load,this_rq->load[1][this_pool])) { + this_rq->wait_node = -1; goto out; } + avg_load /= num_online_cpus(); + /* Wait longer before stealing if load is average. */ + if (BALANCED(avg_load,this_rq->load[1][this_pool])) + del_shift = 0; + else + del_shift = 6; - nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); - /* - * Make sure nothing changed since we checked the - * runqueue length. - */ - if (busiest->nr_running <= nr_running + 1) { - spin_unlock(&busiest->lock); - busiest = NULL; - } -out: + if (this_rq->wait_node != max_pool_idx) { + this_rq->wait_node = max_pool_idx; + this_rq->wait_time = jiffies; + goto out; + } else + if (jiffies - this_rq->wait_time < (POOL_DELAY >> del_shift)) + goto out; + + check_out: + /* Enough imbalance in the remote cpu loads? */ + if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) { + busiest = cpu_rq(best_cpu); + this_rq->wait_node = -1; + } else if (avg_load == -1) + /* only scanned local pool, so let's look at all of them */ + goto scan_all; + out: return busiest; } /* - * pull_task - move a task from a remote runqueue to the local runqueue. - * Both runqueues must be locked. + * Find a task to steal from the busiest RQ. The busiest->lock must be held + * while calling this routine. */ -static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu) { - dequeue_task(p, src_array); - src_rq->nr_running--; - set_task_cpu(p, this_cpu); - this_rq->nr_running++; - enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); -} - -/* - * Current runqueue is empty, or rebalance tick: if there is an - * inbalance (current runqueue is too short) then pull from - * busiest runqueue(s). - * - * We call this with the current runqueue locked, - * irqs disabled. - */ -static void load_balance(runqueue_t *this_rq, int idle) -{ - int imbalance, idx, this_cpu = smp_processor_id(); - runqueue_t *busiest; + int idx; + task_t *next = NULL, *tmp; prio_array_t *array; struct list_head *head, *curr; - task_t *tmp; + int weight, maxweight=0; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); - if (!busiest) - goto out; + /* + * We do not migrate tasks that are: + * 1) running (obviously), or + * 2) cannot be migrated to this CPU due to cpus_allowed. + */ + +#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ + ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ + p != rq->curr && \ + ((p)->cpus_allowed & (1UL<<(this_cpu)))) /* * We first consider expired tasks. Those will likely not be @@ -772,7 +888,7 @@ skip_bitmap: array = busiest->active; goto new_array; } - goto out_unlock; + goto out; } head = array->queue + idx; @@ -780,33 +896,74 @@ skip_bitmap: skip_queue: tmp = list_entry(curr, task_t, run_list); + if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { + weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks; + if (weight > maxweight) { + maxweight = weight; + next = tmp; + } + } + curr = curr->next; + if (curr != head) + goto skip_queue; + idx++; + goto skip_bitmap; + + out: + return next; +} + +/* + * pull_task - move a task from a remote runqueue to the local runqueue. + * Both runqueues must be locked. + */ +static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +{ + dequeue_task(p, src_array); + src_rq->nr_running--; + set_task_cpu(p, this_cpu); + this_rq->nr_running++; + enqueue_task(p, this_rq->active); /* - * We do not migrate tasks that are: - * 1) running (obviously), or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) are cache-hot on their current CPU. + * Note that idle threads have a prio of MAX_PRIO, for this test + * to be always true for them. */ + if (p->prio < this_rq->curr->prio) + set_need_resched(); +} -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) - - curr = curr->prev; - - if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } - pull_task(busiest, array, tmp, this_rq, this_cpu); - if (!idle && --imbalance) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } +/* + * Current runqueue is empty, or rebalance tick: if there is an + * inbalance (current runqueue is too short) then pull from + * busiest runqueue(s). + * + * We call this with the current runqueue locked, + * irqs disabled. + */ +static void load_balance(runqueue_t *this_rq, int idle) +{ + int nr_running, this_cpu = task_cpu(this_rq->curr); + task_t *tmp; + runqueue_t *busiest; + + /* avoid deadlock by timer interrupt on own cpu */ + if (atomic_read(&pool_lock)) return; + busiest = find_busiest_queue(this_rq, this_cpu, idle, &nr_running); + if (!busiest) + goto out; + + nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); + /* + * Make sure nothing changed since we checked the + * runqueue length. + */ + if (busiest->nr_running <= nr_running + 1) + goto out_unlock; + + tmp = task_to_steal(busiest, this_cpu); + if (!tmp) + goto out_unlock; + pull_task(busiest, tmp->array, tmp, this_rq, this_cpu); out_unlock: spin_unlock(&busiest->lock); out: @@ -819,10 +976,10 @@ out: * frequency and balancing agressivity depends on whether the CPU is * idle or not. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on * systems with HZ=100, every 10 msecs.) */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1) #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) static inline void idle_tick(runqueue_t *rq) @@ -1920,6 +2077,8 @@ typedef struct { struct list_head list; task_t *task; struct completion done; + int cpu_dest; + int sync; } migration_req_t; /* @@ -1965,6 +2124,7 @@ void set_cpus_allowed(task_t *p, unsigne } init_completion(&req.done); req.task = p; + req.sync = 1; list_add(&req.list, &rq->migration_queue); task_rq_unlock(rq, &flags); wake_up_process(rq->migration_thread); @@ -1974,6 +2134,17 @@ out: preempt_enable(); } +void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + if (!(old_mask & (1UL << dest_cpu))) + return; + set_cpus_allowed(p, 1UL << dest_cpu); + set_cpus_allowed(p, old_mask); +} + /* * migration_thread - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. @@ -2009,7 +2180,7 @@ static int migration_thread(void * data) for (;;) { runqueue_t *rq_src, *rq_dest; struct list_head *head; - int cpu_src, cpu_dest; + int cpu_src, cpu_dest, sync; migration_req_t *req; unsigned long flags; task_t *p; @@ -2024,10 +2195,17 @@ static int migration_thread(void * data) } req = list_entry(head->next, migration_req_t, list); list_del_init(head->next); - spin_unlock_irqrestore(&rq->lock, flags); p = req->task; - cpu_dest = __ffs(p->cpus_allowed); + sync = req->sync; + if (sync) + cpu_dest = __ffs(p->cpus_allowed & cpu_online_map); + else { + cpu_dest = req->cpu_dest; + req->task = NULL; + } + spin_unlock_irqrestore(&rq->lock, flags); + rq_dest = cpu_rq(cpu_dest); repeat: cpu_src = task_cpu(p); @@ -2050,7 +2228,8 @@ repeat: double_rq_unlock(rq_src, rq_dest); local_irq_restore(flags); - complete(&req->done); + if (sync) + complete(&req->done); } } @@ -2130,6 +2309,15 @@ void __init sched_init(void) __set_bit(MAX_PRIO, array->bitmap); } } +#ifdef CONFIG_NUMA + pool_ptr[0] = 0; + pool_ptr[1] = NR_CPUS; + + numpools = 1; + pool_mask[0] = -1L; + pool_nr_cpus[0] = NR_CPUS; +#endif + /* * We have to do a little magic to get the first * thread right in SMP mode. [-- Attachment #3: 02-numa_sched_ilb-2.5.39.patch --] [-- Type: text/x-diff, Size: 2790 bytes --] diff -urNp a/fs/exec.c b/fs/exec.c --- a/fs/exec.c Tue Oct 8 15:03:54 2002 +++ b/fs/exec.c Tue Oct 8 15:08:22 2002 @@ -993,6 +993,7 @@ int do_execve(char * filename, char ** a int retval; int i; + sched_balance_exec(); file = open_exec(filename); retval = PTR_ERR(file); diff -urNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h Tue Oct 8 15:07:48 2002 +++ b/include/linux/sched.h Tue Oct 8 15:09:27 2002 @@ -460,6 +460,9 @@ extern void set_cpus_allowed(task_t *p, #ifdef CONFIG_NUMA extern void pooldata_lock(void); extern void pooldata_unlock(void); +extern void sched_balance_exec(void); +#else +#define sched_balance_exec() {} #endif extern void sched_migrate_task(task_t *p, int cpu); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Tue Oct 8 15:07:48 2002 +++ b/kernel/sched.c Tue Oct 8 15:08:22 2002 @@ -2134,6 +2134,78 @@ out: preempt_enable(); } +#ifdef CONFIG_NUMA +/* used as counter for round-robin node-scheduling */ +static atomic_t sched_node=ATOMIC_INIT(0); + +/* + * Find the least loaded CPU on the current node of the task. + */ +static int sched_best_cpu(struct task_struct *p, int node) +{ + int n, cpu, load, best_cpu = task_cpu(p); + + load = 1000000; + loop_over_node(n,node) { + cpu = pool_cpu(n); + if (!(p->cpus_allowed & (1UL << cpu) & cpu_online_map)) + continue; + if (cpu_rq(cpu)->nr_running < load) { + best_cpu = cpu; + load = cpu_rq(cpu)->nr_running; + } + } + return best_cpu; +} + +/* + * Find the node with fewest number of tasks running on it. + */ +static int sched_best_node(struct task_struct *p) +{ + int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id(); + int cpu, pool, load; + unsigned long mask = p->cpus_allowed & cpu_online_map; + + do { + best_node = atomic_inc_return(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); + + min_load = 100000000; + for (n = 0; n < numpools; n++) { + pool = (best_node + n) % numpools; + load = 0; + loop_over_node(i, pool) { + cpu=pool_cpu(i); + if (!cpu_online(cpu)) continue; + load += cpu_rq(cpu)->nr_running; + } + if (pool == numa_node_id()) load--; + pool_load = 100*load/pool_nr_cpus[pool]; + if ((pool_load < min_load) && (pool_mask[pool] & mask)) { + min_load = pool_load; + min_pool = pool; + } + } + atomic_set(&sched_node, min_pool); + return min_pool; +} + +void sched_balance_exec(void) +{ + int new_cpu, new_node=0; + + while (atomic_read(&pool_lock)) + cpu_relax(); + if (numpools > 1) { + new_node = sched_best_node(current); + } + new_cpu = sched_best_cpu(current, new_node); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); +} +#endif + void sched_migrate_task(task_t *p, int dest_cpu) { unsigned long old_mask; ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-08 17:33 ` Erich Focht @ 2002-10-08 19:44 ` Martin J. Bligh 2002-10-09 16:26 ` Erich Focht 2002-10-09 1:15 ` Christoph Hellwig 1 sibling, 1 reply; 24+ messages in thread From: Martin J. Bligh @ 2002-10-08 19:44 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > Aaargh, you got the wrong second patch :-( Sorry for that... No problem > Thanks for the hints, I cleaned up the first patch, too. No > CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including > asm/numa.h from asm/topology.h, so no need for you to see it. Cool. Well it compiles now, but: CPU 3 IS NOW UP! Starting migration thread for cpu 3 Bringing up 4 CP4 4ivi e e UPr: 0000 ing igrU:i 4 eaIP: p060:[<c0114231>] Not tainted EFLAGS: 00010046 EIP is at calc_pool_load+0x109/0x120 eax: 00000000 ebx: 00000001 ecx: c034f380 edx: 00000000 esi: c028efa0 edi: 00000020 ebp: c3a37efc esp: c3a37ed4 ds: 0068 es: 0068 ss: 0068 Process swapper (pid: 0, threadinfo=c3a36000 task=f01bf060) Stack: c028f948 c028efa0 f01bf060 00002928 00002944 00002944 ffffffff ffffffff c028efa0 ffffffff c3a37f78 c01142d5 c028f948 00000004 00000001 00000001 c3a36000 c028efa0 f01bf060 00000010 00000008 00000000 00000001 00002a13 Call Trace: [<c01142d5>] load_balance+0x8d/0x5c4 [<c0118a8d>] call_console_drivers+0xdd/0xe4 [<c0118cc2>] release_console_sem+0x42/0xa4 [<c0114c21>] schedule+0xd5/0x3ac [<c0105300>] default_idle+0x0/0x34 [<c01053bf>] cpu_idle+0x43/0x48 [<c0118cc2>] release_console_sem+0x42/0xa4 [<c0118c1d>] printk+0x125/0x140 Code: f7 3c 99 8b 55 08 89 84 9a 80 00 00 00 8b 45 f4 5b 5e 5f 89 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-08 19:44 ` Martin J. Bligh @ 2002-10-09 16:26 ` Erich Focht 2002-10-09 17:33 ` Martin J. Bligh 0 siblings, 1 reply; 24+ messages in thread From: Erich Focht @ 2002-10-09 16:26 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1903 bytes --] On Tuesday 08 October 2002 21:44, Martin J. Bligh wrote: > > Thanks for the hints, I cleaned up the first patch, too. No > > CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including > > asm/numa.h from asm/topology.h, so no need for you to see it. > > Cool. Well it compiles now, but: > > CPU 3 IS NOW UP! > Starting migration thread for cpu 3 > Bringing up 4 > CP4 4ivi e e UPr: 0000 > ing > igrU:i 4 > eaIP: p060:[<c0114231>] Not tainted > EFLAGS: 00010046 > EIP is at calc_pool_load+0x109/0x120 This is strange. It works for me really reliably. I added a check for non-online CPUs in calc_pool_load and changed the pool_lock to be a spinlock. The patches are attached again. Results so far (all based on 2.5.39 + ia64 + discontig, measured on 16 CPU Azusa), averaged over 4 measurements. The statistics show that we need more measurements to get something meaningfull. 2.5.39 numa_sched+ilb numa_test 4 4 AverageUserTime 31.18+-0.67 31.25+- 2.51 ElapsedTime 41.59+-1.95 38.55+- 4.79 TotalUserTime 124.78+-2.71 125.08+-10.03 numa_test 8 4 AverageUserTime 30.84+-0.51 28.94+- 0.04 ElapsedTime 45.02+-1.54 30.00+- 0.24 TotalUserTime 246.82+-4.02 231.66+- 0.34 numa_test 16 4 AverageUserTime 33.61+- 0.72 32.15+- 0.33 ElapsedTime 47.16+- 0.95 37.45+- 3.20 TotalUserTime 538.05+-11.58 514.72+- 5.25 numa_test 32 4 AverageUserTime 37.94+- 0.74 33.36+- 0.02 ElapsedTime 84.83+- 1.28 68.78+- 0.18 TotalUserTime 1214.37+-23.64 1068.06+- 0.53 numa_test 64 4 AverageUserTime 39.58+- 1.34 33.64+- 0.09 ElapsedTime 168.04+- 4.77 142.84+- 5.44 TotalUserTime 2533.82+-85.96 2153.69+- 5.58 Regards, Erich [-- Attachment #2: 01-numa_sched_core7-2.5.39.patch --] [-- Type: text/x-diff, Size: 19262 bytes --] diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002 +++ b/arch/i386/kernel/smpboot.c Tue Oct 8 15:07:48 2002 @@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu) void __init smp_cpus_done(unsigned int max_cpus) { zap_low_mappings(); +#ifdef CONFIG_NUMA + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif } void __init smp_intr_init() diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c --- a/arch/ia64/kernel/smpboot.c Tue Oct 8 15:04:54 2002 +++ b/arch/ia64/kernel/smpboot.c Tue Oct 8 15:07:48 2002 @@ -397,7 +397,7 @@ unsigned long cache_decay_ticks; /* # of static void smp_tune_scheduling (void) { - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */ printk("task migration cache decay timeout: %ld msecs.\n", (cache_decay_ticks + 1) * 1000 / HZ); @@ -508,6 +508,11 @@ smp_cpus_done (unsigned int dummy) printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n", num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100); +#ifdef CONFIG_NUMA + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif } int __devinit diff -urNp a/include/asm-i386/atomic.h b/include/asm-i386/atomic.h --- a/include/asm-i386/atomic.h Fri Sep 27 23:49:16 2002 +++ b/include/asm-i386/atomic.h Tue Oct 8 15:07:48 2002 @@ -111,6 +111,18 @@ static __inline__ void atomic_inc(atomic } /** + * atomic_inc_return - increment atomic variable and return new value + * @v: pointer of type atomic_t + * + * Atomically increments @v by 1 and return it's new value. Note that + * the guaranteed useful range of an atomic_t is only 24 bits. + */ +static inline int atomic_inc_return(atomic_t *v){ + atomic_inc(v); + return v->counter; +} + +/** * atomic_dec - decrement atomic variable * @v: pointer of type atomic_t * diff -urNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h Tue Oct 8 15:03:54 2002 +++ b/include/linux/sched.h Tue Oct 8 15:07:48 2002 @@ -22,6 +22,7 @@ extern unsigned long event; #include <asm/mmu.h> #include <linux/smp.h> +#include <asm/topology.h> #include <linux/sem.h> #include <linux/signal.h> #include <linux/securebits.h> @@ -167,7 +168,6 @@ extern void update_one_process(struct ta extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; - #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); asmlinkage void schedule(void); @@ -457,6 +457,12 @@ extern void set_cpus_allowed(task_t *p, # define set_cpus_allowed(p, new_mask) do { } while (0) #endif +#ifdef CONFIG_NUMA +extern void pooldata_lock(void); +extern void pooldata_unlock(void); +#endif +extern void sched_migrate_task(task_t *p, int cpu); + extern void set_user_nice(task_t *p, long nice); extern int task_prio(task_t *p); extern int task_nice(task_t *p); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Fri Sep 27 23:50:27 2002 +++ b/kernel/sched.c Wed Oct 9 17:06:04 2002 @@ -154,6 +154,10 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; + unsigned long wait_time; + int wait_node; + int load[2][NR_CPUS]; + } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; @@ -174,6 +178,90 @@ static struct runqueue runqueues[NR_CPUS #endif /* + * Variables for describing and accessing processor pools. Using a + * compressed row format like notation. Processor pools are treated + * like logical node numbers. + * + * numpools: number of CPU pools (nodes), + * pool_cpus[]: CPUs in pools sorted by their pool ID, + * pool_ptr[pool]: index of first element in pool_cpus[] belonging to pool. + * pool_mask[]: cpu mask of a pool. + * + * Example: loop over all CPUs in a pool p: + * loop_over_node(i,node) { + * cpu = pool_cpu(i); + * ... + * } + * <efocht@ess.nec.de> + */ +int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES]; +unsigned long pool_mask[MAX_NUMNODES]; +static spinlock_t pool_lock = SPIN_LOCK_UNLOCKED; +#define cpu_to_node(cpu) __cpu_to_node(cpu) + +/* Avoid zeroes in integer divides for load calculations */ +#define BALANCE_FACTOR 100 + +#ifdef CONFIG_NUMA +#define POOL_DELAY (100) +#define loop_over_node(i,n) for(i=pool_ptr[n]; i<pool_ptr[n+1]; i++) +#define pool_cpu(i) pool_cpus[i] + +void pooldata_lock(void) +{ + int i; + spin_lock(&pool_lock); + /* + * Wait a while, any loops using pool data should finish + * in between. This is VERY ugly and should be replaced + * by some real RCU stuff. [EF] + */ + for (i=0; i<100; i++) + udelay(1000); +} + +void pooldata_unlock(void) +{ + spin_unlock(&pool_lock); +} + +/* + * Call pooldata_lock() before calling this function and + * pooldata_unlock() after! + */ +void bld_pools(void) +{ + int n, cpu, ptr, i; + unsigned long mask; + + ptr=0; + for (n=0; n<numnodes; n++) { + mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map; + pool_ptr[n] = ptr; + for (cpu=0; cpu<NR_CPUS; cpu++) + if (mask & (1UL << cpu)) + pool_cpu(ptr++) = cpu; + pool_nr_cpus[n] = ptr - pool_ptr[n]; + } + numpools=numnodes; + pool_ptr[numpools]=ptr; + printk("CPU pools : %d\n",numpools); + for (n=0;n<numpools;n++) { + printk("pool %d :",n); + loop_over_node(i,n) + printk("%d ",pool_cpu(i)); + printk("\n"); + } +} + +#else +#define POOL_DELAY (0) +#define loop_over_node(i,n) for(i=0; i<NR_CPUS; i++) +#define pool_cpu(i) (i) +#endif + + +/* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. @@ -632,121 +720,145 @@ static inline unsigned int double_lock_b } /* - * find_busiest_queue - find the busiest runqueue. + * Calculate load of a CPU pool, store results in data[][NR_CPUS]. + * Return the index of the most loaded runqueue. */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static int calc_pool_load(int data[][NR_CPUS], int this_cpu, int pool, int idle) { - int nr_running, load, max_load, i; - runqueue_t *busiest, *rq_src; + runqueue_t *rq_src, *this_rq = cpu_rq(this_cpu); + int i, ii, idx=-1, refload, load; - /* - * We search all runqueues to find the most busy one. - * We do this lockless to reduce cache-bouncing overhead, - * we re-check the 'best' source CPU later on again, with - * the lock held. - * - * We fend off statistical fluctuations in runqueue lengths by - * saving the runqueue length during the previous load-balancing - * operation and using the smaller one the current and saved lengths. - * If a runqueue is long enough for a longer amount of time then - * we recognize it and pull tasks from it. - * - * The 'current runqueue length' is a statistical maximum variable, - * for that one we take the longer one - to avoid fluctuations in - * the other direction. So for a load-balance to happen it needs - * stable long runqueue on the target CPU and stable short runqueue - * on the local runqueue. - * - * We make an exception if this CPU is about to become idle - in - * that case we are less picky about moving a task across CPUs and - * take what can be taken. - */ - if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) - nr_running = this_rq->nr_running; - else - nr_running = this_rq->prev_nr_running[this_cpu]; - - busiest = NULL; - max_load = 1; - for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) - continue; + data[1][pool] = 0; + refload = -1; + loop_over_node(ii, pool) { + i = pool_cpu(ii); + if (!cpu_online(i)) continue; rq_src = cpu_rq(i); if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) load = rq_src->nr_running; else load = this_rq->prev_nr_running[i]; this_rq->prev_nr_running[i] = rq_src->nr_running; - - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; + data[0][i] = load; + data[1][pool] += load; + if (load > refload) { + idx = i; + refload = load; } } + data[1][pool] = data[1][pool] * BALANCE_FACTOR / pool_nr_cpus[pool]; + return idx; +} - if (likely(!busiest)) - goto out; +/* + * Find a runqueue from which to steal a task. We try to do this as locally as + * possible because we don't want to let tasks get far from their home node. + * This is done in two steps: + * 1. First try to find a runqueue within the own CPU pool (AKA node) with + * imbalance larger than 25% (relative to the current runqueue). + * 2. If the local node is well balanced, locate the most loaded node and its + * most loaded CPU. Remote runqueues running tasks having their homenode on the + * current node are preferred (those tasks count twice in the load calculation). + * If the current load is far below the average try to steal a task from the + * most loaded node/cpu. Otherwise wait 100ms and give less loaded nodes the + * chance to approach the average load. + * + * This concept can be extended easilly to more than two levels (multi-level + * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode... + * <efocht@ess.nec.de> + */ +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, + int idle, int *nr_running) +{ + runqueue_t *busiest = NULL; + int imax, best_cpu, pool, max_pool_load, max_pool_idx; + int i, del_shift; + int avg_load=-1, this_pool = cpu_to_node(this_cpu); + + /* Need at least ~25% imbalance to trigger balancing. */ +#define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4)) + + if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) + *nr_running = this_rq->nr_running; + else + *nr_running = this_rq->prev_nr_running[this_cpu]; - *imbalance = (max_load - nr_running) / 2; + best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle); + + if (best_cpu != this_cpu) + goto check_out; - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (*imbalance < (max_load + 3)/4)) { - busiest = NULL; + scan_all: + best_cpu = -1; + max_pool_load = this_rq->load[1][this_pool]; + max_pool_idx = this_pool; + avg_load = max_pool_load * pool_nr_cpus[this_pool]; + for (i = 1; i < numpools; i++) { + pool = (i + this_pool) % numpools; + imax = calc_pool_load(this_rq->load, this_cpu, pool, idle); + avg_load += this_rq->load[1][pool]*pool_nr_cpus[pool]; + if (this_rq->load[1][pool] > max_pool_load) { + max_pool_load = this_rq->load[1][pool]; + max_pool_idx = pool; + best_cpu = imax; + } + } + /* Exit if not enough imbalance on any remote node. */ + if ((best_cpu < 0) || + BALANCED(max_pool_load,this_rq->load[1][this_pool])) { + this_rq->wait_node = -1; goto out; } + avg_load /= num_online_cpus(); + /* Wait longer before stealing if load is average. */ + if (BALANCED(avg_load,this_rq->load[1][this_pool])) + del_shift = 0; + else + del_shift = 6; - nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); - /* - * Make sure nothing changed since we checked the - * runqueue length. - */ - if (busiest->nr_running <= nr_running + 1) { - spin_unlock(&busiest->lock); - busiest = NULL; - } -out: - return busiest; -} + if (this_rq->wait_node != max_pool_idx) { + this_rq->wait_node = max_pool_idx; + this_rq->wait_time = jiffies; + goto out; + } else + if (jiffies - this_rq->wait_time < (POOL_DELAY >> del_shift)) + goto out; -/* - * pull_task - move a task from a remote runqueue to the local runqueue. - * Both runqueues must be locked. - */ -static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) -{ - dequeue_task(p, src_array); - src_rq->nr_running--; - set_task_cpu(p, this_cpu); - this_rq->nr_running++; - enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); + check_out: + /* Enough imbalance in the remote cpu loads? */ + if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) { + busiest = cpu_rq(best_cpu); + this_rq->wait_node = -1; + } else if (avg_load == -1) + /* only scanned local pool, so let's look at all of them */ + goto scan_all; + out: + return busiest; } /* - * Current runqueue is empty, or rebalance tick: if there is an - * inbalance (current runqueue is too short) then pull from - * busiest runqueue(s). - * - * We call this with the current runqueue locked, - * irqs disabled. + * Find a task to steal from the busiest RQ. The busiest->lock must be held + * while calling this routine. */ -static void load_balance(runqueue_t *this_rq, int idle) +static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu) { - int imbalance, idx, this_cpu = smp_processor_id(); - runqueue_t *busiest; + int idx; + task_t *next = NULL, *tmp; prio_array_t *array; struct list_head *head, *curr; - task_t *tmp; + int weight, maxweight=0; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); - if (!busiest) - goto out; + /* + * We do not migrate tasks that are: + * 1) running (obviously), or + * 2) cannot be migrated to this CPU due to cpus_allowed. + */ + +#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ + ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ + p != rq->curr && \ + ((p)->cpus_allowed & (1UL<<(this_cpu)))) /* * We first consider expired tasks. Those will likely not be @@ -772,7 +884,7 @@ skip_bitmap: array = busiest->active; goto new_array; } - goto out_unlock; + goto out; } head = array->queue + idx; @@ -780,33 +892,74 @@ skip_bitmap: skip_queue: tmp = list_entry(curr, task_t, run_list); + if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { + weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks; + if (weight > maxweight) { + maxweight = weight; + next = tmp; + } + } + curr = curr->next; + if (curr != head) + goto skip_queue; + idx++; + goto skip_bitmap; + + out: + return next; +} + +/* + * pull_task - move a task from a remote runqueue to the local runqueue. + * Both runqueues must be locked. + */ +static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +{ + dequeue_task(p, src_array); + src_rq->nr_running--; + set_task_cpu(p, this_cpu); + this_rq->nr_running++; + enqueue_task(p, this_rq->active); /* - * We do not migrate tasks that are: - * 1) running (obviously), or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) are cache-hot on their current CPU. + * Note that idle threads have a prio of MAX_PRIO, for this test + * to be always true for them. */ + if (p->prio < this_rq->curr->prio) + set_need_resched(); +} -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) - - curr = curr->prev; - - if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } - pull_task(busiest, array, tmp, this_rq, this_cpu); - if (!idle && --imbalance) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } +/* + * Current runqueue is empty, or rebalance tick: if there is an + * inbalance (current runqueue is too short) then pull from + * busiest runqueue(s). + * + * We call this with the current runqueue locked, + * irqs disabled. + */ +static void load_balance(runqueue_t *this_rq, int idle) +{ + int nr_running, this_cpu = task_cpu(this_rq->curr); + task_t *tmp; + runqueue_t *busiest; + + /* avoid deadlock by timer interrupt on own cpu */ + if (spin_is_locked(&pool_lock)) return; + busiest = find_busiest_queue(this_rq, this_cpu, idle, &nr_running); + if (!busiest) + goto out; + + nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); + /* + * Make sure nothing changed since we checked the + * runqueue length. + */ + if (busiest->nr_running <= nr_running + 1) + goto out_unlock; + + tmp = task_to_steal(busiest, this_cpu); + if (!tmp) + goto out_unlock; + pull_task(busiest, tmp->array, tmp, this_rq, this_cpu); out_unlock: spin_unlock(&busiest->lock); out: @@ -819,10 +972,10 @@ out: * frequency and balancing agressivity depends on whether the CPU is * idle or not. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on * systems with HZ=100, every 10 msecs.) */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1) #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) static inline void idle_tick(runqueue_t *rq) @@ -1920,6 +2073,8 @@ typedef struct { struct list_head list; task_t *task; struct completion done; + int cpu_dest; + int sync; } migration_req_t; /* @@ -1965,6 +2120,7 @@ void set_cpus_allowed(task_t *p, unsigne } init_completion(&req.done); req.task = p; + req.sync = 1; list_add(&req.list, &rq->migration_queue); task_rq_unlock(rq, &flags); wake_up_process(rq->migration_thread); @@ -1974,6 +2130,17 @@ out: preempt_enable(); } +void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + if (!(old_mask & (1UL << dest_cpu))) + return; + set_cpus_allowed(p, 1UL << dest_cpu); + set_cpus_allowed(p, old_mask); +} + /* * migration_thread - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. @@ -2009,7 +2176,7 @@ static int migration_thread(void * data) for (;;) { runqueue_t *rq_src, *rq_dest; struct list_head *head; - int cpu_src, cpu_dest; + int cpu_src, cpu_dest, sync; migration_req_t *req; unsigned long flags; task_t *p; @@ -2024,10 +2191,17 @@ static int migration_thread(void * data) } req = list_entry(head->next, migration_req_t, list); list_del_init(head->next); - spin_unlock_irqrestore(&rq->lock, flags); p = req->task; - cpu_dest = __ffs(p->cpus_allowed); + sync = req->sync; + if (sync) + cpu_dest = __ffs(p->cpus_allowed & cpu_online_map); + else { + cpu_dest = req->cpu_dest; + req->task = NULL; + } + spin_unlock_irqrestore(&rq->lock, flags); + rq_dest = cpu_rq(cpu_dest); repeat: cpu_src = task_cpu(p); @@ -2050,7 +2224,8 @@ repeat: double_rq_unlock(rq_src, rq_dest); local_irq_restore(flags); - complete(&req->done); + if (sync) + complete(&req->done); } } @@ -2130,6 +2305,15 @@ void __init sched_init(void) __set_bit(MAX_PRIO, array->bitmap); } } +#ifdef CONFIG_NUMA + pool_ptr[0] = 0; + pool_ptr[1] = NR_CPUS; + + numpools = 1; + pool_mask[0] = -1L; + pool_nr_cpus[0] = NR_CPUS; +#endif + /* * We have to do a little magic to get the first * thread right in SMP mode. [-- Attachment #3: 02-numa_sched_ilb-2.5.39.patch --] [-- Type: text/x-diff, Size: 2793 bytes --] diff -urNp a/fs/exec.c b/fs/exec.c --- a/fs/exec.c Tue Oct 8 15:03:54 2002 +++ b/fs/exec.c Tue Oct 8 15:08:22 2002 @@ -993,6 +993,7 @@ int do_execve(char * filename, char ** a int retval; int i; + sched_balance_exec(); file = open_exec(filename); retval = PTR_ERR(file); diff -urNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h Tue Oct 8 15:07:48 2002 +++ b/include/linux/sched.h Tue Oct 8 15:09:27 2002 @@ -460,6 +460,9 @@ extern void set_cpus_allowed(task_t *p, #ifdef CONFIG_NUMA extern void pooldata_lock(void); extern void pooldata_unlock(void); +extern void sched_balance_exec(void); +#else +#define sched_balance_exec() {} #endif extern void sched_migrate_task(task_t *p, int cpu); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Tue Oct 8 15:07:48 2002 +++ b/kernel/sched.c Tue Oct 8 15:08:22 2002 @@ -2134,6 +2134,78 @@ out: preempt_enable(); } +#ifdef CONFIG_NUMA +/* used as counter for round-robin node-scheduling */ +static atomic_t sched_node=ATOMIC_INIT(0); + +/* + * Find the least loaded CPU on the current node of the task. + */ +static int sched_best_cpu(struct task_struct *p, int node) +{ + int n, cpu, load, best_cpu = task_cpu(p); + + load = 1000000; + loop_over_node(n,node) { + cpu = pool_cpu(n); + if (!(p->cpus_allowed & (1UL << cpu) & cpu_online_map)) + continue; + if (cpu_rq(cpu)->nr_running < load) { + best_cpu = cpu; + load = cpu_rq(cpu)->nr_running; + } + } + return best_cpu; +} + +/* + * Find the node with fewest number of tasks running on it. + */ +static int sched_best_node(struct task_struct *p) +{ + int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id(); + int cpu, pool, load; + unsigned long mask = p->cpus_allowed & cpu_online_map; + + do { + best_node = atomic_inc_return(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); + + min_load = 100000000; + for (n = 0; n < numpools; n++) { + pool = (best_node + n) % numpools; + load = 0; + loop_over_node(i, pool) { + cpu=pool_cpu(i); + if (!cpu_online(cpu)) continue; + load += cpu_rq(cpu)->nr_running; + } + if (pool == numa_node_id()) load--; + pool_load = 100*load/pool_nr_cpus[pool]; + if ((pool_load < min_load) && (pool_mask[pool] & mask)) { + min_load = pool_load; + min_pool = pool; + } + } + atomic_set(&sched_node, min_pool); + return min_pool; +} + +void sched_balance_exec(void) +{ + int new_cpu, new_node=0; + + while (spin_is_locked(&pool_lock)) + cpu_relax(); + if (numpools > 1) { + new_node = sched_best_node(current); + } + new_cpu = sched_best_cpu(current, new_node); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); +} +#endif + void sched_migrate_task(task_t *p, int dest_cpu) { unsigned long old_mask; ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-09 16:26 ` Erich Focht @ 2002-10-09 17:33 ` Martin J. Bligh 2002-10-09 17:58 ` Andrew Theurer 0 siblings, 1 reply; 24+ messages in thread From: Martin J. Bligh @ 2002-10-09 17:33 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > This is strange. It works for me really reliably. I added a check > for non-online CPUs in calc_pool_load and changed the pool_lock to > be a spinlock. The patches are attached again. I'm testing on 2.5.41-mm1 ... your patches still apply clean. That has a whole bunch of nice NUMA mods, you might want to try that instead? All the changes in there will end up in mainline real soon anyway ;-) One minor warning: arch/i386/kernel/smpboot.c: In function `smp_cpus_done': arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of function `bld_pools' And the same panic: Starting migration thread for cpu 3 Bringing up 4 CPU>dividNOWrro! 0St0 igrU: 4 ead :or 0060:[<c011422d>] Not tainted EFLAGS: 00010046 EIP is at calc_pool_load+0x115/0x12c eax: 00000000 ebx: 00000001 ecx: c028f948 edx: 00000000 esi: c034f380 edi: 00000020 ebp: c3a37efc esp: c3a37ed4 ds: 0068 es: 0068 ss: 0068 Process swapper (pid: 0, threadinfo=c3a36000 task=f01bf060) Stack: c028f948 c028efa0 f01bf060 00002928 00002944 00002944 ffffffff ffffffff c028efa0 ffffffff c3a37f78 c01142d1 c028f948 00000004 00000001 00000001 c3a36000 c028efa0 f01bf060 00000010 00000008 00000000 00000001 00002a13 Call Trace: [<c01142d1>] load_balance+0x8d/0x5c8 [<c0118a9d>] call_console_drivers+0xdd/0xe4 [<c0118cd2>] release_console_sem+0x42/0xa4 [<c0114c21>] schedule+0xd5/0x3ac [<c0105300>] default_idle+0x0/0x34 [<c01053bf>] cpu_idle+0x43/0x48 [<c0118cd2>] release_console_sem+0x42/0xa4 [<c0118c2d>] printk+0x125/0x140 Code: f7 3c 9e 89 84 99 80 00 00 00 8b 45 f4 5b 5e 5f 89 ec 5d c3 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-09 17:33 ` Martin J. Bligh @ 2002-10-09 17:58 ` Andrew Theurer 2002-10-09 18:13 ` Andrew Theurer 2002-10-09 23:02 ` Erich Focht 0 siblings, 2 replies; 24+ messages in thread From: Andrew Theurer @ 2002-10-09 17:58 UTC (permalink / raw) To: Martin J. Bligh, Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > I'm testing on 2.5.41-mm1 ... your patches still apply clean. That > has a whole bunch of nice NUMA mods, you might want to try that > instead? All the changes in there will end up in mainline real soon > anyway ;-) > > One minor warning: > > arch/i386/kernel/smpboot.c: In function `smp_cpus_done': > arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of function > `bld_pools' > > And the same panic: > > Starting migration thread for cpu 3 > Bringing up 4 > CPU>dividNOWrro! I got the same thing on 2.5.40-mm1. It looks like it may be a a divide by zero in calc_pool_load. I am attempting to boot a band-aid version right now. OK, got a little further: PCI: Cannot allocate resource region 1 of device 03:0a.0 PCI: Cannot allocate resource region 4 of device 03:0e.1 PCI: Cannot allocate resource region 2 of device 05:0a.0 PCI: Cannot allocate resource region 0 of device 05:0b.0 PCI: Cannot allocate resource region 4 of device 05:0e.1 PCI: Cannot allocate resource region 1 of device 07:0a.0 PCI: Cannot allocate resource region 0 of device 07:0b.0 PCI: Cannot allocate resource region 4 of device 07:0e.1 Starting kswapd d<4>highmem bounce poo sCPU: 3 eEIP: 0060:[<c011a3d3>] Not tainted EFLAGS: 00010046 eax: 00000001 ebx: 00000000 ecx: 00000003 edx: 00000000 esi: c02c98c4 edi: c39c5880 ebp: f0199ee8 esp: f0199ec0 ds: 0068 es: 0068 ss: 0068 Process swapper (pid: 0, threadinfo=f0198000 task=f01c1740) Stack: c39c58a0 c02c94c8 c02c993c 00000000 c02c94c4 00000000 0000007d c02c94a0 00000001 00000003 f0199f1c c0117bec c02c94a0 00000003 00000003 00000001 00000000 c01135d1 f01a3f10 00000000 00000003 f01c1740 c02cb4e0 f0199f44 Call Trace: [<c0117bec>] [<c01135d1>] [<c0117eed>] [<c0122587>] [<c0105420>] [<c0125e60>] [<c0113ca7>] [<c0105420>] [<c010818e>] [<c0105420>] [<c0105420>] [<c010544a>] [<c01054ea>] [<c011cf50>] Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0 -Andrew Theurer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-09 17:58 ` Andrew Theurer @ 2002-10-09 18:13 ` Andrew Theurer 2002-10-09 23:02 ` Erich Focht 1 sibling, 0 replies; 24+ messages in thread From: Andrew Theurer @ 2002-10-09 18:13 UTC (permalink / raw) To: Martin J. Bligh, Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel On Wednesday 09 October 2002 12:58 pm, Andrew Theurer wrote: > > I'm testing on 2.5.41-mm1 ... your patches still apply clean. That > > has a whole bunch of nice NUMA mods, you might want to try that > > instead? All the changes in there will end up in mainline real soon > > anyway ;-) > > > > One minor warning: > > > > arch/i386/kernel/smpboot.c: In function `smp_cpus_done': > > arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of > > function `bld_pools' > > > > And the same panic: > > > > Starting migration thread for cpu 3 > > Bringing up 4 > > CPU>dividNOWrro! > > I got the same thing on 2.5.40-mm1. It looks like it may be a a divide by > zero in calc_pool_load. I am attempting to boot a band-aid version right > now. OK, got a little further: > A little more detail: Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0 EFLAGS: 00010046 eax: 00000001 ebx: 00000000 ecx: 00000003 edx: 00000000 esi: c02c98c4 edi: c39c5880 ebp: f0199ee8 esp: f0199ec0 ds: 0068 es: 0068 ss: 0068 Stack: c39c58a0 c02c94c8 c02c993c 00000000 c02c94c4 00000000 0000007d c02c94a0 00000001 00000003 f0199f1c c0117bec c02c94a0 00000003 00000003 00000001 00000000 c01135d1 f01a3f10 00000000 00000003 f01c1740 c02cb4e0 f0199f44 Call Trace: [<c0117bec>] [<c01135d1>] [<c0117eed>] [<c0122587>] [<c0105420>] [<c0125e60>] [<c0113ca7>] [<c0105420>] [<c010818e>] [<c 0105420>] [<c0105420>] [<c010544a>] [<c01054ea>] [<c011cf50>] Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0 Using defaults from ksymoops -t elf32-i386 -a i386 >>esi; c02c98c4 <runqueues+424/15800> >>edi; c39c5880 <END_OF_CODE+36419c4/????> >>ebp; f0199ee8 <END_OF_CODE+2fe1602c/????> >>esp; f0199ec0 <END_OF_CODE+2fe16004/????> Trace; c0117bec <load_balance+8c/140> Trace; c01135d1 <smp_call_function_interrupt+41/90> Trace; c0117eed <scheduler_tick+24d/390> Trace; c0122587 <tasklet_hi_action+77/c0> Trace; c0105420 <default_idle+0/50> Trace; c0125e60 <update_process_times+40/50> Trace; c0113ca7 <smp_apic_timer_interrupt+117/120> Trace; c0105420 <default_idle+0/50> Trace; c010818e <apic_timer_interrupt+1a/20> Trace; c0105420 <default_idle+0/50> Trace; c0105420 <default_idle+0/50> Trace; c010544a <default_idle+2a/50> Trace; c01054ea <cpu_idle+3a/50> Trace; c011cf50 <printk+140/180> Code; 00000000 Before first symbol 00000000 <_EIP>: Code; 00000000 Before first symbol 0: f7 f3 div %ebx Code; 00000002 Before first symbol 2: 3b 45 e4 cmp 0xffffffe4(%ebp),%eax Code; 00000005 Before first symbol 5: 7e 06 jle d <_EIP+0xd> 0000000d Before first symbol Code; 00000007 Before first symbol 7: 89 45 e4 mov %eax,0xffffffe4(%ebp) Code; 0000000a Before first symbol a: 89 7d ec mov %edi,0xffffffec(%ebp) Code; 0000000d Before first symbol d: 8b 45 d8 mov 0xffffffd8(%ebp),%eax Code; 00000010 Before first symbol 10: 8b 00 mov (%eax),%eax Code; 00000012 Before first symbol 12: 39 f0 cmp %esi,%eax ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-09 17:58 ` Andrew Theurer 2002-10-09 18:13 ` Andrew Theurer @ 2002-10-09 23:02 ` Erich Focht 2002-10-10 17:34 ` Andrew Theurer 1 sibling, 1 reply; 24+ messages in thread From: Erich Focht @ 2002-10-09 23:02 UTC (permalink / raw) To: Andrew Theurer, Martin J. Bligh, Michael Hohnbaum Cc: Ingo Molnar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1031 bytes --] > > Starting migration thread for cpu 3 > > Bringing up 4 > > CPU>dividNOWrro! > > I got the same thing on 2.5.40-mm1. It looks like it may be a a divide by > zero in calc_pool_load. I am attempting to boot a band-aid version right > now. OK, got a little further: This opened my eyes, thanks for all your help and patience!!! The problem is that the load balancer is called before the CPU pools were set up. That's fine, I thought, because I define in sched_init the default pool 0 to include all CPUs. But: in find_busiest_queue() the cpu_to_node(this_cpu) delivers a non-zero pool which is not set up yet, therefore pool_nr_cpus[pool]=0 and we get a zero divide. I'm still wondering why this doesn't happen on our architecture. Maybe the interrupts are disabled longer, I'll check. Anyway, a fix is to force this_pool to be 0 as long as numpools=1. The attached patch is a quick untested hack, maybe one can do it better. Has to be applied on top of the other 2. Going to sleep now... Bye, Erich [-- Attachment #2: 01a-numa_sched_fix-2.5.39.patch --] [-- Type: text/x-diff, Size: 1299 bytes --] diff -urNp 2.5.39-disc-ns/kernel/sched.c 2.5.39-disc-ns8/kernel/sched.c --- 2.5.39-disc-ns/kernel/sched.c Wed Oct 9 17:06:04 2002 +++ 2.5.39-disc-ns8/kernel/sched.c Thu Oct 10 00:51:20 2002 @@ -774,7 +774,7 @@ static inline runqueue_t *find_busiest_q runqueue_t *busiest = NULL; int imax, best_cpu, pool, max_pool_load, max_pool_idx; int i, del_shift; - int avg_load=-1, this_pool = cpu_to_node(this_cpu); + int avg_load=-1, this_pool; /* Need at least ~25% imbalance to trigger balancing. */ #define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4)) @@ -784,10 +784,13 @@ static inline runqueue_t *find_busiest_q else *nr_running = this_rq->prev_nr_running[this_cpu]; + this_pool = (numpools == 1 ? 0 : cpu_to_node(this_cpu)); best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle); if (best_cpu != this_cpu) goto check_out; + else if (numpools == 1) + goto out; scan_all: best_cpu = -1; @@ -830,7 +833,7 @@ static inline runqueue_t *find_busiest_q if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) { busiest = cpu_rq(best_cpu); this_rq->wait_node = -1; - } else if (avg_load == -1) + } else if (avg_load == -1 && numpools > 1) /* only scanned local pool, so let's look at all of them */ goto scan_all; out: ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-09 23:02 ` Erich Focht @ 2002-10-10 17:34 ` Andrew Theurer [not found] ` <200210110947.11714.efocht@ess.nec.de> 0 siblings, 1 reply; 24+ messages in thread From: Andrew Theurer @ 2002-10-10 17:34 UTC (permalink / raw) To: Erich Focht, Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel On Wednesday 09 October 2002 6:02 pm, Erich Focht wrote: > > > Starting migration thread for cpu 3 > > > Bringing up 4 > > > CPU>dividNOWrro! > > > > I got the same thing on 2.5.40-mm1. It looks like it may be a a divide > > by zero in calc_pool_load. I am attempting to boot a band-aid version > > right now. OK, got a little further: > > This opened my eyes, thanks for all your help and patience!!! > > The problem is that the load balancer is called before the CPU pools > were set up. That's fine, I thought, because I define in sched_init > the default pool 0 to include all CPUs. But: in find_busiest_queue() > the cpu_to_node(this_cpu) delivers a non-zero pool which is not set up > yet, therefore pool_nr_cpus[pool]=0 and we get a zero divide. > > I'm still wondering why this doesn't happen on our architecture. Maybe > the interrupts are disabled longer, I'll check. Anyway, a fix is to > force this_pool to be 0 as long as numpools=1. The attached patch is a > quick untested hack, maybe one can do it better. Has to be applied on top > of the other 2. Thanks very much Erich. I did come across another problem here on numa-q. In task_to_steal() there is a divide by cache_decay_ticks, which apparantly is 0 on my system. This may have to do with notsc, but I am not sure. I set cache_decay_ticks to 8, (I cannot boot without using notsc) which is probably not correct, but I can now boot 16 processor numa-q on 2.5.40-mm1 with your patches! I'll get some benchmark results soon. Andrew Theurer ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <200210110947.11714.efocht@ess.nec.de>]
* Re: [PATCH] pooling NUMA scheduler with initial load balancing [not found] ` <200210110947.11714.efocht@ess.nec.de> @ 2002-10-11 8:27 ` Erich Focht 2002-10-11 14:47 ` Martin J. Bligh 0 siblings, 1 reply; 24+ messages in thread From: Erich Focht @ 2002-10-11 8:27 UTC (permalink / raw) To: Andrew Theurer, Martin J. Bligh, Michael Hohnbaum Cc: Ingo Molnar, linux-kernel On Friday 11 October 2002 09:47, Erich Focht wrote: > Hi Andrew, > > On Thursday 10 October 2002 19:34, Andrew Theurer wrote: > > Thanks very much Erich. I did come across another problem here on > > numa-q. In task_to_steal() there is a divide by cache_decay_ticks, which > > apparantly is 0 on my system. This may have to do with notsc, but I am > > not sure. I set cache_decay_ticks to 8, (I cannot boot without using > > notsc) which is probably not correct, but I can now boot 16 processor > > numa-q on 2.5.40-mm1 with your patches! I'll get some benchmark results > > soon. > > oops... This is a bug in 2.5-i386. It means that the O(1) scheduler in > 2.5 doesn't work well either because it doesn't take into account cache > coolness. I'll post a fix to LKML in a few minutes. Sorry, I thought the smp_tune_scheduling() call went lost during the transition to the new cpu boot scheme. But it's there. And the problem is indeed "notsc". So you'll have to fix it, I can't. If you set the cache_decay_ticks to something non-zero, you should _really_ do this for all the scheduler tests, otherwise your measurements will not be comparable. Regards, Erich ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-11 8:27 ` Erich Focht @ 2002-10-11 14:47 ` Martin J. Bligh 2002-10-11 15:29 ` Erich Focht 0 siblings, 1 reply; 24+ messages in thread From: Martin J. Bligh @ 2002-10-11 14:47 UTC (permalink / raw) To: Erich Focht, Andrew Theurer, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > Sorry, I thought the smp_tune_scheduling() call went lost during the > transition to the new cpu boot scheme. But it's there. And the problem > is indeed "notsc". So you'll have to fix it, I can't. Errrm ... not sure what you mean by this. notsc is a perfectly valid thing to do, so if your patch panics with that option, I suggest you make something conditional on notsc inside your patch? Works just fine without your patch, or with Michael's patch .... M. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-11 14:47 ` Martin J. Bligh @ 2002-10-11 15:29 ` Erich Focht 2002-10-11 15:34 ` Martin J. Bligh 0 siblings, 1 reply; 24+ messages in thread From: Erich Focht @ 2002-10-11 15:29 UTC (permalink / raw) To: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum Cc: Ingo Molnar, linux-kernel On Friday 11 October 2002 16:47, Martin J. Bligh wrote: > > Sorry, I thought the smp_tune_scheduling() call went lost during the > > transition to the new cpu boot scheme. But it's there. And the problem > > is indeed "notsc". So you'll have to fix it, I can't. > > Errrm ... not sure what you mean by this. notsc is a perfectly > valid thing to do, so if your patch panics with that option, I > suggest you make something conditional on notsc inside your patch? > Works just fine without your patch, or with Michael's patch .... Martin, arch/i386/kernel/smpboot.c:smp_tune_scheduling() says: if (!cpu_khz) { /* * this basically disables processor-affinity * scheduling on SMP without a TSC. */ cacheflush_time = 0; return; If you boot with notsc, you won't have cache affinity on your machine. Which means that the load_balancer eventually selects cache hot tasks for stealing. The O(1) scheduler doesn't do that under normal conditions! Of course I'll add something to my patch such that it doesn't crash if cache_decay_ticks is unset. But you might be measuring wrong things right now if you leave cache_decay_ticks=0 as then the cache-affinity on NUMAQ is switched off with the vanilla O(1) and with Michael's patch. I want to say: you cannot evaluate the impact of Michael's patches if you don't fix that. This issue is independent of my patches. Regards, Erich ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-11 15:29 ` Erich Focht @ 2002-10-11 15:34 ` Martin J. Bligh 0 siblings, 0 replies; 24+ messages in thread From: Martin J. Bligh @ 2002-10-11 15:34 UTC (permalink / raw) To: Erich Focht, Andrew Theurer, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > arch/i386/kernel/smpboot.c:smp_tune_scheduling() says: > > if (!cpu_khz) { > /* > * this basically disables processor-affinity > * scheduling on SMP without a TSC. > */ > cacheflush_time = 0; > return; > > If you boot with notsc, you won't have cache affinity on your machine. > Which means that the load_balancer eventually selects cache hot tasks > for stealing. The O(1) scheduler doesn't do that under normal conditions! OK, that makes more sense ... I'll go stare at the code some more and see what can be done. > Of course I'll add something to my patch such that it doesn't crash > if cache_decay_ticks is unset. But you might be measuring wrong things > right now if you leave cache_decay_ticks=0 as then the cache-affinity > on NUMAQ is switched off with the vanilla O(1) and with Michael's patch. > I want to say: you cannot evaluate the impact of Michael's patches if > you don't fix that. This issue is independent of my patches. OK, I'll make sure to make the tests uniform somehow to get a fair comparison. Thanks! M. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-08 17:33 ` Erich Focht 2002-10-08 19:44 ` Martin J. Bligh @ 2002-10-09 1:15 ` Christoph Hellwig 2002-10-09 10:29 ` Erich Focht 1 sibling, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2002-10-09 1:15 UTC (permalink / raw) To: Erich Focht; +Cc: Martin J. Bligh, Michael Hohnbaum, Ingo Molnar, linux-kernel On Tue, Oct 08, 2002 at 07:33:06PM +0200, Erich Focht wrote: > Aaargh, you got the wrong second patch :-( Sorry for that... > > Thanks for the hints, I cleaned up the first patch, too. No > CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including > asm/numa.h from asm/topology.h, so no need for you to see it. diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002 +++ b/arch/i386/kernel/smpboot.c Tue Oct 8 11:37:56 2002 @@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu) void __init smp_cpus_done(unsigned int max_cpus) { zap_low_mappings(); +#ifdef CONFIG_NUMA + pooldata_lock(); + bld_pools(); + pooldata_unlock(); +#endif All callers of bld_pools() need the pooldata lock - taking it inside that function makes the code a little more readable.. Also I'd suggest to rename bld_pools() to build_pools() ;) - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */ Could you explain this change? And it's affect on non-NUMA IA64 machines? /** + * atomic_inc_return - increment atomic variable and return new value + * @v: pointer of type atomic_t + * + * Atomically increments @v by 1 and return it's new value. Note that + * the guaranteed useful range of an atomic_t is only 24 bits. + */ +static inline int atomic_inc_return(atomic_t *v){ + atomic_inc(v); + return v->counter; +} Who do you guarantee this is atomic? Please make it fit Documentation/CodyingStyle, btw.. +int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES]; +unsigned long pool_mask[MAX_NUMNODES]; Hmm, shouldn't those [MAX_NUMNODES] arrays be in some per-node array to avoid cacheline-bouncing? +void pooldata_lock(void) +{ + int i; + retry: + while (atomic_read(&pool_lock)); + if (atomic_inc_return(&pool_lock) > 1) { + atomic_dec(&pool_lock); + goto retry; + } Why not a simple spin_lock()? + /* + * Wait a while, any loops using pool data should finish + * in between. This is VERY ugly and should be replaced + * by some real RCU stuff. [EF] + */ + for (i=0; i<100; i++) + udelay(1000); Urgg. I'd suggest you switch to RCU now and make your patch apply ontop of it - another reason to apply the RCU core patch.. +void pooldata_unlock(void) +{ + atomic_dec(&pool_lock); +} Dito for spin_unlock. + /* avoid deadlock by timer interrupt on own cpu */ + if (atomic_read(&pool_lock)) return; spin_trylock.. All in all your code doesn't seem to be very cachelign-friendly, lots of global bouncing. Do you have any numbers on what your patch changes for normal SMP configurations? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] pooling NUMA scheduler with initial load balancing 2002-10-09 1:15 ` Christoph Hellwig @ 2002-10-09 10:29 ` Erich Focht 0 siblings, 0 replies; 24+ messages in thread From: Erich Focht @ 2002-10-09 10:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin J. Bligh, Michael Hohnbaum, Ingo Molnar, linux-kernel Hi Christoph, thanks for looking at this and the comments. Replies are below. On Wednesday 09 October 2002 03:15, Christoph Hellwig wrote: > On Tue, Oct 08, 2002 at 07:33:06PM +0200, Erich Focht wrote: > diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c > --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002 > +++ b/arch/i386/kernel/smpboot.c Tue Oct 8 11:37:56 2002 > @@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu) > void __init smp_cpus_done(unsigned int max_cpus) > { > zap_low_mappings(); > +#ifdef CONFIG_NUMA > + pooldata_lock(); > + bld_pools(); > + pooldata_unlock(); > +#endif > > All callers of bld_pools() need the pooldata lock - taking > it inside that function makes the code a little more readable.. > Also I'd suggest to rename bld_pools() to build_pools() ;) The separation is due to the needs of CPU hot-add. When integrating with Kimi's CPU hotplug patch within the Atlas project, we found that there was more to do in the scheduler blocked phase when adding one CPU to the running system (besides recomputing the pools). Looking too far into the future here... > - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth > estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and > cache-bandwidth estimate */ > > Could you explain this change? And it's affect on non-NUMA IA64 > machines? At some stage I looked up the minimal timeslice and found it to be 10. This means that when you have two tasks on a runqueue and zero on another, you might see both tasks as absolutely cache-hot and cannot steal any of them. The cache_decay_ticks is anyway a very rough approximation. With the task_to_steal() selection from the patch this change should have no significant impact because we make a ranking and select the cache coolest task which slept more than 8ms. The default O(1) behavior is to select the first task which slept more than 10ms, without the ranking. > /** > + * atomic_inc_return - increment atomic variable and return new value > + * @v: pointer of type atomic_t > + * > + * Atomically increments @v by 1 and return it's new value. Note that > + * the guaranteed useful range of an atomic_t is only 24 bits. > + */ > +static inline int atomic_inc_return(atomic_t *v){ > + atomic_inc(v); > + return v->counter; > +} > > Who do you guarantee this is atomic? Please make it fit > Documentation/CodyingStyle, btw.. This code comes from a port of the node affine scheduler done by somebody from IBM for the 2.4.18 version. On IA64 we have really atomic stuff for this operation. It's a quick hack for now. I not good in ia32 assembler and don't know how to replace it. > +int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], > pool_nr_cpus[MAX_NUMNODES]; +unsigned long pool_mask[MAX_NUMNODES]; > > Hmm, shouldn't those [MAX_NUMNODES] arrays be in some per-node array > to avoid cacheline-bouncing? They could be replicated but I don't think this is necessary. They populate cachelines on each CPU. As these arrays are never changed, there is no need for bouncing them. The cachelines on every CPU are only invalidated when some CPU changes these array, which just doesn't happen after boot (except you add a CPU, which you'll do rarely). > +void pooldata_lock(void) > +{ > + int i; > + retry: > + while (atomic_read(&pool_lock)); > + if (atomic_inc_return(&pool_lock) > 1) { > + atomic_dec(&pool_lock); > + goto retry; > + } > > Why not a simple spin_lock()? Actually we need something like write_lock and read_lock. But read_lock counts the readers, therefore changes the lock and we get cacheline-bouncing. A spinlock here would be ok. Then the readers would need to do something like "while (spin_is_locked(pool_lock));". This only reads the lock and (if nobody changes it) avoids cacheline bouncing. We still need to check that all readers went out of the section where the pool data is accessed (following ugly block)... > + /* > + * Wait a while, any loops using pool data should finish > + * in between. This is VERY ugly and should be replaced > + * by some real RCU stuff. [EF] > + */ > + for (i=0; i<100; i++) > + udelay(1000); > > Urgg. I'd suggest you switch to RCU now and make your patch apply > ontop of it - another reason to apply the RCU core patch.. I agree (see the comment). I didn't replace it by RCU because RCU is not in the mainline and it is easier to develop with an ugly but portable replacement. But add this to the list of potential users of RCU. > +void pooldata_unlock(void) > +{ > + atomic_dec(&pool_lock); > +} > > Dito for spin_unlock. > > + /* avoid deadlock by timer interrupt on own cpu */ > + if (atomic_read(&pool_lock)) return; > > spin_trylock.. No! No trylock here, that would change the lock and lead to cacheline bouncing! We don't need the lock, just make sure that the pooldata is not beeing changed right now. > All in all your code doesn't seem to be very cachelign-friendly, > lots of global bouncing. Do you have any numbers on what your > patch changes for normal SMP configurations? Once again: once the address of pool_lock is in the cache, I don't expect any bouncing. The value of that lock almost never changes (except in the boot phase and when you want to hot-add/remove a CPU), therefore there is no reason for cache-line bouncing. The same applies for the pool variables. As you see, I thought a lot about cacheline bouncing and tried to avoid it. I'll replace the pool_lock with a spin_lock, as the atomic_inc_return for ia32 isn't really atomic, you're absolutely right here. As the load balancing code is called pretty rarely (at most every 1ms, normally every 250ms per CPU), additional cycles invested here are ok if the benefit for NUMA is reasonable. If CONFIG_NUMA is undefined, the code should be optimized by the compiler such that it looks like normal SMP code. Still need some changes to get there, but that's not the big problem. Regards, Erich ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht ` (2 preceding siblings ...) 2002-10-07 7:25 ` Martin J. Bligh @ 2002-10-07 16:37 ` Michael Hohnbaum 2002-10-07 20:35 ` Erich Focht 3 siblings, 1 reply; 24+ messages in thread From: Michael Hohnbaum @ 2002-10-07 16:37 UTC (permalink / raw) To: Erich Focht; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel Erich, These numbers look pretty good for your scheduler. It would be good if you could post numbers against 2.5.x where x is the same for all tests. I thought you had your scheduler working on 2.5.x recently. Running numa_test on my systems I find a lot of variation (as much as 10%) between identical runs. Are all of these results averaged over multiple runs? Any idea why the hackbench times vary so widely? Looking at the results for my scheduler, it is real obvious what the algorithm is that I use for selecting a cpu for a process. This has not been quite as apparent on my systems, probably due to other tasks running in the background. The code in your scheduler that causes a different node to start the search for a cpu really pays off here. I know that there are some balance issues, especially on light loads for my scheduler. I've got two tweaks that help that, but I think that adding something similar to what you have done to keep track of the last node picked and starting at the next node could be beneficial. Michael On Sun, 2002-10-06 at 09:51, Erich Focht wrote: > Hi, > > here are some results from testing various versions and approaches > to a NUMA scheduler. I used the numa_test benchmark which I'll post > in a separate email. It runs in parallel N tasks doing the same job: > access randomly a large array. As the array is large enough not to > fit into cache, this is very memory latency sensitive. Also it is > memory bandwidth sensitive. To emulate a real multi-user environment, the > jobs are disturbed by a short load peak. This is simulated by a call > to "hackbench" 3 seconds after the tasks were started. The performance > of the user tasks is depending very much on where they are scheduled > and are CPU hoggers such that the user times are quite independent of > the non-scheduler part of the underlying kernel. The elapsed times > are depending on "hackbench" which actually blocks the machine for the > time it is running. Hackbench is depending on the underlying kernel > and one should compare "elapsed_time - hackbench_time". > > The test machine is a 16 CPU NEC Azusa with Itanium processors and > four nodes. The tested schedulers are: > > A: O(1) scheduler in 2.5.39 > B: O(1) scheduler with task steal limited to only one task (node > affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18 > C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39 > D: pooling NUMA scheduler, no initial load balancing, idle pool_delay > set to 0, under 2.4.18 > E: node affine scheduler with initial load balancing and static homenode > F: node affine scheduler without initial load balancing and dynamic > homenode selection (homenode selected where most of the memory is > allocated). > > As I'm rewriting the node affine scheduler to be more modular, I'll > redo the tests for cases D, E, F on top of 2.5.X kernels soon. > > The results are summarized in the tables below. A set of outputs (for > N=8, 16, 32) is attached. They show clearly why the node affine > scheduler beats them all: The initial load balancing is node-aware, > the task-stealing too. Sometimes the node affine force is not large > enough to bring the task back to the homenode, but it is almost always > good enough. > > The (F) solution with dynamically determined homenode show that the > initial load balancing is crucial, as the equal node balance is not > strongly enforced dynamically. So the optimal solution is probably > (F) with initial load balancing. > > > Average user time (U) and total user time (TU). Minimum per row should > be considered. > ---------------------------------------------------------------------- > Scheduler: A B C D E F > N=4 U 28.12 30.77 33.00 - 27.20 30.29 > TU 112.55 123.13 132.08 - 108.88 121.25 > N=8 U 30.47 31.39 31.65 30.76 28.67 30.08 > TU 243.86 251.27 253.30 246.23 229.51 240.75 > N=16 U 36.42 33.64 32.18 32.27 31.50 32.83 > TU 582.91 538.49 515.11 516.53 504.17 525.59 > N=32 U 38.69 34.83 34.05 33.76 33.89 34.11 > TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9 > N=64 U 39.73 34.73 34.23 - (33.32) 34.98 > TU 2543.4 2223.4 2191.7 - (2133) 2239.5 > > > Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and > 2.5.39 based kernels due to "hackbench", which performs differently. > Compare E-H within a row, but don't take it too seriously. > -------------------------------------------------------------------- > Scheduler: A B C D E F > N=4 E 37.33 37.96 48.31 - 28.14 35.91 > H 9.98 1.49 10.65 - 1.99 1.43 > N=8 E 46.17 39.50 42.53 39.72 30.28 38.28 > H 9.64 1.86 7.27 2.07 2.33 1.86 > N=16 E 47.21 44.67 49.66 42.97 36.98 42.51 > H 5.90 4.69 2.93 5.178 5.56 5.94 > N=32 E 88.60 79.92 80.34 78.35 76.84 77.38 > H 6.29 5.23 2.85 4.51 5.29 4.28 > N=64 E 167.10 147.16 150.59 - (133.9) 148.94 > H 5.96 4.67 3.10 - (-) 6.86 > > (The E:N=64 results are without hackbench disturbance.) > > Best regards, > Erich > ---- > > A: 2.5.39 O(1) scheduler > > Running 'hackbench 20' in the background: Time: 9.639 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 7.5 92.5 0.0 0.0 | 0 1 *| 38.29 > 2 0.0 0.0 0.0 100.0 | 0 3 *| 30.19 > 3 0.0 100.0 0.0 0.0 | 0 1 *| 27.84 > 4 0.0 0.0 0.0 100.0 | 0 3 *| 30.38 > 5 100.0 0.0 0.0 0.0 | 0 0 | 29.96 > 6 100.0 0.0 0.0 0.0 | 0 0 | 29.63 > 7 0.0 0.0 100.0 0.0 | 0 2 *| 27.33 > 8 0.0 0.0 0.0 100.0 | 0 3 *| 30.14 > AverageUserTime 30.47 seconds > ElapsedTime 46.17 > TotalUserTime 243.86 > TotalSysTime 0.41 > > Running 'hackbench 20' in the background: Time: 5.896 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 98.7 1.3 0.0 | 0 1 *| 41.07 > 2 89.2 10.8 0.0 0.0 | 0 0 | 39.16 > 3 9.4 0.0 90.6 0.0 | 0 2 *| 40.72 > 4 0.0 0.0 0.0 100.0 | 0 3 *| 31.96 > 5 0.0 0.1 99.0 0.9 | 0 2 *| 41.36 > 6 0.0 0.0 100.0 0.0 | 0 2 *| 30.62 > 7 0.6 0.0 86.2 13.2 | 0 2 *| 40.11 > 8 100.0 0.0 0.0 0.0 | 0 0 | 31.25 > 9 8.0 0.0 0.0 92.0 | 0 3 *| 40.63 > 10 0.0 100.0 0.0 0.0 | 0 1 *| 30.43 > 11 0.0 0.0 0.0 100.0 | 0 3 *| 31.64 > 12 91.6 0.0 8.4 0.0 | 0 0 | 39.92 > 13 0.0 100.0 0.0 0.0 | 0 1 *| 30.46 > 14 92.1 0.0 7.9 0.0 | 0 0 | 40.36 > 15 0.0 0.0 0.0 100.0 | 0 3 *| 32.29 > 16 4.8 95.2 0.0 0.0 | 0 1 *| 40.70 > AverageUserTime 36.42 seconds > ElapsedTime 47.21 > TotalUserTime 582.91 > TotalSysTime 0.85 > > Running 'hackbench 20' in the background: Time: 6.293 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 100.0 0.0 | 0 2 *| 31.53 > 2 100.0 0.0 0.0 0.0 | 0 0 | 34.84 > 3 100.0 0.0 0.0 0.0 | 0 0 | 32.70 > 4 0.0 100.0 0.0 0.0 | 0 1 *| 32.25 > 5 0.0 0.0 0.0 100.0 | 0 3 *| 33.41 > 6 96.5 2.7 0.8 0.0 | 0 0 | 43.39 > 7 5.6 0.0 94.4 0.0 | 0 2 *| 44.42 > 8 73.4 22.9 3.7 0.0 | 0 0 | 42.14 > 9 0.9 0.0 99.1 0.0 | 0 2 *| 45.10 > 10 0.0 8.4 91.6 0.0 | 0 2 *| 42.85 > 11 0.0 0.0 0.0 100.0 | 0 3 *| 32.11 > 12 0.0 100.0 0.0 0.0 | 0 1 *| 32.16 > 13 0.0 0.0 0.0 100.0 | 0 3 *| 33.48 > 14 0.0 0.0 100.0 0.0 | 0 2 *| 31.35 > 15 0.0 0.3 3.2 96.4 | 0 3 *| 42.27 > 16 91.0 0.0 0.0 9.0 | 0 0 | 35.50 > 17 0.0 100.0 0.0 0.0 | 0 1 *| 32.45 > 18 0.0 97.6 2.4 0.0 | 0 1 *| 41.97 > 19 0.0 100.0 0.0 0.0 | 0 1 *| 32.46 > 20 1.3 0.0 74.2 24.5 | 0 2 *| 43.78 > 21 0.0 0.0 0.9 99.1 | 0 3 *| 32.74 > 22 0.8 0.0 99.2 0.0 | 0 2 *| 45.61 > 23 100.0 0.0 0.0 0.0 | 0 0 | 33.78 > 24 97.0 1.0 0.0 1.9 | 0 0 | 43.74 > 25 0.0 1.8 0.0 98.2 | 0 3 *| 43.31 > 26 0.0 96.1 0.0 3.9 | 0 1 *| 43.61 > 27 2.6 0.0 0.3 97.0 | 0 3 *| 46.54 > 28 0.0 0.0 0.0 100.0 | 0 3 *| 33.59 > 29 0.0 92.1 0.0 7.9 | 0 1 *| 43.43 > 30 2.2 0.0 97.8 0.0 | 0 2 *| 45.19 > 31 26.0 72.3 0.0 1.8 | 0 1 *| 43.29 > 32 98.5 1.3 0.2 0.0 | 0 0 | 42.97 > AverageUserTime 38.69 seconds > ElapsedTime 88.60 > TotalUserTime 1238.43 > TotalSysTime 1.85 > > ------------------------------------------------------------- > B: O(1) scheduler equivalent, 1 task steal per load balance step > (2.4.18 node affine scheduler with CONFIG_NUMA_SCHED=n) > > Running 'hackbench 20' in the background: Time: 1.861 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 100.0 0.0 | 2 2 | 31.62 > 2 0.0 0.0 100.0 0.0 | 2 2 | 31.68 > 3 100.0 0.0 0.0 0.0 | 0 0 | 29.40 > 4 0.0 0.0 100.0 0.0 | 2 2 | 31.56 > 5 0.0 0.0 100.0 0.0 | 2 2 | 31.60 > 6 95.0 5.0 0.0 0.0 | 1 0 *| 38.14 > 7 100.0 0.0 0.0 0.0 | 0 0 | 29.27 > 8 0.0 100.0 0.0 0.0 | 1 1 | 27.87 > AverageUserTime 31.39 seconds > ElapsedTime 39.50 > TotalUserTime 251.27 > TotalSysTime 0.48 > > Running 'hackbench 20' in the background: Time: 4.688 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 100.0 0.0 | 2 2 | 31.49 > 2 0.0 0.0 100.0 0.0 | 2 2 | 31.33 > 3 0.0 0.0 100.0 0.0 | 2 2 | 31.30 > 4 100.0 0.0 0.0 0.0 | 0 0 | 31.98 > 5 100.0 0.0 0.0 0.0 | 0 0 | 32.21 > 6 0.0 91.3 8.7 0.0 | 2 1 *| 39.88 > 7 0.0 100.0 0.0 0.0 | 1 1 | 30.72 > 8 0.0 0.0 0.0 100.0 | 3 3 | 31.56 > 9 1.2 98.8 0.0 0.0 | 0 1 *| 41.31 > 10 0.0 0.0 0.0 100.0 | 3 3 | 31.56 > 11 0.0 100.0 0.0 0.0 | 1 1 | 30.67 > 12 0.0 0.0 0.0 100.0 | 3 3 | 31.45 > 13 100.0 0.0 0.0 0.0 | 0 0 | 31.94 > 14 5.5 0.0 94.5 0.0 | 0 2 *| 40.83 > 15 0.0 0.0 0.0 100.0 | 3 3 | 31.53 > 16 83.0 17.0 0.0 0.0 | 1 0 *| 38.50 > AverageUserTime 33.64 seconds > ElapsedTime 44.67 > TotalUserTime 538.49 > TotalSysTime 1.02 > > Running 'hackbench 20' in the background: Time: 5.229 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 67.6 0.0 17.4 15.0 | 2 0 *| 41.36 > 2 0.0 0.0 100.0 0.0 | 2 2 | 32.78 > 3 100.0 0.0 0.0 0.0 | 0 0 | 32.92 > 4 0.0 0.0 100.0 0.0 | 2 2 | 33.56 > 5 22.4 75.3 2.3 0.0 | 2 1 *| 43.26 > 6 0.0 0.0 100.0 0.0 | 2 2 | 33.38 > 7 100.0 0.0 0.0 0.0 | 0 0 | 33.16 > 8 0.0 0.0 0.0 100.0 | 3 3 | 32.03 > 9 0.0 100.0 0.0 0.0 | 1 1 | 32.64 > 10 0.0 0.0 0.0 100.0 | 3 3 | 32.00 > 11 4.5 0.0 0.0 95.5 | 0 3 *| 43.82 > 12 0.0 0.0 0.0 100.0 | 3 3 | 31.91 > 13 100.0 0.0 0.0 0.0 | 0 0 | 33.46 > 14 0.0 100.0 0.0 0.0 | 1 1 | 32.39 > 15 0.0 0.0 0.0 100.0 | 3 3 | 31.86 > 16 0.0 0.0 100.0 0.0 | 2 2 | 33.35 > 17 100.0 0.0 0.0 0.0 | 0 0 | 33.95 > 18 0.0 0.0 0.0 100.0 | 3 3 | 31.68 > 19 0.0 5.3 0.0 94.7 | 1 3 *| 42.97 > 20 2.0 98.0 0.0 0.0 | 0 1 *| 43.56 > 21 0.0 0.0 0.0 100.0 | 3 3 | 31.65 > 22 100.0 0.0 0.0 0.0 | 0 0 | 34.13 > 23 0.0 100.0 0.0 0.0 | 1 1 | 32.52 > 24 9.9 90.1 0.0 0.0 | 1 1 | 33.58 > 25 0.0 3.2 96.8 0.0 | 1 2 *| 42.51 > 26 89.8 0.0 0.0 10.2 | 0 0 | 34.18 > 27 0.0 100.0 0.0 0.0 | 1 1 | 32.44 > 28 0.0 2.4 97.6 0.0 | 2 2 | 33.59 > 29 8.4 0.0 91.6 0.0 | 2 2 | 33.78 > 30 100.0 0.0 0.0 0.0 | 0 0 | 33.52 > 31 0.0 6.3 93.7 0.0 | 2 2 | 33.37 > 32 0.0 100.0 0.0 0.0 | 1 1 | 33.20 > AverageUserTime 34.83 seconds > ElapsedTime 79.92 > TotalUserTime 1114.91 > TotalSysTime 2.30 > > ----------------------------------------------------------- > C: simple scheduler MH > > Running 'hackbench 20' in the background: Time: 7.271 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 100.0 0.0 0.0 0.0 | 0 0 | 32.66 > 2 100.0 0.0 0.0 0.0 | 0 0 | 32.53 > 3 100.0 0.0 0.0 0.0 | 0 0 | 32.47 > 4 0.0 100.0 0.0 0.0 | 1 1 | 30.33 > 5 0.0 100.0 0.0 0.0 | 1 1 | 30.20 > 6 0.0 100.0 0.0 0.0 | 1 1 | 30.06 > 7 100.0 0.0 0.0 0.0 | 0 0 | 32.35 > 8 100.0 0.0 0.0 0.0 | 0 0 | 32.58 > AverageUserTime 31.65 seconds > ElapsedTime 42.53 > TotalUserTime 253.30 > TotalSysTime 0.48 > > Running 'hackbench 20' in the background: Time: 2.929 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 100.0 0.0 0.0 0.0 | 0 0 | 31.07 > 2 100.0 0.0 0.0 0.0 | 0 0 | 30.97 > 3 100.0 0.0 0.0 0.0 | 0 0 | 30.86 > 4 0.0 100.0 0.0 0.0 | 1 1 | 31.91 > 5 0.0 100.0 0.0 0.0 | 1 1 | 32.04 > 6 0.0 100.0 0.0 0.0 | 1 1 | 31.98 > 7 0.0 100.0 0.0 0.0 | 1 1 | 32.01 > 8 0.0 0.0 100.0 0.0 | 2 2 | 31.83 > 9 0.0 0.0 100.0 0.0 | 2 2 | 31.66 > 10 0.0 0.0 100.0 0.0 | 2 2 | 31.47 > 11 0.0 0.0 100.0 0.0 | 2 2 | 31.74 > 12 0.0 0.0 0.0 100.0 | 3 3 | 31.54 > 13 0.0 0.0 0.0 100.0 | 3 3 | 31.48 > 14 0.0 0.0 0.0 100.0 | 3 3 | 31.49 > 15 0.0 0.0 0.0 100.0 | 3 3 | 31.97 > 16 5.4 0.0 0.0 94.6 | 0 3 *| 40.86 > AverageUserTime 32.18 seconds > ElapsedTime 49.66 > TotalUserTime 515.11 > TotalSysTime 0.95 > > Running 'hackbench 20' in the background: Time: 2.849 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 100.0 0.0 0.0 0.0 | 0 0 | 33.75 > 2 100.0 0.0 0.0 0.0 | 0 0 | 33.89 > 3 100.0 0.0 0.0 0.0 | 0 0 | 33.87 > 4 0.0 100.0 0.0 0.0 | 1 1 | 33.34 > 5 0.0 100.0 0.0 0.0 | 1 1 | 33.41 > 6 0.0 100.0 0.0 0.0 | 1 1 | 33.16 > 7 0.0 100.0 0.0 0.0 | 1 1 | 33.25 > 8 0.0 0.0 100.0 0.0 | 2 2 | 32.90 > 9 0.0 0.0 100.0 0.0 | 2 2 | 33.08 > 10 0.0 0.0 100.0 0.0 | 2 2 | 33.07 > 11 0.0 0.0 100.0 0.0 | 2 2 | 33.11 > 12 0.0 0.0 0.0 100.0 | 3 3 | 31.46 > 13 0.0 0.0 0.0 100.0 | 3 3 | 32.04 > 14 0.0 0.0 0.0 100.0 | 3 3 | 31.59 > 15 0.0 0.0 0.0 100.0 | 3 3 | 31.83 > 16 100.0 0.0 0.0 0.0 | 0 0 | 33.95 > 17 1.2 0.0 0.0 98.8 | 0 3 *| 44.66 > 18 100.0 0.0 0.0 0.0 | 0 0 | 34.02 > 19 55.4 0.0 0.0 44.6 | 3 0 | 37.11 > 20 100.0 0.0 0.0 0.0 | 0 0 | 33.62 > 21 0.0 100.0 0.0 0.0 | 1 1 | 33.33 > 22 0.0 100.0 0.0 0.0 | 1 1 | 33.26 > 23 0.0 100.0 0.0 0.0 | 1 1 | 33.41 > 24 0.0 100.0 0.0 0.0 | 1 1 | 33.14 > 25 0.0 0.0 100.0 0.0 | 2 2 | 33.26 > 26 0.0 0.0 100.0 0.0 | 2 2 | 33.05 > 27 0.0 0.0 3.1 96.9 | 2 3 *| 43.50 > 28 100.0 0.0 0.0 0.0 | 0 0 | 34.02 > 29 60.4 0.0 39.6 0.0 | 2 0 *| 37.05 > 30 0.0 0.0 0.0 100.0 | 3 3 | 31.53 > 31 100.0 0.0 0.0 0.0 | 0 0 | 33.94 > 32 0.0 0.0 100.0 0.0 | 2 2 | 33.03 > AverageUserTime 34.05 seconds > ElapsedTime 80.34 > TotalUserTime 1090.10 > TotalSysTime 2.15 > > --------------------------------------------------------------- > D: numasched : pooling only (idle pool delay set to 0) > > Running 'hackbench 20' in the background: Time: 2.066 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 0.0 100.0 | 3 3 | 28.50 > 2 100.0 0.0 0.0 0.0 | 0 0 | 30.97 > 3 0.0 0.0 0.0 100.0 | 3 3 | 28.49 > 4 0.0 0.0 100.0 0.0 | 2 2 | 28.65 > 5 100.0 0.0 0.0 0.0 | 0 0 | 31.18 > 6 100.0 0.0 0.0 0.0 | 0 0 | 30.96 > 7 5.7 94.3 0.0 0.0 | 0 1 *| 38.69 > 8 0.0 0.0 100.0 0.0 | 2 2 | 28.65 > AverageUserTime 30.76 seconds > ElapsedTime 39.72 > TotalUserTime 246.23 > TotalSysTime 0.46 > > Running 'hackbench 20' in the background: Time: 5.178 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 100.0 0.0 0.0 0.0 | 0 0 | 32.01 > 2 100.0 0.0 0.0 0.0 | 0 0 | 31.82 > 3 0.0 3.4 0.0 96.6 | 1 3 *| 41.01 > 4 100.0 0.0 0.0 0.0 | 0 0 | 32.24 > 5 100.0 0.0 0.0 0.0 | 0 0 | 31.91 > 6 0.0 0.0 100.0 0.0 | 2 2 | 31.75 > 7 0.0 0.0 0.0 100.0 | 3 3 | 30.55 > 8 0.0 0.0 100.0 0.0 | 2 2 | 31.74 > 9 0.0 0.0 100.0 0.0 | 2 2 | 31.74 > 10 0.0 0.0 0.0 100.0 | 3 3 | 30.63 > 11 0.0 0.0 0.0 100.0 | 3 3 | 30.67 > 12 0.0 0.0 100.0 0.0 | 2 2 | 31.75 > 13 0.0 100.0 0.0 0.0 | 1 1 | 31.94 > 14 0.0 100.0 0.0 0.0 | 1 1 | 32.20 > 15 0.0 100.0 0.0 0.0 | 1 1 | 32.24 > 16 0.0 100.0 0.0 0.0 | 1 1 | 32.07 > AverageUserTime 32.27 seconds > ElapsedTime 42.97 > TotalUserTime 516.53 > TotalSysTime 1.16 > > Running 'hackbench 20' in the background: Time: 4.505 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 49.4 0.0 50.6 | 3 3 | 36.48 > 2 0.3 99.7 0.0 0.0 | 0 1 *| 39.21 > 3 0.0 100.0 0.0 0.0 | 1 1 | 32.23 > 4 100.0 0.0 0.0 0.0 | 0 0 | 32.88 > 5 100.0 0.0 0.0 0.0 | 0 0 | 32.98 > 6 100.0 0.0 0.0 0.0 | 0 0 | 32.97 > 7 0.0 18.3 81.7 0.0 | 2 2 | 35.13 > 8 0.0 5.7 94.3 0.0 | 2 2 | 34.00 > 9 0.0 0.0 100.0 0.0 | 2 2 | 33.14 > 10 0.0 3.3 96.7 0.0 | 2 2 | 34.22 > 11 0.0 100.0 0.0 0.0 | 1 1 | 32.34 > 12 0.0 100.0 0.0 0.0 | 1 1 | 32.37 > 13 0.0 13.4 0.0 86.6 | 3 3 | 34.55 > 14 0.0 0.0 0.0 100.0 | 3 3 | 33.93 > 15 0.0 0.0 0.0 100.0 | 3 3 | 33.08 > 16 100.0 0.0 0.0 0.0 | 0 0 | 33.31 > 17 100.0 0.0 0.0 0.0 | 0 0 | 33.08 > 18 0.0 8.2 0.0 91.8 | 3 3 | 33.76 > 19 0.0 0.0 0.0 100.0 | 3 3 | 33.23 > 20 0.0 32.7 0.0 67.3 | 3 3 | 34.79 > 21 0.0 0.0 0.0 100.0 | 3 3 | 32.94 > 22 100.0 0.0 0.0 0.0 | 0 0 | 33.23 > 23 0.0 100.0 0.0 0.0 | 1 1 | 32.35 > 24 0.0 0.0 100.0 0.0 | 2 2 | 33.44 > 25 0.0 100.0 0.0 0.0 | 1 1 | 32.42 > 26 0.0 0.0 100.0 0.0 | 2 2 | 33.25 > 27 0.0 0.0 100.0 0.0 | 2 2 | 33.82 > 28 0.0 0.0 100.0 0.0 | 2 2 | 32.88 > 29 0.0 0.0 0.0 100.0 | 3 3 | 31.99 > 30 0.0 100.0 0.0 0.0 | 1 1 | 31.90 > 31 100.0 0.0 0.0 0.0 | 0 0 | 32.93 > 32 99.6 0.0 0.4 0.0 | 2 0 *| 41.48 > AverageUserTime 33.76 seconds > ElapsedTime 78.35 > TotalUserTime 1080.75 > TotalSysTime 2.03 > > ----------------------------------------------------------- > > E: node affine scheduler with initial load balancing, statically > assigned homenode > > Running 'hackbench 20' in the background: Time: 2.330 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 100.0 0.0 | 2 2 | 28.86 > 2 0.0 100.0 0.0 0.0 | 1 1 | 28.82 > 3 0.0 100.0 0.0 0.0 | 1 1 | 28.85 > 4 0.0 0.0 0.0 100.0 | 3 3 | 28.47 > 5 100.0 0.0 0.0 0.0 | 0 0 | 28.69 > 6 0.0 0.0 100.0 0.0 | 2 2 | 28.65 > 7 0.0 0.0 0.0 100.0 | 3 3 | 28.49 > 8 100.0 0.0 0.0 0.0 | 0 0 | 28.55 > AverageUserTime 28.67 seconds > ElapsedTime 30.28 > TotalUserTime 229.51 > TotalSysTime 0.43 > > > Running 'hackbench 20' in the background: Time: 5.562 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 100.0 0.0 0.0 | 1 1 | 31.47 > 2 0.0 100.0 0.0 0.0 | 1 1 | 31.38 > 3 0.0 0.0 0.0 100.0 | 3 3 | 31.53 > 4 0.0 0.0 0.0 100.0 | 3 3 | 31.46 > 5 100.0 0.0 0.0 0.0 | 0 0 | 31.64 > 6 100.0 0.0 0.0 0.0 | 0 0 | 31.57 > 7 100.0 0.0 0.0 0.0 | 0 0 | 31.57 > 8 0.0 0.0 0.0 100.0 | 3 3 | 31.51 > 9 0.0 0.0 100.0 0.0 | 2 2 | 31.63 > 10 0.0 0.0 100.0 0.0 | 2 2 | 31.56 > 11 0.0 0.0 100.0 0.0 | 2 2 | 31.42 > 12 0.0 100.0 0.0 0.0 | 1 1 | 31.44 > 13 0.0 0.0 0.0 100.0 | 3 3 | 31.43 > 14 0.0 0.0 100.0 0.0 | 2 2 | 31.61 > 15 100.0 0.0 0.0 0.0 | 0 0 | 31.34 > 16 0.0 100.0 0.0 0.0 | 1 1 | 31.36 > AverageUserTime 31.50 seconds > ElapsedTime 36.98 > TotalUserTime 504.17 > TotalSysTime 1.03 > > Running 'hackbench 20' in the background: Time: 5.288 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 46.1 53.9 | 2 3 *| 37.11 > 2 0.0 0.0 0.0 100.0 | 3 3 | 31.32 > 3 0.0 100.0 0.0 0.0 | 1 1 | 33.15 > 4 0.0 4.9 95.1 0.0 | 2 2 | 33.55 > 5 0.0 0.0 100.0 0.0 | 2 2 | 32.81 > 6 0.0 100.0 0.0 0.0 | 1 1 | 32.88 > 7 0.0 100.0 0.0 0.0 | 1 1 | 33.31 > 8 82.9 17.1 0.0 0.0 | 1 0 *| 40.70 > 9 100.0 0.0 0.0 0.0 | 0 0 | 33.36 > 10 9.3 0.0 0.0 90.7 | 0 3 *| 42.12 > 11 100.0 0.0 0.0 0.0 | 0 0 | 33.38 > 12 0.0 0.0 0.0 100.0 | 3 3 | 31.28 > 13 0.0 0.0 0.0 100.0 | 3 3 | 31.43 > 14 0.0 100.0 0.0 0.0 | 1 1 | 33.33 > 15 0.0 100.0 0.0 0.0 | 1 1 | 33.04 > 16 100.0 0.0 0.0 0.0 | 0 0 | 33.02 > 17 95.2 4.8 0.0 0.0 | 0 0 | 33.82 > 18 3.8 0.0 0.0 96.2 | 0 3 *| 43.21 > 19 0.0 0.0 0.0 100.0 | 3 3 | 31.11 > 20 0.0 0.0 100.0 0.0 | 2 2 | 32.19 > 21 0.0 100.0 0.0 0.0 | 1 1 | 33.43 > 22 0.0 0.0 100.0 0.0 | 2 2 | 33.03 > 23 0.0 0.0 100.0 0.0 | 2 2 | 33.06 > 24 100.0 0.0 0.0 0.0 | 0 0 | 33.04 > 25 100.0 0.0 0.0 0.0 | 0 0 | 33.07 > 26 0.0 0.0 100.0 0.0 | 2 2 | 33.07 > 27 0.0 0.0 100.0 0.0 | 2 2 | 33.22 > 28 0.0 0.0 0.0 100.0 | 3 3 | 31.06 > 29 0.0 15.3 84.7 0.0 | 2 2 | 33.25 > 30 0.0 100.0 0.0 0.0 | 1 1 | 33.36 > 31 85.0 0.0 0.0 15.0 | 0 0 | 34.50 > 32 0.0 86.4 0.0 13.6 | 1 1 | 34.27 > AverageUserTime 33.89 seconds > ElapsedTime 76.84 > TotalUserTime 1084.86 > TotalSysTime 2.12 > > -------------------------------------------------------- > > F: node affine scheduler, dynamically assigned homenode, no > initial load balancing > > Running 'hackbench 20' in the background: Time: 1.861 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 0.0 100.0 | 3 3 | 30.75 > 2 100.0 0.0 0.0 0.0 | 0 0 | 27.12 > 3 0.0 0.0 0.0 100.0 | 3 3 | 30.60 > 4 0.0 0.0 0.0 100.0 | 3 3 | 30.65 > 5 0.0 100.0 0.0 0.0 | 1 1 | 28.65 > 6 0.0 0.0 78.4 21.6 | 3 2 *| 36.63 > 7 0.0 100.0 0.0 0.0 | 1 1 | 28.57 > 8 0.0 0.0 100.0 0.0 | 2 2 | 27.65 > AverageUserTime 30.08 seconds > ElapsedTime 38.28 > TotalUserTime 240.75 > TotalSysTime 0.49 > > Running 'hackbench 20' in the background: Time: 5.936 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 0.0 100.0 | 3 3 | 31.47 > 2 100.0 0.0 0.0 0.0 | 0 0 | 29.90 > 3 88.2 0.0 4.6 7.2 | 2 0 *| 40.96 > 4 0.0 0.0 0.0 100.0 | 3 3 | 31.53 > 5 0.0 0.0 0.0 100.0 | 3 3 | 31.50 > 6 0.0 100.0 0.0 0.0 | 1 1 | 32.25 > 7 0.0 84.5 0.0 15.5 | 1 1 | 33.41 > 8 0.0 0.0 100.0 0.0 | 2 2 | 32.38 > 9 83.0 17.0 0.0 0.0 | 1 0 *| 39.29 > 10 0.0 0.0 0.0 100.0 | 3 3 | 31.51 > 11 0.0 100.0 0.0 0.0 | 1 1 | 32.38 > 12 0.0 0.0 100.0 0.0 | 2 2 | 32.30 > 13 0.0 100.0 0.0 0.0 | 1 1 | 32.06 > 14 100.0 0.0 0.0 0.0 | 0 0 | 29.85 > 15 0.0 0.0 100.0 0.0 | 2 2 | 32.18 > 16 0.0 0.0 100.0 0.0 | 2 2 | 32.39 > AverageUserTime 32.83 seconds > ElapsedTime 42.51 > TotalUserTime 525.59 > TotalSysTime 1.14 > > Running 'hackbench 20' in the background: Time: 4.277 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 0.0 0.0 0.0 100.0 | 3 3 | 31.69 > 2 0.0 0.0 100.0 0.0 | 2 2 | 33.81 > 3 100.0 0.0 0.0 0.0 | 0 0 | 33.10 > 4 1.4 4.9 2.8 90.9 | 2 3 *| 44.26 > 5 0.0 0.0 0.0 100.0 | 3 3 | 31.76 > 6 0.0 0.0 0.0 100.0 | 3 3 | 31.69 > 7 0.0 0.0 3.6 96.4 | 2 3 *| 43.73 > 8 0.0 100.0 0.0 0.0 | 1 1 | 33.07 > 9 0.0 0.0 100.0 0.0 | 2 2 | 34.04 > 10 0.0 100.0 0.0 0.0 | 1 1 | 33.24 > 11 0.0 100.0 0.0 0.0 | 1 1 | 33.23 > 12 100.0 0.0 0.0 0.0 | 0 0 | 33.77 > 13 0.0 100.0 0.0 0.0 | 1 1 | 33.10 > 14 100.0 0.0 0.0 0.0 | 0 0 | 33.26 > 15 0.0 100.0 0.0 0.0 | 1 1 | 33.16 > 16 100.0 0.0 0.0 0.0 | 0 0 | 33.23 > 17 3.9 0.0 0.0 96.1 | 0 3 *| 42.74 > 18 0.0 100.0 0.0 0.0 | 1 1 | 32.97 > 19 100.0 0.0 0.0 0.0 | 0 0 | 33.21 > 20 0.0 0.0 92.1 7.9 | 2 2 | 34.30 > 21 0.0 0.0 95.9 4.1 | 2 2 | 34.31 > 22 0.0 0.0 100.0 0.0 | 2 2 | 33.88 > 23 0.0 0.0 0.0 100.0 | 3 3 | 30.85 > 24 0.0 100.0 0.0 0.0 | 1 1 | 32.99 > 25 0.0 0.0 100.0 0.0 | 2 2 | 33.84 > 26 0.0 100.0 0.0 0.0 | 1 1 | 33.35 > 27 100.0 0.0 0.0 0.0 | 0 0 | 33.29 > 28 0.0 0.0 100.0 0.0 | 2 2 | 33.79 > 29 100.0 0.0 0.0 0.0 | 0 0 | 33.25 > 30 100.0 0.0 0.0 0.0 | 0 0 | 33.41 > 31 0.0 0.0 100.0 0.0 | 2 2 | 33.92 > 32 0.0 0.0 0.0 100.0 | 3 3 | 31.16 > AverageUserTime 34.11 seconds > ElapsedTime 77.38 > TotalUserTime 1091.86 > TotalSysTime 2.14 > -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] NUMA schedulers benchmark results 2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum @ 2002-10-07 20:35 ` Erich Focht 0 siblings, 0 replies; 24+ messages in thread From: Erich Focht @ 2002-10-07 20:35 UTC (permalink / raw) To: Michael Hohnbaum; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel Hi Michael, On Monday 07 October 2002 18:37, Michael Hohnbaum wrote: > These numbers look pretty good for your scheduler. It would be good > if you could post numbers against 2.5.x where x is the same for all > tests. I thought you had your scheduler working on 2.5.x recently. I had it running under 2.5.35, but some of the important peripherals of the machine didn't work with that and I hate doing experiments when I only have the serial console. I'll get the node affine stuff finished tomorrow and will try to find a testing slot (that's another problem :-( ) > Running numa_test on my systems I find a lot of variation (as much > as 10%) between identical runs. Are all of these results averaged > over multiple runs? Any idea why the hackbench times vary so widely? No, they are not averaged. I often made 2-3 runs and they usually didn't differ significantly (within 2-3% of average user time), so I just took one. Would be cool to have a script for averaging several results, I'll make that if I get some time, but then we can't look at the distribution over the nodes. And that's what I find most useful for understanding what's going on. > Looking at the results for my scheduler, it is real obvious what the > algorithm is that I use for selecting a cpu for a process. This has > not been quite as apparent on my systems, probably due to other tasks > running in the background. The code in your scheduler that causes a > different node to start the search for a cpu really pays off here. Yes, but it's not perfect, as it cannot predict how long the processes will live. Also it takes into account every running task, no matter whether it's just a shortly running migration_thread or something else. A running average might be better, but I have no experience with that. Any ideas? Regarding the hackbench times: it forks 20 tasks which fork 20 tasks each, these are sending messages to each other. Depending on the load balancing (which is not really predictible) it could happen that you end up with bad distributions of senders/receivers sometimes. Best regards, Erich ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2002-10-11 15:36 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht
2002-10-06 20:24 ` Erich Focht
2002-10-07 0:00 ` Martin J. Bligh
2002-10-07 0:58 ` Martin J. Bligh
2002-10-07 16:52 ` Erich Focht
2002-10-07 7:25 ` Martin J. Bligh
2002-10-07 7:40 ` Ingo Molnar
2002-10-07 20:09 ` [PATCH] pooling NUMA scheduler with initial load balancing Erich Focht
[not found] ` <1420721189.1034032091@[10.10.2.3]>
2002-10-08 17:33 ` Erich Focht
2002-10-08 19:44 ` Martin J. Bligh
2002-10-09 16:26 ` Erich Focht
2002-10-09 17:33 ` Martin J. Bligh
2002-10-09 17:58 ` Andrew Theurer
2002-10-09 18:13 ` Andrew Theurer
2002-10-09 23:02 ` Erich Focht
2002-10-10 17:34 ` Andrew Theurer
[not found] ` <200210110947.11714.efocht@ess.nec.de>
2002-10-11 8:27 ` Erich Focht
2002-10-11 14:47 ` Martin J. Bligh
2002-10-11 15:29 ` Erich Focht
2002-10-11 15:34 ` Martin J. Bligh
2002-10-09 1:15 ` Christoph Hellwig
2002-10-09 10:29 ` Erich Focht
2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum
2002-10-07 20:35 ` Erich Focht
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.