* 2.6.8-rc2-mm2 performance improvements (scheduler?)
@ 2004-08-04 15:10 Martin J. Bligh
2004-08-04 15:12 ` Martin J. Bligh
0 siblings, 1 reply; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 15:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Con Kolivas, linux-kernel
2.6.8-rc2-mm2 has some significant improvements over 2.6.8-rc2,
particularly at low to mid loads ... at the high loads, it's still -58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
slightly improved, but less significant. Numbers from 16x NUMA-Q.
Kernbench sees most improvement in sys time, but also some elapsed
time ... SDET sees up to 18% more througput.
I'm also amused to see that the process scalability is now pretty
damned good ... a full -j kernel compile (using up to about 1300
tasks) goes as fast as the -j 256 (the middle one) ... and elapsed
is faster than -j16, even if system is a little higher.
I *think* this is the scheduler changes ... fits in with profile diffs
I've seen before ... diffprofiles at the end. In my experience, higher
copy_to/from_user and finish_task_switch stuff tends to indicate task
thrashing. Note also .text.lock.semaphore numbers in kernbench profile.
The SDET one looks like it's load-balancing better (mainly less idle
time).
Great stuff.
M.
Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed System User CPU
2.6.7 45.37 90.91 579.75 1479.33
2.6.8-rc2 45.05 88.53 577.87 1485.67
2.6.8-rc2-mm2 44.09 78.84 577.01 1486.33
Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed System User CPU
2.6.7 44.77 97.96 576.59 1507.00
2.6.8-rc2 44.83 96.00 575.50 1497.33
2.6.8-rc2-mm2 43.43 86.04 576.26 1524.33
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.6.7 44.25 88.95 575.63 1501.33
2.6.8-rc2 44.03 87.74 573.82 1503.67
2.6.8-rc2-mm2 43.75 86.68 576.98 1518.00
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 95.9% 2.3%
2.6.8-rc2-mm2 111.5% 3.3%
SDET 2 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.0%
2.6.8-rc2 100.5% 1.4%
2.6.8-rc2-mm2 115.1% 4.0%
SDET 4 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 99.2% 1.1%
2.6.8-rc2-mm2 111.9% 0.5%
SDET 8 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 1.0%
2.6.8-rc2-mm2 117.4% 0.9%
SDET 16 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.5% 0.3%
2.6.8-rc2-mm2 118.5% 0.6%
SDET 32 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.7% 0.4%
2.6.8-rc2-mm2 102.1% 0.8%
SDET 64 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 101.6% 0.4%
2.6.8-rc2-mm2 103.2% 0.0%
SDET 128 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 0.1%
2.6.8-rc2-mm2 103.0% 0.3%
Diffprofile for kernbench -j32 (-ve numbers better with mm2)
2135 4.3% default_idle
233 44.2% pte_alloc_one
220 11.9% buffered_rmqueue
164 264.5% schedule
135 5.9% do_page_fault
84 10.7% clear_page_tables
62 62.6% __wake_up_sync
51 60.7% set_page_address
...
-50 -43.9% sys_close
-56 -10.7% __fput
-58 -13.7% set_page_dirty
-61 -10.0% copy_process
-70 -41.4% pipe_writev
-77 -8.1% file_move
-85 -100.0% wake_up_forked_thread
-87 -50.9% pipe_wait
-90 -5.7% path_lookup
-93 -21.2% page_add_anon_rmap
-105 -28.5% release_task
-113 -11.9% do_wp_page
-116 -7.9% link_path_walk
-116 -43.1% pipe_readv
-121 -7.5% atomic_dec_and_lock
-138 -15.4% strnlen_user
-159 -9.4% do_no_page
-167 -2.7% __d_lookup
-214 -59.4% find_idlest_cpu
-230 -6.2% find_trylock_page
-237 -1.6% do_anonymous_page
-255 -7.8% zap_pte_range
-444 -97.6% .text.lock.semaphore
-532 -43.4% Letext
-632 -54.2% __wake_up
-1086 -52.2% finish_task_switch
-1436 -24.6% __copy_to_user_ll
-3079 -46.3% __copy_from_user_ll
-7468 -5.4% total
sdetbench 8 (-ve numbers better with mm2)
...
-58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 15:10 Martin J. Bligh
@ 2004-08-04 15:12 ` Martin J. Bligh
2004-08-04 19:24 ` Andrew Morton
0 siblings, 1 reply; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 15:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: Con Kolivas, linux-kernel
Doh. Clipped the paste button on the mouse against the PC just as I
hit send ... pasting a bunch of data in the wrong place ;-) Should've
looked like this:
----------------------------------------------
2.6.8-rc2-mm2 has some significant improvements over 2.6.8-rc2,
particularly at low to mid loads ... at the high loads, it's still
slightly improved, but less significant. Numbers from 16x NUMA-Q.
Kernbench sees most improvement in sys time, but also some elapsed
time ... SDET sees up to 18% more througput.
I'm also amused to see that the process scalability is now pretty
damned good ... a full -j kernel compile (using up to about 1300
tasks) goes as fast as the -j 256 (the middle one) ... and elapsed
is faster than -j16, even if system is a little higher.
I *think* this is the scheduler changes ... fits in with profile diffs
I've seen before ... diffprofiles at the end. In my experience, higher
copy_to/from_user and finish_task_switch stuff tends to indicate task
thrashing. Note also .text.lock.semaphore numbers in kernbench profile.
The SDET one looks like it's load-balancing better (mainly less idle
time).
Great stuff.
M.
Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed System User CPU
2.6.7 45.37 90.91 579.75 1479.33
2.6.8-rc2 45.05 88.53 577.87 1485.67
2.6.8-rc2-mm2 44.09 78.84 577.01 1486.33
Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed System User CPU
2.6.7 44.77 97.96 576.59 1507.00
2.6.8-rc2 44.83 96.00 575.50 1497.33
2.6.8-rc2-mm2 43.43 86.04 576.26 1524.33
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.6.7 44.25 88.95 575.63 1501.33
2.6.8-rc2 44.03 87.74 573.82 1503.67
2.6.8-rc2-mm2 43.75 86.68 576.98 1518.00
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 95.9% 2.3%
2.6.8-rc2-mm2 111.5% 3.3%
SDET 2 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.0%
2.6.8-rc2 100.5% 1.4%
2.6.8-rc2-mm2 115.1% 4.0%
SDET 4 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 99.2% 1.1%
2.6.8-rc2-mm2 111.9% 0.5%
SDET 8 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 1.0%
2.6.8-rc2-mm2 117.4% 0.9%
SDET 16 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.5% 0.3%
2.6.8-rc2-mm2 118.5% 0.6%
SDET 32 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.7% 0.4%
2.6.8-rc2-mm2 102.1% 0.8%
SDET 64 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 101.6% 0.4%
2.6.8-rc2-mm2 103.2% 0.0%
SDET 128 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 0.1%
2.6.8-rc2-mm2 103.0% 0.3%
Diffprofile for kernbench -j32 (-ve numbers better with mm2)
2135 4.3% default_idle
233 44.2% pte_alloc_one
220 11.9% buffered_rmqueue
164 264.5% schedule
135 5.9% do_page_fault
84 10.7% clear_page_tables
62 62.6% __wake_up_sync
51 60.7% set_page_address
...
-50 -43.9% sys_close
-56 -10.7% __fput
-58 -13.7% set_page_dirty
-61 -10.0% copy_process
-70 -41.4% pipe_writev
-77 -8.1% file_move
-85 -100.0% wake_up_forked_thread
-87 -50.9% pipe_wait
-90 -5.7% path_lookup
-93 -21.2% page_add_anon_rmap
-105 -28.5% release_task
-113 -11.9% do_wp_page
-116 -7.9% link_path_walk
-116 -43.1% pipe_readv
-121 -7.5% atomic_dec_and_lock
-138 -15.4% strnlen_user
-159 -9.4% do_no_page
-167 -2.7% __d_lookup
-214 -59.4% find_idlest_cpu
-230 -6.2% find_trylock_page
-237 -1.6% do_anonymous_page
-255 -7.8% zap_pte_range
-444 -97.6% .text.lock.semaphore
-532 -43.4% Letext
-632 -54.2% __wake_up
-1086 -52.2% finish_task_switch
-1436 -24.6% __copy_to_user_ll
-3079 -46.3% __copy_from_user_ll
-7468 -5.4% total
sdetbench 8 (-ve numbers better with mm2)
...
-58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 15:12 ` Martin J. Bligh
@ 2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 23:44 ` Peter Williams
0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2004-08-04 19:24 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: kernel, linux-kernel, Ingo Molnar
"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> SDET 8 (see disclaimer)
> Throughput Std. Dev
> 2.6.7 100.0% 0.2%
> 2.6.8-rc2 100.2% 1.0%
> 2.6.8-rc2-mm2 117.4% 0.9%
>
> SDET 16 (see disclaimer)
> Throughput Std. Dev
> 2.6.7 100.0% 0.3%
> 2.6.8-rc2 99.5% 0.3%
> 2.6.8-rc2-mm2 118.5% 0.6%
hum, interesting. Can Con's changes affect the inter-node and inter-cpu
balancing decisions, or is this all due to caching effects, reduced context
switching etc?
I don't expect we'll be merging a new CPU scheduler into mainline any time
soon, but we should work to understand where this improvement came from,
and see if we can get the mainline scheduler to catch up.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:24 ` Andrew Morton
@ 2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:10 ` Ingo Molnar
2004-08-04 23:44 ` Peter Williams
1 sibling, 2 replies; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 19:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: kernel, linux-kernel, Ingo Molnar, Rick Lindsley
--On Wednesday, August 04, 2004 12:24:14 -0700 Andrew Morton <akpm@osdl.org> wrote:
> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
>>
>> SDET 8 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.2%
>> 2.6.8-rc2 100.2% 1.0%
>> 2.6.8-rc2-mm2 117.4% 0.9%
>>
>> SDET 16 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.3%
>> 2.6.8-rc2 99.5% 0.3%
>> 2.6.8-rc2-mm2 118.5% 0.6%
>
> hum, interesting. Can Con's changes affect the inter-node and inter-cpu
> balancing decisions, or is this all due to caching effects, reduced context
> switching etc?
>
> I don't expect we'll be merging a new CPU scheduler into mainline any time
> soon, but we should work to understand where this improvement came from,
> and see if we can get the mainline scheduler to catch up.
Dunno ... really need to take schedstats profiles before and afterwards to
get a better picture what it's doing. Rick was working on a port.
M.
PS. schedstats is great for this kind of thing. Very useful, minimally
invasive, no impact unless configed in, and nothing measurable even then.
Hint. Hint ;-)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:34 ` Martin J. Bligh
@ 2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:07 ` Rick Lindsley
2004-08-04 20:10 ` Ingo Molnar
1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2004-08-04 19:50 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: kernel, linux-kernel, mingo, ricklind
"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> PS. schedstats is great for this kind of thing. Very useful, minimally
> invasive, no impact unless configed in, and nothing measurable even then.
> Hint. Hint ;-)
Ho hum. It's up to the hordes of scheduler hackers really. If they want,
and can agree upon a patch then go wild. It should be against -mm minus
staircase, as there's a fair amount of scheduler stuff banked up for
post-2.6.8.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:50 ` Andrew Morton
@ 2004-08-04 20:07 ` Rick Lindsley
0 siblings, 0 replies; 20+ messages in thread
From: Rick Lindsley @ 2004-08-04 20:07 UTC (permalink / raw)
To: Andrew Morton; +Cc: Martin J. Bligh, kernel, linux-kernel, mingo
Ho hum. It's up to the hordes of scheduler hackers really. If they
want, and can agree upon a patch then go wild. It should be against
-mm minus staircase, as there's a fair amount of scheduler stuff
banked up for post-2.6.8.
The patch exists for both -mm2 and -mm1, but I've been holding off
posting it until I get a chance to do more than simply compile it.
Our lab machines are back up now so I'll test a (-mm2 - staircase)
patch this afternoon.
Rick
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
@ 2004-08-04 20:10 ` Ingo Molnar
2004-08-04 20:36 ` Martin J. Bligh
1 sibling, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2004-08-04 20:10 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
* Martin J. Bligh <mbligh@aracnet.com> wrote:
> >> SDET 16 (see disclaimer)
> >> Throughput Std. Dev
> >> 2.6.7 100.0% 0.3%
> >> 2.6.8-rc2 99.5% 0.3%
> >> 2.6.8-rc2-mm2 118.5% 0.6%
> >
> > hum, interesting. Can Con's changes affect the inter-node and inter-cpu
> > balancing decisions, or is this all due to caching effects, reduced context
> > switching etc?
Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
unapplied a re-run at least part of your tests?
there are a number of NUMA improvements queued up on -mm, and it would
be nice to know what effect these cause, and what effect the staircase
scheduler has.
Ingo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 20:10 ` Ingo Molnar
@ 2004-08-04 20:36 ` Martin J. Bligh
2004-08-04 21:31 ` Ingo Molnar
0 siblings, 1 reply; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 20:36 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
--On Wednesday, August 04, 2004 22:10:19 +0200 Ingo Molnar <mingo@elte.hu> wrote:
>
> * Martin J. Bligh <mbligh@aracnet.com> wrote:
>
>> >> SDET 16 (see disclaimer)
>> >> Throughput Std. Dev
>> >> 2.6.7 100.0% 0.3%
>> >> 2.6.8-rc2 99.5% 0.3%
>> >> 2.6.8-rc2-mm2 118.5% 0.6%
>> >
>> > hum, interesting. Can Con's changes affect the inter-node and inter-cpu
>> > balancing decisions, or is this all due to caching effects, reduced context
>> > switching etc?
>
> Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
> unapplied a re-run at least part of your tests?
>
> there are a number of NUMA improvements queued up on -mm, and it would
> be nice to know what effect these cause, and what effect the staircase
> scheduler has.
Sure. I presume it's just the one patch:
staircase-cpu-scheduler-268-rc2-mm1.patch
which seemed to back out clean and is building now. Scream if that's not
all of it ...
M.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 20:36 ` Martin J. Bligh
@ 2004-08-04 21:31 ` Ingo Molnar
2004-08-04 23:34 ` Martin J. Bligh
0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2004-08-04 21:31 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
* Martin J. Bligh <mbligh@aracnet.com> wrote:
> > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
> > unapplied a re-run at least part of your tests?
> >
> > there are a number of NUMA improvements queued up on -mm, and it would
> > be nice to know what effect these cause, and what effect the staircase
> > scheduler has.
>
> Sure. I presume it's just the one patch:
>
> staircase-cpu-scheduler-268-rc2-mm1.patch
>
> which seemed to back out clean and is building now. Scream if that's
> not all of it ...
correct, that's the end of the scheduler patch-queue and it works fine
if unapplied. (The schedstats patch i just sent applies cleanly to that
base, in case you need one.)
Ingo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 21:31 ` Ingo Molnar
@ 2004-08-04 23:34 ` Martin J. Bligh
0 siblings, 0 replies; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 23:34 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
--On Wednesday, August 04, 2004 23:31:13 +0200 Ingo Molnar <mingo@elte.hu> wrote:
>
> * Martin J. Bligh <mbligh@aracnet.com> wrote:
>
>> > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
>> > unapplied a re-run at least part of your tests?
>> >
>> > there are a number of NUMA improvements queued up on -mm, and it would
>> > be nice to know what effect these cause, and what effect the staircase
>> > scheduler has.
>>
>> Sure. I presume it's just the one patch:
>>
>> staircase-cpu-scheduler-268-rc2-mm1.patch
>>
>> which seemed to back out clean and is building now. Scream if that's
>> not all of it ...
>
> correct, that's the end of the scheduler patch-queue and it works fine
> if unapplied. (The schedstats patch i just sent applies cleanly to that
> base, in case you need one.)
OK, the perf of 2.6.8-rc2-mm2 with the new sched code backed out is exactly
the same as 2.6.8-rc2 ... ie it's definitely the new sched code that makes
the improvement.
M.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
@ 2004-08-04 23:44 ` Peter Williams
2004-08-04 23:59 ` Martin J. Bligh
1 sibling, 1 reply; 20+ messages in thread
From: Peter Williams @ 2004-08-04 23:44 UTC (permalink / raw)
To: Andrew Morton; +Cc: Martin J. Bligh, kernel, linux-kernel, Ingo Molnar
Andrew Morton wrote:
> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
>>SDET 8 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.2%
>> 2.6.8-rc2 100.2% 1.0%
>> 2.6.8-rc2-mm2 117.4% 0.9%
>>
>> SDET 16 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.3%
>> 2.6.8-rc2 99.5% 0.3%
>> 2.6.8-rc2-mm2 118.5% 0.6%
>
>
> hum, interesting. Can Con's changes affect the inter-node and inter-cpu
> balancing decisions, or is this all due to caching effects, reduced context
> switching etc?
One candidate for the cause of this improvement is the replacement of
the active/expired array mechanism with a single array. I believe that
one of the short comings of the active/expired array mechanism is that
it can lead to excessive queuing (possibly even starvation) of tasks
that aren't considered "interactive".
>
> I don't expect we'll be merging a new CPU scheduler into mainline any time
> soon, but we should work to understand where this improvement came from,
> and see if we can get the mainline scheduler to catch up.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 23:44 ` Peter Williams
@ 2004-08-04 23:59 ` Martin J. Bligh
2004-08-05 5:20 ` Rick Lindsley
0 siblings, 1 reply; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 23:59 UTC (permalink / raw)
To: Peter Williams, Andrew Morton, Rick Lindsley
Cc: kernel, linux-kernel, Ingo Molnar
--On Thursday, August 05, 2004 09:44:06 +1000 Peter Williams <pwil3058@bigpond.net.au> wrote:
> Andrew Morton wrote:
>> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
>>
>>> SDET 8 (see disclaimer)
>>> Throughput Std. Dev
>>> 2.6.7 100.0% 0.2%
>>> 2.6.8-rc2 100.2% 1.0%
>>> 2.6.8-rc2-mm2 117.4% 0.9%
>>>
>>> SDET 16 (see disclaimer)
>>> Throughput Std. Dev
>>> 2.6.7 100.0% 0.3%
>>> 2.6.8-rc2 99.5% 0.3%
>>> 2.6.8-rc2-mm2 118.5% 0.6%
>>
>>
>> hum, interesting. Can Con's changes affect the inter-node and inter-cpu
>> balancing decisions, or is this all due to caching effects, reduced context
>> switching etc?
>
> One candidate for the cause of this improvement is the replacement of the active/expired array mechanism with a single array. I believe that one of the short comings of the active/expired array mechanism is that it can lead to excessive queuing (possibly even starvation) of tasks that aren't considered "interactive".
Rick showed me schedstats graphs of the two ... it seems to have lower
latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
looks better ... he'll send them out soon, I think (hint, hint).
M.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 23:59 ` Martin J. Bligh
@ 2004-08-05 5:20 ` Rick Lindsley
2004-08-05 10:45 ` Ingo Molnar
0 siblings, 1 reply; 20+ messages in thread
From: Rick Lindsley @ 2004-08-05 5:20 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Peter Williams, Andrew Morton, kernel, linux-kernel, Ingo Molnar
Rick showed me schedstats graphs of the two ... it seems to have lower
latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
looks better ... he'll send them out soon, I think (hint, hint).
Okay, they're done. Here's the URL of the graphs:
http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
General summary: as Martin reported, we're seeing improvements in a number
of areas, at least with sdet. The graphs as listed there represent stats
from four separate sdet runs run sequentially with an increasing load.
(We're trying to see if we can get the information from each run separately,
rather than the aggregate -- one of the hazards of an automated test
harness :)
Rick
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-05 5:20 ` Rick Lindsley
@ 2004-08-05 10:45 ` Ingo Molnar
0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2004-08-05 10:45 UTC (permalink / raw)
To: Rick Lindsley
Cc: Martin J. Bligh, Peter Williams, Andrew Morton, kernel,
linux-kernel
* Rick Lindsley <ricklind@us.ibm.com> wrote:
> Okay, they're done. Here's the URL of the graphs:
>
> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
>
> General summary: as Martin reported, we're seeing improvements in a
> number of areas, at least with sdet. The graphs as listed there
> represent stats from four separate sdet runs run sequentially with an
> increasing load. (We're trying to see if we can get the information
> from each run separately, rather than the aggregate -- one of the
> hazards of an automated test harness :)
really nice results! Would be interesting to see the effect of Con's
patch on other SMP/NUMA workloads as well - i'd expect to see an
improvement there too. The test was done with the default interactive=1
compute=0 setting, right?
Ingo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
[not found] <200408092240.05287.habanero@us.ibm.com>
@ 2004-08-10 4:08 ` Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
2004-08-10 7:40 ` Rick Lindsley
0 siblings, 2 replies; 20+ messages in thread
From: Andrew Theurer @ 2004-08-10 4:08 UTC (permalink / raw)
To: linux-kernel; +Cc: rocklind, mbligh, mingo, akpm
On Monday 09 August 2004 22:40, you wrote:
> Rick showed me schedstats graphs of the two ... it seems to have lower
> latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
> looks better ... he'll send them out soon, I think (hint, hint).
>
> Okay, they're done. Here's the URL of the graphs:
>
> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
>
> General summary: as Martin reported, we're seeing improvements in a number
> of areas, at least with sdet. The graphs as listed there represent stats
> from four separate sdet runs run sequentially with an increasing load.
> (We're trying to see if we can get the information from each run
> separately, rather than the aggregate -- one of the hazards of an automated
> test harness :)
What's quite interesting is that there is a very noticeable surge in
load_balance with staircase in the early stage of the test, but there appears
to be -no- direct policy changes to load-balance at all in Con's patch (or at
least I didn't notice it -please tell me if you did!). You can see it in
busy load_balance, sched_balance_exec, and pull_task. The runslice and
latency stats confirm this -no-staircase does not balance early on, and the
tasks suffer, waiting on a cpu already loaded up. I do not have an
explanation for this; perhaps it has something to do with eliminating expired
queue.
I would be nice to have per cpu runqueue lengths logged to see how this plays
out -do the cpus on staircase obtain a runqueue length close to
nr_running()/nr_online_cpus sooner than no-staircase?
Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY.
Do you have cswitch data? I would not be surprised if it's a lot higher on
-no-staircase, and cache is thrashed a lot more. This may be something you
can pull out of the -no-staircase kernel quite easily.
-Andrew Theurer
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 4:08 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Andrew Theurer
@ 2004-08-10 4:37 ` Con Kolivas
2004-08-10 15:05 ` Andrew Theurer
2004-08-10 7:40 ` Rick Lindsley
1 sibling, 1 reply; 20+ messages in thread
From: Con Kolivas @ 2004-08-10 4:37 UTC (permalink / raw)
To: Andrew Theurer; +Cc: linux-kernel, rocklind, mbligh, mingo, akpm
Andrew Theurer writes:
> On Monday 09 August 2004 22:40, you wrote:
>> Rick showed me schedstats graphs of the two ... it seems to have lower
>> latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
>> looks better ... he'll send them out soon, I think (hint, hint).
>>
>> Okay, they're done. Here's the URL of the graphs:
>>
>> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
>>
>> General summary: as Martin reported, we're seeing improvements in a number
>> of areas, at least with sdet. The graphs as listed there represent stats
>> from four separate sdet runs run sequentially with an increasing load.
>> (We're trying to see if we can get the information from each run
>> separately, rather than the aggregate -- one of the hazards of an automated
>> test harness :)
>
> What's quite interesting is that there is a very noticeable surge in
> load_balance with staircase in the early stage of the test, but there appears
> to be -no- direct policy changes to load-balance at all in Con's patch (or at
> least I didn't notice it -please tell me if you did!). You can see it in
> busy load_balance, sched_balance_exec, and pull_task. The runslice and
> latency stats confirm this -no-staircase does not balance early on, and the
> tasks suffer, waiting on a cpu already loaded up. I do not have an
> explanation for this; perhaps it has something to do with eliminating expired
> queue.
To be honest I have no idea why that's the case. One of the first things I
did was eliminate the expired array and in my testing (up to 8x at osdl) I
did not really notice this in and of itself made any big difference - of
course this could be because the removal of the expired array was not done
in a way which entitled starved tasks to run in reasonable timeframes.
> I would be nice to have per cpu runqueue lengths logged to see how this plays
> out -do the cpus on staircase obtain a runqueue length close to
> nr_running()/nr_online_cpus sooner than no-staircase?
/me looks in the schedstats peoples' way
> Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY.
Ah well I tuned the timeslice granularity and I can tell you it isn't quite
what most people think. The granularity when you get to greater than 4 cpus
is effectively _disabled_. So in fact, the timeslices are shorter in
staircase (in normal interactive=1, compute=0 mode which is how martin
would have tested it), not longer. But this is not the reason either since
in "compute" mode they are ten times longer and this also improves
throughput further.
> Do you have cswitch data? I would not be surprised if it's a lot higher on
> -no-staircase, and cache is thrashed a lot more. This may be something you
> can pull out of the -no-staircase kernel quite easily.
Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
(-j) on kernbench gives surprisingly similar context switch rates. It's only
when I enable compute mode that the context switches drop compared to
default staircase mode and mainline. You'd have to ask Martin and Rick about
what they got.
> -Andrew Theurer
Cheers,
Con
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 4:08 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
@ 2004-08-10 7:40 ` Rick Lindsley
2004-08-10 15:19 ` Andrew Theurer
1 sibling, 1 reply; 20+ messages in thread
From: Rick Lindsley @ 2004-08-10 7:40 UTC (permalink / raw)
To: Andrew Theurer; +Cc: linux-kernel, mbligh, mingo, akpm
What's quite interesting is that there is a very noticeable surge in
load_balance with staircase in the early stage of the test, but there
appears to be -no- direct policy changes to load-balance at all in
Con's patch (or at least I didn't notice it -please tell me if you
did!). You can see it in busy load_balance, sched_balance_exec, and
pull_task. The runslice and latency stats confirm this -no-staircase
does not balance early on, and the tasks suffer, waiting on a cpu
already loaded up. I do not have an explanation for this; perhaps
it has something to do with eliminating expired queue.
Possibly. The other factor thrown in here is that this was on an SMT
machine, so it's possible that the balancing is no different but we are
seeing tasks initially assigned more poorly. Or, perhaps we're drawing
too much from one data point.
It would be nice to have per cpu runqueue lengths logged to see how
this plays out -do the cpus on staircase obtain a runqueue length
close to nr_running()/nr_online_cpus sooner than no-staircase?
The only difficulty there is do we know how long it normally takes for
this to balance out? We're taking samples every five seconds; might this
not work itself out between one snapshot and the next? Shrug. It would
be easy enough to add another field to report nr_running at the moment
the statistics snapshot was taken, but on anything but compute-intensive
benchmarks I'm afraid we might miss all the interesting data.
Also, one big change apparent to me, the elimination of
TIMESLICE_GRANULARITY. Do you have cswitch data? I would not
be surprised if it's a lot higher on -no-staircase, and cache is
thrashed a lot more. This may be something you can pull out of the
-no-staircase kernel quite easily.
Yes, sar data was collected every five seconds so I do have context switch
data. The bad news is that it was collected for each of 10 runs times
four different loads, and I don't have any handy dandy scripts to pretty
it up :) (Pause.) A quick exercise with a calculator, though, suggests
you are right. cswitches were 10%-20% higher on the no staircase runs.
Rick
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 4:37 ` Con Kolivas
@ 2004-08-10 15:05 ` Andrew Theurer
2004-08-10 20:57 ` Con Kolivas
0 siblings, 1 reply; 20+ messages in thread
From: Andrew Theurer @ 2004-08-10 15:05 UTC (permalink / raw)
To: Con Kolivas; +Cc: linux-kernel, ricklind, mbligh, mingo, akpm
> > Also, one big change apparent to me, the elimination of
> > TIMESLICE_GRANULARITY.
>
> Ah well I tuned the timeslice granularity and I can tell you it isn't quite
> what most people think. The granularity when you get to greater than 4 cpus
> is effectively _disabled_. So in fact, the timeslices are shorter in
> staircase (in normal interactive=1, compute=0 mode which is how martin
> would have tested it), not longer. But this is not the reason either since
> in "compute" mode they are ten times longer and this also improves
> throughput further.
Interesting, I forgot about the "* nr_cpus" that was in the granularity
calculation. That does make me wonder, maybe the timeslices you are
calculating could have something similar, but more appropriate.
Since the number of runnable tasks on a cpu should play a part in latency (the
more tasks, potentially the longer the latency), I wonder if the timeslice
would benefit from a modifier like " / task_cpu(p)->nr_running ". With this
the base timeslice could be quite a bit larger to start for better cache
warmth, and as we add more tasks to that cpu, the timeslices get smaller, so
an acceptable latency is preserved.
> > Do you have cswitch data? I would not be surprised if it's a lot higher
> > on -no-staircase, and cache is thrashed a lot more. This may be
> > something you can pull out of the -no-staircase kernel quite easily.
>
> Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
> (-j) on kernbench gives surprisingly similar context switch rates. It's
> only when I enable compute mode that the context switches drop compared to
> default staircase mode and mainline. You'd have to ask Martin and Rick
> about what they got.
OK, thanks!
-Andrew Theurer
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 7:40 ` Rick Lindsley
@ 2004-08-10 15:19 ` Andrew Theurer
0 siblings, 0 replies; 20+ messages in thread
From: Andrew Theurer @ 2004-08-10 15:19 UTC (permalink / raw)
To: Rick Lindsley; +Cc: Con Kolivas, linux-kernel, mbligh, mingo, akpm
On Tuesday 10 August 2004 02:40, Rick Lindsley wrote:
> What's quite interesting is that there is a very noticeable surge in
> load_balance with staircase in the early stage of the test, but there
> appears to be -no- direct policy changes to load-balance at all in
> Con's patch (or at least I didn't notice it -please tell me if you
> did!). You can see it in busy load_balance, sched_balance_exec, and
> pull_task. The runslice and latency stats confirm this -no-staircase
> does not balance early on, and the tasks suffer, waiting on a cpu
> already loaded up. I do not have an explanation for this; perhaps
> it has something to do with eliminating expired queue.
>
> Possibly. The other factor thrown in here is that this was on an SMT
> machine, so it's possible that the balancing is no different but we are
> seeing tasks initially assigned more poorly. Or, perhaps we're drawing
> too much from one data point.
Yes, my first guess was that sched_balance_exec was changed, and I guess it
was, but earlier than Con's patch. The first conditional we had used to
have:
if (this_rq()->nr_running <= 2)
goto out;
but the 2 is now a 1 for both -rc2 and -rc2-mm2, so we tend to find the best
cpu in the system more often now.
>
> It would be nice to have per cpu runqueue lengths logged to see how
> this plays out -do the cpus on staircase obtain a runqueue length
> close to nr_running()/nr_online_cpus sooner than no-staircase?
>
> The only difficulty there is do we know how long it normally takes for
> this to balance out? We're taking samples every five seconds; might this
> not work itself out between one snapshot and the next? Shrug. It would
> be easy enough to add another field to report nr_running at the moment
> the statistics snapshot was taken, but on anything but compute-intensive
> benchmarks I'm afraid we might miss all the interesting data.
Actually if you have sar cpu util data, we might be able to extract this. For
example, if we have balance issues on 16 user sdet, we may see that very
early on the staircase cpu util was near 100%, where the no-staircase may
have been much lower for the first portion of the test (showing that some
cpus were idle while others may have had more than one task). If we can see
this in sar, IMO that would confirm some sort of indirect load balance
improvement in staircase.
> Also, one big change apparent to me, the elimination of
> TIMESLICE_GRANULARITY. Do you have cswitch data? I would not
> be surprised if it's a lot higher on -no-staircase, and cache is
> thrashed a lot more. This may be something you can pull out of the
> -no-staircase kernel quite easily.
>
> Yes, sar data was collected every five seconds so I do have context switch
> data. The bad news is that it was collected for each of 10 runs times
> four different loads, and I don't have any handy dandy scripts to pretty
> it up :) (Pause.) A quick exercise with a calculator, though, suggests
> you are right. cswitches were 10%-20% higher on the no staircase runs.
Interesting. I wouldn't expect it to account for up to 20% performance, but
maybe 1-2%.
>
> Rick
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 15:05 ` Andrew Theurer
@ 2004-08-10 20:57 ` Con Kolivas
0 siblings, 0 replies; 20+ messages in thread
From: Con Kolivas @ 2004-08-10 20:57 UTC (permalink / raw)
To: Andrew Theurer; +Cc: linux-kernel, ricklind, mbligh, mingo, akpm
[-- Attachment #1: Type: text/plain, Size: 2393 bytes --]
Andrew Theurer wrote:
>>>Also, one big change apparent to me, the elimination of
>>>TIMESLICE_GRANULARITY.
>>
>>Ah well I tuned the timeslice granularity and I can tell you it isn't quite
>>what most people think. The granularity when you get to greater than 4 cpus
>>is effectively _disabled_. So in fact, the timeslices are shorter in
>>staircase (in normal interactive=1, compute=0 mode which is how martin
>>would have tested it), not longer. But this is not the reason either since
>>in "compute" mode they are ten times longer and this also improves
>>throughput further.
>
>
> Interesting, I forgot about the "* nr_cpus" that was in the granularity
> calculation. That does make me wonder, maybe the timeslices you are
> calculating could have something similar, but more appropriate.
>
> Since the number of runnable tasks on a cpu should play a part in latency (the
> more tasks, potentially the longer the latency), I wonder if the timeslice
> would benefit from a modifier like " / task_cpu(p)->nr_running ". With this
> the base timeslice could be quite a bit larger to start for better cache
> warmth, and as we add more tasks to that cpu, the timeslices get smaller, so
> an acceptable latency is preserved.
I had a problem with fairness once I made the timeslices too long since
that also determines priority demotion in the staircase design. That's
why I have the "compute" mode as quite a separate entity because the
longer timeslices on their own weren't of any special benefit (in my up
to 8x testing but could be elsewhere) unless I added the delayed
preemption which is probably where the main extra cache warmth comes
from in "compute" design. Of course this comes at a cost which is higher
latencies... because normal priority preemption is delayed.
>>>Do you have cswitch data? I would not be surprised if it's a lot higher
>>>on -no-staircase, and cache is thrashed a lot more. This may be
>>>something you can pull out of the -no-staircase kernel quite easily.
>>
>>Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
>>(-j) on kernbench gives surprisingly similar context switch rates. It's
>>only when I enable compute mode that the context switches drop compared to
>>default staircase mode and mainline. You'd have to ask Martin and Rick
>>about what they got.
>
>
> OK, thanks!
>
> -Andrew Theurer
Cheers,
Con
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 256 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2004-08-10 20:58 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200408092240.05287.habanero@us.ibm.com>
2004-08-10 4:08 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
2004-08-10 15:05 ` Andrew Theurer
2004-08-10 20:57 ` Con Kolivas
2004-08-10 7:40 ` Rick Lindsley
2004-08-10 15:19 ` Andrew Theurer
2004-08-04 15:10 Martin J. Bligh
2004-08-04 15:12 ` Martin J. Bligh
2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:07 ` Rick Lindsley
2004-08-04 20:10 ` Ingo Molnar
2004-08-04 20:36 ` Martin J. Bligh
2004-08-04 21:31 ` Ingo Molnar
2004-08-04 23:34 ` Martin J. Bligh
2004-08-04 23:44 ` Peter Williams
2004-08-04 23:59 ` Martin J. Bligh
2004-08-05 5:20 ` Rick Lindsley
2004-08-05 10:45 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox