* 2.6.8-rc2-mm2 performance improvements (scheduler?)
@ 2004-08-04 15:10 Martin J. Bligh
2004-08-04 15:12 ` Martin J. Bligh
0 siblings, 1 reply; 27+ messages in thread
From: Martin J. Bligh @ 2004-08-04 15:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Con Kolivas, linux-kernel
2.6.8-rc2-mm2 has some significant improvements over 2.6.8-rc2,
particularly at low to mid loads ... at the high loads, it's still -58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
slightly improved, but less significant. Numbers from 16x NUMA-Q.
Kernbench sees most improvement in sys time, but also some elapsed
time ... SDET sees up to 18% more througput.
I'm also amused to see that the process scalability is now pretty
damned good ... a full -j kernel compile (using up to about 1300
tasks) goes as fast as the -j 256 (the middle one) ... and elapsed
is faster than -j16, even if system is a little higher.
I *think* this is the scheduler changes ... fits in with profile diffs
I've seen before ... diffprofiles at the end. In my experience, higher
copy_to/from_user and finish_task_switch stuff tends to indicate task
thrashing. Note also .text.lock.semaphore numbers in kernbench profile.
The SDET one looks like it's load-balancing better (mainly less idle
time).
Great stuff.
M.
Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed System User CPU
2.6.7 45.37 90.91 579.75 1479.33
2.6.8-rc2 45.05 88.53 577.87 1485.67
2.6.8-rc2-mm2 44.09 78.84 577.01 1486.33
Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed System User CPU
2.6.7 44.77 97.96 576.59 1507.00
2.6.8-rc2 44.83 96.00 575.50 1497.33
2.6.8-rc2-mm2 43.43 86.04 576.26 1524.33
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.6.7 44.25 88.95 575.63 1501.33
2.6.8-rc2 44.03 87.74 573.82 1503.67
2.6.8-rc2-mm2 43.75 86.68 576.98 1518.00
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 95.9% 2.3%
2.6.8-rc2-mm2 111.5% 3.3%
SDET 2 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.0%
2.6.8-rc2 100.5% 1.4%
2.6.8-rc2-mm2 115.1% 4.0%
SDET 4 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 99.2% 1.1%
2.6.8-rc2-mm2 111.9% 0.5%
SDET 8 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 1.0%
2.6.8-rc2-mm2 117.4% 0.9%
SDET 16 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.5% 0.3%
2.6.8-rc2-mm2 118.5% 0.6%
SDET 32 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.7% 0.4%
2.6.8-rc2-mm2 102.1% 0.8%
SDET 64 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 101.6% 0.4%
2.6.8-rc2-mm2 103.2% 0.0%
SDET 128 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 0.1%
2.6.8-rc2-mm2 103.0% 0.3%
Diffprofile for kernbench -j32 (-ve numbers better with mm2)
2135 4.3% default_idle
233 44.2% pte_alloc_one
220 11.9% buffered_rmqueue
164 264.5% schedule
135 5.9% do_page_fault
84 10.7% clear_page_tables
62 62.6% __wake_up_sync
51 60.7% set_page_address
...
-50 -43.9% sys_close
-56 -10.7% __fput
-58 -13.7% set_page_dirty
-61 -10.0% copy_process
-70 -41.4% pipe_writev
-77 -8.1% file_move
-85 -100.0% wake_up_forked_thread
-87 -50.9% pipe_wait
-90 -5.7% path_lookup
-93 -21.2% page_add_anon_rmap
-105 -28.5% release_task
-113 -11.9% do_wp_page
-116 -7.9% link_path_walk
-116 -43.1% pipe_readv
-121 -7.5% atomic_dec_and_lock
-138 -15.4% strnlen_user
-159 -9.4% do_no_page
-167 -2.7% __d_lookup
-214 -59.4% find_idlest_cpu
-230 -6.2% find_trylock_page
-237 -1.6% do_anonymous_page
-255 -7.8% zap_pte_range
-444 -97.6% .text.lock.semaphore
-532 -43.4% Letext
-632 -54.2% __wake_up
-1086 -52.2% finish_task_switch
-1436 -24.6% __copy_to_user_ll
-3079 -46.3% __copy_from_user_ll
-7468 -5.4% total
sdetbench 8 (-ve numbers better with mm2)
...
-58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 15:10 2.6.8-rc2-mm2 performance improvements (scheduler?) Martin J. Bligh
@ 2004-08-04 15:12 ` Martin J. Bligh
2004-08-04 19:24 ` Andrew Morton
0 siblings, 1 reply; 27+ messages in thread
From: Martin J. Bligh @ 2004-08-04 15:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: Con Kolivas, linux-kernel
Doh. Clipped the paste button on the mouse against the PC just as I
hit send ... pasting a bunch of data in the wrong place ;-) Should've
looked like this:
----------------------------------------------
2.6.8-rc2-mm2 has some significant improvements over 2.6.8-rc2,
particularly at low to mid loads ... at the high loads, it's still
slightly improved, but less significant. Numbers from 16x NUMA-Q.
Kernbench sees most improvement in sys time, but also some elapsed
time ... SDET sees up to 18% more througput.
I'm also amused to see that the process scalability is now pretty
damned good ... a full -j kernel compile (using up to about 1300
tasks) goes as fast as the -j 256 (the middle one) ... and elapsed
is faster than -j16, even if system is a little higher.
I *think* this is the scheduler changes ... fits in with profile diffs
I've seen before ... diffprofiles at the end. In my experience, higher
copy_to/from_user and finish_task_switch stuff tends to indicate task
thrashing. Note also .text.lock.semaphore numbers in kernbench profile.
The SDET one looks like it's load-balancing better (mainly less idle
time).
Great stuff.
M.
Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed System User CPU
2.6.7 45.37 90.91 579.75 1479.33
2.6.8-rc2 45.05 88.53 577.87 1485.67
2.6.8-rc2-mm2 44.09 78.84 577.01 1486.33
Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed System User CPU
2.6.7 44.77 97.96 576.59 1507.00
2.6.8-rc2 44.83 96.00 575.50 1497.33
2.6.8-rc2-mm2 43.43 86.04 576.26 1524.33
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.6.7 44.25 88.95 575.63 1501.33
2.6.8-rc2 44.03 87.74 573.82 1503.67
2.6.8-rc2-mm2 43.75 86.68 576.98 1518.00
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 95.9% 2.3%
2.6.8-rc2-mm2 111.5% 3.3%
SDET 2 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.0%
2.6.8-rc2 100.5% 1.4%
2.6.8-rc2-mm2 115.1% 4.0%
SDET 4 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 99.2% 1.1%
2.6.8-rc2-mm2 111.9% 0.5%
SDET 8 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 1.0%
2.6.8-rc2-mm2 117.4% 0.9%
SDET 16 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.5% 0.3%
2.6.8-rc2-mm2 118.5% 0.6%
SDET 32 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.7% 0.4%
2.6.8-rc2-mm2 102.1% 0.8%
SDET 64 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 101.6% 0.4%
2.6.8-rc2-mm2 103.2% 0.0%
SDET 128 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 0.1%
2.6.8-rc2-mm2 103.0% 0.3%
Diffprofile for kernbench -j32 (-ve numbers better with mm2)
2135 4.3% default_idle
233 44.2% pte_alloc_one
220 11.9% buffered_rmqueue
164 264.5% schedule
135 5.9% do_page_fault
84 10.7% clear_page_tables
62 62.6% __wake_up_sync
51 60.7% set_page_address
...
-50 -43.9% sys_close
-56 -10.7% __fput
-58 -13.7% set_page_dirty
-61 -10.0% copy_process
-70 -41.4% pipe_writev
-77 -8.1% file_move
-85 -100.0% wake_up_forked_thread
-87 -50.9% pipe_wait
-90 -5.7% path_lookup
-93 -21.2% page_add_anon_rmap
-105 -28.5% release_task
-113 -11.9% do_wp_page
-116 -7.9% link_path_walk
-116 -43.1% pipe_readv
-121 -7.5% atomic_dec_and_lock
-138 -15.4% strnlen_user
-159 -9.4% do_no_page
-167 -2.7% __d_lookup
-214 -59.4% find_idlest_cpu
-230 -6.2% find_trylock_page
-237 -1.6% do_anonymous_page
-255 -7.8% zap_pte_range
-444 -97.6% .text.lock.semaphore
-532 -43.4% Letext
-632 -54.2% __wake_up
-1086 -52.2% finish_task_switch
-1436 -24.6% __copy_to_user_ll
-3079 -46.3% __copy_from_user_ll
-7468 -5.4% total
sdetbench 8 (-ve numbers better with mm2)
...
-58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 15:12 ` Martin J. Bligh
@ 2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 23:44 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Peter Williams
0 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2004-08-04 19:24 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: kernel, linux-kernel, Ingo Molnar
"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> SDET 8 (see disclaimer)
> Throughput Std. Dev
> 2.6.7 100.0% 0.2%
> 2.6.8-rc2 100.2% 1.0%
> 2.6.8-rc2-mm2 117.4% 0.9%
>
> SDET 16 (see disclaimer)
> Throughput Std. Dev
> 2.6.7 100.0% 0.3%
> 2.6.8-rc2 99.5% 0.3%
> 2.6.8-rc2-mm2 118.5% 0.6%
hum, interesting. Can Con's changes affect the inter-node and inter-cpu
balancing decisions, or is this all due to caching effects, reduced context
switching etc?
I don't expect we'll be merging a new CPU scheduler into mainline any time
soon, but we should work to understand where this improvement came from,
and see if we can get the mainline scheduler to catch up.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:24 ` Andrew Morton
@ 2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
` (2 more replies)
2004-08-04 23:44 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Peter Williams
1 sibling, 3 replies; 27+ messages in thread
From: Martin J. Bligh @ 2004-08-04 19:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: kernel, linux-kernel, Ingo Molnar, Rick Lindsley
--On Wednesday, August 04, 2004 12:24:14 -0700 Andrew Morton <akpm@osdl.org> wrote:
> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
>>
>> SDET 8 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.2%
>> 2.6.8-rc2 100.2% 1.0%
>> 2.6.8-rc2-mm2 117.4% 0.9%
>>
>> SDET 16 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.3%
>> 2.6.8-rc2 99.5% 0.3%
>> 2.6.8-rc2-mm2 118.5% 0.6%
>
> hum, interesting. Can Con's changes affect the inter-node and inter-cpu
> balancing decisions, or is this all due to caching effects, reduced context
> switching etc?
>
> I don't expect we'll be merging a new CPU scheduler into mainline any time
> soon, but we should work to understand where this improvement came from,
> and see if we can get the mainline scheduler to catch up.
Dunno ... really need to take schedstats profiles before and afterwards to
get a better picture what it's doing. Rick was working on a port.
M.
PS. schedstats is great for this kind of thing. Very useful, minimally
invasive, no impact unless configed in, and nothing measurable even then.
Hint. Hint ;-)
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:34 ` Martin J. Bligh
@ 2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:07 ` Rick Lindsley
2004-08-04 20:10 ` Ingo Molnar
2004-08-04 21:26 ` 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch Ingo Molnar
2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-08-04 19:50 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: kernel, linux-kernel, mingo, ricklind
"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> PS. schedstats is great for this kind of thing. Very useful, minimally
> invasive, no impact unless configed in, and nothing measurable even then.
> Hint. Hint ;-)
Ho hum. It's up to the hordes of scheduler hackers really. If they want,
and can agree upon a patch then go wild. It should be against -mm minus
staircase, as there's a fair amount of scheduler stuff banked up for
post-2.6.8.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:50 ` Andrew Morton
@ 2004-08-04 20:07 ` Rick Lindsley
0 siblings, 0 replies; 27+ messages in thread
From: Rick Lindsley @ 2004-08-04 20:07 UTC (permalink / raw)
To: Andrew Morton; +Cc: Martin J. Bligh, kernel, linux-kernel, mingo
Ho hum. It's up to the hordes of scheduler hackers really. If they
want, and can agree upon a patch then go wild. It should be against
-mm minus staircase, as there's a fair amount of scheduler stuff
banked up for post-2.6.8.
The patch exists for both -mm2 and -mm1, but I've been holding off
posting it until I get a chance to do more than simply compile it.
Our lab machines are back up now so I'll test a (-mm2 - staircase)
patch this afternoon.
Rick
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
@ 2004-08-04 20:10 ` Ingo Molnar
2004-08-04 20:36 ` Martin J. Bligh
2004-08-04 21:26 ` 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch Ingo Molnar
2 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2004-08-04 20:10 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
* Martin J. Bligh <mbligh@aracnet.com> wrote:
> >> SDET 16 (see disclaimer)
> >> Throughput Std. Dev
> >> 2.6.7 100.0% 0.3%
> >> 2.6.8-rc2 99.5% 0.3%
> >> 2.6.8-rc2-mm2 118.5% 0.6%
> >
> > hum, interesting. Can Con's changes affect the inter-node and inter-cpu
> > balancing decisions, or is this all due to caching effects, reduced context
> > switching etc?
Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
unapplied a re-run at least part of your tests?
there are a number of NUMA improvements queued up on -mm, and it would
be nice to know what effect these cause, and what effect the staircase
scheduler has.
Ingo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 20:10 ` Ingo Molnar
@ 2004-08-04 20:36 ` Martin J. Bligh
2004-08-04 21:31 ` Ingo Molnar
0 siblings, 1 reply; 27+ messages in thread
From: Martin J. Bligh @ 2004-08-04 20:36 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
--On Wednesday, August 04, 2004 22:10:19 +0200 Ingo Molnar <mingo@elte.hu> wrote:
>
> * Martin J. Bligh <mbligh@aracnet.com> wrote:
>
>> >> SDET 16 (see disclaimer)
>> >> Throughput Std. Dev
>> >> 2.6.7 100.0% 0.3%
>> >> 2.6.8-rc2 99.5% 0.3%
>> >> 2.6.8-rc2-mm2 118.5% 0.6%
>> >
>> > hum, interesting. Can Con's changes affect the inter-node and inter-cpu
>> > balancing decisions, or is this all due to caching effects, reduced context
>> > switching etc?
>
> Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
> unapplied a re-run at least part of your tests?
>
> there are a number of NUMA improvements queued up on -mm, and it would
> be nice to know what effect these cause, and what effect the staircase
> scheduler has.
Sure. I presume it's just the one patch:
staircase-cpu-scheduler-268-rc2-mm1.patch
which seemed to back out clean and is building now. Scream if that's not
all of it ...
M.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:10 ` Ingo Molnar
@ 2004-08-04 21:26 ` Ingo Molnar
2004-08-04 21:34 ` Sam Ravnborg
2004-08-04 22:10 ` Rick Lindsley
2 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2004-08-04 21:26 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley, Nick Piggin
[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]
* Martin J. Bligh <mbligh@aracnet.com> wrote:
> PS. schedstats is great for this kind of thing. Very useful, minimally
> invasive, no impact unless configed in, and nothing measurable even
> then. Hint. Hint ;-)
no objections from me. I like Nick's variant most:
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7/2.6.7-mm1/broken-out/schedstats.patch
i have merged this patch to 2.6.8-rc2-mm2 (to be applied before the
staircase scheduler, attached as schedstat-2.6.8-rc2-mm2-A4.patch).
I fixed a number of cleanliness items, readying this patch for a
mainline merge:
- added kernel/Kconfig.debug for generic debug options (such as
schedstat) and removed tons of debug options from various arch
Kconfig's, instead of adding a boatload of new SCHEDSTAT entries.
This felt good.
- removed sbc_pushed (there's no clone balancing anymore). This also
fixes the stats7.c parser at:
http://www.zip.com.au/~akpm/linux/patches/stuff/stats7.c
- moved some definitions from sched.h to sched.c
- style & whitespace police
tested it on x86.
is this patch fine with everyone? Rick, Nick?
Ingo
[-- Attachment #2: schedstat-2.6.8-rc2-mm2-A4.patch --]
[-- Type: text/plain, Size: 22978 bytes --]
--- linux/arch/i386/Kconfig.orig
+++ linux/arch/i386/Kconfig
@@ -1196,14 +1196,9 @@ source "fs/Kconfig"
source "arch/i386/oprofile/Kconfig"
-
menu "Kernel hacking"
-config DEBUG_KERNEL
- bool "Kernel debugging"
- help
- Say Y here if you are developing drivers or trying to debug and
- identify kernel problems.
+source "kernel/Kconfig.debug"
config EARLY_PRINTK
bool "Early printk" if EMBEDDED
@@ -1231,14 +1226,6 @@ config DEBUG_STACK_USAGE
This option will slow down process creation somewhat.
-config DEBUG_SLAB
- bool "Debug memory allocations"
- depends on DEBUG_KERNEL
- help
- Say Y here to have the kernel do limited verification on memory
- allocation as well as poisoning memory on free to catch use of freed
- memory.
-
config MAGIC_SYSRQ
bool "Magic SysRq key"
depends on DEBUG_KERNEL
@@ -1253,23 +1240,6 @@ config MAGIC_SYSRQ
keys are documented in <file:Documentation/sysrq.txt>. Don't say Y
unless you really know what this hack does.
-config DEBUG_SPINLOCK
- bool "Spinlock debugging"
- depends on DEBUG_KERNEL
- help
- Say Y here and build SMP to catch missing spinlock initialization
- and certain other kinds of spinlock errors commonly made. This is
- best used in conjunction with the NMI watchdog so that spinlock
- deadlocks are also debuggable.
-
-config DEBUG_PAGEALLOC
- bool "Page alloc debugging"
- depends on DEBUG_KERNEL
- help
- Unmap pages from the kernel linear mapping after free_pages().
- This results in a large slowdown, but helps to find certain types
- of memory corruptions.
-
config DEBUG_HIGHMEM
bool "Highmem debugging"
depends on DEBUG_KERNEL && HIGHMEM
@@ -1277,28 +1247,6 @@ config DEBUG_HIGHMEM
This options enables addition error checking for high memory systems.
Disable for production systems.
-config DEBUG_INFO
- bool "Compile the kernel with debug info"
- depends on DEBUG_KERNEL
- help
- If you say Y here the resulting kernel image will include
- debugging info resulting in a larger kernel image.
- Say Y here only if you plan to use gdb to debug the kernel.
- If you don't debug the kernel, you can say N.
-
-config LOCKMETER
- bool "Kernel lock metering"
- depends on SMP
- help
- Say Y to enable kernel lock metering, which adds overhead to SMP locks,
- but allows you to see various statistics using the lockstat command.
-
-config DEBUG_SPINLOCK_SLEEP
- bool "Sleep-inside-spinlock checking"
- help
- If you say Y here, various routines which may sleep will become very
- noisy if they are called with a spinlock held.
-
config KGDB
bool "Include kgdb kernel debugger"
depends on DEBUG_KERNEL
--- linux/arch/ppc/Kconfig.orig
+++ linux/arch/ppc/Kconfig
@@ -1265,12 +1265,7 @@ source "arch/ppc/oprofile/Kconfig"
menu "Kernel hacking"
-config DEBUG_KERNEL
- bool "Kernel debugging"
-
-config DEBUG_SLAB
- bool "Debug memory allocations"
- depends on DEBUG_KERNEL
+source "kernel/Kconfig.debug"
config MAGIC_SYSRQ
bool "Magic SysRq key"
@@ -1286,13 +1281,6 @@ config MAGIC_SYSRQ
keys are documented in <file:Documentation/sysrq.txt>. Don't say Y
unless you really know what this hack does.
-config DEBUG_SPINLOCK
- bool "Spinlock debugging"
- depends on DEBUG_KERNEL
- help
- Say Y here and to CONFIG_SMP to include code to check for missing
- spinlock initialization and some other common spinlock errors.
-
config DEBUG_HIGHMEM
bool "Highmem debugging"
depends on DEBUG_KERNEL && HIGHMEM
@@ -1300,13 +1288,6 @@ config DEBUG_HIGHMEM
This options enables additional error checking for high memory
systems. Disable for production systems.
-config DEBUG_SPINLOCK_SLEEP
- bool "Sleep-inside-spinlock checking"
- depends on DEBUG_KERNEL
- help
- If you say Y here, various routines which may sleep will become very
- noisy if they are called with a spinlock held.
-
config KGDB
bool "Include kgdb kernel debugger"
depends on DEBUG_KERNEL && (BROKEN || PPC_GEN550 || 4xx)
@@ -1358,16 +1339,6 @@ config BDI_SWITCH
Unless you are intending to debug the kernel with one of these
machines, say N here.
-config DEBUG_INFO
- bool "Compile the kernel with debug info"
- depends on DEBUG_KERNEL
- help
- If you say Y here the resulting kernel image will include
- debugging info resulting in a larger kernel image.
- Say Y here only if you plan to use some sort of debugger to
- debug the kernel.
- If you don't debug the kernel, you can say N.
-
config BOOTX_TEXT
bool "Support for early boot text console (BootX or OpenFirmware only)"
depends PPC_OF
--- linux/arch/ia64/Kconfig.orig
+++ linux/arch/ia64/Kconfig
@@ -596,11 +596,7 @@ config IA64_GRANULE_64MB
endchoice
-config DEBUG_KERNEL
- bool "Kernel debugging"
- help
- Say Y here if you are developing drivers or trying to debug and
- identify kernel problems.
+source "arch/i386/oprofile/Kconfig"
config IA64_PRINT_HAZARDS
bool "Print possible IA-64 dependency violations to console"
@@ -635,29 +631,6 @@ config MAGIC_SYSRQ
keys are documented in <file:Documentation/sysrq.txt>. Don't say Y
unless you really know what this hack does.
-config DEBUG_SLAB
- bool "Debug memory allocations"
- depends on DEBUG_KERNEL
- help
- Say Y here to have the kernel do limited verification on memory
- allocation as well as poisoning memory on free to catch use of freed
- memory.
-
-config DEBUG_SPINLOCK
- bool "Spinlock debugging"
- depends on DEBUG_KERNEL
- help
- Say Y here and build SMP to catch missing spinlock initialization
- and certain other kinds of spinlock errors commonly made. This is
- best used in conjunction with the NMI watchdog so that spinlock
- deadlocks are also debuggable.
-
-config DEBUG_SPINLOCK_SLEEP
- bool "Sleep-inside-spinlock checking"
- help
- If you say Y here, various routines which may sleep will become very
- noisy if they are called with a spinlock held.
-
config IA64_DEBUG_CMPXCHG
bool "Turn on compare-and-exchange bug checking (slow!)"
depends on DEBUG_KERNEL
@@ -675,22 +648,6 @@ config IA64_DEBUG_IRQ
and restore instructions. It's useful for tracking down spinlock
problems, but slow! If you're unsure, select N.
-config DEBUG_INFO
- bool "Compile the kernel with debug info"
- depends on DEBUG_KERNEL
- help
- If you say Y here the resulting kernel image will include
- debugging info resulting in a larger kernel image.
- Say Y here only if you plan to use gdb to debug the kernel.
- If you don't debug the kernel, you can say N.
-
-config LOCKMETER
- bool "Kernel lock metering"
- depends on SMP
- help
- Say Y to enable kernel lock metering, which adds overhead to SMP locks,
- but allows you to see various statistics using the lockstat command.
-
config SYSVIPC_COMPAT
bool
depends on COMPAT && SYSVIPC
--- linux/arch/ppc64/Kconfig.orig
+++ linux/arch/ppc64/Kconfig
@@ -345,11 +345,7 @@ source "arch/ppc64/oprofile/Kconfig"
menu "Kernel hacking"
-config DEBUG_KERNEL
- bool "Kernel debugging"
- help
- Say Y here if you are developing drivers or trying to debug and
- identify kernel problems.
+source "kernel/Kconfig.debug"
config DEBUG_STACKOVERFLOW
bool "Check for stack overflows"
@@ -364,14 +360,6 @@ config DEBUG_STACK_USAGE
This option will slow down process creation somewhat.
-config DEBUG_SLAB
- bool "Debug memory allocations"
- depends on DEBUG_KERNEL
- help
- Say Y here to have the kernel do limited verification on memory
- allocation as well as poisoning memory on free to catch use of freed
- memory.
-
config MAGIC_SYSRQ
bool "Magic SysRq key"
depends on DEBUG_KERNEL
@@ -408,15 +396,6 @@ config PPCDBG
bool "Include PPCDBG realtime debugging"
depends on DEBUG_KERNEL
-config DEBUG_INFO
- bool "Compile the kernel with debug info"
- depends on DEBUG_KERNEL
- help
- If you say Y here the resulting kernel image will include
- debugging info resulting in a larger kernel image.
- Say Y here only if you plan to use gdb to debug the kernel.
- If you don't debug the kernel, you can say N.
-
config IRQSTACKS
bool "Use separate kernel stacks when processing interrupts"
help
@@ -424,8 +403,6 @@ config IRQSTACKS
for handling hard and soft interrupts. This can help avoid
overflowing the process kernel stacks.
-endmenu
-
config SPINLINE
bool "Inline spinlock code at each call site"
depends on SMP && !PPC_SPLPAR && !PPC_ISERIES
@@ -436,6 +413,8 @@ config SPINLINE
If in doubt, say N.
+endmenu
+
source "security/Kconfig"
source "crypto/Kconfig"
--- linux/arch/x86_64/Kconfig.orig
+++ linux/arch/x86_64/Kconfig
@@ -402,19 +402,7 @@ source "arch/x86_64/oprofile/Kconfig"
menu "Kernel hacking"
-config DEBUG_KERNEL
- bool "Kernel debugging"
- help
- Say Y here if you are developing drivers or trying to debug and
- identify kernel problems.
-
-config DEBUG_SLAB
- bool "Debug memory allocations"
- depends on DEBUG_KERNEL
- help
- Say Y here to have the kernel do limited verification on memory
- allocation as well as poisoning memory on free to catch use of freed
- memory.
+source "kernel/Kconfig.debug"
config MAGIC_SYSRQ
bool "Magic SysRq key"
@@ -429,15 +417,6 @@ config MAGIC_SYSRQ
keys are documented in <file:Documentation/sysrq.txt>. Don't say Y
unless you really know what this hack does.
-config DEBUG_SPINLOCK
- bool "Spinlock debugging"
- depends on DEBUG_KERNEL
- help
- Say Y here and build SMP to catch missing spinlock initialization
- and certain other kinds of spinlock errors commonly made. This is
- best used in conjunction with the NMI watchdog so that spinlock
- deadlocks are also debuggable.
-
# !SMP for now because the context switch early causes GPF in segment reloading
# and the GS base checking does the wrong thing then, causing a hang.
config CHECKING
@@ -454,17 +433,6 @@ config INIT_DEBUG
Fill __init and __initdata at the end of boot. This helps debugging
illegal uses of __init and __initdata after initialization.
-config DEBUG_INFO
- bool "Compile the kernel with debug info"
- depends on DEBUG_KERNEL
- default n
- help
- If you say Y here the resulting kernel image will include
- debugging info resulting in a larger kernel image.
- Say Y here only if you plan to use gdb to debug the kernel.
- Please note that this option requires new binutils.
- If you don't debug the kernel, you can say N.
-
config FRAME_POINTER
bool "Compile the kernel with frame pointers"
help
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -632,6 +632,32 @@ struct sched_domain {
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
+
+#ifdef CONFIG_SCHEDSTATS
+ unsigned long lb_cnt[3];
+ unsigned long lb_balanced[3];
+ unsigned long lb_failed[3];
+ unsigned long lb_pulled[3];
+ unsigned long lb_hot_pulled[3];
+ unsigned long lb_imbalance[3];
+
+ /* Active load balancing */
+ unsigned long alb_cnt;
+ unsigned long alb_failed;
+ unsigned long alb_pushed;
+
+ /* Wakeups */
+ unsigned long sched_wake_remote;
+
+ /* Passive load balancing */
+ unsigned long plb_pulled;
+
+ /* Affine wakeups */
+ unsigned long afw_pulled;
+
+ /* SD_BALANCE_EXEC balances */
+ unsigned long sbe_pushed;
+#endif
};
#ifndef ARCH_HAS_SCHED_TUNE
--- linux/fs/proc/proc_misc.c.orig
+++ linux/fs/proc/proc_misc.c
@@ -282,6 +282,10 @@ static struct file_operations proc_vmsta
.release = seq_release,
};
+#ifdef CONFIG_SCHEDSTATS
+extern struct file_operations proc_schedstat_operations;
+#endif
+
#ifdef CONFIG_PROC_HARDWARE
static int hardware_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
@@ -711,6 +715,9 @@ void __init proc_misc_init(void)
#ifdef CONFIG_MODULES
create_seq_entry("modules", 0, &proc_modules_operations);
#endif
+#ifdef CONFIG_SCHEDSTATS
+ create_seq_entry("schedstat", 0, &proc_schedstat_operations);
+#endif
#ifdef CONFIG_PROC_KCORE
proc_root_kcore = create_proc_entry("kcore", S_IRUSR, NULL);
if (proc_root_kcore) {
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -41,6 +41,8 @@
#include <linux/percpu.h>
#include <linux/perfctr.h>
#include <linux/kthread.h>
+#include <linux/seq_file.h>
+#include <linux/times.h>
#include <asm/tlb.h>
#include <asm/unistd.h>
@@ -234,6 +236,22 @@ struct runqueue {
task_t *migration_thread;
struct list_head migration_queue;
#endif
+#ifdef CONFIG_SCHEDSTATS
+ /* sys_sched_yield stats */
+ unsigned long yld_exp_empty;
+ unsigned long yld_act_empty;
+ unsigned long yld_both_empty;
+ unsigned long yld_cnt;
+
+ /* schedule stats */
+ unsigned long sched_cnt;
+ unsigned long sched_switch;
+ unsigned long sched_idle;
+
+ /* wake stats */
+ unsigned long sched_wake;
+ unsigned long sched_wake_local;
+#endif
};
static DEFINE_PER_CPU(struct runqueue, runqueues);
@@ -280,6 +298,97 @@ static inline void task_rq_unlock(runque
spin_unlock_irqrestore(&rq->lock, *flags);
}
+
+#ifdef CONFIG_SCHEDSTATS
+
+/*
+ * bump this up when changing the output format or the meaning of an existing
+ * format, so that tools can adapt (or abort)
+ */
+#define SCHEDSTAT_VERSION 7
+
+static int show_schedstat(struct seq_file *seq, void *v)
+{
+ int i;
+
+ seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
+ seq_printf(seq, "timestamp %lu\n", jiffies);
+ for_each_cpu(i) {
+ /* Include offline CPUs */
+ runqueue_t *rq = cpu_rq(i);
+#ifdef CONFIG_SMP
+ struct sched_domain *sd;
+ int j = 0;
+#endif
+
+ seq_printf(seq,
+ "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+ i,
+ rq->yld_both_empty, rq->yld_act_empty,
+ rq->yld_exp_empty, rq->yld_cnt,
+ rq->sched_switch, rq->sched_cnt,
+ rq->sched_idle, rq->sched_wake, rq->sched_wake_local);
+#ifdef CONFIG_SMP
+ for_each_domain(i, sd) {
+ char str[NR_CPUS];
+ int k;
+ cpumask_scnprintf(str, NR_CPUS, sd->span);
+ seq_printf(seq, " domain%d %s", j++, str);
+
+ for (k = 0; k < 3; k++) {
+ seq_printf(seq, " %lu %lu %lu %lu %lu %lu",
+ sd->lb_cnt[k], sd->lb_balanced[k],
+ sd->lb_failed[k], sd->lb_pulled[k],
+ sd->lb_hot_pulled[k], sd->lb_imbalance[k]);
+ }
+
+ seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu",
+ sd->alb_cnt, sd->alb_failed,
+ sd->alb_pushed, sd->sched_wake_remote,
+ sd->plb_pulled, sd->afw_pulled,
+ sd->sbe_pushed);
+ }
+#endif
+
+ seq_printf(seq, "\n");
+ }
+
+ return 0;
+}
+
+static int schedstat_open(struct inode *inode, struct file *file)
+{
+ unsigned size = 4096 * (1 + num_online_cpus() / 32);
+ char *buf = kmalloc(size, GFP_KERNEL);
+ struct seq_file *m;
+ int res;
+
+ if (!buf)
+ return -ENOMEM;
+ res = single_open(file, show_schedstat, NULL);
+ if (!res) {
+ m = file->private_data;
+ m->buf = buf;
+ m->size = size;
+ } else
+ kfree(buf);
+ return res;
+}
+
+struct file_operations proc_schedstat_operations = {
+ .open = schedstat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+# define schedstat_inc(s, field) ((s)->field++)
+# define schedstat_add(s, field, amt) ((s)->field += (amt))
+#else
+# define schedstat_inc(s, field) do { } while (0)
+# define schedstat_add(d, field, amt) do { } while (0)
+#endif
+
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
@@ -751,7 +860,24 @@ static int try_to_wake_up(task_t * p, un
cpu = task_cpu(p);
this_cpu = smp_processor_id();
+ schedstat_inc(rq, sched_wake);
+#ifndef CONFIG_SMP
+ schedstat_inc(rq, sched_wake_local);
+#endif
+
#ifdef CONFIG_SMP
+#ifdef CONFIG_SCHEDSTATS
+ if (cpu == this_cpu)
+ schedstat_inc(rq, sched_wake_local);
+ else {
+ for_each_domain(this_cpu, sd)
+ if (cpu_isset(cpu, sd->span))
+ break;
+ if (sd)
+ schedstat_inc(sd, sched_wake_remote);
+ }
+#endif
+
if (unlikely(task_running(rq, p)))
goto out_activate;
@@ -796,8 +922,17 @@ static int try_to_wake_up(task_t * p, un
* Now sd has SD_WAKE_AFFINE and p is cache cold in sd
* or sd has SD_WAKE_BALANCE and there is an imbalance
*/
- if (cpu_isset(cpu, sd->span))
+ if (cpu_isset(cpu, sd->span)) {
+#ifdef CONFIG_SCHEDSTATS
+ if ((sd->flags & SD_WAKE_AFFINE) &&
+ !task_hot(p, rq->timestamp_last_tick, sd))
+ schedstat_inc(sd, afw_pulled);
+ else if ((sd->flags & SD_WAKE_BALANCE) &&
+ imbalance*this_load <= 100*load)
+ schedstat_inc(sd, plb_pulled);
+#endif
goto out_set_cpu;
+ }
}
}
@@ -1321,6 +1456,7 @@ void sched_exec(void)
if (sd) {
new_cpu = find_idlest_cpu(current, this_cpu, sd);
if (new_cpu != this_cpu) {
+ schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
return;
@@ -1378,6 +1514,13 @@ int can_migrate_task(task_t *p, runqueue
return 0;
}
+#ifdef CONFIG_SCHEDSTATS
+ if (!task_hot(p, rq->timestamp_last_tick, sd))
+ schedstat_inc(sd, lb_pulled[idle]);
+ else
+ schedstat_inc(sd, lb_hot_pulled[idle]);
+#endif
+
return 1;
}
@@ -1638,14 +1781,20 @@ static int load_balance(int this_cpu, ru
int nr_moved;
spin_lock(&this_rq->lock);
+ schedstat_inc(sd, lb_cnt[idle]);
group = find_busiest_group(sd, this_cpu, &imbalance, idle);
- if (!group)
+ if (!group) {
+ schedstat_inc(sd, lb_balanced[idle]);
goto out_balanced;
+ }
busiest = find_busiest_queue(group);
- if (!busiest)
+ if (!busiest) {
+ schedstat_inc(sd, lb_balanced[idle]);
goto out_balanced;
+ }
+
/*
* This should be "impossible", but since load
* balancing is inherently racy and statistical,
@@ -1655,6 +1804,7 @@ static int load_balance(int this_cpu, ru
WARN_ON(1);
goto out_balanced;
}
+ schedstat_add(sd, lb_imbalance[idle], imbalance);
nr_moved = 0;
if (busiest->nr_running > 1) {
@@ -1672,6 +1822,7 @@ static int load_balance(int this_cpu, ru
spin_unlock(&this_rq->lock);
if (!nr_moved) {
+ schedstat_inc(sd, lb_failed[idle]);
sd->nr_balance_failed++;
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
@@ -1726,13 +1877,20 @@ static int load_balance_newidle(int this
unsigned long imbalance;
int nr_moved = 0;
+ schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
- if (!group)
+ if (!group) {
+ schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
goto out;
+ }
busiest = find_busiest_queue(group);
- if (!busiest || busiest == this_rq)
+ if (!busiest || busiest == this_rq) {
+ schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
goto out;
+ }
+
+ schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
@@ -1777,6 +1935,7 @@ static void active_load_balance(runqueue
struct sched_domain *sd;
struct sched_group *group, *busy_group;
int i;
+ int moved = 0;
if (busiest->nr_running <= 1)
return;
@@ -1788,6 +1947,7 @@ static void active_load_balance(runqueue
WARN_ON(1);
return;
}
+ schedstat_inc(sd, alb_cnt);
group = sd->groups;
while (!cpu_isset(busiest_cpu, group->cpumask))
@@ -1824,11 +1984,16 @@ static void active_load_balance(runqueue
if (unlikely(busiest == rq))
goto next_group;
double_lock_balance(busiest, rq);
- move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
+ moved += move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
spin_unlock(&rq->lock);
next_group:
group = group->next;
} while (group != sd->groups);
+
+ if (moved)
+ schedstat_add(sd, alb_pushed, moved);
+ else
+ schedstat_inc(sd, alb_failed);
}
/*
@@ -2181,6 +2346,7 @@ need_resched:
}
release_kernel_lock(prev);
+ schedstat_inc(rq, sched_cnt);
now = sched_clock();
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
@@ -2218,6 +2384,7 @@ need_resched:
next = rq->idle;
rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu, rq);
+ schedstat_inc(rq, sched_idle);
goto switch_tasks;
}
}
@@ -2227,6 +2394,7 @@ need_resched:
/*
* Switch the active and expired arrays.
*/
+ schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
@@ -2986,6 +3154,7 @@ asmlinkage long sys_sched_yield(void)
prio_array_t *array = current->array;
prio_array_t *target = rq->expired;
+ schedstat_inc(rq, yld_cnt);
/*
* We implement yielding by moving the task into the expired
* queue.
--- linux/kernel/Kconfig.debug.orig
+++ linux/kernel/Kconfig.debug
@@ -0,0 +1,71 @@
+#
+# Kernel debug configuration - the generic bits
+#
+
+config DEBUG_KERNEL
+ bool "Kernel debugging"
+ help
+ Say Y here if you are developing drivers or trying to debug and
+ identify kernel problems.
+
+config DEBUG_SLAB
+ bool "Debug memory allocations"
+ depends on DEBUG_KERNEL
+ help
+ Say Y here to have the kernel do limited verification on memory
+ allocation as well as poisoning memory on free to catch use of freed
+ memory.
+
+config DEBUG_SPINLOCK
+ bool "Spinlock debugging"
+ depends on DEBUG_KERNEL
+ help
+ Say Y here and build SMP to catch missing spinlock initialization
+ and certain other kinds of spinlock errors commonly made. This is
+ best used in conjunction with the NMI watchdog so that spinlock
+ deadlocks are also debuggable.
+
+config DEBUG_SPINLOCK_SLEEP
+ bool "Sleep-inside-spinlock checking"
+ depends on DEBUG_KERNEL && SMP
+ help
+ If you say Y here, various routines which may sleep will become very
+ noisy if they are called with a spinlock held.
+
+config DEBUG_PAGEALLOC
+ bool "Page alloc debugging"
+ depends on DEBUG_KERNEL
+ help
+ Unmap pages from the kernel linear mapping after free_pages().
+ This results in a large slowdown, but helps to find certain types
+ of memory corruptions.
+
+config DEBUG_INFO
+ bool "Compile the kernel with debug info"
+ depends on DEBUG_KERNEL
+ help
+ If you say Y here the resulting kernel image will include
+ debugging info resulting in a larger kernel image.
+ Say Y here only if you plan to use gdb to debug the kernel.
+ If you don't debug the kernel, you can say N.
+
+config LOCKMETER
+ bool "Kernel lock metering"
+ depends on DEBUG_KERNEL && SMP
+ help
+ Say Y to enable kernel lock metering, which adds overhead to SMP locks,
+ but allows you to see various statistics using the lockstat command.
+
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on DEBUG_KERNEL && PROC_FS
+ default n
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 20:36 ` Martin J. Bligh
@ 2004-08-04 21:31 ` Ingo Molnar
2004-08-04 23:34 ` Martin J. Bligh
0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2004-08-04 21:31 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
* Martin J. Bligh <mbligh@aracnet.com> wrote:
> > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
> > unapplied a re-run at least part of your tests?
> >
> > there are a number of NUMA improvements queued up on -mm, and it would
> > be nice to know what effect these cause, and what effect the staircase
> > scheduler has.
>
> Sure. I presume it's just the one patch:
>
> staircase-cpu-scheduler-268-rc2-mm1.patch
>
> which seemed to back out clean and is building now. Scream if that's
> not all of it ...
correct, that's the end of the scheduler patch-queue and it works fine
if unapplied. (The schedstats patch i just sent applies cleanly to that
base, in case you need one.)
Ingo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
2004-08-04 21:26 ` 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch Ingo Molnar
@ 2004-08-04 21:34 ` Sam Ravnborg
2004-08-04 21:46 ` Randy.Dunlap
2004-08-04 22:13 ` Ingo Molnar
2004-08-04 22:10 ` Rick Lindsley
1 sibling, 2 replies; 27+ messages in thread
From: Sam Ravnborg @ 2004-08-04 21:34 UTC (permalink / raw)
To: Ingo Molnar
Cc: Martin J. Bligh, Andrew Morton, kernel, linux-kernel,
Rick Lindsley, Nick Piggin
On Wed, Aug 04, 2004 at 11:26:58PM +0200, Ingo Molnar wrote:
> I fixed a number of cleanliness items, readying this patch for a
> mainline merge:
>
> - added kernel/Kconfig.debug for generic debug options (such as
> schedstat) and removed tons of debug options from various arch
> Kconfig's, instead of adding a boatload of new SCHEDSTAT entries.
> This felt good.
Randy Dunlap has posted a patch for this several times.
This has seen some review so the preferred starting point.
Sam
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
2004-08-04 21:34 ` Sam Ravnborg
@ 2004-08-04 21:46 ` Randy.Dunlap
2004-08-04 22:13 ` Ingo Molnar
1 sibling, 0 replies; 27+ messages in thread
From: Randy.Dunlap @ 2004-08-04 21:46 UTC (permalink / raw)
To: Sam Ravnborg
Cc: mingo, mbligh, akpm, kernel, linux-kernel, ricklind, nickpiggin
On Wed, 4 Aug 2004 23:34:54 +0200 Sam Ravnborg wrote:
| On Wed, Aug 04, 2004 at 11:26:58PM +0200, Ingo Molnar wrote:
|
| > I fixed a number of cleanliness items, readying this patch for a
| > mainline merge:
| >
| > - added kernel/Kconfig.debug for generic debug options (such as
| > schedstat) and removed tons of debug options from various arch
| > Kconfig's, instead of adding a boatload of new SCHEDSTAT entries.
| > This felt good.
|
| Randy Dunlap has posted a patch for this several times.
| This has seen some review so the preferred starting point.
Latest version is
http://developer.osdl.org/rddunlap/kconfig/kconfig-debug-268-rc2bk9.patch
It applies cleanly to 2.6.8-rc3 and might make it into
early 2.6.8 ++ patches.
--
~Randy
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
2004-08-04 21:26 ` 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch Ingo Molnar
2004-08-04 21:34 ` Sam Ravnborg
@ 2004-08-04 22:10 ` Rick Lindsley
[not found] ` <20040805143249.GA23967@elte.hu>
1 sibling, 1 reply; 27+ messages in thread
From: Rick Lindsley @ 2004-08-04 22:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Martin J. Bligh, Andrew Morton, kernel, linux-kernel, Nick Piggin
Nick removed a fair amount of statistics that he wasn't using. The
full patch gathers more information. In particular, his patch doesn't
include the code to measure the latency between the time a process is
made runnable and the time it hits a processor which will be key to
measuring interactivity changes.
He passed his changes back to me and I got finished merging his changes
with the current statistics patches just before OLS. I believe this is
largely a superset of the patch you grabbed and should port relatively
easily too.
Versions also exist for
2.6.8-rc2
2.6.8-rc2-mm1
2.6.8-rc2-mm2
at
http://eaglet.rain.com/rick/linux/schedstat/patches/
and within 24 hours at
http://oss.software.ibm.com/linux/patches/?patch_id=730&show=all
The version below is for 2.6.8-rc2-mm2 without the staircase code and has
been compiled cleanly but not yet run.
Rick
diff -rupN linux-2.6.8-rc2-mm2b/Documentation/sched-stats.txt linux-2.6.8-rc2-mm2b-ss/Documentation/sched-stats.txt
--- linux-2.6.8-rc2-mm2b/Documentation/sched-stats.txt Wed Dec 31 16:00:00 1969
+++ linux-2.6.8-rc2-mm2b-ss/Documentation/sched-stats.txt Wed Aug 4 14:35:14 2004
@@ -0,0 +1,150 @@
+Version 9 of schedstats introduces support for sched_domains, which
+hit the mainline kernel in 2.6.7. Some counters make more sense to be
+per-runqueue; other to be per-domain.
+
+In version 9 of schedstat, there is at least one level of domain
+statistics for each cpu listed, and there may well be more than one
+domain. Domains have no particular names in this implementation, but
+the highest numbered one typically arbitrates balancing across all the
+cpus on the machine, while domain0 is the most tightly focused domain,
+sometimes balancing only between pairs of cpus. At this time, there
+are no architectures which need more than three domain levels. The first
+field in the domain stats is a bit map indicating which cpus are affected
+by that domain.
+
+These fields are counters, and only increment. Programs which make use
+of these will need to start with a baseline observation and then calculate
+the change in the counters at each subsequent observation. A perl script
+which does this for many of the fields is available at
+
+ http://eaglet.rain.com/rick/linux/schedstat/
+
+Note that any such script will necessarily be version-specific, as the main
+reason to change versions is changes in the output format. For those wishing
+to write their own scripts, the fields are described here.
+
+CPU statistics
+--------------
+cpu<N> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
+
+NOTE: In the sched_yield() statistics, the active queue is considered empty
+ if it has only one process in it, since obviously the process calling
+ sched_yield() is that process.
+
+First four fields are sched_yield() statistics:
+ 1) # of times both the active and the expired queue were empty
+ 2) # of times just the active queue was empty
+ 3) # of times just the expired queue was empty
+ 4) # of times sched_yield() was called
+
+Next four are schedule() statistics:
+ 5) # of times the active queue had at least one other process on it.
+ 6) # of times we switched to the expired queue and reused it
+ 7) # of times schedule() was called
+ 8) # of times schedule() left the processor idle
+
+Next four are active_load_balance() statistics:
+ 9) # of times active_load_balance() was called
+ 10) # of times active_load_balance() caused this cpu to gain a task
+ 11) # of times active_load_balance() caused this cpu to lose a task
+ 12) # of times active_load_balance() tried to move a task and failed
+
+Next three are try_to_wake_up() statistics:
+ 13) # of times try_to_wake_up() was called
+ 14) # of times try_to_wake_up() successfully moved the awakening task
+ 15) # of times try_to_wake_up() attempted to move the awakening task
+
+Next two are wake_up_forked_thread() statistics:
+ 16) # of times wake_up_forked_thread() was called
+ 17) # of times wake_up_forked_thread() successfully moved the forked task
+
+Next one is a sched_migrate_task() statistic:
+ 18) # of times sched_migrate_task() was called
+
+Next one is a sched_balance_exec() statistic:
+ 19) # of times sched_balance_exec() was called
+
+Next three are statistics describing scheduling latency:
+ 20) sum of all time spent running by tasks on this processor (in ms)
+ 21) sum of all time spent waiting to run by tasks on this processor (in ms)
+ 22) # of tasks (not necessarily unique) given to the processor
+
+The last six are statistics dealing with pull_task():
+ 23) # of times pull_task() moved a task to this cpu when newly idle
+ 24) # of times pull_task() stole a task from this cpu when another cpu
+ was newly idle
+ 25) # of times pull_task() moved a task to this cpu when idle
+ 26) # of times pull_task() stole a task from this cpu when another cpu
+ was idle
+ 27) # of times pull_task() moved a task to this cpu when busy
+ 28) # of times pull_task() stole a task from this cpu when another cpu
+ was busy
+
+
+Domain statistics
+-----------------
+One of these is produced per domain for each cpu described.
+
+domain<N> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
+
+The first field is a bit mask indicating what cpus this domain operates over.
+
+The next fifteen are a variety of load_balance() statistics:
+
+ 1) # of times in this domain load_balance() was called when the cpu
+ was idle
+ 2) # of times in this domain load_balance() was called when the cpu
+ was busy
+ 3) # of times in this domain load_balance() was called when the cpu
+ was just becoming idle
+ 4) # of times in this domain load_balance() tried to move one or more
+ tasks and failed, when the cpu was idle
+ 5) # of times in this domain load_balance() tried to move one or more
+ tasks and failed, when the cpu was busy
+ 6) # of times in this domain load_balance() tried to move one or more
+ tasks and failed, when the cpu was just becoming idle
+ 7) sum of imbalances discovered (if any) with each call to
+ load_balance() in this domain when the cpu was idle
+ 8) sum of imbalances discovered (if any) with each call to
+ load_balance() in this domain when the cpu was busy
+ 9) sum of imbalances discovered (if any) with each call to
+ load_balance() in this domain when the cpu was just becoming idle
+ 10) # of times in this domain load_balance() was called but did not find
+ a busier queue while the cpu was idle
+ 11) # of times in this domain load_balance() was called but did not find
+ a busier queue while the cpu was busy
+ 12) # of times in this domain load_balance() was called but did not find
+ a busier queue while the cpu was just becoming idle
+ 13) # of times in this domain a busier queue was found while the cpu was
+ idle but no busier group was found
+ 14) # of times in this domain a busier queue was found while the cpu was
+ busy but no busier group was found
+ 15) # of times in this domain a busier queue was found while the cpu was
+ just becoming idle but no busier group was found
+
+Next two are sched_balance_exec() statistics:
+ 17) # of times in this domain sched_balance_exec() successfully pushed
+ a task to a new cpu
+ 18) # of times in this domain sched_balance_exec() tried but failed to
+ push a task to a new cpu
+
+Next two are try_to_wake_up() statistics:
+ 19) # of times in this domain try_to_wake_up() tried to move a task based
+ on affinity and cache warmth
+ 20) # of times in this domain try_to_wake_up() tried to move a task based
+ on load balancing
+
+
+/proc/<pid>/stat
+----------------
+schedstats also changes the stat output of individual processes to
+include some of the same information on a per-process level, obtainable
+from /proc/<pid>/stat. Three new fields correlating to fields 20, 21,
+and 22 in the CPU fields, are tacked on to the end but apply only for
+that process.
+
+A program could be easily written to make use of these extra fields to
+report on how well a particular process or set of processes is faring
+under the scheduler's policies. A simple version of such a program is
+available at
+ http://eaglet.rain.com/rick/linux/schedstat/v9/latency.c
diff -rupN linux-2.6.8-rc2-mm2b/arch/i386/Kconfig linux-2.6.8-rc2-mm2b-ss/arch/i386/Kconfig
--- linux-2.6.8-rc2-mm2b/arch/i386/Kconfig Wed Aug 4 14:33:57 2004
+++ linux-2.6.8-rc2-mm2b-ss/arch/i386/Kconfig Wed Aug 4 14:35:14 2004
@@ -1491,6 +1491,19 @@ config 4KSTACKS
on the VM subsystem for higher order allocations. This option
will also use IRQ stacks to compensate for the reduced stackspace.
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default y
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
config X86_FIND_SMP_CONFIG
bool
depends on X86_LOCAL_APIC || X86_VOYAGER
diff -rupN linux-2.6.8-rc2-mm2b/arch/ppc/Kconfig linux-2.6.8-rc2-mm2b-ss/arch/ppc/Kconfig
--- linux-2.6.8-rc2-mm2b/arch/ppc/Kconfig Wed Aug 4 14:33:51 2004
+++ linux-2.6.8-rc2-mm2b-ss/arch/ppc/Kconfig Wed Aug 4 14:35:14 2004
@@ -1368,6 +1368,19 @@ config DEBUG_INFO
debug the kernel.
If you don't debug the kernel, you can say N.
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default y
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
config BOOTX_TEXT
bool "Support for early boot text console (BootX or OpenFirmware only)"
depends PPC_OF
diff -rupN linux-2.6.8-rc2-mm2b/arch/ppc64/Kconfig linux-2.6.8-rc2-mm2b-ss/arch/ppc64/Kconfig
--- linux-2.6.8-rc2-mm2b/arch/ppc64/Kconfig Sun Jul 11 10:34:39 2004
+++ linux-2.6.8-rc2-mm2b-ss/arch/ppc64/Kconfig Wed Aug 4 14:35:14 2004
@@ -424,6 +424,19 @@ config IRQSTACKS
for handling hard and soft interrupts. This can help avoid
overflowing the process kernel stacks.
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default y
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
endmenu
config SPINLINE
diff -rupN linux-2.6.8-rc2-mm2b/arch/x86_64/Kconfig linux-2.6.8-rc2-mm2b-ss/arch/x86_64/Kconfig
--- linux-2.6.8-rc2-mm2b/arch/x86_64/Kconfig Wed Aug 4 14:33:24 2004
+++ linux-2.6.8-rc2-mm2b-ss/arch/x86_64/Kconfig Wed Aug 4 14:35:14 2004
@@ -464,6 +464,19 @@ config DEBUG_INFO
Say Y here only if you plan to use gdb to debug the kernel.
Please note that this option requires new binutils.
If you don't debug the kernel, you can say N.
+
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default y
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
config FRAME_POINTER
bool "Compile the kernel with frame pointers"
diff -rupN linux-2.6.8-rc2-mm2b/fs/proc/array.c linux-2.6.8-rc2-mm2b-ss/fs/proc/array.c
--- linux-2.6.8-rc2-mm2b/fs/proc/array.c Tue Jun 15 22:19:36 2004
+++ linux-2.6.8-rc2-mm2b-ss/fs/proc/array.c Wed Aug 4 14:35:14 2004
@@ -357,9 +357,15 @@ int proc_pid_stat(struct task_struct *ta
/* Temporary variable needed for gcc-2.96 */
start_time = jiffies_64_to_clock_t(task->start_time - INITIAL_JIFFIES);
+#ifdef CONFIG_SCHEDSTATS
+ res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
+%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %lu %lu %lu\n",
+#else
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+#endif
task->pid,
task->comm,
state,
@@ -404,7 +410,14 @@ int proc_pid_stat(struct task_struct *ta
task->exit_signal,
task_cpu(task),
task->rt_priority,
+#ifdef CONFIG_SCHEDSTATS
+ task->policy,
+ task->sched_info.cpu_time,
+ task->sched_info.run_delay,
+ task->sched_info.pcnt);
+#else
task->policy);
+#endif /* CONFIG_SCHEDSTATS */
if(mm)
mmput(mm);
return res;
diff -rupN linux-2.6.8-rc2-mm2b/fs/proc/proc_misc.c linux-2.6.8-rc2-mm2b-ss/fs/proc/proc_misc.c
--- linux-2.6.8-rc2-mm2b/fs/proc/proc_misc.c Wed Aug 4 14:33:19 2004
+++ linux-2.6.8-rc2-mm2b-ss/fs/proc/proc_misc.c Wed Aug 4 14:35:14 2004
@@ -282,6 +282,10 @@ static struct file_operations proc_vmsta
.release = seq_release,
};
+#ifdef CONFIG_SCHEDSTATS
+extern struct file_operations proc_schedstat_operations;
+#endif
+
#ifdef CONFIG_PROC_HARDWARE
static int hardware_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
@@ -711,6 +715,9 @@ void __init proc_misc_init(void)
#ifdef CONFIG_MODULES
create_seq_entry("modules", 0, &proc_modules_operations);
#endif
+#ifdef CONFIG_SCHEDSTATS
+ create_seq_entry("schedstat", 0, &proc_schedstat_operations);
+#endif
#ifdef CONFIG_PROC_KCORE
proc_root_kcore = create_proc_entry("kcore", S_IRUSR, NULL);
if (proc_root_kcore) {
diff -rupN linux-2.6.8-rc2-mm2b/include/linux/sched.h linux-2.6.8-rc2-mm2b-ss/include/linux/sched.h
--- linux-2.6.8-rc2-mm2b/include/linux/sched.h Wed Aug 4 14:33:57 2004
+++ linux-2.6.8-rc2-mm2b-ss/include/linux/sched.h Wed Aug 4 14:35:14 2004
@@ -96,6 +96,16 @@ extern unsigned long nr_running(void);
extern unsigned long nr_uninterruptible(void);
extern unsigned long nr_iowait(void);
+#ifdef CONFIG_SCHEDSTATS
+struct sched_info;
+extern void cpu_sched_info(struct sched_info *, int);
+#define schedstat_inc(rq, field) rq->field++;
+#define schedstat_add(rq, field, amt) rq->field += amt;
+#else
+#define schedstat_inc(rq, field) do { } while (0);
+#define schedstat_add(rq, field, amt) do { } while (0);
+#endif
+
#include <linux/time.h>
#include <linux/param.h>
#include <linux/resource.h>
@@ -367,6 +377,18 @@ struct k_itimer {
struct timespec wall_to_prev; /* wall_to_monotonic used when set */
};
+#ifdef CONFIG_SCHEDSTATS
+struct sched_info {
+ /* cumulative counters */
+ unsigned long cpu_time, /* time spent on the cpu */
+ run_delay, /* time spent waiting on a runqueue */
+ pcnt; /* # of timeslices run on this cpu */
+
+ /* timestamps */
+ unsigned long last_arrival, /* when we last ran on a cpu */
+ last_queued; /* when we were last queued to run */
+};
+#endif /* CONFIG_SCHEDSTATS */
struct io_context; /* See blkdev.h */
void exit_io_context(void);
@@ -429,6 +451,10 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;
+#ifdef CONFIG_SCHEDSTATS
+ struct sched_info sched_info;
+#endif /* CONFIG_SCHEDSTATS */
+
struct list_head tasks;
/*
* ptrace_list/ptrace_children forms the list of my children
@@ -601,6 +627,14 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define SD_WAKE_BALANCE 16 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 32 /* Domain members share cpu power */
+enum idle_type
+{
+ IDLE,
+ NOT_IDLE,
+ NEWLY_IDLE,
+ MAX_IDLE_TYPES
+};
+
struct sched_group {
struct sched_group *next; /* Must be a circular list */
cpumask_t cpumask;
@@ -632,6 +666,23 @@ struct sched_domain {
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
+
+#ifdef CONFIG_SCHEDSTATS
+ /* load_balance() stats */
+ unsigned long lb_cnt[MAX_IDLE_TYPES];
+ unsigned long lb_failed[MAX_IDLE_TYPES];
+ unsigned long lb_imbalance[MAX_IDLE_TYPES];
+ unsigned long lb_nobusyg[MAX_IDLE_TYPES];
+ unsigned long lb_nobusyq[MAX_IDLE_TYPES];
+
+ /* sched_balance_exec() stats */
+ unsigned long sbe_attempts;
+ unsigned long sbe_pushed;
+
+ /* try_to_wake_up() stats */
+ unsigned long ttwu_wake_affine;
+ unsigned long ttwu_wake_balance;
+#endif
};
#ifndef ARCH_HAS_SCHED_TUNE
diff -rupN linux-2.6.8-rc2-mm2b/kernel/fork.c linux-2.6.8-rc2-mm2b-ss/kernel/fork.c
--- linux-2.6.8-rc2-mm2b/kernel/fork.c Wed Aug 4 14:33:57 2004
+++ linux-2.6.8-rc2-mm2b-ss/kernel/fork.c Wed Aug 4 14:35:14 2004
@@ -972,6 +972,11 @@ struct task_struct *copy_process(unsigne
p->security = NULL;
p->io_context = NULL;
p->audit_context = NULL;
+
+#ifdef CONFIG_SCHEDSTATS
+ memset(&p->sched_info, 0, sizeof(p->sched_info));
+#endif /* CONFIG_SCHEDSTATS */
+
#ifdef CONFIG_NUMA
p->mempolicy = mpol_copy(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
diff -rupN linux-2.6.8-rc2-mm2b/kernel/sched.c linux-2.6.8-rc2-mm2b-ss/kernel/sched.c
--- linux-2.6.8-rc2-mm2b/kernel/sched.c Wed Aug 4 14:33:33 2004
+++ linux-2.6.8-rc2-mm2b-ss/kernel/sched.c Wed Aug 4 14:35:14 2004
@@ -41,6 +41,8 @@
#include <linux/percpu.h>
#include <linux/perfctr.h>
#include <linux/kthread.h>
+#include <linux/seq_file.h>
+#include <linux/times.h>
#include <asm/tlb.h>
#include <asm/unistd.h>
@@ -234,6 +236,48 @@ struct runqueue {
task_t *migration_thread;
struct list_head migration_queue;
#endif
+
+#ifdef CONFIG_SCHEDSTATS
+ /* latency stats */
+ struct sched_info rq_sched_info;
+
+ /* sys_sched_yield() stats */
+ unsigned long yld_exp_empty;
+ unsigned long yld_act_empty;
+ unsigned long yld_both_empty;
+ unsigned long yld_cnt;
+
+ /* schedule() stats */
+ unsigned long sched_noswitch;
+ unsigned long sched_switch;
+ unsigned long sched_cnt;
+ unsigned long sched_goidle;
+
+ /* pull_task() stats */
+ unsigned long pt_gained[MAX_IDLE_TYPES];
+ unsigned long pt_lost[MAX_IDLE_TYPES];
+
+ /* active_load_balance() stats */
+ unsigned long alb_cnt;
+ unsigned long alb_lost;
+ unsigned long alb_gained;
+ unsigned long alb_failed;
+
+ /* try_to_wake_up() stats */
+ unsigned long ttwu_cnt;
+ unsigned long ttwu_attempts;
+ unsigned long ttwu_moved;
+
+ /* wake_up_forked_thread() stats */
+ unsigned long wuft_cnt;
+ unsigned long wuft_moved;
+
+ /* sched_migrate_task() stats */
+ unsigned long smt_cnt;
+
+ /* sched_balance_exec() stats */
+ unsigned long sbe_cnt;
+#endif
};
static DEFINE_PER_CPU(struct runqueue, runqueues);
@@ -280,6 +324,97 @@ static inline void task_rq_unlock(runque
spin_unlock_irqrestore(&rq->lock, *flags);
}
+#ifdef CONFIG_SCHEDSTATS
+/*
+ * bump this up when changing the output format or the meaning of an existing
+ * format, so that tools can adapt (or abort)
+ */
+#define SCHEDSTAT_VERSION 9
+
+static int show_schedstat(struct seq_file *seq, void *v)
+{
+ int cpu;
+ enum idle_type itype;
+
+ seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
+ seq_printf(seq, "timestamp %lu\n", jiffies);
+ for_each_online_cpu (cpu) {
+
+ int dcnt = 0;
+
+ runqueue_t *rq = cpu_rq(cpu);
+ struct sched_domain *sd;
+
+ /* runqueue-specific stats */
+ seq_printf(seq,
+ "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu "
+ "%lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+ cpu, rq->yld_both_empty,
+ rq->yld_act_empty, rq->yld_exp_empty,
+ rq->yld_cnt, rq->sched_noswitch,
+ rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
+ rq->alb_cnt, rq->alb_gained, rq->alb_lost,
+ rq->alb_failed,
+ rq->ttwu_cnt, rq->ttwu_moved, rq->ttwu_attempts,
+ rq->wuft_cnt, rq->wuft_moved,
+ rq->smt_cnt, rq->sbe_cnt, rq->rq_sched_info.cpu_time,
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
+
+ for (itype = IDLE; itype < MAX_IDLE_TYPES; itype++)
+ seq_printf(seq, " %lu %lu", rq->pt_gained[itype],
+ rq->pt_lost[itype]);
+
+ seq_printf(seq, "\n");
+
+ /* domain-specific stats */
+ for_each_domain(cpu, sd) {
+ char mask_str[NR_CPUS];
+
+ cpumask_scnprintf(mask_str, NR_CPUS, sd->span);
+ seq_printf(seq, "domain%d %s", dcnt++, mask_str);
+ for (itype = IDLE; itype < MAX_IDLE_TYPES; itype++) {
+ seq_printf(seq, " %lu %lu %lu %lu %lu",
+ sd->lb_cnt[itype],
+ sd->lb_failed[itype],
+ sd->lb_imbalance[itype],
+ sd->lb_nobusyq[itype],
+ sd->lb_nobusyg[itype]);
+ }
+ seq_printf(seq, " %lu %lu %lu %lu\n",
+ sd->sbe_pushed, sd->sbe_attempts,
+ sd->ttwu_wake_affine, sd->ttwu_wake_balance);
+ }
+ }
+ return 0;
+}
+
+static int schedstat_open(struct inode *inode, struct file *file)
+{
+ unsigned size = PAGE_SIZE * (1 + num_online_cpus() / 32);
+ char *buf = kmalloc(size, GFP_KERNEL);
+ struct seq_file *m;
+ int res;
+
+ if (!buf)
+ return -ENOMEM;
+ res = single_open(file, show_schedstat, NULL);
+ if (!res) {
+ m = file->private_data;
+ m->buf = buf;
+ m->size = size;
+ } else
+ kfree(buf);
+ return res;
+}
+
+struct file_operations proc_schedstat_operations = {
+ .open = schedstat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif
+
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
@@ -299,6 +434,113 @@ static inline void rq_unlock(runqueue_t
spin_unlock_irq(&rq->lock);
}
+#ifdef CONFIG_SCHEDSTATS
+/*
+ * Called when a process is dequeued from the active array and given
+ * the cpu. We should note that with the exception of interactive
+ * tasks, the expired queue will become the active queue after the active
+ * queue is empty, without explicitly dequeuing and requeuing tasks in the
+ * expired queue. (Interactive tasks may be requeued directly to the
+ * active queue, thus delaying tasks in the expired queue from running;
+ * see scheduler_tick()).
+ *
+ * This function is only called from sched_info_arrive(), rather than
+ * dequeue_task(). Even though a task may be queued and dequeued multiple
+ * times as it is shuffled about, we're really interested in knowing how
+ * long it was from the *first* time it was queued to the time that it
+ * finally hit a cpu.
+ */
+static inline void sched_info_dequeued(task_t *t)
+{
+ t->sched_info.last_queued = 0;
+}
+
+/*
+ * Called when a task finally hits the cpu. We can now calculate how
+ * long it was waiting to run. We also note when it began so that we
+ * can keep stats on how long its timeslice is.
+ */
+static inline void sched_info_arrive(task_t *t)
+{
+ unsigned long now = jiffies;
+ unsigned long diff = 0;
+ struct runqueue *rq = task_rq(t);
+
+ if (t->sched_info.last_queued)
+ diff = now - t->sched_info.last_queued;
+ sched_info_dequeued(t);
+ t->sched_info.run_delay += diff;
+ t->sched_info.last_arrival = now;
+ t->sched_info.pcnt++;
+
+ if (!rq)
+ return;
+
+ rq->rq_sched_info.run_delay += diff;
+ rq->rq_sched_info.pcnt++;
+}
+
+/*
+ * Called when a process is queued into either the active or expired
+ * array. The time is noted and later used to determine how long we
+ * had to wait for us to reach the cpu. Since the expired queue will
+ * become the active queue after active queue is empty, without dequeuing
+ * and requeuing any tasks, we are interested in queuing to either. It
+ * is unusual but not impossible for tasks to be dequeued and immediately
+ * requeued in the same or another array: this can happen in sched_yield(),
+ * set_user_nice(), and even load_balance() as it moves tasks from runqueue
+ * to runqueue.
+ *
+ * This function is only called from enqueue_task(), but also only updates
+ * the timestamp if it is already not set. It's assumed that
+ * sched_info_dequeued() will clear that stamp when appropriate.
+ */
+static inline void sched_info_queued(task_t *t)
+{
+ if (!t->sched_info.last_queued)
+ t->sched_info.last_queued = jiffies;
+}
+
+/*
+ * Called when a process ceases being the active-running process, either
+ * voluntarily or involuntarily. Now we can calculate how long we ran.
+ */
+static inline void sched_info_depart(task_t *t)
+{
+ struct runqueue *rq = task_rq(t);
+ unsigned long diff = jiffies - t->sched_info.last_arrival;
+
+ t->sched_info.cpu_time += diff;
+
+ if (rq)
+ rq->rq_sched_info.cpu_time += diff;
+}
+
+/*
+ * Called when tasks are switched involuntarily due, typically, to expiring
+ * their time slice. (This may also be called when switching to or from
+ * the idle task.) We are only called when prev != next.
+ */
+static inline void sched_info_switch(task_t *prev, task_t *next)
+{
+ struct runqueue *rq = task_rq(prev);
+
+ /*
+ * prev now departs the cpu. It's not interesting to record
+ * stats about how efficient we were at scheduling the idle
+ * process, however.
+ */
+ if (prev != rq->idle)
+ sched_info_depart(prev);
+
+ if (next != rq->idle)
+ sched_info_arrive(next);
+}
+#else
+#define sched_info_queued(t) {}
+#define sched_info_switch(t, next) {}
+#endif /* CONFIG_SCHEDSTATS */
+
/*
* Adding/removing a task to/from a priority array:
*/
@@ -312,6 +554,7 @@ static void dequeue_task(struct task_str
static void enqueue_task(struct task_struct *p, prio_array_t *array)
{
+ sched_info_queued(p);
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
@@ -736,11 +979,12 @@ static int try_to_wake_up(task_t * p, un
runqueue_t *rq;
#ifdef CONFIG_SMP
unsigned long load, this_load;
- struct sched_domain *sd;
+ struct sched_domain *sd = NULL;
int new_cpu;
#endif
rq = task_rq_lock(p, &flags);
+ schedstat_inc(rq, ttwu_cnt);
old_state = p->state;
if (!(old_state & state))
goto out;
@@ -788,23 +1032,35 @@ static int try_to_wake_up(task_t * p, un
*/
imbalance = sd->imbalance_pct + (sd->imbalance_pct - 100) / 2;
- if ( ((sd->flags & SD_WAKE_AFFINE) &&
- !task_hot(p, rq->timestamp_last_tick, sd))
- || ((sd->flags & SD_WAKE_BALANCE) &&
- imbalance*this_load <= 100*load) ) {
+ if ((sd->flags & SD_WAKE_AFFINE) &&
+ !task_hot(p, rq->timestamp_last_tick, sd)) {
/*
- * Now sd has SD_WAKE_AFFINE and p is cache cold in sd
- * or sd has SD_WAKE_BALANCE and there is an imbalance
+ * This domain has SD_WAKE_AFFINE and p is cache cold
+ * in this domain.
*/
- if (cpu_isset(cpu, sd->span))
+ if (cpu_isset(cpu, sd->span)) {
+ schedstat_inc(sd, ttwu_wake_affine);
goto out_set_cpu;
+ }
+ } else if ((sd->flags & SD_WAKE_BALANCE) &&
+ imbalance*this_load <= 100*load) {
+ /*
+ * This domain has SD_WAKE_BALANCE and there is
+ * an imbalance.
+ */
+ if (cpu_isset(cpu, sd->span)) {
+ schedstat_inc(sd, ttwu_wake_balance);
+ goto out_set_cpu;
+ }
}
}
new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
out_set_cpu:
+ schedstat_inc(rq, ttwu_attempts);
new_cpu = wake_idle(new_cpu, p);
if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) {
+ schedstat_inc(rq, ttwu_moved);
set_task_cpu(p, new_cpu);
task_rq_unlock(rq, &flags);
/* might preempt at this point */
@@ -856,7 +1112,7 @@ out:
int fastcall wake_up_process(task_t * p)
{
return try_to_wake_up(p, TASK_STOPPED |
- TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+ TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
}
EXPORT_SYMBOL(wake_up_process);
@@ -1162,13 +1418,6 @@ unsigned long nr_iowait(void)
return sum;
}
-enum idle_type
-{
- IDLE,
- NOT_IDLE,
- NEWLY_IDLE,
-};
-
#ifdef CONFIG_SMP
/*
@@ -1283,6 +1532,7 @@ static void sched_migrate_task(task_t *p
|| unlikely(cpu_is_offline(dest_cpu)))
goto out;
+ schedstat_inc(rq, smt_cnt);
/* force the process onto the specified CPU */
if (migrate_task(p, dest_cpu, &req)) {
/* Need to wait for migration thread (might exit: take ref). */
@@ -1310,6 +1560,7 @@ void sched_exec(void)
struct sched_domain *tmp, *sd = NULL;
int new_cpu, this_cpu = get_cpu();
+ schedstat_inc(this_rq(), sbe_cnt);
/* Prefer the current CPU if there's only this task running */
if (this_rq()->nr_running <= 1)
goto out;
@@ -1318,9 +1569,11 @@ void sched_exec(void)
if (tmp->flags & SD_BALANCE_EXEC)
sd = tmp;
+ schedstat_inc(sd, sbe_attempts);
if (sd) {
new_cpu = find_idlest_cpu(current, this_cpu, sd);
if (new_cpu != this_cpu) {
+ schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
return;
@@ -1444,6 +1697,15 @@ skip_queue:
idx++;
goto skip_bitmap;
}
+
+ /*
+ * Right now, this is the only place pull_task() is called,
+ * so we can safely collect pull_task() stats here rather than
+ * inside pull_task().
+ */
+ schedstat_inc(this_rq, pt_gained[idle]);
+ schedstat_inc(busiest, pt_lost[idle]);
+
pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
pulled++;
@@ -1638,14 +1900,20 @@ static int load_balance(int this_cpu, ru
int nr_moved;
spin_lock(&this_rq->lock);
-
- group = find_busiest_group(sd, this_cpu, &imbalance, idle);
- if (!group)
- goto out_balanced;
+ schedstat_inc(sd, lb_cnt[idle]);
+
+ group = find_busiest_group(sd, this_cpu, &imbalance, idle);
+ if (!group) {
+ schedstat_inc(sd, lb_nobusyg[idle]);
+ goto out_balanced;
+ }
busiest = find_busiest_queue(group);
- if (!busiest)
- goto out_balanced;
+ if (!busiest) {
+ schedstat_inc(sd, lb_nobusyq[idle]);
+ goto out_balanced;
+ }
+
/*
* This should be "impossible", but since load
* balancing is inherently racy and statistical,
@@ -1656,6 +1924,8 @@ static int load_balance(int this_cpu, ru
goto out_balanced;
}
+ schedstat_add(sd, lb_imbalance[idle], imbalance);
+
nr_moved = 0;
if (busiest->nr_running > 1) {
/*
@@ -1672,6 +1942,7 @@ static int load_balance(int this_cpu, ru
spin_unlock(&this_rq->lock);
if (!nr_moved) {
+ schedstat_inc(sd, lb_failed[idle]);
sd->nr_balance_failed++;
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
@@ -1726,20 +1997,29 @@ static int load_balance_newidle(int this
unsigned long imbalance;
int nr_moved = 0;
+ schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
- if (!group)
+ if (!group) {
+ schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
goto out;
+ }
busiest = find_busiest_queue(group);
- if (!busiest || busiest == this_rq)
+ if (!busiest || busiest == this_rq) {
+ schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
goto out;
+ }
/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
+ schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
nr_moved = move_tasks(this_rq, this_cpu, busiest,
imbalance, sd, NEWLY_IDLE);
+ if (!nr_moved)
+ schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
+
spin_unlock(&busiest->lock);
out:
@@ -1778,6 +2058,7 @@ static void active_load_balance(runqueue
struct sched_group *group, *busy_group;
int i;
+ schedstat_inc(busiest, alb_cnt);
if (busiest->nr_running <= 1)
return;
@@ -1824,7 +2105,12 @@ static void active_load_balance(runqueue
if (unlikely(busiest == rq))
goto next_group;
double_lock_balance(busiest, rq);
- move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
+ if (move_tasks(rq, push_cpu, busiest, 1, sd, IDLE)) {
+ schedstat_inc(busiest, alb_lost);
+ schedstat_inc(rq, alb_gained);
+ } else {
+ schedstat_inc(busiest, alb_failed);
+ }
spin_unlock(&rq->lock);
next_group:
group = group->next;
@@ -2181,6 +2467,7 @@ need_resched:
}
release_kernel_lock(prev);
+ schedstat_inc(rq, sched_cnt);
now = sched_clock();
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
@@ -2227,18 +2514,21 @@ need_resched:
/*
* Switch the active and expired arrays.
*/
+ schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
- }
+ } else
+ schedstat_inc(rq, sched_noswitch);
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
if (dependent_sleeper(cpu, rq, next)) {
+ schedstat_inc(rq, sched_goidle);
next = rq->idle;
goto switch_tasks;
}
@@ -2268,6 +2558,7 @@ switch_tasks:
}
prev->timestamp = now;
+ sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;
rq->nr_switches++;
@@ -2986,6 +3277,7 @@ asmlinkage long sys_sched_yield(void)
prio_array_t *array = current->array;
prio_array_t *target = rq->expired;
+ schedstat_inc(rq, yld_cnt);
/*
* We implement yielding by moving the task into the expired
* queue.
@@ -2996,6 +3288,15 @@ asmlinkage long sys_sched_yield(void)
if (rt_task(current))
target = rq->active;
+ if (current->array->nr_active == 1) {
+ schedstat_inc(rq, yld_act_empty);
+ if (!rq->expired->nr_active) {
+ schedstat_inc(rq, yld_both_empty);
+ }
+ } else if (!rq->expired->nr_active) {
+ schedstat_inc(rq, yld_exp_empty);
+ }
+
dequeue_task(current, array);
enqueue_task(current, target);
@@ -3582,7 +3883,7 @@ static int migration_call(struct notifie
rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq->idle, SCHED_NORMAL, 0);
task_rq_unlock(rq, &flags);
- BUG_ON(rq->nr_running != 0);
+ BUG_ON(rq->nr_running != 0);
/* No need to migrate the tasks: it was best-effort if
* they didn't do lock_cpu_hotplug(). Just wake up
@@ -3597,7 +3898,7 @@ static int migration_call(struct notifie
complete(&req->done);
}
spin_unlock_irq(&rq->lock);
- break;
+ break;
#endif
}
return NOTIFY_OK;
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
2004-08-04 21:34 ` Sam Ravnborg
2004-08-04 21:46 ` Randy.Dunlap
@ 2004-08-04 22:13 ` Ingo Molnar
1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2004-08-04 22:13 UTC (permalink / raw)
To: Martin J. Bligh, Andrew Morton, kernel, linux-kernel,
Rick Lindsley, Nick Piggin
[-- Attachment #1: Type: text/plain, Size: 637 bytes --]
* Sam Ravnborg <sam@ravnborg.org> wrote:
> > I fixed a number of cleanliness items, readying this patch for a
> > mainline merge:
> >
> > - added kernel/Kconfig.debug for generic debug options (such as
> > schedstat) and removed tons of debug options from various arch
> > Kconfig's, instead of adding a boatload of new SCHEDSTAT entries.
> > This felt good.
>
> Randy Dunlap has posted a patch for this several times. This has seen
> some review so the preferred starting point.
sure - Randy's is much more complete.
i've attached schedstat-2.6.8-rc2-mm2-A5.patch which doesnt try to be
too smart wrt. Kconfig.
Ingo
[-- Attachment #2: schedstat-2.6.8-rc2-mm2-A5.patch --]
[-- Type: text/plain, Size: 14001 bytes --]
--- linux/arch/i386/Kconfig.orig
+++ linux/arch/i386/Kconfig
@@ -1491,6 +1491,19 @@ config 4KSTACKS
on the VM subsystem for higher order allocations. This option
will also use IRQ stacks to compensate for the reduced stackspace.
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default n
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
config X86_FIND_SMP_CONFIG
bool
depends on X86_LOCAL_APIC || X86_VOYAGER
--- linux/arch/ppc/Kconfig.orig
+++ linux/arch/ppc/Kconfig
@@ -1368,6 +1368,19 @@ config DEBUG_INFO
debug the kernel.
If you don't debug the kernel, you can say N.
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default n
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
config BOOTX_TEXT
bool "Support for early boot text console (BootX or OpenFirmware only)"
depends PPC_OF
--- linux/arch/ia64/Kconfig.orig
+++ linux/arch/ia64/Kconfig
@@ -635,6 +635,19 @@ config MAGIC_SYSRQ
keys are documented in <file:Documentation/sysrq.txt>. Don't say Y
unless you really know what this hack does.
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default n
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
config DEBUG_SLAB
bool "Debug memory allocations"
depends on DEBUG_KERNEL
--- linux/arch/ppc64/Kconfig.orig
+++ linux/arch/ppc64/Kconfig
@@ -426,6 +426,19 @@ config IRQSTACKS
endmenu
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default n
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
+
config SPINLINE
bool "Inline spinlock code at each call site"
depends on SMP && !PPC_SPLPAR && !PPC_ISERIES
--- linux/arch/x86_64/Kconfig.orig
+++ linux/arch/x86_64/Kconfig
@@ -464,6 +464,19 @@ config DEBUG_INFO
Say Y here only if you plan to use gdb to debug the kernel.
Please note that this option requires new binutils.
If you don't debug the kernel, you can say N.
+
+config SCHEDSTATS
+ bool "Collect scheduler statistics"
+ depends on PROC_FS
+ default n
+ help
+ If you say Y here, additional code will be inserted into the
+ scheduler and related routines to collect statistics about
+ scheduler behavior and provide them in /proc/schedstat. These
+ stats may be useful for both tuning and debugging the scheduler
+ If you aren't debugging the scheduler or trying to tune a specific
+ application, you can say N to avoid the very slight overhead
+ this adds.
config FRAME_POINTER
bool "Compile the kernel with frame pointers"
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -632,6 +632,32 @@ struct sched_domain {
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
+
+#ifdef CONFIG_SCHEDSTATS
+ unsigned long lb_cnt[3];
+ unsigned long lb_balanced[3];
+ unsigned long lb_failed[3];
+ unsigned long lb_pulled[3];
+ unsigned long lb_hot_pulled[3];
+ unsigned long lb_imbalance[3];
+
+ /* Active load balancing */
+ unsigned long alb_cnt;
+ unsigned long alb_failed;
+ unsigned long alb_pushed;
+
+ /* Wakeups */
+ unsigned long sched_wake_remote;
+
+ /* Passive load balancing */
+ unsigned long plb_pulled;
+
+ /* Affine wakeups */
+ unsigned long afw_pulled;
+
+ /* SD_BALANCE_EXEC balances */
+ unsigned long sbe_pushed;
+#endif
};
#ifndef ARCH_HAS_SCHED_TUNE
--- linux/fs/proc/proc_misc.c.orig
+++ linux/fs/proc/proc_misc.c
@@ -282,6 +282,10 @@ static struct file_operations proc_vmsta
.release = seq_release,
};
+#ifdef CONFIG_SCHEDSTATS
+extern struct file_operations proc_schedstat_operations;
+#endif
+
#ifdef CONFIG_PROC_HARDWARE
static int hardware_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
@@ -711,6 +715,9 @@ void __init proc_misc_init(void)
#ifdef CONFIG_MODULES
create_seq_entry("modules", 0, &proc_modules_operations);
#endif
+#ifdef CONFIG_SCHEDSTATS
+ create_seq_entry("schedstat", 0, &proc_schedstat_operations);
+#endif
#ifdef CONFIG_PROC_KCORE
proc_root_kcore = create_proc_entry("kcore", S_IRUSR, NULL);
if (proc_root_kcore) {
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -41,6 +41,8 @@
#include <linux/percpu.h>
#include <linux/perfctr.h>
#include <linux/kthread.h>
+#include <linux/seq_file.h>
+#include <linux/times.h>
#include <asm/tlb.h>
#include <asm/unistd.h>
@@ -234,6 +236,22 @@ struct runqueue {
task_t *migration_thread;
struct list_head migration_queue;
#endif
+#ifdef CONFIG_SCHEDSTATS
+ /* sys_sched_yield stats */
+ unsigned long yld_exp_empty;
+ unsigned long yld_act_empty;
+ unsigned long yld_both_empty;
+ unsigned long yld_cnt;
+
+ /* schedule stats */
+ unsigned long sched_cnt;
+ unsigned long sched_switch;
+ unsigned long sched_idle;
+
+ /* wake stats */
+ unsigned long sched_wake;
+ unsigned long sched_wake_local;
+#endif
};
static DEFINE_PER_CPU(struct runqueue, runqueues);
@@ -280,6 +298,97 @@ static inline void task_rq_unlock(runque
spin_unlock_irqrestore(&rq->lock, *flags);
}
+
+#ifdef CONFIG_SCHEDSTATS
+
+/*
+ * bump this up when changing the output format or the meaning of an existing
+ * format, so that tools can adapt (or abort)
+ */
+#define SCHEDSTAT_VERSION 7
+
+static int show_schedstat(struct seq_file *seq, void *v)
+{
+ int i;
+
+ seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
+ seq_printf(seq, "timestamp %lu\n", jiffies);
+ for_each_cpu(i) {
+ /* Include offline CPUs */
+ runqueue_t *rq = cpu_rq(i);
+#ifdef CONFIG_SMP
+ struct sched_domain *sd;
+ int j = 0;
+#endif
+
+ seq_printf(seq,
+ "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+ i,
+ rq->yld_both_empty, rq->yld_act_empty,
+ rq->yld_exp_empty, rq->yld_cnt,
+ rq->sched_switch, rq->sched_cnt,
+ rq->sched_idle, rq->sched_wake, rq->sched_wake_local);
+#ifdef CONFIG_SMP
+ for_each_domain(i, sd) {
+ char str[NR_CPUS];
+ int k;
+ cpumask_scnprintf(str, NR_CPUS, sd->span);
+ seq_printf(seq, " domain%d %s", j++, str);
+
+ for (k = 0; k < 3; k++) {
+ seq_printf(seq, " %lu %lu %lu %lu %lu %lu",
+ sd->lb_cnt[k], sd->lb_balanced[k],
+ sd->lb_failed[k], sd->lb_pulled[k],
+ sd->lb_hot_pulled[k], sd->lb_imbalance[k]);
+ }
+
+ seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu",
+ sd->alb_cnt, sd->alb_failed,
+ sd->alb_pushed, sd->sched_wake_remote,
+ sd->plb_pulled, sd->afw_pulled,
+ sd->sbe_pushed);
+ }
+#endif
+
+ seq_printf(seq, "\n");
+ }
+
+ return 0;
+}
+
+static int schedstat_open(struct inode *inode, struct file *file)
+{
+ unsigned size = 4096 * (1 + num_online_cpus() / 32);
+ char *buf = kmalloc(size, GFP_KERNEL);
+ struct seq_file *m;
+ int res;
+
+ if (!buf)
+ return -ENOMEM;
+ res = single_open(file, show_schedstat, NULL);
+ if (!res) {
+ m = file->private_data;
+ m->buf = buf;
+ m->size = size;
+ } else
+ kfree(buf);
+ return res;
+}
+
+struct file_operations proc_schedstat_operations = {
+ .open = schedstat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+# define schedstat_inc(s, field) ((s)->field++)
+# define schedstat_add(s, field, amt) ((s)->field += (amt))
+#else
+# define schedstat_inc(s, field) do { } while (0)
+# define schedstat_add(d, field, amt) do { } while (0)
+#endif
+
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
@@ -751,7 +860,24 @@ static int try_to_wake_up(task_t * p, un
cpu = task_cpu(p);
this_cpu = smp_processor_id();
+ schedstat_inc(rq, sched_wake);
+#ifndef CONFIG_SMP
+ schedstat_inc(rq, sched_wake_local);
+#endif
+
#ifdef CONFIG_SMP
+#ifdef CONFIG_SCHEDSTATS
+ if (cpu == this_cpu)
+ schedstat_inc(rq, sched_wake_local);
+ else {
+ for_each_domain(this_cpu, sd)
+ if (cpu_isset(cpu, sd->span))
+ break;
+ if (sd)
+ schedstat_inc(sd, sched_wake_remote);
+ }
+#endif
+
if (unlikely(task_running(rq, p)))
goto out_activate;
@@ -796,8 +922,17 @@ static int try_to_wake_up(task_t * p, un
* Now sd has SD_WAKE_AFFINE and p is cache cold in sd
* or sd has SD_WAKE_BALANCE and there is an imbalance
*/
- if (cpu_isset(cpu, sd->span))
+ if (cpu_isset(cpu, sd->span)) {
+#ifdef CONFIG_SCHEDSTATS
+ if ((sd->flags & SD_WAKE_AFFINE) &&
+ !task_hot(p, rq->timestamp_last_tick, sd))
+ schedstat_inc(sd, afw_pulled);
+ else if ((sd->flags & SD_WAKE_BALANCE) &&
+ imbalance*this_load <= 100*load)
+ schedstat_inc(sd, plb_pulled);
+#endif
goto out_set_cpu;
+ }
}
}
@@ -1321,6 +1456,7 @@ void sched_exec(void)
if (sd) {
new_cpu = find_idlest_cpu(current, this_cpu, sd);
if (new_cpu != this_cpu) {
+ schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
return;
@@ -1378,6 +1514,13 @@ int can_migrate_task(task_t *p, runqueue
return 0;
}
+#ifdef CONFIG_SCHEDSTATS
+ if (!task_hot(p, rq->timestamp_last_tick, sd))
+ schedstat_inc(sd, lb_pulled[idle]);
+ else
+ schedstat_inc(sd, lb_hot_pulled[idle]);
+#endif
+
return 1;
}
@@ -1638,14 +1781,20 @@ static int load_balance(int this_cpu, ru
int nr_moved;
spin_lock(&this_rq->lock);
+ schedstat_inc(sd, lb_cnt[idle]);
group = find_busiest_group(sd, this_cpu, &imbalance, idle);
- if (!group)
+ if (!group) {
+ schedstat_inc(sd, lb_balanced[idle]);
goto out_balanced;
+ }
busiest = find_busiest_queue(group);
- if (!busiest)
+ if (!busiest) {
+ schedstat_inc(sd, lb_balanced[idle]);
goto out_balanced;
+ }
+
/*
* This should be "impossible", but since load
* balancing is inherently racy and statistical,
@@ -1655,6 +1804,7 @@ static int load_balance(int this_cpu, ru
WARN_ON(1);
goto out_balanced;
}
+ schedstat_add(sd, lb_imbalance[idle], imbalance);
nr_moved = 0;
if (busiest->nr_running > 1) {
@@ -1672,6 +1822,7 @@ static int load_balance(int this_cpu, ru
spin_unlock(&this_rq->lock);
if (!nr_moved) {
+ schedstat_inc(sd, lb_failed[idle]);
sd->nr_balance_failed++;
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
@@ -1726,13 +1877,20 @@ static int load_balance_newidle(int this
unsigned long imbalance;
int nr_moved = 0;
+ schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
- if (!group)
+ if (!group) {
+ schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
goto out;
+ }
busiest = find_busiest_queue(group);
- if (!busiest || busiest == this_rq)
+ if (!busiest || busiest == this_rq) {
+ schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
goto out;
+ }
+
+ schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
@@ -1777,6 +1935,7 @@ static void active_load_balance(runqueue
struct sched_domain *sd;
struct sched_group *group, *busy_group;
int i;
+ int moved = 0;
if (busiest->nr_running <= 1)
return;
@@ -1788,6 +1947,7 @@ static void active_load_balance(runqueue
WARN_ON(1);
return;
}
+ schedstat_inc(sd, alb_cnt);
group = sd->groups;
while (!cpu_isset(busiest_cpu, group->cpumask))
@@ -1824,11 +1984,16 @@ static void active_load_balance(runqueue
if (unlikely(busiest == rq))
goto next_group;
double_lock_balance(busiest, rq);
- move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
+ moved += move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
spin_unlock(&rq->lock);
next_group:
group = group->next;
} while (group != sd->groups);
+
+ if (moved)
+ schedstat_add(sd, alb_pushed, moved);
+ else
+ schedstat_inc(sd, alb_failed);
}
/*
@@ -2181,6 +2346,7 @@ need_resched:
}
release_kernel_lock(prev);
+ schedstat_inc(rq, sched_cnt);
now = sched_clock();
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
@@ -2218,6 +2384,7 @@ need_resched:
next = rq->idle;
rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu, rq);
+ schedstat_inc(rq, sched_idle);
goto switch_tasks;
}
}
@@ -2227,6 +2394,7 @@ need_resched:
/*
* Switch the active and expired arrays.
*/
+ schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
@@ -2986,6 +3154,7 @@ asmlinkage long sys_sched_yield(void)
prio_array_t *array = current->array;
prio_array_t *target = rq->expired;
+ schedstat_inc(rq, yld_cnt);
/*
* We implement yielding by moving the task into the expired
* queue.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 21:31 ` Ingo Molnar
@ 2004-08-04 23:34 ` Martin J. Bligh
0 siblings, 0 replies; 27+ messages in thread
From: Martin J. Bligh @ 2004-08-04 23:34 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley
--On Wednesday, August 04, 2004 23:31:13 +0200 Ingo Molnar <mingo@elte.hu> wrote:
>
> * Martin J. Bligh <mbligh@aracnet.com> wrote:
>
>> > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler
>> > unapplied a re-run at least part of your tests?
>> >
>> > there are a number of NUMA improvements queued up on -mm, and it would
>> > be nice to know what effect these cause, and what effect the staircase
>> > scheduler has.
>>
>> Sure. I presume it's just the one patch:
>>
>> staircase-cpu-scheduler-268-rc2-mm1.patch
>>
>> which seemed to back out clean and is building now. Scream if that's
>> not all of it ...
>
> correct, that's the end of the scheduler patch-queue and it works fine
> if unapplied. (The schedstats patch i just sent applies cleanly to that
> base, in case you need one.)
OK, the perf of 2.6.8-rc2-mm2 with the new sched code backed out is exactly
the same as 2.6.8-rc2 ... ie it's definitely the new sched code that makes
the improvement.
M.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
@ 2004-08-04 23:44 ` Peter Williams
2004-08-04 23:59 ` Martin J. Bligh
1 sibling, 1 reply; 27+ messages in thread
From: Peter Williams @ 2004-08-04 23:44 UTC (permalink / raw)
To: Andrew Morton; +Cc: Martin J. Bligh, kernel, linux-kernel, Ingo Molnar
Andrew Morton wrote:
> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
>>SDET 8 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.2%
>> 2.6.8-rc2 100.2% 1.0%
>> 2.6.8-rc2-mm2 117.4% 0.9%
>>
>> SDET 16 (see disclaimer)
>> Throughput Std. Dev
>> 2.6.7 100.0% 0.3%
>> 2.6.8-rc2 99.5% 0.3%
>> 2.6.8-rc2-mm2 118.5% 0.6%
>
>
> hum, interesting. Can Con's changes affect the inter-node and inter-cpu
> balancing decisions, or is this all due to caching effects, reduced context
> switching etc?
One candidate for the cause of this improvement is the replacement of
the active/expired array mechanism with a single array. I believe that
one of the short comings of the active/expired array mechanism is that
it can lead to excessive queuing (possibly even starvation) of tasks
that aren't considered "interactive".
>
> I don't expect we'll be merging a new CPU scheduler into mainline any time
> soon, but we should work to understand where this improvement came from,
> and see if we can get the mainline scheduler to catch up.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 23:44 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Peter Williams
@ 2004-08-04 23:59 ` Martin J. Bligh
2004-08-05 5:20 ` Rick Lindsley
0 siblings, 1 reply; 27+ messages in thread
From: Martin J. Bligh @ 2004-08-04 23:59 UTC (permalink / raw)
To: Peter Williams, Andrew Morton, Rick Lindsley
Cc: kernel, linux-kernel, Ingo Molnar
--On Thursday, August 05, 2004 09:44:06 +1000 Peter Williams <pwil3058@bigpond.net.au> wrote:
> Andrew Morton wrote:
>> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
>>
>>> SDET 8 (see disclaimer)
>>> Throughput Std. Dev
>>> 2.6.7 100.0% 0.2%
>>> 2.6.8-rc2 100.2% 1.0%
>>> 2.6.8-rc2-mm2 117.4% 0.9%
>>>
>>> SDET 16 (see disclaimer)
>>> Throughput Std. Dev
>>> 2.6.7 100.0% 0.3%
>>> 2.6.8-rc2 99.5% 0.3%
>>> 2.6.8-rc2-mm2 118.5% 0.6%
>>
>>
>> hum, interesting. Can Con's changes affect the inter-node and inter-cpu
>> balancing decisions, or is this all due to caching effects, reduced context
>> switching etc?
>
> One candidate for the cause of this improvement is the replacement of the active/expired array mechanism with a single array. I believe that one of the short comings of the active/expired array mechanism is that it can lead to excessive queuing (possibly even starvation) of tasks that aren't considered "interactive".
Rick showed me schedstats graphs of the two ... it seems to have lower
latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
looks better ... he'll send them out soon, I think (hint, hint).
M.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-04 23:59 ` Martin J. Bligh
@ 2004-08-05 5:20 ` Rick Lindsley
2004-08-05 10:45 ` Ingo Molnar
0 siblings, 1 reply; 27+ messages in thread
From: Rick Lindsley @ 2004-08-05 5:20 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Peter Williams, Andrew Morton, kernel, linux-kernel, Ingo Molnar
Rick showed me schedstats graphs of the two ... it seems to have lower
latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
looks better ... he'll send them out soon, I think (hint, hint).
Okay, they're done. Here's the URL of the graphs:
http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
General summary: as Martin reported, we're seeing improvements in a number
of areas, at least with sdet. The graphs as listed there represent stats
from four separate sdet runs run sequentially with an increasing load.
(We're trying to see if we can get the information from each run separately,
rather than the aggregate -- one of the hazards of an automated test
harness :)
Rick
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-05 5:20 ` Rick Lindsley
@ 2004-08-05 10:45 ` Ingo Molnar
0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2004-08-05 10:45 UTC (permalink / raw)
To: Rick Lindsley
Cc: Martin J. Bligh, Peter Williams, Andrew Morton, kernel,
linux-kernel
* Rick Lindsley <ricklind@us.ibm.com> wrote:
> Okay, they're done. Here's the URL of the graphs:
>
> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
>
> General summary: as Martin reported, we're seeing improvements in a
> number of areas, at least with sdet. The graphs as listed there
> represent stats from four separate sdet runs run sequentially with an
> increasing load. (We're trying to see if we can get the information
> from each run separately, rather than the aggregate -- one of the
> hazards of an automated test harness :)
really nice results! Would be interesting to see the effect of Con's
patch on other SMP/NUMA workloads as well - i'd expect to see an
improvement there too. The test was done with the default interactive=1
compute=0 setting, right?
Ingo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
[not found] ` <20040805143249.GA23967@elte.hu>
@ 2004-08-05 18:36 ` Andrew Morton
2004-08-05 18:59 ` Rick Lindsley
0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-08-05 18:36 UTC (permalink / raw)
To: Ingo Molnar; +Cc: ricklind, mbligh, kernel, linux-kernel, nickpiggin
Ingo Molnar <mingo@elte.hu> wrote:
>
> * Rick Lindsley <ricklind@us.ibm.com> wrote:
>
> > The version below is for 2.6.8-rc2-mm2 without the staircase code and
> > has been compiled cleanly but not yet run.
>
> it looks good in principle, but this code needs a couple of cleanups
> before it can go into mainline. I've attached 3 patches, your original,
> the fixed up version and a delta that does the fixups relative to your
> patch.
OK, thanks. I dropped the three schedstats patches, added schedstat-v10.
My current rollup (which is pretty much rc3-mm1 with only that change) is
at http://www.zip.com.au/~akpm/linux/patches/stuff/x.bz2. Additional
scheduler work should be against that tree, please.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch
2004-08-05 18:36 ` Andrew Morton
@ 2004-08-05 18:59 ` Rick Lindsley
0 siblings, 0 replies; 27+ messages in thread
From: Rick Lindsley @ 2004-08-05 18:59 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ingo Molnar, mbligh, kernel, linux-kernel, nickpiggin
Adrian's problem was due to using schedstats on a non CONFIG_SMP system,
which I hadn't tried since sched-domains was added. I'll send a patch
to you separately against this tree
http://www.zip.com.au/~akpm/linux/patches/stuff/x.bz2.
to address that.
Since schedstats digs deep into the internals of the scheduler, major
scheduler changes will probably always require some major revamping
of schedstats, at least to deprecate old counters and possibly to add
new ones. I think, though, that after this initial settling-in period
minor scheduler changes should result in just tweaks to schedstat.
Rick
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
[not found] <200408092240.05287.habanero@us.ibm.com>
@ 2004-08-10 4:08 ` Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
2004-08-10 7:40 ` Rick Lindsley
0 siblings, 2 replies; 27+ messages in thread
From: Andrew Theurer @ 2004-08-10 4:08 UTC (permalink / raw)
To: linux-kernel; +Cc: rocklind, mbligh, mingo, akpm
On Monday 09 August 2004 22:40, you wrote:
> Rick showed me schedstats graphs of the two ... it seems to have lower
> latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
> looks better ... he'll send them out soon, I think (hint, hint).
>
> Okay, they're done. Here's the URL of the graphs:
>
> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
>
> General summary: as Martin reported, we're seeing improvements in a number
> of areas, at least with sdet. The graphs as listed there represent stats
> from four separate sdet runs run sequentially with an increasing load.
> (We're trying to see if we can get the information from each run
> separately, rather than the aggregate -- one of the hazards of an automated
> test harness :)
What's quite interesting is that there is a very noticeable surge in
load_balance with staircase in the early stage of the test, but there appears
to be -no- direct policy changes to load-balance at all in Con's patch (or at
least I didn't notice it -please tell me if you did!). You can see it in
busy load_balance, sched_balance_exec, and pull_task. The runslice and
latency stats confirm this -no-staircase does not balance early on, and the
tasks suffer, waiting on a cpu already loaded up. I do not have an
explanation for this; perhaps it has something to do with eliminating expired
queue.
I would be nice to have per cpu runqueue lengths logged to see how this plays
out -do the cpus on staircase obtain a runqueue length close to
nr_running()/nr_online_cpus sooner than no-staircase?
Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY.
Do you have cswitch data? I would not be surprised if it's a lot higher on
-no-staircase, and cache is thrashed a lot more. This may be something you
can pull out of the -no-staircase kernel quite easily.
-Andrew Theurer
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 4:08 ` Andrew Theurer
@ 2004-08-10 4:37 ` Con Kolivas
2004-08-10 15:05 ` Andrew Theurer
2004-08-10 7:40 ` Rick Lindsley
1 sibling, 1 reply; 27+ messages in thread
From: Con Kolivas @ 2004-08-10 4:37 UTC (permalink / raw)
To: Andrew Theurer; +Cc: linux-kernel, rocklind, mbligh, mingo, akpm
Andrew Theurer writes:
> On Monday 09 August 2004 22:40, you wrote:
>> Rick showed me schedstats graphs of the two ... it seems to have lower
>> latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
>> looks better ... he'll send them out soon, I think (hint, hint).
>>
>> Okay, they're done. Here's the URL of the graphs:
>>
>> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html
>>
>> General summary: as Martin reported, we're seeing improvements in a number
>> of areas, at least with sdet. The graphs as listed there represent stats
>> from four separate sdet runs run sequentially with an increasing load.
>> (We're trying to see if we can get the information from each run
>> separately, rather than the aggregate -- one of the hazards of an automated
>> test harness :)
>
> What's quite interesting is that there is a very noticeable surge in
> load_balance with staircase in the early stage of the test, but there appears
> to be -no- direct policy changes to load-balance at all in Con's patch (or at
> least I didn't notice it -please tell me if you did!). You can see it in
> busy load_balance, sched_balance_exec, and pull_task. The runslice and
> latency stats confirm this -no-staircase does not balance early on, and the
> tasks suffer, waiting on a cpu already loaded up. I do not have an
> explanation for this; perhaps it has something to do with eliminating expired
> queue.
To be honest I have no idea why that's the case. One of the first things I
did was eliminate the expired array and in my testing (up to 8x at osdl) I
did not really notice this in and of itself made any big difference - of
course this could be because the removal of the expired array was not done
in a way which entitled starved tasks to run in reasonable timeframes.
> I would be nice to have per cpu runqueue lengths logged to see how this plays
> out -do the cpus on staircase obtain a runqueue length close to
> nr_running()/nr_online_cpus sooner than no-staircase?
/me looks in the schedstats peoples' way
> Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY.
Ah well I tuned the timeslice granularity and I can tell you it isn't quite
what most people think. The granularity when you get to greater than 4 cpus
is effectively _disabled_. So in fact, the timeslices are shorter in
staircase (in normal interactive=1, compute=0 mode which is how martin
would have tested it), not longer. But this is not the reason either since
in "compute" mode they are ten times longer and this also improves
throughput further.
> Do you have cswitch data? I would not be surprised if it's a lot higher on
> -no-staircase, and cache is thrashed a lot more. This may be something you
> can pull out of the -no-staircase kernel quite easily.
Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
(-j) on kernbench gives surprisingly similar context switch rates. It's only
when I enable compute mode that the context switches drop compared to
default staircase mode and mainline. You'd have to ask Martin and Rick about
what they got.
> -Andrew Theurer
Cheers,
Con
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 4:08 ` Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
@ 2004-08-10 7:40 ` Rick Lindsley
2004-08-10 15:19 ` Andrew Theurer
1 sibling, 1 reply; 27+ messages in thread
From: Rick Lindsley @ 2004-08-10 7:40 UTC (permalink / raw)
To: Andrew Theurer; +Cc: linux-kernel, mbligh, mingo, akpm
What's quite interesting is that there is a very noticeable surge in
load_balance with staircase in the early stage of the test, but there
appears to be -no- direct policy changes to load-balance at all in
Con's patch (or at least I didn't notice it -please tell me if you
did!). You can see it in busy load_balance, sched_balance_exec, and
pull_task. The runslice and latency stats confirm this -no-staircase
does not balance early on, and the tasks suffer, waiting on a cpu
already loaded up. I do not have an explanation for this; perhaps
it has something to do with eliminating expired queue.
Possibly. The other factor thrown in here is that this was on an SMT
machine, so it's possible that the balancing is no different but we are
seeing tasks initially assigned more poorly. Or, perhaps we're drawing
too much from one data point.
It would be nice to have per cpu runqueue lengths logged to see how
this plays out -do the cpus on staircase obtain a runqueue length
close to nr_running()/nr_online_cpus sooner than no-staircase?
The only difficulty there is do we know how long it normally takes for
this to balance out? We're taking samples every five seconds; might this
not work itself out between one snapshot and the next? Shrug. It would
be easy enough to add another field to report nr_running at the moment
the statistics snapshot was taken, but on anything but compute-intensive
benchmarks I'm afraid we might miss all the interesting data.
Also, one big change apparent to me, the elimination of
TIMESLICE_GRANULARITY. Do you have cswitch data? I would not
be surprised if it's a lot higher on -no-staircase, and cache is
thrashed a lot more. This may be something you can pull out of the
-no-staircase kernel quite easily.
Yes, sar data was collected every five seconds so I do have context switch
data. The bad news is that it was collected for each of 10 runs times
four different loads, and I don't have any handy dandy scripts to pretty
it up :) (Pause.) A quick exercise with a calculator, though, suggests
you are right. cswitches were 10%-20% higher on the no staircase runs.
Rick
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 4:37 ` Con Kolivas
@ 2004-08-10 15:05 ` Andrew Theurer
2004-08-10 20:57 ` Con Kolivas
0 siblings, 1 reply; 27+ messages in thread
From: Andrew Theurer @ 2004-08-10 15:05 UTC (permalink / raw)
To: Con Kolivas; +Cc: linux-kernel, ricklind, mbligh, mingo, akpm
> > Also, one big change apparent to me, the elimination of
> > TIMESLICE_GRANULARITY.
>
> Ah well I tuned the timeslice granularity and I can tell you it isn't quite
> what most people think. The granularity when you get to greater than 4 cpus
> is effectively _disabled_. So in fact, the timeslices are shorter in
> staircase (in normal interactive=1, compute=0 mode which is how martin
> would have tested it), not longer. But this is not the reason either since
> in "compute" mode they are ten times longer and this also improves
> throughput further.
Interesting, I forgot about the "* nr_cpus" that was in the granularity
calculation. That does make me wonder, maybe the timeslices you are
calculating could have something similar, but more appropriate.
Since the number of runnable tasks on a cpu should play a part in latency (the
more tasks, potentially the longer the latency), I wonder if the timeslice
would benefit from a modifier like " / task_cpu(p)->nr_running ". With this
the base timeslice could be quite a bit larger to start for better cache
warmth, and as we add more tasks to that cpu, the timeslices get smaller, so
an acceptable latency is preserved.
> > Do you have cswitch data? I would not be surprised if it's a lot higher
> > on -no-staircase, and cache is thrashed a lot more. This may be
> > something you can pull out of the -no-staircase kernel quite easily.
>
> Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
> (-j) on kernbench gives surprisingly similar context switch rates. It's
> only when I enable compute mode that the context switches drop compared to
> default staircase mode and mainline. You'd have to ask Martin and Rick
> about what they got.
OK, thanks!
-Andrew Theurer
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 7:40 ` Rick Lindsley
@ 2004-08-10 15:19 ` Andrew Theurer
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Theurer @ 2004-08-10 15:19 UTC (permalink / raw)
To: Rick Lindsley; +Cc: Con Kolivas, linux-kernel, mbligh, mingo, akpm
On Tuesday 10 August 2004 02:40, Rick Lindsley wrote:
> What's quite interesting is that there is a very noticeable surge in
> load_balance with staircase in the early stage of the test, but there
> appears to be -no- direct policy changes to load-balance at all in
> Con's patch (or at least I didn't notice it -please tell me if you
> did!). You can see it in busy load_balance, sched_balance_exec, and
> pull_task. The runslice and latency stats confirm this -no-staircase
> does not balance early on, and the tasks suffer, waiting on a cpu
> already loaded up. I do not have an explanation for this; perhaps
> it has something to do with eliminating expired queue.
>
> Possibly. The other factor thrown in here is that this was on an SMT
> machine, so it's possible that the balancing is no different but we are
> seeing tasks initially assigned more poorly. Or, perhaps we're drawing
> too much from one data point.
Yes, my first guess was that sched_balance_exec was changed, and I guess it
was, but earlier than Con's patch. The first conditional we had used to
have:
if (this_rq()->nr_running <= 2)
goto out;
but the 2 is now a 1 for both -rc2 and -rc2-mm2, so we tend to find the best
cpu in the system more often now.
>
> It would be nice to have per cpu runqueue lengths logged to see how
> this plays out -do the cpus on staircase obtain a runqueue length
> close to nr_running()/nr_online_cpus sooner than no-staircase?
>
> The only difficulty there is do we know how long it normally takes for
> this to balance out? We're taking samples every five seconds; might this
> not work itself out between one snapshot and the next? Shrug. It would
> be easy enough to add another field to report nr_running at the moment
> the statistics snapshot was taken, but on anything but compute-intensive
> benchmarks I'm afraid we might miss all the interesting data.
Actually if you have sar cpu util data, we might be able to extract this. For
example, if we have balance issues on 16 user sdet, we may see that very
early on the staircase cpu util was near 100%, where the no-staircase may
have been much lower for the first portion of the test (showing that some
cpus were idle while others may have had more than one task). If we can see
this in sar, IMO that would confirm some sort of indirect load balance
improvement in staircase.
> Also, one big change apparent to me, the elimination of
> TIMESLICE_GRANULARITY. Do you have cswitch data? I would not
> be surprised if it's a lot higher on -no-staircase, and cache is
> thrashed a lot more. This may be something you can pull out of the
> -no-staircase kernel quite easily.
>
> Yes, sar data was collected every five seconds so I do have context switch
> data. The bad news is that it was collected for each of 10 runs times
> four different loads, and I don't have any handy dandy scripts to pretty
> it up :) (Pause.) A quick exercise with a calculator, though, suggests
> you are right. cswitches were 10%-20% higher on the no staircase runs.
Interesting. I wouldn't expect it to account for up to 20% performance, but
maybe 1-2%.
>
> Rick
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)
2004-08-10 15:05 ` Andrew Theurer
@ 2004-08-10 20:57 ` Con Kolivas
0 siblings, 0 replies; 27+ messages in thread
From: Con Kolivas @ 2004-08-10 20:57 UTC (permalink / raw)
To: Andrew Theurer; +Cc: linux-kernel, ricklind, mbligh, mingo, akpm
[-- Attachment #1: Type: text/plain, Size: 2393 bytes --]
Andrew Theurer wrote:
>>>Also, one big change apparent to me, the elimination of
>>>TIMESLICE_GRANULARITY.
>>
>>Ah well I tuned the timeslice granularity and I can tell you it isn't quite
>>what most people think. The granularity when you get to greater than 4 cpus
>>is effectively _disabled_. So in fact, the timeslices are shorter in
>>staircase (in normal interactive=1, compute=0 mode which is how martin
>>would have tested it), not longer. But this is not the reason either since
>>in "compute" mode they are ten times longer and this also improves
>>throughput further.
>
>
> Interesting, I forgot about the "* nr_cpus" that was in the granularity
> calculation. That does make me wonder, maybe the timeslices you are
> calculating could have something similar, but more appropriate.
>
> Since the number of runnable tasks on a cpu should play a part in latency (the
> more tasks, potentially the longer the latency), I wonder if the timeslice
> would benefit from a modifier like " / task_cpu(p)->nr_running ". With this
> the base timeslice could be quite a bit larger to start for better cache
> warmth, and as we add more tasks to that cpu, the timeslices get smaller, so
> an acceptable latency is preserved.
I had a problem with fairness once I made the timeslices too long since
that also determines priority demotion in the staircase design. That's
why I have the "compute" mode as quite a separate entity because the
longer timeslices on their own weren't of any special benefit (in my up
to 8x testing but could be elsewhere) unless I added the delayed
preemption which is probably where the main extra cache warmth comes
from in "compute" design. Of course this comes at a cost which is higher
latencies... because normal priority preemption is delayed.
>>>Do you have cswitch data? I would not be surprised if it's a lot higher
>>>on -no-staircase, and cache is thrashed a lot more. This may be
>>>something you can pull out of the -no-staircase kernel quite easily.
>>
>>Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
>>(-j) on kernbench gives surprisingly similar context switch rates. It's
>>only when I enable compute mode that the context switches drop compared to
>>default staircase mode and mainline. You'd have to ask Martin and Rick
>>about what they got.
>
>
> OK, thanks!
>
> -Andrew Theurer
Cheers,
Con
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 256 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2004-08-10 20:58 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-04 15:10 2.6.8-rc2-mm2 performance improvements (scheduler?) Martin J. Bligh
2004-08-04 15:12 ` Martin J. Bligh
2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:07 ` Rick Lindsley
2004-08-04 20:10 ` Ingo Molnar
2004-08-04 20:36 ` Martin J. Bligh
2004-08-04 21:31 ` Ingo Molnar
2004-08-04 23:34 ` Martin J. Bligh
2004-08-04 21:26 ` 2.6.8-rc2-mm2, schedstat-2.6.8-rc2-mm2-A4.patch Ingo Molnar
2004-08-04 21:34 ` Sam Ravnborg
2004-08-04 21:46 ` Randy.Dunlap
2004-08-04 22:13 ` Ingo Molnar
2004-08-04 22:10 ` Rick Lindsley
[not found] ` <20040805143249.GA23967@elte.hu>
2004-08-05 18:36 ` Andrew Morton
2004-08-05 18:59 ` Rick Lindsley
2004-08-04 23:44 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Peter Williams
2004-08-04 23:59 ` Martin J. Bligh
2004-08-05 5:20 ` Rick Lindsley
2004-08-05 10:45 ` Ingo Molnar
[not found] <200408092240.05287.habanero@us.ibm.com>
2004-08-10 4:08 ` Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
2004-08-10 15:05 ` Andrew Theurer
2004-08-10 20:57 ` Con Kolivas
2004-08-10 7:40 ` Rick Lindsley
2004-08-10 15:19 ` Andrew Theurer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox