* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) [not found] <200408092240.05287.habanero@us.ibm.com> @ 2004-08-10 4:08 ` Andrew Theurer 2004-08-10 4:37 ` Con Kolivas 2004-08-10 7:40 ` Rick Lindsley 0 siblings, 2 replies; 20+ messages in thread From: Andrew Theurer @ 2004-08-10 4:08 UTC (permalink / raw) To: linux-kernel; +Cc: rocklind, mbligh, mingo, akpm On Monday 09 August 2004 22:40, you wrote: > Rick showed me schedstats graphs of the two ... it seems to have lower > latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything > looks better ... he'll send them out soon, I think (hint, hint). > > Okay, they're done. Here's the URL of the graphs: > > http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html > > General summary: as Martin reported, we're seeing improvements in a number > of areas, at least with sdet. The graphs as listed there represent stats > from four separate sdet runs run sequentially with an increasing load. > (We're trying to see if we can get the information from each run > separately, rather than the aggregate -- one of the hazards of an automated > test harness :) What's quite interesting is that there is a very noticeable surge in load_balance with staircase in the early stage of the test, but there appears to be -no- direct policy changes to load-balance at all in Con's patch (or at least I didn't notice it -please tell me if you did!). You can see it in busy load_balance, sched_balance_exec, and pull_task. The runslice and latency stats confirm this -no-staircase does not balance early on, and the tasks suffer, waiting on a cpu already loaded up. I do not have an explanation for this; perhaps it has something to do with eliminating expired queue. I would be nice to have per cpu runqueue lengths logged to see how this plays out -do the cpus on staircase obtain a runqueue length close to nr_running()/nr_online_cpus sooner than no-staircase? Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY. Do you have cswitch data? I would not be surprised if it's a lot higher on -no-staircase, and cache is thrashed a lot more. This may be something you can pull out of the -no-staircase kernel quite easily. -Andrew Theurer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-10 4:08 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Andrew Theurer @ 2004-08-10 4:37 ` Con Kolivas 2004-08-10 15:05 ` Andrew Theurer 2004-08-10 7:40 ` Rick Lindsley 1 sibling, 1 reply; 20+ messages in thread From: Con Kolivas @ 2004-08-10 4:37 UTC (permalink / raw) To: Andrew Theurer; +Cc: linux-kernel, rocklind, mbligh, mingo, akpm Andrew Theurer writes: > On Monday 09 August 2004 22:40, you wrote: >> Rick showed me schedstats graphs of the two ... it seems to have lower >> latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything >> looks better ... he'll send them out soon, I think (hint, hint). >> >> Okay, they're done. Here's the URL of the graphs: >> >> http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html >> >> General summary: as Martin reported, we're seeing improvements in a number >> of areas, at least with sdet. The graphs as listed there represent stats >> from four separate sdet runs run sequentially with an increasing load. >> (We're trying to see if we can get the information from each run >> separately, rather than the aggregate -- one of the hazards of an automated >> test harness :) > > What's quite interesting is that there is a very noticeable surge in > load_balance with staircase in the early stage of the test, but there appears > to be -no- direct policy changes to load-balance at all in Con's patch (or at > least I didn't notice it -please tell me if you did!). You can see it in > busy load_balance, sched_balance_exec, and pull_task. The runslice and > latency stats confirm this -no-staircase does not balance early on, and the > tasks suffer, waiting on a cpu already loaded up. I do not have an > explanation for this; perhaps it has something to do with eliminating expired > queue. To be honest I have no idea why that's the case. One of the first things I did was eliminate the expired array and in my testing (up to 8x at osdl) I did not really notice this in and of itself made any big difference - of course this could be because the removal of the expired array was not done in a way which entitled starved tasks to run in reasonable timeframes. > I would be nice to have per cpu runqueue lengths logged to see how this plays > out -do the cpus on staircase obtain a runqueue length close to > nr_running()/nr_online_cpus sooner than no-staircase? /me looks in the schedstats peoples' way > Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY. Ah well I tuned the timeslice granularity and I can tell you it isn't quite what most people think. The granularity when you get to greater than 4 cpus is effectively _disabled_. So in fact, the timeslices are shorter in staircase (in normal interactive=1, compute=0 mode which is how martin would have tested it), not longer. But this is not the reason either since in "compute" mode they are ten times longer and this also improves throughput further. > Do you have cswitch data? I would not be surprised if it's a lot higher on > -no-staircase, and cache is thrashed a lot more. This may be something you > can pull out of the -no-staircase kernel quite easily. Well from what I got on 8x the optimal load (-j x4cpus) and maximal load (-j) on kernbench gives surprisingly similar context switch rates. It's only when I enable compute mode that the context switches drop compared to default staircase mode and mainline. You'd have to ask Martin and Rick about what they got. > -Andrew Theurer Cheers, Con ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-10 4:37 ` Con Kolivas @ 2004-08-10 15:05 ` Andrew Theurer 2004-08-10 20:57 ` Con Kolivas 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2004-08-10 15:05 UTC (permalink / raw) To: Con Kolivas; +Cc: linux-kernel, ricklind, mbligh, mingo, akpm > > Also, one big change apparent to me, the elimination of > > TIMESLICE_GRANULARITY. > > Ah well I tuned the timeslice granularity and I can tell you it isn't quite > what most people think. The granularity when you get to greater than 4 cpus > is effectively _disabled_. So in fact, the timeslices are shorter in > staircase (in normal interactive=1, compute=0 mode which is how martin > would have tested it), not longer. But this is not the reason either since > in "compute" mode they are ten times longer and this also improves > throughput further. Interesting, I forgot about the "* nr_cpus" that was in the granularity calculation. That does make me wonder, maybe the timeslices you are calculating could have something similar, but more appropriate. Since the number of runnable tasks on a cpu should play a part in latency (the more tasks, potentially the longer the latency), I wonder if the timeslice would benefit from a modifier like " / task_cpu(p)->nr_running ". With this the base timeslice could be quite a bit larger to start for better cache warmth, and as we add more tasks to that cpu, the timeslices get smaller, so an acceptable latency is preserved. > > Do you have cswitch data? I would not be surprised if it's a lot higher > > on -no-staircase, and cache is thrashed a lot more. This may be > > something you can pull out of the -no-staircase kernel quite easily. > > Well from what I got on 8x the optimal load (-j x4cpus) and maximal load > (-j) on kernbench gives surprisingly similar context switch rates. It's > only when I enable compute mode that the context switches drop compared to > default staircase mode and mainline. You'd have to ask Martin and Rick > about what they got. OK, thanks! -Andrew Theurer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-10 15:05 ` Andrew Theurer @ 2004-08-10 20:57 ` Con Kolivas 0 siblings, 0 replies; 20+ messages in thread From: Con Kolivas @ 2004-08-10 20:57 UTC (permalink / raw) To: Andrew Theurer; +Cc: linux-kernel, ricklind, mbligh, mingo, akpm [-- Attachment #1: Type: text/plain, Size: 2393 bytes --] Andrew Theurer wrote: >>>Also, one big change apparent to me, the elimination of >>>TIMESLICE_GRANULARITY. >> >>Ah well I tuned the timeslice granularity and I can tell you it isn't quite >>what most people think. The granularity when you get to greater than 4 cpus >>is effectively _disabled_. So in fact, the timeslices are shorter in >>staircase (in normal interactive=1, compute=0 mode which is how martin >>would have tested it), not longer. But this is not the reason either since >>in "compute" mode they are ten times longer and this also improves >>throughput further. > > > Interesting, I forgot about the "* nr_cpus" that was in the granularity > calculation. That does make me wonder, maybe the timeslices you are > calculating could have something similar, but more appropriate. > > Since the number of runnable tasks on a cpu should play a part in latency (the > more tasks, potentially the longer the latency), I wonder if the timeslice > would benefit from a modifier like " / task_cpu(p)->nr_running ". With this > the base timeslice could be quite a bit larger to start for better cache > warmth, and as we add more tasks to that cpu, the timeslices get smaller, so > an acceptable latency is preserved. I had a problem with fairness once I made the timeslices too long since that also determines priority demotion in the staircase design. That's why I have the "compute" mode as quite a separate entity because the longer timeslices on their own weren't of any special benefit (in my up to 8x testing but could be elsewhere) unless I added the delayed preemption which is probably where the main extra cache warmth comes from in "compute" design. Of course this comes at a cost which is higher latencies... because normal priority preemption is delayed. >>>Do you have cswitch data? I would not be surprised if it's a lot higher >>>on -no-staircase, and cache is thrashed a lot more. This may be >>>something you can pull out of the -no-staircase kernel quite easily. >> >>Well from what I got on 8x the optimal load (-j x4cpus) and maximal load >>(-j) on kernbench gives surprisingly similar context switch rates. It's >>only when I enable compute mode that the context switches drop compared to >>default staircase mode and mainline. You'd have to ask Martin and Rick >>about what they got. > > > OK, thanks! > > -Andrew Theurer Cheers, Con [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 256 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-10 4:08 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Andrew Theurer 2004-08-10 4:37 ` Con Kolivas @ 2004-08-10 7:40 ` Rick Lindsley 2004-08-10 15:19 ` Andrew Theurer 1 sibling, 1 reply; 20+ messages in thread From: Rick Lindsley @ 2004-08-10 7:40 UTC (permalink / raw) To: Andrew Theurer; +Cc: linux-kernel, mbligh, mingo, akpm What's quite interesting is that there is a very noticeable surge in load_balance with staircase in the early stage of the test, but there appears to be -no- direct policy changes to load-balance at all in Con's patch (or at least I didn't notice it -please tell me if you did!). You can see it in busy load_balance, sched_balance_exec, and pull_task. The runslice and latency stats confirm this -no-staircase does not balance early on, and the tasks suffer, waiting on a cpu already loaded up. I do not have an explanation for this; perhaps it has something to do with eliminating expired queue. Possibly. The other factor thrown in here is that this was on an SMT machine, so it's possible that the balancing is no different but we are seeing tasks initially assigned more poorly. Or, perhaps we're drawing too much from one data point. It would be nice to have per cpu runqueue lengths logged to see how this plays out -do the cpus on staircase obtain a runqueue length close to nr_running()/nr_online_cpus sooner than no-staircase? The only difficulty there is do we know how long it normally takes for this to balance out? We're taking samples every five seconds; might this not work itself out between one snapshot and the next? Shrug. It would be easy enough to add another field to report nr_running at the moment the statistics snapshot was taken, but on anything but compute-intensive benchmarks I'm afraid we might miss all the interesting data. Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY. Do you have cswitch data? I would not be surprised if it's a lot higher on -no-staircase, and cache is thrashed a lot more. This may be something you can pull out of the -no-staircase kernel quite easily. Yes, sar data was collected every five seconds so I do have context switch data. The bad news is that it was collected for each of 10 runs times four different loads, and I don't have any handy dandy scripts to pretty it up :) (Pause.) A quick exercise with a calculator, though, suggests you are right. cswitches were 10%-20% higher on the no staircase runs. Rick ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-10 7:40 ` Rick Lindsley @ 2004-08-10 15:19 ` Andrew Theurer 0 siblings, 0 replies; 20+ messages in thread From: Andrew Theurer @ 2004-08-10 15:19 UTC (permalink / raw) To: Rick Lindsley; +Cc: Con Kolivas, linux-kernel, mbligh, mingo, akpm On Tuesday 10 August 2004 02:40, Rick Lindsley wrote: > What's quite interesting is that there is a very noticeable surge in > load_balance with staircase in the early stage of the test, but there > appears to be -no- direct policy changes to load-balance at all in > Con's patch (or at least I didn't notice it -please tell me if you > did!). You can see it in busy load_balance, sched_balance_exec, and > pull_task. The runslice and latency stats confirm this -no-staircase > does not balance early on, and the tasks suffer, waiting on a cpu > already loaded up. I do not have an explanation for this; perhaps > it has something to do with eliminating expired queue. > > Possibly. The other factor thrown in here is that this was on an SMT > machine, so it's possible that the balancing is no different but we are > seeing tasks initially assigned more poorly. Or, perhaps we're drawing > too much from one data point. Yes, my first guess was that sched_balance_exec was changed, and I guess it was, but earlier than Con's patch. The first conditional we had used to have: if (this_rq()->nr_running <= 2) goto out; but the 2 is now a 1 for both -rc2 and -rc2-mm2, so we tend to find the best cpu in the system more often now. > > It would be nice to have per cpu runqueue lengths logged to see how > this plays out -do the cpus on staircase obtain a runqueue length > close to nr_running()/nr_online_cpus sooner than no-staircase? > > The only difficulty there is do we know how long it normally takes for > this to balance out? We're taking samples every five seconds; might this > not work itself out between one snapshot and the next? Shrug. It would > be easy enough to add another field to report nr_running at the moment > the statistics snapshot was taken, but on anything but compute-intensive > benchmarks I'm afraid we might miss all the interesting data. Actually if you have sar cpu util data, we might be able to extract this. For example, if we have balance issues on 16 user sdet, we may see that very early on the staircase cpu util was near 100%, where the no-staircase may have been much lower for the first portion of the test (showing that some cpus were idle while others may have had more than one task). If we can see this in sar, IMO that would confirm some sort of indirect load balance improvement in staircase. > Also, one big change apparent to me, the elimination of > TIMESLICE_GRANULARITY. Do you have cswitch data? I would not > be surprised if it's a lot higher on -no-staircase, and cache is > thrashed a lot more. This may be something you can pull out of the > -no-staircase kernel quite easily. > > Yes, sar data was collected every five seconds so I do have context switch > data. The bad news is that it was collected for each of 10 runs times > four different loads, and I don't have any handy dandy scripts to pretty > it up :) (Pause.) A quick exercise with a calculator, though, suggests > you are right. cswitches were 10%-20% higher on the no staircase runs. Interesting. I wouldn't expect it to account for up to 20% performance, but maybe 1-2%. > > Rick ^ permalink raw reply [flat|nested] 20+ messages in thread
* 2.6.8-rc2-mm2 performance improvements (scheduler?)
@ 2004-08-04 15:10 Martin J. Bligh
2004-08-04 15:12 ` Martin J. Bligh
0 siblings, 1 reply; 20+ messages in thread
From: Martin J. Bligh @ 2004-08-04 15:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Con Kolivas, linux-kernel
2.6.8-rc2-mm2 has some significant improvements over 2.6.8-rc2,
particularly at low to mid loads ... at the high loads, it's still -58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
slightly improved, but less significant. Numbers from 16x NUMA-Q.
Kernbench sees most improvement in sys time, but also some elapsed
time ... SDET sees up to 18% more througput.
I'm also amused to see that the process scalability is now pretty
damned good ... a full -j kernel compile (using up to about 1300
tasks) goes as fast as the -j 256 (the middle one) ... and elapsed
is faster than -j16, even if system is a little higher.
I *think* this is the scheduler changes ... fits in with profile diffs
I've seen before ... diffprofiles at the end. In my experience, higher
copy_to/from_user and finish_task_switch stuff tends to indicate task
thrashing. Note also .text.lock.semaphore numbers in kernbench profile.
The SDET one looks like it's load-balancing better (mainly less idle
time).
Great stuff.
M.
Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed System User CPU
2.6.7 45.37 90.91 579.75 1479.33
2.6.8-rc2 45.05 88.53 577.87 1485.67
2.6.8-rc2-mm2 44.09 78.84 577.01 1486.33
Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed System User CPU
2.6.7 44.77 97.96 576.59 1507.00
2.6.8-rc2 44.83 96.00 575.50 1497.33
2.6.8-rc2-mm2 43.43 86.04 576.26 1524.33
Kernbench: (make -j vmlinux, maximal tasks)
Elapsed System User CPU
2.6.7 44.25 88.95 575.63 1501.33
2.6.8-rc2 44.03 87.74 573.82 1503.67
2.6.8-rc2-mm2 43.75 86.68 576.98 1518.00
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 95.9% 2.3%
2.6.8-rc2-mm2 111.5% 3.3%
SDET 2 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.0%
2.6.8-rc2 100.5% 1.4%
2.6.8-rc2-mm2 115.1% 4.0%
SDET 4 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 1.0%
2.6.8-rc2 99.2% 1.1%
2.6.8-rc2-mm2 111.9% 0.5%
SDET 8 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 1.0%
2.6.8-rc2-mm2 117.4% 0.9%
SDET 16 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.5% 0.3%
2.6.8-rc2-mm2 118.5% 0.6%
SDET 32 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.3%
2.6.8-rc2 99.7% 0.4%
2.6.8-rc2-mm2 102.1% 0.8%
SDET 64 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 101.6% 0.4%
2.6.8-rc2-mm2 103.2% 0.0%
SDET 128 (see disclaimer)
Throughput Std. Dev
2.6.7 100.0% 0.2%
2.6.8-rc2 100.2% 0.1%
2.6.8-rc2-mm2 103.0% 0.3%
Diffprofile for kernbench -j32 (-ve numbers better with mm2)
2135 4.3% default_idle
233 44.2% pte_alloc_one
220 11.9% buffered_rmqueue
164 264.5% schedule
135 5.9% do_page_fault
84 10.7% clear_page_tables
62 62.6% __wake_up_sync
51 60.7% set_page_address
...
-50 -43.9% sys_close
-56 -10.7% __fput
-58 -13.7% set_page_dirty
-61 -10.0% copy_process
-70 -41.4% pipe_writev
-77 -8.1% file_move
-85 -100.0% wake_up_forked_thread
-87 -50.9% pipe_wait
-90 -5.7% path_lookup
-93 -21.2% page_add_anon_rmap
-105 -28.5% release_task
-113 -11.9% do_wp_page
-116 -7.9% link_path_walk
-116 -43.1% pipe_readv
-121 -7.5% atomic_dec_and_lock
-138 -15.4% strnlen_user
-159 -9.4% do_no_page
-167 -2.7% __d_lookup
-214 -59.4% find_idlest_cpu
-230 -6.2% find_trylock_page
-237 -1.6% do_anonymous_page
-255 -7.8% zap_pte_range
-444 -97.6% .text.lock.semaphore
-532 -43.4% Letext
-632 -54.2% __wake_up
-1086 -52.2% finish_task_switch
-1436 -24.6% __copy_to_user_ll
-3079 -46.3% __copy_from_user_ll
-7468 -5.4% total
sdetbench 8 (-ve numbers better with mm2)
...
-58 -33.5% clear_page_tables
-61 -19.3% __d_lookup
-70 -15.2% page_remove_rmap
-71 -71.0% finish_task_switch
-71 -46.4% fput
-72 -56.7% buffered_rmqueue
-73 -53.7% pte_alloc_one
-74 -22.9% __copy_to_user_ll
-75 -31.0% do_no_page
-85 -68.0% free_hot_cold_page
-95 -66.0% __copy_user_intel
-118 -21.1% find_trylock_page
-126 -43.8% do_anonymous_page
-171 -21.6% copy_page_range
-368 -38.8% zap_pte_range
-392 -62.1% do_wp_page
-6262 -11.9% default_idle
-9294 -14.4% total
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 15:10 Martin J. Bligh @ 2004-08-04 15:12 ` Martin J. Bligh 2004-08-04 19:24 ` Andrew Morton 0 siblings, 1 reply; 20+ messages in thread From: Martin J. Bligh @ 2004-08-04 15:12 UTC (permalink / raw) To: Andrew Morton; +Cc: Con Kolivas, linux-kernel Doh. Clipped the paste button on the mouse against the PC just as I hit send ... pasting a bunch of data in the wrong place ;-) Should've looked like this: ---------------------------------------------- 2.6.8-rc2-mm2 has some significant improvements over 2.6.8-rc2, particularly at low to mid loads ... at the high loads, it's still slightly improved, but less significant. Numbers from 16x NUMA-Q. Kernbench sees most improvement in sys time, but also some elapsed time ... SDET sees up to 18% more througput. I'm also amused to see that the process scalability is now pretty damned good ... a full -j kernel compile (using up to about 1300 tasks) goes as fast as the -j 256 (the middle one) ... and elapsed is faster than -j16, even if system is a little higher. I *think* this is the scheduler changes ... fits in with profile diffs I've seen before ... diffprofiles at the end. In my experience, higher copy_to/from_user and finish_task_switch stuff tends to indicate task thrashing. Note also .text.lock.semaphore numbers in kernbench profile. The SDET one looks like it's load-balancing better (mainly less idle time). Great stuff. M. Kernbench: (make -j N vmlinux, where N = 2 x num_cpus) Elapsed System User CPU 2.6.7 45.37 90.91 579.75 1479.33 2.6.8-rc2 45.05 88.53 577.87 1485.67 2.6.8-rc2-mm2 44.09 78.84 577.01 1486.33 Kernbench: (make -j N vmlinux, where N = 16 x num_cpus) Elapsed System User CPU 2.6.7 44.77 97.96 576.59 1507.00 2.6.8-rc2 44.83 96.00 575.50 1497.33 2.6.8-rc2-mm2 43.43 86.04 576.26 1524.33 Kernbench: (make -j vmlinux, maximal tasks) Elapsed System User CPU 2.6.7 44.25 88.95 575.63 1501.33 2.6.8-rc2 44.03 87.74 573.82 1503.67 2.6.8-rc2-mm2 43.75 86.68 576.98 1518.00 DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only, and the run results are non-compliant and not-comparable with any published results. Results are shown as percentages of the first set displayed SDET 1 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 1.0% 2.6.8-rc2 95.9% 2.3% 2.6.8-rc2-mm2 111.5% 3.3% SDET 2 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 0.0% 2.6.8-rc2 100.5% 1.4% 2.6.8-rc2-mm2 115.1% 4.0% SDET 4 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 1.0% 2.6.8-rc2 99.2% 1.1% 2.6.8-rc2-mm2 111.9% 0.5% SDET 8 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 0.2% 2.6.8-rc2 100.2% 1.0% 2.6.8-rc2-mm2 117.4% 0.9% SDET 16 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 0.3% 2.6.8-rc2 99.5% 0.3% 2.6.8-rc2-mm2 118.5% 0.6% SDET 32 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 0.3% 2.6.8-rc2 99.7% 0.4% 2.6.8-rc2-mm2 102.1% 0.8% SDET 64 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 0.2% 2.6.8-rc2 101.6% 0.4% 2.6.8-rc2-mm2 103.2% 0.0% SDET 128 (see disclaimer) Throughput Std. Dev 2.6.7 100.0% 0.2% 2.6.8-rc2 100.2% 0.1% 2.6.8-rc2-mm2 103.0% 0.3% Diffprofile for kernbench -j32 (-ve numbers better with mm2) 2135 4.3% default_idle 233 44.2% pte_alloc_one 220 11.9% buffered_rmqueue 164 264.5% schedule 135 5.9% do_page_fault 84 10.7% clear_page_tables 62 62.6% __wake_up_sync 51 60.7% set_page_address ... -50 -43.9% sys_close -56 -10.7% __fput -58 -13.7% set_page_dirty -61 -10.0% copy_process -70 -41.4% pipe_writev -77 -8.1% file_move -85 -100.0% wake_up_forked_thread -87 -50.9% pipe_wait -90 -5.7% path_lookup -93 -21.2% page_add_anon_rmap -105 -28.5% release_task -113 -11.9% do_wp_page -116 -7.9% link_path_walk -116 -43.1% pipe_readv -121 -7.5% atomic_dec_and_lock -138 -15.4% strnlen_user -159 -9.4% do_no_page -167 -2.7% __d_lookup -214 -59.4% find_idlest_cpu -230 -6.2% find_trylock_page -237 -1.6% do_anonymous_page -255 -7.8% zap_pte_range -444 -97.6% .text.lock.semaphore -532 -43.4% Letext -632 -54.2% __wake_up -1086 -52.2% finish_task_switch -1436 -24.6% __copy_to_user_ll -3079 -46.3% __copy_from_user_ll -7468 -5.4% total sdetbench 8 (-ve numbers better with mm2) ... -58 -33.5% clear_page_tables -61 -19.3% __d_lookup -70 -15.2% page_remove_rmap -71 -71.0% finish_task_switch -71 -46.4% fput -72 -56.7% buffered_rmqueue -73 -53.7% pte_alloc_one -74 -22.9% __copy_to_user_ll -75 -31.0% do_no_page -85 -68.0% free_hot_cold_page -95 -66.0% __copy_user_intel -118 -21.1% find_trylock_page -126 -43.8% do_anonymous_page -171 -21.6% copy_page_range -368 -38.8% zap_pte_range -392 -62.1% do_wp_page -6262 -11.9% default_idle -9294 -14.4% total ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 15:12 ` Martin J. Bligh @ 2004-08-04 19:24 ` Andrew Morton 2004-08-04 19:34 ` Martin J. Bligh 2004-08-04 23:44 ` Peter Williams 0 siblings, 2 replies; 20+ messages in thread From: Andrew Morton @ 2004-08-04 19:24 UTC (permalink / raw) To: Martin J. Bligh; +Cc: kernel, linux-kernel, Ingo Molnar "Martin J. Bligh" <mbligh@aracnet.com> wrote: > > SDET 8 (see disclaimer) > Throughput Std. Dev > 2.6.7 100.0% 0.2% > 2.6.8-rc2 100.2% 1.0% > 2.6.8-rc2-mm2 117.4% 0.9% > > SDET 16 (see disclaimer) > Throughput Std. Dev > 2.6.7 100.0% 0.3% > 2.6.8-rc2 99.5% 0.3% > 2.6.8-rc2-mm2 118.5% 0.6% hum, interesting. Can Con's changes affect the inter-node and inter-cpu balancing decisions, or is this all due to caching effects, reduced context switching etc? I don't expect we'll be merging a new CPU scheduler into mainline any time soon, but we should work to understand where this improvement came from, and see if we can get the mainline scheduler to catch up. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 19:24 ` Andrew Morton @ 2004-08-04 19:34 ` Martin J. Bligh 2004-08-04 19:50 ` Andrew Morton 2004-08-04 20:10 ` Ingo Molnar 2004-08-04 23:44 ` Peter Williams 1 sibling, 2 replies; 20+ messages in thread From: Martin J. Bligh @ 2004-08-04 19:34 UTC (permalink / raw) To: Andrew Morton; +Cc: kernel, linux-kernel, Ingo Molnar, Rick Lindsley --On Wednesday, August 04, 2004 12:24:14 -0700 Andrew Morton <akpm@osdl.org> wrote: > "Martin J. Bligh" <mbligh@aracnet.com> wrote: >> >> SDET 8 (see disclaimer) >> Throughput Std. Dev >> 2.6.7 100.0% 0.2% >> 2.6.8-rc2 100.2% 1.0% >> 2.6.8-rc2-mm2 117.4% 0.9% >> >> SDET 16 (see disclaimer) >> Throughput Std. Dev >> 2.6.7 100.0% 0.3% >> 2.6.8-rc2 99.5% 0.3% >> 2.6.8-rc2-mm2 118.5% 0.6% > > hum, interesting. Can Con's changes affect the inter-node and inter-cpu > balancing decisions, or is this all due to caching effects, reduced context > switching etc? > > I don't expect we'll be merging a new CPU scheduler into mainline any time > soon, but we should work to understand where this improvement came from, > and see if we can get the mainline scheduler to catch up. Dunno ... really need to take schedstats profiles before and afterwards to get a better picture what it's doing. Rick was working on a port. M. PS. schedstats is great for this kind of thing. Very useful, minimally invasive, no impact unless configed in, and nothing measurable even then. Hint. Hint ;-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 19:34 ` Martin J. Bligh @ 2004-08-04 19:50 ` Andrew Morton 2004-08-04 20:07 ` Rick Lindsley 2004-08-04 20:10 ` Ingo Molnar 1 sibling, 1 reply; 20+ messages in thread From: Andrew Morton @ 2004-08-04 19:50 UTC (permalink / raw) To: Martin J. Bligh; +Cc: kernel, linux-kernel, mingo, ricklind "Martin J. Bligh" <mbligh@aracnet.com> wrote: > > PS. schedstats is great for this kind of thing. Very useful, minimally > invasive, no impact unless configed in, and nothing measurable even then. > Hint. Hint ;-) Ho hum. It's up to the hordes of scheduler hackers really. If they want, and can agree upon a patch then go wild. It should be against -mm minus staircase, as there's a fair amount of scheduler stuff banked up for post-2.6.8. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 19:50 ` Andrew Morton @ 2004-08-04 20:07 ` Rick Lindsley 0 siblings, 0 replies; 20+ messages in thread From: Rick Lindsley @ 2004-08-04 20:07 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, kernel, linux-kernel, mingo Ho hum. It's up to the hordes of scheduler hackers really. If they want, and can agree upon a patch then go wild. It should be against -mm minus staircase, as there's a fair amount of scheduler stuff banked up for post-2.6.8. The patch exists for both -mm2 and -mm1, but I've been holding off posting it until I get a chance to do more than simply compile it. Our lab machines are back up now so I'll test a (-mm2 - staircase) patch this afternoon. Rick ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 19:34 ` Martin J. Bligh 2004-08-04 19:50 ` Andrew Morton @ 2004-08-04 20:10 ` Ingo Molnar 2004-08-04 20:36 ` Martin J. Bligh 1 sibling, 1 reply; 20+ messages in thread From: Ingo Molnar @ 2004-08-04 20:10 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley * Martin J. Bligh <mbligh@aracnet.com> wrote: > >> SDET 16 (see disclaimer) > >> Throughput Std. Dev > >> 2.6.7 100.0% 0.3% > >> 2.6.8-rc2 99.5% 0.3% > >> 2.6.8-rc2-mm2 118.5% 0.6% > > > > hum, interesting. Can Con's changes affect the inter-node and inter-cpu > > balancing decisions, or is this all due to caching effects, reduced context > > switching etc? Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler unapplied a re-run at least part of your tests? there are a number of NUMA improvements queued up on -mm, and it would be nice to know what effect these cause, and what effect the staircase scheduler has. Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 20:10 ` Ingo Molnar @ 2004-08-04 20:36 ` Martin J. Bligh 2004-08-04 21:31 ` Ingo Molnar 0 siblings, 1 reply; 20+ messages in thread From: Martin J. Bligh @ 2004-08-04 20:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley --On Wednesday, August 04, 2004 22:10:19 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Martin J. Bligh <mbligh@aracnet.com> wrote: > >> >> SDET 16 (see disclaimer) >> >> Throughput Std. Dev >> >> 2.6.7 100.0% 0.3% >> >> 2.6.8-rc2 99.5% 0.3% >> >> 2.6.8-rc2-mm2 118.5% 0.6% >> > >> > hum, interesting. Can Con's changes affect the inter-node and inter-cpu >> > balancing decisions, or is this all due to caching effects, reduced context >> > switching etc? > > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler > unapplied a re-run at least part of your tests? > > there are a number of NUMA improvements queued up on -mm, and it would > be nice to know what effect these cause, and what effect the staircase > scheduler has. Sure. I presume it's just the one patch: staircase-cpu-scheduler-268-rc2-mm1.patch which seemed to back out clean and is building now. Scream if that's not all of it ... M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 20:36 ` Martin J. Bligh @ 2004-08-04 21:31 ` Ingo Molnar 2004-08-04 23:34 ` Martin J. Bligh 0 siblings, 1 reply; 20+ messages in thread From: Ingo Molnar @ 2004-08-04 21:31 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley * Martin J. Bligh <mbligh@aracnet.com> wrote: > > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler > > unapplied a re-run at least part of your tests? > > > > there are a number of NUMA improvements queued up on -mm, and it would > > be nice to know what effect these cause, and what effect the staircase > > scheduler has. > > Sure. I presume it's just the one patch: > > staircase-cpu-scheduler-268-rc2-mm1.patch > > which seemed to back out clean and is building now. Scream if that's > not all of it ... correct, that's the end of the scheduler patch-queue and it works fine if unapplied. (The schedstats patch i just sent applies cleanly to that base, in case you need one.) Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 21:31 ` Ingo Molnar @ 2004-08-04 23:34 ` Martin J. Bligh 0 siblings, 0 replies; 20+ messages in thread From: Martin J. Bligh @ 2004-08-04 23:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, kernel, linux-kernel, Rick Lindsley --On Wednesday, August 04, 2004 23:31:13 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Martin J. Bligh <mbligh@aracnet.com> wrote: > >> > Martin, could you try 2.6.8-rc2-mm2 with staircase-cpu-scheduler >> > unapplied a re-run at least part of your tests? >> > >> > there are a number of NUMA improvements queued up on -mm, and it would >> > be nice to know what effect these cause, and what effect the staircase >> > scheduler has. >> >> Sure. I presume it's just the one patch: >> >> staircase-cpu-scheduler-268-rc2-mm1.patch >> >> which seemed to back out clean and is building now. Scream if that's >> not all of it ... > > correct, that's the end of the scheduler patch-queue and it works fine > if unapplied. (The schedstats patch i just sent applies cleanly to that > base, in case you need one.) OK, the perf of 2.6.8-rc2-mm2 with the new sched code backed out is exactly the same as 2.6.8-rc2 ... ie it's definitely the new sched code that makes the improvement. M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 19:24 ` Andrew Morton 2004-08-04 19:34 ` Martin J. Bligh @ 2004-08-04 23:44 ` Peter Williams 2004-08-04 23:59 ` Martin J. Bligh 1 sibling, 1 reply; 20+ messages in thread From: Peter Williams @ 2004-08-04 23:44 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, kernel, linux-kernel, Ingo Molnar Andrew Morton wrote: > "Martin J. Bligh" <mbligh@aracnet.com> wrote: > >>SDET 8 (see disclaimer) >> Throughput Std. Dev >> 2.6.7 100.0% 0.2% >> 2.6.8-rc2 100.2% 1.0% >> 2.6.8-rc2-mm2 117.4% 0.9% >> >> SDET 16 (see disclaimer) >> Throughput Std. Dev >> 2.6.7 100.0% 0.3% >> 2.6.8-rc2 99.5% 0.3% >> 2.6.8-rc2-mm2 118.5% 0.6% > > > hum, interesting. Can Con's changes affect the inter-node and inter-cpu > balancing decisions, or is this all due to caching effects, reduced context > switching etc? One candidate for the cause of this improvement is the replacement of the active/expired array mechanism with a single array. I believe that one of the short comings of the active/expired array mechanism is that it can lead to excessive queuing (possibly even starvation) of tasks that aren't considered "interactive". > > I don't expect we'll be merging a new CPU scheduler into mainline any time > soon, but we should work to understand where this improvement came from, > and see if we can get the mainline scheduler to catch up. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 23:44 ` Peter Williams @ 2004-08-04 23:59 ` Martin J. Bligh 2004-08-05 5:20 ` Rick Lindsley 0 siblings, 1 reply; 20+ messages in thread From: Martin J. Bligh @ 2004-08-04 23:59 UTC (permalink / raw) To: Peter Williams, Andrew Morton, Rick Lindsley Cc: kernel, linux-kernel, Ingo Molnar --On Thursday, August 05, 2004 09:44:06 +1000 Peter Williams <pwil3058@bigpond.net.au> wrote: > Andrew Morton wrote: >> "Martin J. Bligh" <mbligh@aracnet.com> wrote: >> >>> SDET 8 (see disclaimer) >>> Throughput Std. Dev >>> 2.6.7 100.0% 0.2% >>> 2.6.8-rc2 100.2% 1.0% >>> 2.6.8-rc2-mm2 117.4% 0.9% >>> >>> SDET 16 (see disclaimer) >>> Throughput Std. Dev >>> 2.6.7 100.0% 0.3% >>> 2.6.8-rc2 99.5% 0.3% >>> 2.6.8-rc2-mm2 118.5% 0.6% >> >> >> hum, interesting. Can Con's changes affect the inter-node and inter-cpu >> balancing decisions, or is this all due to caching effects, reduced context >> switching etc? > > One candidate for the cause of this improvement is the replacement of the active/expired array mechanism with a single array. I believe that one of the short comings of the active/expired array mechanism is that it can lead to excessive queuing (possibly even starvation) of tasks that aren't considered "interactive". Rick showed me schedstats graphs of the two ... it seems to have lower latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything looks better ... he'll send them out soon, I think (hint, hint). M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-04 23:59 ` Martin J. Bligh @ 2004-08-05 5:20 ` Rick Lindsley 2004-08-05 10:45 ` Ingo Molnar 0 siblings, 1 reply; 20+ messages in thread From: Rick Lindsley @ 2004-08-05 5:20 UTC (permalink / raw) To: Martin J. Bligh Cc: Peter Williams, Andrew Morton, kernel, linux-kernel, Ingo Molnar Rick showed me schedstats graphs of the two ... it seems to have lower latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything looks better ... he'll send them out soon, I think (hint, hint). Okay, they're done. Here's the URL of the graphs: http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html General summary: as Martin reported, we're seeing improvements in a number of areas, at least with sdet. The graphs as listed there represent stats from four separate sdet runs run sequentially with an increasing load. (We're trying to see if we can get the information from each run separately, rather than the aggregate -- one of the hazards of an automated test harness :) Rick ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.6.8-rc2-mm2 performance improvements (scheduler?) 2004-08-05 5:20 ` Rick Lindsley @ 2004-08-05 10:45 ` Ingo Molnar 0 siblings, 0 replies; 20+ messages in thread From: Ingo Molnar @ 2004-08-05 10:45 UTC (permalink / raw) To: Rick Lindsley Cc: Martin J. Bligh, Peter Williams, Andrew Morton, kernel, linux-kernel * Rick Lindsley <ricklind@us.ibm.com> wrote: > Okay, they're done. Here's the URL of the graphs: > > http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html > > General summary: as Martin reported, we're seeing improvements in a > number of areas, at least with sdet. The graphs as listed there > represent stats from four separate sdet runs run sequentially with an > increasing load. (We're trying to see if we can get the information > from each run separately, rather than the aggregate -- one of the > hazards of an automated test harness :) really nice results! Would be interesting to see the effect of Con's patch on other SMP/NUMA workloads as well - i'd expect to see an improvement there too. The test was done with the default interactive=1 compute=0 setting, right? Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2004-08-10 20:58 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200408092240.05287.habanero@us.ibm.com>
2004-08-10 4:08 ` 2.6.8-rc2-mm2 performance improvements (scheduler?) Andrew Theurer
2004-08-10 4:37 ` Con Kolivas
2004-08-10 15:05 ` Andrew Theurer
2004-08-10 20:57 ` Con Kolivas
2004-08-10 7:40 ` Rick Lindsley
2004-08-10 15:19 ` Andrew Theurer
2004-08-04 15:10 Martin J. Bligh
2004-08-04 15:12 ` Martin J. Bligh
2004-08-04 19:24 ` Andrew Morton
2004-08-04 19:34 ` Martin J. Bligh
2004-08-04 19:50 ` Andrew Morton
2004-08-04 20:07 ` Rick Lindsley
2004-08-04 20:10 ` Ingo Molnar
2004-08-04 20:36 ` Martin J. Bligh
2004-08-04 21:31 ` Ingo Molnar
2004-08-04 23:34 ` Martin J. Bligh
2004-08-04 23:44 ` Peter Williams
2004-08-04 23:59 ` Martin J. Bligh
2004-08-05 5:20 ` Rick Lindsley
2004-08-05 10:45 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox