* [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled @ 2016-02-18 11:11 Mel Gorman 2016-02-18 19:43 ` Rafael J. Wysocki 2016-02-19 11:11 ` Stephane Gasparini 0 siblings, 2 replies; 11+ messages in thread From: Mel Gorman @ 2016-02-18 11:11 UTC (permalink / raw) To: Rafael Wysocki Cc: Dirk Brandewie, Ingo Molnar, Peter Zijlstra, Matt Fleming, Mike Galbraith, Linux-PM, LKML, Mel Gorman (cc'ing pm and scheduler people as the problem could be blamed on either subsystem depending on your point of view) The PID relies on samples of equal time but this does not apply for deferrable timers when the CPU is idle. intel_pstate checks if the actual duration between samples is large and if so, the "busyness" of the CPU is scaled. This assumes the delay was a deferred timer but a workload may simply have been idle for a short time if it's context switching between a server and client or waiting very briefly on IO. It's compounded by the problem that server/clients migrate between CPUs due to wake-affine trying to maximise hot cache usage. In such cases, the cores are not considered busy and the frequency is dropped prematurely. This patch increases the hold-off value before the busyness is scaled. It was selected based simply on testing until the desired result was found. Tests were conducted with workloads that are either client/server based or short-lived IO. dbench4 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Hmean mb/sec-1 309.82 ( 0.00%) 327.01 ( 5.55%) Hmean mb/sec-2 594.92 ( 0.00%) 613.02 ( 3.04%) Hmean mb/sec-4 669.17 ( 0.00%) 712.27 ( 6.44%) Hmean mb/sec-8 700.82 ( 0.00%) 724.04 ( 3.31%) Hmean mb/sec-64 425.38 ( 0.00%) 448.02 ( 5.32%) 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Mean %Busy 27.28 26.81 Mean CPU%c1 42.50 44.29 Mean CPU%c3 7.16 7.14 Mean CPU%c6 23.05 21.76 Mean CPU%c7 0.00 0.00 Mean CorWatt 4.60 5.08 Mean PkgWatt 6.83 7.32 There is fairly sizable performance boost from the modification and while the percentage of time spent in C1 is increased, it is not by a substantial amount and the power usage increase is tiny. iozone for small files and varying block sizes. Format is IOOperation-filessize-recordsize 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Hmean SeqWrite-200704-1 740152.30 ( 0.00%) 748432.35 ( 1.12%) Hmean SeqWrite-200704-2 1052506.25 ( 0.00%) 1169065.30 ( 11.07%) Hmean SeqWrite-200704-4 1450716.41 ( 0.00%) 1725335.69 ( 18.93%) Hmean SeqWrite-200704-8 1523917.72 ( 0.00%) 1881610.25 ( 23.47%) Hmean SeqWrite-200704-16 1572519.89 ( 0.00%) 1750277.07 ( 11.30%) Hmean SeqWrite-200704-32 1611078.69 ( 0.00%) 1923796.62 ( 19.41%) Hmean SeqWrite-200704-64 1656755.37 ( 0.00%) 1892766.99 ( 14.25%) Hmean SeqWrite-200704-128 1641739.24 ( 0.00%) 1952081.27 ( 18.90%) Hmean SeqWrite-200704-256 1660046.05 ( 0.00%) 1931237.50 ( 16.34%) Hmean SeqWrite-200704-512 1634394.86 ( 0.00%) 1860369.95 ( 13.83%) Hmean SeqWrite-200704-1024 1629526.38 ( 0.00%) 1810320.92 ( 11.09%) Hmean SeqWrite-401408-1 828943.43 ( 0.00%) 876152.50 ( 5.70%) Hmean SeqWrite-401408-2 1231519.20 ( 0.00%) 1368986.18 ( 11.16%) Hmean SeqWrite-401408-4 1724109.56 ( 0.00%) 1838265.22 ( 6.62%) Hmean SeqWrite-401408-8 1806615.84 ( 0.00%) 1969611.74 ( 9.02%) Hmean SeqWrite-401408-16 1859268.96 ( 0.00%) 2003005.51 ( 7.73%) Hmean SeqWrite-401408-32 1887759.67 ( 0.00%) 2415913.37 ( 27.98%) Hmean SeqWrite-401408-64 1941717.11 ( 0.00%) 1971929.24 ( 1.56%) Hmean SeqWrite-401408-128 1919515.58 ( 0.00%) 2127647.53 ( 10.84%) Hmean SeqWrite-401408-256 1908766.57 ( 0.00%) 2067473.02 ( 8.31%) Hmean SeqWrite-401408-512 1908999.37 ( 0.00%) 2195587.56 ( 15.01%) Hmean SeqWrite-401408-1024 1912232.98 ( 0.00%) 2150068.56 ( 12.44%) Hmean Rewrite-200704-1 1151067.57 ( 0.00%) 1155309.64 ( 0.37%) Hmean Rewrite-200704-2 1786824.53 ( 0.00%) 1837093.18 ( 2.81%) Hmean Rewrite-200704-4 2539338.19 ( 0.00%) 2649019.78 ( 4.32%) Hmean Rewrite-200704-8 2687411.53 ( 0.00%) 2785202.26 ( 3.64%) Hmean Rewrite-200704-16 2709445.97 ( 0.00%) 2805580.76 ( 3.55%) Hmean Rewrite-200704-32 2735718.43 ( 0.00%) 2807532.87 ( 2.63%) Hmean Rewrite-200704-64 2782754.97 ( 0.00%) 2952024.38 ( 6.08%) Hmean Rewrite-200704-128 2791889.73 ( 0.00%) 2805048.02 ( 0.47%) Hmean Rewrite-200704-256 2711596.34 ( 0.00%) 2828896.54 ( 4.33%) Hmean Rewrite-200704-512 2665066.25 ( 0.00%) 2868058.05 ( 7.62%) Hmean Rewrite-200704-1024 2675375.89 ( 0.00%) 2685664.19 ( 0.38%) Hmean Rewrite-401408-1 1350713.78 ( 0.00%) 1358762.21 ( 0.60%) Hmean Rewrite-401408-2 2079420.61 ( 0.00%) 2097399.02 ( 0.86%) Hmean Rewrite-401408-4 2889535.90 ( 0.00%) 2912795.03 ( 0.80%) Hmean Rewrite-401408-8 3068155.32 ( 0.00%) 3090915.84 ( 0.74%) Hmean Rewrite-401408-16 3103789.43 ( 0.00%) 3162486.65 ( 1.89%) Hmean Rewrite-401408-32 3112447.72 ( 0.00%) 3243067.63 ( 4.20%) Hmean Rewrite-401408-64 3232651.39 ( 0.00%) 3227701.02 ( -0.15%) Hmean Rewrite-401408-128 3149556.47 ( 0.00%) 3165694.24 ( 0.51%) Hmean Rewrite-401408-256 3093348.93 ( 0.00%) 3104229.97 ( 0.35%) Hmean Rewrite-401408-512 3026305.45 ( 0.00%) 3121151.02 ( 3.13%) Hmean Rewrite-401408-1024 3005431.18 ( 0.00%) 3046910.32 ( 1.38%) 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Mean %Busy 3.10 3.09 Mean CPU%c1 6.16 5.55 Mean CPU%c3 0.08 0.10 Mean CPU%c6 90.65 91.26 Mean CPU%c7 0.00 0.00 Mean CorWatt 1.71 1.74 Mean PkgWatt 3.88 3.91 Max %Busy 16.51 16.22 Max CPU%c1 17.03 21.99 Max CPU%c3 2.57 2.15 Max CPU%c6 96.39 96.31 Max CPU%c7 0.00 0.00 Max CorWatt 5.40 5.42 Max PkgWatt 7.53 7.56 The other operations are omitted as they showed no performance difference. For sequential writes and rewrites there is a massive gain in throughput for very small files. The increase in power consumption is negligible. It is known that the increase is not universal. Larger core machines see a much smaller benefit so the rate of CPU migrations are a factor. netperf-UDP_STREAM 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Hmean send-64 233.96 ( 0.00%) 244.76 ( 4.61%) Hmean send-128 466.74 ( 0.00%) 479.16 ( 2.66%) Hmean send-256 929.12 ( 0.00%) 964.00 ( 3.75%) Hmean send-1024 3631.36 ( 0.00%) 3781.89 ( 4.15%) Hmean send-2048 6984.60 ( 0.00%) 7169.60 ( 2.65%) Hmean send-3312 10792.94 ( 0.00%) 11103.42 ( 2.88%) Hmean send-4096 12895.57 ( 0.00%) 13112.58 ( 1.68%) Hmean send-8192 23057.34 ( 0.00%) 23443.80 ( 1.68%) Hmean send-16384 37871.11 ( 0.00%) 38292.60 ( 1.11%) Hmean recv-64 233.89 ( 0.00%) 244.71 ( 4.63%) Hmean recv-128 466.63 ( 0.00%) 479.09 ( 2.67%) Hmean recv-256 928.88 ( 0.00%) 963.74 ( 3.75%) Hmean recv-1024 3630.54 ( 0.00%) 3780.96 ( 4.14%) Hmean recv-2048 6983.20 ( 0.00%) 7167.55 ( 2.64%) Hmean recv-3312 10790.92 ( 0.00%) 11100.63 ( 2.87%) Hmean recv-4096 12891.37 ( 0.00%) 13110.35 ( 1.70%) Hmean recv-8192 23054.79 ( 0.00%) 23438.27 ( 1.66%) Hmean recv-16384 37866.79 ( 0.00%) 38283.73 ( 1.10%) 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Mean %Busy 37.30 37.10 Mean CPU%c1 37.52 37.30 Mean CPU%c3 0.10 0.10 Mean CPU%c6 25.08 25.49 Mean CPU%c7 0.00 0.00 Mean CorWatt 11.20 11.18 Mean PkgWatt 13.30 13.28 Max %Busy 50.64 51.73 Max CPU%c1 49.80 50.53 Max CPU%c3 9.14 8.95 Max CPU%c6 62.46 63.48 Max CPU%c7 0.00 0.00 Max CorWatt 16.46 16.44 Max PkgWatt 18.58 18.55 In this test, the client/server are pinned to cores so the scheduler decisions are not a factor. There is still a mild performance boost with no impact on power consumption. cyclictest-pinned 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Amean LatAvg 3.00 ( 0.00%) 2.64 ( 11.94%) Amean LatMax 156.93 ( 0.00%) 106.89 ( 31.89%) 4.5.0-rc2 4.5.0-rc2 vanilla sample-v1r1 Mean %Busy 99.74 99.73 Mean CPU%c1 0.02 0.02 Mean CPU%c3 0.00 0.01 Mean CPU%c6 0.23 0.24 Mean CPU%c7 0.00 0.00 Mean CorWatt 5.06 5.92 Mean PkgWatt 7.12 7.99 Max %Busy 100.00 100.00 Max CPU%c1 3.88 3.50 Max CPU%c3 0.71 0.99 Max CPU%c6 41.79 43.17 Max CPU%c7 0.00 0.00 Max CorWatt 6.80 8.66 Max PkgWatt 8.85 10.71 This test measures how quickly a task wakes up after a timeout. The test could be defeated by selecting a different timeout value that is outside the new hold-off value. Furthermore, a workload that is very sensitive to wakeup latencies should use the performance governor. Nevertheless it's interesting to note the impact of increasing the hold-off value. There is an increase in power usage because the CPU remains active during sleep times. In all cases, there are some CPU migrations because wakers pull wakees to nearby CPUs. It could be argued that the workload should be pinned but this puts a burden on the user that may not even be possible in all cases. The scheduler could try keeping processes on the same CPUs but that would impact cache hotness and cause a different class of issues. It is inevitable that there will be some conflict between power management and scheduling decisions but there is some gains from delaying idling slightly without a severe impact on power consumption. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- drivers/cpufreq/intel_pstate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index cd83d477e32d..54250084174a 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; duration_us = ktime_us_delta(cpu->sample.time, cpu->last_sample_time); - if (duration_us > sample_time * 3) { + if (duration_us > sample_time * 12) { sample_ratio = div_fp(int_tofp(sample_time), int_tofp(duration_us)); core_busy = mul_fp(core_busy, sample_ratio); -- 2.6.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman @ 2016-02-18 19:43 ` Rafael J. Wysocki 2016-02-18 21:09 ` Doug Smythies 2016-02-18 23:29 ` Pandruvada, Srinivas 2016-02-19 11:11 ` Stephane Gasparini 1 sibling, 2 replies; 11+ messages in thread From: Rafael J. Wysocki @ 2016-02-18 19:43 UTC (permalink / raw) To: Mel Gorman Cc: Rafael Wysocki, Dirk Brandewie, Ingo Molnar, Peter Zijlstra, Matt Fleming, Mike Galbraith, Linux-PM, LKML Hi Mel, On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman <mgorman@techsingularity.net> wrote: [cut] > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > drivers/cpufreq/intel_pstate.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index cd83d477e32d..54250084174a 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; > duration_us = ktime_us_delta(cpu->sample.time, > cpu->last_sample_time); > - if (duration_us > sample_time * 3) { > + if (duration_us > sample_time * 12) { > sample_ratio = div_fp(int_tofp(sample_time), > int_tofp(duration_us)); > core_busy = mul_fp(core_busy, sample_ratio); > -- I've been considering making a change like this, but I wasn't quite sure how much greater the multiplier should be, so I've queued this one up for 4.6. That said please note that we're planning to make one significant change to intel_pstate in the 4.6 cycle that's very likely to affect your results. It is currently present in linux-next (commit 402c43ed2d74 "cpufreq: intel_pstate: Replace timers with utilization update callbacks" in the linux-next branch of the linux-pm.git tree, that depends on commit fe7034338ba0 "cpufreq: Add mechanism for registering utilization update callbacks" in the same branch). Also you can just pull from the pm-cpufreq-test branch in linux-pm.git, but that contains much more material. Thanks, Rafael ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 19:43 ` Rafael J. Wysocki @ 2016-02-18 21:09 ` Doug Smythies 2016-02-19 10:49 ` Mel Gorman 2016-02-23 14:04 ` Mel Gorman 2016-02-18 23:29 ` Pandruvada, Srinivas 1 sibling, 2 replies; 11+ messages in thread From: Doug Smythies @ 2016-02-18 21:09 UTC (permalink / raw) To: 'Rafael J. Wysocki', 'Mel Gorman' Cc: 'Rafael Wysocki', 'Ingo Molnar', 'Peter Zijlstra', 'Matt Fleming', 'Mike Galbraith', 'Linux-PM', 'LKML', 'Srinivas Pandruvada' On 2106.02.18 Rafael J. Wysocki wrote: On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman wrote: >> >> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> >> --- >> drivers/cpufreq/intel_pstate.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c >> index cd83d477e32d..54250084174a 100644 >> --- a/drivers/cpufreq/intel_pstate.c >> +++ b/drivers/cpufreq/intel_pstate.c >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) >> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; >> duration_us = ktime_us_delta(cpu->sample.time, >> cpu->last_sample_time); >> - if (duration_us > sample_time * 3) { >> + if (duration_us > sample_time * 12) { >> sample_ratio = div_fp(int_tofp(sample_time), >> int_tofp(duration_us)); >> core_busy = mul_fp(core_busy, sample_ratio); >> -- The immediately preceding comment needs to be changed also. Note that with duration related scaling only coming in at such a high ratio it might be worth saving the divide and just setting it to 0. > I've been considering making a change like this, but I wasn't quite > sure how much greater the multiplier should be, so I've queued this > one up for 4.6. > That said please note that we're planning to make one significant > change to intel_pstate in the 4.6 cycle that's very likely to affect > your results. Rafael: I started to test Mel's change added to your 3 patch set version 10. I only have one data point so far, I selected the test I did from one of Mel's better results (although there is no reason to expect my computer to have best results for the same operating conditions): Stock kernel 4.5-rc4 just for reference: Linux s15 4.5.0-040500rc4-generic #201602141731 SMP Sun Feb 14 22:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0 Output is in Kbytes/sec KB reclen write rewrite 401408 32 1895293 3035291 _________________________________________________________________ Kernel 4.5-rc4 + jrw 3 patch set version 10 (nominal 3X duration holdoff) Linux s15 4.5.0-rc4-rjwv10 #167 SMP Mon Feb 15 14:23:10 PST 2016 x86_64 x86_64 x86_64 GNU/Linux Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0 Output is in Kbytes/sec KB reclen write rewrite 401408 32 2010558 3086354 401408 32 1945126 3127472 401408 32 1944807 3110387 401408 32 1948620 3110002 AVE 1962278 3108554 Performance mode, for comparison: KB reclen write rewrite 401408 32 2870111 5023311 401408 32 2869642 5149213 401408 32 2792053 5100280 401408 32 2863887 5149229 _________________________________________________________________ Kernel 4.5-rc4 + jrw 3 patch set version 10 + mg 12X duration hold-off Linux s15 4.5.0-rc4-rjwv10-12 #169 SMP Thu Feb 18 08:15:33 PST 2016 x86_64 x86_64 x86_64 GNU/Linux Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0 Output is in Kbytes/sec KB reclen write rewrite 401408 32 1989670 3100580 401408 32 2062291 3112463 401408 32 2107637 3233567 401408 32 2111772 3340610 AVE 2067843 3196805 Gain Verses 3X 5.4% 2.8% _________________________________________________________________ Mel: Did you observe any downside conditions? For example, here is just an example taken some trace samples from my computer: Duration kick in = 3X Core busy = 101 Current pstate = 16 Load = 2.2% Duration = 43.815 mSec Scaled busy = 48 Next Pstate = 16 (= minimum for my computer) If duration kick in = 12X then Scaled busy = 214 Next pstate = 38 (= Max turbo for my computer) Note: I do NOT have an operational example where it matters in terms of energy use or whatever. I just suggesting that we look. ... Doug ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 21:09 ` Doug Smythies @ 2016-02-19 10:49 ` Mel Gorman 2016-02-23 14:04 ` Mel Gorman 1 sibling, 0 replies; 11+ messages in thread From: Mel Gorman @ 2016-02-19 10:49 UTC (permalink / raw) To: Doug Smythies Cc: 'Rafael J. Wysocki', 'Rafael Wysocki', 'Ingo Molnar', 'Peter Zijlstra', 'Matt Fleming', 'Mike Galbraith', 'Linux-PM', 'LKML', 'Srinivas Pandruvada' On Thu, Feb 18, 2016 at 01:09:26PM -0800, Doug Smythies wrote: > On 2106.02.18 Rafael J. Wysocki wrote: > On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman wrote: > >> > >> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > >> --- > >> drivers/cpufreq/intel_pstate.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > >> index cd83d477e32d..54250084174a 100644 > >> --- a/drivers/cpufreq/intel_pstate.c > >> +++ b/drivers/cpufreq/intel_pstate.c > >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) > >> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; > >> duration_us = ktime_us_delta(cpu->sample.time, > >> cpu->last_sample_time); > >> - if (duration_us > sample_time * 3) { > >> + if (duration_us > sample_time * 12) { > >> sample_ratio = div_fp(int_tofp(sample_time), > >> int_tofp(duration_us)); > >> core_busy = mul_fp(core_busy, sample_ratio); > >> -- > > The immediately preceding comment needs to be changed also. Yes, it does. Thanks. > Note that with duration related scaling only coming in at such a high > ratio it might be worth saving the divide and just setting it to 0. > That sounds reasonable. I've queued up a test based on this as well as tests with the linux-next branch from linux-pm to see what falls out. > > I've been considering making a change like this, but I wasn't quite > > sure how much greater the multiplier should be, so I've queued this > > one up for 4.6. > > > That said please note that we're planning to make one significant > > change to intel_pstate in the 4.6 cycle that's very likely to affect > > your results. > > Rafael: > > I started to test Mel's change added to your 3 patch set version 10. > > I only have one data point so far, I selected the test I did from one of Mel's > better results (although there is no reason to expect my computer to have > best results for the same operating conditions): > It's a reasonable expectation. > Stock kernel 4.5-rc4 just for reference: > Linux s15 4.5.0-040500rc4-generic #201602141731 SMP Sun Feb 14 22:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux > > Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0 > Output is in Kbytes/sec > > KB reclen write rewrite > 401408 32 1895293 3035291 > _________________________________________________________________ > > Kernel 4.5-rc4 + jrw 3 patch set version 10 (nominal 3X duration holdoff) > Linux s15 4.5.0-rc4-rjwv10 #167 SMP Mon Feb 15 14:23:10 PST 2016 x86_64 x86_64 x86_64 GNU/Linux > > Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0 > Output is in Kbytes/sec > > KB reclen write rewrite > 401408 32 2010558 3086354 > 401408 32 1945126 3127472 > 401408 32 1944807 3110387 > 401408 32 1948620 3110002 > AVE 1962278 3108554 > > Performance mode, for comparison: > > KB reclen write rewrite > 401408 32 2870111 5023311 > 401408 32 2869642 5149213 > 401408 32 2792053 5100280 > 401408 32 2863887 5149229 > _________________________________________________________________ > > Kernel 4.5-rc4 + jrw 3 patch set version 10 + mg 12X duration hold-off > Linux s15 4.5.0-rc4-rjwv10-12 #169 SMP Thu Feb 18 08:15:33 PST 2016 x86_64 x86_64 x86_64 GNU/Linux > > Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0 > Output is in Kbytes/sec > > KB reclen write rewrite > 401408 32 1989670 3100580 > 401408 32 2062291 3112463 > 401408 32 2107637 3233567 > 401408 32 2111772 3340610 > AVE 2067843 3196805 > Gain Verses 3X 5.4% 2.8% > _________________________________________________________________ > > Mel: Did you observe any downside conditions? > Not so far but my expectation is that any downside would be power consumption related. At worst, I expect the patch to have little or not performance impact in cases where there are a lot of cores, a lot of migration and the CPU core is idle longer than the new hold-off period. For power-consumption, I'm relying entirely on the output of turbostat to tell me if there are problems which may or may not be sufficient. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 21:09 ` Doug Smythies 2016-02-19 10:49 ` Mel Gorman @ 2016-02-23 14:04 ` Mel Gorman 1 sibling, 0 replies; 11+ messages in thread From: Mel Gorman @ 2016-02-23 14:04 UTC (permalink / raw) To: Doug Smythies Cc: 'Rafael J. Wysocki', 'Rafael Wysocki', 'Ingo Molnar', 'Peter Zijlstra', 'Matt Fleming', 'Mike Galbraith', 'Linux-PM', 'LKML', 'Srinivas Pandruvada' On Thu, Feb 18, 2016 at 01:09:26PM -0800, Doug Smythies wrote: > >> +++ b/drivers/cpufreq/intel_pstate.c > >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) > >> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; > >> duration_us = ktime_us_delta(cpu->sample.time, > >> cpu->last_sample_time); > >> - if (duration_us > sample_time * 3) { > >> + if (duration_us > sample_time * 12) { > >> sample_ratio = div_fp(int_tofp(sample_time), > >> int_tofp(duration_us)); > >> core_busy = mul_fp(core_busy, sample_ratio); > >> -- > > The immediately preceding comment needs to be changed also. > Note that with duration related scaling only coming in at such a high > ratio it might be worth saving the divide and just setting it to 0. > I tried this and FWIW, the performance is generally comparable as is the power usage as reported by turbostat. On occasion, depending on the machine, the system CPU usage is noticably lower. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 19:43 ` Rafael J. Wysocki 2016-02-18 21:09 ` Doug Smythies @ 2016-02-18 23:29 ` Pandruvada, Srinivas 2016-02-18 23:33 ` Rafael J. Wysocki 1 sibling, 1 reply; 11+ messages in thread From: Pandruvada, Srinivas @ 2016-02-18 23:29 UTC (permalink / raw) To: mgorman@techsingularity.net, rafael@kernel.org Cc: matt@codeblueprint.co.uk, mingo@kernel.org, peterz@infradead.org, Brandewie, Dirk J, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, rjw@rjwysocki.net, umgwanakikbuti@gmail.com On Thu, 2016-02-18 at 20:43 +0100, Rafael J. Wysocki wrote: > Hi Mel, > > On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman > <mgorman@techsingularity.net> wrote: > > [cut] > > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > --- > > drivers/cpufreq/intel_pstate.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/cpufreq/intel_pstate.c > > b/drivers/cpufreq/intel_pstate.c > > index cd83d477e32d..54250084174a 100644 > > --- a/drivers/cpufreq/intel_pstate.c > > +++ b/drivers/cpufreq/intel_pstate.c > > @@ -999,7 +999,7 @@ static inline int32_t > > get_target_pstate_use_performance(struct cpudata *cpu) > > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; > > duration_us = ktime_us_delta(cpu->sample.time, > > cpu->last_sample_time); > > - if (duration_us > sample_time * 3) { > > + if (duration_us > sample_time * 12) { > > sample_ratio = div_fp(int_tofp(sample_time), > > int_tofp(duration_us)); > > core_busy = mul_fp(core_busy, sample_ratio); > > -- > > I've been considering making a change like this, but I wasn't quite > sure how much greater the multiplier should be, so I've queued this > one up for 4.6. > We need to test power impact on different server workloads. So please hold on. We have server folks complaining that we already consume too much power. Thanks, Srinivas > That said please note that we're planning to make one significant > change to intel_pstate in the 4.6 cycle that's very likely to affect > your results. > > It is currently present in linux-next (commit 402c43ed2d74 "cpufreq: > intel_pstate: Replace timers with utilization update callbacks" in > the > linux-next branch of the linux-pm.git tree, that depends on commit > fe7034338ba0 "cpufreq: Add mechanism for registering utilization > update callbacks" in the same branch). Also you can just pull from > the pm-cpufreq-test branch in linux-pm.git, but that contains much > more material. > > Thanks, > Rafael > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 23:29 ` Pandruvada, Srinivas @ 2016-02-18 23:33 ` Rafael J. Wysocki 0 siblings, 0 replies; 11+ messages in thread From: Rafael J. Wysocki @ 2016-02-18 23:33 UTC (permalink / raw) To: Pandruvada, Srinivas Cc: mgorman@techsingularity.net, rafael@kernel.org, matt@codeblueprint.co.uk, mingo@kernel.org, peterz@infradead.org, Brandewie, Dirk J, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, rjw@rjwysocki.net, umgwanakikbuti@gmail.com On Fri, Feb 19, 2016 at 12:29 AM, Pandruvada, Srinivas <srinivas.pandruvada@intel.com> wrote: > On Thu, 2016-02-18 at 20:43 +0100, Rafael J. Wysocki wrote: >> Hi Mel, >> >> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman >> <mgorman@techsingularity.net> wrote: >> >> [cut] >> >> > >> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> >> > --- >> > drivers/cpufreq/intel_pstate.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/drivers/cpufreq/intel_pstate.c >> > b/drivers/cpufreq/intel_pstate.c >> > index cd83d477e32d..54250084174a 100644 >> > --- a/drivers/cpufreq/intel_pstate.c >> > +++ b/drivers/cpufreq/intel_pstate.c >> > @@ -999,7 +999,7 @@ static inline int32_t >> > get_target_pstate_use_performance(struct cpudata *cpu) >> > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; >> > duration_us = ktime_us_delta(cpu->sample.time, >> > cpu->last_sample_time); >> > - if (duration_us > sample_time * 3) { >> > + if (duration_us > sample_time * 12) { >> > sample_ratio = div_fp(int_tofp(sample_time), >> > int_tofp(duration_us)); >> > core_busy = mul_fp(core_busy, sample_ratio); >> > -- >> >> I've been considering making a change like this, but I wasn't quite >> sure how much greater the multiplier should be, so I've queued this >> one up for 4.6. >> > We need to test power impact on different server workloads. So please > hold on. > We have server folks complaining that we already consume too much > power. I'll drop the commit if it turns out to cause too much energy to be consumed. Thanks, Rafael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman 2016-02-18 19:43 ` Rafael J. Wysocki @ 2016-02-19 11:11 ` Stephane Gasparini 2016-02-19 16:38 ` Doug Smythies 1 sibling, 1 reply; 11+ messages in thread From: Stephane Gasparini @ 2016-02-19 11:11 UTC (permalink / raw) To: Mel Gorman Cc: Rafael Wysocki, Dirk Brandewie, Ingo Molnar, Peter Zijlstra, Matt Fleming, Mike Galbraith, Linux-PM, LKML [-- Attachment #1: Type: text/plain, Size: 14996 bytes --] The issue you are reporting looks like one we improved on android by using the average pstate instead of using the last requested pstate We know that this is improving the ffmpeg encoding performance when using the load algorithm. see patch attached This patch is only applied on get_target_pstate_use_cpu_load however you can give it a try on get_target_pstate_use_performance IPLoad+Avg-Pstate vs IP Load: Benchmark ∆Perf ∆Power SmartBench-Gaming -0.1% -10.4% SmartBench-Productivity -0.8% -10.4% CandyCrush n/a -17.4% AngryBirds n/a -5.9% videoPlayback n/a -13.9% audioPlayback n/a -4.9% IcyRocks-0-0 0.0% -4.0% IcyRocks-20-50 0.0% -38.4% IcyRocks-40-100 0.1% -2.8% IcyRocks-60-150 1.4% -0.6% IcyRocks-80-200 2.9% 0.7% IcyRocks-100-250 1.1% 0.4% iozone RR -2.7% -4.2% iozone RW -8.8% -4.2% Drystone -0.2% -0.8% Coremark 0.5% 0.2% Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> --- drivers/cpufreq/intel_pstate.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index cd83d47..6ba8cab 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -908,8 +908,6 @@ static inline void intel_pstate_sample(struct cpudata *cpu) cpu->sample.mperf -= cpu->prev_mperf; cpu->sample.tsc -= cpu->prev_tsc; - intel_pstate_calc_busy(cpu); - cpu->prev_aperf = aperf; cpu->prev_mperf = mperf; cpu->prev_tsc = tsc; @@ -931,6 +929,12 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu) mod_timer_pinned(&cpu->timer, jiffies + delay); } +static inline int32_t get_avg_pstate(struct cpudata *cpu) +{ + return div64_u64(cpu->pstate.max_pstate * cpu->sample.aperf, + cpu->sample.mperf); +} + static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu) { struct sample *sample = &cpu->sample; @@ -964,7 +968,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu) cpu_load = div64_u64(int_tofp(100) * mperf, sample->tsc); cpu->sample.busy_scaled = cpu_load; - return cpu->pstate.current_pstate - pid_calc(&cpu->pid, cpu_load); + return get_avg_pstate(cpu) - pid_calc(&cpu->pid, cpu_load); } static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) @@ -973,6 +977,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) s64 duration_us; u32 sample_time; + intel_pstate_calc_busy(cpu); /* * core_busy is the ratio of actual performance to max * max_pstate is the max non turbo pstate available — Steph > On Feb 18, 2016, at 12:11 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > > (cc'ing pm and scheduler people as the problem could be blamed on either > subsystem depending on your point of view) > > The PID relies on samples of equal time but this does not apply for > deferrable timers when the CPU is idle. intel_pstate checks if the actual > duration between samples is large and if so, the "busyness" of the CPU > is scaled. > > This assumes the delay was a deferred timer but a workload may simply have > been idle for a short time if it's context switching between a server and > client or waiting very briefly on IO. It's compounded by the problem that > server/clients migrate between CPUs due to wake-affine trying to maximise > hot cache usage. In such cases, the cores are not considered busy and the > frequency is dropped prematurely. > > This patch increases the hold-off value before the busyness is scaled. It > was selected based simply on testing until the desired result was found. > Tests were conducted with workloads that are either client/server based > or short-lived IO. > > dbench4 > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Hmean mb/sec-1 309.82 ( 0.00%) 327.01 ( 5.55%) > Hmean mb/sec-2 594.92 ( 0.00%) 613.02 ( 3.04%) > Hmean mb/sec-4 669.17 ( 0.00%) 712.27 ( 6.44%) > Hmean mb/sec-8 700.82 ( 0.00%) 724.04 ( 3.31%) > Hmean mb/sec-64 425.38 ( 0.00%) 448.02 ( 5.32%) > > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Mean %Busy 27.28 26.81 > Mean CPU%c1 42.50 44.29 > Mean CPU%c3 7.16 7.14 > Mean CPU%c6 23.05 21.76 > Mean CPU%c7 0.00 0.00 > Mean CorWatt 4.60 5.08 > Mean PkgWatt 6.83 7.32 > > There is fairly sizable performance boost from the modification and while > the percentage of time spent in C1 is increased, it is not by a substantial > amount and the power usage increase is tiny. > > iozone for small files and varying block sizes. Format is IOOperation-filessize-recordsize > > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Hmean SeqWrite-200704-1 740152.30 ( 0.00%) 748432.35 ( 1.12%) > Hmean SeqWrite-200704-2 1052506.25 ( 0.00%) 1169065.30 ( 11.07%) > Hmean SeqWrite-200704-4 1450716.41 ( 0.00%) 1725335.69 ( 18.93%) > Hmean SeqWrite-200704-8 1523917.72 ( 0.00%) 1881610.25 ( 23.47%) > Hmean SeqWrite-200704-16 1572519.89 ( 0.00%) 1750277.07 ( 11.30%) > Hmean SeqWrite-200704-32 1611078.69 ( 0.00%) 1923796.62 ( 19.41%) > Hmean SeqWrite-200704-64 1656755.37 ( 0.00%) 1892766.99 ( 14.25%) > Hmean SeqWrite-200704-128 1641739.24 ( 0.00%) 1952081.27 ( 18.90%) > Hmean SeqWrite-200704-256 1660046.05 ( 0.00%) 1931237.50 ( 16.34%) > Hmean SeqWrite-200704-512 1634394.86 ( 0.00%) 1860369.95 ( 13.83%) > Hmean SeqWrite-200704-1024 1629526.38 ( 0.00%) 1810320.92 ( 11.09%) > Hmean SeqWrite-401408-1 828943.43 ( 0.00%) 876152.50 ( 5.70%) > Hmean SeqWrite-401408-2 1231519.20 ( 0.00%) 1368986.18 ( 11.16%) > Hmean SeqWrite-401408-4 1724109.56 ( 0.00%) 1838265.22 ( 6.62%) > Hmean SeqWrite-401408-8 1806615.84 ( 0.00%) 1969611.74 ( 9.02%) > Hmean SeqWrite-401408-16 1859268.96 ( 0.00%) 2003005.51 ( 7.73%) > Hmean SeqWrite-401408-32 1887759.67 ( 0.00%) 2415913.37 ( 27.98%) > Hmean SeqWrite-401408-64 1941717.11 ( 0.00%) 1971929.24 ( 1.56%) > Hmean SeqWrite-401408-128 1919515.58 ( 0.00%) 2127647.53 ( 10.84%) > Hmean SeqWrite-401408-256 1908766.57 ( 0.00%) 2067473.02 ( 8.31%) > Hmean SeqWrite-401408-512 1908999.37 ( 0.00%) 2195587.56 ( 15.01%) > Hmean SeqWrite-401408-1024 1912232.98 ( 0.00%) 2150068.56 ( 12.44%) > Hmean Rewrite-200704-1 1151067.57 ( 0.00%) 1155309.64 ( 0.37%) > Hmean Rewrite-200704-2 1786824.53 ( 0.00%) 1837093.18 ( 2.81%) > Hmean Rewrite-200704-4 2539338.19 ( 0.00%) 2649019.78 ( 4.32%) > Hmean Rewrite-200704-8 2687411.53 ( 0.00%) 2785202.26 ( 3.64%) > Hmean Rewrite-200704-16 2709445.97 ( 0.00%) 2805580.76 ( 3.55%) > Hmean Rewrite-200704-32 2735718.43 ( 0.00%) 2807532.87 ( 2.63%) > Hmean Rewrite-200704-64 2782754.97 ( 0.00%) 2952024.38 ( 6.08%) > Hmean Rewrite-200704-128 2791889.73 ( 0.00%) 2805048.02 ( 0.47%) > Hmean Rewrite-200704-256 2711596.34 ( 0.00%) 2828896.54 ( 4.33%) > Hmean Rewrite-200704-512 2665066.25 ( 0.00%) 2868058.05 ( 7.62%) > Hmean Rewrite-200704-1024 2675375.89 ( 0.00%) 2685664.19 ( 0.38%) > Hmean Rewrite-401408-1 1350713.78 ( 0.00%) 1358762.21 ( 0.60%) > Hmean Rewrite-401408-2 2079420.61 ( 0.00%) 2097399.02 ( 0.86%) > Hmean Rewrite-401408-4 2889535.90 ( 0.00%) 2912795.03 ( 0.80%) > Hmean Rewrite-401408-8 3068155.32 ( 0.00%) 3090915.84 ( 0.74%) > Hmean Rewrite-401408-16 3103789.43 ( 0.00%) 3162486.65 ( 1.89%) > Hmean Rewrite-401408-32 3112447.72 ( 0.00%) 3243067.63 ( 4.20%) > Hmean Rewrite-401408-64 3232651.39 ( 0.00%) 3227701.02 ( -0.15%) > Hmean Rewrite-401408-128 3149556.47 ( 0.00%) 3165694.24 ( 0.51%) > Hmean Rewrite-401408-256 3093348.93 ( 0.00%) 3104229.97 ( 0.35%) > Hmean Rewrite-401408-512 3026305.45 ( 0.00%) 3121151.02 ( 3.13%) > Hmean Rewrite-401408-1024 3005431.18 ( 0.00%) 3046910.32 ( 1.38%) > > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Mean %Busy 3.10 3.09 > Mean CPU%c1 6.16 5.55 > Mean CPU%c3 0.08 0.10 > Mean CPU%c6 90.65 91.26 > Mean CPU%c7 0.00 0.00 > Mean CorWatt 1.71 1.74 > Mean PkgWatt 3.88 3.91 > Max %Busy 16.51 16.22 > Max CPU%c1 17.03 21.99 > Max CPU%c3 2.57 2.15 > Max CPU%c6 96.39 96.31 > Max CPU%c7 0.00 0.00 > Max CorWatt 5.40 5.42 > Max PkgWatt 7.53 7.56 > > The other operations are omitted as they showed no performance difference. > For sequential writes and rewrites there is a massive gain in throughput > for very small files. The increase in power consumption is negligible. > It is known that the increase is not universal. Larger core machines see > a much smaller benefit so the rate of CPU migrations are a factor. > > netperf-UDP_STREAM > > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Hmean send-64 233.96 ( 0.00%) 244.76 ( 4.61%) > Hmean send-128 466.74 ( 0.00%) 479.16 ( 2.66%) > Hmean send-256 929.12 ( 0.00%) 964.00 ( 3.75%) > Hmean send-1024 3631.36 ( 0.00%) 3781.89 ( 4.15%) > Hmean send-2048 6984.60 ( 0.00%) 7169.60 ( 2.65%) > Hmean send-3312 10792.94 ( 0.00%) 11103.42 ( 2.88%) > Hmean send-4096 12895.57 ( 0.00%) 13112.58 ( 1.68%) > Hmean send-8192 23057.34 ( 0.00%) 23443.80 ( 1.68%) > Hmean send-16384 37871.11 ( 0.00%) 38292.60 ( 1.11%) > Hmean recv-64 233.89 ( 0.00%) 244.71 ( 4.63%) > Hmean recv-128 466.63 ( 0.00%) 479.09 ( 2.67%) > Hmean recv-256 928.88 ( 0.00%) 963.74 ( 3.75%) > Hmean recv-1024 3630.54 ( 0.00%) 3780.96 ( 4.14%) > Hmean recv-2048 6983.20 ( 0.00%) 7167.55 ( 2.64%) > Hmean recv-3312 10790.92 ( 0.00%) 11100.63 ( 2.87%) > Hmean recv-4096 12891.37 ( 0.00%) 13110.35 ( 1.70%) > Hmean recv-8192 23054.79 ( 0.00%) 23438.27 ( 1.66%) > Hmean recv-16384 37866.79 ( 0.00%) 38283.73 ( 1.10%) > > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Mean %Busy 37.30 37.10 > Mean CPU%c1 37.52 37.30 > Mean CPU%c3 0.10 0.10 > Mean CPU%c6 25.08 25.49 > Mean CPU%c7 0.00 0.00 > Mean CorWatt 11.20 11.18 > Mean PkgWatt 13.30 13.28 > Max %Busy 50.64 51.73 > Max CPU%c1 49.80 50.53 > Max CPU%c3 9.14 8.95 > Max CPU%c6 62.46 63.48 > Max CPU%c7 0.00 0.00 > Max CorWatt 16.46 16.44 > Max PkgWatt 18.58 18.55 > > In this test, the client/server are pinned to cores so the scheduler > decisions are not a factor. There is still a mild performance boost > with no impact on power consumption. > > cyclictest-pinned > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Amean LatAvg 3.00 ( 0.00%) 2.64 ( 11.94%) > Amean LatMax 156.93 ( 0.00%) 106.89 ( 31.89%) > > 4.5.0-rc2 4.5.0-rc2 > vanilla sample-v1r1 > Mean %Busy 99.74 99.73 > Mean CPU%c1 0.02 0.02 > Mean CPU%c3 0.00 0.01 > Mean CPU%c6 0.23 0.24 > Mean CPU%c7 0.00 0.00 > Mean CorWatt 5.06 5.92 > Mean PkgWatt 7.12 7.99 > Max %Busy 100.00 100.00 > Max CPU%c1 3.88 3.50 > Max CPU%c3 0.71 0.99 > Max CPU%c6 41.79 43.17 > Max CPU%c7 0.00 0.00 > Max CorWatt 6.80 8.66 > Max PkgWatt 8.85 10.71 > > This test measures how quickly a task wakes up after a timeout. The test > could be defeated by selecting a different timeout value that is outside > the new hold-off value. Furthermore, a workload that is very sensitive to > wakeup latencies should use the performance governor. Nevertheless it's > interesting to note the impact of increasing the hold-off value. There is > an increase in power usage because the CPU remains active during sleep times. > > In all cases, there are some CPU migrations because wakers pull wakees to > nearby CPUs. It could be argued that the workload should be pinned but this > puts a burden on the user that may not even be possible in all cases. The > scheduler could try keeping processes on the same CPUs but that would impact > cache hotness and cause a different class of issues. It is inevitable that > there will be some conflict between power management and scheduling decisions > but there is some gains from delaying idling slightly without a severe impact > on power consumption. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > drivers/cpufreq/intel_pstate.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index cd83d477e32d..54250084174a 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; > duration_us = ktime_us_delta(cpu->sample.time, > cpu->last_sample_time); > - if (duration_us > sample_time * 3) { > + if (duration_us > sample_time * 12) { > sample_ratio = div_fp(int_tofp(sample_time), > int_tofp(duration_us)); > core_busy = mul_fp(core_busy, sample_ratio); > -- > 2.6.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: [linux-power-mgmt] [PATCH 1_3] cpufreq: intel_pstate: Use avg_pstate instead of current_pstate --] [-- Type: application/octet-stream, Size: 6382 bytes --] ^ permalink raw reply related [flat|nested] 11+ messages in thread
* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-19 11:11 ` Stephane Gasparini @ 2016-02-19 16:38 ` Doug Smythies 2016-02-24 16:19 ` Stephane Gasparini 0 siblings, 1 reply; 11+ messages in thread From: Doug Smythies @ 2016-02-19 16:38 UTC (permalink / raw) To: 'Stephane Gasparini', 'Mel Gorman' Cc: 'Rafael Wysocki', 'Ingo Molnar', 'Peter Zijlstra', 'Matt Fleming', 'Mike Galbraith', 'Linux-PM', 'LKML', 'Srinivas Pandruvada' Hi Steph, On 2016.02.19 03:12 Stephane Gasparini wrote: > > The issue you are reporting looks like one we improved on android by using > the average pstate instead of using the last requested pstate > > We know that this is improving the ffmpeg encoding performance when using the > load algorithm. > > see patch attached > > This patch is only applied on get_target_pstate_use_cpu_load however you can give > it a try on get_target_pstate_use_performance Yes, that type of patch works on the load based approach. However, I do not think it works on the performance based approach. Why not? Well, and if I understand correctly, follow the math and you end up with: scaled_busy = 100% scaled_busy = (aperf * 100% / mperf) * (max_pstate / * ((aperf * max_pstate) / mperf)) ... Doug ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-19 16:38 ` Doug Smythies @ 2016-02-24 16:19 ` Stephane Gasparini 2016-02-25 19:51 ` Doug Smythies 0 siblings, 1 reply; 11+ messages in thread From: Stephane Gasparini @ 2016-02-24 16:19 UTC (permalink / raw) To: Doug Smythies Cc: Mel Gorman, Rafael Wysocki, Ingo Molnar, Peter Zijlstra, Matt Fleming, Mike Galbraith, Linux-PM, LKML, Srinivas Pandruvada Hi Doug > On Feb 19, 2016, at 5:38 PM, Doug Smythies <dsmythies@telus.net> wrote: > > Hi Steph, > > On 2016.02.19 03:12 Stephane Gasparini wrote: >> >> The issue you are reporting looks like one we improved on android by using >> the average pstate instead of using the last requested pstate >> >> We know that this is improving the ffmpeg encoding performance when using the >> load algorithm. >> >> see patch attached >> >> This patch is only applied on get_target_pstate_use_cpu_load however you can give >> it a try on get_target_pstate_use_performance > > Yes, that type of patch works on the load based approach. > I’m not talking about using average p-state in the scaled_busy computation. I’m talking adding the output of the PID (the number of pstate to ad or subtract) to the average pstate rather than adding this to the current p-sate. The current p-state is in some situation not reflecting the reality as the current p-state can be imposed by a "linked CPU". This is the case when you have a thread migration on "linked CPU" that was not loaded. Its current P-State will be low while its average p-state will reflect the activity of the "linked CPU". I will not claim this is a perfect solution, but this combined to the topology awareness of the scheduler is helping to take better decision. > However, I do not think it works on the performance based approach. Why not? > Well, and if I understand correctly, follow the math and you end up with: > > scaled_busy = 100% > > scaled_busy = (aperf * 100% / mperf) * (max_pstate / * ((aperf * max_pstate) / mperf)) > > ... Doug > > — Steph ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled 2016-02-24 16:19 ` Stephane Gasparini @ 2016-02-25 19:51 ` Doug Smythies 0 siblings, 0 replies; 11+ messages in thread From: Doug Smythies @ 2016-02-25 19:51 UTC (permalink / raw) To: 'Stephane Gasparini' Cc: 'Mel Gorman', 'Rafael Wysocki', 'Ingo Molnar', 'Peter Zijlstra', 'Matt Fleming', 'Mike Galbraith', 'Linux-PM', 'LKML', 'Srinivas Pandruvada' Hi Steph, On 2016.02.24 08:20 Stephane Gasparini wrote: >> On Feb 19, 2016, at 5:38 PM, Doug Smythies <dsmythies@telus.net> wrote: >>> On 2016.02.19 03:12 Stephane Gasparini wrote: >>> >>> The issue you are reporting looks like one we improved on android by using >>> the average pstate instead of using the last requested pstate >>> >>> We know that this is improving the ffmpeg encoding performance when using the >>> load algorithm. >>> >>> see patch attached >>> >>> This patch is only applied on get_target_pstate_use_cpu_load however you can give >>> it a try on get_target_pstate_use_performance >> >> Yes, that type of patch works on the load based approach. > > I’m not talking about using average p-state in the scaled_busy computation. > I’m talking adding the output of the PID (the number of pstate to ad or subtract) > to the average pstate rather than adding this to the current p-sate. For the situation we are dealing with here, that would actually make it worse, wouldn't it? Let's work through a real very low load example from the Mel V2 patch where the target pstate is increased whereas it should have been decreased: Mel patch version 2 (12X hold off added to rjw 3 patch v10 set added to kernel 4.5-rc4): CPU: 3 Core busy: 105 Scaled busy: 143 Old pstate: 25 New pstate: 34 mperf: 52039 aperf: 55097 tsc: 335265689 freq: 3599750 KHz Load: 0.02% Duration (mS): 98.293 New pstate = old pstate + (scaled_busy-setpoint) * p_gain = 25 + (143 - 97) * 0.2 = 34 (as above) Ave pstate = max_pstate * aperf / mperf = 34 * 55097 / 52039 = 36 Steph average pstate method added to the above: New pstate = ave pstate + (scaled_busy-setpoint) * p_gain = 36 + (143 - 97) * 0.2 = 45 (before clamping) Now, just for completeness show the no Mel patch math: Scaled busy = Core busy * max_pstate / old pstate * sample time / duration = 105 * 34 / 25 * 10 / 98.293 = 14.53 New pstate = old pstate + (scaled_busy-setpoint) * p_gain = 25 + (14.53 - 97) * .2 = 8.5 = 16 clamped minimum Regardless, I coded the average pstate method and observe little difference between it and the Mel V2 patch with limited testing. ... Doug ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-02-25 19:51 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman 2016-02-18 19:43 ` Rafael J. Wysocki 2016-02-18 21:09 ` Doug Smythies 2016-02-19 10:49 ` Mel Gorman 2016-02-23 14:04 ` Mel Gorman 2016-02-18 23:29 ` Pandruvada, Srinivas 2016-02-18 23:33 ` Rafael J. Wysocki 2016-02-19 11:11 ` Stephane Gasparini 2016-02-19 16:38 ` Doug Smythies 2016-02-24 16:19 ` Stephane Gasparini 2016-02-25 19:51 ` Doug Smythies
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).