Re: [ckrm-tech] [RFC 0/4] sched: Add CPU rate caps (improved)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Williams <pwil3058@bigpond.net.au>
To: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Kirill Korotaev <dev@openvz.org>, Srivatsa <vatsa@in.ibm.com>,
	CKRM <ckrm-tech@lists.sourceforge.net>,
	Balbir Singh <bsingharora@gmail.com>,
	Mike Galbraith <efault@gmx.de>, Con Kolivas <kernel@kolivas.org>,
	Sam Vilain <sam@vilain.net>,
	Kingsley Cheung <kingsley@aurema.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Ingo Molnar <mingo@elte.hu>,
	Rene Herman <rene.herman@keyaccess.nl>
Subject: Re: [ckrm-tech] [RFC 0/4] sched: Add CPU rate caps (improved)
Date: Thu, 08 Jun 2006 17:50:08 +1000	[thread overview]
Message-ID: <4487D6B0.3080502@bigpond.net.au> (raw)
In-Reply-To: <448688B2.2030206@jp.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 5670 bytes --]

MAEDA Naoaki wrote:
> 
> I tried to run kernbench with hard cap, and then it spent a very
> long time on "Cleaning souce tree..." phase. Because this phase
> is not CPU hog, my expectation is that it act as without cap.
> 
> That can be reproduced by just running "make clean" on top of a
> kernel source tree with hard cap.
> 
> % /usr/bin/time make clean
> 1.62user 0.29system 0:01.90elapsed 101%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (0major+68539minor)pagefaults 0swaps
> 
>   # Without cap, it returns almost immediately
> 
> % ~/withcap.sh  -C 900 /usr/bin/time make clean
> 1.61user 0.29system 1:26.17elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+68537minor)pagefaults 0swaps
> 
>   # With 90% hard cap, it takes about 1.5 minutes.
> 
> % ~/withcap.sh  -C 100 /usr/bin/time make clean
> 1.64user 0.34system 3:31.48elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+68538minor)pagefaults 0swaps
> 
>   # It became worse with 10% hard cap.
> 
> % ~/withcap.sh  -c 900 /usr/bin/time make clean
> 1.63user 0.28system 0:01.89elapsed 100%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (0major+68537minor)pagefaults 0swaps
> 
>   # It doesn't happen with soft cap.

This behaviour is caused by the "make clean" being a short lived CPU 
intensive task.  It was made worse by the facts that my simplifications 
to the sinbin duration calculation which assumed a constant CPU burst 
size based on the time slice and that exiting tasks could still have 
caps enforced.  (The simplification was done to avoid 64 bit divides.)

I've put in a more complex sinbin calculation (and don't think the 64 
bit divides will matter too much as they're on an infrequently travelled 
path.  Exiting tasks are now excluded from having caps enforced on the 
grounds that it's best for system performance to let them get out of the 
way as soon as possible.  A patch is attached and I would appreciate it 
if you could see if it improves the situation you are observing.

These changes don't completely get rid of the phenomenon but I think 
that it's less severe.  I've written a couple of scripts to test this 
behaviour using the wload program from:

<http://prdownloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz?download>

You run loop.sh with a single argument and it uses asps.sh.   What the 
test does is run a number (specified by the argument to loop.sh) of 
instances of wload in series and uses time to get the stats for the 
series to complete.  It does these for a number of different durations 
of wload running between 0.001 and 10.0 seconds.  Here's an example of 
an output from an uncapped run:

Peter[peterw@heathwren ~]$ ./loops.sh 1
-d=0.001: user = 0.01 system = 0.00 elapsed = 0.00 rate = 133%
-d=0.005: user = 0.01 system = 0.00 elapsed = 0.01 rate = 84%
-d=0.01: user = 0.02 system = 0.00 elapsed = 0.01 rate = 105%
-d=0.05: user = 0.06 system = 0.00 elapsed = 0.05 rate = 103%
-d=0.1: user = 0.10 system = 0.00 elapsed = 0.11 rate = 98%
-d=0.5: user = 0.50 system = 0.00 elapsed = 0.50 rate = 100%
-d=1.0: user = 1.00 system = 0.00 elapsed = 1.01 rate = 99%
-d=5.0: user = 5.00 system = 0.00 elapsed = 5.01 rate = 99%
-d=10.0: user = 10.00 system = 0.00 elapsed = 10.01 rate = 99%

and with a cap of 90%:

[peterw@heathwren ~]$ withcap -C 900 ./loops.sh 1
-d=0.001: user = 0.00 system = 0.00 elapsed = 0.01 rate = 53%
-d=0.005: user = 0.01 system = 0.00 elapsed = 0.02 rate = 61%
-d=0.01: user = 0.01 system = 0.00 elapsed = 0.03 rate = 66%
-d=0.05: user = 0.06 system = 0.00 elapsed = 0.07 rate = 85%
-d=0.1: user = 0.10 system = 0.00 elapsed = 0.11 rate = 91%
-d=0.5: user = 0.50 system = 0.00 elapsed = 0.56 rate = 90%
-d=1.0: user = 1.00 system = 0.00 elapsed = 1.11 rate = 90%
-d=5.0: user = 5.00 system = 0.00 elapsed = 5.54 rate = 90%
-d=10.0: user = 10.00 system = 0.00 elapsed = 11.14 rate = 89%

Notice how the tasks' usage rates get closer to the cap the longer the 
task runs and never exceeds the cap.  With smaller caps the effect is 
different e.g. for a 9% cap we get:

[peterw@heathwren ~]$ withcap -C 90 ./loops.sh 1
-d=0.001: user = 0.00 system = 0.00 elapsed = 0.01 rate = 109%
-d=0.005: user = 0.01 system = 0.00 elapsed = 0.02 rate = 59%
-d=0.01: user = 0.02 system = 0.00 elapsed = 0.05 rate = 35%
-d=0.05: user = 0.05 system = 0.00 elapsed = 0.14 rate = 42%
-d=0.1: user = 0.10 system = 0.00 elapsed = 0.25 rate = 43%
-d=0.5: user = 0.50 system = 0.00 elapsed = 1.87 rate = 27%
-d=1.0: user = 1.00 system = 0.00 elapsed = 5.37 rate = 18%
-d=5.0: user = 5.00 system = 0.00 elapsed = 48.61 rate = 10%
-d=10.0: user = 10.00 system = 0.00 elapsed = 102.22 rate = 9%

and short lived tasks are being under capped.

Bearing in mind that -d=0.01 is equivalent of a task running for just a 
single tick and that that's about the shortest cycle length we're likely 
to see for CPU intensive tasks (and then only when the capping 
enforcement kicks) I think it is unrealistic to expect much better for 
tasks with a life shorter than that.  Further it takes several cycles to 
gather reasonable statistics to base capping enforcement so I think that 
doing much better than this for short lived tasks is unrealistic.

You could also try using a smaller value for CAP_STATS_OFFSET as this 
will shorten the half life of the Kalman filters and make the capping 
react more quickly to changes in usage rates (which is what a task's 
starting is).  The downside is that it will be less smooth.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

[-- Attachment #2: loops.sh --]
[-- Type: application/x-shellscript, Size: 191 bytes --]

[-- Attachment #3: short-lived-tasks-hard-cap-fix --]
[-- Type: text/plain, Size: 1502 bytes --]

---
 kernel/sched.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

Index: MM-2.6.17-rc5-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc5-mm3.orig/kernel/sched.c	2006-06-06 11:29:51.000000000 +1000
+++ MM-2.6.17-rc5-mm3/kernel/sched.c	2006-06-08 14:28:10.000000000 +1000
@@ -216,7 +216,8 @@ static void sinbin_release_fn(unsigned l
 #define cap_load_weight(p) \
 	(max((int)((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / CPU_CAP_ONE), 1))
 #define safe_to_enforce_cap(p) \
-	(!((p)->mutexes_held || (p)->flags & (PF_FREEZE | PF_UIWAKE)))
+	(!((p)->mutexes_held || \
+	   (p)->flags & (PF_FREEZE | PF_UIWAKE | PF_EXITING)))
 #define safe_to_sinbin(p) (safe_to_enforce_cap(p) && !signal_pending(p))
 
 static void init_cpu_rate_caps(task_t *p)
@@ -1235,13 +1236,16 @@ static unsigned long reqd_sinbin_ticks(c
 	unsigned long long rhs = p->avg_cycle_length * p->cpu_rate_hard_cap;
 
 	if (lhs > rhs) {
-		unsigned long res;
-
-		res = static_prio_timeslice(p->static_prio);
-		res *= (CPU_CAP_ONE - p->cpu_rate_hard_cap);
-		res /= CPU_CAP_ONE;
+		lhs -= p->avg_cpu_per_cycle;
+		lhs >>= CAP_STATS_OFFSET;
+		/* have to do two divisions because there's no gaurantee
+		 * that p->cpu_rate_hard_cap * (1000000000 / HZ) would
+		 * not overflow a 32 bit unsigned integer
+		 */
+		(void)do_div(lhs, p->cpu_rate_hard_cap);
+		(void)do_div(lhs, (1000000000 / HZ));
 
-		return res ? : 1;
+		return lhs ? : 1;
 	}
 
 	return 0;

[-- Attachment #4: asps.sh --]
[-- Type: application/x-shellscript, Size: 92 bytes --]

next prev parent reply	other threads:[~2006-06-08  7:50 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20060606023708.2801.24804.sendpatchset@heathwren.pw.nest>
2006-06-07  8:05 ` [ckrm-tech] [RFC 0/4] sched: Add CPU rate caps (improved) MAEDA Naoaki
2006-06-07 12:44   ` Peter Williams
2006-06-08  7:50   ` Peter Williams [this message]
2006-06-09  0:57     ` Peter Williams
2006-06-09  5:50       ` MAEDA Naoaki
2006-06-09  6:05         ` Peter Williams
2006-06-09  5:41     ` MAEDA Naoaki
2006-06-09  6:38       ` Peter Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4487D6B0.3080502@bigpond.net.au \
    --to=pwil3058@bigpond.net.au \
    --cc=bsingharora@gmail.com \
    --cc=ckrm-tech@lists.sourceforge.net \
    --cc=dev@openvz.org \
    --cc=ebiederm@xmission.com \
    --cc=efault@gmx.de \
    --cc=kernel@kolivas.org \
    --cc=kingsley@aurema.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maeda.naoaki@jp.fujitsu.com \
    --cc=mingo@elte.hu \
    --cc=rene.herman@keyaccess.nl \
    --cc=sam@vilain.net \
    --cc=vatsa@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox