Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: luca abeni <luca.abeni@unitn.it>
To: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@arm.com>,
	Claudio Scordino <claudio@evidence.eu.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
Date: Tue, 3 Jan 2017 22:33:01 +0100	[thread overview]
Message-ID: <20170103223301.0c6af9d2@sweethome> (raw)
In-Reply-To: <6c4ab4ec-7164-fafe-5efc-990f3cf31269@redhat.com>

Hi Daniel,
(sorry for the previous html email; I replied from my phone and I did
not realise how the email client was configured)

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

[...]
> > The implemented CPU reclaiming algorithm is based on tracking the
> > utilization U_act of active tasks (first 2 patches), and modifying
> > the runtime accounting rule (see patch 0004). The original GRUB
> > algorithm is modified as described in [2] to support multiple CPUs
> > (the original algorithm only considered one single CPU, this one
> > tracks U_act per runqueue) and to leave an "unreclaimable" fraction
> > of CPU time to non SCHED_DEADLINE tasks (see patch 0005: the
> > original algorithm can consume 100% of the CPU time, starving all
> > the other tasks). Patch 0003 uses the newly introduced "inactive
> > timer" (introduced in patch 0002) to fix dl_overflow() and
> > __setparam_dl(). Patch 0006 allows to enable CPU reclaiming only on
> > selected tasks.  
> 
> Hi,
> 
> Today I did some tests in this patch set. Unfortunately, it seems that
> there is a problem :-(.
[...]
I reproduced this issue; thanks for the report. It seems to be due to
the fact that the reclaiming tasks are more than the CPU cores and the
load is very high (near to the utilisation limit).

I am investigating it, and will hopefully post an update in the next
days.



			Thanks,
				Luca


> 
> In a four core box, if I dispatch 11 tasks [1] with setup:
> 
>   period = 30 ms
>   runtime = 10 ms
>   flags = 0 (GRUB disabled)
> 
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
> 
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected. However, if I enable GRUB in the same task set
> I get this:
> 
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
> 
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.
> 
> I traced this task activation and noticed this:
> 
>          swapper     0 [003] 14968.332244: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.339294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7050 us
> (14968.339294 - 14968.332244)
> 
> period:  29997 us (14968.362241 - 14968.332244)
>          swapper     0 [003] 14968.362241: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.369294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 0.007053
> us (14968.369294 = 14968.362241)
> 
> period: 29994 us (14968.392235 - 14968.362241)
>          swapper     0 [003] 14968.392235: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.399301:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7066 us
> (14968.399301 - 14968.392235)
> 
> period:  30008 us (14968.422243 - 14968.392235)
>          swapper     0 [003] 14968.422243: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.429294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7051 us
> (14968.429294 - 14968.422243)
> 
> period:  29995 us (14968.452238 - 14968.422243)
>          swapper     0 [003] 14968.452238: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.459293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7055 us
> (14968.459293 - 14968.452238)
> 
> period:  30055 us (14968.482293 - 14968.452238)
>                g 14925 [003] 14968.482293: sched:sched_switch:
> g:14925 [-1] R ==> g:14926 [-1] g 14926 [003] 14968.490293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 8000 us
> (14968.490293 - 14968.482293)
> 
> The task is using less CPU than it was reserved/guaranteed.
> 
> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
> 
> You can see this with this code snip:
> 
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>  
>  	sched_rt_avg_update(rq, delta_exec);
>  
> -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> -		delta_exec = grub_reclaim(delta_exec, rq);
> -	dl_se->runtime -= delta_exec;
> +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> +		u64 new_delta_exec;
> +		new_delta_exec = grub_reclaim(delta_exec, rq);
> +		if (new_delta_exec > delta_exec)
> +			trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> +					new_delta_exec,
> +					delta_exec,
> +					(new_delta_exec -
> delta_exec));
> +		dl_se->runtime -= new_delta_exec;
> +	}
> +	else {
> +		dl_se->runtime -= delta_exec;
> +	}
> +
> +
>  
>  throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------  
> 
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
>                g-1984  [001] d.h1  1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981  [003]
> d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
>                g-1748  [001] d.h1   418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748  [001]
> d.h1   418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
> 
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
> 
> That is all I could find today. Am I missing something?
> 
> [1] http://bristot.me/lkml/d.c
> 
> -- Daniel

next prev parent reply	other threads:[~2017-01-03 21:33 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
2016-12-30 11:33 ` [RFC v4 1/6] sched/deadline: track the active utilization Luca Abeni
2016-12-30 11:33 ` [RFC v4 2/6] sched/deadline: improve the tracking of " Luca Abeni
2017-01-11 17:05   ` Juri Lelli
2017-01-11 21:22     ` luca abeni
2016-12-30 11:33 ` [RFC v4 3/6] sched/deadline: fix the update of the total -deadline utilization Luca Abeni
2016-12-30 11:33 ` [RFC v4 4/6] sched/deadline: implement GRUB accounting Luca Abeni
2016-12-30 11:33 ` [RFC v4 5/6] sched/deadline: do not reclaim the whole CPU bandwidth Luca Abeni
2016-12-30 11:33 ` [RFC v4 6/6] sched/deadline: make GRUB a task's flag Luca Abeni
2017-01-03 18:58 ` [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Daniel Bristot de Oliveira
2017-01-03 21:33   ` luca abeni [this message]
2017-01-04 12:17   ` luca abeni
2017-01-04 15:14     ` Daniel Bristot de Oliveira
2017-01-04 16:42       ` Luca Abeni
2017-01-04 18:00         ` Daniel Bristot de Oliveira
2017-01-04 18:30           ` Luca Abeni
2017-01-11 12:19             ` Juri Lelli
2017-01-11 12:39               ` Luca Abeni
2017-01-11 15:06                 ` Juri Lelli
2017-01-11 21:16                   ` luca abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170103223301.0c6af9d2@sweethome \
    --to=luca.abeni@unitn.it \
    --cc=bristot@redhat.com \
    --cc=claudio@evidence.eu.com \
    --cc=juri.lelli@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tommaso.cucinotta@sssup.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox