[BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
@ 2016-02-03 18:55 Steven Rostedt
  2016-02-03 18:57 ` Steven Rostedt
  2016-02-04  9:54 ` Juri Lelli
  0 siblings, 2 replies; 8+ messages in thread
From: Steven Rostedt @ 2016-02-03 18:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Clark Williams, John Kacur, Daniel Bristot de Oliveira,
	Juri Lelli

There's an accounting issue with the SCHED_DEADLINE and the creation of
cpusets. If a SCHED_DEADLINE task already exists and a new root domain
is created, the calculation of the bandwidth among the root domains
gets corrupted.

For the reproducer, I downloaded Juri's tests:

  https://github.com/jlelli/tests.git 

    For his burn.c file.

  https://github.com/jlelli/schedtool-dl.git

    For his modified schedtool utility.


I have a kernel with my patches that show the bandwidth:

 # grep dl /proc/sched_debug        
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0


Note: the sched_rt runtime and period are 950000 and 1000000
respectively, and the bw ratio is (95/100) << 20 == 996147.

This isn't the way I first discovered the issue, but it appears to be
the quickest way to reproduce it.

Make sure there's no other cpusets. As libvirt created some, I had to
remove them first:

 # rmdir /sys/fs/cgroup/cpuset/libvirt/{qemu,}


 # burn&
 # schedtool -E -t 2000000:20000000 $!

 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857

Note: (2/20) << 20 == 104857

 # echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance

 # grep dl /proc/sched_debug                                       
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0

Notice that after removing load_balancing from the main cpuset, all the
totals went to zero.

Let's see what happens when we kill the task.

 # killall burn

 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : -104857

They all went negative!

Not good, but we can recover...

 # echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 

# grep dl /proc/sched_debugdl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0

Playing with this a bit more, I found that it appears that setting
load_balance to 1 in the toplevel cpuset always resets the deadline
bandwidth weather or not it should be. At least that's a way to recover
from things not working anymore, but I still believe this is a bug.

Things can get even worse when adding a bunch of cpusets. But it
appears to always be cleared if you kill all sched_deadline tasks and
set the toplevel cpuset sched_load_balance to 1.


I traced this with the below patch and have this:

       schedtool-2279  [004] d...   124.195359: __sched_setscheduler: dl_b=ffff88011a01d040 new=104857 old=0
       schedtool-2279  [004] d...   124.195361: __dl_add: (ffff88011a01d040) total=0 add tsk=104857
       schedtool-2279  [004] d...   124.195362: __dl_add: new total=104857
       schedtool-2279  [004] d...   124.195364: <stack trace>
 => __dl_add
 => __sched_setscheduler
 => SyS_sched_setattr
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917342: rq_attach_root: old_rd refcount=8
            bash-2213  [003] d...   142.917344: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917351: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917387: rq_attach_root: old_rd refcount=7
            bash-2213  [003] d...   142.917388: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917393: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917420: rq_attach_root: old_rd refcount=6
            bash-2213  [003] d...   142.917420: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917425: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917452: rq_attach_root: old_rd refcount=5
            bash-2213  [003] d...   142.917452: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917457: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917485: rq_attach_root: old_rd refcount=4
            bash-2213  [003] d...   142.917485: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917490: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917518: rq_attach_root: old_rd refcount=3
            bash-2213  [003] d...   142.917519: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917524: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917550: rq_attach_root: old_rd refcount=2
            bash-2213  [003] d...   142.917550: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917556: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
            bash-2213  [003] d...   142.917582: rq_attach_root: old_rd refcount=1
            bash-2213  [003] d...   142.917582: rq_attach_root: old rd ffff88011a01d000 (ffff88011a01d040) total=104857
            bash-2213  [003] d...   142.917582: rq_attach_root: add new rd ffffffff81fcd9c0 (ffffffff81fcda00) total=0
            bash-2213  [003] d...   142.917588: <stack trace>
 => rq_attach_root
 => cpu_attach_domain
 => partition_sched_domains
 => rebuild_sched_domains_locked
 => update_flag
 => cpuset_write_u64
 => cgroup_file_write
 => kernfs_fop_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath
     kworker/0:2-268   [000] d...   153.335522: task_dead_dl: [0:ffffffff81fcda00] total=0 tsk=104857
     kworker/0:2-268   [000] d...   153.335523: task_dead_dl: new total=-104857
     kworker/0:2-268   [000] d...   153.335528: <stack trace>
 => task_dead_dl
 => finish_task_switch
 => __schedule
 => schedule
 => worker_thread
 => kthread
 => ret_from_fork


Seems to be some disconnect between cgroups and sched deadline root
domains.

-- Steve

---
 kernel/sched/core.c     |    9 +++++++++
 kernel/sched/deadline.c |    5 +++++
 kernel/sched/sched.h    |    6 ++++++
 3 files changed, 20 insertions(+)

Index: linux-trace.git/kernel/sched/core.c
===================================================================
--- linux-trace.git.orig/kernel/sched/core.c	2016-02-03 10:54:55.158659968 -0500
+++ linux-trace.git/kernel/sched/core.c	2016-02-03 13:52:56.213696814 -0500
@@ -2301,6 +2301,7 @@ static int dl_overflow(struct task_struc
 	if (new_bw == p->dl.dl_bw)
 		return 0;
 
+	trace_printk("dl_b=%p new=%lld old=%lld\n", dl_b, new_bw, p->dl.dl_bw);
 	/*
 	 * Either if a task, enters, leave, or stays -deadline but changes
 	 * its parameters, we may need to update accordingly the total
@@ -5068,6 +5069,7 @@ int task_can_attach(struct task_struct *
 		if (overflow)
 			ret = -EBUSY;
 		else {
+			struct dl_bw *src_dl_b;
 			/*
 			 * We reserve space for this task in the destination
 			 * root_domain, as we can't fail after this point.
@@ -5075,6 +5077,8 @@ int task_can_attach(struct task_struct *
 			 * later on (see set_cpus_allowed_dl()).
 			 */
 			__dl_add(dl_b, p->dl.dl_bw);
+			src_dl_b = dl_bw_of(task_cpu(p));
+			trace_printk("source %p total=%lld\n", src_dl_b, src_dl_b->total_bw);
 		}
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 		rcu_read_unlock_sched();
@@ -5636,6 +5640,7 @@ static void rq_attach_root(struct rq *rq
 
 		cpumask_clear_cpu(rq->cpu, old_rd->span);
 
+		trace_printk("old_rd refcount=%d\n", atomic_read(&old_rd->refcount));
 		/*
 		 * If we dont want to free the old_rd yet then
 		 * set old_rd to NULL to skip the freeing later
@@ -5646,7 +5651,11 @@ static void rq_attach_root(struct rq *rq
 	}
 
 	atomic_inc(&rd->refcount);
+	if (old_rd)
+		trace_printk("old rd %p (%p) total=%lld\n", rq->rd, &rq->rd->dl_bw, rq->rd->dl_bw.total_bw);
+	trace_printk("add new rd %p (%p) total=%lld\n", rd, &rd->dl_bw, rd->dl_bw.total_bw);
 	rq->rd = rd;
+	trace_dump_stack(0);
 
 	cpumask_set_cpu(rq->cpu, rd->span);
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
Index: linux-trace.git/kernel/sched/deadline.c
===================================================================
--- linux-trace.git.orig/kernel/sched/deadline.c	2016-02-03 10:21:27.140280992 -0500
+++ linux-trace.git/kernel/sched/deadline.c	2016-02-03 13:44:41.321432980 -0500
@@ -66,6 +66,8 @@ void init_dl_bw(struct dl_bw *dl_b)
 	else
 		dl_b->bw = to_ratio(global_rt_period(), global_rt_runtime());
 	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+	trace_printk("clear total_bw %p (was:%lld)\n", dl_b, dl_b->total_bw);
+	trace_dump_stack(0);
 	dl_b->total_bw = 0;
 }
 
@@ -1224,7 +1226,10 @@ static void task_dead_dl(struct task_str
 	 */
 	raw_spin_lock_irq(&dl_b->lock);
 	/* XXX we should retain the bw until 0-lag */
+	trace_printk("[%d:%p] total=%lld tsk=%lld\n", task_cpu(p), dl_b, dl_b->total_bw, p->dl.dl_bw);
 	dl_b->total_bw -= p->dl.dl_bw;
+	trace_printk("new total=%lld\n", dl_b->total_bw);
+	trace_dump_stack(0);
 	raw_spin_unlock_irq(&dl_b->lock);
 }
 
Index: linux-trace.git/kernel/sched/sched.h
===================================================================
--- linux-trace.git.orig/kernel/sched/sched.h	2016-02-03 10:54:55.182659590 -0500
+++ linux-trace.git/kernel/sched/sched.h	2016-02-03 13:44:41.321432980 -0500
@@ -192,13 +192,19 @@ struct dl_bw {
 static inline
 void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
 {
+	trace_printk("(%p) total=%lld sub tsk=%lld\n", dl_b, dl_b->total_bw, tsk_bw);
 	dl_b->total_bw -= tsk_bw;
+	trace_printk("new total=%lld\n", dl_b->total_bw);
+	trace_dump_stack(0);
 }
 
 static inline
 void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
 {
+	trace_printk("(%p) total=%lld add tsk=%lld\n", dl_b, dl_b->total_bw, tsk_bw);
 	dl_b->total_bw += tsk_bw;
+	trace_printk("new total=%lld\n", dl_b->total_bw);
+	trace_dump_stack(0);
 }
 
 static inline

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-03 18:55 [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets Steven Rostedt
@ 2016-02-03 18:57 ` Steven Rostedt
  2016-02-04  9:54 ` Juri Lelli
  1 sibling, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2016-02-03 18:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: LKML, Clark Williams, John Kacur, Daniel Bristot de Oliveira,
	Juri Lelli

On Wed, 3 Feb 2016 13:55:50 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> 
> They all went negative!
> 
> Not good, but we can recover...

Even though we can recover, I'm betting we can easily overcommit
deadline tasks here. It probably wouldn't be too hard to come up with a
test case that does so.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-03 18:55 [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets Steven Rostedt
  2016-02-03 18:57 ` Steven Rostedt
@ 2016-02-04  9:54 ` Juri Lelli
  2016-02-04 12:04   ` Juri Lelli
  1 sibling, 1 reply; 8+ messages in thread
From: Juri Lelli @ 2016-02-04  9:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Clark Williams, John Kacur,
	Daniel Bristot de Oliveira, Juri Lelli

Hi Steve,

first of all thanks a lot for your detailed report, if only all bug
reports were like this.. :)

On 03/02/16 13:55, Steven Rostedt wrote:
> There's an accounting issue with the SCHED_DEADLINE and the creation of
> cpusets. If a SCHED_DEADLINE task already exists and a new root domain
> is created, the calculation of the bandwidth among the root domains
> gets corrupted.
> 
> For the reproducer, I downloaded Juri's tests:
> 
>   https://github.com/jlelli/tests.git 
> 
>     For his burn.c file.
> 
>   https://github.com/jlelli/schedtool-dl.git
> 
>     For his modified schedtool utility.
> 
> 
> I have a kernel with my patches that show the bandwidth:
> 
>  # grep dl /proc/sched_debug        
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> 
> 
> Note: the sched_rt runtime and period are 950000 and 1000000
> respectively, and the bw ratio is (95/100) << 20 == 996147.
> 
> This isn't the way I first discovered the issue, but it appears to be
> the quickest way to reproduce it.
> 
> Make sure there's no other cpusets. As libvirt created some, I had to
> remove them first:
> 
>  # rmdir /sys/fs/cgroup/cpuset/libvirt/{qemu,}
> 
> 
>  # burn&
>  # schedtool -E -t 2000000:20000000 $!
> 
>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> 
> Note: (2/20) << 20 == 104857
> 
>  # echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> 
>  # grep dl /proc/sched_debug                                       
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> 
> Notice that after removing load_balancing from the main cpuset, all the
> totals went to zero.
> 

Right. I think this is the same thing that happens after hotplug. IIRC
the code paths are actually the same. The problem is that hotplug or
cpuset reconfiguration operations are destructive w.r.t. root_domains,
so we lose bandwidth information when that happens. The problem is that
we only store cumulative information regarding bandwidth in root_domain,
while information about which task belongs to which cpuset is store in
cpuset data structures.

I tried to fix this a while back, but my tentative was broken, I failed
to get locking right and, even though it seemed to fix the issue for me,
it was prone to race conditions. You might still want to have a look at
that for reference: https://lkml.org/lkml/2015/9/2/162

> Let's see what happens when we kill the task.
> 
>  # killall burn
> 
>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> 
> They all went negative!
> 

Yes, that's because we remove task's bw from the root_domain
unconditionally in task_dead_dl(), as you also found out below.

> Not good, but we can recover...
> 
>  # echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 
> 
> # grep dl /proc/sched_debugdl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> 
> Playing with this a bit more, I found that it appears that setting
> load_balance to 1 in the toplevel cpuset always resets the deadline
> bandwidth weather or not it should be. At least that's a way to recover
> from things not working anymore, but I still believe this is a bug.
> 

It's good that we can recover, but that's still a bug yes :/.

I'll try to see if my broken patch make what you are seeing apparently
disappear, so that we can at least confirm that we are seeing the same
problem; you could do the same if you want, I pushed that here

 git://linux-arm.org/linux-jl.git upstream/fixes/dl-hotplug

I'm not sure anyway if my approach can be fixed or if we have to solve
this some other way. I have to get back to look at this.

Best,

- Juri

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-04  9:54 ` Juri Lelli
@ 2016-02-04 12:04   ` Juri Lelli
  2016-02-04 12:27     ` Juri Lelli
  0 siblings, 1 reply; 8+ messages in thread
From: Juri Lelli @ 2016-02-04 12:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Clark Williams, John Kacur,
	Daniel Bristot de Oliveira, Juri Lelli

On 04/02/16 09:54, Juri Lelli wrote:
> Hi Steve,
> 
> first of all thanks a lot for your detailed report, if only all bug
> reports were like this.. :)
> 
> On 03/02/16 13:55, Steven Rostedt wrote:

[...]

> 
> Right. I think this is the same thing that happens after hotplug. IIRC
> the code paths are actually the same. The problem is that hotplug or
> cpuset reconfiguration operations are destructive w.r.t. root_domains,
> so we lose bandwidth information when that happens. The problem is that
> we only store cumulative information regarding bandwidth in root_domain,
> while information about which task belongs to which cpuset is store in
> cpuset data structures.
> 
> I tried to fix this a while back, but my tentative was broken, I failed
> to get locking right and, even though it seemed to fix the issue for me,
> it was prone to race conditions. You might still want to have a look at
> that for reference: https://lkml.org/lkml/2015/9/2/162
> 

[...]

> 
> It's good that we can recover, but that's still a bug yes :/.
> 
> I'll try to see if my broken patch make what you are seeing apparently
> disappear, so that we can at least confirm that we are seeing the same
> problem; you could do the same if you want, I pushed that here
> 

No it doesn't solve this :/. I placed restoring code in the hotplug
workfn, so updates generated by toggling sched_load_balance don't get
caught, of course. But, this at least tells us that we should solve this
someplace else.

Best,

- Juri

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-04 12:04   ` Juri Lelli
@ 2016-02-04 12:27     ` Juri Lelli
  2016-02-04 16:30       ` Juri Lelli
  0 siblings, 1 reply; 8+ messages in thread
From: Juri Lelli @ 2016-02-04 12:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Clark Williams, John Kacur,
	Daniel Bristot de Oliveira, Juri Lelli

On 04/02/16 12:04, Juri Lelli wrote:
> On 04/02/16 09:54, Juri Lelli wrote:
> > Hi Steve,
> > 
> > first of all thanks a lot for your detailed report, if only all bug
> > reports were like this.. :)
> > 
> > On 03/02/16 13:55, Steven Rostedt wrote:
> 
> [...]
> 
> > 
> > Right. I think this is the same thing that happens after hotplug. IIRC
> > the code paths are actually the same. The problem is that hotplug or
> > cpuset reconfiguration operations are destructive w.r.t. root_domains,
> > so we lose bandwidth information when that happens. The problem is that
> > we only store cumulative information regarding bandwidth in root_domain,
> > while information about which task belongs to which cpuset is store in
> > cpuset data structures.
> > 
> > I tried to fix this a while back, but my tentative was broken, I failed
> > to get locking right and, even though it seemed to fix the issue for me,
> > it was prone to race conditions. You might still want to have a look at
> > that for reference: https://lkml.org/lkml/2015/9/2/162
> > 
> 
> [...]
> 
> > 
> > It's good that we can recover, but that's still a bug yes :/.
> > 
> > I'll try to see if my broken patch make what you are seeing apparently
> > disappear, so that we can at least confirm that we are seeing the same
> > problem; you could do the same if you want, I pushed that here
> > 
> 
> No it doesn't solve this :/. I placed restoring code in the hotplug
> workfn, so updates generated by toggling sched_load_balance don't get
> caught, of course. But, this at least tells us that we should solve this
> someplace else.
> 

Well, if I call an unlocked version of my cpuset_hotplug_update_rd()
from kernel/cpuset.c:update_flag() the issue seems to go away. But, we
end up overcommitting the default null domain (try to toggle sched_load_
balance multiple times). I updated the branch, but I still think we
should solve this differently.

Best,

- Juri

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-04 12:27     ` Juri Lelli
@ 2016-02-04 16:30       ` Juri Lelli
  2016-02-04 17:31         ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Juri Lelli @ 2016-02-04 16:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Clark Williams, John Kacur,
	Daniel Bristot de Oliveira, Juri Lelli

Hi Steve,

On 04/02/16 12:27, Juri Lelli wrote:
> On 04/02/16 12:04, Juri Lelli wrote:
> > On 04/02/16 09:54, Juri Lelli wrote:
> > > Hi Steve,
> > > 
> > > first of all thanks a lot for your detailed report, if only all bug
> > > reports were like this.. :)
> > > 
> > > On 03/02/16 13:55, Steven Rostedt wrote:
> > 
> > [...]
> > 
> > > 
> > > Right. I think this is the same thing that happens after hotplug. IIRC
> > > the code paths are actually the same. The problem is that hotplug or
> > > cpuset reconfiguration operations are destructive w.r.t. root_domains,
> > > so we lose bandwidth information when that happens. The problem is that
> > > we only store cumulative information regarding bandwidth in root_domain,
> > > while information about which task belongs to which cpuset is store in
> > > cpuset data structures.
> > > 
> > > I tried to fix this a while back, but my tentative was broken, I failed
> > > to get locking right and, even though it seemed to fix the issue for me,
> > > it was prone to race conditions. You might still want to have a look at
> > > that for reference: https://lkml.org/lkml/2015/9/2/162
> > > 
> > 
> > [...]
> > 
> > > 
> > > It's good that we can recover, but that's still a bug yes :/.
> > > 
> > > I'll try to see if my broken patch make what you are seeing apparently
> > > disappear, so that we can at least confirm that we are seeing the same
> > > problem; you could do the same if you want, I pushed that here
> > > 
> > 
> > No it doesn't solve this :/. I placed restoring code in the hotplug
> > workfn, so updates generated by toggling sched_load_balance don't get
> > caught, of course. But, this at least tells us that we should solve this
> > someplace else.
> > 
> 
> Well, if I call an unlocked version of my cpuset_hotplug_update_rd()
> from kernel/cpuset.c:update_flag() the issue seems to go away. But, we
> end up overcommitting the default null domain (try to toggle sched_load_
> balance multiple times). I updated the branch, but I still think we
> should solve this differently.
> 

I've actually changed a bit this approach, and things seem better here.
Could you please give this a try? (You can also fetch the same branch).

Thanks,

- Juri

--->8---

>From c45d255859a2978a350bae39ead52f4dd11ab767 Mon Sep 17 00:00:00 2001
From: Juri Lelli <juri.lelli@arm.com>
Date: Tue, 28 Jul 2015 11:55:51 +0100
Subject: [PATCH] sched/{cpuset,core}: restore root_domain status across
 destructive ops

Hotplug and sched_domains update operations are destructive w.r.t data
associated with cpuset; in this case we care about root_domains.
SCHED_DEADLINE puts bandwidth information regarding admitted tasks on
root_domains, information that is gone when an hotplug or update
operation happens. Also, it is not currently possible to tell to which
task(s) the allocated bandwidth belongs, as this link is lost after
sched_setscheduler() succeeds.

This patch forces rebuilding of allocated bandwidth information at
root_domain level after partition_sched_domains() is done. It also
ensures that we don't leave stale information in def_root_domain when
that becomes empty (since it is never freed).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 include/linux/sched.h |  2 ++
 kernel/cpuset.c       | 39 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c   | 28 ++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..5f9eeb4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2241,6 +2241,8 @@ extern int cpuset_cpumask_can_shrink(const struct cpumask *cur,
 				     const struct cpumask *trial);
 extern int task_can_attach(struct task_struct *p,
 			   const struct cpumask *cs_cpus_allowed);
+void sched_restore_dl_bw(struct task_struct *task,
+			 const struct cpumask *new_mask);
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p,
 			       const struct cpumask *new_mask);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3e945fc..57078f0 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -785,6 +785,44 @@ done:
 	return ndoms;
 }
 
+/**
+ * update_tasks_rd - Update tasks' root_domains status.
+ * @cs: the cpuset to which each task's root_domain belongs
+ *
+ * Iterate through each task of @cs updating state of its related
+ * root_domain.
+ */
+static void update_tasks_rd(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, &it);
+	while ((task = css_task_iter_next(&it)))
+		sched_restore_dl_bw(task, cs->effective_cpus);
+	css_task_iter_end(&it);
+}
+
+static void cpuset_update_rd(void)
+{
+	struct cpuset *cs;
+	struct cgroup_subsys_state *pos_css;
+
+	lockdep_assert_held(&cpuset_mutex);
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+		if (!css_tryget_online(&cs->css))
+			continue;
+		rcu_read_unlock();
+
+		update_tasks_rd(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -818,6 +856,7 @@ static void rebuild_sched_domains_locked(void)
 
 	/* Have scheduler rebuild the domains */
 	partition_sched_domains(ndoms, doms, attr);
+	cpuset_update_rd();
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ce7a8..f9558f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2277,6 +2277,23 @@ static inline int dl_bw_cpus(int i)
 }
 #endif
 
+void sched_restore_dl_bw(struct task_struct *task,
+			 const struct cpumask *new_mask)
+{
+	struct dl_bw *dl_b;
+	unsigned long flags;
+
+	if (!task_has_dl_policy(task))
+		return;
+
+	rcu_read_lock_sched();
+	dl_b = dl_bw_of(cpumask_any(new_mask));
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+	dl_b->total_bw += task->dl.dl_bw;
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+	rcu_read_unlock_sched();
+}
+
 /*
  * We must be sure that accepting a new task (or allowing changing the
  * parameters of an existing one) is consistent with the bandwidth
@@ -5636,6 +5653,17 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 
 		cpumask_clear_cpu(rq->cpu, old_rd->span);
 
+		if (old_rd == &def_root_domain &&
+		    cpumask_empty(old_rd->span)) {
+			/*
+			 * def_root_domain is never freed, so we have to clean
+			 * it when it becomes empty.
+			 */
+			raw_spin_lock(&old_rd->dl_bw.lock);
+			old_rd->dl_bw.total_bw = 0;
+			raw_spin_unlock(&old_rd->dl_bw.lock);
+		}
+
 		/*
 		 * If we dont want to free the old_rd yet then
 		 * set old_rd to NULL to skip the freeing later
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-04 16:30       ` Juri Lelli
@ 2016-02-04 17:31         ` Steven Rostedt
  2016-02-04 18:32           ` Juri Lelli
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2016-02-04 17:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Clark Williams, John Kacur,
	Daniel Bristot de Oliveira, Juri Lelli

On Thu, 4 Feb 2016 16:30:49 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> I've actually changed a bit this approach, and things seem better here.
> Could you please give this a try? (You can also fetch the same branch).

It appears to fix the one issue I pointed out, but it doesn't fix the
issue with cpusets.

 # burn&
 # TASK=$!
 # schedtool -E -t 2000000:20000000 $TASK
 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857

 # mkdir /sys/fs/cgroup/cpuset/my_cpuset
 # echo 1 > /sys/fs/cgroup/cpuset/my_cpuset/cpuset.cpus
 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 209714

It appears to add double the bandwidth.

 # kill $TASK
 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857

Now we have used bandwidth with nothing running.

 # rmdir /sys/fs/cgroup/cpuset/my_cpuset
 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 104857

And now that bandwidth is leaked, but it seems we can get it back with
the old sched_load_balance trick.

 # echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
 # grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[4]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[5]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[6]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[7]:
  .dl_nr_running                 : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0


-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets
  2016-02-04 17:31         ` Steven Rostedt
@ 2016-02-04 18:32           ` Juri Lelli
  0 siblings, 0 replies; 8+ messages in thread
From: Juri Lelli @ 2016-02-04 18:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Clark Williams, John Kacur,
	Daniel Bristot de Oliveira, Juri Lelli

On 04/02/16 12:31, Steven Rostedt wrote:
> On Thu, 4 Feb 2016 16:30:49 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > I've actually changed a bit this approach, and things seem better here.
> > Could you please give this a try? (You can also fetch the same branch).
> 
> It appears to fix the one issue I pointed out, but it doesn't fix the
> issue with cpusets.
> 
>  # burn&
>  # TASK=$!
>  # schedtool -E -t 2000000:20000000 $TASK
>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> 
>  # mkdir /sys/fs/cgroup/cpuset/my_cpuset
>  # echo 1 > /sys/fs/cgroup/cpuset/my_cpuset/cpuset.cpus
>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 209714
> 
> It appears to add double the bandwidth.
> 

Mmm.. IIUC that's because we don't destroy any root_domain in this case,
as sched_load_balance of the parent is still set. So we add again to the
existing one. I could fix that with some flag indicating when we
actually destroy root_domain(s), but I fear it will make this solution
uglier than it is already :/. More thinking required.

Thanks for testing.

Best,

- Juri

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-02-04 18:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-03 18:55 [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets Steven Rostedt
2016-02-03 18:57 ` Steven Rostedt
2016-02-04  9:54 ` Juri Lelli
2016-02-04 12:04   ` Juri Lelli
2016-02-04 12:27     ` Juri Lelli
2016-02-04 16:30       ` Juri Lelli
2016-02-04 17:31         ` Steven Rostedt
2016-02-04 18:32           ` Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox