Re: [BUG] sched: leaf_cfs_rq_list use after free

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [BUG] sched: leaf_cfs_rq_list use after free
@ 2016-03-12  9:42 Kazuki Yamaguchi
  2016-03-12 13:59 ` Peter Zijlstra
  2016-03-14 11:20 ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Kazuki Yamaguchi @ 2016-03-12  9:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Niklas Cassel, Peter Zijlstra, linux-kernel

Hello,

I got similar kernel crashes after the patch, which went to 4.4:

2e91fa7 cgroup: keep zombies associated with their original cgroups

I was just about to report, but maybe this is related?

^^^^^^^[    0.761718] BUG: unable to handle kernel NULL pointer 
dereference at 00000000000008b0
[    0.762860] IP: [<ffffffff81052630>] update_blocked_averages+0x80/0x600
[    0.764020] PGD 3fc067 PUD 3a9067 PMD 0
[    0.764020] Oops: 0000 [#1] SMP
[    0.764020] CPU: 0 PID: 56 Comm: test Not tainted 4.5.0-rc7 #25
[    0.764020] task: ffff8800003d2700 ti: ffff8800003e8000 task.ti: 
ffff8800003e8000
[    0.764020] RIP: 0010:[<ffffffff81052630>]  [<ffffffff81052630>] 
update_blocked_averages+0x80/0x600
[    0.764020] RSP: 0000:ffff880007c03e50  EFLAGS: 00000016
[    0.764020] RAX: 0000000000000000 RBX: 00000000ffff165e RCX: 
000000002d5096e1
[    0.764020] RDX: 00000000000d281c RSI: ffff880000138200 RDI: 
00000000000d281c
[    0.764020] RBP: ffff880007c03eb0 R08: ffffffff811567e0 R09: 
0000000000000100
[    0.764020] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff880007c11920
[    0.764020] R13: 00000000000110c0 R14: afb504000afb5041 R15: 
ffff880007c110c0
[    0.764020] FS:  0000000001b69880(0063) GS:ffff880007c00000(0000) 
knlGS:0000000000000000
[    0.764020] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.764020] CR2: 00000000000008b0 CR3: 00000000003a4000 CR4: 
00000000000006b0
[    0.764020] Stack:
[    0.764020]  0000000080000100 0000000000000286 ffff880007c0c7f8 
0000000000000006
[    0.764020]  0000000007c0c5c0 ffff880000138200 ffffffff8104ce00 
00000000ffff165e
[    0.764020]  ffff880007c110c0 00000000000110c0 0000000000000007 
0000000000000000
[    0.764020] Call Trace:
[    0.764020]  <IRQ>
[    0.764020]  [<ffffffff8104ce00>] ? wake_up_process+0x10/0x20
[    0.764020]  [<ffffffff8105978d>] run_rebalance_domains+0x6d/0x290
[    0.764020]  [<ffffffff81072cab>] ? run_timer_softirq+0x19b/0x220
[    0.764020]  [<ffffffff810318ee>] __do_softirq+0xde/0x1e0
[    0.764020]  [<ffffffff81031aef>] irq_exit+0x5f/0x70
[    0.764020]  [<ffffffff81020238>] 
smp_trace_apic_timer_interrupt+0x68/0x90
[    0.764020]  [<ffffffff81020269>] smp_apic_timer_interrupt+0x9/0x10
[    0.764020]  [<ffffffff8114dd4c>] apic_timer_interrupt+0x7c/0x90
[    0.764020]  <EOI>
[    0.764020]  [<ffffffff810b76f6>] ? find_vma+0x16/0x70
[    0.764020]  [<ffffffff81026d18>] __do_page_fault+0xe8/0x360
[    0.764020]  [<ffffffff81026fcc>] do_page_fault+0xc/0x10
[    0.764020]  [<ffffffff8114e5cf>] page_fault+0x1f/0x30
[    0.764020] Code: 00 48 8d b0 28 ff ff ff 49 be 41 50 fb 0a 00 04 b5 
af 48 89 74 24 28 48 8b 74 24 28 c7 44 24 24 00 00 00 00 48 8b 86 c8 00 
00 00 <48> 8b 90 b0 08 00 00 48 8b 86 a0 00 00 00 48 85 c0 74 46 31 c0
[    0.764020] RIP  [<ffffffff81052630>] update_blocked_averages+0x80/0x600
[    0.764020]  RSP <ffff880007c03e50>
[    0.764020] CR2: 00000000000008b0
[    0.764020] ---[ end trace 754fbc727003a126 ]---
[    0.764020] Kernel panic - not syncing: Fatal exception in interrupt
[    0.764020] Shutting down cpus with NMI
[    0.764020] Kernel Offset: disabled
[    0.764020] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt


I can reproduce it on QEMU (qemu-system-x86_64 -smp 2).

enabled config:
CONFIG_PID_NS=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_SMP=y


init.sh:
#!/bin/sh
mkdir /testg
mount -t cgroup -o cpu cgroup /testg
echo /agent.sh > /testg/release_agent
echo 1 > /testg/notify_on_release

mkdir /temp-mnt
while :; do
    echo -n ^
    ./test
done


agent.sh:
#!/bin/sh
rmdir /testg$1


test.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sched.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/ptrace.h>

int
main(void)
{
     mount("none", "/temp-mnt", "tmpfs", 0, "");
     unshare(CLONE_NEWPID);
     pid_t pid = fork();
     if (pid == 0) {
         fork();
     } else {
         ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACEFORK);
         char template[128] = "/testg/XXXXXX";
         if (!mkdtemp(template)) abort();
         FILE *f = fopen(strcat(template, "/cgroup.procs"), "w");
         fprintf(f, "%d\n", pid);
         fclose(f);
         wait(NULL); // stopped at fork()
         kill(pid, SIGKILL);
         umount("/temp-mnt");
     }
     return 0;
}

-- 
Kazuki Yamaguchi <k@rhe.jp>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
@ 2016-03-12 13:59 ` Peter Zijlstra
  2016-03-14 11:20 ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-12 13:59 UTC (permalink / raw)
  To: Kazuki Yamaguchi; +Cc: Tejun Heo, Niklas Cassel, linux-kernel

On Sat, Mar 12, 2016 at 06:42:57PM +0900, Kazuki Yamaguchi wrote:
> I can reproduce it on QEMU (qemu-system-x86_64 -smp 2).

Most excellent! I'll go have a play.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
  2016-03-12 13:59 ` Peter Zijlstra
@ 2016-03-14 11:20 ` Peter Zijlstra
  2016-03-14 12:09   ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-14 11:20 UTC (permalink / raw)
  To: Kazuki Yamaguchi; +Cc: Tejun Heo, Niklas Cassel, linux-kernel

On Sat, Mar 12, 2016 at 06:42:57PM +0900, Kazuki Yamaguchi wrote:

> 2e91fa7 cgroup: keep zombies associated with their original cgroups

So the below hackery yields:

[  192.814857] ------------[ cut here ]------------
[  192.820025] WARNING: CPU: 38 PID: 3539 at ../kernel/sched/fair.c:288 enqueue_entity+0x90d/0xa10()
[  192.829930] Modules linked in:
[  192.833346] CPU: 38 PID: 3539 Comm: test Not tainted 4.5.0-rc7-01136-g89456cf-dirty #346
[  192.842377] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[  192.853832]  0000000000000000 ffff880423d87b08 ffffffff81684a55 0000000000000000
[  192.862136]  ffffffff81f36e63 ffff880423d87b40 ffffffff810d7366 ffff880827432c00
[  192.870437]  ffff880827505b80 000000000000024a 0000000000000001 0000000000000000
[  192.878744] Call Trace:
[  192.881480]  [<ffffffff81684a55>] dump_stack+0x67/0x92
[  192.887224]  [<ffffffff810d7366>] warn_slowpath_common+0x86/0xc0
[  192.893930]  [<ffffffff810d745a>] warn_slowpath_null+0x1a/0x20
[  192.900431]  [<ffffffff8111bd6d>] enqueue_entity+0x90d/0xa10
[  192.906751]  [<ffffffff811168cd>] ? select_task_rq_fair+0x48d/0x790
[  192.913748]  [<ffffffff8111bec9>] enqueue_task_fair+0x59/0x8c0
[  192.920254]  [<ffffffff8112d83d>] ? __lock_is_held+0x4d/0x70
[  192.926572]  [<ffffffff8112d83d>] ? __lock_is_held+0x4d/0x70
[  192.932895]  [<ffffffff811081bc>] activate_task+0x5c/0xb0
[  192.938923]  [<ffffffff8110874e>] ttwu_do_activate.constprop.104+0x3e/0x90
[  192.946601]  [<ffffffff81109669>] try_to_wake_up+0x1f9/0x620
[  192.952919]  [<ffffffff81109b32>] default_wake_function+0x12/0x20
[  192.959717]  [<ffffffff810da021>] child_wait_callback+0x51/0x60
[  192.966326]  [<ffffffff81125c02>] __wake_up_common+0x52/0x90
[  192.972634]  [<ffffffff811261f5>] __wake_up_sync_key+0x45/0x60
[  192.979146]  [<ffffffff810dccd6>] __wake_up_parent+0x26/0x30
[  192.985468]  [<ffffffff810ea09b>] do_notify_parent+0x30b/0x550
[  192.991980]  [<ffffffff810e9edd>] ? do_notify_parent+0x14d/0x550
[  192.998684]  [<ffffffff810e476a>] ? __ptrace_unlink+0xba/0x110
[  193.005196]  [<ffffffff810e3a18>] __ptrace_detach.part.12+0x88/0xd0
[  193.012183]  [<ffffffff810e4897>] exit_ptrace+0x87/0xd0
[  193.018015]  [<ffffffff810dc9ab>] do_exit+0xabb/0xca0
[  193.023663]  [<ffffffff810dcc1e>] do_group_exit+0x4e/0xc0
[  193.029680]  [<ffffffff810dcca4>] SyS_exit_group+0x14/0x20
[  193.035823]  [<ffffffff81b1d965>] entry_SYSCALL_64_fastpath+0x18/0xa8
[  193.043013] ---[ end trace 8c92cd8599d0fd71 ]---


Which yields the patch in question is indeed full of fail. The
additional hackery below (new cpu_cgroup_exit) does indeed fix the fail.

But given that this is true for cpu, it is also very much true for perf.

So I would suggest TJ to revert that patch and queue it for stable.

It it clearly borken, because cgroup_exit() is called from preemptible
context, so _obviously_ we can (and clearly will) schedule after that,
which is somewhat hard if the cgroup we're supposedly belonging to has
been torn to shreds in the meantime.

---
 kernel/sched/core.c | 17 +++++++++++++++++
 kernel/sched/fair.c | 13 +++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c82ded..3892a48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7640,6 +7640,9 @@ void sched_move_task(struct task_struct *tsk)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+	if (tsk->flags & PF_EXITING) /* we're dying, tg could be about to vanish */
+		tg = &root_task_group;
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8112,6 +8115,19 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+static void cpu_cgroup_exit(struct task_struct *task)
+{
+	/*
+	 * cgroup_exit() is called in the copy_process() failure path.
+	 * Ignore this case since the task hasn't ran yet, this avoids
+	 * trying to poke a half freed task state from generic code.
+	 */
+	if (!(task->flags & PF_EXITING))
+		return;
+
+	sched_move_task(task);
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -8443,6 +8459,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
+	.exit           = cpu_cgroup_exit,
 	.legacy_cftypes	= cpu_files,
 	.early_init	= 1,
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3313052..163b829 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -285,7 +285,9 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->on_list) {
+	WARN_ON(cfs_rq->on_list == -1);
+
+	if (cfs_rq->on_list == 0) {
 		/*
 		 * Ensure we either appear before our parent (if already
 		 * enqueued) or force our parent to appear after us when it is
@@ -302,15 +304,17 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		}
 
 		cfs_rq->on_list = 1;
+		trace_printk("%p\n", cfs_rq);
 	}
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	if (cfs_rq->on_list) {
+	if (cfs_rq->on_list == 1) {
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
-		cfs_rq->on_list = 0;
+		trace_printk("%p\n", cfs_rq);
 	}
+	cfs_rq->on_list = -1;
 }
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
@@ -8356,9 +8360,10 @@ void unregister_fair_sched_group(struct task_group *tg)
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
 		 * check on_list without danger of it being re-added.
-		 */
+		 *
 		if (!tg->cfs_rq[cpu]->on_list)
 			continue;
+		 */
 
 		rq = cpu_rq(cpu);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-14 11:20 ` Peter Zijlstra
@ 2016-03-14 12:09   ` Peter Zijlstra
  2016-03-16 14:24     ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-14 12:09 UTC (permalink / raw)
  To: Kazuki Yamaguchi; +Cc: Tejun Heo, Niklas Cassel, linux-kernel

On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote:
> So I would suggest TJ to revert that patch and queue it for stable.
> 
> It it clearly borken, because cgroup_exit() is called from preemptible
> context, so _obviously_ we can (and clearly will) schedule after that,
> which is somewhat hard if the cgroup we're supposedly belonging to has
> been torn to shreds in the meantime.

And I think it might be fundamentally broken, because it leaves ->css
set to whatever cgroup we had, while simultaneously allowing that css to
go away.

This means that anything trying to use this pointer; and there's quite a
lot of that; is in for a nasty surprise.

So you really need to change the ->css, either when the task starts
dying (like it used to), or otherwise right before the cgroup goes
offline.

The argument used was that people want to see resources consumed by
Zombies, which is all fine and dandy, but when you destroy the cgroup
you cannot see that anyway.

So something needs to fundamentally ensure that ->css changes before we
go offline the thing.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-14 12:09   ` Peter Zijlstra
@ 2016-03-16 14:24     ` Tejun Heo
  2016-03-16 14:44       ` Tejun Heo
  2016-03-16 15:22       ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Tejun Heo @ 2016-03-16 14:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

Hello, Peter.

Sorry about the delay.

On Mon, Mar 14, 2016 at 01:09:03PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote:
> > So I would suggest TJ to revert that patch and queue it for stable.
> > 
> > It it clearly borken, because cgroup_exit() is called from preemptible
> > context, so _obviously_ we can (and clearly will) schedule after that,
> > which is somewhat hard if the cgroup we're supposedly belonging to has
> > been torn to shreds in the meantime.
> 
> And I think it might be fundamentally broken, because it leaves ->css
> set to whatever cgroup we had, while simultaneously allowing that css to
> go away.

So, the problem here is that cpu is using css_offline to tear down the
css.  perf shouldn't have the same problem as it destroys its css from
css_free.  The distinctions among different callbacks evolved over
time and we need to update the documentation but here are the rules.

css_alloc()

	This one is obvious.  Alloc and init.  The css will become
	visible during css iteration sometime between alloc and
	online.

css_online()

	The css is now guaranteed to be visible for css_for_each*()
	iterations.  This distinction exists because some controllers
	need to propagate state changes downwards requiring a new css
	to become visible before it inherits the current state from
	the parent.  Conversely, there's no reason to use this
	callback if there's no such requirement.

	Ex: Freezer which propagates the target state downwards and
	needs a new child to inherit the current state while
	iteratable.

css_offline()

	The css is going down and draining usages by refusing new
	css_tryget_online()'s but still guaranteed to be visible
	during css iterations.  Controllers which explicitly needs to
	put, say, cache references or need to perform cleanup
	operations while the css is iteratable need to use this
	method; otherwise, no reason to bother with it.

	Ex: blkcg pins csses as part of lookup cache which can prevent
	a css from being drained and released, so blkcg uses this call
	back to disable caching for the css.

css_released()

	The css is drained and not visible during iteration and will
	be freed after a RCU grace period.  Used by controllers which
	cache pointers to csses being drained.

	Ex: memcg needs to iterate csses which are being drained and
	its custom iterator implementation caches RCU protected
	pointers to csses.  memcg uses this callback to shoot down
	those cached pointers.

css_free()

	The css is drained, not visible to iterations, not used by
	anyone, and will be freed immediately.

So, controllers which don't have persistent states or synchronization
requirements around state propagation have no reason to bother with
all the rest.  css_alloc() and css_free() are the right callbacks to
init and tear down csses.  If a controller has specific needs to
propagate states, only the part of operations which are affected
should be in the respective specialized init/exit methods.

> This means that anything trying to use this pointer; and there's quite a
> lot of that; is in for a nasty surprise.
> 
> So you really need to change the ->css, either when the task starts
> dying (like it used to), or otherwise right before the cgroup goes
> offline.

So, the rule is that the css should stay serviceable until everyone
depending on it is gone.

> The argument used was that people want to see resources consumed by
> Zombies, which is all fine and dandy, but when you destroy the cgroup
> you cannot see that anyway.
> 
> So something needs to fundamentally ensure that ->css changes before we
> go offline the thing.

I could be mistaken but AFAICS there doesn't seem to be anything
requiring bothering with the more specialized exit methods.  Given
that no css iteration is used and everything is lock protected, the
patch at the end of this messages should do and seems to work fine
here.  Am I missing something?

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f5abc6..98d019d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8332,14 +8332,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
 	sched_offline_group(tg);
+	sched_destroy_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8701,7 +8695,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 14:24     ` Tejun Heo
@ 2016-03-16 14:44       ` Tejun Heo
  2016-03-16 15:22       ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2016-03-16 14:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

Hello, again.

On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote:
> I could be mistaken but AFAICS there doesn't seem to be anything
> requiring bothering with the more specialized exit methods.  Given
> that no css iteration is used and everything is lock protected, the

Ooh, missed the rcu protected tg list, so the right shutdown sequence
would be the following instead where css_released() takes the tg off
the internal RCU protected lists after all usages are drained and
css_free() frees it.

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f5abc6..6d58e8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8335,7 +8335,7 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 	sched_destroy_group(tg);
 }
 
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
@@ -8701,7 +8701,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
+	.css_released	= cpu_cgroup_css_released,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 14:24     ` Tejun Heo
  2016-03-16 14:44       ` Tejun Heo
@ 2016-03-16 15:22       ` Peter Zijlstra
  2016-03-16 16:50         ` Tejun Heo
                           ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-16 15:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote:
> So, the problem here is that cpu is using css_offline to tear down the
> css.  perf shouldn't have the same problem as it destroys its css from
> css_free.  The distinctions among different callbacks evolved over
> time and we need to update the documentation but here are the rules.
> 
> css_alloc()
> 
> 	This one is obvious.  Alloc and init.  The css will become
> 	visible during css iteration sometime between alloc and
> 	online.
> 
> css_online()
> 
> 	The css is now guaranteed to be visible for css_for_each*()
> 	iterations.  This distinction exists because some controllers
> 	need to propagate state changes downwards requiring a new css
> 	to become visible before it inherits the current state from
> 	the parent.  Conversely, there's no reason to use this
> 	callback if there's no such requirement.
> 
> 	Ex: Freezer which propagates the target state downwards and
> 	needs a new child to inherit the current state while
> 	iteratable.

So it looks like sched uses css_online() for no particular reason
either, I've moved all that into css_alloc().

> 
> css_offline()
> 
> 	The css is going down and draining usages by refusing new
> 	css_tryget_online()'s but still guaranteed to be visible
> 	during css iterations.  Controllers which explicitly needs to
> 	put, say, cache references or need to perform cleanup
> 	operations while the css is iteratable need to use this
> 	method; otherwise, no reason to bother with it.
> 
> 	Ex: blkcg pins csses as part of lookup cache which can prevent
> 	a css from being drained and released, so blkcg uses this call
> 	back to disable caching for the css.
> 
> css_released()
> 
> 	The css is drained and not visible during iteration and will
> 	be freed after a RCU grace period.  Used by controllers which
> 	cache pointers to csses being drained.
> 
> 	Ex: memcg needs to iterate csses which are being drained and
> 	its custom iterator implementation caches RCU protected
> 	pointers to csses.  memcg uses this callback to shoot down
> 	those cached pointers.
> 
> css_free()
> 
> 	The css is drained, not visible to iterations, not used by
> 	anyone, and will be freed immediately.

None of that speaks of where Zombies live, am I to infer that Zombies
pass css_offline() but stall css_released() ?

I don't particularly care about iterating css bits, but I do need my
parent group to still exist, this is now also guaranteed for
css_release(), right? The above documentation also doesn't mention this;
in particular I require that css_release() for any group is not called
before the css_release() of any child group.

The reason I'm interested in css_release() is that there is an RCU grace
period between that and css_free(), so I can kill my call_rcu() from
css_free().

> > So something needs to fundamentally ensure that ->css changes before we
> > go offline the thing.
> 
> I could be mistaken but AFAICS there doesn't seem to be anything
> requiring bothering with the more specialized exit methods.  Given
> that no css iteration is used and everything is lock protected, the
> patch at the end of this messages should do and seems to work fine
> here.  Am I missing something?

Some additional cleanups maybe.. but yes, that seems to work.

Thanks!

---
Subject: sched: Fix/cleanup cgroup teardown/init

The cpu controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit 2e91fa7f6d45
("cgroup: keep zombies associated with their original cgroups") caused
it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Cc: stable@vger.kernel.org
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04a04e0378fe..bc5076fd0167 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7538,7 +7538,7 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
-static void free_sched_group(struct task_group *tg)
+static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
@@ -7564,7 +7564,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	return tg;
 
 err:
-	free_sched_group(tg);
+	sched_free_group(tg);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -7584,17 +7584,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void free_sched_group_rcu(struct rcu_head *rhp)
+static void sched_free_group_rcu(struct rcu_head *rhp)
 {
 	/* now it should be safe to free those cfs_rqs */
-	free_sched_group(container_of(rhp, struct task_group, rcu));
+	sched_free_group(container_of(rhp, struct task_group, rcu));
 }
 
-/* Destroy runqueue etc associated with a task group */
 void sched_destroy_group(struct task_group *tg)
 {
 	/* wait for possible concurrent references to cfs_rqs complete */
-	call_rcu(&tg->rcu, free_sched_group_rcu);
+	call_rcu(&tg->rcu, sched_free_group_rcu);
 }
 
 void sched_offline_group(struct task_group *tg)
@@ -8053,31 +8052,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	sched_online_group(tg, parent);
+
 	return &tg->css;
 }
 
-static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
-	struct task_group *parent = css_tg(css->parent);
 
-	if (parent)
-		sched_online_group(tg, parent);
-	return 0;
+	sched_offline_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
-	sched_offline_group(tg);
+	/*
+	 * Relies on the RCU grace period between css_released() and this.
+	 */
+	sched_free_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8437,9 +8431,8 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
+	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
-	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 15:22       ` Peter Zijlstra
@ 2016-03-16 16:50         ` Tejun Heo
  2016-03-16 17:04           ` Peter Zijlstra
  2016-03-17  8:29         ` Niklas Cassel
  2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2016-03-16 16:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

Hello, Peter.

On Wed, Mar 16, 2016 at 04:22:45PM +0100, Peter Zijlstra wrote:
> > css_online()
> > 
> > 	The css is now guaranteed to be visible for css_for_each*()
> > 	iterations.  This distinction exists because some controllers
> > 	need to propagate state changes downwards requiring a new css
> > 	to become visible before it inherits the current state from
> > 	the parent.  Conversely, there's no reason to use this
> > 	callback if there's no such requirement.
> > 
> > 	Ex: Freezer which propagates the target state downwards and
> > 	needs a new child to inherit the current state while
> > 	iteratable.
> 
> So it looks like sched uses css_online() for no particular reason
> either, I've moved all that into css_alloc().

The parings are alloc <-> free, and online <-> offline,released, so if
you do some part of shutdown in either offline or released, it
probably makes sense to the counterpart of init in online.

> None of that speaks of where Zombies live, am I to infer that Zombies
> pass css_offline() but stall css_released() ?

Yeap, zombies may remain attached to the css before css_released().

> I don't particularly care about iterating css bits, but I do need my
> parent group to still exist, this is now also guaranteed for
> css_release(), right? The above documentation also doesn't mention this;

Yeah, if you do your custom rcu protected data structures which needs
to be accessible after offline, the rules would be the same as
requiring css iteration in the same way, so css_released() would be
the right callback to use.

> in particular I require that css_release() for any group is not called
> before the css_release() of any child group.

That is guaranteed now.

>  static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct task_group *tg = css_tg(css);
>  
> -	sched_destroy_group(tg);
> -}
> -
> -static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
> -{
> -	struct task_group *tg = css_tg(css);
> -
> -	sched_offline_group(tg);
> +	/*
> +	 * Relies on the RCU grace period between css_released() and this.
> +	 */
> +	sched_free_group(tg);
>  }

Hmmm... I don't think it'd be safe to merge the two ops.  Nothing
guarantees that the RCU callback of cpu controller is called after the
cgroup core one and cgroup core one would do use-after-free.  Just
changing offline to released should do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 16:50         ` Tejun Heo
@ 2016-03-16 17:04           ` Peter Zijlstra
  2016-03-16 17:49             ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-16 17:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

On Wed, Mar 16, 2016 at 09:50:06AM -0700, Tejun Heo wrote:
> >  static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> >  {
> >  	struct task_group *tg = css_tg(css);
> >  
> > +	/*
> > +	 * Relies on the RCU grace period between css_released() and this.
> > +	 */
> > +	sched_free_group(tg);
> >  }
> 
> Hmmm... I don't think it'd be safe to merge the two ops.  Nothing
> guarantees that the RCU callback of cpu controller is called after the
> cgroup core one and cgroup core one would do use-after-free.  Just
> changing offline to released should do.

I'm confused, the code looks like:

static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
	struct task_group *tg = css_tg(css);

	sched_offline_group(tg);
}

static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
{
	struct task_group *tg = css_tg(css);

	/*
	 * Relies on the RCU grace period between css_release() and this.
	 */
	sched_free_group(tg);
}


css_released(): sched_offline_group() takes everything down and does
                list_del_rcu() etc..

css_free(): does just a kfree() of bits, no RCU no nothing, relying
	    instead on the fact that there is an RCU GP between
	    css_released() and css_free().


This is not correct?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 17:04           ` Peter Zijlstra
@ 2016-03-16 17:49             ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2016-03-16 17:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

On Wed, Mar 16, 2016 at 06:04:56PM +0100, Peter Zijlstra wrote:
> > Hmmm... I don't think it'd be safe to merge the two ops.  Nothing
> > guarantees that the RCU callback of cpu controller is called after the
> > cgroup core one and cgroup core one would do use-after-free.  Just
> > changing offline to released should do.
> 
> I'm confused, the code looks like:
> 
> static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
> {
> 	struct task_group *tg = css_tg(css);
> 
> 	sched_offline_group(tg);
> }
> 
> static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> {
> 	struct task_group *tg = css_tg(css);
> 
> 	/*
> 	 * Relies on the RCU grace period between css_release() and this.
> 	 */
> 	sched_free_group(tg);
> }

Oops, misread the two functions swapping positions as getting merged.
Yes, that is correct.  Sorry about the confusion.  Please feel free to
add

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 15:22       ` Peter Zijlstra
  2016-03-16 16:50         ` Tejun Heo
@ 2016-03-17  8:29         ` Niklas Cassel
  2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Niklas Cassel @ 2016-03-17  8:29 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo; +Cc: Kazuki Yamaguchi, linux-kernel

On 03/16/2016 04:22 PM, Peter Zijlstra wrote:
> Subject: sched: Fix/cleanup cgroup teardown/init
> 
> The cpu controller hasn't kept up with the various changes in the whole
> cgroup initialization / destruction sequence, and commit 2e91fa7f6d45
> ("cgroup: keep zombies associated with their original cgroups") caused
> it to explode.
> 
> The reason for this is that zombies do not inhibit css_offline() from
> being called, but do stall css_released(). Now we tear down the cfs_rq
> structures on css_offline() but zombies can run after that, leading to
> use-after-free issues.
> 
> The solution is to move the tear-down to css_released(), which
> guarantees nobody (including no zombies) is still using our cgroup.
> 
> Furthermore, a few simple cleanups are possible too. There doesn't
> appear to be any point to us using css_online() (anymore?) so fold that
> in css_alloc().
> 
> And since cgroup code guarantees an RCU grace period between
> css_released() and css_free() we can forgo using call_rcu() and free the
> stuff immediately.
> 
> Cc: stable@vger.kernel.org
> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Survived 500 reboots. Without the patch, I've never gone past 84 reboots.

Tested-by: Niklas Cassel <niklas.cassel@axis.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-03-16 15:22       ` Peter Zijlstra
  2016-03-16 16:50         ` Tejun Heo
  2016-03-17  8:29         ` Niklas Cassel
@ 2016-03-21 11:15         ` tip-bot for Peter Zijlstra
  2016-04-28 18:40           ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-03-21 11:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, niklas.cassel, peterz, k, hpa, tj, tglx, mingo,
	linux-kernel

Commit-ID:  2f5177f0fd7e531b26d54633be62d1d4cb94621c
Gitweb:     http://git.kernel.org/tip/2f5177f0fd7e531b26d54633be62d1d4cb94621c
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Mar 2016 16:22:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Mar 2016 10:49:23 +0100

sched/cgroup: Fix/cleanup cgroup teardown/init

The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:

  2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")

caused it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Suggested-by: Tejun Heo <tj@kernel.org>
Reported-by: Kazuki Yamaguchi <k@rhe.jp>
Reported-by: Niklas Cassel <niklas.cassel@axis.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d872e1..2a87bdd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7536,7 +7536,7 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
-static void free_sched_group(struct task_group *tg)
+static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
@@ -7562,7 +7562,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	return tg;
 
 err:
-	free_sched_group(tg);
+	sched_free_group(tg);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -7582,17 +7582,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void free_sched_group_rcu(struct rcu_head *rhp)
+static void sched_free_group_rcu(struct rcu_head *rhp)
 {
 	/* now it should be safe to free those cfs_rqs */
-	free_sched_group(container_of(rhp, struct task_group, rcu));
+	sched_free_group(container_of(rhp, struct task_group, rcu));
 }
 
-/* Destroy runqueue etc associated with a task group */
 void sched_destroy_group(struct task_group *tg)
 {
 	/* wait for possible concurrent references to cfs_rqs complete */
-	call_rcu(&tg->rcu, free_sched_group_rcu);
+	call_rcu(&tg->rcu, sched_free_group_rcu);
 }
 
 void sched_offline_group(struct task_group *tg)
@@ -8051,31 +8050,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	sched_online_group(tg, parent);
+
 	return &tg->css;
 }
 
-static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
-	struct task_group *parent = css_tg(css->parent);
 
-	if (parent)
-		sched_online_group(tg, parent);
-	return 0;
+	sched_offline_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
-	sched_offline_group(tg);
+	/*
+	 * Relies on the RCU grace period between css_released() and this.
+	 */
+	sched_free_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8435,9 +8429,8 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
+	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
-	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
@ 2016-04-28 18:40           ` Peter Zijlstra
  2016-04-28 18:51             ` Greg Kroah-Hartman
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-04-28 18:40 UTC (permalink / raw)
  To: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	Greg Kroah-Hartman
  Cc: linux-tip-commits, mjg59


Greg,

It looks like the below patch missed 4.5 and I'm starting to get bug
reports that look very much like this issue, could we get this patch
lifted into 4.5-stable?

On Mon, Mar 21, 2016 at 04:15:41AM -0700, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  2f5177f0fd7e531b26d54633be62d1d4cb94621c
> Gitweb:     http://git.kernel.org/tip/2f5177f0fd7e531b26d54633be62d1d4cb94621c
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Wed, 16 Mar 2016 16:22:45 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 21 Mar 2016 10:49:23 +0100
> 
> sched/cgroup: Fix/cleanup cgroup teardown/init
> 
> The CPU controller hasn't kept up with the various changes in the whole
> cgroup initialization / destruction sequence, and commit:
> 
>   2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
> 
> caused it to explode.
> 
> The reason for this is that zombies do not inhibit css_offline() from
> being called, but do stall css_released(). Now we tear down the cfs_rq
> structures on css_offline() but zombies can run after that, leading to
> use-after-free issues.
> 
> The solution is to move the tear-down to css_released(), which
> guarantees nobody (including no zombies) is still using our cgroup.
> 
> Furthermore, a few simple cleanups are possible too. There doesn't
> appear to be any point to us using css_online() (anymore?) so fold that
> in css_alloc().
> 
> And since cgroup code guarantees an RCU grace period between
> css_released() and css_free() we can forgo using call_rcu() and free the
> stuff immediately.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Reported-by: Kazuki Yamaguchi <k@rhe.jp>
> Reported-by: Niklas Cassel <niklas.cassel@axis.com>
> Tested-by: Niklas Cassel <niklas.cassel@axis.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Acked-by: Tejun Heo <tj@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
> Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/core.c | 35 ++++++++++++++---------------------
>  1 file changed, 14 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d872e1..2a87bdd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7536,7 +7536,7 @@ void set_curr_task(int cpu, struct task_struct *p)
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
>  
> -static void free_sched_group(struct task_group *tg)
> +static void sched_free_group(struct task_group *tg)
>  {
>  	free_fair_sched_group(tg);
>  	free_rt_sched_group(tg);
> @@ -7562,7 +7562,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>  	return tg;
>  
>  err:
> -	free_sched_group(tg);
> +	sched_free_group(tg);
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> @@ -7582,17 +7582,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
>  }
>  
>  /* rcu callback to free various structures associated with a task group */
> -static void free_sched_group_rcu(struct rcu_head *rhp)
> +static void sched_free_group_rcu(struct rcu_head *rhp)
>  {
>  	/* now it should be safe to free those cfs_rqs */
> -	free_sched_group(container_of(rhp, struct task_group, rcu));
> +	sched_free_group(container_of(rhp, struct task_group, rcu));
>  }
>  
> -/* Destroy runqueue etc associated with a task group */
>  void sched_destroy_group(struct task_group *tg)
>  {
>  	/* wait for possible concurrent references to cfs_rqs complete */
> -	call_rcu(&tg->rcu, free_sched_group_rcu);
> +	call_rcu(&tg->rcu, sched_free_group_rcu);
>  }
>  
>  void sched_offline_group(struct task_group *tg)
> @@ -8051,31 +8050,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	if (IS_ERR(tg))
>  		return ERR_PTR(-ENOMEM);
>  
> +	sched_online_group(tg, parent);
> +
>  	return &tg->css;
>  }
>  
> -static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
> +static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
>  {
>  	struct task_group *tg = css_tg(css);
> -	struct task_group *parent = css_tg(css->parent);
>  
> -	if (parent)
> -		sched_online_group(tg, parent);
> -	return 0;
> +	sched_offline_group(tg);
>  }
>  
>  static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct task_group *tg = css_tg(css);
>  
> -	sched_destroy_group(tg);
> -}
> -
> -static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
> -{
> -	struct task_group *tg = css_tg(css);
> -
> -	sched_offline_group(tg);
> +	/*
> +	 * Relies on the RCU grace period between css_released() and this.
> +	 */
> +	sched_free_group(tg);
>  }
>  
>  static void cpu_cgroup_fork(struct task_struct *task)
> @@ -8435,9 +8429,8 @@ static struct cftype cpu_files[] = {
>  
>  struct cgroup_subsys cpu_cgrp_subsys = {
>  	.css_alloc	= cpu_cgroup_css_alloc,
> +	.css_released	= cpu_cgroup_css_released,
>  	.css_free	= cpu_cgroup_css_free,
> -	.css_online	= cpu_cgroup_css_online,
> -	.css_offline	= cpu_cgroup_css_offline,
>  	.fork		= cpu_cgroup_fork,
>  	.can_attach	= cpu_cgroup_can_attach,
>  	.attach		= cpu_cgroup_attach,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-04-28 18:40           ` Peter Zijlstra
@ 2016-04-28 18:51             ` Greg Kroah-Hartman
  2016-04-28 21:36               ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Greg Kroah-Hartman @ 2016-04-28 18:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	linux-tip-commits, mjg59

On Thu, Apr 28, 2016 at 08:40:32PM +0200, Peter Zijlstra wrote:
> 
> Greg,
> 
> It looks like the below patch missed 4.5 and I'm starting to get bug
> reports that look very much like this issue, could we get this patch
> lifted into 4.5-stable?

Sure, also added to 4.4-stable as that is where the bug for this was
fixed, according to the commit message.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-04-28 18:51             ` Greg Kroah-Hartman
@ 2016-04-28 21:36               ` Peter Zijlstra
  2016-05-02  3:06                 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-04-28 21:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	linux-tip-commits, mjg59

On Thu, Apr 28, 2016 at 11:51:37AM -0700, Greg Kroah-Hartman wrote:
> On Thu, Apr 28, 2016 at 08:40:32PM +0200, Peter Zijlstra wrote:
> > 
> > Greg,
> > 
> > It looks like the below patch missed 4.5 and I'm starting to get bug
> > reports that look very much like this issue, could we get this patch
> > lifted into 4.5-stable?
> 
> Sure, also added to 4.4-stable as that is where the bug for this was
> fixed, according to the commit message.

I think this patch relies on the following two cgroup patches:

8bb5ef79bc0f ("cgroup: make sure a parent css isn't freed before its children")
aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")

which are in 4.5, but I've not checked 4.4-stable

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-04-28 21:36               ` Peter Zijlstra
@ 2016-05-02  3:06                 ` Greg Kroah-Hartman
  0 siblings, 0 replies; 21+ messages in thread
From: Greg Kroah-Hartman @ 2016-05-02  3:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	linux-tip-commits, mjg59

On Thu, Apr 28, 2016 at 11:36:09PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 28, 2016 at 11:51:37AM -0700, Greg Kroah-Hartman wrote:
> > On Thu, Apr 28, 2016 at 08:40:32PM +0200, Peter Zijlstra wrote:
> > > 
> > > Greg,
> > > 
> > > It looks like the below patch missed 4.5 and I'm starting to get bug
> > > reports that look very much like this issue, could we get this patch
> > > lifted into 4.5-stable?
> > 
> > Sure, also added to 4.4-stable as that is where the bug for this was
> > fixed, according to the commit message.
> 
> I think this patch relies on the following two cgroup patches:
> 
> 8bb5ef79bc0f ("cgroup: make sure a parent css isn't freed before its children")

This was missing.

> aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")

This was already in 4.4-stable, thanks for letting me know.

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [BUG] sched: leaf_cfs_rq_list use after free
@ 2016-03-04 10:41 Niklas Cassel
  2016-03-10 12:54 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Niklas Cassel @ 2016-03-04 10:41 UTC (permalink / raw)
  To: tj, peterz, linux-kernel@vger.kernel.org

Hello

I've stumbled upon a use after free bug related to
CONFIG_FAIR_GROUP_SCHED / rq->cfs_rq->leaf_cfs_rq_list in v4.4.


Normally, a cfs_rq is immediately removed from the leaf_cfs_rq_list
and cfs_rq->onlist is set to 0, then the cfs_rq is freed at a later
time by call_rcu(&tg->rcu, free_sched_group_rcu).


What happens when we crash is that a cfs_rq is immediately removed
from the leaf_cfs_rq_list and cfs_rq->onlist is set to 0, however
then the cfs_rq is readded to the list, cfs_rq->onlist gets set to 1,
then comes the call to call_rcu(&tg->rcu, free_sched_group_rcu).

Now the cfs_rq is freed, filled with 0x6b6b6b6b by SLUB_DEBUG,
and still on the leaf_cfs_rq_list. Since the cfs_rq is still on
the list, the next call to update_blocked_averages will iterate
the list and will try to access members of the cfs_rq object,
an object which has already been freed.



[   27.531374] Unable to handle kernel paging request at virtual address 6b6b706b
[   27.538596] pgd = 8cea8000
[   27.541295] [6b6b706b] *pgd=00000000
[   27.544870] Internal error: Oops: 1 [#1] PREEMPT SMP ARM
[   27.564025] CPU: 1 PID: 1252 Comm: logger Tainted: G           O    4.4.0 #2
[   27.571064] Hardware name: Axis ARTPEC-6 Platform
[   27.575759] task: b9586540 ti: 8c84c000 task.ti: 8c84c000
[   27.581155] PC is at update_blocked_averages+0xcc/0x748
[   27.586372] LR is at update_blocked_averages+0xbc/0x748
[   27.591589] pc : [<80051d78>]    lr : [<80051d68>]    psr: 200c0193
               sp : 8c84dce8  ip : 00000500  fp : 8efb1680
[   27.603056] r10: 00000006  r9 : 80847788  r8 : 6b6b6b6b
[   27.608271] r7 : 00000007  r6 : ffff958a  r5 : 00000007  r4 : ffff958a
[   27.614789] r3 : 6b6b6b6b  r2 : 00000101  r1 : 00000000  r0 : 00000003
[   27.621308] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   27.628521] Control: 10c5387d  Table: 0cea804a  DAC: 00000055
[   27.634257] Process logger (pid: 1252, stack limit = 0x8c84c210)
[   27.640254] Stack: (0x8c84dce8 to 0x8c84e000)
[   27.644604] dce0:                   6b6b6b6b 00000103 bad39440 80048250 00000000 bad398d0
[   27.652774] dd00: bf6cf0d0 00000001 807e2c48 bad398d0 00000000 8054e7c8 ffff4582 bf6cec00
[   27.660944] dd20: 00000001 8004825c 00000100 807dc400 8c84de40 bf6cb340 bad87ebc 00000100
[   27.669114] dd40: afb50401 200c0113 00000200 807dc400 807e2100 ffff958a 00000007 8083916c
[   27.677283] dd60: 00000100 00000006 0000001c 80058748 bf6cb340 8054e810 00000000 00000001
[   27.685452] dd80: 807dc400 bf6cec00 00000001 bf6cec00 8083916c 00000001 c0803100 807dc400
[   27.693622] dda0: 807e209c 000000a0 00000007 8083916c 00000100 00000006 0000001c 800282a0
[   27.701791] ddc0: 00000001 bf6d2a80 b95fac00 0000000a ffff958b 00400000 bacf7000 807dc400
[   27.709961] dde0: 00000000 00000000 0000001b bf0188c0 00000001 c0803100 b95fac00 80028830
[   27.718130] de00: 807dc400 8006ca14 c0802100 c080210c 807e2db0 8081a140 8c84de40 80009420
[   27.726300] de20: 8054e780 80122048 800c0013 ffffffff 8c84de74 00000001 00100073 800142c0
[   27.734469] de40: b95ace70 b9586540 00000000 00000000 600c0013 00000000 024080c0 8010e8e0
[   27.742639] de60: 00000001 00000001 00100073 b95fac00 00000000 8c84de90 8054e780 80122048
[   27.750808] de80: 800c0013 ffffffff b95fac00 80122044 bad00640 8011c418 000001f6 b95acb70
[   27.758978] dea0: 76f42000 b95acb70 76f42000 b95acb68 76f43000 8ce48780 00100073 8010e8e0
[   27.767148] dec0: 00100073 00000000 b95fac00 00000000 00000000 00000001 b9421000 00000001
[   27.775317] dee0: 00000000 76f46000 00000000 00000000 8001e8b8 76f42000 00000003 00000003
[   27.783486] df00: b95fac00 8ce48780 00000001 00001000 807e2c64 8010efb4 00000000 00000000
[   27.791656] df20: 0000004d 00000073 8c84df50 8ce487c4 b95fac00 00000003 00000013 00000000
[   27.799825] df40: 8c84c000 b95fac00 7ece0b44 800faf84 00000002 00000000 00000000 8c84df64
[   27.807995] df60: b95fac00 00000000 00000002 00000003 00000013 00000000 00000000 8010d4e8
[   27.816163] df80: 00000002 00000000 00000003 00000003 00000000 00000003 000000c0 800104e4
[   27.824333] dfa0: 00000020 800104b0 00000003 00000000 00000000 00000013 00000003 00000002
[   27.832502] dfc0: 00000003 00000000 00000003 000000c0 0007ecd0 76f45958 76f45574 7ece0b44
[   27.840671] dfe0: 00000000 7ece09fc 76f2e814 76f368d8 400c0010 00000000 00000000 00000000
[   27.848847] [<80051d78>] (update_blocked_averages) from [<80058748>] (rebalance_domains+0x38/0x2cc)
[   27.857889] [<80058748>] (rebalance_domains) from [<800282a0>] (__do_softirq+0x98/0x354)
[   27.865975] [<800282a0>] (__do_softirq) from [<80028830>] (irq_exit+0xb0/0x11c)
[   27.873281] [<80028830>] (irq_exit) from [<8006ca14>] (__handle_domain_irq+0x60/0xb8)
[   27.881106] [<8006ca14>] (__handle_domain_irq) from [<80009420>] (gic_handle_irq+0x48/0x94)
[   27.889452] [<80009420>] (gic_handle_irq) from [<800142c0>] (__irq_svc+0x40/0x74)
[   27.896924] Exception stack(0x8c84de40 to 0x8c84de88)
[   27.901969] de40: b95ace70 b9586540 00000000 00000000 600c0013 00000000 024080c0 8010e8e0
[   27.910139] de60: 00000001 00000001 00100073 b95fac00 00000000 8c84de90 8054e780 80122048
[   27.918306] de80: 800c0013 ffffffff
[   27.921793] [<800142c0>] (__irq_svc) from [<80122048>] (__slab_alloc.constprop.9+0x28/0x2c)
[   27.930139] [<80122048>] (__slab_alloc.constprop.9) from [<8011c418>] (kmem_cache_alloc+0x14c/0x204)
[   27.939265] [<8011c418>] (kmem_cache_alloc) from [<8010e8e0>] (mmap_region+0x29c/0x680)
[   27.947262] [<8010e8e0>] (mmap_region) from [<8010efb4>] (do_mmap+0x2f0/0x378)
[   27.954481] [<8010efb4>] (do_mmap) from [<800faf84>] (vm_mmap_pgoff+0x74/0xa4)
[   27.961699] [<800faf84>] (vm_mmap_pgoff) from [<8010d4e8>] (SyS_mmap_pgoff+0x94/0xf0)
[   27.969524] [<8010d4e8>] (SyS_mmap_pgoff) from [<800104b0>] (__sys_trace_return+0x0/0x10)
[   27.977694] Code: e59b8078 e59b309c e3a0cc05 e3580000 (e18300dc) 

A snippet of the trace_printks I've added when analyzing the problem.
The prints show that a certain cfs_rq gets readded after it has been removed,
and that update_blocked_averages uses the cfs_rq which has already been freed:

         systemd-1     [000]    22.664453: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 0
         systemd-1     [000]    22.664479: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 0
         systemd-1     [000]    22.664481: bprint:               cpu_cgroup_css_alloc: tg 0x8efb1800 tg->css.id 0
         systemd-1     [000]    22.664547: bprint:               cpu_cgroup_css_online: tg 0x8efb1800 tg->css.id 80
         systemd-874   [001]    27.389000: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
    migrate_cert-820   [001]    27.421337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
     kworker/0:1-24    [000]    27.421356: bprint:               cpu_cgroup_css_offline: tg 0x8efb1800 tg->css.id 80
     kworker/0:1-24    [000]    27.421445: bprint:               list_del_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
    migrate_cert-820   [001]    27.421506: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
   system-status-815   [001]    27.491358: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
     kworker/0:1-24    [000]    27.501561: bprint:               cpu_cgroup_css_free: tg 0x8efb1800 tg->css.id 80
    migrate_cert-820   [001]    27.511337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
     ksoftirqd/0-3     [000]    27.521830: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 80
     ksoftirqd/0-3     [000]    27.521857: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 80
          logger-1252  [001]    27.531355: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x6b6b6b6b


I've reproduced this on v4.4, but I've also managed to reproduce the bug
after cherry-picking the following patches
(all but one were marked for v4.4 stable):

6fe1f34 sched/cgroup: Fix cgroup entity load tracking tear-down
d6e022f workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
041bd12 Revert "workqueue: make sure delayed work run in local cpu"
8bb5ef7 cgroup: make sure a parent css isn't freed before its children
aa226ff cgroup: make sure a parent css isn't offlined before its children
e93ad19 cpuset: make mm migration asynchronous

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-04 10:41 [BUG] sched: leaf_cfs_rq_list use after free Niklas Cassel
@ 2016-03-10 12:54 ` Peter Zijlstra
  2016-03-11 17:02   ` Niklas Cassel
  2016-03-11 18:20   ` Tejun Heo
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-10 12:54 UTC (permalink / raw)
  To: Niklas Cassel; +Cc: tj, linux-kernel@vger.kernel.org

On Fri, Mar 04, 2016 at 11:41:17AM +0100, Niklas Cassel wrote:

> A snippet of the trace_printks I've added when analyzing the problem.
> The prints show that a certain cfs_rq gets readded after it has been removed,
> and that update_blocked_averages uses the cfs_rq which has already been freed:
> 
>          systemd-1     [000]    22.664453: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 0
>          systemd-1     [000]    22.664479: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 0
>          systemd-1     [000]    22.664481: bprint:               cpu_cgroup_css_alloc: tg 0x8efb1800 tg->css.id 0
>          systemd-1     [000]    22.664547: bprint:               cpu_cgroup_css_online: tg 0x8efb1800 tg->css.id 80
>          systemd-874   [001]    27.389000: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
>     migrate_cert-820   [001]    27.421337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1

>      kworker/0:1-24    [000]    27.421356: bprint:               cpu_cgroup_css_offline: tg 0x8efb1800 tg->css.id 80

So we take the cgroup offline

>      kworker/0:1-24    [000]    27.421445: bprint:               list_del_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x1

Remove our cfs_rq from the list

>     migrate_cert-820   [001]    27.421506: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0

And stuff it back on again -> *FAIL*

>    system-status-815   [001]    27.491358: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
>      kworker/0:1-24    [000]    27.501561: bprint:               cpu_cgroup_css_free: tg 0x8efb1800 tg->css.id 80
>     migrate_cert-820   [001]    27.511337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
>      ksoftirqd/0-3     [000]    27.521830: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 80
>      ksoftirqd/0-3     [000]    27.521857: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 80
>           logger-1252  [001]    27.531355: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x6b6b6b6b
> 
> 
> I've reproduced this on v4.4, but I've also managed to reproduce the bug
> after cherry-picking the following patches
> (all but one were marked for v4.4 stable):
> 
> 6fe1f34 sched/cgroup: Fix cgroup entity load tracking tear-down
> d6e022f workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
> 041bd12 Revert "workqueue: make sure delayed work run in local cpu"
> 8bb5ef7 cgroup: make sure a parent css isn't freed before its children
> aa226ff cgroup: make sure a parent css isn't offlined before its children
> e93ad19 cpuset: make mm migration asynchronous

Hmm, that is most unfortunate indeed.

Can you describe a reliable reproducer?

So we only call list_add_leaf_cfs_rq() through enqueue_task_fair(),
which means someone is still running inside that cgroup.

TJ, I thought we only call offline when the cgroup is empty, don't we?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-10 12:54 ` Peter Zijlstra
@ 2016-03-11 17:02   ` Niklas Cassel
  2016-03-11 17:28     ` Peter Zijlstra
  2016-03-11 18:20   ` Tejun Heo
  1 sibling, 1 reply; 21+ messages in thread
From: Niklas Cassel @ 2016-03-11 17:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tj, linux-kernel@vger.kernel.org

On 03/10/2016 01:54 PM, Peter Zijlstra wrote:
> On Fri, Mar 04, 2016 at 11:41:17AM +0100, Niklas Cassel wrote:
> 
>> A snippet of the trace_printks I've added when analyzing the problem.
>> The prints show that a certain cfs_rq gets readded after it has been removed,
>> and that update_blocked_averages uses the cfs_rq which has already been freed:
>>
>>          systemd-1     [000]    22.664453: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 0
>>          systemd-1     [000]    22.664479: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 0
>>          systemd-1     [000]    22.664481: bprint:               cpu_cgroup_css_alloc: tg 0x8efb1800 tg->css.id 0
>>          systemd-1     [000]    22.664547: bprint:               cpu_cgroup_css_online: tg 0x8efb1800 tg->css.id 80
>>          systemd-874   [001]    27.389000: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
>>     migrate_cert-820   [001]    27.421337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
> 
>>      kworker/0:1-24    [000]    27.421356: bprint:               cpu_cgroup_css_offline: tg 0x8efb1800 tg->css.id 80
> 
> So we take the cgroup offline
> 
>>      kworker/0:1-24    [000]    27.421445: bprint:               list_del_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
> 
> Remove our cfs_rq from the list
> 
>>     migrate_cert-820   [001]    27.421506: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
> 
> And stuff it back on again -> *FAIL*
> 
>>    system-status-815   [001]    27.491358: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
>>      kworker/0:1-24    [000]    27.501561: bprint:               cpu_cgroup_css_free: tg 0x8efb1800 tg->css.id 80
>>     migrate_cert-820   [001]    27.511337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
>>      ksoftirqd/0-3     [000]    27.521830: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 80
>>      ksoftirqd/0-3     [000]    27.521857: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 80
>>           logger-1252  [001]    27.531355: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x6b6b6b6b
>>
>>
>> I've reproduced this on v4.4, but I've also managed to reproduce the bug
>> after cherry-picking the following patches
>> (all but one were marked for v4.4 stable):
>>
>> 6fe1f34 sched/cgroup: Fix cgroup entity load tracking tear-down
>> d6e022f workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
>> 041bd12 Revert "workqueue: make sure delayed work run in local cpu"
>> 8bb5ef7 cgroup: make sure a parent css isn't freed before its children
>> aa226ff cgroup: make sure a parent css isn't offlined before its children
>> e93ad19 cpuset: make mm migration asynchronous
> 
> Hmm, that is most unfortunate indeed.
> 
> Can you describe a reliable reproducer?

Right now I just do a reboot loop, systemd creates cgroups during boot
this dump took 84 reboots. Harder to reproduce with more trace_printks.

CONFIG_SCHED_AUTOGROUP looks interesting though, might try that next week.

Appears that the call comes from idle_balance or rebalance_domains,
since they are the only ones that call load_balance.

New more detailed trace:


systemd-1 [000] 13.761629: sched_wakeup: ksoftirqd/0:3 [120] success=1 CPU:000
systemd-1 [000] 13.761640: sched_stat_wait: comm=rcu_preempt pid=7 delay=204399 [ns]
systemd-1 [000] 13.761646: sched_switch: systemd:1 [120] R ==> rcu_preempt:7 [120]
rcu_preempt-7 [000] 13.761664: sched_stat_runtime: comm=rcu_preempt pid=7 runtime=39633 [ns] vruntime=20015190792 [ns]
rcu_preempt-7 [000] 13.761669: sched_stat_wait: comm=rcu_sched pid=8 delay=222979 [ns]
rcu_preempt-7 [000] 13.761673: sched_switch: rcu_preempt:7 [120] S ==> rcu_sched:8 [120]
rcu_sched-8 [000] 13.761683: sched_stat_runtime: comm=rcu_sched pid=8 runtime=19129 [ns] vruntime=20015170288 [ns]
rcu_sched-8 [000] 13.761686: sched_stat_wait: comm=kworker/0:3 pid=275 delay=83217 [ns]
rcu_sched-8 [000] 13.761689: sched_switch: rcu_sched:8 [120] S ==> kworker/0:3:275 [120]
kworker/0:3-275 [000] 13.761700: function: cpu_cgroup_css_offline
kworker/0:3-275 [000] 13.761703: bprint: cpu_cgroup_css_offline: tg 0x8f5bb380 tg->css.id 34
kworker/0:3-275 [000] 13.761705: function: sched_offline_group
kworker/0:3-275 [000] 13.761712: bprint: list_del_leaf_cfs_rq: cfs_rq 0xba25e900 tg 0x8f5bb380 cpu 0 on_list 0x1
kworker/0:3-275 [000] 13.761716: bprint: list_del_leaf_cfs_rq: cfs_rq 0xba25ea80 tg 0x8f5bb380 cpu 1 on_list 0x1
kworker/0:3-275 [000] 13.761726: sched_stat_runtime: comm=kworker/0:3 pid=275 runtime=43970 [ns] vruntime=20015195129 [ns]
kworker/0:3-275 [000] 13.761731: sched_stat_wait: comm=ksoftirqd/0 pid=3 delay=102732 [ns]
kworker/0:3-275 [000] 13.761734: sched_switch: kworker/0:3:275 [120] S ==> ksoftirqd/0:3 [120]
ksoftirqd/0-3 [000] 13.762007: sched_process_free: comm=find pid=595 prio=120
artpec6src0:src-590 [001] 13.763084: sched_stat_runtime: comm=artpec6src0:src pid=590 runtime=1698787 [ns] vruntime=428699560 [ns]
artpec6src0:src-590 [001] 13.763094: sched_switch: artpec6src0:src:590 [120] S ==> swapper/1:0 [120]
ksoftirqd/0-3 [000] 13.763220: sched_stat_runtime: comm=ksoftirqd/0 pid=3 runtime=1494541 [ns] vruntime=20016645700 [ns]
ksoftirqd/0-3 [000] 13.763226: sched_stat_wait: comm=systemd-journal pid=187 delay=28585309 [ns]
ksoftirqd/0-3 [000] 13.763232: sched_switch: ksoftirqd/0:3 [120] S ==> systemd-journal:187 [120]
systemd-journal-187 [000] 13.763654: sched_stat_runtime: comm=systemd-journal pid=187 runtime=432768 [ns] vruntime=148478617 [ns]
--
event_switch-581 [001] 13.767390: bprint: update_blocked_averages: cfs_rq 0xbf6cec50 cpu 1 on_list 0x1
event_switch-581 [001] 13.767392: bprint: update_blocked_averages: cfs_rq 0xba22b980 cpu 1 on_list 0x1
event_switch-581 [001] 13.767393: bprint: update_blocked_averages: cfs_rq 0x8f502a80 cpu 1 on_list 0x1
event_switch-581 [001] 13.767395: bprint: update_blocked_averages: cfs_rq 0x8f502480 cpu 1 on_list 0x1
event_switch-581 [001] 13.767405: sched_stat_wait: comm=net.final pid=593 delay=41272110 [ns]
event_switch-581 [001] 13.767413: sched_migrate_task: comm=net.final pid=593 prio=120 orig_cpu=0 dest_cpu=1
event_switch-581 [001] 13.767416: bprint: load_balance: load_balance
event_switch-581 [001] 13.767417: bprint: load_balance: attach_tasks
event_switch-581 [001] 13.767418: bprint: attach_task: attach_task
event_switch-581 [001] 13.767419: bprint: activate_task: activate_task
event_switch-581 [001] 13.767421: bprint: enqueue_task_fair: cfs_rq 0xba25ea80 tg 0x8f5bb380
event_switch-581 [001] 13.767425: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba25ea80 tg 0x8f5bb380 cpu 1 on_list 0x0
event_switch-581 [001] 13.767427: bprint: enqueue_task_fair: cfs_rq 0xba25fb00 tg 0xba25e480
event_switch-581 [001] 13.767430: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba25fb00 tg 0xba25e480 cpu 1 on_list 0x1
event_switch-581 [001] 13.767431: bprint: enqueue_task_fair: cfs_rq 0xba22a480 tg 0xba22a600
event_switch-581 [001] 13.767434: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba22a480 tg 0xba22a600 cpu 1 on_list 0x1
event_switch-581 [001] 13.767435: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
event_switch-581 [001] 13.767437: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
event_switch-581 [001] 13.767442: sched_stat_wait: comm=net.final pid=593 delay=0 [ns]
event_switch-581 [001] 13.767445: sched_switch: event_switch:581 [120] D ==> net.final:593 [120]
systemd-593 [001] 13.767451: sched_stat_runtime: comm=net.final pid=593 runtime=29506 [ns] vruntime=611130957 [ns]
systemd-593 [001] 13.767458: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
systemd-593 [001] 13.767460: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
systemd-593 [001] 13.767461: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
systemd-593 [001] 13.767463: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
systemd-593 [001] 13.767464: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
systemd-593 [001] 13.767465: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
systemd-593 [001] 13.767467: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
systemd-593 [001] 13.767469: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
systemd-593 [001] 13.767470: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
systemd-593 [001] 13.767471: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
systemd-593 [001] 13.767473: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
systemd-593 [001] 13.767490: bprint: update_blocked_averages: cfs_rq 0xba22b980 cpu 1 on_list 0x1
systemd-593 [001] 13.767491: bprint: update_blocked_averages: cfs_rq 0x8f502a80 cpu 1 on_list 0x1
systemd-593 [001] 13.767493: bprint: update_blocked_averages: cfs_rq 0x8f502480 cpu 1 on_list 0x1
systemd-593 [001] 13.767499: sched_stat_wait: comm=find pid=595 delay=45445182 [ns]
systemd-593 [001] 13.767503: sched_stat_runtime: comm=systemd pid=1 runtime=239039 [ns] vruntime=20029165854 [ns]
systemd-593 [001] 13.767505: sched_migrate_task: comm=find pid=595 prio=120 orig_cpu=0 dest_cpu=1
systemd-593 [001] 13.767507: bprint: load_balance: load_balance
systemd-593 [001] 13.767509: bprint: load_balance: attach_tasks
systemd-593 [001] 13.767510: bprint: attach_task: attach_task
systemd-593 [001] 13.767511: bprint: activate_task: activate_task
systemd-593 [001] 13.767512: bprint: enqueue_task_fair: cfs_rq 0xba25ea80 tg 0x8f5bb380
systemd-593 [001] 13.767513: sched_stat_runtime: comm=net.final pid=593 runtime=61988 [ns] vruntime=611192945 [ns]
systemd-593 [001] 13.767517: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba25ea80 tg 0x8f5bb380 cpu 1 on_list 0x1
systemd-593 [001] 13.767518: bprint: enqueue_task_fair: cfs_rq 0xba25fb00 tg 0xba25e480
systemd-593 [001] 13.767521: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba25fb00 tg 0xba25e480 cpu 1 on_list 0x1
systemd-593 [001] 13.767522: bprint: enqueue_task_fair: cfs_rq 0xba22a480 tg 0xba22a600
systemd-593 [001] 13.767524: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba22a480 tg 0xba22a600 cpu 1 on_list 0x1
systemd-593 [001] 13.767525: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
systemd-593 [001] 13.767528: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
systemd-593 [001] 13.767530: sched_stat_wait: comm=find pid=595 delay=0 [ns]
systemd-593 [001] 13.767534: sched_switch: net.final:593 [120] x ==> find:595 [120]
systemd-1 [000] 13.767535: sched_waking: comm=event_switch pid=581 prio=120 target_cpu=001
net.final-595 [001] 13.767540: sched_stat_runtime: comm=find pid=595 runtime=27142 [ns] vruntime=615270938 [ns]
systemd-1 [000] 13.767542: sched_migrate_task: comm=event_switch pid=581 prio=120 orig_cpu=1 dest_cpu=0
systemd-1 [000] 13.767545: bprint: ttwu_do_activate.constprop.27: ttwu_activate
net.final-595 [001] 13.767546: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
systemd-1 [000] 13.767546: bprint: activate_task: activate_task
net.final-595 [001] 13.767548: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
systemd-1 [000] 13.767548: bprint: enqueue_task_fair: cfs_rq 0x8f304180 tg 0x8f305e00
net.final-595 [001] 13.767549: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
systemd-1 [000] 13.767551: sched_stat_blocked: comm=event_switch pid=581 delay=196994 [ns]
net.final-595 [001] 13.767551: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
net.final-595 [001] 13.767552: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
systemd-1 [000] 13.767553: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f304180 tg 0x8f305e00 cpu 0 on_list 0x1
net.final-595 [001] 13.767553: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
systemd-1 [000] 13.767554: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
--
net.final-595 [001] 13.767595: bprint: activate_task: activate_task
net.final-595 [001] 13.767596: bprint: enqueue_task_fair: cfs_rq 0x8f305c80 tg 0x8f305e00
net.final-595 [001] 13.767599: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f305c80 tg 0x8f305e00 cpu 1 on_list 0x1
net.final-595 [001] 13.767600: bprint: enqueue_task_fair: cfs_rq 0xba22a480 tg 0xba22a600
net.final-595 [001] 13.767603: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba22a480 tg 0xba22a600 cpu 1 on_list 0x1
net.final-595 [001] 13.767604: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
net.final-595 [001] 13.767607: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
net.final-595 [001] 13.767610: sched_stat_wait: comm=event_switch pid=581 delay=0 [ns]
net.final-595 [001] 13.767613: sched_switch: find:595 [120] x ==> event_switch:581 [120]
event_switch-581 [001] 13.767802: sched_stat_runtime: comm=event_switch pid=581 runtime=204921 [ns] vruntime=10385020 [ns]
event_switch-581 [001] 13.767812: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
event_switch-581 [001] 13.767815: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
event_switch-581 [001] 13.767816: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
event_switch-581 [001] 13.767818: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
event_switch-581 [001] 13.767819: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
event_switch-581 [001] 13.767821: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
event_switch-581 [001] 13.767822: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.767823: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.767825: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
event_switch-581 [001] 13.767826: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
event_switch-581 [001] 13.767828: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
event_switch-581 [001] 13.768788: bprint: activate_task: activate_task
event_switch-581 [001] 13.768790: bprint: enqueue_task_fair: cfs_rq 0x8f503680 tg 0x8f503500
event_switch-581 [001] 13.768795: sched_stat_sleep: comm=dbus-daemon pid=451 delay=99417790 [ns]
event_switch-581 [001] 13.768798: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f503680 tg 0x8f503500 cpu 0 on_list 0x1
event_switch-581 [001] 13.768799: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
event_switch-581 [001] 13.768802: bprint: list_add_leaf_cfs_rq: cfs_rq 0xba22be00 tg 0xba22a600 cpu 0 on_list 0x1
event_switch-581 [001] 13.768804: bprint: enqueue_task_fair: cfs_rq 0xbf6c2c50 tg 0x808489c8
event_switch-581 [001] 13.768805: sched_stat_runtime: comm=systemd pid=1 runtime=1204444 [ns] vruntime=20030457675 [ns]
event_switch-581 [001] 13.768810: sched_wakeup: dbus-daemon:451 [120] success=1 CPU:000
event_switch-581 [001] 13.769135: sched_stat_runtime: comm=event_switch pid=581 runtime=866705 [ns] vruntime=11335615 [ns]
event_switch-581 [001] 13.769148: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
event_switch-581 [001] 13.769151: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
event_switch-581 [001] 13.769152: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
event_switch-581 [001] 13.769154: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
event_switch-581 [001] 13.769156: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
event_switch-581 [001] 13.769157: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
event_switch-581 [001] 13.769159: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.769160: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.769162: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
event_switch-581 [001] 13.769164: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
event_switch-581 [001] 13.769165: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
dbus-daemon-451 [001] 13.771168: sched_stat_sleep: comm=monolith pid=375 delay=357880493 [ns]
dbus-daemon-451 [001] 13.771172: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f5bbb00 tg 0x8f5bb980 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.771173: bprint: enqueue_task_fair: cfs_rq 0xba22a480 tg 0xba22a600
dbus-daemon-451 [001] 13.771182: sched_wakeup: monolith:375 [120] success=1 CPU:001
dbus-daemon-451 [001] 13.771188: sched_stat_runtime: comm=dbus-daemon pid=451 runtime=947108 [ns] vruntime=61194189 [ns]
dbus-daemon-451 [001] 13.771193: sched_stat_wait: comm=monolith pid=375 delay=0 [ns]
dbus-daemon-451 [001] 13.771197: sched_switch: dbus-daemon:451 [120] R ==> monolith:375 [120]
monolith-375 [001] 13.771384: sched_stat_runtime: comm=monolith pid=375 runtime=219369 [ns] vruntime=422918929 [ns]
systemd-1 [000] 13.771388: sched_stat_runtime: comm=systemd pid=1 runtime=2191233 [ns] vruntime=20033055212 [ns]
systemd-1 [000] 13.771403: sched_waking: comm=rcu_sched pid=8 prio=120 target_cpu=000
monolith-375 [001] 13.771405: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
systemd-1 [000] 13.771407: bprint: ttwu_do_activate.constprop.27: ttwu_activate
monolith-375 [001] 13.771408: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
systemd-1 [000] 13.771408: bprint: activate_task: activate_task
monolith-375 [001] 13.771410: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
systemd-1 [000] 13.771410: bprint: enqueue_task_fair: cfs_rq 0xbf6c2c50 tg 0x808489c8
systemd-1 [000] 13.771411: sched_stat_runtime: comm=systemd pid=1 runtime=22660 [ns] vruntime=20033077872 [ns]
monolith-375 [001] 13.771412: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
monolith-375 [001] 13.771414: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
systemd-1 [000] 13.771414: sched_stat_sleep: comm=rcu_sched pid=8 delay=9728351 [ns]
monolith-375 [001] 13.771415: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
--
monolith-375 [000] 13.794888: sched_waking: comm=dbus-daemon pid=451 prio=120 target_cpu=001
monolith-375 [000] 13.794898: sched_migrate_task: comm=dbus-daemon pid=451 prio=120 orig_cpu=1 dest_cpu=0
monolith-375 [000] 13.794903: bprint: ttwu_do_activate.constprop.27: ttwu_activate
monolith-375 [000] 13.794904: bprint: activate_task: activate_task
monolith-375 [000] 13.794907: bprint: enqueue_task_fair: cfs_rq 0x8f503680 tg 0x8f503500
monolith-375 [000] 13.794911: sched_stat_sleep: comm=dbus-daemon pid=451 delay=299956 [ns]
monolith-375 [000] 13.794914: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f503680 tg 0x8f503500 cpu 0 on_list 0x1
monolith-375 [000] 13.794915: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
monolith-375 [000] 13.794924: sched_wakeup: dbus-daemon:451 [120] success=1 CPU:000
event_switch-581 [001] 13.795390: sched_stat_runtime: comm=event_switch pid=581 runtime=780901 [ns] vruntime=20958193 [ns]
event_switch-581 [001] 13.795406: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
event_switch-581 [001] 13.795410: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
event_switch-581 [001] 13.795412: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
event_switch-581 [001] 13.795414: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
event_switch-581 [001] 13.795415: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
event_switch-581 [001] 13.795417: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
event_switch-581 [001] 13.795419: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.795421: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.795423: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
event_switch-581 [001] 13.795425: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
event_switch-581 [001] 13.795426: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
monolith-375 [000] 13.795574: sched_waking: comm=event_switch pid=581 prio=120 target_cpu=001
monolith-375 [000] 13.795581: sched_migrate_task: comm=event_switch pid=581 prio=120 orig_cpu=1 dest_cpu=0
monolith-375 [000] 13.795584: bprint: ttwu_do_activate.constprop.27: ttwu_activate
monolith-375 [000] 13.795585: bprint: activate_task: activate_task
monolith-375 [000] 13.795587: bprint: enqueue_task_fair: cfs_rq 0x8f304180 tg 0x8f305e00
monolith-375 [000] 13.795591: sched_stat_blocked: comm=event_switch pid=581 delay=199612 [ns]
monolith-375 [000] 13.795593: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f304180 tg 0x8f305e00 cpu 0 on_list 0x1
monolith-375 [000] 13.795594: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
monolith-375 [000] 13.795601: sched_wakeup: event_switch:581 [120] success=1 CPU:000
dbus-daemon-451 [001] 13.796233: sched_stat_runtime: comm=dbus-daemon pid=451 runtime=751313 [ns] vruntime=70318742 [ns]
dbus-daemon-451 [001] 13.796248: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796251: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796253: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796255: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796256: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796258: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796259: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796261: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796263: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796264: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796266: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
dbus-daemon-451 [001] 13.796327: sched_stat_wait: comm=event_switch pid=581 delay=0 [ns]
dbus-daemon-451 [001] 13.796331: sched_switch: dbus-daemon:451 [120] S ==> event_switch:581 [120]
monolith-375 [000] 13.796338: sched_migrate_task: comm=dbus-daemon pid=451 prio=120 orig_cpu=1 dest_cpu=0
monolith-375 [000] 13.796341: bprint: ttwu_do_activate.constprop.27: ttwu_activate
monolith-375 [000] 13.796344: bprint: activate_task: activate_task
monolith-375 [000] 13.796346: bprint: enqueue_task_fair: cfs_rq 0x8f503680 tg 0x8f503500
event_switch-581 [001] 13.796348: sched_stat_runtime: comm=event_switch pid=581 runtime=37688 [ns] vruntime=20995881 [ns]
monolith-375 [000] 13.796349: sched_stat_sleep: comm=dbus-daemon pid=451 delay=114543 [ns]
monolith-375 [000] 13.796351: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f503680 tg 0x8f503500 cpu 0 on_list 0x1
monolith-375 [000] 13.796353: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
event_switch-581 [001] 13.796354: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
event_switch-581 [001] 13.796356: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
event_switch-581 [001] 13.796357: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
monolith-375 [000] 13.796358: sched_wakeup: dbus-daemon:451 [120] success=1 CPU:000
event_switch-581 [001] 13.796358: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
event_switch-581 [001] 13.796360: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
event_switch-581 [001] 13.796361: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
event_switch-581 [001] 13.796363: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.796364: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.796366: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
event_switch-581 [001] 13.796367: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
--
monolith-375 [000] 13.796533: sched_waking: comm=event_switch pid=581 prio=120 target_cpu=001
monolith-375 [000] 13.796539: sched_migrate_task: comm=event_switch pid=581 prio=120 orig_cpu=1 dest_cpu=0
monolith-375 [000] 13.796542: bprint: ttwu_do_activate.constprop.27: ttwu_activate
monolith-375 [000] 13.796544: bprint: activate_task: activate_task
monolith-375 [000] 13.796546: bprint: enqueue_task_fair: cfs_rq 0x8f304180 tg 0x8f305e00
monolith-375 [000] 13.796548: sched_stat_blocked: comm=event_switch pid=581 delay=197519 [ns]
monolith-375 [000] 13.796551: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f304180 tg 0x8f305e00 cpu 0 on_list 0x1
monolith-375 [000] 13.796552: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
monolith-375 [000] 13.796557: sched_wakeup: event_switch:581 [120] success=1 CPU:000
dbus-daemon-451 [001] 13.796588: sched_stat_runtime: comm=dbus-daemon pid=451 runtime=183212 [ns] vruntime=70501954 [ns]
dbus-daemon-451 [001] 13.796596: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796598: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796600: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796601: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796603: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796604: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796606: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796607: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796608: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796610: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
dbus-daemon-451 [001] 13.796611: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
monolith-375 [000] 13.796970: sched_waking: comm=dbus-daemon pid=451 prio=120 target_cpu=001
monolith-375 [000] 13.796977: sched_migrate_task: comm=dbus-daemon pid=451 prio=120 orig_cpu=1 dest_cpu=0
monolith-375 [000] 13.796981: bprint: ttwu_do_activate.constprop.27: ttwu_activate
monolith-375 [000] 13.796982: bprint: activate_task: activate_task
monolith-375 [000] 13.796984: bprint: enqueue_task_fair: cfs_rq 0x8f503680 tg 0x8f503500
monolith-375 [000] 13.796987: sched_stat_sleep: comm=dbus-daemon pid=451 delay=396035 [ns]
monolith-375 [000] 13.796990: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f503680 tg 0x8f503500 cpu 0 on_list 0x1
monolith-375 [000] 13.796991: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
monolith-375 [000] 13.796998: sched_wakeup: dbus-daemon:451 [120] success=1 CPU:000
event_switch-581 [001] 13.797334: sched_stat_runtime: comm=event_switch pid=581 runtime=676930 [ns] vruntime=21672811 [ns]
event_switch-581 [001] 13.797347: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
event_switch-581 [001] 13.797351: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
monolith-375 [000] 13.797351: sched_stat_runtime: comm=monolith pid=375 runtime=3220024 [ns] vruntime=316146973 [ns]
event_switch-581 [001] 13.797352: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
event_switch-581 [001] 13.797354: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
event_switch-581 [001] 13.797356: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
event_switch-581 [001] 13.797357: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
event_switch-581 [001] 13.797359: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
event_switch-581 [001] 13.797361: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
monolith-375 [000] 13.797362: sched_stat_wait: comm=dbus-daemon pid=451 delay=365973 [ns]
event_switch-581 [001] 13.797362: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
--
dbus-daemon-451 [001] 13.801318: bprint: ttwu_do_activate.constprop.27: ttwu_activate
dbus-daemon-451 [001] 13.801319: bprint: activate_task: activate_task
dbus-daemon-451 [001] 13.801321: bprint: enqueue_task_fair: cfs_rq 0x8f304180 tg 0x8f305e00
dbus-daemon-451 [001] 13.801324: sched_stat_sleep: comm=event_switch pid=581 delay=785510 [ns]
dbus-daemon-451 [001] 13.801326: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8f304180 tg 0x8f305e00 cpu 0 on_list 0x1
dbus-daemon-451 [001] 13.801327: bprint: enqueue_task_fair: cfs_rq 0xba22be00 tg 0xba22a600
dbus-daemon-451 [001] 13.801333: sched_wakeup: event_switch:581 [120] success=1 CPU:000
dbus-daemon-451 [001] 13.801357: sched_stat_runtime: comm=dbus-daemon pid=451 runtime=410404 [ns] vruntime=71341154 [ns]
dbus-daemon-451 [001] 13.801365: sched_switch: dbus-daemon:451 [120] S ==> swapper/1:0 [120]
monolith-375 [000] 13.801389: sched_stat_runtime: comm=monolith pid=375 runtime=180052 [ns] vruntime=316788230 [ns]
<idle>-0 [001] 13.801394: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
<idle>-0 [001] 13.801397: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
<idle>-0 [001] 13.801399: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
<idle>-0 [001] 13.801401: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
<idle>-0 [001] 13.801403: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
monolith-375 [000] 13.801404: sched_waking: comm=rcu_preempt pid=7 prio=120 target_cpu=000
<idle>-0 [001] 13.801405: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
<idle>-0 [001] 13.801406: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
monolith-375 [000] 13.801407: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1
<idle>-0 [001] 13.801408: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.801410: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
--
<idle>-0 [001] 13.811388: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
<idle>-0 [000] 13.811389: bprint: update_blocked_averages: cfs_rq 0x8f131b00 cpu 0 on_list 0x1
<idle>-0 [001] 13.811391: sched_stat_sleep: comm=rcu_preempt pid=7 delay=9913746 [ns]
<idle>-0 [000] 13.811391: bprint: update_blocked_averages: cfs_rq 0x8f131c80 cpu 0 on_list 0x1
<idle>-0 [000] 13.811393: bprint: update_blocked_averages: cfs_rq 0x8f503680 cpu 0 on_list 0x1
<idle>-0 [001] 13.811393: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
<idle>-0 [000] 13.811394: bprint: update_blocked_averages: cfs_rq 0x8f5bad80 cpu 0 on_list 0x1
<idle>-0 [001] 13.811396: sched_wakeup: rcu_preempt:7 [120] success=1 CPU:001
<idle>-0 [000] 13.811396: bprint: update_blocked_averages: cfs_rq 0xba2dd380 cpu 0 on_list 0x1
<idle>-0 [000] 13.811398: bprint: update_blocked_averages: cfs_rq 0x8f5bb500 cpu 0 on_list 0x1
<idle>-0 [001] 13.811399: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
<idle>-0 [000] 13.811399: bprint: update_blocked_averages: cfs_rq 0xba25fc80 cpu 0 on_list 0x1
<idle>-0 [001] 13.811401: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
<idle>-0 [000] 13.811401: bprint: update_blocked_averages: cfs_rq 0xba2dca80 cpu 0 on_list 0x1
<idle>-0 [001] 13.811402: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
<idle>-0 [000] 13.811402: bprint: update_blocked_averages: cfs_rq 0xba22b800 cpu 0 on_list 0x1
<idle>-0 [000] 13.811404: bprint: update_blocked_averages: cfs_rq 0x8f67ed80 cpu 0 on_list 0x1
<idle>-0 [001] 13.811404: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
<idle>-0 [001] 13.811405: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
<idle>-0 [000] 13.811406: bprint: update_blocked_averages: cfs_rq 0x8f58cf00 cpu 0 on_list 0x1
<idle>-0 [001] 13.811407: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
--
<idle>-0 [001] 13.811601: sched_stat_sleep: comm=ksoftirqd/1 pid=14 delay=29863797 [ns]
<idle>-0 [001] 13.811603: sched_wakeup: ksoftirqd/1:14 [120] success=1 CPU:001
<idle>-0 [001] 13.811609: sched_stat_wait: comm=ksoftirqd/1 pid=14 delay=0 [ns]
<idle>-0 [001] 13.811612: sched_switch: swapper/1:0 [120] R ==> ksoftirqd/1:14 [120]
<idle>-0 [000] 13.811622: sched_process_free: comm=systemd-cgroups pid=596 prio=120
ksoftirqd/1-14 [001] 13.811663: sched_process_free: comm=systemd-cgroups pid=597 prio=120
ksoftirqd/1-14 [001] 13.811690: sched_stat_runtime: comm=ksoftirqd/1 pid=14 runtime=90456 [ns] vruntime=31855210740 [ns]
ksoftirqd/1-14 [001] 13.811694: sched_stat_wait: comm=rcu_preempt pid=7 delay=302328 [ns]
ksoftirqd/1-14 [001] 13.811696: sched_switch: ksoftirqd/1:14 [120] S ==> rcu_preempt:7 [120]
rcu_preempt-7 [001] 13.811704: sched_stat_runtime: comm=rcu_preempt pid=7 runtime=14969 [ns] vruntime=31855165780 [ns]
rcu_preempt-7 [001] 13.811709: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811711: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811713: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811714: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811716: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811717: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811719: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811720: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811722: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811723: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811724: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
rcu_preempt-7 [001] 13.811741: bprint: update_blocked_averages: cfs_rq 0xbf6cec50 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811742: bprint: update_blocked_averages: cfs_rq 0xba22b980 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811744: bprint: update_blocked_averages: cfs_rq 0x8f502a80 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811745: bprint: update_blocked_averages: cfs_rq 0x8f502480 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.811750: sched_switch: rcu_preempt:7 [120] S ==> swapper/1:0 [120]
<idle>-0 [000] 13.811809: sched_process_free: comm=systemd-cgroups pid=598 prio=120
<idle>-0 [000] 13.811981: sched_waking: comm=kworker/0:3 pid=275 prio=120 target_cpu=000
<idle>-0 [001] 13.812051: sched_stat_wait: comm=rcu_preempt pid=7 delay=0 [ns]
<idle>-0 [001] 13.812053: sched_switch: swapper/1:0 [120] R ==> rcu_preempt:7 [120]
rcu_preempt-7 [001] 13.812062: sched_stat_runtime: comm=rcu_preempt pid=7 runtime=21636 [ns] vruntime=31855187416 [ns]
rcu_preempt-7 [001] 13.812065: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812067: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812069: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812071: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812072: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812074: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812075: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812076: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812077: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812079: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
rcu_preempt-7 [001] 13.812080: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
rcu_preempt-7 [001] 13.812114: sched_switch: rcu_preempt:7 [120] S ==> swapper/1:0 [120]
<idle>-0 [000] 13.812186: sys_exit_exit: 0x78
<...>-58792 [000] 13.812186: sys_enter_getrusage: who: 0x0000, ru: 0x80046f98
<idle>-0 [001] 13.821381: sched_waking: comm=rcu_preempt pid=7 prio=120 target_cpu=001
<idle>-0 [001] 13.821384: bprint: ttwu_do_activate.constprop.27: ttwu_activate
<idle>-0 [001] 13.821386: bprint: activate_task: activate_task
<idle>-0 [001] 13.821387: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
<idle>-0 [001] 13.821390: sched_stat_sleep: comm=rcu_preempt pid=7 delay=9325236 [ns]
<idle>-0 [001] 13.821392: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
<idle>-0 [001] 13.821394: sched_wakeup: rcu_preempt:7 [120] success=1 CPU:001
<idle>-0 [001] 13.821397: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
<idle>-0 [001] 13.821399: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
<idle>-0 [001] 13.821400: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
<idle>-0 [001] 13.821402: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
<idle>-0 [001] 13.821403: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
<idle>-0 [001] 13.821405: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
<idle>-0 [001] 13.821406: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.821408: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.821409: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
<idle>-0 [001] 13.821411: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
<idle>-0 [001] 13.821412: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
<idle>-0 [001] 13.821444: sched_switch: swapper/1:0 [120] R ==> rcu_preempt:7 [120]
rcu_preempt-7 [001] 13.821451: sched_stat_runtime: comm=rcu_preempt pid=7 runtime=54530 [ns] vruntime=31855241946 [ns]
rcu_preempt-7 [001] 13.821454: sched_switch: rcu_preempt:7 [120] S ==> swapper/1:0 [120]
<idle>-0 [001] 13.831375: sched_waking: comm=rcu_preempt pid=7 prio=120 target_cpu=001
<idle>-0 [001] 13.831378: bprint: ttwu_do_activate.constprop.27: ttwu_activate
<idle>-0 [001] 13.831379: bprint: activate_task: activate_task
<idle>-0 [001] 13.831381: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
<idle>-0 [001] 13.831382: sched_stat_sleep: comm=rcu_preempt pid=7 delay=9929238 [ns]
<idle>-0 [001] 13.831385: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
<idle>-0 [001] 13.831386: sched_wakeup: rcu_preempt:7 [120] success=1 CPU:001
<idle>-0 [001] 13.831389: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
<idle>-0 [001] 13.831391: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
<idle>-0 [001] 13.831392: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
<idle>-0 [001] 13.831394: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
<idle>-0 [001] 13.831395: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
<idle>-0 [001] 13.831397: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
<idle>-0 [001] 13.831398: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.831400: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.831401: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
<idle>-0 [001] 13.831402: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
<idle>-0 [001] 13.831404: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
<idle>-0 [001] 13.841380: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
<idle>-0 [001] 13.841382: sched_stat_sleep: comm=rcu_preempt pid=7 delay=9937904 [ns]
<idle>-0 [001] 13.841384: bprint: list_add_leaf_cfs_rq: cfs_rq 0xbf6cec50 tg 0x808489c8 cpu 1 on_list 0x1
<idle>-0 [001] 13.841386: sched_wakeup: rcu_preempt:7 [120] success=1 CPU:001
<idle>-0 [001] 13.841389: sched_waking: comm=kworker/1:1 pid=24 prio=120 target_cpu=001
<idle>-0 [001] 13.841391: bprint: ttwu_do_activate.constprop.27: ttwu_activate
<idle>-0 [001] 13.841392: bprint: activate_task: activate_task
<idle>-0 [001] 13.841393: bprint: enqueue_task_fair: cfs_rq 0xbf6cec50 tg 0x808489c8
<idle>-0 [001] 13.841395: sched_stat_sleep: comm=kworker/1:1 pid=24 delay=50114090 [ns]
<idle>-0 [001] 13.841397: sched_wakeup: kworker/1:1:24 [120] success=1 CPU:001
<idle>-0 [001] 13.841400: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x1
<idle>-0 [001] 13.841402: bprint: update_blocked_averages: cfs_rq 0x8f305c80 cpu 1 on_list 0x1
<idle>-0 [001] 13.841404: bprint: update_blocked_averages: cfs_rq 0x8f502600 cpu 1 on_list 0x1
<idle>-0 [001] 13.841405: bprint: update_blocked_averages: cfs_rq 0x8f130180 cpu 1 on_list 0x1
<idle>-0 [001] 13.841407: bprint: update_blocked_averages: cfs_rq 0x8f130900 cpu 1 on_list 0x1
<idle>-0 [001] 13.841408: bprint: update_blocked_averages: cfs_rq 0x8f503800 cpu 1 on_list 0x1
<idle>-0 [001] 13.841410: bprint: update_blocked_averages: cfs_rq 0x8f5bbb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.841412: bprint: update_blocked_averages: cfs_rq 0xba25fb00 cpu 1 on_list 0x1
<idle>-0 [001] 13.841413: bprint: update_blocked_averages: cfs_rq 0x8f5bb800 cpu 1 on_list 0x1
<idle>-0 [001] 13.841415: bprint: update_blocked_averages: cfs_rq 0xba2dd200 cpu 1 on_list 0x1
<idle>-0 [001] 13.841416: bprint: update_blocked_averages: cfs_rq 0xba2dcc00 cpu 1 on_list 0x1
--
<idle>-0 [001] 13.841433: bprint: update_blocked_averages: cfs_rq 0xbf6cec50 cpu 1 on_list 0x1
<idle>-0 [001] 13.841435: bprint: update_blocked_averages: cfs_rq 0xba22b980 cpu 1 on_list 0x1
<idle>-0 [001] 13.841436: bprint: update_blocked_averages: cfs_rq 0x8f502a80 cpu 1 on_list 0x1
<idle>-0 [001] 13.841438: bprint: update_blocked_averages: cfs_rq 0x8f502480 cpu 1 on_list 0x1
<idle>-0 [001] 13.841483: sched_stat_wait: comm=kworker/1:1 pid=24 delay=6857 [ns]
<idle>-0 [001] 13.841486: sched_switch: swapper/1:0 [120] R ==> kworker/1:1:24 [120]
kworker/1:1-24 [001] 13.841518: sched_stat_runtime: comm=kworker/1:1 pid=24 runtime=117597 [ns] vruntime=31855237881 [ns]
kworker/1:1-24 [001] 13.841521: sched_stat_wait: comm=rcu_preempt pid=7 delay=138060 [ns]
kworker/1:1-24 [001] 13.841524: sched_switch: kworker/1:1:24 [120] S ==> rcu_preempt:7 [120]
rcu_preempt-7 [001] 13.841530: sched_stat_runtime: comm=rcu_preempt pid=7 runtime=12468 [ns] vruntime=31855307432 [ns]
rcu_preempt-7 [001] 13.841534: bprint: update_blocked_averages: cfs_rq 0xba25ea80 cpu 1 on_list 0x6b6b6b6b

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-11 17:02   ` Niklas Cassel
@ 2016-03-11 17:28     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-03-11 17:28 UTC (permalink / raw)
  To: Niklas Cassel; +Cc: tj, linux-kernel@vger.kernel.org

On Fri, Mar 11, 2016 at 06:02:07PM +0100, Niklas Cassel wrote:
> > Can you describe a reliable reproducer?
> 
> Right now I just do a reboot loop, systemd creates cgroups during boot
> this dump took 84 reboots. Harder to reproduce with more trace_printks.

Bugger, I don't have a single system with systemd on, so that'll be hard
to reproduce.

> 
> CONFIG_SCHED_AUTOGROUP looks interesting though, might try that next week.
> 
> Appears that the call comes from idle_balance or rebalance_domains,
> since they are the only ones that call load_balance.

Weird, balancing should very much be inside the cgroup, so if balancing
moves anything, there must be tasks to begin with, which still should
not be the case if we've called offline.

FWIW:

echo 1 > /debug/tracing/options/stacktrace

will add a stack trace to each trace_printk().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-10 12:54 ` Peter Zijlstra
  2016-03-11 17:02   ` Niklas Cassel
@ 2016-03-11 18:20   ` Tejun Heo
  1 sibling, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2016-03-11 18:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Niklas Cassel, linux-kernel@vger.kernel.org

Hello, Peter.

On Thu, Mar 10, 2016 at 01:54:17PM +0100, Peter Zijlstra wrote:
> > I've reproduced this on v4.4, but I've also managed to reproduce the bug
> > after cherry-picking the following patches
> > (all but one were marked for v4.4 stable):
> > 
> > 6fe1f34 sched/cgroup: Fix cgroup entity load tracking tear-down
> > d6e022f workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
> > 041bd12 Revert "workqueue: make sure delayed work run in local cpu"
> > 8bb5ef7 cgroup: make sure a parent css isn't freed before its children
> > aa226ff cgroup: make sure a parent css isn't offlined before its children
> > e93ad19 cpuset: make mm migration asynchronous
> 
> Hmm, that is most unfortunate indeed.
> 
> Can you describe a reliable reproducer?
> 
> So we only call list_add_leaf_cfs_rq() through enqueue_task_fair(),
> which means someone is still running inside that cgroup.
> 
> TJ, I thought we only call offline when the cgroup is empty, don't we?

Yeap, populated csses shouldn't be being offlined.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-02  3:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
2016-03-12 13:59 ` Peter Zijlstra
2016-03-14 11:20 ` Peter Zijlstra
2016-03-14 12:09   ` Peter Zijlstra
2016-03-16 14:24     ` Tejun Heo
2016-03-16 14:44       ` Tejun Heo
2016-03-16 15:22       ` Peter Zijlstra
2016-03-16 16:50         ` Tejun Heo
2016-03-16 17:04           ` Peter Zijlstra
2016-03-16 17:49             ` Tejun Heo
2016-03-17  8:29         ` Niklas Cassel
2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
2016-04-28 18:40           ` Peter Zijlstra
2016-04-28 18:51             ` Greg Kroah-Hartman
2016-04-28 21:36               ` Peter Zijlstra
2016-05-02  3:06                 ` Greg Kroah-Hartman
  -- strict thread matches above, loose matches on Subject: below --
2016-03-04 10:41 [BUG] sched: leaf_cfs_rq_list use after free Niklas Cassel
2016-03-10 12:54 ` Peter Zijlstra
2016-03-11 17:02   ` Niklas Cassel
2016-03-11 17:28     ` Peter Zijlstra
2016-03-11 18:20   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).