linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [BUG] sched: leaf_cfs_rq_list use after free
@ 2016-03-12  9:42 Kazuki Yamaguchi
  2016-03-12 13:59 ` Peter Zijlstra
  2016-03-14 11:20 ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Kazuki Yamaguchi @ 2016-03-12  9:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Niklas Cassel, Peter Zijlstra, linux-kernel

Hello,

I got similar kernel crashes after the patch, which went to 4.4:

2e91fa7 cgroup: keep zombies associated with their original cgroups

I was just about to report, but maybe this is related?

^^^^^^^[    0.761718] BUG: unable to handle kernel NULL pointer 
dereference at 00000000000008b0
[    0.762860] IP: [<ffffffff81052630>] update_blocked_averages+0x80/0x600
[    0.764020] PGD 3fc067 PUD 3a9067 PMD 0
[    0.764020] Oops: 0000 [#1] SMP
[    0.764020] CPU: 0 PID: 56 Comm: test Not tainted 4.5.0-rc7 #25
[    0.764020] task: ffff8800003d2700 ti: ffff8800003e8000 task.ti: 
ffff8800003e8000
[    0.764020] RIP: 0010:[<ffffffff81052630>]  [<ffffffff81052630>] 
update_blocked_averages+0x80/0x600
[    0.764020] RSP: 0000:ffff880007c03e50  EFLAGS: 00000016
[    0.764020] RAX: 0000000000000000 RBX: 00000000ffff165e RCX: 
000000002d5096e1
[    0.764020] RDX: 00000000000d281c RSI: ffff880000138200 RDI: 
00000000000d281c
[    0.764020] RBP: ffff880007c03eb0 R08: ffffffff811567e0 R09: 
0000000000000100
[    0.764020] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff880007c11920
[    0.764020] R13: 00000000000110c0 R14: afb504000afb5041 R15: 
ffff880007c110c0
[    0.764020] FS:  0000000001b69880(0063) GS:ffff880007c00000(0000) 
knlGS:0000000000000000
[    0.764020] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.764020] CR2: 00000000000008b0 CR3: 00000000003a4000 CR4: 
00000000000006b0
[    0.764020] Stack:
[    0.764020]  0000000080000100 0000000000000286 ffff880007c0c7f8 
0000000000000006
[    0.764020]  0000000007c0c5c0 ffff880000138200 ffffffff8104ce00 
00000000ffff165e
[    0.764020]  ffff880007c110c0 00000000000110c0 0000000000000007 
0000000000000000
[    0.764020] Call Trace:
[    0.764020]  <IRQ>
[    0.764020]  [<ffffffff8104ce00>] ? wake_up_process+0x10/0x20
[    0.764020]  [<ffffffff8105978d>] run_rebalance_domains+0x6d/0x290
[    0.764020]  [<ffffffff81072cab>] ? run_timer_softirq+0x19b/0x220
[    0.764020]  [<ffffffff810318ee>] __do_softirq+0xde/0x1e0
[    0.764020]  [<ffffffff81031aef>] irq_exit+0x5f/0x70
[    0.764020]  [<ffffffff81020238>] 
smp_trace_apic_timer_interrupt+0x68/0x90
[    0.764020]  [<ffffffff81020269>] smp_apic_timer_interrupt+0x9/0x10
[    0.764020]  [<ffffffff8114dd4c>] apic_timer_interrupt+0x7c/0x90
[    0.764020]  <EOI>
[    0.764020]  [<ffffffff810b76f6>] ? find_vma+0x16/0x70
[    0.764020]  [<ffffffff81026d18>] __do_page_fault+0xe8/0x360
[    0.764020]  [<ffffffff81026fcc>] do_page_fault+0xc/0x10
[    0.764020]  [<ffffffff8114e5cf>] page_fault+0x1f/0x30
[    0.764020] Code: 00 48 8d b0 28 ff ff ff 49 be 41 50 fb 0a 00 04 b5 
af 48 89 74 24 28 48 8b 74 24 28 c7 44 24 24 00 00 00 00 48 8b 86 c8 00 
00 00 <48> 8b 90 b0 08 00 00 48 8b 86 a0 00 00 00 48 85 c0 74 46 31 c0
[    0.764020] RIP  [<ffffffff81052630>] update_blocked_averages+0x80/0x600
[    0.764020]  RSP <ffff880007c03e50>
[    0.764020] CR2: 00000000000008b0
[    0.764020] ---[ end trace 754fbc727003a126 ]---
[    0.764020] Kernel panic - not syncing: Fatal exception in interrupt
[    0.764020] Shutting down cpus with NMI
[    0.764020] Kernel Offset: disabled
[    0.764020] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt


I can reproduce it on QEMU (qemu-system-x86_64 -smp 2).

enabled config:
CONFIG_PID_NS=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_SMP=y


init.sh:
#!/bin/sh
mkdir /testg
mount -t cgroup -o cpu cgroup /testg
echo /agent.sh > /testg/release_agent
echo 1 > /testg/notify_on_release

mkdir /temp-mnt
while :; do
    echo -n ^
    ./test
done


agent.sh:
#!/bin/sh
rmdir /testg$1


test.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sched.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/ptrace.h>

int
main(void)
{
     mount("none", "/temp-mnt", "tmpfs", 0, "");
     unshare(CLONE_NEWPID);
     pid_t pid = fork();
     if (pid == 0) {
         fork();
     } else {
         ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACEFORK);
         char template[128] = "/testg/XXXXXX";
         if (!mkdtemp(template)) abort();
         FILE *f = fopen(strcat(template, "/cgroup.procs"), "w");
         fprintf(f, "%d\n", pid);
         fclose(f);
         wait(NULL); // stopped at fork()
         kill(pid, SIGKILL);
         umount("/temp-mnt");
     }
     return 0;
}

-- 
Kazuki Yamaguchi <k@rhe.jp>

^ permalink raw reply	[flat|nested] 21+ messages in thread
* [BUG] sched: leaf_cfs_rq_list use after free
@ 2016-03-04 10:41 Niklas Cassel
  2016-03-10 12:54 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Niklas Cassel @ 2016-03-04 10:41 UTC (permalink / raw)
  To: tj, peterz, linux-kernel@vger.kernel.org

Hello

I've stumbled upon a use after free bug related to
CONFIG_FAIR_GROUP_SCHED / rq->cfs_rq->leaf_cfs_rq_list in v4.4.


Normally, a cfs_rq is immediately removed from the leaf_cfs_rq_list
and cfs_rq->onlist is set to 0, then the cfs_rq is freed at a later
time by call_rcu(&tg->rcu, free_sched_group_rcu).


What happens when we crash is that a cfs_rq is immediately removed
from the leaf_cfs_rq_list and cfs_rq->onlist is set to 0, however
then the cfs_rq is readded to the list, cfs_rq->onlist gets set to 1,
then comes the call to call_rcu(&tg->rcu, free_sched_group_rcu).

Now the cfs_rq is freed, filled with 0x6b6b6b6b by SLUB_DEBUG,
and still on the leaf_cfs_rq_list. Since the cfs_rq is still on
the list, the next call to update_blocked_averages will iterate
the list and will try to access members of the cfs_rq object,
an object which has already been freed.



[   27.531374] Unable to handle kernel paging request at virtual address 6b6b706b
[   27.538596] pgd = 8cea8000
[   27.541295] [6b6b706b] *pgd=00000000
[   27.544870] Internal error: Oops: 1 [#1] PREEMPT SMP ARM
[   27.564025] CPU: 1 PID: 1252 Comm: logger Tainted: G           O    4.4.0 #2
[   27.571064] Hardware name: Axis ARTPEC-6 Platform
[   27.575759] task: b9586540 ti: 8c84c000 task.ti: 8c84c000
[   27.581155] PC is at update_blocked_averages+0xcc/0x748
[   27.586372] LR is at update_blocked_averages+0xbc/0x748
[   27.591589] pc : [<80051d78>]    lr : [<80051d68>]    psr: 200c0193
               sp : 8c84dce8  ip : 00000500  fp : 8efb1680
[   27.603056] r10: 00000006  r9 : 80847788  r8 : 6b6b6b6b
[   27.608271] r7 : 00000007  r6 : ffff958a  r5 : 00000007  r4 : ffff958a
[   27.614789] r3 : 6b6b6b6b  r2 : 00000101  r1 : 00000000  r0 : 00000003
[   27.621308] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   27.628521] Control: 10c5387d  Table: 0cea804a  DAC: 00000055
[   27.634257] Process logger (pid: 1252, stack limit = 0x8c84c210)
[   27.640254] Stack: (0x8c84dce8 to 0x8c84e000)
[   27.644604] dce0:                   6b6b6b6b 00000103 bad39440 80048250 00000000 bad398d0
[   27.652774] dd00: bf6cf0d0 00000001 807e2c48 bad398d0 00000000 8054e7c8 ffff4582 bf6cec00
[   27.660944] dd20: 00000001 8004825c 00000100 807dc400 8c84de40 bf6cb340 bad87ebc 00000100
[   27.669114] dd40: afb50401 200c0113 00000200 807dc400 807e2100 ffff958a 00000007 8083916c
[   27.677283] dd60: 00000100 00000006 0000001c 80058748 bf6cb340 8054e810 00000000 00000001
[   27.685452] dd80: 807dc400 bf6cec00 00000001 bf6cec00 8083916c 00000001 c0803100 807dc400
[   27.693622] dda0: 807e209c 000000a0 00000007 8083916c 00000100 00000006 0000001c 800282a0
[   27.701791] ddc0: 00000001 bf6d2a80 b95fac00 0000000a ffff958b 00400000 bacf7000 807dc400
[   27.709961] dde0: 00000000 00000000 0000001b bf0188c0 00000001 c0803100 b95fac00 80028830
[   27.718130] de00: 807dc400 8006ca14 c0802100 c080210c 807e2db0 8081a140 8c84de40 80009420
[   27.726300] de20: 8054e780 80122048 800c0013 ffffffff 8c84de74 00000001 00100073 800142c0
[   27.734469] de40: b95ace70 b9586540 00000000 00000000 600c0013 00000000 024080c0 8010e8e0
[   27.742639] de60: 00000001 00000001 00100073 b95fac00 00000000 8c84de90 8054e780 80122048
[   27.750808] de80: 800c0013 ffffffff b95fac00 80122044 bad00640 8011c418 000001f6 b95acb70
[   27.758978] dea0: 76f42000 b95acb70 76f42000 b95acb68 76f43000 8ce48780 00100073 8010e8e0
[   27.767148] dec0: 00100073 00000000 b95fac00 00000000 00000000 00000001 b9421000 00000001
[   27.775317] dee0: 00000000 76f46000 00000000 00000000 8001e8b8 76f42000 00000003 00000003
[   27.783486] df00: b95fac00 8ce48780 00000001 00001000 807e2c64 8010efb4 00000000 00000000
[   27.791656] df20: 0000004d 00000073 8c84df50 8ce487c4 b95fac00 00000003 00000013 00000000
[   27.799825] df40: 8c84c000 b95fac00 7ece0b44 800faf84 00000002 00000000 00000000 8c84df64
[   27.807995] df60: b95fac00 00000000 00000002 00000003 00000013 00000000 00000000 8010d4e8
[   27.816163] df80: 00000002 00000000 00000003 00000003 00000000 00000003 000000c0 800104e4
[   27.824333] dfa0: 00000020 800104b0 00000003 00000000 00000000 00000013 00000003 00000002
[   27.832502] dfc0: 00000003 00000000 00000003 000000c0 0007ecd0 76f45958 76f45574 7ece0b44
[   27.840671] dfe0: 00000000 7ece09fc 76f2e814 76f368d8 400c0010 00000000 00000000 00000000
[   27.848847] [<80051d78>] (update_blocked_averages) from [<80058748>] (rebalance_domains+0x38/0x2cc)
[   27.857889] [<80058748>] (rebalance_domains) from [<800282a0>] (__do_softirq+0x98/0x354)
[   27.865975] [<800282a0>] (__do_softirq) from [<80028830>] (irq_exit+0xb0/0x11c)
[   27.873281] [<80028830>] (irq_exit) from [<8006ca14>] (__handle_domain_irq+0x60/0xb8)
[   27.881106] [<8006ca14>] (__handle_domain_irq) from [<80009420>] (gic_handle_irq+0x48/0x94)
[   27.889452] [<80009420>] (gic_handle_irq) from [<800142c0>] (__irq_svc+0x40/0x74)
[   27.896924] Exception stack(0x8c84de40 to 0x8c84de88)
[   27.901969] de40: b95ace70 b9586540 00000000 00000000 600c0013 00000000 024080c0 8010e8e0
[   27.910139] de60: 00000001 00000001 00100073 b95fac00 00000000 8c84de90 8054e780 80122048
[   27.918306] de80: 800c0013 ffffffff
[   27.921793] [<800142c0>] (__irq_svc) from [<80122048>] (__slab_alloc.constprop.9+0x28/0x2c)
[   27.930139] [<80122048>] (__slab_alloc.constprop.9) from [<8011c418>] (kmem_cache_alloc+0x14c/0x204)
[   27.939265] [<8011c418>] (kmem_cache_alloc) from [<8010e8e0>] (mmap_region+0x29c/0x680)
[   27.947262] [<8010e8e0>] (mmap_region) from [<8010efb4>] (do_mmap+0x2f0/0x378)
[   27.954481] [<8010efb4>] (do_mmap) from [<800faf84>] (vm_mmap_pgoff+0x74/0xa4)
[   27.961699] [<800faf84>] (vm_mmap_pgoff) from [<8010d4e8>] (SyS_mmap_pgoff+0x94/0xf0)
[   27.969524] [<8010d4e8>] (SyS_mmap_pgoff) from [<800104b0>] (__sys_trace_return+0x0/0x10)
[   27.977694] Code: e59b8078 e59b309c e3a0cc05 e3580000 (e18300dc) 

A snippet of the trace_printks I've added when analyzing the problem.
The prints show that a certain cfs_rq gets readded after it has been removed,
and that update_blocked_averages uses the cfs_rq which has already been freed:

         systemd-1     [000]    22.664453: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 0
         systemd-1     [000]    22.664479: bprint:               alloc_fair_sched_group: allocated cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 0
         systemd-1     [000]    22.664481: bprint:               cpu_cgroup_css_alloc: tg 0x8efb1800 tg->css.id 0
         systemd-1     [000]    22.664547: bprint:               cpu_cgroup_css_online: tg 0x8efb1800 tg->css.id 80
         systemd-874   [001]    27.389000: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
    migrate_cert-820   [001]    27.421337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
     kworker/0:1-24    [000]    27.421356: bprint:               cpu_cgroup_css_offline: tg 0x8efb1800 tg->css.id 80
     kworker/0:1-24    [000]    27.421445: bprint:               list_del_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
    migrate_cert-820   [001]    27.421506: bprint:               list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
   system-status-815   [001]    27.491358: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
     kworker/0:1-24    [000]    27.501561: bprint:               cpu_cgroup_css_free: tg 0x8efb1800 tg->css.id 80
    migrate_cert-820   [001]    27.511337: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
     ksoftirqd/0-3     [000]    27.521830: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 80
     ksoftirqd/0-3     [000]    27.521857: bprint:               free_fair_sched_group: freeing cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 80
          logger-1252  [001]    27.531355: bprint:               update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x6b6b6b6b


I've reproduced this on v4.4, but I've also managed to reproduce the bug
after cherry-picking the following patches
(all but one were marked for v4.4 stable):

6fe1f34 sched/cgroup: Fix cgroup entity load tracking tear-down
d6e022f workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
041bd12 Revert "workqueue: make sure delayed work run in local cpu"
8bb5ef7 cgroup: make sure a parent css isn't freed before its children
aa226ff cgroup: make sure a parent css isn't offlined before its children
e93ad19 cpuset: make mm migration asynchronous

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-02  3:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
2016-03-12 13:59 ` Peter Zijlstra
2016-03-14 11:20 ` Peter Zijlstra
2016-03-14 12:09   ` Peter Zijlstra
2016-03-16 14:24     ` Tejun Heo
2016-03-16 14:44       ` Tejun Heo
2016-03-16 15:22       ` Peter Zijlstra
2016-03-16 16:50         ` Tejun Heo
2016-03-16 17:04           ` Peter Zijlstra
2016-03-16 17:49             ` Tejun Heo
2016-03-17  8:29         ` Niklas Cassel
2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
2016-04-28 18:40           ` Peter Zijlstra
2016-04-28 18:51             ` Greg Kroah-Hartman
2016-04-28 21:36               ` Peter Zijlstra
2016-05-02  3:06                 ` Greg Kroah-Hartman
  -- strict thread matches above, loose matches on Subject: below --
2016-03-04 10:41 [BUG] sched: leaf_cfs_rq_list use after free Niklas Cassel
2016-03-10 12:54 ` Peter Zijlstra
2016-03-11 17:02   ` Niklas Cassel
2016-03-11 17:28     ` Peter Zijlstra
2016-03-11 18:20   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).