* Kernel crash in cgroup_pidlist_destroy_work_fn()
@ 2014-09-16 23:56 Cong Wang
2014-09-17 5:29 ` Li Zefan
0 siblings, 1 reply; 4+ messages in thread
From: Cong Wang @ 2014-09-16 23:56 UTC (permalink / raw)
To: Tejun Heo; +Cc: LKML, cgroups
[-- Attachment #1: Type: text/plain, Size: 1425 bytes --]
Hi, Tejun
We saw some kernel null pointer dereference in
cgroup_pidlist_destroy_work_fn(), more precisely at
__mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
on request.
Looking at the code, it seems flush_workqueue() doesn't care about new
incoming works, it only processes currently pending ones, if this is
correct, then we could have the following race condition:
cgroup_pidlist_destroy_all():
//...
mutex_lock(&cgrp->pidlist_mutex);
list_for_each_entry_safe(l, tmp_l, &cgrp->pidlists, links)
mod_delayed_work(cgroup_pidlist_destroy_wq,
&l->destroy_dwork, 0);
mutex_unlock(&cgrp->pidlist_mutex);
// <--- another process calls cgroup_pidlist_start() here
since mutex is released
flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
process adds new pidlist and queue work in pararell
BUG_ON(!list_empty(&cgrp->pidlists)); // <--- This check is
passed, list_add() could happen after this
Therefore, the newly added pidlist will point to a freed cgroup, and
when it is freed in the delayed work we will crash.
The attached patch (compile test ONLY) could be a possible fix, since
it will check and hold a refcount on this cgroup in
cgroup_pidlist_start(). But I could very easily miss something here
since there are many cgroup changes after 3.14 and I don't follow
cgroup development.
What do you think?
Thanks.
[-- Attachment #2: cgroup.diff --]
[-- Type: text/plain, Size: 938 bytes --]
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 940aced..2206151 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4084,6 +4084,9 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
int index = 0, pid = *pos;
int *iter, ret;
+ if (!cgroup_tryget(cgrp))
+ return NULL;
+
mutex_lock(&cgrp->pidlist_mutex);
/*
@@ -4132,13 +4135,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
static void cgroup_pidlist_stop(struct seq_file *s, void *v)
{
+ struct cgroup *cgrp = seq_css(s)->cgroup;
struct kernfs_open_file *of = s->private;
struct cgroup_pidlist *l = of->priv;
if (l)
mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork,
CGROUP_PIDLIST_DESTROY_DELAY);
- mutex_unlock(&seq_css(s)->cgroup->pidlist_mutex);
+ mutex_unlock(&cgrp->pidlist_mutex);
+ cgroup_put(cgrp);
}
static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
2014-09-16 23:56 Kernel crash in cgroup_pidlist_destroy_work_fn() Cong Wang
@ 2014-09-17 5:29 ` Li Zefan
2014-09-17 9:26 ` Li Zefan
0 siblings, 1 reply; 4+ messages in thread
From: Li Zefan @ 2014-09-17 5:29 UTC (permalink / raw)
To: Cong Wang; +Cc: Tejun Heo, LKML, cgroups
On 2014/9/17 7:56, Cong Wang wrote:
> Hi, Tejun
>
>
> We saw some kernel null pointer dereference in
> cgroup_pidlist_destroy_work_fn(), more precisely at
> __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
> on request.
>
Yes, please.
> Looking at the code, it seems flush_workqueue() doesn't care about new
> incoming works, it only processes currently pending ones, if this is
> correct, then we could have the following race condition:
>
> cgroup_pidlist_destroy_all():
> //...
> mutex_lock(&cgrp->pidlist_mutex);
> list_for_each_entry_safe(l, tmp_l, &cgrp->pidlists, links)
> mod_delayed_work(cgroup_pidlist_destroy_wq,
> &l->destroy_dwork, 0);
> mutex_unlock(&cgrp->pidlist_mutex);
>
> // <--- another process calls cgroup_pidlist_start() here
> since mutex is released
>
> flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
> process adds new pidlist and queue work in pararell
> BUG_ON(!list_empty(&cgrp->pidlists)); // <--- This check is
> passed, list_add() could happen after this
>
Did you confirm this is what happened when the bug was triggered?
I don't think the race condition you described exists. In 3.14 kernel,
cgroup_diput() won't be called if there is any thread running
cgroup_pidlist_start(). This is guaranteed by vfs.
But newer kernels are different. Looks like the bug exists in those
kernels.
>
> Therefore, the newly added pidlist will point to a freed cgroup, and
> when it is freed in the delayed work we will crash.
>
> The attached patch (compile test ONLY) could be a possible fix, since
> it will check and hold a refcount on this cgroup in
> cgroup_pidlist_start(). But I could very easily miss something here
> since there are many cgroup changes after 3.14 and I don't follow
> cgroup development.
>
> What do you think?
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
2014-09-17 5:29 ` Li Zefan
@ 2014-09-17 9:26 ` Li Zefan
2014-09-19 0:23 ` Cong Wang
0 siblings, 1 reply; 4+ messages in thread
From: Li Zefan @ 2014-09-17 9:26 UTC (permalink / raw)
To: Cong Wang; +Cc: Tejun Heo, LKML, cgroups
On 2014/9/17 13:29, Li Zefan wrote:
> On 2014/9/17 7:56, Cong Wang wrote:
>> Hi, Tejun
>>
>>
>> We saw some kernel null pointer dereference in
>> cgroup_pidlist_destroy_work_fn(), more precisely at
>> __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
>> on request.
>>
>
> Yes, please.
>
>> Looking at the code, it seems flush_workqueue() doesn't care about new
>> incoming works, it only processes currently pending ones, if this is
>> correct, then we could have the following race condition:
>>
>> cgroup_pidlist_destroy_all():
>> //...
>> mutex_lock(&cgrp->pidlist_mutex);
>> list_for_each_entry_safe(l, tmp_l, &cgrp->pidlists, links)
>> mod_delayed_work(cgroup_pidlist_destroy_wq,
>> &l->destroy_dwork, 0);
>> mutex_unlock(&cgrp->pidlist_mutex);
>>
>> // <--- another process calls cgroup_pidlist_start() here
>> since mutex is released
>>
>> flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
>> process adds new pidlist and queue work in pararell
>> BUG_ON(!list_empty(&cgrp->pidlists)); // <--- This check is
>> passed, list_add() could happen after this
>>
>
> Did you confirm this is what happened when the bug was triggered?
>
> I don't think the race condition you described exists. In 3.14 kernel,
> cgroup_diput() won't be called if there is any thread running
> cgroup_pidlist_start(). This is guaranteed by vfs.
>
> But newer kernels are different. Looks like the bug exists in those
> kernels.
>
Newer kernels should be also fine.
If cgroup_pidlist_destroy_all() is called, it means kernfs has already
removed the tasks file, and even if you still have it opened, when
you try to read it, it will immediately return an errno.
fd = open(cgrp/tasks)
cgroup_rmdir(cgrp)
cgroup_destroy_locked(c)
kernfs_remove()
...
css_free_work_fn()
cgroup_pidlist_destroy_all()
read(fd of cgrp/tasks)
return -ENODEV
So cgroup_pidlist_destroy_all() won't race with cgroup_pidlist_start().
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
2014-09-17 9:26 ` Li Zefan
@ 2014-09-19 0:23 ` Cong Wang
0 siblings, 0 replies; 4+ messages in thread
From: Cong Wang @ 2014-09-19 0:23 UTC (permalink / raw)
To: Li Zefan; +Cc: Tejun Heo, LKML, Cgroups
Hi, Zefan
Thanks for your reply. You are right, vfs refcount should guarantee us
there is no more file read before we destroy that cgroup. I thought
there is somewhere else could release that cgroup refcount.
Maybe I didn't make it clear, this bug is hardly reproducible, we only
saw it once (otherwise I would have more clue and test my patch).
The full stacktrace is at the end of this email, for your reference.
Reading the code again, I thought we could have some race in-flight
work and pending work, but cgroup_pidlist_destroy_work_fn() should
get this right by checking the pending bit after holding mutex, what's
more, we should only have one in-flight work for this workqueue, therefore
the code seems pretty safe.
The unbalanced mutex_unlock() you fixed actually is hard to hit since
the memory must run out in that case, and I don't get any OOM
information from the bug reporter. So I think I can rule out this case.
Thanks!
--------------------------------------->
[15955.313964] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000027
[15955.314958] IP: [<ffffffff814ad8b5>] __mutex_lock_slowpath+0x2f/0x1a0
[15955.316088] PGD 0
[15955.316310] Oops: 0000 [#1] SMP
[15955.316810] Modules linked in: tun ipv6 bonding dm_multipath
scsi_dh video sbs sbshc acpi_pad acpi_ipmi parport_pc lp parport
tcp_diag inet_diag ipmi_devintf dell_rbu sg igb i2c_algo_bit ptp
pps_core iTCO_wdt iTCO_vendor_support dcdbas hed ipmi_si
ipmi_msghandler i7core_edac i2c_i801 shpchp i2c_core edac_core ioatdma
dca lpc_ich mfd_core microcode ahci libahci libata sd_mod scsi_mod
[15955.321422] CPU: 5 PID: 7280 Comm: kworker/5:2 Not tainted 3.14.14 #1
[15955.322275] Hardware name: Dell C6100 /0D61XP, BIOS
1.66 12/23/2011
[15955.323326] Workqueue: cgroup_pidlist_destroy cgroup_pidlist_destroy_work_fn
[15955.324317] task: ffff8800ad899840 ti: ffff880323b50000 task.ti:
ffff880323b50000
[15955.325236] RIP: 0010:[<ffffffff814ad8b5>] [<ffffffff814ad8b5>]
__mutex_lock_slowpath+0x2f/0x1a0
[15955.326465] RSP: 0018:ffff880323b51dc0 EFLAGS: 00010286
[15955.327178] RAX: ffffffffffffffff RBX: ffff88032c7de0f8 RCX: ffff88032e188ac0
[15955.328061] RDX: 0000000080000000 RSI: 0000000000000140 RDI: ffff88032c7de0f8
[15955.328927] RBP: ffff880323b51e08 R08: ffff88032e188ac0 R09: 0000000000000000
[15955.329784] R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000000
[15955.330778] R13: ffff8800ad899840 R14: ffff880333cb7200 R15: 0000000000000000
[15955.331650] FS: 0000000000000000(0000) GS:ffff880333ca0000(0000)
knlGS:0000000000000000
[15955.332616] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[15955.333329] CR2: 0000000000000027 CR3: 0000000001a0c000 CR4: 00000000000007e0
[15955.334584] Stack:
[15955.334798] ffff880323b51e08 ffffffff8100163f ffff880323b51de8
ffffffff8107d6e8
[15955.335751] ffff88032c7de0f8 0000000000000000 ffff880333cb2300
ffff880333cb7200
[15955.336831] 0000000000000000 ffff880323b51e20 ffffffff814ada45
ffff88032e188ab8
[15955.337779] Call Trace:
[15955.338053] [<ffffffff8100163f>] ? __switch_to+0x1ca/0x3d2
[15955.338739] [<ffffffff8107d6e8>] ? mmdrop+0x11/0x20
[15955.339386] [<ffffffff814ada45>] mutex_lock+0x1f/0x2f
[15955.340054] [<ffffffff810bd2b9>] cgroup_pidlist_destroy_work_fn+0x22/0x66
[15955.340898] [<ffffffff810710b9>] process_one_work+0x16d/0x289
[15955.341629] [<ffffffff81071877>] worker_thread+0x149/0x1f5
[15955.342327] [<ffffffff8107172e>] ? rescuer_thread+0x27f/0x27f
[15955.343049] [<ffffffff810764db>] kthread+0xae/0xb6
[15955.343685] [<ffffffff81070000>] ? idle_worker_timeout+0x4/0x5b
[15955.344499] [<ffffffff8107642d>] ? __kthread_parkme+0x61/0x61
[15955.345245] [<ffffffff814b532c>] ret_from_fork+0x7c/0xb0
[15955.345934] [<ffffffff8107642d>] ? __kthread_parkme+0x61/0x61
[15955.346663] Code: 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89
fb 48 83 e4 f0 48 83 ec 20 65 4c 8b 2c 25 80 b8 00 00 48 8b 47 18 48
85 c0 74 05 <8b> 40 28 eb 05 b8 01 00 00 00 85 c0 0f 84 9f 00 00 00 4c
8d 63
[15955.349439] RIP [<ffffffff814ad8b5>] __mutex_lock_slowpath+0x2f/0x1a0
[15955.350270] RSP <ffff880323b51dc0>
[15955.350635] CR2: 0000000000000027
[15955.351481] ---[ end trace 169754b1b86c507e ]---
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-09-19 0:23 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-16 23:56 Kernel crash in cgroup_pidlist_destroy_work_fn() Cong Wang
2014-09-17 5:29 ` Li Zefan
2014-09-17 9:26 ` Li Zefan
2014-09-19 0:23 ` Cong Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).