* net/bluetooth: workqueue destruction WARNING in hci_unregister_dev @ 2016-01-26 11:53 Dmitry Vyukov 2016-01-26 12:29 ` Dmitry Vyukov 2016-02-18 14:00 ` Jiri Slaby 0 siblings, 2 replies; 22+ messages in thread From: Dmitry Vyukov @ 2016-01-26 11:53 UTC (permalink / raw) To: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML Cc: syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet Hello, I've hit the following warning while running syzkaller fuzzer: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968 destroy_workqueue+0x172/0x550() Modules linked in: CPU: 2 PID: 17409 Comm: syz-executor Not tainted 4.5.0-rc1+ #283 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 00000000ffffffff ffff88003665f8a0 ffffffff8299a06d 0000000000000000 ffff88003599c740 ffffffff8643f0c0 ffff88003665f8e0 ffffffff8134fcf9 ffffffff8139d4c2 ffffffff8643f0c0 0000000000000f80 ffff8800630c5ae8 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff8299a06d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50 [<ffffffff8134fcf9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482 [<ffffffff8134ff29>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515 [<ffffffff8139d4c2>] destroy_workqueue+0x172/0x550 kernel/workqueue.c:3968 [<ffffffff85a714a4>] hci_unregister_dev+0x264/0x700 net/bluetooth/hci_core.c:3162 [<ffffffff84595ce6>] vhci_release+0x76/0xe0 drivers/bluetooth/hci_vhci.c:341 [<ffffffff817b2376>] __fput+0x236/0x780 fs/file_table.c:208 [<ffffffff817b2945>] ____fput+0x15/0x20 fs/file_table.c:244 [<ffffffff813ad760>] task_work_run+0x170/0x210 kernel/task_work.c:115 [< inline >] exit_task_work include/linux/task_work.h:21 [<ffffffff81358da5>] do_exit+0x8b5/0x2c60 kernel/exit.c:748 [<ffffffff8135b2c8>] do_group_exit+0x108/0x330 kernel/exit.c:878 [<ffffffff8137e434>] get_signal+0x5e4/0x14f0 kernel/signal.c:2307 [<ffffffff811a1db3>] do_signal+0x83/0x1c90 arch/x86/kernel/signal.c:712 [<ffffffff81006685>] exit_to_usermode_loop+0x1a5/0x210 arch/x86/entry/common.c:247 [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282 [<ffffffff810084ea>] syscall_return_slowpath+0x2ba/0x340 arch/x86/entry/common.c:344 [<ffffffff863597a2>] int_ret_from_sys_call+0x25/0x9f arch/x86/entry/entry_64.S:281 ---[ end trace f627386faee7426f ]--- Unfortunately I cannot reproduce it in a controlled environment, but I've hit it twice in different VMs. So maybe if you see something obvious there. Is it possible that something is submitted into the workqueue between it is drained and destroyed in hci_unregister_dev? On commit 92e963f50fc74041b5e9e744c330dca48e04f08d (Jan 24). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-01-26 11:53 net/bluetooth: workqueue destruction WARNING in hci_unregister_dev Dmitry Vyukov @ 2016-01-26 12:29 ` Dmitry Vyukov 2016-02-18 14:00 ` Jiri Slaby 1 sibling, 0 replies; 22+ messages in thread From: Dmitry Vyukov @ 2016-01-26 12:29 UTC (permalink / raw) To: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML Cc: syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet On Tue, Jan 26, 2016 at 12:53 PM, Dmitry Vyukov <dvyukov@google.com> wrote: > Hello, > > I've hit the following warning while running syzkaller fuzzer: > > ------------[ cut here ]------------ > WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968 > destroy_workqueue+0x172/0x550() > Modules linked in: > CPU: 2 PID: 17409 Comm: syz-executor Not tainted 4.5.0-rc1+ #283 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 00000000ffffffff ffff88003665f8a0 ffffffff8299a06d 0000000000000000 > ffff88003599c740 ffffffff8643f0c0 ffff88003665f8e0 ffffffff8134fcf9 > ffffffff8139d4c2 ffffffff8643f0c0 0000000000000f80 ffff8800630c5ae8 > Call Trace: > [< inline >] __dump_stack lib/dump_stack.c:15 > [<ffffffff8299a06d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50 > [<ffffffff8134fcf9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482 > [<ffffffff8134ff29>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515 > [<ffffffff8139d4c2>] destroy_workqueue+0x172/0x550 kernel/workqueue.c:3968 > [<ffffffff85a714a4>] hci_unregister_dev+0x264/0x700 > net/bluetooth/hci_core.c:3162 > [<ffffffff84595ce6>] vhci_release+0x76/0xe0 drivers/bluetooth/hci_vhci.c:341 > [<ffffffff817b2376>] __fput+0x236/0x780 fs/file_table.c:208 > [<ffffffff817b2945>] ____fput+0x15/0x20 fs/file_table.c:244 > [<ffffffff813ad760>] task_work_run+0x170/0x210 kernel/task_work.c:115 > [< inline >] exit_task_work include/linux/task_work.h:21 > [<ffffffff81358da5>] do_exit+0x8b5/0x2c60 kernel/exit.c:748 > [<ffffffff8135b2c8>] do_group_exit+0x108/0x330 kernel/exit.c:878 > [<ffffffff8137e434>] get_signal+0x5e4/0x14f0 kernel/signal.c:2307 > [<ffffffff811a1db3>] do_signal+0x83/0x1c90 arch/x86/kernel/signal.c:712 > [<ffffffff81006685>] exit_to_usermode_loop+0x1a5/0x210 > arch/x86/entry/common.c:247 > [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282 > [<ffffffff810084ea>] syscall_return_slowpath+0x2ba/0x340 > arch/x86/entry/common.c:344 > [<ffffffff863597a2>] int_ret_from_sys_call+0x25/0x9f > arch/x86/entry/entry_64.S:281 > ---[ end trace f627386faee7426f ]--- > > Unfortunately I cannot reproduce it in a controlled environment, but > I've hit it twice in different VMs. So maybe if you see something > obvious there. Is it possible that something is submitted into the > workqueue between it is drained and destroyed in hci_unregister_dev? > > On commit 92e963f50fc74041b5e9e744c330dca48e04f08d (Jan 24). Wait, I was able to reproduce it by running the following program in a parallel loop for several hours: https://gist.githubusercontent.com/dvyukov/c15675af95e599fe6631/raw/c34117b54d0352f5df4f572d4151a94557780a9b/gistfile1.txt I think that you just need to open /dev/vhci, because the program does not seem to do anything useful with the descriptor. Machine died, full log is here: https://gist.githubusercontent.com/dvyukov/4584d70c2b4bee7ca875/raw/3794fb9f4c8e0fb2f91261a66725b042b5de4e0f/gistfile1.txt ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-01-26 11:53 net/bluetooth: workqueue destruction WARNING in hci_unregister_dev Dmitry Vyukov 2016-01-26 12:29 ` Dmitry Vyukov @ 2016-02-18 14:00 ` Jiri Slaby 2016-02-18 14:22 ` Dmitry Vyukov 1 sibling, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-02-18 14:00 UTC (permalink / raw) To: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML Cc: syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Tejun Heo, Takashi Iwai [-- Attachment #1: Type: text/plain, Size: 3509 bytes --] Cc Tejun (workqueues), Takashi (debug patch) On 01/26/2016, 12:53 PM, Dmitry Vyukov wrote: > Hello, > > I've hit the following warning while running syzkaller fuzzer: Hi, I am hitting it over and over again using syzkaller. > WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968 > destroy_workqueue+0x172/0x550() Which of the warnings is it in your case? I hit "(pwq != wq->dfl_pwq) && (pwq->refcnt > 1)". So I have this in the code: if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1))) { pr_info("%s: pwq=%p wq->dfl_pwq=%p pwq->refcnt=%d\n", __func__, pwq, wq->dfl_pwq, pwq->refcnt); mutex_unlock(&wq->mutex); return; } if (WARN_ON(pwq->nr_active)) { ... And the values are: pwq=ffff88006e271500 wq->dfl_pwq=ffff88006e272400 pwq->refcnt=2 pwq=ffff88002ec48300 wq->dfl_pwq=ffff88002ec4bc00 pwq->refcnt=2 pwq=ffff880020f76400 wq->dfl_pwq=ffff880020f75500 pwq->refcnt=2 pwq=ffff88002a908f00 wq->dfl_pwq=ffff88002a90bc00 pwq->refcnt=1 pwq=ffff8800348e0300 wq->dfl_pwq=ffff8800348e0000 pwq->refcnt=2 pwq=ffff88006e276400 wq->dfl_pwq=ffff88006e275800 pwq->refcnt=2 Note the single "pwq->refcnt=1" in there. So this is perhaps a race? Takashi also provided me with a debug patch included in the attached patch. It did not trigger though. > CPU: 2 PID: 17409 Comm: syz-executor Not tainted 4.5.0-rc1+ #283 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 00000000ffffffff ffff88003665f8a0 ffffffff8299a06d 0000000000000000 > ffff88003599c740 ffffffff8643f0c0 ffff88003665f8e0 ffffffff8134fcf9 > ffffffff8139d4c2 ffffffff8643f0c0 0000000000000f80 ffff8800630c5ae8 > Call Trace: > [< inline >] __dump_stack lib/dump_stack.c:15 > [<ffffffff8299a06d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50 > [<ffffffff8134fcf9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482 > [<ffffffff8134ff29>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515 > [<ffffffff8139d4c2>] destroy_workqueue+0x172/0x550 kernel/workqueue.c:3968 > [<ffffffff85a714a4>] hci_unregister_dev+0x264/0x700 > net/bluetooth/hci_core.c:3162 > [<ffffffff84595ce6>] vhci_release+0x76/0xe0 drivers/bluetooth/hci_vhci.c:341 > [<ffffffff817b2376>] __fput+0x236/0x780 fs/file_table.c:208 > [<ffffffff817b2945>] ____fput+0x15/0x20 fs/file_table.c:244 > [<ffffffff813ad760>] task_work_run+0x170/0x210 kernel/task_work.c:115 > [< inline >] exit_task_work include/linux/task_work.h:21 > [<ffffffff81358da5>] do_exit+0x8b5/0x2c60 kernel/exit.c:748 > [<ffffffff8135b2c8>] do_group_exit+0x108/0x330 kernel/exit.c:878 > [<ffffffff8137e434>] get_signal+0x5e4/0x14f0 kernel/signal.c:2307 > [<ffffffff811a1db3>] do_signal+0x83/0x1c90 arch/x86/kernel/signal.c:712 > [<ffffffff81006685>] exit_to_usermode_loop+0x1a5/0x210 > arch/x86/entry/common.c:247 > [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282 > [<ffffffff810084ea>] syscall_return_slowpath+0x2ba/0x340 > arch/x86/entry/common.c:344 > [<ffffffff863597a2>] int_ret_from_sys_call+0x25/0x9f > arch/x86/entry/entry_64.S:281 > ---[ end trace f627386faee7426f ]--- > > Unfortunately I cannot reproduce it in a controlled environment, but > I've hit it twice in different VMs. So maybe if you see something > obvious there. Is it possible that something is submitted into the > workqueue between it is drained and destroyed in hci_unregister_dev? > > On commit 92e963f50fc74041b5e9e744c330dca48e04f08d (Jan 24). > thanks, -- js suse labs [-- Attachment #2: vhci_debug.patch --] [-- Type: text/x-patch, Size: 3862 bytes --] --- include/linux/workqueue.h | 1 + kernel/workqueue.c | 23 ++++++++++++++++++++--- net/bluetooth/hci_core.c | 19 +++++++++++++++++++ 3 files changed, 40 insertions(+), 3 deletions(-) --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -312,6 +312,7 @@ enum { __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ + __WQ_DESTROYING = 1 << 19, WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1366,6 +1366,9 @@ static void __queue_work(int cpu, struct unsigned int work_flags; unsigned int req_cpu = cpu; + if (WARN_ON(wq->flags & __WQ_DESTROYING)) + return; + /* * While a work item is PENDING && off queue, a task trying to * steal the PENDING will busy-loop waiting for it to either get @@ -4010,6 +4013,7 @@ void destroy_workqueue(struct workqueue_ int node; /* drain it before proceeding with destruction */ + wq->flags |= __WQ_DESTROYING; drain_workqueue(wq); /* sanity checks */ @@ -4024,9 +4028,22 @@ void destroy_workqueue(struct workqueue_ } } - if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) || - WARN_ON(pwq->nr_active) || - WARN_ON(!list_empty(&pwq->delayed_works))) { + if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1))) { + pr_info("%s: pwq=%p wq->dfl_pwq=%p pwq->refcnt=%d\n", + __func__, pwq, wq->dfl_pwq, + pwq->refcnt); + mutex_unlock(&wq->mutex); + return; + } + + if (WARN_ON(pwq->nr_active)) { + pr_info("%s: %ps\n", __func__, wq); + mutex_unlock(&wq->mutex); + return; + } + + if (WARN_ON(!list_empty(&pwq->delayed_works))) { + pr_info("%s: %ps\n", __func__, wq); mutex_unlock(&wq->mutex); return; } --- a/net/bluetooth/hci_core.c +++ b/net/bluetooth/hci_core.c @@ -2467,6 +2467,8 @@ static void hci_cmd_timeout(struct work_ struct hci_dev *hdev = container_of(work, struct hci_dev, cmd_timer.work); + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return; if (hdev->sent_cmd) { struct hci_command_hdr *sent = (void *) hdev->sent_cmd->data; u16 opcode = __le16_to_cpu(sent->opcode); @@ -3225,6 +3227,11 @@ int hci_recv_frame(struct hci_dev *hdev, return -ENXIO; } + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) { + kfree_skb(skb); + return -ENXIO; + } + if (hci_skb_pkt_type(skb) != HCI_EVENT_PKT && hci_skb_pkt_type(skb) != HCI_ACLDATA_PKT && hci_skb_pkt_type(skb) != HCI_SCODATA_PKT) { @@ -3248,6 +3255,9 @@ EXPORT_SYMBOL(hci_recv_frame); /* Receive diagnostic message from HCI drivers */ int hci_recv_diag(struct hci_dev *hdev, struct sk_buff *skb) { + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return -ENXIO; + /* Mark as diagnostic packet */ hci_skb_pkt_type(skb) = HCI_DIAG_PKT; @@ -3326,6 +3336,9 @@ int hci_send_cmd(struct hci_dev *hdev, _ { struct sk_buff *skb; + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return -ENXIO; + BT_DBG("%s opcode 0x%4.4x plen %d", hdev->name, opcode, plen); skb = hci_prepare_cmd(hdev, opcode, plen, param); @@ -3461,6 +3474,9 @@ void hci_send_acl(struct hci_chan *chan, { struct hci_dev *hdev = chan->conn->hdev; + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return; + BT_DBG("%s chan %p flags 0x%4.4x", hdev->name, chan, flags); hci_queue_acl(chan, &chan->data_q, skb, flags); @@ -3474,6 +3490,9 @@ void hci_send_sco(struct hci_conn *conn, struct hci_dev *hdev = conn->hdev; struct hci_sco_hdr hdr; + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return; + BT_DBG("%s len %d", hdev->name, skb->len); hdr.handle = cpu_to_le16(conn->handle); ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-02-18 14:00 ` Jiri Slaby @ 2016-02-18 14:22 ` Dmitry Vyukov 2016-02-18 17:44 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Dmitry Vyukov @ 2016-02-18 14:22 UTC (permalink / raw) To: Jiri Slaby Cc: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Tejun Heo, Takashi Iwai On Thu, Feb 18, 2016 at 3:00 PM, Jiri Slaby <jslaby@suse.cz> wrote: > Cc Tejun (workqueues), Takashi (debug patch) > > On 01/26/2016, 12:53 PM, Dmitry Vyukov wrote: >> Hello, >> >> I've hit the following warning while running syzkaller fuzzer: > > Hi, > > I am hitting it over and over again using syzkaller. > >> WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968 >> destroy_workqueue+0x172/0x550() > > Which of the warnings is it in your case? The line number points to WARN_ON(pwq->nr_active), but there can well be a off-by-one error, so I am not sure. If you are hitting it all the time, so probably it is the same. Or at least if makes sense to fix yours first since it is more easily reproducible. > I hit "(pwq != wq->dfl_pwq) && (pwq->refcnt > 1)". So I have this in the > code: > if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1))) { > pr_info("%s: pwq=%p wq->dfl_pwq=%p pwq->refcnt=%d\n", > __func__, pwq, wq->dfl_pwq, > pwq->refcnt); > mutex_unlock(&wq->mutex); > return; > } > > if (WARN_ON(pwq->nr_active)) { > ... > > And the values are: > pwq=ffff88006e271500 wq->dfl_pwq=ffff88006e272400 pwq->refcnt=2 > pwq=ffff88002ec48300 wq->dfl_pwq=ffff88002ec4bc00 pwq->refcnt=2 > pwq=ffff880020f76400 wq->dfl_pwq=ffff880020f75500 pwq->refcnt=2 > pwq=ffff88002a908f00 wq->dfl_pwq=ffff88002a90bc00 pwq->refcnt=1 > pwq=ffff8800348e0300 wq->dfl_pwq=ffff8800348e0000 pwq->refcnt=2 > pwq=ffff88006e276400 wq->dfl_pwq=ffff88006e275800 pwq->refcnt=2 > > > Note the single "pwq->refcnt=1" in there. So this is perhaps a race? > > Takashi also provided me with a debug patch included in the attached > patch. It did not trigger though. > >> CPU: 2 PID: 17409 Comm: syz-executor Not tainted 4.5.0-rc1+ #283 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 >> 00000000ffffffff ffff88003665f8a0 ffffffff8299a06d 0000000000000000 >> ffff88003599c740 ffffffff8643f0c0 ffff88003665f8e0 ffffffff8134fcf9 >> ffffffff8139d4c2 ffffffff8643f0c0 0000000000000f80 ffff8800630c5ae8 >> Call Trace: >> [< inline >] __dump_stack lib/dump_stack.c:15 >> [<ffffffff8299a06d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50 >> [<ffffffff8134fcf9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482 >> [<ffffffff8134ff29>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515 >> [<ffffffff8139d4c2>] destroy_workqueue+0x172/0x550 kernel/workqueue.c:3968 >> [<ffffffff85a714a4>] hci_unregister_dev+0x264/0x700 >> net/bluetooth/hci_core.c:3162 >> [<ffffffff84595ce6>] vhci_release+0x76/0xe0 drivers/bluetooth/hci_vhci.c:341 >> [<ffffffff817b2376>] __fput+0x236/0x780 fs/file_table.c:208 >> [<ffffffff817b2945>] ____fput+0x15/0x20 fs/file_table.c:244 >> [<ffffffff813ad760>] task_work_run+0x170/0x210 kernel/task_work.c:115 >> [< inline >] exit_task_work include/linux/task_work.h:21 >> [<ffffffff81358da5>] do_exit+0x8b5/0x2c60 kernel/exit.c:748 >> [<ffffffff8135b2c8>] do_group_exit+0x108/0x330 kernel/exit.c:878 >> [<ffffffff8137e434>] get_signal+0x5e4/0x14f0 kernel/signal.c:2307 >> [<ffffffff811a1db3>] do_signal+0x83/0x1c90 arch/x86/kernel/signal.c:712 >> [<ffffffff81006685>] exit_to_usermode_loop+0x1a5/0x210 >> arch/x86/entry/common.c:247 >> [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282 >> [<ffffffff810084ea>] syscall_return_slowpath+0x2ba/0x340 >> arch/x86/entry/common.c:344 >> [<ffffffff863597a2>] int_ret_from_sys_call+0x25/0x9f >> arch/x86/entry/entry_64.S:281 >> ---[ end trace f627386faee7426f ]--- >> >> Unfortunately I cannot reproduce it in a controlled environment, but >> I've hit it twice in different VMs. So maybe if you see something >> obvious there. Is it possible that something is submitted into the >> workqueue between it is drained and destroyed in hci_unregister_dev? >> >> On commit 92e963f50fc74041b5e9e744c330dca48e04f08d (Jan 24). >> > > thanks, > -- > js > suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-02-18 14:22 ` Dmitry Vyukov @ 2016-02-18 17:44 ` Tejun Heo [not found] ` <20160218174427.GG13177-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2016-02-18 17:44 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai [-- Attachment #1: Type: text/plain, Size: 339 bytes --] Hello, Can you please do the followings? 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether the problem is reproducible. WQ_MEM_RECLAIM on anything bluetooth doesn't make sense btw. Why is it there? 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached patch works too. Thanks. -- tejun [-- Attachment #2: patch --] [-- Type: text/plain, Size: 1183 bytes --] diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 7ff5dc7..9824d4f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4012,7 +4012,7 @@ void destroy_workqueue(struct workqueue_struct *wq) /* drain it before proceeding with destruction */ drain_workqueue(wq); - /* sanity checks */ + /* nothing should be in flight */ mutex_lock(&wq->mutex); for_each_pwq(pwq, wq) { int i; @@ -4024,8 +4024,7 @@ void destroy_workqueue(struct workqueue_struct *wq) } } - if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) || - WARN_ON(pwq->nr_active) || + if (WARN_ON(pwq->nr_active) || WARN_ON(!list_empty(&pwq->delayed_works))) { mutex_unlock(&wq->mutex); return; @@ -4046,6 +4045,13 @@ void destroy_workqueue(struct workqueue_struct *wq) if (wq->rescuer) kthread_stop(wq->rescuer->task); + /* rescuer is gone, everything should be quiescent now */ + WARN_ON(!list_empty(&wq->maydays)); + mutex_lock(&wq->mutex); + for_each_pwq(pwq, wq) + WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)); + mutex_unlock(&wq->mutex); + if (!(wq->flags & WQ_UNBOUND)) { /* * The base ref is never dropped on per-cpu pwqs. Directly ^ permalink raw reply related [flat|nested] 22+ messages in thread
[parent not found: <20160218174427.GG13177-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <20160218174427.GG13177-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-02-19 10:20 ` Jiri Slaby 2016-02-19 12:10 ` Jiri Slaby 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-02-19 10:20 UTC (permalink / raw) To: Tejun Heo, Dmitry Vyukov Cc: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai On 02/18/2016, 06:44 PM, Tejun Heo wrote: > Hello, > > Can you please do the followings? > > 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether > the problem is reproducible. WQ_MEM_RECLAIM on anything bluetooth > doesn't make sense btw. Why is it there? > > 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached > patch works too. Hello, 1. didn't help, the problem persists. So I haven't applied the patch from 2. thanks, -- js suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-02-19 10:20 ` Jiri Slaby @ 2016-02-19 12:10 ` Jiri Slaby 2016-03-02 15:45 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-02-19 12:10 UTC (permalink / raw) To: Tejun Heo, Dmitry Vyukov Cc: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai On 02/19/2016, 11:20 AM, Jiri Slaby wrote: > On 02/18/2016, 06:44 PM, Tejun Heo wrote: >> Hello, >> >> Can you please do the followings? >> >> 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether >> the problem is reproducible. WQ_MEM_RECLAIM on anything bluetooth >> doesn't make sense btw. Why is it there? >> >> 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached >> patch works too. > > Hello, > > 1. didn't help, the problem persists. So I haven't applied the patch from 2. FWIW I dumped more info about the wq: wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> > thanks, -- js suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-02-19 12:10 ` Jiri Slaby @ 2016-03-02 15:45 ` Tejun Heo [not found] ` <20160302154507.GC4282-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2016-03-02 15:45 UTC (permalink / raw) To: Jiri Slaby Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai Hello, Jiri. On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: > > 1. didn't help, the problem persists. So I haven't applied the patch from 2. > > FWIW I dumped more info about the wq: > wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 > pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> Can you please print out the same info for all pwq's during shutdown? It looks like we're leaking pwq refcnt but I can't spot a place where that could happen on an empty pwq. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20160302154507.GC4282-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <20160302154507.GC4282-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-03-03 9:12 ` Jiri Slaby 2016-03-11 17:12 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-03-03 9:12 UTC (permalink / raw) To: Tejun Heo Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai Hi, On 03/02/2016, 04:45 PM, Tejun Heo wrote: > On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: >>> 1. didn't help, the problem persists. So I haven't applied the patch from 2. >> >> FWIW I dumped more info about the wq: >> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 >> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> > > Can you please print out the same info for all pwq's during shutdown? > It looks like we're leaking pwq refcnt but I can't spot a place where > that could happen on an empty pwq. I have not done that yet, but today, I see: destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 in-flight: 18568:wq_barrier_func thanks, -- js suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-03-03 9:12 ` Jiri Slaby @ 2016-03-11 17:12 ` Tejun Heo [not found] ` <20160311171205.GB24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2016-03-11 17:12 UTC (permalink / raw) To: Jiri Slaby Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai Hello, Sorry about the delay. On Thu, Mar 03, 2016 at 10:12:01AM +0100, Jiri Slaby wrote: > On 03/02/2016, 04:45 PM, Tejun Heo wrote: > > On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: > >>> 1. didn't help, the problem persists. So I haven't applied the patch from 2. > >> > >> FWIW I dumped more info about the wq: > >> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 > >> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> > > > > Can you please print out the same info for all pwq's during shutdown? > > It looks like we're leaking pwq refcnt but I can't spot a place where > > that could happen on an empty pwq. > > I have not done that yet, but today, I see: > destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 > wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: > pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 > in-flight: 18568:wq_barrier_func So, this means that there's flush_work() racing against workqueue destruction, which can't be safe. :( Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20160311171205.GB24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <20160311171205.GB24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-03-17 12:00 ` Jiri Slaby [not found] ` <56EA9C4D.2080803-AlSwsSmVLrQ@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-03-17 12:00 UTC (permalink / raw) To: Tejun Heo Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai [-- Attachment #1: Type: text/plain, Size: 1327 bytes --] Hello, On 03/11/2016, 06:12 PM, Tejun Heo wrote: > On Thu, Mar 03, 2016 at 10:12:01AM +0100, Jiri Slaby wrote: >> On 03/02/2016, 04:45 PM, Tejun Heo wrote: >>> On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: >>>>> 1. didn't help, the problem persists. So I haven't applied the patch from 2. >>>> >>>> FWIW I dumped more info about the wq: >>>> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 >>>> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> >>> >>> Can you please print out the same info for all pwq's during shutdown? >>> It looks like we're leaking pwq refcnt but I can't spot a place where >>> that could happen on an empty pwq. >> >> I have not done that yet, but today, I see: >> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 >> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: >> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 >> in-flight: 18568:wq_barrier_func > > So, this means that there's flush_work() racing against workqueue > destruction, which can't be safe. :( But I cannot trigger the WARN_ONs in the attached patch, so I am confused how this can happen :(. (While I am still seeing the destroy WARNINGs.) BTW. what did you mean by dumping the states at shutdown? Is it still relevant? thanks, -- js suse labs [-- Attachment #2: vhci_debug.patch --] [-- Type: text/x-patch, Size: 7549 bytes --] --- include/linux/workqueue.h | 1 + include/net/bluetooth/hci_core.h | 5 +++++ kernel/reboot.c | 1 + kernel/workqueue.c | 34 +++++++++++++++++++++++++++++++--- net/bluetooth/hci_core.c | 39 +++++++++++++++++++++++++++++++++++---- 5 files changed, 73 insertions(+), 7 deletions(-) --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -312,6 +312,7 @@ enum { __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ + __WQ_DESTROYING = 1 << 19, WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ --- a/include/net/bluetooth/hci_core.h +++ b/include/net/bluetooth/hci_core.h @@ -312,6 +312,11 @@ struct hci_dev { struct workqueue_struct *workqueue; struct workqueue_struct *req_workqueue; +#define HCI_WQ_A 0 +#define HCI_WQ_D 1 +#define HCI_WQR_A 8 +#define HCI_WQR_D 9 + unsigned long wq_status; struct work_struct power_on; struct delayed_work power_off; --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -231,6 +231,7 @@ static void kernel_shutdown_prepare(enum (state == SYSTEM_HALT) ? SYS_HALT : SYS_POWER_OFF, NULL); system_state = state; usermodehelper_disable(); + show_workqueue_state(); device_shutdown(); } /** --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1366,6 +1366,9 @@ static void __queue_work(int cpu, struct unsigned int work_flags; unsigned int req_cpu = cpu; + if (WARN_ON(wq->flags & __WQ_DESTROYING)) + return; + /* * While a work item is PENDING && off queue, a task trying to * steal the PENDING will busy-loop waiting for it to either get @@ -2804,6 +2807,9 @@ static bool start_flush_work(struct work pwq = worker->current_pwq; } + if (WARN_ON(pwq->wq->flags & __WQ_DESTROYING)) + return false; + check_flush_dependency(pwq->wq, work); insert_wq_barrier(pwq, barr, work, worker); @@ -2821,6 +2827,8 @@ static bool start_flush_work(struct work lock_map_acquire_read(&pwq->wq->lockdep_map); lock_map_release(&pwq->wq->lockdep_map); + WARN_ON(pwq->wq->flags & __WQ_DESTROYING); + return true; already_gone: spin_unlock_irq(&pool->lock); @@ -3998,6 +4006,8 @@ err_destroy: } EXPORT_SYMBOL_GPL(__alloc_workqueue_key); +static void show_pwq(struct pool_workqueue *pwq); + /** * destroy_workqueue - safely terminate a workqueue * @wq: target workqueue @@ -4010,6 +4020,7 @@ void destroy_workqueue(struct workqueue_ int node; /* drain it before proceeding with destruction */ + wq->flags |= __WQ_DESTROYING; drain_workqueue(wq); /* sanity checks */ @@ -4024,9 +4035,26 @@ void destroy_workqueue(struct workqueue_ } } - if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) || - WARN_ON(pwq->nr_active) || - WARN_ON(!list_empty(&pwq->delayed_works))) { + if ((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) { + pr_info("%s: name='%s' pwq=%p wq->dfl_pwq=%p pwq->refcnt=%d pwq->nr_active=%d delayed_works:", + __func__, wq->name, pwq, wq->dfl_pwq, + pwq->refcnt, pwq->nr_active); + + show_pwq(pwq); + + mutex_unlock(&wq->mutex); + WARN_ON(1); + return; + } + + if (WARN_ON(pwq->nr_active)) { + pr_info("%s: %ps\n", __func__, wq); + mutex_unlock(&wq->mutex); + return; + } + + if (WARN_ON(!list_empty(&pwq->delayed_works))) { + pr_info("%s: %ps\n", __func__, wq); mutex_unlock(&wq->mutex); return; } --- a/net/bluetooth/hci_core.c +++ b/net/bluetooth/hci_core.c @@ -1471,7 +1471,11 @@ int hci_dev_open(__u16 dev) * has finished. This means that error conditions like RFKILL * or no valid public or static random address apply. */ + WARN_ON(test_bit(HCI_WQR_D, &hdev->wq_status)); + WARN_ON(!test_bit(HCI_WQR_A, &hdev->wq_status)); flush_workqueue(hdev->req_workqueue); + WARN_ON(test_bit(HCI_WQR_D, &hdev->wq_status)); + WARN_ON(!test_bit(HCI_WQR_A, &hdev->wq_status)); /* For controllers not using the management interface and that * are brought up using legacy ioctl, set the HCI_BONDABLE bit @@ -2467,6 +2471,8 @@ static void hci_cmd_timeout(struct work_ struct hci_dev *hdev = container_of(work, struct hci_dev, cmd_timer.work); + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return; if (hdev->sent_cmd) { struct hci_command_hdr *sent = (void *) hdev->sent_cmd->data; u16 opcode = __le16_to_cpu(sent->opcode); @@ -3043,15 +3049,17 @@ int hci_register_dev(struct hci_dev *hde BT_DBG("%p name %s bus %d", hdev, hdev->name, hdev->bus); - hdev->workqueue = alloc_workqueue("%s", WQ_HIGHPRI | WQ_UNBOUND | - WQ_MEM_RECLAIM, 1, hdev->name); + hdev->workqueue = alloc_workqueue("%s", WQ_HIGHPRI | WQ_UNBOUND + , 1, hdev->name); + set_bit(HCI_WQ_A, &hdev->wq_status); if (!hdev->workqueue) { error = -ENOMEM; goto err; } - hdev->req_workqueue = alloc_workqueue("%s", WQ_HIGHPRI | WQ_UNBOUND | - WQ_MEM_RECLAIM, 1, hdev->name); + hdev->req_workqueue = alloc_workqueue("req_%s", WQ_HIGHPRI | WQ_UNBOUND + , 1, hdev->name); + set_bit(HCI_WQR_A, &hdev->wq_status); if (!hdev->req_workqueue) { destroy_workqueue(hdev->workqueue); error = -ENOMEM; @@ -3108,8 +3116,12 @@ int hci_register_dev(struct hci_dev *hde return id; err_wqueue: + set_bit(HCI_WQ_D, &hdev->wq_status); destroy_workqueue(hdev->workqueue); + clear_bit(HCI_WQ_A, &hdev->wq_status); + set_bit(HCI_WQR_D, &hdev->wq_status); destroy_workqueue(hdev->req_workqueue); + clear_bit(HCI_WQR_A, &hdev->wq_status); err: ida_simple_remove(&hci_index_ida, hdev->id); @@ -3160,7 +3172,9 @@ void hci_unregister_dev(struct hci_dev * debugfs_remove_recursive(hdev->debugfs); destroy_workqueue(hdev->workqueue); + set_bit(HCI_WQR_D, &hdev->wq_status); destroy_workqueue(hdev->req_workqueue); + clear_bit(HCI_WQR_A, &hdev->wq_status); hci_dev_lock(hdev); hci_bdaddr_list_clear(&hdev->blacklist); @@ -3225,6 +3239,11 @@ int hci_recv_frame(struct hci_dev *hdev, return -ENXIO; } + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) { + kfree_skb(skb); + return -ENXIO; + } + if (hci_skb_pkt_type(skb) != HCI_EVENT_PKT && hci_skb_pkt_type(skb) != HCI_ACLDATA_PKT && hci_skb_pkt_type(skb) != HCI_SCODATA_PKT) { @@ -3248,6 +3267,9 @@ EXPORT_SYMBOL(hci_recv_frame); /* Receive diagnostic message from HCI drivers */ int hci_recv_diag(struct hci_dev *hdev, struct sk_buff *skb) { + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return -ENXIO; + /* Mark as diagnostic packet */ hci_skb_pkt_type(skb) = HCI_DIAG_PKT; @@ -3326,6 +3348,9 @@ int hci_send_cmd(struct hci_dev *hdev, _ { struct sk_buff *skb; + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return -ENXIO; + BT_DBG("%s opcode 0x%4.4x plen %d", hdev->name, opcode, plen); skb = hci_prepare_cmd(hdev, opcode, plen, param); @@ -3461,6 +3486,9 @@ void hci_send_acl(struct hci_chan *chan, { struct hci_dev *hdev = chan->conn->hdev; + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return; + BT_DBG("%s chan %p flags 0x%4.4x", hdev->name, chan, flags); hci_queue_acl(chan, &chan->data_q, skb, flags); @@ -3474,6 +3502,9 @@ void hci_send_sco(struct hci_conn *conn, struct hci_dev *hdev = conn->hdev; struct hci_sco_hdr hdr; + if (WARN_ON(hci_dev_test_flag(hdev, HCI_UNREGISTER))) + return; + BT_DBG("%s len %d", hdev->name, skb->len); hdr.handle = cpu_to_le16(conn->handle); ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <56EA9C4D.2080803-AlSwsSmVLrQ@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <56EA9C4D.2080803-AlSwsSmVLrQ@public.gmane.org> @ 2016-03-18 20:52 ` Tejun Heo 2016-03-21 15:58 ` Jiri Slaby 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2016-03-18 20:52 UTC (permalink / raw) To: Jiri Slaby Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai Hello, Jiri. On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote: > >> I have not done that yet, but today, I see: > >> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 > >> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: > >> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 > >> in-flight: 18568:wq_barrier_func > > > > So, this means that there's flush_work() racing against workqueue > > destruction, which can't be safe. :( > > But I cannot trigger the WARN_ONs in the attached patch, so I am > confused how this can happen :(. (While I am still seeing the destroy > WARNINGs.) So, no operations should be in progress when destroy_workqueue() is called. If somebody was flushing a work item, the flush call must have returned before destroy_workqueue() was invoked, which doesn't seem to be the case here. Can you trigger BUG_ON() or sysrq-t when the above triggers? There must be a task which is flushing a work item there and it shouldn't be difficult to pinpoint what's going on from it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-03-18 20:52 ` Tejun Heo @ 2016-03-21 15:58 ` Jiri Slaby [not found] ` <56F01A1C.40208-AlSwsSmVLrQ@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-03-21 15:58 UTC (permalink / raw) To: Tejun Heo Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai Hello, On 03/18/2016, 09:52 PM, Tejun Heo wrote: > On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote: >>>> I have not done that yet, but today, I see: >>>> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 >>>> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: >>>> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 >>>> in-flight: 18568:wq_barrier_func >>> >>> So, this means that there's flush_work() racing against workqueue >>> destruction, which can't be safe. :( >> >> But I cannot trigger the WARN_ONs in the attached patch, so I am >> confused how this can happen :(. (While I am still seeing the destroy >> WARNINGs.) > > So, no operations should be in progress when destroy_workqueue() is > called. If somebody was flushing a work item, the flush call must > have returned before destroy_workqueue() was invoked, which doesn't > seem to be the case here. Can you trigger BUG_ON() or sysrq-t when > the above triggers? There must be a task which is flushing a work > item there and it shouldn't be difficult to pinpoint what's going on > from it. The output of sysrq-t is here (> 200k), but I cannot see anything suspicious in it: http://www.fi.muni.cz/~xslaby/sklad/panics/jctl.txt This is what the code does now: + if ((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) { + pr_info("%s: name='%s' pwq=%p wq->dfl_pwq=%p pwq->refcnt=%d pwq->nr_active=%d delayed_works:", + __func__, wq->name, pwq, wq->dfl_pwq, + pwq->refcnt, pwq->nr_active); + + show_pwq(pwq); + + mutex_unlock(&wq->mutex); + show_state(); + show_workqueue_state(); + WARN_ON(1); + return; + } thanks, -- js suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <56F01A1C.40208-AlSwsSmVLrQ@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <56F01A1C.40208-AlSwsSmVLrQ@public.gmane.org> @ 2016-03-22 8:09 ` Jiri Slaby [not found] ` <56F0FDCE.1040701-AlSwsSmVLrQ@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-03-22 8:09 UTC (permalink / raw) To: Tejun Heo, Dmitry Vyukov Cc: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai On 03/21/2016, 04:58 PM, Jiri Slaby wrote: > Hello, > > On 03/18/2016, 09:52 PM, Tejun Heo wrote: >> On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote: >>>>> I have not done that yet, but today, I see: >>>>> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 >>>>> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: >>>>> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 >>>>> in-flight: 18568:wq_barrier_func >>>> >>>> So, this means that there's flush_work() racing against workqueue >>>> destruction, which can't be safe. :( >>> >>> But I cannot trigger the WARN_ONs in the attached patch, so I am >>> confused how this can happen :(. (While I am still seeing the destroy >>> WARNINGs.) >> >> So, no operations should be in progress when destroy_workqueue() is >> called. If somebody was flushing a work item, the flush call must >> have returned before destroy_workqueue() was invoked, which doesn't >> seem to be the case here. Can you trigger BUG_ON() or sysrq-t when >> the above triggers? There must be a task which is flushing a work >> item there and it shouldn't be difficult to pinpoint what's going on >> from it. > > The output of sysrq-t is here (> 200k), but I cannot see anything > suspicious in it: > http://www.fi.muni.cz/~xslaby/sklad/panics/jctl.txt Hmm, so I seem I cannot reproduce with this hunk: --- a/net/bluetooth/hci_core.c +++ b/net/bluetooth/hci_core.c @@ -3139,10 +3139,10 @@ void hci_unregister_dev(struct hci_dev *hdev) list_del(&hdev->list); write_unlock(&hci_dev_list_lock); - hci_dev_do_close(hdev); - cancel_work_sync(&hdev->power_on); + hci_dev_do_close(hdev); + if (!test_bit(HCI_INIT, &hdev->flags) && !hci_dev_test_flag(hdev, HCI_SETUP) && !hci_dev_test_flag(hdev, HCI_CONFIG)) { I cannot explain why though. I do not see how it matters in this particular case... Dmitry, could you apply it too? But I don't know how often you see the warning. PS. next on the table is the gsm tty warning. thanks, -- js suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <56F0FDCE.1040701-AlSwsSmVLrQ@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <56F0FDCE.1040701-AlSwsSmVLrQ@public.gmane.org> @ 2016-03-22 12:32 ` Dmitry Vyukov 2016-09-03 10:58 ` Dmitry Vyukov 0 siblings, 1 reply; 22+ messages in thread From: Dmitry Vyukov @ 2016-03-22 12:32 UTC (permalink / raw) To: Jiri Slaby Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai On Tue, Mar 22, 2016 at 9:09 AM, Jiri Slaby <jslaby-AlSwsSmVLrQ@public.gmane.org> wrote: > On 03/21/2016, 04:58 PM, Jiri Slaby wrote: >> Hello, >> >> On 03/18/2016, 09:52 PM, Tejun Heo wrote: >>> On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote: >>>>>> I have not done that yet, but today, I see: >>>>>> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 >>>>>> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: >>>>>> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 >>>>>> in-flight: 18568:wq_barrier_func >>>>> >>>>> So, this means that there's flush_work() racing against workqueue >>>>> destruction, which can't be safe. :( >>>> >>>> But I cannot trigger the WARN_ONs in the attached patch, so I am >>>> confused how this can happen :(. (While I am still seeing the destroy >>>> WARNINGs.) >>> >>> So, no operations should be in progress when destroy_workqueue() is >>> called. If somebody was flushing a work item, the flush call must >>> have returned before destroy_workqueue() was invoked, which doesn't >>> seem to be the case here. Can you trigger BUG_ON() or sysrq-t when >>> the above triggers? There must be a task which is flushing a work >>> item there and it shouldn't be difficult to pinpoint what's going on >>> from it. >> >> The output of sysrq-t is here (> 200k), but I cannot see anything >> suspicious in it: >> http://www.fi.muni.cz/~xslaby/sklad/panics/jctl.txt > > Hmm, so I seem I cannot reproduce with this hunk: > --- a/net/bluetooth/hci_core.c > +++ b/net/bluetooth/hci_core.c > @@ -3139,10 +3139,10 @@ void hci_unregister_dev(struct hci_dev *hdev) > list_del(&hdev->list); > write_unlock(&hci_dev_list_lock); > > - hci_dev_do_close(hdev); > - > cancel_work_sync(&hdev->power_on); > > + hci_dev_do_close(hdev); > + > if (!test_bit(HCI_INIT, &hdev->flags) && > !hci_dev_test_flag(hdev, HCI_SETUP) && > !hci_dev_test_flag(hdev, HCI_CONFIG)) { > > > > I cannot explain why though. I do not see how it matters in this > particular case... > > Dmitry, could you apply it too? But I don't know how often you see the > warning. I've seen it only several times in several months, so I don't it will be helpful. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-03-22 12:32 ` Dmitry Vyukov @ 2016-09-03 10:58 ` Dmitry Vyukov 2016-09-05 13:08 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Dmitry Vyukov @ 2016-09-03 10:58 UTC (permalink / raw) To: Jiri Slaby Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai On Tue, Mar 22, 2016 at 1:32 PM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Tue, Mar 22, 2016 at 9:09 AM, Jiri Slaby <jslaby@suse.cz> wrote: >> On 03/21/2016, 04:58 PM, Jiri Slaby wrote: >>> Hello, >>> >>> On 03/18/2016, 09:52 PM, Tejun Heo wrote: >>>> On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote: >>>>>>> I have not done that yet, but today, I see: >>>>>>> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 >>>>>>> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: >>>>>>> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 >>>>>>> in-flight: 18568:wq_barrier_func >>>>>> >>>>>> So, this means that there's flush_work() racing against workqueue >>>>>> destruction, which can't be safe. :( >>>>> >>>>> But I cannot trigger the WARN_ONs in the attached patch, so I am >>>>> confused how this can happen :(. (While I am still seeing the destroy >>>>> WARNINGs.) >>>> >>>> So, no operations should be in progress when destroy_workqueue() is >>>> called. If somebody was flushing a work item, the flush call must >>>> have returned before destroy_workqueue() was invoked, which doesn't >>>> seem to be the case here. Can you trigger BUG_ON() or sysrq-t when >>>> the above triggers? There must be a task which is flushing a work >>>> item there and it shouldn't be difficult to pinpoint what's going on >>>> from it. >>> >>> The output of sysrq-t is here (> 200k), but I cannot see anything >>> suspicious in it: >>> http://www.fi.muni.cz/~xslaby/sklad/panics/jctl.txt >> >> Hmm, so I seem I cannot reproduce with this hunk: >> --- a/net/bluetooth/hci_core.c >> +++ b/net/bluetooth/hci_core.c >> @@ -3139,10 +3139,10 @@ void hci_unregister_dev(struct hci_dev *hdev) >> list_del(&hdev->list); >> write_unlock(&hci_dev_list_lock); >> >> - hci_dev_do_close(hdev); >> - >> cancel_work_sync(&hdev->power_on); >> >> + hci_dev_do_close(hdev); >> + >> if (!test_bit(HCI_INIT, &hdev->flags) && >> !hci_dev_test_flag(hdev, HCI_SETUP) && >> !hci_dev_test_flag(hdev, HCI_CONFIG)) { >> >> >> >> I cannot explain why though. I do not see how it matters in this >> particular case... >> >> Dmitry, could you apply it too? But I don't know how often you see the >> warning. > > I've seen it only several times in several months, so I don't it will > be helpful. Bad news: I hit it again. On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have bf389cabb3b8079c23f9762e62b05f291e2d5e99. WARNING: CPU: 3 PID: 23168 at kernel/workqueue.c:4029 destroy_workqueue+0x172/0x620 CPU: 3 PID: 23168 Comm: syz-executor Not tainted 4.8.0-rc3-next-20160825+ #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffffffff886b6fe0 ffff880037be77a8 ffffffff82db38d9 ffffffff00000016 fffffbfff10d6dfc ffffffff8708f000 ffff880037be7880 ffffffff8709be00 dffffc0000000000 0000000000000009 ffff880037be7870 ffffffff816cf888 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff82db38d9>] dump_stack+0x12e/0x185 lib/dump_stack.c:51 [<ffffffff816cf888>] panic+0x1e4/0x3fb kernel/panic.c:179 [<ffffffff813863b4>] __warn+0x1c4/0x1e0 kernel/panic.c:542 [<ffffffff8138659c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585 [<ffffffff813dc6d2>] destroy_workqueue+0x172/0x620 kernel/workqueue.c:4029 [<ffffffff8645973f>] hci_unregister_dev+0x3df/0x8a0 net/bluetooth/hci_core.c:3170 [<ffffffff84cb6926>] vhci_release+0x76/0xf0 drivers/bluetooth/hci_vhci.c:354 [<ffffffff81868bbc>] __fput+0x28c/0x780 fs/file_table.c:208 [<ffffffff81869135>] ____fput+0x15/0x20 fs/file_table.c:244 [<ffffffff813ebf63>] task_work_run+0xf3/0x170 [< inline >] exit_task_work include/linux/task_work.h:21 [<ffffffff81394218>] do_exit+0x868/0x2e70 kernel/exit.c:828 [<ffffffff81396998>] do_group_exit+0x108/0x330 kernel/exit.c:958 [<ffffffff813ba4aa>] get_signal+0x62a/0x15d0 kernel/signal.c:2307 [<ffffffff81202333>] do_signal+0x83/0x1f60 arch/x86/kernel/signal.c:805 [<ffffffff81006345>] exit_to_usermode_loop+0x1a5/0x210 arch/x86/entry/common.c:163 [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:198 [<ffffffff8100869f>] syscall_return_slowpath+0x2bf/0x340 arch/x86/entry/common.c:267 [<ffffffff86e1079c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-09-03 10:58 ` Dmitry Vyukov @ 2016-09-05 13:08 ` Tejun Heo 2016-09-05 13:14 ` Dmitry Vyukov 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2016-09-05 13:08 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai Hello, On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote: > > I've seen it only several times in several months, so I don't it will > > be helpful. > > > Bad news: I hit it again. > On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have > bf389cabb3b8079c23f9762e62b05f291e2d5e99. Hmmm... we're not getting anywhere. I've applied the following patch to wq/for-4.8-fixes so that when this happens the next time we can actually tell what the hell is going wrong. Thanks. ------ 8< ------ >From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Mon, 5 Sep 2016 08:54:06 -0400 Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in destroy_workqueue() destroy_workqueue() performs a number of sanity checks to ensure that the workqueue is empty before proceeding with destruction. However, it's not always easy to tell what's going on just from the warning message. Let's dump workqueue state after sanity check failures to help debugging. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com Cc: Dmitry Vyukov <dvyukov@google.com> --- kernel/workqueue.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ef071ca..4eaec8b8 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq) for (i = 0; i < WORK_NR_COLORS; i++) { if (WARN_ON(pwq->nr_in_flight[i])) { mutex_unlock(&wq->mutex); + show_workqueue_state(); return; } } @@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq) WARN_ON(pwq->nr_active) || WARN_ON(!list_empty(&pwq->delayed_works))) { mutex_unlock(&wq->mutex); + show_workqueue_state(); return; } } -- 2.7.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-09-05 13:08 ` Tejun Heo @ 2016-09-05 13:14 ` Dmitry Vyukov 2016-09-10 9:33 ` Dmitry Vyukov 0 siblings, 1 reply; 22+ messages in thread From: Dmitry Vyukov @ 2016-09-05 13:14 UTC (permalink / raw) To: Tejun Heo Cc: Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai On Mon, Sep 5, 2016 at 3:08 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote: >> > I've seen it only several times in several months, so I don't it will >> > be helpful. >> >> >> Bad news: I hit it again. >> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have >> bf389cabb3b8079c23f9762e62b05f291e2d5e99. > > Hmmm... we're not getting anywhere. I've applied the following patch > to wq/for-4.8-fixes so that when this happens the next time we can > actually tell what the hell is going wrong. > > Thanks. > > ------ 8< ------ > From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Mon, 5 Sep 2016 08:54:06 -0400 > Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in > destroy_workqueue() > > destroy_workqueue() performs a number of sanity checks to ensure that > the workqueue is empty before proceeding with destruction. However, > it's not always easy to tell what's going on just from the warning > message. Let's dump workqueue state after sanity check failures to > help debugging. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com > Cc: Dmitry Vyukov <dvyukov@google.com> > --- > kernel/workqueue.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index ef071ca..4eaec8b8 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq) > for (i = 0; i < WORK_NR_COLORS; i++) { > if (WARN_ON(pwq->nr_in_flight[i])) { > mutex_unlock(&wq->mutex); > + show_workqueue_state(); > return; > } > } > @@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq) > WARN_ON(pwq->nr_active) || > WARN_ON(!list_empty(&pwq->delayed_works))) { > mutex_unlock(&wq->mutex); > + show_workqueue_state(); > return; > } > } Applied this to my test tree. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-09-05 13:14 ` Dmitry Vyukov @ 2016-09-10 9:33 ` Dmitry Vyukov 2016-09-13 15:35 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Dmitry Vyukov @ 2016-09-10 9:33 UTC (permalink / raw) To: Tejun Heo Cc: Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai On Mon, Sep 5, 2016 at 3:14 PM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Mon, Sep 5, 2016 at 3:08 PM, Tejun Heo <tj@kernel.org> wrote: >> Hello, >> >> On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote: >>> > I've seen it only several times in several months, so I don't it will >>> > be helpful. >>> >>> >>> Bad news: I hit it again. >>> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have >>> bf389cabb3b8079c23f9762e62b05f291e2d5e99. >> >> Hmmm... we're not getting anywhere. I've applied the following patch >> to wq/for-4.8-fixes so that when this happens the next time we can >> actually tell what the hell is going wrong. >> >> Thanks. >> >> ------ 8< ------ >> From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001 >> From: Tejun Heo <tj@kernel.org> >> Date: Mon, 5 Sep 2016 08:54:06 -0400 >> Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in >> destroy_workqueue() >> >> destroy_workqueue() performs a number of sanity checks to ensure that >> the workqueue is empty before proceeding with destruction. However, >> it's not always easy to tell what's going on just from the warning >> message. Let's dump workqueue state after sanity check failures to >> help debugging. >> >> Signed-off-by: Tejun Heo <tj@kernel.org> >> Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com >> Cc: Dmitry Vyukov <dvyukov@google.com> >> --- >> kernel/workqueue.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c >> index ef071ca..4eaec8b8 100644 >> --- a/kernel/workqueue.c >> +++ b/kernel/workqueue.c >> @@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq) >> for (i = 0; i < WORK_NR_COLORS; i++) { >> if (WARN_ON(pwq->nr_in_flight[i])) { >> mutex_unlock(&wq->mutex); >> + show_workqueue_state(); >> return; >> } >> } >> @@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq) >> WARN_ON(pwq->nr_active) || >> WARN_ON(!list_empty(&pwq->delayed_works))) { >> mutex_unlock(&wq->mutex); >> + show_workqueue_state(); >> return; >> } >> } > > > Applied this to my test tree. Hit the WARNING with the patch. It showed "Showing busy workqueues and worker pools:" after the WARNING, but then no queue info. Was it already destroyed and removed from the list?... [ 186.994195] blk_update_request: critical target error, dev sr0, sector 0 [ 187.031368] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 187.032172] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 187.032816] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 187.033432] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 02 00 [ 187.034124] blk_update_request: critical target error, dev sr0, sector 0 [ 187.069408] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 187.070127] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 187.070744] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 187.071368] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 04 00 [ 187.071982] blk_update_request: critical target error, dev sr0, sector 0 [ 187.098397] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 187.099194] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 187.099863] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 187.100527] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 04 00 [ 187.101206] blk_update_request: critical target error, dev sr0, sector 0 [ 187.104486] VFS: Dirty inode writeback failed for block device sr0 (err=-5). [ 198.113128] ------------[ cut here ]------------ [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042 destroy_workqueue+0x17b/0x630 [ 198.114628] Modules linked in: [ 198.115009] CPU: 2 PID: 26691 Comm: syz-executor Not tainted 4.8.0-rc5-next-20160909+ #23 [ 198.115615] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 198.116032] ffff880060f4f880 ffffffff82d88e39 ffffffff87098a20 fffffbfff10d6ad0 [ 198.116032] 0000000000000000 0000000000000000 ffffffff87098a20 ffffffff813b5abb [ 198.116032] 0000000000000009 ffff880060f4f8c8 ffffffff81354c3f ffff880060f4f8e8 [ 198.116032] Call Trace: [ 198.116032] [<ffffffff82d88e39>] dump_stack+0x12e/0x185 [ 198.116032] [<ffffffff813b5abb>] ? destroy_workqueue+0x17b/0x630 [ 198.116032] [<ffffffff81354c3f>] __warn+0x19f/0x1e0 [ 198.116032] [<ffffffff81354e4c>] warn_slowpath_null+0x2c/0x40 [ 198.116032] [<ffffffff813b5abb>] destroy_workqueue+0x17b/0x630 [ 198.116032] [<ffffffff8642b43f>] hci_unregister_dev+0x3df/0x8a0 [ 198.116032] [<ffffffff8195bc70>] ? fcntl_setlk+0xc30/0xc30 [ 198.116032] [<ffffffff84c86b70>] ? vhci_flush+0x50/0x50 [ 198.116032] [<ffffffff84c86be6>] vhci_release+0x76/0xf0 [ 198.116032] [<ffffffff8183c71c>] __fput+0x28c/0x780 [ 198.116032] [<ffffffff8183cc95>] ____fput+0x15/0x20 [ 198.116032] [<ffffffff813bc6d3>] task_work_run+0xf3/0x170 [ 198.116032] [<ffffffff81364c58>] do_exit+0x868/0x2de0 [ 198.116032] [<ffffffff81934de0>] ? userfaultfd_ioctl+0x27b0/0x27b0 [ 198.116032] [<ffffffff8146c180>] ? debug_check_no_locks_freed+0x3c0/0x3c0 [ 198.116032] [<ffffffff813643f0>] ? mm_update_next_owner+0x640/0x640 [ 198.116032] [<ffffffff819323d0>] ? userfaultfd_release+0x620/0x620 [ 198.116032] [<ffffffff8137fd32>] ? __dequeue_signal+0x132/0x540 [ 198.116032] [<ffffffff8137f896>] ? recalc_sigpending_tsk+0xd6/0x180 [ 198.116032] [<ffffffff81367348>] do_group_exit+0x108/0x330 [ 198.116032] [<ffffffff8138adea>] get_signal+0x62a/0x15d0 [ 198.116032] [<ffffffff811d0333>] do_signal+0x83/0x1e20 [ 198.116032] [<ffffffff81761920>] ? __pmd_alloc+0x3f0/0x3f0 [ 198.116032] [<ffffffff811d02b0>] ? setup_sigcontext+0x7d0/0x7d0 [ 198.116032] [<ffffffff8125d3c5>] ? kvm_clock_get_cycles+0x25/0x30 [ 198.116032] [<ffffffff817720d0>] ? find_vma+0x30/0x150 [ 198.116032] [<ffffffff8126b128>] ? __do_page_fault+0x198/0xbb0 [ 198.116032] [<ffffffff81510500>] ? do_futex+0x17d0/0x17d0 [ 198.116032] [<ffffffff81006335>] exit_to_usermode_loop+0x1a5/0x210 [ 198.116032] [<ffffffff81008384>] prepare_exit_to_usermode+0x154/0x190 [ 198.116032] [<ffffffff86df20b4>] retint_user+0x8/0x23 [ 198.137668] ---[ end trace 1beeef55c15054c4 ]--- [ 198.138229] Showing busy workqueues and worker pools: [ 198.219611] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 198.220424] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 198.221207] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 198.222086] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 04 00 [ 198.222872] blk_update_request: critical target error, dev sr0, sector 0 [ 198.223637] buffer_io_error: 78 callbacks suppressed [ 198.224152] Buffer I/O error on dev sr0, logical block 0, lost async page write [ 198.225336] Buffer I/O error on dev sr0, logical block 1, lost async page write [ 198.226211] Buffer I/O error on dev sr0, logical block 2, lost async page write [ 198.227031] Buffer I/O error on dev sr0, logical block 3, lost async page write [ 198.227887] Buffer I/O error on dev sr0, logical block 4, lost async page write [ 198.228841] Buffer I/O error on dev sr0, logical block 5, lost async page write [ 198.229653] Buffer I/O error on dev sr0, logical block 6, lost async page write [ 198.230344] Buffer I/O error on dev sr0, logical block 7, lost async page write [ 198.231156] Buffer I/O error on dev sr0, logical block 8, lost async page write [ 198.231973] Buffer I/O error on dev sr0, logical block 9, lost async page write [ 198.244460] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 198.245272] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 198.245939] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 198.246554] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 02 00 [ 198.247265] blk_update_request: critical target error, dev sr0, sector 0 [ 198.275421] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 198.276249] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 198.276865] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 198.277466] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 04 00 [ 198.277998] blk_update_request: critical target error, dev sr0, sector 0 [ 198.333417] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 198.334161] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] [ 198.334734] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code [ 198.335433] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 04 00 [ 198.336095] blk_update_request: critical target error, dev sr0, sector 0 [ 198.337116] VFS: Dirty inode writeback failed for block device sr0 (err=-5). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-09-10 9:33 ` Dmitry Vyukov @ 2016-09-13 15:35 ` Tejun Heo [not found] ` <20160913153520.GC21123-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2016-09-13 15:35 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai Hello, On Sat, Sep 10, 2016 at 11:33:48AM +0200, Dmitry Vyukov wrote: > Hit the WARNING with the patch. It showed "Showing busy workqueues and > worker pools:" after the WARNING, but then no queue info. Was it > already destroyed and removed from the list?... Hmm... It either means that the work item which was in flight when WARN_ON() ran finished by the time the debug printout got to it or that it's something unrelated to busy work items. > [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042 > destroy_workqueue+0x17b/0x630 I don't seem to have the same source code that you have. Which exact WARN_ON() is this? Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20160913153520.GC21123-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev [not found] ` <20160913153520.GC21123-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-09-13 18:14 ` Jiri Slaby 2016-09-16 20:24 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Jiri Slaby @ 2016-09-13 18:14 UTC (permalink / raw) To: Tejun Heo, Dmitry Vyukov Cc: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai On 09/13/2016, 05:35 PM, Tejun Heo wrote: > Hello, > > On Sat, Sep 10, 2016 at 11:33:48AM +0200, Dmitry Vyukov wrote: >> Hit the WARNING with the patch. It showed "Showing busy workqueues and >> worker pools:" after the WARNING, but then no queue info. Was it >> already destroyed and removed from the list?... > > Hmm... It either means that the work item which was in flight when > WARN_ON() ran finished by the time the debug printout got to it or > that it's something unrelated to busy work items. > >> [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042 >> destroy_workqueue+0x17b/0x630 > > I don't seem to have the same source code that you have. Which exact > WARN_ON() is this? I assume Dmitry sees the same what I am still seeing, so I reported this some time ago: https://lkml.org/lkml/2016/3/21/492 This warning is trigerred there and still occurs with "HEAD": (pwq != wq->dfl_pwq) && (pwq->refcnt > 1) and the state dump is in the log empty too: destroy_workqueue: name='hci0' pwq=ffff88006b5c8f00 wq->dfl_pwq=ffff88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: pwq 13: cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1 in-flight: 2669:wq_barrier_func thanks, -- js suse labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev 2016-09-13 18:14 ` Jiri Slaby @ 2016-09-16 20:24 ` Tejun Heo 0 siblings, 0 replies; 22+ messages in thread From: Tejun Heo @ 2016-09-16 20:24 UTC (permalink / raw) To: Jiri Slaby Cc: Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai Hello, On Tue, Sep 13, 2016 at 08:14:40PM +0200, Jiri Slaby wrote: > I assume Dmitry sees the same what I am still seeing, so I reported this > some time ago: > https://lkml.org/lkml/2016/3/21/492 > > This warning is trigerred there and still occurs with "HEAD": > (pwq != wq->dfl_pwq) && (pwq->refcnt > 1) > and the state dump is in the log empty too: > destroy_workqueue: name='hci0' pwq=ffff88006b5c8f00 > wq->dfl_pwq=ffff88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: > pwq 13: > cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1 > in-flight: 2669:wq_barrier_func Hmmm... I think it could be from rescuer holding reference on the pwq. Both cases have WQ_MEM_RECLAIM and it could be that rescuer was still in flight (even without work items pending) when the sanity checks were done. The following patch moves the sanity checks after rescuer destruction. Dmitry, Jiri, can you please see whether the warning goes away with this patch? Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 984f6ff..e8046a1 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4042,8 +4042,7 @@ void destroy_workqueue(struct workqueue_struct *wq) } } - if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) || - WARN_ON(pwq->nr_active) || + if (WARN_ON(pwq->nr_active) || WARN_ON(!list_empty(&pwq->delayed_works))) { mutex_unlock(&wq->mutex); show_workqueue_state(); @@ -4080,6 +4079,7 @@ void destroy_workqueue(struct workqueue_struct *wq) for_each_node(node) { pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]); RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL); + WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt != 1)); put_pwq_unlocked(pwq); } @@ -4089,6 +4089,7 @@ void destroy_workqueue(struct workqueue_struct *wq) */ pwq = wq->dfl_pwq; wq->dfl_pwq = NULL; + WARN_ON(pwq->refcnt != 1); put_pwq_unlocked(pwq); } } ^ permalink raw reply related [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-09-16 20:24 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-26 11:53 net/bluetooth: workqueue destruction WARNING in hci_unregister_dev Dmitry Vyukov
2016-01-26 12:29 ` Dmitry Vyukov
2016-02-18 14:00 ` Jiri Slaby
2016-02-18 14:22 ` Dmitry Vyukov
2016-02-18 17:44 ` Tejun Heo
[not found] ` <20160218174427.GG13177-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-02-19 10:20 ` Jiri Slaby
2016-02-19 12:10 ` Jiri Slaby
2016-03-02 15:45 ` Tejun Heo
[not found] ` <20160302154507.GC4282-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-03-03 9:12 ` Jiri Slaby
2016-03-11 17:12 ` Tejun Heo
[not found] ` <20160311171205.GB24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-03-17 12:00 ` Jiri Slaby
[not found] ` <56EA9C4D.2080803-AlSwsSmVLrQ@public.gmane.org>
2016-03-18 20:52 ` Tejun Heo
2016-03-21 15:58 ` Jiri Slaby
[not found] ` <56F01A1C.40208-AlSwsSmVLrQ@public.gmane.org>
2016-03-22 8:09 ` Jiri Slaby
[not found] ` <56F0FDCE.1040701-AlSwsSmVLrQ@public.gmane.org>
2016-03-22 12:32 ` Dmitry Vyukov
2016-09-03 10:58 ` Dmitry Vyukov
2016-09-05 13:08 ` Tejun Heo
2016-09-05 13:14 ` Dmitry Vyukov
2016-09-10 9:33 ` Dmitry Vyukov
2016-09-13 15:35 ` Tejun Heo
[not found] ` <20160913153520.GC21123-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-09-13 18:14 ` Jiri Slaby
2016-09-16 20:24 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).