From: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
To: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
Subject: Race between ipoib_cm_handle_tx_wc and ipoib_cm_tx_destroy crashing.
Date: Thu, 11 Aug 2016 17:11:37 +0300 [thread overview]
Message-ID: <57AC8799.8060003@kyup.com> (raw)
Hello list,
I've come across yet another ipoib issue. I got the following crash,
while executing "ifdown ib0". At this time the infiniband network was working
as expected (e.g. no issue like the ones I reported before with tx queue
time outs, etc) :
[721677.936044] ib0: timing out; 1 sends not completed
[721678.760114] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[721678.760337] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[721679.081771] BUG: unable to handle kernel paging request at ffffc900358e24d8
[721679.081776] IP: [<ffffffffa01edf20>] ipoib_dma_unmap_tx+0x20/0x170 [ib_ipoib]
[721679.081782] PGD 1fff432067 PUD 3ffec01067 PMD 1e8f064067 PTE 0
[721679.081785] Oops: 0000 [#1] SMP
[721679.081819] CPU: 0 PID: 3451 Comm: qib_cq0 Tainted: G O 4.4.14-clouder3 #26
[721679.081821] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.2 01/16/2015
[721679.081822] task: ffff881fee8bb700 ti: ffff881ff0144000 task.ti: ffff881ff0144000
[721679.081823] RIP: 0010:[<ffffffffa01edf20>] [<ffffffffa01edf20>] ipoib_dma_unmap_tx+0x20/0x170 [ib_ipoib]
[721679.081826] RSP: 0000:ffff881fff803e00 EFLAGS: 00010286
[721679.081827] RAX: ffffc900358e24e0 RBX: ffffc900358e24d8 RCX: 0000000000000100
[721679.081828] RDX: 000000000000069b RSI: ffffc900358e24d8 RDI: ffff881ff25c8700
[721679.081829] RBP: ffff881fff803e30 R08: 0000000000000002 R09: ffffc9001b7df000
[721679.081830] R10: 00000000000aebb9 R11: 0000000000000000 R12: 0000000000000000
[721679.081831] R13: 0000000000000000 R14: ffff881ff25c8700 R15: ffff881ff25c8000
[721679.081832] FS: 0000000000000000(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
[721679.081833] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[721679.081834] CR2: ffffc900358e24d8 CR3: 0000001fa3b4f000 CR4: 00000000001406f0
[721679.081834] Stack:
[721679.081835] 0000000000000000 ffffc900358e24d8 0000000000000000 ffff881fdf4b2c00
[721679.081837] 0000000000000000 ffff881ff25c8000 ffff881fff803e80 ffffffffa01f5660
[721679.081839] ffff881ff25c9068 0000000040000059 ffff881ff25c8700 ffff881ff25c8710
[721679.081840] Call Trace:
[721679.081842] <IRQ>
[721679.081845] [<ffffffffa01f5660>] ipoib_cm_handle_tx_wc+0x70/0x280 [ib_ipoib]
[721679.081848] [<ffffffffa01ee487>] ipoib_poll+0xd7/0x160 [ib_ipoib]
[721679.081852] [<ffffffff8154d36c>] net_rx_action+0x1ec/0x330
[721679.081855] [<ffffffff81057337>] __do_softirq+0x147/0x310
[721679.081866] [<ffffffffa00b601f>] ? send_complete+0x1f/0x60 [ib_qib]
[721679.081869] [<ffffffff8161618c>] do_softirq_own_stack+0x1c/0x30
[721679.081869] <EOI>
[721679.081871] [<ffffffff810567bb>] do_softirq.part.17+0x3b/0x40
[721679.081873] [<ffffffff81056876>] __local_bh_enable_ip+0xb6/0xc0
[721679.081877] [<ffffffffa00b604b>] send_complete+0x4b/0x60 [ib_qib]
[721679.081881] [<ffffffff81071bdb>] kthread_worker_fn+0xbb/0x1e0
[721679.081883] [<ffffffff81071b20>] ? kthread_create_on_node+0x180/0x180
[721679.081885] [<ffffffff8107161f>] kthread+0xef/0x110
[721679.081887] [<ffffffff81071530>] ? kthread_park+0x60/0x60
[721679.081889] [<ffffffff816149ff>] ret_from_fork+0x3f/0x70
[721679.081891] [<ffffffff81071530>] ? kthread_park+0x60/0x60
[721679.081892] Code: 48 8b 05 b4 63 a2 e1 eb a3 66 90 0f 1f 44 00 00 55 48 8d 46 08 48 89 e5 41 57 41 56 49 89 fe 41 55 45 31 ed 41 54 53 48 83 ec 08 <4c> 8b 3e 48 89 45 d0 41 8b 97 80 00 00 00 41 8b 87 84 00 00 00
[721679.081909] RIP [<ffffffffa01edf20>] ipoib_dma_unmap_tx+0x20/0x170 [ib_ipoib]
[721679.081911] RSP <ffff881fff803e00>
[721679.081912] CR2: ffffc900358e24d8
ipoib_dma_unmap_tx+0x20 is : struct sk_buff *skb = tx_req->skb;
The address of tx_req is ffffc900358e24d8, which apparently
is unmapped address.
The warning message "ib0: timing out; 1 sends not completed",
came from ipoib_cm_tx_destroy. This means the "goto timeout"
statement has executed, triggering the freeing of the ipoib_cm_tx
state, eventually unmapping the vmalloced. At the time of the crash ipoib_cm_tx looks like:
struct ipoib_cm_tx {
id = 0xffff881fdf4b2c60,
qp = 0xffff8801c0bc8000,
list = {
next = 0xdead000000000100,
prev = 0xdead000000000200
},
dev = 0xffff881ff25c8000,
neigh = 0x0,
path = 0xffff881f3f7a7500,
tx_ring = 0xffffc900358df000,
tx_head = 101722,
tx_tail = 101722,
flags = 1,
mtu = 65524
}
The list is dead which corresponds to the list_del(&p->list);
in cm_tx_reap and the tx_head/tx_tail are equal, corresponding
to the loop in ipoib_cm_tx_destroy.
Reading the code in ipoib_cm_handle_tx_wc I wasn't able to figure out
how is this function synchronized against parallel destruction of
underlying ipoib_cm_tx, which seems to be the case here.
Regards,
Nikolay
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next reply other threads:[~2016-08-11 14:11 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-11 14:11 Nikolay Borisov [this message]
[not found] ` <57AC8799.8060003-6AxghH7DbtA@public.gmane.org>
2016-08-11 15:10 ` Race between ipoib_cm_handle_tx_wc and ipoib_cm_tx_destroy crashing Erez Shitrit
[not found] ` <CAAk-MO8ftavVagQSXte_7yRf=iwHRXk-123wWRmmJELW+RVC7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-11 15:39 ` Nikolay Borisov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57AC8799.8060003@kyup.com \
--to=kernel-6axghh7dbta@public.gmane.org \
--cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.