From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Jack Wang <xjtuwjp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
Jinpu Wang <jinpu.wang-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>,
"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Dongsu Park <dongsu.park-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
Subject: Re: list corruption in IPOIB
Date: Mon, 20 May 2013 16:38:35 +0300 [thread overview]
Message-ID: <519A275B.9070400@mellanox.com> (raw)
In-Reply-To: <CAD+HZHUKU3qq_WbaoW8NfwkoMQWQKeVS1GTGXxBRUEJOridEyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 5/20/2013 3:58 PM, Jack Wang wrote:
> I haven't reproduced the original bug we saw in our production
> environment
> BUG: unable to handle kernel
> at 0000000000000008
> IP: [<ffffffffa0206c30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
> ...
> RIP: 0010:[<ffffffffa0206c30>] [<ffffffffa0206c30>]
> ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
> RSP: 0018:ffff8825fdcbddb0 EFLAGS: 00010086
> RAX: 0000000000000246 RBX: ffff8807b59c29c0 RCX: 0000000000000000
> RDX: 4400000006000002 RSI: 0000000000000246 RDI: ffff8810026527c0
> RBP: ffff881002652000 R08: 0000000000015360 R09: dead000000200200
> R10: dead000000100100 R11: 0000000000000001 R12: 0000000000000001
> R13: 0000000000000000 R14: ffff8810026523a0 R15: ffff8810026527c0
> FS: 00007f4c9a325700(0000) GS:ffff880807c00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000008 CR3: 0000002605e3a000 CR4: 00000000000407f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/u:3 (pid: 61374, threadinfo ffff8825fdcbc000, task
> ffff8807fd0eafb0)
> Stack:
> ffff8820043303c0
> ffff880807d52700
> ffff8807fd0eafb0
> ffff8825fdcbdde0
>
> ffff8810026533b8
> ffffffffa0039868
> ffff8825fdcbdde0
> ffff8805fd549a00
>
> ffffffff81b9d480
> ffff8807fd2f4000
> ffffffffa0206b50
> 0000000000000000
>
> Call Trace:
> [<ffffffffa0039868>] ? process_req+0xe8/0x1a0 [ib_addr]
> [<ffffffffa0206b50>] ? ipoib_cm_tx_handler+0x2d0/0x2d0 [ib_ipoib]
> [<ffffffff81052d64>] ? process_one_work+0x114/0x470
> [<ffffffff81055033>] ? worker_thread+0x163/0x3e0
> [<ffffffff81054ed0>] ? manage_workers+0x200/0x200
> [<ffffffff81054ed0>] ? manage_workers+0x200/0x200
> [<ffffffff8105963e>] ? kthread+0x9e/0xb0
> [<ffffffff8167e9e4>] ? kernel_thread_helper+0x4/0x10
> [<ffffffff810595a0>] ? kthread_freezable_should_stop+0x60/0x60
> [<ffffffff8167e9e0>] ? gs_change+0x13/0x13
> ...
> [<ffffffffa01fec30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
> RSP <ffff881d275f1db0>
> ---[ end trace 38ff082cbc03dd75 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
>
>
>
> , only the A variant of the crash in has been reproduced:
>
> WARNING: at lib/list_debug.c:49 __list_del_entry+0x63/0xd0()
> Hardware name: System Product Name
> list_del corruption, ffff88020dbd3080->next is LIST_POISON1
> (dead000000100100)
> Modules linked in: ...
> Pid: 16248, comm: iperf Tainted: G W 3.4.23-pserver+ #76
> Call Trace:
> <IRQ> [<ffffffff8103c21f>] warn_slowpath_common+0x7f/0xc0
> [<ffffffff8103c316>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff81428563>] ? do_raw_spin_lock+0xd3/0x140
> [<ffffffff81428883>] __list_del_entry+0x63/0xd0
> [<ffffffff81428901>] list_del+0x11/0x40
> [<ffffffffa02f64c5>] ipoib_cm_handle_tx_wc+0x225/0x380 [ib_ipoib]
> [<ffffffffa02eea44>] ipoib_poll+0x164/0x190 [ib_ipoib]
> [<ffffffff815d91fd>] net_rx_action+0x13d/0x320
> [<ffffffff81044f29>] ? __do_softirq+0x89/0x380
> [<ffffffff81044f98>] __do_softirq+0xf8/0x380
> [<ffffffff8174632c>] call_softirq+0x1c/0x30
> <EOI> [<ffffffff81004305>] do_softirq+0x95/0xd0
> [<ffffffff815daacc>] ? dev_queue_xmit+0x29c/0xbf0
> [<ffffffff8104461b>] local_bh_enable+0xeb/0xf0
> [<ffffffff815daacc>] dev_queue_xmit+0x29c/0xbf0
> [<ffffffff815da830>] ? ptype_seq_start+0xb0/0xb0
> [<ffffffff815e0d87>] neigh_connected_output+0xc7/0x110
> [<ffffffff8109f36d>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff81617386>] ip_finish_output2+0x1c6/0x460
> [<ffffffff8161723a>] ? ip_finish_output2+0x7a/0x460
> [<ffffffff81619033>] ip_finish_output+0xc3/0x230
> [<ffffffff81619510>] ip_output+0xa0/0x110
> [<ffffffff8161764d>] ip_local_out+0x2d/0x90
> [<ffffffff816176cb>] ip_send_skb+0x1b/0x60
> [<ffffffff8163f27b>] udp_send_skb+0x10b/0x380
> [<ffffffff815c3a70>] ? sock_def_wakeup+0x1b0/0x1b0
> [<ffffffff81616e90>] ? ip_append_page+0x530/0x530
> [<ffffffff81641462>] udp_sendmsg+0x3b2/0xb50
> [<ffffffff8173c530>] ? retint_restore_args+0x13/0x13
> [<ffffffff8164d9a0>] ? inet_create+0x5b0/0x5b0
> [<ffffffff815c2310>] ? sock_update_classid+0x150/0x2b0
> [<ffffffff8164dacb>] inet_sendmsg+0x12b/0x240
> [<ffffffff8164d9a0>] ? inet_create+0x5b0/0x5b0
> [<ffffffff815c2272>] ? sock_update_classid+0xb2/0x2b0
> [<ffffffff815c2310>] ? sock_update_classid+0x150/0x2b0
> [<ffffffff815bda40>] sock_aio_write+0x190/0x1b0
> [<ffffffff8142214e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff8116e82a>] do_sync_write+0xea/0x130
> [<ffffffff8109bfdd>] ? trace_hardirqs_off+0xd/0x10
> [<ffffffff811713d3>] ? fget_light+0x43/0x490
> [<ffffffff813b14f3>] ? security_file_permission+0x23/0x90
> [<ffffffff8116ee82>] vfs_write+0x172/0x190
> [<ffffffff8116ef91>] sys_write+0x51/0x90
> [<ffffffff81744de9>] system_call_fastpath+0x16/0x1b
> ---[ end trace 66110390802a41db ]---
>
> after apply
> commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27
> Author: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
> <mailto:shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>>
> Date: Mon Aug 13 14:39:49 2012 +0000
>
> IB/ipoib: Add missing locking when CM object is deleted
>
> Above warning is gone, but we still see the warning at the begin of
> this thread.
>
>
>
> 2013/5/20 Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
> <mailto:ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>>
>
> On 20/05/2013 15:46, Jinpu Wang wrote:
>
> A quick test show the list_corruption warning is gone, after I
> convert
> all list_del(&neigh->list) to list_del_list(&neigh->list).
>
>
> yes, but this wasn't your original problem or was it?
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> <mailto:majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
Hi Jack,
I don't understand what is the current status, that is what do you see
now after applying the patches.
If you don't get the original bug why did you gave the trace of it? Or
is it a new trace? It is not clear from your mail.
Please add only the trace of the current issue.
Best regards,
S.P.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-05-20 13:38 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-17 19:36 list corruption in IPOIB Jack Wang
[not found] ` <519686B4.7010300-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-05-18 19:37 ` Or Gerlitz
[not found] ` <CAJZOPZJNA7E005x9+XdVMG31fLEZm2mKB1nkpt5m3hA1qh7fYg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-18 21:36 ` Jack Wang
[not found] ` <5197F447.5020702-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-05-19 6:00 ` Or Gerlitz
[not found] ` <51986A8B.9030806-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-19 9:17 ` Jack Wang
[not found] ` <519898B0.1000901-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-05-20 9:05 ` Or Gerlitz
[not found] ` <5199E747.3070502-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-20 9:10 ` Jinpu Wang
[not found] ` <CAMGffEn6YwXSB7KDfDRJrJmBaiQEG-zAjEonY=JUxMo=nLRSXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-20 10:58 ` Or Gerlitz
[not found] ` <519A01DD.6080906-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-20 12:46 ` Jinpu Wang
[not found] ` <CAMGffEk=PJge4jtdcx8xOKA_3RhcSn9wweULxCE7yctPApSn1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-20 12:51 ` Or Gerlitz
[not found] ` <CAD+HZHUKU3qq_WbaoW8NfwkoMQWQKeVS1GTGXxBRUEJOridEyg@mail.gmail.com>
[not found] ` <CAD+HZHUKU3qq_WbaoW8NfwkoMQWQKeVS1GTGXxBRUEJOridEyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-20 13:38 ` Shlomo Pongratz [this message]
[not found] ` <519A275B.9070400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-20 14:36 ` Jack Wang
[not found] ` <519A34F9.3080700-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-05-20 19:00 ` Or Gerlitz
[not found] ` <CAJZOPZKQF-qWLKAtuh8tJvPeMmWJTsXqG5P_0ELBs3EKYDh4sA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-20 19:38 ` Jack Wang
[not found] ` <519A7BAA.1080008-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-05-20 19:50 ` Or Gerlitz
[not found] ` <CAJZOPZLaXDjMHWCoo5Gs_iEro22o6XS2u-f6E9SLtH3AFMu_mQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-20 19:57 ` Jack Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=519A275B.9070400@mellanox.com \
--to=shlomop-vpraknaxozvwk0htik3j/w@public.gmane.org \
--cc=dongsu.park-EIkl63zCoXaH+58JC4qpiA@public.gmane.org \
--cc=jinpu.wang-EIkl63zCoXaH+58JC4qpiA@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=xjtuwjp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox