* ocrdma failure in 4.4.0-rc5
@ 2015-12-19 20:11 Doug Ledford
[not found] ` <5675BA00.5060101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Doug Ledford @ 2015-12-19 20:11 UTC (permalink / raw)
To: devesh.sharma-1wcpHE2jlwO1Z/+hSey0Gg,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1.1: Type: text/plain, Size: 304 bytes --]
Hi Devesh,
Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
If you have vlans off of the main device, this is what I get from the
Fedora rawhide 4.4.0-rc5 kernel:
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG KeyID: 0E572FDD
[-- Attachment #1.2: dmesg.out --]
[-- Type: text/plain, Size: 8743 bytes --]
[ 26.692881] be2net 0000:85:00.0 ocrdma_roce: Link is Up
[ 26.693339] ======================================================
[ 26.693340] [ INFO: possible circular locking dependency detected ]
[ 26.693341] 4.4.0-0.rc5.git3.1.fc24.x86_64 #1 Tainted: G I
[ 26.693341] -------------------------------------------------------
[ 26.693342] NetworkManager/2867 is trying to acquire lock:
[ 26.693348] (be_adapter_list_lock){+.+.+.}, at: [<ffffffffa053d7f5>] be_roce_dev_open+0x35/0x70 [be2net]
[ 26.693349]
but task is already holding lock:
[ 26.693354] (rtnl_mutex){+.+.+.}, at: [<ffffffff8174961b>] rtnetlink_rcv+0x1b/0x40
[ 26.693355]
which lock already depends on the new lock.
[ 26.693355]
the existing dependency chain (in reverse order) is:
[ 26.693356]
-> #2 (rtnl_mutex){+.+.+.}:
[ 26.693361] [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[ 26.693366] [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[ 26.693368] [<ffffffff81747f27>] rtnl_lock+0x17/0x20
[ 26.693375] [<ffffffffa00770b5>] enum_all_gids_of_dev_cb+0x25/0xd0 [ib_core]
[ 26.693379] [<ffffffffa0072918>] ib_enum_roce_netdev+0x128/0x130 [ib_core]
[ 26.693382] [<ffffffffa00774e1>] roce_rescan_device+0x21/0x30 [ib_core]
[ 26.693385] [<ffffffffa007521c>] ib_cache_setup_one+0x2bc/0x3b0 [ib_core]
[ 26.693388] [<ffffffffa00725d3>] ib_register_device+0x2e3/0x420 [ib_core]
[ 26.693391] [<ffffffffa076c85a>] ocrdma_add+0x43a/0x710 [ocrdma]
[ 26.693393] [<ffffffffa053d58d>] _be_roce_dev_add+0x17d/0x1e0 [be2net]
[ 26.693396] [<ffffffffa053d65a>] be_roce_register_driver+0x6a/0xd0 [be2net]
[ 26.693402] [<ffffffffa0781015>] target_dev_control_store+0x15/0x20 [target_core_mod]
[ 26.693406] [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[ 26.693408] [<ffffffff811e3298>] do_init_module+0x5f/0x1e7
[ 26.693410] [<ffffffff81153246>] load_module+0x2126/0x27d0
[ 26.693411] [<ffffffff81153a62>] SyS_init_module+0x172/0x1b0
[ 26.693412] [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 26.693414]
-> #1 (device_mutex){+.+.+.}:
[ 26.693415] [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[ 26.693417] [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[ 26.693420] [<ffffffffa007232f>] ib_register_device+0x3f/0x420 [ib_core]
[ 26.693422] [<ffffffffa076c85a>] ocrdma_add+0x43a/0x710 [ocrdma]
[ 26.693423] [<ffffffffa053d58d>] _be_roce_dev_add+0x17d/0x1e0 [be2net]
[ 26.693425] [<ffffffffa053d65a>] be_roce_register_driver+0x6a/0xd0 [be2net]
[ 26.693428] [<ffffffffa0781015>] target_dev_control_store+0x15/0x20 [target_core_mod]
[ 26.693430] [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[ 26.693431] [<ffffffff811e3298>] do_init_module+0x5f/0x1e7
[ 26.693432] [<ffffffff81153246>] load_module+0x2126/0x27d0
[ 26.693433] [<ffffffff81153a62>] SyS_init_module+0x172/0x1b0
[ 26.693435] [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 26.693436]
-> #0 (be_adapter_list_lock){+.+.+.}:
[ 26.693437] [<ffffffff8110a969>] __lock_acquire+0x18f9/0x1b70
[ 26.693439] [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[ 26.693440] [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[ 26.693442] [<ffffffffa053d7f5>] be_roce_dev_open+0x35/0x70 [be2net]
[ 26.693444] [<ffffffffa0532500>] be_open+0x670/0x700 [be2net]
[ 26.693446] [<ffffffff81739df8>] __dev_open+0xc8/0x140
[ 26.693448] [<ffffffff8173a10d>] __dev_change_flags+0x9d/0x160
[ 26.693449] [<ffffffff8173a1f9>] dev_change_flags+0x29/0x70
[ 26.693451] [<ffffffff8174a486>] do_setlink+0x636/0xb80
[ 26.693452] [<ffffffff8174b0bc>] rtnl_newlink+0x5ac/0x8a0
[ 26.693454] [<ffffffff81749726>] rtnetlink_rcv_msg+0xe6/0x240
[ 26.693456] [<ffffffff81773a44>] netlink_rcv_skb+0xa4/0xc0
[ 26.693457] [<ffffffff8174962a>] rtnetlink_rcv+0x2a/0x40
[ 26.693459] [<ffffffff8177315a>] netlink_unicast+0x19a/0x290
[ 26.693460] [<ffffffff81773713>] netlink_sendmsg+0x4c3/0x620
[ 26.693462] [<ffffffff81715488>] sock_sendmsg+0x38/0x50
[ 26.693463] [<ffffffff81715fa9>] ___sys_sendmsg+0x2c9/0x2e0
[ 26.693465] [<ffffffff81716cf1>] __sys_sendmsg+0x51/0x90
[ 26.693466] [<ffffffff81716d42>] SyS_sendmsg+0x12/0x20
[ 26.693467] [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 26.693468]
other info that might help us debug this:
[ 26.693469] Chain exists of:
be_adapter_list_lock --> device_mutex --> rtnl_mutex
[ 26.693470] Possible unsafe locking scenario:
[ 26.693470] CPU0 CPU1
[ 26.693470] ---- ----
[ 26.693471] lock(rtnl_mutex);
[ 26.693472] lock(device_mutex);
[ 26.693472] lock(rtnl_mutex);
[ 26.693473] lock(be_adapter_list_lock);
[ 26.693473]
*** DEADLOCK ***
[ 26.693474] 1 lock held by NetworkManager/2867:
[ 26.693476] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8174961b>] rtnetlink_rcv+0x1b/0x40
[ 26.693476]
stack backtrace:
[ 26.693478] CPU: 14 PID: 2867 Comm: NetworkManager Tainted: G I 4.4.0-0.rc5.git3.1.fc24.x86_64 #1
[ 26.693479] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014
[ 26.693481] 0000000000000000 0000000022867838 ffff8820175d74a0 ffffffff81427df9
[ 26.693482] ffffffff82bd4410 ffff8820175d74e0 ffffffff81107653 ffff8820175d7550
[ 26.693483] ffff882017590cc8 ffff882017590000 ffff882017590c90 0000000000000000
[ 26.693484] Call Trace:
[ 26.693487] [<ffffffff81427df9>] dump_stack+0x4b/0x72
[ 26.693489] [<ffffffff81107653>] print_circular_bug+0x1e3/0x250
[ 26.693490] [<ffffffff8110a969>] __lock_acquire+0x18f9/0x1b70
[ 26.693492] [<ffffffff81880964>] ? retint_kernel+0x10/0x10
[ 26.693493] [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[ 26.693495] [<ffffffffa053d7f5>] ? be_roce_dev_open+0x35/0x70 [be2net]
[ 26.693497] [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[ 26.693499] [<ffffffffa053d7f5>] ? be_roce_dev_open+0x35/0x70 [be2net]
[ 26.693500] [<ffffffff81733c0c>] ? netdev_info+0x6c/0x90
[ 26.693502] [<ffffffffa053d7f5>] ? be_roce_dev_open+0x35/0x70 [be2net]
[ 26.693504] [<ffffffff8174ed17>] ? linkwatch_fire_event+0x57/0xa0
[ 26.693506] [<ffffffffa053d7f5>] be_roce_dev_open+0x35/0x70 [be2net]
[ 26.693507] [<ffffffffa0532500>] be_open+0x670/0x700 [be2net]
[ 26.693509] [<ffffffff81739df8>] __dev_open+0xc8/0x140
[ 26.693511] [<ffffffff8173a10d>] __dev_change_flags+0x9d/0x160
[ 26.693512] [<ffffffff8173a1f9>] dev_change_flags+0x29/0x70
[ 26.693513] [<ffffffff8174a486>] do_setlink+0x636/0xb80
[ 26.693515] [<ffffffff8110952a>] ? __lock_acquire+0x4ba/0x1b70
[ 26.693518] [<ffffffffa01168ed>] ? mga_dirty_update+0x21d/0x350 [mgag200]
[ 26.693520] [<ffffffff810268b9>] ? sched_clock+0x9/0x10
[ 26.693522] [<ffffffff81458622>] ? nla_parse+0x32/0x100
[ 26.693523] [<ffffffff8174b0bc>] rtnl_newlink+0x5ac/0x8a0
[ 26.693527] [<ffffffff810b8028>] ? ns_capable+0x38/0x70
[ 26.693528] [<ffffffff81749726>] rtnetlink_rcv_msg+0xe6/0x240
[ 26.693530] [<ffffffff8174961b>] ? rtnetlink_rcv+0x1b/0x40
[ 26.693533] [<ffffffff810e82dc>] ? local_clock+0x1c/0x20
[ 26.693534] [<ffffffff8174961b>] ? rtnetlink_rcv+0x1b/0x40
[ 26.693535] [<ffffffff81749640>] ? rtnetlink_rcv+0x40/0x40
[ 26.693537] [<ffffffff81773a44>] netlink_rcv_skb+0xa4/0xc0
[ 26.693538] [<ffffffff8174962a>] rtnetlink_rcv+0x2a/0x40
[ 26.693539] [<ffffffff8177315a>] netlink_unicast+0x19a/0x290
[ 26.693540] [<ffffffff817730d4>] ? netlink_unicast+0x114/0x290
[ 26.693541] [<ffffffff81773713>] netlink_sendmsg+0x4c3/0x620
[ 26.693543] [<ffffffff81715488>] sock_sendmsg+0x38/0x50
[ 26.693544] [<ffffffff81715fa9>] ___sys_sendmsg+0x2c9/0x2e0
[ 26.693546] [<ffffffff810268b9>] ? sched_clock+0x9/0x10
[ 26.693548] [<ffffffff810e82dc>] ? local_clock+0x1c/0x20
[ 26.693551] [<ffffffff812978f2>] ? __fget+0x122/0x200
[ 26.693553] [<ffffffff812977d5>] ? __fget+0x5/0x200
[ 26.693554] [<ffffffff81297a3a>] ? __fget_light+0x2a/0x90
[ 26.693556] [<ffffffff81716cf1>] __sys_sendmsg+0x51/0x90
[ 26.693558] [<ffffffff81716d42>] SyS_sendmsg+0x12/0x20
[ 26.693559] [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 26.706745] be2net 0000:85:00.0 ocrdma_roce: Link is Up
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ocrdma failure in 4.4.0-rc5
[not found] ` <5675BA00.5060101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-12-21 4:29 ` Devesh Sharma
[not found] ` <CANjDDBiJAyfS7wFr9yYs7fAnn1-h9t1JH9s1LDf-Q8RYk=Ayyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Devesh Sharma @ 2015-12-21 4:29 UTC (permalink / raw)
To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi Dough,
Thanks for your note.
We will root cause the issue asap and get back to you with the fix.
-Regards
Devesh
On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Hi Devesh,
>
> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
> If you have vlans off of the main device, this is what I get from the
> Fedora rawhide 4.4.0-rc5 kernel:
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> GPG KeyID: 0E572FDD
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ocrdma failure in 4.4.0-rc5
[not found] ` <CANjDDBiJAyfS7wFr9yYs7fAnn1-h9t1JH9s1LDf-Q8RYk=Ayyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-12-23 17:59 ` Doug Ledford
[not found] ` <567AE0E8.70107-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Doug Ledford @ 2015-12-23 17:59 UTC (permalink / raw)
To: Devesh Sharma; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 765 bytes --]
On 12/20/2015 11:29 PM, Devesh Sharma wrote:
> Hi Dough,
>
> Thanks for your note.
>
> We will root cause the issue asap and get back to you with the fix.
Ping. Any update?
> -Regards
> Devesh
>
> On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Hi Devesh,
>>
>> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>> If you have vlans off of the main device, this is what I get from the
>> Fedora rawhide 4.4.0-rc5 kernel:
>>
>>
>> --
>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> GPG KeyID: 0E572FDD
>>
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG KeyID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ocrdma failure in 4.4.0-rc5
[not found] ` <567AE0E8.70107-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-12-23 18:53 ` Devesh Sharma
[not found] ` <CANjDDBibF9eR-uumY3rzckeDfDE6CfBQsAAdkZVWAitdviW2FQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Devesh Sharma @ 2015-12-23 18:53 UTC (permalink / raw)
To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi Dough,
I was all set to send you an update in my morning tomorrow. Let me
give you a brief update here
The deadlock is caused being caused due to following two facts:
A. be2net is sending open/close event to ocrdma holding
device_list_mutex. Nic Open/close hooks are called under rtnl lock
from user-space.
B. As per ocrdma intialization logic ib_register_device() is called
under device_list_mutex. On the other hand inside
ib_register_device(), GID table initialization logic tries to acquire
rtln-lock to fill some of the table attributes.
My patch series to fix this issue is already under testing. After 1
round of internal reivew I should be able to post the series in a day
or two.
-Regards
Devesh
On Wed, Dec 23, 2015 at 11:29 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 12/20/2015 11:29 PM, Devesh Sharma wrote:
>> Hi Dough,
>>
>> Thanks for your note.
>>
>> We will root cause the issue asap and get back to you with the fix.
>
> Ping. Any update?
>
>> -Regards
>> Devesh
>>
>> On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> Hi Devesh,
>>>
>>> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>>> If you have vlans off of the main device, this is what I get from the
>>> Fedora rawhide 4.4.0-rc5 kernel:
>>>
>>>
>>> --
>>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> GPG KeyID: 0E572FDD
>>>
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> GPG KeyID: 0E572FDD
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ocrdma failure in 4.4.0-rc5
[not found] ` <CANjDDBibF9eR-uumY3rzckeDfDE6CfBQsAAdkZVWAitdviW2FQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-12-23 19:22 ` Doug Ledford
0 siblings, 0 replies; 5+ messages in thread
From: Doug Ledford @ 2015-12-23 19:22 UTC (permalink / raw)
To: Devesh Sharma; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 1997 bytes --]
On 12/23/2015 01:53 PM, Devesh Sharma wrote:
> Hi Dough,
>
> I was all set to send you an update in my morning tomorrow. Let me
> give you a brief update here
>
> The deadlock is caused being caused due to following two facts:
> A. be2net is sending open/close event to ocrdma holding
> device_list_mutex. Nic Open/close hooks are called under rtnl lock
> from user-space.
> B. As per ocrdma intialization logic ib_register_device() is called
> under device_list_mutex. On the other hand inside
> ib_register_device(), GID table initialization logic tries to acquire
> rtln-lock to fill some of the table attributes.
>
> My patch series to fix this issue is already under testing. After 1
> round of internal reivew I should be able to post the series in a day
> or two.
Ok. This is a serious enough issue I really want it for 4.4-rc, so the
sooner the better ;-)
Thanks for the update.
> -Regards
> Devesh
>
>
> On Wed, Dec 23, 2015 at 11:29 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 12/20/2015 11:29 PM, Devesh Sharma wrote:
>>> Hi Dough,
>>>
>>> Thanks for your note.
>>>
>>> We will root cause the issue asap and get back to you with the fix.
>>
>> Ping. Any update?
>>
>>> -Regards
>>> Devesh
>>>
>>> On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>> Hi Devesh,
>>>>
>>>> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>>>> If you have vlans off of the main device, this is what I get from the
>>>> Fedora rawhide 4.4.0-rc5 kernel:
>>>>
>>>>
>>>> --
>>>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> GPG KeyID: 0E572FDD
>>>>
>>
>>
>> --
>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> GPG KeyID: 0E572FDD
>>
>>
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG KeyID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-12-23 19:22 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-19 20:11 ocrdma failure in 4.4.0-rc5 Doug Ledford
[not found] ` <5675BA00.5060101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-12-21 4:29 ` Devesh Sharma
[not found] ` <CANjDDBiJAyfS7wFr9yYs7fAnn1-h9t1JH9s1LDf-Q8RYk=Ayyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-12-23 17:59 ` Doug Ledford
[not found] ` <567AE0E8.70107-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-12-23 18:53 ` Devesh Sharma
[not found] ` <CANjDDBibF9eR-uumY3rzckeDfDE6CfBQsAAdkZVWAitdviW2FQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-12-23 19:22 ` Doug Ledford
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).