linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ocrdma failure in 4.4.0-rc5
@ 2015-12-19 20:11 Doug Ledford
       [not found] ` <5675BA00.5060101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Doug Ledford @ 2015-12-19 20:11 UTC (permalink / raw)
  To: devesh.sharma-1wcpHE2jlwO1Z/+hSey0Gg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 304 bytes --]

Hi Devesh,

Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
 If you have vlans off of the main device, this is what I get from the
Fedora rawhide 4.4.0-rc5 kernel:


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #1.2: dmesg.out --]
[-- Type: text/plain, Size: 8743 bytes --]

[   26.692881] be2net 0000:85:00.0 ocrdma_roce: Link is Up

[   26.693339] ======================================================
[   26.693340] [ INFO: possible circular locking dependency detected ]
[   26.693341] 4.4.0-0.rc5.git3.1.fc24.x86_64 #1 Tainted: G          I    
[   26.693341] -------------------------------------------------------
[   26.693342] NetworkManager/2867 is trying to acquire lock:
[   26.693348]  (be_adapter_list_lock){+.+.+.}, at: [<ffffffffa053d7f5>] be_roce_dev_open+0x35/0x70 [be2net]
[   26.693349] 
               but task is already holding lock:
[   26.693354]  (rtnl_mutex){+.+.+.}, at: [<ffffffff8174961b>] rtnetlink_rcv+0x1b/0x40
[   26.693355] 
               which lock already depends on the new lock.

[   26.693355] 
               the existing dependency chain (in reverse order) is:
[   26.693356] 
               -> #2 (rtnl_mutex){+.+.+.}:
[   26.693361]        [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[   26.693366]        [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[   26.693368]        [<ffffffff81747f27>] rtnl_lock+0x17/0x20
[   26.693375]        [<ffffffffa00770b5>] enum_all_gids_of_dev_cb+0x25/0xd0 [ib_core]
[   26.693379]        [<ffffffffa0072918>] ib_enum_roce_netdev+0x128/0x130 [ib_core]
[   26.693382]        [<ffffffffa00774e1>] roce_rescan_device+0x21/0x30 [ib_core]
[   26.693385]        [<ffffffffa007521c>] ib_cache_setup_one+0x2bc/0x3b0 [ib_core]
[   26.693388]        [<ffffffffa00725d3>] ib_register_device+0x2e3/0x420 [ib_core]
[   26.693391]        [<ffffffffa076c85a>] ocrdma_add+0x43a/0x710 [ocrdma]
[   26.693393]        [<ffffffffa053d58d>] _be_roce_dev_add+0x17d/0x1e0 [be2net]
[   26.693396]        [<ffffffffa053d65a>] be_roce_register_driver+0x6a/0xd0 [be2net]
[   26.693402]        [<ffffffffa0781015>] target_dev_control_store+0x15/0x20 [target_core_mod]
[   26.693406]        [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[   26.693408]        [<ffffffff811e3298>] do_init_module+0x5f/0x1e7
[   26.693410]        [<ffffffff81153246>] load_module+0x2126/0x27d0
[   26.693411]        [<ffffffff81153a62>] SyS_init_module+0x172/0x1b0
[   26.693412]        [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[   26.693414] 
               -> #1 (device_mutex){+.+.+.}:
[   26.693415]        [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[   26.693417]        [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[   26.693420]        [<ffffffffa007232f>] ib_register_device+0x3f/0x420 [ib_core]
[   26.693422]        [<ffffffffa076c85a>] ocrdma_add+0x43a/0x710 [ocrdma]
[   26.693423]        [<ffffffffa053d58d>] _be_roce_dev_add+0x17d/0x1e0 [be2net]
[   26.693425]        [<ffffffffa053d65a>] be_roce_register_driver+0x6a/0xd0 [be2net]
[   26.693428]        [<ffffffffa0781015>] target_dev_control_store+0x15/0x20 [target_core_mod]
[   26.693430]        [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[   26.693431]        [<ffffffff811e3298>] do_init_module+0x5f/0x1e7
[   26.693432]        [<ffffffff81153246>] load_module+0x2126/0x27d0
[   26.693433]        [<ffffffff81153a62>] SyS_init_module+0x172/0x1b0
[   26.693435]        [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[   26.693436] 
               -> #0 (be_adapter_list_lock){+.+.+.}:
[   26.693437]        [<ffffffff8110a969>] __lock_acquire+0x18f9/0x1b70
[   26.693439]        [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[   26.693440]        [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[   26.693442]        [<ffffffffa053d7f5>] be_roce_dev_open+0x35/0x70 [be2net]
[   26.693444]        [<ffffffffa0532500>] be_open+0x670/0x700 [be2net]
[   26.693446]        [<ffffffff81739df8>] __dev_open+0xc8/0x140
[   26.693448]        [<ffffffff8173a10d>] __dev_change_flags+0x9d/0x160
[   26.693449]        [<ffffffff8173a1f9>] dev_change_flags+0x29/0x70
[   26.693451]        [<ffffffff8174a486>] do_setlink+0x636/0xb80
[   26.693452]        [<ffffffff8174b0bc>] rtnl_newlink+0x5ac/0x8a0
[   26.693454]        [<ffffffff81749726>] rtnetlink_rcv_msg+0xe6/0x240
[   26.693456]        [<ffffffff81773a44>] netlink_rcv_skb+0xa4/0xc0
[   26.693457]        [<ffffffff8174962a>] rtnetlink_rcv+0x2a/0x40
[   26.693459]        [<ffffffff8177315a>] netlink_unicast+0x19a/0x290
[   26.693460]        [<ffffffff81773713>] netlink_sendmsg+0x4c3/0x620
[   26.693462]        [<ffffffff81715488>] sock_sendmsg+0x38/0x50
[   26.693463]        [<ffffffff81715fa9>] ___sys_sendmsg+0x2c9/0x2e0
[   26.693465]        [<ffffffff81716cf1>] __sys_sendmsg+0x51/0x90
[   26.693466]        [<ffffffff81716d42>] SyS_sendmsg+0x12/0x20
[   26.693467]        [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[   26.693468] 
               other info that might help us debug this:

[   26.693469] Chain exists of:
                 be_adapter_list_lock --> device_mutex --> rtnl_mutex

[   26.693470]  Possible unsafe locking scenario:

[   26.693470]        CPU0                    CPU1
[   26.693470]        ----                    ----
[   26.693471]   lock(rtnl_mutex);
[   26.693472]                                lock(device_mutex);
[   26.693472]                                lock(rtnl_mutex);
[   26.693473]   lock(be_adapter_list_lock);
[   26.693473] 
                *** DEADLOCK ***

[   26.693474] 1 lock held by NetworkManager/2867:
[   26.693476]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8174961b>] rtnetlink_rcv+0x1b/0x40
[   26.693476] 
               stack backtrace:
[   26.693478] CPU: 14 PID: 2867 Comm: NetworkManager Tainted: G          I     4.4.0-0.rc5.git3.1.fc24.x86_64 #1
[   26.693479] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014
[   26.693481]  0000000000000000 0000000022867838 ffff8820175d74a0 ffffffff81427df9
[   26.693482]  ffffffff82bd4410 ffff8820175d74e0 ffffffff81107653 ffff8820175d7550
[   26.693483]  ffff882017590cc8 ffff882017590000 ffff882017590c90 0000000000000000
[   26.693484] Call Trace:
[   26.693487]  [<ffffffff81427df9>] dump_stack+0x4b/0x72
[   26.693489]  [<ffffffff81107653>] print_circular_bug+0x1e3/0x250
[   26.693490]  [<ffffffff8110a969>] __lock_acquire+0x18f9/0x1b70
[   26.693492]  [<ffffffff81880964>] ? retint_kernel+0x10/0x10
[   26.693493]  [<ffffffff8110b56e>] lock_acquire+0xce/0x1c0
[   26.693495]  [<ffffffffa053d7f5>] ? be_roce_dev_open+0x35/0x70 [be2net]
[   26.693497]  [<ffffffff8187c086>] mutex_lock_nested+0x86/0x400
[   26.693499]  [<ffffffffa053d7f5>] ? be_roce_dev_open+0x35/0x70 [be2net]
[   26.693500]  [<ffffffff81733c0c>] ? netdev_info+0x6c/0x90
[   26.693502]  [<ffffffffa053d7f5>] ? be_roce_dev_open+0x35/0x70 [be2net]
[   26.693504]  [<ffffffff8174ed17>] ? linkwatch_fire_event+0x57/0xa0
[   26.693506]  [<ffffffffa053d7f5>] be_roce_dev_open+0x35/0x70 [be2net]
[   26.693507]  [<ffffffffa0532500>] be_open+0x670/0x700 [be2net]
[   26.693509]  [<ffffffff81739df8>] __dev_open+0xc8/0x140
[   26.693511]  [<ffffffff8173a10d>] __dev_change_flags+0x9d/0x160
[   26.693512]  [<ffffffff8173a1f9>] dev_change_flags+0x29/0x70
[   26.693513]  [<ffffffff8174a486>] do_setlink+0x636/0xb80
[   26.693515]  [<ffffffff8110952a>] ? __lock_acquire+0x4ba/0x1b70
[   26.693518]  [<ffffffffa01168ed>] ? mga_dirty_update+0x21d/0x350 [mgag200]
[   26.693520]  [<ffffffff810268b9>] ? sched_clock+0x9/0x10
[   26.693522]  [<ffffffff81458622>] ? nla_parse+0x32/0x100
[   26.693523]  [<ffffffff8174b0bc>] rtnl_newlink+0x5ac/0x8a0
[   26.693527]  [<ffffffff810b8028>] ? ns_capable+0x38/0x70
[   26.693528]  [<ffffffff81749726>] rtnetlink_rcv_msg+0xe6/0x240
[   26.693530]  [<ffffffff8174961b>] ? rtnetlink_rcv+0x1b/0x40
[   26.693533]  [<ffffffff810e82dc>] ? local_clock+0x1c/0x20
[   26.693534]  [<ffffffff8174961b>] ? rtnetlink_rcv+0x1b/0x40
[   26.693535]  [<ffffffff81749640>] ? rtnetlink_rcv+0x40/0x40
[   26.693537]  [<ffffffff81773a44>] netlink_rcv_skb+0xa4/0xc0
[   26.693538]  [<ffffffff8174962a>] rtnetlink_rcv+0x2a/0x40
[   26.693539]  [<ffffffff8177315a>] netlink_unicast+0x19a/0x290
[   26.693540]  [<ffffffff817730d4>] ? netlink_unicast+0x114/0x290
[   26.693541]  [<ffffffff81773713>] netlink_sendmsg+0x4c3/0x620
[   26.693543]  [<ffffffff81715488>] sock_sendmsg+0x38/0x50
[   26.693544]  [<ffffffff81715fa9>] ___sys_sendmsg+0x2c9/0x2e0
[   26.693546]  [<ffffffff810268b9>] ? sched_clock+0x9/0x10
[   26.693548]  [<ffffffff810e82dc>] ? local_clock+0x1c/0x20
[   26.693551]  [<ffffffff812978f2>] ? __fget+0x122/0x200
[   26.693553]  [<ffffffff812977d5>] ? __fget+0x5/0x200
[   26.693554]  [<ffffffff81297a3a>] ? __fget_light+0x2a/0x90
[   26.693556]  [<ffffffff81716cf1>] __sys_sendmsg+0x51/0x90
[   26.693558]  [<ffffffff81716d42>] SyS_sendmsg+0x12/0x20
[   26.693559]  [<ffffffff8187fe32>] entry_SYSCALL_64_fastpath+0x12/0x76
[   26.706745] be2net 0000:85:00.0 ocrdma_roce: Link is Up

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ocrdma failure in 4.4.0-rc5
       [not found] ` <5675BA00.5060101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-12-21  4:29   ` Devesh Sharma
       [not found]     ` <CANjDDBiJAyfS7wFr9yYs7fAnn1-h9t1JH9s1LDf-Q8RYk=Ayyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Devesh Sharma @ 2015-12-21  4:29 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Dough,

Thanks for your note.

We will root cause the issue asap and get back to you with the fix.

-Regards
Devesh

On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Hi Devesh,
>
> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>  If you have vlans off of the main device, this is what I get from the
> Fedora rawhide 4.4.0-rc5 kernel:
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ocrdma failure in 4.4.0-rc5
       [not found]     ` <CANjDDBiJAyfS7wFr9yYs7fAnn1-h9t1JH9s1LDf-Q8RYk=Ayyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-12-23 17:59       ` Doug Ledford
       [not found]         ` <567AE0E8.70107-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Doug Ledford @ 2015-12-23 17:59 UTC (permalink / raw)
  To: Devesh Sharma; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 765 bytes --]

On 12/20/2015 11:29 PM, Devesh Sharma wrote:
> Hi Dough,
> 
> Thanks for your note.
> 
> We will root cause the issue asap and get back to you with the fix.

Ping.  Any update?

> -Regards
> Devesh
> 
> On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Hi Devesh,
>>
>> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>>  If you have vlans off of the main device, this is what I get from the
>> Fedora rawhide 4.4.0-rc5 kernel:
>>
>>
>> --
>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>               GPG KeyID: 0E572FDD
>>


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ocrdma failure in 4.4.0-rc5
       [not found]         ` <567AE0E8.70107-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-12-23 18:53           ` Devesh Sharma
       [not found]             ` <CANjDDBibF9eR-uumY3rzckeDfDE6CfBQsAAdkZVWAitdviW2FQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Devesh Sharma @ 2015-12-23 18:53 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Dough,

I was all set to send you an update in my morning tomorrow. Let me
give you a brief update here

The deadlock is caused being caused due to following two facts:
A. be2net is sending open/close event to ocrdma holding
device_list_mutex. Nic Open/close hooks are called under rtnl lock
from user-space.
B. As per ocrdma intialization logic ib_register_device() is called
under device_list_mutex. On the other hand inside
ib_register_device(), GID table initialization logic tries to acquire
rtln-lock to fill some of the table attributes.

My patch series to fix this issue is already under testing. After 1
round of internal reivew I should be able to post the series in a day
or two.

-Regards
Devesh


On Wed, Dec 23, 2015 at 11:29 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 12/20/2015 11:29 PM, Devesh Sharma wrote:
>> Hi Dough,
>>
>> Thanks for your note.
>>
>> We will root cause the issue asap and get back to you with the fix.
>
> Ping.  Any update?
>
>> -Regards
>> Devesh
>>
>> On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> Hi Devesh,
>>>
>>> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>>>  If you have vlans off of the main device, this is what I get from the
>>> Fedora rawhide 4.4.0-rc5 kernel:
>>>
>>>
>>> --
>>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>               GPG KeyID: 0E572FDD
>>>
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ocrdma failure in 4.4.0-rc5
       [not found]             ` <CANjDDBibF9eR-uumY3rzckeDfDE6CfBQsAAdkZVWAitdviW2FQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-12-23 19:22               ` Doug Ledford
  0 siblings, 0 replies; 5+ messages in thread
From: Doug Ledford @ 2015-12-23 19:22 UTC (permalink / raw)
  To: Devesh Sharma; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1997 bytes --]

On 12/23/2015 01:53 PM, Devesh Sharma wrote:
> Hi Dough,
> 
> I was all set to send you an update in my morning tomorrow. Let me
> give you a brief update here
> 
> The deadlock is caused being caused due to following two facts:
> A. be2net is sending open/close event to ocrdma holding
> device_list_mutex. Nic Open/close hooks are called under rtnl lock
> from user-space.
> B. As per ocrdma intialization logic ib_register_device() is called
> under device_list_mutex. On the other hand inside
> ib_register_device(), GID table initialization logic tries to acquire
> rtln-lock to fill some of the table attributes.
> 
> My patch series to fix this issue is already under testing. After 1
> round of internal reivew I should be able to post the series in a day
> or two.

Ok.  This is a serious enough issue I really want it for 4.4-rc, so the
sooner the better ;-)

Thanks for the update.

> -Regards
> Devesh
> 
> 
> On Wed, Dec 23, 2015 at 11:29 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 12/20/2015 11:29 PM, Devesh Sharma wrote:
>>> Hi Dough,
>>>
>>> Thanks for your note.
>>>
>>> We will root cause the issue asap and get back to you with the fix.
>>
>> Ping.  Any update?
>>
>>> -Regards
>>> Devesh
>>>
>>> On Sun, Dec 20, 2015 at 1:41 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>> Hi Devesh,
>>>>
>>>> Testing 4.4.0-rc5, the ocrdma driver is failing for me (100% reliably).
>>>>  If you have vlans off of the main device, this is what I get from the
>>>> Fedora rawhide 4.4.0-rc5 kernel:
>>>>
>>>>
>>>> --
>>>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>               GPG KeyID: 0E572FDD
>>>>
>>
>>
>> --
>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>               GPG KeyID: 0E572FDD
>>
>>


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-12-23 19:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-19 20:11 ocrdma failure in 4.4.0-rc5 Doug Ledford
     [not found] ` <5675BA00.5060101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-12-21  4:29   ` Devesh Sharma
     [not found]     ` <CANjDDBiJAyfS7wFr9yYs7fAnn1-h9t1JH9s1LDf-Q8RYk=Ayyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-12-23 17:59       ` Doug Ledford
     [not found]         ` <567AE0E8.70107-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-12-23 18:53           ` Devesh Sharma
     [not found]             ` <CANjDDBibF9eR-uumY3rzckeDfDE6CfBQsAAdkZVWAitdviW2FQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-12-23 19:22               ` Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).