public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* Bug Report: possible circular locking issue
@ 2017-08-28 16:38 Doug Ledford
       [not found] ` <1503938316.78641.98.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2017-08-28 16:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Attachment #1: Type: text/plain, Size: 8968 bytes --]

Resend from my work email address:


I ran across this while testing a 4.13-rc7 kernel + the rdma next code.
 I don't have the time to track this down before going on PTO, so I'm
putting it out here for others to look at.

This machine holds multiple connections in it:

ib0/ib1 -> dual port qib
roce -> ocrdma
iwarp -> cxgb4

During bootup I got this:

[   37.244753] iw_cxgb4: 0000:83:00.4: Up
[   37.250168] iw_cxgb4: 0000:83:00.4: On-Chip Queues not supported on
this deve

[   37.263207] ======================================================
[   37.270656] WARNING: possible circular locking dependency detected
[   37.278101] 4.13.0-rc7+ #130 Not tainted
[   37.283019] ------------------------------------------------------
[   37.290470] NetworkManager/2196 is trying to acquire lock:
[   37.297143]  (device_mutex){+.+.+.}, at: [<ffffffffc08d2465>]
ib_register_de]
[   37.308026] 
               but task is already holding lock:
[   37.315694]  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
notify_ulds.isra.]
[   37.326108] 
               which lock already depends on the new lock.

[   37.337689] 
               the existing dependency chain (in reverse order) is:
[   37.347301] 
               -> #2 (uld_mutex){+.+.+.}:
[   37.354048]        lock_acquire+0xbd/0x200
[   37.359083]        __mutex_lock+0x88/0x950
[   37.364122]        mutex_lock_nested+0x1b/0x20
[   37.369690]        cxgb_up+0x27/0x840 [cxgb4]
[   37.375623]        cxgb_open+0x34/0x90 [cxgb4]
[   37.381168]        __dev_open+0xc9/0x140
[   37.386039]        __dev_change_flags+0x9d/0x160
[   37.391686]        dev_change_flags+0x29/0x60
[   37.397069]        do_setlink+0x4bf/0xc80
[   37.402024]        rtnl_newlink+0x512/0x8a0
[   37.407177]        rtnetlink_rcv_msg+0xac/0x240
[   37.412702]        netlink_rcv_skb+0xed/0x120
[   37.418023]        rtnetlink_rcv+0x2a/0x40
[   37.423060]        netlink_unicast+0x182/0x220
[   37.428482]        netlink_sendmsg+0x2e9/0x3e0
[   37.433868]        sock_sendmsg+0x38/0x50
[   37.438766]        ___sys_sendmsg+0x2b2/0x2d0
[   37.444052]        __sys_sendmsg+0x54/0x90
[   37.449047]        SyS_sendmsg+0x12/0x20
[   37.453848]        entry_SYSCALL_64_fastpath+0x1f/0xbe
[   37.460007] 
               -> #1 (rtnl_mutex){+.+.+.}:
[   37.466764]        lock_acquire+0xbd/0x200
[   37.471745]        __mutex_lock+0x88/0x950
[   37.476853]        mutex_lock_nested+0x1b/0x20
[   37.482336]        rtnl_lock+0x17/0x20
[   37.487038]        enum_all_gids_of_dev_cb+0x25/0xd0 [ib_core]
[   37.494509]        ib_enum_roce_netdev+0xe7/0x100 [ib_core]
[   37.501256]        roce_rescan_device+0x21/0x30 [ib_core]
[   37.507680]        ib_cache_setup_one+0x1f1/0x350 [ib_core]
[   37.514297]        ib_register_device+0x444/0x720 [ib_core]
[   37.520900]        ocrdma_add+0x46f/0x820 [ocrdma]
[   37.526622]        _be_roce_dev_add+0x17d/0x1e0 [be2net]
[   37.532929]        be_roce_register_driver+0x4a/0x90 [be2net]
[   37.539716]        ib_umad_poll+0x15/0x50 [ib_umad]
[   37.545527]        do_one_initcall+0x51/0x1a9
[   37.550881]        do_init_module+0x60/0x1ff
[   37.556129]        load_module+0x257e/0x2b10
[   37.561375]        SYSC_finit_module+0xa9/0x100
[   37.566880]        SyS_finit_module+0xe/0x10
[   37.572099]        do_syscall_64+0x6c/0x1d0
[   37.577178]        return_from_SYSCALL_64+0x0/0x7a
[   37.583232] 
               -> #0 (device_mutex){+.+.+.}:
[   37.590704]        __lock_acquire+0x153c/0x1550
[   37.596442]        lock_acquire+0xbd/0x200
[   37.601399]        __mutex_lock+0x88/0x950
[   37.606346]        mutex_lock_nested+0x1b/0x20
[   37.611669]        ib_register_device+0xb5/0x720 [ib_core]
[   37.618170]        c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
[   37.625061]        c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
[   37.632108]        notify_ulds.isra.28+0x3f/0x60 [cxgb4]
[   37.638410]        cxgb_up+0x70b/0x840 [cxgb4]
[   37.643946]        cxgb_open+0x34/0x90 [cxgb4]
[   37.649265]        __dev_open+0xc9/0x140
[   37.653977]        __dev_change_flags+0x9d/0x160
[   37.659613]        dev_change_flags+0x29/0x60
[   37.665046]        do_setlink+0x4bf/0xc80
[   37.669851]        rtnl_newlink+0x512/0x8a0
[   37.675090]        rtnetlink_rcv_msg+0xac/0x240
[   37.680717]        netlink_rcv_skb+0xed/0x120
[   37.685937]        rtnetlink_rcv+0x2a/0x40
[   37.691081]        netlink_unicast+0x182/0x220
[   37.696607]        netlink_sendmsg+0x2e9/0x3e0
[   37.702136]        sock_sendmsg+0x38/0x50
[   37.707180]        ___sys_sendmsg+0x2b2/0x2d0
[   37.712639]        __sys_sendmsg+0x54/0x90
[   37.717542]        SyS_sendmsg+0x12/0x20
[   37.722249]        entry_SYSCALL_64_fastpath+0x1f/0xbe
[   37.728326] 
               other info that might help us debug this:

[   37.738479] Chain exists of:
                 device_mutex --> rtnl_mutex --> uld_mutex

[   37.750153]  Possible unsafe locking scenario:

[   37.757412]        CPU0                    CPU1
[   37.762894]        ----                    ----
[   37.768381]   lock(uld_mutex);
[   37.772149]                                lock(rtnl_mutex);
[   37.778830]                                lock(uld_mutex);
[   37.785413]   lock(device_mutex);
[   37.789462] 
                *** DEADLOCK ***

[   37.797070] 2 locks held by NetworkManager/2196:
[   37.802557]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff9e83457b>]
rtnetlink_r0
[   37.812213]  #1:  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
notify_ulds.]
[   37.822846] 
               stack backtrace:
[   37.828894] CPU: 17 PID: 2196 Comm: NetworkManager Not tainted
4.13.0-rc7+ #0
[   37.837655] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
2.0.2 03/6
[   37.846551] Call Trace:
[   37.849630]  dump_stack+0x85/0xcc
[   37.853679]  print_circular_bug+0x200/0x20e
[   37.858806]  __lock_acquire+0x153c/0x1550
[   37.863738]  lock_acquire+0xbd/0x200
[   37.868138]  ? ib_register_device+0xb5/0x720 [ib_core]
[   37.874275]  ? ib_register_device+0xb5/0x720 [ib_core]
[   37.880403]  __mutex_lock+0x88/0x950
[   37.884782]  ? ib_register_device+0xb5/0x720 [ib_core]
[   37.890914]  ? ib_register_device+0xb5/0x720 [ib_core]
[   37.897108]  ? find_held_lock+0x40/0xb0
[   37.901838]  mutex_lock_nested+0x1b/0x20
[   37.906669]  ib_register_device+0xb5/0x720 [ib_core]
[   37.912669]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
[   37.919261]  ? rcu_read_lock_sched_held+0x98/0xa0
[   37.924973]  ? kmem_cache_alloc_trace+0x278/0x2e0
[   37.930691]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
[   37.937293]  c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
[   37.943702]  c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
[   37.950213]  ? notify_ulds.isra.28+0x24/0x60 [cxgb4]
[   37.956244]  notify_ulds.isra.28+0x3f/0x60 [cxgb4]
[   37.962083]  cxgb_up+0x70b/0x840 [cxgb4]
[   37.966951]  ? cxgb4_ofld_send+0x20/0x20 [cxgb4]
[   37.972594]  cxgb_open+0x34/0x90 [cxgb4]
[   37.977462]  __dev_open+0xc9/0x140
[   37.981741]  __dev_change_flags+0x9d/0x160
[   37.986794]  dev_change_flags+0x29/0x60
[   37.991557]  do_setlink+0x4bf/0xc80
[   37.995931]  rtnl_newlink+0x512/0x8a0
[   38.000500]  ? rtnl_newlink+0x104/0x8a0
[   38.005263]  ? check_usage+0xb5/0x490
[   38.009826]  ? ns_capable_common+0x7a/0x90
[   38.014876]  ? ns_capable+0x13/0x20
[   38.019253]  rtnetlink_rcv_msg+0xac/0x240
[   38.024215]  ? rtnetlink_rcv+0x1b/0x40
[   38.028879]  ? netlink_deliver_tap+0x7a/0x2c0
[   38.034232]  ? rtnl_newlink+0x8a0/0x8a0
[   38.038995]  netlink_rcv_skb+0xed/0x120
[   38.043760]  rtnetlink_rcv+0x2a/0x40
[   38.048244]  netlink_unicast+0x182/0x220
[   38.053119]  netlink_sendmsg+0x2e9/0x3e0
[   38.057985]  sock_sendmsg+0x38/0x50
[   38.062243]  ___sys_sendmsg+0x2b2/0x2d0
[   38.066877]  ? find_held_lock+0x40/0xb0
[   38.071499]  ? __fget+0x102/0x210
[   38.075647]  ? __fget+0x121/0x210
[   38.079780]  ? __fget+0x5/0x210
[   38.083706]  ? __fget_light+0x25/0x70
[   38.088208]  __sys_sendmsg+0x54/0x90
[   38.092606]  SyS_sendmsg+0x12/0x20
[   38.096810]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   38.102379] RIP: 0033:0x7f146e486974
[   38.106778] RSP: 002b:00007ffd0cd3ee00 EFLAGS: 00000293 ORIG_RAX:
0000000000e
[   38.115654] RAX: ffffffffffffffda RBX: 000055698f9641f9 RCX:
00007f146e486974
[   38.124058] RDX: 0000000000000000 RSI: 00007ffd0cd3ee50 RDI:
0000000000000007
[   38.132474] RBP: 00007ffd0cd3f2e0 R08: 0000000000000000 R09:
000055699118c300
[   38.140884] R10: 0000000000000001 R11: 0000000000000293 R12:
0000000000000001
[   38.149306] R13: 0000000000000001 R14: 00007ffd0cd3f010 R15:
000055698fbda5c0
[   38.160359] ib_srpt srpt_add_one(cxgb4_0) failed.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 862 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug Report: possible circular locking issue
       [not found] ` <1503938316.78641.98.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-08-28 19:12   ` Doug Ledford
       [not found]     ` <1503947529.78641.108.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2017-08-28 19:12 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; +Cc: Wise, Steve

On Mon, 2017-08-28 at 12:38 -0400, Doug Ledford wrote:
> Resend from my work email address:
> 
> 
> I ran across this while testing a 4.13-rc7 kernel + the rdma next
> code.

This reproduces on a stock 4.13-rc7 kernel.  But, across all the stuff
I've booted it on so far, it only shows up on cxgb4 devices, so I think
this is a cxgb4 specific issue.  Steve, can you look into this? 

My basic config is a stock Fedora rawhide box and I took the Fedora
kernel config and copied it into my git repo checkout of v4.13-rc7 and
compiled using that config.  If you need any more info, I can try to
get it to you.

The machine environment that produces this includes:

base Ethernet device + 2 vlan devices
srp target mode is in use (kernel LIO support), the iwarp device isn't
specifically configured for use, but srpt tries to set it up anyway
iser target mode is in use (kernel LIO support again, single tpg with
wildcard address so the iwarp devices are in use)
nfsordma in use and exporting several mount points, again with wildcard
address so all RDMA devices are candidates

With this environment, I get the trackback on bootup every time.  It
then proceeds to run.  I haven't tested it under load to see how it
does, but it's up anyway.

>  I don't have the time to track this down before going on PTO, so I'm
> putting it out here for others to look at.
> 
> This machine holds multiple connections in it:
> 
> ib0/ib1 -> dual port qib
> roce -> ocrdma
> iwarp -> cxgb4
> 
> During bootup I got this:
> 
> [   37.244753] iw_cxgb4: 0000:83:00.4: Up
> [   37.250168] iw_cxgb4: 0000:83:00.4: On-Chip Queues not supported
> on
> this deve
> 
> [   37.263207] ======================================================
> [   37.270656] WARNING: possible circular locking dependency detected
> [   37.278101] 4.13.0-rc7+ #130 Not tainted
> [   37.283019] ------------------------------------------------------
> [   37.290470] NetworkManager/2196 is trying to acquire lock:
> [   37.297143]  (device_mutex){+.+.+.}, at: [<ffffffffc08d2465>]
> ib_register_de]
> [   37.308026] 
>                but task is already holding lock:
> [   37.315694]  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
> notify_ulds.isra.]
> [   37.326108] 
>                which lock already depends on the new lock.
> 
> [   37.337689] 
>                the existing dependency chain (in reverse order) is:
> [   37.347301] 
>                -> #2 (uld_mutex){+.+.+.}:
> [   37.354048]        lock_acquire+0xbd/0x200
> [   37.359083]        __mutex_lock+0x88/0x950
> [   37.364122]        mutex_lock_nested+0x1b/0x20
> [   37.369690]        cxgb_up+0x27/0x840 [cxgb4]
> [   37.375623]        cxgb_open+0x34/0x90 [cxgb4]
> [   37.381168]        __dev_open+0xc9/0x140
> [   37.386039]        __dev_change_flags+0x9d/0x160
> [   37.391686]        dev_change_flags+0x29/0x60
> [   37.397069]        do_setlink+0x4bf/0xc80
> [   37.402024]        rtnl_newlink+0x512/0x8a0
> [   37.407177]        rtnetlink_rcv_msg+0xac/0x240
> [   37.412702]        netlink_rcv_skb+0xed/0x120
> [   37.418023]        rtnetlink_rcv+0x2a/0x40
> [   37.423060]        netlink_unicast+0x182/0x220
> [   37.428482]        netlink_sendmsg+0x2e9/0x3e0
> [   37.433868]        sock_sendmsg+0x38/0x50
> [   37.438766]        ___sys_sendmsg+0x2b2/0x2d0
> [   37.444052]        __sys_sendmsg+0x54/0x90
> [   37.449047]        SyS_sendmsg+0x12/0x20
> [   37.453848]        entry_SYSCALL_64_fastpath+0x1f/0xbe
> [   37.460007] 
>                -> #1 (rtnl_mutex){+.+.+.}:
> [   37.466764]        lock_acquire+0xbd/0x200
> [   37.471745]        __mutex_lock+0x88/0x950
> [   37.476853]        mutex_lock_nested+0x1b/0x20
> [   37.482336]        rtnl_lock+0x17/0x20
> [   37.487038]        enum_all_gids_of_dev_cb+0x25/0xd0 [ib_core]
> [   37.494509]        ib_enum_roce_netdev+0xe7/0x100 [ib_core]
> [   37.501256]        roce_rescan_device+0x21/0x30 [ib_core]
> [   37.507680]        ib_cache_setup_one+0x1f1/0x350 [ib_core]
> [   37.514297]        ib_register_device+0x444/0x720 [ib_core]
> [   37.520900]        ocrdma_add+0x46f/0x820 [ocrdma]
> [   37.526622]        _be_roce_dev_add+0x17d/0x1e0 [be2net]
> [   37.532929]        be_roce_register_driver+0x4a/0x90 [be2net]
> [   37.539716]        ib_umad_poll+0x15/0x50 [ib_umad]
> [   37.545527]        do_one_initcall+0x51/0x1a9
> [   37.550881]        do_init_module+0x60/0x1ff
> [   37.556129]        load_module+0x257e/0x2b10
> [   37.561375]        SYSC_finit_module+0xa9/0x100
> [   37.566880]        SyS_finit_module+0xe/0x10
> [   37.572099]        do_syscall_64+0x6c/0x1d0
> [   37.577178]        return_from_SYSCALL_64+0x0/0x7a
> [   37.583232] 
>                -> #0 (device_mutex){+.+.+.}:
> [   37.590704]        __lock_acquire+0x153c/0x1550
> [   37.596442]        lock_acquire+0xbd/0x200
> [   37.601399]        __mutex_lock+0x88/0x950
> [   37.606346]        mutex_lock_nested+0x1b/0x20
> [   37.611669]        ib_register_device+0xb5/0x720 [ib_core]
> [   37.618170]        c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
> [   37.625061]        c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
> [   37.632108]        notify_ulds.isra.28+0x3f/0x60 [cxgb4]
> [   37.638410]        cxgb_up+0x70b/0x840 [cxgb4]
> [   37.643946]        cxgb_open+0x34/0x90 [cxgb4]
> [   37.649265]        __dev_open+0xc9/0x140
> [   37.653977]        __dev_change_flags+0x9d/0x160
> [   37.659613]        dev_change_flags+0x29/0x60
> [   37.665046]        do_setlink+0x4bf/0xc80
> [   37.669851]        rtnl_newlink+0x512/0x8a0
> [   37.675090]        rtnetlink_rcv_msg+0xac/0x240
> [   37.680717]        netlink_rcv_skb+0xed/0x120
> [   37.685937]        rtnetlink_rcv+0x2a/0x40
> [   37.691081]        netlink_unicast+0x182/0x220
> [   37.696607]        netlink_sendmsg+0x2e9/0x3e0
> [   37.702136]        sock_sendmsg+0x38/0x50
> [   37.707180]        ___sys_sendmsg+0x2b2/0x2d0
> [   37.712639]        __sys_sendmsg+0x54/0x90
> [   37.717542]        SyS_sendmsg+0x12/0x20
> [   37.722249]        entry_SYSCALL_64_fastpath+0x1f/0xbe
> [   37.728326] 
>                other info that might help us debug this:
> 
> [   37.738479] Chain exists of:
>                  device_mutex --> rtnl_mutex --> uld_mutex
> 
> [   37.750153]  Possible unsafe locking scenario:
> 
> [   37.757412]        CPU0                    CPU1
> [   37.762894]        ----                    ----
> [   37.768381]   lock(uld_mutex);
> [   37.772149]                                lock(rtnl_mutex);
> [   37.778830]                                lock(uld_mutex);
> [   37.785413]   lock(device_mutex);
> [   37.789462] 
>                 *** DEADLOCK ***
> 
> [   37.797070] 2 locks held by NetworkManager/2196:
> [   37.802557]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff9e83457b>]
> rtnetlink_r0
> [   37.812213]  #1:  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
> notify_ulds.]
> [   37.822846] 
>                stack backtrace:
> [   37.828894] CPU: 17 PID: 2196 Comm: NetworkManager Not tainted
> 4.13.0-rc7+ #0
> [   37.837655] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
> 2.0.2 03/6
> [   37.846551] Call Trace:
> [   37.849630]  dump_stack+0x85/0xcc
> [   37.853679]  print_circular_bug+0x200/0x20e
> [   37.858806]  __lock_acquire+0x153c/0x1550
> [   37.863738]  lock_acquire+0xbd/0x200
> [   37.868138]  ? ib_register_device+0xb5/0x720 [ib_core]
> [   37.874275]  ? ib_register_device+0xb5/0x720 [ib_core]
> [   37.880403]  __mutex_lock+0x88/0x950
> [   37.884782]  ? ib_register_device+0xb5/0x720 [ib_core]
> [   37.890914]  ? ib_register_device+0xb5/0x720 [ib_core]
> [   37.897108]  ? find_held_lock+0x40/0xb0
> [   37.901838]  mutex_lock_nested+0x1b/0x20
> [   37.906669]  ib_register_device+0xb5/0x720 [ib_core]
> [   37.912669]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
> [   37.919261]  ? rcu_read_lock_sched_held+0x98/0xa0
> [   37.924973]  ? kmem_cache_alloc_trace+0x278/0x2e0
> [   37.930691]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
> [   37.937293]  c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
> [   37.943702]  c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
> [   37.950213]  ? notify_ulds.isra.28+0x24/0x60 [cxgb4]
> [   37.956244]  notify_ulds.isra.28+0x3f/0x60 [cxgb4]
> [   37.962083]  cxgb_up+0x70b/0x840 [cxgb4]
> [   37.966951]  ? cxgb4_ofld_send+0x20/0x20 [cxgb4]
> [   37.972594]  cxgb_open+0x34/0x90 [cxgb4]
> [   37.977462]  __dev_open+0xc9/0x140
> [   37.981741]  __dev_change_flags+0x9d/0x160
> [   37.986794]  dev_change_flags+0x29/0x60
> [   37.991557]  do_setlink+0x4bf/0xc80
> [   37.995931]  rtnl_newlink+0x512/0x8a0
> [   38.000500]  ? rtnl_newlink+0x104/0x8a0
> [   38.005263]  ? check_usage+0xb5/0x490
> [   38.009826]  ? ns_capable_common+0x7a/0x90
> [   38.014876]  ? ns_capable+0x13/0x20
> [   38.019253]  rtnetlink_rcv_msg+0xac/0x240
> [   38.024215]  ? rtnetlink_rcv+0x1b/0x40
> [   38.028879]  ? netlink_deliver_tap+0x7a/0x2c0
> [   38.034232]  ? rtnl_newlink+0x8a0/0x8a0
> [   38.038995]  netlink_rcv_skb+0xed/0x120
> [   38.043760]  rtnetlink_rcv+0x2a/0x40
> [   38.048244]  netlink_unicast+0x182/0x220
> [   38.053119]  netlink_sendmsg+0x2e9/0x3e0
> [   38.057985]  sock_sendmsg+0x38/0x50
> [   38.062243]  ___sys_sendmsg+0x2b2/0x2d0
> [   38.066877]  ? find_held_lock+0x40/0xb0
> [   38.071499]  ? __fget+0x102/0x210
> [   38.075647]  ? __fget+0x121/0x210
> [   38.079780]  ? __fget+0x5/0x210
> [   38.083706]  ? __fget_light+0x25/0x70
> [   38.088208]  __sys_sendmsg+0x54/0x90
> [   38.092606]  SyS_sendmsg+0x12/0x20
> [   38.096810]  entry_SYSCALL_64_fastpath+0x1f/0xbe
> [   38.102379] RIP: 0033:0x7f146e486974
> [   38.106778] RSP: 002b:00007ffd0cd3ee00 EFLAGS: 00000293 ORIG_RAX:
> 0000000000e
> [   38.115654] RAX: ffffffffffffffda RBX: 000055698f9641f9 RCX:
> 00007f146e486974
> [   38.124058] RDX: 0000000000000000 RSI: 00007ffd0cd3ee50 RDI:
> 0000000000000007
> [   38.132474] RBP: 00007ffd0cd3f2e0 R08: 0000000000000000 R09:
> 000055699118c300
> [   38.140884] R10: 0000000000000001 R11: 0000000000000293 R12:
> 0000000000000001
> [   38.149306] R13: 0000000000000001 R14: 00007ffd0cd3f010 R15:
> 000055698fbda5c0
> [   38.160359] ib_srpt srpt_add_one(cxgb4_0) failed.
> 
-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Bug Report: possible circular locking issue
       [not found]     ` <1503947529.78641.108.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-08-28 19:18       ` Steve Wise
  2017-08-28 19:28         ` Doug Ledford
  2017-08-31 15:24       ` Potnuri Bharat Teja
  1 sibling, 1 reply; 8+ messages in thread
From: Steve Wise @ 2017-08-28 19:18 UTC (permalink / raw)
  To: 'Doug Ledford', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> On Mon, 2017-08-28 at 12:38 -0400, Doug Ledford wrote:
> > Resend from my work email address:
> >
> >
> > I ran across this while testing a 4.13-rc7 kernel + the rdma next
> > code.
> 
> This reproduces on a stock 4.13-rc7 kernel.  But, across all the stuff
> I've booted it on so far, it only shows up on cxgb4 devices, so I think
> this is a cxgb4 specific issue.  Steve, can you look into this?

Hey Doug.  Yes.   Is this a regression that shows up now in 4.13-rc but worked in older kernels?
 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug Report: possible circular locking issue
  2017-08-28 19:18       ` Steve Wise
@ 2017-08-28 19:28         ` Doug Ledford
       [not found]           ` <1503948482.78641.110.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2017-08-28 19:28 UTC (permalink / raw)
  To: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 2017-08-28 at 14:18 -0500, Steve Wise wrote:
> > On Mon, 2017-08-28 at 12:38 -0400, Doug Ledford wrote:
> > > Resend from my work email address:
> > > 
> > > 
> > > I ran across this while testing a 4.13-rc7 kernel + the rdma next
> > > code.
> > 
> > This reproduces on a stock 4.13-rc7 kernel.  But, across all the
> > stuff
> > I've booted it on so far, it only shows up on cxgb4 devices, so I
> > think
> > this is a cxgb4 specific issue.  Steve, can you look into this?
> 
> Hey Doug.  Yes.   Is this a regression that shows up now in 4.13-rc
> but worked in older kernels?

I think so.  But I don't update this particular machine every release
(mainly because it's a contested machine with a unique setup, and the
last couple kernel releases it's been tied up with rhel testing instead
of my own, but prior to the last couple kernels I don't recall this
issue existing).

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Bug Report: possible circular locking issue
       [not found]           ` <1503948482.78641.110.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-08-28 19:44             ` Steve Wise
  2017-08-28 20:03               ` Doug Ledford
  0 siblings, 1 reply; 8+ messages in thread
From: Steve Wise @ 2017-08-28 19:44 UTC (permalink / raw)
  To: 'Doug Ledford', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> > > > I ran across this while testing a 4.13-rc7 kernel + the rdma next
> > > > code.
> > >
> > > This reproduces on a stock 4.13-rc7 kernel.  But, across all the
> > > stuff
> > > I've booted it on so far, it only shows up on cxgb4 devices, so I
> > > think
> > > this is a cxgb4 specific issue.  Steve, can you look into this?
> >
> > Hey Doug.  Yes.   Is this a regression that shows up now in 4.13-rc
> > but worked in older kernels?
> 
> I think so.  But I don't update this particular machine every release
> (mainly because it's a contested machine with a unique setup, and the
> last couple kernel releases it's been tied up with rhel testing instead
> of my own, but prior to the last couple kernels I don't recall this
> issue existing).

Do you have other systems that run cxgb4 and do not exhibit this lockdep issue? 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug Report: possible circular locking issue
  2017-08-28 19:44             ` Steve Wise
@ 2017-08-28 20:03               ` Doug Ledford
  0 siblings, 0 replies; 8+ messages in thread
From: Doug Ledford @ 2017-08-28 20:03 UTC (permalink / raw)
  To: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 2017-08-28 at 14:44 -0500, Steve Wise wrote:
> > > > > I ran across this while testing a 4.13-rc7 kernel + the rdma
> > > > > next
> > > > > code.
> > > > 
> > > > This reproduces on a stock 4.13-rc7 kernel.  But, across all
> > > > the
> > > > stuff
> > > > I've booted it on so far, it only shows up on cxgb4 devices, so
> > > > I
> > > > think
> > > > this is a cxgb4 specific issue.  Steve, can you look into this?
> > > 
> > > Hey Doug.  Yes.   Is this a regression that shows up now in 4.13-
> > > rc
> > > but worked in older kernels?
> > 
> > I think so.  But I don't update this particular machine every
> > release
> > (mainly because it's a contested machine with a unique setup, and
> > the
> > last couple kernel releases it's been tied up with rhel testing
> > instead
> > of my own, but prior to the last couple kernels I don't recall this
> > issue existing).
> 
> Do you have other systems that run cxgb4

Yes...

>  and do not exhibit this lockdep issue? 

and I don't know.  The problem is that we only have a few systems with
cxgb4 hardware.  We have two pairs of systems that are used for
automated testing, and we have a few servers.  Of the servers, one of
them is almost never rebooted because it's the master node.  The other
one is the one I'm working with right now and altough it is not the
head node and does get test kernels and the like, it's also an NVMe
server and spent a lot of time doing that over the last 6 months.  And,
unfortunately, those automated tests on the other pairs of client
machines don't flag this particular problem because it doesn't cause
any tests to fail as far as I can tell.  It's a theoretical pointed out
by lockdep, but I don't know that we are actually hitting it.  I'm
logged into the server right now, and I can see the lockdep splat in
the dmesg output, but a ps axf | grep D comes back empty, so no threads
actually hit the lock order issue and deadlocked.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug Report: possible circular locking issue
       [not found]     ` <1503947529.78641.108.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-08-28 19:18       ` Steve Wise
@ 2017-08-31 15:24       ` Potnuri Bharat Teja
       [not found]         ` <20170831152430.GA15173-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Potnuri Bharat Teja @ 2017-08-31 15:24 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, SWise OGC

Hi Doug,
Could you please share the config you have on the Fedora box.
I tried enabling lock debug on 4.13-rc7 but I dont see the warning.
Thanks,
Bharat.
On Tuesday, August 08/29/17, 2017 at 00:42:09 +0530, Doug Ledford wrote:
> On Mon, 2017-08-28 at 12:38 -0400, Doug Ledford wrote:
> > Resend from my work email address:
> > 
> > 
> > I ran across this while testing a 4.13-rc7 kernel + the rdma next
> > code.
> 
> This reproduces on a stock 4.13-rc7 kernel.  But, across all the stuff
> I've booted it on so far, it only shows up on cxgb4 devices, so I think
> this is a cxgb4 specific issue.  Steve, can you look into this? 
> 
> My basic config is a stock Fedora rawhide box and I took the Fedora
> kernel config and copied it into my git repo checkout of v4.13-rc7 and
> compiled using that config.  If you need any more info, I can try to
> get it to you.
> 
> The machine environment that produces this includes:
> 
> base Ethernet device + 2 vlan devices
> srp target mode is in use (kernel LIO support), the iwarp device isn't
> specifically configured for use, but srpt tries to set it up anyway
> iser target mode is in use (kernel LIO support again, single tpg with
> wildcard address so the iwarp devices are in use)
> nfsordma in use and exporting several mount points, again with wildcard
> address so all RDMA devices are candidates
> 
> With this environment, I get the trackback on bootup every time.  It
> then proceeds to run.  I haven't tested it under load to see how it
> does, but it's up anyway.
> 
> >  I don't have the time to track this down before going on PTO, so I'm
> > putting it out here for others to look at.
> > 
> > This machine holds multiple connections in it:
> > 
> > ib0/ib1 -> dual port qib
> > roce -> ocrdma
> > iwarp -> cxgb4
> > 
> > During bootup I got this:
> > 
> > [   37.244753] iw_cxgb4: 0000:83:00.4: Up
> > [   37.250168] iw_cxgb4: 0000:83:00.4: On-Chip Queues not supported
> > on
> > this deve
> > 
> > [   37.263207] ======================================================
> > [   37.270656] WARNING: possible circular locking dependency detected
> > [   37.278101] 4.13.0-rc7+ #130 Not tainted
> > [   37.283019] ------------------------------------------------------
> > [   37.290470] NetworkManager/2196 is trying to acquire lock:
> > [   37.297143]  (device_mutex){+.+.+.}, at: [<ffffffffc08d2465>]
> > ib_register_de]
> > [   37.308026] 
> >                but task is already holding lock:
> > [   37.315694]  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
> > notify_ulds.isra.]
> > [   37.326108] 
> >                which lock already depends on the new lock.
> > 
> > [   37.337689] 
> >                the existing dependency chain (in reverse order) is:
> > [   37.347301] 
> >                -> #2 (uld_mutex){+.+.+.}:
> > [   37.354048]        lock_acquire+0xbd/0x200
> > [   37.359083]        __mutex_lock+0x88/0x950
> > [   37.364122]        mutex_lock_nested+0x1b/0x20
> > [   37.369690]        cxgb_up+0x27/0x840 [cxgb4]
> > [   37.375623]        cxgb_open+0x34/0x90 [cxgb4]
> > [   37.381168]        __dev_open+0xc9/0x140
> > [   37.386039]        __dev_change_flags+0x9d/0x160
> > [   37.391686]        dev_change_flags+0x29/0x60
> > [   37.397069]        do_setlink+0x4bf/0xc80
> > [   37.402024]        rtnl_newlink+0x512/0x8a0
> > [   37.407177]        rtnetlink_rcv_msg+0xac/0x240
> > [   37.412702]        netlink_rcv_skb+0xed/0x120
> > [   37.418023]        rtnetlink_rcv+0x2a/0x40
> > [   37.423060]        netlink_unicast+0x182/0x220
> > [   37.428482]        netlink_sendmsg+0x2e9/0x3e0
> > [   37.433868]        sock_sendmsg+0x38/0x50
> > [   37.438766]        ___sys_sendmsg+0x2b2/0x2d0
> > [   37.444052]        __sys_sendmsg+0x54/0x90
> > [   37.449047]        SyS_sendmsg+0x12/0x20
> > [   37.453848]        entry_SYSCALL_64_fastpath+0x1f/0xbe
> > [   37.460007] 
> >                -> #1 (rtnl_mutex){+.+.+.}:
> > [   37.466764]        lock_acquire+0xbd/0x200
> > [   37.471745]        __mutex_lock+0x88/0x950
> > [   37.476853]        mutex_lock_nested+0x1b/0x20
> > [   37.482336]        rtnl_lock+0x17/0x20
> > [   37.487038]        enum_all_gids_of_dev_cb+0x25/0xd0 [ib_core]
> > [   37.494509]        ib_enum_roce_netdev+0xe7/0x100 [ib_core]
> > [   37.501256]        roce_rescan_device+0x21/0x30 [ib_core]
> > [   37.507680]        ib_cache_setup_one+0x1f1/0x350 [ib_core]
> > [   37.514297]        ib_register_device+0x444/0x720 [ib_core]
> > [   37.520900]        ocrdma_add+0x46f/0x820 [ocrdma]
> > [   37.526622]        _be_roce_dev_add+0x17d/0x1e0 [be2net]
> > [   37.532929]        be_roce_register_driver+0x4a/0x90 [be2net]
> > [   37.539716]        ib_umad_poll+0x15/0x50 [ib_umad]
> > [   37.545527]        do_one_initcall+0x51/0x1a9
> > [   37.550881]        do_init_module+0x60/0x1ff
> > [   37.556129]        load_module+0x257e/0x2b10
> > [   37.561375]        SYSC_finit_module+0xa9/0x100
> > [   37.566880]        SyS_finit_module+0xe/0x10
> > [   37.572099]        do_syscall_64+0x6c/0x1d0
> > [   37.577178]        return_from_SYSCALL_64+0x0/0x7a
> > [   37.583232] 
> >                -> #0 (device_mutex){+.+.+.}:
> > [   37.590704]        __lock_acquire+0x153c/0x1550
> > [   37.596442]        lock_acquire+0xbd/0x200
> > [   37.601399]        __mutex_lock+0x88/0x950
> > [   37.606346]        mutex_lock_nested+0x1b/0x20
> > [   37.611669]        ib_register_device+0xb5/0x720 [ib_core]
> > [   37.618170]        c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
> > [   37.625061]        c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
> > [   37.632108]        notify_ulds.isra.28+0x3f/0x60 [cxgb4]
> > [   37.638410]        cxgb_up+0x70b/0x840 [cxgb4]
> > [   37.643946]        cxgb_open+0x34/0x90 [cxgb4]
> > [   37.649265]        __dev_open+0xc9/0x140
> > [   37.653977]        __dev_change_flags+0x9d/0x160
> > [   37.659613]        dev_change_flags+0x29/0x60
> > [   37.665046]        do_setlink+0x4bf/0xc80
> > [   37.669851]        rtnl_newlink+0x512/0x8a0
> > [   37.675090]        rtnetlink_rcv_msg+0xac/0x240
> > [   37.680717]        netlink_rcv_skb+0xed/0x120
> > [   37.685937]        rtnetlink_rcv+0x2a/0x40
> > [   37.691081]        netlink_unicast+0x182/0x220
> > [   37.696607]        netlink_sendmsg+0x2e9/0x3e0
> > [   37.702136]        sock_sendmsg+0x38/0x50
> > [   37.707180]        ___sys_sendmsg+0x2b2/0x2d0
> > [   37.712639]        __sys_sendmsg+0x54/0x90
> > [   37.717542]        SyS_sendmsg+0x12/0x20
> > [   37.722249]        entry_SYSCALL_64_fastpath+0x1f/0xbe
> > [   37.728326] 
> >                other info that might help us debug this:
> > 
> > [   37.738479] Chain exists of:
> >                  device_mutex --> rtnl_mutex --> uld_mutex
> > 
> > [   37.750153]  Possible unsafe locking scenario:
> > 
> > [   37.757412]        CPU0                    CPU1
> > [   37.762894]        ----                    ----
> > [   37.768381]   lock(uld_mutex);
> > [   37.772149]                                lock(rtnl_mutex);
> > [   37.778830]                                lock(uld_mutex);
> > [   37.785413]   lock(device_mutex);
> > [   37.789462] 
> >                 *** DEADLOCK ***
> > 
> > [   37.797070] 2 locks held by NetworkManager/2196:
> > [   37.802557]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff9e83457b>]
> > rtnetlink_r0
> > [   37.812213]  #1:  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
> > notify_ulds.]
> > [   37.822846] 
> >                stack backtrace:
> > [   37.828894] CPU: 17 PID: 2196 Comm: NetworkManager Not tainted
> > 4.13.0-rc7+ #0
> > [   37.837655] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
> > 2.0.2 03/6
> > [   37.846551] Call Trace:
> > [   37.849630]  dump_stack+0x85/0xcc
> > [   37.853679]  print_circular_bug+0x200/0x20e
> > [   37.858806]  __lock_acquire+0x153c/0x1550
> > [   37.863738]  lock_acquire+0xbd/0x200
> > [   37.868138]  ? ib_register_device+0xb5/0x720 [ib_core]
> > [   37.874275]  ? ib_register_device+0xb5/0x720 [ib_core]
> > [   37.880403]  __mutex_lock+0x88/0x950
> > [   37.884782]  ? ib_register_device+0xb5/0x720 [ib_core]
> > [   37.890914]  ? ib_register_device+0xb5/0x720 [ib_core]
> > [   37.897108]  ? find_held_lock+0x40/0xb0
> > [   37.901838]  mutex_lock_nested+0x1b/0x20
> > [   37.906669]  ib_register_device+0xb5/0x720 [ib_core]
> > [   37.912669]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
> > [   37.919261]  ? rcu_read_lock_sched_held+0x98/0xa0
> > [   37.924973]  ? kmem_cache_alloc_trace+0x278/0x2e0
> > [   37.930691]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
> > [   37.937293]  c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
> > [   37.943702]  c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
> > [   37.950213]  ? notify_ulds.isra.28+0x24/0x60 [cxgb4]
> > [   37.956244]  notify_ulds.isra.28+0x3f/0x60 [cxgb4]
> > [   37.962083]  cxgb_up+0x70b/0x840 [cxgb4]
> > [   37.966951]  ? cxgb4_ofld_send+0x20/0x20 [cxgb4]
> > [   37.972594]  cxgb_open+0x34/0x90 [cxgb4]
> > [   37.977462]  __dev_open+0xc9/0x140
> > [   37.981741]  __dev_change_flags+0x9d/0x160
> > [   37.986794]  dev_change_flags+0x29/0x60
> > [   37.991557]  do_setlink+0x4bf/0xc80
> > [   37.995931]  rtnl_newlink+0x512/0x8a0
> > [   38.000500]  ? rtnl_newlink+0x104/0x8a0
> > [   38.005263]  ? check_usage+0xb5/0x490
> > [   38.009826]  ? ns_capable_common+0x7a/0x90
> > [   38.014876]  ? ns_capable+0x13/0x20
> > [   38.019253]  rtnetlink_rcv_msg+0xac/0x240
> > [   38.024215]  ? rtnetlink_rcv+0x1b/0x40
> > [   38.028879]  ? netlink_deliver_tap+0x7a/0x2c0
> > [   38.034232]  ? rtnl_newlink+0x8a0/0x8a0
> > [   38.038995]  netlink_rcv_skb+0xed/0x120
> > [   38.043760]  rtnetlink_rcv+0x2a/0x40
> > [   38.048244]  netlink_unicast+0x182/0x220
> > [   38.053119]  netlink_sendmsg+0x2e9/0x3e0
> > [   38.057985]  sock_sendmsg+0x38/0x50
> > [   38.062243]  ___sys_sendmsg+0x2b2/0x2d0
> > [   38.066877]  ? find_held_lock+0x40/0xb0
> > [   38.071499]  ? __fget+0x102/0x210
> > [   38.075647]  ? __fget+0x121/0x210
> > [   38.079780]  ? __fget+0x5/0x210
> > [   38.083706]  ? __fget_light+0x25/0x70
> > [   38.088208]  __sys_sendmsg+0x54/0x90
> > [   38.092606]  SyS_sendmsg+0x12/0x20
> > [   38.096810]  entry_SYSCALL_64_fastpath+0x1f/0xbe
> > [   38.102379] RIP: 0033:0x7f146e486974
> > [   38.106778] RSP: 002b:00007ffd0cd3ee00 EFLAGS: 00000293 ORIG_RAX:
> > 0000000000e
> > [   38.115654] RAX: ffffffffffffffda RBX: 000055698f9641f9 RCX:
> > 00007f146e486974
> > [   38.124058] RDX: 0000000000000000 RSI: 00007ffd0cd3ee50 RDI:
> > 0000000000000007
> > [   38.132474] RBP: 00007ffd0cd3f2e0 R08: 0000000000000000 R09:
> > 000055699118c300
> > [   38.140884] R10: 0000000000000001 R11: 0000000000000293 R12:
> > 0000000000000001
> > [   38.149306] R13: 0000000000000001 R14: 00007ffd0cd3f010 R15:
> > 000055698fbda5c0
> > [   38.160359] ib_srpt srpt_add_one(cxgb4_0) failed.
> > 
> -- 
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>     GPG KeyID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug Report: possible circular locking issue
       [not found]         ` <20170831152430.GA15173-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2017-09-05 11:40           ` Potnuri Bharat Teja
  0 siblings, 0 replies; 8+ messages in thread
From: Potnuri Bharat Teja @ 2017-09-05 11:40 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, SWise OGC

On Thursday, August 08/31/17, 2017 at 20:54:31 +0530, Potnuri Bharat Teja wrote:
> Hi Doug,
> Could you please share the config you have on the Fedora box.
> I tried enabling lock debug on 4.13-rc7 but I dont see the warning.
Nevermind, I now see the issue on my machines.
Thanks,
Bharat.
> Thanks,
> Bharat.
> On Tuesday, August 08/29/17, 2017 at 00:42:09 +0530, Doug Ledford wrote:
> > On Mon, 2017-08-28 at 12:38 -0400, Doug Ledford wrote:
> > > Resend from my work email address:
> > > 
> > > 
> > > I ran across this while testing a 4.13-rc7 kernel + the rdma next
> > > code.
> > 
> > This reproduces on a stock 4.13-rc7 kernel.  But, across all the stuff
> > I've booted it on so far, it only shows up on cxgb4 devices, so I think
> > this is a cxgb4 specific issue.  Steve, can you look into this? 
> > 
> > My basic config is a stock Fedora rawhide box and I took the Fedora
> > kernel config and copied it into my git repo checkout of v4.13-rc7 and
> > compiled using that config.  If you need any more info, I can try to
> > get it to you.
> > 
> > The machine environment that produces this includes:
> > 
> > base Ethernet device + 2 vlan devices
> > srp target mode is in use (kernel LIO support), the iwarp device isn't
> > specifically configured for use, but srpt tries to set it up anyway
> > iser target mode is in use (kernel LIO support again, single tpg with
> > wildcard address so the iwarp devices are in use)
> > nfsordma in use and exporting several mount points, again with wildcard
> > address so all RDMA devices are candidates
> > 
> > With this environment, I get the trackback on bootup every time.  It
> > then proceeds to run.  I haven't tested it under load to see how it
> > does, but it's up anyway.
> > 
> > >  I don't have the time to track this down before going on PTO, so I'm
> > > putting it out here for others to look at.
> > > 
> > > This machine holds multiple connections in it:
> > > 
> > > ib0/ib1 -> dual port qib
> > > roce -> ocrdma
> > > iwarp -> cxgb4
> > > 
> > > During bootup I got this:
> > > 
> > > [   37.244753] iw_cxgb4: 0000:83:00.4: Up
> > > [   37.250168] iw_cxgb4: 0000:83:00.4: On-Chip Queues not supported
> > > on
> > > this deve
> > > 
> > > [   37.263207] ======================================================
> > > [   37.270656] WARNING: possible circular locking dependency detected
> > > [   37.278101] 4.13.0-rc7+ #130 Not tainted
> > > [   37.283019] ------------------------------------------------------
> > > [   37.290470] NetworkManager/2196 is trying to acquire lock:
> > > [   37.297143]  (device_mutex){+.+.+.}, at: [<ffffffffc08d2465>]
> > > ib_register_de]
> > > [   37.308026] 
> > >                but task is already holding lock:
> > > [   37.315694]  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
> > > notify_ulds.isra.]
> > > [   37.326108] 
> > >                which lock already depends on the new lock.
> > > 
> > > [   37.337689] 
> > >                the existing dependency chain (in reverse order) is:
> > > [   37.347301] 
> > >                -> #2 (uld_mutex){+.+.+.}:
> > > [   37.354048]        lock_acquire+0xbd/0x200
> > > [   37.359083]        __mutex_lock+0x88/0x950
> > > [   37.364122]        mutex_lock_nested+0x1b/0x20
> > > [   37.369690]        cxgb_up+0x27/0x840 [cxgb4]
> > > [   37.375623]        cxgb_open+0x34/0x90 [cxgb4]
> > > [   37.381168]        __dev_open+0xc9/0x140
> > > [   37.386039]        __dev_change_flags+0x9d/0x160
> > > [   37.391686]        dev_change_flags+0x29/0x60
> > > [   37.397069]        do_setlink+0x4bf/0xc80
> > > [   37.402024]        rtnl_newlink+0x512/0x8a0
> > > [   37.407177]        rtnetlink_rcv_msg+0xac/0x240
> > > [   37.412702]        netlink_rcv_skb+0xed/0x120
> > > [   37.418023]        rtnetlink_rcv+0x2a/0x40
> > > [   37.423060]        netlink_unicast+0x182/0x220
> > > [   37.428482]        netlink_sendmsg+0x2e9/0x3e0
> > > [   37.433868]        sock_sendmsg+0x38/0x50
> > > [   37.438766]        ___sys_sendmsg+0x2b2/0x2d0
> > > [   37.444052]        __sys_sendmsg+0x54/0x90
> > > [   37.449047]        SyS_sendmsg+0x12/0x20
> > > [   37.453848]        entry_SYSCALL_64_fastpath+0x1f/0xbe
> > > [   37.460007] 
> > >                -> #1 (rtnl_mutex){+.+.+.}:
> > > [   37.466764]        lock_acquire+0xbd/0x200
> > > [   37.471745]        __mutex_lock+0x88/0x950
> > > [   37.476853]        mutex_lock_nested+0x1b/0x20
> > > [   37.482336]        rtnl_lock+0x17/0x20
> > > [   37.487038]        enum_all_gids_of_dev_cb+0x25/0xd0 [ib_core]
> > > [   37.494509]        ib_enum_roce_netdev+0xe7/0x100 [ib_core]
> > > [   37.501256]        roce_rescan_device+0x21/0x30 [ib_core]
> > > [   37.507680]        ib_cache_setup_one+0x1f1/0x350 [ib_core]
> > > [   37.514297]        ib_register_device+0x444/0x720 [ib_core]
> > > [   37.520900]        ocrdma_add+0x46f/0x820 [ocrdma]
> > > [   37.526622]        _be_roce_dev_add+0x17d/0x1e0 [be2net]
> > > [   37.532929]        be_roce_register_driver+0x4a/0x90 [be2net]
> > > [   37.539716]        ib_umad_poll+0x15/0x50 [ib_umad]
> > > [   37.545527]        do_one_initcall+0x51/0x1a9
> > > [   37.550881]        do_init_module+0x60/0x1ff
> > > [   37.556129]        load_module+0x257e/0x2b10
> > > [   37.561375]        SYSC_finit_module+0xa9/0x100
> > > [   37.566880]        SyS_finit_module+0xe/0x10
> > > [   37.572099]        do_syscall_64+0x6c/0x1d0
> > > [   37.577178]        return_from_SYSCALL_64+0x0/0x7a
> > > [   37.583232] 
> > >                -> #0 (device_mutex){+.+.+.}:
> > > [   37.590704]        __lock_acquire+0x153c/0x1550
> > > [   37.596442]        lock_acquire+0xbd/0x200
> > > [   37.601399]        __mutex_lock+0x88/0x950
> > > [   37.606346]        mutex_lock_nested+0x1b/0x20
> > > [   37.611669]        ib_register_device+0xb5/0x720 [ib_core]
> > > [   37.618170]        c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
> > > [   37.625061]        c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
> > > [   37.632108]        notify_ulds.isra.28+0x3f/0x60 [cxgb4]
> > > [   37.638410]        cxgb_up+0x70b/0x840 [cxgb4]
> > > [   37.643946]        cxgb_open+0x34/0x90 [cxgb4]
> > > [   37.649265]        __dev_open+0xc9/0x140
> > > [   37.653977]        __dev_change_flags+0x9d/0x160
> > > [   37.659613]        dev_change_flags+0x29/0x60
> > > [   37.665046]        do_setlink+0x4bf/0xc80
> > > [   37.669851]        rtnl_newlink+0x512/0x8a0
> > > [   37.675090]        rtnetlink_rcv_msg+0xac/0x240
> > > [   37.680717]        netlink_rcv_skb+0xed/0x120
> > > [   37.685937]        rtnetlink_rcv+0x2a/0x40
> > > [   37.691081]        netlink_unicast+0x182/0x220
> > > [   37.696607]        netlink_sendmsg+0x2e9/0x3e0
> > > [   37.702136]        sock_sendmsg+0x38/0x50
> > > [   37.707180]        ___sys_sendmsg+0x2b2/0x2d0
> > > [   37.712639]        __sys_sendmsg+0x54/0x90
> > > [   37.717542]        SyS_sendmsg+0x12/0x20
> > > [   37.722249]        entry_SYSCALL_64_fastpath+0x1f/0xbe
> > > [   37.728326] 
> > >                other info that might help us debug this:
> > > 
> > > [   37.738479] Chain exists of:
> > >                  device_mutex --> rtnl_mutex --> uld_mutex
> > > 
> > > [   37.750153]  Possible unsafe locking scenario:
> > > 
> > > [   37.757412]        CPU0                    CPU1
> > > [   37.762894]        ----                    ----
> > > [   37.768381]   lock(uld_mutex);
> > > [   37.772149]                                lock(rtnl_mutex);
> > > [   37.778830]                                lock(uld_mutex);
> > > [   37.785413]   lock(device_mutex);
> > > [   37.789462] 
> > >                 *** DEADLOCK ***
> > > 
> > > [   37.797070] 2 locks held by NetworkManager/2196:
> > > [   37.802557]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff9e83457b>]
> > > rtnetlink_r0
> > > [   37.812213]  #1:  (uld_mutex){+.+.+.}, at: [<ffffffffc0574fd4>]
> > > notify_ulds.]
> > > [   37.822846] 
> > >                stack backtrace:
> > > [   37.828894] CPU: 17 PID: 2196 Comm: NetworkManager Not tainted
> > > 4.13.0-rc7+ #0
> > > [   37.837655] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
> > > 2.0.2 03/6
> > > [   37.846551] Call Trace:
> > > [   37.849630]  dump_stack+0x85/0xcc
> > > [   37.853679]  print_circular_bug+0x200/0x20e
> > > [   37.858806]  __lock_acquire+0x153c/0x1550
> > > [   37.863738]  lock_acquire+0xbd/0x200
> > > [   37.868138]  ? ib_register_device+0xb5/0x720 [ib_core]
> > > [   37.874275]  ? ib_register_device+0xb5/0x720 [ib_core]
> > > [   37.880403]  __mutex_lock+0x88/0x950
> > > [   37.884782]  ? ib_register_device+0xb5/0x720 [ib_core]
> > > [   37.890914]  ? ib_register_device+0xb5/0x720 [ib_core]
> > > [   37.897108]  ? find_held_lock+0x40/0xb0
> > > [   37.901838]  mutex_lock_nested+0x1b/0x20
> > > [   37.906669]  ib_register_device+0xb5/0x720 [ib_core]
> > > [   37.912669]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
> > > [   37.919261]  ? rcu_read_lock_sched_held+0x98/0xa0
> > > [   37.924973]  ? kmem_cache_alloc_trace+0x278/0x2e0
> > > [   37.930691]  ? c4iw_register_device+0x2f6/0x460 [iw_cxgb4]
> > > [   37.937293]  c4iw_register_device+0x3a0/0x460 [iw_cxgb4]
> > > [   37.943702]  c4iw_uld_state_change+0x7a4/0xcd0 [iw_cxgb4]
> > > [   37.950213]  ? notify_ulds.isra.28+0x24/0x60 [cxgb4]
> > > [   37.956244]  notify_ulds.isra.28+0x3f/0x60 [cxgb4]
> > > [   37.962083]  cxgb_up+0x70b/0x840 [cxgb4]
> > > [   37.966951]  ? cxgb4_ofld_send+0x20/0x20 [cxgb4]
> > > [   37.972594]  cxgb_open+0x34/0x90 [cxgb4]
> > > [   37.977462]  __dev_open+0xc9/0x140
> > > [   37.981741]  __dev_change_flags+0x9d/0x160
> > > [   37.986794]  dev_change_flags+0x29/0x60
> > > [   37.991557]  do_setlink+0x4bf/0xc80
> > > [   37.995931]  rtnl_newlink+0x512/0x8a0
> > > [   38.000500]  ? rtnl_newlink+0x104/0x8a0
> > > [   38.005263]  ? check_usage+0xb5/0x490
> > > [   38.009826]  ? ns_capable_common+0x7a/0x90
> > > [   38.014876]  ? ns_capable+0x13/0x20
> > > [   38.019253]  rtnetlink_rcv_msg+0xac/0x240
> > > [   38.024215]  ? rtnetlink_rcv+0x1b/0x40
> > > [   38.028879]  ? netlink_deliver_tap+0x7a/0x2c0
> > > [   38.034232]  ? rtnl_newlink+0x8a0/0x8a0
> > > [   38.038995]  netlink_rcv_skb+0xed/0x120
> > > [   38.043760]  rtnetlink_rcv+0x2a/0x40
> > > [   38.048244]  netlink_unicast+0x182/0x220
> > > [   38.053119]  netlink_sendmsg+0x2e9/0x3e0
> > > [   38.057985]  sock_sendmsg+0x38/0x50
> > > [   38.062243]  ___sys_sendmsg+0x2b2/0x2d0
> > > [   38.066877]  ? find_held_lock+0x40/0xb0
> > > [   38.071499]  ? __fget+0x102/0x210
> > > [   38.075647]  ? __fget+0x121/0x210
> > > [   38.079780]  ? __fget+0x5/0x210
> > > [   38.083706]  ? __fget_light+0x25/0x70
> > > [   38.088208]  __sys_sendmsg+0x54/0x90
> > > [   38.092606]  SyS_sendmsg+0x12/0x20
> > > [   38.096810]  entry_SYSCALL_64_fastpath+0x1f/0xbe
> > > [   38.102379] RIP: 0033:0x7f146e486974
> > > [   38.106778] RSP: 002b:00007ffd0cd3ee00 EFLAGS: 00000293 ORIG_RAX:
> > > 0000000000e
> > > [   38.115654] RAX: ffffffffffffffda RBX: 000055698f9641f9 RCX:
> > > 00007f146e486974
> > > [   38.124058] RDX: 0000000000000000 RSI: 00007ffd0cd3ee50 RDI:
> > > 0000000000000007
> > > [   38.132474] RBP: 00007ffd0cd3f2e0 R08: 0000000000000000 R09:
> > > 000055699118c300
> > > [   38.140884] R10: 0000000000000001 R11: 0000000000000293 R12:
> > > 0000000000000001
> > > [   38.149306] R13: 0000000000000001 R14: 00007ffd0cd3f010 R15:
> > > 000055698fbda5c0
> > > [   38.160359] ib_srpt srpt_add_one(cxgb4_0) failed.
> > > 
> > -- 
> > Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >     GPG KeyID: B826A3330E572FDD
> >     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-09-05 11:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-28 16:38 Bug Report: possible circular locking issue Doug Ledford
     [not found] ` <1503938316.78641.98.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-28 19:12   ` Doug Ledford
     [not found]     ` <1503947529.78641.108.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-28 19:18       ` Steve Wise
2017-08-28 19:28         ` Doug Ledford
     [not found]           ` <1503948482.78641.110.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-28 19:44             ` Steve Wise
2017-08-28 20:03               ` Doug Ledford
2017-08-31 15:24       ` Potnuri Bharat Teja
     [not found]         ` <20170831152430.GA15173-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2017-09-05 11:40           ` Potnuri Bharat Teja

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox