linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Soft lockup in unloading kernel modules
@ 2014-05-08 14:28 Senn Klemens
  2014-05-08 15:59 ` Anna Schumaker
  0 siblings, 1 reply; 5+ messages in thread
From: Senn Klemens @ 2014-05-08 14:28 UTC (permalink / raw)
  To: linux-nfs; +Cc: linux-rdma

Hi,

I am getting a soft lockup on the NFS server on its reboot if at least
one client mount is established. I am using OpenSUSE 12.3 with the
nfs-rdma kernel from Anna Schumaker
(git://git.linux-nfs.org/projects/anna/nfs-rdma.git).

The export on the server side is done with
/data	*(fsid=0,crossmnt,rw,mp,no_root_squash,sync,no_subtree_check,insecure)

Following command is used for mounting the NFSv4 share:
mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.19:/ /mnt

The HCA is a Mellanox MT4099 on the server and the client.

The soft lockup can be reproduced by following steps:
  o server: Start the nfs server
  o client: Mount the share
  o client: Do a "ls" in the mounted directory
  o server: Stop the nfs server
  o server: Unload the nfs and mlx4 modules or reboot the server (I used
the openibd init script from the Mellanox driver without having the
Mellanox stack installed)

The server reports a soft lockup
  BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:6146]
most times.

Sometimes I get following kernel panic
BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
IP: [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
PGD 82a820067 PUD 857832067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry nfnetlink_log
nfnetlink bluetooth rfkill nfsv4 svcrdma dm_mod cpuid nfs fscache lockd
sunrpc af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm
ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib(-) ib_sa ib_mad ib_core
ib_addr sr_mod cdrom usb_storage joydev mlx4_core usbhid
x86_pkg_temp_thermal coretemp kvm_intel kvm ghash_clmulni_intel
aesni_intel ablk_helper cryptd iTCO_wdt lrw igb gf128mul
iTCO_vendor_support ehci_pci glue_helper pcspkr i2c_algo_bit isci
ehci_hcd aes_x86_64 ptp libsas ioatdma lpc_ich microcode sb_edac sg
pps_core usbcore ipmi_si tpm_tis edac_core scsi_transport_sas i2c_i801
mfd_core dca usb_common tpm ipmi_msghandler wmi acpi_cpufreq button edd
autofs4 xfs libcrc32c crc32c_intel processor thermal_sys scsi_dh_rdac
scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh [last unloaded: oid_registry]
CPU: 0 PID: 6603 Comm: modprobe Not tainted 3.15.0-rc2-anna-nfs-rdma+ #3
Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
task: ffff88105b8c6050 ti: ffff88105d814000 task.ti: ffff88105d814000
RIP: 0010:[<ffffffff815a5c35>]  [<ffffffff815a5c35>]
_raw_spin_lock_bh+0x15/0x40
RSP: 0018:ffff88105d815d18  EFLAGS: 00010286
RAX: 0000000000010000 RBX: ffffffffffffffff RCX: 0000000000000000
RDX: 000000000000000b RSI: 0000000000000000 RDI: 0000000000000003
RBP: ffff88105d815d18 R08: ffff88087c611f38 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88087c3c9800
R13: ffff88107b82ab00 R14: 0000000000000003 R15: 0000000000000007
FS:  00007fef64612700(0000) GS:ffff88087fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000003 CR3: 000000087c2c7000 CR4: 00000000000407f0
Stack:
 ffff88105d815d58 ffffffffa05199f0 ffff88105d815d88 ffff88087c3c9800
 ffff88087c3c9400 ffff88107b82ab00 ffff88087c3c9660 ffff88087c3c95c8
 ffff88105d815d78 ffffffffa0421ce9 ffff88087c3c9400 ffff88107b82aac0
Call Trace:
 [<ffffffffa05199f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
 [<ffffffffa0421ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
 [<ffffffffa039d086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
 [<ffffffffa01dca86>] ib_unregister_device+0x46/0x120 [ib_core]
 [<ffffffffa032ddc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
 [<ffffffffa02fb9d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
 [<ffffffffa02fba2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
 [<ffffffffa033f4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
 [<ffffffff810bd6b2>] SyS_delete_module+0x152/0x220
 [<ffffffff811496e4>] ? vm_munmap+0x54/0x70
 [<ffffffff815adca6>] system_call_fastpath+0x1a/0x1f
Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d c3
55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0> 0f
c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
RIP  [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
 RSP <ffff88105d815d18>
CR2: 0000000000000003
---[ end trace 18e02ff413ac4b9b ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffff9fffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt

Kind regards,
Klemens


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Soft lockup in unloading kernel modules
  2014-05-08 14:28 Soft lockup in unloading kernel modules Senn Klemens
@ 2014-05-08 15:59 ` Anna Schumaker
  2014-05-13 16:48   ` Klemens Senn
  0 siblings, 1 reply; 5+ messages in thread
From: Anna Schumaker @ 2014-05-08 15:59 UTC (permalink / raw)
  To: Senn Klemens, linux-nfs; +Cc: linux-rdma

I haven't applied Chuck's recent (v3) patches to that kernel yet (I've been waiting to see if people have comments).  I'll try to push something out today.

On 05/08/2014 10:28 AM, Senn Klemens wrote:
> Hi,
>
> I am getting a soft lockup on the NFS server on its reboot if at least
> one client mount is established. I am using OpenSUSE 12.3 with the
> nfs-rdma kernel from Anna Schumaker
> (git://git.linux-nfs.org/projects/anna/nfs-rdma.git).
>
> The export on the server side is done with
> /data	*(fsid=0,crossmnt,rw,mp,no_root_squash,sync,no_subtree_check,insecure)
>
> Following command is used for mounting the NFSv4 share:
> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.19:/ /mnt
>
> The HCA is a Mellanox MT4099 on the server and the client.
>
> The soft lockup can be reproduced by following steps:
>   o server: Start the nfs server
>   o client: Mount the share
>   o client: Do a "ls" in the mounted directory
>   o server: Stop the nfs server
>   o server: Unload the nfs and mlx4 modules or reboot the server (I used
> the openibd init script from the Mellanox driver without having the
> Mellanox stack installed)
>
> The server reports a soft lockup
>   BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:6146]
> most times.
>
> Sometimes I get following kernel panic
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
> IP: [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
> PGD 82a820067 PUD 857832067 PMD 0
> Oops: 0002 [#1] SMP
> Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry nfnetlink_log
> nfnetlink bluetooth rfkill nfsv4 svcrdma dm_mod cpuid nfs fscache lockd
> sunrpc af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm
> ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib(-) ib_sa ib_mad ib_core
> ib_addr sr_mod cdrom usb_storage joydev mlx4_core usbhid
> x86_pkg_temp_thermal coretemp kvm_intel kvm ghash_clmulni_intel
> aesni_intel ablk_helper cryptd iTCO_wdt lrw igb gf128mul
> iTCO_vendor_support ehci_pci glue_helper pcspkr i2c_algo_bit isci
> ehci_hcd aes_x86_64 ptp libsas ioatdma lpc_ich microcode sb_edac sg
> pps_core usbcore ipmi_si tpm_tis edac_core scsi_transport_sas i2c_i801
> mfd_core dca usb_common tpm ipmi_msghandler wmi acpi_cpufreq button edd
> autofs4 xfs libcrc32c crc32c_intel processor thermal_sys scsi_dh_rdac
> scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh [last unloaded: oid_registry]
> CPU: 0 PID: 6603 Comm: modprobe Not tainted 3.15.0-rc2-anna-nfs-rdma+ #3
> Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
> task: ffff88105b8c6050 ti: ffff88105d814000 task.ti: ffff88105d814000
> RIP: 0010:[<ffffffff815a5c35>]  [<ffffffff815a5c35>]
> _raw_spin_lock_bh+0x15/0x40
> RSP: 0018:ffff88105d815d18  EFLAGS: 00010286
> RAX: 0000000000010000 RBX: ffffffffffffffff RCX: 0000000000000000
> RDX: 000000000000000b RSI: 0000000000000000 RDI: 0000000000000003
> RBP: ffff88105d815d18 R08: ffff88087c611f38 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88087c3c9800
> R13: ffff88107b82ab00 R14: 0000000000000003 R15: 0000000000000007
> FS:  00007fef64612700(0000) GS:ffff88087fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000003 CR3: 000000087c2c7000 CR4: 00000000000407f0
> Stack:
>  ffff88105d815d58 ffffffffa05199f0 ffff88105d815d88 ffff88087c3c9800
>  ffff88087c3c9400 ffff88107b82ab00 ffff88087c3c9660 ffff88087c3c95c8
>  ffff88105d815d78 ffffffffa0421ce9 ffff88087c3c9400 ffff88107b82aac0
> Call Trace:
>  [<ffffffffa05199f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
>  [<ffffffffa0421ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
>  [<ffffffffa039d086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
>  [<ffffffffa01dca86>] ib_unregister_device+0x46/0x120 [ib_core]
>  [<ffffffffa032ddc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>  [<ffffffffa02fb9d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>  [<ffffffffa02fba2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>  [<ffffffffa033f4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>  [<ffffffff810bd6b2>] SyS_delete_module+0x152/0x220
>  [<ffffffff811496e4>] ? vm_munmap+0x54/0x70
>  [<ffffffff815adca6>] system_call_fastpath+0x1a/0x1f
> Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d c3
> 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0> 0f
> c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
> RIP  [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>  RSP <ffff88105d815d18>
> CR2: 0000000000000003
> ---[ end trace 18e02ff413ac4b9b ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffff9fffffff)
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>
> Kind regards,
> Klemens
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Soft lockup in unloading kernel modules
  2014-05-08 15:59 ` Anna Schumaker
@ 2014-05-13 16:48   ` Klemens Senn
  2014-05-19 17:51     ` Chuck Lever
  0 siblings, 1 reply; 5+ messages in thread
From: Klemens Senn @ 2014-05-13 16:48 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-nfs

Hi Anna,

today I retried unloading the kernel modules with your updated kernel
and additionally I tried the nfsd-next kernel from J. Bruce Fields and
Chuck's nfs-rdma-client kernel.

In short: None of these was able to unload the kernel modules with an
active connection.

In detail:

With your kernel I got following 3 faults:
  o BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:4615]
  o BUG: unable to handle kernel NULL pointer dereference at
0000000000000003
  o BUG: unable to handle kernel paging request at 0000000000005b8c

With the nfsd-next kernel I got following results:
  o BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:4452]
  o module unloading blocks forever, dmesg shows:
    nfsd: last server has exited, flushing export cache
    waiting module removal not supported: please upgrade
  o Kernel keeps running but reports the following:
    nfsd: last server has exited, flushing export cache
    waiting module removal not supported: please upgrade
    svc_xprt_enqueue: threads and transports both waiting??
    INFO: task modprobe:4510 blocked for more than 480 seconds.
          Not tainted 3.15.0-rc1-bfields-master+ #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
    modprobe        D ffff88087fc13440     0  4510   4458 0x00000000
     ffff88105bb23c58 0000000000000086 ffff88105c14e690 0000000000013440
     ffff88105bb23fd8 0000000000013440 ffffffff81a14480 ffff88105c14e690
     0000000000000037 ffff88085d7f74d8 ffff88085d7f74e0 7fffffffffffffff
    Call Trace:
     [<ffffffff815a2424>] schedule+0x24/0x70
     [<ffffffff815a18cc>] schedule_timeout+0x1ec/0x260
     [<ffffffff8159a504>] ? printk+0x5c/0x5e
     [<ffffffff815a3406>] wait_for_completion+0x96/0x100
     [<ffffffff81080c90>] ? try_to_wake_up+0x2b0/0x2b0
     [<ffffffffa0314039>] cma_remove_one+0x1a9/0x220 [rdma_cm]
     [<ffffffffa01fea86>] ib_unregister_device+0x46/0x120 [ib_core]
     [<ffffffffa02c5dc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
     [<ffffffffa04319d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
     [<ffffffffa0431a2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
     [<ffffffffa02d74cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
     [<ffffffff810bd612>] SyS_delete_module+0x152/0x220
     [<ffffffff81149684>] ? vm_munmap+0x54/0x70
     [<ffffffff815ad5a6>] system_call_fastpath+0x1a/0x1f

With the nfs-rdma-client I got following results:
  o module unloading blocks forever, dmesg shows:
    nfsd: last server has exited, flushing export cache
    svc_xprt_enqueue: threads and transports both waiting??
  o BUG: unable to handle kernel paging request at 0000000000004dec
    IP: [<ffffffff815a63b5>] _raw_spin_lock_bh+0x15/0x40
    PGD 107ba9a067 PUD 105c093067 PMD 0
    Oops: 0002 [#1] SMP
    Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry svcrdma
dm_mod cpuid nfs fscache lockd sunrpc af_packet 8021q garp stp llc
rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en
mlx4_ib(-) ib_sa ib_mad ib_core ib_addr sr_mod cdrom usb_storage joydev
mlx4_core usbhid x86_pkg_temp_thermal coretemp kvm_intel kvm
ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper ehci_pci aes_x86_64 ehci_hcd isci iTCO_wdt libsas pcspkr
iTCO_vendor_support igb i2c_algo_bit sb_edac lpc_ich edac_core ioatdma
usbcore tpm_tis ptp microcode i2c_i801 sg mfd_core scsi_transport_sas
ipmi_si usb_common tpm wmi pps_core dca ipmi_msghandler acpi_cpufreq
button edd autofs4 xfs libcrc32c crc32c_intel processor thermal_sys
scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh
    CPU: 14 PID: 4813 Comm: modprobe Not tainted
3.15.0-rc5-cel-nfs-rdma-client-unpatched+ #2
    Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
    task: ffff88085bf96190 ti: ffff88085d42a000 task.ti: ffff88085d42a000
    RIP: 0010:[<ffffffff815a63b5>]  [<ffffffff815a63b5>]
_raw_spin_lock_bh+0x15/0x40
    RSP: 0018:ffff88085d42bd18  EFLAGS: 00010286
    RAX: 0000000000010000 RBX: 0000000000004de8 RCX: 0000000000000000
    RDX: 000000000000000b RSI: 000000000000000e RDI: 0000000000004dec
    RBP: ffff88085d42bd18 R08: ffff88087c611f38 R09: 000000000000a140
    R10: 000000000000002b R11: 0000000000000000 R12: ffff88085dcc3c00
    R13: ffff88105ca13280 R14: 0000000000004dec R15: 0000000000004df0
    FS:  00007f0e49fb5700(0000) GS:ffff88107fcc0000(0000)
knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000004dec CR3: 000000105b027000 CR4: 00000000000407e0
    Stack:
     ffff88085d42bd58 ffffffffa03bd9f0 0000000001328b88 ffff88085dcc3c00
     ffff88085dce8000 ffff88105ca13280 ffff88085dce8260 ffff88085dce81c8
     ffff88085d42bd78 ffffffffa0441ce9 ffff88085dce8000 ffff88105ca13240
    Call Trace:
     [<ffffffffa03bd9f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
     [<ffffffffa0441ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
     [<ffffffffa031a086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
     [<ffffffffa0261a86>] ib_unregister_device+0x46/0x120 [ib_core]
     [<ffffffffa02b9dc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
     [<ffffffffa02329d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
     [<ffffffffa0232a2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
     [<ffffffffa02cb4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
     [<ffffffff810bd6f0>] SyS_delete_module+0x170/0x1f0
     [<ffffffff811497f4>] ? vm_munmap+0x54/0x70
     [<ffffffff815ae426>] system_call_fastpath+0x1a/0x1f
    Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d
c3 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0>
0f c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
    RIP  [<ffffffff815a63b5>] _raw_spin_lock_bh+0x15/0x40
     RSP <ffff88085d42bd18>
    CR2: 0000000000004dec
    ---[ end trace bf1fd548a33cbfc4 ]---
    Kernel panic - not syncing: Fatal exception in interrupt
    Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffff9fffffff)
    ---[ end Kernel panic - not syncing: Fatal exception in interrupt


Regards,
Klemens


On 05/08/2014 05:59 PM, Anna Schumaker wrote:
> I haven't applied Chuck's recent (v3) patches to that kernel yet (I've been waiting to see if people have comments).  I'll try to push something out today.
> 
> On 05/08/2014 10:28 AM, Senn Klemens wrote:
>> Hi,
>>
>> I am getting a soft lockup on the NFS server on its reboot if at least
>> one client mount is established. I am using OpenSUSE 12.3 with the
>> nfs-rdma kernel from Anna Schumaker
>> (git://git.linux-nfs.org/projects/anna/nfs-rdma.git).
>>
>> The export on the server side is done with
>> /data	*(fsid=0,crossmnt,rw,mp,no_root_squash,sync,no_subtree_check,insecure)
>>
>> Following command is used for mounting the NFSv4 share:
>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.19:/ /mnt
>>
>> The HCA is a Mellanox MT4099 on the server and the client.
>>
>> The soft lockup can be reproduced by following steps:
>>   o server: Start the nfs server
>>   o client: Mount the share
>>   o client: Do a "ls" in the mounted directory
>>   o server: Stop the nfs server
>>   o server: Unload the nfs and mlx4 modules or reboot the server (I used
>> the openibd init script from the Mellanox driver without having the
>> Mellanox stack installed)
>>
>> The server reports a soft lockup
>>   BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:6146]
>> most times.
>>
>> Sometimes I get following kernel panic
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
>> IP: [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>> PGD 82a820067 PUD 857832067 PMD 0
>> Oops: 0002 [#1] SMP
>> Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry nfnetlink_log
>> nfnetlink bluetooth rfkill nfsv4 svcrdma dm_mod cpuid nfs fscache lockd
>> sunrpc af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm
>> ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib(-) ib_sa ib_mad ib_core
>> ib_addr sr_mod cdrom usb_storage joydev mlx4_core usbhid
>> x86_pkg_temp_thermal coretemp kvm_intel kvm ghash_clmulni_intel
>> aesni_intel ablk_helper cryptd iTCO_wdt lrw igb gf128mul
>> iTCO_vendor_support ehci_pci glue_helper pcspkr i2c_algo_bit isci
>> ehci_hcd aes_x86_64 ptp libsas ioatdma lpc_ich microcode sb_edac sg
>> pps_core usbcore ipmi_si tpm_tis edac_core scsi_transport_sas i2c_i801
>> mfd_core dca usb_common tpm ipmi_msghandler wmi acpi_cpufreq button edd
>> autofs4 xfs libcrc32c crc32c_intel processor thermal_sys scsi_dh_rdac
>> scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh [last unloaded: oid_registry]
>> CPU: 0 PID: 6603 Comm: modprobe Not tainted 3.15.0-rc2-anna-nfs-rdma+ #3
>> Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
>> task: ffff88105b8c6050 ti: ffff88105d814000 task.ti: ffff88105d814000
>> RIP: 0010:[<ffffffff815a5c35>]  [<ffffffff815a5c35>]
>> _raw_spin_lock_bh+0x15/0x40
>> RSP: 0018:ffff88105d815d18  EFLAGS: 00010286
>> RAX: 0000000000010000 RBX: ffffffffffffffff RCX: 0000000000000000
>> RDX: 000000000000000b RSI: 0000000000000000 RDI: 0000000000000003
>> RBP: ffff88105d815d18 R08: ffff88087c611f38 R09: 0000000000000001
>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88087c3c9800
>> R13: ffff88107b82ab00 R14: 0000000000000003 R15: 0000000000000007
>> FS:  00007fef64612700(0000) GS:ffff88087fc00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000000000003 CR3: 000000087c2c7000 CR4: 00000000000407f0
>> Stack:
>>  ffff88105d815d58 ffffffffa05199f0 ffff88105d815d88 ffff88087c3c9800
>>  ffff88087c3c9400 ffff88107b82ab00 ffff88087c3c9660 ffff88087c3c95c8
>>  ffff88105d815d78 ffffffffa0421ce9 ffff88087c3c9400 ffff88107b82aac0
>> Call Trace:
>>  [<ffffffffa05199f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
>>  [<ffffffffa0421ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
>>  [<ffffffffa039d086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
>>  [<ffffffffa01dca86>] ib_unregister_device+0x46/0x120 [ib_core]
>>  [<ffffffffa032ddc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>>  [<ffffffffa02fb9d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>>  [<ffffffffa02fba2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>>  [<ffffffffa033f4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>>  [<ffffffff810bd6b2>] SyS_delete_module+0x152/0x220
>>  [<ffffffff811496e4>] ? vm_munmap+0x54/0x70
>>  [<ffffffff815adca6>] system_call_fastpath+0x1a/0x1f
>> Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d c3
>> 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0> 0f
>> c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
>> RIP  [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>>  RSP <ffff88105d815d18>
>> CR2: 0000000000000003
>> ---[ end trace 18e02ff413ac4b9b ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
>> 0xffffffff80000000-0xffffffff9fffffff)
>> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>>
>> Kind regards,
>> Klemens
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Soft lockup in unloading kernel modules
  2014-05-13 16:48   ` Klemens Senn
@ 2014-05-19 17:51     ` Chuck Lever
  2014-05-19 21:02       ` Shirley Ma
  0 siblings, 1 reply; 5+ messages in thread
From: Chuck Lever @ 2014-05-19 17:51 UTC (permalink / raw)
  To: Klemens Senn; +Cc: linux-rdma, Linux NFS Mailing List

Hi Klemens-

On May 13, 2014, at 12:48 PM, Klemens Senn <klemens.senn@ims.co.at> wrote:

> Hi Anna,
> 
> today I retried unloading the kernel modules with your updated kernel
> and additionally I tried the nfsd-next kernel from J. Bruce Fields and
> Chuck's nfs-rdma-client kernel.

I filed

  https://bugzilla.linux-nfs.org/show_bug.cgi?id=252

to track this issue.


> In short: None of these was able to unload the kernel modules with an
> active connection.
> 
> In detail:
> 
> With your kernel I got following 3 faults:
>  o BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:4615]
>  o BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000003
>  o BUG: unable to handle kernel paging request at 0000000000005b8c
> 
> With the nfsd-next kernel I got following results:
>  o BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:4452]
>  o module unloading blocks forever, dmesg shows:
>    nfsd: last server has exited, flushing export cache
>    waiting module removal not supported: please upgrade
>  o Kernel keeps running but reports the following:
>    nfsd: last server has exited, flushing export cache
>    waiting module removal not supported: please upgrade
>    svc_xprt_enqueue: threads and transports both waiting??
>    INFO: task modprobe:4510 blocked for more than 480 seconds.
>          Not tainted 3.15.0-rc1-bfields-master+ #1
>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>    modprobe        D ffff88087fc13440     0  4510   4458 0x00000000
>     ffff88105bb23c58 0000000000000086 ffff88105c14e690 0000000000013440
>     ffff88105bb23fd8 0000000000013440 ffffffff81a14480 ffff88105c14e690
>     0000000000000037 ffff88085d7f74d8 ffff88085d7f74e0 7fffffffffffffff
>    Call Trace:
>     [<ffffffff815a2424>] schedule+0x24/0x70
>     [<ffffffff815a18cc>] schedule_timeout+0x1ec/0x260
>     [<ffffffff8159a504>] ? printk+0x5c/0x5e
>     [<ffffffff815a3406>] wait_for_completion+0x96/0x100
>     [<ffffffff81080c90>] ? try_to_wake_up+0x2b0/0x2b0
>     [<ffffffffa0314039>] cma_remove_one+0x1a9/0x220 [rdma_cm]
>     [<ffffffffa01fea86>] ib_unregister_device+0x46/0x120 [ib_core]
>     [<ffffffffa02c5dc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>     [<ffffffffa04319d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>     [<ffffffffa0431a2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>     [<ffffffffa02d74cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>     [<ffffffff810bd612>] SyS_delete_module+0x152/0x220
>     [<ffffffff81149684>] ? vm_munmap+0x54/0x70
>     [<ffffffff815ad5a6>] system_call_fastpath+0x1a/0x1f
> 
> With the nfs-rdma-client I got following results:
>  o module unloading blocks forever, dmesg shows:
>    nfsd: last server has exited, flushing export cache
>    svc_xprt_enqueue: threads and transports both waiting??
>  o BUG: unable to handle kernel paging request at 0000000000004dec
>    IP: [<ffffffff815a63b5>] _raw_spin_lock_bh+0x15/0x40
>    PGD 107ba9a067 PUD 105c093067 PMD 0
>    Oops: 0002 [#1] SMP
>    Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry svcrdma
> dm_mod cpuid nfs fscache lockd sunrpc af_packet 8021q garp stp llc
> rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en
> mlx4_ib(-) ib_sa ib_mad ib_core ib_addr sr_mod cdrom usb_storage joydev
> mlx4_core usbhid x86_pkg_temp_thermal coretemp kvm_intel kvm
> ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
> glue_helper ehci_pci aes_x86_64 ehci_hcd isci iTCO_wdt libsas pcspkr
> iTCO_vendor_support igb i2c_algo_bit sb_edac lpc_ich edac_core ioatdma
> usbcore tpm_tis ptp microcode i2c_i801 sg mfd_core scsi_transport_sas
> ipmi_si usb_common tpm wmi pps_core dca ipmi_msghandler acpi_cpufreq
> button edd autofs4 xfs libcrc32c crc32c_intel processor thermal_sys
> scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh
>    CPU: 14 PID: 4813 Comm: modprobe Not tainted
> 3.15.0-rc5-cel-nfs-rdma-client-unpatched+ #2
>    Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
>    task: ffff88085bf96190 ti: ffff88085d42a000 task.ti: ffff88085d42a000
>    RIP: 0010:[<ffffffff815a63b5>]  [<ffffffff815a63b5>]
> _raw_spin_lock_bh+0x15/0x40
>    RSP: 0018:ffff88085d42bd18  EFLAGS: 00010286
>    RAX: 0000000000010000 RBX: 0000000000004de8 RCX: 0000000000000000
>    RDX: 000000000000000b RSI: 000000000000000e RDI: 0000000000004dec
>    RBP: ffff88085d42bd18 R08: ffff88087c611f38 R09: 000000000000a140
>    R10: 000000000000002b R11: 0000000000000000 R12: ffff88085dcc3c00
>    R13: ffff88105ca13280 R14: 0000000000004dec R15: 0000000000004df0
>    FS:  00007f0e49fb5700(0000) GS:ffff88107fcc0000(0000)
> knlGS:0000000000000000
>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    CR2: 0000000000004dec CR3: 000000105b027000 CR4: 00000000000407e0
>    Stack:
>     ffff88085d42bd58 ffffffffa03bd9f0 0000000001328b88 ffff88085dcc3c00
>     ffff88085dce8000 ffff88105ca13280 ffff88085dce8260 ffff88085dce81c8
>     ffff88085d42bd78 ffffffffa0441ce9 ffff88085dce8000 ffff88105ca13240
>    Call Trace:
>     [<ffffffffa03bd9f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
>     [<ffffffffa0441ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
>     [<ffffffffa031a086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
>     [<ffffffffa0261a86>] ib_unregister_device+0x46/0x120 [ib_core]
>     [<ffffffffa02b9dc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>     [<ffffffffa02329d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>     [<ffffffffa0232a2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>     [<ffffffffa02cb4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>     [<ffffffff810bd6f0>] SyS_delete_module+0x170/0x1f0
>     [<ffffffff811497f4>] ? vm_munmap+0x54/0x70
>     [<ffffffff815ae426>] system_call_fastpath+0x1a/0x1f
>    Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d
> c3 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0>
> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
>    RIP  [<ffffffff815a63b5>] _raw_spin_lock_bh+0x15/0x40
>     RSP <ffff88085d42bd18>
>    CR2: 0000000000004dec
>    ---[ end trace bf1fd548a33cbfc4 ]---
>    Kernel panic - not syncing: Fatal exception in interrupt
>    Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffff9fffffff)
>    ---[ end Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> Regards,
> Klemens
> 
> 
> On 05/08/2014 05:59 PM, Anna Schumaker wrote:
>> I haven't applied Chuck's recent (v3) patches to that kernel yet (I've been waiting to see if people have comments).  I'll try to push something out today.
>> 
>> On 05/08/2014 10:28 AM, Senn Klemens wrote:
>>> Hi,
>>> 
>>> I am getting a soft lockup on the NFS server on its reboot if at least
>>> one client mount is established. I am using OpenSUSE 12.3 with the
>>> nfs-rdma kernel from Anna Schumaker
>>> (git://git.linux-nfs.org/projects/anna/nfs-rdma.git).
>>> 
>>> The export on the server side is done with
>>> /data	*(fsid=0,crossmnt,rw,mp,no_root_squash,sync,no_subtree_check,insecure)
>>> 
>>> Following command is used for mounting the NFSv4 share:
>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.19:/ /mnt
>>> 
>>> The HCA is a Mellanox MT4099 on the server and the client.
>>> 
>>> The soft lockup can be reproduced by following steps:
>>>  o server: Start the nfs server
>>>  o client: Mount the share
>>>  o client: Do a "ls" in the mounted directory
>>>  o server: Stop the nfs server
>>>  o server: Unload the nfs and mlx4 modules or reboot the server (I used
>>> the openibd init script from the Mellanox driver without having the
>>> Mellanox stack installed)
>>> 
>>> The server reports a soft lockup
>>>  BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:6146]
>>> most times.
>>> 
>>> Sometimes I get following kernel panic
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
>>> IP: [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>>> PGD 82a820067 PUD 857832067 PMD 0
>>> Oops: 0002 [#1] SMP
>>> Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry nfnetlink_log
>>> nfnetlink bluetooth rfkill nfsv4 svcrdma dm_mod cpuid nfs fscache lockd
>>> sunrpc af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm
>>> ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib(-) ib_sa ib_mad ib_core
>>> ib_addr sr_mod cdrom usb_storage joydev mlx4_core usbhid
>>> x86_pkg_temp_thermal coretemp kvm_intel kvm ghash_clmulni_intel
>>> aesni_intel ablk_helper cryptd iTCO_wdt lrw igb gf128mul
>>> iTCO_vendor_support ehci_pci glue_helper pcspkr i2c_algo_bit isci
>>> ehci_hcd aes_x86_64 ptp libsas ioatdma lpc_ich microcode sb_edac sg
>>> pps_core usbcore ipmi_si tpm_tis edac_core scsi_transport_sas i2c_i801
>>> mfd_core dca usb_common tpm ipmi_msghandler wmi acpi_cpufreq button edd
>>> autofs4 xfs libcrc32c crc32c_intel processor thermal_sys scsi_dh_rdac
>>> scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh [last unloaded: oid_registry]
>>> CPU: 0 PID: 6603 Comm: modprobe Not tainted 3.15.0-rc2-anna-nfs-rdma+ #3
>>> Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
>>> task: ffff88105b8c6050 ti: ffff88105d814000 task.ti: ffff88105d814000
>>> RIP: 0010:[<ffffffff815a5c35>]  [<ffffffff815a5c35>]
>>> _raw_spin_lock_bh+0x15/0x40
>>> RSP: 0018:ffff88105d815d18  EFLAGS: 00010286
>>> RAX: 0000000000010000 RBX: ffffffffffffffff RCX: 0000000000000000
>>> RDX: 000000000000000b RSI: 0000000000000000 RDI: 0000000000000003
>>> RBP: ffff88105d815d18 R08: ffff88087c611f38 R09: 0000000000000001
>>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88087c3c9800
>>> R13: ffff88107b82ab00 R14: 0000000000000003 R15: 0000000000000007
>>> FS:  00007fef64612700(0000) GS:ffff88087fc00000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000003 CR3: 000000087c2c7000 CR4: 00000000000407f0
>>> Stack:
>>> ffff88105d815d58 ffffffffa05199f0 ffff88105d815d88 ffff88087c3c9800
>>> ffff88087c3c9400 ffff88107b82ab00 ffff88087c3c9660 ffff88087c3c95c8
>>> ffff88105d815d78 ffffffffa0421ce9 ffff88087c3c9400 ffff88107b82aac0
>>> Call Trace:
>>> [<ffffffffa05199f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
>>> [<ffffffffa0421ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
>>> [<ffffffffa039d086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
>>> [<ffffffffa01dca86>] ib_unregister_device+0x46/0x120 [ib_core]
>>> [<ffffffffa032ddc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>>> [<ffffffffa02fb9d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>>> [<ffffffffa02fba2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>>> [<ffffffffa033f4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>>> [<ffffffff810bd6b2>] SyS_delete_module+0x152/0x220
>>> [<ffffffff811496e4>] ? vm_munmap+0x54/0x70
>>> [<ffffffff815adca6>] system_call_fastpath+0x1a/0x1f
>>> Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d c3
>>> 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0> 0f
>>> c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
>>> RIP  [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>>> RSP <ffff88105d815d18>
>>> CR2: 0000000000000003
>>> ---[ end trace 18e02ff413ac4b9b ]---
>>> Kernel panic - not syncing: Fatal exception in interrupt
>>> Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
>>> 0xffffffff80000000-0xffffffff9fffffff)
>>> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>>> 
>>> Kind regards,
>>> Klemens
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Soft lockup in unloading kernel modules
  2014-05-19 17:51     ` Chuck Lever
@ 2014-05-19 21:02       ` Shirley Ma
  0 siblings, 0 replies; 5+ messages in thread
From: Shirley Ma @ 2014-05-19 21:02 UTC (permalink / raw)
  To: Chuck Lever, Klemens Senn; +Cc: linux-rdma, Linux NFS Mailing List

Klements,

Can you add more details on how to unloading the modules (step by step) 
in the bug report?

Thanks
Shirley

On 05/19/2014 10:51 AM, Chuck Lever wrote:
> Hi Klemens-
>
> On May 13, 2014, at 12:48 PM, Klemens Senn <klemens.senn@ims.co.at> wrote:
>
>> Hi Anna,
>>
>> today I retried unloading the kernel modules with your updated kernel
>> and additionally I tried the nfsd-next kernel from J. Bruce Fields and
>> Chuck's nfs-rdma-client kernel.
> I filed
>
>    https://bugzilla.linux-nfs.org/show_bug.cgi?id=252
>
> to track this issue.
>
>
>> In short: None of these was able to unload the kernel modules with an
>> active connection.
>>
>> In detail:
>>
>> With your kernel I got following 3 faults:
>>   o BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:4615]
>>   o BUG: unable to handle kernel NULL pointer dereference at
>> 0000000000000003
>>   o BUG: unable to handle kernel paging request at 0000000000005b8c
>>
>> With the nfsd-next kernel I got following results:
>>   o BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:4452]
>>   o module unloading blocks forever, dmesg shows:
>>     nfsd: last server has exited, flushing export cache
>>     waiting module removal not supported: please upgrade
>>   o Kernel keeps running but reports the following:
>>     nfsd: last server has exited, flushing export cache
>>     waiting module removal not supported: please upgrade
>>     svc_xprt_enqueue: threads and transports both waiting??
>>     INFO: task modprobe:4510 blocked for more than 480 seconds.
>>           Not tainted 3.15.0-rc1-bfields-master+ #1
>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>     modprobe        D ffff88087fc13440     0  4510   4458 0x00000000
>>      ffff88105bb23c58 0000000000000086 ffff88105c14e690 0000000000013440
>>      ffff88105bb23fd8 0000000000013440 ffffffff81a14480 ffff88105c14e690
>>      0000000000000037 ffff88085d7f74d8 ffff88085d7f74e0 7fffffffffffffff
>>     Call Trace:
>>      [<ffffffff815a2424>] schedule+0x24/0x70
>>      [<ffffffff815a18cc>] schedule_timeout+0x1ec/0x260
>>      [<ffffffff8159a504>] ? printk+0x5c/0x5e
>>      [<ffffffff815a3406>] wait_for_completion+0x96/0x100
>>      [<ffffffff81080c90>] ? try_to_wake_up+0x2b0/0x2b0
>>      [<ffffffffa0314039>] cma_remove_one+0x1a9/0x220 [rdma_cm]
>>      [<ffffffffa01fea86>] ib_unregister_device+0x46/0x120 [ib_core]
>>      [<ffffffffa02c5dc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>>      [<ffffffffa04319d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>>      [<ffffffffa0431a2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>>      [<ffffffffa02d74cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>>      [<ffffffff810bd612>] SyS_delete_module+0x152/0x220
>>      [<ffffffff81149684>] ? vm_munmap+0x54/0x70
>>      [<ffffffff815ad5a6>] system_call_fastpath+0x1a/0x1f
>>
>> With the nfs-rdma-client I got following results:
>>   o module unloading blocks forever, dmesg shows:
>>     nfsd: last server has exited, flushing export cache
>>     svc_xprt_enqueue: threads and transports both waiting??
>>   o BUG: unable to handle kernel paging request at 0000000000004dec
>>     IP: [<ffffffff815a63b5>] _raw_spin_lock_bh+0x15/0x40
>>     PGD 107ba9a067 PUD 105c093067 PMD 0
>>     Oops: 0002 [#1] SMP
>>     Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry svcrdma
>> dm_mod cpuid nfs fscache lockd sunrpc af_packet 8021q garp stp llc
>> rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en
>> mlx4_ib(-) ib_sa ib_mad ib_core ib_addr sr_mod cdrom usb_storage joydev
>> mlx4_core usbhid x86_pkg_temp_thermal coretemp kvm_intel kvm
>> ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
>> glue_helper ehci_pci aes_x86_64 ehci_hcd isci iTCO_wdt libsas pcspkr
>> iTCO_vendor_support igb i2c_algo_bit sb_edac lpc_ich edac_core ioatdma
>> usbcore tpm_tis ptp microcode i2c_i801 sg mfd_core scsi_transport_sas
>> ipmi_si usb_common tpm wmi pps_core dca ipmi_msghandler acpi_cpufreq
>> button edd autofs4 xfs libcrc32c crc32c_intel processor thermal_sys
>> scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh
>>     CPU: 14 PID: 4813 Comm: modprobe Not tainted
>> 3.15.0-rc5-cel-nfs-rdma-client-unpatched+ #2
>>     Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
>>     task: ffff88085bf96190 ti: ffff88085d42a000 task.ti: ffff88085d42a000
>>     RIP: 0010:[<ffffffff815a63b5>]  [<ffffffff815a63b5>]
>> _raw_spin_lock_bh+0x15/0x40
>>     RSP: 0018:ffff88085d42bd18  EFLAGS: 00010286
>>     RAX: 0000000000010000 RBX: 0000000000004de8 RCX: 0000000000000000
>>     RDX: 000000000000000b RSI: 000000000000000e RDI: 0000000000004dec
>>     RBP: ffff88085d42bd18 R08: ffff88087c611f38 R09: 000000000000a140
>>     R10: 000000000000002b R11: 0000000000000000 R12: ffff88085dcc3c00
>>     R13: ffff88105ca13280 R14: 0000000000004dec R15: 0000000000004df0
>>     FS:  00007f0e49fb5700(0000) GS:ffff88107fcc0000(0000)
>> knlGS:0000000000000000
>>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>     CR2: 0000000000004dec CR3: 000000105b027000 CR4: 00000000000407e0
>>     Stack:
>>      ffff88085d42bd58 ffffffffa03bd9f0 0000000001328b88 ffff88085dcc3c00
>>      ffff88085dce8000 ffff88105ca13280 ffff88085dce8260 ffff88085dce81c8
>>      ffff88085d42bd78 ffffffffa0441ce9 ffff88085dce8000 ffff88105ca13240
>>     Call Trace:
>>      [<ffffffffa03bd9f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
>>      [<ffffffffa0441ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
>>      [<ffffffffa031a086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
>>      [<ffffffffa0261a86>] ib_unregister_device+0x46/0x120 [ib_core]
>>      [<ffffffffa02b9dc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>>      [<ffffffffa02329d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>>      [<ffffffffa0232a2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>>      [<ffffffffa02cb4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>>      [<ffffffff810bd6f0>] SyS_delete_module+0x170/0x1f0
>>      [<ffffffff811497f4>] ? vm_munmap+0x54/0x70
>>      [<ffffffff815ae426>] system_call_fastpath+0x1a/0x1f
>>     Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d
>> c3 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0>
>> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
>>     RIP  [<ffffffff815a63b5>] _raw_spin_lock_bh+0x15/0x40
>>      RSP <ffff88085d42bd18>
>>     CR2: 0000000000004dec
>>     ---[ end trace bf1fd548a33cbfc4 ]---
>>     Kernel panic - not syncing: Fatal exception in interrupt
>>     Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
>> 0xffffffff80000000-0xffffffff9fffffff)
>>     ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>>
>>
>> Regards,
>> Klemens
>>
>>
>> On 05/08/2014 05:59 PM, Anna Schumaker wrote:
>>> I haven't applied Chuck's recent (v3) patches to that kernel yet (I've been waiting to see if people have comments).  I'll try to push something out today.
>>>
>>> On 05/08/2014 10:28 AM, Senn Klemens wrote:
>>>> Hi,
>>>>
>>>> I am getting a soft lockup on the NFS server on its reboot if at least
>>>> one client mount is established. I am using OpenSUSE 12.3 with the
>>>> nfs-rdma kernel from Anna Schumaker
>>>> (git://git.linux-nfs.org/projects/anna/nfs-rdma.git).
>>>>
>>>> The export on the server side is done with
>>>> /data	*(fsid=0,crossmnt,rw,mp,no_root_squash,sync,no_subtree_check,insecure)
>>>>
>>>> Following command is used for mounting the NFSv4 share:
>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.19:/ /mnt
>>>>
>>>> The HCA is a Mellanox MT4099 on the server and the client.
>>>>
>>>> The soft lockup can be reproduced by following steps:
>>>>   o server: Start the nfs server
>>>>   o client: Mount the share
>>>>   o client: Do a "ls" in the mounted directory
>>>>   o server: Stop the nfs server
>>>>   o server: Unload the nfs and mlx4 modules or reboot the server (I used
>>>> the openibd init script from the Mellanox driver without having the
>>>> Mellanox stack installed)
>>>>
>>>> The server reports a soft lockup
>>>>   BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:6146]
>>>> most times.
>>>>
>>>> Sometimes I get following kernel panic
>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
>>>> IP: [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>>>> PGD 82a820067 PUD 857832067 PMD 0
>>>> Oops: 0002 [#1] SMP
>>>> Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry nfnetlink_log
>>>> nfnetlink bluetooth rfkill nfsv4 svcrdma dm_mod cpuid nfs fscache lockd
>>>> sunrpc af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm
>>>> ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib(-) ib_sa ib_mad ib_core
>>>> ib_addr sr_mod cdrom usb_storage joydev mlx4_core usbhid
>>>> x86_pkg_temp_thermal coretemp kvm_intel kvm ghash_clmulni_intel
>>>> aesni_intel ablk_helper cryptd iTCO_wdt lrw igb gf128mul
>>>> iTCO_vendor_support ehci_pci glue_helper pcspkr i2c_algo_bit isci
>>>> ehci_hcd aes_x86_64 ptp libsas ioatdma lpc_ich microcode sb_edac sg
>>>> pps_core usbcore ipmi_si tpm_tis edac_core scsi_transport_sas i2c_i801
>>>> mfd_core dca usb_common tpm ipmi_msghandler wmi acpi_cpufreq button edd
>>>> autofs4 xfs libcrc32c crc32c_intel processor thermal_sys scsi_dh_rdac
>>>> scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh [last unloaded: oid_registry]
>>>> CPU: 0 PID: 6603 Comm: modprobe Not tainted 3.15.0-rc2-anna-nfs-rdma+ #3
>>>> Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
>>>> task: ffff88105b8c6050 ti: ffff88105d814000 task.ti: ffff88105d814000
>>>> RIP: 0010:[<ffffffff815a5c35>]  [<ffffffff815a5c35>]
>>>> _raw_spin_lock_bh+0x15/0x40
>>>> RSP: 0018:ffff88105d815d18  EFLAGS: 00010286
>>>> RAX: 0000000000010000 RBX: ffffffffffffffff RCX: 0000000000000000
>>>> RDX: 000000000000000b RSI: 0000000000000000 RDI: 0000000000000003
>>>> RBP: ffff88105d815d18 R08: ffff88087c611f38 R09: 0000000000000001
>>>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88087c3c9800
>>>> R13: ffff88107b82ab00 R14: 0000000000000003 R15: 0000000000000007
>>>> FS:  00007fef64612700(0000) GS:ffff88087fc00000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000000000000003 CR3: 000000087c2c7000 CR4: 00000000000407f0
>>>> Stack:
>>>> ffff88105d815d58 ffffffffa05199f0 ffff88105d815d88 ffff88087c3c9800
>>>> ffff88087c3c9400 ffff88107b82ab00 ffff88087c3c9660 ffff88087c3c95c8
>>>> ffff88105d815d78 ffffffffa0421ce9 ffff88087c3c9400 ffff88107b82aac0
>>>> Call Trace:
>>>> [<ffffffffa05199f0>] svc_xprt_enqueue+0x50/0x220 [sunrpc]
>>>> [<ffffffffa0421ce9>] rdma_cma_handler+0x69/0x180 [svcrdma]
>>>> [<ffffffffa039d086>] cma_remove_one+0x1f6/0x220 [rdma_cm]
>>>> [<ffffffffa01dca86>] ib_unregister_device+0x46/0x120 [ib_core]
>>>> [<ffffffffa032ddc9>] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
>>>> [<ffffffffa02fb9d0>] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
>>>> [<ffffffffa02fba2b>] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
>>>> [<ffffffffa033f4cc>] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
>>>> [<ffffffff810bd6b2>] SyS_delete_module+0x152/0x220
>>>> [<ffffffff811496e4>] ? vm_munmap+0x54/0x70
>>>> [<ffffffff815adca6>] system_call_fastpath+0x1a/0x1f
>>>> Code: 5d c3 0f b7 17 66 39 ca 74 f6 f3 90 0f b7 17 66 39 d1 75 f6 5d c3
>>>> 55 65 81 04 25 20 b9 00 00 00 02 00 00 48 89 e5 b8 00 00 01 00 <f0> 0f
>>>> c1 07 89 c2 c1 ea 10 66 39 c2 75 04 5d c3 f3 90 0f b7 07
>>>> RIP  [<ffffffff815a5c35>] _raw_spin_lock_bh+0x15/0x40
>>>> RSP <ffff88105d815d18>
>>>> CR2: 0000000000000003
>>>> ---[ end trace 18e02ff413ac4b9b ]---
>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>> Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range:
>>>> 0xffffffff80000000-0xffffffff9fffffff)
>>>> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>>>>
>>>> Kind regards,
>>>> Klemens
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-19 21:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-08 14:28 Soft lockup in unloading kernel modules Senn Klemens
2014-05-08 15:59 ` Anna Schumaker
2014-05-13 16:48   ` Klemens Senn
2014-05-19 17:51     ` Chuck Lever
2014-05-19 21:02       ` Shirley Ma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).