From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
To: Spelic <spelic-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: NFS-RDMA hangs: connection closed (-103)
Date: Wed, 01 Dec 2010 17:25:43 -0600 [thread overview]
Message-ID: <4CF6D977.6000402@opengridcomputing.com> (raw)
In-Reply-To: <4CF6D69B.4030501-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
Hi Spelic,
Can you reproduce this with an nfsv3 mount?
On 12/1/10 5:13 PM, Spelic wrote:
> Hello all
>
> First of all: I have tried to send this message to the list at least 3
> times but it doesn't seem to get through (and I'm given no error back).
> It was very long with 2 attachments... is is because of that? What are
> the limits of this ML?
> This time I will shorten it a bit and remove the attachments.
>
> Here is my problem:
> I am trying to use NFS over RDMA. It doesn't work: hangs very soon.
> I tried kernel 2.6.32 from ubuntu 10.04, and then I tried the most
> recent upstream 2.6.37-rc4 compiled from source. They behave basically
> the same regarding the NFS mount itself, only difference is that 2.6.32
> will hang the complete operating system when nfs hangs, while 2.6.37-rc4
> (after nfs hangs) will only hang processes which launch sync or list nfs
> directories. Anyway the mount is hanged forever; does not resolve by
> itself.
> IPoIB nfs mounts appear to work flawlessly, the problem is with RDMA only.
>
> Hardware: (identical client and server machines)
> 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
> HCA] (rev 20)
> Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
> Flags: bus master, fast devsel, latency 0, IRQ 30
> Memory at d8800000 (64-bit, non-prefetchable) [size=1M]
> Memory at d8000000 (64-bit, prefetchable) [size=8M]
> Capabilities: [40] Power Management version 2
> Capabilities: [48] Vital Product Data <?>
> Capabilities: [90] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/5 Enable-
> Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32
> Capabilities: [60] Express Endpoint, MSI 00
> Kernel driver in use: ib_mthca
> Kernel modules: ib_mthca
>
> Mainboard = Supermicro X7DWT with embedded infiniband.
>
> This is my test:
> on server I make a big 14GB ramdisk (exact boot option:
> ramdisk_size=14680064), format xfs and mount like this:
> mkfs.xfs -f -l size=128m -d agcount=16 /dev/ram0
> mount -o nobarrier,inode64,logbufs=8,logbsize=256k /dev/ram0
> /mnt/ram/
> On the client I mount like this (fstab):
> 10.100.0.220:/ /mnt/nfsram nfs4
> _netdev,auto,defaults,rdma,port=20049 0 0
>
> Then on the client I perform
> echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo
> syncing now ; sync ; echo finished
>
> It hangs as soon as it reaches the end of the 14GB of space, but never
> writes "syncing now". It seems like the "disk full" message triggers the
> hangup reliably on NFS over RDMA over XFS over ramdisk; other
> combinations are not so reliable for triggering the bug (e.g. ext4).
>
> However please note that this is not an XFS problem in itself: we had
> another hangup on an ext4 filesystem on NFS on RDMA on real disks for
> real work after a few hours (and it hadn't hit the "disk full"
> situation); this technique with XFS on ramdisk is just more reliably
> reproducible.
>
> Note that the hangup does not happen on NFS over IPoIB (no RDMA) over
> XFS over ramdisk. It's really an RDMA-only bug.
> On the other machine (2.6.32) that was doing real work on real disks I
> am now mounting over IPoIB without RDMA and in fact that one is still
> running reliably.
>
> The dd process hangs like this: (/proc/pid/stack)
> [<ffffffff810f8f75>] sync_page+0x45/0x60
> [<ffffffff810f9143>] wait_on_page_bit+0x73/0x80
> [<ffffffff810f9590>] filemap_fdatawait_range+0x110/0x1a0
> [<ffffffff810f9720>] filemap_write_and_wait_range+0x70/0x80
> [<ffffffff811766ba>] vfs_fsync_range+0x5a/0xa0
> [<ffffffff8117676c>] vfs_fsync+0x1c/0x20
> [<ffffffffa02bda1d>] nfs_file_write+0xdd/0x1f0 [nfs]
> [<ffffffff8114d4fa>] do_sync_write+0xda/0x120
> [<ffffffff8114d808>] vfs_write+0xc8/0x190
> [<ffffffff8114e061>] sys_write+0x51/0x90
> [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The dd process is not killable with -9 . Stays alive and hanged.
>
> In the dmesg (client) you can see this line immediately, as soon as
> transfer stops (iostat -n 1) and dd hangs up:
> [ 3072.884988] rpcrdma: connection to 10.100.0.220:20049 closed (-103)
>
> after a while you can see this in dmesg
> [ 3242.890030] INFO: task dd:2140 blocked for more than 120 seconds.
> [ 3242.890132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 3242.890239] dd D ffff88040a8f0398 0 2140 2113
> 0x00000000
> [ 3242.890243] ffff88040891fb38 0000000000000082 ffff88040891fa98
> ffff88040891fa98
> [ 3242.890248] 00000000000139c0 ffff88040a8f0000 ffff88040a8f0398
> ffff88040891ffd8
> [ 3242.890251] ffff88040a8f03a0 00000000000139c0 ffff88040891e010
> 00000000000139c0
> [ 3242.890255] Call Trace:
> [ 3242.890264] [<ffffffff81035509>] ? default_spin_lock_flags+0x9/0x10
> [ 3242.890269] [<ffffffff810f8f30>] ? sync_page+0x0/0x60
> [ 3242.890273] [<ffffffff8157b824>] io_schedule+0x44/0x60
> [ 3242.890276] [<ffffffff810f8f75>] sync_page+0x45/0x60
> [ 3242.890279] [<ffffffff8157c0bf>] __wait_on_bit+0x5f/0x90
> [ 3242.890281] [<ffffffff810f9143>] wait_on_page_bit+0x73/0x80
> [ 3242.890286] [<ffffffff81081bf0>] ? wake_bit_function+0x0/0x40
> [ 3242.890290] [<ffffffff81103ce5>] ? pagevec_lookup_tag+0x25/0x40
> [ 3242.890293] [<ffffffff810f9590>] filemap_fdatawait_range+0x110/0x1a0
> [ 3242.890296] [<ffffffff810f9720>] filemap_write_and_wait_range+0x70/0x80
> [ 3242.890301] [<ffffffff811766ba>] vfs_fsync_range+0x5a/0xa0
> [ 3242.890303] [<ffffffff8117676c>] vfs_fsync+0x1c/0x20
> [ 3242.890319] [<ffffffffa02bda1d>] nfs_file_write+0xdd/0x1f0 [nfs]
> [ 3242.890323] [<ffffffff8114d4fa>] do_sync_write+0xda/0x120
> [ 3242.890328] [<ffffffff812967c8>] ? apparmor_file_permission+0x18/0x20
> [ 3242.890333] [<ffffffff81263323>] ? security_file_permission+0x23/0x90
> [ 3242.890335] [<ffffffff8114d808>] vfs_write+0xc8/0x190
> [ 3242.890338] [<ffffffff8114e061>] sys_write+0x51/0x90
> [ 3242.890343] [<ffffffff8100c001>] ? system_call_after_swapgs+0x41/0x6c
> [ 3242.890346] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> [ 3253.280020] nfs: server 10.100.0.220 not responding, still trying
>
>
> If I try an umount -f this happens:
> umount2: Device or resource busy
> umount.nfs4: /mnt/nfsram: device is busy
> umount2: Device or resource busy
> umount.nfs4: /mnt/nfsram: device is busy
> it stays mounted and processes are still hanged
> However if I repeat the umount -f three times, at the third time it
> does unmount. At that point the dd process is killed. However it's still
> impossible to remount with RDMA, and lsmod shows:
> xprtrdma 41048 1
> module in use with 1 reference.
>
> Only a reboot of the client allows to repeat the mount at this point.
> It is not needed to reboot the server.
>
>
> The dmesg at server side is this:
>
> for each hang at client side (with about 1 minute delay every time):
> BEGIN
> [ 464.780047] WARNING: at
> net/sunrpc/xprtrdma/svc_rdma_transport.c:1162
> __svc_rdma_free+0x20d/0x230 [svcrdma]()
> [ 464.780050] Hardware name: X7DWT
> [ 464.780051] Modules linked in: xfs binfmt_misc ppdev fbcon
> tileblit font bitblit softcursor nfs radeon nfsd lockd svcrdma exportfs
> ib_srp scsi_transport_srp fscache scsi_tgt rdma_ucm crc32c ib_ipoib
> ib_iser nfs_acl auth_rpcgss rdma_cm ttm ib_cm iw_cm ib_sa ib_addr
> mlx4_ib drm_kms_helper iscsi_tcp libiscsi_tcp drm libiscsi mlx4_core
> scsi_transport_iscsi ib_mthca sunrpc ib_uverbs e1000e ib_umad ib_mad
> ioatdma ib_core joydev i5400_edac psmouse edac_core usbhid i2c_algo_bit
> lp shpchp dca hid parport i5k_amb serio_raw [last unloaded: xprtrdma]
> [ 464.780094] Pid: 11, comm: kworker/0:1 Not tainted
> 2.6.37-rc4-stserver-mykernel3+ #1
> [ 464.780096] Call Trace:
> [ 464.780106] [<ffffffff8105feef>] warn_slowpath_common+0x7f/0xc0
> [ 464.780110] [<ffffffffa0220ff0>] ? __svc_rdma_free+0x0/0x230
> [svcrdma]
> [ 464.780112] [<ffffffff8105ff4a>] warn_slowpath_null+0x1a/0x20
> [ 464.780116] [<ffffffffa02211fd>] __svc_rdma_free+0x20d/0x230
> [svcrdma]
> [ 464.780119] [<ffffffffa0220ff0>] ? __svc_rdma_free+0x0/0x230
> [svcrdma]
> [ 464.780123] [<ffffffff8107aa05>] process_one_work+0x125/0x440
> [ 464.780126] [<ffffffff8107d1d0>] worker_thread+0x170/0x410
> [ 464.780129] [<ffffffff8107d060>] ? worker_thread+0x0/0x410
> [ 464.780132] [<ffffffff81081656>] kthread+0x96/0xa0
> [ 464.780137] [<ffffffff8100ce64>] kernel_thread_helper+0x4/0x10
> [ 464.780139] [<ffffffff810815c0>] ? kthread+0x0/0xa0
> [ 464.780142] [<ffffffff8100ce60>] ? kernel_thread_helper+0x0/0x10
> [ 464.780144] ---[ end trace e31416a7a1dc2103 ]---
> END
>
> If at this point I restart nfs-kernel-server, this is the dmesg at
> serverside:
> [ 1119.778579] nfsd: last server has exited, flushing export cache
> [ 1120.930444] svc: failed to register lockdv1 RPC service (errno 97).
> [ 1120.930490] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
> state recovery directory
> [ 1120.930512] NFSD: starting 90-second grace period
> I'm not sure if the svc line is meaningful...
>
> I have the dmesgs of the client and server since boot if you are
> interested, lsmod, and other stuff. I have removed them to see if the
> message gets through to the list now...
>
> Thanks for your help
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-12-01 23:25 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-01 23:13 NFS-RDMA hangs: connection closed (-103) Spelic
[not found] ` <4CF6D69B.4030501-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-01 23:25 ` Tom Tucker [this message]
2010-12-01 23:59 ` Tom Tucker
[not found] ` <4CF6E144.1080200-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-12-02 12:16 ` Spelic
[not found] ` <4CF78E0E.2040308-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-02 18:37 ` Roland Dreier
[not found] ` <ada39qgm36k.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2010-12-02 19:09 ` Spelic
[not found] ` <4CF7EEE0.9030408-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-07 16:12 ` Tom Tucker
[not found] ` <4CFE5CF1.6020806-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-12-08 15:10 ` Spelic
[not found] ` <4CFF9FE4.5010705-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-09 15:25 ` Tom Tucker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CF6D977.6000402@opengridcomputing.com \
--to=tom-7bpotxp6k4+p2yhjcf5u+vpxobypeauw@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=spelic-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.