All of lore.kernel.org
 help / color / mirror / Atom feed
* v3 client processes hung in rpc_wait_bit_killable
@ 2013-01-31 22:58 Nathan Dauchy - NOAA Affiliate
  2013-02-07 22:55 ` Nathan Dauchy
  0 siblings, 1 reply; 2+ messages in thread
From: Nathan Dauchy - NOAA Affiliate @ 2013-01-31 22:58 UTC (permalink / raw)
  To: linux-nfs

Greetings,

On our Linux HPC cluster, one or more of the client nodes occasionally
has a problem with all access to an NFS mount hanging.  Can anyone
here point out a possible cause of the symptoms below or offer
suggestions for other tools or debugging steps I should try?  Any new
ideas or hints are appreciated!

The basic symptom is that all processes which access the NFS file
system hang in "D" state.  Dumping the call trace shows that they are
stuck in rpc_wait_bit_killable().  The processes can indeed be
terminated with "kill -9".  If I kill all hung processes, I can umount
the file system and remount it and then access works again.

We can run under "normal" user load for up to 3 days before 1-5 client
nodes experience a similar problem.   When it happens, the 100's of
other clients show no sign of a problem with their NFS mounts.  There
is no pattern to exactly which client will hang, though one complex
user workflow seems to trigger it more than others.

Here is an example of the hung processes:

# ps -eo pid,stat,args,wchan | grep D
   PID STAT COMMAND                     WCHAN
  2192 Ds   -bash                       rpc_wait_bit_killable
  2408 Ds   bash -c /home/ndauchy/bin/f rpc_wait_bit_killable
 15926 D    df -P                       rpc_wait_bit_killable
 36857 S+   grep D                      pipe_wait

By running "df" on each NFS mount one by one I was able to narrow down
which mount was hung to this one:

# mount | grep home
10.178.17.21:/vol/home on /misc/whome type nfs
(rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.178.17.21,mountvers=3,mountport=4046,mountproto=udp,local_lock=all,addr=10.178.17.21)

There is no TCP connection to that server that I can see:

# netstat -atu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address    State
tcp        0      0 *:ssh                       *:*               LISTEN
tcp        0      0 s320-ib0:panasas            *:*               LISTEN
tcp        0      0 s320:panasas                *:*               LISTEN
tcp        0      0 *:15002                     *:*               LISTEN
tcp        0      0 *:15003                     *:*               LISTEN
tcp        0      0 localhost.localdomain:15005 *:*               LISTEN
tcp        0      0 s320-ib0:45637              panmgmt5:10622    ESTABLISHED
tcp        0      0 s320:ssh                    fe3:59631         ESTABLISHED
tcp        0      0 s320:849                    wms1314-nfs:nfs   ESTABLISHED
tcp        0      0 s320:ssh                    fe5:47363         ESTABLISHED
tcp        0      0 *:ssh                       *:*               LISTEN
udp        0      0 s320-ib0:ntp                *:*
udp        0      0 s320:ntp                    *:*
udp        0      0 localhost.localdomain:ntp   *:*
udp        0      0 *:ntp                       *:*
udp        0      0 *:52539                     *:*
udp        0      0 fe80::21e:6703:3c:f00d:ntp  *:*
udp        0      0 fe80::21e:67ff:fe3c:f008:ntp *:*
udp        0      0 ::1:ntp                     *:*
udp        0      0 *:ntp                       *:*


Here are the call traces:

# echo w > /proc/sysrq-trigger

SysRq : Show Blocked State
  task                        PC stack   pid father
bash          D 0000000000000000     0  2192      1 0x00000004
 ffff88042db47b28 0000000000000082 0000000000000000 ffff880400000000
 ffff88042db47ad8 ffff88042db47dd8 ffff88042db47b98 ffff88042db47b88
 ffff880121f7d098 ffff88042db47fd8 000000000000fb88 ffff880121f7d098
Call Trace:
 [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c4707>] nfs3_proc_getattr+0x47/0x90 [nfs]
 [<ffffffffa00b23b3>] __nfs_revalidate_inode+0xe3/0x220 [nfs]
 [<ffffffffa00b2696>] nfs_revalidate_inode+0x36/0x60 [nfs]
 [<ffffffffa00b2776>] nfs_getattr+0x66/0x120 [nfs]
 [<ffffffff81180571>] vfs_getattr+0x51/0x80
 [<ffffffff81180600>] vfs_fstatat+0x60/0x80
 [<ffffffff8118074b>] vfs_stat+0x1b/0x20
 [<ffffffff81180774>] sys_newstat+0x24/0x50
 [<ffffffff8117cba5>] ? fput+0x25/0x30
 [<ffffffff811785cd>] ? filp_close+0x5d/0x90
 [<ffffffff811786a5>] ? sys_close+0xa5/0x100
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
bash          D 0000000000000000     0  2408      1 0x00000000
 ffff8802f1f4d898 0000000000000086 0000000000000082 0000000000000246
 ffff8802f1f4d838 ffff8802f1f4d898 ffff880028043718 0000000300000001
 ffff88043418b098 ffff8802f1f4dfd8 000000000000fb88 ffff88043418b098
Call Trace:
 [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c3d4c>] nfs3_proc_access+0xbc/0x180 [nfs]
 [<ffffffffa00ab168>] nfs_do_access+0x198/0x420 [nfs]
 [<ffffffffa000e765>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
 [<ffffffffa000d770>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
 [<ffffffffa00ab498>] nfs_permission+0xa8/0x1e0 [nfs]
 [<ffffffff8118995d>] __link_path_walk+0xad/0x1030
 [<ffffffff8118ab6a>] path_walk+0x6a/0xe0
 [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0
 [<ffffffff8117c780>] ? get_empty_filp+0xa0/0x180
 [<ffffffff8118bc6b>] do_filp_open+0xfb/0xd60
 [<ffffffffa00af26a>] ? nfs_file_read+0xca/0x130 [nfs]
 [<ffffffff81213426>] ? security_prepare_creds+0x16/0x20
 [<ffffffff810996ef>] ? prepare_creds+0x8f/0xb0
 [<ffffffff81183943>] open_exec+0x33/0x100
 [<ffffffff81183aff>] do_execve+0xef/0x340
 [<ffffffff8127ea8a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b54a>] stub_execve+0x6a/0xc0
df            D 0000000000000000     0 15926  15915 0x00000000
 ffff8801898f5ae8 0000000000000082 ffff8801898f5a48 ffffffff81116967
 ffff8801898f5a68 ffffffff81116a25 ffff88032d29cdc0 ffffffffa00207a0
 ffff88042be765f8 ffff8801898f5fd8 000000000000fb88 ffff88042be765f8
Call Trace:
 [<ffffffff81116967>] ? mempool_free_slab+0x17/0x20
 [<ffffffff81116a25>] ? mempool_free+0x95/0xa0
 [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffff8118a018>] ? __link_path_walk+0x768/0x1030
 [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c4797>] nfs3_proc_statfs+0x47/0x90 [nfs]
 [<ffffffffa00b4cc6>] nfs_statfs+0x66/0x1a0 [nfs]
 [<ffffffff811abe64>] statfs_by_dentry+0x74/0xa0
 [<ffffffff811abf9b>] vfs_statfs+0x1b/0xb0
 [<ffffffff811ac260>] do_statfs_native+0x20/0xb0
 [<ffffffff811ac3f2>] sys_statfs+0x72/0xb0
 [<ffffffff81500a7e>] ? do_device_not_available+0xe/0x10
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


When I tried "echo 1 > /proc/sys/sunrpc/rpc_debug" I got MANY messages
like the following, so quickly set it back to 0 again.  I don't know
how to tell whether the RPCs are for the hung mountpoint or a
different one on the client, but did notice that the "xid" numbers are
increasing, but the "req" number does not change.

Jan 31 01:13:11 s320 kernel: RPC: 55738 xprt_prepare_transmit
Jan 31 01:13:13 s320 kernel: RPC: 55743 reserved req ffff88042be25000
xid 4041e127
Jan 31 01:13:13 s320 kernel: RPC: 55743 xprt_prepare_transmit
Jan 31 01:13:13 s320 kernel: RPC: 55743 xprt_transmit(112)
Jan 31 01:13:13 s320 kernel: RPC: 55743 xmit complete
Jan 31 01:13:13 s320 kernel: RPC: 55743 xid 4041e127 complete (112
bytes received)
Jan 31 01:13:13 s320 kernel: RPC: 55743 release request ffff88042be25000
Jan 31 01:13:13 s320 kernel: RPC: 55744 reserved req ffff88042be25000
xid 4141e127
Jan 31 01:13:13 s320 kernel: RPC: 55744 xprt_prepare_transmit
Jan 31 01:13:13 s320 kernel: RPC: 55744 xprt_transmit(112)
Jan 31 01:13:13 s320 kernel: RPC: 55744 xmit complete
Jan 31 01:13:13 s320 kernel: RPC: 55744 xid 4141e127 complete (112
bytes received)
Jan 31 01:13:13 s320 kernel: RPC: 55744 release request ffff88042be25000
Jan 31 01:13:14 s320 kernel: RPC: 55745 reserved req ffff88042be25000
xid 4241e127
Jan 31 01:13:14 s320 kernel: RPC: 55745 xprt_prepare_transmit
Jan 31 01:13:14 s320 kernel: RPC: 55745 xprt_transmit(116)
Jan 31 01:13:14 s320 kernel: RPC: 55745 xmit complete
Jan 31 01:13:14 s320 kernel: RPC: 55745 xid 4241e127 complete (120
bytes received)
Jan 31 01:13:14 s320 kernel: RPC: 55745 release request ffff88042be25000
Jan 31 01:13:14 s320 kernel: RPC: 55746 reserved req ffff88042be25000
xid 4341e127
Jan 31 01:13:14 s320 kernel: RPC: 55746 xprt_prepare_transmit
Jan 31 01:13:14 s320 kernel: RPC: 55746 xprt_transmit(116)
Jan 31 01:13:14 s320 kernel: RPC: 55746 xmit complete
Jan 31 01:13:14 s320 kernel: RPC: 55746 xid 4341e127 complete (120
bytes received)
Jan 31 01:13:14 s320 kernel: RPC: 55746 release request ffff88042be25000
Jan 31 01:13:14 s320 kernel: RPC: 55747 reserved req ffff88042be25000
xid 4441e127


The clients are running CentOS-6.3, and we have tried both kernel
2.6.32-279.2.1.el6 and 2.6.32-279.19.1.el6, and are using
nfs-utils-1.2.3-26.el6.  We never saw the problem on the same hardware
when running CentOS-5.5 and kernel 2.6.18-194.11.4.el5.

The server in this example is a NetApp FAS3140, but mounts to other
servers (primarily running CentOS-5.7) have exhibited similar
problems.

Running "tcpdump" shows no NFS traffic to the server when running a
new command which should access the NFS mount.

We do not have a firewall in the way, and I don't believe that any
other routing problems are blocking network traffic since "ping" and
"showmount -e" from the client to the server respond normally.

Trying to create a new mount to any mountpoint on the same server
hangs as well.  The call trace shows it in __d_instantiate(), called
from rpc_wait_bit_killable().  Creating a new mount to a different
server works fine.

And finally, here are even more details that may be unique to our
configuration, and might confuse the issue, but I wanted to list them
here just in case they are relevant:

* These nodes are fully diskless, mounting their root file system via
NFS, and have no swap.

* We see similar hangs of processes accessing the root file system, as
indicated by the call trace generated by alt-sysrq-w from the console.
Unfortunately without a working root file system I can't login to the
nodes to get more debugging info, so have decided to concentrate on
the /home mount described above.

* Our site has 5 clusters all running the exact same (diskless) OS,
ranging from 250 to 850 nodes.  The problem is triggered most
frequently on a system with 8 cores/node + 24 GB memory, and very
rarely on one with 16 cores/node + 32 GB memory.  So it may have
something to do with memory pressure or with number of nodes used by a
user's application at a fixed core count.  Of course I can't rule out
differences in client NIC or cluster network capabilities impacting
frequency either.


Thanks for your time reading all this and any help you can offer!

-Nathan

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: v3 client processes hung in rpc_wait_bit_killable
  2013-01-31 22:58 v3 client processes hung in rpc_wait_bit_killable Nathan Dauchy - NOAA Affiliate
@ 2013-02-07 22:55 ` Nathan Dauchy
  0 siblings, 0 replies; 2+ messages in thread
From: Nathan Dauchy @ 2013-02-07 22:55 UTC (permalink / raw)
  To: linux-nfs

Update: we had another similar hang and this time I caught it fairly
quickly, so the "netstat -atu" output may be more useful.  See inline
below...

On 01/31/2013 03:58 PM, Nathan Dauchy - NOAA Affiliate wrote:
> Greetings,
> 
> On our Linux HPC cluster, one or more of the client nodes occasionally
> has a problem with all access to an NFS mount hanging.  Can anyone
> here point out a possible cause of the symptoms below or offer
> suggestions for other tools or debugging steps I should try?  Any new
> ideas or hints are appreciated!
> 
> The basic symptom is that all processes which access the NFS file
> system hang in "D" state.  Dumping the call trace shows that they are
> stuck in rpc_wait_bit_killable().  The processes can indeed be
> terminated with "kill -9".  If I kill all hung processes, I can umount
> the file system and remount it and then access works again.
> 
> We can run under "normal" user load for up to 3 days before 1-5 client
> nodes experience a similar problem.   When it happens, the 100's of
> other clients show no sign of a problem with their NFS mounts.  There
> is no pattern to exactly which client will hang, though one complex
> user workflow seems to trigger it more than others.
> 
> Here is an example of the hung processes:
> 
> # ps -eo pid,stat,args,wchan | grep D
>    PID STAT COMMAND                     WCHAN
>   2192 Ds   -bash                       rpc_wait_bit_killable
>   2408 Ds   bash -c /home/ndauchy/bin/f rpc_wait_bit_killable
>  15926 D    df -P                       rpc_wait_bit_killable
>  36857 S+   grep D                      pipe_wait
> 
> By running "df" on each NFS mount one by one I was able to narrow down
> which mount was hung to this one:
> 
> # mount | grep home
> 10.178.17.21:/vol/home on /misc/whome type nfs
> (rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.178.17.21,mountvers=3,mountport=4046,mountproto=udp,local_lock=all,addr=10.178.17.21)
> 
> There is no TCP connection to that server that I can see:
> 
> # netstat -atu
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address               Foreign Address    State
> tcp        0      0 *:ssh                       *:*               LISTEN
> tcp        0      0 s320-ib0:panasas            *:*               LISTEN
> tcp        0      0 s320:panasas                *:*               LISTEN
> tcp        0      0 *:15002                     *:*               LISTEN
> tcp        0      0 *:15003                     *:*               LISTEN
> tcp        0      0 localhost.localdomain:15005 *:*               LISTEN
> tcp        0      0 s320-ib0:45637              panmgmt5:10622    ESTABLISHED
> tcp        0      0 s320:ssh                    fe3:59631         ESTABLISHED
> tcp        0      0 s320:849                    wms1314-nfs:nfs   ESTABLISHED
> tcp        0      0 s320:ssh                    fe5:47363         ESTABLISHED
> tcp        0      0 *:ssh                       *:*               LISTEN
> udp        0      0 s320-ib0:ntp                *:*
> udp        0      0 s320:ntp                    *:*
> udp        0      0 localhost.localdomain:ntp   *:*
> udp        0      0 *:ntp                       *:*
> udp        0      0 *:52539                     *:*
> udp        0      0 fe80::21e:6703:3c:f00d:ntp  *:*
> udp        0      0 fe80::21e:67ff:fe3c:f008:ntp *:*
> udp        0      0 ::1:ntp                     *:*
> udp        0      0 *:ntp                       *:*

This time there is still no connection to the NFS server, but there are
many conections in CLOSE_WAIT and FIN_WAIT.:

# netstat -atu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address
    State
tcp        0      0 *:ssh                       *:*
    LISTEN
tcp        0      0 n109-ib0:panasas            *:*
    LISTEN
tcp        0      0 n109:panasas                *:*
    LISTEN
tcp        0      0 *:15002                     *:*
    LISTEN
tcp        0      0 *:15003                     *:*
    LISTEN
tcp        0      0 n109:15003                  n98:975
    SYN_RECV
tcp        0      0 n109:15003                  n12:nnsp
    SYN_RECV
tcp        0      0 n109:15003                  n7:timbuktu
    SYN_RECV
tcp        0      0 n109:15003                  n56:netsc-prod
    SYN_RECV
tcp        0      0 n109:15003                  n58:at-echo
    SYN_RECV
tcp        0      0 n109:15003                  n116:815
    SYN_RECV
tcp        0      0 n109:15003                  n73:1001
    SYN_RECV
tcp        0      0 n109:15003                  n5:768
    SYN_RECV
tcp        0      0 n109:15003                  n83:276
    SYN_RECV
tcp        0      0 n109:15003                  n10:867
    SYN_RECV
tcp        0      0 n109:15003                  n23:csi-sgwp
    SYN_RECV
tcp        0      0 n109:15003                  n48:ticf-1
    SYN_RECV
tcp        0      0 localhost.localdomain:15005 *:*
    LISTEN
tcp        0     86 n109:932                    n131:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n99:nest-protocol
    CLOSE_WAIT
tcp       79      0 n109:15003                  n9:macon-tcp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n131:cvc_hostd
    CLOSE_WAIT
tcp        0     86 n109:331                    n11:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n75:963
    CLOSE_WAIT
tcp       79      0 n109:15003                  n98:openport
    CLOSE_WAIT
tcp       79      0 n109:15003                  n9:ipcserver
    CLOSE_WAIT
tcp        1      0 n109:15003                  n93:979
    CLOSE_WAIT
tcp       79      0 n109:15003                  n126:820
    CLOSE_WAIT
tcp       79      0 n109:15003                  n121:at-7
    CLOSE_WAIT
tcp       79      0 n109:15003                  n45:sift-uft
    CLOSE_WAIT
tcp       79      0 n109:15003                  n121:scc-security
    CLOSE_WAIT
tcp       81      0 n109:15003                  n30:229
    CLOSE_WAIT
tcp        0     86 n109:utmpcd                 n16:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n95:241
    CLOSE_WAIT
tcp        0     89 n109:zserv                  n145:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n116:878
    CLOSE_WAIT
tcp       79      0 n109:15003                  n93:vmnet
    CLOSE_WAIT
tcp       79      0 n109:15003                  n10:960
    CLOSE_WAIT
tcp       79      0 n109:15003                  n140:sntp-heartbeat
    CLOSE_WAIT
tcp       79      0 n109:15003                  n45:nas
    CLOSE_WAIT
tcp        0     86 n109:rmc                    n55:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n103:ddm-rdb
    CLOSE_WAIT
tcp       79      0 n109:15003                  n77:philips-vc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n85:883
    CLOSE_WAIT
tcp       81      0 n109:15003                  n52:con
    CLOSE_WAIT
tcp        0     86 n109:vacdsm-sws             n36:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n73:iiop
    CLOSE_WAIT
tcp        0     88 n109:946                    n5:15003
    FIN_WAIT1
tcp        0     86 n109:sgcp                   n108:15003
    FIN_WAIT1
tcp       80      0 n109:15003                  n145:iris-lwz
    CLOSE_WAIT
tcp       79      0 n109:15003                  n66:885
    CLOSE_WAIT
tcp       81      0 n109:15003                  n7:pawserv
    CLOSE_WAIT
tcp       80      0 n109:15003                  n53:opalis-rdv
    CLOSE_WAIT
tcp       81      0 n109:15003                  n115:hap
    CLOSE_WAIT
tcp       79      0 n109:15003                  n65:remotefs
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:914c/g
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:sgcp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n131:mftp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n65:3com-amp3
    CLOSE_WAIT
tcp       79      0 n109:15003                  n101:vatp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n66:mftp
    CLOSE_WAIT
tcp       81      0 n109:15003                  n7:708
    CLOSE_WAIT
tcp       81      0 n109:15003                  n115:iso-ill
    CLOSE_WAIT
tcp       80      0 n109:15003                  n47:864
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:uma
    CLOSE_WAIT
tcp       79      0 n109:15003                  n6:vnas
    CLOSE_WAIT
tcp       79      0 n109:15003                  n49:exp1
    CLOSE_WAIT
tcp        0      0 n109-ib0:56970              panmgmt5:panasas
    TIME_WAIT
tcp       79      0 n109:15003                  n36:732
    CLOSE_WAIT
tcp        0      0 n109:919                    n19:15003
    FIN_WAIT2
tcp       81      0 n109:15003                  n5:corerjd
    CLOSE_WAIT
tcp       80      0 n109:15003                  n53:at-5
    CLOSE_WAIT
tcp       79      0 n109:15003                  n83:namp
    CLOSE_WAIT
tcp        1      0 n109:ssh                    n145:36357
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:917
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:vpps-qua
    CLOSE_WAIT
tcp       79      0 n109:15003                  n8:rlzdbase
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:zserv
    CLOSE_WAIT
tcp       79      0 n109:15003                  n49:telnets
    CLOSE_WAIT
tcp       81      0 n109:15003                  n52:cadlock
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:texar
    CLOSE_WAIT
tcp       79      0 n109:15003                  n58:937
    CLOSE_WAIT
tcp        0     88 n109:hassle                 n43:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n47:entrust-kmsh
    CLOSE_WAIT
tcp        0      1 n109:prm-nm                 n93:15003
    SYN_SENT
tcp        0      0 n109:1017                   wms56-nfs:nfs
    ESTABLISHED
tcp       79      0 n109:15003                  n23:asip-webadmin
    CLOSE_WAIT
tcp       79      0 n109:15003                  n99:879
    CLOSE_WAIT
tcp       79      0 n109:15003                  n101:jargon
    CLOSE_WAIT
tcp       79      0 n109:15003                  n62:snare
    CLOSE_WAIT
tcp        0      0 n109:ssh                    fe5:42381
    ESTABLISHED
tcp       79      0 n109:15003                  n6:iris-xpc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n16:931
    CLOSE_WAIT
tcp       79      0 n109:15003                  n103:urm
    CLOSE_WAIT
tcp       79      0 n109:15003                  n49:mftp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n98:npmp-gui
    CLOSE_WAIT
tcp        0     86 n109:981                    n48:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n36:telnets
    CLOSE_WAIT
tcp       81      0 n109:15003                  n68:asa
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:npmp-gui
    CLOSE_WAIT
tcp       81      0 n109:15003                  n5:813
    CLOSE_WAIT
tcp       79      0 n109:15003                  n116:avian
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:new-rwho
    CLOSE_WAIT
tcp       79      0 n109:15003                  n9:ulistproc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n75:k-block
    CLOSE_WAIT
tcp       79      0 n109:15003                  n26:819
    CLOSE_WAIT
tcp       79      0 n109:15003                  n37:vpps-qua
    CLOSE_WAIT
tcp       79      0 n109:15003                  n6:qrh
    CLOSE_WAIT
tcp        0     86 n109:bgmp                   n65:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n101:bnet
    CLOSE_WAIT
tcp       79      0 n109:15003                  n62:unify
    CLOSE_WAIT
tcp       79      0 n109:15003                  n140:microsoft-ds
    CLOSE_WAIT
tcp       80      0 n109:15003                  n145:ocs_cmu
    CLOSE_WAIT
tcp       79      0 n109:15003                  n81:decbsrv
    CLOSE_WAIT
tcp       81      0 n109:15003                  n5:ddm-dfm
    CLOSE_WAIT
tcp       79      0 n109:15003                  n81:decauth
    CLOSE_WAIT
tcp       79      0 n109:15003                  n26:1002
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:lanserver
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:mac-srvr-admin
    CLOSE_WAIT
tcp       79      0 n109:15003                  n8:852
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:hassle
    CLOSE_WAIT
tcp       79      0 n109:15003                  n37:743
    CLOSE_WAIT
tcp       80      0 n109:15003                  n145:olsr
    CLOSE_WAIT
tcp       81      0 n109:15003                  n120:bhfhs
    CLOSE_WAIT
tcp       79      0 n109:15003                  n101:pkix-timestamp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n56:298
    CLOSE_WAIT
tcp       79      0 n109:15003                  n86:970
    CLOSE_WAIT
tcp       79      0 n109:15003                  n10:netconf-ssh
    CLOSE_WAIT
tcp       81      0 n109:15003                  n122:netrcs
    CLOSE_WAIT
tcp       79      0 n109:15003                  n95:34617
    CLOSE_WAIT
tcp       23      0 n109:15002                  jetbqs3:818
    CLOSE_WAIT
tcp       79      0 n109:15003                  n86:hp-collector
    CLOSE_WAIT
tcp       79      0 n109:15003                  n62:815
    CLOSE_WAIT
tcp       81      0 n109:15003                  n122:sanity
    CLOSE_WAIT
tcp       80      0 n109:15003                  n145:krb5_prop
    CLOSE_WAIT
tcp       81      0 n109:15003                  n7:passgo
    CLOSE_WAIT
tcp       79      0 n109:15003                  n81:839
    CLOSE_WAIT
tcp       79      0 n109:15003                  n16:939
    CLOSE_WAIT
tcp       81      0 n109:15003                  n5:338
    CLOSE_WAIT
tcp        0     86 n109:335                    n45:15003
    FIN_WAIT1
tcp       52      0 n109:15002                  jetbqs3:845
    ESTABLISHED
tcp       79      0 n109:15003                  n86:923
    CLOSE_WAIT
tcp        0      0 n109-ib0:58388              panmgmt4:panasas
    TIME_WAIT
tcp       81      0 n109:15003                  n7:urm
    CLOSE_WAIT
tcp       79      0 n109:15003                  n106:matip-type-a
    CLOSE_WAIT
tcp       79      0 n109:15003                  n48:928
    CLOSE_WAIT
tcp       79      0 n109:15003                  n56:utime
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:nest-protocol
    CLOSE_WAIT
tcp       79      0 n109:15003                  n65:imaps
    CLOSE_WAIT
tcp       79      0 n109:15003                  n23:725
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:datasurfsrvsec
    CLOSE_WAIT
tcp       79      0 n109:15003                  n37:quotad
    CLOSE_WAIT
tcp       79      0 n109:15003                  n55:hello-port
    CLOSE_WAIT
tcp       79      0 n109:15003                  n26:289
    CLOSE_WAIT
tcp       79      0 n109:15003                  n75:opalis-rdv
    CLOSE_WAIT
tcp        0      0 n109:45677                  10.178.123.66:10622
    ESTABLISHED
tcp       79      0 n109:15003                  n15:820
    CLOSE_WAIT
tcp       79      0 n109:15003                  n121:341
    CLOSE_WAIT
tcp       80      0 n109:15003                  n53:spsc
    CLOSE_WAIT
tcp        0     86 n109:hap                    n23:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n85:fln-spx
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:decvms-sysmgt
    CLOSE_WAIT
tcp       79      0 n109:15003                  n106:dn6-nlm-aud
    CLOSE_WAIT
tcp       79      0 n109:15003                  n10:sco-websrvrmg3
    CLOSE_WAIT
tcp       79      0 n109:15003                  n131:link
    CLOSE_WAIT
tcp       79      0 n109:15003                  n6:818
    CLOSE_WAIT
tcp       79      0 n109:15003                  n126:863
    CLOSE_WAIT
tcp       79      0 n109:15003                  n126:servstat
    CLOSE_WAIT
tcp        0     87 n109:commerce               n53:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n16:decauth
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:xact-backup
    CLOSE_WAIT
tcp        0      0 n109-ib0:60283              panmgmt6:panasas
    TIME_WAIT
tcp       79      0 n109:15003                  n85:spsc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n75:banyan-rpc
    CLOSE_WAIT
tcp        0      0 *:ssh                       *:*
    LISTEN
udp        0      0 *:60495                     *:*

udp        0      0 *:863                       *:*

udp        0      0 n109-ib0:ntp                *:*

udp        0      0 n109:ntp                    *:*

udp        0      0 localhost.localdomain:ntp   *:*

udp        0      0 *:ntp                       *:*

udp        0      0 fe80::230:48ff:fec9:3b5e:ntp *:*

udp        0      0 fe80::230:48c9:78dc:1:ntp   *:*

udp        0      0 ::1:ntp                     *:*

udp        0      0 *:ntp                       *:*



> 
> Here are the call traces:
> 
> # echo w > /proc/sysrq-trigger
> 
> SysRq : Show Blocked State
>   task                        PC stack   pid father
> bash          D 0000000000000000     0  2192      1 0x00000004
>  ffff88042db47b28 0000000000000082 0000000000000000 ffff880400000000
>  ffff88042db47ad8 ffff88042db47dd8 ffff88042db47b98 ffff88042db47b88
>  ffff880121f7d098 ffff88042db47fd8 000000000000fb88 ffff880121f7d098
> Call Trace:
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
>  [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>  [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
>  [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
>  [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
>  [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
>  [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
>  [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
>  [<ffffffffa00c4707>] nfs3_proc_getattr+0x47/0x90 [nfs]
>  [<ffffffffa00b23b3>] __nfs_revalidate_inode+0xe3/0x220 [nfs]
>  [<ffffffffa00b2696>] nfs_revalidate_inode+0x36/0x60 [nfs]
>  [<ffffffffa00b2776>] nfs_getattr+0x66/0x120 [nfs]
>  [<ffffffff81180571>] vfs_getattr+0x51/0x80
>  [<ffffffff81180600>] vfs_fstatat+0x60/0x80
>  [<ffffffff8118074b>] vfs_stat+0x1b/0x20
>  [<ffffffff81180774>] sys_newstat+0x24/0x50
>  [<ffffffff8117cba5>] ? fput+0x25/0x30
>  [<ffffffff811785cd>] ? filp_close+0x5d/0x90
>  [<ffffffff811786a5>] ? sys_close+0xa5/0x100
>  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> bash          D 0000000000000000     0  2408      1 0x00000000
>  ffff8802f1f4d898 0000000000000086 0000000000000082 0000000000000246
>  ffff8802f1f4d838 ffff8802f1f4d898 ffff880028043718 0000000300000001
>  ffff88043418b098 ffff8802f1f4dfd8 000000000000fb88 ffff88043418b098
> Call Trace:
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
>  [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>  [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
>  [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
>  [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
>  [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
>  [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
>  [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
>  [<ffffffffa00c3d4c>] nfs3_proc_access+0xbc/0x180 [nfs]
>  [<ffffffffa00ab168>] nfs_do_access+0x198/0x420 [nfs]
>  [<ffffffffa000e765>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
>  [<ffffffffa000d770>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
>  [<ffffffffa00ab498>] nfs_permission+0xa8/0x1e0 [nfs]
>  [<ffffffff8118995d>] __link_path_walk+0xad/0x1030
>  [<ffffffff8118ab6a>] path_walk+0x6a/0xe0
>  [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0
>  [<ffffffff8117c780>] ? get_empty_filp+0xa0/0x180
>  [<ffffffff8118bc6b>] do_filp_open+0xfb/0xd60
>  [<ffffffffa00af26a>] ? nfs_file_read+0xca/0x130 [nfs]
>  [<ffffffff81213426>] ? security_prepare_creds+0x16/0x20
>  [<ffffffff810996ef>] ? prepare_creds+0x8f/0xb0
>  [<ffffffff81183943>] open_exec+0x33/0x100
>  [<ffffffff81183aff>] do_execve+0xef/0x340
>  [<ffffffff8127ea8a>] ? strncpy_from_user+0x4a/0x90
>  [<ffffffff810095ea>] sys_execve+0x4a/0x80
>  [<ffffffff8100b54a>] stub_execve+0x6a/0xc0
> df            D 0000000000000000     0 15926  15915 0x00000000
>  ffff8801898f5ae8 0000000000000082 ffff8801898f5a48 ffffffff81116967
>  ffff8801898f5a68 ffffffff81116a25 ffff88032d29cdc0 ffffffffa00207a0
>  ffff88042be765f8 ffff8801898f5fd8 000000000000fb88 ffff88042be765f8
> Call Trace:
>  [<ffffffff81116967>] ? mempool_free_slab+0x17/0x20
>  [<ffffffff81116a25>] ? mempool_free+0x95/0xa0
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
>  [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>  [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
>  [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
>  [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
>  [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
>  [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
>  [<ffffffff8118a018>] ? __link_path_walk+0x768/0x1030
>  [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
>  [<ffffffffa00c4797>] nfs3_proc_statfs+0x47/0x90 [nfs]
>  [<ffffffffa00b4cc6>] nfs_statfs+0x66/0x1a0 [nfs]
>  [<ffffffff811abe64>] statfs_by_dentry+0x74/0xa0
>  [<ffffffff811abf9b>] vfs_statfs+0x1b/0xb0
>  [<ffffffff811ac260>] do_statfs_native+0x20/0xb0
>  [<ffffffff811ac3f2>] sys_statfs+0x72/0xb0
>  [<ffffffff81500a7e>] ? do_device_not_available+0xe/0x10
>  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> 

New traces look fairly similar:

SysRq : Show Blocked State
  task                        PC stack   pid father
sshd          D 0000000000000000     0 23812   1786 0x00000004
 ffff88032c5af928 0000000000000082 0000000000000000 0000000000000003
 0000000000000001 0000000000000282 ffff88032c5af8d8 ffffffff81052223
 ffff88031b1cdab8 ffff88032c5affd8 000000000000fb88 ffff88031b1cdab8
Call Trace:
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b752>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814ead9f>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814eae48>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810909d0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000bc95>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000bf51>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffffa00c129d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c167c>] nfs3_proc_access+0xbc/0x180 [nfs]
 [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110
 [<ffffffffa00a91b9>] nfs_do_access+0x199/0x3c0 [nfs]
 [<ffffffffa000e575>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
 [<ffffffffa000d550>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
 [<ffffffffa00a9488>] nfs_permission+0xa8/0x1e0 [nfs]
 [<ffffffff8118497d>] __link_path_walk+0xad/0x1030
 [<ffffffff81185b8a>] path_walk+0x6a/0xe0
 [<ffffffff81185d5b>] do_path_lookup+0x5b/0xa0
 [<ffffffff81177a50>] ? get_empty_filp+0xa0/0x180
 [<ffffffff81186c8b>] do_filp_open+0xfb/0xd60
 [<ffffffff81193272>] ? alloc_fd+0x92/0x160
 [<ffffffff81173a39>] do_sys_open+0x69/0x140
 [<ffffffff81088a2a>] ? sys_setresuid+0x15a/0x180
 [<ffffffff81173b50>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
bash          D 0000000000000000     0 24454      1 0x00000000
 ffff8802e8163898 0000000000000082 0000000000000000 0000000000000282
 0000000000000082 0000000000000000 ffff8802e8163848 ffff8802e81638a8
 ffff8803306885f8 ffff8802e8163fd8 000000000000fb88 ffff8803306885f8
Call Trace:
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b752>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814ead9f>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814eae48>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810909d0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000bc95>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000bf51>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffffa00c129d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c167c>] nfs3_proc_access+0xbc/0x180 [nfs]
 [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110
 [<ffffffffa00a91b9>] nfs_do_access+0x199/0x3c0 [nfs]
 [<ffffffffa000e575>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
 [<ffffffffa000d550>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
 [<ffffffffa00a9488>] nfs_permission+0xa8/0x1e0 [nfs]
 [<ffffffff8118497d>] __link_path_walk+0xad/0x1030
 [<ffffffff81185b8a>] path_walk+0x6a/0xe0
 [<ffffffff81185d5b>] do_path_lookup+0x5b/0xa0
 [<ffffffff81177a50>] ? get_empty_filp+0xa0/0x180
 [<ffffffff81186c8b>] do_filp_open+0xfb/0xd60
 [<ffffffff8104338c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff8120cbe6>] ? security_prepare_creds+0x16/0x20
 [<ffffffff8117ea23>] open_exec+0x33/0x100
 [<ffffffff8117ebbf>] do_execve+0xcf/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0


Does any of that provide more clues?

Are there other diagnostic tools I should be using?

Anything else I should check before rebooting the client?


Thanks in advance,
Nathan


> 
> When I tried "echo 1 > /proc/sys/sunrpc/rpc_debug" I got MANY messages
> like the following, so quickly set it back to 0 again.  I don't know
> how to tell whether the RPCs are for the hung mountpoint or a
> different one on the client, but did notice that the "xid" numbers are
> increasing, but the "req" number does not change.
> 
> Jan 31 01:13:11 s320 kernel: RPC: 55738 xprt_prepare_transmit
> Jan 31 01:13:13 s320 kernel: RPC: 55743 reserved req ffff88042be25000
> xid 4041e127
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xprt_prepare_transmit
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xprt_transmit(112)
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xmit complete
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xid 4041e127 complete (112
> bytes received)
> Jan 31 01:13:13 s320 kernel: RPC: 55743 release request ffff88042be25000
> Jan 31 01:13:13 s320 kernel: RPC: 55744 reserved req ffff88042be25000
> xid 4141e127
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xprt_prepare_transmit
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xprt_transmit(112)
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xmit complete
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xid 4141e127 complete (112
> bytes received)
> Jan 31 01:13:13 s320 kernel: RPC: 55744 release request ffff88042be25000
> Jan 31 01:13:14 s320 kernel: RPC: 55745 reserved req ffff88042be25000
> xid 4241e127
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xprt_prepare_transmit
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xprt_transmit(116)
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xmit complete
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xid 4241e127 complete (120
> bytes received)
> Jan 31 01:13:14 s320 kernel: RPC: 55745 release request ffff88042be25000
> Jan 31 01:13:14 s320 kernel: RPC: 55746 reserved req ffff88042be25000
> xid 4341e127
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xprt_prepare_transmit
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xprt_transmit(116)
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xmit complete
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xid 4341e127 complete (120
> bytes received)
> Jan 31 01:13:14 s320 kernel: RPC: 55746 release request ffff88042be25000
> Jan 31 01:13:14 s320 kernel: RPC: 55747 reserved req ffff88042be25000
> xid 4441e127
> 
> 
> The clients are running CentOS-6.3, and we have tried both kernel
> 2.6.32-279.2.1.el6 and 2.6.32-279.19.1.el6, and are using
> nfs-utils-1.2.3-26.el6.  We never saw the problem on the same hardware
> when running CentOS-5.5 and kernel 2.6.18-194.11.4.el5.
> 
> The server in this example is a NetApp FAS3140, but mounts to other
> servers (primarily running CentOS-5.7) have exhibited similar
> problems.
> 
> Running "tcpdump" shows no NFS traffic to the server when running a
> new command which should access the NFS mount.
> 
> We do not have a firewall in the way, and I don't believe that any
> other routing problems are blocking network traffic since "ping" and
> "showmount -e" from the client to the server respond normally.
> 
> Trying to create a new mount to any mountpoint on the same server
> hangs as well.  The call trace shows it in __d_instantiate(), called
> from rpc_wait_bit_killable().  Creating a new mount to a different
> server works fine.
> 
> And finally, here are even more details that may be unique to our
> configuration, and might confuse the issue, but I wanted to list them
> here just in case they are relevant:
> 
> * These nodes are fully diskless, mounting their root file system via
> NFS, and have no swap.
> 
> * We see similar hangs of processes accessing the root file system, as
> indicated by the call trace generated by alt-sysrq-w from the console.
> Unfortunately without a working root file system I can't login to the
> nodes to get more debugging info, so have decided to concentrate on
> the /home mount described above.
> 
> * Our site has 5 clusters all running the exact same (diskless) OS,
> ranging from 250 to 850 nodes.  The problem is triggered most
> frequently on a system with 8 cores/node + 24 GB memory, and very
> rarely on one with 16 cores/node + 32 GB memory.  So it may have
> something to do with memory pressure or with number of nodes used by a
> user's application at a fixed core count.  Of course I can't rule out
> differences in client NIC or cluster network capabilities impacting
> frequency either.
> 
> 
> Thanks for your time reading all this and any help you can offer!
> 
> -Nathan

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-02-07 22:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-31 22:58 v3 client processes hung in rpc_wait_bit_killable Nathan Dauchy - NOAA Affiliate
2013-02-07 22:55 ` Nathan Dauchy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.