From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Walters Subject: nfs: server not responding Date: Sat, 16 May 2009 00:57:01 +0000 (UTC) Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 To: linux-nfs@vger.kernel.org Return-path: Received: from main.gmane.org ([80.91.229.2]:36243 "EHLO ciao.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752431AbZEPBAF (ORCPT ); Fri, 15 May 2009 21:00:05 -0400 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1M58Fu-0003Oe-Lv for linux-nfs@vger.kernel.org; Sat, 16 May 2009 01:00:03 +0000 Received: from vanilla.varna.spnet.net ([212.50.0.56]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 16 May 2009 01:00:02 +0000 Received: from jeronimo by vanilla.varna.spnet.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 16 May 2009 01:00:02 +0000 Sender: linux-nfs-owner@vger.kernel.org List-ID: Description of problem: Periodically, and with no obvious cause, all NFS connections between ou= r Debian=20 Testing (_Squeeze_) x86 client (a diskless node which uses nfsroot and = boots=20 from the server) and our Debian Testing (_Squeeze_) x86 server hang and= dmesg=20 on the client side informs that the server is "not responding". The server is responding to everyone else's requests.=20 Restarting the nfsd on the server doesn't appear to solve the problem. At first I wasnt able to capture some debug information since /var/log = was=20 mounted over the nfs, so I have installed a hard drive where I mounted=20 only /var/log to be able to capture debug logs from the client as well. Debug Logs:=20 http://fixity.net/tmp/client.log.gz - Kernel RPC Debug Log from the cli= ent http://fixity.net/tmp/server.log.gz - Kernel RPC Debug Log from the ser= ver How reproducible: Happens from 10 to 90 minutes after booting the diskless node. Actual results: NFS connections stop responding, system hangs or becomes very slow and=20 unresponsive (it doesnt respond to Ctrl+Alt+Del as well). 60 to 90 minu= tes=20 after the first server time out client says server OK but the client is= still=20 unresponsive. Immediately after that the client logs server connection = loss=20 again which leads to continues loop. Client is still unresponsive. Some= times=20 client resumes normal operation for couple of hours but then the proble= m=20 repeats. Connectivity info:=20 Both the client and the server are connected to Gigabit Ethernet Cisco = Metro=20 series managable switch. Both of them use Intel Pro 82545GM Gigabit Eth= ernet=20 Server Controllers. Neither one of them log any Ethernet errors and non= e are=20 logged by the switch. Expected results: NFS connections continue to function and don't fail like clockwork when= every=20 other client on the network has no issues. Client & Server Load: =46or the purposes of testing both machines were only running needed da= emons and=20 weren=E2=80=99t loaded at all. Client & Server Kernel: On both the client and server custom compiled linux 2.6.29.3 kernel was= used.=20 Configuration file @ http://fixity.net/tmp/config-2.6.29.3.gz Client & Server Network interface fragmented packet queue length: net.ipv4.ipfrag_high_thresh =3D 524288 net.ipv4.ipfrag_low_thresh =3D 393216 Client Versions: libnfsidmap2/squeeze uptodate 0.21-2 nfs-common/squeeze uptodate 1:1.1.4-1 Client Mount (cat /proc/mounts | grep nfsroot): 10.11.11.1:/nfsroot / nfs=20 rw,vers=3D3,rsize=3D524288,wsize=3D524288,namlen=3D255,hard,nointr,nolo= ck,proto=3Dtcp,time o=3D7,retrans=3D10,sec=3Dsys,addr=3D10.11.11.1 0 0 Client fstab: proc /proc proc defaults 0 0 /dev/nfs / nfs defaults 1 1 none /tmp tmpfs defaults 0 0 none /var/run tmpfs defaults 0 0 none /var/lock tmpfs defaults 0 0 none /var/tmp tmpfs defaults 0 0 Client Daemons: portmap, rpc.statd, rpc.idmapd Server Daemons: portmap, rpc.statd, rpc.idmapd, rpc.mountd --manage-gids Server Versions: libnfsidmap2/squeeze uptodate 0.21-2 nfs-common/squeeze uptodate 1:1.1.4-1 nfs-kernel-server/testing uptodate 1:1.1.4-1 Server Export: /nfsroot 10.11.11.*(rw,no_root_squash,async,no_subtree_check) Server Options: RPCNFSDCOUNT=3D16 RPCNFSDPRIORITY=3D0 RPCMOUNTDOPTS=3D--manage-gids NEED_SVCGSSD=3Dno RPCSVCGSSDOPTS=3Dno Additional Info: Since I have read that tweaking the nfsroot mount options could improve= the=20 situation a have tested with different options as follows: rsize/wsize=3D1024|2048|4096|8192|32768|524288 timeo=3D15|60|600 retrans=3D3|10|20 None resulted in solving the problem. Any help or suggestions on fixing the problem would be highly appreciat= ed. I=20 have been messing with that problem for the last couple of weeks and ra= n out of=20 ideas. Best Regards, Jerome Walters