From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephan Koledin Subject: NFS lockups with 2.4.18 Date: Wed, 24 Sep 2003 12:51:22 -0400 Sender: nfs-admin@lists.sourceforge.net Message-ID: <3F71CB8A.3090208@neolinear.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Return-path: Received: from sc8-sf-mx1-b.sourceforge.net ([10.3.1.11] helo=sc8-sf-mx1.sourceforge.net) by sc8-sf-list1.sourceforge.net with esmtp (Cipher TLSv1:DES-CBC3-SHA:168) (Exim 3.31-VA-mm2 #1 (Debian)) id 1A2CrN-0001jS-00 for ; Wed, 24 Sep 2003 09:51:25 -0700 Received: from n5.neolinear.com ([208.20.218.5] helo=flood.neolinear.com) by sc8-sf-mx1.sourceforge.net with esmtp (Exim 4.22) id 1A2CrM-0004Rk-R2 for nfs@lists.sourceforge.net; Wed, 24 Sep 2003 09:51:24 -0700 Received: from rain ([192.9.200.77]) by flood.neolinear.com with esmtp (Exim 3.35 #1 (Debian)) id 1A2CrK-0007Xl-00 for ; Wed, 24 Sep 2003 12:51:22 -0400 To: nfs@lists.sourceforge.net Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: Hello All- We're running into some NFS lockups here and having a rough time debugging/solving the issue. I'm hoping that someone on the list may be able to provide some suggestions. Background: Unpredictably, but approximately every week or two, we run into a problem where our file server stops responding to NFS requests. The server is Debian 3.0 (woody), running the 2.4.18-1-686 stock debian kernel. Have also seen the same problem with the debian bf2.4 kernel. Server is a single 2.6GHz P4 processor, >1GB RAM, with several large (>500GB) SCSI-160 arrays using ext3 filesystems, quotas enabled. The problem only seems to be resolvable with a reboot, and even then, the problem will often reoccur within a few minutes, requiring another reboot (1-5 times), before finally settling down and being stable for 1-14+ days. We have not noticed anything strange/interesting in any of the logs or on the console. All other services/processes on the machine continue to operate perfectly, including disk operations. The machine simply stops serving NFS. We serve NFS to Linux, Solaris, and HP clients, several versions of each OS. Some Relevant Data (from when NFS was not working properly): $ rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100029 1 udp 845 keyserv 100029 2 udp 845 keyserv 100011 1 udp 852 rquotad 100011 2 udp 852 rquotad 100011 1 tcp 855 rquotad 100011 2 tcp 855 rquotad 100024 1 udp 32772 status 100024 1 tcp 32768 status 100001 1 udp 32773 rstatd 100001 2 udp 32773 rstatd 100001 3 udp 32773 rstatd 100001 4 udp 32773 rstatd 100001 5 udp 32773 rstatd 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100021 1 udp 32774 nlockmgr 100021 3 udp 32774 nlockmgr 100021 4 udp 32774 nlockmgr 100005 1 udp 32775 mountd 100005 1 tcp 32769 mountd 100005 2 udp 32775 mountd 100005 2 tcp 32769 mountd 100005 3 udp 32775 mountd 100005 3 tcp 32769 mountd $ rpcinfo [-u | -t] portmapper udp program 100000 version 2 ready and waiting portmapper tcp program 100000 version 2 ready and waiting keyserv program 100029 version 1 ready and waiting program 100029 version 2 ready and waiting rquotad udp program 100011 version 1 ready and waiting program 100011 version 2 ready and waiting rquotad tcp program 100011 version 1 ready and waiting program 100011 version 2 ready and waiting status udp program 100024 version 1 ready and waiting status tcp program 100024 version 1 ready and waiting rstatd udp program 100001 version 1 ready and waiting program 100001 version 2 ready and waiting program 100001 version 3 ready and waiting program 100001 version 4 is not available program 100001 version 5 ready and waiting nfs udp program 100003 version 0 is not available nlockmgr udp program 100021 version 0 is not available mountd udp program 100005 version 0 is not available mountd tcp program 100005 version 0 is not available (normal output, as expected, is as follows) nfs udp program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting nlockmgr udp program 100021 version 1 ready and waiting program 100021 version 2 is not available program 100021 version 3 ready and waiting program 100021 version 4 ready and waiting mountd udp program 100005 version 1 ready and waiting program 100005 version 2 ready and waiting program 100005 version 3 ready and waiting mountd tcp program 100005 version 1 ready and waiting program 100005 version 2 ready and waiting program 100005 version 3 ready and waiting All nfsd processes (32) are still in process list, but the number of swapped processes reported by vmstat jumps from 2 to 33, right around the occurrence of the problem. The rpc.mountd, rpc.rquotad, and rpc.statd processes are also still in the process list along with the 32 nfsd instances. Don't see anything unusual in any of the other system stats - memory, cpu, and disk usage all remain steady. Memory utilization graphs during the problem period appear flatter than is typical, but not much change in values from normal operation. Does anyone have any ideas about the problem? I'll be sure to grab some nfsstat and better ps output next time this happens, but any other suggestions for better logging or any other relevant data collection? Have there been any fixes since 2.4.18 that address similar or related problems? Thanks for any help with this elusive problem. -Stephan -- Stephan B Koledin Network Systems Developer http://neolinear.com/ ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs