From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965111AbWEKBpR (ORCPT ); Wed, 10 May 2006 21:45:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S965114AbWEKBpR (ORCPT ); Wed, 10 May 2006 21:45:17 -0400 Received: from inglit.ubishops.ca ([206.167.194.132]:31951 "EHLO cs.ubishops.ca") by vger.kernel.org with ESMTP id S965111AbWEKBpQ (ORCPT ); Wed, 10 May 2006 21:45:16 -0400 Message-ID: <4462978C.6080809@cs.mcgill.ca> Date: Wed, 10 May 2006 21:46:52 -0400 From: Patrick McLean User-Agent: Thunderbird 1.5.0.2 (X11/20060430) MIME-Version: 1.0 To: Neil Brown CC: linux-kernel , Andrew Bogecho Subject: Re: NFS locking References: <446246D6.5010509@cs.mcgill.ca> <17506.33247.884320.387785@cse.unsw.edu.au> In-Reply-To: <17506.33247.884320.387785@cse.unsw.edu.au> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Neil Brown wrote: > On Wednesday May 10, chutz@cs.mcgill.ca wrote: >> We have a NFS server here with a fairly high load. The clients are >> Linux, FreeBSD and Solaris. The exported filesystem is XFS, which is onb >> a LVM drive. After between 3 and 30 days it seems that locking >> completely stops working, clients generally either error or simply lock >> up when they try to lock a file. The only way to fix it seems to be a >> reboot. > > Reboot the client or the server? > The server, rebooting the clients had no effect. >> Last time it happened was on 2.6.17-rc2, it started around 2.6.15. >> >> There is nothing in the dmesg on the server, the (Linux) clients are >> printing this in the dmesg when something tries to create a lock: >> >> lockd: server xxx.xxx.xxx.xxx not responding, still trying >> lockd: server xxx.xxx.xxx.xxx not responding, still trying > > Sounds like the server has locked up. > What does 'ps' on the server show for 'lockd'? Is it in 'D'? What is > the 'wchan'? Are any 'nfsd's permanently in 'D'? > > Try > echo t > /proc/sysrq-trigger > > and see what the stack trace for lockd is - probably only useful if it > is in 'D'. > > Maybe a 'tcpdump -s 1500' of traffic between client and server would > help. We have already rebooted the server this time around, we will do the stack trace and tcpdump from a client next time it happens. Though, I do seem to remember that lockd was in the "D" state on the server when it happened this afternoon. Restarting the nfs service on the server did spawn a new lockd process, but did not fix the problem.