All of lore.kernel.org
 help / color / mirror / Atom feed
* lockd not responding
@ 2007-03-06 20:17 Jan Rekorajski
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Rekorajski @ 2007-03-06 20:17 UTC (permalink / raw)
  To: nfs

Hi,
After applying Trond's patches the oops problem went away but now I'm
back to comatose lockd.

rainbow is a NFS server, sith a random client:

[baggins@sith ~]$ rpcinfo -p rainbow | grep lock
    100021    1   udp  32774  nlockmgr
    100021    3   udp  32774  nlockmgr
    100021    4   udp  32774  nlockmgr
    100021    1   tcp  37150  nlockmgr
    100021    3   tcp  37150  nlockmgr
    100021    4   tcp  37150  nlockmgr

[baggins@sith ~]$ rpcinfo -u rainbow 100021
rpcinfo: RPC: Timed out
program 100021 version 0 is not available

[baggins@sith ~]$ rpcinfo -t rainbow 100021
rpcinfo: RPC: Timed out
program 100021 version 0 is not available

[baggins@sith ~]$ telnet rainbow 37150
Trying 10.1.1.4.37150...
Connected to rainbow.mimuw.edu.pl.
Escape character is '^]'.
^]
telnet>

[root@rainbow ~]# ps aux | grep "\[lockd\]"
root      3786  0.0  0.0      0     0 ?        S    01:55   0:00 [lockd]

So, lockd is up and running, I can connect to it, but it's not responding
to RPC calls, what's interesting that it works just after the reboot and
only after some time it stops.

I also see a lot of these in logs on server
(red13 is another NFS client):

portmap: server red13 not responding, timed out
lockd: server red13 not responding, timed out
lockd: couldn't create RPC handle for red13

Looks to me that lockd loops over some dead client and is so wind up in
doing so that it has no time to answer new calls.

Jan
-- 
Jan Rekorajski            |  ALL SUSPECTS ARE GUILTY. PERIOD!
baggins<at>mimuw.edu.pl   |  OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY?
BOFH, MANIAC              |                   -- TROOPS by Kevin Rubio

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* lockd not responding
@ 2007-09-12  7:26 kenneth johansson
  2007-09-13  0:27 ` Jeff Layton
  0 siblings, 1 reply; 6+ messages in thread
From: kenneth johansson @ 2007-09-12  7:26 UTC (permalink / raw)
  To: nfs

Got a warning from the lock validating check again and
later a unresponsive lockd with a backtrace this time
actually at the same place the lock warning was on.

[68078.860233]  =======================
[68078.860276] lockd         D E4433CD0  5432  2445      2 (L-TLB)
[68078.860419]        e4433cf0 00000046 e4433df4 e4433cd0 c0335db3 00000000 3b9ab342 8c0cfa96 
[68078.860701]        00002591 8c0cfa96 00002591 0000e375 00000073 ecdb0ba0 0000576f 00000000 
[68078.861019]        00000002 c051fc58 e31fea04 00000246 ecdb0a90 e4433d2c c03b7030 00000000 
[68078.861337] Call Trace:
[68078.861414]  [<c03b7030>] __mutex_lock_slowpath+0xa0/0x290
[68078.861490]  [<c03b723c>] mutex_lock+0x1c/0x20
[68078.861565]  [<c02411c9>] nlmsvc_traverse_blocks+0x29/0xa0
[68078.861647]  [<c02425fe>] nlm_traverse_files+0x6e/0x210
[68078.861723]  [<c024282b>] nlmsvc_mark_resources+0x1b/0x30
[68078.861799]  [<c023f02e>] nlm_gc_hosts+0x4e/0x1e0
[68078.861874]  [<c023f576>] nlm_lookup_host+0x46/0x310
[68078.861950]  [<c023f874>] nlmsvc_lookup_host+0x34/0x40
[68078.862026]  [<c02414b5>] nlmsvc_lock+0x125/0x360
[68078.862100]  [<c024562c>] nlm4svc_proc_lock+0x7c/0x110
[68078.862178]  [<c03a4740>] svc_process+0x680/0x730
[68078.862257]  [<c0240166>] lockd+0x106/0x240
[68078.862331]  [<c0104b43>] kernel_thread_helper+0x7/0x14
[68078.862407]  =======================

[21409.476505] =======================================================
[21409.476599] [ INFO: possible circular locking dependency detected ]
[21409.476646] 2.6.22.3 #7
[21409.476688] -------------------------------------------------------
[21409.476735] lockd/2445 is trying to acquire lock:
[21409.476781]  (&file->f_mutex){--..}, at: [<c03b723c>] mutex_lock+0x1c/0x20
[21409.476951] 
[21409.476952] but task is already holding lock:
[21409.477034]  (nlm_host_mutex){--..}, at: [<c03b723c>] mutex_lock+0x1c/0x20
[21409.477198] 
[21409.477199] which lock already depends on the new lock.
[21409.477201] 
[21409.477321] 
[21409.477322] the existing dependency chain (in reverse order) is:
[21409.477405] 
[21409.477406] -> #1 (nlm_host_mutex){--..}:
[21409.477574]        [<c0135d4d>] __lock_acquire+0xdad/0xf60
[21409.477853]        [<c0135f55>] lock_acquire+0x55/0x70
[21409.478128]        [<c03b6ff9>] __mutex_lock_slowpath+0x69/0x290
[21409.478405]        [<c03b723c>] mutex_lock+0x1c/0x20
[21409.478680]        [<c023f561>] nlm_lookup_host+0x31/0x310
[21409.478961]        [<c023f874>] nlmsvc_lookup_host+0x34/0x40
[21409.479238]        [<c02414b5>] nlmsvc_lock+0x125/0x360
[21409.479513]        [<c024562c>] nlm4svc_proc_lock+0x7c/0x110
[21409.479792]        [<c03a4740>] svc_process+0x680/0x730
[21409.480071]        [<c0240166>] lockd+0x106/0x240
[21409.480347]        [<c0104b43>] kernel_thread_helper+0x7/0x14
[21409.480625]        [<ffffffff>] 0xffffffff
[21409.480904] 
[21409.480905] -> #0 (&file->f_mutex){--..}:
[21409.481072]        [<c0135bc7>] __lock_acquire+0xc27/0xf60
[21409.481348]        [<c0135f55>] lock_acquire+0x55/0x70
[21409.481623]        [<c03b6ff9>] __mutex_lock_slowpath+0x69/0x290
[21409.481900]        [<c03b723c>] mutex_lock+0x1c/0x20
[21409.482175]        [<c02411c9>] nlmsvc_traverse_blocks+0x29/0xa0
[21409.482453]        [<c02425fe>] nlm_traverse_files+0x6e/0x210
[21409.482729]        [<c024282b>] nlmsvc_mark_resources+0x1b/0x30
[21409.483005]        [<c023f02e>] nlm_gc_hosts+0x4e/0x1e0
[21409.483281]        [<c023f576>] nlm_lookup_host+0x46/0x310
[21409.483558]        [<c023f874>] nlmsvc_lookup_host+0x34/0x40
[21409.483834]        [<c024506b>] nlm4svc_retrieve_args+0x3b/0xd0
[21409.484111]        [<c0245607>] nlm4svc_proc_lock+0x57/0x110
[21409.484387]        [<c03a4740>] svc_process+0x680/0x730
[21409.484663]        [<c0240166>] lockd+0x106/0x240
[21409.484938]        [<c0104b43>] kernel_thread_helper+0x7/0x14
[21409.485215]        [<ffffffff>] 0xffffffff
[21409.485488] 
[21409.485489] other info that might help us debug this:
[21409.485491] 
[21409.485611] 1 lock held by lockd/2445:
[21409.485654]  #0:  (nlm_host_mutex){--..}, at: [<c03b723c>] mutex_lock+0x1c/0x20
[21409.485855] 
[21409.485856] stack backtrace:
[21409.485937]  [<c0104eca>] show_trace_log_lvl+0x1a/0x30
[21409.486012]  [<c0105a02>] show_trace+0x12/0x20
[21409.486087]  [<c0105a75>] dump_stack+0x15/0x20
[21409.486161]  [<c0133d6c>] print_circular_bug_tail+0x6c/0x80
[21409.486237]  [<c0135bc7>] __lock_acquire+0xc27/0xf60
[21409.486312]  [<c0135f55>] lock_acquire+0x55/0x70
[21409.486386]  [<c03b6ff9>] __mutex_lock_slowpath+0x69/0x290
[21409.486462]  [<c03b723c>] mutex_lock+0x1c/0x20
[21409.487052]  [<c02411c9>] nlmsvc_traverse_blocks+0x29/0xa0
[21409.487129]  [<c02425fe>] nlm_traverse_files+0x6e/0x210
[21409.487204]  [<c024282b>] nlmsvc_mark_resources+0x1b/0x30
[21409.487279]  [<c023f02e>] nlm_gc_hosts+0x4e/0x1e0
[21409.487354]  [<c023f576>] nlm_lookup_host+0x46/0x310
[21409.487430]  [<c023f874>] nlmsvc_lookup_host+0x34/0x40
[21409.487505]  [<c024506b>] nlm4svc_retrieve_args+0x3b/0xd0
[21409.487581]  [<c0245607>] nlm4svc_proc_lock+0x57/0x110
[21409.487656]  [<c03a4740>] svc_process+0x680/0x730
[21409.487731]  [<c0240166>] lockd+0x106/0x240
[21409.487805]  [<c0104b43>] kernel_thread_helper+0x7/0x14
[21409.487880]  =======================



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockd not responding
  2007-09-12  7:26 lockd not responding kenneth johansson
@ 2007-09-13  0:27 ` Jeff Layton
  2007-09-21 20:52   ` Trond Myklebust
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2007-09-13  0:27 UTC (permalink / raw)
  To: kenneth johansson; +Cc: nfs

On Wed, 12 Sep 2007 07:26:02 +0000 (UTC)
kenneth johansson <ken@kenjo.org> wrote:

> Got a warning from the lock validating check again and
> later a unresponsive lockd with a backtrace this time
> actually at the same place the lock warning was on.

FWIW, I think I ran into the exact same problem recently. I opened a
BZ case for it so I wouldn't forget about it, but haven't had time to
track it down:

https://bugzilla.redhat.com/show_bug.cgi?id=280311

I'm not certain that lockd was stuck at the time, since I had some
other things going on with the box, but it may have been...

-- 
Jeff Layton <jlayton@redhat.com>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockd not responding
  2007-09-13  0:27 ` Jeff Layton
@ 2007-09-21 20:52   ` Trond Myklebust
  2007-09-22 21:31     ` Jeff Layton
  2007-09-23 18:52     ` kenneth johansson
  0 siblings, 2 replies; 6+ messages in thread
From: Trond Myklebust @ 2007-09-21 20:52 UTC (permalink / raw)
  To: Jeff Layton; +Cc: kenneth johansson, nfs

[-- Attachment #1: Type: text/plain, Size: 716 bytes --]

On Wed, 2007-09-12 at 20:27 -0400, Jeff Layton wrote:
> On Wed, 12 Sep 2007 07:26:02 +0000 (UTC)
> kenneth johansson <ken@kenjo.org> wrote:
> 
> > Got a warning from the lock validating check again and
> > later a unresponsive lockd with a backtrace this time
> > actually at the same place the lock warning was on.
> 
> FWIW, I think I ran into the exact same problem recently. I opened a
> BZ case for it so I wouldn't forget about it, but haven't had time to
> track it down:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=280311
> 
> I'm not certain that lockd was stuck at the time, since I had some
> other things going on with the box, but it may have been...

Does the attached patch help?

Cheers
  Trond

[-- Attachment #2: linux-2.6.23-001-fix_lockd_circular_dependency.dif --]
[-- Type: message/rfc822, Size: 3589 bytes --]

From: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: No Subject
Date: Wed, 19 Sep 2007 11:54:54 -0400
Message-ID: <1190407932.6721.43.camel@heimdal.trondhjem.org>

The problem is that the garbage collector for the 'host' structures
nlm_gc_hosts(), holds nlm_host_mutex while calling down to
nlmsvc_mark_resources, which, eventually takes the file->f_mutex.

We cannot therefore call nlmsvc_lookup_host() from within
nlmsvc_create_block, since the caller will already hold file->f_mutex, so
the attempt to grab nlm_host_mutex may deadlock.

Fix the problem by calling nlmsvc_lookup_host() outside the file->f_mutex.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 fs/lockd/svclock.c |   29 ++++++++++++++++++-----------
 1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index a21e4bc..d098c7a 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -171,19 +171,14 @@ found:
  * GRANTED_RES message by cookie, without having to rely on the client's IP
  * address. --okir
  */
-static inline struct nlm_block *
-nlmsvc_create_block(struct svc_rqst *rqstp, struct nlm_file *file,
-		struct nlm_lock *lock, struct nlm_cookie *cookie)
+static struct nlm_block *
+nlmsvc_create_block(struct svc_rqst *rqstp, struct nlm_host *host,
+		    struct nlm_file *file, struct nlm_lock *lock,
+		    struct nlm_cookie *cookie)
 {
 	struct nlm_block	*block;
-	struct nlm_host		*host;
 	struct nlm_rqst		*call = NULL;
 
-	/* Create host handle for callback */
-	host = nlmsvc_lookup_host(rqstp, lock->caller, lock->len);
-	if (host == NULL)
-		return NULL;
-
 	call = nlm_alloc_call(host);
 	if (call == NULL)
 		return NULL;
@@ -366,6 +361,7 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
 			struct nlm_lock *lock, int wait, struct nlm_cookie *cookie)
 {
 	struct nlm_block	*block = NULL;
+	struct nlm_host		*host;
 	int			error;
 	__be32			ret;
 
@@ -377,6 +373,10 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
 				(long long)lock->fl.fl_end,
 				wait);
 
+	/* Create host handle for callback */
+	host = nlmsvc_lookup_host(rqstp, lock->caller, lock->len);
+	if (host == NULL)
+		return nlm_lck_denied_nolocks;
 
 	/* Lock file against concurrent access */
 	mutex_lock(&file->f_mutex);
@@ -385,7 +385,8 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
 	 */
 	block = nlmsvc_lookup_block(file, lock);
 	if (block == NULL) {
-		block = nlmsvc_create_block(rqstp, file, lock, cookie);
+		block = nlmsvc_create_block(rqstp, nlm_get_host(host), file,
+				lock, cookie);
 		ret = nlm_lck_denied_nolocks;
 		if (block == NULL)
 			goto out;
@@ -449,6 +450,7 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
 out:
 	mutex_unlock(&file->f_mutex);
 	nlmsvc_release_block(block);
+	nlm_release_host(host);
 	dprintk("lockd: nlmsvc_lock returned %u\n", ret);
 	return ret;
 }
@@ -477,10 +479,15 @@ nlmsvc_testlock(struct svc_rqst *rqstp, struct nlm_file *file,
 
 	if (block == NULL) {
 		struct file_lock *conf = kzalloc(sizeof(*conf), GFP_KERNEL);
+		struct nlm_host	*host;
 
 		if (conf == NULL)
 			return nlm_granted;
-		block = nlmsvc_create_block(rqstp, file, lock, cookie);
+		/* Create host handle for callback */
+		host = nlmsvc_lookup_host(rqstp, lock->caller, lock->len);
+		if (host == NULL)
+			return nlm_lck_denied_nolocks;
+		block = nlmsvc_create_block(rqstp, host, file, lock, cookie);
 		if (block == NULL) {
 			kfree(conf);
 			return nlm_granted;

[-- Attachment #3: Type: text/plain, Size: 228 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

[-- Attachment #4: Type: text/plain, Size: 140 bytes --]

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: lockd not responding
  2007-09-21 20:52   ` Trond Myklebust
@ 2007-09-22 21:31     ` Jeff Layton
  2007-09-23 18:52     ` kenneth johansson
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff Layton @ 2007-09-22 21:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: kenneth johansson, nfs

On Fri, 21 Sep 2007 16:52:12 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Wed, 2007-09-12 at 20:27 -0400, Jeff Layton wrote:
> > On Wed, 12 Sep 2007 07:26:02 +0000 (UTC)
> > kenneth johansson <ken@kenjo.org> wrote:
> > 
> > > Got a warning from the lock validating check again and
> > > later a unresponsive lockd with a backtrace this time
> > > actually at the same place the lock warning was on.
> > 
> > FWIW, I think I ran into the exact same problem recently. I opened a
> > BZ case for it so I wouldn't forget about it, but haven't had time to
> > track it down:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=280311
> > 
> > I'm not certain that lockd was stuck at the time, since I had some
> > other things going on with the box, but it may have been...
> 
> Does the attached patch help?
> 
> Cheers
>   Trond
> 

Thanks, Trond -- nice work :-)

Yes. It does seem to. I have a reproducer of sorts -- run this in a
continuous loop:

lock file
sleep 2 seconds
unlock file
sleep 1 second

When run simultaneously on the server and client against the same
inode, the lockdep warning usually pops within a few minutes. With
the patch above, I never saw it, even after running for 20 mins or
so.

-- 
Jeff Layton <jlayton@redhat.com>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockd not responding
  2007-09-21 20:52   ` Trond Myklebust
  2007-09-22 21:31     ` Jeff Layton
@ 2007-09-23 18:52     ` kenneth johansson
  1 sibling, 0 replies; 6+ messages in thread
From: kenneth johansson @ 2007-09-23 18:52 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, Jeff Layton


On Fri, 2007-09-21 at 16:52 -0400, Trond Myklebust wrote:
> On Wed, 2007-09-12 at 20:27 -0400, Jeff Layton wrote:
> > On Wed, 12 Sep 2007 07:26:02 +0000 (UTC)
> > kenneth johansson <ken@kenjo.org> wrote:
> > 
> > > Got a warning from the lock validating check again and
> > > later a unresponsive lockd with a backtrace this time
> > > actually at the same place the lock warning was on.
> > 
> > FWIW, I think I ran into the exact same problem recently. I opened a
> > BZ case for it so I wouldn't forget about it, but haven't had time to
> > track it down:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=280311
> > 
> > I'm not certain that lockd was stuck at the time, since I had some
> > other things going on with the box, but it may have been...
> 
> Does the attached patch help?

Yes it does. 

I tested Jeffs lock unlock loop and it got the lock warning in a minute
or two the three times I tested without the patch and has run for about
an hour now with with no warning.

So there is an Ack from me on this fix.



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-09-23 18:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-12  7:26 lockd not responding kenneth johansson
2007-09-13  0:27 ` Jeff Layton
2007-09-21 20:52   ` Trond Myklebust
2007-09-22 21:31     ` Jeff Layton
2007-09-23 18:52     ` kenneth johansson
  -- strict thread matches above, loose matches on Subject: below --
2007-03-06 20:17 Jan Rekorajski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.