All of lore.kernel.org
 help / color / mirror / Atom feed
* More about nfsd/lockd hang in 2.4.20+NFS_ALL
@ 2003-06-13 13:46 Matthew Mitchell
  2003-06-13 15:48 ` Trond Myklebust
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Mitchell @ 2003-06-13 13:46 UTC (permalink / raw)
  To: nfs

(see my earlier message from June 10 for more information)

So this morning it happened again.  Seems that when the operator tried 
to log in to the disk server as himself, it NFS-mounted his home 
directory for him.  This is a loopback mount.  The message printed out 
by lockd was
	lockd: rejected NSM callback from 7f000001:32769

(4 times).  This is in fs/lockd/svcproc.c and svc4proc.c, in the 
function nlmsvc_proc_sm_notify.  I am not sure what the difference 
between "nlm" and "nlm4" is, but I bet someone on this list knows...

In any event, I noticed this right as it was happening, so I was able to 
kill -9 the operator's login and the system recovered.  Symptoms of the 
hang were like I had seen before -- it looks like this is capable of 
hanging every NFS service running on the machine.

For now I just changed the automount map so this won't happen.  I can't 
imagine that this behavior is correct, so perhaps someone would be 
interested in helping me understand what is going on?  There should be 
no technical reason why a loopback NFS mount should fail, even though 
you might not really want to do it for performance reasons.

The code looks like this:

         if (saddr.sin_addr.s_addr != htonl(INADDR_LOOPBACK)
          || ntohs(saddr.sin_port) >= 1024) {
                 printk(KERN_WARNING
                         "lockd: rejected NSM callback from %08x:%d\n",
                         ntohl(rqstp->rq_addr.sin_addr.s_addr),
                         ntohs(rqstp->rq_addr.sin_port));
                 return rpc_system_err;
         }

In this case, though, the rq_addr.sin_addr.s_addr is that of loopback, 
as it says in the message (7f000001 => 127.0.0.1).  It would appear that 
this is a lock notify that's supposed to be called when a client 
reconnects to a server, but it thinks it's being called with some 
impossible values.

Am I on the mark here?  Something that might be relevant: this server 
was recently pressed into use as the server for these volumes. 
Previously, it was mounting them (home directories) from another server, 
which died.  Perhaps it has some old lock information lying around, and 
when it tries to connect to itself as a client, it tries to reacquire 
its locks?  Or perhaps it is something more innocuous.

In any case, comments or help appreciated.

-- 
Matthew Mitchell
Systems Programmer/Administrator            matthew@geodev.com
Geophysical Development Corporation         phone 713 782 1234
1 Riverway Suite 2100, Houston, TX  77056     fax 713 782 1829



-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More about nfsd/lockd hang in 2.4.20+NFS_ALL
  2003-06-13 13:46 More about nfsd/lockd hang in 2.4.20+NFS_ALL Matthew Mitchell
@ 2003-06-13 15:48 ` Trond Myklebust
  2003-06-13 16:19   ` Matthew Mitchell
  0 siblings, 1 reply; 6+ messages in thread
From: Trond Myklebust @ 2003-06-13 15:48 UTC (permalink / raw)
  To: Matthew Mitchell; +Cc: nfs

>>>>> " " == Matthew Mitchell <matthew@geodev.com> writes:

     > The code looks like this:

     >          if (saddr.sin_addr.s_addr != htonl(INADDR_LOOPBACK)
     >           || ntohs(saddr.sin_port) >= 1024) {
     >                  printk(KERN_WARNING
     >                          "lockd: rejected NSM callback from
     >                          %08x:%d\n",
     >                          ntohl(rqstp->rq_addr.sin_addr.s_addr),
     >                          ntohs(rqstp->rq_addr.sin_port));
     >                  return rpc_system_err;
     >          }

     > In this case, though, the rq_addr.sin_addr.s_addr is that of
     > loopback, as it says in the message (7f000001 => 127.0.0.1).
     > It would appear that this is a lock notify that's supposed to
     > be called when a client reconnects to a server, but it thinks
     > it's being called with some impossible values.

It's just saying that the kernel expects rpc.statd to contact it using
a reserved port when notifying it about a reboot of one of the remote
servers.

Most rpc.statd daemons today run setuid some unprivileged
user. Perhaps this is causing bindresvport() to fail?

Cheers,
  Trond


-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More about nfsd/lockd hang in 2.4.20+NFS_ALL
  2003-06-13 15:48 ` Trond Myklebust
@ 2003-06-13 16:19   ` Matthew Mitchell
  2003-06-13 16:31     ` Trond Myklebust
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Mitchell @ 2003-06-13 16:19 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

Trond Myklebust wrote:
>>>>>>" " == Matthew Mitchell <matthew@geodev.com> writes:
> 
> 
>      > The code looks like this:
> 
>      >          if (saddr.sin_addr.s_addr != htonl(INADDR_LOOPBACK)
>      >           || ntohs(saddr.sin_port) >= 1024) {
>      >                  printk(KERN_WARNING
>      >                          "lockd: rejected NSM callback from
>      >                          %08x:%d\n",
>      >                          ntohl(rqstp->rq_addr.sin_addr.s_addr),
>      >                          ntohs(rqstp->rq_addr.sin_port));
>      >                  return rpc_system_err;
>      >          }
> 
>      > In this case, though, the rq_addr.sin_addr.s_addr is that of
>      > loopback, as it says in the message (7f000001 => 127.0.0.1).
>      > It would appear that this is a lock notify that's supposed to
>      > be called when a client reconnects to a server, but it thinks
>      > it's being called with some impossible values.
> 
> It's just saying that the kernel expects rpc.statd to contact it using
> a reserved port when notifying it about a reboot of one of the remote
> servers.
> 
> Most rpc.statd daemons today run setuid some unprivileged
> user. Perhaps this is causing bindresvport() to fail?

rpc.statd in this case is running as an unprivileged user, yes.  So 
lockd will not allow a local statd to talk to it unless it is running on 
a privileged port?  That seems to be what is going on in the 
conditional.  Next question is -- why?

Even assuming there is a good reason, why might it cause the whole nfs 
system to hang?  I'm guessing statd or somewhere in the rpc layer isn't 
expecting this to fail, but I have no idea where to even start looking.

-- 
Matthew Mitchell
Systems Programmer/Administrator            matthew@geodev.com
Geophysical Development Corporation         phone 713 782 1234
1 Riverway Suite 2100, Houston, TX  77056     fax 713 782 1829



-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More about nfsd/lockd hang in 2.4.20+NFS_ALL
  2003-06-13 16:19   ` Matthew Mitchell
@ 2003-06-13 16:31     ` Trond Myklebust
  2003-06-13 16:48       ` Matthew Mitchell
  0 siblings, 1 reply; 6+ messages in thread
From: Trond Myklebust @ 2003-06-13 16:31 UTC (permalink / raw)
  To: Matthew Mitchell; +Cc: Trond Myklebust, nfs

>>>>> " " == Matthew Mitchell <matthew@geodev.com> writes:

     > rpc.statd in this case is running as an unprivileged user, yes.
     > So lockd will not allow a local statd to talk to it unless it
     > is running on a privileged port?  That seems to be what is
     > going on in the conditional.  Next question is -- why?

For obvious reasons, you don't want any Tom, Dick or Harry to be able
to tell the kernel that it should try to recover locking state from a
given server.

     > Even assuming there is a good reason, why might it cause the
     > whole nfs system to hang?

My guess (since you are not supplying a tcpdump) is that the server is
down. That's when it is supposed to happen, anyway...

Cheers,
  Trond


-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More about nfsd/lockd hang in 2.4.20+NFS_ALL
  2003-06-13 16:31     ` Trond Myklebust
@ 2003-06-13 16:48       ` Matthew Mitchell
  2003-06-13 21:21         ` Trond Myklebust
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Mitchell @ 2003-06-13 16:48 UTC (permalink / raw)
  To: trond.myklebust; +Cc: nfs

Trond Myklebust wrote:
>>>>>>" " == Matthew Mitchell <matthew@geodev.com> writes:
> 
> 
>      > rpc.statd in this case is running as an unprivileged user, yes.
>      > So lockd will not allow a local statd to talk to it unless it
>      > is running on a privileged port?  That seems to be what is
>      > going on in the conditional.  Next question is -- why?
> 
> For obvious reasons, you don't want any Tom, Dick or Harry to be able
> to tell the kernel that it should try to recover locking state from a
> given server.

Right, it has to come from localhost on a privileged port.  I 
understand.  But how could it ever work if it's not working in this 
case?  Maybe this is a red herring.

According to rpcinfo on this server, which is also the client, port 
32769 is "sgi_fam".  What is that?  status is 32768.  Perhaps it's 
rejecting it with good cause.

>      > Even assuming there is a good reason, why might it cause the
>      > whole nfs system to hang?
> 
> My guess (since you are not supplying a tcpdump) is that the server is
> down. That's when it is supposed to happen, anyway...

Hmm.  So the messages from lockd could just be a symptom of the problem 
(nfsd locking up), you think.  I first noticed the problem when a remote 
user logged into the server, and the home directory (exported by the 
server) got remounted by the automounter on a local path.

But just now I tried manually mounting the home directories on another 
local path, and it seems to work fine.

Perhaps it involves the automounter somehow?  I did notice that the 
output of mount looked funny when I was trying to see if the volume had 
been remounted.  It was something like

fenris:/export/users on /home/users type nfs (rw,bind)

instead of

fenris:/export/users on /home/users type nfs (rw,addr=127.0.0.1)

I can try to reproduce the problem with autofs, but these are user home 
directories,  and they might get annoyed. :)

-- 
Matthew Mitchell
Systems Programmer/Administrator            matthew@geodev.com
Geophysical Development Corporation         phone 713 782 1234
1 Riverway Suite 2100, Houston, TX  77056     fax 713 782 1829



-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More about nfsd/lockd hang in 2.4.20+NFS_ALL
  2003-06-13 16:48       ` Matthew Mitchell
@ 2003-06-13 21:21         ` Trond Myklebust
  0 siblings, 0 replies; 6+ messages in thread
From: Trond Myklebust @ 2003-06-13 21:21 UTC (permalink / raw)
  To: Matthew Mitchell; +Cc: trond.myklebust, nfs

>>>>> " " == Matthew Mitchell <matthew@geodev.com> writes:

     > According to rpcinfo on this server, which is also the client,
     > port 32769 is "sgi_fam".  What is that?  status is 32768.
     > Perhaps it's rejecting it with good cause.

fam has nothing to do with NFS. It is a 'file alteration
monitor'. ('man fam')

     > Hmm.  So the messages from lockd could just be a symptom of the
     > problem (nfsd locking up), you think.  I first noticed the
     > problem when a remote user logged into the server, and the home
     > directory (exported by the server) got remounted by the
     > automounter on a local path.

If you think that fam is screwing with NFS then kill it. It is hardly
a critical service on most setups.

Cheers,
  Trond


-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-06-13 21:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-13 13:46 More about nfsd/lockd hang in 2.4.20+NFS_ALL Matthew Mitchell
2003-06-13 15:48 ` Trond Myklebust
2003-06-13 16:19   ` Matthew Mitchell
2003-06-13 16:31     ` Trond Myklebust
2003-06-13 16:48       ` Matthew Mitchell
2003-06-13 21:21         ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.