NFSv3/TCP client hangs under load (2.4.17+NFS

All of lore.kernel.org
 help / color / mirror / Atom feed

* NFSv3/TCP client hangs under load (2.4.17+NFS_ALL)
@ 2002-04-13  3:22 Andrew Ryan
  2002-04-13 18:44 ` Trond Myklebust
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Ryan @ 2002-04-13  3:22 UTC (permalink / raw)
  To: nfs

I am getting NFS client hangs under load with v3/TCP that I can almost 
reliably reproduce.

The setup:
Run a couple of processes that hit the NFS client hard. In my case, 4 
simultaneous bonnie++ runs, and a simple perl script that exercises file 
locking on NFS. It's not clear I need that many bonnie++'s, but that's what I 
was using. Running the file locking script seems essential to triggering the 
bug, but I can't be totally sure. It takes me anywhere from 15 minutes to 2 
hours of running this workload to trigger a hang.

The client host is running 2.4.17+NFS_ALL, smp ; otherwise more or less a 
stock RH7.2 system. The mount uses the following options:
v3,tcp,intr,hard,rsize=32768,wsize=32768
The server is a Netapp, fwiw.

The symptom:
The NFS mount on the client completely hangs, as do any processes that are 
using it. Other mount points on the same server can be accessed from the 
client -- in fact the same mount point on the server can be mounted and 
accessed fine, in a different place on the client while the hang is going on. 
The following error appears in the kernel at the beginning of the hang:
Apr 12 19:57:07 testhost kernel: nfs_statfs: statfs error = 512
After that nothing, until the mount comes back.

Fixing the problem:
The only way to get out of this situation is to kill the 'offending' process. 
In my case, either the perl script or the bonnie++'s. Then after sending kill 
signals to each process, a minute so or later, the mount will come back. 
Otherwise I have waited >20 minutes and the mount has still been hung.

While the system is hung, turning on /proc/sys/sunrpc/rpc_debug reveals the 
following to the kernel log. It looks like something is waking up every 5 
seconds and trying to process the same request, and failing, and going into 
some sort of infinite loop.

I'm wondering if anyone can help me debug this problem. At the very least I 
intend to try UDP to see if that avoids this problem. But my tests have shown 
that TCP is somewhat faster, so I'd like to use that if possible, although 
stability is more important right now.

Output to kernel log with rpc_debug on while mount is hung:
Apr 12 19:59:47 testhost kernel: RPC:      tcp_data_ready...
Apr 12 19:59:47 testhost kernel: RPC:     xprt queue f779c000
Apr 12 19:59:47 testhost kernel: RPC:      tcp_data_ready client f779c000
Apr 12 19:59:47 testhost kernel: RPC:      state 1 conn 1 dead 0 zapped 0
Apr 12 19:59:47 testhost kernel: rpciod_tcp_dispatcher: Queue Running
Apr 12 19:59:47 testhost kernel: rpciod_tcp_dispatcher: Processing f779c000
Apr 12 19:59:47 testhost kernel: RPC:      tcp_input_record
Apr 12 19:59:52 testhost kernel: RPC:      tcp_data_ready...
Apr 12 19:59:52 testhost kernel: RPC:     xprt queue f779c000
Apr 12 19:59:52 testhost kernel: RPC:      tcp_data_ready client f779c000
Apr 12 19:59:52 testhost kernel: RPC:      state 1 conn 1 dead 0 zapped 0
Apr 12 19:59:52 testhost kernel: rpciod_tcp_dispatcher: Queue Running
Apr 12 19:59:52 testhost kernel: rpciod_tcp_dispatcher: Processing f779c000
Apr 12 19:59:52 testhost kernel: RPC:      tcp_input_record
Apr 12 19:59:57 testhost kernel: RPC:      tcp_data_ready...
Apr 12 19:59:57 testhost kernel: RPC:     xprt queue f779c000
Apr 12 19:59:57 testhost kernel: RPC:      tcp_data_ready client f779c000
Apr 12 19:59:57 testhost kernel: RPC:      state 1 conn 1 dead 0 zapped 0
Apr 12 19:59:57 testhost kernel: rpciod_tcp_dispatcher: Queue Running
Apr 12 19:59:57 testhost kernel: rpciod_tcp_dispatcher: Processing f779c000
Apr 12 19:59:57 testhost kernel: RPC:      tcp_input_record
Apr 12 20:00:02 testhost kernel: RPC:      tcp_data_ready...
Apr 12 20:00:02 testhost kernel: RPC:     xprt queue f779c000
Apr 12 20:00:02 testhost kernel: RPC:      tcp_data_ready client f779c000
Apr 12 20:00:02 testhost kernel: rpciod_tcp_dispatcher: Queue Running
Apr 12 20:00:02 testhost kernel: rpciod_tcp_dispatcher: Processing f779c000
Apr 12 20:00:02 testhost kernel: RPC:      tcp_input_record
Apr 12 20:00:02 testhost kernel: RPC:      state 1 conn 1 dead 0 zapped 0
Apr 12 20:00:07 testhost kernel: RPC:      tcp_data_ready...
Apr 12 20:00:07 testhost kernel: RPC:     xprt queue f779c000
Apr 12 20:00:07 testhost kernel: RPC:      tcp_data_ready client f779c000
Apr 12 20:00:07 testhost kernel: rpciod_tcp_dispatcher: Queue Running
Apr 12 20:00:07 testhost kernel: rpciod_tcp_dispatcher: Processing f779c000
Apr 12 20:00:07 testhost kernel: RPC:      tcp_input_record
Apr 12 20:00:07 testhost kernel: RPC:      state 1 conn 1 dead 0 zapped 0

Perl script used to exercise file locking:
#!/usr/bin/perl

use strict;
use Fcntl qw(:DEFAULT :flock);;

my $dir=$ARGV[0] || "/shared/data/test";

my $filename = "$dir/$$.test";

while (1 == 1) {
    open(FILE, ">$filename") or die "Can't open file: $!\n";
    flock FILE, LOCK_EX or die "Can't lock file: $!\n";
    print FILE "Hello world\n";
    close FILE;
    lstat FILE;
    unlink($filename);
}

Any ideas much appreciated.

thanks,
andrew

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: NFSv3/TCP client hangs under load (2.4.17+NFS_ALL)
  2002-04-13  3:22 NFSv3/TCP client hangs under load (2.4.17+NFS_ALL) Andrew Ryan
@ 2002-04-13 18:44 ` Trond Myklebust
  2002-04-13 19:13   ` Andrew Ryan
  2002-04-21 19:01   ` Andrew Ryan
  0 siblings, 2 replies; 4+ messages in thread
From: Trond Myklebust @ 2002-04-13 18:44 UTC (permalink / raw)
  To: Andrew Ryan; +Cc: nfs

>>>>> " " == Andrew Ryan <andrewr@nam-shub.com> writes:

     > The client host is running 2.4.17+NFS_ALL, smp ; otherwise more
     > or less a stock RH7.2 system. The mount uses the following
     > options: v3,tcp,intr,hard,rsize=32768,wsize=32768 The server is
     > a Netapp, fwiw.

     > While the system is hung, turning on /proc/sys/sunrpc/rpc_debug
     > reveals the following to the kernel log. It looks like
     > something is waking up every 5 seconds and trying to process
     > the same request, and failing, and going into some sort of
     > infinite loop.

The rpciod dispatcher is what is supposed to copy data from the TCP
socket back to the NFS layer. If new data is arriving at the socket,
then it is perfectly normal that it should be running. If it never
wakes up any RPC requests, though, then that might indicate that the
client and server out of sync w.r.t. where they think they are in the
data stream.

I'm actually in the process of rewriting this whole part of the TCP
code. For 2.4.19-pre6, the code in the corresponding NFS_ALL patch
should be more robust against interference (and possibly a bit faster
than the old code). It would be nice if you could give it a go...

Cheers,
  Trond

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: NFSv3/TCP client hangs under load (2.4.17+NFS_ALL)
  2002-04-13 18:44 ` Trond Myklebust
@ 2002-04-13 19:13   ` Andrew Ryan
  2002-04-21 19:01   ` Andrew Ryan
  1 sibling, 0 replies; 4+ messages in thread
From: Andrew Ryan @ 2002-04-13 19:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

At 08:44 PM 4/13/02 +0200, Trond Myklebust wrote:
>I'm actually in the process of rewriting this whole part of the TCP
>code. For 2.4.19-pre6, the code in the corresponding NFS_ALL patch
>should be more robust against interference (and possibly a bit faster
>than the old code). It would be nice if you could give it a go...

I will try this -- but it will take me a longer while to put a new kernel 
through all the necessary paces so I can use it in production, and I don't 
like to use pre* kernels in production. In the meantime, as a workaround, 
should UDP on 2.4.17+NFS_ALL be more robust? I'll try it anyway, but I'm 
looking for short term workarounds that will alleviate my problem without 
changing kernels or significantly impacting NFS client performance.

In the meantime, I've found even stranger manifestations of the hang. I 
have been able to create a case (randomly it seems) where every directory 
on the hung mount is accessible, except for one. Any process accessing this 
directory hangs. But I can access other directories on the same mount, and 
stat the mount just fine. Go figure.

andrew

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: NFSv3/TCP client hangs under load (2.4.17+NFS_ALL)
  2002-04-13 18:44 ` Trond Myklebust
  2002-04-13 19:13   ` Andrew Ryan
@ 2002-04-21 19:01   ` Andrew Ryan
  1 sibling, 0 replies; 4+ messages in thread
From: Andrew Ryan @ 2002-04-21 19:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

At 08:44 PM 4/13/02 +0200, Trond Myklebust wrote:

>I'm actually in the process of rewriting this whole part of the TCP
>code. For 2.4.19-pre6, the code in the corresponding NFS_ALL patch
>should be more robust against interference (and possibly a bit faster
>than the old code). It would be nice if you could give it a go...


I tried the 2.4.19-pre7 kernel with your NFS patch with TCP , and it did 
not produce the hang under the same loading conditions. I didn't do 
performance testing between the two, if I do, I'll let you know if it's faster.

So, I guess the lesson here is to beware the NFSv3/TCP code in 2.4.17 under 
load, and use 2.4.19 when it is released, or use UDP.


thanks,
andrew


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-04-21 19:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-13  3:22 NFSv3/TCP client hangs under load (2.4.17+NFS_ALL) Andrew Ryan
2002-04-13 18:44 ` Trond Myklebust
2002-04-13 19:13   ` Andrew Ryan
2002-04-21 19:01   ` Andrew Ryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.