nfsd random drop - Olaf Kirch

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Olaf Kirch <okir@suse.de>
To: Neil Brown <neilb@cse.unsw.edu.au>
Cc: nfs@lists.sourceforge.net
Subject: nfsd random drop
Date: Thu, 1 Apr 2004 12:23:34 +0200	[thread overview]
Message-ID: <20040401102334.GC20772@suse.de> (raw)

[-- Attachment #1: Type: text/plain, Size: 3958 bytes --]

Hi,

I hate to bore you all with the same old stuff, but I'm still fighting
problems caused by nfsd's dropping active connections.

The most recent episode in this saga is a problem with the Linux
client.

Consider a network with a single Linux 2.4 based home server, a few
hundred clients, all using TCP. In Linux 2.4, nfsd starts dropping
connections when it reaches a limit of (nrthreads + 3) * 10 open
connections. With 4 threads, this means 70 connections, and with 8 threads
this means 110 connections max. Both of which is totally inadequate for
this network. To get out of the congestion zone, we would need to bump
the number of threads to about 20, which is just silly.

The very same network has been served well with just 4 threads all
the time while using UDP.

With the 2.6 kernel, things get even worse as the formula was changed to
(nrthreads + 3) * 5, so you'll max out at 35 (4 threads) and 55 (with
8 threads), respectively. To serve 200 mounts via TCP simultaneously,
you'd need close to 40 nfsd threads.

In theory, all clients should be able to cope gracefully with such drops,
but even the Linux client runs into a couple of SNAFUs with these.

One: with a 50% probability, nfsd decides to drop the _newest_ connection,
which is the one it just accepted.  When the Linux client sees a fresh
connection go down before it was able to send anything across, it
backs off for 15 to 60 seconds, hanging the NFS mount (with 2.6.5-pre,
it's always 60 seconds). Which is kind of annoying the KDE users here,
because KDE applications like to scribble to the home directory all
the time, and their entire session freezes when NFS hangs.

Second: People have reported that files vanished and/or rename/remove
operations failed.

I also think this is due to the TCP disconnects. What I think
is happening here is this:

 -      user X: unlink("blafoo") 
 -      kernel: sends NFS call to server REMOVE "blafoo" 
 -      nfsd thread A receives request, removes file blafoo. waits for 
	some file system i/o to sync the change to disk 
 -      a new tcp connection comes in. Another nfsd thread B decides 
	it needs to nuke some connections, selects user X's connection 
 -      nfsd thread A decides it should send the response now,
	but finds the socket is gone. Drops the reply.
 -      client kernel: reconnect to NFS server
 -	server drops connection
 -	client waits for a while, reconnects again,
	resends REMOVE "blafoo" 
 -      NFS server: sorry, ENOENT: there's no such file "blafoo" 

Normally, the NFS server's replay cache should protect from this sort
of behavior, but the long timeouts before the client can reconnect
effectively mean the cached reply has been forgotten by the time the
retransmitted call arrives.

This is not a theoretical case; users here have reported that
files vanish mysteriously several times a day.

Three: people reported lots of messages in their syslog saying
"nfs_rename: target foo/bar busy, d_count=2". This is a variation
of the above. nfs_rename finds that someone still has foo/bar
open and decides it needs to do a sillyrename. The rename
fails with the spurious ENOENT error described above, causing
the entire rename operation to fail

Four: Some buggy clients can't deal with it, but I think I mentioned
that already.  Prime offender is zOS; when a fresh connection is killed,
it simply propagates the error to the application, hard mount or not. I
know it's broken, but that doesn't mean we can't be gentler and make
these clients work more smoothly with Linux.

I propose to add the following two patches to the server and client. They
increase the connection limit, stop dropping the neweset socket, and
add some printk's to alert the admin of the contention.

As an alternative to hardcoding a formula based on the number of threads,
I could also make the max number of connections a sysctl.

Comments,
Olaf
-- 
Olaf Kirch     |  The Hardware Gods hate me.
okir@suse.de   |
---------------+ 

[-- Attachment #2: sunrpc-svcsock-drop --]
[-- Type: text/plain, Size: 1848 bytes --]

diff -ur linux-2.6.4-nfsd/net/sunrpc/svcsock.c linux-2.6.4/net/sunrpc/svcsock.c
--- linux-2.6.4-nfsd/net/sunrpc/svcsock.c	2004-03-11 03:55:22.000000000 +0100
+++ linux-2.6.4/net/sunrpc/svcsock.c	2004-03-30 16:58:01.000000000 +0200
@@ -828,21 +828,33 @@

 	/* make sure that we don't have too many active connections.
 	 * If we have, something must be dropped.
-	 * We randomly choose between newest and oldest (in terms
-	 * of recent activity) and drop it.
+	 *
+	 * There's no point in trying to do random drop here for
+	 * DoS prevention. The NFS clients does 1 reconnect in 15
+	 * seconds. An attacker can easily beat that.
+	 *
+	 * The only somewhat efficient mechanism would be to drop
+	 * old connections from the same IP first. But right now
+	 * we don't even record the client IP in svc_sock.
 	 */
-	if (serv->sv_tmpcnt > (serv->sv_nrthreads+3)*5) {
+	if (serv->sv_tmpcnt > (serv->sv_nrthreads+3)*20) {
 		struct svc_sock *svsk = NULL;
 		spin_lock_bh(&serv->sv_lock);
 		if (!list_empty(&serv->sv_tempsocks)) {
-			if (net_random()&1)
-				svsk = list_entry(serv->sv_tempsocks.prev,
-						  struct svc_sock,
-						  sk_list);
-			else
-				svsk = list_entry(serv->sv_tempsocks.next,
-						  struct svc_sock,
-						  sk_list);
+			if (net_ratelimit()) {
+				/* Try to help the admin */
+				printk(KERN_NOTICE "%s: too many open TCP sockets, consider "
+						   "increasing the number of threads\n",
+						   serv->sv_name);
+				printk(KERN_NOTICE "%s: last TCP connect from %u.%u.%u.%u:%d\n",
+							serv->sv_name, 
+							NIPQUAD(sin.sin_addr.s_addr),
+							ntohs(sin.sin_port));
+			}
+			/* Always select the oldest socket. It's not fair, but so is life */
+			svsk = list_entry(serv->sv_tempsocks.prev,
+					  struct svc_sock,
+					  sk_list);
 			set_bit(SK_CLOSE, &svsk->sk_flags);
 			svsk->sk_inuse ++;
 		}

[-- Attachment #3: sunrpc-verbose-disconnect --]
[-- Type: text/plain, Size: 445 bytes --]

--- linux-2.6.4/net/sunrpc/xprt.c.reconnect	2004-03-30 14:19:45.000000000 +0200
+++ linux-2.6.4/net/sunrpc/xprt.c	2004-03-30 15:42:04.000000000 +0200
@@ -1039,6 +1039,11 @@
 	case TCP_SYN_RECV:
 		break;
 	default:
+		if (net_ratelimit()) {
+			printk(KERN_NOTICE "NFS server %u.%u.%u.%u %s connection\n",
+					NIPQUAD(xprt->addr.sin_addr.s_addr),
+					xprt_connected(xprt)? "closed" : "refused");
+		}
 		xprt_disconnect(xprt);
 		break;
 	}

next             reply	other threads:[~2004-04-01 10:23 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-04-01 10:23 Olaf Kirch [this message]
2004-04-01 16:14 ` nfsd random drop "Peter Lojkin" 
2004-04-05  0:13 ` Neil Brown
2004-04-05  1:09   ` Trond Myklebust
  -- strict thread matches above, loose matches on Subject: below --
2004-04-01 15:19 Lever, Charles

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040401102334.GC20772@suse.de \
    --to=okir@suse.de \
    --cc=neilb@cse.unsw.edu.au \
    --cc=nfs@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.