From mboxrd@z Thu Jan 1 00:00:00 1970 From: Olaf Kirch Subject: Re: 2.6.5-pre TCP connect problems Date: Mon, 29 Mar 2004 17:28:01 +0200 Sender: nfs-admin@lists.sourceforge.net Message-ID: <20040329152801.GB19311@suse.de> References: <20040329135042.GG2992@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Return-path: Received: from sc8-sf-mx1-b.sourceforge.net ([10.3.1.11] helo=sc8-sf-mx1.sourceforge.net) by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30) id 1B7ygG-00036L-J1 for nfs@lists.sourceforge.net; Mon, 29 Mar 2004 07:28:04 -0800 Received: from ns.suse.de ([195.135.220.2] helo=Cantor.suse.de) by sc8-sf-mx1.sourceforge.net with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.30) id 1B7ygF-00068S-LW for nfs@lists.sourceforge.net; Mon, 29 Mar 2004 07:28:04 -0800 Received: from hermes.suse.de (Hermes.suse.de [195.135.221.8]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by Cantor.suse.de (Postfix) with ESMTP id ECE9239EF9E for ; Mon, 29 Mar 2004 17:28:01 +0200 (CEST) To: nfs@lists.sourceforge.net In-Reply-To: <20040329135042.GG2992@suse.de> Errors-To: nfs-admin@lists.sourceforge.net List-Unsubscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Post: List-Help: List-Subscribe: , List-Archive: On Mon, Mar 29, 2004 at 03:50:42PM +0200, Olaf Kirch wrote: > I'm currently debugging a problem with TCP reconnects in 2.6.5-pre where > the TCP reconnect code got rewritten to use worker queues. What happens > is that the NFS server drops the connection immediately and that state > change isn't propagated to the transport. I debugged this a little more. The problem I'be been seeing was caused by too many TCP connections. The NFS server was dropping connections randomly. Randomly means that the newest connection will be dropped with a probability of 50%, so that the connection dies before the client has sent the first packet. This causes the client to back off for 60 seconds. I'm not sure why this effect wasn't visible with 2.6.4, but it seems it used a lower timeout (REESTABLISH_TIMEOUT = 15sec) when the connection was refused or dropped instantly, and may have been less noticeable therefore. I'm not sure if it's a good idea to be more aggressive about reconnecting, but I think the client should at least log a message to syslog that a connection attempt failed. Likewise, the server should probably log a message when it finds it's dropping too many TCP connections. Finally, I think the way nfsd drops connections is bad. Dropping the most recent connection doesn't prevent DoS, and as this example demonstrates, it does unexpected things to your clients. Olaf -- Olaf Kirch | The Hardware Gods hate me. okir@suse.de | ---------------+ ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs