From mboxrd@z Thu Jan 1 00:00:00 1970 From: "James Nichols" Subject: Re: After many hours all outbound connections get stuck in SYN_SENT Date: Thu, 20 Dec 2007 11:37:39 -0500 Message-ID: <83a51e120712200837p9e3d1a4g15b5f4763597073e@mail.gmail.com> References: <83a51e120712141239u52d2dd68p1b6ee7ed08f2cecf@mail.gmail.com> <83a51e120712181021p4c4c2a13g8820271f1e00361b@mail.gmail.com> <4768123A.7040603@cosmosbay.com> <83a51e120712181144l65633b32r72cc369f9d012f47@mail.gmail.com> <47682F8C.20205@cosmosbay.com> <83a51e120712190853q33d9c7c1t4a46380665b7538b@mail.gmail.com> <47694FCC.1020507@cosmosbay.com> <83a51e120712190943m3bf0e2e4v2ea6b660142e9a5a@mail.gmail.com> <1198161695.6154.47.camel@andromache> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "Jan Engelhardt" , "Eric Dumazet" , linux-kernel@vger.kernel.org, "Linux Netdev List" To: "Glen Turner" Return-path: Received: from fg-out-1718.google.com ([72.14.220.157]:27483 "EHLO fg-out-1718.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761000AbXLTQhl (ORCPT ); Thu, 20 Dec 2007 11:37:41 -0500 Received: by fg-out-1718.google.com with SMTP id e21so608051fga.17 for ; Thu, 20 Dec 2007 08:37:40 -0800 (PST) In-Reply-To: <1198161695.6154.47.camel@andromache> Content-Disposition: inline Sender: netdev-owner@vger.kernel.org List-ID: > But I'd be very surprised if the router is acting as anything more > that a network-layer device. It might perhaps have some soft connection > state being used for generating accounting records. Being Cisco > it's probably a switch-router, so it might carry some per-port hard > state for validating source IP addresses and ARPs on each port. > > The firewall is much more likely to be carrying per-flow Sack > state. The Cisco PIX had a bug with SACK handling (CSCse14419, > fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it > has regressed). A simple trace either side of the firewall will > show the inconsistency between the TCP sequence number (which > gets randomised) and the Sack sequence number (which didn't). > You could disable the TCP Sequence Number Randomisation feature > and see if the fault reoccurs. I do have TCP Sequence # Randomization enabled on my router. However, if this was causing an issue, wouldn't it always occur and cause connection issues, not just after 38 hours of correct operation? I can look into turning this off, but I'll likely have to jump through several hoops which will be challenging if I don't have a very clear definitive reason why this is causing this issue. Plus, I've had this problem with at least 2 other sets of network switches over the past 4 years. I'm actually running 7.0(6), which doesn't have the fix you mentioned. If it really is possible that this issue wouldn't always cause problems, but only after hours of succesful operation, then I could probably motivate the upgrade. I can try to setup a trace, but this is a lot of work for other people in my organization, so it will take quite some time. > You'd probably should also investigate the Linux kernel, > especially the size and locks of the components of the Sack data > structures and what happens to those data structures after Sack is > disabled (presumably the Sack data structure is in some unhappy > circumstance, and disabling Sack allows the data to be discarded, > magically unclaging the box). > > In the absence of the reporter wanting to dump the kernel's > core, how about a patch to print the Sack datastructure when > the command to disable Sack is received by the kernel? > Maybe just print the last 16b of the IP address? Given the fact that I've had this problem for so long, over a variety of networking hardware vendors and colo-facilities, this really sounds good to me. It will be challenging for me to justify a kernel core dump, but a simple patch to dump the Sack data would be do-able.