From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Greear Subject: Re: CIFS endless console spammage in 2.6.38.7 Date: Wed, 01 Jun 2011 12:17:21 -0700 Message-ID: <4DE69041.5070802@candelatech.com> References: <4DE5385C.1030808@candelatech.com> <4DE54561.1090906@candelatech.com> <20110531164408.178eeebf@tlielax.poochiereds.net> <4DE55537.5040705@candelatech.com> <20110601140139.079287da@tlielax.poochiereds.net> <4DE67FFE.3040907@candelatech.com> <20110601150621.7b465941@tlielax.poochiereds.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Steve French , linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Jeff Layton Return-path: In-Reply-To: <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> Sender: linux-cifs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: On 06/01/2011 12:06 PM, Jeff Layton wrote: > On Wed, 01 Jun 2011 11:07:58 -0700 > Ben Greear wrote: > >> On 06/01/2011 11:01 AM, Jeff Layton wrote: >>> On Tue, 31 May 2011 15:54:36 -0500 >>> Steve French wrote: >>> >>>> we will have more info when run with he quick and dirty modified logging >>>> >>> >>> I'm not sure what that is, but what may be helpful is to launch a >>> kernel debugger when this happens, track down the TCP_Server_Info and >>> see what the state of the socket that hangs off of it is. If it's a >>> NULL pointer or an already-closed socket, then that may help point the >>> way to the root cause. >> >> We put in some WARN_ON calls to get stack traces, and some other >> connection related logging. We should get a WARN_ON if the socket is NULL. >> >> We were not able to reproduce the problem last night..the file servers did >> screw up, but the CIFS clients acted normally. >> > > Based on no real evidence at all and just a gut-feeling, I suspect that: > > 1) this is a long-standing bug > > ...and... > > 2) it's a race condition > > ...though it may be that recent changes have changed the timing enough > to make it more likely (hard to say until we understand the problem > better). > > Have you seen this happen more than once? I think so...but we are also testing iscsi and NFS failover concurrently, and for a while other instability was making it difficult to determine exactly what killed things (seems we had a bad HD that would often fail about the time iscsi did...thought it was software bug for a while, but after replacing the HD it's been running better.) We're going to crank up another machine with 100+ cifs mounts and see if that helps reproduce the bug faster. Current test is 20 IO threads, but only a single mount. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com