CIFS endless console spammage in 2.6.38.7

All of lore.kernel.org
 help / color / mirror / Atom feed

* CIFS endless console spammage in 2.6.38.7
@ 2011-05-31 18:50 Ben Greear
       [not found] ` <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-05-31 18:50 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

Kernel is somewhat hacked, but no changes to CIFS.

While doing failover testing, we managed to get the cifs client
spewing endless serial console spammage.  We can ping the system, but
otherwise cannot seem to interact with it.  I tried serial-console sysrq
commands (blind, spewage makes it impossible to see any real results) to
turn logging to 0, but that didn't help (yet..going to let it run in case
there is just a huge backlog of messages).

The file-server cluster is in a bad state, but still not excuse
for the clients machine to become useless.

The spewage is at least primarily:

CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88

Seems -88 probably means -ENOTSOCK.

At the least, perhaps the cERROR() messages
should be rate limitted?

This one is hard and slow to reproduce, but we'll
keep testing..and will try pertinent patches if someone
has some suggestions.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found] ` <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-05-31 19:36   ` Steve French
       [not found]     ` <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Steve French @ 2011-05-31 19:36 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

This is on setting up a session, so could be something like:
- mount
- do write
- server crash
- attempt to reconnect
- socket returns ENOSOCK
- attempt to reconnect ...
- repeat

Is this repeatable enough that we could modify the client to stop on
the reconnect to see what is causing the socket to go bad and which
operation we are repeating the reconnect on.



On Tue, May 31, 2011 at 1:50 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
> Kernel is somewhat hacked, but no changes to CIFS.
>
>
> While doing failover testing, we managed to get the cifs client
> spewing endless serial console spammage.  We can ping the system, but
> otherwise cannot seem to interact with it.  I tried serial-console sysrq
> commands (blind, spewage makes it impossible to see any real results) to
> turn logging to 0, but that didn't help (yet..going to let it run in case
> there is just a huge backlog of messages).
>
> The file-server cluster is in a bad state, but still not excuse
> for the clients machine to become useless.
>
> The spewage is at least primarily:
>
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
> CIFS VFS: Send error in SessSetup = -88
>
> Seems -88 probably means -ENOTSOCK.
>
> At the least, perhaps the cERROR() messages
> should be rate limitted?
>
> This one is hard and slow to reproduce, but we'll
> keep testing..and will try pertinent patches if someone
> has some suggestions.
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> Candela Technologies Inc  http://www.candelatech.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]     ` <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-05-31 19:45       ` Ben Greear
       [not found]         ` <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-05-31 19:45 UTC (permalink / raw)
  To: Steve French; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 05/31/2011 12:36 PM, Steve French wrote:
> This is on setting up a session, so could be something like:
> - mount
> - do write
> - server crash
> - attempt to reconnect
> - socket returns ENOSOCK
> - attempt to reconnect ...
> - repeat
>
> Is this repeatable enough that we could modify the client to stop on
> the reconnect to see what is causing the socket to go bad and which
> operation we are repeating the reconnect on.

Well, ENOTSOCK sounds like a pretty serious coding problem.  Maybe
a use-after-close or something?

At the least, we could look for some particular errors (such as ENOTSOCK)
and print more info and do a more thorough job of cleaning up.

Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well?

Seems we can reproduce this only when our open-filer HA system
craps itself during failover, but we can get that to happen usually
within hours, sometimes maybe about a day.  And, CIFS errors don't always
happen when the HA cluster goes bad.

So, I'm happy to test patches, but since it's a bit tricky to
reproduce this...I'm hoping to get the best info possible with
each patch iteration!

Thanks,
Ben

>
>
>
> On Tue, May 31, 2011 at 1:50 PM, Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>> Kernel is somewhat hacked, but no changes to CIFS.
>>
>>
>> While doing failover testing, we managed to get the cifs client
>> spewing endless serial console spammage.  We can ping the system, but
>> otherwise cannot seem to interact with it.  I tried serial-console sysrq
>> commands (blind, spewage makes it impossible to see any real results) to
>> turn logging to 0, but that didn't help (yet..going to let it run in case
>> there is just a huge backlog of messages).
>>
>> The file-server cluster is in a bad state, but still not excuse
>> for the clients machine to become useless.
>>
>> The spewage is at least primarily:
>>
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>> CIFS VFS: Send error in SessSetup = -88
>>
>> Seems -88 probably means -ENOTSOCK.
>>
>> At the least, perhaps the cERROR() messages
>> should be rate limitted?
>>
>> This one is hard and slow to reproduce, but we'll
>> keep testing..and will try pertinent patches if someone
>> has some suggestions.
>>
>> Thanks,
>> Ben
>>
>> --
>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
>> Candela Technologies Inc  http://www.candelatech.com
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>


-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]         ` <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-05-31 20:44           ` Jeff Layton
       [not found]             ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-05-31 20:44 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Tue, 31 May 2011 12:45:37 -0700
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:

> On 05/31/2011 12:36 PM, Steve French wrote:
> > This is on setting up a session, so could be something like:
> > - mount
> > - do write
> > - server crash
> > - attempt to reconnect
> > - socket returns ENOSOCK
> > - attempt to reconnect ...
> > - repeat
> >
> > Is this repeatable enough that we could modify the client to stop on
> > the reconnect to see what is causing the socket to go bad and which
> > operation we are repeating the reconnect on.
> 
> Well, ENOTSOCK sounds like a pretty serious coding problem.  Maybe
> a use-after-close or something?
> 
> At the least, we could look for some particular errors (such as ENOTSOCK)
> and print more info and do a more thorough job of cleaning up.
> 
> Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well?
> 
> Seems we can reproduce this only when our open-filer HA system
> craps itself during failover, but we can get that to happen usually
> within hours, sometimes maybe about a day.  And, CIFS errors don't always
> happen when the HA cluster goes bad.
> 
> So, I'm happy to test patches, but since it's a bit tricky to
> reproduce this...I'm hoping to get the best info possible with
> each patch iteration!
> 

I had a report of a similar problem on a RHEL5 (2.6.18) kernel:

    https://bugzilla.redhat.com/show_bug.cgi?id=704921

In this case, it caused an oops as well. Your problem may or may not be
the same, but if it is, I suspect that the root cause is a lack of
clear locking rules for the TCP_Server_Info->tcpStatus.

What I think happened in that case was that the client was in the
middle of a NEGOTIATE request and got a response, and another reconnect
occurred while it was processing it. While the client was tearing down
and creating a new socket, the thread that issued the NEGOTIATE on the
previous socket marked the tcpStatus as CifsGood.

Fixing it looks to be anything but trivial. I'm not even quite sure how
to approach it at this point. Suggestions welcome.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]             ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-05-31 20:51               ` Steve French
       [not found]                 ` <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-05-31 20:51               ` Ben Greear
  1 sibling, 1 reply; 23+ messages in thread
From: Steve French @ 2011-05-31 20:51 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Ben Greear, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Tue, May 31, 2011 at 3:44 PM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, 31 May 2011 12:45:37 -0700
> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
>
>> On 05/31/2011 12:36 PM, Steve French wrote:
>> > This is on setting up a session, so could be something like:
>> > - mount
>> > - do write
>> > - server crash
>> > - attempt to reconnect
>> > - socket returns ENOSOCK
>> > - attempt to reconnect ...
>> > - repeat
>> >
>> > Is this repeatable enough that we could modify the client to stop on
>> > the reconnect to see what is causing the socket to go bad and which
>> > operation we are repeating the reconnect on.
>>
>> Well, ENOTSOCK sounds like a pretty serious coding problem.  Maybe
>> a use-after-close or something?
>>
>> At the least, we could look for some particular errors (such as ENOTSOCK)
>> and print more info and do a more thorough job of cleaning up.
>>
>> Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well?
>>
>> Seems we can reproduce this only when our open-filer HA system
>> craps itself during failover, but we can get that to happen usually
>> within hours, sometimes maybe about a day.  And, CIFS errors don't always
>> happen when the HA cluster goes bad.
>>
>> So, I'm happy to test patches, but since it's a bit tricky to
>> reproduce this...I'm hoping to get the best info possible with
>> each patch iteration!
>>
>
> I had a report of a similar problem on a RHEL5 (2.6.18) kernel:
>
>    https://bugzilla.redhat.com/show_bug.cgi?id=704921
>
> In this case, it caused an oops as well. Your problem may or may not be
> the same, but if it is, I suspect that the root cause is a lack of
> clear locking rules for the TCP_Server_Info->tcpStatus.
>
> What I think happened in that case was that the client was in the
> middle of a NEGOTIATE request and got a response, and another reconnect
> occurred while it was processing it. While the client was tearing down
> and creating a new socket, the thread that issued the NEGOTIATE on the
> previous socket marked the tcpStatus as CifsGood.
>
> Fixing it looks to be anything but trivial. I'm not even quite sure how
> to approach it at this point. Suggestions welcome.

I thought the kernel was more recent than that - how recent is the kernel here?

I think something related to cifs_sendv returning ENOTSOCK immediately
when not reconnected could be related.



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                 ` <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-05-31 20:53                   ` Ben Greear
       [not found]                     ` <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-05-31 20:53 UTC (permalink / raw)
  To: Steve French; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 05/31/2011 01:51 PM, Steve French wrote:
> On Tue, May 31, 2011 at 3:44 PM, Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:

>> Fixing it looks to be anything but trivial. I'm not even quite sure how
>> to approach it at this point. Suggestions welcome.
>
> I thought the kernel was more recent than that - how recent is the kernel here?
>
> I think something related to cifs_sendv returning ENOTSOCK immediately
> when not reconnected could be related.

My kernel is 2.6.38.7, quite recent.  We're using the bind-to-local-IP
features too, but not sure that matters.

Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                     ` <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-05-31 20:54                       ` Steve French
       [not found]                         ` <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Steve French @ 2011-05-31 20:54 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA

we will have more info when run with he quick and dirty modified logging

On Tue, May 31, 2011 at 3:53 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
> On 05/31/2011 01:51 PM, Steve French wrote:
>>
>> On Tue, May 31, 2011 at 3:44 PM, Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
>
>>> Fixing it looks to be anything but trivial. I'm not even quite sure how
>>> to approach it at this point. Suggestions welcome.
>>
>> I thought the kernel was more recent than that - how recent is the kernel
>> here?
>>
>> I think something related to cifs_sendv returning ENOTSOCK immediately
>> when not reconnected could be related.
>
> My kernel is 2.6.38.7, quite recent.  We're using the bind-to-local-IP
> features too, but not sure that matters.
>
> Ben
>
> --
> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> Candela Technologies Inc  http://www.candelatech.com
>
>



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                         ` <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-06-01 18:01                           ` Jeff Layton
       [not found]                             ` <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-06-01 18:01 UTC (permalink / raw)
  To: Steve French; +Cc: Ben Greear, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Tue, 31 May 2011 15:54:36 -0500
Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> we will have more info when run with he quick and dirty modified logging
> 

I'm not sure what that is, but what may be helpful is to launch a
kernel debugger when this happens, track down the TCP_Server_Info and
see what the state of the socket that hangs off of it is. If it's a
NULL pointer or an already-closed socket, then that may help point the
way to the root cause.

> On Tue, May 31, 2011 at 3:53 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
> > On 05/31/2011 01:51 PM, Steve French wrote:
> >>
> >> On Tue, May 31, 2011 at 3:44 PM, Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
> >
> >>> Fixing it looks to be anything but trivial. I'm not even quite sure how
> >>> to approach it at this point. Suggestions welcome.
> >>
> >> I thought the kernel was more recent than that - how recent is the kernel
> >> here?
> >>
> >> I think something related to cifs_sendv returning ENOTSOCK immediately
> >> when not reconnected could be related.
> >
> > My kernel is 2.6.38.7, quite recent.  We're using the bind-to-local-IP
> > features too, but not sure that matters.
> >
> > Ben
> >
> > --
> > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> > Candela Technologies Inc  http://www.candelatech.com
> >
> >
> 
> 
> 


-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                             ` <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-06-01 18:07                               ` Ben Greear
       [not found]                                 ` <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-06-01 18:07 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 06/01/2011 11:01 AM, Jeff Layton wrote:
> On Tue, 31 May 2011 15:54:36 -0500
> Steve French<smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  wrote:
>
>> we will have more info when run with he quick and dirty modified logging
>>
>
> I'm not sure what that is, but what may be helpful is to launch a
> kernel debugger when this happens, track down the TCP_Server_Info and
> see what the state of the socket that hangs off of it is. If it's a
> NULL pointer or an already-closed socket, then that may help point the
> way to the root cause.

We put in some WARN_ON calls to get stack traces, and some other
connection related logging.  We should get a WARN_ON if the socket is NULL.

We were not able to reproduce the problem last night..the file servers did
screw up, but the CIFS clients acted normally.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                 ` <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-06-01 19:06                                   ` Jeff Layton
       [not found]                                     ` <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-06-01 19:06 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Wed, 01 Jun 2011 11:07:58 -0700
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:

> On 06/01/2011 11:01 AM, Jeff Layton wrote:
> > On Tue, 31 May 2011 15:54:36 -0500
> > Steve French<smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  wrote:
> >
> >> we will have more info when run with he quick and dirty modified logging
> >>
> >
> > I'm not sure what that is, but what may be helpful is to launch a
> > kernel debugger when this happens, track down the TCP_Server_Info and
> > see what the state of the socket that hangs off of it is. If it's a
> > NULL pointer or an already-closed socket, then that may help point the
> > way to the root cause.
> 
> We put in some WARN_ON calls to get stack traces, and some other
> connection related logging.  We should get a WARN_ON if the socket is NULL.
> 
> We were not able to reproduce the problem last night..the file servers did
> screw up, but the CIFS clients acted normally.
> 

Based on no real evidence at all and just a gut-feeling, I suspect that: 

1) this is a long-standing bug

...and...

2) it's a race condition

...though it may be that recent changes have changed the timing enough
to make it more likely (hard to say until we understand the problem
better).

Have you seen this happen more than once?

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                     ` <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-06-01 19:17                                       ` Ben Greear
       [not found]                                         ` <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-06-01 19:17 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 06/01/2011 12:06 PM, Jeff Layton wrote:
> On Wed, 01 Jun 2011 11:07:58 -0700
> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>
>> On 06/01/2011 11:01 AM, Jeff Layton wrote:
>>> On Tue, 31 May 2011 15:54:36 -0500
>>> Steve French<smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>   wrote:
>>>
>>>> we will have more info when run with he quick and dirty modified logging
>>>>
>>>
>>> I'm not sure what that is, but what may be helpful is to launch a
>>> kernel debugger when this happens, track down the TCP_Server_Info and
>>> see what the state of the socket that hangs off of it is. If it's a
>>> NULL pointer or an already-closed socket, then that may help point the
>>> way to the root cause.
>>
>> We put in some WARN_ON calls to get stack traces, and some other
>> connection related logging.  We should get a WARN_ON if the socket is NULL.
>>
>> We were not able to reproduce the problem last night..the file servers did
>> screw up, but the CIFS clients acted normally.
>>
>
> Based on no real evidence at all and just a gut-feeling, I suspect that:
>
> 1) this is a long-standing bug
>
> ...and...
>
> 2) it's a race condition
>
> ...though it may be that recent changes have changed the timing enough
> to make it more likely (hard to say until we understand the problem
> better).
>
> Have you seen this happen more than once?

I think so...but we are also testing iscsi and NFS failover concurrently,
and for a while other instability was making it difficult to determine
exactly what killed things (seems we had a bad HD that would often
fail about the time iscsi did...thought it was software bug for a while,
but after replacing the HD it's been running better.)

We're going to crank up another machine with 100+ cifs mounts
and see if that helps reproduce the bug faster.  Current test
is 20 IO threads, but only a single mount.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                         ` <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-06-03 21:01                                           ` Ben Greear
       [not found]                                             ` <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-06-03 21:01 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

Ok, we had some luck.  Here's the backtrace and attending dmesg
output.  The filer has been doing failover, but it has not gone
into a failed state...so, the system *should* be able to reconnect.

We have the system in the failed state now and will leave it that way
for a bit in case you have some commands you'd like me to run.

Aside from the hung cifs processes (anything accessing those mounts
gets into the D state), the system seems fine.


CIFS VFS: Unexpected lookup error -112
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Unexpected lookup error -11
CIFS VFS: Unexpected lookup error -112
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Unexpected lookup error -112
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Unexpected lookup error -11
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: need to reconnect in sendv here
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
BUG: unable to handle kernel
Hardware name: X8ST3
NULL pointer dereference
Modules linked in: at 0000000000000020
  be2iscsi
IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
  bnx2iPGD 0  cnic
  uio
Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
  mdio
last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
  ib_iserCPU 2  rdma_cm
Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY 
ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY 
nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc 
i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
  libiscsi
  scsi_transport_iscsi
Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
  auth_rpcgss
RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
  sunrpc
RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
  ipv6
RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
  uinput
RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
  i2c_i801
RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
  e1000e
R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
  i2c_core
FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
  igb
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  ioatdma
CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
  iTCO_wdt
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  i7core_edac
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
  pcspkr
Stack:
  dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8

  ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
  0000000000000004Call Trace:
  ffff8802e64e5c30 ffffffff8135792c
  0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
  ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17

Call Trace:
  [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
  [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
  [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
  [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
  [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
  [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
  [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
  [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
  [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
  [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
  [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
  [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
  [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
  [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
  [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
  [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
  [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
  [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
  [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
  [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
  [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
  [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
  [<ffffffff8103838e>] ? need_resched+0x1e/0x28
  [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
  [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
  [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
  [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
  [<ffffffff8105c3bf>] kthread+0x7d/0x85
  [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
  [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
  [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
  [<ffffffff8105c342>] ? kthread+0x0/0x85
  [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
  [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
28 5b
---[ end trace 3387e7bab0a9c645 ]---
c9 c3 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 20 4c 8b 67 60 <48> 8b 7e 20 48 89 55 e0 48 89 4d d8 48 89 75 e8 44 89 45 d0 e8
RIP  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
  RSP <ffff8802e64e5bc0>
CR2: 0000000000000020
CIFS VFS: need to reconnect in sendv here
CIFS VFS: need to reconnect in sendv here
CIFS VFS: need to reconnect in sendv here
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]()
Hardware name: X8ST3
Modules linked in: be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr md4 nls_utf8 
cifs xt_TPROXY nf_tproxy_core xt_socket ip6_tables
CIFS VFS: need to reconnect in sendv here
  nf_defrag_ipv6 xt_connlimit 8021q
CIFS VFS: need to reconnect in sendv here
------------[ cut here ]------------
  garp
------------[ cut here ]------------
  bridge
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]()
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]()
Hardware name: X8ST3
Hardware name: X8ST3
Modules linked in: stp
Modules linked in: be2iscsi
CIFS VFS: need to reconnect in sendv here
  llc iscsi_boot_sysfs fuse be2iscsi macvlan
------------[ cut here ]------------
  bnx2i wanlink(P) iscsi_boot_sysfs cnic pktgen
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]()
  bnx2i uio
Hardware name: X8ST3
  cxgb3i cnic libcxgbi iscsi_tcp uio cxgb3 cxgb3i
Modules linked in: libcxgbi mdio cxgb3 ib_iser be2iscsi mdio iscsi_boot_sysfs ib_iser rdma_cm rdma_cm bnx2i ib_cm cnic ib_cm iw_cm iw_cm ib_sa ib_sa ib_mad uio 
ib_mad cxgb3i ib_core ib_core ib_addr ib_addr libcxgbi md4 md4 nls_utf8 cifs cxgb3 xt_TPROXY mdio nf_tproxy_core nls_utf8 xt_socket ib_iser ip6_tables cifs 
rdma_cm xt_TPROXY nf_defrag_ipv6 ib_cm nf_tproxy_core libiscsi_tcp iw_cm libiscsi xt_connlimit xt_socket ib_sa scsi_transport_iscsi ip6_tables nfs 8021q ib_mad 
garp lockd bridge fscache ib_core nfs_acl nf_defrag_ipv6 auth_rpcgss xt_connlimit ib_addr stp md4 sunrpc llc nls_utf8 fuse ipv6 macvlan uinput wanlink(P) 
i2c_i801 cifs pktgen e1000e 8021q iscsi_tcp garp libiscsi_tcp xt_TPROXY libiscsi nf_tproxy_core bridge scsi_transport_iscsi xt_socket stp i2c_core llc 
ip6_tables igb fuse ioatdma nfs macvlan nf_defrag_ipv6 lockd iTCO_wdt fscache wanlink(P) nfs_acl pktgen xt_connlimit auth_rpcgss iscsi_tcp sunrpc 8021q 
libiscsi_tcp i7core_edac libiscsi ipv6 garp iTCO_vendor_support scsi_transport_iscsi uinput nfs i2c_i801 pcspkr e1000e lockd i2c_core fscache dca igb nfs_acl 
edac_core bridge ioatdma stp microcode iTCO_wdt [last unloaded: ipt_addrtype] llc
  auth_rpcgss i7core_edac iTCO_vendor_supportPid: 4754, comm: btserver Tainted: P        W   2.6.38.8+ #12
  pcspkr sunrpc dca edac_coreCall Trace:
  ipv6 fuse uinput microcode [last unloaded: ipt_addrtype] [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
  i2c_i801
  macvlan [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
Pid: 4734, comm: btserver Tainted: P        W   2.6.38.8+ #12
Call Trace:
  [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs]
  e1000e wanlink(P) [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
  pktgen [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
  i2c_core iscsi_tcp [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs]
  libiscsi_tcp [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs]
  igb ioatdma [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs]
  libiscsi iTCO_wdt scsi_transport_iscsi [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
  nfs [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
  i7core_edac lockd [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
  fscache [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
  iTCO_vendor_support nfs_acl auth_rpcgss sunrpc [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
  pcspkr [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
  [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
  ipv6 dca [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
  uinput [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
  i2c_i801 edac_core [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
  e1000e microcode [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
  [last unloaded: ipt_addrtype] i2c_core igb [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50

  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
Pid: 7943, comm: btserver Tainted: P        W   2.6.38.8+ #12
  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
Call Trace:
  ioatdma [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
  iTCO_wdt i7core_edac [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
  [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
---[ end trace 3387e7bab0a9c646 ]---
  iTCO_vendor_support [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
  pcspkr
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
  dca edac_core [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs]
  microcode [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
  [last unloaded: ipt_addrtype] [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs]

  [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
Pid: 4740, comm: btserver Tainted: P        W   2.6.38.8+ #12
  [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
Call Trace:
---[ end trace 3387e7bab0a9c647 ]---
  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
  [<ffffffffa031c2e6>] ? CIFSSMBRead+0x9a/0x277 [cifs]
  [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
---[ end trace 3387e7bab0a9c648 ]---
  [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
  [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
  [<ffffffffa0329271>] ? cifs_readpage_worker+0x1d6/0x319 [cifs]
  [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs]
CIFS VFS: need to reconnect in sendv here
  [<ffffffffa03295a6>] ? cifs_readpage+0xb3/0xfd [cifs]
  [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs]
  [<ffffffff810a5247>] ? generic_file_aio_read+0x468/0x5d1
  [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
  [<ffffffff810a4beb>] ? generic_file_aio_write+0x83/0xa1
  [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
  [<ffffffff810ea81e>] ? do_sync_read+0xc6/0x103
  [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
  [<ffffffff811a6871>] ? fsnotify_perm+0x61/0x6d
  [<ffffffff811a68d4>] ? security_file_permission+0x29/0x2e
  [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
  [<ffffffff810eb2b5>] ? vfs_read+0xa6/0x102
  [<ffffffff810a4b33>] ? __generic_file_aio_write+0x23d/0x272
  [<ffffffff810eb3ca>] ? sys_read+0x45/0x6c
  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
---[ end trace 3387e7bab0a9c649 ]---
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
  [<ffffffffa03150c5>] ? cifs_file_aio_write+0x2d/0x5c [cifs]
CIFS VFS: need to reconnect in sendv here
  [<ffffffff810ea71b>] ? do_sync_write+0xc6/0x103
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
  [<ffffffff811a68d4>] ? security_file_permission+0x29/0x2e
  [<ffffffff810eb0d3>] ? vfs_write+0xa9/0x105
  [<ffffffff810eb1e8>] ? sys_write+0x45/0x6c
  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
---[ end trace 3387e7bab0a9c64a ]---
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
cifs_setup_session: 127 callbacks suppressed
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: Unexpected lookup error -88
CIFS VFS: Unexpected lookup error -88
CIFS VFS: need to reconnect in sendv here
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: need to reconnect in sendv here
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: Unexpected lookup error -88
CIFS VFS: need to reconnect in sendv here
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: need to reconnect in sendv here
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.
CIFS VFS: need to reconnect in sendv here
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds.


-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                             ` <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-06-04  1:42                                               ` Jeff Layton
       [not found]                                                 ` <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-06-04  1:42 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, 03 Jun 2011 14:01:11 -0700
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:

> Ok, we had some luck.  Here's the backtrace and attending dmesg
> output.  The filer has been doing failover, but it has not gone
> into a failed state...so, the system *should* be able to reconnect.
> 
> We have the system in the failed state now and will leave it that way
> for a bit in case you have some commands you'd like me to run.
> 
> Aside from the hung cifs processes (anything accessing those mounts
> gets into the D state), the system seems fine.
> 
> 
> CIFS VFS: Unexpected lookup error -112
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Unexpected lookup error -11
> CIFS VFS: Unexpected lookup error -112
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Unexpected lookup error -112
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Unexpected lookup error -11
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: Reconnecting tcp session
> CIFS VFS: need to reconnect in sendv here
> ------------[ cut here ]------------
> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> BUG: unable to handle kernel
> Hardware name: X8ST3
> NULL pointer dereference
> Modules linked in: at 0000000000000020
>   be2iscsi
> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>   bnx2iPGD 0  cnic
>   uio
> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
>   mdio
> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
>   ib_iserCPU 2  rdma_cm
> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY 
> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY 
> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc 
> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
>   libiscsi
>   scsi_transport_iscsi
> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
>   auth_rpcgss
> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>   sunrpc
> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
>   ipv6
> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
>   uinput
> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
>   i2c_i801
> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
>   e1000e
> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
>   i2c_core
> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
>   igb
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>   ioatdma
> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
>   iTCO_wdt
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   i7core_edac
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>   iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
>   pcspkr
> Stack:
>   dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
> 
>   ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
>   0000000000000004Call Trace:
>   ffff8802e64e5c30 ffffffff8135792c
>   0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
>   ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
> 
> Call Trace:
>   [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
>   [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
>   [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
>   [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
>   [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>   [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
>   [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>   [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
>   [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
>   [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
>   [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>   [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
>   [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
>   [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
>   [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
>   [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
>   [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
>   [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
>   [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
>   [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
>   [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>   [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
>   [<ffffffff8103838e>] ? need_resched+0x1e/0x28
>   [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
>   [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>   [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
>   [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>   [<ffffffff8105c3bf>] kthread+0x7d/0x85
>   [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
>   [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
>   [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
>   [<ffffffff8105c342>] ? kthread+0x0/0x85
>   [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
>   [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
> 28 5b
> ---[ end trace 3387e7bab0a9c645 ]---

Kaboom. So you're seeing oopses too. Could you get a listing of the
place where it oopsed by following the instructions here?

http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses

I suspect that "sock" is NULL in this case too and it blew up in
kernel_recvmsg.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                 ` <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
@ 2011-06-04  5:03                                                   ` Ben Greear
       [not found]                                                     ` <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-06-04  5:03 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 06/03/2011 06:42 PM, Jeff Layton wrote:
> On Fri, 03 Jun 2011 14:01:11 -0700
> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>
>> Ok, we had some luck.  Here's the backtrace and attending dmesg
>> output.  The filer has been doing failover, but it has not gone
>> into a failed state...so, the system *should* be able to reconnect.
>>
>> We have the system in the failed state now and will leave it that way
>> for a bit in case you have some commands you'd like me to run.
>>
>> Aside from the hung cifs processes (anything accessing those mounts
>> gets into the D state), the system seems fine.
>>
>>
>> CIFS VFS: Unexpected lookup error -112
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Unexpected lookup error -11
>> CIFS VFS: Unexpected lookup error -112
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Unexpected lookup error -112
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Unexpected lookup error -11
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: Reconnecting tcp session
>> CIFS VFS: need to reconnect in sendv here
>> ------------[ cut here ]------------
>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
>> BUG: unable to handle kernel
>> Hardware name: X8ST3
>> NULL pointer dereference
>> Modules linked in: at 0000000000000020
>>    be2iscsi
>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>>    bnx2iPGD 0  cnic
>>    uio
>> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
>>    mdio
>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
>>    ib_iserCPU 2  rdma_cm
>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
>>    libiscsi
>>    scsi_transport_iscsi
>> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
>>    auth_rpcgss
>> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>>    sunrpc
>> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
>>    ipv6
>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
>>    uinput
>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
>>    i2c_i801
>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
>>    e1000e
>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
>>    i2c_core
>> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
>>    igb
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>    ioatdma
>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
>>    iTCO_wdt
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>    i7core_edac
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>    iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
>>    pcspkr
>> Stack:
>>    dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
>>
>>    ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
>>    0000000000000004Call Trace:
>>    ffff8802e64e5c30 ffffffff8135792c
>>    0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
>>    ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
>>
>> Call Trace:
>>    [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
>>    [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
>>    [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
>>    [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
>>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>    [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
>>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>    [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
>>    [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
>>    [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
>>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>    [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
>>    [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
>>    [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
>>    [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
>>    [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
>>    [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
>>    [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
>>    [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
>>    [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
>>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>    [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
>>    [<ffffffff8103838e>] ? need_resched+0x1e/0x28
>>    [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
>>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>>    [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
>>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>>    [<ffffffff8105c3bf>] kthread+0x7d/0x85
>>    [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
>>    [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
>>    [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
>>    [<ffffffff8105c342>] ? kthread+0x0/0x85
>>    [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
>>    [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
>> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
>> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
>> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
>> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
>> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
>> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
>> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
>> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
>> 28 5b
>> ---[ end trace 3387e7bab0a9c645 ]---
>
> Kaboom. So you're seeing oopses too. Could you get a listing of the
> place where it oopsed by following the instructions here?
>
> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
>
> I suspect that "sock" is NULL in this case too and it blew up in
> kernel_recvmsg.

I added code to WARN_ON when ssocket was null.  This isn't a real panic,
just a WARN_ON:


static int
smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
{
	int rc = 0;
	int i = 0;
	struct msghdr smb_msg;
	struct smb_hdr *smb_buffer = iov[0].iov_base;
	unsigned int len = iov[0].iov_len;
	unsigned int total_len;
	int first_vec = 0;
	unsigned int smb_buf_length = smb_buffer->smb_buf_length;
	struct socket *ssocket = server->ssocket;

	if (ssocket == NULL) {
		cERROR(1, "need to reconnect in sendv here");
*** HERE ***	WARN_ON_ONCE(1);
  		return -ENOTSOCK; /* BB eventually add reconnect code here */
	}

A second warn-on when ENOTSOCK is perculated up to the calling stack
a bit causes the other stack dumpage.  I think the one above is root
cause...need to figure out how to have it gracefully bail out and re-connect
when it hits this state, as current code just calls this general loop over
and over again.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                     ` <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-06-04 11:19                                                       ` Jeff Layton
       [not found]                                                         ` <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-06-04 11:19 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, 03 Jun 2011 22:03:43 -0700
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:

> On 06/03/2011 06:42 PM, Jeff Layton wrote:
> > On Fri, 03 Jun 2011 14:01:11 -0700
> > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
> >
> >> Ok, we had some luck.  Here's the backtrace and attending dmesg
> >> output.  The filer has been doing failover, but it has not gone
> >> into a failed state...so, the system *should* be able to reconnect.
> >>
> >> We have the system in the failed state now and will leave it that way
> >> for a bit in case you have some commands you'd like me to run.
> >>
> >> Aside from the hung cifs processes (anything accessing those mounts
> >> gets into the D state), the system seems fine.
> >>
> >>
> >> CIFS VFS: Unexpected lookup error -112
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Unexpected lookup error -11
> >> CIFS VFS: Unexpected lookup error -112
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Unexpected lookup error -112
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Unexpected lookup error -11
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: Reconnecting tcp session
> >> CIFS VFS: need to reconnect in sendv here
> >> ------------[ cut here ]------------
> >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> >> BUG: unable to handle kernel
> >> Hardware name: X8ST3
> >> NULL pointer dereference
> >> Modules linked in: at 0000000000000020
> >>    be2iscsi
> >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> >>    bnx2iPGD 0  cnic
> >>    uio
> >> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
> >>    mdio
> >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
> >>    ib_iserCPU 2  rdma_cm
> >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
> >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
> >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
> >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
> >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
> >>    libiscsi
> >>    scsi_transport_iscsi
> >> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
> >>    auth_rpcgss
> >> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> >>    sunrpc
> >> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
> >>    ipv6
> >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
> >>    uinput
> >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
> >>    i2c_i801
> >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
> >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
> >>    e1000e
> >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
> >>    i2c_core
> >> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
> >>    igb
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>    ioatdma
> >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
> >>    iTCO_wdt
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>    i7core_edac
> >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>    iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
> >>    pcspkr
> >> Stack:
> >>    dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
> >>
> >>    ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
> >>    0000000000000004Call Trace:
> >>    ffff8802e64e5c30 ffffffff8135792c
> >>    0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
> >>    ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
> >>
> >> Call Trace:
> >>    [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
> >>    [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
> >>    [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
> >>    [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
> >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>    [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
> >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>    [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
> >>    [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
> >>    [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
> >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>    [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
> >>    [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
> >>    [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
> >>    [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
> >>    [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
> >>    [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
> >>    [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
> >>    [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
> >>    [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
> >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>    [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
> >>    [<ffffffff8103838e>] ? need_resched+0x1e/0x28
> >>    [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
> >>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
> >>    [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
> >>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
> >>    [<ffffffff8105c3bf>] kthread+0x7d/0x85
> >>    [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
> >>    [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
> >>    [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
> >>    [<ffffffff8105c342>] ? kthread+0x0/0x85
> >>    [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
> >>    [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
> >> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
> >> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
> >> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
> >> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
> >> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
> >> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
> >> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
> >> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
> >> 28 5b
> >> ---[ end trace 3387e7bab0a9c645 ]---
> >
> > Kaboom. So you're seeing oopses too. Could you get a listing of the
> > place where it oopsed by following the instructions here?
> >
> > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
> >
> > I suspect that "sock" is NULL in this case too and it blew up in
> > kernel_recvmsg.
> 
> I added code to WARN_ON when ssocket was null.  This isn't a real panic,
> just a WARN_ON:
> 
> 
> static int
> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
> {
> 	int rc = 0;
> 	int i = 0;
> 	struct msghdr smb_msg;
> 	struct smb_hdr *smb_buffer = iov[0].iov_base;
> 	unsigned int len = iov[0].iov_len;
> 	unsigned int total_len;
> 	int first_vec = 0;
> 	unsigned int smb_buf_length = smb_buffer->smb_buf_length;
> 	struct socket *ssocket = server->ssocket;
> 
> 	if (ssocket == NULL) {
> 		cERROR(1, "need to reconnect in sendv here");
> *** HERE ***	WARN_ON_ONCE(1);
>   		return -ENOTSOCK; /* BB eventually add reconnect code here */
> 	}
> 
> A second warn-on when ENOTSOCK is perculated up to the calling stack
> a bit causes the other stack dumpage.  I think the one above is root
> cause...need to figure out how to have it gracefully bail out and re-connect
> when it hits this state, as current code just calls this general loop over
> and over again.
>

No, your warning is there, but it's Oopsing too:

> >> ------------[ cut here ]------------
> >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> >> BUG: unable to handle kernel
> >> Hardware name: X8ST3
> >> NULL pointer dereference
> >> Modules linked in: at 0000000000000020
> >>    be2iscsi
> >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e


...smb_sendv is called by the "send" side which is generally a
userspace process. The oops happened on the receive side. cifsd called
kernel_recvmsg, and it looks like it passed in a NULL sock pointer.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                         ` <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
@ 2011-06-06 13:45                                                           ` Jeff Layton
       [not found]                                                             ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-06-06 13:45 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Sat, 4 Jun 2011 07:19:23 -0400
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Fri, 03 Jun 2011 22:03:43 -0700
> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
> 
> > On 06/03/2011 06:42 PM, Jeff Layton wrote:
> > > On Fri, 03 Jun 2011 14:01:11 -0700
> > > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
> > >
> > >> Ok, we had some luck.  Here's the backtrace and attending dmesg
> > >> output.  The filer has been doing failover, but it has not gone
> > >> into a failed state...so, the system *should* be able to reconnect.
> > >>
> > >> We have the system in the failed state now and will leave it that way
> > >> for a bit in case you have some commands you'd like me to run.
> > >>
> > >> Aside from the hung cifs processes (anything accessing those mounts
> > >> gets into the D state), the system seems fine.
> > >>
> > >>
> > >> CIFS VFS: Unexpected lookup error -112
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Unexpected lookup error -11
> > >> CIFS VFS: Unexpected lookup error -112
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Unexpected lookup error -112
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Unexpected lookup error -11
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: Reconnecting tcp session
> > >> CIFS VFS: need to reconnect in sendv here
> > >> ------------[ cut here ]------------
> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> > >> BUG: unable to handle kernel
> > >> Hardware name: X8ST3
> > >> NULL pointer dereference
> > >> Modules linked in: at 0000000000000020
> > >>    be2iscsi
> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> > >>    bnx2iPGD 0  cnic
> > >>    uio
> > >> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
> > >>    mdio
> > >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
> > >>    ib_iserCPU 2  rdma_cm
> > >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
> > >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
> > >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
> > >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
> > >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
> > >>    libiscsi
> > >>    scsi_transport_iscsi
> > >> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
> > >>    auth_rpcgss
> > >> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> > >>    sunrpc
> > >> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
> > >>    ipv6
> > >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
> > >>    uinput
> > >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
> > >>    i2c_i801
> > >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
> > >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
> > >>    e1000e
> > >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
> > >>    i2c_core
> > >> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
> > >>    igb
> > >> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > >>    ioatdma
> > >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
> > >>    iTCO_wdt
> > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >>    i7core_edac
> > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > >>    iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
> > >>    pcspkr
> > >> Stack:
> > >>    dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
> > >>
> > >>    ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
> > >>    0000000000000004Call Trace:
> > >>    ffff8802e64e5c30 ffffffff8135792c
> > >>    0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
> > >>    ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
> > >>
> > >> Call Trace:
> > >>    [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
> > >>    [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
> > >>    [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
> > >>    [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> > >>    [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> > >>    [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
> > >>    [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
> > >>    [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> > >>    [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
> > >>    [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
> > >>    [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
> > >>    [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
> > >>    [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
> > >>    [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
> > >>    [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
> > >>    [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
> > >>    [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> > >>    [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
> > >>    [<ffffffff8103838e>] ? need_resched+0x1e/0x28
> > >>    [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
> > >>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
> > >>    [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
> > >>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
> > >>    [<ffffffff8105c3bf>] kthread+0x7d/0x85
> > >>    [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
> > >>    [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
> > >>    [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
> > >>    [<ffffffff8105c342>] ? kthread+0x0/0x85
> > >>    [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
> > >>    [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
> > >> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
> > >> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
> > >> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
> > >> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
> > >> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
> > >> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
> > >> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
> > >> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
> > >> 28 5b
> > >> ---[ end trace 3387e7bab0a9c645 ]---
> > >
> > > Kaboom. So you're seeing oopses too. Could you get a listing of the
> > > place where it oopsed by following the instructions here?
> > >
> > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
> > >
> > > I suspect that "sock" is NULL in this case too and it blew up in
> > > kernel_recvmsg.
> > 
> > I added code to WARN_ON when ssocket was null.  This isn't a real panic,
> > just a WARN_ON:
> > 
> > 
> > static int
> > smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
> > {
> > 	int rc = 0;
> > 	int i = 0;
> > 	struct msghdr smb_msg;
> > 	struct smb_hdr *smb_buffer = iov[0].iov_base;
> > 	unsigned int len = iov[0].iov_len;
> > 	unsigned int total_len;
> > 	int first_vec = 0;
> > 	unsigned int smb_buf_length = smb_buffer->smb_buf_length;
> > 	struct socket *ssocket = server->ssocket;
> > 
> > 	if (ssocket == NULL) {
> > 		cERROR(1, "need to reconnect in sendv here");
> > *** HERE ***	WARN_ON_ONCE(1);
> >   		return -ENOTSOCK; /* BB eventually add reconnect code here */
> > 	}
> > 
> > A second warn-on when ENOTSOCK is perculated up to the calling stack
> > a bit causes the other stack dumpage.  I think the one above is root
> > cause...need to figure out how to have it gracefully bail out and re-connect
> > when it hits this state, as current code just calls this general loop over
> > and over again.
> >
> 
> No, your warning is there, but it's Oopsing too:
> 
> > >> ------------[ cut here ]------------
> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> > >> BUG: unable to handle kernel
> > >> Hardware name: X8ST3
> > >> NULL pointer dereference
> > >> Modules linked in: at 0000000000000020
> > >>    be2iscsi
> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> 
> 
> ...smb_sendv is called by the "send" side which is generally a
> userspace process. The oops happened on the receive side. cifsd called
> kernel_recvmsg, and it looks like it passed in a NULL sock pointer.
> 

I suspect that the following (untested) patch will fix this. I think
the symptoms that you've seen are consistent with the patch
description. Ben, would you be able to test this in your setup? This
should at least prevent the oopses.

------------------[snip]--------------------

[PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket  pointer

It's possible for the following set of events to happen:

cifsd calls cifs_reconnect which reconnects the socket. A userspace
process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
gets a reply. But, while processing the reply, cifsd calls
cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
reply from the earlier NEGOTIATE completes and the tcpStatus is set to
CifsGood. cifs_reconnect then goes through and closes the socket and sets the
pointer to zero, but because the status is now CifsGood, the new socket
is not created and cifs_reconnect exits with the socket pointer set to
NULL.

Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
CifsNeedNegotiate, and by making sure that generic_ip_connect is always
called at least once in cifs_reconnect.

Note that this is not a perfect fix for this issue. It's still possible
that the NEGOTIATE reply is handled after the socket has been closed and
reconnected. In that case, the socket state will look correct but it no
NEGOTIATE was performed on it. In that situation though the server
should just shut down the socket on the next attempted send, rather
than causing the oops that occurs today.

Reported-by: Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Signed-off-by: Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 fs/cifs/connect.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 84c7307..8bb55bc 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
 		mid_entry->callback(mid_entry);
 	}
 
-	while (server->tcpStatus == CifsNeedReconnect) {
+	do {
 		try_to_freeze();
 
 		/* we should try only the port we connected to before */
@@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
 				server->tcpStatus = CifsNeedNegotiate;
 			spin_unlock(&GlobalMid_Lock);
 		}
-	}
+	} while (server->tcpStatus == CifsNeedReconnect);
 
 	return rc;
 }
@@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses)
 	}
 	if (rc == 0) {
 		spin_lock(&GlobalMid_Lock);
-		if (server->tcpStatus != CifsExiting)
+		if (server->tcpStatus == CifsNeedNegotiate)
 			server->tcpStatus = CifsGood;
 		else
 			rc = -EHOSTDOWN;
-- 
1.7.5.2


-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

[parent not found: <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                             ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-06-06 15:37                                                               ` Steve French
  2011-06-06 16:47                                                               ` Ben Greear
  1 sibling, 0 replies; 23+ messages in thread
From: Steve French @ 2011-06-06 15:37 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Ben Greear, linux-cifs-u79uwXL29TY76Z2rM5mHXA

Sounds promising.

Any others have thoughts about Jeff's proposed solution?

Ben,
If you get test data on this with and without patch - let us know.

On Mon, Jun 6, 2011 at 8:45 AM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, 4 Jun 2011 07:19:23 -0400
> Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> On Fri, 03 Jun 2011 22:03:43 -0700
>> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
>>
>> > On 06/03/2011 06:42 PM, Jeff Layton wrote:
>> > > On Fri, 03 Jun 2011 14:01:11 -0700
>> > > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>> > >
>> > >> Ok, we had some luck.  Here's the backtrace and attending dmesg
>> > >> output.  The filer has been doing failover, but it has not gone
>> > >> into a failed state...so, the system *should* be able to reconnect.
>> > >>
>> > >> We have the system in the failed state now and will leave it that way
>> > >> for a bit in case you have some commands you'd like me to run.
>> > >>
>> > >> Aside from the hung cifs processes (anything accessing those mounts
>> > >> gets into the D state), the system seems fine.
>> > >>
>> > >>
>> > >> CIFS VFS: Unexpected lookup error -112
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Unexpected lookup error -11
>> > >> CIFS VFS: Unexpected lookup error -112
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Unexpected lookup error -112
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Unexpected lookup error -11
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: Reconnecting tcp session
>> > >> CIFS VFS: need to reconnect in sendv here
>> > >> ------------[ cut here ]------------
>> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
>> > >> BUG: unable to handle kernel
>> > >> Hardware name: X8ST3
>> > >> NULL pointer dereference
>> > >> Modules linked in: at 0000000000000020
>> > >>    be2iscsi
>> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>> > >>    bnx2iPGD 0  cnic
>> > >>    uio
>> > >> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
>> > >>    mdio
>> > >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
>> > >>    ib_iserCPU 2  rdma_cm
>> > >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
>> > >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
>> > >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
>> > >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
>> > >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
>> > >>    libiscsi
>> > >>    scsi_transport_iscsi
>> > >> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
>> > >>    auth_rpcgss
>> > >> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>> > >>    sunrpc
>> > >> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
>> > >>    ipv6
>> > >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
>> > >>    uinput
>> > >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
>> > >>    i2c_i801
>> > >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
>> > >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
>> > >>    e1000e
>> > >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
>> > >>    i2c_core
>> > >> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
>> > >>    igb
>> > >> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > >>    ioatdma
>> > >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
>> > >>    iTCO_wdt
>> > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > >>    i7core_edac
>> > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > >>    iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
>> > >>    pcspkr
>> > >> Stack:
>> > >>    dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
>> > >>
>> > >>    ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
>> > >>    0000000000000004Call Trace:
>> > >>    ffff8802e64e5c30 ffffffff8135792c
>> > >>    0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
>> > >>    ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
>> > >>
>> > >> Call Trace:
>> > >>    [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
>> > >>    [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
>> > >>    [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
>> > >>    [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
>> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>> > >>    [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
>> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>> > >>    [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
>> > >>    [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
>> > >>    [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
>> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>> > >>    [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
>> > >>    [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
>> > >>    [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
>> > >>    [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
>> > >>    [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
>> > >>    [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
>> > >>    [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
>> > >>    [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
>> > >>    [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
>> > >>    [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>> > >>    [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
>> > >>    [<ffffffff8103838e>] ? need_resched+0x1e/0x28
>> > >>    [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
>> > >>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>> > >>    [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
>> > >>    [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>> > >>    [<ffffffff8105c3bf>] kthread+0x7d/0x85
>> > >>    [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
>> > >>    [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
>> > >>    [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
>> > >>    [<ffffffff8105c342>] ? kthread+0x0/0x85
>> > >>    [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
>> > >>    [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
>> > >> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
>> > >> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
>> > >> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
>> > >> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
>> > >> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
>> > >> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
>> > >> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
>> > >> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
>> > >> 28 5b
>> > >> ---[ end trace 3387e7bab0a9c645 ]---
>> > >
>> > > Kaboom. So you're seeing oopses too. Could you get a listing of the
>> > > place where it oopsed by following the instructions here?
>> > >
>> > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
>> > >
>> > > I suspect that "sock" is NULL in this case too and it blew up in
>> > > kernel_recvmsg.
>> >
>> > I added code to WARN_ON when ssocket was null.  This isn't a real panic,
>> > just a WARN_ON:
>> >
>> >
>> > static int
>> > smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
>> > {
>> >     int rc = 0;
>> >     int i = 0;
>> >     struct msghdr smb_msg;
>> >     struct smb_hdr *smb_buffer = iov[0].iov_base;
>> >     unsigned int len = iov[0].iov_len;
>> >     unsigned int total_len;
>> >     int first_vec = 0;
>> >     unsigned int smb_buf_length = smb_buffer->smb_buf_length;
>> >     struct socket *ssocket = server->ssocket;
>> >
>> >     if (ssocket == NULL) {
>> >             cERROR(1, "need to reconnect in sendv here");
>> > *** HERE ***        WARN_ON_ONCE(1);
>> >             return -ENOTSOCK; /* BB eventually add reconnect code here */
>> >     }
>> >
>> > A second warn-on when ENOTSOCK is perculated up to the calling stack
>> > a bit causes the other stack dumpage.  I think the one above is root
>> > cause...need to figure out how to have it gracefully bail out and re-connect
>> > when it hits this state, as current code just calls this general loop over
>> > and over again.
>> >
>>
>> No, your warning is there, but it's Oopsing too:
>>
>> > >> ------------[ cut here ]------------
>> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
>> > >> BUG: unable to handle kernel
>> > >> Hardware name: X8ST3
>> > >> NULL pointer dereference
>> > >> Modules linked in: at 0000000000000020
>> > >>    be2iscsi
>> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>>
>>
>> ...smb_sendv is called by the "send" side which is generally a
>> userspace process. The oops happened on the receive side. cifsd called
>> kernel_recvmsg, and it looks like it passed in a NULL sock pointer.
>>
>
> I suspect that the following (untested) patch will fix this. I think
> the symptoms that you've seen are consistent with the patch
> description. Ben, would you be able to test this in your setup? This
> should at least prevent the oopses.
>
> ------------------[snip]--------------------
>
> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket  pointer
>
> It's possible for the following set of events to happen:
>
> cifsd calls cifs_reconnect which reconnects the socket. A userspace
> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
> gets a reply. But, while processing the reply, cifsd calls
> cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
> reply from the earlier NEGOTIATE completes and the tcpStatus is set to
> CifsGood. cifs_reconnect then goes through and closes the socket and sets the
> pointer to zero, but because the status is now CifsGood, the new socket
> is not created and cifs_reconnect exits with the socket pointer set to
> NULL.
>
> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
> CifsNeedNegotiate, and by making sure that generic_ip_connect is always
> called at least once in cifs_reconnect.
>
> Note that this is not a perfect fix for this issue. It's still possible
> that the NEGOTIATE reply is handled after the socket has been closed and
> reconnected. In that case, the socket state will look correct but it no
> NEGOTIATE was performed on it. In that situation though the server
> should just shut down the socket on the next attempted send, rather
> than causing the oops that occurs today.
>
> Reported-by: Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> Signed-off-by: Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/cifs/connect.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index 84c7307..8bb55bc 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>                mid_entry->callback(mid_entry);
>        }
>
> -       while (server->tcpStatus == CifsNeedReconnect) {
> +       do {
>                try_to_freeze();
>
>                /* we should try only the port we connected to before */
> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>                                server->tcpStatus = CifsNeedNegotiate;
>                        spin_unlock(&GlobalMid_Lock);
>                }
> -       }
> +       } while (server->tcpStatus == CifsNeedReconnect);
>
>        return rc;
>  }
> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses)
>        }
>        if (rc == 0) {
>                spin_lock(&GlobalMid_Lock);
> -               if (server->tcpStatus != CifsExiting)
> +               if (server->tcpStatus == CifsNeedNegotiate)
>                        server->tcpStatus = CifsGood;
>                else
>                        rc = -EHOSTDOWN;
> --
> 1.7.5.2
>
>
> --
> Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                             ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2011-06-06 15:37                                                               ` Steve French
@ 2011-06-06 16:47                                                               ` Ben Greear
       [not found]                                                                 ` <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-06-06 16:47 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 06/06/2011 06:45 AM, Jeff Layton wrote:
> On Sat, 4 Jun 2011 07:19:23 -0400
> Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
>
>> On Fri, 03 Jun 2011 22:03:43 -0700
>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>>
>>> On 06/03/2011 06:42 PM, Jeff Layton wrote:
>>>> On Fri, 03 Jun 2011 14:01:11 -0700
>>>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>   wrote:
>>>>
>>>>> Ok, we had some luck.  Here's the backtrace and attending dmesg
>>>>> output.  The filer has been doing failover, but it has not gone
>>>>> into a failed state...so, the system *should* be able to reconnect.
>>>>>
>>>>> We have the system in the failed state now and will leave it that way
>>>>> for a bit in case you have some commands you'd like me to run.
>>>>>
>>>>> Aside from the hung cifs processes (anything accessing those mounts
>>>>> gets into the D state), the system seems fine.
>>>>>
>>>>>
>>>>> CIFS VFS: Unexpected lookup error -112
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Unexpected lookup error -11
>>>>> CIFS VFS: Unexpected lookup error -112
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Unexpected lookup error -112
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Unexpected lookup error -11
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: Reconnecting tcp session
>>>>> CIFS VFS: need to reconnect in sendv here
>>>>> ------------[ cut here ]------------
>>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
>>>>> BUG: unable to handle kernel
>>>>> Hardware name: X8ST3
>>>>> NULL pointer dereference
>>>>> Modules linked in: at 0000000000000020
>>>>>     be2iscsi
>>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>>>>>     bnx2iPGD 0  cnic
>>>>>     uio
>>>>> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
>>>>>     mdio
>>>>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
>>>>>     ib_iserCPU 2  rdma_cm
>>>>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
>>>>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
>>>>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
>>>>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
>>>>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
>>>>>     libiscsi
>>>>>     scsi_transport_iscsi
>>>>> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
>>>>>     auth_rpcgss
>>>>> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>>>>>     sunrpc
>>>>> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
>>>>>     ipv6
>>>>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
>>>>>     uinput
>>>>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
>>>>>     i2c_i801
>>>>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
>>>>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
>>>>>     e1000e
>>>>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
>>>>>     i2c_core
>>>>> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
>>>>>     igb
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>     ioatdma
>>>>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
>>>>>     iTCO_wdt
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>     i7core_edac
>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>>     iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
>>>>>     pcspkr
>>>>> Stack:
>>>>>     dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
>>>>>
>>>>>     ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
>>>>>     0000000000000004Call Trace:
>>>>>     ffff8802e64e5c30 ffffffff8135792c
>>>>>     0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
>>>>>     ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
>>>>>
>>>>> Call Trace:
>>>>>     [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
>>>>>     [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
>>>>>     [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
>>>>>     [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
>>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>>>>     [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
>>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>>>>     [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
>>>>>     [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
>>>>>     [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
>>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>>>>     [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
>>>>>     [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
>>>>>     [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
>>>>>     [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
>>>>>     [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
>>>>>     [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
>>>>>     [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
>>>>>     [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
>>>>>     [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
>>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
>>>>>     [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
>>>>>     [<ffffffff8103838e>] ? need_resched+0x1e/0x28
>>>>>     [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
>>>>>     [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>>>>>     [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
>>>>>     [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
>>>>>     [<ffffffff8105c3bf>] kthread+0x7d/0x85
>>>>>     [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
>>>>>     [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
>>>>>     [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
>>>>>     [<ffffffff8105c342>] ? kthread+0x0/0x85
>>>>>     [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
>>>>>     [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
>>>>> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
>>>>> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
>>>>> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
>>>>> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
>>>>> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
>>>>> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
>>>>> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
>>>>> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
>>>>> 28 5b
>>>>> ---[ end trace 3387e7bab0a9c645 ]---
>>>>
>>>> Kaboom. So you're seeing oopses too. Could you get a listing of the
>>>> place where it oopsed by following the instructions here?
>>>>
>>>> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
>>>>
>>>> I suspect that "sock" is NULL in this case too and it blew up in
>>>> kernel_recvmsg.
>>>
>>> I added code to WARN_ON when ssocket was null.  This isn't a real panic,
>>> just a WARN_ON:
>>>
>>>
>>> static int
>>> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
>>> {
>>> 	int rc = 0;
>>> 	int i = 0;
>>> 	struct msghdr smb_msg;
>>> 	struct smb_hdr *smb_buffer = iov[0].iov_base;
>>> 	unsigned int len = iov[0].iov_len;
>>> 	unsigned int total_len;
>>> 	int first_vec = 0;
>>> 	unsigned int smb_buf_length = smb_buffer->smb_buf_length;
>>> 	struct socket *ssocket = server->ssocket;
>>>
>>> 	if (ssocket == NULL) {
>>> 		cERROR(1, "need to reconnect in sendv here");
>>> *** HERE ***	WARN_ON_ONCE(1);
>>>    		return -ENOTSOCK; /* BB eventually add reconnect code here */
>>> 	}
>>>
>>> A second warn-on when ENOTSOCK is perculated up to the calling stack
>>> a bit causes the other stack dumpage.  I think the one above is root
>>> cause...need to figure out how to have it gracefully bail out and re-connect
>>> when it hits this state, as current code just calls this general loop over
>>> and over again.
>>>
>>
>> No, your warning is there, but it's Oopsing too:
>>
>>>>> ------------[ cut here ]------------
>>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
>>>>> BUG: unable to handle kernel
>>>>> Hardware name: X8ST3
>>>>> NULL pointer dereference
>>>>> Modules linked in: at 0000000000000020
>>>>>     be2iscsi
>>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
>>
>>
>> ...smb_sendv is called by the "send" side which is generally a
>> userspace process. The oops happened on the receive side. cifsd called
>> kernel_recvmsg, and it looks like it passed in a NULL sock pointer.
>>
>
> I suspect that the following (untested) patch will fix this. I think
> the symptoms that you've seen are consistent with the patch
> description. Ben, would you be able to test this in your setup? This
> should at least prevent the oopses.
>
> ------------------[snip]--------------------
>
> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket  pointer
>
> It's possible for the following set of events to happen:
>
> cifsd calls cifs_reconnect which reconnects the socket. A userspace
> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
> gets a reply. But, while processing the reply, cifsd calls
> cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
> reply from the earlier NEGOTIATE completes and the tcpStatus is set to
> CifsGood. cifs_reconnect then goes through and closes the socket and sets the
> pointer to zero, but because the status is now CifsGood, the new socket
> is not created and cifs_reconnect exits with the socket pointer set to
> NULL.
>
> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
> CifsNeedNegotiate, and by making sure that generic_ip_connect is always
> called at least once in cifs_reconnect.
>
> Note that this is not a perfect fix for this issue. It's still possible
> that the NEGOTIATE reply is handled after the socket has been closed and
> reconnected. In that case, the socket state will look correct but it no
> NEGOTIATE was performed on it. In that situation though the server
> should just shut down the socket on the next attempted send, rather
> than causing the oops that occurs today.
>
> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   fs/cifs/connect.c |    6 +++---
>   1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index 84c7307..8bb55bc 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>   		mid_entry->callback(mid_entry);
>   	}
>
> -	while (server->tcpStatus == CifsNeedReconnect) {
> +	do {
>   		try_to_freeze();
>
>   		/* we should try only the port we connected to before */
> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>   				server->tcpStatus = CifsNeedNegotiate;
>   			spin_unlock(&GlobalMid_Lock);
>   		}
> -	}
> +	} while (server->tcpStatus == CifsNeedReconnect);
>
>   	return rc;
>   }
> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses)
>   	}
>   	if (rc == 0) {
>   		spin_lock(&GlobalMid_Lock);
> -		if (server->tcpStatus != CifsExiting)
> +		if (server->tcpStatus == CifsNeedNegotiate)
>   			server->tcpStatus = CifsGood;
>   		else
>   			rc = -EHOSTDOWN;


This has some merge issues on 3.6.38.8:


<<<<<<<
	while ((server->tcpStatus != CifsExiting) &&
	       (server->tcpStatus != CifsGood)) {
=======
	do {
 >>>>>>>

Should I keep your comparison for tcpStatus == CifsNeedReconnect
instead of these != comparisons above?
	

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                                 ` <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-06-06 16:51                                                                   ` Jeff Layton
       [not found]                                                                     ` <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2011-06-06 16:51 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Mon, 06 Jun 2011 09:47:40 -0700
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:

> On 06/06/2011 06:45 AM, Jeff Layton wrote:
> > On Sat, 4 Jun 2011 07:19:23 -0400
> > Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
> >
> >> On Fri, 03 Jun 2011 22:03:43 -0700
> >> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
> >>
> >>> On 06/03/2011 06:42 PM, Jeff Layton wrote:
> >>>> On Fri, 03 Jun 2011 14:01:11 -0700
> >>>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>   wrote:
> >>>>
> >>>>> Ok, we had some luck.  Here's the backtrace and attending dmesg
> >>>>> output.  The filer has been doing failover, but it has not gone
> >>>>> into a failed state...so, the system *should* be able to reconnect.
> >>>>>
> >>>>> We have the system in the failed state now and will leave it that way
> >>>>> for a bit in case you have some commands you'd like me to run.
> >>>>>
> >>>>> Aside from the hung cifs processes (anything accessing those mounts
> >>>>> gets into the D state), the system seems fine.
> >>>>>
> >>>>>
> >>>>> CIFS VFS: Unexpected lookup error -112
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Unexpected lookup error -11
> >>>>> CIFS VFS: Unexpected lookup error -112
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Unexpected lookup error -112
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Unexpected lookup error -11
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: Reconnecting tcp session
> >>>>> CIFS VFS: need to reconnect in sendv here
> >>>>> ------------[ cut here ]------------
> >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> >>>>> BUG: unable to handle kernel
> >>>>> Hardware name: X8ST3
> >>>>> NULL pointer dereference
> >>>>> Modules linked in: at 0000000000000020
> >>>>>     be2iscsi
> >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> >>>>>     bnx2iPGD 0  cnic
> >>>>>     uio
> >>>>> Oops: 0000 [#1]  cxgb3iPREEMPT  libcxgbiSMP  cxgb3
> >>>>>     mdio
> >>>>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
> >>>>>     ib_iserCPU 2  rdma_cm
> >>>>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
> >>>>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
> >>>>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
> >>>>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
> >>>>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
> >>>>>     libiscsi
> >>>>>     scsi_transport_iscsi
> >>>>> Pid: 5047, comm: cifsd Tainted: P            2.6.38.8+ #12 nfs  lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
> >>>>>     auth_rpcgss
> >>>>> RIP: 0010:[<ffffffff81356230>]  [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> >>>>>     sunrpc
> >>>>> RSP: 0018:ffff8802e64e5bc0  EFLAGS: 00010286
> >>>>>     ipv6
> >>>>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
> >>>>>     uinput
> >>>>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
> >>>>>     i2c_i801
> >>>>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
> >>>>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
> >>>>>     e1000e
> >>>>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
> >>>>>     i2c_core
> >>>>> FS:  0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
> >>>>>     igb
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>>     ioatdma
> >>>>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
> >>>>>     iTCO_wdt
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>     i7core_edac
> >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>>>>     iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
> >>>>>     pcspkr
> >>>>> Stack:
> >>>>>     dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
> >>>>>
> >>>>>     ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P            2.6.38.8+ #12
> >>>>>     0000000000000004Call Trace:
> >>>>>     ffff8802e64e5c30 ffffffff8135792c
> >>>>>     0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
> >>>>>     ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
> >>>>>
> >>>>> Call Trace:
> >>>>>     [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
> >>>>>     [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
> >>>>>     [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
> >>>>>     [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
> >>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>>>>     [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
> >>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>>>>     [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
> >>>>>     [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
> >>>>>     [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
> >>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>>>>     [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
> >>>>>     [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
> >>>>>     [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
> >>>>>     [<ffffffff810382c6>] ? __wake_up+0x3f/0x48
> >>>>>     [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
> >>>>>     [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
> >>>>>     [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
> >>>>>     [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
> >>>>>     [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
> >>>>>     [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
> >>>>>     [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
> >>>>>     [<ffffffff8103838e>] ? need_resched+0x1e/0x28
> >>>>>     [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
> >>>>>     [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
> >>>>>     [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
> >>>>>     [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
> >>>>>     [<ffffffff8105c3bf>] kthread+0x7d/0x85
> >>>>>     [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
> >>>>>     [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
> >>>>>     [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
> >>>>>     [<ffffffff8105c342>] ? kthread+0x0/0x85
> >>>>>     [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
> >>>>>     [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
> >>>>> Code: 48 8b 4d d8  [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
> >>>>> 48 8b  [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
> >>>>> 55 e0 48  [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
> >>>>> 8b 75  [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
> >>>>> e8 ff 90  [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
> >>>>> a8 00  [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
> >>>>> 00 00  [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
> >>>>> 48 83 c4  [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
> >>>>> 28 5b
> >>>>> ---[ end trace 3387e7bab0a9c645 ]---
> >>>>
> >>>> Kaboom. So you're seeing oopses too. Could you get a listing of the
> >>>> place where it oopsed by following the instructions here?
> >>>>
> >>>> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
> >>>>
> >>>> I suspect that "sock" is NULL in this case too and it blew up in
> >>>> kernel_recvmsg.
> >>>
> >>> I added code to WARN_ON when ssocket was null.  This isn't a real panic,
> >>> just a WARN_ON:
> >>>
> >>>
> >>> static int
> >>> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
> >>> {
> >>> 	int rc = 0;
> >>> 	int i = 0;
> >>> 	struct msghdr smb_msg;
> >>> 	struct smb_hdr *smb_buffer = iov[0].iov_base;
> >>> 	unsigned int len = iov[0].iov_len;
> >>> 	unsigned int total_len;
> >>> 	int first_vec = 0;
> >>> 	unsigned int smb_buf_length = smb_buffer->smb_buf_length;
> >>> 	struct socket *ssocket = server->ssocket;
> >>>
> >>> 	if (ssocket == NULL) {
> >>> 		cERROR(1, "need to reconnect in sendv here");
> >>> *** HERE ***	WARN_ON_ONCE(1);
> >>>    		return -ENOTSOCK; /* BB eventually add reconnect code here */
> >>> 	}
> >>>
> >>> A second warn-on when ENOTSOCK is perculated up to the calling stack
> >>> a bit causes the other stack dumpage.  I think the one above is root
> >>> cause...need to figure out how to have it gracefully bail out and re-connect
> >>> when it hits this state, as current code just calls this general loop over
> >>> and over again.
> >>>
> >>
> >> No, your warning is there, but it's Oopsing too:
> >>
> >>>>> ------------[ cut here ]------------
> >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
> >>>>> BUG: unable to handle kernel
> >>>>> Hardware name: X8ST3
> >>>>> NULL pointer dereference
> >>>>> Modules linked in: at 0000000000000020
> >>>>>     be2iscsi
> >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
> >>
> >>
> >> ...smb_sendv is called by the "send" side which is generally a
> >> userspace process. The oops happened on the receive side. cifsd called
> >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer.
> >>
> >
> > I suspect that the following (untested) patch will fix this. I think
> > the symptoms that you've seen are consistent with the patch
> > description. Ben, would you be able to test this in your setup? This
> > should at least prevent the oopses.
> >
> > ------------------[snip]--------------------
> >
> > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket  pointer
> >
> > It's possible for the following set of events to happen:
> >
> > cifsd calls cifs_reconnect which reconnects the socket. A userspace
> > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
> > gets a reply. But, while processing the reply, cifsd calls
> > cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
> > reply from the earlier NEGOTIATE completes and the tcpStatus is set to
> > CifsGood. cifs_reconnect then goes through and closes the socket and sets the
> > pointer to zero, but because the status is now CifsGood, the new socket
> > is not created and cifs_reconnect exits with the socket pointer set to
> > NULL.
> >
> > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
> > CifsNeedNegotiate, and by making sure that generic_ip_connect is always
> > called at least once in cifs_reconnect.
> >
> > Note that this is not a perfect fix for this issue. It's still possible
> > that the NEGOTIATE reply is handled after the socket has been closed and
> > reconnected. In that case, the socket state will look correct but it no
> > NEGOTIATE was performed on it. In that situation though the server
> > should just shut down the socket on the next attempted send, rather
> > than causing the oops that occurs today.
> >
> > Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> > Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >   fs/cifs/connect.c |    6 +++---
> >   1 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> > index 84c7307..8bb55bc 100644
> > --- a/fs/cifs/connect.c
> > +++ b/fs/cifs/connect.c
> > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
> >   		mid_entry->callback(mid_entry);
> >   	}
> >
> > -	while (server->tcpStatus == CifsNeedReconnect) {
> > +	do {
> >   		try_to_freeze();
> >
> >   		/* we should try only the port we connected to before */
> > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
> >   				server->tcpStatus = CifsNeedNegotiate;
> >   			spin_unlock(&GlobalMid_Lock);
> >   		}
> > -	}
> > +	} while (server->tcpStatus == CifsNeedReconnect);
> >
> >   	return rc;
> >   }
> > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses)
> >   	}
> >   	if (rc == 0) {
> >   		spin_lock(&GlobalMid_Lock);
> > -		if (server->tcpStatus != CifsExiting)
> > +		if (server->tcpStatus == CifsNeedNegotiate)
> >   			server->tcpStatus = CifsGood;
> >   		else
> >   			rc = -EHOSTDOWN;
> 
> 
> This has some merge issues on 3.6.38.8:
> 
> 
> <<<<<<<
> 	while ((server->tcpStatus != CifsExiting) &&
> 	       (server->tcpStatus != CifsGood)) {
> =======
> 	do {
>  >>>>>>>
> 
> Should I keep your comparison for tcpStatus == CifsNeedReconnect
> instead of these != comparisons above?
> 	
> 
> Thanks,
> Ben
> 

No, I think you probably just want to take patch fd88ce9313 too, which
should fix up the merge conflict.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                                     ` <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-06-06 17:22                                                                       ` Ben Greear
       [not found]                                                                         ` <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-06-06 17:22 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 06/06/2011 09:51 AM, Jeff Layton wrote:
> On Mon, 06 Jun 2011 09:47:40 -0700
> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>
>> On 06/06/2011 06:45 AM, Jeff Layton wrote:
>>> On Sat, 4 Jun 2011 07:19:23 -0400

>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket  pointer
>>>
>>> It's possible for the following set of events to happen:
>>>
>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace
>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
>>> gets a reply. But, while processing the reply, cifsd calls
>>> cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to
>>> CifsGood. cifs_reconnect then goes through and closes the socket and sets the
>>> pointer to zero, but because the status is now CifsGood, the new socket
>>> is not created and cifs_reconnect exits with the socket pointer set to
>>> NULL.
>>>
>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always
>>> called at least once in cifs_reconnect.
>>>
>>> Note that this is not a perfect fix for this issue. It's still possible
>>> that the NEGOTIATE reply is handled after the socket has been closed and
>>> reconnected. In that case, the socket state will look correct but it no
>>> NEGOTIATE was performed on it. In that situation though the server
>>> should just shut down the socket on the next attempted send, rather
>>> than causing the oops that occurs today.
>>>
>>> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
>>> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>    fs/cifs/connect.c |    6 +++---
>>>    1 files changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
>>> index 84c7307..8bb55bc 100644
>>> --- a/fs/cifs/connect.c
>>> +++ b/fs/cifs/connect.c
>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>>>    		mid_entry->callback(mid_entry);
>>>    	}
>>>
>>> -	while (server->tcpStatus == CifsNeedReconnect) {
>>> +	do {
>>>    		try_to_freeze();
>>>
>>>    		/* we should try only the port we connected to before */
>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>>>    				server->tcpStatus = CifsNeedNegotiate;
>>>    			spin_unlock(&GlobalMid_Lock);
>>>    		}
>>> -	}
>>> +	} while (server->tcpStatus == CifsNeedReconnect);
>>>
>>>    	return rc;
>>>    }
>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses)
>>>    	}
>>>    	if (rc == 0) {
>>>    		spin_lock(&GlobalMid_Lock);
>>> -		if (server->tcpStatus != CifsExiting)
>>> +		if (server->tcpStatus == CifsNeedNegotiate)
>>>    			server->tcpStatus = CifsGood;
>>>    		else
>>>    			rc = -EHOSTDOWN;
>>
>>
>> This has some merge issues on 3.6.38.8:
>>
>>
>> <<<<<<<
>> 	while ((server->tcpStatus != CifsExiting)&&
>> 	(server->tcpStatus != CifsGood)) {
>> =======
>> 	do {
>>   >>>>>>>
>>
>> Should I keep your comparison for tcpStatus == CifsNeedReconnect
>> instead of these != comparisons above?
>> 	
>>
>> Thanks,
>> Ben
>>
>
> No, I think you probably just want to take patch fd88ce9313 too, which
> should fix up the merge conflict.

Ok, I've applied those two..we'll start testing with these patches
today.  Might take a while before we are certain the fix works, as
this isn't usually easy or fast to reproduce.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                                         ` <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2011-06-07  1:00                                                                           ` Steve French
       [not found]                                                                             ` <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Steve French @ 2011-06-07  1:00 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA

Ben,
Thanks - this may be a very rare case - hard to prove without your testing
but it looks like Jeff's patch makes sense.

On Mon, Jun 6, 2011 at 12:22 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
> On 06/06/2011 09:51 AM, Jeff Layton wrote:
>>
>> On Mon, 06 Jun 2011 09:47:40 -0700
>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>>
>>> On 06/06/2011 06:45 AM, Jeff Layton wrote:
>>>>
>>>> On Sat, 4 Jun 2011 07:19:23 -0400
>
>>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket
>>>>  pointer
>>>>
>>>> It's possible for the following set of events to happen:
>>>>
>>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace
>>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
>>>> gets a reply. But, while processing the reply, cifsd calls
>>>> cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
>>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to
>>>> CifsGood. cifs_reconnect then goes through and closes the socket and
>>>> sets the
>>>> pointer to zero, but because the status is now CifsGood, the new socket
>>>> is not created and cifs_reconnect exits with the socket pointer set to
>>>> NULL.
>>>>
>>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
>>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always
>>>> called at least once in cifs_reconnect.
>>>>
>>>> Note that this is not a perfect fix for this issue. It's still possible
>>>> that the NEGOTIATE reply is handled after the socket has been closed and
>>>> reconnected. In that case, the socket state will look correct but it no
>>>> NEGOTIATE was performed on it. In that situation though the server
>>>> should just shut down the socket on the next attempted send, rather
>>>> than causing the oops that occurs today.
>>>>
>>>> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
>>>> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>   fs/cifs/connect.c |    6 +++---
>>>>   1 files changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
>>>> index 84c7307..8bb55bc 100644
>>>> --- a/fs/cifs/connect.c
>>>> +++ b/fs/cifs/connect.c
>>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>>>>                mid_entry->callback(mid_entry);
>>>>        }
>>>>
>>>> -       while (server->tcpStatus == CifsNeedReconnect) {
>>>> +       do {
>>>>                try_to_freeze();
>>>>
>>>>                /* we should try only the port we connected to before */
>>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>>>>                                server->tcpStatus = CifsNeedNegotiate;
>>>>                        spin_unlock(&GlobalMid_Lock);
>>>>                }
>>>> -       }
>>>> +       } while (server->tcpStatus == CifsNeedReconnect);
>>>>
>>>>        return rc;
>>>>   }
>>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid,
>>>> struct cifs_ses *ses)
>>>>        }
>>>>        if (rc == 0) {
>>>>                spin_lock(&GlobalMid_Lock);
>>>> -               if (server->tcpStatus != CifsExiting)
>>>> +               if (server->tcpStatus == CifsNeedNegotiate)
>>>>                        server->tcpStatus = CifsGood;
>>>>                else
>>>>                        rc = -EHOSTDOWN;
>>>
>>>
>>> This has some merge issues on 3.6.38.8:
>>>
>>>
>>> <<<<<<<
>>>        while ((server->tcpStatus != CifsExiting)&&
>>>        (server->tcpStatus != CifsGood)) {
>>> =======
>>>        do {
>>>  >>>>>>>
>>>
>>> Should I keep your comparison for tcpStatus == CifsNeedReconnect
>>> instead of these != comparisons above?
>>>
>>>
>>> Thanks,
>>> Ben
>>>
>>
>> No, I think you probably just want to take patch fd88ce9313 too, which
>> should fix up the merge conflict.
>
> Ok, I've applied those two..we'll start testing with these patches
> today.  Might take a while before we are certain the fix works, as
> this isn't usually easy or fast to reproduce.
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
> Candela Technologies Inc  http://www.candelatech.com
>
>



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]                                                                             ` <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-06-10 18:55                                                                               ` Ben Greear
  0 siblings, 0 replies; 23+ messages in thread
From: Ben Greear @ 2011-06-10 18:55 UTC (permalink / raw)
  To: Steve French; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 06/06/2011 06:00 PM, Steve French wrote:
> Ben,
> Thanks - this may be a very rare case - hard to prove without your testing
> but it looks like Jeff's patch makes sense.

We've had 3+ days of clean failover testing, so I think that
patch did solve the problem.

You are welcome to add my tested-by if you want.

Thanks,
Ben

>
> On Mon, Jun 6, 2011 at 12:22 PM, Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>> On 06/06/2011 09:51 AM, Jeff Layton wrote:
>>>
>>> On Mon, 06 Jun 2011 09:47:40 -0700
>>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>    wrote:
>>>
>>>> On 06/06/2011 06:45 AM, Jeff Layton wrote:
>>>>>
>>>>> On Sat, 4 Jun 2011 07:19:23 -0400
>>
>>>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket
>>>>>   pointer
>>>>>
>>>>> It's possible for the following set of events to happen:
>>>>>
>>>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace
>>>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
>>>>> gets a reply. But, while processing the reply, cifsd calls
>>>>> cifs_reconnect again.  Eventually the GlobalMid_Lock is dropped and the
>>>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to
>>>>> CifsGood. cifs_reconnect then goes through and closes the socket and
>>>>> sets the
>>>>> pointer to zero, but because the status is now CifsGood, the new socket
>>>>> is not created and cifs_reconnect exits with the socket pointer set to
>>>>> NULL.
>>>>>
>>>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
>>>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always
>>>>> called at least once in cifs_reconnect.
>>>>>
>>>>> Note that this is not a perfect fix for this issue. It's still possible
>>>>> that the NEGOTIATE reply is handled after the socket has been closed and
>>>>> reconnected. In that case, the socket state will look correct but it no
>>>>> NEGOTIATE was performed on it. In that situation though the server
>>>>> should just shut down the socket on the next attempted send, rather
>>>>> than causing the oops that occurs today.
>>>>>
>>>>> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
>>>>> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>    fs/cifs/connect.c |    6 +++---
>>>>>    1 files changed, 3 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
>>>>> index 84c7307..8bb55bc 100644
>>>>> --- a/fs/cifs/connect.c
>>>>> +++ b/fs/cifs/connect.c
>>>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>>>>>                 mid_entry->callback(mid_entry);
>>>>>         }
>>>>>
>>>>> -       while (server->tcpStatus == CifsNeedReconnect) {
>>>>> +       do {
>>>>>                 try_to_freeze();
>>>>>
>>>>>                 /* we should try only the port we connected to before */
>>>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
>>>>>                                 server->tcpStatus = CifsNeedNegotiate;
>>>>>                         spin_unlock(&GlobalMid_Lock);
>>>>>                 }
>>>>> -       }
>>>>> +       } while (server->tcpStatus == CifsNeedReconnect);
>>>>>
>>>>>         return rc;
>>>>>    }
>>>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid,
>>>>> struct cifs_ses *ses)
>>>>>         }
>>>>>         if (rc == 0) {
>>>>>                 spin_lock(&GlobalMid_Lock);
>>>>> -               if (server->tcpStatus != CifsExiting)
>>>>> +               if (server->tcpStatus == CifsNeedNegotiate)
>>>>>                         server->tcpStatus = CifsGood;
>>>>>                 else
>>>>>                         rc = -EHOSTDOWN;
>>>>
>>>>
>>>> This has some merge issues on 3.6.38.8:
>>>>
>>>>
>>>> <<<<<<<
>>>>         while ((server->tcpStatus != CifsExiting)&&
>>>>         (server->tcpStatus != CifsGood)) {
>>>> =======
>>>>         do {
>>>>   >>>>>>>
>>>>
>>>> Should I keep your comparison for tcpStatus == CifsNeedReconnect
>>>> instead of these != comparisons above?
>>>>
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>
>>> No, I think you probably just want to take patch fd88ce9313 too, which
>>> should fix up the merge conflict.
>>
>> Ok, I've applied those two..we'll start testing with these patches
>> today.  Might take a while before we are certain the fix works, as
>> this isn't usually easy or fast to reproduce.
>>
>> Thanks,
>> Ben
>>
>> --
>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
>> Candela Technologies Inc  http://www.candelatech.com
>>
>>
>
>
>


-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CIFS endless console spammage in 2.6.38.7
       [not found]             ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2011-05-31 20:51               ` Steve French
@ 2011-05-31 20:51               ` Ben Greear
  1 sibling, 0 replies; 23+ messages in thread
From: Ben Greear @ 2011-05-31 20:51 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On 05/31/2011 01:44 PM, Jeff Layton wrote:
> On Tue, 31 May 2011 12:45:37 -0700
> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>  wrote:
>
>> On 05/31/2011 12:36 PM, Steve French wrote:
>>> This is on setting up a session, so could be something like:
>>> - mount
>>> - do write
>>> - server crash
>>> - attempt to reconnect
>>> - socket returns ENOSOCK
>>> - attempt to reconnect ...
>>> - repeat
>>>
>>> Is this repeatable enough that we could modify the client to stop on
>>> the reconnect to see what is causing the socket to go bad and which
>>> operation we are repeating the reconnect on.
>>
>> Well, ENOTSOCK sounds like a pretty serious coding problem.  Maybe
>> a use-after-close or something?
>>
>> At the least, we could look for some particular errors (such as ENOTSOCK)
>> and print more info and do a more thorough job of cleaning up.
>>
>> Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well?
>>
>> Seems we can reproduce this only when our open-filer HA system
>> craps itself during failover, but we can get that to happen usually
>> within hours, sometimes maybe about a day.  And, CIFS errors don't always
>> happen when the HA cluster goes bad.
>>
>> So, I'm happy to test patches, but since it's a bit tricky to
>> reproduce this...I'm hoping to get the best info possible with
>> each patch iteration!
>>
>
> I had a report of a similar problem on a RHEL5 (2.6.18) kernel:
>
>      https://bugzilla.redhat.com/show_bug.cgi?id=704921
>
> In this case, it caused an oops as well. Your problem may or may not be
> the same, but if it is, I suspect that the root cause is a lack of
> clear locking rules for the TCP_Server_Info->tcpStatus.
>
> What I think happened in that case was that the client was in the
> middle of a NEGOTIATE request and got a response, and another reconnect
> occurred while it was processing it. While the client was tearing down
> and creating a new socket, the thread that issued the NEGOTIATE on the
> previous socket marked the tcpStatus as CifsGood.
>
> Fixing it looks to be anything but trivial. I'm not even quite sure how
> to approach it at this point. Suggestions welcome.

Well, I grepped through 2GB of console logs and found no oopses
in my case.

Seems to me that the retry logic either isn't being properly done,
or maybe it's just trying too often and stuck in basically a tight
loop writing logs to the console.  (My HA server cluster is still hosed,
left it busted while debugging this, so there is no way that CIFS can
actually recover the connection as of now.)

If it's just a log-spam tight loop, then rate-limitting the messages
should help, and some timeouts or backoffs should be added to CIFS.

Building new kernels now, and we'll try to reproduce with the
extra debugging code.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-06-10 18:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-31 18:50 CIFS endless console spammage in 2.6.38.7 Ben Greear
     [not found] ` <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-05-31 19:36   ` Steve French
     [not found]     ` <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-31 19:45       ` Ben Greear
     [not found]         ` <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-05-31 20:44           ` Jeff Layton
     [not found]             ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-05-31 20:51               ` Steve French
     [not found]                 ` <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-31 20:53                   ` Ben Greear
     [not found]                     ` <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-05-31 20:54                       ` Steve French
     [not found]                         ` <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-06-01 18:01                           ` Jeff Layton
     [not found]                             ` <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-01 18:07                               ` Ben Greear
     [not found]                                 ` <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-01 19:06                                   ` Jeff Layton
     [not found]                                     ` <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-01 19:17                                       ` Ben Greear
     [not found]                                         ` <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-03 21:01                                           ` Ben Greear
     [not found]                                             ` <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-04  1:42                                               ` Jeff Layton
     [not found]                                                 ` <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-06-04  5:03                                                   ` Ben Greear
     [not found]                                                     ` <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-04 11:19                                                       ` Jeff Layton
     [not found]                                                         ` <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-06-06 13:45                                                           ` Jeff Layton
     [not found]                                                             ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-06 15:37                                                               ` Steve French
2011-06-06 16:47                                                               ` Ben Greear
     [not found]                                                                 ` <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-06 16:51                                                                   ` Jeff Layton
     [not found]                                                                     ` <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-06 17:22                                                                       ` Ben Greear
     [not found]                                                                         ` <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-07  1:00                                                                           ` Steve French
     [not found]                                                                             ` <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-06-10 18:55                                                                               ` Ben Greear
2011-05-31 20:51               ` Ben Greear

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.