* CIFS endless console spammage in 2.6.38.7
@ 2011-05-31 18:50 Ben Greear
[not found] ` <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>
0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2011-05-31 18:50 UTC (permalink / raw)
To: linux-cifs-u79uwXL29TY76Z2rM5mHXA
Kernel is somewhat hacked, but no changes to CIFS.
While doing failover testing, we managed to get the cifs client
spewing endless serial console spammage. We can ping the system, but
otherwise cannot seem to interact with it. I tried serial-console sysrq
commands (blind, spewage makes it impossible to see any real results) to
turn logging to 0, but that didn't help (yet..going to let it run in case
there is just a huge backlog of messages).
The file-server cluster is in a bad state, but still not excuse
for the clients machine to become useless.
The spewage is at least primarily:
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
CIFS VFS: Send error in SessSetup = -88
Seems -88 probably means -ENOTSOCK.
At the least, perhaps the cERROR() messages
should be rate limitted?
This one is hard and slow to reproduce, but we'll
keep testing..and will try pertinent patches if someone
has some suggestions.
Thanks,
Ben
--
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 23+ messages in thread[parent not found: <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-05-31 19:36 ` Steve French [not found] ` <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Steve French @ 2011-05-31 19:36 UTC (permalink / raw) To: Ben Greear; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA This is on setting up a session, so could be something like: - mount - do write - server crash - attempt to reconnect - socket returns ENOSOCK - attempt to reconnect ... - repeat Is this repeatable enough that we could modify the client to stop on the reconnect to see what is causing the socket to go bad and which operation we are repeating the reconnect on. On Tue, May 31, 2011 at 1:50 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > Kernel is somewhat hacked, but no changes to CIFS. > > > While doing failover testing, we managed to get the cifs client > spewing endless serial console spammage. We can ping the system, but > otherwise cannot seem to interact with it. I tried serial-console sysrq > commands (blind, spewage makes it impossible to see any real results) to > turn logging to 0, but that didn't help (yet..going to let it run in case > there is just a huge backlog of messages). > > The file-server cluster is in a bad state, but still not excuse > for the clients machine to become useless. > > The spewage is at least primarily: > > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > CIFS VFS: Send error in SessSetup = -88 > > Seems -88 probably means -ENOTSOCK. > > At the least, perhaps the cERROR() messages > should be rate limitted? > > This one is hard and slow to reproduce, but we'll > keep testing..and will try pertinent patches if someone > has some suggestions. > > Thanks, > Ben > > -- > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > Candela Technologies Inc http://www.candelatech.com > > -- > To unsubscribe from this list: send the line "unsubscribe linux-cifs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Thanks, Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-05-31 19:45 ` Ben Greear [not found] ` <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-05-31 19:45 UTC (permalink / raw) To: Steve French; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA On 05/31/2011 12:36 PM, Steve French wrote: > This is on setting up a session, so could be something like: > - mount > - do write > - server crash > - attempt to reconnect > - socket returns ENOSOCK > - attempt to reconnect ... > - repeat > > Is this repeatable enough that we could modify the client to stop on > the reconnect to see what is causing the socket to go bad and which > operation we are repeating the reconnect on. Well, ENOTSOCK sounds like a pretty serious coding problem. Maybe a use-after-close or something? At the least, we could look for some particular errors (such as ENOTSOCK) and print more info and do a more thorough job of cleaning up. Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well? Seems we can reproduce this only when our open-filer HA system craps itself during failover, but we can get that to happen usually within hours, sometimes maybe about a day. And, CIFS errors don't always happen when the HA cluster goes bad. So, I'm happy to test patches, but since it's a bit tricky to reproduce this...I'm hoping to get the best info possible with each patch iteration! Thanks, Ben > > > > On Tue, May 31, 2011 at 1:50 PM, Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >> Kernel is somewhat hacked, but no changes to CIFS. >> >> >> While doing failover testing, we managed to get the cifs client >> spewing endless serial console spammage. We can ping the system, but >> otherwise cannot seem to interact with it. I tried serial-console sysrq >> commands (blind, spewage makes it impossible to see any real results) to >> turn logging to 0, but that didn't help (yet..going to let it run in case >> there is just a huge backlog of messages). >> >> The file-server cluster is in a bad state, but still not excuse >> for the clients machine to become useless. >> >> The spewage is at least primarily: >> >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> CIFS VFS: Send error in SessSetup = -88 >> >> Seems -88 probably means -ENOTSOCK. >> >> At the least, perhaps the cERROR() messages >> should be rate limitted? >> >> This one is hard and slow to reproduce, but we'll >> keep testing..and will try pertinent patches if someone >> has some suggestions. >> >> Thanks, >> Ben >> >> -- >> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> >> Candela Technologies Inc http://www.candelatech.com >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-05-31 20:44 ` Jeff Layton [not found] ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-05-31 20:44 UTC (permalink / raw) To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Tue, 31 May 2011 12:45:37 -0700 Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > On 05/31/2011 12:36 PM, Steve French wrote: > > This is on setting up a session, so could be something like: > > - mount > > - do write > > - server crash > > - attempt to reconnect > > - socket returns ENOSOCK > > - attempt to reconnect ... > > - repeat > > > > Is this repeatable enough that we could modify the client to stop on > > the reconnect to see what is causing the socket to go bad and which > > operation we are repeating the reconnect on. > > Well, ENOTSOCK sounds like a pretty serious coding problem. Maybe > a use-after-close or something? > > At the least, we could look for some particular errors (such as ENOTSOCK) > and print more info and do a more thorough job of cleaning up. > > Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well? > > Seems we can reproduce this only when our open-filer HA system > craps itself during failover, but we can get that to happen usually > within hours, sometimes maybe about a day. And, CIFS errors don't always > happen when the HA cluster goes bad. > > So, I'm happy to test patches, but since it's a bit tricky to > reproduce this...I'm hoping to get the best info possible with > each patch iteration! > I had a report of a similar problem on a RHEL5 (2.6.18) kernel: https://bugzilla.redhat.com/show_bug.cgi?id=704921 In this case, it caused an oops as well. Your problem may or may not be the same, but if it is, I suspect that the root cause is a lack of clear locking rules for the TCP_Server_Info->tcpStatus. What I think happened in that case was that the client was in the middle of a NEGOTIATE request and got a response, and another reconnect occurred while it was processing it. While the client was tearing down and creating a new socket, the thread that issued the NEGOTIATE on the previous socket marked the tcpStatus as CifsGood. Fixing it looks to be anything but trivial. I'm not even quite sure how to approach it at this point. Suggestions welcome. -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2011-05-31 20:51 ` Steve French [not found] ` <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-05-31 20:51 ` Ben Greear 1 sibling, 1 reply; 23+ messages in thread From: Steve French @ 2011-05-31 20:51 UTC (permalink / raw) To: Jeff Layton; +Cc: Ben Greear, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Tue, May 31, 2011 at 3:44 PM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, 31 May 2011 12:45:37 -0700 > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >> On 05/31/2011 12:36 PM, Steve French wrote: >> > This is on setting up a session, so could be something like: >> > - mount >> > - do write >> > - server crash >> > - attempt to reconnect >> > - socket returns ENOSOCK >> > - attempt to reconnect ... >> > - repeat >> > >> > Is this repeatable enough that we could modify the client to stop on >> > the reconnect to see what is causing the socket to go bad and which >> > operation we are repeating the reconnect on. >> >> Well, ENOTSOCK sounds like a pretty serious coding problem. Maybe >> a use-after-close or something? >> >> At the least, we could look for some particular errors (such as ENOTSOCK) >> and print more info and do a more thorough job of cleaning up. >> >> Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well? >> >> Seems we can reproduce this only when our open-filer HA system >> craps itself during failover, but we can get that to happen usually >> within hours, sometimes maybe about a day. And, CIFS errors don't always >> happen when the HA cluster goes bad. >> >> So, I'm happy to test patches, but since it's a bit tricky to >> reproduce this...I'm hoping to get the best info possible with >> each patch iteration! >> > > I had a report of a similar problem on a RHEL5 (2.6.18) kernel: > > https://bugzilla.redhat.com/show_bug.cgi?id=704921 > > In this case, it caused an oops as well. Your problem may or may not be > the same, but if it is, I suspect that the root cause is a lack of > clear locking rules for the TCP_Server_Info->tcpStatus. > > What I think happened in that case was that the client was in the > middle of a NEGOTIATE request and got a response, and another reconnect > occurred while it was processing it. While the client was tearing down > and creating a new socket, the thread that issued the NEGOTIATE on the > previous socket marked the tcpStatus as CifsGood. > > Fixing it looks to be anything but trivial. I'm not even quite sure how > to approach it at this point. Suggestions welcome. I thought the kernel was more recent than that - how recent is the kernel here? I think something related to cifs_sendv returning ENOTSOCK immediately when not reconnected could be related. -- Thanks, Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-05-31 20:53 ` Ben Greear [not found] ` <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-05-31 20:53 UTC (permalink / raw) To: Steve French; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 05/31/2011 01:51 PM, Steve French wrote: > On Tue, May 31, 2011 at 3:44 PM, Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> Fixing it looks to be anything but trivial. I'm not even quite sure how >> to approach it at this point. Suggestions welcome. > > I thought the kernel was more recent than that - how recent is the kernel here? > > I think something related to cifs_sendv returning ENOTSOCK immediately > when not reconnected could be related. My kernel is 2.6.38.7, quite recent. We're using the bind-to-local-IP features too, but not sure that matters. Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-05-31 20:54 ` Steve French [not found] ` <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Steve French @ 2011-05-31 20:54 UTC (permalink / raw) To: Ben Greear; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA we will have more info when run with he quick and dirty modified logging On Tue, May 31, 2011 at 3:53 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > On 05/31/2011 01:51 PM, Steve French wrote: >> >> On Tue, May 31, 2011 at 3:44 PM, Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >>> Fixing it looks to be anything but trivial. I'm not even quite sure how >>> to approach it at this point. Suggestions welcome. >> >> I thought the kernel was more recent than that - how recent is the kernel >> here? >> >> I think something related to cifs_sendv returning ENOTSOCK immediately >> when not reconnected could be related. > > My kernel is 2.6.38.7, quite recent. We're using the bind-to-local-IP > features too, but not sure that matters. > > Ben > > -- > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > Candela Technologies Inc http://www.candelatech.com > > -- Thanks, Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-06-01 18:01 ` Jeff Layton [not found] ` <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-06-01 18:01 UTC (permalink / raw) To: Steve French; +Cc: Ben Greear, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Tue, 31 May 2011 15:54:36 -0500 Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > we will have more info when run with he quick and dirty modified logging > I'm not sure what that is, but what may be helpful is to launch a kernel debugger when this happens, track down the TCP_Server_Info and see what the state of the socket that hangs off of it is. If it's a NULL pointer or an already-closed socket, then that may help point the way to the root cause. > On Tue, May 31, 2011 at 3:53 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > > On 05/31/2011 01:51 PM, Steve French wrote: > >> > >> On Tue, May 31, 2011 at 3:44 PM, Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > >>> Fixing it looks to be anything but trivial. I'm not even quite sure how > >>> to approach it at this point. Suggestions welcome. > >> > >> I thought the kernel was more recent than that - how recent is the kernel > >> here? > >> > >> I think something related to cifs_sendv returning ENOTSOCK immediately > >> when not reconnected could be related. > > > > My kernel is 2.6.38.7, quite recent. We're using the bind-to-local-IP > > features too, but not sure that matters. > > > > Ben > > > > -- > > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > > Candela Technologies Inc http://www.candelatech.com > > > > > > > -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2011-06-01 18:07 ` Ben Greear [not found] ` <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-06-01 18:07 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 06/01/2011 11:01 AM, Jeff Layton wrote: > On Tue, 31 May 2011 15:54:36 -0500 > Steve French<smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> we will have more info when run with he quick and dirty modified logging >> > > I'm not sure what that is, but what may be helpful is to launch a > kernel debugger when this happens, track down the TCP_Server_Info and > see what the state of the socket that hangs off of it is. If it's a > NULL pointer or an already-closed socket, then that may help point the > way to the root cause. We put in some WARN_ON calls to get stack traces, and some other connection related logging. We should get a WARN_ON if the socket is NULL. We were not able to reproduce the problem last night..the file servers did screw up, but the CIFS clients acted normally. Thanks, Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-06-01 19:06 ` Jeff Layton [not found] ` <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-06-01 19:06 UTC (permalink / raw) To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Wed, 01 Jun 2011 11:07:58 -0700 Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > On 06/01/2011 11:01 AM, Jeff Layton wrote: > > On Tue, 31 May 2011 15:54:36 -0500 > > Steve French<smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > >> we will have more info when run with he quick and dirty modified logging > >> > > > > I'm not sure what that is, but what may be helpful is to launch a > > kernel debugger when this happens, track down the TCP_Server_Info and > > see what the state of the socket that hangs off of it is. If it's a > > NULL pointer or an already-closed socket, then that may help point the > > way to the root cause. > > We put in some WARN_ON calls to get stack traces, and some other > connection related logging. We should get a WARN_ON if the socket is NULL. > > We were not able to reproduce the problem last night..the file servers did > screw up, but the CIFS clients acted normally. > Based on no real evidence at all and just a gut-feeling, I suspect that: 1) this is a long-standing bug ...and... 2) it's a race condition ...though it may be that recent changes have changed the timing enough to make it more likely (hard to say until we understand the problem better). Have you seen this happen more than once? -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2011-06-01 19:17 ` Ben Greear [not found] ` <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-06-01 19:17 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 06/01/2011 12:06 PM, Jeff Layton wrote: > On Wed, 01 Jun 2011 11:07:58 -0700 > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >> On 06/01/2011 11:01 AM, Jeff Layton wrote: >>> On Tue, 31 May 2011 15:54:36 -0500 >>> Steve French<smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >>> >>>> we will have more info when run with he quick and dirty modified logging >>>> >>> >>> I'm not sure what that is, but what may be helpful is to launch a >>> kernel debugger when this happens, track down the TCP_Server_Info and >>> see what the state of the socket that hangs off of it is. If it's a >>> NULL pointer or an already-closed socket, then that may help point the >>> way to the root cause. >> >> We put in some WARN_ON calls to get stack traces, and some other >> connection related logging. We should get a WARN_ON if the socket is NULL. >> >> We were not able to reproduce the problem last night..the file servers did >> screw up, but the CIFS clients acted normally. >> > > Based on no real evidence at all and just a gut-feeling, I suspect that: > > 1) this is a long-standing bug > > ...and... > > 2) it's a race condition > > ...though it may be that recent changes have changed the timing enough > to make it more likely (hard to say until we understand the problem > better). > > Have you seen this happen more than once? I think so...but we are also testing iscsi and NFS failover concurrently, and for a while other instability was making it difficult to determine exactly what killed things (seems we had a bad HD that would often fail about the time iscsi did...thought it was software bug for a while, but after replacing the HD it's been running better.) We're going to crank up another machine with 100+ cifs mounts and see if that helps reproduce the bug faster. Current test is 20 IO threads, but only a single mount. Thanks, Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-06-03 21:01 ` Ben Greear [not found] ` <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-06-03 21:01 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA Ok, we had some luck. Here's the backtrace and attending dmesg output. The filer has been doing failover, but it has not gone into a failed state...so, the system *should* be able to reconnect. We have the system in the failed state now and will leave it that way for a bit in case you have some commands you'd like me to run. Aside from the hung cifs processes (anything accessing those mounts gets into the D state), the system seems fine. CIFS VFS: Unexpected lookup error -112 CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Unexpected lookup error -11 CIFS VFS: Unexpected lookup error -112 CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Unexpected lookup error -112 CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Unexpected lookup error -11 CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: Reconnecting tcp session CIFS VFS: need to reconnect in sendv here ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() BUG: unable to handle kernel Hardware name: X8ST3 NULL pointer dereference Modules linked in: at 0000000000000020 be2iscsi IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e bnx2iPGD 0 cnic uio Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 mdio last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat ib_iserCPU 2 rdma_cm Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp libiscsi scsi_transport_iscsi Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl auth_rpcgss RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e sunrpc RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 ipv6 RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 uinput RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 i2c_i801 RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 e1000e R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 i2c_core FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 igb CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b ioatdma CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 iTCO_wdt DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 i7core_edac DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) pcspkr Stack: dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 0000000000000004Call Trace: ffff8802e64e5c30 ffffffff8135792c 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 Call Trace: [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 [<ffffffff8103838e>] ? need_resched+0x1e/0x28 [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] [<ffffffff8105c3bf>] kthread+0x7d/0x85 [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] [<ffffffff8105c342>] ? kthread+0x0/0x85 [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b 28 5b ---[ end trace 3387e7bab0a9c645 ]--- c9 c3 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 20 4c 8b 67 60 <48> 8b 7e 20 48 89 55 e0 48 89 4d d8 48 89 75 e8 44 89 45 d0 e8 RIP [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e RSP <ffff8802e64e5bc0> CR2: 0000000000000020 CIFS VFS: need to reconnect in sendv here CIFS VFS: need to reconnect in sendv here CIFS VFS: need to reconnect in sendv here ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]() Hardware name: X8ST3 Modules linked in: be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr md4 nls_utf8 cifs xt_TPROXY nf_tproxy_core xt_socket ip6_tables CIFS VFS: need to reconnect in sendv here nf_defrag_ipv6 xt_connlimit 8021q CIFS VFS: need to reconnect in sendv here ------------[ cut here ]------------ garp ------------[ cut here ]------------ bridge WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]() WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]() Hardware name: X8ST3 Hardware name: X8ST3 Modules linked in: stp Modules linked in: be2iscsi CIFS VFS: need to reconnect in sendv here llc iscsi_boot_sysfs fuse be2iscsi macvlan ------------[ cut here ]------------ bnx2i wanlink(P) iscsi_boot_sysfs cnic pktgen WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/connect.c:3230 cifs_setup_session+0xe5/0x1ff [cifs]() bnx2i uio Hardware name: X8ST3 cxgb3i cnic libcxgbi iscsi_tcp uio cxgb3 cxgb3i Modules linked in: libcxgbi mdio cxgb3 ib_iser be2iscsi mdio iscsi_boot_sysfs ib_iser rdma_cm rdma_cm bnx2i ib_cm cnic ib_cm iw_cm iw_cm ib_sa ib_sa ib_mad uio ib_mad cxgb3i ib_core ib_core ib_addr ib_addr libcxgbi md4 md4 nls_utf8 cifs cxgb3 xt_TPROXY mdio nf_tproxy_core nls_utf8 xt_socket ib_iser ip6_tables cifs rdma_cm xt_TPROXY nf_defrag_ipv6 ib_cm nf_tproxy_core libiscsi_tcp iw_cm libiscsi xt_connlimit xt_socket ib_sa scsi_transport_iscsi ip6_tables nfs 8021q ib_mad garp lockd bridge fscache ib_core nfs_acl nf_defrag_ipv6 auth_rpcgss xt_connlimit ib_addr stp md4 sunrpc llc nls_utf8 fuse ipv6 macvlan uinput wanlink(P) i2c_i801 cifs pktgen e1000e 8021q iscsi_tcp garp libiscsi_tcp xt_TPROXY libiscsi nf_tproxy_core bridge scsi_transport_iscsi xt_socket stp i2c_core llc ip6_tables igb fuse ioatdma nfs macvlan nf_defrag_ipv6 lockd iTCO_wdt fscache wanlink(P) nfs_acl pktgen xt_connlimit auth_rpcgss iscsi_tcp sunrpc 8021q libiscsi_tcp i7core_edac libiscsi ipv6 garp iTCO_vendor_support scsi_transport_iscsi uinput nfs i2c_i801 pcspkr e1000e lockd i2c_core fscache dca igb nfs_acl edac_core bridge ioatdma stp microcode iTCO_wdt [last unloaded: ipt_addrtype] llc auth_rpcgss i7core_edac iTCO_vendor_supportPid: 4754, comm: btserver Tainted: P W 2.6.38.8+ #12 pcspkr sunrpc dca edac_coreCall Trace: ipv6 fuse uinput microcode [last unloaded: ipt_addrtype] [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 i2c_i801 macvlan [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 Pid: 4734, comm: btserver Tainted: P W 2.6.38.8+ #12 Call Trace: [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs] e1000e wanlink(P) [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 pktgen [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 i2c_core iscsi_tcp [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs] libiscsi_tcp [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs] igb ioatdma [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs] libiscsi iTCO_wdt scsi_transport_iscsi [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 nfs [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 i7core_edac lockd [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] fscache [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] iTCO_vendor_support nfs_acl auth_rpcgss sunrpc [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] pcspkr [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 ipv6 dca [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] uinput [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 i2c_i801 edac_core [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 e1000e microcode [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c [last unloaded: ipt_addrtype] i2c_core igb [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] Pid: 7943, comm: btserver Tainted: P W 2.6.38.8+ #12 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c Call Trace: ioatdma [<ffffffff810e901f>] ? filp_close+0x3e/0x6d iTCO_wdt i7core_edac [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] ---[ end trace 3387e7bab0a9c646 ]--- iTCO_vendor_support [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 pcspkr CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. dca edac_core [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs] microcode [<ffffffff810e901f>] ? filp_close+0x3e/0x6d [last unloaded: ipt_addrtype] [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs] [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 Pid: 4740, comm: btserver Tainted: P W 2.6.38.8+ #12 [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] Call Trace: ---[ end trace 3387e7bab0a9c647 ]--- [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b [<ffffffffa031c2e6>] ? CIFSSMBRead+0x9a/0x277 [cifs] [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 ---[ end trace 3387e7bab0a9c648 ]--- [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. [<ffffffffa0329271>] ? cifs_readpage_worker+0x1d6/0x319 [cifs] [<ffffffffa031fc2a>] ? cifs_setup_session+0xe5/0x1ff [cifs] CIFS VFS: need to reconnect in sendv here [<ffffffffa03295a6>] ? cifs_readpage+0xb3/0xfd [cifs] [<ffffffffa0316012>] ? cifs_reconnect_tcon+0x1bf/0x2c9 [cifs] [<ffffffff810a5247>] ? generic_file_aio_read+0x468/0x5d1 [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 [<ffffffff810a4beb>] ? generic_file_aio_write+0x83/0xa1 [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] [<ffffffff810ea81e>] ? do_sync_read+0xc6/0x103 [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] [<ffffffff811a6871>] ? fsnotify_perm+0x61/0x6d [<ffffffff811a68d4>] ? security_file_permission+0x29/0x2e [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] [<ffffffff810eb2b5>] ? vfs_read+0xa6/0x102 [<ffffffff810a4b33>] ? __generic_file_aio_write+0x23d/0x272 [<ffffffff810eb3ca>] ? sys_read+0x45/0x6c [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 ---[ end trace 3387e7bab0a9c649 ]--- CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c [<ffffffffa03150c5>] ? cifs_file_aio_write+0x2d/0x5c [cifs] CIFS VFS: need to reconnect in sendv here [<ffffffff810ea71b>] ? do_sync_write+0xc6/0x103 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. [<ffffffff811a68d4>] ? security_file_permission+0x29/0x2e [<ffffffff810eb0d3>] ? vfs_write+0xa9/0x105 [<ffffffff810eb1e8>] ? sys_write+0x45/0x6c [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b ---[ end trace 3387e7bab0a9c64a ]--- CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here cifs_setup_session: 127 callbacks suppressed CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: Send error in SessSetup = -88 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: Unexpected lookup error -88 CIFS VFS: Unexpected lookup error -88 CIFS VFS: need to reconnect in sendv here CIFS VFS: Send error in SessSetup = -88 CIFS VFS: need to reconnect in sendv here CIFS VFS: Send error in SessSetup = -88 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: Unexpected lookup error -88 CIFS VFS: need to reconnect in sendv here CIFS VFS: Send error in SessSetup = -88 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: Send error in SessSetup = -88 CIFS VFS: need to reconnect in sendv here CIFS VFS: Send error in SessSetup = -88 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. CIFS VFS: need to reconnect in sendv here CIFS VFS: Send error in SessSetup = -88 CIFS VFS: SessSetup, ENOTSOCK, Sleep 15 seconds. -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-06-04 1:42 ` Jeff Layton [not found] ` <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-06-04 1:42 UTC (permalink / raw) To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Fri, 03 Jun 2011 14:01:11 -0700 Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > Ok, we had some luck. Here's the backtrace and attending dmesg > output. The filer has been doing failover, but it has not gone > into a failed state...so, the system *should* be able to reconnect. > > We have the system in the failed state now and will leave it that way > for a bit in case you have some commands you'd like me to run. > > Aside from the hung cifs processes (anything accessing those mounts > gets into the D state), the system seems fine. > > > CIFS VFS: Unexpected lookup error -112 > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Unexpected lookup error -11 > CIFS VFS: Unexpected lookup error -112 > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Unexpected lookup error -112 > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Unexpected lookup error -11 > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: Reconnecting tcp session > CIFS VFS: need to reconnect in sendv here > ------------[ cut here ]------------ > WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > BUG: unable to handle kernel > Hardware name: X8ST3 > NULL pointer dereference > Modules linked in: at 0000000000000020 > be2iscsi > IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > bnx2iPGD 0 cnic > uio > Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 > mdio > last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat > ib_iserCPU 2 rdma_cm > Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY > ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY > nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi > scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc > i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp > libiscsi > scsi_transport_iscsi > Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl > auth_rpcgss > RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > sunrpc > RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 > ipv6 > RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 > uinput > RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 > i2c_i801 > RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 > R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 > e1000e > R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 > i2c_core > FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 > igb > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > ioatdma > CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 > iTCO_wdt > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > i7core_edac > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) > pcspkr > Stack: > dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 > > ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 > 0000000000000004Call Trace: > ffff8802e64e5c30 ffffffff8135792c > 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 > ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 > > Call Trace: > [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] > [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 > [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 > [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf > [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 > [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf > [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 > [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] > [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 > [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] > [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 > [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 > [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] > [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 > [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc > [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] > [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] > [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 > [<ffffffff8103838e>] ? need_resched+0x1e/0x28 > [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] > [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 > [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > [<ffffffff8105c3bf>] kthread+0x7d/0x85 > [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] > [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 > [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] > [<ffffffff8105c342>] ? kthread+0x0/0x85 > [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 > [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] > Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 > 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 > 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c > 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 > e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] > a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d > 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 > 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b > 28 5b > ---[ end trace 3387e7bab0a9c645 ]--- Kaboom. So you're seeing oopses too. Could you get a listing of the place where it oopsed by following the instructions here? http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses I suspect that "sock" is NULL in this case too and it blew up in kernel_recvmsg. -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org> @ 2011-06-04 5:03 ` Ben Greear [not found] ` <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-06-04 5:03 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 06/03/2011 06:42 PM, Jeff Layton wrote: > On Fri, 03 Jun 2011 14:01:11 -0700 > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >> Ok, we had some luck. Here's the backtrace and attending dmesg >> output. The filer has been doing failover, but it has not gone >> into a failed state...so, the system *should* be able to reconnect. >> >> We have the system in the failed state now and will leave it that way >> for a bit in case you have some commands you'd like me to run. >> >> Aside from the hung cifs processes (anything accessing those mounts >> gets into the D state), the system seems fine. >> >> >> CIFS VFS: Unexpected lookup error -112 >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Unexpected lookup error -11 >> CIFS VFS: Unexpected lookup error -112 >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Unexpected lookup error -112 >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Unexpected lookup error -11 >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: Reconnecting tcp session >> CIFS VFS: need to reconnect in sendv here >> ------------[ cut here ]------------ >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >> BUG: unable to handle kernel >> Hardware name: X8ST3 >> NULL pointer dereference >> Modules linked in: at 0000000000000020 >> be2iscsi >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> bnx2iPGD 0 cnic >> uio >> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 >> mdio >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat >> ib_iserCPU 2 rdma_cm >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp >> libiscsi >> scsi_transport_iscsi >> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl >> auth_rpcgss >> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> sunrpc >> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 >> ipv6 >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 >> uinput >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 >> i2c_i801 >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 >> e1000e >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 >> i2c_core >> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 >> igb >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> ioatdma >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 >> iTCO_wdt >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> i7core_edac >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) >> pcspkr >> Stack: >> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 >> >> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 >> 0000000000000004Call Trace: >> ffff8802e64e5c30 ffffffff8135792c >> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 >> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 >> >> Call Trace: >> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] >> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 >> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 >> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf >> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 >> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 >> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] >> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 >> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 >> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] >> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 >> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc >> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] >> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 >> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 >> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >> [<ffffffff8105c3bf>] kthread+0x7d/0x85 >> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] >> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 >> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] >> [<ffffffff8105c342>] ? kthread+0x0/0x85 >> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 >> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] >> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 >> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 >> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c >> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 >> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] >> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d >> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 >> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b >> 28 5b >> ---[ end trace 3387e7bab0a9c645 ]--- > > Kaboom. So you're seeing oopses too. Could you get a listing of the > place where it oopsed by following the instructions here? > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses > > I suspect that "sock" is NULL in this case too and it blew up in > kernel_recvmsg. I added code to WARN_ON when ssocket was null. This isn't a real panic, just a WARN_ON: static int smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) { int rc = 0; int i = 0; struct msghdr smb_msg; struct smb_hdr *smb_buffer = iov[0].iov_base; unsigned int len = iov[0].iov_len; unsigned int total_len; int first_vec = 0; unsigned int smb_buf_length = smb_buffer->smb_buf_length; struct socket *ssocket = server->ssocket; if (ssocket == NULL) { cERROR(1, "need to reconnect in sendv here"); *** HERE *** WARN_ON_ONCE(1); return -ENOTSOCK; /* BB eventually add reconnect code here */ } A second warn-on when ENOTSOCK is perculated up to the calling stack a bit causes the other stack dumpage. I think the one above is root cause...need to figure out how to have it gracefully bail out and re-connect when it hits this state, as current code just calls this general loop over and over again. Thanks, Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-06-04 11:19 ` Jeff Layton [not found] ` <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-06-04 11:19 UTC (permalink / raw) To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Fri, 03 Jun 2011 22:03:43 -0700 Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > On 06/03/2011 06:42 PM, Jeff Layton wrote: > > On Fri, 03 Jun 2011 14:01:11 -0700 > > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > > > >> Ok, we had some luck. Here's the backtrace and attending dmesg > >> output. The filer has been doing failover, but it has not gone > >> into a failed state...so, the system *should* be able to reconnect. > >> > >> We have the system in the failed state now and will leave it that way > >> for a bit in case you have some commands you'd like me to run. > >> > >> Aside from the hung cifs processes (anything accessing those mounts > >> gets into the D state), the system seems fine. > >> > >> > >> CIFS VFS: Unexpected lookup error -112 > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Unexpected lookup error -11 > >> CIFS VFS: Unexpected lookup error -112 > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Unexpected lookup error -112 > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Unexpected lookup error -11 > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: Reconnecting tcp session > >> CIFS VFS: need to reconnect in sendv here > >> ------------[ cut here ]------------ > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > >> BUG: unable to handle kernel > >> Hardware name: X8ST3 > >> NULL pointer dereference > >> Modules linked in: at 0000000000000020 > >> be2iscsi > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >> bnx2iPGD 0 cnic > >> uio > >> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 > >> mdio > >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat > >> ib_iserCPU 2 rdma_cm > >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY > >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY > >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi > >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc > >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp > >> libiscsi > >> scsi_transport_iscsi > >> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl > >> auth_rpcgss > >> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >> sunrpc > >> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 > >> ipv6 > >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 > >> uinput > >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 > >> i2c_i801 > >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 > >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 > >> e1000e > >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 > >> i2c_core > >> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 > >> igb > >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > >> ioatdma > >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 > >> iTCO_wdt > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >> i7core_edac > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > >> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) > >> pcspkr > >> Stack: > >> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 > >> > >> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 > >> 0000000000000004Call Trace: > >> ffff8802e64e5c30 ffffffff8135792c > >> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 > >> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 > >> > >> Call Trace: > >> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] > >> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 > >> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 > >> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf > >> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 > >> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 > >> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] > >> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 > >> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 > >> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] > >> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 > >> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc > >> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] > >> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 > >> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 > >> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > >> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > >> [<ffffffff8105c3bf>] kthread+0x7d/0x85 > >> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] > >> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 > >> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] > >> [<ffffffff8105c342>] ? kthread+0x0/0x85 > >> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 > >> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] > >> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 > >> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 > >> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c > >> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 > >> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] > >> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d > >> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 > >> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b > >> 28 5b > >> ---[ end trace 3387e7bab0a9c645 ]--- > > > > Kaboom. So you're seeing oopses too. Could you get a listing of the > > place where it oopsed by following the instructions here? > > > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses > > > > I suspect that "sock" is NULL in this case too and it blew up in > > kernel_recvmsg. > > I added code to WARN_ON when ssocket was null. This isn't a real panic, > just a WARN_ON: > > > static int > smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) > { > int rc = 0; > int i = 0; > struct msghdr smb_msg; > struct smb_hdr *smb_buffer = iov[0].iov_base; > unsigned int len = iov[0].iov_len; > unsigned int total_len; > int first_vec = 0; > unsigned int smb_buf_length = smb_buffer->smb_buf_length; > struct socket *ssocket = server->ssocket; > > if (ssocket == NULL) { > cERROR(1, "need to reconnect in sendv here"); > *** HERE *** WARN_ON_ONCE(1); > return -ENOTSOCK; /* BB eventually add reconnect code here */ > } > > A second warn-on when ENOTSOCK is perculated up to the calling stack > a bit causes the other stack dumpage. I think the one above is root > cause...need to figure out how to have it gracefully bail out and re-connect > when it hits this state, as current code just calls this general loop over > and over again. > No, your warning is there, but it's Oopsing too: > >> ------------[ cut here ]------------ > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > >> BUG: unable to handle kernel > >> Hardware name: X8ST3 > >> NULL pointer dereference > >> Modules linked in: at 0000000000000020 > >> be2iscsi > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e ...smb_sendv is called by the "send" side which is generally a userspace process. The oops happened on the receive side. cifsd called kernel_recvmsg, and it looks like it passed in a NULL sock pointer. -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org> @ 2011-06-06 13:45 ` Jeff Layton [not found] ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-06-06 13:45 UTC (permalink / raw) To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Sat, 4 Jun 2011 07:19:23 -0400 Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, 03 Jun 2011 22:03:43 -0700 > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > > > On 06/03/2011 06:42 PM, Jeff Layton wrote: > > > On Fri, 03 Jun 2011 14:01:11 -0700 > > > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > > > > > >> Ok, we had some luck. Here's the backtrace and attending dmesg > > >> output. The filer has been doing failover, but it has not gone > > >> into a failed state...so, the system *should* be able to reconnect. > > >> > > >> We have the system in the failed state now and will leave it that way > > >> for a bit in case you have some commands you'd like me to run. > > >> > > >> Aside from the hung cifs processes (anything accessing those mounts > > >> gets into the D state), the system seems fine. > > >> > > >> > > >> CIFS VFS: Unexpected lookup error -112 > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Unexpected lookup error -11 > > >> CIFS VFS: Unexpected lookup error -112 > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Unexpected lookup error -112 > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Unexpected lookup error -11 > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: Reconnecting tcp session > > >> CIFS VFS: need to reconnect in sendv here > > >> ------------[ cut here ]------------ > > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > > >> BUG: unable to handle kernel > > >> Hardware name: X8ST3 > > >> NULL pointer dereference > > >> Modules linked in: at 0000000000000020 > > >> be2iscsi > > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > > >> bnx2iPGD 0 cnic > > >> uio > > >> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 > > >> mdio > > >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat > > >> ib_iserCPU 2 rdma_cm > > >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY > > >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY > > >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi > > >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc > > >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp > > >> libiscsi > > >> scsi_transport_iscsi > > >> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl > > >> auth_rpcgss > > >> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > > >> sunrpc > > >> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 > > >> ipv6 > > >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 > > >> uinput > > >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 > > >> i2c_i801 > > >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 > > >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 > > >> e1000e > > >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 > > >> i2c_core > > >> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 > > >> igb > > >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > >> ioatdma > > >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 > > >> iTCO_wdt > > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > >> i7core_edac > > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > >> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) > > >> pcspkr > > >> Stack: > > >> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 > > >> > > >> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 > > >> 0000000000000004Call Trace: > > >> ffff8802e64e5c30 ffffffff8135792c > > >> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 > > >> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 > > >> > > >> Call Trace: > > >> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] > > >> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 > > >> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 > > >> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf > > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > > >> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 > > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > > >> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf > > >> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 > > >> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] > > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > > >> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 > > >> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] > > >> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 > > >> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 > > >> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] > > >> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 > > >> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc > > >> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] > > >> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] > > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > > >> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 > > >> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 > > >> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] > > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > > >> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 > > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > > >> [<ffffffff8105c3bf>] kthread+0x7d/0x85 > > >> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] > > >> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 > > >> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] > > >> [<ffffffff8105c342>] ? kthread+0x0/0x85 > > >> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 > > >> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] > > >> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 > > >> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 > > >> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c > > >> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 > > >> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] > > >> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d > > >> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 > > >> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b > > >> 28 5b > > >> ---[ end trace 3387e7bab0a9c645 ]--- > > > > > > Kaboom. So you're seeing oopses too. Could you get a listing of the > > > place where it oopsed by following the instructions here? > > > > > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses > > > > > > I suspect that "sock" is NULL in this case too and it blew up in > > > kernel_recvmsg. > > > > I added code to WARN_ON when ssocket was null. This isn't a real panic, > > just a WARN_ON: > > > > > > static int > > smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) > > { > > int rc = 0; > > int i = 0; > > struct msghdr smb_msg; > > struct smb_hdr *smb_buffer = iov[0].iov_base; > > unsigned int len = iov[0].iov_len; > > unsigned int total_len; > > int first_vec = 0; > > unsigned int smb_buf_length = smb_buffer->smb_buf_length; > > struct socket *ssocket = server->ssocket; > > > > if (ssocket == NULL) { > > cERROR(1, "need to reconnect in sendv here"); > > *** HERE *** WARN_ON_ONCE(1); > > return -ENOTSOCK; /* BB eventually add reconnect code here */ > > } > > > > A second warn-on when ENOTSOCK is perculated up to the calling stack > > a bit causes the other stack dumpage. I think the one above is root > > cause...need to figure out how to have it gracefully bail out and re-connect > > when it hits this state, as current code just calls this general loop over > > and over again. > > > > No, your warning is there, but it's Oopsing too: > > > >> ------------[ cut here ]------------ > > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > > >> BUG: unable to handle kernel > > >> Hardware name: X8ST3 > > >> NULL pointer dereference > > >> Modules linked in: at 0000000000000020 > > >> be2iscsi > > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > > > ...smb_sendv is called by the "send" side which is generally a > userspace process. The oops happened on the receive side. cifsd called > kernel_recvmsg, and it looks like it passed in a NULL sock pointer. > I suspect that the following (untested) patch will fix this. I think the symptoms that you've seen are consistent with the patch description. Ben, would you be able to test this in your setup? This should at least prevent the oopses. ------------------[snip]-------------------- [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer It's possible for the following set of events to happen: cifsd calls cifs_reconnect which reconnects the socket. A userspace process then calls cifs_negotiate_protocol to handle the NEGOTIATE and gets a reply. But, while processing the reply, cifsd calls cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the reply from the earlier NEGOTIATE completes and the tcpStatus is set to CifsGood. cifs_reconnect then goes through and closes the socket and sets the pointer to zero, but because the status is now CifsGood, the new socket is not created and cifs_reconnect exits with the socket pointer set to NULL. Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is CifsNeedNegotiate, and by making sure that generic_ip_connect is always called at least once in cifs_reconnect. Note that this is not a perfect fix for this issue. It's still possible that the NEGOTIATE reply is handled after the socket has been closed and reconnected. In that case, the socket state will look correct but it no NEGOTIATE was performed on it. In that situation though the server should just shut down the socket on the next attempted send, rather than causing the oops that occurs today. Reported-by: Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Signed-off-by: Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- fs/cifs/connect.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index 84c7307..8bb55bc 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) mid_entry->callback(mid_entry); } - while (server->tcpStatus == CifsNeedReconnect) { + do { try_to_freeze(); /* we should try only the port we connected to before */ @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) server->tcpStatus = CifsNeedNegotiate; spin_unlock(&GlobalMid_Lock); } - } + } while (server->tcpStatus == CifsNeedReconnect); return rc; } @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) } if (rc == 0) { spin_lock(&GlobalMid_Lock); - if (server->tcpStatus != CifsExiting) + if (server->tcpStatus == CifsNeedNegotiate) server->tcpStatus = CifsGood; else rc = -EHOSTDOWN; -- 1.7.5.2 -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply related [flat|nested] 23+ messages in thread
[parent not found: <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2011-06-06 15:37 ` Steve French 2011-06-06 16:47 ` Ben Greear 1 sibling, 0 replies; 23+ messages in thread From: Steve French @ 2011-06-06 15:37 UTC (permalink / raw) To: Jeff Layton; +Cc: Ben Greear, linux-cifs-u79uwXL29TY76Z2rM5mHXA Sounds promising. Any others have thoughts about Jeff's proposed solution? Ben, If you get test data on this with and without patch - let us know. On Mon, Jun 6, 2011 at 8:45 AM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Sat, 4 Jun 2011 07:19:23 -0400 > Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> On Fri, 03 Jun 2011 22:03:43 -0700 >> Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >> >> > On 06/03/2011 06:42 PM, Jeff Layton wrote: >> > > On Fri, 03 Jun 2011 14:01:11 -0700 >> > > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >> > > >> > >> Ok, we had some luck. Here's the backtrace and attending dmesg >> > >> output. The filer has been doing failover, but it has not gone >> > >> into a failed state...so, the system *should* be able to reconnect. >> > >> >> > >> We have the system in the failed state now and will leave it that way >> > >> for a bit in case you have some commands you'd like me to run. >> > >> >> > >> Aside from the hung cifs processes (anything accessing those mounts >> > >> gets into the D state), the system seems fine. >> > >> >> > >> >> > >> CIFS VFS: Unexpected lookup error -112 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Unexpected lookup error -11 >> > >> CIFS VFS: Unexpected lookup error -112 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Unexpected lookup error -112 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Unexpected lookup error -11 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: need to reconnect in sendv here >> > >> ------------[ cut here ]------------ >> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >> > >> BUG: unable to handle kernel >> > >> Hardware name: X8ST3 >> > >> NULL pointer dereference >> > >> Modules linked in: at 0000000000000020 >> > >> be2iscsi >> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> > >> bnx2iPGD 0 cnic >> > >> uio >> > >> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 >> > >> mdio >> > >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat >> > >> ib_iserCPU 2 rdma_cm >> > >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY >> > >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY >> > >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi >> > >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc >> > >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp >> > >> libiscsi >> > >> scsi_transport_iscsi >> > >> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl >> > >> auth_rpcgss >> > >> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> > >> sunrpc >> > >> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 >> > >> ipv6 >> > >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 >> > >> uinput >> > >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 >> > >> i2c_i801 >> > >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 >> > >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 >> > >> e1000e >> > >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 >> > >> i2c_core >> > >> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 >> > >> igb >> > >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> > >> ioatdma >> > >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 >> > >> iTCO_wdt >> > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> > >> i7core_edac >> > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> > >> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) >> > >> pcspkr >> > >> Stack: >> > >> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 >> > >> >> > >> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 >> > >> 0000000000000004Call Trace: >> > >> ffff8802e64e5c30 ffffffff8135792c >> > >> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 >> > >> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 >> > >> >> > >> Call Trace: >> > >> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] >> > >> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 >> > >> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 >> > >> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf >> > >> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 >> > >> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 >> > >> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] >> > >> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 >> > >> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 >> > >> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] >> > >> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 >> > >> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc >> > >> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] >> > >> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 >> > >> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 >> > >> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] >> > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >> > >> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 >> > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >> > >> [<ffffffff8105c3bf>] kthread+0x7d/0x85 >> > >> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] >> > >> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 >> > >> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] >> > >> [<ffffffff8105c342>] ? kthread+0x0/0x85 >> > >> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 >> > >> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] >> > >> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 >> > >> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 >> > >> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c >> > >> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 >> > >> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] >> > >> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d >> > >> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 >> > >> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b >> > >> 28 5b >> > >> ---[ end trace 3387e7bab0a9c645 ]--- >> > > >> > > Kaboom. So you're seeing oopses too. Could you get a listing of the >> > > place where it oopsed by following the instructions here? >> > > >> > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses >> > > >> > > I suspect that "sock" is NULL in this case too and it blew up in >> > > kernel_recvmsg. >> > >> > I added code to WARN_ON when ssocket was null. This isn't a real panic, >> > just a WARN_ON: >> > >> > >> > static int >> > smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) >> > { >> > int rc = 0; >> > int i = 0; >> > struct msghdr smb_msg; >> > struct smb_hdr *smb_buffer = iov[0].iov_base; >> > unsigned int len = iov[0].iov_len; >> > unsigned int total_len; >> > int first_vec = 0; >> > unsigned int smb_buf_length = smb_buffer->smb_buf_length; >> > struct socket *ssocket = server->ssocket; >> > >> > if (ssocket == NULL) { >> > cERROR(1, "need to reconnect in sendv here"); >> > *** HERE *** WARN_ON_ONCE(1); >> > return -ENOTSOCK; /* BB eventually add reconnect code here */ >> > } >> > >> > A second warn-on when ENOTSOCK is perculated up to the calling stack >> > a bit causes the other stack dumpage. I think the one above is root >> > cause...need to figure out how to have it gracefully bail out and re-connect >> > when it hits this state, as current code just calls this general loop over >> > and over again. >> > >> >> No, your warning is there, but it's Oopsing too: >> >> > >> ------------[ cut here ]------------ >> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >> > >> BUG: unable to handle kernel >> > >> Hardware name: X8ST3 >> > >> NULL pointer dereference >> > >> Modules linked in: at 0000000000000020 >> > >> be2iscsi >> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> >> >> ...smb_sendv is called by the "send" side which is generally a >> userspace process. The oops happened on the receive side. cifsd called >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer. >> > > I suspect that the following (untested) patch will fix this. I think > the symptoms that you've seen are consistent with the patch > description. Ben, would you be able to test this in your setup? This > should at least prevent the oopses. > > ------------------[snip]-------------------- > > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer > > It's possible for the following set of events to happen: > > cifsd calls cifs_reconnect which reconnects the socket. A userspace > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and > gets a reply. But, while processing the reply, cifsd calls > cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the > reply from the earlier NEGOTIATE completes and the tcpStatus is set to > CifsGood. cifs_reconnect then goes through and closes the socket and sets the > pointer to zero, but because the status is now CifsGood, the new socket > is not created and cifs_reconnect exits with the socket pointer set to > NULL. > > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is > CifsNeedNegotiate, and by making sure that generic_ip_connect is always > called at least once in cifs_reconnect. > > Note that this is not a perfect fix for this issue. It's still possible > that the NEGOTIATE reply is handled after the socket has been closed and > reconnected. In that case, the socket state will look correct but it no > NEGOTIATE was performed on it. In that situation though the server > should just shut down the socket on the next attempted send, rather > than causing the oops that occurs today. > > Reported-by: Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > Signed-off-by: Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > fs/cifs/connect.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > index 84c7307..8bb55bc 100644 > --- a/fs/cifs/connect.c > +++ b/fs/cifs/connect.c > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > mid_entry->callback(mid_entry); > } > > - while (server->tcpStatus == CifsNeedReconnect) { > + do { > try_to_freeze(); > > /* we should try only the port we connected to before */ > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > server->tcpStatus = CifsNeedNegotiate; > spin_unlock(&GlobalMid_Lock); > } > - } > + } while (server->tcpStatus == CifsNeedReconnect); > > return rc; > } > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) > } > if (rc == 0) { > spin_lock(&GlobalMid_Lock); > - if (server->tcpStatus != CifsExiting) > + if (server->tcpStatus == CifsNeedNegotiate) > server->tcpStatus = CifsGood; > else > rc = -EHOSTDOWN; > -- > 1.7.5.2 > > > -- > Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > -- Thanks, Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 2011-06-06 15:37 ` Steve French @ 2011-06-06 16:47 ` Ben Greear [not found] ` <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org> 1 sibling, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-06-06 16:47 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 06/06/2011 06:45 AM, Jeff Layton wrote: > On Sat, 4 Jun 2011 07:19:23 -0400 > Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> On Fri, 03 Jun 2011 22:03:43 -0700 >> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >> >>> On 06/03/2011 06:42 PM, Jeff Layton wrote: >>>> On Fri, 03 Jun 2011 14:01:11 -0700 >>>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >>>> >>>>> Ok, we had some luck. Here's the backtrace and attending dmesg >>>>> output. The filer has been doing failover, but it has not gone >>>>> into a failed state...so, the system *should* be able to reconnect. >>>>> >>>>> We have the system in the failed state now and will leave it that way >>>>> for a bit in case you have some commands you'd like me to run. >>>>> >>>>> Aside from the hung cifs processes (anything accessing those mounts >>>>> gets into the D state), the system seems fine. >>>>> >>>>> >>>>> CIFS VFS: Unexpected lookup error -112 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Unexpected lookup error -11 >>>>> CIFS VFS: Unexpected lookup error -112 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Unexpected lookup error -112 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Unexpected lookup error -11 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: need to reconnect in sendv here >>>>> ------------[ cut here ]------------ >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >>>>> BUG: unable to handle kernel >>>>> Hardware name: X8ST3 >>>>> NULL pointer dereference >>>>> Modules linked in: at 0000000000000020 >>>>> be2iscsi >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >>>>> bnx2iPGD 0 cnic >>>>> uio >>>>> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 >>>>> mdio >>>>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat >>>>> ib_iserCPU 2 rdma_cm >>>>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY >>>>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY >>>>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi >>>>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc >>>>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp >>>>> libiscsi >>>>> scsi_transport_iscsi >>>>> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl >>>>> auth_rpcgss >>>>> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >>>>> sunrpc >>>>> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 >>>>> ipv6 >>>>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 >>>>> uinput >>>>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 >>>>> i2c_i801 >>>>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 >>>>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 >>>>> e1000e >>>>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 >>>>> i2c_core >>>>> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 >>>>> igb >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>>> ioatdma >>>>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 >>>>> iTCO_wdt >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>> i7core_edac >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>>>> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) >>>>> pcspkr >>>>> Stack: >>>>> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 >>>>> >>>>> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 >>>>> 0000000000000004Call Trace: >>>>> ffff8802e64e5c30 ffffffff8135792c >>>>> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 >>>>> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 >>>>> >>>>> Call Trace: >>>>> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] >>>>> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 >>>>> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 >>>>> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf >>>>> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 >>>>> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 >>>>> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] >>>>> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 >>>>> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 >>>>> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] >>>>> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 >>>>> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc >>>>> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] >>>>> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 >>>>> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 >>>>> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >>>>> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >>>>> [<ffffffff8105c3bf>] kthread+0x7d/0x85 >>>>> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] >>>>> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 >>>>> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] >>>>> [<ffffffff8105c342>] ? kthread+0x0/0x85 >>>>> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 >>>>> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] >>>>> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 >>>>> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 >>>>> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c >>>>> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 >>>>> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] >>>>> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d >>>>> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 >>>>> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b >>>>> 28 5b >>>>> ---[ end trace 3387e7bab0a9c645 ]--- >>>> >>>> Kaboom. So you're seeing oopses too. Could you get a listing of the >>>> place where it oopsed by following the instructions here? >>>> >>>> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses >>>> >>>> I suspect that "sock" is NULL in this case too and it blew up in >>>> kernel_recvmsg. >>> >>> I added code to WARN_ON when ssocket was null. This isn't a real panic, >>> just a WARN_ON: >>> >>> >>> static int >>> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) >>> { >>> int rc = 0; >>> int i = 0; >>> struct msghdr smb_msg; >>> struct smb_hdr *smb_buffer = iov[0].iov_base; >>> unsigned int len = iov[0].iov_len; >>> unsigned int total_len; >>> int first_vec = 0; >>> unsigned int smb_buf_length = smb_buffer->smb_buf_length; >>> struct socket *ssocket = server->ssocket; >>> >>> if (ssocket == NULL) { >>> cERROR(1, "need to reconnect in sendv here"); >>> *** HERE *** WARN_ON_ONCE(1); >>> return -ENOTSOCK; /* BB eventually add reconnect code here */ >>> } >>> >>> A second warn-on when ENOTSOCK is perculated up to the calling stack >>> a bit causes the other stack dumpage. I think the one above is root >>> cause...need to figure out how to have it gracefully bail out and re-connect >>> when it hits this state, as current code just calls this general loop over >>> and over again. >>> >> >> No, your warning is there, but it's Oopsing too: >> >>>>> ------------[ cut here ]------------ >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >>>>> BUG: unable to handle kernel >>>>> Hardware name: X8ST3 >>>>> NULL pointer dereference >>>>> Modules linked in: at 0000000000000020 >>>>> be2iscsi >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> >> >> ...smb_sendv is called by the "send" side which is generally a >> userspace process. The oops happened on the receive side. cifsd called >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer. >> > > I suspect that the following (untested) patch will fix this. I think > the symptoms that you've seen are consistent with the patch > description. Ben, would you be able to test this in your setup? This > should at least prevent the oopses. > > ------------------[snip]-------------------- > > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer > > It's possible for the following set of events to happen: > > cifsd calls cifs_reconnect which reconnects the socket. A userspace > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and > gets a reply. But, while processing the reply, cifsd calls > cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the > reply from the earlier NEGOTIATE completes and the tcpStatus is set to > CifsGood. cifs_reconnect then goes through and closes the socket and sets the > pointer to zero, but because the status is now CifsGood, the new socket > is not created and cifs_reconnect exits with the socket pointer set to > NULL. > > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is > CifsNeedNegotiate, and by making sure that generic_ip_connect is always > called at least once in cifs_reconnect. > > Note that this is not a perfect fix for this issue. It's still possible > that the NEGOTIATE reply is handled after the socket has been closed and > reconnected. In that case, the socket state will look correct but it no > NEGOTIATE was performed on it. In that situation though the server > should just shut down the socket on the next attempted send, rather > than causing the oops that occurs today. > > Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > fs/cifs/connect.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > index 84c7307..8bb55bc 100644 > --- a/fs/cifs/connect.c > +++ b/fs/cifs/connect.c > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > mid_entry->callback(mid_entry); > } > > - while (server->tcpStatus == CifsNeedReconnect) { > + do { > try_to_freeze(); > > /* we should try only the port we connected to before */ > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > server->tcpStatus = CifsNeedNegotiate; > spin_unlock(&GlobalMid_Lock); > } > - } > + } while (server->tcpStatus == CifsNeedReconnect); > > return rc; > } > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) > } > if (rc == 0) { > spin_lock(&GlobalMid_Lock); > - if (server->tcpStatus != CifsExiting) > + if (server->tcpStatus == CifsNeedNegotiate) > server->tcpStatus = CifsGood; > else > rc = -EHOSTDOWN; This has some merge issues on 3.6.38.8: <<<<<<< while ((server->tcpStatus != CifsExiting) && (server->tcpStatus != CifsGood)) { ======= do { >>>>>>> Should I keep your comparison for tcpStatus == CifsNeedReconnect instead of these != comparisons above? Thanks, Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-06-06 16:51 ` Jeff Layton [not found] ` <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jeff Layton @ 2011-06-06 16:51 UTC (permalink / raw) To: Ben Greear; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Mon, 06 Jun 2011 09:47:40 -0700 Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > On 06/06/2011 06:45 AM, Jeff Layton wrote: > > On Sat, 4 Jun 2011 07:19:23 -0400 > > Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > >> On Fri, 03 Jun 2011 22:03:43 -0700 > >> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >> > >>> On 06/03/2011 06:42 PM, Jeff Layton wrote: > >>>> On Fri, 03 Jun 2011 14:01:11 -0700 > >>>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >>>> > >>>>> Ok, we had some luck. Here's the backtrace and attending dmesg > >>>>> output. The filer has been doing failover, but it has not gone > >>>>> into a failed state...so, the system *should* be able to reconnect. > >>>>> > >>>>> We have the system in the failed state now and will leave it that way > >>>>> for a bit in case you have some commands you'd like me to run. > >>>>> > >>>>> Aside from the hung cifs processes (anything accessing those mounts > >>>>> gets into the D state), the system seems fine. > >>>>> > >>>>> > >>>>> CIFS VFS: Unexpected lookup error -112 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Unexpected lookup error -11 > >>>>> CIFS VFS: Unexpected lookup error -112 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Unexpected lookup error -112 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Unexpected lookup error -11 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: need to reconnect in sendv here > >>>>> ------------[ cut here ]------------ > >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > >>>>> BUG: unable to handle kernel > >>>>> Hardware name: X8ST3 > >>>>> NULL pointer dereference > >>>>> Modules linked in: at 0000000000000020 > >>>>> be2iscsi > >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >>>>> bnx2iPGD 0 cnic > >>>>> uio > >>>>> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 > >>>>> mdio > >>>>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat > >>>>> ib_iserCPU 2 rdma_cm > >>>>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY > >>>>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY > >>>>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi > >>>>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc > >>>>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp > >>>>> libiscsi > >>>>> scsi_transport_iscsi > >>>>> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl > >>>>> auth_rpcgss > >>>>> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >>>>> sunrpc > >>>>> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 > >>>>> ipv6 > >>>>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 > >>>>> uinput > >>>>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 > >>>>> i2c_i801 > >>>>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 > >>>>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 > >>>>> e1000e > >>>>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 > >>>>> i2c_core > >>>>> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 > >>>>> igb > >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > >>>>> ioatdma > >>>>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 > >>>>> iTCO_wdt > >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>>>> i7core_edac > >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > >>>>> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) > >>>>> pcspkr > >>>>> Stack: > >>>>> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 > >>>>> > >>>>> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 > >>>>> 0000000000000004Call Trace: > >>>>> ffff8802e64e5c30 ffffffff8135792c > >>>>> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 > >>>>> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 > >>>>> > >>>>> Call Trace: > >>>>> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] > >>>>> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 > >>>>> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 > >>>>> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf > >>>>> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 > >>>>> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 > >>>>> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] > >>>>> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 > >>>>> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 > >>>>> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] > >>>>> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 > >>>>> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc > >>>>> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] > >>>>> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 > >>>>> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 > >>>>> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] > >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > >>>>> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 > >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > >>>>> [<ffffffff8105c3bf>] kthread+0x7d/0x85 > >>>>> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] > >>>>> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 > >>>>> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] > >>>>> [<ffffffff8105c342>] ? kthread+0x0/0x85 > >>>>> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 > >>>>> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] > >>>>> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 > >>>>> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 > >>>>> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c > >>>>> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 > >>>>> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] > >>>>> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d > >>>>> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 > >>>>> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b > >>>>> 28 5b > >>>>> ---[ end trace 3387e7bab0a9c645 ]--- > >>>> > >>>> Kaboom. So you're seeing oopses too. Could you get a listing of the > >>>> place where it oopsed by following the instructions here? > >>>> > >>>> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses > >>>> > >>>> I suspect that "sock" is NULL in this case too and it blew up in > >>>> kernel_recvmsg. > >>> > >>> I added code to WARN_ON when ssocket was null. This isn't a real panic, > >>> just a WARN_ON: > >>> > >>> > >>> static int > >>> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) > >>> { > >>> int rc = 0; > >>> int i = 0; > >>> struct msghdr smb_msg; > >>> struct smb_hdr *smb_buffer = iov[0].iov_base; > >>> unsigned int len = iov[0].iov_len; > >>> unsigned int total_len; > >>> int first_vec = 0; > >>> unsigned int smb_buf_length = smb_buffer->smb_buf_length; > >>> struct socket *ssocket = server->ssocket; > >>> > >>> if (ssocket == NULL) { > >>> cERROR(1, "need to reconnect in sendv here"); > >>> *** HERE *** WARN_ON_ONCE(1); > >>> return -ENOTSOCK; /* BB eventually add reconnect code here */ > >>> } > >>> > >>> A second warn-on when ENOTSOCK is perculated up to the calling stack > >>> a bit causes the other stack dumpage. I think the one above is root > >>> cause...need to figure out how to have it gracefully bail out and re-connect > >>> when it hits this state, as current code just calls this general loop over > >>> and over again. > >>> > >> > >> No, your warning is there, but it's Oopsing too: > >> > >>>>> ------------[ cut here ]------------ > >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > >>>>> BUG: unable to handle kernel > >>>>> Hardware name: X8ST3 > >>>>> NULL pointer dereference > >>>>> Modules linked in: at 0000000000000020 > >>>>> be2iscsi > >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >> > >> > >> ...smb_sendv is called by the "send" side which is generally a > >> userspace process. The oops happened on the receive side. cifsd called > >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer. > >> > > > > I suspect that the following (untested) patch will fix this. I think > > the symptoms that you've seen are consistent with the patch > > description. Ben, would you be able to test this in your setup? This > > should at least prevent the oopses. > > > > ------------------[snip]-------------------- > > > > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer > > > > It's possible for the following set of events to happen: > > > > cifsd calls cifs_reconnect which reconnects the socket. A userspace > > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and > > gets a reply. But, while processing the reply, cifsd calls > > cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the > > reply from the earlier NEGOTIATE completes and the tcpStatus is set to > > CifsGood. cifs_reconnect then goes through and closes the socket and sets the > > pointer to zero, but because the status is now CifsGood, the new socket > > is not created and cifs_reconnect exits with the socket pointer set to > > NULL. > > > > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is > > CifsNeedNegotiate, and by making sure that generic_ip_connect is always > > called at least once in cifs_reconnect. > > > > Note that this is not a perfect fix for this issue. It's still possible > > that the NEGOTIATE reply is handled after the socket has been closed and > > reconnected. In that case, the socket state will look correct but it no > > NEGOTIATE was performed on it. In that situation though the server > > should just shut down the socket on the next attempted send, rather > > than causing the oops that occurs today. > > > > Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > > Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > --- > > fs/cifs/connect.c | 6 +++--- > > 1 files changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > > index 84c7307..8bb55bc 100644 > > --- a/fs/cifs/connect.c > > +++ b/fs/cifs/connect.c > > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > > mid_entry->callback(mid_entry); > > } > > > > - while (server->tcpStatus == CifsNeedReconnect) { > > + do { > > try_to_freeze(); > > > > /* we should try only the port we connected to before */ > > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > > server->tcpStatus = CifsNeedNegotiate; > > spin_unlock(&GlobalMid_Lock); > > } > > - } > > + } while (server->tcpStatus == CifsNeedReconnect); > > > > return rc; > > } > > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) > > } > > if (rc == 0) { > > spin_lock(&GlobalMid_Lock); > > - if (server->tcpStatus != CifsExiting) > > + if (server->tcpStatus == CifsNeedNegotiate) > > server->tcpStatus = CifsGood; > > else > > rc = -EHOSTDOWN; > > > This has some merge issues on 3.6.38.8: > > > <<<<<<< > while ((server->tcpStatus != CifsExiting) && > (server->tcpStatus != CifsGood)) { > ======= > do { > >>>>>>> > > Should I keep your comparison for tcpStatus == CifsNeedReconnect > instead of these != comparisons above? > > > Thanks, > Ben > No, I think you probably just want to take patch fd88ce9313 too, which should fix up the merge conflict. -- Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2011-06-06 17:22 ` Ben Greear [not found] ` <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Ben Greear @ 2011-06-06 17:22 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 06/06/2011 09:51 AM, Jeff Layton wrote: > On Mon, 06 Jun 2011 09:47:40 -0700 > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >> On 06/06/2011 06:45 AM, Jeff Layton wrote: >>> On Sat, 4 Jun 2011 07:19:23 -0400 >>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer >>> >>> It's possible for the following set of events to happen: >>> >>> cifsd calls cifs_reconnect which reconnects the socket. A userspace >>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and >>> gets a reply. But, while processing the reply, cifsd calls >>> cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the >>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to >>> CifsGood. cifs_reconnect then goes through and closes the socket and sets the >>> pointer to zero, but because the status is now CifsGood, the new socket >>> is not created and cifs_reconnect exits with the socket pointer set to >>> NULL. >>> >>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is >>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always >>> called at least once in cifs_reconnect. >>> >>> Note that this is not a perfect fix for this issue. It's still possible >>> that the NEGOTIATE reply is handled after the socket has been closed and >>> reconnected. In that case, the socket state will look correct but it no >>> NEGOTIATE was performed on it. In that situation though the server >>> should just shut down the socket on the next attempted send, rather >>> than causing the oops that occurs today. >>> >>> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> >>> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>> --- >>> fs/cifs/connect.c | 6 +++--- >>> 1 files changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c >>> index 84c7307..8bb55bc 100644 >>> --- a/fs/cifs/connect.c >>> +++ b/fs/cifs/connect.c >>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>> mid_entry->callback(mid_entry); >>> } >>> >>> - while (server->tcpStatus == CifsNeedReconnect) { >>> + do { >>> try_to_freeze(); >>> >>> /* we should try only the port we connected to before */ >>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>> server->tcpStatus = CifsNeedNegotiate; >>> spin_unlock(&GlobalMid_Lock); >>> } >>> - } >>> + } while (server->tcpStatus == CifsNeedReconnect); >>> >>> return rc; >>> } >>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) >>> } >>> if (rc == 0) { >>> spin_lock(&GlobalMid_Lock); >>> - if (server->tcpStatus != CifsExiting) >>> + if (server->tcpStatus == CifsNeedNegotiate) >>> server->tcpStatus = CifsGood; >>> else >>> rc = -EHOSTDOWN; >> >> >> This has some merge issues on 3.6.38.8: >> >> >> <<<<<<< >> while ((server->tcpStatus != CifsExiting)&& >> (server->tcpStatus != CifsGood)) { >> ======= >> do { >> >>>>>>> >> >> Should I keep your comparison for tcpStatus == CifsNeedReconnect >> instead of these != comparisons above? >> >> >> Thanks, >> Ben >> > > No, I think you probably just want to take patch fd88ce9313 too, which > should fix up the merge conflict. Ok, I've applied those two..we'll start testing with these patches today. Might take a while before we are certain the fix works, as this isn't usually easy or fast to reproduce. Thanks, Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org> @ 2011-06-07 1:00 ` Steve French [not found] ` <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Steve French @ 2011-06-07 1:00 UTC (permalink / raw) To: Ben Greear; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA Ben, Thanks - this may be a very rare case - hard to prove without your testing but it looks like Jeff's patch makes sense. On Mon, Jun 6, 2011 at 12:22 PM, Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > On 06/06/2011 09:51 AM, Jeff Layton wrote: >> >> On Mon, 06 Jun 2011 09:47:40 -0700 >> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >> >>> On 06/06/2011 06:45 AM, Jeff Layton wrote: >>>> >>>> On Sat, 4 Jun 2011 07:19:23 -0400 > >>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket >>>> pointer >>>> >>>> It's possible for the following set of events to happen: >>>> >>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace >>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and >>>> gets a reply. But, while processing the reply, cifsd calls >>>> cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the >>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to >>>> CifsGood. cifs_reconnect then goes through and closes the socket and >>>> sets the >>>> pointer to zero, but because the status is now CifsGood, the new socket >>>> is not created and cifs_reconnect exits with the socket pointer set to >>>> NULL. >>>> >>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is >>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always >>>> called at least once in cifs_reconnect. >>>> >>>> Note that this is not a perfect fix for this issue. It's still possible >>>> that the NEGOTIATE reply is handled after the socket has been closed and >>>> reconnected. In that case, the socket state will look correct but it no >>>> NEGOTIATE was performed on it. In that situation though the server >>>> should just shut down the socket on the next attempted send, rather >>>> than causing the oops that occurs today. >>>> >>>> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> >>>> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> --- >>>> fs/cifs/connect.c | 6 +++--- >>>> 1 files changed, 3 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c >>>> index 84c7307..8bb55bc 100644 >>>> --- a/fs/cifs/connect.c >>>> +++ b/fs/cifs/connect.c >>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>> mid_entry->callback(mid_entry); >>>> } >>>> >>>> - while (server->tcpStatus == CifsNeedReconnect) { >>>> + do { >>>> try_to_freeze(); >>>> >>>> /* we should try only the port we connected to before */ >>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>> server->tcpStatus = CifsNeedNegotiate; >>>> spin_unlock(&GlobalMid_Lock); >>>> } >>>> - } >>>> + } while (server->tcpStatus == CifsNeedReconnect); >>>> >>>> return rc; >>>> } >>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, >>>> struct cifs_ses *ses) >>>> } >>>> if (rc == 0) { >>>> spin_lock(&GlobalMid_Lock); >>>> - if (server->tcpStatus != CifsExiting) >>>> + if (server->tcpStatus == CifsNeedNegotiate) >>>> server->tcpStatus = CifsGood; >>>> else >>>> rc = -EHOSTDOWN; >>> >>> >>> This has some merge issues on 3.6.38.8: >>> >>> >>> <<<<<<< >>> while ((server->tcpStatus != CifsExiting)&& >>> (server->tcpStatus != CifsGood)) { >>> ======= >>> do { >>> >>>>>>> >>> >>> Should I keep your comparison for tcpStatus == CifsNeedReconnect >>> instead of these != comparisons above? >>> >>> >>> Thanks, >>> Ben >>> >> >> No, I think you probably just want to take patch fd88ce9313 too, which >> should fix up the merge conflict. > > Ok, I've applied those two..we'll start testing with these patches > today. Might take a while before we are certain the fix works, as > this isn't usually easy or fast to reproduce. > > Thanks, > Ben > > -- > Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> > Candela Technologies Inc http://www.candelatech.com > > -- Thanks, Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-06-10 18:55 ` Ben Greear 0 siblings, 0 replies; 23+ messages in thread From: Ben Greear @ 2011-06-10 18:55 UTC (permalink / raw) To: Steve French; +Cc: Jeff Layton, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 06/06/2011 06:00 PM, Steve French wrote: > Ben, > Thanks - this may be a very rare case - hard to prove without your testing > but it looks like Jeff's patch makes sense. We've had 3+ days of clean failover testing, so I think that patch did solve the problem. You are welcome to add my tested-by if you want. Thanks, Ben > > On Mon, Jun 6, 2011 at 12:22 PM, Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >> On 06/06/2011 09:51 AM, Jeff Layton wrote: >>> >>> On Mon, 06 Jun 2011 09:47:40 -0700 >>> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: >>> >>>> On 06/06/2011 06:45 AM, Jeff Layton wrote: >>>>> >>>>> On Sat, 4 Jun 2011 07:19:23 -0400 >> >>>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket >>>>> pointer >>>>> >>>>> It's possible for the following set of events to happen: >>>>> >>>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace >>>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and >>>>> gets a reply. But, while processing the reply, cifsd calls >>>>> cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the >>>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to >>>>> CifsGood. cifs_reconnect then goes through and closes the socket and >>>>> sets the >>>>> pointer to zero, but because the status is now CifsGood, the new socket >>>>> is not created and cifs_reconnect exits with the socket pointer set to >>>>> NULL. >>>>> >>>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is >>>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always >>>>> called at least once in cifs_reconnect. >>>>> >>>>> Note that this is not a perfect fix for this issue. It's still possible >>>>> that the NEGOTIATE reply is handled after the socket has been closed and >>>>> reconnected. In that case, the socket state will look correct but it no >>>>> NEGOTIATE was performed on it. In that situation though the server >>>>> should just shut down the socket on the next attempted send, rather >>>>> than causing the oops that occurs today. >>>>> >>>>> Reported-by: Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> >>>>> Signed-off-by: Jeff Layton<jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>>> --- >>>>> fs/cifs/connect.c | 6 +++--- >>>>> 1 files changed, 3 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c >>>>> index 84c7307..8bb55bc 100644 >>>>> --- a/fs/cifs/connect.c >>>>> +++ b/fs/cifs/connect.c >>>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>>> mid_entry->callback(mid_entry); >>>>> } >>>>> >>>>> - while (server->tcpStatus == CifsNeedReconnect) { >>>>> + do { >>>>> try_to_freeze(); >>>>> >>>>> /* we should try only the port we connected to before */ >>>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>>> server->tcpStatus = CifsNeedNegotiate; >>>>> spin_unlock(&GlobalMid_Lock); >>>>> } >>>>> - } >>>>> + } while (server->tcpStatus == CifsNeedReconnect); >>>>> >>>>> return rc; >>>>> } >>>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, >>>>> struct cifs_ses *ses) >>>>> } >>>>> if (rc == 0) { >>>>> spin_lock(&GlobalMid_Lock); >>>>> - if (server->tcpStatus != CifsExiting) >>>>> + if (server->tcpStatus == CifsNeedNegotiate) >>>>> server->tcpStatus = CifsGood; >>>>> else >>>>> rc = -EHOSTDOWN; >>>> >>>> >>>> This has some merge issues on 3.6.38.8: >>>> >>>> >>>> <<<<<<< >>>> while ((server->tcpStatus != CifsExiting)&& >>>> (server->tcpStatus != CifsGood)) { >>>> ======= >>>> do { >>>> >>>>>>> >>>> >>>> Should I keep your comparison for tcpStatus == CifsNeedReconnect >>>> instead of these != comparisons above? >>>> >>>> >>>> Thanks, >>>> Ben >>>> >>> >>> No, I think you probably just want to take patch fd88ce9313 too, which >>> should fix up the merge conflict. >> >> Ok, I've applied those two..we'll start testing with these patches >> today. Might take a while before we are certain the fix works, as >> this isn't usually easy or fast to reproduce. >> >> Thanks, >> Ben >> >> -- >> Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> >> Candela Technologies Inc http://www.candelatech.com >> >> > > > -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: CIFS endless console spammage in 2.6.38.7 [not found] ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 2011-05-31 20:51 ` Steve French @ 2011-05-31 20:51 ` Ben Greear 1 sibling, 0 replies; 23+ messages in thread From: Ben Greear @ 2011-05-31 20:51 UTC (permalink / raw) To: Jeff Layton; +Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA On 05/31/2011 01:44 PM, Jeff Layton wrote: > On Tue, 31 May 2011 12:45:37 -0700 > Ben Greear<greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote: > >> On 05/31/2011 12:36 PM, Steve French wrote: >>> This is on setting up a session, so could be something like: >>> - mount >>> - do write >>> - server crash >>> - attempt to reconnect >>> - socket returns ENOSOCK >>> - attempt to reconnect ... >>> - repeat >>> >>> Is this repeatable enough that we could modify the client to stop on >>> the reconnect to see what is causing the socket to go bad and which >>> operation we are repeating the reconnect on. >> >> Well, ENOTSOCK sounds like a pretty serious coding problem. Maybe >> a use-after-close or something? >> >> At the least, we could look for some particular errors (such as ENOTSOCK) >> and print more info and do a more thorough job of cleaning up. >> >> Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well? >> >> Seems we can reproduce this only when our open-filer HA system >> craps itself during failover, but we can get that to happen usually >> within hours, sometimes maybe about a day. And, CIFS errors don't always >> happen when the HA cluster goes bad. >> >> So, I'm happy to test patches, but since it's a bit tricky to >> reproduce this...I'm hoping to get the best info possible with >> each patch iteration! >> > > I had a report of a similar problem on a RHEL5 (2.6.18) kernel: > > https://bugzilla.redhat.com/show_bug.cgi?id=704921 > > In this case, it caused an oops as well. Your problem may or may not be > the same, but if it is, I suspect that the root cause is a lack of > clear locking rules for the TCP_Server_Info->tcpStatus. > > What I think happened in that case was that the client was in the > middle of a NEGOTIATE request and got a response, and another reconnect > occurred while it was processing it. While the client was tearing down > and creating a new socket, the thread that issued the NEGOTIATE on the > previous socket marked the tcpStatus as CifsGood. > > Fixing it looks to be anything but trivial. I'm not even quite sure how > to approach it at this point. Suggestions welcome. Well, I grepped through 2GB of console logs and found no oopses in my case. Seems to me that the retry logic either isn't being properly done, or maybe it's just trying too often and stuck in basically a tight loop writing logs to the console. (My HA server cluster is still hosed, left it busted while debugging this, so there is no way that CIFS can actually recover the connection as of now.) If it's just a log-spam tight loop, then rate-limitting the messages should help, and some timeouts or backoffs should be added to CIFS. Building new kernels now, and we'll try to reproduce with the extra debugging code. Thanks, Ben -- Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2011-06-10 18:55 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-31 18:50 CIFS endless console spammage in 2.6.38.7 Ben Greear
[not found] ` <4DE5385C.1030808-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-05-31 19:36 ` Steve French
[not found] ` <BANLkTik+Z32vDVjB3_Rt7iPrqpJPJYnpwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-31 19:45 ` Ben Greear
[not found] ` <4DE54561.1090906-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-05-31 20:44 ` Jeff Layton
[not found] ` <20110531164408.178eeebf-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-05-31 20:51 ` Steve French
[not found] ` <BANLkTinyb=tekDwPLqxuSqyQfrgc8MykCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-31 20:53 ` Ben Greear
[not found] ` <4DE55537.5040705-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-05-31 20:54 ` Steve French
[not found] ` <BANLkTimNgW-Ff_50HeuFqmS7PXXjuLmYVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-06-01 18:01 ` Jeff Layton
[not found] ` <20110601140139.079287da-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-01 18:07 ` Ben Greear
[not found] ` <4DE67FFE.3040907-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-01 19:06 ` Jeff Layton
[not found] ` <20110601150621.7b465941-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-01 19:17 ` Ben Greear
[not found] ` <4DE69041.5070802-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-03 21:01 ` Ben Greear
[not found] ` <4DE94B97.8090302-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-04 1:42 ` Jeff Layton
[not found] ` <20110603214204.318602e8-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-06-04 5:03 ` Ben Greear
[not found] ` <4DE9BCAF.10303-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-04 11:19 ` Jeff Layton
[not found] ` <20110604071923.777c666f-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-06-06 13:45 ` Jeff Layton
[not found] ` <20110606094547.0c04d1c5-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-06 15:37 ` Steve French
2011-06-06 16:47 ` Ben Greear
[not found] ` <4DED04AC.7090508-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-06 16:51 ` Jeff Layton
[not found] ` <20110606125143.56da1fdb-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-06-06 17:22 ` Ben Greear
[not found] ` <4DED0CCA.6090305-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2011-06-07 1:00 ` Steve French
[not found] ` <BANLkTimkinsfojB0=Sf5=o5HBOfiTWTsAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-06-10 18:55 ` Ben Greear
2011-05-31 20:51 ` Ben Greear
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.