public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* OpenSM Failover
@ 2009-10-10 23:38 Aaron Knister
       [not found] ` <B1EF3F77-622B-40DA-BB3D-DC35973B60A6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Aaron Knister @ 2009-10-10 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I'm not sure if this is the right place to post about this issue, but  
here goes-

I'm having problems with OpenSM failover.

I have two nodes running opensmd version "3.2.6_20090317" from RHEL  
5.4. I'm using a configuration file on both generated using opensm -c.  
When I start the subnet manager on node a, everything is fine. It  
appears to reassign itself a lid of 1 which I think is expected. When  
I started the subnet manager on node b everything is fine. If you  
query its lid it shows the subnet manager in a standby state. Now for  
the fun. If I stop opensmd on node a (service opensmd stop) then all  
of the traffic on the fabric stops. It takes node b's OpenSM instance  
about 30 seconds to realize that node a's subnet manager is dead and  
come up in the master state. Now when node a's subnet manager comes  
back (service opensmd start), all traffic on the fabric stops and node  
b's subnet manager goes into the standby state...but node a's subnet  
manager doesn't take over the fabric and come up as master for about  
another 40 seconds (during this time the no traffic passes over the  
fabric). The below logs should help illustrate what I'm seeing


Oct 10 19:14:14 node-a OpenSM[14132]: Entering DISCOVERING state
Oct 10 19:14:14 node-a OpenSM[14132]: Entering MASTER state
Oct 10 19:14:14 node-a OpenSM[14132]: SUBNET UP

Oct 10 19:14:25 node-b OpenSM[11197]: /var/log/opensm.log log file  
opened
Oct 10 19:14:25 node-b OpenSM[11197]: OpenSM 3.2.6_20090317
Oct 10 19:14:25 node-b OpenSM[11197]: Entering DISCOVERING state
Oct 10 19:14:26 node-b OpenSM[11197]: Entering STANDBY state

Oct 10 19:15:44 node-a OpenSM[14132]: Exiting SM
Oct 10 19:16:16 node-b OpenSM[11197]: Entering DISCOVERING state
Oct 10 19:16:16 node-b OpenSM[11197]: Entering MASTER state

Oct 10 19:18:52 node-a OpenSM[14213]: /var/log/opensm.log log file  
opened
Oct 10 19:18:52 node-a OpenSM[14213]: OpenSM 3.2.6_20090317
Oct 10 19:18:52 node-a OpenSM[14213]: Entering DISCOVERING state
Oct 10 19:18:53 node-b OpenSM[11197]: Entering STANDBY state
Oct 10 19:18:53 node-a OpenSM[14213]: Entering STANDBY state
Oct 10 19:19:33 node-a OpenSM[14213]: Entering DISCOVERING state
Oct 10 19:19:33 node-a OpenSM[14213]: Entering MASTER state
Oct 10 19:19:33 node-a OpenSM[14213]: SUBNET UP

We have opensm 3.2.5_20081207 (ofed 1.4) on another cluster and it  
fails over and fails back almost instantly with seemingly no traffic  
interruption if you gracefully stopped the active opensmd instance  
(service opensmd stop). Is the behavior I'm seeing considered normal?  
I can understand the 30 seconds for the initial failover but why the  
40 second failback when the original master comes back? Any help is  
appreciated :)

BTW my switch is a Qlogic 12800-180 with the latest firmware and the  
HCAs are Mellanox MT26428 running firmware version 2.6.648.

Thanks!

-Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-10-14 13:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-10 23:38 OpenSM Failover Aaron Knister
     [not found] ` <B1EF3F77-622B-40DA-BB3D-DC35973B60A6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-10-11  0:02   ` Aaron Knister
     [not found]     ` <CBC039F5-9019-436D-AF6D-F887E860D07B-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-10-12  7:14       ` Yevgeny Kliteynik
     [not found]         ` <4AD2D75A.2020403-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-12  8:14           ` Or Gerlitz
     [not found]             ` <4AD2E582.8010202-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2009-10-12  8:22               ` Yevgeny Kliteynik
     [not found]                 ` <4AD2E736.4050803-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-12 14:03                   ` Aaron Knister
     [not found]                     ` <eafd71280910120703y7dfa04cbq114cf07d46c909fb-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-13 15:32                       ` Yevgeny Kliteynik
     [not found]                         ` <4AD49DAB.4020206-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-13 15:52                           ` Aaron Knister
     [not found]                             ` <eafd71280910130852k1166b980kdf7129a52dacd42f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-13 16:13                               ` Yevgeny Kliteynik
     [not found]                                 ` <4AD4A72C.6000108-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-13 21:26                                   ` Aaron Knister
     [not found]                                     ` <eafd71280910131426g1cd68d7k1c28aee185ac3b8d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-14  8:46                                       ` Yevgeny Kliteynik
     [not found]                                         ` <4AD58FED.1060800-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-14 13:13                                           ` Aaron Knister
2009-10-12 13:41           ` Aaron Knister

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox