From: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
To: Aaron Knister <aaron.knister-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: OpenSM Failover
Date: Mon, 12 Oct 2009 09:14:34 +0200 [thread overview]
Message-ID: <4AD2D75A.2020403@dev.mellanox.co.il> (raw)
In-Reply-To: <CBC039F5-9019-436D-AF6D-F887E860D07B-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Aaron,
Aaron Knister wrote:
> I just stumbled across this in the release notes for opensm 3.2.6-
>
> "* SMs do not hand-over when running on ConnectX in a switch-based
> topology."
>
> So I guess that answers the question of whether or not what I'm seeing
> is "expected behavior". Out of curiosity what are the technical reasons
> for this? I just tried opensm 3.3.2 and I still experience the same
> behavior.
There was a hand-over problem in OFED 1.4, but later it turned
out to be FW issue. The thing is, FW version 2.6.648 doesn't
have this bug any more...
The 30 seconds for the initial failover is expected, but the
40 second failback when the original master comes back is a problem.
Can you please double check that the FW version 2.6.648 is used on
both HCAs that run OSM?
And what is the FW version of HCAs that don't have this problem?
Also, can you please reproduce the issue running OSM as follows:
opensm -V -e -s 0
on both nodes and attach the /var/log/opensm.log files?
-- Yevgeny
> On Oct 10, 2009, at 7:38 PM, Aaron Knister wrote:
>
>> I'm not sure if this is the right place to post about this issue, but
>> here goes-
>>
>> I'm having problems with OpenSM failover.
>>
>> I have two nodes running opensmd version "3.2.6_20090317" from RHEL
>> 5.4. I'm using a configuration file on both generated using opensm -c.
>> When I start the subnet manager on node a, everything is fine. It
>> appears to reassign itself a lid of 1 which I think is expected. When
>> I started the subnet manager on node b everything is fine. If you
>> query its lid it shows the subnet manager in a standby state. Now for
>> the fun. If I stop opensmd on node a (service opensmd stop) then all
>> of the traffic on the fabric stops. It takes node b's OpenSM instance
>> about 30 seconds to realize that node a's subnet manager is dead and
>> come up in the master state. Now when node a's subnet manager comes
>> back (service opensmd start), all traffic on the fabric stops and node
>> b's subnet manager goes into the standby state...but node a's subnet
>> manager doesn't take over the fabric and come up as master for about
>> another 40 seconds (during this time the no traffic passes over the
>> fabric). The below logs should help illustrate what I'm seeing
>>
>>
>> Oct 10 19:14:14 node-a OpenSM[14132]: Entering DISCOVERING state
>> Oct 10 19:14:14 node-a OpenSM[14132]: Entering MASTER state
>> Oct 10 19:14:14 node-a OpenSM[14132]: SUBNET UP
>>
>> Oct 10 19:14:25 node-b OpenSM[11197]: /var/log/opensm.log log file opened
>> Oct 10 19:14:25 node-b OpenSM[11197]: OpenSM 3.2.6_20090317
>> Oct 10 19:14:25 node-b OpenSM[11197]: Entering DISCOVERING state
>> Oct 10 19:14:26 node-b OpenSM[11197]: Entering STANDBY state
>>
>> Oct 10 19:15:44 node-a OpenSM[14132]: Exiting SM
>> Oct 10 19:16:16 node-b OpenSM[11197]: Entering DISCOVERING state
>> Oct 10 19:16:16 node-b OpenSM[11197]: Entering MASTER state
>>
>> Oct 10 19:18:52 node-a OpenSM[14213]: /var/log/opensm.log log file opened
>> Oct 10 19:18:52 node-a OpenSM[14213]: OpenSM 3.2.6_20090317
>> Oct 10 19:18:52 node-a OpenSM[14213]: Entering DISCOVERING state
>> Oct 10 19:18:53 node-b OpenSM[11197]: Entering STANDBY state
>> Oct 10 19:18:53 node-a OpenSM[14213]: Entering STANDBY state
>> Oct 10 19:19:33 node-a OpenSM[14213]: Entering DISCOVERING state
>> Oct 10 19:19:33 node-a OpenSM[14213]: Entering MASTER state
>> Oct 10 19:19:33 node-a OpenSM[14213]: SUBNET UP
>>
>> We have opensm 3.2.5_20081207 (ofed 1.4) on another cluster and it
>> fails over and fails back almost instantly with seemingly no traffic
>> interruption if you gracefully stopped the active opensmd instance
>> (service opensmd stop). Is the behavior I'm seeing considered normal?
>> I can understand the 30 seconds for the initial failover but why the
>> 40 second failback when the original master comes back? Any help is
>> appreciated :)
>>
>> BTW my switch is a Qlogic 12800-180 with the latest firmware and the
>> HCAs are Mellanox MT26428 running firmware version 2.6.648.
>>
>> Thanks!
>>
>> -Aaron
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-10-12 7:14 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-10-10 23:38 OpenSM Failover Aaron Knister
[not found] ` <B1EF3F77-622B-40DA-BB3D-DC35973B60A6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-10-11 0:02 ` Aaron Knister
[not found] ` <CBC039F5-9019-436D-AF6D-F887E860D07B-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-10-12 7:14 ` Yevgeny Kliteynik [this message]
[not found] ` <4AD2D75A.2020403-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-12 8:14 ` Or Gerlitz
[not found] ` <4AD2E582.8010202-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2009-10-12 8:22 ` Yevgeny Kliteynik
[not found] ` <4AD2E736.4050803-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-12 14:03 ` Aaron Knister
[not found] ` <eafd71280910120703y7dfa04cbq114cf07d46c909fb-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-13 15:32 ` Yevgeny Kliteynik
[not found] ` <4AD49DAB.4020206-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-13 15:52 ` Aaron Knister
[not found] ` <eafd71280910130852k1166b980kdf7129a52dacd42f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-13 16:13 ` Yevgeny Kliteynik
[not found] ` <4AD4A72C.6000108-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-13 21:26 ` Aaron Knister
[not found] ` <eafd71280910131426g1cd68d7k1c28aee185ac3b8d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-14 8:46 ` Yevgeny Kliteynik
[not found] ` <4AD58FED.1060800-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2009-10-14 13:13 ` Aaron Knister
2009-10-12 13:41 ` Aaron Knister
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4AD2D75A.2020403@dev.mellanox.co.il \
--to=kliteyn-ldsdmyg8hgv8yrgs2mwiifqbs+8scbdb@public.gmane.org \
--cc=aaron.knister-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox