From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yevgeny Kliteynik Subject: Re: OpenSM Failover Date: Wed, 14 Oct 2009 10:46:37 +0200 Message-ID: <4AD58FED.1060800@dev.mellanox.co.il> References: <4AD2D75A.2020403@dev.mellanox.co.il> <4AD2E582.8010202@voltaire.com> <4AD2E736.4050803@dev.mellanox.co.il> <4AD49DAB.4020206@dev.mellanox.co.il> <4AD4A72C.6000108@dev.mellanox.co.il> Reply-To: kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Aaron Knister Cc: Or Gerlitz , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-rdma@vger.kernel.org Aaron, Aaron Knister wrote: >>> As I said, the older opensms >>> on the older mellanox model HCAs failsover and failsback instantly. >> The instant failback is expected, and this is the bug that >> we're discussing. As for the instant failover - I'll check >> how the things supposed to work and get back to you. After checking this thing, I don't understand how the instant failover is possible. The only case it can work is if you don't have a switch in your subnet - just two HCAs connected directly to each other. Is this the case? If not, then I'd like to see opensm logs. Please run opensm as before (-V -s 0 -e). Start OSM on node A with high priority. Start OSM on node B with low priority. Kill OSM on node A, and see that OSM on node B becomes master. I need only the log of the opensm on node B. Best if you could just attach it to the bugzilla issue form, but if you can't - you can mail it to me. -- Yevgeny >> -- Yevgeny >> >>> On Tue, Oct 13, 2009 at 11:32 AM, Yevgeny Kliteynik >>> wrote: >>>> Aaron, >>>> >>>> Thanks for the logs, this was really helpful. >>>> Looks like there is a handover race in the OSM - >>>> SM on node A misses the fact that SM on node B >>>> have gave up its mastership. >>>> >>>> There is a bugzilla issue the describes all the >>>> details of this race: >>>> >>>> https://bugs.openfabrics.org/show_bug.cgi?id=1499 >>>> >>>> I've updated the issue form with your case, and we will continue >>>> following >>>> this bug there. >>>> >>>> -- Yevgeny >>>> >>>> Aaron Knister wrote: >>>>> While the adapters have mellanox chipsets their actually IBM OEM >>>>> branded and IBM hasn't released the 2.7 fw yet. I'm a little hesitant >>>>> to apply the generic Mellanox FW. >>>>> >>>>> On Mon, Oct 12, 2009 at 4:22 AM, Yevgeny Kliteynik >>>>> wrote: >>>>>> Or Gerlitz wrote: >>>>>>> Yevgeny Kliteynik wrote: >>>>>>>> There was a hand-over problem in OFED 1.4, but later it turned out >>>>>>>> to >>>>>>>> be >>>>>>>> FW issue. The thing is, FW version 2.6.648 doesn't have this bug any >>>>>>>> more... >>>>>>> so things should work fine with the newly released 2.7 firmware? >>>>>> Yes >>>>>> >>>>>>> if this is still under question, Aaron, I suggest you open a bugzilla >>>>>>> case >>>>>>> @ https://bugs.openfabrics.org and we can track from there. >>>>>> Good idea. >>>>>> >>>>>> -- Yevgeny >>>>>> >>>>>>> Or. >>>>>>> >>>>>>> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html