* Intermitted "Failed to obtain HW semaphore, aborting" error
@ 2013-02-03 5:14 Bharath Ramesh
[not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Bharath Ramesh @ 2013-02-03 5:14 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 573 bytes --]
Intermittently a couple of nodes in our cluster throw the error "Failed
to obtain HW semaphore, aborting" on boot. When this error occurs we are
unable to use IB on those nodes, unloading and reloading the module
doesnt help. I was wondering what could be causing this error, google
only brings up the source code and no discussion about this error. We
are using OFED-1.5.4, any help in debugging and resolving this issue is
greatly appreciated.
note: I am not subscribed to the list so would appreciate if I am copied
in the replies.
--
Bharath
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3738 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread[parent not found: <510DF227.30307-PjAqaU27lzQ@public.gmane.org>]
* Re: Intermitted "Failed to obtain HW semaphore, aborting" error [not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org> @ 2013-02-03 7:11 ` Or Gerlitz [not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 4+ messages in thread From: Or Gerlitz @ 2013-02-03 7:11 UTC (permalink / raw) To: Bharath Ramesh; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA On 03/02/2013 07:14, Bharath Ramesh wrote: > Intermittently a couple of nodes in our cluster throw the error > "Failed to obtain HW semaphore, aborting" on boot. When this error > occurs we are unable to use IB on those nodes, unloading and reloading > the module doesnt help. load mlx4_core with debug_level=1 and send the resulted dmesg along with the lspci info of the card ("$ lspci | grep Mellanox") > I was wondering what could be causing this error, google only brings > up the source code and no discussion about this error. We are using > OFED-1.5.4, any help in debugging and resolving this issue is greatly > appreciated. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: Intermitted "Failed to obtain HW semaphore, aborting" error [not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2013-02-03 15:18 ` Bharath Ramesh [not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org> 0 siblings, 1 reply; 4+ messages in thread From: Bharath Ramesh @ 2013-02-03 15:18 UTC (permalink / raw) To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 2699 bytes --] On 2/3/2013 2:11 AM, Or Gerlitz wrote: > On 03/02/2013 07:14, Bharath Ramesh wrote: >> Intermittently a couple of nodes in our cluster throw the error >> "Failed to obtain HW semaphore, aborting" on boot. When this error >> occurs we are unable to use IB on those nodes, unloading and >> reloading the module doesnt help. > > load mlx4_core with debug_level=1 and send the resulted dmesg along > with the lspci info of the card ("$ lspci | grep Mellanox") The same node will come up fine on some reboots and on others I will get this error. Here is the output from lspci $ lspci | grep Mellanox 01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] dmesg output trying to load mlx4_core with debug_level=1 mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.4 (November 10, 2011) mlx4_core: Initializing 0000:01:00.0 mlx4_core 0000:01:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26 mlx4_core 0000:01:00.0: setting latency timer to 64 Uhhuh. NMI received for unknown reason 31 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue Uhhuh. NMI received for unknown reason 31 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue Uhhuh. NMI received for unknown reason 21 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting mlx4_core 0000:01:00.0: Failed to reset HCA, aborting. mlx4_core 0000:01:00.0: PCI INT A disabled mlx4_core: probe of 0000:01:00.0 failed with error -11 I am unable to run ibv_devinfo on the bad node, here is an output from a good node $ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.2370 node_guid: 001e:6703:003c:dff4 sys_image_guid: 001e:6703:003c:dff7 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: INCX-3I358C10501 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 358 port_lid: 331 port_lmc: 0x00 link_layer: IB -- Bharath [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3738 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>]
* Re: Intermitted "Failed to obtain HW semaphore, aborting" error [not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org> @ 2013-02-03 20:27 ` Or Gerlitz 0 siblings, 0 replies; 4+ messages in thread From: Or Gerlitz @ 2013-02-03 20:27 UTC (permalink / raw) To: Bharath Ramesh; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA On Sun, Feb 3, 2013 at 5:18 PM, Bharath Ramesh <bramesh-PjAqaU27lzQ@public.gmane.org> wrote: [...] > I am unable to run ibv_devinfo on the bad node, here is an output from a > good node > $ ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.10.2370 [...] should be step in the right direction if you upgrade your firmware to the latest GA, 2.11.0500 avail on the Mellanox site. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-02-03 20:27 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-03 5:14 Intermitted "Failed to obtain HW semaphore, aborting" error Bharath Ramesh
[not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
2013-02-03 7:11 ` Or Gerlitz
[not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-03 15:18 ` Bharath Ramesh
[not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>
2013-02-03 20:27 ` Or Gerlitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox