* Intermitted "Failed to obtain HW semaphore, aborting" error
@ 2013-02-03 5:14 Bharath Ramesh
[not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Bharath Ramesh @ 2013-02-03 5:14 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 573 bytes --]
Intermittently a couple of nodes in our cluster throw the error "Failed
to obtain HW semaphore, aborting" on boot. When this error occurs we are
unable to use IB on those nodes, unloading and reloading the module
doesnt help. I was wondering what could be causing this error, google
only brings up the source code and no discussion about this error. We
are using OFED-1.5.4, any help in debugging and resolving this issue is
greatly appreciated.
note: I am not subscribed to the list so would appreciate if I am copied
in the replies.
--
Bharath
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3738 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Intermitted "Failed to obtain HW semaphore, aborting" error
[not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
@ 2013-02-03 7:11 ` Or Gerlitz
[not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2013-02-03 7:11 UTC (permalink / raw)
To: Bharath Ramesh; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 03/02/2013 07:14, Bharath Ramesh wrote:
> Intermittently a couple of nodes in our cluster throw the error
> "Failed to obtain HW semaphore, aborting" on boot. When this error
> occurs we are unable to use IB on those nodes, unloading and reloading
> the module doesnt help.
load mlx4_core with debug_level=1 and send the resulted dmesg along
with the lspci info of the card ("$ lspci | grep Mellanox")
> I was wondering what could be causing this error, google only brings
> up the source code and no discussion about this error. We are using
> OFED-1.5.4, any help in debugging and resolving this issue is greatly
> appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Intermitted "Failed to obtain HW semaphore, aborting" error
[not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-02-03 15:18 ` Bharath Ramesh
[not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Bharath Ramesh @ 2013-02-03 15:18 UTC (permalink / raw)
To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 2699 bytes --]
On 2/3/2013 2:11 AM, Or Gerlitz wrote:
> On 03/02/2013 07:14, Bharath Ramesh wrote:
>> Intermittently a couple of nodes in our cluster throw the error
>> "Failed to obtain HW semaphore, aborting" on boot. When this error
>> occurs we are unable to use IB on those nodes, unloading and
>> reloading the module doesnt help.
>
> load mlx4_core with debug_level=1 and send the resulted dmesg along
> with the lspci info of the card ("$ lspci | grep Mellanox")
The same node will come up fine on some reboots and on others I will get
this error. Here is the output from lspci
$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
dmesg output trying to load mlx4_core with debug_level=1
mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.4 (November 10, 2011)
mlx4_core: Initializing 0000:01:00.0
mlx4_core 0000:01:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26
mlx4_core 0000:01:00.0: setting latency timer to 64
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting
mlx4_core 0000:01:00.0: Failed to reset HCA, aborting.
mlx4_core 0000:01:00.0: PCI INT A disabled
mlx4_core: probe of 0000:01:00.0 failed with error -11
I am unable to run ibv_devinfo on the bad node, here is an output from a
good node
$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.10.2370
node_guid: 001e:6703:003c:dff4
sys_image_guid: 001e:6703:003c:dff7
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: INCX-3I358C10501
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 358
port_lid: 331
port_lmc: 0x00
link_layer: IB
--
Bharath
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3738 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Intermitted "Failed to obtain HW semaphore, aborting" error
[not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>
@ 2013-02-03 20:27 ` Or Gerlitz
0 siblings, 0 replies; 4+ messages in thread
From: Or Gerlitz @ 2013-02-03 20:27 UTC (permalink / raw)
To: Bharath Ramesh; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Sun, Feb 3, 2013 at 5:18 PM, Bharath Ramesh <bramesh-PjAqaU27lzQ@public.gmane.org> wrote:
[...]
> I am unable to run ibv_devinfo on the bad node, here is an output from a
> good node
> $ ibv_devinfo
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.10.2370
[...]
should be step in the right direction if you upgrade your firmware to
the latest GA, 2.11.0500 avail on the Mellanox site.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-02-03 20:27 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-03 5:14 Intermitted "Failed to obtain HW semaphore, aborting" error Bharath Ramesh
[not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
2013-02-03 7:11 ` Or Gerlitz
[not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-03 15:18 ` Bharath Ramesh
[not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>
2013-02-03 20:27 ` Or Gerlitz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox