From: Bharath Ramesh <bramesh-PjAqaU27lzQ@public.gmane.org>
To: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Intermitted "Failed to obtain HW semaphore, aborting" error
Date: Sun, 03 Feb 2013 10:18:14 -0500 [thread overview]
Message-ID: <510E7FB6.8050603@vt.edu> (raw)
In-Reply-To: <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 2699 bytes --]
On 2/3/2013 2:11 AM, Or Gerlitz wrote:
> On 03/02/2013 07:14, Bharath Ramesh wrote:
>> Intermittently a couple of nodes in our cluster throw the error
>> "Failed to obtain HW semaphore, aborting" on boot. When this error
>> occurs we are unable to use IB on those nodes, unloading and
>> reloading the module doesnt help.
>
> load mlx4_core with debug_level=1 and send the resulted dmesg along
> with the lspci info of the card ("$ lspci | grep Mellanox")
The same node will come up fine on some reboots and on others I will get
this error. Here is the output from lspci
$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
dmesg output trying to load mlx4_core with debug_level=1
mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.4 (November 10, 2011)
mlx4_core: Initializing 0000:01:00.0
mlx4_core 0000:01:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26
mlx4_core 0000:01:00.0: setting latency timer to 64
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting
mlx4_core 0000:01:00.0: Failed to reset HCA, aborting.
mlx4_core 0000:01:00.0: PCI INT A disabled
mlx4_core: probe of 0000:01:00.0 failed with error -11
I am unable to run ibv_devinfo on the bad node, here is an output from a
good node
$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.10.2370
node_guid: 001e:6703:003c:dff4
sys_image_guid: 001e:6703:003c:dff7
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: INCX-3I358C10501
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 358
port_lid: 331
port_lmc: 0x00
link_layer: IB
--
Bharath
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3738 bytes --]
next prev parent reply other threads:[~2013-02-03 15:18 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-03 5:14 Intermitted "Failed to obtain HW semaphore, aborting" error Bharath Ramesh
[not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
2013-02-03 7:11 ` Or Gerlitz
[not found] ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-03 15:18 ` Bharath Ramesh [this message]
[not found] ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>
2013-02-03 20:27 ` Or Gerlitz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=510E7FB6.8050603@vt.edu \
--to=bramesh-pjaqau27lzq@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.