All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bharath Ramesh <bramesh-PjAqaU27lzQ@public.gmane.org>
To: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Intermitted "Failed to obtain HW semaphore, aborting" error
Date: Sun, 03 Feb 2013 10:18:14 -0500	[thread overview]
Message-ID: <510E7FB6.8050603@vt.edu> (raw)
In-Reply-To: <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2699 bytes --]

On 2/3/2013 2:11 AM, Or Gerlitz wrote:
> On 03/02/2013 07:14, Bharath Ramesh wrote:
>> Intermittently a couple of nodes in our cluster throw the error 
>> "Failed to obtain HW semaphore, aborting" on boot. When this error 
>> occurs we are unable to use IB on those nodes, unloading and 
>> reloading the module doesnt help. 
>
> load mlx4_core with debug_level=1 and send the resulted dmesg along 
> with the lspci info of the card ("$ lspci | grep Mellanox")
The same node will come up fine on some reboots and on others I will get 
this error. Here is the output from lspci
$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family 
[ConnectX-3]

dmesg output trying to load mlx4_core with debug_level=1

mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.4 (November 10, 2011)
mlx4_core: Initializing 0000:01:00.0
mlx4_core 0000:01:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26
mlx4_core 0000:01:00.0: setting latency timer to 64
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting
mlx4_core 0000:01:00.0: Failed to reset HCA, aborting.
mlx4_core 0000:01:00.0: PCI INT A disabled
mlx4_core: probe of 0000:01:00.0 failed with error -11

I am unable to run ibv_devinfo on the bad node, here is an output from a 
good node
$ ibv_devinfo
hca_id: mlx4_0
         transport:                      InfiniBand (0)
         fw_ver:                         2.10.2370
         node_guid:                      001e:6703:003c:dff4
         sys_image_guid:                 001e:6703:003c:dff7
         vendor_id:                      0x02c9
         vendor_part_id:                 4099
         hw_ver:                         0x0
         board_id:                       INCX-3I358C10501
         phys_port_cnt:                  1
                 port:   1
                         state:                  PORT_ACTIVE (4)
                         max_mtu:                2048 (4)
                         active_mtu:             2048 (4)
                         sm_lid:                 358
                         port_lid:               331
                         port_lmc:               0x00
                         link_layer:             IB

-- 
Bharath



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3738 bytes --]

  parent reply	other threads:[~2013-02-03 15:18 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-03  5:14 Intermitted "Failed to obtain HW semaphore, aborting" error Bharath Ramesh
     [not found] ` <510DF227.30307-PjAqaU27lzQ@public.gmane.org>
2013-02-03  7:11   ` Or Gerlitz
     [not found]     ` <510E0D9B.9020907-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-03 15:18       ` Bharath Ramesh [this message]
     [not found]         ` <510E7FB6.8050603-PjAqaU27lzQ@public.gmane.org>
2013-02-03 20:27           ` Or Gerlitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=510E7FB6.8050603@vt.edu \
    --to=bramesh-pjaqau27lzq@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.