All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Benjamin Drung <benjamin.drung@cloud.ionos.com>
Cc: linux-rdma@vger.kernel.org
Subject: Re: Race condition between / wrong load order of ib_umad and ib_ipoib
Date: Tue, 2 Jun 2020 16:50:15 -0300	[thread overview]
Message-ID: <20200602195015.GD6578@ziepe.ca> (raw)
In-Reply-To: <6c58097c2310a57a987959660a8612467d8bd96c.camel@cloud.ionos.com>

On Tue, Jun 02, 2020 at 05:11:31PM +0200, Benjamin Drung wrote:
> Hi,
> 
> after a kernel upgrade to version 4.19 (in-house built with Mellanox
> OFED drivers), some of our systems fail to bring up their IPoIB devices
> on boot. Different HCAs are affected (e.g. MT4099 and MT26428). We are
> using rdma-core on Debian and have IPoIB devices (like `ib0.dddd`)
> configured in `/etc/network/interfaces`. Big cluster seem to be more
> affected than smaller ones. In case of the failure, we see this kernel
> message:
> 
> ```
> ib0.dddd: P_Key 0xdddd is not found
> ```

I think this means you are missing some IPoIB bug fixes?

This warning means ipoib was started before the subnet manager had
programmed in the pkey table. (ie it is a race)

The way it is supposed to work is for IPoIB to create the interface
anyhow in the down state and wait for the SM to program the pkey, then
move to the up state.

> Pinging other hosts will fail then with:
> 
> ```
> ping: sendmsg: Network is unreachable
> ```

This suggests ipoib is stuck down, so it missed the pkey change
event..

> changing the order in this configuration file to load `ib_umad` before
> `ib_ipoib`, the servers come up correctly.

This is probably just adding enough delay that the SM has setup pkey
table before starting ipoib...

Jason 

  reply	other threads:[~2020-06-02 19:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-02 15:11 Race condition between / wrong load order of ib_umad and ib_ipoib Benjamin Drung
2020-06-02 19:50 ` Jason Gunthorpe [this message]
     [not found]   ` <CAD+HZHX+RXs-Hxr-pV2Ufy-dJi22eJtH6MkNc1ZUmYXS9Pu91g@mail.gmail.com>
2020-06-03  7:37     ` Jinpu Wang
2020-06-03 11:24       ` Jason Gunthorpe
2020-06-03 11:31         ` Jinpu Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200602195015.GD6578@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=benjamin.drung@cloud.ionos.com \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.