Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
@ 2016-02-19 18:03 Roland Dreier
       [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Roland Dreier @ 2016-02-19 18:03 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Doug Ledford, Hal Rosenstock

Hello again everyone,

I'm assessing the state of the art in writing an application that can
recover from an HCA castastrophic error (aka IBV_EVENT_DEVICE_FATAL
async event), and it appears the pieces are not there yet.  What is
supposed to happen from the kernel side is that userspace closes all
of its contexts, then the kernel tears down and recreates the device,
and userspace reopens the device and starts over.

However it doesn't look like there is any way for librdmacm to call
ibv_close_device() without tearing down the whole library and closing
all devices (which is disruptive if my application is also using
another HCA that didn't hit a catastrophic error).  But even if we add
an interface to close a single cma_device, libibverbs doesn't really
have a way to wait for the device to be torn down and reinitialized.
(In the kernel, we have the ib_client.add and ib_client.remove
callbacks, but libibverbs just initializes a static array of devices
at library initialization)

Is there any work on closing these gaps that has been done yet
(perhaps in OFED or in pending patches), or have I found a wide open
field to innovate in?

As a side note, how does opensm handle this?  I haven't tried it yet,
but from reading code I believe that libibumad will not correctly pass
the ib_umad failure back up to opensm, and so opensm will be stuck
with a dead /dev/infiniband/umadX file handle forever.  Is that
assessment correct?

Thanks!
  Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* RE: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
       [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-02-19 18:10   ` Hefty, Sean
  2016-02-19 19:39   ` Hal Rosenstock
  2016-02-21 11:56   ` Liran Liss
  2 siblings, 0 replies; 6+ messages in thread
From: Hefty, Sean @ 2016-02-19 18:10 UTC (permalink / raw)
  To: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Doug Ledford, Hal Rosenstock

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 358 bytes --]

> Is there any work on closing these gaps that has been done yet
> (perhaps in OFED or in pending patches), or have I found a wide open
> field to innovate in?

I'm not aware of any work going on in this area.

- Sean
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
       [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-02-19 18:10   ` Hefty, Sean
@ 2016-02-19 19:39   ` Hal Rosenstock
  2016-02-21 11:56   ` Liran Liss
  2 siblings, 0 replies; 6+ messages in thread
From: Hal Rosenstock @ 2016-02-19 19:39 UTC (permalink / raw)
  To: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sean Hefty, Doug Ledford

On 2/19/2016 1:03 PM, Roland Dreier wrote:
> As a side note, how does opensm handle this?  I haven't tried it yet,
> but from reading code I believe that libibumad will not correctly pass
> the ib_umad failure back up to opensm, and so opensm will be stuck
> with a dead /dev/infiniband/umadX file handle forever.  Is that
> assessment correct?

Yes, your assessment is correct.

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
       [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-02-19 18:10   ` Hefty, Sean
  2016-02-19 19:39   ` Hal Rosenstock
@ 2016-02-21 11:56   ` Liran Liss
       [not found]     ` <HE1PR05MB1418BB2F8E162955160E2D3EB1A20-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2 siblings, 1 reply; 6+ messages in thread
From: Liran Liss @ 2016-02-21 11:56 UTC (permalink / raw)
  To: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sean Hefty, Doug Ledford, Hal Rosenstock

Hi Roland,

The kernel part is in place, but user-space support is not complete.

When a specific RDMA device receives a fatal event, the user is guaranteed to get this event.
What is missing is a way for rdmacm (maybe via a well-behaved app that provides the context for reading asynch errors) to be informed of this event and re-open the HCA context.

BTW, rdmacm also doesn't notice when new RDMA devices pop up...
--Liran


> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Roland Dreier
> Sent: Friday, February 19, 2016 8:03 PM
> To: linux-rdma@vger.kernel.org; Sean Hefty <sean.hefty@intel.com>; Doug
> Ledford <dledford@redhat.com>; Hal Rosenstock <hal@dev.mellanox.co.il>
> Subject: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
> 
> Hello again everyone,
> 
> I'm assessing the state of the art in writing an application that can recover from
> an HCA castastrophic error (aka IBV_EVENT_DEVICE_FATAL async event), and it
> appears the pieces are not there yet.  What is supposed to happen from the
> kernel side is that userspace closes all of its contexts, then the kernel tears down
> and recreates the device, and userspace reopens the device and starts over.
> 
> However it doesn't look like there is any way for librdmacm to call
> ibv_close_device() without tearing down the whole library and closing all
> devices (which is disruptive if my application is also using another HCA that
> didn't hit a catastrophic error).  But even if we add an interface to close a single
> cma_device, libibverbs doesn't really have a way to wait for the device to be
> torn down and reinitialized.
> (In the kernel, we have the ib_client.add and ib_client.remove callbacks, but
> libibverbs just initializes a static array of devices at library initialization)
> 
> Is there any work on closing these gaps that has been done yet (perhaps in OFED
> or in pending patches), or have I found a wide open field to innovate in?
> 
> 
> As a side note, how does opensm handle this?  I haven't tried it yet, but from
> reading code I believe that libibumad will not correctly pass the ib_umad failure
> back up to opensm, and so opensm will be stuck with a dead
> /dev/infiniband/umadX file handle forever.  Is that assessment correct?
> 
> Thanks!
>   Roland
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <HE1PR05MB1418BB2F8E162955160E2D3EB1A20-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]

* Re: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
       [not found]     ` <HE1PR05MB1418BB2F8E162955160E2D3EB1A20-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-02-21 16:51       ` Roland Dreier
       [not found]         ` <CAG4TOxPy1LQf4cfYm=_zSyr1RuEhLds+ZDiuYKcrD-cFmKUEnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Roland Dreier @ 2016-02-21 16:51 UTC (permalink / raw)
  To: Liran Liss
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Doug Ledford, Hal Rosenstock

Right, that's pretty much what I wrote.  However I think it's a bit
worse than "be informed of this event and re-open the HCA context."
Userspace needs to synchronize with the kernel to wait for the uverbs
device to be torn down and recreated, and there's no guarantee that
the device will come back with the same name.  (A perhaps contrived
example is a glitch of a PCI switch with multiple HCAs below it - we
might reset and re-enumerate the HCAs in a different order the second
time around)

 - R.

On Sun, Feb 21, 2016 at 3:56 AM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Hi Roland,
>
> The kernel part is in place, but user-space support is not complete.
>
> When a specific RDMA device receives a fatal event, the user is guaranteed to get this event.
> What is missing is a way for rdmacm (maybe via a well-behaved app that provides the context for reading asynch errors) to be informed of this event and re-open the HCA context.
>
> BTW, rdmacm also doesn't notice when new RDMA devices pop up...
> --Liran
>
>
>> -----Original Message-----
>> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
>> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Roland Dreier
>> Sent: Friday, February 19, 2016 8:03 PM
>> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; Doug
>> Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
>> Subject: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
>>
>> Hello again everyone,
>>
>> I'm assessing the state of the art in writing an application that can recover from
>> an HCA castastrophic error (aka IBV_EVENT_DEVICE_FATAL async event), and it
>> appears the pieces are not there yet.  What is supposed to happen from the
>> kernel side is that userspace closes all of its contexts, then the kernel tears down
>> and recreates the device, and userspace reopens the device and starts over.
>>
>> However it doesn't look like there is any way for librdmacm to call
>> ibv_close_device() without tearing down the whole library and closing all
>> devices (which is disruptive if my application is also using another HCA that
>> didn't hit a catastrophic error).  But even if we add an interface to close a single
>> cma_device, libibverbs doesn't really have a way to wait for the device to be
>> torn down and reinitialized.
>> (In the kernel, we have the ib_client.add and ib_client.remove callbacks, but
>> libibverbs just initializes a static array of devices at library initialization)
>>
>> Is there any work on closing these gaps that has been done yet (perhaps in OFED
>> or in pending patches), or have I found a wide open field to innovate in?
>>
>>
>> As a side note, how does opensm handle this?  I haven't tried it yet, but from
>> reading code I believe that libibumad will not correctly pass the ib_umad failure
>> back up to opensm, and so opensm will be stuck with a dead
>> /dev/infiniband/umadX file handle forever.  Is that assessment correct?
>>
>> Thanks!
>>   Roland
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body
>> of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <CAG4TOxPy1LQf4cfYm=_zSyr1RuEhLds+ZDiuYKcrD-cFmKUEnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* RE: Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
       [not found]         ` <CAG4TOxPy1LQf4cfYm=_zSyr1RuEhLds+ZDiuYKcrD-cFmKUEnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-02-21 17:19           ` Liran Liss
  0 siblings, 0 replies; 6+ messages in thread
From: Liran Liss @ 2016-02-21 17:19 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Doug Ledford, Hal Rosenstock

> From: roland.dreier@gmail.com [mailto:roland.dreier@gmail.com] On Behalf
> Of Roland Dreier
> 
> Right, that's pretty much what I wrote.  However I think it's a bit worse than "be
> informed of this event and re-open the HCA context."
> Userspace needs to synchronize with the kernel to wait for the uverbs device to
> be torn down and recreated, and there's no guarantee that the device will come
> back with the same name.  (A perhaps contrived example is a glitch of a PCI
> switch with multiple HCAs below it - we might reset and re-enumerate the HCAs
> in a different order the second time around)
> 
>  - R.

The "torn" part happens immediately. AFAIK, you won't be able to rebind to a failed device because it unregisters immediately.
We are missing the plug-in event, but regarding device names an app can rely on the NodeGUID.
Do you think that we need something more powerful to maintain the actual names, as done for netdevs (e.g., some udev rule that can modify the name assignment)?

--Liran

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-02-21 17:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-19 18:03 Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application? Roland Dreier
     [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-19 18:10   ` Hefty, Sean
2016-02-19 19:39   ` Hal Rosenstock
2016-02-21 11:56   ` Liran Liss
     [not found]     ` <HE1PR05MB1418BB2F8E162955160E2D3EB1A20-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-02-21 16:51       ` Roland Dreier
     [not found]         ` <CAG4TOxPy1LQf4cfYm=_zSyr1RuEhLds+ZDiuYKcrD-cFmKUEnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-21 17:19           ` Liran Liss

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox