public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application?
@ 2016-02-19 18:03 Roland Dreier
       [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Roland Dreier @ 2016-02-19 18:03 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Doug Ledford, Hal Rosenstock

Hello again everyone,

I'm assessing the state of the art in writing an application that can
recover from an HCA castastrophic error (aka IBV_EVENT_DEVICE_FATAL
async event), and it appears the pieces are not there yet.  What is
supposed to happen from the kernel side is that userspace closes all
of its contexts, then the kernel tears down and recreates the device,
and userspace reopens the device and starts over.

However it doesn't look like there is any way for librdmacm to call
ibv_close_device() without tearing down the whole library and closing
all devices (which is disruptive if my application is also using
another HCA that didn't hit a catastrophic error).  But even if we add
an interface to close a single cma_device, libibverbs doesn't really
have a way to wait for the device to be torn down and reinitialized.
(In the kernel, we have the ib_client.add and ib_client.remove
callbacks, but libibverbs just initializes a static array of devices
at library initialization)

Is there any work on closing these gaps that has been done yet
(perhaps in OFED or in pending patches), or have I found a wide open
field to innovate in?


As a side note, how does opensm handle this?  I haven't tried it yet,
but from reading code I believe that libibumad will not correctly pass
the ib_umad failure back up to opensm, and so opensm will be stuck
with a dead /dev/infiniband/umadX file handle forever.  Is that
assessment correct?

Thanks!
  Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-02-21 17:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-19 18:03 Recovering from IBV_EVENT_DEVICE_FATAL in librdmacm application? Roland Dreier
     [not found] ` <CAL1RGDXux9KFEUkBeegeiGGJdvKpGv_rRRs-cqNj1U6Nq0YSiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-19 18:10   ` Hefty, Sean
2016-02-19 19:39   ` Hal Rosenstock
2016-02-21 11:56   ` Liran Liss
     [not found]     ` <HE1PR05MB1418BB2F8E162955160E2D3EB1A20-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-02-21 16:51       ` Roland Dreier
     [not found]         ` <CAG4TOxPy1LQf4cfYm=_zSyr1RuEhLds+ZDiuYKcrD-cFmKUEnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-21 17:19           ` Liran Liss

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox