From: Kleber Sacilotto de Souza <klebers-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
Hal Rosenstock
<hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
Jack Morgenstein
<jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Subject: Re: [PATCH] IB/mlx4: Fail post send command on error recovery
Date: Thu, 04 Apr 2013 10:01:39 -0300 [thread overview]
Message-ID: <515D79B3.4090808@linux.vnet.ibm.com> (raw)
In-Reply-To: <CAL1RGDW7wMVmyFhCv-Ei8Mbca-Y9yv+nygzfREU2_TozNSZ60A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 04/02/2013 02:00 PM, Roland Dreier wrote:
>> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
>> index 35cced2..0fa4f72 100644
>> --- a/drivers/infiniband/hw/mlx4/qp.c
>> +++ b/drivers/infiniband/hw/mlx4/qp.c
>> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
>> __be32 blh;
>> int i;
>>
>> + if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev))
>> + return -EIO;
>> +
>> spin_lock_irqsave(&qp->sq.lock, flags);
>>
>> ind = qp->sq_next_wqe;
>
> To pile on to what Or and Jack asked, why here? Why not in post_recv?
> Why not in mlx4_en? What about userspace consumers? What if the
> error condition triggers just after the pci_channel_offline() check?
> What if a command is queued but a PCI error occurs before the
> completion can be returned?
>
> Is there some practical scenario where this change makes a difference?
>
> I would assume that in case of a PCI error, the driver would notice a
> catastrophic error and send that asynchronous event to consumers, who
> would know that commands might have been lost.
>
The problem that I'm trying to solve is that some IB core modules are
hanging waiting on completion queues on their remove path during error
recovery. I've added the pci offline check in post_send, which seemed to
have to solved the problem, but while running other tests I was able to
hit the bug again. Adding the check in post_recv also only hid the
problem for a few testcases.
Adding any check in mlx4_en doesn't make sense in this case, because the
problem is only with IB adapters. The ethernet/RoCE adapters are
recovering fine, the check has been added already on the relevant places
in mlx4_core.
What async event should be sent to consumers before calling the remove
functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core
in case of catastrophic error (but not in PCI error recovery), doesn't
seem to be handled by most of the event handlers registered. Sending
IB_EVENT_PORT_ERR seems to solve the problem for most modules, but
rdma_cm, which doesn't have an event handler, is still hanging. Should
we implement an event handler for rdma_cm?
Thanks!
--
Kleber Sacilotto de Souza
IBM Linux Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-04-04 13:01 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-28 18:45 [PATCH] IB/mlx4: Fail post send command on error recovery Kleber Sacilotto de Souza
[not found] ` <1364496315-7588-1-git-send-email-klebers-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2013-04-02 9:15 ` Or Gerlitz
[not found] ` <515AA1C6.7070804-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-04-02 11:24 ` Jack Morgenstein
2013-04-02 17:00 ` Roland Dreier
[not found] ` <CAL1RGDW7wMVmyFhCv-Ei8Mbca-Y9yv+nygzfREU2_TozNSZ60A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-04 13:01 ` Kleber Sacilotto de Souza [this message]
[not found] ` <515D79B3.4090808-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2013-04-04 14:00 ` Jack Morgenstein
[not found] ` <201304041700.40349.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-04-04 21:45 ` Or Gerlitz
[not found] ` <CAJZOPZ+dgtQRX_sfcDc=aSOW553Twi0oqNjREeminQ2tnZeEmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-08 13:51 ` Kleber Sacilotto de Souza
[not found] ` <5162CB51.3080600-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2013-04-08 15:47 ` Or Gerlitz
2013-04-04 21:45 ` Or Gerlitz
[not found] ` <CAJZOPZLgCMDmTO-qqZXm9Y9xv+xCh5bezz3A_nBn6BEtB-G+0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-04 21:49 ` Roland Dreier
[not found] ` <CAL1RGDU=hn7hMy0ECQ7AOQqmuB8R6+BT6JUudNS_6rPBKr2UtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-08 14:07 ` Kleber Sacilotto de Souza
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=515D79B3.4090808@linux.vnet.ibm.com \
--to=klebers-23vcf4htsmix0ybbhkvfkdbpr1lh4cv8@public.gmane.org \
--cc=hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
--cc=sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox