From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bart Van Assche Subject: Re: [PATCH 04/12] IB/srp: Fix connection state tracking Date: Wed, 6 May 2015 11:29:16 +0200 Message-ID: <5549DEEC.9050501@sandisk.com> References: <5541EE21.3050809@sandisk.com> <5541EE9F.8090605@sandisk.com> <1430410094.102408.71.camel@redhat.com> <55488BAE.7070006@sandisk.com> <1430835029.2407.187.camel@redhat.com> <5548D2FF.7030501@sandisk.com> <1430838637.2407.209.camel@redhat.com> <5548E155.70007@sandisk.com> <1430842201.2407.226.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1430842201.2407.226.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Doug Ledford Cc: James Bottomley , Sagi Grimberg , Sebastian Parschauer , linux-rdma , "linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" List-Id: linux-rdma@vger.kernel.org Hello Doug, On 05/05/15 18:10, Doug Ledford wrote: > Be that as it may, that doesn't change what I said about posting a > command to a known disconnected QP. You could just fail immediately. > Something like: > > if (!ch->connected) { > scmnd->result = DID_NO_CONNECT; > goto err; > } > > right after getting the channel in queuecommand would work. That would > save a couple spinlocks, several DMA mappings, a call into the low level > driver, and a few other things. (And I only left requeue on the table > because I wasn't sure how the blk_mq dealt with just a single channel > being down versus all of them being down) What you wrote above looks correct to me. However, it is intentional that such a check is not present in srp_queuecommand(). The intention was to optimize the hot path of that driver as much as possible. Hence the choice to post a work request on the QP even after it has been disconnected and to let the HCA generate an error completion. > But my point in all of this is that if you have a single qp between > yourself and the target, then any error including a qp resource error == > path error since you only have one path. When you have a multi queue > device, that's no longer true. A transient resource problem on one qp > does not mean a path event (at least not necessarily, although your > statement below converts a QP event into a path event by virtue > disconnecting and reconnecting all of the QPs). My curiosity is now > moot given what you wrote about tearing everything down and reconnecting > (unless the error handling is modified to be more subtle in its > workings), but the original question in my mind was what happens at the > blk_mq level if you did have a single queue drop but not all of them and > you weren't using multipath. If we want to support this without adding similar code to handle this in every SCSI LLD I think we need to change first how blk-mq and dm-multipath interact. Today dm-multipath is a layer on top of blk-mq. Supporting the above scenario properly is possible e.g. by integrating multipath support in the blk-mq layer. I think Hannes and Christoph have already started to work on this. >> If only one channel fails all other channels are disconnected and the >> transport layer error handling mechanism is started. > > I missed that. I assume it's done in srp_start_tl_fail_timers()? Yes, that's correct. Both QP errors and reception of a DREQ trigger a call of srp_tl_err_work(). That last function calls srp_start_tl_fail_timers() which starts the reconnection mechanism, at least if the reconnect_delay parameter has a positive value (> 0). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html