From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bart Van Assche Subject: Re: SCSI LLDs, the SCSI error handler and host resource lifetime Date: Wed, 21 Nov 2012 13:26:35 +0100 Message-ID: <50ACC87B.5010401@acm.org> References: <50AB9286.8040403@acm.org> <50AC808D.1060700@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from georges.telenet-ops.be ([195.130.137.68]:43232 "EHLO georges.telenet-ops.be" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750892Ab2KUM0i (ORCPT ); Wed, 21 Nov 2012 07:26:38 -0500 In-Reply-To: <50AC808D.1060700@suse.de> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Hannes Reinecke Cc: linux-scsi On 11/21/12 08:19, Hannes Reinecke wrote: > On 11/20/2012 03:24 PM, Bart Van Assche wrote: >> If I interpret the SCSI error handler source code correctly then >> scsi_unjam_host() may proceed concurrently with scsi_remove_host(). >> This means that the LLD eh_abort_handler callback may get invoked after >> scsi_remove_host() finished. At least the SRP initiator (ib_srp) cleans >> up resources necessary for aborting commands as soon as >> scsi_remove_host() returns. That looks like a race condition to me. As >> far as I can see it is only safe to clean up such resources after the >> EH thread has been stopped. Any opinions about adding an additional >> callback for this purpose in struct scsi_host_template ? >> >> Note: it doesn't look like a good idea to me to let scsi_remove_host() >> wait until error recovery has finished since scsi_remove_host() may get >> invoked from the context of a workqueue. If any work gets queued on the >> same workqueue related to SCSI error handling letting scsi_remove_host() >> wait for the error handler to finish might result in a deadlock. >> >> The patch below is a request for comments patch that does not only add a >> callback to struct scsi_host_template but also fixes a (hard to trigger) >> race condition in ib_srp: avoid that ib_destroy_cm_id() frees the IB RC >> connection while srp_send_tsk_mgmt() is using it. >> > Hmm. > This would still mean that the eh thread will run until finished. > Which can take _A LOT_ of time (we're speaking hours here). > I would rather have an additional return code in the various > scsi_try_XXX functions to terminate the loop quickly. How about combining both approaches ? I think the additional callback is needed anyway to prevent the race condition explained above. Making the SCSI EH stop quicker after scsi_remove_host() has been invoked looks like a good idea to me but I'm not sure that change alone is sufficient. Bart.