From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bart Van Assche Subject: Re: [PATCH 07/14] scsi_transport_srp: Add transport layer error handling Date: Wed, 19 Jun 2013 17:27:04 +0200 Message-ID: <51C1CDC8.4070103@acm.org> References: <51C1B5CA.2030302@profitbricks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from georges.telenet-ops.be ([195.130.137.68]:52156 "EHLO georges.telenet-ops.be" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934891Ab3FSP1I (ORCPT ); Wed, 19 Jun 2013 11:27:08 -0400 In-Reply-To: <51C1B5CA.2030302@profitbricks.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Jack Wang Cc: David Dillow , Vu Pham , Sebastian Riemer , linux-rdma , linux-scsi , James Bottomley , Roland Dreier On 06/19/13 15:44, Jack Wang wrote: >> + /* >> + * It can occur that after fast_io_fail_tmo expired and before >> + * dev_loss_tmo expired that the SCSI error handler has >> + * offlined one or more devices. scsi_target_unblock() doesn't >> + * change the state of these devices into running, so do that >> + * explicitly. >> + */ >> + spin_lock_irq(shost->host_lock); >> + __shost_for_each_device(sdev, shost) >> + if (sdev->sdev_state == SDEV_OFFLINE) >> + sdev->sdev_state = SDEV_RUNNING; >> + spin_unlock_irq(shost->host_lock); > > Do you have test case to verify this behaviour? Hello Jack, This is what I came up with after analyzing why a so-called "port flapping" test failed. The concept of that test is simple: use ibportstate to disable and reenable the proper IB port on the switch with random intervals and check whether I/O starts running again if the path remains operational long enough. When running such a test for a few days with random intervals between a few seconds and a few minutes sooner or later it will occur that scsi_try_host_reset() succeeds and that scsi_eh_test_devices() fails. That will cause the SCSI error handler to offline devices. Hence the above code to change the offline state into running after a reconnect succeeds. I'm not proud of that code but I couldn't find a better solution. Maybe the above code won't be necessary anymore once we switch to Hannes' new SCSI error handler. Bart.