From: Jack Wang <jinpu.wang@profitbricks.com>
To: Bart Van Assche <bvanassche@acm.org>
Cc: David Dillow <dillowda@ornl.gov>, Vu Pham <vuhuong@mellanox.com>,
Sebastian Riemer <sebastian.riemer@profitbricks.com>,
linux-rdma <linux-rdma@vger.kernel.org>,
linux-scsi <linux-scsi@vger.kernel.org>,
James Bottomley <jbottomley@parallels.com>,
Roland Dreier <roland@kernel.org>
Subject: Re: [PATCH 07/14] scsi_transport_srp: Add transport layer error handling
Date: Fri, 21 Jun 2013 14:17:41 +0200 [thread overview]
Message-ID: <51C44465.3030506@profitbricks.com> (raw)
In-Reply-To: <51C1CDC8.4070103@acm.org>
On 06/19/2013 05:27 PM, Bart Van Assche wrote:
> On 06/19/13 15:44, Jack Wang wrote:
>>> + /*
>>> + * It can occur that after fast_io_fail_tmo expired and before
>>> + * dev_loss_tmo expired that the SCSI error handler has
>>> + * offlined one or more devices. doesn't
>>> + * change the state of these devices into running, so do that
>>> + * explicitly.
>>> + */
>>> + spin_lock_irq(shost->host_lock);
>>> + __shost_for_each_device(sdev, shost)
>>> + if (sdev->sdev_state == SDEV_OFFLINE)
>>> + sdev->sdev_state = SDEV_RUNNING;
>>> + spin_unlock_irq(shost->host_lock);
>>
>> Do you have test case to verify this behaviour?
>
> Hello Jack,
>
> This is what I came up with after analyzing why a so-called "port
> flapping" test failed. The concept of that test is simple: use
> ibportstate to disable and reenable the proper IB port on the switch
> with random intervals and check whether I/O starts running again if the
> path remains operational long enough. When running such a test for a few
> days with random intervals between a few seconds and a few minutes
> sooner or later it will occur that scsi_try_host_reset() succeeds and
> that scsi_eh_test_devices() fails. That will cause the SCSI error
> handler to offline devices. Hence the above code to change the offline
> state into running after a reconnect succeeds. I'm not proud of that
> code but I couldn't find a better solution. Maybe the above code won't
> be necessary anymore once we switch to Hannes' new SCSI error handler.
>
> Bart.
Thanks Bart for reply, in fact we saw same problem you describe here.
It's reasonable to set the device back to RUNNING after reconnect
succeeds. I'm curious why the scsi_target_unblock() doesn't handle this
case.
I'm not sure new SCSI eh from Hannes will avoid scsi eh set device to
offline in such situation, but at least it will avoid one bad lun block
whole host.
Jack
next prev parent reply other threads:[~2013-06-21 12:17 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-19 13:44 [PATCH 07/14] scsi_transport_srp: Add transport layer error handling Jack Wang
2013-06-19 15:27 ` Bart Van Assche
2013-06-21 12:17 ` Jack Wang [this message]
-- strict thread matches above, loose matches on Subject: below --
2013-06-24 13:48 Jack Wang
[not found] ` <51C84E39.80806-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-06-24 15:50 ` Bart Van Assche
[not found] ` <51C86AB4.1000906-HInyCGIudOg@public.gmane.org>
2013-06-24 16:05 ` Jack Wang
2013-06-12 13:17 [PATCH 0/14] IB SRP initiator patches for kernel 3.11 Bart Van Assche
2013-06-12 13:28 ` [PATCH 07/14] scsi_transport_srp: Add transport layer error handling Bart Van Assche
[not found] ` <51B8777B.5050201-HInyCGIudOg@public.gmane.org>
2013-06-13 19:43 ` Vu Pham
2013-06-14 13:19 ` Bart Van Assche
[not found] ` <51BB1857.7040802-HInyCGIudOg@public.gmane.org>
2013-06-14 17:59 ` Vu Pham
[not found] ` <51BB5A04.3080901-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-15 9:52 ` Bart Van Assche
[not found] ` <51BC3945.9030900-HInyCGIudOg@public.gmane.org>
2013-06-17 6:18 ` Hannes Reinecke
2013-06-17 7:04 ` Bart Van Assche
2013-06-17 7:14 ` Hannes Reinecke
2013-06-17 7:29 ` Bart Van Assche
[not found] ` <51BEBAEA.4080202-HInyCGIudOg@public.gmane.org>
2013-06-17 8:10 ` Hannes Reinecke
2013-06-17 10:13 ` Sebastian Riemer
2013-06-18 16:59 ` Vu Pham
[not found] ` <51C09202.2040503-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-19 13:00 ` Bart Van Assche
2013-06-23 21:13 ` Mike Christie
[not found] ` <51C764FB.6070207-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
2013-06-24 7:37 ` Bart Van Assche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51C44465.3030506@profitbricks.com \
--to=jinpu.wang@profitbricks.com \
--cc=bvanassche@acm.org \
--cc=dillowda@ornl.gov \
--cc=jbottomley@parallels.com \
--cc=linux-rdma@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=roland@kernel.org \
--cc=sebastian.riemer@profitbricks.com \
--cc=vuhuong@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox