mpt2sas losing reset events with cable pulls?

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Roland Dreier <roland@kernel.org>
To: Kashyap Desai <kashyap.desai@lsi.com>, Eric Moore <eric.moore@lsi.com>
Cc: eric@purestorage.com, linux-scsi@vger.kernel.org
Subject: mpt2sas losing reset events with cable pulls?
Date: Tue, 30 Aug 2011 17:51:08 -0700	[thread overview]
Message-ID: <1314751868-1112-1-git-send-email-roland@kernel.org> (raw)

Hi!

We have a system with mpt2sas driver from the upstream kernel --

    #define MPT2SAS_DRIVER_VERSION              "09.100.00.00"

and hardware:

    mpt2sas1: LSISAS2008: FWVersion(09.00.00.00), ChipRevision(0x03), BiosVersion(07.17.00.00)

We have a SAS JBOD with a bunch of SSDs in it, connected with two wide
SAS ports, running Linux multipathing.  If we pull one of the cables
with IO running, then occasionally (say, 1 in 100 cable pulls) some of
the IO gets "stuck" -- we continually hit

	else if (sas_device_priv_data->block || sas_target_priv_data->tm_busy)
		return SCSI_MLQUEUE_DEVICE_BUSY;

in the mpt2sas _scsih_qcmd() function, where tm_busy never gets
cleared.  We added some debugging to _scsih_sas_device_status_change_event()
and we found that when things go wrong, we get an event of type
MPI2_EVENT_SAS_DEV_STAT_RC_INTERNAL_DEVICE_RESET for each one of the
SSDs (which sets tm_busy for each target), but then
MPI2_EVENT_SAS_DEV_STAT_RC_CMP_INTERNAL_DEV_RESET never comes for one
of the targets (so tm_busy is never cleared).

In other words, we get the reset event for handle 0x24, 0x25, ...,
0x3a with all the handles in the range (and hence all the targets)
getting an event; but then in the broken case, the reset complete
event comes for all the handles *except* one (for example 0x39).

This leads to the system getting wedged waiting for a SCSI command
that will never finish, which is not what we're after when one of our
paths to the JBOD goes down (given that we have a second path and are
aiming at fault tolerance here!).

This feels to me like it is probably a firmware race or some other
bug, but perhaps the driver is losing events somehow.  Anyway, how can
we fix this?  Please let me know if there's any further debugging
information I can collect to help make progress on this.

Thanks!
  Roland

next             reply	other threads:[~2011-08-31  0:51 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-31  0:51 Roland Dreier [this message]
2011-08-31  2:41 ` mpt2sas losing reset events with cable pulls? Douglas Gilbert
     [not found]   ` <CAL1RGDU9QtuXGhMQN_cT6+RPh0NTZexYwE8HK1fpw5vuX_FLOw@mail.gmail.com>
2011-08-31  3:16     ` Eric Seppanen
2011-08-31  5:41     ` Roland Dreier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1314751868-1112-1-git-send-email-roland@kernel.org \
    --to=roland@kernel.org \
    --cc=eric.moore@lsi.com \
    --cc=eric@purestorage.com \
    --cc=kashyap.desai@lsi.com \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).