public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
From: brem belguebli <brem.belguebli@gmail.com>
Cc: linux-scsi@vger.kernel.org
Subject: lpfc SAN/SCSI issue
Date: Thu, 22 Apr 2010 21:24:35 +0200	[thread overview]
Message-ID: <1271964275.2480.1.camel@localhost> (raw)
In-Reply-To: <20100422164739.GA15813@lsil.com>

I have a server (RHEL 5.3) connected to 2 SAN extended fabrics (across 2
sites, distance 1 ms, links are ISL with 100 km long distance buffer
credits) via 2 lpfc HBA's (LPe1105-HP FC with the RHEL 5.3 shipped LPFC
driver 8.2.0.33.3p.)
 
A SAN FABRIC reconfiguration (DWDM Ring failover from worker to
protection)  occured yesterday  after some intersite telco link switch
that lasted less than 0,3 ms. 
 
Only one FABRIC was impacted, named FABRIC2 
 
Our server is connected to the FABRICs thru 2 edge switches, so not
directly connected to the core switches on which the link failure
occured. 
 
>From then, our server (which accesses thru the 2 fabrics the LUNS from
our 2 sites) started to climb in terms of load average (up to 250 for a
dual proc quadcore machine!) with a high percentage of iowait (up to
50%). 
 
We did some testing, bypassing DM-MP by issuing dd commands to the
physical /dev/sdX devices (more than 30 LUNS are presented to the
server, seen each thru 4 paths making more than 120 /dev/sd devices)
and half of our dd processes went to D state, as well as some unitary
scsi_id that we manually run on the same physical devices. 
 
Multipathd itself was also in D state. 
 
The only way to restore the whole thing was to reset the server HBA
connected to FABRIC2, after 2 hours of investigation 
 
No kind of scsi log, or whatever did appear during the outage duration
(~2 hours) despite the fact that the scsi timeouts set on the physical
devices is 60s, that the HBA's timeout is 14s. 
 
The /sys/block/sdX/device/state were showing running state despite the
fact that the devices (well half of them) were actually inaccessible. 
 
What leads me to : 
 
1) assumption: it looks the lpfc driver following this SAN event goes in
a black hole mode not returning any io error or whatever to the scsi
upper layer 
 
2) question: how come the scsi timers don't trigger and declare the
device faulty (the answer may be in the above assumption). 
 
Any idea or tip on what could cause this, some FC SCN message not well
handled or whatever ?

Regards

Brem



  reply	other threads:[~2010-04-22 17:32 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-22 16:47 [PATCH] mpt2sas: DIF Type 2 Protection Support Eric Moore
2010-04-22 19:24 ` brem belguebli [this message]
2010-04-23 13:28   ` lpfc SAN/SCSI issue James Smart
     [not found]     ` <j2o29ae894c1004230922le8baf635y563e50e3edc53bc3@mail.gmail.com>
     [not found]       ` <4BD226F4.6070908@emulex.com>
     [not found]         ` <1272109999.2983.30.camel@localhost>
     [not found]           ` <4BD5D258.8030309@emulex.com>
2010-04-26 21:52             ` brem belguebli
2010-04-27 17:37               ` brem belguebli
2010-05-03 16:39                 ` brem belguebli
2010-05-05 14:01                   ` James Smart
2010-05-06 11:06                     ` brem belguebli
2010-05-06 13:39                       ` James Smart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1271964275.2480.1.camel@localhost \
    --to=brem.belguebli@gmail.com \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox