public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* mpt2sas driver behaving strange with a failed SATA disk behind SAS expander.
@ 2011-08-17 14:25 Fredrik Lindgren
  2011-08-17 17:08 ` Peter Chang
  2011-08-17 18:35 ` Ravi Shankar
  0 siblings, 2 replies; 4+ messages in thread
From: Fredrik Lindgren @ 2011-08-17 14:25 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org

Hello,

I'm seeing something strange on a Supermicro 847E16-R1400. It has SAS 
expanders
with SATA disks behind them (Seagate Barracuda XT). The SAS card is a 
LSI SAS9211-8i.

When doing disk IO on the disks (they are all configured in MD raids) 
suddenly IO will
stop and these messages are printed on the console about once every second:

mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)

 From what I understand this means:

PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)

So a disk is acting up, generating errors? What does the last "10" mean 
in the sub_code,
is that an identifier for which disk it is?

After some time, the message changed:

mpt2sas0: log info(0x31111000): originator(PL), code(0x11), sub code(0x1000)

Now the disk seems to have died completely?

PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_DSCVRY_SATA_INIT_TIMEOUT (0x00001000)

What bothers me is that the machine is just hanging there with IO 
blocking for the disk
in question (I guess, this was gong on for several hours) there was no 
SCSI-errors and the
drive in question was not ejected from the MD array. After rebooting it 
started to rebuild
the MD array, promptly got stuck again and just sat there until the disk 
was removed from
the array and it was restarted again.

This was with a stock Debian Squeeze kernel 
(linux-image-2.6.32-5-amd64). I got the exact same
thing with a vanilla 3.0.1 from kernel.org.

Regards,
   Fredrik Lindgren

----

dmesg from 3.0.1:

mpt2sas version 08.100.00.02 loaded
mpt2sas 0000:06:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26
mpt2sas 0000:06:00.0: setting latency timer to 64
mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (49559612 kB)
mpt2sas 0000:06:00.0: irq 72 for MSI/MSI-X
mpt2sas0: PCI-MSI-X enabled: IRQ 72
mpt2sas0: iomem(0x00000000fbc3c000), mapped(0xffffc90006068000), size(16384)
mpt2sas0: ioport(0x000000000000d000), size(256)
mpt2sas0: sending diag reset !!
mpt2sas0: diag reset: SUCCESS
mpt2sas0: Allocated physical memory: size(3971 kB)
mpt2sas0: Current Controller Queue Depth(1739), Max Controller Queue 
Depth(2000)
mpt2sas0: Scatter Gather Elements per IO(128)
mpt2sas0: LSISAS2008: FWVersion(09.00.00.00), ChipRevision(0x03), 
BiosVersion(07.17.00.00)
mpt2sas0: Protocol=(Initiator,Target), 
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set 
Full,NCQ)
mpt2sas0: sending port enable !!
mpt2sas0: host_add: handle(0x0001), sas_addr(0x500605b0034da7c0), phys(8)
mpt2sas0: expander_add: handle(0x0009), parent(0x0001), 
sas_addr(0x5003048001016e7f), phys(38)
mpt2sas0: expander_add: handle(0x0023), parent(0x0002), 
sas_addr(0x5003048000f6b57f), phys(30)
mpt2sas0: port enable: SUCCESS

root@weathergirl:~# smp_rep_manufacturer /dev/bsg/expander-6\:0
Report manufacturer response:
   Expander change count: 85
   SAS-1.1 format: 1
   vendor identification: LSI CORP
   product identification: SAS2X36
   product revision level: 0717
   component vendor identification: LSI
   component id: 547
   component revision level: 5
root@weathergirl:~# smp_rep_manufacturer /dev/bsg/expander-6\:1
Report manufacturer response:
   Expander change count: 67
   SAS-1.1 format: 1
   vendor identification: LSI CORP
   product identification: SAS2X28
   product revision level: 0717
   component vendor identification: LSI
   component id: 545
   component revision level: 5


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mpt2sas driver behaving strange with a failed SATA disk behind SAS expander.
  2011-08-17 14:25 mpt2sas driver behaving strange with a failed SATA disk behind SAS expander Fredrik Lindgren
@ 2011-08-17 17:08 ` Peter Chang
  2011-08-17 18:49   ` Peter Chang
  2011-08-17 18:35 ` Ravi Shankar
  1 sibling, 1 reply; 4+ messages in thread
From: Peter Chang @ 2011-08-17 17:08 UTC (permalink / raw)
  To: Fredrik Lindgren; +Cc: linux-scsi@vger.kernel.org

Le 17 août 2011 07:25, Fredrik Lindgren <fli@swip.net> a écrit :
> When doing disk IO on the disks (they are all configured in MD raids)
> suddenly IO will
> stop and these messages are printed on the console about once every second:
>
> mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)
>
> From what I understand this means:
>
> PL_LOGINFO_CODE_RESET (0x00110000)
> PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)
>
> So a disk is acting up, generating errors? What does the last "10" mean in
> the sub_code,
> is that an identifier for which disk it is?

no, the bottom bts are still part of the error code.

i haven't run w/ your exact fw/driver setup, but i think you'll find
that you're in a 'loop' where the driver is returning DID_RESET and
the scsi layer is retrying w/o going through the retry counter logic
(the command that fails is one that the firmware issued).

\p
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mpt2sas driver behaving strange with a failed SATA disk behind SAS expander.
  2011-08-17 14:25 mpt2sas driver behaving strange with a failed SATA disk behind SAS expander Fredrik Lindgren
  2011-08-17 17:08 ` Peter Chang
@ 2011-08-17 18:35 ` Ravi Shankar
  1 sibling, 0 replies; 4+ messages in thread
From: Ravi Shankar @ 2011-08-17 18:35 UTC (permalink / raw)
  Cc: linux-scsi@vger.kernel.org

On 08/17/11 07:25, Fredrik Lindgren wrote:
> Hello,
>
> I'm seeing something strange on a Supermicro 847E16-R1400. It has SAS 
> expanders
> with SATA disks behind them (Seagate Barracuda XT). The SAS card is a 
> LSI SAS9211-8i.
>
> When doing disk IO on the disks (they are all configured in MD raids) 
> suddenly IO will
> stop and these messages are printed on the console about once every 
> second:
>
> mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), 
> sub_code(0x0610)
>
> From what I understand this means:
>
> PL_LOGINFO_CODE_RESET (0x00110000)
> PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)
>
> So a disk is acting up, generating errors? What does the last "10" 
> mean in the sub_code,
> is that an identifier for which disk it is?
>
> After some time, the message changed:
>
> mpt2sas0: log info(0x31111000): originator(PL), code(0x11), sub 
> code(0x1000)
>
> Now the disk seems to have died completely?
>
> PL_LOGINFO_CODE_RESET (0x00110000)
> PL_LOGINFO_SUB_CODE_DSCVRY_SATA_INIT_TIMEOUT (0x00001000)
>
I think sub code (0x610) indicates "Error in SATA ReadLogExt SATA 
command" and subsequently the disk drive failed
to initialize (SATA initialization timeout). Since you've connected 
through Expander, the link between Disk and Expander
should be actively transmitting FIS frames. You can verify whether Disk 
link is up by checking Expander Routing Tables.

Reduce the link speed (from 6 to 3 Gb/s) between HBA-Exp-Disk and try 
disabling Native Cmd Queuing and see whether it helps.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mpt2sas driver behaving strange with a failed SATA disk behind SAS expander.
  2011-08-17 17:08 ` Peter Chang
@ 2011-08-17 18:49   ` Peter Chang
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Chang @ 2011-08-17 18:49 UTC (permalink / raw)
  To: Fredrik Lindgren; +Cc: linux-scsi@vger.kernel.org

Le 17 août 2011 10:08, Peter Chang <dpf@google.com> a écrit :
> Le 17 août 2011 07:25, Fredrik Lindgren <fli@swip.net> a écrit :
>> When doing disk IO on the disks (they are all configured in MD raids)
>> suddenly IO will
>> stop and these messages are printed on the console about once every second:
>>
>> mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)
>>
>> From what I understand this means:
>>
>> PL_LOGINFO_CODE_RESET (0x00110000)
>> PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)
>>
>> So a disk is acting up, generating errors? What does the last "10" mean in
>> the sub_code,
>> is that an identifier for which disk it is?
>
> no, the bottom bts are still part of the error code.
>
> i haven't run w/ your exact fw/driver setup, but i think you'll find
> that you're in a 'loop' where the driver is returning DID_RESET and
> the scsi layer is retrying w/o going through the retry counter logic
> (the command that fails is one that the firmware issued).

since someone else gave the error code (i didn't check if i just had
some other magic header)...

the problem is probably a combination of the disk and controller
firmwares. when an NCQ request fails the firmware will do a READ LOG
EXT(10) to figure out why. some disks don't do handle this sequence
the way the firmware expects so it starts the COMRESET dance w/ the
disk and returns an event w/ the loginfo to the driver/kernel.

the 'fix' (really a workaround) is in
mpt2sas_scsih.c:_scsih_io_done(). in the case for
MPI2_IOCSTATUS_SCSI_TASK_TERMINATED change the DID_RESET to
DID_SOFT_ERROR and the rest of the scsi layer will go down the regular
retry handling and you'll get out of the 'loop'.

lsi supposed to have this fix coming soon.

disabling NCQ will 'fix' this as well.


\p
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-08-17 18:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-17 14:25 mpt2sas driver behaving strange with a failed SATA disk behind SAS expander Fredrik Lindgren
2011-08-17 17:08 ` Peter Chang
2011-08-17 18:49   ` Peter Chang
2011-08-17 18:35 ` Ravi Shankar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox