* Re: MegaSAS Hang on Smart Query
2006-06-30 3:03 ` Douglas Gilbert
@ 2006-06-30 18:49 ` Keith Baker
0 siblings, 0 replies; 3+ messages in thread
From: Keith Baker @ 2006-06-30 18:49 UTC (permalink / raw)
To: dougg; +Cc: linux-scsi
Ok, turns out the exact command being run was smartctl -H so I did this:
localhost:~# smartctl -H -r ioctl,3 /dev/sda
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
[inquiry: 12 00 00 00 24 00 ]
scsi_status=0x0, host_status=0x0, driver_status=0x0
info=0x0 duration=0 milliseconds
Incoming data, len=36:
00 00 00 05 02 5b 00 00 02 44 45 4c 4c 20 20 20 20
10 50 45 52 43 20 35 2f 69 20 20 20 20 20 20 20 20
20 31 2e 30 30
status=0x0
[log sense: 4d 00 40 00 00 00 00 00 04 00 ]
scsi_status=0x2, host_status=0x0, driver_status=0x8
info=0x1 duration=0 milliseconds
Incoming data, len=4:
00 00 00 05 02
>>> Sense buffer, len=19:
00 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00
10 00 00 00
status=2: sense_key=5 asc=20 ascq=0
Log Sense for supported pages failed [unsupported scsi opcode]
[request sense: 03 00 00 00 12 00 ]
scsi_status=0x0, host_status=0x0, driver_status=0x0
info=0x0 duration=0 milliseconds
Incoming data, len=18:
00 70 00 00 00 00 00 00 0b 00 00 00 00 00 00 00 00
10 00 00
status=0x0
SMART Health Status: OK
localhost:~#
note that this command returned fine!
Then I try it again and it hangs at the inquery:
localhost:~# smartctl -H -r ioctl,3 /dev/sda
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
[inquiry: 12 00 00 00 24 00 ]
After a minute or so I then get this from dmesg:
sd 0:2:0:0: megasas: RESET -26412 cmd=12
megasas: [ 0]waiting for 7 commands to complete
megasas: [ 5]waiting for 7 commands to complete
megasas: [10]waiting for 7 commands to complete
MESSAGE REPEATED up to [175]
megasas: failed to do reset
sd 0:2:0:0: megasas: RESET -26412 cmd=12
megasas: cannot recover from previous reset failures
sd 0:2:0:0: megasas: RESET -26412 cmd=12
megasas: cannot recover from previous reset failures
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 32224045
Buffer I/O error on device sda3, logical block 3487820
lost page write due to I/O error on sda3
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 1063841686
Buffer I/O error on device sda7, logical block 76433411
lost page write due to I/O error on sda7
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 376122118
Buffer I/O error on device sda6, logical block 38470685
lost page write due to I/O error on sda6
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 376293934
Buffer I/O error on device sda6, logical block 38492162
lost page write due to I/O error on sda6
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 1063841694
Buffer I/O error on device sda7, logical block 76433412
lost page write due to I/O error on sda7
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 32420053
Buffer I/O error on device sda3, logical block 3512321
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 38487730
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 2950192
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 38487679
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 38487688
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda3.
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda7.
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
ext3_abort called.
EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda6.
sd 0:2:0:0: rejecting I/O to offline device
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sda6): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ext3_abort called.
EXT3-fs error (device sda3): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:2:0:0: rejecting I/O to offline device
printk: 11 messages suppressed.
Buffer I/O error on device sda3, logical block 0
lost page write due to I/O error on sda3
Buffer I/O error on device sda3, logical block 1
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 5
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 426021
lost page write due to I/O error on sda3
Buffer I/O error on device sda3, logical block 426022
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 426090
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
REPEATED a few hundred times
printk: 128 messages suppressed.
Buffer I/O error on device sda6, logical block 38469634
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Then I get this from smartctl:
scsi_status=0x0, host_status=0x0, driver_status=0x6
info=0x1 duration=234328 milliseconds
Incoming data, len=36:
00 50 05 a5 f5 80 a1 42 c0 00 00 00 00 00 00 00 00
10 00 00 00 00 00 00 00 00 00 00 00 c0 0f a4 12 c0
20 00 00 00 00
[inquiry: 12 00 00 00 24 00 ]
SCSI_IOCTL_SEND_COMMAND ioctl failed, errno=19 [No such device]
Standard Inquiry (36 bytes) failed [No such device]
Retrying with a 64 byte Standard Inquiry
[inquiry: 12 00 00 00 40 00 ]
SCSI_IOCTL_SEND_COMMAND ioctl failed, errno=19 [No such device]
Standard Inquiry (64 bytes) failed [No such device]
A mandatory SMART command failed: exiting. To continue, add one or more
'-T permissive' options.
then the kernel gets really unhappy and I get:
Message from syslogd@localhost at Fri Jun 30 14:37:31 2006 ...
localhost kernel: journal commit I/O error
> Keith Baker wrote:
>> I've been having a hang with 2.6.16.22 and the megasas driver. I'm
>> pretty
>> sure it has to do with a smartctl -a <logical drive>. The SCSI layer
>> gets
>> all sorts of in a twist.
>
> Keith,
> Could you add '-r ioctl,3' to the smartctl command line
> to get a full debug output. Then we can see which SCSI
> commands the megasas driver or hardware doesn't like.
>
>> megasas: waiting for 2 commands to complete
>> - repeats a bunch of times then -
>> sd 0:2:0:0: rejecting I/O to offline device
>>
>> Given a bit of wisdom in a driver distributed by dell which mentioned
>> the
>> controller not responding to a cache inqury... isn't the correct thing
>> to
>> do respond with some sort of unsupported response? not just ignore the
>> query?
>
> Correct. I'm sure the vendor knows what should be done.
>
>> I've hunted around for patches around this problem but haven't found
>> any,
>> of course "don't use smart against a logical drive" works, but I'm not
>> the
>> only one using these boxes and it does cause the system to go down.
>
> Doug Gilbert
>
>
>
>
--
Keith Baker
Systems Administrator
MetaCarta, Inc
350 Massachusetts Ave, 4th Floor
Cambridge, MA 02139 USA
Office: (617) 661-6382, ext. 527
email: keith.baker@metacarta.com
PGP Key: 0190570B
www.metacarta.com <http://www.metacarta.com>
^ permalink raw reply [flat|nested] 3+ messages in thread