* MegaSAS Hang on Smart Query
@ 2006-06-30 0:29 Keith Baker
2006-06-30 3:03 ` Douglas Gilbert
0 siblings, 1 reply; 3+ messages in thread
From: Keith Baker @ 2006-06-30 0:29 UTC (permalink / raw)
To: linux-scsi
I've been having a hang with 2.6.16.22 and the megasas driver. I'm pretty
sure it has to do with a smartctl -a <logical drive>. The SCSI layer gets
all sorts of in a twist.
megasas: waiting for 2 commands to complete
- repeats a bunch of times then -
sd 0:2:0:0: rejecting I/O to offline device
Given a bit of wisdom in a driver distributed by dell which mentioned the
controller not responding to a cache inqury... isn't the correct thing to
do respond with some sort of unsupported response? not just ignore the
query?
I've hunted around for patches around this problem but haven't found any,
of course "don't use smart against a logical drive" works, but I'm not the
only one using these boxes and it does cause the system to go down.
--
Keith Baker
MetaCarta, Inc
350 Massachusetts Ave, 4th Floor
Cambridge, MA 02139 USA
www.metacarta.com <http://www.metacarta.com>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: MegaSAS Hang on Smart Query
2006-06-30 0:29 MegaSAS Hang on Smart Query Keith Baker
@ 2006-06-30 3:03 ` Douglas Gilbert
2006-06-30 18:49 ` Keith Baker
0 siblings, 1 reply; 3+ messages in thread
From: Douglas Gilbert @ 2006-06-30 3:03 UTC (permalink / raw)
To: Keith Baker; +Cc: linux-scsi
Keith Baker wrote:
> I've been having a hang with 2.6.16.22 and the megasas driver. I'm pretty
> sure it has to do with a smartctl -a <logical drive>. The SCSI layer gets
> all sorts of in a twist.
Keith,
Could you add '-r ioctl,3' to the smartctl command line
to get a full debug output. Then we can see which SCSI
commands the megasas driver or hardware doesn't like.
> megasas: waiting for 2 commands to complete
> - repeats a bunch of times then -
> sd 0:2:0:0: rejecting I/O to offline device
>
> Given a bit of wisdom in a driver distributed by dell which mentioned the
> controller not responding to a cache inqury... isn't the correct thing to
> do respond with some sort of unsupported response? not just ignore the
> query?
Correct. I'm sure the vendor knows what should be done.
> I've hunted around for patches around this problem but haven't found any,
> of course "don't use smart against a logical drive" works, but I'm not the
> only one using these boxes and it does cause the system to go down.
Doug Gilbert
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: MegaSAS Hang on Smart Query
2006-06-30 3:03 ` Douglas Gilbert
@ 2006-06-30 18:49 ` Keith Baker
0 siblings, 0 replies; 3+ messages in thread
From: Keith Baker @ 2006-06-30 18:49 UTC (permalink / raw)
To: dougg; +Cc: linux-scsi
Ok, turns out the exact command being run was smartctl -H so I did this:
localhost:~# smartctl -H -r ioctl,3 /dev/sda
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
[inquiry: 12 00 00 00 24 00 ]
scsi_status=0x0, host_status=0x0, driver_status=0x0
info=0x0 duration=0 milliseconds
Incoming data, len=36:
00 00 00 05 02 5b 00 00 02 44 45 4c 4c 20 20 20 20
10 50 45 52 43 20 35 2f 69 20 20 20 20 20 20 20 20
20 31 2e 30 30
status=0x0
[log sense: 4d 00 40 00 00 00 00 00 04 00 ]
scsi_status=0x2, host_status=0x0, driver_status=0x8
info=0x1 duration=0 milliseconds
Incoming data, len=4:
00 00 00 05 02
>>> Sense buffer, len=19:
00 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00
10 00 00 00
status=2: sense_key=5 asc=20 ascq=0
Log Sense for supported pages failed [unsupported scsi opcode]
[request sense: 03 00 00 00 12 00 ]
scsi_status=0x0, host_status=0x0, driver_status=0x0
info=0x0 duration=0 milliseconds
Incoming data, len=18:
00 70 00 00 00 00 00 00 0b 00 00 00 00 00 00 00 00
10 00 00
status=0x0
SMART Health Status: OK
localhost:~#
note that this command returned fine!
Then I try it again and it hangs at the inquery:
localhost:~# smartctl -H -r ioctl,3 /dev/sda
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
[inquiry: 12 00 00 00 24 00 ]
After a minute or so I then get this from dmesg:
sd 0:2:0:0: megasas: RESET -26412 cmd=12
megasas: [ 0]waiting for 7 commands to complete
megasas: [ 5]waiting for 7 commands to complete
megasas: [10]waiting for 7 commands to complete
MESSAGE REPEATED up to [175]
megasas: failed to do reset
sd 0:2:0:0: megasas: RESET -26412 cmd=12
megasas: cannot recover from previous reset failures
sd 0:2:0:0: megasas: RESET -26412 cmd=12
megasas: cannot recover from previous reset failures
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 32224045
Buffer I/O error on device sda3, logical block 3487820
lost page write due to I/O error on sda3
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 1063841686
Buffer I/O error on device sda7, logical block 76433411
lost page write due to I/O error on sda7
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 376122118
Buffer I/O error on device sda6, logical block 38470685
lost page write due to I/O error on sda6
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 376293934
Buffer I/O error on device sda6, logical block 38492162
lost page write due to I/O error on sda6
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 1063841694
Buffer I/O error on device sda7, logical block 76433412
lost page write due to I/O error on sda7
sd 0:2:0:0: SCSI error: return code = 0x6000000
end_request: I/O error, dev sda, sector 32420053
Buffer I/O error on device sda3, logical block 3512321
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 38487730
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 2950192
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 38487679
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 38487688
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda3.
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda7.
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
ext3_abort called.
EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda6.
sd 0:2:0:0: rejecting I/O to offline device
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sda6): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ext3_abort called.
EXT3-fs error (device sda3): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:2:0:0: rejecting I/O to offline device
printk: 11 messages suppressed.
Buffer I/O error on device sda3, logical block 0
lost page write due to I/O error on sda3
Buffer I/O error on device sda3, logical block 1
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 5
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 426021
lost page write due to I/O error on sda3
Buffer I/O error on device sda3, logical block 426022
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 426090
lost page write due to I/O error on sda3
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
REPEATED a few hundred times
printk: 128 messages suppressed.
Buffer I/O error on device sda6, logical block 38469634
lost page write due to I/O error on sda6
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
Then I get this from smartctl:
scsi_status=0x0, host_status=0x0, driver_status=0x6
info=0x1 duration=234328 milliseconds
Incoming data, len=36:
00 50 05 a5 f5 80 a1 42 c0 00 00 00 00 00 00 00 00
10 00 00 00 00 00 00 00 00 00 00 00 c0 0f a4 12 c0
20 00 00 00 00
[inquiry: 12 00 00 00 24 00 ]
SCSI_IOCTL_SEND_COMMAND ioctl failed, errno=19 [No such device]
Standard Inquiry (36 bytes) failed [No such device]
Retrying with a 64 byte Standard Inquiry
[inquiry: 12 00 00 00 40 00 ]
SCSI_IOCTL_SEND_COMMAND ioctl failed, errno=19 [No such device]
Standard Inquiry (64 bytes) failed [No such device]
A mandatory SMART command failed: exiting. To continue, add one or more
'-T permissive' options.
then the kernel gets really unhappy and I get:
Message from syslogd@localhost at Fri Jun 30 14:37:31 2006 ...
localhost kernel: journal commit I/O error
> Keith Baker wrote:
>> I've been having a hang with 2.6.16.22 and the megasas driver. I'm
>> pretty
>> sure it has to do with a smartctl -a <logical drive>. The SCSI layer
>> gets
>> all sorts of in a twist.
>
> Keith,
> Could you add '-r ioctl,3' to the smartctl command line
> to get a full debug output. Then we can see which SCSI
> commands the megasas driver or hardware doesn't like.
>
>> megasas: waiting for 2 commands to complete
>> - repeats a bunch of times then -
>> sd 0:2:0:0: rejecting I/O to offline device
>>
>> Given a bit of wisdom in a driver distributed by dell which mentioned
>> the
>> controller not responding to a cache inqury... isn't the correct thing
>> to
>> do respond with some sort of unsupported response? not just ignore the
>> query?
>
> Correct. I'm sure the vendor knows what should be done.
>
>> I've hunted around for patches around this problem but haven't found
>> any,
>> of course "don't use smart against a logical drive" works, but I'm not
>> the
>> only one using these boxes and it does cause the system to go down.
>
> Doug Gilbert
>
>
>
>
--
Keith Baker
Systems Administrator
MetaCarta, Inc
350 Massachusetts Ave, 4th Floor
Cambridge, MA 02139 USA
Office: (617) 661-6382, ext. 527
email: keith.baker@metacarta.com
PGP Key: 0190570B
www.metacarta.com <http://www.metacarta.com>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2006-06-30 18:49 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-30 0:29 MegaSAS Hang on Smart Query Keith Baker
2006-06-30 3:03 ` Douglas Gilbert
2006-06-30 18:49 ` Keith Baker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox