* megaraid_sas waiting for command and then offline @ 2006-10-25 8:46 David N. Welton 2006-10-25 22:48 ` Brett G. Durrett 0 siblings, 1 reply; 15+ messages in thread From: David N. Welton @ 2006-10-25 8:46 UTC (permalink / raw) To: bdurrett; +Cc: linux-kernel Hi, I found someone corresponding to your name writing about a problem with the megaraid sas driver/hardware on the LKML: http://lkml.org/lkml/2006/9/6/12 We have a Dell (2950, running 2.6.18 #1 SMP) as well, and the way I managed to kill the thing dead in its tracks (symptoms basically what you you describe) is with smartctl: root@salgari:~# smartctl --all /dev/sda smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: DELL PERC 5/i Version: 1.00 Device type: disk Local Time is: Wed Oct 25 10:14:40 2006 CEST Device does not support SMART Error Counter logging not supported Device does not support Self Test logging ---- [61101.681857] sd 0:2:0:0: rejecting I/O to offline device [61101.681944] EXT3-fs error (device sda1): ext3_readdir: directory #7553069 contains a hole at offset 0 [61103.944794] sd 0:2:0:0: rejecting I/O to offline device [61103.944879] EXT3-fs error (device sda1): ext3_readdir: directory #7553069 contains a hole at offset 0 [61104.672212] sd 0:2:0:0: rejecting I/O to offline device [61104.672295] EXT3-fs error (device sda1): ext3_readdir: directory #7553069 contains a hole at offset 0 [61105.255981] sd 0:2:0:0: rejecting I/O to offline device [61105.256066] EXT3-fs error (device sda1): ext3_readdir: directory #7553069 contains a hole at offset 0 ---- Dead in the water. We suspect that in any case there are some disk problems, which is why we were trying to use smartctl in the first place. I was just curious if you managed to figure anything out... Thanks, Dave Welton -- Webster srl Sede legale: Via del Seminario, 3 35122 Padova Sede operativa: Via S. Breda, 28 35010 Limena (PD) Tel. +39 049 8842188 Email: d.welton@webster.it Visita www.webster.it ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-10-25 8:46 megaraid_sas waiting for command and then offline David N. Welton @ 2006-10-25 22:48 ` Brett G. Durrett 2006-10-25 23:03 ` Alan Cox 2006-11-13 21:40 ` Brett G. Durrett 0 siblings, 2 replies; 15+ messages in thread From: Brett G. Durrett @ 2006-10-25 22:48 UTC (permalink / raw) To: David N. Welton; +Cc: linux-kernel David, We switched to 2.6.18 (SMP) and applied the latest patches from LSI (got them directly from Sumant Patro). Also, he told me to make sure "read ahead" was set to "off". This seems to have reduced the frequency of the failures to about once per week (across 10+ machines), down from several times per week. After I reported an additional failure, Sumant said they were able to reproduce the problems with XFS but they have not seen it with EXT3. I prefer XFS but I prefer to have reliable databases even more... I now have a couple of systems running in the new configuration and I am slowly migrating others to it as well. I have not seen a failure with EXT3 but I statistically it would have been unlikely... I won't declare victory until I have more systems converted with a few weeks of reliable use. Hope this helps... if anybody solves the root cause I will happily offer them a small gift to show my gratitude. B- David N. Welton wrote: >Hi, > >I found someone corresponding to your name writing about a problem with >the megaraid sas driver/hardware on the LKML: > >http://lkml.org/lkml/2006/9/6/12 > >We have a Dell (2950, running 2.6.18 #1 SMP) as well, and the way I >managed to kill the thing dead in its tracks (symptoms basically what >you you describe) is with smartctl: > >root@salgari:~# smartctl --all /dev/sda >smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen >Home page is http://smartmontools.sourceforge.net/ > >Device: DELL PERC 5/i Version: 1.00 >Device type: disk >Local Time is: Wed Oct 25 10:14:40 2006 CEST >Device does not support SMART > >Error Counter logging not supported > > >Device does not support Self Test logging > >---- > >[61101.681857] sd 0:2:0:0: rejecting I/O to offline device >[61101.681944] EXT3-fs error (device sda1): ext3_readdir: directory >#7553069 contains a hole at offset 0 >[61103.944794] sd 0:2:0:0: rejecting I/O to offline device >[61103.944879] EXT3-fs error (device sda1): ext3_readdir: directory >#7553069 contains a hole at offset 0 >[61104.672212] sd 0:2:0:0: rejecting I/O to offline device >[61104.672295] EXT3-fs error (device sda1): ext3_readdir: directory >#7553069 contains a hole at offset 0 >[61105.255981] sd 0:2:0:0: rejecting I/O to offline device >[61105.256066] EXT3-fs error (device sda1): ext3_readdir: directory >#7553069 contains a hole at offset 0 > >---- > >Dead in the water. We suspect that in any case there are some disk >problems, which is why we were trying to use smartctl in the first place. > >I was just curious if you managed to figure anything out... > >Thanks, >Dave Welton > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-10-25 22:48 ` Brett G. Durrett @ 2006-10-25 23:03 ` Alan Cox 2006-11-13 21:40 ` Brett G. Durrett 1 sibling, 0 replies; 15+ messages in thread From: Alan Cox @ 2006-10-25 23:03 UTC (permalink / raw) To: Brett G. Durrett; +Cc: David N. Welton, linux-kernel Ar Mer, 2006-10-25 am 15:48 -0700, ysgrifennodd Brett G. Durrett: > After I reported an additional failure, Sumant said they were able to > reproduce the problems with XFS but they have not seen it with EXT3. I've seen precisely that pattern with a couple of IDE controllers. In both cases they had problems with very large I/O requests. XFS was generating extremely long linear reads and writes while ext3 tended to generate nice I/O patterns but never really huge ones. (The IDE drivers in question have since been fixed except for IT821x where some firmware versions in raid mode still barf) Alan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-10-25 22:48 ` Brett G. Durrett 2006-10-25 23:03 ` Alan Cox @ 2006-11-13 21:40 ` Brett G. Durrett 1 sibling, 0 replies; 15+ messages in thread From: Brett G. Durrett @ 2006-11-13 21:40 UTC (permalink / raw) To: linux-kernel; +Cc: Brett G. Durrett, David N. Welton Bad news - I just reproduced the failure using EXT3 on a system that had a complete install 4 days ago, so it looks like the megaraid_sas driver fails with both XFS and EXT3 (although EXT3 seems more reliable). I was running EXT with no read ahead: # ./MegaCli -LDGetProp -Cache -L0 -A0 Adapter 0-VD 0: Cache Policy:WriteBack, ReadAheadNone, Direct # mount /dev/sda1 on / type ext3 (rw,errors=remount-ro) # uname -a Linux AF001158 2.6.18-imvuamd64smpmsastest #1 SMP Mon Oct 9 21:26:46 PDT 2006 x86_64 GNU/Linux Here are the megaraid entries from syslog: FACILITY DATE TIME MESSAGE kern-warning 2006-11-13 12:56:25 kernel: megasas[0]: 64 bit SGLs were sent to FW kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Pending OS cmds in FW : kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x15351800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe238b77, lba_hi : 0x0, sense_buf addr : 0x1534d900,sge count : 0x47 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x1535c800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23991f, lba_hi : 0x0, sense_buf addr : 0x15356d00,sge count : 0x50 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x15375000 : <3>megasas[0]: frame count : 0x6, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23aaaf, lba_hi : 0x0, sense_buf addr : 0x15371800,sge count : 0x1a kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x15377c00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xae0005f, lba_hi : 0x0, sense_buf addr : 0x15371d80,sge count : 0x2 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x1537b400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe208367, lba_hi : 0x0, sense_buf addr : 0x1537a280,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x1537d400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe239697, lba_hi : 0x0, sense_buf addr : 0x1537a680,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff00000 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe238f17, lba_hi : 0x0, sense_buf addr : 0x1537ac00,sge count : 0x45 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff01400 : <3>megasas[0]: frame count : 0x7, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe238df7, lba_hi : 0x0, sense_buf addr : 0x1537ae80,sge count : 0x22 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff06400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xa68d66f, lba_hi : 0x0, sense_buf addr : 0xcff03680,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff18400 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe239e27, lba_hi : 0x0, sense_buf addr : 0xcff15680,sge count : 0x50 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff1f000 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe239b9f, lba_hi : 0x0, sense_buf addr : 0xcff1e200,sge count : 0x50 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff20000 : <3>megasas[0]: frame count : 0x4, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23c41f, lba_hi : 0x0, sense_buf addr : 0xcff1e400,sge count : 0xf kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff2b000 : <3>megasas[0]: frame count : 0x3, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23a377, lba_hi : 0x0, sense_buf addr : 0xcff27800,sge count : 0xa kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff35c00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xa601697, lba_hi : 0x0, sense_buf addr : 0xcff30b80,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff44400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe238b6f, lba_hi : 0x0, sense_buf addr : 0xcff42480,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff4cc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe20a287, lba_hi : 0x0, sense_buf addr : 0xcff4b380,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff4f800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23a0f7, lba_hi : 0x0, sense_buf addr : 0xcff4b900,sge count : 0x38 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff52400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0x5f4009f, lba_hi : 0x0, sense_buf addr : 0xcff4be80,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff5fc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe238f0f, lba_hi : 0x0, sense_buf addr : 0xcff5d580,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff60000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xa6000df, lba_hi : 0x0, sense_buf addr : 0xcff5d600,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff6bc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe239e1f, lba_hi : 0x0, sense_buf addr : 0xcff66b80,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff75800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe239197, lba_hi : 0x0, sense_buf addr : 0xcff6fd00,sge count : 0x50 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff76400 : <3>megasas[0]: frame count : 0x3, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23a0a7, lba_hi : 0x0, sense_buf addr : 0xcff6fe80,sge count : 0xa kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff7b400 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23969f, lba_hi : 0x0, sense_buf addr : 0xcff78680,sge count : 0x50 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0xcff7e400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe23aaa7, lba_hi : 0x0, sense_buf addr : 0xcff78c80,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x15391400 : <3>megasas[0]: frame count : 0x2, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xd0c004f, lba_hi : 0x0, sense_buf addr : 0x1538ae80,sge count : 0x3 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x153a3000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0x5f40217, lba_hi : 0x0, sense_buf addr : 0x1539ce00,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x153adc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe2343e7, lba_hi : 0x0, sense_buf addr : 0x153ae180,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x153bdc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xa601657, lba_hi : 0x0, sense_buf addr : 0x153b7d80,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x153c3000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xae00057, lba_hi : 0x0, sense_buf addr : 0x153c0600,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x153c4000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe2324af, lba_hi : 0x0, sense_buf addr : 0x153c0800,sge count : 0x1 kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Frame addr :0x153c7400 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo : 0xe239417, lba_hi : 0x0, sense_buf addr : 0x153c0e80,sge count : 0x50 kern-warning 2006-11-13 12:56:25 kernel: megasas[0]: Pending Internal cmds in FW : kern-err 2006-11-13 12:56:25 kernel: megasas[0]: Dumping Done. kern-err 2006-11-13 12:56:25 kernel: megasas: failed to do reset kern-notice 2006-11-13 12:56:25 kernel: sd 0:2:0:0: megasas: RESET -20487153 cmd=2a kern-err 2006-11-13 12:56:25 kernel: megasas: cannot recover from previous reset failures kern-notice 2006-11-13 12:56:25 kernel: sd 0:2:0:0: megasas: RESET -20487153 cmd=2a kern-err 2006-11-13 12:56:25 kernel: megasas: cannot recover from previous reset failures kern-notice 2006-11-13 12:56:24 kernel: megasas: [100]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [105]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [110]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [115]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [120]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [125]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [130]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [135]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [140]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [145]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [150]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [155]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [160]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [165]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [170]waiting for 32 commands to complete kern-notice 2006-11-13 12:56:24 kernel: megasas: [175]waiting for 32 commands to complete kern-warning 2006-11-13 12:56:24 kernel: megasas[0]: Dumping Frame Phys Address of all pending cmds in FW kern-err 2006-11-13 12:56:24 kernel: megasas[0]: Total OS Pending cmds : 32 kern-notice 2006-11-13 12:54:59 kernel: megasas: [95]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:54 kernel: megasas: [90]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:49 kernel: megasas: [85]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:44 kernel: megasas: [80]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:39 kernel: megasas: [75]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:34 kernel: megasas: [70]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:29 kernel: megasas: [65]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:24 kernel: megasas: [60]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:19 kernel: megasas: [55]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:14 kernel: megasas: [50]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:09 kernel: megasas: [45]waiting for 32 commands to complete kern-notice 2006-11-13 12:54:04 kernel: megasas: [40]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:59 kernel: megasas: [35]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:54 kernel: megasas: [30]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:49 kernel: megasas: [25]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:44 kernel: megasas: [20]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:39 kernel: megasas: [15]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:34 kernel: megasas: [10]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:29 kernel: megasas: [ 5]waiting for 32 commands to complete kern-notice 2006-11-13 12:53:24 kernel: sd 0:2:0:0: megasas: RESET -20487153 cmd=2a kern-notice 2006-11-13 12:53:24 kernel: megasas: [ 0]waiting for 32 commands to complete Brett G. Durrett wrote: > > David, > > We switched to 2.6.18 (SMP) and applied the latest patches from LSI > (got them directly from Sumant Patro). Also, he told me to make sure > "read ahead" was set to "off". This seems to have reduced the > frequency of the failures to about once per week (across 10+ > machines), down from several times per week. > > After I reported an additional failure, Sumant said they were able to > reproduce the problems with XFS but they have not seen it with EXT3. > I prefer XFS but I prefer to have reliable databases even more... > > I now have a couple of systems running in the new configuration and I > am slowly migrating others to it as well. I have not seen a failure > with EXT3 but I statistically it would have been unlikely... I won't > declare victory until I have more systems converted with a few weeks > of reliable use. > > Hope this helps... if anybody solves the root cause I will happily > offer them a small gift to show my gratitude. > > B- > > > > David N. Welton wrote: > >> Hi, >> >> I found someone corresponding to your name writing about a problem with >> the megaraid sas driver/hardware on the LKML: >> >> http://lkml.org/lkml/2006/9/6/12 >> >> We have a Dell (2950, running 2.6.18 #1 SMP) as well, and the way I >> managed to kill the thing dead in its tracks (symptoms basically what >> you you describe) is with smartctl: >> >> root@salgari:~# smartctl --all /dev/sda >> smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce >> Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> Device: DELL PERC 5/i Version: 1.00 >> Device type: disk >> Local Time is: Wed Oct 25 10:14:40 2006 CEST >> Device does not support SMART >> >> Error Counter logging not supported >> >> >> Device does not support Self Test logging >> >> ---- >> >> [61101.681857] sd 0:2:0:0: rejecting I/O to offline device >> [61101.681944] EXT3-fs error (device sda1): ext3_readdir: directory >> #7553069 contains a hole at offset 0 >> [61103.944794] sd 0:2:0:0: rejecting I/O to offline device >> [61103.944879] EXT3-fs error (device sda1): ext3_readdir: directory >> #7553069 contains a hole at offset 0 >> [61104.672212] sd 0:2:0:0: rejecting I/O to offline device >> [61104.672295] EXT3-fs error (device sda1): ext3_readdir: directory >> #7553069 contains a hole at offset 0 >> [61105.255981] sd 0:2:0:0: rejecting I/O to offline device >> [61105.256066] EXT3-fs error (device sda1): ext3_readdir: directory >> #7553069 contains a hole at offset 0 >> >> ---- >> >> Dead in the water. We suspect that in any case there are some disk >> problems, which is why we were trying to use smartctl in the first >> place. >> >> I was just curious if you managed to figure anything out... >> >> Thanks, >> Dave Welton >> >> > - > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline @ 2006-12-12 3:04 Joe Malicki 2006-12-12 5:24 ` Brett G. Durrett 0 siblings, 1 reply; 15+ messages in thread From: Joe Malicki @ 2006-12-12 3:04 UTC (permalink / raw) To: Brett G. Durrett, "linux-scsi ", David N. Welton, linux-poweredge, Sumant Patro Cc: Marc A. Meadows, Keith R. Baker > I have the same or a similar issue running 2.6.17 SMP x86_64 - the > megaraid_sas driver hangs waiting for commands and then the filesystem > unmounts, leaving the machine in an unusable state until there is a hard > reboot (the machine is responsive but any access, shell or otherwise, is > impossible without the filesystem). While I do not have much debugging > information available, this happens to me about once every 6-7 days in > my pool of seven machines, so I can probably get debugging info. Since > the disk is offline and I can't get remote console, I don't have any > details except something similar to Dave Lloyd's post, below. Brett, is this still happening to you? We're seeing this very sporadically, but it does concern us. We've seen driver updates in 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware: Package Version - 5.0.2-0003 Firmware Version - 1.00.01-0157 SASBIOS Version - MT23 Ctrl-R Version - 1.02-007 MPT Version - 00.06.71.00-IT and haven't been able to reproduce it, but we can't find a test case to reliably reproduce the problem to know that anything was fixed (out of 31 identically configured Dell 2950's with the PERC 5/i RAID controller (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them Maxtor Atlas 10k, not hot spare). Our 2950s do have 16GB of RAM each, so the firmware update (which mentions that it fixes DMA beyond 8GB) sounds promising, but I would think that if that was the problem we were experiencing, we would reproduce this much more often? We are certainly using the RAM for cache and memory, it's not like we've never touched beyond 8GB. Does anyone have a test case to reproduce this problem reliably, or a detailed description of what actually happens (on low levels) when this problem occurs that can help to make a test? We are more interested in making this reproducible now than in finding a workaround... if anyone has any tips on how to make this *more* likely to happen we'd like to know (so far, I know to try to use XFS and enable ReadAhead). We have seen this correlated with Patrol Reads going on at the same time, but aren't sure if this is a red herring, and haven't been able to force the issue to happen by enabling Patrol Reads. We've only ever seen these on two machines - one machine reproduces the problem in a little over a week, and the other has reproduced it a small number of times. The machines that reproduce it run an experimental demo workload, but we have not found a test case so far to reproduce the problem on demand to find or verify solutions. We're currently swapping out machines to verify that there are no hardware problems, but the machines diagnose themselves cleanly, and the workload they run is different enough that something about the workload we can't yet synthesize into a test case is the problem. Thank you! Joe Malicki Software Engineer Metacarta, Inc. email: jmalicki@metacarta.com > The only thing that the machines with these failures seem to have in > common is the fact that they are almost exclusively writes - they are > slave database machines with large memory and pretty much just > replicate. The read/write machines seem to have less failures. > > I am happy to help provide debugging information in any reasonable way. > In the mean time, if there are any known suggestions or workarounds for > the problem, I would be grateful for the guidance. > > Here are what details on the controller. If you want additional info, > let me know exactly what you need and I will do what I can to get it to > you.: > > Product Name : PERC 5/i Integrated > Serial No : 12345 > FW Package Build: 5.0.1-0030 > FW Version : 1.00.01-0088 > BIOS Version : MT23 > Ctrl-R Version :1.02-007 > > B- ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-12-12 3:04 Joe Malicki @ 2006-12-12 5:24 ` Brett G. Durrett 2006-12-12 5:53 ` Joseph Malicki 0 siblings, 1 reply; 15+ messages in thread From: Brett G. Durrett @ 2006-12-12 5:24 UTC (permalink / raw) To: Joe Malicki Cc: "linux-scsi ", David N. Welton, linux-poweredge, Sumant Patro, Marc A. Meadows, Keith R. Baker I am still seeing this and we have between 2 and 5 failures per week (across almost 20 machines). I am seeing it on ext3 (we migrated all of the machines from XFS) and with ReadAhead disabled. You mention a firmware update but I don't see any new PERC 5 firmware packages on Dell's site... can you give me a pointer to the firmware update? Also, has anybody had this problem on RHE? Dell does not support Linux unless it is RHE... I would be surprised is somehow RHE did not have this problem. B- Joe Malicki wrote: >> I have the same or a similar issue running 2.6.17 SMP x86_64 - the >> megaraid_sas driver hangs waiting for commands and then the filesystem >> unmounts, leaving the machine in an unusable state until there is a hard >> reboot (the machine is responsive but any access, shell or otherwise, is >> impossible without the filesystem). While I do not have much debugging >> information available, this happens to me about once every 6-7 days in >> my pool of seven machines, so I can probably get debugging info. Since >> the disk is offline and I can't get remote console, I don't have any >> details except something similar to Dave Lloyd's post, below. >> > > Brett, is this still happening to you? We're seeing this very > sporadically, but it does concern us. We've seen driver updates in > 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware: > > Package Version - 5.0.2-0003 > Firmware Version - 1.00.01-0157 > SASBIOS Version - MT23 > Ctrl-R Version - 1.02-007 > MPT Version - 00.06.71.00-IT > > and haven't been able to reproduce it, but we can't find a test case to > reliably reproduce the problem to know that anything was fixed (out of > 31 identically configured Dell 2950's with the PERC 5/i RAID controller > (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them > Maxtor Atlas 10k, not hot spare). Our 2950s do have 16GB of RAM each, > so the firmware update (which mentions that it fixes DMA beyond 8GB) > sounds promising, but I would think that if that was the problem we were > experiencing, we would reproduce this much more often? We are certainly > using the RAM for cache and memory, it's not like we've never touched > beyond 8GB. > > Does anyone have a test case to reproduce this problem reliably, or a > detailed description of what actually happens (on low levels) when this > problem occurs that can help to make a test? We are more interested in > making this reproducible now than in finding a workaround... if anyone > has any tips on how to make this *more* likely to happen we'd like to > know (so far, I know to try to use XFS and enable ReadAhead). > > We have seen this correlated with Patrol Reads going on at the same > time, but aren't sure if this is a red herring, and haven't been able to > force the issue to happen by enabling Patrol Reads. > > We've only ever seen these on two machines - one machine reproduces the > problem in a little over a week, and the other has reproduced it a small > number of times. The machines that reproduce it run an experimental > demo workload, but we have not found a test case so far to reproduce the > problem on demand to find or verify solutions. We're currently swapping > out machines to verify that there are no hardware problems, but the > machines diagnose themselves cleanly, and the workload they run is > different enough that something about the workload we can't yet > synthesize into a test case is the problem. > > Thank you! > Joe Malicki > Software Engineer > Metacarta, Inc. > email: jmalicki@metacarta.com > > >> The only thing that the machines with these failures seem to have in >> common is the fact that they are almost exclusively writes - they are >> slave database machines with large memory and pretty much just >> replicate. The read/write machines seem to have less failures. >> >> I am happy to help provide debugging information in any reasonable way. >> In the mean time, if there are any known suggestions or workarounds for >> the problem, I would be grateful for the guidance. >> >> Here are what details on the controller. If you want additional info, >> let me know exactly what you need and I will do what I can to get it to >> you.: >> >> Product Name : PERC 5/i Integrated >> Serial No : 12345 >> FW Package Build: 5.0.1-0030 >> FW Version : 1.00.01-0088 >> BIOS Version : MT23 >> Ctrl-R Version :1.02-007 >> >> B- >> > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-12-12 5:24 ` Brett G. Durrett @ 2006-12-12 5:53 ` Joseph Malicki 2006-12-12 12:30 ` Greg Dickie 2006-12-20 1:03 ` Brett G. Durrett 0 siblings, 2 replies; 15+ messages in thread From: Joseph Malicki @ 2006-12-12 5:53 UTC (permalink / raw) To: Brett G. Durrett Cc: "linux-scsi ", David N. Welton, linux-poweredge, Sumant Patro, Marc A. Meadows, Keith R. Baker Hi Brett! Thanks for the response, hopefully we can gather enough data points to help solve the problem. The new PERC 5/i integrated firmware dated 11/21/2006 is at: http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9182&typecnt=2&libid=46&releaseid=R139225&vercnt=3 PERC 5/E adapter: http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9181&typecnt=2&libid=46&releaseid=R139227&vercnt=2 The release notes describe very similar symptoms, but I am not ready to believe it yet as I can't reliably reproduce the problem well enough to be confident of a fix, though it sounds like you might be able to. Unfortunately we're using Debian at the moment, but if I can reproduce I can run on RHEL in a heartbeat to duplicate it for support (for now I'm trying to minimize variables). Also, which driver version are you running? I noticed you were using some patches from Sumant Patro@LSI - is your driver identical to the one in 2.6.19? If not, what does it look like? Have you noticed any correlations with patrol reads at the times of the failures? You can tell by running MegaCli -FwTermLog -Dsply -aALL What hardware are you running (CPUs, RAM, disk configuration)? Have you noticed any correlation with heavy network I/O (as well as disk I/O)? Some of our systems may have experienced this when running more network load than typical. Thanks! Joe Brett G. Durrett wrote: > > I am still seeing this and we have between 2 and 5 failures per week > (across almost 20 machines). I am seeing it on ext3 (we migrated all > of the machines from XFS) and with ReadAhead disabled. > > You mention a firmware update but I don't see any new PERC 5 firmware > packages on Dell's site... can you give me a pointer to the firmware > update? > > Also, has anybody had this problem on RHE? Dell does not support > Linux unless it is RHE... I would be surprised is somehow RHE did not > have this problem. > > B- > > > > Joe Malicki wrote: >>> I have the same or a similar issue running 2.6.17 SMP x86_64 - the >>> megaraid_sas driver hangs waiting for commands and then the filesystem >>> unmounts, leaving the machine in an unusable state until there is a >>> hard >>> reboot (the machine is responsive but any access, shell or >>> otherwise, is >>> impossible without the filesystem). While I do not have much debugging >>> information available, this happens to me about once every 6-7 days in >>> my pool of seven machines, so I can probably get debugging info. Since >>> the disk is offline and I can't get remote console, I don't have any >>> details except something similar to Dave Lloyd's post, below. >>> >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-12-12 5:53 ` Joseph Malicki @ 2006-12-12 12:30 ` Greg Dickie 2006-12-12 18:40 ` Joe Malicki 2006-12-20 1:03 ` Brett G. Durrett 1 sibling, 1 reply; 15+ messages in thread From: Greg Dickie @ 2006-12-12 12:30 UTC (permalink / raw) To: Joseph Malicki Cc: Brett G. Durrett, linux-scsi, linux-poweredge, David N. Welton, Sumant Patro, Marc A. Meadows, Keith R. Baker We've never had lockups like this but we did notice that the megaraid_sas modules defaults to a much higher commands per lun setting than the hardware seems to be able to handle. IIRC the default is 128 and we lowered it to 16 for the 5i and 32 for the 5E. HTH, Greg On Tue, 2006-12-12 at 00:53 -0500, Joseph Malicki wrote: > Hi Brett! > > Thanks for the response, hopefully we can gather enough data points to > help solve the problem. > > The new PERC 5/i integrated firmware dated 11/21/2006 is at: > http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9182&typecnt=2&libid=46&releaseid=R139225&vercnt=3 > PERC 5/E adapter: > http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9181&typecnt=2&libid=46&releaseid=R139227&vercnt=2 > > The release notes describe very similar symptoms, but I am not ready to > believe it yet as I can't reliably reproduce the problem well enough to > be confident of a fix, though it sounds like you might be able to. > Unfortunately we're using Debian at the moment, but if I can reproduce I > can run on RHEL in a heartbeat to duplicate it for support (for now I'm > trying to minimize variables). > > Also, which driver version are you running? I noticed you were using > some patches from Sumant Patro@LSI - is your driver identical to the one > in 2.6.19? If not, what does it look like? > > Have you noticed any correlations with patrol reads at the times of the > failures? You can tell by running MegaCli -FwTermLog -Dsply -aALL > > What hardware are you running (CPUs, RAM, disk configuration)? > > Have you noticed any correlation with heavy network I/O (as well as disk > I/O)? Some of our systems may have experienced this when running more > network load than typical. > > > Thanks! > Joe > > Brett G. Durrett wrote: > > > > I am still seeing this and we have between 2 and 5 failures per week > > (across almost 20 machines). I am seeing it on ext3 (we migrated all > > of the machines from XFS) and with ReadAhead disabled. > > > > You mention a firmware update but I don't see any new PERC 5 firmware > > packages on Dell's site... can you give me a pointer to the firmware > > update? > > > > Also, has anybody had this problem on RHE? Dell does not support > > Linux unless it is RHE... I would be surprised is somehow RHE did not > > have this problem. > > > > B- > > > > > > > > Joe Malicki wrote: > >>> I have the same or a similar issue running 2.6.17 SMP x86_64 - the > >>> megaraid_sas driver hangs waiting for commands and then the filesystem > >>> unmounts, leaving the machine in an unusable state until there is a > >>> hard > >>> reboot (the machine is responsive but any access, shell or > >>> otherwise, is > >>> impossible without the filesystem). While I do not have much debugging > >>> information available, this happens to me about once every 6-7 days in > >>> my pool of seven machines, so I can probably get debugging info. Since > >>> the disk is offline and I can't get remote console, I don't have any > >>> details except something similar to Dave Lloyd's post, below. > >>> > >> > > > > _______________________________________________ > Linux-PowerEdge mailing list > Linux-PowerEdge@dell.com > http://lists.us.dell.com/mailman/listinfo/linux-poweredge > Please read the FAQ at http://lists.us.dell.com/faq -- Greg Dickie just a guy Maximum Throughput ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-12-12 12:30 ` Greg Dickie @ 2006-12-12 18:40 ` Joe Malicki 0 siblings, 0 replies; 15+ messages in thread From: Joe Malicki @ 2006-12-12 18:40 UTC (permalink / raw) To: Greg Dickie Cc: Brett G. Durrett, linux-scsi, linux-poweredge, David N. Welton, Sumant Patro, Marc A. Meadows, Keith R. Baker Thanks Greg, Is there documentation or tests of the number of commands per LUN that the hardware can handle? The driver is clearly reading the value out of a register on the card. thanks, joe Greg Dickie wrote: > We've never had lockups like this but we did notice that the > megaraid_sas modules defaults to a much higher commands per lun setting > than the hardware seems to be able to handle. IIRC the default is 128 > and we lowered it to 16 for the 5i and 32 for the 5E. > > HTH, > Greg > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-12-12 5:53 ` Joseph Malicki 2006-12-12 12:30 ` Greg Dickie @ 2006-12-20 1:03 ` Brett G. Durrett 1 sibling, 0 replies; 15+ messages in thread From: Brett G. Durrett @ 2006-12-20 1:03 UTC (permalink / raw) To: Joseph Malicki Cc: "linux-scsi ", David N. Welton, linux-poweredge, Sumant Patro, Marc A. Meadows, Keith R. Baker Just replied to another poster but wanted to respond to this thread as well... Joe, Huge thanks for the pointer to the new firmware... I had a page bookmarked for the 2950 firmware but the bookmark went to an old page. We are running 16G machines, dual core, dual CPU, RAID 5 on Perc5i. The kernel is 2.6.18 and I think all of Sumant's changes are in 2.6.19. The patrol reads did not seem to correlate to the failures. Some possibly good news: It is probably too early to say for sure, but I upgraded the firmware and have not had a failure on any of the machines with the new firmware. I will not feel this is "fixed" until I go another two weeks with no failures. The notes in the firmware update are supposed to fix a problem that is consistent with our failures: 4.0 Fixes Addresses potential issue with PERC 5 controllers that may become unresponsive on systems with 8GB of memory or more. This fix corrects an issue on systems with 8+ GB of memory PERC 5 controllers may become unresponsive. If the affected controller is the boot device this would cause an OS crash, hang, or bluescreen. If not the boot controller the system would experience timeouts (event 129 and 9 in windows, IO aborts in Linux). Once the controller is in this state it will not return to operation until the system has rebooted, any storage connected to the controller will not be accessible until the reboot. This has now been corrected. B- Joseph Malicki wrote: > Hi Brett! > > Thanks for the response, hopefully we can gather enough data points to > help solve the problem. > > The new PERC 5/i integrated firmware dated 11/21/2006 is at: > http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9182&typecnt=2&libid=46&releaseid=R139225&vercnt=3 > > PERC 5/E adapter: > http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9181&typecnt=2&libid=46&releaseid=R139227&vercnt=2 > > > The release notes describe very similar symptoms, but I am not ready > to believe it yet as I can't reliably reproduce the problem well > enough to be confident of a fix, though it sounds like you might be > able to. Unfortunately we're using Debian at the moment, but if I > can reproduce I can run on RHEL in a heartbeat to duplicate it for > support (for now I'm trying to minimize variables). > > Also, which driver version are you running? I noticed you were using > some patches from Sumant Patro@LSI - is your driver identical to the > one in 2.6.19? If not, what does it look like? > > Have you noticed any correlations with patrol reads at the times of > the failures? You can tell by running MegaCli -FwTermLog -Dsply -aALL > > What hardware are you running (CPUs, RAM, disk configuration)? > > Have you noticed any correlation with heavy network I/O (as well as > disk I/O)? Some of our systems may have experienced this when running > more network load than typical. > > > Thanks! > Joe > > Brett G. Durrett wrote: >> >> I am still seeing this and we have between 2 and 5 failures per week >> (across almost 20 machines). I am seeing it on ext3 (we migrated all >> of the machines from XFS) and with ReadAhead disabled. >> >> You mention a firmware update but I don't see any new PERC 5 firmware >> packages on Dell's site... can you give me a pointer to the firmware >> update? >> >> Also, has anybody had this problem on RHE? Dell does not support >> Linux unless it is RHE... I would be surprised is somehow RHE did not >> have this problem. >> >> B- >> >> >> >> Joe Malicki wrote: >>>> I have the same or a similar issue running 2.6.17 SMP x86_64 - the >>>> megaraid_sas driver hangs waiting for commands and then the filesystem >>>> unmounts, leaving the machine in an unusable state until there is a >>>> hard >>>> reboot (the machine is responsive but any access, shell or >>>> otherwise, is >>>> impossible without the filesystem). While I do not have much debugging >>>> information available, this happens to me about once every 6-7 days in >>>> my pool of seven machines, so I can probably get debugging info. Since >>>> the disk is offline and I can't get remote console, I don't have any >>>> details except something similar to Dave Lloyd's post, below. >>>> >>> >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: megaraid_sas waiting for command and then offline @ 2006-09-06 17:14 Patro, Sumant 2006-09-06 20:44 ` Brett G. Durrett 0 siblings, 1 reply; 15+ messages in thread From: Patro, Sumant @ 2006-09-06 17:14 UTC (permalink / raw) To: Brett G. Durrett, Dave Lloyd Cc: Bagalkote, Sreenivas, lkml, Berkley Shands, Kolli, Neela, Yang, Bo Hello Brett, A DMA related bug was fixed in FW ver *.0095 that was causing the FW to stop responding. Please upgrade the FW version to >= *.0095 and let me know if you still see the issue. Regards, Sumant -----Original Message----- From: Brett G. Durrett [mailto:brett@imvu.com] Sent: Wednesday, September 06, 2006 9:04 AM To: Dave Lloyd Cc: Patro, Sumant; Bagalkote, Sreenivas; lkml; Berkley Shands Subject: Re: megaraid_sas waiting for command and then offline The machines are Dell 2900s, so the mobo is custom. From a Dell SE, "Dell uses a custom mobo that is Dell branded with the Intel chipset Greencreek.". B- Dave Lloyd wrote: > Brett G. Durrett wrote: > > > > I have the same or a similar issue running 2.6.17 SMP x86_64 - the > > megaraid_sas driver hangs waiting for commands and then the filesystem > > unmounts, leaving the machine in an unusable state until there is a > hard > > reboot (the machine is responsive but any access, shell or > otherwise, is > > impossible without the filesystem). While I do not have much debugging > > information available, this happens to me about once every 6-7 days in > > my pool of seven machines, so I can probably get debugging info. Since > > the disk is offline and I can't get remote console, I don't have any > > details except something similar to Dave Lloyd's post, below. > > > > The only thing that the machines with these failures seem to have in > > common is the fact that they are almost exclusively writes - they are > > slave database machines with large memory and pretty much just > > replicate. The read/write machines seem to have less failures. > > > > I am happy to help provide debugging information in any reasonable way. > > In the mean time, if there are any known suggestions or workarounds for > > the problem, I would be grateful for the guidance. > > > > Here are what details on the controller. If you want additional info, > > let me know exactly what you need and I will do what I can to get it to > > you.: > > > > Product Name : PERC 5/i Integrated > > Serial No : 12345 > > FW Package Build: 5.0.1-0030 > > FW Version : 1.00.01-0088 > > BIOS Version : MT23 > > Ctrl-R Version :1.02-007 > > > > B- > > Which motherboard are you using? We believe that this may be a > motherboard specific issue. It appears to happen on a SuperMicro > motherboard but not a Tyan motherboard. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-09-06 17:14 Patro, Sumant @ 2006-09-06 20:44 ` Brett G. Durrett 0 siblings, 0 replies; 15+ messages in thread From: Brett G. Durrett @ 2006-09-06 20:44 UTC (permalink / raw) To: Patro, Sumant Cc: Dave Lloyd, Bagalkote, Sreenivas, lkml, Berkley Shands, Kolli, Neela, Yang, Bo Sumant, Not sure if I am missing something - I appear to be running the latest FW available: Versions ================ Product Name : PERC 5/i Integrated Serial No : 12345 FW Package Build: 5.0.1-0030 FW Version : 1.00.01-0088 BIOS Version : MT23 Ctrl-R Version :1.02-007 Patro, Sumant wrote: >Hello Brett, > > A DMA related bug was fixed in FW ver *.0095 that was causing >the FW to stop responding. > > Please upgrade the FW version to >= *.0095 and let me know if >you still see the issue. > >Regards, > >Sumant > > >-----Original Message----- >From: Brett G. Durrett [mailto:brett@imvu.com] >Sent: Wednesday, September 06, 2006 9:04 AM >To: Dave Lloyd >Cc: Patro, Sumant; Bagalkote, Sreenivas; lkml; Berkley Shands >Subject: Re: megaraid_sas waiting for command and then offline > > >The machines are Dell 2900s, so the mobo is custom. From a Dell SE, >"Dell uses a custom mobo that is Dell branded with the Intel chipset >Greencreek.". > >B- > > > > >Dave Lloyd wrote: > > > >>Brett G. Durrett wrote: >> >> >>>I have the same or a similar issue running 2.6.17 SMP x86_64 - the >>>megaraid_sas driver hangs waiting for commands and then the >>> >>> >filesystem > > >>>unmounts, leaving the machine in an unusable state until there is a >>> >>> >>hard >> >> >>>reboot (the machine is responsive but any access, shell or >>> >>> >>otherwise, is >> >> >>>impossible without the filesystem). While I do not have much >>> >>> >debugging > > >>>information available, this happens to me about once every 6-7 days >>> >>> >in > > >>>my pool of seven machines, so I can probably get debugging info. >>> >>> >Since > > >>>the disk is offline and I can't get remote console, I don't have any >>>details except something similar to Dave Lloyd's post, below. >>> >>>The only thing that the machines with these failures seem to have in >>>common is the fact that they are almost exclusively writes - they >>> >>> >are > > >>>slave database machines with large memory and pretty much just >>>replicate. The read/write machines seem to have less failures. >>> >>>I am happy to help provide debugging information in any reasonable >>> >>> >way. > > >>>In the mean time, if there are any known suggestions or workarounds >>> >>> >for > > >>>the problem, I would be grateful for the guidance. >>> >>>Here are what details on the controller. If you want additional >>> >>> >info, > > >>>let me know exactly what you need and I will do what I can to get it >>> >>> >to > > >>>you.: >>> >>>Product Name : PERC 5/i Integrated >>>Serial No : 12345 >>>FW Package Build: 5.0.1-0030 >>>FW Version : 1.00.01-0088 >>>BIOS Version : MT23 >>>Ctrl-R Version :1.02-007 >>> >>>B- >>> >>> >>Which motherboard are you using? We believe that this may be a >>motherboard specific issue. It appears to happen on a SuperMicro >>motherboard but not a Tyan motherboard. >> >> >> >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* megaraid_sas waiting for command and then offline
@ 2006-09-06 4:49 Brett G. Durrett
2006-09-06 14:11 ` Dave Lloyd
0 siblings, 1 reply; 15+ messages in thread
From: Brett G. Durrett @ 2006-09-06 4:49 UTC (permalink / raw)
To: Sumant.Patro, Sreenivas.Bagalkote; +Cc: lkml, dlloyd
I have the same or a similar issue running 2.6.17 SMP x86_64 - the
megaraid_sas driver hangs waiting for commands and then the filesystem
unmounts, leaving the machine in an unusable state until there is a hard
reboot (the machine is responsive but any access, shell or otherwise, is
impossible without the filesystem). While I do not have much debugging
information available, this happens to me about once every 6-7 days in
my pool of seven machines, so I can probably get debugging info. Since
the disk is offline and I can't get remote console, I don't have any
details except something similar to Dave Lloyd's post, below.
The only thing that the machines with these failures seem to have in
common is the fact that they are almost exclusively writes - they are
slave database machines with large memory and pretty much just
replicate. The read/write machines seem to have less failures.
I am happy to help provide debugging information in any reasonable way.
In the mean time, if there are any known suggestions or workarounds for
the problem, I would be grateful for the guidance.
Here are what details on the controller. If you want additional info,
let me know exactly what you need and I will do what I can to get it to
you.:
Product Name : PERC 5/i Integrated
Serial No : 12345
FW Package Build: 5.0.1-0030
FW Version : 1.00.01-0088
BIOS Version : MT23
Ctrl-R Version :1.02-007
B-
Subject RE: MegaRaid 8408E goes out to lunch with nr_requests > 8
Date Thu, 13 Jul 2006 09:25:09 -0600
>From "Patro, Sumant" <>
Hello Dave,
I tried to duplicate the issue with 2.6.18rc1 but did not see
the issue. From the message it looks like the Firmware has stopped
processing cmds. Could you please let us know the Firmware version of
the controller ?
Thanks,
Sumant
-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Dave Lloyd
Sent: Wednesday, July 12, 2006 7:47 AM
To: linux-kernel@vger.kernel.org; Berkley Shands
Subject: MegaRaid 8408E goes out to lunch with nr_requests > 8
This happens both on 2.6.17 and 2.6.18rc1 using the megaraid, mptsas and
mptscsih drivers supplied with the kernel.
While writing data to raid0 devs on a LSI MegaRaid 8408E controller, the
devices will hang after somewhere between 4-7gb of data written. If I
dial the nr_requests back from the default down to 8, the hang will not
occur. The hang does occur at 16. I haven't tested values between the
two, but I'm not too optimistic. From what I can see, it looks like 8
should be a magic number to make the queue look congested more often
than not.
Here are the messages I get when the devices go out to lunch:
Jul 11 14:13:34 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:13:34 systemname kernel: megasas: [ 0]waiting for 256 commands
to complete
Jul 11 14:13:39 systemname kernel: megasas: [ 5]waiting for 256 commands
to complete
Jul 11 14:13:44 systemname kernel: megasas: [10]waiting for 256 commands
to complete
Jul 11 14:13:49 systemname kernel: megasas: [15]waiting for 256 commands
to complete
[...]
Jul 11 14:16:35 systemname kernel: megasas: [175]waiting for 256
commands to complete
Jul 11 14:16:35 systemname kernel: megasas: failed to do reset
Jul 11 14:16:35 systemname kernel: sd 4:2:1:0: megasas: RESET -40216
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: scsi: Device offlined -
not ready after error recovery
Jul 11 14:16:36 systemname last message repeated 13 times
Interestingly, the machine will hang on shutdown and requires a hard
reset to reboot. Bummer!
My next step is to try and reproduce and dig into this some in KDB.
Has anyone else seen this and/or does anyone have some suggestions for
further debugging info?
--
Dave Lloyd
Test Engineer, Exegy, Inc.
314.450.5342
dlloyd@exegy.com
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: megaraid_sas waiting for command and then offline 2006-09-06 4:49 Brett G. Durrett @ 2006-09-06 14:11 ` Dave Lloyd 2006-09-06 16:04 ` Brett G. Durrett 0 siblings, 1 reply; 15+ messages in thread From: Dave Lloyd @ 2006-09-06 14:11 UTC (permalink / raw) To: Brett G. Durrett; +Cc: Sumant.Patro, Sreenivas.Bagalkote, lkml, Berkley Shands Brett G. Durrett wrote: > > I have the same or a similar issue running 2.6.17 SMP x86_64 - the > megaraid_sas driver hangs waiting for commands and then the filesystem > unmounts, leaving the machine in an unusable state until there is a hard > reboot (the machine is responsive but any access, shell or otherwise, is > impossible without the filesystem). While I do not have much debugging > information available, this happens to me about once every 6-7 days in > my pool of seven machines, so I can probably get debugging info. Since > the disk is offline and I can't get remote console, I don't have any > details except something similar to Dave Lloyd's post, below. > > The only thing that the machines with these failures seem to have in > common is the fact that they are almost exclusively writes - they are > slave database machines with large memory and pretty much just > replicate. The read/write machines seem to have less failures. > > I am happy to help provide debugging information in any reasonable way. > In the mean time, if there are any known suggestions or workarounds for > the problem, I would be grateful for the guidance. > > Here are what details on the controller. If you want additional info, > let me know exactly what you need and I will do what I can to get it to > you.: > > Product Name : PERC 5/i Integrated > Serial No : 12345 > FW Package Build: 5.0.1-0030 > FW Version : 1.00.01-0088 > BIOS Version : MT23 > Ctrl-R Version :1.02-007 > > B- Which motherboard are you using? We believe that this may be a motherboard specific issue. It appears to happen on a SuperMicro motherboard but not a Tyan motherboard. -- Dave Lloyd Test Engineer, Exegy, Inc. 314.450.5342 dlloyd@exegy.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: megaraid_sas waiting for command and then offline 2006-09-06 14:11 ` Dave Lloyd @ 2006-09-06 16:04 ` Brett G. Durrett 0 siblings, 0 replies; 15+ messages in thread From: Brett G. Durrett @ 2006-09-06 16:04 UTC (permalink / raw) To: Dave Lloyd; +Cc: Sumant.Patro, Sreenivas.Bagalkote, lkml, Berkley Shands The machines are Dell 2900s, so the mobo is custom. From a Dell SE, "Dell uses a custom mobo that is Dell branded with the Intel chipset Greencreek.". B- Dave Lloyd wrote: > Brett G. Durrett wrote: > > > > I have the same or a similar issue running 2.6.17 SMP x86_64 - the > > megaraid_sas driver hangs waiting for commands and then the filesystem > > unmounts, leaving the machine in an unusable state until there is a > hard > > reboot (the machine is responsive but any access, shell or > otherwise, is > > impossible without the filesystem). While I do not have much debugging > > information available, this happens to me about once every 6-7 days in > > my pool of seven machines, so I can probably get debugging info. Since > > the disk is offline and I can't get remote console, I don't have any > > details except something similar to Dave Lloyd's post, below. > > > > The only thing that the machines with these failures seem to have in > > common is the fact that they are almost exclusively writes - they are > > slave database machines with large memory and pretty much just > > replicate. The read/write machines seem to have less failures. > > > > I am happy to help provide debugging information in any reasonable way. > > In the mean time, if there are any known suggestions or workarounds for > > the problem, I would be grateful for the guidance. > > > > Here are what details on the controller. If you want additional info, > > let me know exactly what you need and I will do what I can to get it to > > you.: > > > > Product Name : PERC 5/i Integrated > > Serial No : 12345 > > FW Package Build: 5.0.1-0030 > > FW Version : 1.00.01-0088 > > BIOS Version : MT23 > > Ctrl-R Version :1.02-007 > > > > B- > > Which motherboard are you using? We believe that this may be a > motherboard specific issue. It appears to happen on a SuperMicro > motherboard but not a Tyan motherboard. > ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2006-12-20 1:27 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-10-25 8:46 megaraid_sas waiting for command and then offline David N. Welton 2006-10-25 22:48 ` Brett G. Durrett 2006-10-25 23:03 ` Alan Cox 2006-11-13 21:40 ` Brett G. Durrett -- strict thread matches above, loose matches on Subject: below -- 2006-12-12 3:04 Joe Malicki 2006-12-12 5:24 ` Brett G. Durrett 2006-12-12 5:53 ` Joseph Malicki 2006-12-12 12:30 ` Greg Dickie 2006-12-12 18:40 ` Joe Malicki 2006-12-20 1:03 ` Brett G. Durrett 2006-09-06 17:14 Patro, Sumant 2006-09-06 20:44 ` Brett G. Durrett 2006-09-06 4:49 Brett G. Durrett 2006-09-06 14:11 ` Dave Lloyd 2006-09-06 16:04 ` Brett G. Durrett
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.