From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Janos Haar" Subject: Re: Suggestion needed for fixing RAID6 Date: Thu, 29 Apr 2010 23:07:40 +0200 Message-ID: <0c1201cae7e0$01f9a930$0400a8c0@dcccs> References: <626601cae203$dae35030$0400a8c0@dcccs> <20100423065143.GA17743@maude.comedia.it> <695a01cae2c1$a72907d0$0400a8c0@dcccs> <4BD193D0.5080003@shiftmail.org> <717901cae3e5$6a5fa730$0400a8c0@dcccs> <4BD3751A.5000403@shiftmail.org> <756601cae45e$213d6190$0400a8c0@dcccs> <4BD569E2.7010409@shiftmail.org> <7a3e01cae53f$684122c0$0400a8c0@dcccs> <4BD5C51E.9040207@shiftmail.org> <80a201cae621$684daa30$0400a8c0@dcccs> <4BD76CF6.5020804@shiftmail.org> <20100428113732.03486490@notabene.brown> <4BD830B0.1080406@shiftmail.org> <025e01cae6d7$30bb7870$0400a8c0@dcccs> <4BD843D4.7030700@shiftmail.org> <062001cae771$545e0910$0400a8c0@dcccs> <4BD9A41E.9050009@shiftmail.org> Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset="ISO-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit Return-path: Sender: linux-raid-owner@vger.kernel.org To: MRK Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids ----- Original Message ----- From: "MRK" To: "Janos Haar" Cc: Sent: Thursday, April 29, 2010 5:22 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/29/2010 09:55 AM, Janos Haar wrote: >> >> md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) >> sdg4[6 >> ] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] >> 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] >> [UUU_UUU_UUUU] >> [===========>.........] recovery = 56.8% (831095108/1462653888) >> finish=50 >> 19.8min speed=2096K/sec >> >> Drive dropped again with this patch! >> + the kernel freezed. >> (I will try to get more info...) >> >> Janos > > Hmm too bad :-( it seems it still doesn't work, sorry for that > > I suppose the kernel didn't freeze immediately after disabling the drive > or you wouldn't have had the chance to cat /proc/mdstat... this was this command in putty.exe window: watch "cat /proc/mdstat ; du -h /snap*" I think it have crashed soon. I had no time to recognize what happened and exit from the watch. > > Hence dmesg messages might have gone to /var/log/messages or something. > Can you look there to see if there is any interesting message to post > here? Yes, i know that. The crash was not written up unfortunately. But there is some info: (some UNC reported from sdh) .... Apr 29 09:50:29 Clarus-gl2k10-2 kernel: res 51/40:00:27:c0:5e/40:00:63:00:00/e0 Emask 0x9 (media error) Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: status: { DRDY ERR } Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: error: { UNC } Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: configured for UDMA/133 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Sense Key : Medium Error [current] [descriptor] Apr 29 09:50:29 Clarus-gl2k10-2 kernel: Descriptor sense data with sense descriptors (in hex): Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 63 5e c0 27 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Add. Sense: Unrecovered read error - auto reallocate failed Apr 29 09:50:29 Clarus-gl2k10-2 kernel: end_request: I/O error, dev sdh, sector 1667153959 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189872 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189880 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189888 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189896 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189904 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189912 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189920 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189928 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189936 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189944 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write Protect is off Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write Protect is off Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 29 13:07:39 Clarus-gl2k10-2 syslogd 1.4.1: restart. > Did the COW device fill up at least a bit? The initial size is 1.1MB, and what we wants to see is only some kbytes... I don't know exactly. Next time i will try to reduce the initial size to 16KByte. > > Also: you know that if you disable graphics on the server > ("/etc/init.d/gdm stop" or something like that) you usually can see the > stack trace of the kernel panic on screen when it hangs (unless terminal > was blank for powersaving, which you can disable too). You can take a > photo of that one (or write it down but it will be long) to so maybe > somebody can understand why it hanged. You might be even obtain the stack > trace through a serial port but that will take more effort. This pc based server have no graphic card at all. :-) (this is one of my freak ideas) And the terminal is redirected to the com1. If i really want, i can catch this with serial cable, but i think the log should be enough from the messages file. Thanks, Janos