From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bill Subject: Re: raid5 replace ignored error? Date: Tue, 18 Feb 2014 13:16:14 -0600 Message-ID: <5303B17E.5010708@sbcglobal.net> References: <52F0AC42.5080407@sbcglobal.net> <20140218144657.28b7601e@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140218144657.28b7601e@notabene.brown> Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: linux-raid List-Id: linux-raid.ids On 02/17/2014 09:46 PM, NeilBrown wrote: > On Tue, 04 Feb 2014 03:00:50 -0600 Bill wrote: > >> Hi, >> >> I had something weird happen during a replace in a raid5 array on kernel >> 3.10.28 - >> it appears an error in writing to / communicating with the replacement >> disk was ignored. >> >> I have this array: >> >> md3 : active raid5 sda1[0] sdd1[3] sdb1[1] sdf1[4] sdc1[2] >> 3900742144 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU] >> bitmap: 0/233 pages [0KB], 2048KB chunk >> >> I tried replacing sdf1 with sde1. >> >> [106666.129833] md: recovery of RAID array md3 >> [106666.129836] md: minimum _guaranteed_ speed: 20000 KB/sec/disk. >> [106666.129837] md: using maximum available idle IO bandwidth (but >> not more than 200000 KB/sec) for recovery. >> [106666.129842] md: using 128k window, over a total of 975185536k. >> >> 1/2 hour later I got a flood of errors in dmesg: >> >> [108334.974861] ata5.00: exception Emask 0x10 SAct 0x7fffffff SErr >> 0x480100 action 0x6 frozen >> [108334.974864] ata5.00: irq_stat 0x08000000, interface fatal error >> [108334.974866] ata5: SError: { UnrecovData 10B8B Handshk } >> [108334.974868] ata5.00: failed command: WRITE FPDMA QUEUED >> [108334.974872] ata5.00: cmd 61/00:00:10:97:9e/04:00:15:00:00/40 >> tag 0 ncq 524288 out >> [108334.974872] res 40/00:b0:10:f7:9e/00:00:15:00:00/40 >> Emask 0x10 (ATA bus error) >> [108334.974873] ata5.00: status: { DRDY } >> . >> .(29 more of the same message) >> . >> [108344.976877] ata5: softreset failed (1st FIS failed) >> [108344.976883] ata5: hard resetting link >> [108349.874854] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) >> [108349.901025] ata5.00: configured for UDMA/133 >> [108349.901055] ata5: EH complete >> >> There were no md error messages, the recovery continued, and finished a >> few hours later. >> >> [122443.805899] md: md3: recovery done. >> >> >> Afterwards I did a QC check and found a mismatch in one file which I >> mapped to the area >> being updated when this error was logged. >> >> What should happen in this case? >> Should the "replace" have failed or is there something else going on here? > Hi Bill, > sorry for the delay. > > Were there any message like: > end_request: I/O error, dev sde, sector NNNNNNNN > > ?? > If not, then the error never got up to md - the driver thinks that it managed > to recovery. > If so, then md really should have marked the replacement as faulty - or > possible recorded a bad-block if the device has a badblock log on it (mdadm > -E would tell you). > > If the write actually failed, but md wasn't told, then that is a problem in > the driver or device. > If the md was told, then it certainly would be a bug in md. > Thanks for the help, Neil There were no "I/O error" messages, and the drive is healthy according to SMART data. I later found that these errors came after I hot-plugged a drive into a different sata card. During the hotplug for the new disk, the controller for sde hard-reset, then I got the flood of errors above, then it hard-reset again 10 seconds later and things worked ok from there. When I dug into the file which got corrupted, I found blocks of zero bytes, which implies the write didn't happen because I had zero'd the drive before doing the replace. So it seems like something failed and md didn't hear about it. This happened while I was breaking in a new system, and I've since tightened up some loose screws and loose cables, and things are much more stable now. Thanks again, Bill