From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joe Peterson Subject: Re: Raid1 with failing drive Date: Wed, 29 Oct 2008 14:02:04 -0600 Message-ID: <4908C13C.7070607@gentoo.org> References: <20081028164851.70f9d92e@extreme> <1225307907.6448.284.camel@think.oraclecorp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: Stephen Hemminger , linux-btrfs@vger.kernel.org To: Chris Mason Return-path: In-Reply-To: <1225307907.6448.284.camel@think.oraclecorp.com> List-ID: Chris Mason wrote: > On Tue, 2008-10-28 at 16:48 -0700, Stephen Hemminger wrote: >> I have a system with a pair of small/fast but unreliable scsi drives. >> I tried setting up a raid1 configuration and using it for builds. >> Using 2.6.26.7 and btrfs 0.16. When using ext3 (no raid) on same partition, >> the driver would recalibrate and log something an keep going. But with >> btrfs it doesn't recover and takes drive offline. >> > > Btrfs doesn't really take drives offline. In the future we'll notice > that a drive is returning all errors, but for now we'll probably just > keep beating on it. It can also detect when a bad checksum is returned or the drive returns an i/o error, right? Would the "all-zero" test be a heuristic in case neither of those happened (but I cannot imagine why the zeros would get by the checksum check)? > The IO error handling code in btrfs currently expects it'll be able to > find at least one good mirror. You're probably hitting some bad > conditions as it fails to clean up. What happens (or rather, will happen) on a regular/non-mirrored btrfs? Would it then return an i/o error to the user and/or mark a block as bad? In ZFS, the state of the volume changes, noting an issue (also happens on a scrub), and the user can check this. What I don't like about ZFS is that the user can clear the condition, and then it appears OK again until another scrub. -Joe