On 2015-11-30 15:17, Chris Murphy wrote: > On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn > wrote: > >> General thoughts on this: >> 1. If there's a write error, we fail unconditionally right now. It would be >> nice to have a configurable number of retries before failing. > > I'm unconvinced. I pretty much immediately do not trust a block device > that fails even a single write, and I'd expect the file system to > quickly get confused if it can't rely on flushing pending writes to > that device. Unless Btrfs gets into the business of tracking bad > sectors (failed writes), the block device is a gonor upon a single > write failure, although it could still be reliable for reads. I've had multiple cases of disks that got one write error then were fine for more than a year before any further issues. My thought is add an option to retry that single write after some short delay (1-2s maybe), and if it still fails, then mark the disk as failed. This will provide an option for people like me who don't want to need to immediately replace a disk when it hits a write error. (Possibly add some counter in and if we get another write error within a given period of time, we just kick the disk instead of retrying). Transient errors do happen, and in some cases more often than people would expect. We should reasonably account for this. This discussion actually brings to mind the rather annoying behavior of some of the proprietary NAS systems we have where I work. They check SMART attributes on a regular basis, and if anything the disk firmware marks as pre-failure changes at all, it kicks the disk from the RAID array. It only kicks on a change though, so you can just disconnect and reconnect the disk itself, and it accepts it as a new disk as long as the attribute didn't cross the threshold the disk firmware lists. (I discovered this rather short-sighted behavior by accident, but I've used the old disks in other systems just fine for months with no issue whatsoever). > > Possibly reasonable, is the user indicting a preference for what > happens after the max number of write failures is exceeded: > > - Volume goes degraded: Faulty block device is ignored entirely, > degraded writes permitted. > - Volumes goes ro: Faulty block device is still used for reads, > degraded writes not permitted. > > As far as I know, md and lvm only do the former. And md/mdadm did > recently get the ability to support bad block maps so it can continue > using drives lacking reserve sectors (typically that's the reason for > write failures on conventional rotational drives). > > > >> 2. Similar for read errors, possibly with the ability to ignore them below >> some threshold. > > Agreed. Maybe it would be an error rate (set by ratio)? > I was thinking of either: a. A running count, using the current error counting mechanisms, with some max number allowed before the device gets kicked. b. A count that decays over time, this would need two tunables (how long an error is considered, and how many are allowed).