From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f171.google.com ([209.85.223.171]:34893 "EHLO mail-io0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758757AbcDHTYH (ORCPT ); Fri, 8 Apr 2016 15:24:07 -0400 Received: by mail-io0-f171.google.com with SMTP id g185so143424545ioa.2 for ; Fri, 08 Apr 2016 12:24:07 -0700 (PDT) Subject: Missing device handling (was: 'unable to mount btrfs pool...') To: Chris Murphy , Btrfs BTRFS References: <57064231.2070201@gmail.com> <5707961D.6000803@gmail.com> From: "Austin S. Hemmelgarn" Message-ID: <57080530.7030805@gmail.com> Date: Fri, 8 Apr 2016 15:23:28 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2016-04-08 12:17, Chris Murphy wrote: > On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn > wrote: >> >> I entirely agree. If the fix doesn't require any kind of decision to be >> made other than whether to fix it or not, it should be trivially fixable >> with the tools. TBH though, this particular issue with devices disappearing >> and reappearing could be fixed easier in the block layer (at least, there >> are things that need to be fixed WRT it in the block layer). > > Right. The block layer needs a way to communicate device missing to > Btrfs and Btrfs needs to have some tolerance for transience. Being notified when a device disappears _shouldn't_ be that hard. A uevent gets sent already, and we should be able to associate some kind of callback with that happening for devices we have mounted. The bigger issue is going to be handling the devices _reappearing_ (if we still hold a reference to the device, it appears under a different name/major/minor, and if it's more than one device and we have no references, they may appear in a different order than they were originally), and there is where we really need to fix things. A device disappearing forever is bad and all, but a device losing connection and reconnecting completely ruining the FS is exponentially worse. Overall, to provide true reliability here, we need: 1. Some way for userspace to disable writeback caching per-device (this is needed for other reasons as well, but those are orthogonal to this discussion). This then needs to be used on all removable devices by default (Windows and OS X do this, it's part of why small transfers appear to complete faster on Linux, and then the disk takes _forever_ to unmount). This would reduce the possibility of data loss when a device disappears. 2. A way for userspace to be notified (instead of having to poll) of state changes in BTRFS. Currently, the only ways for userspace to know something is wrong are either parsing dmesg or polling the filesystem flags (and based both personal experience, and statements I've seen here and elsewhere, polling the FS flags is not reliable for this). Most normal installations are going to want to trigger handlers for specific state changes (be it e-mail to an admin, or some other notification method, or even doing some kind of maintenance on the FS automatically), and we need some kind of notification if we want to give userspace the ability to properly manage things. 3. A way to tell that a device is gone _when it happens_, not when we try to write to it next, not when a write fails, but the moment the block layer knows it's not there, we need to know as well. This is a prerequisite for the next two items. Sadly, we're probably the only thing that would directly benefit from this (LVM uses uevents and monitoring daemons to handle this, we don't exactly have that luxury), which means it may be hard to get something like this merged. 4. Transparent handling of short, transient loss of a device. This goes together to a certain extent with 1, if something disappears for long enough that the kernel notices, but it reappears before we have any I/O to do on it again, we shouldn't lose our lunch unless userspace tells us to (because we told userspace that it's gone due to item 2). In theory, we should be able to cache a small number of internal pending writes for when it reappears (so for example, if a transaction is being committed, and the USB disk disappears for a second, we should be able to pick up where we left off (after verifying the last write we sent)). We should also have an automatic re-sync if it's a short enough period it's gone for. The max timeout here should probably be configurable, but probably could just be one tunable for the whole system. 5. Give userspace the option to handle degraded states how it wants to, and keep our default of remount RO when degraded when userspace doesn't want to handle it itself. This needs to be configured at run-time (not stored on the media), and it needs to be per-filesystem, otherwise we open up all kinds of other issues. This is a core concept in LVM and many other storage management systems; namely, userspace can choose to handle a degraded RAID array however the hell it wants, and we'll provide a couple of sane default handlers for the common cases. I would personally suggest adding a per-filesystem node in sysfs to handle both 2 and 5. Having it open tells BTRFS to not automatically attempt countermeasures when degraded, select/epoll on it will return when state changes, reads will return (at minimum): what devices comprise the FS, per disk state (is it working, failed, missing, a hot-spare, etc), and what effective redundancy we have (how many devices we can lose and still be mountable, so 1 for raid1, raid10, and raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for n-way replication (n-1), n-order parity (n), or erasure coding). This would make it trivial to write a daemon to monitor the filesystem, react when something happens, and handle all the policy decisions.