From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f179.google.com ([209.85.223.179]:35227 "EHLO mail-io0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753157AbcDFXIN (ORCPT ); Wed, 6 Apr 2016 19:08:13 -0400 Received: by mail-io0-f179.google.com with SMTP id g185so74406998ioa.2 for ; Wed, 06 Apr 2016 16:08:12 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: Date: Wed, 6 Apr 2016 17:08:12 -0600 Message-ID: Subject: Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' From: Chris Murphy To: Ank Ular Cc: Btrfs BTRFS Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular wrote: > > From the ouput of 'dmesg', the section: > [ 20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm > [ 20.999984] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn > [ 21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds > [ 21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu > > bothers me because the transid value of these four devices doesn't > match the other 16 devices in the pool {should be 625065}. In theory, > I believe these should all have the same transid value. These four > devices are all on a single USB 3.0 port and this is the link I > believe went down and came back up. This is effectively a 4 disk failure and raid6 only allows for 2. Now, a valid complaint is that as soon as Btrfs is seeing write failures for 3 devices, it needs to go read-only. Specifically, it would go read only upon 3 or more write errors affecting a single full raid stripe (data and parity strips combined); and that's because such a write is fully failed. Now, maybe there's a way to just retry that stripe? During heavy writing, there are probably multiple stripes in flight. But in real short order the file system I think needs to face plant (read only or even a graceful crash) is better than continuing to write to n-4 drives which is a bunch of bogus data, in effect. I'm gonna guess the superblock on all the surviving drives is wrong, because it sounds like the file system didn't immediately go read only when the four drives vanished? However, there is probably really valuable information in the superblocks of the failed devices. The file system should be consistent as of the generation on those missing devices. If there's a way to roll back the file system to those supers, including using their trees, then it should be possible to get the file system back - while accepting 100% data loss between generation 625039 and 625065. That's already 100% data loss anyway, if it was still doing n-4 device writes - those are bogus generations. Since this is entirely COW, nothing should be lost. All the data necessary to go back to generation 625039 is on all drives. And none of the data after that is usable anyway. Possibly even 625038 is the last good one on every single drive. So what you should try to do is get supers on every drive. There are three super blocks per drive. And there are four backups per super. So that's potentially 12 slots per drive times 20 drives. That's a lot of data for you to look through but that's what you have to do. The top task would be to see if the three supers are the same on each device, if so, then that cuts the comparison down by 1/3. And then compare the supers across devices. You can get this with btrfs-show-super -fa. You might look in another thread about how to setup an overlay for 16 of the 20 drives; making certain you obfuscate the volume UUID of the original, only allowing that UUID to appear via the overlay (of the same volume+device UUID appear to the kernel, e.g. using LVM snapshots of either thick or thin variety and making both visible and then trying to mount one of them). Others have done this I think remotely to make sure the local system only sees the overlay devices. Anyway, this allows you to make destructive changes non-destructively. What I can't tell you off hand is if any of the tools will let you specifically accept the superblocks from the four "good" devices that went offline abruptly, and adapt them to to the other 16, i.e. rolling back the 16 that went too far forward without the other 4. Make sense? Note. You can't exactly copy the super block from one device to another because it contains a dev UUID. So first you need to look at a superblock for any two of the four "good" devices, and compare them. Exactly how do they differ? They should only differ witih dev_item.devid, dev_item.uuid, and maybe dev_item.total_bytes and hopefully not but maybe dev_item.bytes_used. And then somehow adapt this for the other 16 drives. I'd love it if there's a tool that does this, maybe 'btrfs rescue super-recover' but there are no meaningful options with that command so I'm skeptical how it knows what's bad and what's good. You literally might have to splice superblocks and write them to 16 drives in exactly 3 locations per drive (well, maybe just one of them, and then delete the magic from the other two, and then 'btrfs rescue super-recover' should then use the one good copy to fix the two bad copies). Sigh.... maybe? In theory it's possible, I just don't know the state of the tools. But I'm fairly sure the best chance of recovery is going to be on the 4 drives that abruptly vanished. Their supers will be mostly correct or close to it: and that's what has all the roots in it: tree, fs, chunk, extent and csum. And all of those states are better farther in the past, rather than the 16 drives that have much newer writes. Of course it is possible there's corruption problems with those four drives having vanished while writes were incomplete. But if you're lucky, data write happen first, then metadata writes second, and only then is the super updated. So the super should point to valid metadata and that should point to valid data. If that order is wrong, then it's bad news and you have to look at backup roots. But *if* you get all the supers correct and on the same page, you can access the backup roots by using -o recovery if corruption is found with a normal mount. -- Chris Murphy