Self-destruct of btrfs RAID6 array

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Self-destruct of btrfs RAID6 array
@ 2015-11-20  4:11 Paul Loewenstein
  2015-11-20  6:19 ` Duncan
  2015-11-20 13:29 ` Austin S Hemmelgarn
  0 siblings, 2 replies; 3+ messages in thread
From: Paul Loewenstein @ 2015-11-20  4:11 UTC (permalink / raw)
  To: linux-btrfs

I have just had an apparently catastrophic collapse of a large RAID6 
array.  I was hoping that the dual-redundancy of a RAID6 array would 
compensate for having no backup media large enough to back it up!

Any suggestions for repairing this array, at least to the point of 
mounting it read-only?  I am thinking of trying to mount it degraded 
with different devices missing, but I don't know if that will be an 
exercise in futility.

btrfs fi show still works!

Label: 'btrfsdata'  uuid: ccde0a00-e50b-4154-977f-ac591ab580a5
         Total devices 6 FS bytes used 9.62TiB
         devid   10 size 3.64TiB used 2.41TiB path /dev/sdg
         devid   11 size 3.64TiB used 2.41TiB path /dev/sda
         devid   12 size 3.64TiB used 2.41TiB path /dev/sdb
         devid   13 size 3.64TiB used 2.41TiB path /dev/sdc
         devid   14 size 3.64TiB used 2.41TiB path /dev/sdd
         devid   15 size 3.64TiB used 2.41TiB path /dev/sde

It spontaneously (I believe it was after it successfully mounted rw on 
boot, but I can't check for sure without looking at the last file 
creation time).  After another reboot it won't mount at all.

btrfs check /dev/sda gives:

parent transid verify failed on 73440384909312 wanted 491976 found 485531
parent transid verify failed on 73440384909312 wanted 491976 found 485531
checksum verify failed on 73440384909312 found 26943E11 wanted 0FCB3E97
checksum verify failed on 73440384909312 found AAD98681 wanted EA004FE8
checksum verify failed on 73440384909312 found AAD98681 wanted EA004FE8
bytenr mismatch, want=73440384909312, have=274180945215488
Couldn't read chunk root
Couldn't open file system

Looking back in the journal (I shall now be setting up journal 
monitoring), I found lots of errors, starting last September, only a few 
weeks after converting from RAID1 to RAID6.
Blank lines precede reboots and for the first log indicate the omission 
of over 30K entries!  The first log must represent some software bug, 
because /dev/sdh is NOT a btrfs device!

LOG EXTRACTS, while the filesystem was still mounted.  Journal grepped 
for btrfs, boot line added after.  Note different kernel version on 
reboot after upgrade.

Aug 26 20:12:24 cambridge kernel: Linux version 4.1.5-100.fc21.x86_64 
(mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 
(Red Hat 4.9.2-6) (GCC) ) #1 SMP Tue Aug 11 00:24:23 UTC 2015
Aug 26 20:12:52 cambridge kernel: Btrfs loaded
Aug 26 20:12:52 cambridge kernel: BTRFS: device label btrfsdata devid 11 
transid 484422 /dev/sda
Aug 26 20:12:52 cambridge kernel: BTRFS: device label btrfsdata devid 15 
transid 484422 /dev/sde
Aug 26 20:12:52 cambridge kernel: BTRFS: device label btrfsdata devid 13 
transid 484422 /dev/sdc
Aug 26 20:12:52 cambridge kernel: BTRFS: device label btrfsdata devid 14 
transid 484422 /dev/sdd
Aug 26 20:12:52 cambridge kernel: BTRFS: device label btrfsdata devid 12 
transid 484422 /dev/sdb
Aug 26 20:12:52 cambridge kernel: BTRFS: device label btrfsdata devid 10 
transid 484422 /dev/sdg
Sep 13 16:11:34 cambridge kernel: BTRFS: bdev /dev/sdh errs: wr 0, rd 0, 
flush 1, corrupt 0, gen 0
Sep 13 16:11:34 cambridge kernel: BTRFS: lost page write due to I/O 
error on /dev/sdh
Sep 13 16:11:34 cambridge kernel: BTRFS: bdev /dev/sdh errs: wr 1, rd 0, 
flush 1, corrupt 0, gen 0
Sep 13 16:11:34 cambridge kernel: BTRFS: lost page write due to I/O 
error on /dev/sdh
Sep 13 16:11:34 cambridge kernel: BTRFS: bdev /dev/sdh errs: wr 2, rd 0, 
flush 1, corrupt 0, gen 0
Sep 13 16:11:34 cambridge kernel: BTRFS: lost page write due to I/O 
error on /dev/sdh

Nov 15 15:21:51 cambridge kernel: BTRFS: lost page write due to I/O 
error on /dev/sdh
Nov 15 15:21:51 cambridge kernel: BTRFS: bdev /dev/sdh errs: wr 18713, 
rd 0, flush 6238, corrupt 0, gen 0
Nov 15 15:21:51 cambridge kernel: BTRFS: lost page write due to I/O 
error on /dev/sdh
Nov 15 15:21:51 cambridge kernel: BTRFS: bdev /dev/sdh errs: wr 18714, 
rd 0, flush 6238, corrupt 0, gen 0

Nov 15 15:23:00 cambridge kernel: Linux version 4.1.12-101.fc21.x86_64 
(mockbuild@bkernel01.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 
(Red Hat 4.9.2-6) (GCC) ) #1 SMP Wed Oct 28 15:18:44 UTC 2015
Nov 15 15:23:33 cambridge kernel: Btrfs loaded
Nov 15 15:23:33 cambridge kernel: BTRFS: device label btrfsdata devid 14 
transid 492036 /dev/sdd
Nov 15 15:23:33 cambridge kernel: BTRFS: device label btrfsdata devid 15 
transid 485798 /dev/sde
Nov 15 15:23:33 cambridge kernel: BTRFS: device label btrfsdata devid 11 
transid 492036 /dev/sda
Nov 15 15:23:33 cambridge kernel: BTRFS: device label btrfsdata devid 13 
transid 492036 /dev/sdc
Nov 15 15:23:33 cambridge kernel: BTRFS: device label btrfsdata devid 10 
transid 492036 /dev/sdg
Nov 15 15:23:33 cambridge kernel: BTRFS: device label btrfsdata devid 12 
transid 492036 /dev/sdb
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73440384909312 wanted 491976 found 485531
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73440384913408 wanted 491976 found 485531
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73440384917504 wanted 491976 found 485696
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73440384921600 wanted 491976 found 485696
Nov 15 15:23:33 cambridge kernel: BTRFS: bdev /dev/sde errs: wr 18711, 
rd 0, flush 6237, corrupt 0, gen 0
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): bad tree block 
start 1121375725894905312 74200909787136
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): bad tree block 
start 7250342666203184288 74200909791232
Nov 15 15:23:33 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439

Nov 15 20:37:14 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73440384917504 wanted 491976 found 485696
Nov 15 20:37:14 cambridge kernel: BTRFS (device sdb): parent transid 
verify failed on 73440384921600 wanted 491976 found 485696
Nov 15 20:39:01 cambridge kernel: BTRFS (device sdb): bad tree block 
start 8747312261073978676 74201584123904
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733865472 csum 3128256294 expected csum 3176585556
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733869568 csum 3953187115 expected csum 2827150008
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733873664 csum 2011708136 expected csum 1514290758
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733877760 csum 4227108651 expected csum 3929632885
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733881856 csum 667263525 expected csum 2167952522
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733885952 csum 1421670165 expected csum 2602382287
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733890048 csum 2320260888 expected csum 606775819
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733865472 csum 3128256294 expected csum 3176585556
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733894144 csum 2140326945 expected csum 2209619790
Nov 15 20:39:02 cambridge kernel: BTRFS warning (device sdb): csum 
failed ino 1455165 off 1733898240 csum 372680472 expected csum 3888049973

Nov 15 20:42:45 cambridge kernel: Linux version 4.1.12-101.fc21.x86_64 
(mockbuild@bkernel01.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 
(Red Hat 4.9.2-6) (GCC) ) #1 SMP Wed Oct 28 15:18:44 UTC 2015
Nov 15 20:43:16 cambridge kernel: Btrfs loaded
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 15 
transid 492120 /dev/sde
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 14 
transid 492120 /dev/sdd
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 13 
transid 492120 /dev/sdc
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 12 
transid 492120 /dev/sdb
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 11 
transid 492120 /dev/sda
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 10 
transid 492120 /dev/sdg
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384909312 wanted 491976 found 485531
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384913408 wanted 491976 found 485531
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384917504 wanted 491976 found 485696
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384921600 wanted 491976 found 485696
Nov 15 20:43:16 cambridge kernel: BTRFS: bdev /dev/sde errs: wr 18711, 
rd 0, flush 6237, corrupt 0, gen 0
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 1121375725894905312 74200909787136
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 7250342666203184288 74200909791232
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:43:16 cambridge kernel: BTRFS: Failed to read block groups: -5
Nov 15 20:43:16 cambridge kernel: BTRFS: open_ctree failed
Nov 15 20:49:14 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384909312 wanted 491976 found 485531
Nov 15 20:49:15 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384913408 wanted 491976 found 485531
Nov 15 20:49:15 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384917504 wanted 491976 found 485696
Nov 15 20:49:15 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384921600 wanted 491976 found 485696
Nov 15 20:49:15 cambridge kernel: BTRFS: bdev /dev/sde errs: wr 18711, 
rd 0, flush 6237, corrupt 0, gen 0
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 1121375725894905312 74200909787136
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 7250342666203184288 74200909791232
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:49:16 cambridge kernel: BTRFS: Failed to read block groups: -5
Nov 15 20:49:16 cambridge kernel: BTRFS: open_ctree failed
Nov 15 20:43:16 cambridge kernel: Btrfs loaded
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 15 
transid 492120 /dev/sde
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 14 
transid 492120 /dev/sdd
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 13 
transid 492120 /dev/sdc
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 12 
transid 492120 /dev/sdb
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 11 
transid 492120 /dev/sda
Nov 15 20:43:16 cambridge kernel: BTRFS: device label btrfsdata devid 10 
transid 492120 /dev/sdg
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384909312 wanted 491976 found 485531
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384913408 wanted 491976 found 485531
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384917504 wanted 491976 found 485696
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384921600 wanted 491976 found 485696
Nov 15 20:43:16 cambridge kernel: BTRFS: bdev /dev/sde errs: wr 18711, 
rd 0, flush 6237, corrupt 0, gen 0
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 1121375725894905312 74200909787136
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 7250342666203184288 74200909791232
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:43:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:43:16 cambridge kernel: BTRFS: Failed to read block groups: -5
Nov 15 20:43:16 cambridge kernel: BTRFS: open_ctree failed
Nov 15 20:49:14 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384909312 wanted 491976 found 485531
Nov 15 20:49:15 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384913408 wanted 491976 found 485531
Nov 15 20:49:15 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384917504 wanted 491976 found 485696
Nov 15 20:49:15 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73440384921600 wanted 491976 found 485696
Nov 15 20:49:15 cambridge kernel: BTRFS: bdev /dev/sde errs: wr 18711, 
rd 0, flush 6237, corrupt 0, gen 0
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 1121375725894905312 74200909787136
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): bad tree block 
start 7250342666203184288 74200909791232
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:49:16 cambridge kernel: BTRFS (device sdg): parent transid 
verify failed on 73417618042880 wanted 488487 found 485439
Nov 15 20:49:16 cambridge kernel: BTRFS: Failed to read block groups: -5
Nov 15 20:49:16 cambridge kernel: BTRFS: open_ctree failed


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Self-destruct of btrfs RAID6 array
  2015-11-20  4:11 Self-destruct of btrfs RAID6 array Paul Loewenstein
@ 2015-11-20  6:19 ` Duncan
  2015-11-20 13:29 ` Austin S Hemmelgarn
  1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2015-11-20  6:19 UTC (permalink / raw)
  To: linux-btrfs

Paul Loewenstein posted on Thu, 19 Nov 2015 20:11:14 -0800 as excerpted:

> I have just had an apparently catastrophic collapse of a large RAID6
> array.  I was hoping that the dual-redundancy of a RAID6 array would
> compensate for having no backup media large enough to back it up!

Well...

First, while btrfs in general is "stabilizing" and is noticeably better 
than it was a year ago, it remains "not yet fully stable or mature."

There's a sysadmin's rule of backups, that if it's not backed up, you 
value the data it contains less than the time/trouble/resources of making 
a backup, and thus, should it fail, regardless of any loss of data you've 
saved what your actions defined as /really/ valuable, the time/trouble/
resources saved by not doing the backup, and thus should be happy as you 
saved the real important stuff.

Because btrfs isn't yet fully stable, having backups is even more 
important than it would be on a fully stable filesystem like xfs, ext*, 
or reiserfs (my previous favorite and what I still use on spinning rust 
and for backups), so that sysadmin's rule of backups applies double.

Of course some distros are choosing to deploy and support btrfs as if 
it's already fully stable, and that's their risk and their business for 
doing so, but by the same token, for that you'd get support from them, 
not from the upstream list (here), where btrfs is still considered to be 
"stabilizing, not yet fully stable".

Second, btrfs raid56 mode is much newer than btrfs in general, and isn't 
yet close to even the "stabilizing, good enough provided you have good 
backups or are using throw-away data" general level of btrfs.  Nominal 
code-completion was only kernel 3.19, and there were very significant 
bugs with it and 4.0, into the early 4.1 cycle, tho by 4.1 release the 
worst and known bugs were fixed.  But as a btrfs user and list regular, I 
and others have repeatedly recommended that people not consider btrfs 
raid56 mode as "stabilizing-stable" as btrfs in general is, for at least 
a year (five kernel cycles) after nominal code completion in 3.19, and 
even then, people thinking about using btrfs raid56 should check the list 
for recent bugs and consider, before deploying in anything but throw-away-
data (which can be because it's backed up data) test mode.  Of course 
that would be kernel 4.4, which is currently in development.

And as it happens, kernel 4.4 has been announced as a long-term-stable 
series, so things look to be working out reasonably well for those 
interested in first-opportunity-stablish btrfs raid56 deployment on it. 
=:^)

Since we're obviously not at 4.4 release yet, and in fact you're 
apparently running 4.1 stable series, that means btrfs raid56 mode must 
still be considered less stable than btrfs as a whole, which as I said is 
itself "still stabilizing, not fully stable and mature", so now we're at 
double-the-already-doubled-strength, 4 times the normal strength, of the 
sysadmin's backup rule.

So it's four-times self-evident that if you didn't have backups for data 
on raid56 mode btrfs, by your actions you placed a *REALLY* low value on 
that data!  So losing it is /very/ trivial, at least compared to the time 
and resources you can be happy you saved by not having a backup. =:^)

That said, there's still hope...

First, because btrfs raid56 mode /is/ so new and not yet stable, you 
really need to be working with the absolute latest tools in ordered to 
have the best chance at recovery.  That means kernel 4.3 and btrfs-progs 
4.3.1, if at all possible.  You can use earlier, but it might mean losing 
what's actually recoverable using the latest tools.

> Any suggestions for repairing this array, at least to the point of
> mounting it read-only?  I am thinking of trying to mount it degraded
> with different devices missing, but I don't know if that will be an
> exercise in futility.
> 
> btrfs fi show still works!
> 
> Label: 'btrfsdata'  uuid: ccde0a00-e50b-4154-977f-ac591ab580a5
>          Total devices 6 FS bytes used 9.62TiB
>          devid   10 size 3.64TiB  used 2.41TiB path /dev/sdg
>          devid   11 size 3.64TiB used 2.41TiB path /dev/sda
>          devid   12 size 3.64TiB used 2.41TiB path /dev/sdb
>          devid   13 size 3.64TiB used 2.41TiB path /dev/sdc
>          devid   14 size 3.64TiB used 2.41TiB path /dev/sdd
>          devid   15 size 3.64TiB used 2.41TiB path /dev/sde
> 
> It spontaneously (I believe it was after it successfully mounted rw on
> boot, but I can't check for sure without looking at the last file
> creation time).  After another reboot it won't mount at all.

You say mount, but there's no hint of the options you've tried.

If you've not yet read up on the user documentation on the wiki,
https://btrfs.wiki.kernel.org , I suggest you do so.  There's a lot of 
useful background information there, including discussion of mount 
options and recovery.

What you will want to try here if you haven't already is a degraded,ro 
mount, possibly with the recovery option as well (try it without first, 
then with, if necessary).

If you've not tried degraded writable yet, there's a possibility mounting 
degraded, writable, will work, but if it does, you want to do device 
replaces/deletes to get undegraded as soon as possible, preferably with 
as little other writing to the filesystem as possible, as if new chunks 
need allocated to do further writes they may be allocated in single mode, 
and there's currently a bug which won't allow degraded read-write mount 
after that, because btrfs sees the single-mode chunks on a degraded 
filesystem and thinks there may be others on the missing devices, without 
actually checking.  As a result, you often get just one shot at a 
writable mount to undegrade, and if that doesn't work, the filesystem is 
often only read-only mountable after that.  (This bug applies to all 
redundant/parity raid modes so to raid1 and raid10 as well, not just 
raid56.)

If you /had/ tried degraded mounting, that bug may be why you're now 
unable to mount again, writable, but degraded,ro, is likely to still 
work.  There's actually a patch for the bug, that makes btrfs check the 
actual chunk allocation to see if all are accounted for on the existing 
devices, allowing writable mounting if so, but it's definitely not in 4.1 
or 4.2, tho I think it might have made 4.3.  (If so it could possibly be 
backported to stable-series 4.1 at least, but it's unlikely to be there 
yet.)

If the various degraded,recovery,ro options don't work, the next thing to 
try is btrfs restore.  This works with an unmounted filesystem using the 
userspace code, so a current btrfs-progs, preferably 4.3.0 or 4.3.1, is 
recommended for the best chance at success.

What btrfs restore does is try to read the unmounted filesystem and 
retrieve files from it, writing them to some other mounted filesystem 
location.  Newer btrfs restore versions have options to save ownership/
permissions and timestamp data, and rewrite symlinks as well, otherwise 
the files are written as the executing user (root) using its umask.  
There's options to write only selective parts of the filesystem, and/or 
to restore specific snapshots (which are otherwise ignored), as well. 
Obviously you'll need space at wherever you point restore at to write 
whatever you intend to restore, but if you didn't have a current backup, 
as people considering this option obviously didn't, this is basically 
replacing the space you would have otherwise dedicated to backups, so 
it's not too horrible.

With a bit of luck, restore will work without further trouble.  If it 
doesn't, there's more damage, but btrfs does keep a history of main 
roots, and btrfs-find-root can be used to list them, with btrfs restore 
able to take a root by its bytenr, using the -t option.  Here's the wiki 
page link with further instructions, tho last I looked it was a bit dated.

https://btrfs.wiki.kernel.org/index.php/Restore

A hint, in case it's not obvious from the wiki page, generation, and 
transid/transaction-id, are the same thing. =:^)

Of course, also see the btrfs-restore manpage, which now actually lists 
the wiki link for more info.  As I said the wiki page was a bit dated 
last I looked, so definitely check the manpage, and pay attention to the 
newer options such as -l (list roots, useful with -t to see if that root 
is a good restore candidate), -D (dry run), and -m and -S, metadata and 
symlinks, without which files will be restored as the writing user (root) 
using the present umask, with current timestamps, and no symlinks.

If btrfs restore fails you, then getting a dev interested in the specific 
errors you have and patches to fix them, is your only hope.  But of 
course, since you already saved what was most important to you, the time 
and resources you would have otherwise spent to do the backup, and what 
might be lost here is as explained above at most valued at 4X-trivial, 
you can still be happy that you saved the really important stuff and any 
loss really /is/ trivial. 

(Seriously, when you compare the loss of a bit of data to what those 
folks in France lost recently, or what those Syrian refugees are risking 
and at times losing, their lives, or what the folks in 9/11 lost... in 
perspective, losing a bit of data here really *is* trivial.  The fact 
that we're both here at all, along with the others on the list, 
discussing this, makes us all pretty lucky, all things considered!  
Sometimes it does help to step back and get some /real/ perspective! =:^)

> Looking back in the journal (I shall now be setting up journal
> monitoring), I found lots of errors, starting last September, only a few
> weeks after converting from RAID1 to RAID6.
> Blank lines precede reboots and for the first log indicate the omission
> of over 30K entries!  The first log must represent some software bug,
> because /dev/sdh is NOT a btrfs device!

That very possibly indicates either a different device-detection order 
and thus device letter assignment on boot, such that one of the other 
devices appeared as /dev/sdh at that boot, or a device dropping out and 
reappearing as sdh, instead of whatever letter it had previously.  On 
today's hardware, such device reordering isn't uncommon, thus the switch 
to mounting by UUID or filesystem labels, for instance, as opposed to the 
now somewhat unpredictable /dev/sdX devices names, since the X can change!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Self-destruct of btrfs RAID6 array
  2015-11-20  4:11 Self-destruct of btrfs RAID6 array Paul Loewenstein
  2015-11-20  6:19 ` Duncan
@ 2015-11-20 13:29 ` Austin S Hemmelgarn
  1 sibling, 0 replies; 3+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-20 13:29 UTC (permalink / raw)
  To: Paul Loewenstein, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1151 bytes --]

On 2015-11-19 23:11, Paul Loewenstein wrote:
> I have just had an apparently catastrophic collapse of a large RAID6
> array.  I was hoping that the dual-redundancy of a RAID6 array would
> compensate for having no backup media large enough to back it up!
Duncan already did a really good job of explaining this (and from what I 
can tell, I'm pretty sure his analysis of what's going on is correct), 
but I would like to add a couple of things.

First, RAID is not a backup, it's a way to minimize the need to restore 
from backups in the event of hardware failure (or, in the case of BTRFS, 
also a way to minimize the effects of data corruption).

Second, have you considered doing encrypted backups to a cloud storage 
service?  This is what I personally do, and it works really well for me. 
  Amazon S3 has pretty reasonable pricing, and there are multiple 
options on Linux to allow accessing it like a filesystem.  There are 
many other options as well (in my case, I backup to both S3 and Dropbox, 
but I also have small enough backups that I don't need to worry about 
the 1T limit on Dropbox for non-business accounts).

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-11-20 13:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-20  4:11 Self-destruct of btrfs RAID6 array Paul Loewenstein
2015-11-20  6:19 ` Duncan
2015-11-20 13:29 ` Austin S Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox