Unrecoverable fs corruption?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Unrecoverable fs corruption?
@ 2015-12-31 23:36 Alexander Duscheleit
  2016-01-01  1:22 ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Duscheleit @ 2015-12-31 23:36 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I had a power fail today at my home server and after the reboot the 
btrfs RAID1 won't come back up.

When trying to mount one of the 2 disks of the array I get the following 
error:
[ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled
[ 4126.316402] BTRFS: has skinny extents
[ 4126.337324] BTRFS: failed to read chunk tree on sdb2
[ 4126.353027] BTRFS: open_ctree failed

a btrfs check segfaults after a few seconds with the following message:
(0:29)[root@hera]~  # ❯❯❯ btrfs check /dev/sdb2
warning devid 1 not found already
bad key ordering 68 69
Checking filesystem on /dev/sdb2
UUID: d55fa866-3baa-4e73-bf3e-5fda29672df3
checking extents
bad key ordering 68 69
bad block 6513625202688
Errors found in extent allocation tree or chunk allocation
[1]    11164 segmentation fault  btrfs check /dev/sdb2

I have 2 btrfs-images (one with -w, one without) but they are 6.1G and 
1.1G repectively, I don't know
if I can upload them at all and also not where to store such large 
files.

I did try a btrfs check --repair on one of the disks which gave the 
following result:
enabling repair mode
warning devid 1 not found already
bad key ordering 68 69
repair mode will force to clear out log tree, Are you sure? [y/N]: y
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs[0x44161e]
btrfs(btrfs_reserve_extent+0xa7b)[0x4463db]
btrfs(btrfs_alloc_free_block+0x5f)[0x44649f]
btrfs(__btrfs_cow_block+0xc4)[0x437d64]
btrfs(btrfs_cow_block+0x35)[0x438365]
btrfs[0x43d3d6]
btrfs(btrfs_commit_transaction+0x95)[0x43f125]
btrfs(cmd_check+0x5ec)[0x429cdc]
btrfs(main+0x82)[0x40ef32]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f881f983610]
btrfs(_start+0x29)[0x40f039]

That's all I tried so far.
btrfs restore -viD seems to find most of the files accessible but since 
I don't have a spare hdd of
sufficient size I would have to break the array and reformat and use one 
of the disk as restore
target. I'm not prepared to do this before I know there is no other way 
to fix the drives since I'm
essentially destroying one more chance at saving the data.

Is there anything I can do to get the fs out of this mess?
-- 
Alexander Duscheleit

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2015-12-31 23:36 Unrecoverable fs corruption? Alexander Duscheleit
@ 2016-01-01  1:22 ` Chris Murphy
  2016-01-01  8:13   ` Duncan
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-01-01  1:22 UTC (permalink / raw)
  To: Alexander Duscheleit; +Cc: Btrfs BTRFS

On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit
<alexander.duscheleit@gmail.com> wrote:
> Hello,
>
> I had a power fail today at my home server and after the reboot the btrfs
> RAID1 won't come back up.
>
> When trying to mount one of the 2 disks of the array I get the following
> error:
> [ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled
> [ 4126.316402] BTRFS: has skinny extents
> [ 4126.337324] BTRFS: failed to read chunk tree on sdb2
> [ 4126.353027] BTRFS: open_ctree failed

Why are you trying to mount only one? What mount options did you use
when you did this?

>
> a btrfs check segfaults after a few seconds with the following message:
> (0:29)[root@hera]~  # ❯❯❯ btrfs check /dev/sdb2
> warning devid 1 not found already
> bad key ordering 68 69
> Checking filesystem on /dev/sdb2
> UUID: d55fa866-3baa-4e73-bf3e-5fda29672df3
> checking extents
> bad key ordering 68 69
> bad block 6513625202688
> Errors found in extent allocation tree or chunk allocation
> [1]    11164 segmentation fault  btrfs check /dev/sdb2
>
> I have 2 btrfs-images (one with -w, one without) but they are 6.1G and 1.1G
> repectively, I don't know
> if I can upload them at all and also not where to store such large files.
>
> I did try a btrfs check --repair on one of the disks which gave the
> following result:
> enabling repair mode
> warning devid 1 not found already
> bad key ordering 68 69
> repair mode will force to clear out log tree, Are you sure? [y/N]: y
> Unable to find block group for 0
> extent-tree.c:289: find_search_start: Assertion `1` failed.
> btrfs[0x44161e]
> btrfs(btrfs_reserve_extent+0xa7b)[0x4463db]
> btrfs(btrfs_alloc_free_block+0x5f)[0x44649f]
> btrfs(__btrfs_cow_block+0xc4)[0x437d64]
> btrfs(btrfs_cow_block+0x35)[0x438365]
> btrfs[0x43d3d6]
> btrfs(btrfs_commit_transaction+0x95)[0x43f125]
> btrfs(cmd_check+0x5ec)[0x429cdc]
> btrfs(main+0x82)[0x40ef32]
> /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f881f983610]
> btrfs(_start+0x29)[0x40f039]
>
>
> That's all I tried so far.
> btrfs restore -viD seems to find most of the files accessible but since I
> don't have a spare hdd of
> sufficient size I would have to break the array and reformat and use one of
> the disk as restore
> target. I'm not prepared to do this before I know there is no other way to
> fix the drives since I'm
> essentially destroying one more chance at saving the data.
>
> Is there anything I can do to get the fs out of this mess?

I'm skeptical about the logic of using --repair, which modifies the
filesystem, on just one device of a two device rai1, while saying
you're reluctant to "break the array." It doesn't make sense to me to
expect such modification on one of the drives, keeps it at all
consistent with the other. I hope a dev can say whether --repair with
a missing device is a bad idea, because if so maybe degraded repairs
need a --force flag to help users from making things worse.

Anyway, in the meantime, my advice is do not mount either device rw
(together or separately). The less changes you make right now the
better.

What kernel and btrfs-progs version are you using?

Did you try to mount with -o recovery, or -o ro,recovery before trying
'btrfs check --repair' ? If so, post all relevant kernel messages.
Don't try -o recovery now if you haven't previously tried it; it's
probably safe to try -o ro,recovery if you haven't tried that yet. I
would try -o ro,recovery three ways: both devs, and each dev
separately (for which you'll use -o ro,recovery,degraded).

If that doesn't work, it sounds like it might be a task for 'btrfs
rescue chunk-recover' which will take a long time. But I suggest
waiting as long as possible for a reply, and in the meantime I suggest
looking at getting another drive to use as spare so you can keep both
of these drives.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-01  1:22 ` Chris Murphy
@ 2016-01-01  8:13   ` Duncan
  2016-01-02  4:32     ` Christoph Anton Mitterer
  2016-01-02 10:53     ` Alexander Duscheleit
  0 siblings, 2 replies; 12+ messages in thread
From: Duncan @ 2016-01-01  8:13 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 31 Dec 2015 18:22:09 -0700 as excerpted:

> On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit
> <alexander.duscheleit@gmail.com> wrote:
>> Hello,
>>
>> I had a power fail today at my home server and after the reboot the
>> btrfs RAID1 won't come back up.
>>
>> When trying to mount one of the 2 disks of the array I get the
>> following error:
>> [ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled
>> [ 4126.316402] BTRFS: has skinny extents [ 4126.337324] BTRFS: failed
>> to read chunk tree on sdb2 [ 4126.353027] BTRFS: open_ctree failed
> 
> 
> Why are you trying to mount only one? What mount options did you use
> when you did this?

Yes, please.

>> btrfs restore -viD seems to find most of the files accessible but since
>> I don't have a spare hdd of sufficient size I would have to break the
>> array and reformat and use one of the disk as restore target. I'm not
>> prepared to do this before I know there is no other way to fix the
>> drives since I'm essentially destroying one more chance at saving the
>> data.

> Anyway, in the meantime, my advice is do not mount either device rw
> (together or separately). The less changes you make right now the
> better.
> 
> What kernel and btrfs-progs version are you using?

Unless you've already tried it (hard to say without the mount options you 
used above), I'd first try a different tact than C Murphy suggests, 
falling back to what he suggests if it doesn't work.  I suppose he 
assumes you've already tried this...

But first things first, as C Murphy suggests, when you post problems like 
this, *PLEASE* post kernel and progs userspace versions.  Given the rate 
at which btrfs is still changing, that's pretty critical information.  
Also, if you're not running the latest or second latest kernel or LTS 
kernel series and a similar or newer userspace, be prepared to be asked 
to try a newer version.  With the almost released 4.4 set to be an LTS, 
that means it if you want to try it, or the LTS kernel series 4.1 and 
3.18, or the current or previous current kernel series 4.3 or 4.2 (tho 
with 4.2 not being an LTS updates are ended or close to it, so people on 
it should be either upgrading to 4.3 or downgrading to 4.1 LTS anyway).  
And for userspace, a good rule of thumb is whatever the kernel series, a 
corresponding or newer userspace as well.

With that covered...

This is a good place to bring in something else CM recommended, but in a 
slightly different context.  If you've read many of my previous posts 
you're likely to know what I'm about to say.  The admin's first rule of 
backups says, in simplest form[1], that if you don't have a backup, by 
your actions you're defining the data that would be backed up as not 
worth the hassle and resources to do that backup.  If in that case you 
lose the data, be happy, as you still saved what you defined by your 
actions as of /true/ value regardless of any claims to the contrary, the 
hassle and resourced you would have spent making that backup.  =:^)

While the rule of backups applies in general, for btrfs it applies even 
more, because btrfs is still under heavy development and while btrfs is 
"stabilizING, it's not yet fully stable and mature, so the risk of 
actually needing to use that backup remains correspondingly higher than 
it'd ordinarily be.

But, you didn't mention having backups, and did mention that you didn't 
have a spare hdd so would have to break the array to have a place to do a 
btrfs restore to, which reads very much like you don't have ANY BACKUPS 
AT ALL!!

Of course, in the context of the above backups rule, I guess you 
understand the implications, that you consider the value of that data 
essentially throw-away, particularly since you still don't have a backup, 
despite running a not entirely stable filesystem that puts the data at 
greater risk than would a fully stable filesystem.

Which means no big deal.  You've obviously saved the time, hassle and 
resources necessary to make that backup, which is obviously of more value 
to you than the data that's not backed up, so the data is obviously of 
low enough value you can simply blow away the filesystem with a fresh 
mkfs and start over. =:^)

Except... were that the case, you probably wouldn't be posting.

Which brings entirely new urgency to what CM said about getting that 
spare hdd, so you can actually create that backup, and count yourself 
very lucky if you don't lose your data before you have it backed up, 
since your previous actions were unfortunately not in accordance with the 
value you seem to be claiming for the data.

OK, the rest of this post is written with the assumption that your claims 
and your actions regarding the value of the data in question, agree, and 
that since you're still trying to recover the data, you don't consider it 
just throw-away, which means you now have someplace to put that backup, 
should you actually be lucky enough to get the chance to make it...

With your try to mount, did you try the degraded mount option?  That's 
primarily what this post is about as it's not clear you did, and what I'd 
try first, as without that, btrfs will normally refuse to mount if a 
device is missing, failing with the rather generic ctree open failure 
error, as your attempt did.

And as CM suggests, trying the degraded,ro mount options together is a 
wise idea, at least at first, in ordered to help prevent further damage.

If a degraded,ro mount fails, then it's time to try CM's suggestions.

If a degraded,ro mount succeeds, then do a btrfs device scan, and a btrfs 
filesystem show, and see if it shows both devices or just one.  If you 
like you can also try a read-only scrub (a scrub without read-only will 
fail if the filesystem is read-only), to see if there's any corruption.

If after a device scan, a show still shows just one device, then the 
other device is truly damaged and your best bet is to try to recover from 
just the one device, see below.  If it shows both devices, then (after 
taking the opportunity while read-only mounted to do that backup to the 
other device we're assuming you now have) try unmounting and mounting 
again, normally.  With luck it'll work and the initial mount failure was 
due to btrfs only seeing the one device as btrfs device scan hadn't been 
run to let it know of the other one yet.  With the now normally mounted 
filesystem, I'd strongly suggest a btrfs scrub as first order of 
business, to try to get the two devices back in sync after the crash.

If on the degraded,ro mount, a btrfs device scan followed by btrfs fi 
show, shows the filesystem still with only one device, the other device 
would appear to be dead as far as btrfs is concerned.  In this case, 
you'll need to recover from the degraded-mount working device as if the 
second one had entirely failed.

What I'd do in this case, if you haven't done so already, is that read-
only btrfs scrub, just to see where you are in terms of corruption on the 
remaining device.  If it comes out clean, you will likely be able to 
recover with little if any data loss.  If not, hopefully you can still 
recover most of it.

At this point, now that we're assuming that you have another device to 
make a backup to, if you haven't already, take the opportunity to do that 
backup to the other device.  Be sure to unmount and remount that other 
device after the backup and test to be sure what's there is usable, 
because sysadmin's backups rule #2 is that a would-be backup that hasn't 
been tested isn't yet a backup, for the purposes of rule #1, because a 
backup isn't completed until it has been tested.

With the backup safely done and tested, you can now afford to attempt a 
bit riskier stuff on the existing btrfs.

Even tho btrfs isn't recognizing that second device, let's be sure it 
doesn't suddenly decide to be recognized, complicating things.  Either 
wipe the filesystem (dd if=/dev/zero, of=<unrecognized former btrfs 
device, or better yet, run badblocks on it in destructive mode, to both 
wipe and test it at the same time), or if you're impatient, at least use 
wipefs on it, to wipe the superblock.  Alternatively, do a temporary 
mkfs.btrfs on it, just to wipe the existing superblocks.

Now you can treat that device as a fresh device and replace the missing 
device on the degraded btrfs.

First you need to remount the degraded filesystem rw, because you can't 
add/delete/replace devices on a read-only mounted filesystem.

How you do the replace depends on the kernel and userspace you're 
running, and newer versions make it far easier.

With a reasonably current btrfs setup, you can use btrfs replace start, 
feeding it the ID number of the missing device and the device node (/dev/
whatever) of the replacement device, plus the mountpoint path.  See the 
btrfs-replace manpage.

But the ID parameter wasn't added until relatively recently.  If you 
aren't running a recent enough btrfs, you can try missing in place of the 
missing device, but with some versions that didn't work either.

Older btrfs versions didn't have btrfs replace.  If you're running 
something that old, you really should upgrade, but meanwhile, will have 
to use separate btrfs device add, followed by btrfs device delete (or 
remove, older versions only had delete, which remains an alias to remove 
in newer versions).  The add should be fast.  The delete will take quite 
a long time as it'll do a rebalance in the process.

Meanwhile, on some older versions, you often effectively got only one 
chance at the replace after mounting the filesystem writable, as if you 
rebooted (or had a crash) with the filesystem still degraded, a bug would 
often prevent mounting degraded,rw again, only degraded,ro, and of course 
the replace couldn't continue or a new attempt made, while the filesystem 
was mounted ro.  In that case, the only option (if you didn't already 
have a current backup) was to use the read-only mount as a backup and 
copy the files elsewhere, because the existing filesystem was stuck in 
read-only mode.

So keeping relatively current really does have its advantages. =:^)

Finally, repeating what I said above, this assumes you didn't try 
mounting with the degraded option, with or without ro, and that it works 
when you do, giving you a chance to at least copy the data off the read-
only filesystem.  If it doesn't, as CM evidently assumed, and if you 
don't have backups, then you have to fall back to CM's suggestions.

---
[1] Sysadmin's first rule of backups:  The more complex form covers 
multiple backups and accounts for the risk factor of actually needing to 
use them.  It says that for any level of backup, either you have it, or 
you consider the value of the data multiplied by the risk factor of 
having to actually use that level of backup, to be less than the resource 
and hassle cost of making that backup.  In this form, data such as your 
internet cache is probably not worth enough to justify even a single 
level of backup, while truly valuable data might be worth 101 levels of 
backup or more, some of them offsite and others onsite but not normally 
physically connected, because the data is truly valuable enough that even 
multiplied by the extremely tiny chance of actually having 100 levels of 
backup fail and actually needing that 101st level, justifies having it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-01  8:13   ` Duncan
@ 2016-01-02  4:32     ` Christoph Anton Mitterer
  2016-01-03 15:00       ` Duncan
  2016-01-02 10:53     ` Alexander Duscheleit
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Anton Mitterer @ 2016-01-02  4:32 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1929 bytes --]

On Fri, 2016-01-01 at 08:13 +0000, Duncan wrote:
> you can also try a read-only scrub
OT: I just wondered, would a balance include everything a scrub
includes (i.e. read+verify all data and rebuild an errors on different
devices / block copies)... of course in addition to also copying all
"good" data... and perhaps with the difference, that you don't get that
detailed information as in scrub but only the kernel log messages about
errors?

> In this case, 
> you'll need to recover from the degraded-mount working device as if
> the 
> second one had entirely failed.
> 
> What I'd do in this case, if you haven't done so already, is that
> read-
> only btrfs scrub, just to see where you are in terms of corruption on
> the 
> remaining device.
I don't think that this is the best order of the steps - at least not
when it's about precious data.

Doing a scrub at this phase, would just read all data, telling you the
status,... but first you should try to copy as much as possible (just
in case the remaining good drive fails as well) and *then* do the scrub
to see what's actually good or not.


Alternatively the first step could be backing up to another drive in
the sense of dd-copy (beware of the problem of UUID collisions in
btrfs: you MUST make sure here that the kernel doesn't see[0] devices
with the same IDs, which is of course the case with dd, unless you
write to e.g. an image file and not a device)

This has advantages and disadvantages:
- btrfs rebuild would only rebuild those block that are actually
used... so you need to do less reads from a possibly soon-to-be-dying
device
- OTOH, you only copy the blocks which btrfs thinks are actually
used,... and if later it would turn out that there are filesystem
corruptions in these, you don't have any other areas (with possibly
older data) where you could try some last-resort-recoveries..



Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-02  4:32     ` Christoph Anton Mitterer
@ 2016-01-03 15:00       ` Duncan
  2016-01-04  0:05         ` Christoph Anton Mitterer
  0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2016-01-03 15:00 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Sat, 02 Jan 2016 05:32:21 +0100 as
excerpted:

> On Fri, 2016-01-01 at 08:13 +0000, Duncan wrote:
>> you can also try a read-only scrub
> OT: I just wondered, would a balance include everything a scrub includes
> (i.e. read+verify all data and rebuild an errors on different devices /
> block copies)... of course in addition to also copying all "good"
> data... and perhaps with the difference, that you don't get that
> detailed information as in scrub but only the kernel log messages about
> errors?

AFAIK, no, at least not by design, as balance works at the chunk level, 
while scrub works inside chunks, verifying the checksums on each block.

But now that I think about it, balance does read the chunk in ordered to 
rewrite its contents, and that read, like all reads, should normally be 
checksum verified (except of course in the case of nodatasum, which nocow 
of course implies).  So a balance completed without error /may/ 
effectively indicate a scrub would complete without error as well.  But 
it wasn't specifically designed for that, and if it does so, it's only 
doing it because all reads are checksum verified, not because it's 
actually purposely doing a scrub.

And even if balance works to verify no checksum errors, I don't believe 
it would correct them or give you the detail on them that a scrub would.  
And if there is an error, it'd be a balance error, which might or might 
not actually be a scrub error.

>> In this case,
>> you'll need to recover from the degraded-mount working device as if the
>> second one had entirely failed.
>> 
>> What I'd do in this case, if you haven't done so already, is that read-
>> only btrfs scrub, just to see where you are in terms of corruption on
>> the remaining device.
> I don't think that this is the best order of the steps - at least not
> when it's about precious data.
> 
> Doing a scrub at this phase, would just read all data, telling you the
> status,... but first you should try to copy as much as possible (just in
> case the remaining good drive fails as well) and *then* do the scrub to
> see what's actually good or not.

Good point.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-03 15:00       ` Duncan
@ 2016-01-04  0:05         ` Christoph Anton Mitterer
  2016-01-06  7:35           ` Duncan
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Anton Mitterer @ 2016-01-04  0:05 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2111 bytes --]

On Sun, 2016-01-03 at 15:00 +0000, Duncan wrote:
> But now that I think about it, balance does read the chunk in ordered
> to 
> rewrite its contents, and that read, like all reads, should normally
> be 
> checksum verified
That was my idea.... :)

>  (except of course in the case of nodatasum, which nocow 
> of course implies).
Though I haven't had the time so far to reply on the most recent posts
in that thread,... I still haven't given up on the quest for
checksumming of nodatacow'ed data ;-)


> So a balance completed without error /may/ 
> effectively indicate a scrub would complete without error as
> well.  But 
> it wasn't specifically designed for that, and if it does so, it's
> only 
> doing it because all reads are checksum verified, not because it's 
> actually purposely doing a scrub.
Well sure... this is however an interesting concept to think about for
the long term future.

I'd expect that in some distant future, we'd have powerful userland
tools that do maintenance and health monitoring of btrfs filesystems,
including e.g. automated scrubs, defrags and so on.

Especially on large filesystems all these operations tend to take large
amounts of time and may even impact the lifetime of the storage
device(s)... so it would be clever if certain such operations could be
kinda "merged", at least for the purposes of getting the results.
As in the above example, if one would anyway run a full balance, the
next scrub may be skipped because one is just doing one.
Similar for defrag.


> And even if balance works to verify no checksum errors, I don't
> believe 
> it would correct them or give you the detail on them that a scrub
> would.
I'd have expected that that read errors are (if possible because of
block copies) are repaired as soon as they're encountered... isn't that
the case?

  
> And if there is an error, it'd be a balance error, which might or
> might 
> not actually be a scrub error.
Sure, but it shouldn't be difficult to collect e.g. scrub stats during
balance as well.


:-)

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-04  0:05         ` Christoph Anton Mitterer
@ 2016-01-06  7:35           ` Duncan
  0 siblings, 0 replies; 12+ messages in thread
From: Duncan @ 2016-01-06  7:35 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Mon, 04 Jan 2016 01:05:02 +0100 as
excerpted:

> On Sun, 2016-01-03 at 15:00 +0000, Duncan wrote:
>> But now that I think about it, balance does read the chunk in ordered
>> to rewrite its contents, and that read, like all reads, should normally
>> be checksum verified
> That was my idea.... :)
> 
>>  (except of course in the case of nodatasum, which nocow
>> of course implies).
> Though I haven't had the time so far to reply on the most recent posts
> in that thread,... I still haven't given up on the quest for
> checksumming of nodatacow'ed data ;-)

Following the lines of the btrfs-convert discussion elsewhere, I don't 
believe the current devs to be too interested in this at the current 
time, tho maybe in the "bluesky" timeframe, beyond five years out, likely 
more like ten.  Because most of them believe it to be cost/benefit 
impractical to work on.  However, much like btrfs-convert, if a (probably 
new) developer finds this his particular itch he wants to scratch, and 
puts in the seriously high level of effort to get it to work, and it's 
all up to code standard, perhaps.  But it's going to have to pass a 
pretty high level of skepticism and in general it's simply not considered 
worth the incredible level of effort that would be necessary, so it's 
going to take a developer with a pretty intense itch to scratch over a 
period, very likely, of some years, by the time the code can be both 
demonstrated theoretically correct and pass regression tests and 
skepticism, to get it to the level were it could be properly included.

IOW, not impossible, but as close as it gets.  I'd say the chances of 
seeing this in mainline (not just a series of patches carried by someone 
else) in anything under say 7 years is well under 5%, probably under 2%.  
The chances at say 15 years... maybe 15%.  (That said, if you look at ext4 
as an example, it has grown a bunch of exotic options over time, that 
most people will never use but that scratched someone's itch.  Btrfs 
could be getting similar, at 7+ years out, so it's possible, and at that 
viewpoint, some may even consider the chances near 50% at the 10 year out 
mark.  I'm skeptical, but I wouldn't have considered all those weird 
things now possible in ext4 likely to ever reach mainline ext4, either, 
so...)

But I honestly don't expect current devs to spend much time on the 
proposal, at least not in the 7- year timeframe.

> Especially on large filesystems all these operations tend to take large
> amounts of time and may even impact the lifetime of the storage
> device(s)... so it would be clever if certain such operations could be
> kinda "merged", at least for the purposes of getting the results.
> As in the above example, if one would anyway run a full balance, the
> next scrub may be skipped because one is just doing one.
> Similar for defrag.

Well, balance definitely doesn't do defrag.  By analogy, balance is at 
the UN, nation to nation, level, while defrag is at the city precinct 
level.  They're simply out of each other's scope.

Which isn't to say that at some point in the future, there won't be some 
btrfs doitall command, that does scrub and balance and defrag and 
recompression and ... all in a single pass, taking parameters from all 
the individual functions.  But as you say, that's likely to be at least 
intermediate future, 3-5 years out, maybe 5-7 years out or more.

And like btrfs-convert, I'd consider it in the "not a core tool, but nice 
to have" category.

>> And even if balance works to verify no checksum errors, I don't believe
>> it would correct them or give you the detail on them that a scrub
>> would.
> I'd have expected that that read errors are (if possible because of
> block copies) are repaired as soon as they're encountered... isn't that
> the case?

(My understanding is that...) At the balance level, checksum corruption 
errors aren't going to be fixed from the other copy or from parity, 
because unlike normal file usage, the other copy isn't read -- balance 
isn't worried about file or extent level corruption, and any it would 
find would be simply a byproduct of the normal read-time checksum 
verification process, it's simply moving chunks around.  Such errors 
would thus simply cause the balance to abort, with whatever balance-time 
error that wouldn't even necessarily reflect that it's a checksum error.

Assuming that's correct, a completed balance could be assumed to have in 
addition the meaning of a scrub completed without any errors, but a 
failed balance could have failed for one of any number of reasons and 
with one of various balance-level errors, with such a failure yielding 
little or no clue as to scrub status.

>> And if there is an error, it'd be a balance error, which might or might
>> not actually be a scrub error.
> Sure, but it shouldn't be difficult to collect e.g. scrub stats during
> balance as well.

Given that as of now they're still struggling to manage balance's memory 
requirements in ordered to let it scale more efficiently, and that 
scaling, particularly in the presence of large numbers of subvolumes and 
with quotas remains the single biggest issue, the devs are extremely 
unlikely to want to be adding additional memory requirements in ordered 
to additionally track scrub stats.

Even once the current scaling issues are resolved, I don't see it being a 
useful option for balance itself, precisely because of the scaling 
issues, then on potentially embedded systems running TB-scale storage.  
But there might indeed be some place for it in the still very theoretical 
btrfs doitall command you proposed and I named doitall, above.  Embedded-
scale applications would simply not run that command, instead running the 
lower resource individual commands, while doitall could say check that it 
had a minimum of 16 GiB of memory or whatever to use, and exit with an 
error if not, so it could optionally be run on systems with the required 
resources.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-01  8:13   ` Duncan
  2016-01-02  4:32     ` Christoph Anton Mitterer
@ 2016-01-02 10:53     ` Alexander Duscheleit
  2016-01-02 21:19       ` Henk Slager
  2016-01-03 16:08       ` Duncan
  1 sibling, 2 replies; 12+ messages in thread
From: Alexander Duscheleit @ 2016-01-02 10:53 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 01 Jan 2016 00:14:37 -0800, Duncan wrote:
> Chris Murphy posted on Thu, 31 Dec 2015 18:22:09 -0700 as excerpted:
>
>> On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit
>> <alexander.duschel...@gmail.com> wrote:
>>> [...]
>>
>>
>> Why are you trying to mount only one? What mount options did you use
>> when you did this?
>
> Yes, please.

I was under the impression that a mount (actually any) command issued
against a member of a multi-device btrfs would affect the whole
multi-device.

>
>>> btrfs restore -viD seems to find most of the files accessible but
>>> since I don't have a spare hdd of sufficient size I would have to
>>> break the array and reformat and use one of the disk as restore
>>> target. I'm not prepared to do this before I know there is no other
>>> way to fix the drives since I'm essentially destroying one more
>>> chance at saving the data.
>
>> Anyway, in the meantime, my advice is do not mount either device rw
>> (together or separately). The less changes you make right now the
>> better.
>>
>> What kernel and btrfs-progs version are you using?

Sorry, I had this included in a paragraph I later removed.
Kernel 4.3.3, btrfs-progs v4.3.1

>
> Unless you've already tried it (hard to say without the mount options
> you used above), I'd first try a different tact than C Murphy
> suggests, falling back to what he suggests if it doesn't work.  I
> suppose he assumes you've already tried this...
>
> But first things first, as C Murphy suggests, when you post problems
> like this, *PLEASE* post kernel and progs userspace versions.  Given
> the rate at which btrfs is still changing, that's pretty critical
> information. Also, if you're not running the latest or second latest
> kernel or LTS kernel series and a similar or newer userspace, be
> prepared to be asked to try a newer version.  With the almost
> released 4.4 set to be an LTS, that means it if you want to try it,
> or the LTS kernel series 4.1 and 3.18, or the current or previous
> current kernel series 4.3 or 4.2 (tho with 4.2 not being an LTS
> updates are ended or close to it, so people on it should be either
> upgrading to 4.3 or downgrading to 4.1 LTS anyway). And for
> userspace, a good rule of thumb is whatever the kernel series, a
> corresponding or newer userspace as well.
>
> With that covered...
>
> This is a good place to bring in something else CM recommended, but
> in a slightly different context.  If you've read many of my previous
> posts you're likely to know what I'm about to say.  The admin's first
> rule of backups says, in simplest form[1], that if you don't have a
> backup, by your actions you're defining the data that would be backed
> up as not worth the hassle and resources to do that backup.  If in
> that case you lose the data, be happy, as you still saved what you
> defined by your actions as of /true/ value regardless of any claims
> to the contrary, the hassle and resourced you would have spent making
> that backup.  =:^)
>
> While the rule of backups applies in general, for btrfs it applies
> even more, because btrfs is still under heavy development and while
> btrfs is "stabilizING, it's not yet fully stable and mature, so the
> risk of actually needing to use that backup remains correspondingly
> higher than it'd ordinarily be.
>
> But, you didn't mention having backups, and did mention that you
> didn't have a spare hdd so would have to break the array to have a
> place to do a btrfs restore to, which reads very much like you don't
> have ANY BACKUPS AT ALL!!
>
> Of course, in the context of the above backups rule, I guess you
> understand the implications, that you consider the value of that data
> essentially throw-away, particularly since you still don't have a
> backup, despite running a not entirely stable filesystem that puts
> the data at greater risk than would a fully stable filesystem.
>
> Which means no big deal.  You've obviously saved the time, hassle and
> resources necessary to make that backup, which is obviously of more
> value to you than the data that's not backed up, so the data is
> obviously of low enough value you can simply blow away the filesystem
> with a fresh mkfs and start over. =:^)
>
> Except... were that the case, you probably wouldn't be posting.
>
> Which brings entirely new urgency to what CM said about getting that
> spare hdd, so you can actually create that backup, and count yourself
> very lucky if you don't lose your data before you have it backed up,
> since your previous actions were unfortunately not in accordance with
> the value you seem to be claiming for the data.

Yes, there are things that rank higher in priority than backups of
the data in question. Namely food and shelter. The mirror drives
is all I could scrounge together after several months. The previous
setup was JBOD of 9 disks no younger than 7 years. At the point of
replacement I was so wary of the hardware giving in that I didn't even
think about potential software issues.

I chose btrfs as a means to "future-proof" the storage. For me it won
out against zfs for it's superior re-shapaing capability in terms of
RAID modes and adding disks to existing arrays.

>
> OK, the rest of this post is written with the assumption that your
> claims and your actions regarding the value of the data in question,
> agree, and that since you're still trying to recover the data, you
> don't consider it just throw-away, which means you now have someplace
> to put that backup, should you actually be lucky enough to get the
> chance to make it...

An additional drive of matching capacity won't be within my financial
means for several months, sadly.
I DO still have the old drives in storage. While they are of very
questionable reliability, I'm confident I can get most of the data
back from those.
None of it is *essential* data. I can always re-rip my music,
re-download most of the other media and re-create the rest from raw
sources. But given the hassle in time and bandwidth I can invest some
hours on and off to try to pull it from the drives as well.

>
>
> With your try to mount, did you try the degraded mount option?  That's
> primarily what this post is about as it's not clear you did, and what
> I'd try first, as without that, btrfs will normally refuse to mount
> if a device is missing, failing with the rather generic ctree open
> failure error, as your attempt did.
>
> And as CM suggests, trying the degraded,ro mount options together is a
> wise idea, at least at first, in ordered to help prevent further
> damage.
>
> If a degraded,ro mount fails, then it's time to try CM's suggestions.

I had tried a degraded,ro mount early on. I don't know why I didn't
include that in my first mail. The result is as follows:

[13984.341838] BTRFS info (device sdc2): allowing degraded mounts
[13984.341844] BTRFS info (device sdc2): disk space caching is enabled
[13984.341846] BTRFS: has skinny extents
[13984.538637] BTRFS critical (device sdc2): corrupt leaf, bad key
order: block=6513625202688,root=1, slot=68 [13984.546327] BTRFS
critical (device sdc2): corrupt leaf, bad key order:
block=6513625202688,root=1, slot=68 [13984.552233] BTRFS: Failed to
read block groups: -5 [13984.585375] BTRFS: open_ctree failed
[13997.313514] BTRFS info (device sdb2): allowing degraded mounts
[13997.313520] BTRFS info (device sdb2): disk space caching is enabled
[13997.313522] BTRFS: has skinny extents [13997.522838] BTRFS critical
(device sdb2): corrupt leaf, bad key order: block=6513625202688,root=1,
slot=68 [13997.530175] BTRFS critical (device sdb2): corrupt leaf, bad
key order: block=6513625202688,root=1, slot=68 [13997.538289] BTRFS:
Failed to read block groups: -5 [13997.582019] BTRFS: open_ctree failed


>
> [...]
>

So I can't mount either disk as ro and I can't afford another drive
to store the data.

I can confirm that I can get at least a subset of the data off the
drives via btrfs-restore. (In fact I already restored the only chunk of
data that's newer than the old disk set AND not easily recreated, which
makes the whole endeavour a bit less nerve-wracking.)

As I see it, my best course of action right now is wiping one of the
two disks and then using btrfs restore to copy the data off the other
disk onto the now blank one. I'd expect to get back a large percentage
of the inaccessible data that way. That is unless someone tells me
there's an easy fix for the "corrupt leaf, bad key order" fault and
I've been chasing ghosts the whole time.

> ---
> [1] Sysadmin's first rule of backups:  The more complex form covers
> multiple backups and accounts for the risk factor of actually needing
> to use them.  It says that for any level of backup, either you have
> it, or you consider the value of the data multiplied by the risk
> factor of having to actually use that level of backup, to be less
> than the resource and hassle cost of making that backup.  In this
> form, data such as your internet cache is probably not worth enough
> to justify even a single level of backup, while truly valuable data
> might be worth 101 levels of backup or more, some of them offsite and
> others onsite but not normally physically connected, because the data
> is truly valuable enough that even multiplied by the extremely tiny
> chance of actually having 100 levels of backup fail and actually
> needing that 101st level, justifies having it.

The data is certainly worth another level of security, the problem is I
can't afford it. Basically the amount I have accumulated has outstripped
my means to properly store it. I'm trying my best with what's available.

And no, I wouldn't trust data to this storage that could have a
financial or personal impact if lost.

-- 
Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-02 10:53     ` Alexander Duscheleit
@ 2016-01-02 21:19       ` Henk Slager
  2016-01-03 15:53         ` Duncan
  2016-01-03 16:08       ` Duncan
  1 sibling, 1 reply; 12+ messages in thread
From: Henk Slager @ 2016-01-02 21:19 UTC (permalink / raw)
  To: Alexander Duscheleit; +Cc: linux-btrfs

[...]
> [13984.341838] BTRFS info (device sdc2): allowing degraded mounts
> [13984.341844] BTRFS info (device sdc2): disk space caching is enabled
> [13984.341846] BTRFS: has skinny extents
> [13984.538637] BTRFS critical (device sdc2): corrupt leaf, bad key
> order: block=6513625202688,root=1, slot=68 [13984.546327] BTRFS
> critical (device sdc2): corrupt leaf, bad key order:
> block=6513625202688,root=1, slot=68 [13984.552233] BTRFS: Failed to
> read block groups: -5 [13984.585375] BTRFS: open_ctree failed
> [13997.313514] BTRFS info (device sdb2): allowing degraded mounts
> [13997.313520] BTRFS info (device sdb2): disk space caching is enabled
> [13997.313522] BTRFS: has skinny extents [13997.522838] BTRFS critical
> (device sdb2): corrupt leaf, bad key order: block=6513625202688,root=1,
> slot=68 [13997.530175] BTRFS critical (device sdb2): corrupt leaf, bad
> key order: block=6513625202688,root=1, slot=68 [13997.538289] BTRFS:
> Failed to read block groups: -5 [13997.582019] BTRFS: open_ctree failed
>
>
>>
>> [...]
>>
>
> So I can't mount either disk as ro and I can't afford another drive
> to store the data.
>
> I can confirm that I can get at least a subset of the data off the
> drives via btrfs-restore. (In fact I already restored the only chunk of
> data that's newer than the old disk set AND not easily recreated, which
> makes the whole endeavour a bit less nerve-wracking.)
>
> As I see it, my best course of action right now is wiping one of the
> two disks and then using btrfs restore to copy the data off the other
> disk onto the now blank one. I'd expect to get back a large percentage
> of the inaccessible data that way. That is unless someone tells me
> there's an easy fix for the "corrupt leaf, bad key order" fault and
> I've been chasing ghosts the whole time.
I had once this error:
BTRFS critical (device sdf1): corrupt leaf, slot offset bad:
block=77130973184,root=1, slot=150

Not the same, but the 'corrupt leaf' part was due to memory module
bit-failures I had some time ago. At least I haven't seen these kind
of errors in other btrfs fs failure cases. I my case, no raid
profiles, and I could fix it with --repair.
It also might be that your 'corrupt leaf,...' error is caused by the
earlier --repair action, otherwise I wouldn't know from experience how
to fix it.

If you think btrfs raid (I/O)fault handling etc is not good enough
yet, instead of raid1, you might consider 2x single (dup for
metadata), with 1 the main/master fs and the other one the slave fs,
created by send | receive (incremental). If you scrub both on regular
basis, email or so the error cases, you can act if something is wrong.
And every now and then do a brute-force diff to verify that contents
of both filesystems (snapshots) are still the same.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-02 21:19       ` Henk Slager
@ 2016-01-03 15:53         ` Duncan
  2016-01-03 16:24           ` Martin Steigerwald
  0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2016-01-03 15:53 UTC (permalink / raw)
  To: linux-btrfs

Henk Slager posted on Sat, 02 Jan 2016 22:19:18 +0100 as excerpted:

> If you think btrfs raid (I/O)fault handling etc is not good enough yet,
> instead of raid1, you might consider 2x single (dup for metadata), with
> 1 the main/master fs and the other one the slave fs, created by send |
> receive (incremental). If you scrub both on regular basis, email or so
> the error cases, you can act if something is wrong.
> And every now and then do a brute-force diff to verify that contents of
> both filesystems (snapshots) are still the same.

Given the OP's situation, that he was running btrfs in raid1 mode, and 
that a third device of similar capacity is simply out of the question due 
to cost at this point, this approach, possibly generalized, is what I'd 
recommend as well.

RAID-1 is not a backup.  And I'd strongly recommend a backup take 
priority over a raid1 if there's simply not enough money for more 
devices.  There's simply too many ways a raid1 can go wrong when there's 
no actual backup, including fat-fingering a deletion[1].

Now if the device capacity is sufficiently large, I'd actually recommend 
partitioning both devices up with two identically sized partitions on 
each.  Then the first partition on each can be made into a raid1 forming 
the working copy, while the second partition on each can be a separate 
raid1 that's the backup.  That way, there's both a backup and raid1 
protection.  That's actually what I'm doing here, pretty much.[2]

Of course, the partitioned raid1 working and backup solution does require 
that the data actually fit in half the space of a single device, and it 
may not, in which case this isn't an option.

Which would bring us back to a working copy on one device and its backup 
on the other.

But I'd actually consider making either the backup not btrfs.  What I use 
here for my second backups is the old reiserfs I was using before btrfs.  
That way, if it's a btrfs bug that takes out the one copy, you don't have 
to worry about the same btrfs bug taking out the backup when you try to 
fall back to it.  It may not be particularly likely, and it does kill the 
chance of using btrfs send/receive to update the backup, but it 
significantly eases my mind when I'm in recovery mode, knowing my backup 
isn't subject to whatever btrfs bug I had that put me in recovery mode in 
the first place.  

(In the partitioned raid case, I'd consider making the backup mdraid1, 
with whatever filesystem on top, since other than btrfs and zfs, 
filesystems basically don't do raid so it must be implemented below 
them.  Or don't raid the backup and simply make a primary backup on one 
device and a secondary backup on the other.)

---
[1] Fat-fingering a deletion:  My own brown-bag "I became an admin that 
day" case was running a script, unfortunately as root, that I was 
debugging, where I did an rm -rf $somevar/*, with $somevar assigned 
earlier, only either the somevar in the assignment or the somevar in the 
rm line was typoed, so the var ended up empty and the command ended up as 
rm -rf /*. ...

I was *SO* glad I had a backup, not just a raid1, that day!

Needless to say, I also learned the lesson, the hard way, that either you 
don't debug your scripts as root, or if you are going to do so, you 
comment out rm lines and replace them with ls, the first time thru!  Or 
do a confirm-prompt with the command line printed, first, and then copy/
paste the confirmation version to the operational line, so there's no 
chance of typoing something different than the confirmed version.

[2] Dual raid1 working and backup copies on a pair of partitioned 
devices:  My setup is actually rather somewhat more complex than that, 
but the details are not apropos to this discussion.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-03 15:53         ` Duncan
@ 2016-01-03 16:24           ` Martin Steigerwald
  0 siblings, 0 replies; 12+ messages in thread
From: Martin Steigerwald @ 2016-01-03 16:24 UTC (permalink / raw)
  To: Btrfs BTRFS

Am Sonntag, 3. Januar 2016, 15:53:56 CET schrieben Sie:
> [1] Fat-fingering a deletion:  My own brown-bag "I became an admin that 
> day" case was running a script, unfortunately as root, that I was 
> debugging, where I did an rm -rf $somevar/*, with $somevar assigned 
> earlier, only either the somevar in the assignment or the somevar in the 
> rm line was typoed, so the var ended up empty and the command ended up as 
> rm -rf /*. ...
> 
> I was *SO* glad I had a backup, not just a raid1, that day!

Epic.

Thats the one case GNU rm doesn´t cover yet.

It refuses to rm -rf . or rm -rf .. and rm -rf / (unless you give special 
argument, but there is not much it can do about rm -r /*, as the shell expands 
this before handing it to the command.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Unrecoverable fs corruption?
  2016-01-02 10:53     ` Alexander Duscheleit
  2016-01-02 21:19       ` Henk Slager
@ 2016-01-03 16:08       ` Duncan
  1 sibling, 0 replies; 12+ messages in thread
From: Duncan @ 2016-01-03 16:08 UTC (permalink / raw)
  To: linux-btrfs

Alexander Duscheleit posted on Sat, 02 Jan 2016 11:53:18 +0100 as
excerpted:

> I was under the impression that a mount (actually any) command issued
> against a member of a multi-device btrfs would affect the whole
> multi-device.

Well, yes and no.  Yes, when it mounts correctly.

But with a multi-device btrfs, it can happen that btrfs doesn't yet know 
about all the devices when a mount is attempted, in which case the mount 
may fail (particularly without the degraded option), simply because it 
doesn't know about the other devices.

A btrfs device scan after all devices are available but before the mount 
attempt should fix this problem and allow a mount with any of the 
component devices, and these days, udev normally triggers that when any 
new devices appear, so it seldom needs to be done manually.  However, in 
udev-free cases or in early boot before udev is up, udev obviously won't 
handle it and the mount can still fail.

Additionally, if a device is missing or damaged to the point that btrfs 
can't see it, btrfs will normally refuse a mount unless degraded is one 
of the mount options.  And depending on the situation, degraded,ro may be 
needed.  While you mentioned below this part in your reply that you had 
tried degraded,ro, that wasn't in your original post, so we wanted the 
mount options you had actually tried, to see if you had tried degraded,ro, 
or not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-01-06  7:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-31 23:36 Unrecoverable fs corruption? Alexander Duscheleit
2016-01-01  1:22 ` Chris Murphy
2016-01-01  8:13   ` Duncan
2016-01-02  4:32     ` Christoph Anton Mitterer
2016-01-03 15:00       ` Duncan
2016-01-04  0:05         ` Christoph Anton Mitterer
2016-01-06  7:35           ` Duncan
2016-01-02 10:53     ` Alexander Duscheleit
2016-01-02 21:19       ` Henk Slager
2016-01-03 15:53         ` Duncan
2016-01-03 16:24           ` Martin Steigerwald
2016-01-03 16:08       ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).