* Unrecoverable fs corruption? @ 2015-12-31 23:36 Alexander Duscheleit 2016-01-01 1:22 ` Chris Murphy 0 siblings, 1 reply; 12+ messages in thread From: Alexander Duscheleit @ 2015-12-31 23:36 UTC (permalink / raw) To: linux-btrfs Hello, I had a power fail today at my home server and after the reboot the btrfs RAID1 won't come back up. When trying to mount one of the 2 disks of the array I get the following error: [ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled [ 4126.316402] BTRFS: has skinny extents [ 4126.337324] BTRFS: failed to read chunk tree on sdb2 [ 4126.353027] BTRFS: open_ctree failed a btrfs check segfaults after a few seconds with the following message: (0:29)[root@hera]~ # ❯❯❯ btrfs check /dev/sdb2 warning devid 1 not found already bad key ordering 68 69 Checking filesystem on /dev/sdb2 UUID: d55fa866-3baa-4e73-bf3e-5fda29672df3 checking extents bad key ordering 68 69 bad block 6513625202688 Errors found in extent allocation tree or chunk allocation [1] 11164 segmentation fault btrfs check /dev/sdb2 I have 2 btrfs-images (one with -w, one without) but they are 6.1G and 1.1G repectively, I don't know if I can upload them at all and also not where to store such large files. I did try a btrfs check --repair on one of the disks which gave the following result: enabling repair mode warning devid 1 not found already bad key ordering 68 69 repair mode will force to clear out log tree, Are you sure? [y/N]: y Unable to find block group for 0 extent-tree.c:289: find_search_start: Assertion `1` failed. btrfs[0x44161e] btrfs(btrfs_reserve_extent+0xa7b)[0x4463db] btrfs(btrfs_alloc_free_block+0x5f)[0x44649f] btrfs(__btrfs_cow_block+0xc4)[0x437d64] btrfs(btrfs_cow_block+0x35)[0x438365] btrfs[0x43d3d6] btrfs(btrfs_commit_transaction+0x95)[0x43f125] btrfs(cmd_check+0x5ec)[0x429cdc] btrfs(main+0x82)[0x40ef32] /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f881f983610] btrfs(_start+0x29)[0x40f039] That's all I tried so far. btrfs restore -viD seems to find most of the files accessible but since I don't have a spare hdd of sufficient size I would have to break the array and reformat and use one of the disk as restore target. I'm not prepared to do this before I know there is no other way to fix the drives since I'm essentially destroying one more chance at saving the data. Is there anything I can do to get the fs out of this mess? -- Alexander Duscheleit ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2015-12-31 23:36 Unrecoverable fs corruption? Alexander Duscheleit @ 2016-01-01 1:22 ` Chris Murphy 2016-01-01 8:13 ` Duncan 0 siblings, 1 reply; 12+ messages in thread From: Chris Murphy @ 2016-01-01 1:22 UTC (permalink / raw) To: Alexander Duscheleit; +Cc: Btrfs BTRFS On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit <alexander.duscheleit@gmail.com> wrote: > Hello, > > I had a power fail today at my home server and after the reboot the btrfs > RAID1 won't come back up. > > When trying to mount one of the 2 disks of the array I get the following > error: > [ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled > [ 4126.316402] BTRFS: has skinny extents > [ 4126.337324] BTRFS: failed to read chunk tree on sdb2 > [ 4126.353027] BTRFS: open_ctree failed Why are you trying to mount only one? What mount options did you use when you did this? > > a btrfs check segfaults after a few seconds with the following message: > (0:29)[root@hera]~ # ❯❯❯ btrfs check /dev/sdb2 > warning devid 1 not found already > bad key ordering 68 69 > Checking filesystem on /dev/sdb2 > UUID: d55fa866-3baa-4e73-bf3e-5fda29672df3 > checking extents > bad key ordering 68 69 > bad block 6513625202688 > Errors found in extent allocation tree or chunk allocation > [1] 11164 segmentation fault btrfs check /dev/sdb2 > > I have 2 btrfs-images (one with -w, one without) but they are 6.1G and 1.1G > repectively, I don't know > if I can upload them at all and also not where to store such large files. > > I did try a btrfs check --repair on one of the disks which gave the > following result: > enabling repair mode > warning devid 1 not found already > bad key ordering 68 69 > repair mode will force to clear out log tree, Are you sure? [y/N]: y > Unable to find block group for 0 > extent-tree.c:289: find_search_start: Assertion `1` failed. > btrfs[0x44161e] > btrfs(btrfs_reserve_extent+0xa7b)[0x4463db] > btrfs(btrfs_alloc_free_block+0x5f)[0x44649f] > btrfs(__btrfs_cow_block+0xc4)[0x437d64] > btrfs(btrfs_cow_block+0x35)[0x438365] > btrfs[0x43d3d6] > btrfs(btrfs_commit_transaction+0x95)[0x43f125] > btrfs(cmd_check+0x5ec)[0x429cdc] > btrfs(main+0x82)[0x40ef32] > /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f881f983610] > btrfs(_start+0x29)[0x40f039] > > > That's all I tried so far. > btrfs restore -viD seems to find most of the files accessible but since I > don't have a spare hdd of > sufficient size I would have to break the array and reformat and use one of > the disk as restore > target. I'm not prepared to do this before I know there is no other way to > fix the drives since I'm > essentially destroying one more chance at saving the data. > > Is there anything I can do to get the fs out of this mess? I'm skeptical about the logic of using --repair, which modifies the filesystem, on just one device of a two device rai1, while saying you're reluctant to "break the array." It doesn't make sense to me to expect such modification on one of the drives, keeps it at all consistent with the other. I hope a dev can say whether --repair with a missing device is a bad idea, because if so maybe degraded repairs need a --force flag to help users from making things worse. Anyway, in the meantime, my advice is do not mount either device rw (together or separately). The less changes you make right now the better. What kernel and btrfs-progs version are you using? Did you try to mount with -o recovery, or -o ro,recovery before trying 'btrfs check --repair' ? If so, post all relevant kernel messages. Don't try -o recovery now if you haven't previously tried it; it's probably safe to try -o ro,recovery if you haven't tried that yet. I would try -o ro,recovery three ways: both devs, and each dev separately (for which you'll use -o ro,recovery,degraded). If that doesn't work, it sounds like it might be a task for 'btrfs rescue chunk-recover' which will take a long time. But I suggest waiting as long as possible for a reply, and in the meantime I suggest looking at getting another drive to use as spare so you can keep both of these drives. -- Chris Murphy ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-01 1:22 ` Chris Murphy @ 2016-01-01 8:13 ` Duncan 2016-01-02 4:32 ` Christoph Anton Mitterer 2016-01-02 10:53 ` Alexander Duscheleit 0 siblings, 2 replies; 12+ messages in thread From: Duncan @ 2016-01-01 8:13 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Thu, 31 Dec 2015 18:22:09 -0700 as excerpted: > On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit > <alexander.duscheleit@gmail.com> wrote: >> Hello, >> >> I had a power fail today at my home server and after the reboot the >> btrfs RAID1 won't come back up. >> >> When trying to mount one of the 2 disks of the array I get the >> following error: >> [ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled >> [ 4126.316402] BTRFS: has skinny extents [ 4126.337324] BTRFS: failed >> to read chunk tree on sdb2 [ 4126.353027] BTRFS: open_ctree failed > > > Why are you trying to mount only one? What mount options did you use > when you did this? Yes, please. >> btrfs restore -viD seems to find most of the files accessible but since >> I don't have a spare hdd of sufficient size I would have to break the >> array and reformat and use one of the disk as restore target. I'm not >> prepared to do this before I know there is no other way to fix the >> drives since I'm essentially destroying one more chance at saving the >> data. > Anyway, in the meantime, my advice is do not mount either device rw > (together or separately). The less changes you make right now the > better. > > What kernel and btrfs-progs version are you using? Unless you've already tried it (hard to say without the mount options you used above), I'd first try a different tact than C Murphy suggests, falling back to what he suggests if it doesn't work. I suppose he assumes you've already tried this... But first things first, as C Murphy suggests, when you post problems like this, *PLEASE* post kernel and progs userspace versions. Given the rate at which btrfs is still changing, that's pretty critical information. Also, if you're not running the latest or second latest kernel or LTS kernel series and a similar or newer userspace, be prepared to be asked to try a newer version. With the almost released 4.4 set to be an LTS, that means it if you want to try it, or the LTS kernel series 4.1 and 3.18, or the current or previous current kernel series 4.3 or 4.2 (tho with 4.2 not being an LTS updates are ended or close to it, so people on it should be either upgrading to 4.3 or downgrading to 4.1 LTS anyway). And for userspace, a good rule of thumb is whatever the kernel series, a corresponding or newer userspace as well. With that covered... This is a good place to bring in something else CM recommended, but in a slightly different context. If you've read many of my previous posts you're likely to know what I'm about to say. The admin's first rule of backups says, in simplest form[1], that if you don't have a backup, by your actions you're defining the data that would be backed up as not worth the hassle and resources to do that backup. If in that case you lose the data, be happy, as you still saved what you defined by your actions as of /true/ value regardless of any claims to the contrary, the hassle and resourced you would have spent making that backup. =:^) While the rule of backups applies in general, for btrfs it applies even more, because btrfs is still under heavy development and while btrfs is "stabilizING, it's not yet fully stable and mature, so the risk of actually needing to use that backup remains correspondingly higher than it'd ordinarily be. But, you didn't mention having backups, and did mention that you didn't have a spare hdd so would have to break the array to have a place to do a btrfs restore to, which reads very much like you don't have ANY BACKUPS AT ALL!! Of course, in the context of the above backups rule, I guess you understand the implications, that you consider the value of that data essentially throw-away, particularly since you still don't have a backup, despite running a not entirely stable filesystem that puts the data at greater risk than would a fully stable filesystem. Which means no big deal. You've obviously saved the time, hassle and resources necessary to make that backup, which is obviously of more value to you than the data that's not backed up, so the data is obviously of low enough value you can simply blow away the filesystem with a fresh mkfs and start over. =:^) Except... were that the case, you probably wouldn't be posting. Which brings entirely new urgency to what CM said about getting that spare hdd, so you can actually create that backup, and count yourself very lucky if you don't lose your data before you have it backed up, since your previous actions were unfortunately not in accordance with the value you seem to be claiming for the data. OK, the rest of this post is written with the assumption that your claims and your actions regarding the value of the data in question, agree, and that since you're still trying to recover the data, you don't consider it just throw-away, which means you now have someplace to put that backup, should you actually be lucky enough to get the chance to make it... With your try to mount, did you try the degraded mount option? That's primarily what this post is about as it's not clear you did, and what I'd try first, as without that, btrfs will normally refuse to mount if a device is missing, failing with the rather generic ctree open failure error, as your attempt did. And as CM suggests, trying the degraded,ro mount options together is a wise idea, at least at first, in ordered to help prevent further damage. If a degraded,ro mount fails, then it's time to try CM's suggestions. If a degraded,ro mount succeeds, then do a btrfs device scan, and a btrfs filesystem show, and see if it shows both devices or just one. If you like you can also try a read-only scrub (a scrub without read-only will fail if the filesystem is read-only), to see if there's any corruption. If after a device scan, a show still shows just one device, then the other device is truly damaged and your best bet is to try to recover from just the one device, see below. If it shows both devices, then (after taking the opportunity while read-only mounted to do that backup to the other device we're assuming you now have) try unmounting and mounting again, normally. With luck it'll work and the initial mount failure was due to btrfs only seeing the one device as btrfs device scan hadn't been run to let it know of the other one yet. With the now normally mounted filesystem, I'd strongly suggest a btrfs scrub as first order of business, to try to get the two devices back in sync after the crash. If on the degraded,ro mount, a btrfs device scan followed by btrfs fi show, shows the filesystem still with only one device, the other device would appear to be dead as far as btrfs is concerned. In this case, you'll need to recover from the degraded-mount working device as if the second one had entirely failed. What I'd do in this case, if you haven't done so already, is that read- only btrfs scrub, just to see where you are in terms of corruption on the remaining device. If it comes out clean, you will likely be able to recover with little if any data loss. If not, hopefully you can still recover most of it. At this point, now that we're assuming that you have another device to make a backup to, if you haven't already, take the opportunity to do that backup to the other device. Be sure to unmount and remount that other device after the backup and test to be sure what's there is usable, because sysadmin's backups rule #2 is that a would-be backup that hasn't been tested isn't yet a backup, for the purposes of rule #1, because a backup isn't completed until it has been tested. With the backup safely done and tested, you can now afford to attempt a bit riskier stuff on the existing btrfs. Even tho btrfs isn't recognizing that second device, let's be sure it doesn't suddenly decide to be recognized, complicating things. Either wipe the filesystem (dd if=/dev/zero, of=<unrecognized former btrfs device, or better yet, run badblocks on it in destructive mode, to both wipe and test it at the same time), or if you're impatient, at least use wipefs on it, to wipe the superblock. Alternatively, do a temporary mkfs.btrfs on it, just to wipe the existing superblocks. Now you can treat that device as a fresh device and replace the missing device on the degraded btrfs. First you need to remount the degraded filesystem rw, because you can't add/delete/replace devices on a read-only mounted filesystem. How you do the replace depends on the kernel and userspace you're running, and newer versions make it far easier. With a reasonably current btrfs setup, you can use btrfs replace start, feeding it the ID number of the missing device and the device node (/dev/ whatever) of the replacement device, plus the mountpoint path. See the btrfs-replace manpage. But the ID parameter wasn't added until relatively recently. If you aren't running a recent enough btrfs, you can try missing in place of the missing device, but with some versions that didn't work either. Older btrfs versions didn't have btrfs replace. If you're running something that old, you really should upgrade, but meanwhile, will have to use separate btrfs device add, followed by btrfs device delete (or remove, older versions only had delete, which remains an alias to remove in newer versions). The add should be fast. The delete will take quite a long time as it'll do a rebalance in the process. Meanwhile, on some older versions, you often effectively got only one chance at the replace after mounting the filesystem writable, as if you rebooted (or had a crash) with the filesystem still degraded, a bug would often prevent mounting degraded,rw again, only degraded,ro, and of course the replace couldn't continue or a new attempt made, while the filesystem was mounted ro. In that case, the only option (if you didn't already have a current backup) was to use the read-only mount as a backup and copy the files elsewhere, because the existing filesystem was stuck in read-only mode. So keeping relatively current really does have its advantages. =:^) Finally, repeating what I said above, this assumes you didn't try mounting with the degraded option, with or without ro, and that it works when you do, giving you a chance to at least copy the data off the read- only filesystem. If it doesn't, as CM evidently assumed, and if you don't have backups, then you have to fall back to CM's suggestions. --- [1] Sysadmin's first rule of backups: The more complex form covers multiple backups and accounts for the risk factor of actually needing to use them. It says that for any level of backup, either you have it, or you consider the value of the data multiplied by the risk factor of having to actually use that level of backup, to be less than the resource and hassle cost of making that backup. In this form, data such as your internet cache is probably not worth enough to justify even a single level of backup, while truly valuable data might be worth 101 levels of backup or more, some of them offsite and others onsite but not normally physically connected, because the data is truly valuable enough that even multiplied by the extremely tiny chance of actually having 100 levels of backup fail and actually needing that 101st level, justifies having it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-01 8:13 ` Duncan @ 2016-01-02 4:32 ` Christoph Anton Mitterer 2016-01-03 15:00 ` Duncan 2016-01-02 10:53 ` Alexander Duscheleit 1 sibling, 1 reply; 12+ messages in thread From: Christoph Anton Mitterer @ 2016-01-02 4:32 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1929 bytes --] On Fri, 2016-01-01 at 08:13 +0000, Duncan wrote: > you can also try a read-only scrub OT: I just wondered, would a balance include everything a scrub includes (i.e. read+verify all data and rebuild an errors on different devices / block copies)... of course in addition to also copying all "good" data... and perhaps with the difference, that you don't get that detailed information as in scrub but only the kernel log messages about errors? > In this case, > you'll need to recover from the degraded-mount working device as if > the > second one had entirely failed. > > What I'd do in this case, if you haven't done so already, is that > read- > only btrfs scrub, just to see where you are in terms of corruption on > the > remaining device. I don't think that this is the best order of the steps - at least not when it's about precious data. Doing a scrub at this phase, would just read all data, telling you the status,... but first you should try to copy as much as possible (just in case the remaining good drive fails as well) and *then* do the scrub to see what's actually good or not. Alternatively the first step could be backing up to another drive in the sense of dd-copy (beware of the problem of UUID collisions in btrfs: you MUST make sure here that the kernel doesn't see[0] devices with the same IDs, which is of course the case with dd, unless you write to e.g. an image file and not a device) This has advantages and disadvantages: - btrfs rebuild would only rebuild those block that are actually used... so you need to do less reads from a possibly soon-to-be-dying device - OTOH, you only copy the blocks which btrfs thinks are actually used,... and if later it would turn out that there are filesystem corruptions in these, you don't have any other areas (with possibly older data) where you could try some last-resort-recoveries.. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-02 4:32 ` Christoph Anton Mitterer @ 2016-01-03 15:00 ` Duncan 2016-01-04 0:05 ` Christoph Anton Mitterer 0 siblings, 1 reply; 12+ messages in thread From: Duncan @ 2016-01-03 15:00 UTC (permalink / raw) To: linux-btrfs Christoph Anton Mitterer posted on Sat, 02 Jan 2016 05:32:21 +0100 as excerpted: > On Fri, 2016-01-01 at 08:13 +0000, Duncan wrote: >> you can also try a read-only scrub > OT: I just wondered, would a balance include everything a scrub includes > (i.e. read+verify all data and rebuild an errors on different devices / > block copies)... of course in addition to also copying all "good" > data... and perhaps with the difference, that you don't get that > detailed information as in scrub but only the kernel log messages about > errors? AFAIK, no, at least not by design, as balance works at the chunk level, while scrub works inside chunks, verifying the checksums on each block. But now that I think about it, balance does read the chunk in ordered to rewrite its contents, and that read, like all reads, should normally be checksum verified (except of course in the case of nodatasum, which nocow of course implies). So a balance completed without error /may/ effectively indicate a scrub would complete without error as well. But it wasn't specifically designed for that, and if it does so, it's only doing it because all reads are checksum verified, not because it's actually purposely doing a scrub. And even if balance works to verify no checksum errors, I don't believe it would correct them or give you the detail on them that a scrub would. And if there is an error, it'd be a balance error, which might or might not actually be a scrub error. >> In this case, >> you'll need to recover from the degraded-mount working device as if the >> second one had entirely failed. >> >> What I'd do in this case, if you haven't done so already, is that read- >> only btrfs scrub, just to see where you are in terms of corruption on >> the remaining device. > I don't think that this is the best order of the steps - at least not > when it's about precious data. > > Doing a scrub at this phase, would just read all data, telling you the > status,... but first you should try to copy as much as possible (just in > case the remaining good drive fails as well) and *then* do the scrub to > see what's actually good or not. Good point. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-03 15:00 ` Duncan @ 2016-01-04 0:05 ` Christoph Anton Mitterer 2016-01-06 7:35 ` Duncan 0 siblings, 1 reply; 12+ messages in thread From: Christoph Anton Mitterer @ 2016-01-04 0:05 UTC (permalink / raw) To: Duncan, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2111 bytes --] On Sun, 2016-01-03 at 15:00 +0000, Duncan wrote: > But now that I think about it, balance does read the chunk in ordered > to > rewrite its contents, and that read, like all reads, should normally > be > checksum verified That was my idea.... :) > (except of course in the case of nodatasum, which nocow > of course implies). Though I haven't had the time so far to reply on the most recent posts in that thread,... I still haven't given up on the quest for checksumming of nodatacow'ed data ;-) > So a balance completed without error /may/ > effectively indicate a scrub would complete without error as > well. But > it wasn't specifically designed for that, and if it does so, it's > only > doing it because all reads are checksum verified, not because it's > actually purposely doing a scrub. Well sure... this is however an interesting concept to think about for the long term future. I'd expect that in some distant future, we'd have powerful userland tools that do maintenance and health monitoring of btrfs filesystems, including e.g. automated scrubs, defrags and so on. Especially on large filesystems all these operations tend to take large amounts of time and may even impact the lifetime of the storage device(s)... so it would be clever if certain such operations could be kinda "merged", at least for the purposes of getting the results. As in the above example, if one would anyway run a full balance, the next scrub may be skipped because one is just doing one. Similar for defrag. > And even if balance works to verify no checksum errors, I don't > believe > it would correct them or give you the detail on them that a scrub > would. I'd have expected that that read errors are (if possible because of block copies) are repaired as soon as they're encountered... isn't that the case? > And if there is an error, it'd be a balance error, which might or > might > not actually be a scrub error. Sure, but it shouldn't be difficult to collect e.g. scrub stats during balance as well. :-) Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-04 0:05 ` Christoph Anton Mitterer @ 2016-01-06 7:35 ` Duncan 0 siblings, 0 replies; 12+ messages in thread From: Duncan @ 2016-01-06 7:35 UTC (permalink / raw) To: linux-btrfs Christoph Anton Mitterer posted on Mon, 04 Jan 2016 01:05:02 +0100 as excerpted: > On Sun, 2016-01-03 at 15:00 +0000, Duncan wrote: >> But now that I think about it, balance does read the chunk in ordered >> to rewrite its contents, and that read, like all reads, should normally >> be checksum verified > That was my idea.... :) > >> (except of course in the case of nodatasum, which nocow >> of course implies). > Though I haven't had the time so far to reply on the most recent posts > in that thread,... I still haven't given up on the quest for > checksumming of nodatacow'ed data ;-) Following the lines of the btrfs-convert discussion elsewhere, I don't believe the current devs to be too interested in this at the current time, tho maybe in the "bluesky" timeframe, beyond five years out, likely more like ten. Because most of them believe it to be cost/benefit impractical to work on. However, much like btrfs-convert, if a (probably new) developer finds this his particular itch he wants to scratch, and puts in the seriously high level of effort to get it to work, and it's all up to code standard, perhaps. But it's going to have to pass a pretty high level of skepticism and in general it's simply not considered worth the incredible level of effort that would be necessary, so it's going to take a developer with a pretty intense itch to scratch over a period, very likely, of some years, by the time the code can be both demonstrated theoretically correct and pass regression tests and skepticism, to get it to the level were it could be properly included. IOW, not impossible, but as close as it gets. I'd say the chances of seeing this in mainline (not just a series of patches carried by someone else) in anything under say 7 years is well under 5%, probably under 2%. The chances at say 15 years... maybe 15%. (That said, if you look at ext4 as an example, it has grown a bunch of exotic options over time, that most people will never use but that scratched someone's itch. Btrfs could be getting similar, at 7+ years out, so it's possible, and at that viewpoint, some may even consider the chances near 50% at the 10 year out mark. I'm skeptical, but I wouldn't have considered all those weird things now possible in ext4 likely to ever reach mainline ext4, either, so...) But I honestly don't expect current devs to spend much time on the proposal, at least not in the 7- year timeframe. > Especially on large filesystems all these operations tend to take large > amounts of time and may even impact the lifetime of the storage > device(s)... so it would be clever if certain such operations could be > kinda "merged", at least for the purposes of getting the results. > As in the above example, if one would anyway run a full balance, the > next scrub may be skipped because one is just doing one. > Similar for defrag. Well, balance definitely doesn't do defrag. By analogy, balance is at the UN, nation to nation, level, while defrag is at the city precinct level. They're simply out of each other's scope. Which isn't to say that at some point in the future, there won't be some btrfs doitall command, that does scrub and balance and defrag and recompression and ... all in a single pass, taking parameters from all the individual functions. But as you say, that's likely to be at least intermediate future, 3-5 years out, maybe 5-7 years out or more. And like btrfs-convert, I'd consider it in the "not a core tool, but nice to have" category. >> And even if balance works to verify no checksum errors, I don't believe >> it would correct them or give you the detail on them that a scrub >> would. > I'd have expected that that read errors are (if possible because of > block copies) are repaired as soon as they're encountered... isn't that > the case? (My understanding is that...) At the balance level, checksum corruption errors aren't going to be fixed from the other copy or from parity, because unlike normal file usage, the other copy isn't read -- balance isn't worried about file or extent level corruption, and any it would find would be simply a byproduct of the normal read-time checksum verification process, it's simply moving chunks around. Such errors would thus simply cause the balance to abort, with whatever balance-time error that wouldn't even necessarily reflect that it's a checksum error. Assuming that's correct, a completed balance could be assumed to have in addition the meaning of a scrub completed without any errors, but a failed balance could have failed for one of any number of reasons and with one of various balance-level errors, with such a failure yielding little or no clue as to scrub status. >> And if there is an error, it'd be a balance error, which might or might >> not actually be a scrub error. > Sure, but it shouldn't be difficult to collect e.g. scrub stats during > balance as well. Given that as of now they're still struggling to manage balance's memory requirements in ordered to let it scale more efficiently, and that scaling, particularly in the presence of large numbers of subvolumes and with quotas remains the single biggest issue, the devs are extremely unlikely to want to be adding additional memory requirements in ordered to additionally track scrub stats. Even once the current scaling issues are resolved, I don't see it being a useful option for balance itself, precisely because of the scaling issues, then on potentially embedded systems running TB-scale storage. But there might indeed be some place for it in the still very theoretical btrfs doitall command you proposed and I named doitall, above. Embedded- scale applications would simply not run that command, instead running the lower resource individual commands, while doitall could say check that it had a minimum of 16 GiB of memory or whatever to use, and exit with an error if not, so it could optionally be run on systems with the required resources. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-01 8:13 ` Duncan 2016-01-02 4:32 ` Christoph Anton Mitterer @ 2016-01-02 10:53 ` Alexander Duscheleit 2016-01-02 21:19 ` Henk Slager 2016-01-03 16:08 ` Duncan 1 sibling, 2 replies; 12+ messages in thread From: Alexander Duscheleit @ 2016-01-02 10:53 UTC (permalink / raw) To: linux-btrfs On Fri, 01 Jan 2016 00:14:37 -0800, Duncan wrote: > Chris Murphy posted on Thu, 31 Dec 2015 18:22:09 -0700 as excerpted: > >> On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit >> <alexander.duschel...@gmail.com> wrote: >>> [...] >> >> >> Why are you trying to mount only one? What mount options did you use >> when you did this? > > Yes, please. I was under the impression that a mount (actually any) command issued against a member of a multi-device btrfs would affect the whole multi-device. > >>> btrfs restore -viD seems to find most of the files accessible but >>> since I don't have a spare hdd of sufficient size I would have to >>> break the array and reformat and use one of the disk as restore >>> target. I'm not prepared to do this before I know there is no other >>> way to fix the drives since I'm essentially destroying one more >>> chance at saving the data. > >> Anyway, in the meantime, my advice is do not mount either device rw >> (together or separately). The less changes you make right now the >> better. >> >> What kernel and btrfs-progs version are you using? Sorry, I had this included in a paragraph I later removed. Kernel 4.3.3, btrfs-progs v4.3.1 > > Unless you've already tried it (hard to say without the mount options > you used above), I'd first try a different tact than C Murphy > suggests, falling back to what he suggests if it doesn't work. I > suppose he assumes you've already tried this... > > But first things first, as C Murphy suggests, when you post problems > like this, *PLEASE* post kernel and progs userspace versions. Given > the rate at which btrfs is still changing, that's pretty critical > information. Also, if you're not running the latest or second latest > kernel or LTS kernel series and a similar or newer userspace, be > prepared to be asked to try a newer version. With the almost > released 4.4 set to be an LTS, that means it if you want to try it, > or the LTS kernel series 4.1 and 3.18, or the current or previous > current kernel series 4.3 or 4.2 (tho with 4.2 not being an LTS > updates are ended or close to it, so people on it should be either > upgrading to 4.3 or downgrading to 4.1 LTS anyway). And for > userspace, a good rule of thumb is whatever the kernel series, a > corresponding or newer userspace as well. > > With that covered... > > This is a good place to bring in something else CM recommended, but > in a slightly different context. If you've read many of my previous > posts you're likely to know what I'm about to say. The admin's first > rule of backups says, in simplest form[1], that if you don't have a > backup, by your actions you're defining the data that would be backed > up as not worth the hassle and resources to do that backup. If in > that case you lose the data, be happy, as you still saved what you > defined by your actions as of /true/ value regardless of any claims > to the contrary, the hassle and resourced you would have spent making > that backup. =:^) > > While the rule of backups applies in general, for btrfs it applies > even more, because btrfs is still under heavy development and while > btrfs is "stabilizING, it's not yet fully stable and mature, so the > risk of actually needing to use that backup remains correspondingly > higher than it'd ordinarily be. > > But, you didn't mention having backups, and did mention that you > didn't have a spare hdd so would have to break the array to have a > place to do a btrfs restore to, which reads very much like you don't > have ANY BACKUPS AT ALL!! > > Of course, in the context of the above backups rule, I guess you > understand the implications, that you consider the value of that data > essentially throw-away, particularly since you still don't have a > backup, despite running a not entirely stable filesystem that puts > the data at greater risk than would a fully stable filesystem. > > Which means no big deal. You've obviously saved the time, hassle and > resources necessary to make that backup, which is obviously of more > value to you than the data that's not backed up, so the data is > obviously of low enough value you can simply blow away the filesystem > with a fresh mkfs and start over. =:^) > > Except... were that the case, you probably wouldn't be posting. > > Which brings entirely new urgency to what CM said about getting that > spare hdd, so you can actually create that backup, and count yourself > very lucky if you don't lose your data before you have it backed up, > since your previous actions were unfortunately not in accordance with > the value you seem to be claiming for the data. Yes, there are things that rank higher in priority than backups of the data in question. Namely food and shelter. The mirror drives is all I could scrounge together after several months. The previous setup was JBOD of 9 disks no younger than 7 years. At the point of replacement I was so wary of the hardware giving in that I didn't even think about potential software issues. I chose btrfs as a means to "future-proof" the storage. For me it won out against zfs for it's superior re-shapaing capability in terms of RAID modes and adding disks to existing arrays. > > OK, the rest of this post is written with the assumption that your > claims and your actions regarding the value of the data in question, > agree, and that since you're still trying to recover the data, you > don't consider it just throw-away, which means you now have someplace > to put that backup, should you actually be lucky enough to get the > chance to make it... An additional drive of matching capacity won't be within my financial means for several months, sadly. I DO still have the old drives in storage. While they are of very questionable reliability, I'm confident I can get most of the data back from those. None of it is *essential* data. I can always re-rip my music, re-download most of the other media and re-create the rest from raw sources. But given the hassle in time and bandwidth I can invest some hours on and off to try to pull it from the drives as well. > > > With your try to mount, did you try the degraded mount option? That's > primarily what this post is about as it's not clear you did, and what > I'd try first, as without that, btrfs will normally refuse to mount > if a device is missing, failing with the rather generic ctree open > failure error, as your attempt did. > > And as CM suggests, trying the degraded,ro mount options together is a > wise idea, at least at first, in ordered to help prevent further > damage. > > If a degraded,ro mount fails, then it's time to try CM's suggestions. I had tried a degraded,ro mount early on. I don't know why I didn't include that in my first mail. The result is as follows: [13984.341838] BTRFS info (device sdc2): allowing degraded mounts [13984.341844] BTRFS info (device sdc2): disk space caching is enabled [13984.341846] BTRFS: has skinny extents [13984.538637] BTRFS critical (device sdc2): corrupt leaf, bad key order: block=6513625202688,root=1, slot=68 [13984.546327] BTRFS critical (device sdc2): corrupt leaf, bad key order: block=6513625202688,root=1, slot=68 [13984.552233] BTRFS: Failed to read block groups: -5 [13984.585375] BTRFS: open_ctree failed [13997.313514] BTRFS info (device sdb2): allowing degraded mounts [13997.313520] BTRFS info (device sdb2): disk space caching is enabled [13997.313522] BTRFS: has skinny extents [13997.522838] BTRFS critical (device sdb2): corrupt leaf, bad key order: block=6513625202688,root=1, slot=68 [13997.530175] BTRFS critical (device sdb2): corrupt leaf, bad key order: block=6513625202688,root=1, slot=68 [13997.538289] BTRFS: Failed to read block groups: -5 [13997.582019] BTRFS: open_ctree failed > > [...] > So I can't mount either disk as ro and I can't afford another drive to store the data. I can confirm that I can get at least a subset of the data off the drives via btrfs-restore. (In fact I already restored the only chunk of data that's newer than the old disk set AND not easily recreated, which makes the whole endeavour a bit less nerve-wracking.) As I see it, my best course of action right now is wiping one of the two disks and then using btrfs restore to copy the data off the other disk onto the now blank one. I'd expect to get back a large percentage of the inaccessible data that way. That is unless someone tells me there's an easy fix for the "corrupt leaf, bad key order" fault and I've been chasing ghosts the whole time. > --- > [1] Sysadmin's first rule of backups: The more complex form covers > multiple backups and accounts for the risk factor of actually needing > to use them. It says that for any level of backup, either you have > it, or you consider the value of the data multiplied by the risk > factor of having to actually use that level of backup, to be less > than the resource and hassle cost of making that backup. In this > form, data such as your internet cache is probably not worth enough > to justify even a single level of backup, while truly valuable data > might be worth 101 levels of backup or more, some of them offsite and > others onsite but not normally physically connected, because the data > is truly valuable enough that even multiplied by the extremely tiny > chance of actually having 100 levels of backup fail and actually > needing that 101st level, justifies having it. The data is certainly worth another level of security, the problem is I can't afford it. Basically the amount I have accumulated has outstripped my means to properly store it. I'm trying my best with what's available. And no, I wouldn't trust data to this storage that could have a financial or personal impact if lost. -- Alex ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-02 10:53 ` Alexander Duscheleit @ 2016-01-02 21:19 ` Henk Slager 2016-01-03 15:53 ` Duncan 2016-01-03 16:08 ` Duncan 1 sibling, 1 reply; 12+ messages in thread From: Henk Slager @ 2016-01-02 21:19 UTC (permalink / raw) To: Alexander Duscheleit; +Cc: linux-btrfs [...] > [13984.341838] BTRFS info (device sdc2): allowing degraded mounts > [13984.341844] BTRFS info (device sdc2): disk space caching is enabled > [13984.341846] BTRFS: has skinny extents > [13984.538637] BTRFS critical (device sdc2): corrupt leaf, bad key > order: block=6513625202688,root=1, slot=68 [13984.546327] BTRFS > critical (device sdc2): corrupt leaf, bad key order: > block=6513625202688,root=1, slot=68 [13984.552233] BTRFS: Failed to > read block groups: -5 [13984.585375] BTRFS: open_ctree failed > [13997.313514] BTRFS info (device sdb2): allowing degraded mounts > [13997.313520] BTRFS info (device sdb2): disk space caching is enabled > [13997.313522] BTRFS: has skinny extents [13997.522838] BTRFS critical > (device sdb2): corrupt leaf, bad key order: block=6513625202688,root=1, > slot=68 [13997.530175] BTRFS critical (device sdb2): corrupt leaf, bad > key order: block=6513625202688,root=1, slot=68 [13997.538289] BTRFS: > Failed to read block groups: -5 [13997.582019] BTRFS: open_ctree failed > > >> >> [...] >> > > So I can't mount either disk as ro and I can't afford another drive > to store the data. > > I can confirm that I can get at least a subset of the data off the > drives via btrfs-restore. (In fact I already restored the only chunk of > data that's newer than the old disk set AND not easily recreated, which > makes the whole endeavour a bit less nerve-wracking.) > > As I see it, my best course of action right now is wiping one of the > two disks and then using btrfs restore to copy the data off the other > disk onto the now blank one. I'd expect to get back a large percentage > of the inaccessible data that way. That is unless someone tells me > there's an easy fix for the "corrupt leaf, bad key order" fault and > I've been chasing ghosts the whole time. I had once this error: BTRFS critical (device sdf1): corrupt leaf, slot offset bad: block=77130973184,root=1, slot=150 Not the same, but the 'corrupt leaf' part was due to memory module bit-failures I had some time ago. At least I haven't seen these kind of errors in other btrfs fs failure cases. I my case, no raid profiles, and I could fix it with --repair. It also might be that your 'corrupt leaf,...' error is caused by the earlier --repair action, otherwise I wouldn't know from experience how to fix it. If you think btrfs raid (I/O)fault handling etc is not good enough yet, instead of raid1, you might consider 2x single (dup for metadata), with 1 the main/master fs and the other one the slave fs, created by send | receive (incremental). If you scrub both on regular basis, email or so the error cases, you can act if something is wrong. And every now and then do a brute-force diff to verify that contents of both filesystems (snapshots) are still the same. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-02 21:19 ` Henk Slager @ 2016-01-03 15:53 ` Duncan 2016-01-03 16:24 ` Martin Steigerwald 0 siblings, 1 reply; 12+ messages in thread From: Duncan @ 2016-01-03 15:53 UTC (permalink / raw) To: linux-btrfs Henk Slager posted on Sat, 02 Jan 2016 22:19:18 +0100 as excerpted: > If you think btrfs raid (I/O)fault handling etc is not good enough yet, > instead of raid1, you might consider 2x single (dup for metadata), with > 1 the main/master fs and the other one the slave fs, created by send | > receive (incremental). If you scrub both on regular basis, email or so > the error cases, you can act if something is wrong. > And every now and then do a brute-force diff to verify that contents of > both filesystems (snapshots) are still the same. Given the OP's situation, that he was running btrfs in raid1 mode, and that a third device of similar capacity is simply out of the question due to cost at this point, this approach, possibly generalized, is what I'd recommend as well. RAID-1 is not a backup. And I'd strongly recommend a backup take priority over a raid1 if there's simply not enough money for more devices. There's simply too many ways a raid1 can go wrong when there's no actual backup, including fat-fingering a deletion[1]. Now if the device capacity is sufficiently large, I'd actually recommend partitioning both devices up with two identically sized partitions on each. Then the first partition on each can be made into a raid1 forming the working copy, while the second partition on each can be a separate raid1 that's the backup. That way, there's both a backup and raid1 protection. That's actually what I'm doing here, pretty much.[2] Of course, the partitioned raid1 working and backup solution does require that the data actually fit in half the space of a single device, and it may not, in which case this isn't an option. Which would bring us back to a working copy on one device and its backup on the other. But I'd actually consider making either the backup not btrfs. What I use here for my second backups is the old reiserfs I was using before btrfs. That way, if it's a btrfs bug that takes out the one copy, you don't have to worry about the same btrfs bug taking out the backup when you try to fall back to it. It may not be particularly likely, and it does kill the chance of using btrfs send/receive to update the backup, but it significantly eases my mind when I'm in recovery mode, knowing my backup isn't subject to whatever btrfs bug I had that put me in recovery mode in the first place. (In the partitioned raid case, I'd consider making the backup mdraid1, with whatever filesystem on top, since other than btrfs and zfs, filesystems basically don't do raid so it must be implemented below them. Or don't raid the backup and simply make a primary backup on one device and a secondary backup on the other.) --- [1] Fat-fingering a deletion: My own brown-bag "I became an admin that day" case was running a script, unfortunately as root, that I was debugging, where I did an rm -rf $somevar/*, with $somevar assigned earlier, only either the somevar in the assignment or the somevar in the rm line was typoed, so the var ended up empty and the command ended up as rm -rf /*. ... I was *SO* glad I had a backup, not just a raid1, that day! Needless to say, I also learned the lesson, the hard way, that either you don't debug your scripts as root, or if you are going to do so, you comment out rm lines and replace them with ls, the first time thru! Or do a confirm-prompt with the command line printed, first, and then copy/ paste the confirmation version to the operational line, so there's no chance of typoing something different than the confirmed version. [2] Dual raid1 working and backup copies on a pair of partitioned devices: My setup is actually rather somewhat more complex than that, but the details are not apropos to this discussion. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-03 15:53 ` Duncan @ 2016-01-03 16:24 ` Martin Steigerwald 0 siblings, 0 replies; 12+ messages in thread From: Martin Steigerwald @ 2016-01-03 16:24 UTC (permalink / raw) To: Btrfs BTRFS Am Sonntag, 3. Januar 2016, 15:53:56 CET schrieben Sie: > [1] Fat-fingering a deletion: My own brown-bag "I became an admin that > day" case was running a script, unfortunately as root, that I was > debugging, where I did an rm -rf $somevar/*, with $somevar assigned > earlier, only either the somevar in the assignment or the somevar in the > rm line was typoed, so the var ended up empty and the command ended up as > rm -rf /*. ... > > I was *SO* glad I had a backup, not just a raid1, that day! Epic. Thats the one case GNU rm doesn´t cover yet. It refuses to rm -rf . or rm -rf .. and rm -rf / (unless you give special argument, but there is not much it can do about rm -r /*, as the shell expands this before handing it to the command. Thanks, -- Martin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Unrecoverable fs corruption? 2016-01-02 10:53 ` Alexander Duscheleit 2016-01-02 21:19 ` Henk Slager @ 2016-01-03 16:08 ` Duncan 1 sibling, 0 replies; 12+ messages in thread From: Duncan @ 2016-01-03 16:08 UTC (permalink / raw) To: linux-btrfs Alexander Duscheleit posted on Sat, 02 Jan 2016 11:53:18 +0100 as excerpted: > I was under the impression that a mount (actually any) command issued > against a member of a multi-device btrfs would affect the whole > multi-device. Well, yes and no. Yes, when it mounts correctly. But with a multi-device btrfs, it can happen that btrfs doesn't yet know about all the devices when a mount is attempted, in which case the mount may fail (particularly without the degraded option), simply because it doesn't know about the other devices. A btrfs device scan after all devices are available but before the mount attempt should fix this problem and allow a mount with any of the component devices, and these days, udev normally triggers that when any new devices appear, so it seldom needs to be done manually. However, in udev-free cases or in early boot before udev is up, udev obviously won't handle it and the mount can still fail. Additionally, if a device is missing or damaged to the point that btrfs can't see it, btrfs will normally refuse a mount unless degraded is one of the mount options. And depending on the situation, degraded,ro may be needed. While you mentioned below this part in your reply that you had tried degraded,ro, that wasn't in your original post, so we wanted the mount options you had actually tried, to see if you had tried degraded,ro, or not. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-01-06 7:36 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-31 23:36 Unrecoverable fs corruption? Alexander Duscheleit 2016-01-01 1:22 ` Chris Murphy 2016-01-01 8:13 ` Duncan 2016-01-02 4:32 ` Christoph Anton Mitterer 2016-01-03 15:00 ` Duncan 2016-01-04 0:05 ` Christoph Anton Mitterer 2016-01-06 7:35 ` Duncan 2016-01-02 10:53 ` Alexander Duscheleit 2016-01-02 21:19 ` Henk Slager 2016-01-03 15:53 ` Duncan 2016-01-03 16:24 ` Martin Steigerwald 2016-01-03 16:08 ` Duncan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).