linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Dawson <matthew@mjdsystems.ca>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Kai Krakow <hurikhan77+btrfs@gmail.com>, linux-btrfs@vger.kernel.org
Subject: Re: Help recovering filesystem (if possible)
Date: Wed, 24 Nov 2021 00:11:46 -0500	[thread overview]
Message-ID: <4306866.vuYhMxLoTh@ring00> (raw)
In-Reply-To: <20211124044343.GF17148@hungrycats.org>

On Tuesday, November 23, 2021 11:43:43 P.M. EST Zygo Blaxell wrote:
> On Thu, Nov 18, 2021 at 11:42:05PM -0500, Matthew Dawson wrote:
> > On Thursday, November 18, 2021 4:09:15 P.M. EST Zygo Blaxell wrote:
> > > On Wed, Nov 17, 2021 at 09:57:40PM -0500, Matthew Dawson wrote:
> > > > On Monday, November 15, 2021 5:46:43 A.M. EST Kai Krakow wrote:
> > > > > Am Mo., 15. Nov. 2021 um 02:55 Uhr schrieb Matthew Dawson
> > > > > 
> > > > > <matthew@mjdsystems.ca>:
> > > > > > I recently upgrade one of my machines to the 5.15.2 kernel.  on
> > > > > > the
> > > > > > first
> > > > > > reboot, I had a kernel fault during the initialization (I didn't
> > > > > > get
> > > > > > to
> > > > > > capture the printed stack trace, but I'm 99% sure it did not have
> > > > > > BTRFS
> > > > > > related calls).  I then rebooted the machine back to a 5.14
> > > > > > kernel,
> > > > > > but
> > > > > > the
> > > > > > BCache (writeback) cache was corrupted.  I then force started the
> > > > > > underlying disks, but now my BTRFS filesystem will no longer
> > > > > > mount.  I
> > > > > > realize there may be missing/corrupted data, but I would like to
> > > > > > ideally
> > > > > > get any data I can off the disks.
> > > > > 
> > > > > I had a similar issue lately where the system didn't reboot cleanly
> > > > > (there's some issue in the BIOS or with the SSD firmware where it
> > > > > would disconnect the SSD from SATA a few seconds after boot, forcing
> > > > > bcache into detaching dirty caches).
> > > > > 
> > > > > Since you are seeing transaction IDs lacking behind expectations, I
> > > > > think you've lost dirty writeback data from bcache. Do fix this in
> > > > > the
> > > > > future, you should use bcache only in writearound or writethrough
> > > > > mode.
> > > > 
> > > > Considering I started the bcache devices without the cache, I don't
> > > > doubt
> > > > I've lost writeback data and I have no doubts there will be issues. 
> > > > At
> > > > this point I'm just in data recovery, trying to get what I can.
> > > 
> > > The word "issues" is not adequate to describe the catastrophic damage
> > > to metadata that occurs if the contents of a writeback cache are lost.
> > > 
> > > If writeback failure happens to only one btrfs device's cache, you
> > > can recover with btrfs raid1 self-healing using intact copies stored
> > > on working devices.  If it happens on multiple btrfs devices at once
> > > (e.g. due to misconfiguration of bcache with more than one btrfs device
> > > per pool or more than one bcache pool per SSD, or due to a kernel bug
> > > that affects all bcache instances at once, or a firmware bug that
> > > affects
> > > each SSD device the same way during a crash) then recovery isn't
> > > possible.
> > > 
> > > Writeback cache failures are _bad_, falling between "many thousands of
> > > bad sectors" and "total disk failure" in terms of difficulty of
> > > recovery.
> > > 
> > > > Hopefully someone has a different idea?  I am posting here because I
> > > > feel
> > > > any luck is going to start using more dangerous options and those
> > > > usually
> > > > say to ask the mailing list first.
> > > 
> > > Your best option would be to get the caches running again, at least in
> > > read-only mode.  It's not a good option, but all your other options
> > > depend
> > > on having access to as many cached dirty pages as possible.  If all you
> > > have is the backing devices, then now is the time to scrape what you
> > > can from the drives with 'btrfs restore' then make use of your backups.
> > 
> > At this point I think I'm stuck with just the backing devices (with GB of
> > lost dirty data on the cache).  And I'm primarily in data recovery,
> > trying to get whatever good data I can to help supplement the backed up
> > data.
> 
> I don't use words like "catastrophic" casually.  Recovery typically
> isn't possible with the backing disks after a writeback cache failure.
> 
> The writeback cache algorithm will prefer to keep the most critical
> metadata in cache, while writing out-of-date metadata pages out to the
> backing devices.  This process effectively wipes btrfs metadata off
> the backing disks as the cache fills up, and puts it back as the cache
> flushes out.  If a large dirty cache dies, it can leave nothing behind.
> 
> > As mentioned in my first email though, btrfs restore fails with the
> > following error message:
> > # btrfs restore -l /dev/dm-2
> > parent transid verify failed on 132806584614912 wanted 3240123 found
> > 3240119 parent transid verify failed on 132806584614912 wanted 3240123
> > found 3240119 parent transid verify failed on 132806584614912 wanted
> > 3240123 found 3240119 parent transid verify failed on 132806584614912
> > wanted 3240123 found 3240119 Ignoring transid failure
> > Couldn't setup extent tree
> > Couldn't setup device tree
> > Could not open root, trying backup super
> > warning, device 6 is missing
> > warning, device 13 is missing
> > warning, device 12 is missing
> > warning, device 11 is missing
> > warning, device 7 is missing
> > warning, device 9 is missing
> > warning, device 14 is missing
> > bytenr mismatch, want=136920576753664, have=0
> > ERROR: cannot read chunk root
> > Could not open root, trying backup super
> > warning, device 6 is missing
> > warning, device 13 is missing
> > warning, device 12 is missing
> > warning, device 11 is missing
> > warning, device 7 is missing
> > warning, device 9 is missing
> > warning, device 14 is missing
> > bytenr mismatch, want=136920576753664, have=0
> > ERROR: cannot read chunk root
> > Could not open root, trying backup super
> > When all devices are up and reported to the kernel.  I was looking for
> > help to try and move beyond these errors and get whatever may still be
> > available.
> The general btrfs recovery process is:
> 
> 	1.  Restore device and chunk trees.  Without these, btrfs
> 	can't translate logical to physical block addresses, or even
> 	recognize its own devices, so you get "device is missing" errors.
> 	The above log shows that device and chunk tree data is now in the
> 	cache--or at least, not on the backing disks.	'btrfs rescue
> 	chunk-recover' may locate an older copy of this data by brute
> 	force search of the disk, if an older copy still exists.
> 
> 	2.  Find subvol roots to read data.  'btrfs-find-root' will
> 	do a brute-force search of the disks to locate subvol roots,
> 	which you can pass to 'btrfs restore -l' to try to read files.
> 	Normally this produces hundreds of candidates and you'll have
> 	to try each one.  If you have an old snapshot (one that predates
> 	the last full cache flush, and no balance, device shrink, device
> 	remove, defrag, or dedupe operation has occurred since) then you
> 	might be able to read its entire tree.	Subvols that are modified
> 	recently will be unusable as they will be missing many or all
> 	of their pages (they will be in the cache, not the backing disks).
> 
> 	3.  Verify the data you get back.  The csum tree is no longer
> 	usable, so you'll have no way to know if any data that you get
> 	from the filesystem is correct or garbage.  This is true even if
> 	you are reading from an old snapshot, as the csum tree is global
> 	to all subvols and will be modified (and moved into the cache)
> 	by any write to the filesystem.
> 
> In the logs above we see that you have missing pages in extent, chunk,
> and device trees.  In a writeback cache setup, new versions of these
> trees will be written to the cache, while the old versions are partially
> or completely erased on the backing devices in the process of flushing
> out previous dirty pages.  This pattern will repeat for subvol and csum
> trees, leaving you with severely damaged or unusable metadata on the
> backing disks as long as there are dirty pages in cache.
> 
> > If further recovery is impossible that's fine I'll wipe and start over,
> > but I rather try some risky things to get what I can before I do so.
> 
> I wouldn't say it's impossible in theory, but in practice it is a level
> of effort comparable to unshredding a phone book--after someone has
> grabbed a handful of the shredded paper and burned it.
> 
> High-risk interventions like 'check --repair --init-extent-tree' are
> likely to have no effect in the best case (they'll give up due to lack
> of usable metadata), and will destroy even more data in the worst case
> (they'll try modifying the filesystem and overwrite some of the surviving
> data).  They depend on having intact device and subvol trees to work,
> so if you can't get those back, there's no need to try anything else.
> 
> In theory, if you can infer the file structure from the contents of the
> files, you might be able to guess some of the missing metadata.  e.g. the
> logical-to-physical translation in the device tree only provides about
> 16 bits of an extent byte address, so you could theoretically build
> a tool which tries all 65536 most likely disk locations for a block
> until it finds a plausible content match for a file, and use that tool
> to reconstruct the device tree.  It might even be possible to automate
> this using fragments of the csum tree (assuming the relevant parts of
> the csum tree exist on the backing devices and not only in the cache).
> This is only the theory--practical tools to do this kind of recovery
> don't yet exist.
Thanks for the suggestions!  I'll give them a try over the next bit (I'm 
getting some extra storage, then I'll try using device mapper's snapshot 
target to avoid destroying what there).

I also might try writing a recovery tool for the bcache cache, doing something 
similar to the dm snapshot system.

Thanks for the pointers!
--
Matthew



  reply	other threads:[~2021-11-24  5:11 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-15  1:52 Help recovering filesystem (if possible) Matthew Dawson
2021-11-15 10:46 ` Kai Krakow
2021-11-18  2:57   ` Matthew Dawson
2021-11-18 21:09     ` Zygo Blaxell
2021-11-19  4:42       ` Matthew Dawson
2021-11-24  4:43         ` Zygo Blaxell
2021-11-24  5:11           ` Matthew Dawson [this message]
  -- strict thread matches above, loose matches on Subject: below --
2021-11-15  1:23 Matthew Dawson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4306866.vuYhMxLoTh@ring00 \
    --to=matthew@mjdsystems.ca \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=hurikhan77+btrfs@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).