`btrfs check --repair` stuck in a loop // filesystem repair case study

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* `btrfs check --repair` stuck in a loop // filesystem repair case study
@ 2016-12-03  2:35 Tomasz Melcer
  0 siblings, 0 replies; only message in thread
From: Tomasz Melcer @ 2016-12-03  2:35 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have a btrfs filesystem on a 4TB HDD connected with USB 2.0. Some time 
ago I accidentally disconnected the drive while doing heavy writes. 
After reconnecting it seemed like the filesystem still works (it mounted 
fine and I could read some files chosen at random), but I ran `btrfs 
scrub` to be sure.

`btrfs scrub` aborted itself after ~20 hours, after reading ~3.5TB of 
data. `dmesg` contained a single line:

#v+
BTRFS error (device dm-0): bad tree block start 0 3527021166592
#v-

I couldn't find any further details anywhere in logs. I assume this 
means that some data have actually been lost from this filesystem. I 
have backups of data from this drive, so I decided to play a little 
trying out btrfs recovery strategies.

I checked whether there are any bad blocks on the raw device — all 
blocks were read successfully.

I created a devicemapper snapshot/overlay to keep the raw device data 
read only and track the changes made by any recovery procedures.

I ran `btrfstune -u` on the overlay to avoid having two devices with the 
same uuid. This was done using a dedicated VM which did not see the raw 
device (suggested by `Ke` on IRC). BTW, this command resulted in the 
overlay device growing by ~25GB, which IIUC means that around 6M 
4096-byte blocks were changed in the process (is that expected?).

I was recommended to run `btrfs check`. The result is here: [1] (323 
lines of output), and IIRC it finished in few hours.

  [1] https://gist.github.com/liori/f8c5e69677e8c9d6038d2e3e4db9aa42

(5 data checksum errors are a preexisting condition, I knew about them 
before the incident).

I then started `btrfs check --repair`. This was about a week ago, and it 
is still going. The partial output is here: [2] (already almost 18k 
lines). The same problems are being found again and again in a loop, as 
if it was stuck.

  [2] https://gist.github.com/liori/01494afbe63cd19ba49be663be937d84

I do observe that the ctime of the overlay file is updated every once a 
while, but the file itself does not grow anymore after some initial 
change of ~70k blocks. My interpretation is that even if the repair 
process writes anything, it only keeps writing in the same places again 
and again.

I did not have any snapshots on this filesystem. I did have some 
deduplicated content, but no more than 4 copies of any data block, and 
deduplication resulted in saving ~1TB of space total. The device was 
never a part of a multi-device setup.

Is there anything more I can do with this filesystem to bring it to a 
state where I can `btrfs scrub` it, know what have been lost, etc? Is 
this behavior of `btrfs scrub --repair` expected and will it ever finish?

Thank you,

-- 
Tomasz Melcer

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2016-12-03 14:36 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-03  2:35 `btrfs check --repair` stuck in a loop // filesystem repair case study Tomasz Melcer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).