* `btrfs check --repair` stuck in a loop // filesystem repair case study
@ 2016-12-03 2:35 Tomasz Melcer
0 siblings, 0 replies; only message in thread
From: Tomasz Melcer @ 2016-12-03 2:35 UTC (permalink / raw)
To: linux-btrfs
Hi,
I have a btrfs filesystem on a 4TB HDD connected with USB 2.0. Some time
ago I accidentally disconnected the drive while doing heavy writes.
After reconnecting it seemed like the filesystem still works (it mounted
fine and I could read some files chosen at random), but I ran `btrfs
scrub` to be sure.
`btrfs scrub` aborted itself after ~20 hours, after reading ~3.5TB of
data. `dmesg` contained a single line:
#v+
BTRFS error (device dm-0): bad tree block start 0 3527021166592
#v-
I couldn't find any further details anywhere in logs. I assume this
means that some data have actually been lost from this filesystem. I
have backups of data from this drive, so I decided to play a little
trying out btrfs recovery strategies.
I checked whether there are any bad blocks on the raw device — all
blocks were read successfully.
I created a devicemapper snapshot/overlay to keep the raw device data
read only and track the changes made by any recovery procedures.
I ran `btrfstune -u` on the overlay to avoid having two devices with the
same uuid. This was done using a dedicated VM which did not see the raw
device (suggested by `Ke` on IRC). BTW, this command resulted in the
overlay device growing by ~25GB, which IIUC means that around 6M
4096-byte blocks were changed in the process (is that expected?).
I was recommended to run `btrfs check`. The result is here: [1] (323
lines of output), and IIRC it finished in few hours.
[1] https://gist.github.com/liori/f8c5e69677e8c9d6038d2e3e4db9aa42
(5 data checksum errors are a preexisting condition, I knew about them
before the incident).
I then started `btrfs check --repair`. This was about a week ago, and it
is still going. The partial output is here: [2] (already almost 18k
lines). The same problems are being found again and again in a loop, as
if it was stuck.
[2] https://gist.github.com/liori/01494afbe63cd19ba49be663be937d84
I do observe that the ctime of the overlay file is updated every once a
while, but the file itself does not grow anymore after some initial
change of ~70k blocks. My interpretation is that even if the repair
process writes anything, it only keeps writing in the same places again
and again.
I did not have any snapshots on this filesystem. I did have some
deduplicated content, but no more than 4 copies of any data block, and
deduplication resulted in saving ~1TB of space total. The device was
never a part of a multi-device setup.
Is there anything more I can do with this filesystem to bring it to a
state where I can `btrfs scrub` it, know what have been lost, etc? Is
this behavior of `btrfs scrub --repair` expected and will it ever finish?
Thank you,
--
Tomasz Melcer
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2016-12-03 14:36 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-03 2:35 `btrfs check --repair` stuck in a loop // filesystem repair case study Tomasz Melcer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).