From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: A Big Thank You, and some Notes on Current Recovery Tools.
Date: Mon, 1 Jan 2018 05:21:19 +0000 (UTC) [thread overview]
Message-ID: <pan$61851$35f92f40$b7544793$147901a@cox.net> (raw)
In-Reply-To: CAJt7KB9UuW5Zg35VJ+fNV8RVZk_kAukmcgyK8GH4d1M286DkXA@mail.gmail.com
Stirling Westrup posted on Sun, 31 Dec 2017 19:48:15 -0500 as excerpted:
> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>
> Thanks to their tireless help in answering all my dumb questions I have
> managed to get my BTRFS working again! As I speak I have the full,
> non-degraded, quad of drives mounted and am updating my latest backup of
> their contents.
I'm glad you were able to fix it. Hopefully, some of what was learned
from the experience can hope the devs make btrfs better, as well as, for
you, reinforcing the sysadmin's first rule of backups that I'm rather
known for quoting around here: The *real* value of data to an admin is
defined not by any flimsy claims as to its value, but rather, by the
number of backups an admin considers it worth having of that data. If
there's no backups, or none beyond level N, that's simply defining the
data as not worth the time/trouble/resources necessary to do those
backups (beyond level N), or flipped around, defining the time/trouble/
resources saved in /not/ doing the backups to be worth more than the data.
Thus, it can *always* be said that what was defined to be of most value
was saved, either the data, if it was worth the trouble making the
backup, or the time/trouble/resources necessary to make it if there was
no backup.
Of course you had backups, but they weren't current. However, the same
rule applies then to the data in the delta between the backup and current
state. If it wasn't worth freshening your backups to capture backups of
that delta as well, then by definition the data was worth less than the
time/trouble/resources necessary to do that freshening.
... And FWIW, after finding myself in similar situations regarding backup
updates here, but fortunately with btrfs' readable by btrfs restore...
I recently decided it was worth the money to upgrade to ssd backups as
well as ssd working copies... precisely to lower the trouble threshold to
updating those backups... and I'm happy to report that it's had exactly
the effect I had hoped... I'm doing much more regular backups, keeping
that maximum delta between working copy and first-line backup much
smaller (days to weeks) than it was before (months to over a year (!!),
so I'm walking the talk and holding myself to the same rules I preach!
=:^)
> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T drives
> failed, and with help I was able to make a 100% recovery of the lost
> data. I do have some observations on what I went through though. Take
> this as constructive criticism, or as a point for discussing additions
> to the recovery tools:
>
> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
> errors exactly coincided with the 3 super-blocks on the drive. The odds
> against this happening as random independent events is so unlikely as to
> be mind-boggling. (Something like odds of 1 in 10^26) So, I'm going to
> guess this wasn't random chance. Its possible that something inside the
> drive's layers of firmware is to blame, but it seems more likely to me
> that there must be some BTRFS process that can, under some conditions,
> try to update all superblocks as quickly as possible. I think it must be
> that a drive failure during this window managed to corrupt all three
> superblocks. It may be better to perform an update-readback-compare on
> each superblock before moving onto the next, so as to avoid this
> particular failure in the future. I doubt this would slow things down
> much as the superblocks must be cached in memory anyway.
I'd actually suspect something in the drive firmware or hardware...
didn't like the fact that btrfs was *constantly* rewriting the *exact*
same place, the copies of the superblock.
Because otherwise, as you say, the odds are simply too high that it would
be /exactly/ those three blocks, not, say, two superblocks, and something
else.
"They say" it's SSDs that work that way, not spinning rust, which is
supposed to "not care" about how many times a particular block is
rewritten, but more about spinning hours, etc. However, I'd argue that
the same rules that have applied to "spinning rust" for decades... don't
necessarily hold any longer as the area of each bit or byte gets smaller
and smaller, and /particularly/ so with the new point-heat-recording and
shingled designs. Indeed, I had already wondered personally about media-
point longevity given repeated point-heat-recording cycles, and the fact
that btrfs superblocks are the /one/ thing that's not constantly COWed to
different locations at every write, but remain at the exact same media
address, rewritten for /every/ btrfs commit cycle, as they /must/ be,
given the way btrfs works.
Of course that's why ssds have the FTL/firmware-translation-layer between
the actual physical media and the filesystem layer, doing that COW at the
device level, so no single hotspot address is rewritten many more times
than the coldspot addresses.
And of course spinning rust has its firmware as well, tho at least in the
public domain, they don't COW a sector until it actually dies. But I
actually suspect that some of them do SSD-like wear-leveling anyway,
because I just don't see how the smaller and smaller physical bit-write
areas can stand up to the repeated rewrite wear, otherwise.
But either there was something buggy with yours, that btrfs triggered
with its superblock write pattern, or it simply didn't have the level of
protection it needed, or perhaps some of both.
Anyway, as I said, the odds are simply too great. There's simply no
other explanation for it being the /exact/ three superblocks, spaced as
they are precisely to /avoid/ ending up in the same physical weak-spot
area by accident, that went out.
Which has significant implications for the below...
> 2) The recovery tools seem too dumb while thinking they are smarter than
> they are. There should be some way to tell the various tools to consider
> some subset of the drives in a system as worth considering. Not knowing
> that a superblock was a single 4096-byte sector, I had primed my
> recovery by copying a valid superblock from one drive to the clone of my
> broken drive before starting the ddrescue of the failing drive. I had
> hoped that I could piece together a valid superblock from a good drive,
> and whatever I could recover from the failing one. In the end this
> turned out to be a useful strategy, but meanwhile I had two drives that
> both claimed to be drive 2 of 4, and no drive claiming to be drive 1 of
> 4. The tools completely failed to deal with this case and were
> consistently preferring to read the bogus drive 2 instead of the real
> drive 2, and it wasn't until I deliberately patched over the magic in
> the cloned drive that I could use the various recovery tools without
> bizarre and spurious errors. I understand how this was never an
> anticipated scenario for the recovery process, but if its happened once,
> it could happen again. Just dealing with a failing drive and its clone
> both available in one system could cause this.
Of course btrfs has known problems with clones that duplicate the GUID,
as many cloning tools do, where both the clone and the working copy are
available to btrfs at the same time. This is because btrfs, unlike most
filesystems being multi-device, needed /some/ way to uniquely identify
each filesystem, and as here, each device of each filesystem, and the
"Globally Unique Identification", aka GUID, aka UUID (universally unique
ID), was taken as *exactly* what it says in the name, globally/
universally unique. That's one of the design assumptions of btrfs,
written into the code at a level that really can't be changed at this
late date, many years into the process. And btrfs really /does/ have
known data corruption potential when those IDs don't turn out to be
unique after all.
Which is why admins that have done their due diligence researching the
filesystems they're trusting with the integrity of their data, know that
if they're using replication methods that expose multiple devices with
the same GUIDs/UUIDs, they *MUST* take care to expose to btrfs only one
instance of those UUIDs/GUIDs at a time. Because there's a very real
danger of data corruption if btrfs sees two supposedly "unique" IDs, as
it can and sometimes does get /very/ confused by that.
Unfortunately, as btrfs becomes more widespread and common-place, beyond
the level of admin that really researches a filesystem before they put
their trust in it, a lot of btrfs-using admins are ending up learning
this the hard way... unfortunately.
Tho arguably, the good part of it is that just as admins coming from the
MS side of things had to learn all about mounting and unmounting, and
what to avoid to avoid the trap of data corruption due to pulling a
(removable) device without cleanly unmounting it, as btrfs becomes more
common, people will eventually learn the btrfs rules of safe data
behavior as well.
Tho equally arguably, that among several reasons may be enough to keep
btrfs from ever becoming the mainstream replacement for and successor to
the ext* line that it was intended to be. Oh, well... Every filesystem
has its strengths and weaknesses, and a good admin will learn to
appreciate them and use a filesystem appropriate to the use-case, while
not so good admins... generally end up suffering more than necessary, as
they fight with filesystems in use-cases that they are simply not the
best choice out there at supporting.
Of course the alternative would be a limited-choice ecosystem like MS,
where there's only basically two FS choices, some version of the
venerable FAT, or some version of NTFS, both choices among many others
available to Linux/*IX users, as well. Fine for some, but "No thanks,
I'll keep my broad array of choices, thank you very much!" for me. =:^)
> 3) There don't appear to be any tools designed for dumping a full
> superblock in hex notation, or for patching a superblock in place.
> Seeing as I was forced to use a hex editor to do exactly that, and then
> go through hoops to generate a correct CSUM for the patched block, I
> would certainly have preferred there to be some sort of utility to do
> the patching for me.
100% agreed, here. Of course that's one reason among many that btrfs
remains "still stabilizing, not yet fully stable and mature", precisely
because there's various holes like this one remaining in the btrfs
toolset.
It is said that the air force jocks of some nations semi-euphemistically
describe a situation in which they are vastly outnumbered as a "target
rich environment." Whatever the truth of /that/, by analogy it's
definitely the case that btrfs remains a "development-opportunity rich
environment" in terms of improvement possibilities remaining to be
developed. There's certainly more ideas for improvement than there is
time and devs to implement, test, bugfix, and test some more, all those
ideas, and this is one more that it'd definitely be nice to have!
But given how closely you worked with the devs to get your situation
fixed, and thus the knowledge of your specific tool-case they now have,
the chances of actually getting this implemented in something approaching
reasonably useful time, is better than most. =:^)
> 4) Despite having lost all 3 superblocks on one drive in a 4-drive setup
> (RAID0 Data with RAID1 Metadata), it was possible to derive all missing
> information needed to rebuild the lost superblock from the existing good
> drives. I don't know how often it can be done, or if it was due to some
> peculiarity of the particular RAID configuration I was using, or what.
> But seeing as this IS possible at least under some circumstances, it
> would be useful to have some recovery tools that knew what those
> circumstances were, and could make use of them.
Of course raid0 in any form is considered among admins to be for the
"don't-care-if-we-lose-it, it's throw-away-data", either because it
actually /is/ throw-away data, or because there's at least one extra
level of backups in case the raid0 /does/ die, use-case.
By that argument there's limited benefit to any investment in raid0 mode
recovery, because nobody sane uses it for anything of greater than "throw-
away" value anyway.
Tho OTOH, given that raid1-metadata/single-data (which roughly equates to
raid0-data) is the btrfs-multi-device effective default... arguably,
either that default should be changed to raid1/10 for data as well as
metadata, or at least there's /some/ support for prioritizing
implementation of tools such as those that would have helped automate the
process, here.
Personally, I'd argue for changing the default to raid1 2-3 device, raid10
4+ device, but maybe that's just me...
> 5) Finally, I want to comment on the fact that each drive only stored up
> to 3 superblocks. Knowing how important they are to system integrity, I
> would have been happy to have had 5 or 10 such blocks, or had each drive
> keep one copy of each superblock for each other drive. At 4K per
> superblock, this would seem a trivial amount to store even in a huge
> raid with 64 or 128 drives in it. Could there be some method introduced
> for keeping far more redundant metainformation around? I admit I'm
> unclear on what the optimal numbers of these things would be. Certainly
> if I hadn't lost all 3 superblocks at once, I might have thought that
> number adequate.
If indeed I'm correct that the odds of it being ALL three of the
superblocks that failed, and ONLY the superblocks, strongly indicate a
mismatch between hardware/firmware and the btrfs superblock constant
rewrite to the /exact/ same address pattern, then...
Making it 5 or 10 or 100 or 1000 such blocks won't help much.
OTOH, I'm rather intrigued by the idea of keeping one copy of each of
the /other/ devices' superblocks on all devices. I'd consider that idea
worth further discussion anyway, tho it's quite possible that performance
or other considerations make it simply impractical to implement, and even
if practical to implement in the general sense, it'd certainly require an
on-device format update, and those aren't done lightly or often, as all
formats from the original mainlined one must be supported going forward.
But it's definitely an idea I'd like to see further discussed, even if
it's simply to point out the holes in the idea I'm just not seeing, from
my viewpoint that's definitely much closer to admin than dev.
Tho while I do rather like the idea, given the above, even keeping
additional superblock copies on all the other devices isn't necessarily
going to help much, particularly when it's all similar devices,
presumably with similar firmware and media weak-points.
But other-device superblocks very well could have helped in a situation
like yours, where there were two different device sizes and potentially
brands...
> Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
> fan of BTRFS and its potential, and I know its still early days for the
> code base, and it's yet to fully mature in its recovery and diagnostic
> tools. I'm just hoping that these points can contribute in some small
> way and give back some of the help I got in fixing my system!
I believe you've very likely done just that. =:^)
And even if your case doesn't result in tools to automate superblock
restoration in cases such as yours in the immediate to near-term (say to
three years out), it has very definitely already resulted in regulars
that now have experience with the problem and should now find it /much/
easier to tackle a similar problem the next time it comes up! And as you
say, it almost certainly /will/ come up again, because it's not /that/
unreasonable or uncommon a situation to find oneself in, after all!
But definitely, the best-case would be if it results in the tools
learning how to automate the process so people that have no clue what a
hex editor even is can still have at least /some/ chance of recovering
from it, where we're just lucky here that someone with the technical
skill and just as importantly the time/motivation/determination to either
get a fix or know exactly why it /could-not/ be fixed, happened to have
the problem, not someone more like me that /might/ have the technical
skill, but would be far more likely to just accept the damage as reality
and fall back to the backups such as they are, than actually invest the
time in either getting that fix or knowing for sure that it /can't/ be
fixed.
The signature I've seen, something about the unreasonable man refusing to
accept reality, thereby making his own, and /thereby/, changing it for
the good, for everyone, thus progress depending on the unreasonable man,
comes to mind. =:^)
Yes, I suppose I /did/ just call you "unreasonable", but that's a rather
extreme compliment, in this case! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2018-01-01 5:23 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-01 0:48 A Big Thank You, and some Notes on Current Recovery Tools Stirling Westrup
2018-01-01 5:21 ` Duncan [this message]
2018-01-01 10:13 ` Qu Wenruo
2018-01-01 12:15 ` Kai Krakow
2018-01-01 19:44 ` Stirling Westrup
2018-01-02 2:03 ` Duncan
2018-01-02 10:02 ` ein
2018-01-02 11:15 ` Paul Jones
2018-01-02 12:45 ` Marat Khalili
2018-01-02 14:45 ` ein
2018-01-01 22:50 ` waxhead
2018-01-02 0:57 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$61851$35f92f40$b7544793$147901a@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox