A Big Thank You, and some Notes on Current Recovery Tools.

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* A Big Thank You, and some Notes on Current Recovery Tools.
@ 2018-01-01  0:48 Stirling Westrup
  2018-01-01  5:21 ` Duncan
  2018-01-01 10:13 ` Qu Wenruo
  0 siblings, 2 replies; 12+ messages in thread
From: Stirling Westrup @ 2018-01-01  0:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Qu Wenruo, Nikolay Borisov

Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
YOU to Nikolay Borisov and most especially to Qu Wenruo!

Thanks to their tireless help in answering all my dumb questions I
have managed to get my BTRFS working again! As I speak I have the
full, non-degraded, quad of drives mounted and am updating my latest
backup of their contents.

I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
drives failed, and with help I was able to make a 100% recovery of the
lost data. I do have some observations on what I went through though.
Take this as constructive criticism, or as a point for discussing
additions to the recovery tools:

1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
errors exactly coincided with the 3 super-blocks on the drive. The
odds against this happening as random independent events is so
unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)
So, I'm going to guess this wasn't random chance. Its possible that
something inside the drive's layers of firmware is to blame, but it
seems more likely to me that there must be some BTRFS process that
can, under some conditions, try to update all superblocks as quickly
as possible. I think it must be that a drive failure during this
window managed to corrupt all three superblocks. It may be better to
perform an update-readback-compare on each superblock before moving
onto the next, so as to avoid this particular failure in the future. I
doubt this would slow things down much as the superblocks must be
cached in memory anyway.

2) The recovery tools seem too dumb while thinking they are smarter
than they are. There should be some way to tell the various tools to
consider some subset of the drives in a system as worth considering.
Not knowing that a superblock was a single 4096-byte sector, I had
primed my recovery by copying a valid superblock from one drive to the
clone of my broken drive before starting the ddrescue of the failing
drive. I had hoped that I could piece together a valid superblock from
a good drive, and whatever I could recover from the failing one. In
the end this turned out to be a useful strategy, but meanwhile I had
two drives that both claimed to be drive 2 of 4, and no drive claiming
to be drive 1 of 4. The tools completely failed to deal with this case
and were consistently preferring to read the bogus drive 2 instead of
the real drive 2, and it wasn't until I deliberately patched over the
magic in the cloned drive that I could use the various recovery tools
without bizarre and spurious errors. I understand how this was never
an anticipated scenario for the recovery process, but if its happened
once, it could happen again. Just dealing with a failing drive and its
clone both available in one system could cause this.

3) There don't appear to be any tools designed for dumping a full
superblock in hex notation, or for patching a superblock in place.
Seeing as I was forced to use a hex editor to do exactly that, and
then go through hoops to generate a correct CSUM for the patched
block, I would certainly have preferred there to be some sort of
utility to do the patching for me.

4) Despite having lost all 3 superblocks on one drive in a 4-drive
setup (RAID0 Data with RAID1 Metadata), it was possible to derive all
missing information needed to rebuild the lost superblock from the
existing good drives. I don't know how often it can be done, or if it
was due to some peculiarity of the particular RAID configuration I was
using, or what. But seeing as this IS possible at least under some
circumstances, it would be useful to have some recovery tools that
knew what those circumstances were, and could make use of them.

5) Finally, I want to comment on the fact that each drive only stored
up to 3 superblocks. Knowing how important they are to system
integrity, I would have been happy to have had 5 or 10 such blocks, or
had each drive keep one copy of each superblock for each other drive.
At 4K per superblock, this would seem a trivial amount to store even
in a huge raid with 64 or 128 drives in it. Could there be some method
introduced for keeping far more redundant metainformation around? I
admit I'm unclear on what the optimal numbers of these things would
be. Certainly if I hadn't lost all 3 superblocks at once, I might have
thought that number adequate.

Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
fan of BTRFS and its potential, and I know its still early days for
the code base, and it's yet to fully mature in its recovery and
diagnostic tools. I'm just hoping that these points can contribute in
some small way and give back some of the help I got in fixing my
system!

-- 
Stirling Westrup
Programmer, Entrepreneur.
https://www.linkedin.com/e/fpf/77228
http://www.linkedin.com/in/swestrup
http://technaut.livejournal.com
http://sourceforge.net/users/stirlingwestrup

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01  0:48 A Big Thank You, and some Notes on Current Recovery Tools Stirling Westrup
@ 2018-01-01  5:21 ` Duncan
  2018-01-01 10:13 ` Qu Wenruo
  1 sibling, 0 replies; 12+ messages in thread
From: Duncan @ 2018-01-01  5:21 UTC (permalink / raw)
  To: linux-btrfs

Stirling Westrup posted on Sun, 31 Dec 2017 19:48:15 -0500 as excerpted:

> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
> YOU to Nikolay Borisov and most especially to Qu Wenruo!
> 
> Thanks to their tireless help in answering all my dumb questions I have
> managed to get my BTRFS working again! As I speak I have the full,
> non-degraded, quad of drives mounted and am updating my latest backup of
> their contents.

I'm glad you were able to fix it.  Hopefully, some of what was learned 
from the experience can hope the devs make btrfs better, as well as, for 
you, reinforcing the sysadmin's first rule of backups that I'm rather 
known for quoting around here: The *real* value of data to an admin is 
defined not by any flimsy claims as to its value, but rather, by the 
number of backups an admin considers it worth having of that data.  If 
there's no backups, or none beyond level N, that's simply defining the 
data as not worth the time/trouble/resources necessary to do those 
backups (beyond level N), or flipped around, defining the time/trouble/
resources saved in /not/ doing the backups to be worth more than the data.

Thus, it can *always* be said that what was defined to be of most value 
was saved, either the data, if it was worth the trouble making the 
backup, or the time/trouble/resources necessary to make it if there was 
no backup.

Of course you had backups, but they weren't current.  However, the same 
rule applies then to the data in the delta between the backup and current 
state.  If it wasn't worth freshening your backups to capture backups of 
that delta as well, then by definition the data was worth less than the 
time/trouble/resources necessary to do that freshening.

... And FWIW, after finding myself in similar situations regarding backup 
updates here, but fortunately with btrfs' readable by btrfs restore...
I recently decided it was worth the money to upgrade to ssd backups as 
well as ssd working copies... precisely to lower the trouble threshold to 
updating those backups... and I'm happy to report that it's had exactly 
the effect I had hoped... I'm doing much more regular backups, keeping 
that maximum delta between working copy and first-line backup much 
smaller (days to weeks) than it was before (months to over a year (!!), 
so I'm walking the talk and holding myself to the same rules I preach! 
=:^)

> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T drives
> failed, and with help I was able to make a 100% recovery of the lost
> data. I do have some observations on what I went through though. Take
> this as constructive criticism, or as a point for discussing additions
> to the recovery tools:
> 
> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
> errors exactly coincided with the 3 super-blocks on the drive. The odds
> against this happening as random independent events is so unlikely as to
> be mind-boggling. (Something like odds of 1 in 10^26) So, I'm going to
> guess this wasn't random chance. Its possible that something inside the
> drive's layers of firmware is to blame, but it seems more likely to me
> that there must be some BTRFS process that can, under some conditions,
> try to update all superblocks as quickly as possible. I think it must be
> that a drive failure during this window managed to corrupt all three
> superblocks. It may be better to perform an update-readback-compare on
> each superblock before moving onto the next, so as to avoid this
> particular failure in the future. I doubt this would slow things down
> much as the superblocks must be cached in memory anyway.

I'd actually suspect something in the drive firmware or hardware... 
didn't like the fact that btrfs was *constantly* rewriting the *exact* 
same place, the copies of the superblock.

Because otherwise, as you say, the odds are simply too high that it would 
be /exactly/ those three blocks, not, say, two superblocks, and something 
else.

"They say" it's SSDs that work that way, not spinning rust, which is 
supposed to "not care" about how many times a particular block is 
rewritten, but more about spinning hours, etc.  However, I'd argue that 
the same rules that have applied to "spinning rust" for decades... don't 
necessarily hold any longer as the area of each bit or byte gets smaller 
and smaller, and /particularly/ so with the new point-heat-recording and 
shingled designs.  Indeed, I had already wondered personally about media-
point longevity given repeated point-heat-recording cycles, and the fact 
that btrfs superblocks are the /one/ thing that's not constantly COWed to 
different locations at every write, but remain at the exact same media 
address, rewritten for /every/ btrfs commit cycle, as they /must/ be, 
given the way btrfs works.

Of course that's why ssds have the FTL/firmware-translation-layer between 
the actual physical media and the filesystem layer, doing that COW at the 
device level, so no single hotspot address is rewritten many more times 
than the coldspot addresses.

And of course spinning rust has its firmware as well, tho at least in the 
public domain, they don't COW a sector until it actually dies.  But I 
actually suspect that some of them do SSD-like wear-leveling anyway, 
because I just don't see how the smaller and smaller physical bit-write 
areas can stand up to the repeated rewrite wear, otherwise.

But either there was something buggy with yours, that btrfs triggered 
with its superblock write pattern, or it simply didn't have the level of 
protection it needed, or perhaps some of both.

Anyway, as I said, the odds are simply too great.  There's simply no 
other explanation for it being the /exact/ three superblocks, spaced as 
they are precisely to /avoid/ ending up in the same physical weak-spot 
area by accident, that went out.

Which has significant implications for the below...

> 2) The recovery tools seem too dumb while thinking they are smarter than
> they are. There should be some way to tell the various tools to consider
> some subset of the drives in a system as worth considering. Not knowing
> that a superblock was a single 4096-byte sector, I had primed my
> recovery by copying a valid superblock from one drive to the clone of my
> broken drive before starting the ddrescue of the failing drive. I had
> hoped that I could piece together a valid superblock from a good drive,
> and whatever I could recover from the failing one. In the end this
> turned out to be a useful strategy, but meanwhile I had two drives that
> both claimed to be drive 2 of 4, and no drive claiming to be drive 1 of
> 4. The tools completely failed to deal with this case and were
> consistently preferring to read the bogus drive 2 instead of the real
> drive 2, and it wasn't until I deliberately patched over the magic in
> the cloned drive that I could use the various recovery tools without
> bizarre and spurious errors. I understand how this was never an
> anticipated scenario for the recovery process, but if its happened once,
> it could happen again. Just dealing with a failing drive and its clone
> both available in one system could cause this.

Of course btrfs has known problems with clones that duplicate the GUID, 
as many cloning tools do, where both the clone and the working copy are 
available to btrfs at the same time.  This is because btrfs, unlike most 
filesystems being multi-device, needed /some/ way to uniquely identify 
each filesystem, and as here, each device of each filesystem, and the 
"Globally Unique Identification", aka GUID, aka UUID (universally unique 
ID), was taken as *exactly* what it says in the name, globally/
universally unique.  That's one of the design assumptions of btrfs, 
written into the code at a level that really can't be changed at this 
late date, many years into the process.  And btrfs really /does/ have 
known data corruption potential when those IDs don't turn out to be 
unique after all.

Which is why admins that have done their due diligence researching the 
filesystems they're trusting with the integrity of their data, know that 
if they're using replication methods that expose multiple devices with 
the same GUIDs/UUIDs, they *MUST* take care to expose to btrfs only one 
instance of those UUIDs/GUIDs at a time.  Because there's a very real 
danger of data corruption if btrfs sees two supposedly "unique" IDs, as 
it can and sometimes does get /very/ confused by that.

Unfortunately, as btrfs becomes more widespread and common-place, beyond 
the level of admin that really researches a filesystem before they put 
their trust in it, a lot of btrfs-using admins are ending up learning 
this the hard way... unfortunately.

Tho arguably, the good part of it is that just as admins coming from the 
MS side of things had to learn all about mounting and unmounting, and 
what to avoid to avoid the trap of data corruption due to pulling a 
(removable) device without cleanly unmounting it, as btrfs becomes more 
common, people will eventually learn the btrfs rules of safe data 
behavior as well.

Tho equally arguably, that among several reasons may be enough to keep 
btrfs from ever becoming the mainstream replacement for and successor to 
the ext* line that it was intended to be.  Oh, well...  Every filesystem 
has its strengths and weaknesses, and a good admin will learn to 
appreciate them and use a filesystem appropriate to the use-case, while 
not so good admins... generally end up suffering more than necessary, as 
they fight with filesystems in use-cases that they are simply not the 
best choice out there at supporting.

Of course the alternative would be a limited-choice ecosystem like MS, 
where there's only basically two FS choices, some version of the 
venerable FAT, or some version of NTFS, both choices among many others 
available to Linux/*IX users, as well.  Fine for some, but "No thanks, 
I'll keep my broad array of choices, thank you very much!" for me. =:^)

> 3) There don't appear to be any tools designed for dumping a full
> superblock in hex notation, or for patching a superblock in place.
> Seeing as I was forced to use a hex editor to do exactly that, and then
> go through hoops to generate a correct CSUM for the patched block, I
> would certainly have preferred there to be some sort of utility to do
> the patching for me.

100% agreed, here.  Of course that's one reason among many that btrfs 
remains "still stabilizing, not yet fully stable and mature", precisely 
because there's various holes like this one remaining in the btrfs 
toolset.

It is said that the air force jocks of some nations semi-euphemistically 
describe a situation in which they are vastly outnumbered as a "target 
rich environment."  Whatever the truth of /that/, by analogy it's 
definitely the case that btrfs remains a "development-opportunity rich 
environment" in terms of improvement possibilities remaining to be 
developed.  There's certainly more ideas for improvement than there is 
time and devs to implement, test, bugfix, and test some more, all those 
ideas, and this is one more that it'd definitely be nice to have!

But given how closely you worked with the devs to get your situation 
fixed, and thus the knowledge of your specific tool-case they now have, 
the chances of actually getting this implemented in something approaching 
reasonably useful time, is better than most. =:^)

> 4) Despite having lost all 3 superblocks on one drive in a 4-drive setup
> (RAID0 Data with RAID1 Metadata), it was possible to derive all missing
> information needed to rebuild the lost superblock from the existing good
> drives. I don't know how often it can be done, or if it was due to some
> peculiarity of the particular RAID configuration I was using, or what.
> But seeing as this IS possible at least under some circumstances, it
> would be useful to have some recovery tools that knew what those
> circumstances were, and could make use of them.

Of course raid0 in any form is considered among admins to be for the 
"don't-care-if-we-lose-it, it's throw-away-data", either because it 
actually /is/ throw-away data, or because there's at least one extra 
level of backups in case the raid0 /does/ die, use-case.

By that argument there's limited benefit to any investment in raid0 mode 
recovery, because nobody sane uses it for anything of greater than "throw-
away" value anyway.

Tho OTOH, given that raid1-metadata/single-data (which roughly equates to 
raid0-data) is the btrfs-multi-device effective default... arguably, 
either that default should be changed to raid1/10 for data as well as 
metadata, or at least there's /some/ support for prioritizing 
implementation of tools such as those that would have helped automate the 
process, here.

Personally, I'd argue for changing the default to raid1 2-3 device, raid10 
4+ device, but maybe that's just me...

> 5) Finally, I want to comment on the fact that each drive only stored up
> to 3 superblocks. Knowing how important they are to system integrity, I
> would have been happy to have had 5 or 10 such blocks, or had each drive
> keep one copy of each superblock for each other drive. At 4K per
> superblock, this would seem a trivial amount to store even in a huge
> raid with 64 or 128 drives in it. Could there be some method introduced
> for keeping far more redundant metainformation around? I admit I'm
> unclear on what the optimal numbers of these things would be. Certainly
> if I hadn't lost all 3 superblocks at once, I might have thought that
> number adequate.

If indeed I'm correct that the odds of it being ALL three of the 
superblocks that failed, and ONLY the superblocks, strongly indicate a 
mismatch between hardware/firmware and the btrfs superblock constant 
rewrite to the /exact/ same address pattern, then...

Making it 5 or 10 or 100 or 1000 such blocks won't help much.

OTOH, I'm rather intrigued by the idea of keeping one copy of each of 
the /other/ devices' superblocks on all devices.  I'd consider that idea 
worth further discussion anyway, tho it's quite possible that performance 
or other considerations make it simply impractical to implement, and even 
if practical to implement in the general sense, it'd certainly require an 
on-device format update, and those aren't done lightly or often, as all 
formats from the original mainlined one must be supported going forward.  
But it's definitely an idea I'd like to see further discussed, even if 
it's simply to point out the holes in the idea I'm just not seeing, from 
my viewpoint that's definitely much closer to admin than dev.

Tho while I do rather like the idea, given the above, even keeping 
additional superblock copies on all the other devices isn't necessarily 
going to help much, particularly when it's all similar devices, 
presumably with similar firmware and media weak-points.

But other-device superblocks very well could have helped in a situation 
like yours, where there were two different device sizes and potentially 
brands...

> Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
> fan of BTRFS and its potential, and I know its still early days for the
> code base, and it's yet to fully mature in its recovery and diagnostic
> tools. I'm just hoping that these points can contribute in some small
> way and give back some of the help I got in fixing my system!

I believe you've very likely done just that. =:^)

And even if your case doesn't result in tools to automate superblock 
restoration in cases such as yours in the immediate to near-term (say to 
three years out), it has very definitely already resulted in regulars 
that now have experience with the problem and should now find it /much/ 
easier to tackle a similar problem the next time it comes up!  And as you 
say, it almost certainly /will/ come up again, because it's not /that/ 
unreasonable or uncommon a situation to find oneself in, after all!

But definitely, the best-case would be if it results in the tools 
learning how to automate the process so people that have no clue what a 
hex editor even is can still have at least /some/ chance of recovering 
from it, where we're just lucky here that someone with the technical 
skill and just as importantly the time/motivation/determination to either 
get a fix or know exactly why it /could-not/ be fixed, happened to have 
the problem, not someone more like me that /might/ have the technical 
skill, but would be far more likely to just accept the damage as reality 
and fall back to the backups such as they are, than actually invest the 
time in either getting that fix or knowing for sure that it /can't/ be 
fixed.

The signature I've seen, something about the unreasonable man refusing to 
accept reality, thereby making his own, and /thereby/, changing it for 
the good, for everyone, thus progress depending on the unreasonable man, 
comes to mind. =:^)

Yes, I suppose I /did/ just call you "unreasonable", but that's a rather 
extreme compliment, in this case! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01  0:48 A Big Thank You, and some Notes on Current Recovery Tools Stirling Westrup
  2018-01-01  5:21 ` Duncan
@ 2018-01-01 10:13 ` Qu Wenruo
  2018-01-01 12:15   ` Kai Krakow
  2018-01-01 22:50   ` waxhead
  1 sibling, 2 replies; 12+ messages in thread
From: Qu Wenruo @ 2018-01-01 10:13 UTC (permalink / raw)
  To: swestrup, linux-btrfs; +Cc: Nikolay Borisov


[-- Attachment #1.1: Type: text/plain, Size: 7300 bytes --]



On 2018年01月01日 08:48, Stirling Westrup wrote:
> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
> YOU to Nikolay Borisov and most especially to Qu Wenruo!
> 
> Thanks to their tireless help in answering all my dumb questions I
> have managed to get my BTRFS working again! As I speak I have the
> full, non-degraded, quad of drives mounted and am updating my latest
> backup of their contents.
> 
> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
> drives failed, and with help I was able to make a 100% recovery of the
> lost data. I do have some observations on what I went through though.
> Take this as constructive criticism, or as a point for discussing
> additions to the recovery tools:
> 
> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
> errors exactly coincided with the 3 super-blocks on the drive.

WTF, why all these corruption all happens at btrfs super blocks?!

What a coincident.

> The
> odds against this happening as random independent events is so
> unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)

Yep, that's also why I was thinking the corruption is much heavier than
our expectation.

But if this turns out to be superblocks only, then as long as superblock
can be recovered, you're OK to go.

> So, I'm going to guess this wasn't random chance. Its possible that
> something inside the drive's layers of firmware is to blame, but it
> seems more likely to me that there must be some BTRFS process that
> can, under some conditions, try to update all superblocks as quickly
> as possible.

Btrfs only tries to update its superblock when committing transaction.
And it's only done after all devices are flushed.

AFAIK there is nothing strange.

> I think it must be that a drive failure during this
> window managed to corrupt all three superblocks.

Maybe, but at least the first (primary) superblock is written with FUA
flag, unless you enabled libata FUA support (which is disabled by
default) AND your driver supports native FUA (not all HDD supports it, I
only have a seagate 3.5 HDD supports it), FUA write will be converted to
write & flush, which should be quite safe.

The only timing I can think of is, between the superblock write request
submit and the wait for them.

But anyway, btrfs superblocks are the ONLY metadata not protected by
CoW, so it is possible something may go wrong at certain timming.

> It may be better to
> perform an update-readback-compare on each superblock before moving
> onto the next, so as to avoid this particular failure in the future. I
> doubt this would slow things down much as the superblocks must be
> cached in memory anyway.

That should be done by block layer, where things like dm-integrity could
help.

> 
> 2) The recovery tools seem too dumb while thinking they are smarter
> than they are. There should be some way to tell the various tools to
> consider some subset of the drives in a system as worth considering.

My fault, in fact there is a -F option for dump-super, to force it to
recognize the bad superblock and output whatever it has.

In that case at least we could be able to see if it was really corrupted
or just some bitflip in magic numbers.

> Not knowing that a superblock was a single 4096-byte sector, I had
> primed my recovery by copying a valid superblock from one drive to the
> clone of my broken drive before starting the ddrescue of the failing
> drive. I had hoped that I could piece together a valid superblock from
> a good drive, and whatever I could recover from the failing one. In
> the end this turned out to be a useful strategy, but meanwhile I had
> two drives that both claimed to be drive 2 of 4, and no drive claiming
> to be drive 1 of 4. The tools completely failed to deal with this case
> and were consistently preferring to read the bogus drive 2 instead of
> the real drive 2, and it wasn't until I deliberately patched over the
> magic in the cloned drive that I could use the various recovery tools
> without bizarre and spurious errors. I understand how this was never
> an anticipated scenario for the recovery process, but if its happened
> once, it could happen again. Just dealing with a failing drive and its
> clone both available in one system could cause this.

Well, most tools put more focus on not screwing things further, so it's
common it's not as smart as user really want.

At least, super-recover could take more advantage of using chunk tree to
regenerate the super if user really want.
(Although so far only one case, and that's your case, could take use of
this possible new feature though)

> 
> 3) There don't appear to be any tools designed for dumping a full
> superblock in hex notation, or for patching a superblock in place.
> Seeing as I was forced to use a hex editor to do exactly that, and
> then go through hoops to generate a correct CSUM for the patched
> block, I would certainly have preferred there to be some sort of
> utility to do the patching for me.

Mostly because we think current super-recovery is good enough, until
your case.

> 
> 4) Despite having lost all 3 superblocks on one drive in a 4-drive
> setup (RAID0 Data with RAID1 Metadata), it was possible to derive all
> missing information needed to rebuild the lost superblock from the
> existing good drives. I don't know how often it can be done, or if it
> was due to some peculiarity of the particular RAID configuration I was
> using, or what. But seeing as this IS possible at least under some
> circumstances, it would be useful to have some recovery tools that
> knew what those circumstances were, and could make use of them.

In fact, you don't even need any special tool to do the recovery.

The basic ro+degraded mount should allow you to recover 75% of your data.
And btrfs-recovery should do pretty much the same.

The biggest advantage you have is, your faith and knowledge about only
superblocks are corrupted in the device, which turns out to be a miracle.
(While at the point I know your backup supers are also corrupted, I lose
the faith)

Thanks,
Qu

> 
> 5) Finally, I want to comment on the fact that each drive only stored
> up to 3 superblocks. Knowing how important they are to system
> integrity, I would have been happy to have had 5 or 10 such blocks, or
> had each drive keep one copy of each superblock for each other drive.
> At 4K per superblock, this would seem a trivial amount to store even
> in a huge raid with 64 or 128 drives in it. Could there be some method
> introduced for keeping far more redundant metainformation around? I
> admit I'm unclear on what the optimal numbers of these things would
> be. Certainly if I hadn't lost all 3 superblocks at once, I might have
> thought that number adequate.
> 
> Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
> fan of BTRFS and its potential, and I know its still early days for
> the code base, and it's yet to fully mature in its recovery and
> diagnostic tools. I'm just hoping that these points can contribute in
> some small way and give back some of the help I got in fixing my
> system!
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01 10:13 ` Qu Wenruo
@ 2018-01-01 12:15   ` Kai Krakow
  2018-01-01 19:44     ` Stirling Westrup
  2018-01-01 22:50   ` waxhead
  1 sibling, 1 reply; 12+ messages in thread
From: Kai Krakow @ 2018-01-01 12:15 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:

> On 2018年01月01日 08:48, Stirling Westrup wrote:
>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
>> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>> 
>> Thanks to their tireless help in answering all my dumb questions I have
>> managed to get my BTRFS working again! As I speak I have the full,
>> non-degraded, quad of drives mounted and am updating my latest backup
>> of their contents.
>> 
>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
>> drives failed, and with help I was able to make a 100% recovery of the
>> lost data. I do have some observations on what I went through though.
>> Take this as constructive criticism, or as a point for discussing
>> additions to the recovery tools:
>> 
>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>> errors exactly coincided with the 3 super-blocks on the drive.
> 
> WTF, why all these corruption all happens at btrfs super blocks?!
> 
> What a coincident.

Maybe it's a hybrid drive with flash? Or something that went wrong in the 
drive-internal cache memory the very time when superblocks where updated?

I bet that the sectors aren't really broken, just the on-disk checksum 
didn't match the sector. I remember such things happening to me more than 
once back in the days when drives where still connected by molex power 
connectors. Those connectors started to get loose over time, due to 
thermals or repeated disconnect and connect. That is, drives sometimes 
started to no longer have a reliable power source which let to all sorts 
of very strange problems, mostly resulting in pseudo-defective sectors.

That said, the OP would like to check the power supply after this 
coincidence... Maybe it's aging and no longer able to support all four 
drives, CPU, GPU and stuff with stable power.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01 12:15   ` Kai Krakow
@ 2018-01-01 19:44     ` Stirling Westrup
  2018-01-02  2:03       ` Duncan
  2018-01-02 10:02       ` ein
  0 siblings, 2 replies; 12+ messages in thread
From: Stirling Westrup @ 2018-01-01 19:44 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Mon, Jan 1, 2018 at 7:15 AM, Kai Krakow <hurikhan77@gmail.com> wrote:
> Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:
>
>> On 2018年01月01日 08:48, Stirling Westrup wrote:
>>>
>>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>>> errors exactly coincided with the 3 super-blocks on the drive.
>>
>> WTF, why all these corruption all happens at btrfs super blocks?!
>>
>> What a coincident.
>
> Maybe it's a hybrid drive with flash? Or something that went wrong in the
> drive-internal cache memory the very time when superblocks where updated?
>
> I bet that the sectors aren't really broken, just the on-disk checksum
> didn't match the sector. I remember such things happening to me more than
> once back in the days when drives where still connected by molex power
> connectors. Those connectors started to get loose over time, due to
> thermals or repeated disconnect and connect. That is, drives sometimes
> started to no longer have a reliable power source which let to all sorts
> of very strange problems, mostly resulting in pseudo-defective sectors.
>
> That said, the OP would like to check the power supply after this
> coincidence... Maybe it's aging and no longer able to support all four
> drives, CPU, GPU and stuff with stable power.

You may be right about the cause of the error being a power-supply issue.
For those that are curious, the drive that failed was a Seagate Barracuda
LP 2000G drive (ST2000DL003).

I hadn't gone into the particulars of the failure, but the BTRFS in
question is my
file server and it mostly holds ripped DVDs, so the storage tends to
grow in size
but existing files seldom change, unless I reorganize things. The
intent is for it to
be backed up to a proper RAIDed BTRFS system weekly, but I have to admit that
I've never gotten around to automating the start of backups and have just been
running it whenever I make large changes to the file server, or
reorganize things.

I was starting to run out of space on the file server, and I had
noticed a few transient
drive errors in the logs (from the 2T device that failed) and so had
decided I'd add
another 2T device to the array temporarily, and then replace both the
failing device and the
temp device with a new 4T drive once I'd had a chance to go buy a new one.

In hind sight (which is always 20/20), I should have updated the
backups before starting to
make my changes, but as I'd just added a new 4T drive to the BTRFS
RAID6 in my backup
system a week before, and it went as smooth as butter, I guess I was
feeling insufficiently
paranoid.

I shut down the system, installed the 5th drive, rebooted... and
nothing. The system made some
horrible sounds and refused to boot. It wouldn't even get past POST.
Not being a hardware
guy I wasn't sure what killed my server box, but I assume it was the
power supply. Again, once
I get the chance I'll take it to my local computer shop and have
someone look at it.

Luckily I had an exactly identical system laying idle, so I swapped
all the drives and the extra sata
controller to handle them, and booted it up, only to find that the
failing drive had now definitely failed.

Interesting, the various tools I used kept reporting an 'unknown
error' for the 3 bad sectors. IIRC, one
of the diagnostic tools reported it as "Error 11 (Unknown)". In any
case, there appeared to be many
errors on the disk, but when I used ddrescue to make a full copy of
it, all of the sectors were (eventually)
fully recovered, except for the 3 superblocks.

After a few days of non-destructive tests and googling for information
on BTRFS multi-drive systems, I
finally decided I had to contact this list for advice, and the rest is
well documented.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01 19:44     ` Stirling Westrup
@ 2018-01-02  2:03       ` Duncan
  2018-01-02 10:02       ` ein
  1 sibling, 0 replies; 12+ messages in thread
From: Duncan @ 2018-01-02  2:03 UTC (permalink / raw)
  To: linux-btrfs

Stirling Westrup posted on Mon, 01 Jan 2018 14:44:43 -0500 as excerpted:

> In hind sight (which is always 20/20), I should have updated the backups
> before starting to make my changes, but as I'd just added a new 4T drive
> to the BTRFS RAID6 in my backup system a week before, and it went as
> smooth as butter, I guess I was feeling insufficiently paranoid.

Are you aware of btrfs raid56-mode history?

If you're running a current enough kernel (wiki says 4.12 for raid56 
mode, but you might want 4.14 for other fixes and/or the fact that it's 
LTS) the severest known raid56 issues that had it recommendation-
blacklisted are fixed, but raid56 mode still doesn't have fixes for the 
infamous parity-raid write hole, and parities are not checksummed, in 
hindsight an implementation mistake as it breaks btrfs' otherwise 
integrity and checksumming guarantees, that's going to require an on-disk 
format change and some major work to fix.

If you're running at least kernel 4.12 and are aware of and understand 
the remaining raid56 caveats, raid56 mode can be a valid choice, but if 
not, I strongly recommend doing more research to learn and understand 
those caveats, before relying too heavily on that backup.

The most reliable and well tested btrfs multi-device mode remains raid1, 
tho that's expensive in terms of space required since it duplicates 
everything.  For many devices, the recommendation seems to remain btrfs 
raid1, either straight, or on top of a pair of mdraid0s (or alike, 
dmraid0s, hardware raid0s, etc), since that performs better than btrfs 
raid10, and removes a confusing tho not harmful if properly understood 
layout ambiguity of btrfs raid10 as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01 19:44     ` Stirling Westrup
  2018-01-02  2:03       ` Duncan
@ 2018-01-02 10:02       ` ein
  2018-01-02 11:15         ` Paul Jones
  1 sibling, 1 reply; 12+ messages in thread
From: ein @ 2018-01-02 10:02 UTC (permalink / raw)
  To: swestrup, Kai Krakow; +Cc: linux-btrfs

On 01/01/2018 08:44 PM, Stirling Westrup wrote:
> On Mon, Jan 1, 2018 at 7:15 AM, Kai Krakow <hurikhan77@gmail.com> wrote:
>> Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:
>>
>>> On 2018年01月01日 08:48, Stirling Westrup wrote:
>>>>
>>>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>>>> errors exactly coincided with the 3 super-blocks on the drive.
>>>
>>> WTF, why all these corruption all happens at btrfs super blocks?!
>>>
>>> What a coincident.
>>
>> Maybe it's a hybrid drive with flash? Or something that went wrong in the
>> drive-internal cache memory the very time when superblocks where updated?
>>
>> I bet that the sectors aren't really broken, just the on-disk checksum
>> didn't match the sector. I remember such things happening to me more than
>> once back in the days when drives where still connected by molex power
>> connectors. Those connectors started to get loose over time, due to
>> thermals or repeated disconnect and connect. That is, drives sometimes
>> started to no longer have a reliable power source which let to all sorts
>> of very strange problems, mostly resulting in pseudo-defective sectors.
>>
>> That said, the OP would like to check the power supply after this
>> coincidence... Maybe it's aging and no longer able to support all four
>> drives, CPU, GPU and stuff with stable power.
> 
> You may be right about the cause of the error being a power-supply issue.
> For those that are curious, the drive that failed was a Seagate Barracuda
> LP 2000G drive (ST2000DL003).
> 

Forgive me if it's not relevant, but I own quite a few disks from that
series, like:

root@iomega-ordo:~# hdparm -i /dev/sda
/dev/sda:
 Model=ST2000DM001-1CH164, FwRev=CC27, SerialNo=Z1E6EV85
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }

root@iomega-acm:~# smartctl -d sat -a /dev/sda
=== START OF INFORMATION SECTION ===
Device Model:     ST3000DM001-9YN166
Serial Number:    S1F0PGQJ
LU WWN Device Id: 5 000c50 0516fce00
Firmware Version: CC4B

root@iomega-europol:~# smartctl -d sat -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [armv5tel-linux-2.6.31.8] (local build)
=== START OF INFORMATION SECTION ===
Device Model:     ST3000DM001-9YN166
Serial Number:    Z1F1H5KA
LU WWN Device Id: 5 000c50 04ec18fda

Different locations, different environments, different boards one more
stable (the power) than others.

I replaced at least three four in the past 3 years. All of them died
because heavy random wirte workload. (rsnapshot, massive cp -al of
millions of files every day). In my case every time bad sectors occurred
too, but I didn't analyze where exactly, it was just a backup
destination drive. I pretty convinced it could be ext2 supers too though.


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-02 10:02       ` ein
@ 2018-01-02 11:15         ` Paul Jones
  2018-01-02 12:45           ` Marat Khalili
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Jones @ 2018-01-02 11:15 UTC (permalink / raw)
  To: ein, swestrup@gmail.com, Kai Krakow; +Cc: linux-btrfs@vger.kernel.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2167 bytes --]

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of ein
> Sent: Tuesday, 2 January 2018 9:03 PM
> To: swestrup@gmail.com; Kai Krakow <hurikhan77@gmail.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: A Big Thank You, and some Notes on Current Recovery Tools.


> Forgive me if it's not relevant, but I own quite a few disks from that series,
> like:
> 
> root@iomega-ordo:~# hdparm -i /dev/sda
> /dev/sda:
>  Model=ST2000DM001-1CH164, FwRev=CC27, SerialNo=Z1E6EV85  Config={
> HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
> 
> root@iomega-acm:~# smartctl -d sat -a /dev/sda === START OF
> INFORMATION SECTION ===
> Device Model:     ST3000DM001-9YN166
> Serial Number:    S1F0PGQJ
> LU WWN Device Id: 5 000c50 0516fce00
> Firmware Version: CC4B
> 
> root@iomega-europol:~# smartctl -d sat -a /dev/sda smartctl 5.41 2011-06-09
> r3365 [armv5tel-linux-2.6.31.8] (local build) === START OF INFORMATION
> SECTION ===
> Device Model:     ST3000DM001-9YN166
> Serial Number:    Z1F1H5KA
> LU WWN Device Id: 5 000c50 04ec18fda
> 
> Different locations, different environments, different boards one more
> stable (the power) than others.
> 
> I replaced at least three four in the past 3 years. All of them died because
> heavy random wirte workload. (rsnapshot, massive cp -al of millions of files
> every day). In my case every time bad sectors occurred too, but I didn't
> analyze where exactly, it was just a backup destination drive. I pretty
> convinced it could be ext2 supers too though.

I think the 1-3TB Seagate drives are garbage. Out of 6 drives I replaced all under warranty due to bad sectors, 2 of them were replaced twice! As the replacements failed out of warranty they were replaced with 3-4TB HGST drives and I've had no problems ever since. My workload was just a daily backup store, so they sat there idling about 22 hours a day.
I hear the 4+TB Seagate drives are much better quality but I have no experience with them.

Paul.
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-02 11:15         ` Paul Jones
@ 2018-01-02 12:45           ` Marat Khalili
  2018-01-02 14:45             ` ein
  0 siblings, 1 reply; 12+ messages in thread
From: Marat Khalili @ 2018-01-02 12:45 UTC (permalink / raw)
  To: Paul Jones, ein.net; +Cc: linux-btrfs@vger.kernel.org

> I think the 1-3TB Seagate drives are garbage.

There are known problems with ST3000DM001, but first of all you should not put PC-oriented disks in RAID, they are not designed for it on multiple levels (vibration tolerance, error reporting...) There are similar horror stories about people filling whole cases with WD Greens and observing their (non-BTRFS) RAID 6 fail. 

(Sorry for OT.)
-- 

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-02 12:45           ` Marat Khalili
@ 2018-01-02 14:45             ` ein
  0 siblings, 0 replies; 12+ messages in thread
From: ein @ 2018-01-02 14:45 UTC (permalink / raw)
  Cc: linux-btrfs@vger.kernel.org

On 01/02/2018 01:45 PM, Marat Khalili wrote:
>> I think the 1-3TB Seagate drives are garbage.
> 
> There are known problems with ST3000DM001, but first of all you should not put PC-oriented disks in RAID, they are not designed for it on multiple levels (vibration tolerance, error reporting...) There are similar horror stories about people filling whole cases with WD Greens and observing their (non-BTRFS) RAID 6 fail. 
> 
> 

Same with ST2000DM001, lenovo and seagate did that for instance. How the
hell user may know...

>
> (Sorry for OT.)

m2

-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01 10:13 ` Qu Wenruo
  2018-01-01 12:15   ` Kai Krakow
@ 2018-01-01 22:50   ` waxhead
  2018-01-02  0:57     ` Qu Wenruo
  1 sibling, 1 reply; 12+ messages in thread
From: waxhead @ 2018-01-01 22:50 UTC (permalink / raw)
  To: Qu Wenruo, swestrup, linux-btrfs; +Cc: Nikolay Borisov

Qu Wenruo wrote:
>
>
> On 2018年01月01日 08:48, Stirling Westrup wrote:
>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
>> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>>
>> Thanks to their tireless help in answering all my dumb questions I
>> have managed to get my BTRFS working again! As I speak I have the
>> full, non-degraded, quad of drives mounted and am updating my latest
>> backup of their contents.
>>
>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
>> drives failed, and with help I was able to make a 100% recovery of the
>> lost data. I do have some observations on what I went through though.
>> Take this as constructive criticism, or as a point for discussing
>> additions to the recovery tools:
>>
>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>> errors exactly coincided with the 3 super-blocks on the drive.
>
> WTF, why all these corruption all happens at btrfs super blocks?!
>
> What a coincident.
>
>> The
>> odds against this happening as random independent events is so
>> unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)
>
> Yep, that's also why I was thinking the corruption is much heavier than
> our expectation.
>
> But if this turns out to be superblocks only, then as long as superblock
> can be recovered, you're OK to go.
>
>> So, I'm going to guess this wasn't random chance. Its possible that
>> something inside the drive's layers of firmware is to blame, but it
>> seems more likely to me that there must be some BTRFS process that
>> can, under some conditions, try to update all superblocks as quickly
>> as possible.
>
> Btrfs only tries to update its superblock when committing transaction.
> And it's only done after all devices are flushed.
>
> AFAIK there is nothing strange.
>
>> I think it must be that a drive failure during this
>> window managed to corrupt all three superblocks.
>
> Maybe, but at least the first (primary) superblock is written with FUA
> flag, unless you enabled libata FUA support (which is disabled by
> default) AND your driver supports native FUA (not all HDD supports it, I
> only have a seagate 3.5 HDD supports it), FUA write will be converted to
> write & flush, which should be quite safe.
>
> The only timing I can think of is, between the superblock write request
> submit and the wait for them.
>
> But anyway, btrfs superblocks are the ONLY metadata not protected by
> CoW, so it is possible something may go wrong at certain timming.
>

So from what I can piece together SSD mode is safer even for regular 
harddisks correct?

According to this...
https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock

- There is 3x superblocks for every device.
- The superblocks are updated every 30 seconds if there is any changes...
- SSD mode will not try to update all superblocks in one go, but update 
one by one every 30 seconds.

So if SSD mode is enabled even for harddisks then only 60 seconds of 
filesystem history / activity will potentially be lost... this sounds 
like a reasonable trade-off compared to having your entire filesystem 
hampered if your hardware is not perhaps optimal (which is sort of the 
point with BTRFS' checksumming anyway)

So would it make sense to enable SSD behavior by default for HDD's ?!

>> It may be better to
>> perform an update-readback-compare on each superblock before moving
>> onto the next, so as to avoid this particular failure in the future. I
>> doubt this would slow things down much as the superblocks must be
>> cached in memory anyway.
>
> That should be done by block layer, where things like dm-integrity could
> help.
>
>>
>> 2) The recovery tools seem too dumb while thinking they are smarter
>> than they are. There should be some way to tell the various tools to
>> consider some subset of the drives in a system as worth considering.
>
> My fault, in fact there is a -F option for dump-super, to force it to
> recognize the bad superblock and output whatever it has.
>
> In that case at least we could be able to see if it was really corrupted
> or just some bitflip in magic numbers.
>
>> Not knowing that a superblock was a single 4096-byte sector, I had
>> primed my recovery by copying a valid superblock from one drive to the
>> clone of my broken drive before starting the ddrescue of the failing
>> drive. I had hoped that I could piece together a valid superblock from
>> a good drive, and whatever I could recover from the failing one. In
>> the end this turned out to be a useful strategy, but meanwhile I had
>> two drives that both claimed to be drive 2 of 4, and no drive claiming
>> to be drive 1 of 4. The tools completely failed to deal with this case
>> and were consistently preferring to read the bogus drive 2 instead of
>> the real drive 2, and it wasn't until I deliberately patched over the
>> magic in the cloned drive that I could use the various recovery tools
>> without bizarre and spurious errors. I understand how this was never
>> an anticipated scenario for the recovery process, but if its happened
>> once, it could happen again. Just dealing with a failing drive and its
>> clone both available in one system could cause this.
>
> Well, most tools put more focus on not screwing things further, so it's
> common it's not as smart as user really want.
>
> At least, super-recover could take more advantage of using chunk tree to
> regenerate the super if user really want.
> (Although so far only one case, and that's your case, could take use of
> this possible new feature though)
>
>>
>> 3) There don't appear to be any tools designed for dumping a full
>> superblock in hex notation, or for patching a superblock in place.
>> Seeing as I was forced to use a hex editor to do exactly that, and
>> then go through hoops to generate a correct CSUM for the patched
>> block, I would certainly have preferred there to be some sort of
>> utility to do the patching for me.
>
> Mostly because we think current super-recovery is good enough, until
> your case.
>
>>
>> 4) Despite having lost all 3 superblocks on one drive in a 4-drive
>> setup (RAID0 Data with RAID1 Metadata), it was possible to derive all
>> missing information needed to rebuild the lost superblock from the
>> existing good drives. I don't know how often it can be done, or if it
>> was due to some peculiarity of the particular RAID configuration I was
>> using, or what. But seeing as this IS possible at least under some
>> circumstances, it would be useful to have some recovery tools that
>> knew what those circumstances were, and could make use of them.
>
> In fact, you don't even need any special tool to do the recovery.
>
> The basic ro+degraded mount should allow you to recover 75% of your data.
> And btrfs-recovery should do pretty much the same.
>
> The biggest advantage you have is, your faith and knowledge about only
> superblocks are corrupted in the device, which turns out to be a miracle.
> (While at the point I know your backup supers are also corrupted, I lose
> the faith)
>
> Thanks,
> Qu
>
>>
>> 5) Finally, I want to comment on the fact that each drive only stored
>> up to 3 superblocks. Knowing how important they are to system
>> integrity, I would have been happy to have had 5 or 10 such blocks, or
>> had each drive keep one copy of each superblock for each other drive.
>> At 4K per superblock, this would seem a trivial amount to store even
>> in a huge raid with 64 or 128 drives in it. Could there be some method
>> introduced for keeping far more redundant metainformation around? I
>> admit I'm unclear on what the optimal numbers of these things would
>> be. Certainly if I hadn't lost all 3 superblocks at once, I might have
>> thought that number adequate.
>>
>> Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
>> fan of BTRFS and its potential, and I know its still early days for
>> the code base, and it's yet to fully mature in its recovery and
>> diagnostic tools. I'm just hoping that these points can contribute in
>> some small way and give back some of the help I got in fixing my
>> system!
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: A Big Thank You, and some Notes on Current Recovery Tools.
  2018-01-01 22:50   ` waxhead
@ 2018-01-02  0:57     ` Qu Wenruo
  0 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2018-01-02  0:57 UTC (permalink / raw)
  To: waxhead, swestrup, linux-btrfs; +Cc: Nikolay Borisov


[-- Attachment #1.1: Type: text/plain, Size: 9197 bytes --]



On 2018年01月02日 06:50, waxhead wrote:
> Qu Wenruo wrote:
>>
>>
>> On 2018年01月01日 08:48, Stirling Westrup wrote:
>>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
>>> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>>>
>>> Thanks to their tireless help in answering all my dumb questions I
>>> have managed to get my BTRFS working again! As I speak I have the
>>> full, non-degraded, quad of drives mounted and am updating my latest
>>> backup of their contents.
>>>
>>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
>>> drives failed, and with help I was able to make a 100% recovery of the
>>> lost data. I do have some observations on what I went through though.
>>> Take this as constructive criticism, or as a point for discussing
>>> additions to the recovery tools:
>>>
>>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>>> errors exactly coincided with the 3 super-blocks on the drive.
>>
>> WTF, why all these corruption all happens at btrfs super blocks?!
>>
>> What a coincident.
>>
>>> The
>>> odds against this happening as random independent events is so
>>> unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)
>>
>> Yep, that's also why I was thinking the corruption is much heavier than
>> our expectation.
>>
>> But if this turns out to be superblocks only, then as long as superblock
>> can be recovered, you're OK to go.
>>
>>> So, I'm going to guess this wasn't random chance. Its possible that
>>> something inside the drive's layers of firmware is to blame, but it
>>> seems more likely to me that there must be some BTRFS process that
>>> can, under some conditions, try to update all superblocks as quickly
>>> as possible.
>>
>> Btrfs only tries to update its superblock when committing transaction.
>> And it's only done after all devices are flushed.
>>
>> AFAIK there is nothing strange.
>>
>>> I think it must be that a drive failure during this
>>> window managed to corrupt all three superblocks.
>>
>> Maybe, but at least the first (primary) superblock is written with FUA
>> flag, unless you enabled libata FUA support (which is disabled by
>> default) AND your driver supports native FUA (not all HDD supports it, I
>> only have a seagate 3.5 HDD supports it), FUA write will be converted to
>> write & flush, which should be quite safe.
>>
>> The only timing I can think of is, between the superblock write request
>> submit and the wait for them.
>>
>> But anyway, btrfs superblocks are the ONLY metadata not protected by
>> CoW, so it is possible something may go wrong at certain timming.
>>
> 
> So from what I can piece together SSD mode is safer even for regular
> harddisks correct?
> 
> According to this...
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
> 
> - There is 3x superblocks for every device.

At most 3x. The 3rd one is for device larger than 256G.

> - The superblocks are updated every 30 seconds if there is any changes...

The interval can be specified by commit= mount option.
And 30 is the default.

> - SSD mode will not try to update all superblocks in one go, but update
> one by one every 30 seconds.

If I didn't miss anything, from write_dev_supers() and
wait_dev_supers(), nothing checkes SSD mount option flag to do anything
different.

So, again if I didn't miss anything, superblock write is the same,
unless you're using nobarrier mount option.

Thanks,
Qu
> 
> So if SSD mode is enabled even for harddisks then only 60 seconds of
> filesystem history / activity will potentially be lost... this sounds
> like a reasonable trade-off compared to having your entire filesystem
> hampered if your hardware is not perhaps optimal (which is sort of the
> point with BTRFS' checksumming anyway)
> 
> So would it make sense to enable SSD behavior by default for HDD's ?!
> 
>>> It may be better to
>>> perform an update-readback-compare on each superblock before moving
>>> onto the next, so as to avoid this particular failure in the future. I
>>> doubt this would slow things down much as the superblocks must be
>>> cached in memory anyway.
>>
>> That should be done by block layer, where things like dm-integrity could
>> help.
>>
>>>
>>> 2) The recovery tools seem too dumb while thinking they are smarter
>>> than they are. There should be some way to tell the various tools to
>>> consider some subset of the drives in a system as worth considering.
>>
>> My fault, in fact there is a -F option for dump-super, to force it to
>> recognize the bad superblock and output whatever it has.
>>
>> In that case at least we could be able to see if it was really corrupted
>> or just some bitflip in magic numbers.
>>
>>> Not knowing that a superblock was a single 4096-byte sector, I had
>>> primed my recovery by copying a valid superblock from one drive to the
>>> clone of my broken drive before starting the ddrescue of the failing
>>> drive. I had hoped that I could piece together a valid superblock from
>>> a good drive, and whatever I could recover from the failing one. In
>>> the end this turned out to be a useful strategy, but meanwhile I had
>>> two drives that both claimed to be drive 2 of 4, and no drive claiming
>>> to be drive 1 of 4. The tools completely failed to deal with this case
>>> and were consistently preferring to read the bogus drive 2 instead of
>>> the real drive 2, and it wasn't until I deliberately patched over the
>>> magic in the cloned drive that I could use the various recovery tools
>>> without bizarre and spurious errors. I understand how this was never
>>> an anticipated scenario for the recovery process, but if its happened
>>> once, it could happen again. Just dealing with a failing drive and its
>>> clone both available in one system could cause this.
>>
>> Well, most tools put more focus on not screwing things further, so it's
>> common it's not as smart as user really want.
>>
>> At least, super-recover could take more advantage of using chunk tree to
>> regenerate the super if user really want.
>> (Although so far only one case, and that's your case, could take use of
>> this possible new feature though)
>>
>>>
>>> 3) There don't appear to be any tools designed for dumping a full
>>> superblock in hex notation, or for patching a superblock in place.
>>> Seeing as I was forced to use a hex editor to do exactly that, and
>>> then go through hoops to generate a correct CSUM for the patched
>>> block, I would certainly have preferred there to be some sort of
>>> utility to do the patching for me.
>>
>> Mostly because we think current super-recovery is good enough, until
>> your case.
>>
>>>
>>> 4) Despite having lost all 3 superblocks on one drive in a 4-drive
>>> setup (RAID0 Data with RAID1 Metadata), it was possible to derive all
>>> missing information needed to rebuild the lost superblock from the
>>> existing good drives. I don't know how often it can be done, or if it
>>> was due to some peculiarity of the particular RAID configuration I was
>>> using, or what. But seeing as this IS possible at least under some
>>> circumstances, it would be useful to have some recovery tools that
>>> knew what those circumstances were, and could make use of them.
>>
>> In fact, you don't even need any special tool to do the recovery.
>>
>> The basic ro+degraded mount should allow you to recover 75% of your data.
>> And btrfs-recovery should do pretty much the same.
>>
>> The biggest advantage you have is, your faith and knowledge about only
>> superblocks are corrupted in the device, which turns out to be a miracle.
>> (While at the point I know your backup supers are also corrupted, I lose
>> the faith)
>>
>> Thanks,
>> Qu
>>
>>>
>>> 5) Finally, I want to comment on the fact that each drive only stored
>>> up to 3 superblocks. Knowing how important they are to system
>>> integrity, I would have been happy to have had 5 or 10 such blocks, or
>>> had each drive keep one copy of each superblock for each other drive.
>>> At 4K per superblock, this would seem a trivial amount to store even
>>> in a huge raid with 64 or 128 drives in it. Could there be some method
>>> introduced for keeping far more redundant metainformation around? I
>>> admit I'm unclear on what the optimal numbers of these things would
>>> be. Certainly if I hadn't lost all 3 superblocks at once, I might have
>>> thought that number adequate.
>>>
>>> Anyway, I hope no one takes these criticisms the wrong way. I'm a huge
>>> fan of BTRFS and its potential, and I know its still early days for
>>> the code base, and it's yet to fully mature in its recovery and
>>> diagnostic tools. I'm just hoping that these points can contribute in
>>> some small way and give back some of the help I got in fixing my
>>> system!
>>>
>>>
>>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-01-02 14:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-01  0:48 A Big Thank You, and some Notes on Current Recovery Tools Stirling Westrup
2018-01-01  5:21 ` Duncan
2018-01-01 10:13 ` Qu Wenruo
2018-01-01 12:15   ` Kai Krakow
2018-01-01 19:44     ` Stirling Westrup
2018-01-02  2:03       ` Duncan
2018-01-02 10:02       ` ein
2018-01-02 11:15         ` Paul Jones
2018-01-02 12:45           ` Marat Khalili
2018-01-02 14:45             ` ein
2018-01-01 22:50   ` waxhead
2018-01-02  0:57     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox