* Re: Can't add/replace a device on degraded filesystem
2016-12-30 0:27 Can't add/replace a device on degraded filesystem Rich Gannon
@ 2016-12-31 4:21 ` Duncan
2016-12-31 8:08 ` Roman Mamedov
1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2016-12-31 4:21 UTC (permalink / raw)
To: linux-btrfs
Rich Gannon posted on Thu, 29 Dec 2016 19:27:30 -0500 as excerpted:
> Well I certainly got myself into a pickle. Been a Btrfs user since 2008
> and this is the first time I've had a serious problem....and I got two
> on the same day (I'm separating them in a different emails).
>
> I had 4x 4TB harddrives in a d=single m=raid1 array for about a year now
> containing many media files I really want to save.
It's a bit late now, but as I've posted previously, any admin worth the
name knows that failing to have a backup is by that failure of action, in
an actions speak louder than words way defining the un-backedup data as
worth less than the time, trouble and resources necessary to do that
backup. And that's the normal case, with fully stable and mature
filesystems. Since btrfs is still under heavy development, not as
unstable as it once was, but still very definitely in the process of
stabilizing and maturing, that rule of backups is even stronger than it'd
be otherwise -- if you don't have those backups, you're *seriously*
defining the data as not worth that hassle.
So it's pretty simple. Regardless of the outcome you can rest easy,
because either you had a backup, and by having it defined that data as
worth the bother to do that backup, or you don't have it, and by that
failure, defined that data as not worth the trouble. But because in
either case you saved what was your actions defined as most important to
you, either way, you can be happy that you couldn't and didn't lose the
really important stuff, and can, worst-case, retrieve it from backup with
a bit more time and energy spent in restoring that backup. =:^)
That said, there remains a big difference between a theoretical/
calculated risk of loss, which you might have calculated to be worth that
risk, and actually being faced with the it no longer being a relatively
low chance theoretical risk, but potential reality you're now trying to
avoid. Lower value than insuring against a relatively low chance
theoretical risk is one thing, lower value than time and hassle spent
avoiding otherwise certain loss is something else entirely, and even
given the truth of the above, once faced with that reality one can
legitimately find it worth it to spend some time and hassle avoiding it,
as the cost equations have now changed.
FWIW I've been there myself, even after knowingly taking that calculated
risk and losing the bet I made with fate, only to find the cost equations
now justify at least /some/ additional effort to get back, in my case,
the time-delta difference between the backup I had that I had let get
rather more stale than I arguably should have, and my previously working
copy.
Luckily in my case I was able to recover most of that delta, and chances
are very good that you can as well, particularly because you /can/ still
mount read-only, and as you say, should still have at least one copy of
that data safely available on the remaining devices.
> Yesterday I removed
> them from my desktop, installed them into a "new-to-me" Supermicro 2U
> server and even swapped over my HighPoint MegaRAID 2720 SAS HBA (yes,
> it's acting as a direct pass-thruHBA only). With the added space, I
> also installed an additional 4TB drive to the filesystem and was
> performing a rebalance with filters:
>
> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt/bpool-btrfs
>
> I found that the new drive dropped off-line during the rebalance. I
> swapped the drive into a different bay to see if it was backplane,cord,
> or drive related. Upon remount, the same drive dropped offline. I had
> another new 4TB drive and swapped it in for the dead drive.
>
> I can mount my filesystem with -o degraded, but I can not do btrfs
> replace or btrfs device add as the filesystem is in read-only mode, and
> I can not mount read-write.
Yes. This is a known gotcha in the current btrfs implementation. It
remains rather easier than it should be to find oneself caught between a
rock and a hard place, with a read-only filesystem that can't be repaired
because that requires a writable mount, but without the ability to do a
writable mount, at least with current mainline code, until you can do
that repair. The problem is mentioned, without a real explanation, on
the wiki gotchas page (under raid1 volumes only mountable once RW if
degraded, your problem's slightly different as it was an interrupted
conversion, but the same root problem applies).
The problem is that while btrfs with current code is smart enough to know
that single mode chunks only have one copy and thus you're likely in
trouble if you are missing a device, and that a single missing device
shouldn't be a big deal if everything's raid1 or raid10, what it doesn't
understand is that just because there's some single-mode chunks on a
partially raid1 or raid10 filesystem with one device missing, doesn't
/necessarily/ mean that any of those single mode chunks actually happened
to /be/ on that missing device, and thus missing! It's entirely possible
that all the single-mode chunks are on devices that are still available,
along with at least one copy of all the raid1/10 stuff, so there's no
chunks actually entirely missing at all, in which case the filesystem
isn't in /immediate/ danger of further damage if mounted writable with
that missing device, and thus it should be allowed, tho ideally an admin
will ensure only being in that state long enough to do the actual repair.
In the raid1 case, btrfs raid1 knows it needs two devices to do raid1,
and in the event of a degraded-writable mount with only a single device,
will write any new chunks as single mode because it can't write raid1.
That's why there's only one chance to fix it, because if you don't fix it
during that one mount, the filesystem will see those single mode chunks
on a filesystem with a missing device and will refuse to mount writable
the next time, even tho by definition they couldn't have been created on
the missing device, because its being missing is what triggered the
single-mode chunk writing in the first place, thus triggering the rock
and hard place scenario on next mount.
You're seeing the same problem now, single mode chunks on a filesystem
with a device missing causing a failure to mount degraded-writable, but
in your case, it was because the single mode chunks were already there,
and the device went missing just as you were converting to redundant raid
after adding a new device. But because it /was/ the new device that went
missing, nothing, or at least very little, should have been written to
/it/ in single mode, as only the old chunks would be single mode, new
ones would be redundant raid, raid10 in your case.
> From my understanding, my data should all be safe as during the
> balance, no single-copy files should have made it onto the new drive
> (that subsequently failed). Is this a correct assumption?
Yes, altho there's a narrow chance of a few chunks being written to the
new and now missing device in single mode, between the time the
filesystem was mounted writable and the time you started the conversion.
After you started the conversion, any new chunks /should/ have been
written in the new raid10 mode.
But the good news is that only data chunks should be subject to that
narrow window, as all your metadata is either raid1 (the multi-device
default) or raid10. So the potential damage should be pretty limited.
> Here is some btrfs data:
> proton bpool-btrfs # btrfs fi df /mnt/bpool-btrfs/
> Data, RAID10: total=2.17TiB, used=1.04TiB
> Data, single: total=7.79TiB, used=7.59TiB
> System, RAID1: total=32.00MiB, used=1.08MiB
> Metadata, RAID10: total=1.00GiB, used=1023.88MiB
> Metadata, RAID1: total=10.00GiB, used=8.24GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> proton bpool-btrfs # btrfs fi sh /mnt/bpool-btrfs/
> Label: 'bigpool' uuid: 85e8b0dd-fbbd-48a2-abc4-ccaefa5e8d18
> Total devices 5 FS bytes used 8.64TiB
> devid 5 size 3.64TiB used 2.77TiB path /dev/mapper/bpool-3
> devid 6 size 3.64TiB used 2.77TiB path /dev/mapper/bpool-4
> devid 7 size 3.64TiB used 2.77TiB path /dev/mapper/bpool-1
> devid 8 size 3.64TiB used 2.77TiB path /dev/mapper/bpool-2
> *** Some devices missing
>
>
> NOTE: The drives are all fully encrypted with LUKS/dm_crypt.
>
> Please help me save the data :)
>
> Rich
Interesting. There's another rich that runs btrfs multi-device raid for
his media files and occasionally posts to the list. He's a gentoo dev
and I happen to run gentoo, the context in which we normally interact, so
it's interesting seeing him here, too. But he normally goes by rich0,
and AFAIK he isn't running btrfs on top of an encryption layer too, as
you are...
OK, so there's two ways you can go to save things, here. Your choice, as
both will be a hassle, but they're different hassles.
1) The less technical throw money at it and brute force it method, is to
simply take advantage of the fact that you can still mount read-only and
should still have everything or nearly everything intact, file-wise, and
throw together some additional storage to copy everything off to. The
good thing about this is that when you're done you'll have a backup, the
current read-only filesystem for now, but when it comes time to update
that backup, you can blow away the current filesystem and recreate it,
using it as a backup after that.
The problem, of course, is that at the multi-TB sizes you're talking,
it'll take at least 2-3 free devices, and that's going to be money,
potentially even more than just the devices, if you don't have room to
connect them all in your current setup, so have to to either setup a
second machine or install additional multi-device cards if there's room
for even that, in the current one.
Of course a second problem is that given the near double-digits TB size
we're dealing with, doing /anything/ with that much data takes a decent
amount of time, during which the devices dealing with the copy, repair,
or whatever it is, are under significant stress and thus more likely to
fail than normal, so there's a definite non-zero chance of something else
failing before you get everything safely out of the danger zone and
squared away. Of course that's a risk that comes with the big-data
territory; you just gotta deal with it if you're dealing with TB-scale
data, but it /is/ worth mentioning one difference with btrfs raid10 as
compared to more common raid10s, that you may not have factored in for
the additional risk.
That being... btrfs raid10 is per-chunk, unlike standard raid10 which is
per-device, and btrfs can and sometimes does switch around which devices
get what part of the chunk. IOW, the device that gets A1 may well get B4
and D3 and C2, with the other pieces of the chunk stripe vs. mirror
similarly distributed amongst the other devices. In general the ultimate
implication being that you can only lose a single device of the raid10.
Once you lose a second device, chances are pretty high that the two
devices will have both mirrors of /some/ of the chunk-strips, making the
filesystem unrepairable, along with at least some of the data.
There's talk of exposing some way to configure the chunk allocator so it
keeps the same mirror vs. strip of each chunk, consistently, and it
should be possible and will very likely eventually happen, but it could
be three months from now, or five-plus-years. So don't plan on it being
right away...
If you're fine with that and planned on only one device-loss safety
anyway, great. Otherwise...
2) Meanwhile, your second choice, the more technically skilled, less
brute force, method. As I said the problem is known, and there's
actually a patch floating around to fix it. But the patch, reasonably
simple on its own, got tangled up in a rather more complex and longer
term project, bringing hot-spare functionality to btrfs, and as such, it
has remained out of tree because it's now to be submitted with that
project, which is still cooking and isn't mature enough to be mainline-
merged just yet.
What the patch does is teach btrfs to actually check per-chunk whether it
has at least one copy of the chunk data available, and allow writable
mounts if so, thus eliminating the you-only-get-one-shot-at-a-fix
problem, because later degraded mounts should see all the chunks are
available and thus allow mounting writable, where current btrfs does not.
This more technically inclined method involves tracking down those
patches and applying them to your own kernel/btrfs build, after which you
should, with any decent luck, be able to mount the filesystem writable in
ordered to continue the conversion and ultimately eliminate the trigger
problem as all the single mode chunks are converted to raid10.
Of course besides this requiring a bit more technical skill, the other
problem is that once the immediate issue is gone and you're all raid10,
you still have only that single working copy, with mirrors, yes, but no
real backup. While the other way, you end up with a backup, and thus in
a much safer place in terms of that data, than you apparently are now.
3) There's actually a third method too, involving using btrfs restore on
the unmounted filesystem to retrieve the files and copy them elsewhere.
However, this is really designed for when the filesystem can't be mounted
read-only either. Since you can mount it read-only and should be able to
recover most or all files that way, you don't need to try to resort to
this third method. But it's nice to know it's there, if you do end up
needing to use it, as I have a couple times, tho AFAIK not since before
kernel 4.0, so it has been awhile.
Looking at the longer perspective, then, btrfs really does seem to be
gradually stabilizing/maturing, as before it seemed that every year or
so, something would happen and I'd need to think about actually using
those backups, tho unlike some, I've always been lucky enough to be able
to get fresher versions via restore, than I had in backup. But I imagine
the moment I begin actually counting on that, I'll have a problem that
restore can't fix, so I really do consider those backups a measure of the
value I place on the data, and would be stressed to a point, but only to
a point, if I ended up losing any changes between my sometimes too stale
backups, and my working copy. But if I do I'll only have myself to
blame, because after all, actions don't lie, and I obviously had things I
considered more important than freshening those backups, so if I lose
that delta, so be it, I still saved what I obviously thought was more
important.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 3+ messages in thread