Re: Help with space

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Help with space
Date: Thu, 1 May 2014 05:33:45 +0000 (UTC)	[thread overview]
Message-ID: <pan$e723c$9c5c686c$f2068dd4$dd71b519@cox.net> (raw)
In-Reply-To: 2809235.lZD2oazSeA@xev

Russell Coker posted on Thu, 01 May 2014 11:52:33 +1000 as excerpted:

> I've just been doing some experiments with a failing disk used for
> backups (so I'm not losing any real data here).

=:^)

> The "dup" option for metadata means that the entire filesystem
> structure is intact in spite of having lots of errors (in another
> thread I wrote about getting 50+ correctable errors on metadata while
> doing a backup).

TL;DR: Discustion of btrfs raid1 and n-way-mirroring.  Bonus discussion 
on spinning rust heat-death and death in general modes.

That's why I'm running raid1 for both data and metadata here.  I love 
btrfs' data/metadata checksumming and integrity mechanisms, and having 
that second copy to scrub from in the event of an error on one of them is 
just as important to me as the device-redundancy-and-failure-recovery bit.

I could get the latter on md/raid and did run it for some years, but the 
fact that there's no way to have it do routine read-time parity cross-
check and scrub (or N-way checking and vote, rewriting to a bad copy on 
failure, in the case of raid1), even tho it has all the cross-checksums 
already there and available to do it, but only actually makes /use/ of 
that for recovery if a device fails...

My biggest frustration with btrfs ATM is the lack of "true" raid1, aka 
N-way-mirroring.  Btrfs presently only does pair-mirroring, no matter the 
number of devices in the "raid1".  Checksummed-3-way-redundancy really is 
the sweet spot I'd like to hit, and yes it's on the road map, but this 
thing seems to be taking about as long as Christmas does to a five or six 
year old... which is a pretty apt metaphor of my anticipation and the 
eagerness with which I'll be unwrapping and playing with that present 
once it comes! =:^)

> My experience is that in the vast majority of disk failures that don't
> involve dropping a disk the majority of disk data will still be
> readable.  For example one time I had a workstation running RAID-1 get
> too hot in summer and both disks developed significant numbers of
> errors, enough that it couldn't maintain a Linux Software RAID-1 (disks
> got kicked out all the time).  I wrote a program to read all the data
> from disk 0 and read from disk 1 any blocks that couldn't be read from
> disk 0, the result was that after running e2fsck on the result I didn't
> lose any data.

That's rather similar to an experience of mine.  I'm in Phoenix, AZ, and 
outdoor in-the-shade temps can reach near 50C.  Air-conditioning failure 
with a system left running while I was elsewhere.  I came home the the 
"hot car effect", far hotter inside than out, so likely 55-60C ambient 
air temp, very likely 70+ device temps.  The system was still on but 
"frozen" (broiled?) due to disk head crash and possibly CPU thermal 
shutdown.

Surprisingly, after shutting everything down, getting a new AC, and 
letting the system cool for a few hours, it pretty much all came back to 
life, including the CPU(s) (that was pre-multi-core, but I don't remember 
whether it was my dual socket original Opteron, or pre-dual-socket for me 
as well) which I had feared would be dead.

The disk as well came back, minus the sections that were being accessed 
at the time of the head crash, which I expect were physically grooved.

I only had the one main disk running at the time, but fortunately I had 
partitioned it up and had working and backup partitions for everything 
vital, and of course the backup partitions weren't mounted at the time, 
and they came thru just fine (tho without checksumming so I'll never know 
if there were bit-flips, but I could boot from the backup / and mount the 
other backups, and a working partition or two that weren't hurt, just 
fine.

But I *DID* have quite a time recovering anyway, primarily because my 
rootfs, /usr/ and /var (which had the system's installed package 
database), were three different partitions that ended up being from three 
different backup dates... on gentoo, with its rolling updates!  IIRC I 
had a current /var including the package database, but the package files 
actually on the rootfs and on /usr were from different package versions 
from what the db in /var was tracking, and were different from each other 
as well.  I was still finding stale package remnants nearly two years 
later!

But I continued running that disk for several months until I had some 
money to replace it, then copied the system, by then current again except 
for the occasional stale file, to the new setup.  I always wondered how 
much longer I could have run the heat-tested one, but didn't want to 
trust my luck any further, so retired it.

Which was when I got into md/raid, first mostly raid6, then later redone 
to raid1, once I figured out the fancy dual checksums weren't doing 
anything but slowing me down in normal operations anyway.

And on my new setup, I used a partitioning policy I continue to this day, 
namely, everything that the package manager touches[1] including its 
installed-pkg database on /var goes on rootfs.  With a working rootfs and 
several backups of various ages on various physical devices (that 
filesystem's only 8 gig or so, with only 4 gig or so of data, so I can 
and do now keep multiple alternate rootfs partition backups on multiple 
devices) should I need to use them, that means no matter what age the 
backup I might ultimately end up booting to, the package database it 
contains will remain in sync with the content of the packages it's 
tracking.  No further possibility of database and /var from one backup, 
rootfs from another, and /usr from a third!

Anyway, yes, my experience tracks yours.  Both in that case and when I 
simply run the disks to wear-out (which I sometimes do as a secondary/
backup/low-priority-cache-data device once it starts clicking or 
developing bad sectors or whatever), the devices themselves continue to 
work in general, long after I've begun to see intermittent issues with 
them.

Tho my experience to date has been spinning rust.  My primary workstation 
pair of current primary devices are now SSD (Corsair Neutron 256-gig, NOT 
Neutron GTX), partitioned identically with multiple btrfs partitions, in 
btrfs raid1 mode except for the two separate individual /boots), and I'm 
happy with them so far, but I must admit to being a bit worried about 
their less familiar failure modes.

> So if you have BTRFS configured to "dup" metadata on a RAID-5 array
> (either hardware RAID or Linux Software RAID) then the probability of
> losing metadata would be a lot lower than for a filesystem which doesn't
> do checksums and doesn't duplicate metadata.  To lose metadata you would
> need to have two errors that line up with both copies of the same
> metadata block.

Like I said, btrfs raid1 both data/metadata here, for exactly that 
reason.  But I'd sure like to make it triplet-mirror instead of being 
limited to pair-mirror, again for exactly that reason.  Currently, I 
figure the chance of both copies independently going bad is lower than 
the risk of a bug in still-under-development btrfs making BOTH copies 
equally bad (even if they pass checksum), and I'm choosing to run btrfs 
knowing that tho I keep non-btrfs backups just in case.  But as btrfs 
matures and stabilizes, the chance of a btrfs bug making both copies bad 
goes down, while the chance of the two copies independently going bad at 
the same place remains the same, and as the two chances reverse in 
likelihood, I'd sure like to have that triplet-mirroring available.

Oh well, the day will come, even if I'm a six-year-old waiting for 
Christmas at this point.  =:^\

> One problem with many RAID arrays is that it seems to only be possible
> to remove a disk and generate a replacement from parity.  I'd like to be
> able to read all the data from the old disk which is readable and write
> it to the new disk.  Then use the parity from other disks to recover the
> blocks which weren't readable.  That way if you have errors on two disks
> it won't matter unless they both happen to be on the same stripe.  Given
> that BTRFS RAID-5 isn't usable yet it seems that the only way to get
> this result is to use RAID-Z on ZFS.

=:^(  But at least you're already in December, in terms of your btrfs 
Christmas, while at best I'm still in November, for mine...

---
[1] Everything the package manager touches:  Minus a few write-required 
state files and the like in /var, which are now symlinked to parallels 
in /home/var, since I keep the rootfs read-only mounted by default these 
days, but by the same token, those operational-write-required files can 
go missing or be out of sync without dramatically affecting operation.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2014-05-01  5:33 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-27 18:19 Help with space Justin Brown
2014-02-27 19:27 ` Chris Murphy
2014-02-27 19:51   ` Chris Murphy
2014-02-27 20:49     ` otakujunction
2014-02-27 21:11       ` Chris Murphy
2014-02-28  0:12         ` Dave Chinner
2014-02-28  0:27           ` Chris Murphy
2014-02-28  4:21             ` Dave Chinner
2014-02-28  5:49               ` Chris Murphy
2014-02-28  4:34 ` Roman Mamedov
2014-02-28  7:27   ` Duncan
2014-02-28  7:37     ` Roman Mamedov
2014-02-28  7:46     ` Justin Brown
2014-05-01  1:52   ` Russell Coker
2014-05-01  5:33     ` Duncan [this message]
2014-05-02  1:48       ` Russell Coker
2014-05-02  8:23         ` Duncan
2014-05-02  9:28           ` Brendan Hide
2014-05-02 19:21           ` Chris Murphy
2014-05-02 21:08             ` Hugo Mills
2014-05-02 22:33               ` Chris Murphy
2014-05-03 16:31             ` Austin S Hemmelgarn
2014-05-03 19:09               ` Chris Murphy
2014-05-03 20:52                 ` Austin S Hemmelgarn
2014-05-03 23:16                 ` Chris Murphy
2014-02-28  6:13 ` Chris Murphy
2014-02-28  6:26   ` Chris Murphy
2014-02-28  7:39     ` Justin Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$e723c$9c5c686c$f2068dd4$dd71b519@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox