btrfs-raid questions I couldn't find an answer to on the wiki

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs-raid questions I couldn't find an answer to on the wiki
@ 2012-01-26 15:41 Duncan
  2012-01-28 12:08 ` Martin Steigerwald
  2012-01-29 11:23 ` Goffredo Baroncelli
  0 siblings, 2 replies; 14+ messages in thread
From: Duncan @ 2012-01-26 15:41 UTC (permalink / raw)
  To: linux-btrfs

I'm currently researching an upgrade to (raid1-ed) btrfs from mostly 
reiserfs (which I've found quite reliable (even thru a period of bad ram 
and resulting system crashes) since data=ordered went in with 2.6.16 or 
whatever it was.  (Thanks, Chris! =:^)) on multiple md/raid-1s.  I have 
some questions that don't appear to be addressed well on the wiki, yet, 
or where the wiki info might be dated.

Device hardware is 4 now aging 300-gig disks with identical gpt-
partitioning on all four disks, using multiple 4-way md/raid-1s for most 
of the system.  I'm running gentoo/~amd64 with the linus mainline kernel 
from git, kernel generally updated 1-2X/wk except during the merge 
window, so I stay reasonably current.  I have btrfs-progs-9999, aka the 
live-git build, kernel.org mason tree, installed.

The current layout has a total of 16 physical disk partitions on each of 
the four drives, mostly of which are 4-disk md/raid1, but with a couple 
md/raid1s for local cache of redownloadables, etc, thrown in.  Some of 
the mds are further partitioned (mdp), some not.  A couple are only 2-
disk md/raid1 instead of the usual 4-disk.  Most mds have a working and 
backup copy of exactly the same partitioned size, thus explaining the 
multitude of partitions, since most of them come in pairs.  No lvm as I'm 
not running an initrd which meant it couldn't handle root, and I wasn't 
confident in my ability to recover the system in an emergency with lvm 
either, so I was best off without it.

Note that my current plan is to keep the backup sets as reiserfs on md/
raid1 for the time being, probably until btrfs comes out of experimental/
testing or at least until it further stabilizes, so I'm not too worried 
about btrfs as long as it's not going to go scribbling outside the 
partitions established for it.  For the worst-case I have boot-tested 
external-drive backup.

Three questions:

1) My /boot partition and its backup (which I do want to keep separate 
from root) are only 128 MB each.  The wiki recommends 1 gig sizes 
minimum, but there's some indication that's dated info due to mixed data/
metadata mode in recent kernels.

Is a 128 MB btrfs reasonable?  What's the mixed-mode minumum recommended 
and what is overhead going to look like?

2)  The wiki indicates that btrfs-raid1 and raid-10 only mirror data 2-
way, regardless of the number of devices.  On my now aging disks, I 
really do NOT like the idea of only 2-copy redundancy.  I'm far happier 
with the 4-way redundancy, twice for the important stuff since it's in 
both working and backup mds altho they're on the same 4-disk set (tho I 
do have an external drive backup as well, but it's not kept as current).

If true that's a real disappointment, as I was looking forward to btrfs-
raid1 with checksummed integrity management.

Is there really NO way to do more than 2-way btrfs-raid1?  If not, 
presumably layering it on md/raid1 is possible, but is two-way-btrfs-
raid1-on-2-way-md-raid1 or btrfs-on-single-4-way-md-raid1 (presumably 
still-duped btrfs metadata) recommended?  Or perhaps the recommendations 
for performance and reliability differ in that scenario?

3) How does btrfs space overhead (and ENOSPC issues) compare to reiserfs 
with its (default) journal and tail-packing?  My existing filesystems are 
128 MB and 4 GB at the low end, and 90 GB and 16 GB at the high end.  At 
the same size, can I expect to fit more or less data on them?  Do the 
compression options change that by much "IRL"?  Given that I'm using same-
sized partitions for my raid-1s, I guess at least /that/ angle of it's 
covered.

Thanks. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-26 15:41 btrfs-raid questions I couldn't find an answer to on the wiki Duncan
@ 2012-01-28 12:08 ` Martin Steigerwald
  2012-01-29  5:40   ` Duncan
  2012-01-29 11:23 ` Goffredo Baroncelli
  1 sibling, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2012-01-28 12:08 UTC (permalink / raw)
  To: linux-btrfs

Am Donnerstag, 26. Januar 2012 schrieb Duncan:
> I'm currently researching an upgrade to (raid1-ed) btrfs from mostly
> reiserfs (which I've found quite reliable (even thru a period of bad
> ram and resulting system crashes) since data=3Dordered went in with
> 2.6.16 or whatever it was.  (Thanks, Chris! =3D:^)) on multiple
> md/raid-1s.  I have some questions that don't appear to be addressed
> well on the wiki, yet, or where the wiki info might be dated.
>=20
> Device hardware is 4 now aging 300-gig disks with identical gpt-
> partitioning on all four disks, using multiple 4-way md/raid-1s for
> most of the system.  I'm running gentoo/~amd64 with the linus mainlin=
e
> kernel from git, kernel generally updated 1-2X/wk except during the
> merge window, so I stay reasonably current.  I have btrfs-progs-9999,
> aka the live-git build, kernel.org mason tree, installed.
>=20
> The current layout has a total of 16 physical disk partitions on each
> of the four drives, mostly of which are 4-disk md/raid1, but with a
> couple md/raid1s for local cache of redownloadables, etc, thrown in.=20
> Some of the mds are further partitioned (mdp), some not.  A couple ar=
e
> only 2- disk md/raid1 instead of the usual 4-disk.  Most mds have a
> working and backup copy of exactly the same partitioned size, thus
> explaining the multitude of partitions, since most of them come in
> pairs.  No lvm as I'm not running an initrd which meant it couldn't
> handle root, and I wasn't confident in my ability to recover the
> system in an emergency with lvm either, so I was best off without it.

Sounds like a quite complex setup.

> Three questions:
>=20
> 1) My /boot partition and its backup (which I do want to keep separat=
e
> from root) are only 128 MB each.  The wiki recommends 1 gig sizes
> minimum, but there's some indication that's dated info due to mixed
> data/ metadata mode in recent kernels.
>=20
> Is a 128 MB btrfs reasonable?  What's the mixed-mode minumum
> recommended and what is overhead going to look like?

I don=C2=B4t know.

You could try with a loop device. Just create one and mkfs.btrfs on it,=
=20
mount it and copy your stuff from /boot over to see whether that works =
and=20
how much space is left.

On BTRFS I recommend using btrfs filesystem df for more exact figures o=
f=20
space utilization that df would return.

Likewise for RAID 1, just create 2 or 4 BTRFS image files.

You may try with:

       -M, --mixed
              Mix  data  and  metadata  chunks together for more
              efficient space utilization.  This feature  incurs
              a  performance  penalty in larger filesystems.  It
              is recommended for use with filesystems of  1  GiB
              or smaller.

for smaller partitions (see manpage of mkfs.btrfs).
=20
> 2)  The wiki indicates that btrfs-raid1 and raid-10 only mirror data =
2-
> way, regardless of the number of devices.  On my now aging disks, I
> really do NOT like the idea of only 2-copy redundancy.  I'm far happi=
er
> with the 4-way redundancy, twice for the important stuff since it's i=
n
> both working and backup mds altho they're on the same 4-disk set (tho=
 I
> do have an external drive backup as well, but it's not kept as
> current).
>=20
> If true that's a real disappointment, as I was looking forward to
> btrfs- raid1 with checksummed integrity management.

I didn=C2=B4t see anything like this.

Would be nice to be able to adapt the redundancy degree where possible.

An idea might be splitting into a delayed synchronisation mirror:

Have two BTRFS RAID-1 - original and backup - and have a cronjob with=20
rsync mirroring files every hour or so. Later this might be replaced by=
=20
btrfs send/receive - or by RAID-1 with higher redundancy.

> 3) How does btrfs space overhead (and ENOSPC issues) compare to
> reiserfs with its (default) journal and tail-packing?  My existing
> filesystems are 128 MB and 4 GB at the low end, and 90 GB and 16 GB a=
t
> the high end.  At the same size, can I expect to fit more or less dat=
a
> on them?  Do the compression options change that by much "IRL"?  Give=
n
> that I'm using same- sized partitions for my raid-1s, I guess at leas=
t
> /that/ angle of it's covered.

The efficiency of the compression options depend highly of the kind of =
data=20
you want to store.

I tried lzo on a external disk with movies, music files, images and=20
software archives. The effect has been minimal, about 3% or so. But for=
=20
unpacked source trees, lots of clear text files, likely also virtual=20
machine image files or other nicely compressible data the effect should=
 be=20
better.

Although BTRFS received a lot of fixes for ENOSPC issues I would be a b=
it=20
reluctant with very small filesystems. But that is just a gut feeling. =
So I=20
do not know whether the option -M from above is tested widely. I doubt =
it.

Maybe someone with more in-depth knowledge can shed some light on this.

--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-28 12:08 ` Martin Steigerwald
@ 2012-01-29  5:40   ` Duncan
  2012-01-29  7:55     ` Martin Steigerwald
  0 siblings, 1 reply; 14+ messages in thread
From: Duncan @ 2012-01-29  5:40 UTC (permalink / raw)
  To: linux-btrfs

Martin Steigerwald posted on Sat, 28 Jan 2012 13:08:52 +0100 as excerpt=
ed:

> Am Donnerstag, 26. Januar 2012 schrieb Duncan:

>> The current layout has a total of 16 physical disk partitions on eac=
h
>> of the four drives, mostly of which are 4-disk md/raid1, but with a
>> couple md/raid1s for local cache of redownloadables, etc, thrown in.
>> Some of the mds are further partitioned (mdp), some not.  A couple a=
re
>> only 2- disk md/raid1 instead of the usual 4-disk.  Most mds have a
>> working and backup copy of exactly the same partitioned size, thus
>> explaining the multitude of partitions, since most of them come in
>> pairs.  No lvm as I'm not running an initrd which meant it couldn't
>> handle root, and I wasn't confident in my ability to recover the sys=
tem
>> in an emergency with lvm either, so I was best off without it.
>=20
> Sounds like a quite complex setup.

It is.  I was actually writing a rather more detailed description, but=20
decided few would care and it'd turn into a tl;dr.  It was I think the=20
4th rewrite that finally got it down to something reasonable while stil=
l=20
hopefully conveying any details that might be corner-cases someone know=
s=20
something about.

>> Three questions:
>>=20
>> 1) My /boot partition and its backup (which I do want to keep separa=
te
>> from root) are only 128 MB each.  The wiki recommends 1 gig sizes
>> minimum, but there's some indication that's dated info due to mixed
>> data/ metadata mode in recent kernels.
>>=20
>> Is a 128 MB btrfs reasonable?  What's the mixed-mode minumum
>> recommended and what is overhead going to look like?
>=20
> I don=C2=B4t know.
>=20
> You could try with a loop device. Just create one and mkfs.btrfs on i=
t,
> mount it and copy your stuff from /boot over to see whether that work=
s
> and how much space is left.

The loop device is a really good idea that hadn't occurred to me.  Than=
ks!

> On BTRFS I recommend using btrfs filesystem df for more exact figures=
 of
> space utilization that df would return.

Yes.  I've read about the various space reports on the wiki so have the=
=20
general idea, but will of course need to review it again after I get=20
something setup so I can actually type in the commands and see for=20
myself.  Still, thanks for the reinforcement.  It certainly won't hurt,=
=20
and of course it's quite possible that others will end up reading this=20
too, so it could end up being a benefit to many people, not just me. =3D=
:^)

> You may try with:
>=20
>        -M, --mixed
>               Mix  data  and  metadata  chunks together for more
>               efficient space utilization.  This feature  incurs a=20
>               performance  penalty in larger filesystems.  It is
>               recommended for use with filesystems of  1  GiB or
>               smaller.
>=20
> for smaller partitions (see manpage of mkfs.btrfs).

I had actually seen that too, but as it's newer there's significantly=20
less mentions of it out there, so the reinforcement is DEFINITELY=20
valued!  I like to have a rather good general sysadmin's idea of what's=
=20
going on and how everything fits together, as opposed to simply followi=
ng=20
instructions by rote, before I'm really comfortable with something as=20
critical as filesystem maintenance (keeping in mind that when one reall=
y=20
tends to need that knowledge is in an already stressful recovery=20
situation, very possibly without all the usual documentation/net-
resources available), and repetition of the basics helps getting=20
comfortable with it, so I'm very happy for it even if it isn't "new" to=
=20
me. =3D:^)  (As mentioned, that was a big reason behind my ultimate=20
rejection of LVM, I simply couldn't get comfortable enough with it to b=
e=20
confident of my ability to recover it in an emergency recovery situatio=
n.)

>> 2)  The wiki indicates that btrfs-raid1 and raid-10 only mirror data=
 2-
>> way, regardless of the number of devices.  On my now aging disks, I
>> really do NOT like the idea of only 2-copy redundancy.  I'm far happ=
ier
>> with the 4-way redundancy, twice for the important stuff since it's =
in
>> both working and backup mds altho they're on the same 4-disk set (th=
o I
>> do have an external drive backup as well, but it's not kept as
>> current).
>>=20
>> If true that's a real disappointment, as I was looking forward to
>> btrfs- raid1 with checksummed integrity management.
>=20
> I didn=C2=B4t see anything like this.
>=20
> Would be nice to be able to adapt the redundancy degree where possibl=
e.

I posted the wiki reference in reply to someone else recently.  Let's s=
ee=20
if I can find it again...

Here it is.  This is from the bottom of the RAID and data replication=20
section (immediately above "Balancing") on the SysadminGuide page:

>>>>>
With RAID-1 and RAID-10, only two copies of each byte of data are=20
written, regardless of how many block devices are actually in use on th=
e=20
filesystem.=20
<<<<<

But that's one of the bits that I hoped was stale, and that it allowed=20
setting the number of copies for both data and metadata, now.  However,=
 I=20
don't see any options along that line to feed to mkfs.btrfs or btrfs *=20
either one, so it would seem it's not there yet, at least not in btrfs-
tools as built just a couple days ago from the official/mason tree on=20
kernel.org.  I haven't tried the integration tree (aka Hugo Mills' aka=20
darksatanic.net tree).  So I guess that wiki quote is still correct.  O=
h,=20
well... maybe later-this-year/in-a-few-kernel-cycles.

> An idea might be splitting into a delayed synchronisation mirror:
>=20
> Have two BTRFS RAID-1 - original and backup - and have a cronjob with
> rsync mirroring files every hour or so. Later this might be replaced =
by
> btrfs send/receive - or by RAID-1 with higher redundancy.

That's an interesting idea.  However, as I run git kernels and don't=20
accumulate a lot of uptime in any case, what I'd probably do is set up=20
the rsync to be run after a successful boot or mount of the filesystem =
in=20
question.  That way, if it ever failed to boot/mount for whatever reaso=
n,=20
I could be relatively confident that the backup version remained intact=
=20
and usable.

That's actually /quite/ an interesting idea.  While I have working and=20
backup partitions for most stuff now, the process remains a manual one,=
=20
when I think the system is stable enough and enough time has passed sin=
ce=20
the last one, so the backup tends to be weeks or months old as opposed =
to=20
days or hours.  This idea, modified to do it once per boot or mount or=20
whatever, would keep the backups far more current and be much less hass=
le=20
than the manual method I'm using now.  So even if I don't immediately=20
switch to btrfs as I had thought I might, I can implement those scripts=
=20
on the current system now, and then they'll be ready and tested, needin=
g=20
little modification when I switch to btrfs, later.

Thanks for the ideas! =3D:^)

>> 3) How does btrfs space overhead (and ENOSPC issues) compare to
>> reiserfs with its (default) journal and tail-packing?  My existing
>> filesystems are 128 MB and 4 GB at the low end, and 90 GB and 16 GB =
at
>> the high end.  At the same size, can I expect to fit more or less da=
ta
>> on them?  Do the compression options change that by much "IRL"?  Giv=
en
>> that I'm using same- sized partitions for my raid-1s, I guess at lea=
st
>> /that/ angle of it's covered.
>=20
> The efficiency of the compression options depend highly of the kind o=
f
> data you want to store.
>=20
> I tried lzo on a external disk with movies, music files, images and
> software archives. The effect has been minimal, about 3% or so. But f=
or
> unpacked source trees, lots of clear text files, likely also virtual
> machine image files or other nicely compressible data the effect shou=
ld
> be better.

Back in the day, MS-DOS 6.2 on a 130 MB hard drive, I used to run MS=20
Drivespace (which I guess they partnered with Stacker to get the tech=20
for, then dropped the Stacker partnership like a hot potato after they'=
d=20
sucked out all the tech they wanted, killing Stacker in the process...)=
,=20
so I'm familiar with the idea of filesystem or lower integrated=20
compression and realize that it's definitely variable.  I was just=20
wondering what the real-life usage scenarios had come up with, realizin=
g=20
even as I wrote it that the question wasn't one that could be answered =
in=20
anything but general terms.

But I run Gentoo and thus deal with a lot of build scripts, etc, plus t=
he=20
usual *ix style plain text config files, etc, so I expect for that=20
compression will be pretty good.  Rather less so on the media and bzip-
tarballed binpkgs partitions, certainly, with the home partition likely=
=20
intermediate since it has a lot of plain text /and/ a lot of pre-
compressed data.

Meanwhile, even without a specific answer, just the discussion is helpi=
ng=20
to clarify my understanding and expectations regarding compression, so=20
thanks.

> Although BTRFS received a lot of fixes for ENOSPC issues I would be a
> bit reluctant with very small filesystems. But that is just a gut
> feeling. So I do not know whether the option -M from above is tested
> widely. I doubt it.

The only real small filesystem/raid I have is /boot, the 128 MB=20
mentioned.  But in thinking it over a bit more since I wrote the initia=
l=20
post, I realized that given the 9-ish gigs of unallocated freespace at=20
the end of the drives and the fact that most of the partitions are at a=
=20
quarter-gig offset due to the 128 MB /boot and the combined 128 MB BIOS=
=20
and UEFI reserved partitions, I have room to expand both by several=20
times, and making the total of all 3 (plus the initial few sectors of=20
unpartitioned boot area) at the beginning of the drive an even 1 gig=20
would give me even gig offsets for all the other partitions/raids as we=
ll.

So I'll almost certainly expand /boot from 1/8 gig to 1/4 gig, and mayb=
e=20
to half or even 3/4 gig, just so the offsets for everything else end up=
=20
at even half or full gig boundaries, instead of the quarter-gig I have=20
now.  Between that and mixed-mode, I think the potential sizing issue o=
f=20
/boot pretty much disappears.  One less problem to worry about. =3D:^)

So the big sticking point now is two-copy-only data on btrfs-raid1,=20
regardless of the number of drives, and sticking that on top of md/raid=
's=20
a workaround, tho obviously I'd much rather a btrfs that could mirror=20
both data and metadata an arbtrary number of ways instead of just two. =
=20
(There's some hints that metadata at least gets mirrored to all drives =
in=20
a btrfs-raid1, tho nothing clearly states it one way or another.  But=20
without data mirrored to all drives as well, I'm just not comfortable.)

But while not ideal, the data integrity checking of two-way btrfs-raid1=
=20
on two-way md/raid1 should at least be better than entirely unverified
4-way md/raid1, and I expect the rest will come over time, so I could=20
simply upgrade anyway.

OTOH, in general as I've looked closer, I've found btrfs to be rather=20
farther away from exiting experimental than the prominent adoption by=20
various distros had led me to believe, and without N-way mirroring raid=
,=20
one of the two big features that I was looking forward to (the other=20
being the data integrity checking) just vaporized in front of my eyes, =
so=20
I may well hold off on upgrading until, potentially, late this year=20
instead of early this year, even if there are workarounds.  I'm just no=
t=20
sure it's worth the cost of dealing with the still experimental aspects=
=2E

Either way, however, this little foray into previously unexplored=20
territory leaves me with a MUCH firmer grasp of btrfs.  It's no longer=20
simply a vague filesystem with some vague features out there.

And now that I'm here, I'll probably stay on the list as well, as I've=20
already answered a number of questions posted by others, based on the=20
material in the wiki and manpages, so I think I have something to=20
contribute, and keeping up with developments will be far easier if I st=
ay=20
involved.

Meanwhile, again and overall, thanks for the answer.  I did have most o=
f=20
the bits of info I needed there floating around, but having someone to=20
discuss my questions with has definitely helped solidify the concepts,=20
and you've given me at least two very good suggestions that were entire=
ly=20
new to me and that would have certainly taken me quite some time to com=
e=20
up with on my own, if I'd been able to do so at all, so thanks, indeed!=
=20
=3D:^)

--=20
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-29  5:40   ` Duncan
@ 2012-01-29  7:55     ` Martin Steigerwald
  0 siblings, 0 replies; 14+ messages in thread
From: Martin Steigerwald @ 2012-01-29  7:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Duncan

Am Sonntag, 29. Januar 2012 schrieb Duncan:
> Martin Steigerwald posted on Sat, 28 Jan 2012 13:08:52 +0100 as=20
excerpted:
> > Am Donnerstag, 26. Januar 2012 schrieb Duncan:
[=E2=80=A6]
> >> 2)  The wiki indicates that btrfs-raid1 and raid-10 only mirror da=
ta
> >> 2- way, regardless of the number of devices.  On my now aging
> >> disks, I really do NOT like the idea of only 2-copy redundancy.=20
> >> I'm far happier with the 4-way redundancy, twice for the important
> >> stuff since it's in both working and backup mds altho they're on
> >> the same 4-disk set (tho I do have an external drive backup as
> >> well, but it's not kept as current).
> >>=20
> >> If true that's a real disappointment, as I was looking forward to
> >> btrfs- raid1 with checksummed integrity management.
> >=20
> > I didn=C2=B4t see anything like this.
> >=20
> > Would be nice to be able to adapt the redundancy degree where
> > possible.
>=20
> I posted the wiki reference in reply to someone else recently.  Let's
> see if I can find it again...
>=20
> Here it is.  This is from the bottom of the RAID and data replication
> section (immediately above "Balancing") on the SysadminGuide page:
>=20
>=20
> With RAID-1 and RAID-10, only two copies of each byte of data are
> written, regardless of how many block devices are actually in use on
> the filesystem.
> <<<<<

Yes, I have seen that too sometime ago. What I meant I didn=C2=B4t see =
anything=20
like this, is that I didn=C2=B4t see and option to set the number of co=
pies=20
anywhere yet - just like you.

> > An idea might be splitting into a delayed synchronisation mirror:
> >=20
> > Have two BTRFS RAID-1 - original and backup - and have a cronjob wi=
th
> > rsync mirroring files every hour or so. Later this might be replace=
d
> > by btrfs send/receive - or by RAID-1 with higher redundancy.
>=20
> That's an interesting idea.  However, as I run git kernels and don't
> accumulate a lot of uptime in any case, what I'd probably do is set u=
p
> the rsync to be run after a successful boot or mount of the filesyste=
m
> in question.  That way, if it ever failed to boot/mount for whatever
> reason, I could be relatively confident that the backup version
> remained intact and usable.
>=20
> That's actually /quite/ an interesting idea.  While I have working an=
d
> backup partitions for most stuff now, the process remains a manual on=
e,
> when I think the system is stable enough and enough time has passed
> since the last one, so the backup tends to be weeks or months old as
> opposed to days or hours.  This idea, modified to do it once per boot
> or mount or whatever, would keep the backups far more current and be
> much less hassle than the manual method I'm using now.  So even if I
> don't immediately switch to btrfs as I had thought I might, I can
> implement those scripts on the current system now, and then they'll b=
e
> ready and tested, needing little modification when I switch to btrfs,
> later.
>=20
> Thanks for the ideas! =3D:^)

Well you may even through in a snapshot in-between.

During boot before backup first snapshot or just after mount before=20
applications / services are started snapshot the source device. That=20
should give you a fairly consistent backup source. Then do the rsync=20
backup. Then snapshot the backup drive.

This way you can access older backups in case the original has gone bad=
=20
and has been backuped nonetheless.

I suggest a cronjob deleting old snapshots after some time again in ord=
er=20
to save space.

I want to replace my backup by something like this. There is also=20
rsnapshot for this case, but its error reporting I find sub optimal (no=
=20
rsync error messages included unless you run it on the command line wit=
h=20
option -v) and it uses hardlinks. Maybe could be adapted to use snapsho=
ts?
=20
> > Although BTRFS received a lot of fixes for ENOSPC issues I would be=
 a
> > bit reluctant with very small filesystems. But that is just a gut
> > feeling. So I do not know whether the option -M from above is teste=
d
> > widely. I doubt it.
>=20
> The only real small filesystem/raid I have is /boot, the 128 MB
> mentioned.  But in thinking it over a bit more since I wrote the
> initial post, I realized that given the 9-ish gigs of unallocated
> freespace at the end of the drives and the fact that most of the
> partitions are at a quarter-gig offset due to the 128 MB /boot and th=
e
> combined 128 MB BIOS and UEFI reserved partitions, I have room to
> expand both by several times, and making the total of all 3 (plus the
> initial few sectors of unpartitioned boot area) at the beginning of
> the drive an even 1 gig would give me even gig offsets for all the
> other partitions/raids as well.
>=20
> So I'll almost certainly expand /boot from 1/8 gig to 1/4 gig, and
> maybe to half or even 3/4 gig, just so the offsets for everything els=
e
> end up at even half or full gig boundaries, instead of the quarter-gi=
g
> I have now.  Between that and mixed-mode, I think the potential sizin=
g
> issue of /boot pretty much disappears.  One less problem to worry
> about. =3D:^)

About /boot: I do not see any specific need to convert boot to BTRFS as=
=20
well. Since kernels have version number attached to seem and can be=20
installed side by side, snapshotting /boot does not appear that importa=
nt=20
to me.

So you can just use Ext3 or with GRUB 2 or a patched GRUB 1, some distr=
os=20
do it, Ext4 for /boot in case BTRFS would not work out.
=20
> So the big sticking point now is two-copy-only data on btrfs-raid1,
> regardless of the number of drives, and sticking that on top of
> md/raid's a workaround, tho obviously I'd much rather a btrfs that
> could mirror both data and metadata an arbtrary number of ways instea=
d
> of just two. (There's some hints that metadata at least gets mirrored
> to all drives in a btrfs-raid1, tho nothing clearly states it one way
> or another.  But without data mirrored to all drives as well, I'm jus=
t
> not comfortable.)

I am with you there. Would be a nice feature.

The distributed filesystem Ceph which likes to be based on BTRFS volume=
s=20
has something like that, but Ceph might be overdoing it for your case ;=
).
=20
> OTOH, in general as I've looked closer, I've found btrfs to be rather
> farther away from exiting experimental than the prominent adoption by
> various distros had led me to believe, and without N-way mirroring
> raid, one of the two big features that I was looking forward to (the
> other being the data integrity checking) just vaporized in front of m=
y
> eyes, so I may well hold off on upgrading until, potentially, late
> this year instead of early this year, even if there are workarounds.=20
> I'm just not sure it's worth the cost of dealing with the still
> experimental aspects.

I decided for a partial approach.

My Amarok machine - an old ThinkPad T23 - is fully upgraded. On my main=
=20
laptop - a ThinkPad T520 with Intel SSD 320 - I have BTRFS as / and /ho=
me=20
still sits on Ext4.

I like this approach, cause I can gain experience with BTRFS, while not=
=20
putting to important data at risk. I can afford to loose /, since I hav=
e a=20
backup. But even with a backup of /home, I=C2=B4d rather not loose it, =
since I=20
only do it all 2-3 weeks cause its a manual thing for me at the moment.

At work I have a scratch data partition for Debian package development,=
=20
compiling stuff and other stuff I do not want to do within the NFS expo=
rt,=20
on BTRFS - that I backup to an Ext4 partition.

> And now that I'm here, I'll probably stay on the list as well, as I'v=
e
> already answered a number of questions posted by others, based on the
> material in the wiki and manpages, so I think I have something to
> contribute, and keeping up with developments will be far easier if I
> stay involved.

I encourage you, to start by putting something you can afford to loose =
on=20
BTRFS to gather practical experiences.

> Meanwhile, again and overall, thanks for the answer.  I did have most

You are welcome.

I do not know a definitve answer to the number of copies question, but =
I=20
believe that its not possible yet to set it.

Thanks,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-26 15:41 btrfs-raid questions I couldn't find an answer to on the wiki Duncan
  2012-01-28 12:08 ` Martin Steigerwald
@ 2012-01-29 11:23 ` Goffredo Baroncelli
  2012-01-30  5:49   ` Li Zefan
  2012-01-30 14:58   ` Kyle Gates
  1 sibling, 2 replies; 14+ messages in thread
From: Goffredo Baroncelli @ 2012-01-29 11:23 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Thursday, 26 January, 2012 16:41:32 Duncan wrote:
> 1) My /boot partition and its backup (which I do want to keep separate 
> from root) are only 128 MB each.  The wiki recommends 1 gig sizes 
> minimum, but there's some indication that's dated info due to mixed data/
> metadata mode in recent kernels.
> 
> Is a 128 MB btrfs reasonable?  What's the mixed-mode minumum recommended 
> and what is overhead going to look like?

IIRC, the minimum size should be 256MB. Anyway, if you want/allow a separate 
partition for  /boot I suggest to use a classic filesystem like ext3.

BR
G.Baroncelli

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-29 11:23 ` Goffredo Baroncelli
@ 2012-01-30  5:49   ` Li Zefan
  2012-01-30 14:58   ` Kyle Gates
  1 sibling, 0 replies; 14+ messages in thread
From: Li Zefan @ 2012-01-30  5:49 UTC (permalink / raw)
  To: kreijack; +Cc: Duncan, linux-btrfs

Goffredo Baroncelli wrote:
> On Thursday, 26 January, 2012 16:41:32 Duncan wrote:
>> 1) My /boot partition and its backup (which I do want to keep separate 
>> from root) are only 128 MB each.  The wiki recommends 1 gig sizes 
>> minimum, but there's some indication that's dated info due to mixed data/
>> metadata mode in recent kernels.
>>
>> Is a 128 MB btrfs reasonable?  What's the mixed-mode minumum recommended 
>> and what is overhead going to look like?
> 
> IIRC, the minimum size should be 256MB. Anyway, if you want/allow a separate 
> partition for  /boot I suggest to use a classic filesystem like ext3.
> 

The 256MB limitation has been removed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-29 11:23 ` Goffredo Baroncelli
  2012-01-30  5:49   ` Li Zefan
@ 2012-01-30 14:58   ` Kyle Gates
  2012-01-31  5:55     ` Duncan
  1 sibling, 1 reply; 14+ messages in thread
From: Kyle Gates @ 2012-01-30 14:58 UTC (permalink / raw)
  To: kreijack, 1i5t5.duncan; +Cc: linux-btrfs


I've been having good luck with my /boot on a separate 1GB RAID1 btrfs filesystem using grub2 (2 disks only! I wouldn't try it with 3). I should note, however, that I'm NOT using compression on this volume because if I remember correctly it may not play well with grub (maybe that was just lzo though) and I'm also not using subvolumes either for the same reason.

Kyle

----------------------------------------
> From: kreijack@inwind.it
> To: 1i5t5.duncan@cox.net
> Subject: Re: btrfs-raid questions I couldn't find an answer to on the wiki
> Date: Sun, 29 Jan 2012 12:23:39 +0100
> CC: linux-btrfs@vger.kernel.org
>
> On Thursday, 26 January, 2012 16:41:32 Duncan wrote:
> > 1) My /boot partition and its backup (which I do want to keep separate
> > from root) are only 128 MB each. The wiki recommends 1 gig sizes
> > minimum, but there's some indication that's dated info due to mixed data/
> > metadata mode in recent kernels.
> >
> > Is a 128 MB btrfs reasonable? What's the mixed-mode minumum recommended
> > and what is overhead going to look like?
>
> IIRC, the minimum size should be 256MB. Anyway, if you want/allow a separate
> partition for /boot I suggest to use a classic filesystem like ext3.
>
> BR
> G.Baroncelli
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-30 14:58   ` Kyle Gates
@ 2012-01-31  5:55     ` Duncan
  2012-02-01  0:22       ` Kyle Gates
  2012-02-10 19:45       ` Phillip Susi
  0 siblings, 2 replies; 14+ messages in thread
From: Duncan @ 2012-01-31  5:55 UTC (permalink / raw)
  To: linux-btrfs

Kyle Gates posted on Mon, 30 Jan 2012 08:58:41 -0600 as excerpted:

> I've been having good luck with my /boot on a separate 1GB RAID1 btrfs
> filesystem using grub2 (2 disks only! I wouldn't try it with 3). I
> should note, however, that I'm NOT using compression on this volume
> because if I remember correctly it may not play well with grub (maybe
> that was just lzo though) and I'm also not using subvolumes either for
> the same reason.

Thanks!  I'm on grub2 as well.  It's is still masked on gentoo, but I 
recently unmasked and upgraded to it, taking advantage of the fact that I 
have two two-spindle md/raid-1s for /boot and its backup to test and 
upgrade one of them first, then the other only when I was satisfied with 
the results on the first set.  I'll be using a similar strategy for the 
btrfs upgrades, only most of my md/raid-1s are 4-spindle, with two sets, 
working and backup, and I'll upgrade one set first.

I'm going to keep /boot a pair of two-spindle raid-1s, but intend to make 
them btrfs-raid1s instead of md/raid-1s, and will upgrade one two-spindle 
set at a time.

More on the status of grub2 btrfs-compression support based on my 
research.  There is support for btrfs/gzip-compression in at least grub 
trunk.  AFAIK, it's gzip-compression in grub-1.99-release and
lzo-compression in trunk only, but I may be misremembering and it's gzip 
in trunk only and only uncompressed in grub-1.99-release.

In any event, since I'm running 128 MB /boot md/raid-1s without 
compression now, and intend to increase the size to at least a quarter 
gig to better align the following partitions, /boot is the one set of 
btrfs partitions I do NOT intend to enable compression on, so that won't 
be an issue for me here.  And since for /boot I'm running a pair of
two-spindle raid1s instead of my usual quad-spindle raid1s, you've 
confirmed that works as well. =:^)

As a side note, since I only recently did the grub2 upgrade, I've been 
enjoying its ability to load and read md/raid and my current reiserfs 
directly, thus giving me the ability to look up info in at least text-
based main system config and notes files directly from grub2, without 
booting into Linux, if for some reason the above-grub boot is hosed or 
inconvenient at that moment.  I just realized that if I want to maintain 
that direct-from-grub access, I'll need to ensure that the grub2 I'm 
running groks the btrfs compression scheme I'm using on any filesystem I 
want grub2 to be able to read.

Hmm... that brings up another question:  You mention a 1-gig btrfs-raid1 /
boot, but do NOT mention whether you installed it before or after mixed-
chunk (data/metadata) support made it into btrfs and became the default 
for <= 1 gig filesystems.

Can you confirm one way or the other whether you're running mixed-chunk 
on that 1-gig?  I'm not sure whether grub2's btrfs module groks mixed-
chunk or not, or whether that even matters to it.

Also, could you confirm mbr-bios vs gpt-bios vs uefi-gpt partitions?  I'm 
using gpt-bios partitioning here, with the special gpt-bios-reserved 
partition, so grub2-install can build the modules necessary for /boot 
access directly into its core-image and install that in the gpt-bios-
reserved partition.  It occurs to me that either uefi-gpt or gpt-bios 
with the appropriate reserved partition won't have quite the same issues 
with grub2 reading a btrfs /boot that either mbr-bios or gpt-bios without 
a reserved bios partition would.  If you're running gpt-bios with a 
reserved bios partition, that confirms yet another aspect of your setup, 
compared to mine.  If you're running uefi-gpt, not so much as at least in 
theory, that's best-case.  If you're running either mbr-bios or gpt-bios 
without a reserved bios partition, that's a worst-case, so if it works, 
then the others should definitely work.

Meanwhile, you're right about subvolumes.  I'd not try them on a btrfs 
/boot, either.  (I don't really see the use case for it, for a separate
/boot, tho there's certainly a case for a /boot subvolume on a btrfs 
root, for people doing that.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-31  5:55     ` Duncan
@ 2012-02-01  0:22       ` Kyle Gates
  2012-02-01  6:59         ` Duncan
  2012-02-10 19:45       ` Phillip Susi
  1 sibling, 1 reply; 14+ messages in thread
From: Kyle Gates @ 2012-02-01  0:22 UTC (permalink / raw)
  To: 1i5t5.duncan, linux-btrfs

>> I've been having good luck with my /boot on a separate 1GB RAID1 btrfs
>> filesystem using grub2 (2 disks only! I wouldn't try it with 3). I
>> should note, however, that I'm NOT using compression on this volume
>> because if I remember correctly it may not play well with grub (maybe
>> that was just lzo though) and I'm also not using subvolumes either for
>> the same reason.
>
> Thanks! I'm on grub2 as well. It's is still masked on gentoo, but I
> recently unmasked and upgraded to it, taking advantage of the fact that I
> have two two-spindle md/raid-1s for /boot and its backup to test and
> upgrade one of them first, then the other only when I was satisfied with
> the results on the first set. I'll be using a similar strategy for the
> btrfs upgrades, only most of my md/raid-1s are 4-spindle, with two sets,
> working and backup, and I'll upgrade one set first.
>
> I'm going to keep /boot a pair of two-spindle raid-1s, but intend to make
> them btrfs-raid1s instead of md/raid-1s, and will upgrade one two-spindle
> set at a time.
>
> More on the status of grub2 btrfs-compression support based on my
> research. There is support for btrfs/gzip-compression in at least grub
> trunk. AFAIK, it's gzip-compression in grub-1.99-release and
> lzo-compression in trunk only, but I may be misremembering and it's gzip
> in trunk only and only uncompressed in grub-1.99-release.

I believe you are correct that btrfs zlib support is included in grub2 
version 1.99 and lzo is in trunk.
I'll try compressing the files on /boot for one installed kernel with the 
defrag -czlib option and see how it goes.
Result: Seemed to work just fine.

> In any event, since I'm running 128 MB /boot md/raid-1s without
> compression now, and intend to increase the size to at least a quarter
> gig to better align the following partitions, /boot is the one set of
> btrfs partitions I do NOT intend to enable compression on, so that won't
> be an issue for me here. And since for /boot I'm running a pair of
> two-spindle raid1s instead of my usual quad-spindle raid1s, you've
> confirmed that works as well. =:^)
>
> As a side note, since I only recently did the grub2 upgrade, I've been
> enjoying its ability to load and read md/raid and my current reiserfs
> directly, thus giving me the ability to look up info in at least text-
> based main system config and notes files directly from grub2, without
> booting into Linux, if for some reason the above-grub boot is hosed or
> inconvenient at that moment. I just realized that if I want to maintain
> that direct-from-grub access, I'll need to ensure that the grub2 I'm
> running groks the btrfs compression scheme I'm using on any filesystem I
> want grub2 to be able to read.
>
> Hmm... that brings up another question: You mention a 1-gig btrfs-raid1 /
> boot, but do NOT mention whether you installed it before or after mixed-
> chunk (data/metadata) support made it into btrfs and became the default
> for <= 1 gig filesystems.

I don't think I specifically enabled mixed chunk support when I created this 
filesystem. It was done on a 2.6 kernel sometime in the middle of 2011 iirc.

> Can you confirm one way or the other whether you're running mixed-chunk
> on that 1-gig? I'm not sure whether grub2's btrfs module groks mixed-
> chunk or not, or whether that even matters to it.
>
> Also, could you confirm mbr-bios vs gpt-bios vs uefi-gpt partitions? I'm
> using gpt-bios partitioning here, with the special gpt-bios-reserved
> partition, so grub2-install can build the modules necessary for /boot
> access directly into its core-image and install that in the gpt-bios-
> reserved partition. It occurs to me that either uefi-gpt or gpt-bios
> with the appropriate reserved partition won't have quite the same issues
> with grub2 reading a btrfs /boot that either mbr-bios or gpt-bios without
> a reserved bios partition would. If you're running gpt-bios with a
> reserved bios partition, that confirms yet another aspect of your setup,
> compared to mine. If you're running uefi-gpt, not so much as at least in
> theory, that's best-case. If you're running either mbr-bios or gpt-bios
> without a reserved bios partition, that's a worst-case, so if it works,
> then the others should definitely work.

Same here, gpt-bios, 1MB partition with bios_grub flag set (gdisk code EF02) 
for grub to reside on.

> Meanwhile, you're right about subvolumes. I'd not try them on a btrfs
> /boot, either. (I don't really see the use case for it, for a separate
> /boot, tho there's certainly a case for a /boot subvolume on a btrfs
> root, for people doing that.)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-02-01  0:22       ` Kyle Gates
@ 2012-02-01  6:59         ` Duncan
  0 siblings, 0 replies; 14+ messages in thread
From: Duncan @ 2012-02-01  6:59 UTC (permalink / raw)
  To: linux-btrfs

Kyle Gates posted on Tue, 31 Jan 2012 18:22:51 -0600 as excerpted:

> I don't think I specifically enabled mixed chunk support when I created
> this filesystem. It was done on a 2.6 kernel sometime in the middle of
> 2011 iirc.

Yeah, I'd guess that was before mixed-chunk, or at least before it became 
the default for <=1GiB filesystems, so even if it was supported it 
wouldn't have been the default.

Meaning there's still an open question as to whether grub-1.99 supports 
mixed-chunk.

It looks like I might get more time to play with it this coming week than 
I had this past week.  I might try some of my own experiments... and 
whether grub groks mixed-chunk will certainly be among them if I do.

As for those recommending something other than btrfs for /boot, yes, 
that's a possibility, but I strongly prefer to standardize on a single 
filesystem type.  Right now, that's reiserfs for everything except flash-
based USB and legacy floppies (both of which I use ext4 without 
journaling for, except for the floppies I used to update my BIOS, before 
my 2003 era mainboard got EOLed; those were freedos images), and 
ultimately, I hope it'll be btrfs for everything including flash-based 
(tho perhaps not for legacy floppies, but it has been awhile since I used 
one of them for anything, after that last BIOS update...).

Of course I'm going to keep reiserfs on my backups, even if I use btrfs 
for my working system, for the time being since btrfs is still in heavy 
development, but ultimately, I want to go all btrfs just as I'm all 
reiserfs now, and that would include both /boot 2-spindle raid-1s.

Tho if btrfs doesn't work well for that ATM, I can keep /boot as reiserfs 
for the time being, since I'm already keeping it for the backups, for the 
time being.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-01-31  5:55     ` Duncan
  2012-02-01  0:22       ` Kyle Gates
@ 2012-02-10 19:45       ` Phillip Susi
  2012-02-11  5:48         ` Duncan
  1 sibling, 1 reply; 14+ messages in thread
From: Phillip Susi @ 2012-02-10 19:45 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/31/2012 12:55 AM, Duncan wrote:
> Thanks!  I'm on grub2 as well.  It's is still masked on gentoo, but
> I recently unmasked and upgraded to it, taking advantage of the
> fact that I have two two-spindle md/raid-1s for /boot and its
> backup to test and upgrade one of them first, then the other only
> when I was satisfied with the results on the first set.  I'll be
> using a similar strategy for the btrfs upgrades, only most of my
> md/raid-1s are 4-spindle, with two sets, working and backup, and
> I'll upgrade one set first.

Why do you want to have a separate /boot partition?  Unless you can't
boot without it, having one just makes things more
complex/problematic.  If you do have one, I agree that it is best to
keep it ext4 not btrfs.

> Meanwhile, you're right about subvolumes.  I'd not try them on a
> btrfs /boot, either.  (I don't really see the use case for it, for
> a separate /boot, tho there's certainly a case for a /boot
> subvolume on a btrfs root, for people doing that.)

The Ubuntu installer creates two subvolumes by default when you
install on btrfs: one named @, mounted on /, and one named @home,
mounted on /home.  Grub2 handles this well since the subvols have
names in the default root, so grub just refers to /@/boot instead of
/boot, and so on.  The apt-btrfs-snapshot package makes apt
automatically snapshot the root subvol so you can revert after an
upgrade.  This seamlessly causes grub to go back to the old boot menu
without the new kernels too, since it goes back to reading the old
grub.cfg in the reverted root subvol.

I have a radically different suggestion you might consider rebuilding
your system using.  Partition each disk into only two partitions: one
for bios_grub, and one for everything else ( or just use MBR and skip
the bios_grub partition ).  Give the second partitions to mdadm to
make a raid10 array out of.  If you use a 2x far and 2x offset instead
of the default near layout, you will have an array that can still
handle any 2 of the 4 drives failing, will have twice the capacity of
a 4 way mirror, almost the same sequential read throughput of a 4 way
raid0, and about twice the write throughput of a 4 way mirror.
Partition that array up and put your filesystems on it.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNXPnAAoJEJrBOlT6nu75/d8IAJ0fQ3xWPe6SYBY8nj34mcWh
ql6C4ieMkd07ZCuymT5ZVhWJhtdc6/Vg7ecWmhYdeu4d1WGp4DvTumEYHVl4ZlRk
mT9Lq4SupDL5Dk0nfxZUqY8XnIek3kIG/wgekgdSuLF0J9QFQdCFc25j/idIh0Dy
Gk5NJtgKmsTKUQhzPQZxif8nwWVQzQICm5P//FeOQgx8sq7iVdCQHUxlJEPfsL7m
CVVMJPVk+524rFTWxLZ4KLbXkNE7nrikg7UMlWBtM5gflkU0Y+bfmZKPGcqBCSSn
AId5M5alzjLSLblBqwf8wKpEIiDXBqb6f+bSxqnk5FdKKx5l5lziZyqQM+gnyIo=
=ePD3
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-02-10 19:45       ` Phillip Susi
@ 2012-02-11  5:48         ` Duncan
  2012-02-12  0:04           ` Phillip Susi
  0 siblings, 1 reply; 14+ messages in thread
From: Duncan @ 2012-02-11  5:48 UTC (permalink / raw)
  To: linux-btrfs

Phillip Susi posted on Fri, 10 Feb 2012 14:45:43 -0500 as excerpted:

> On 1/31/2012 12:55 AM, Duncan wrote:
>> Thanks!  I'm on grub2 as well.  It's is still masked on gentoo, but I
>> recently unmasked and upgraded to it, taking advantage of the fact that
>> I have two two-spindle md/raid-1s for /boot and its backup to test and
>> upgrade one of them first, then the other only when I was satisfied
>> with the results on the first set.  I'll be using a similar strategy
>> for the btrfs upgrades, only most of my md/raid-1s are 4-spindle, with
>> two sets, working and backup, and I'll upgrade one set first.
> 
> Why do you want to have a separate /boot partition?  Unless you can't
> boot without it, having one just makes things more complex/problematic. 
> If you do have one, I agree that it is best to keep it ext4 not btrfs.

For a proper picture of the situation, understand that I don't have an 
initr*, I build everything I need into the kernel and have module loading 
disabled, and I keep /boot unmounted except when I'm actually installing 
an upgrade or reconfiguring.

Having a separate /boot means that I can keep it unmounted and thus free 
from possible random corruption or accidental partial /boot tree 
overwrite or deletion, most of the time.  It also means that I can emerge 
(build from sources using the gentoo ebuild script provided for the 
purpose, and install to the live system) a new grub without fear of 
corrupting what I actually boot from -- the grub system installation and 
boot installation remain separate.

A separate /boot is also more robust in terms of file system corruption 
-- if something goes wrong with my rootfs, I can simply boot its backup, 
from a separate /boot that will not have been corrupted.  Similarly, if 
something goes wrong with /boot (or the bios partition), I can switch 
drives in the BIOS and boot from the backup /boot, then load my usual 
rootfs.

Since I'm working with four drives, and both the working /boot and 
backup /boot are two-spindle md/raid1, one on one pair, one on the other, 
I have both hardware redundancy via the second spindle of the raid1, and 
admin-fatfinger redundancy via the backup.  However, the rootfs and its 
backup are both on quad-spindle md/raid1s, thus giving me four separate 
physical copies each of rootfs and its backup.  Because the disk points 
at a single bootloader, if /boot is on rootfs, all four would point to 
either the working rootfs or the backup rootfs, and would update 
together, so I'd lose the ability to fall back to the backup /boot.

(Note that I developed the backup /boot policy and solution back on 
legacy-grub.  Grub2 is rather more flexible, particularly with a 
reasonably roomy GPT BIOS partition, and since each BIOS partition is 
installed individually, in theory, if a grub2 update failed, I could 
point the BIOS at a disk I hadn't installed the BIOS partition update to 
yet, boot to the limited grub rescue-mode-shell, and point it at the 
/boot in the backup rootfs to load the normal-mode-shell, menu, and 
additional grub2 modules as necessary.  However, being able to access a 
full normal-mode-shell grub2 on the backup /boot instead of having to 
resort to the grub2 rescue-mode-shell to reach the backup rootfs, does 
have its benefits.)

One of the nice things about grub2 normal-mode is that it allows 
(directory and plain text file) browsing of pretty much anything it has a 
module for, anywhere on the system.  That's a nice thing to be able to 
do, but it too is much more robust if /boot isn't part of rootfs, and 
thus, isn't likely to be damaged if the rootfs is.  The ability to boot 
to grub2 and retrieve vital information (even if limited to plain-text 
file storage) from a system without a working rootfs is a very nice 
ability to have! 

So you see, a separate /boot really does have its uses. =:^)

>> Meanwhile, you're right about subvolumes.  I'd not try them on a btrfs
>> /boot, either.  (I don't really see the use case for it, for a separate
>> /boot, tho there's certainly a case for a /boot subvolume on a btrfs
>> root, for people doing that.)
> 
> The Ubuntu installer creates two subvolumes by default when you install
> on btrfs: one named @, mounted on /, and one named @home, mounted on
> /home.  Grub2 handles this well since the subvols have names in the
> default root, so grub just refers to /@/boot instead of /boot, and so
> on.  The apt-btrfs-snapshot package makes apt automatically snapshot the
> root subvol so you can revert after an upgrade.  This seamlessly causes
> grub to go back to the old boot menu without the new kernels too, since
> it goes back to reading the old grub.cfg in the reverted root subvol.

Thanks for that "real world" example.  Subvolumes and particularly 
snapshots can indeed be quite useful, but I'd be rather leery of having 
all that on the same master filesystem.  Lose it and you've lost 
everything, snapshots or no snapshots, if there's not bootable backups 
somewhere.

Two experiences inform my partitioning and layout judgment here.  The 
first one was back before the turn of the century when I still did MS.  
In fact, at the time I was running an MSIE public beta for either MSIE 4 
or 5, both of which I ran but IDR which it was that this happened with.  
MS made a change to the MSIE cache indexing, keeping the index file disk 
location in memory and direct-writing to it for performance reasons, 
rather than going the usual filesystem access route.  The only problem 
was, whoever made that change didn't think about MSIE and MS (filesystem) 
Explorer being effectively merged, and that it ran all the time as it was 
the shell.

So then it comes time for the regularly scheduled system defrag, and 
defrag moves the index files out from under MSIE.  Then MSIE updates the 
index, writing to the old location, in the process overwriting whatever's 
there, causing all sorts of crosslinked files and other destruction.

A number of folks running that beta had un-backed-up data destroyed by 
that bug (which MS fixed in the release by simply marking the MSIE index 
files with the system attribute, so defrag wouldn't move them), but all 
it did to me was screw up a few files on my separate TMP partition, 
because I HAD a separate TMP partition, and because that's where I had 
put the IE cache, reasoning that it was temporary data and thus belonged 
on the TMP partition.  That decision saved my bacon!

Both before and after that, I had a number of similar but rather more 
minor incidents where a strict partitioning policy saved me trouble, as 
well.  But that one was all it took to keep me using a strict separate 
partitioning system to this day.

The second experience was when the AC failed here, in the hot Phoenix 
summer (routinely 45-48C highs).  I had left the system on and gone 
somewhere.  When the AC failed, the outside-in-the-shade-temperature was 
45C+, inside room temperature was EASILY 60C+, and the drive temperature 
was very likely 90C+!

The drive of course failed due to physical head-crash on the still-
spinning platters (I could see the grooves when I took it apart, later).

When I came home of course the system was frozen, and I turned it off.  
The CPUs survived, and surprisingly, so did much of the disk.  It was 
only where the physical head crash grooves were that the data was gone.

I didn't have off-disk backups at that time (for sure I do now!), but I 
had duplicate backup partitions for anything valuable.  Since they 
weren't mounted, I was able to recover and even continue using the backup 
rootfs, /usr, etc, for a couple months, until I could buy a new disk and 
transfer everything over.

Again, what saved me was the fact that I had everything partitioned off.  
The partitions that weren't actually mounted were pretty much undamaged, 
save for a few single scratches due to head seeking from one mounted 
partition to another, before the system itself crashed, and unlike the 
grooves worn in the mounted partitions, the disk's own error correction 
caught most of that.  An fsck fixed things up pretty good, tho I lost a 
few files.

I hate to think about what would have happened if instead of separate 
partitions, each with its own intact metadata, etc, those "unmounted" 
partitions had been simply subvolumes on a single master filesystem!  
True, btrfs has double metadata and both data and metadata checksumming, 
and I'm *DEFINITELY* looking forward to the additional protection from 
that (tho only two-way even on a 4-spindle so-called raid1 btrfs was a 
big disappointment, tho an article I read somewhere says multi-redundancy 
is scheduled for kernel 3.4 or 3.5), but the plan at least here is for 
that to be ADDITIONAL protection, NOT AN EXCUSE TO BE SLOPPY!  It's for 
that reason that I intend to keep proper partitions and probably won't 
make a lot of use of the subvolume functionality, except as it's used by 
the snapshot functionality, which I expect I WILL use, for exactly the 
type of rollback functionality you describe above.

> I have a radically different suggestion you might consider rebuilding
> your system using.  Partition each disk into only two partitions: one
> for bios_grub, and one for everything else ( or just use MBR and skip
> the bios_grub partition ).  Give the second partitions to mdadm to make
> a raid10 array out of.  If you use a 2x far and 2x offset instead of the
> default near layout, you will have an array that can still handle any 2
> of the 4 drives failing, will have twice the capacity of a 4 way mirror,
> almost the same sequential read throughput of a 4 way raid0, and about
> twice the write throughput of a 4 way mirror. Partition that array up
> and put your filesystems on it.

I like the raid-10 idea and will have to research it some more as I 
understand the idea behind "near" and "far" on raid10, but having never 
used raid-10, I don't "grok" that idea, understand it well enough to have 
appreciated the possibility for lose-an-two, before you suggested it.

And I'm only running 300 gig disks and given that I'm running a working 
and a backup copy of most of those raids/partitions, it's more like 180 
or 200 gig of actual storage, with the free-space fragmented due to the 
multiple partitions/raids, so I /am/ running a bit low on free-space and 
could definitely use the doubled space at this point!

But I believe I'll keep multiple raids for much the same reason I keep 
multiple partitions, it's a FAR more robust solution than having all 
one's eggs in one RAID basket.

Besides, I actually did try a single partitioned RAID (well, two, one for 
all the working copies, one for the backups) when I first setup md/raid, 
and came to the conclusion that the recovery time on that big a raid is 
rather longer than I like to be dealing with it.  Multiple raids, with 
the ones I'm not using ATM offline, means I don't have to worry about 
recovering the entire thing, only the raids that were online and actually 
dirty at the time of crash or whatever.  And of course write-intent 
bitmaps means even shorter recovery time in most cases, so between 
multiple raids and write-intent-bitmaps, a recovery that would take 2-3 
hours with my original all-in-one raid setup, now often takes < 5 
minutes! =:^)  Even with write-intent-bitmaps, I'd hate to go back to big 
all-in-one raids, for recovery reasons alone, and between that and the 
additional robustness of multiple raids, I just don't see myself doing 
that any time soon.

But the 2x far, 2x offset raid10 idea, to let me lose any two of the 
four, is something I will very possibly use, especially now that I've 
seen that btrfs isn't as close to ready with multi-redundancy as I had 
hoped, so it'll probably be mid-year at the earliest before I can 
reasonably play with that.  Thanks again, as that's a very practical 
suggestion indeed! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-02-11  5:48         ` Duncan
@ 2012-02-12  0:04           ` Phillip Susi
  2012-02-12 22:31             ` Duncan
  0 siblings, 1 reply; 14+ messages in thread
From: Phillip Susi @ 2012-02-12  0:04 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/11/2012 12:48 AM, Duncan wrote:
> So you see, a separate /boot really does have its uses. =:^)

True, but booting from removable media is easy too, and a full livecd gives
much more recovery options than the grub shell.  It is the corrupted root
fs that is of much more concern than /boot.

> I like the raid-10 idea and will have to research it some more as I 
> understand the idea behind "near" and "far" on raid10, but having never 
> used raid-10, I don't "grok" that idea, understand it well enough to have 
> appreciated the possibility for lose-an-two, before you suggested it.

To grok the other layouts, it helps to think of the simple two disk case.
A far layout is like having a raid0 across the first half of the disk, then
mirroring the whole first half of the disk onto the second half of the other
disk.  Offset has the mirror on the next stripe so each stripe is interleaved
with a mirror stripe, rather than having all original, then all mirrors after.

It looks like mdadm won't let you use both at once, so you'd have to go with
a 3 way far or offset.  Also I was wrong about the additional space.  You
would only get 25% more space since you still have 3 copies of all data so
you get 4/3 times the space, but you will get much better throughput since
it is striped across all 4 disks.  Far gives better sequential read since it
reads just like a raid0, but writes have to seek all the way across the disk
to write the backup.  Offset requires seeks between each stripe on read, but
the writes don't have to seek to write the backup.

You also could do a raid6 and get the double failure tolerance, and two disks
worth of capacity, but not as much read throughput as raid10.

> But I believe I'll keep multiple raids for much the same reason I keep 
> multiple partitions, it's a FAR more robust solution than having all 
> one's eggs in one RAID basket.

True.

> Besides, I actually did try a single partitioned RAID (well, two, one for 
> all the working copies, one for the backups) when I first setup md/raid, 
> and came to the conclusion that the recovery time on that big a raid is 
> rather longer than I like to be dealing with it.  Multiple raids, with 
> the ones I'm not using ATM offline, means I don't have to worry about 
> recovering the entire thing, only the raids that were online and actually 
> dirty at the time of crash or whatever.  And of course write-intent 
> bitmaps means even shorter recovery time in most cases, so between 
> multiple raids and write-intent-bitmaps, a recovery that would take 2-3 
> hours with my original all-in-one raid setup, now often takes < 5 
> minutes! =:^)  Even with write-intent-bitmaps, I'd hate to go back to big 
> all-in-one raids, for recovery reasons alone, and between that and the 
> additional robustness of multiple raids, I just don't see myself doing 
> that any time soon.

Depends on what you mean by recovery.  Re-adding a drive that you removed
will be faster with multiple raids ( though write-intent bitmaps also take
care of that ), but if you actually have a failed disk and have to replace
it with a new one, you still have to do a rebuild on all of the raids
so it ends up taking the same total time.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNwIZAAoJEJrBOlT6nu754yUIAL79DHhanAC0SWaXFBYTT4T2
N2xG3ved177BXX0VhKCcoYcWFiSerWzAnPlZsUDzMfaHDxBNF4ATsnboY31rCG1j
QJE3Oz9Cop45xhTBrMcwYs+woR+0HAmYb1Qa1aKrNwG0d6XlfZsLFBFUtrB411lX
erOS77EsT2BYaumanvouM8vm5LG9ZrOItiELI7rm+hEcw64p3rjkUkvBG5nTdj8K
0x7tYgUHEZNngMSx4rMTUFTlx9485gn7eJ2hT1gbVNmRcCGwotTpOTXoJMh3csbF
jYbUJKqK0n+gxhHSW/+KJBTlb1gbZpuaiibqpQnUlOecI/Fmj2MpHQnZ4WSNpc8=
=HjvY
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs-raid questions I couldn't find an answer to on the wiki
  2012-02-12  0:04           ` Phillip Susi
@ 2012-02-12 22:31             ` Duncan
  0 siblings, 0 replies; 14+ messages in thread
From: Duncan @ 2012-02-12 22:31 UTC (permalink / raw)
  To: linux-btrfs

Phillip Susi posted on Sat, 11 Feb 2012 19:04:41 -0500 as excerpted:

> On 02/11/2012 12:48 AM, Duncan wrote:
>> So you see, a separate /boot really does have its uses. =:^)
> 
> True, but booting from removable media is easy too, and a full livecd
> gives much more recovery options than the grub shell.

And a rootfs backup that's simply a copy of rootfs at the time it was 
taken is even MORE flexible, especially when rootfs is arranged to 
contain all packages installed by the package manager.  That's what I 
use.  If misfortune comes my way right in the middle of a critical 
project and rootfs dies, simply root= on the kernel command line at the 
grub prompt, to the backup root, and assuming that critical project is on 
another filesystem (such as home), I can normally simply continue where I 
left off.  Full X and desktop, browser, movie players, document editors 
and viewers, presentation software, all the software I had on the system 
at the time I made the backup, directly bootable without futzing around 
with data restores, etc. =:^)

> It is the corrupted root fs that is of much more concern than /boot.

Yes, but to the extent that /boot is the gateway to both the rootfs and 
its backup... and digging out the removable media is at least a /bit/ 
more hassle than simply altering the root= (and mdX=) on the kernel 
command line...`

(Incidentally, I've thought for quite some time that I really should have 
had two such backups, such that if I'm just doing the backup when 
misfortune strikes and takes out both the working rootfs and its backup, 
the backup being mounted and actively written at the time of the 
misfortune, I could always boot to the second backup.  But I hadn't 
considered that when I did the current layout.  Given that rootfs with 
the full installed system's only 4.75 gigs (with a quarter gig /usr/local 
on the same 5 gig partitioned md/raid), it shouldn't be /too/ difficult 
to fit that in at my next rearrange, especially if I do the 4/3 raid10s 
as you suggested (for another ~100 gig since I'm running 300 gig disks).)

>> I don't "grok" [raid10]
> 
> To grok the other layouts, it helps to think of the simple two disk
> case.
> A far layout is like having a raid0 across the first half of the disk,
> then mirroring the whole first half of the disk onto the second half of
> the other disk.  Offset has the mirror on the next stripe so each stripe
> is interleaved with a mirror stripe, rather than having all original,
> then all mirrors after.
> 
> It looks like mdadm won't let you use both at once, so you'd have to go
> with a 3 way far or offset.  Also I was wrong about the additional
> space.  You would only get 25% more space since you still have 3 copies
> of all data so you get 4/3 times the space, but you will get much better
> throughput since it is striped across all 4 disks.  Far gives better
> sequential read since it reads just like a raid0, but writes have to
> seek all the way across the disk to write the backup.  Offset requires
> seeks between each stripe on read, but the writes don't have to seek to
> write the backup.

Thanks.  That's reasonably clear.  Beyond that, I just have to DO IT, to 
get comfortable enough with it to be confident in my restoration 
abilities under the stress of an emergency recovery.  (That's the reason 
I ditched the lvm2 layer I had tried, the additional complexity of that 
one more layer was simply too much for me to be confident in my ability 
to manage it without fat-fingering under the stress of an emergency 
recovery situation.)

> You also could do a raid6 and get the double failure tolerance, and two
> disks worth of capacity, but not as much read throughput as raid10.

Ugh!  That's what I tried as my first raid layout, when I was young and 
foolish, raid-wise!  Raid5/6's read-modify-write cycle in ordered to get 
the parity data written was simply too much!  Combine that with the 
parallel job read boost of raid1, and raid1 was a FAR better choice for 
me than raid6!

Actually, since much of my reading /is/ parallel jobs and the kernel i/o 
scheduler and md do such a good job of taking advantage of raid1's 
parallel-read characteristics, it has seemed I do better with that that 
with raid0!  I do still have one raid0, for gentoo's package tree, the 
kernel tree, etc, since redundancy doesn't matter for it and the 4X space 
it gives me for that is nice, but bigger storage, I'd have it all raid1 
(or now raid10) and not have to worry about other levels.

Counterintuitively, even write seems more responsive with raid1 than 
raid0, in actual use.  The only explanation I've come up with for that is 
that in practice, any large scale writes tend to be reads from elsewhere 
as well, and the md scheduler is evidently smart enough to read from one 
spindle and write to the others, then switch off to catch up writing on 
the formerly read-spindle, such that there's rather less head seeking 
between read and write than there'd be otherwise.  Since raid0 only has 
the single copy, the data MUST be read from whatever spindle it resides 
on, thus eliminating the kernel/md's ability to smart-schedule, favoring 
one spindle at a time for reads to eliminate seeks.

For that reason, I've always thought that if I went to raid10, I'd try to 
do it with at least triple spindle at the raid1 level, thus hoping to get 
both the additional redundancy and parallel scheduling of raid1, while 
also getting the thruput speed and size of the stripes.

Now you've pointed out that I can do essentially that with a triple 
mirror on quad spindle raid10, and I'm seeing new possibilities open up...

>> Multiple
>> raids, with the ones I'm not using ATM offline, means I don't have to
>> worry about recovering the entire thing, only the raids that were
>> online and actually dirty at the time of crash or whatever.
> 
> Depends on what you mean by recovery.  Re-adding a drive that you
> removed will be faster with multiple raids ( though write-intent bitmaps
> also take care of that ), but if you actually have a failed disk and
> have to replace it with a new one, you still have to do a rebuild on all
> of the raids so it ends up taking the same total time.

Very good point.  I was talking about re-adding.  For various reasons 
including hardware power-on stability latency (these particular disks 
apparently take a bit to stabilize after power on and suspend-to-disk 
often kicks a disk on resume due to ID-match-failure, which then appears 
as say sde instead of sdb; I've solved that problem by simply leaving on 
or shutting down the system instead of using suspend-to-disk), faulty 
memory at one point causing kernel panics, and the fact that I run live-
git kernels, I've had rather more experience with re-add than I would 
have liked.  But that has made me QUITE confident in my ability to 
recover from either that or a dead drive, since I've had rather more 
practice than I anticipated.

But all my experience has been with re-add, so that's what I was thinking 
about when I said recovery.  Thanks for pointing out that I omitted to 
mention that as I was really quite oblivious. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-02-12 22:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-26 15:41 btrfs-raid questions I couldn't find an answer to on the wiki Duncan
2012-01-28 12:08 ` Martin Steigerwald
2012-01-29  5:40   ` Duncan
2012-01-29  7:55     ` Martin Steigerwald
2012-01-29 11:23 ` Goffredo Baroncelli
2012-01-30  5:49   ` Li Zefan
2012-01-30 14:58   ` Kyle Gates
2012-01-31  5:55     ` Duncan
2012-02-01  0:22       ` Kyle Gates
2012-02-01  6:59         ` Duncan
2012-02-10 19:45       ` Phillip Susi
2012-02-11  5:48         ` Duncan
2012-02-12  0:04           ` Phillip Susi
2012-02-12 22:31             ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).