* safe/necessary to balance system chunks?
@ 2014-04-25 14:57 Steve Leung
2014-04-25 17:24 ` Chris Murphy
0 siblings, 1 reply; 17+ messages in thread
From: Steve Leung @ 2014-04-25 14:57 UTC (permalink / raw)
To: linux-btrfs
Hi list,
I've got a 3-device RAID1 btrfs filesystem that started out life as
single-device.
btrfs fi df:
Data, RAID1: total=1.31TiB, used=1.07TiB
System, RAID1: total=32.00MiB, used=224.00KiB
System, DUP: total=32.00MiB, used=32.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=66.00GiB, used=2.97GiB
This still lists some system chunks as DUP, and not as RAID1. Does this
mean that if one device were to fail, some system chunks would be
unrecoverable? How bad would that be?
Assuming this is something that needs to be fixed, would I be able to fix
this by balancing the system chunks? Since the "force" flag is required,
does that mean that balancing system chunks is inherently risky or
unpleasant?
Thanks,
Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 14:57 safe/necessary to balance system chunks? Steve Leung
@ 2014-04-25 17:24 ` Chris Murphy
2014-04-25 18:12 ` Austin S Hemmelgarn
2014-04-25 18:36 ` Steve Leung
0 siblings, 2 replies; 17+ messages in thread
From: Chris Murphy @ 2014-04-25 17:24 UTC (permalink / raw)
To: Steve Leung; +Cc: linux-btrfs
On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>
> Hi list,
>
> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>
> btrfs fi df:
>
> Data, RAID1: total=1.31TiB, used=1.07TiB
> System, RAID1: total=32.00MiB, used=224.00KiB
> System, DUP: total=32.00MiB, used=32.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>
> This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be?
Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
>
> Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks? Since the "force" flag is required, does that mean that balancing system chunks is inherently risky or unpleasant?
I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data.
The metadata looks like it could use a balance, 66GB of metadata chunks allocated but only 3GB used. So you could include something like -musage=50 at the same time and that will balance any chunks with 50% or less usage.
Chris Murphy
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 17:24 ` Chris Murphy
@ 2014-04-25 18:12 ` Austin S Hemmelgarn
2014-04-25 18:43 ` Steve Leung
` (2 more replies)
2014-04-25 18:36 ` Steve Leung
1 sibling, 3 replies; 17+ messages in thread
From: Austin S Hemmelgarn @ 2014-04-25 18:12 UTC (permalink / raw)
To: Chris Murphy, Steve Leung; +Cc: linux-btrfs
On 2014-04-25 13:24, Chris Murphy wrote:
>
> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>
>>
>> Hi list,
>>
>> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>>
>> btrfs fi df:
>>
>> Data, RAID1: total=1.31TiB, used=1.07TiB
>> System, RAID1: total=32.00MiB, used=224.00KiB
>> System, DUP: total=32.00MiB, used=32.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>
>> This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be?
>
> Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
>
> Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
>
As far as I understand it, the system chunks are THE root chunk tree for
the entire system, that is to say, it's the tree of tree roots that is
pointed to by the superblock. (I would love to know if this
understanding is wrong). Thus losing that data almost always means
losing the whole filesystem.
>>
>> Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks? Since the "force" flag is required, does that mean that balancing system chunks is inherently risky or unpleasant?
>
> I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data.
The kernel won't allow a balance involving system chunks unless you
specify force, as it considers any kind of balance using them to be
dangerous. Given your circumstances, I'd personally say that the safety
provided by RAID1 outweighs the risk of making the FS un-mountable.
>
> The metadata looks like it could use a balance, 66GB of metadata chunks allocated but only 3GB used. So you could include something like -musage=50 at the same time and that will balance any chunks with 50% or less usage.
>
>
> Chris Murphy
>
Personally, I would recommend making a full backup of all the data (tar
works wonderfully for this), and recreate the entire filesystem from
scratch, but passing all three devices to mkfs.btrfs. This should
result in all the chunks being RAID1, and will also allow you to benefit
from newer features.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 17:24 ` Chris Murphy
2014-04-25 18:12 ` Austin S Hemmelgarn
@ 2014-04-25 18:36 ` Steve Leung
1 sibling, 0 replies; 17+ messages in thread
From: Steve Leung @ 2014-04-25 18:36 UTC (permalink / raw)
To: Chris Murphy; +Cc: linux-btrfs
On 04/25/2014 11:24 AM, Chris Murphy wrote:
>
> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>
>> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>>
>> btrfs fi df:
>>
>> Data, RAID1: total=1.31TiB, used=1.07TiB
>> System, RAID1: total=32.00MiB, used=224.00KiB
>> System, DUP: total=32.00MiB, used=32.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>
>> This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be?
>
> Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
As for how it occurred, I'm not sure. I created this filesystem some
time ago (not sure exactly, but I'm guessing with a 3.4-era kernel?) so
it's quite possible it's not reproducible on newer kernels.
It's also nice to know I've been one failed device away from a dead
filesystem for a long time now, but better to notice it late than never. :)
Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 18:12 ` Austin S Hemmelgarn
@ 2014-04-25 18:43 ` Steve Leung
2014-04-25 19:07 ` Austin S Hemmelgarn
` (2 more replies)
2014-04-25 19:14 ` Hugo Mills
2014-04-25 23:03 ` Duncan
2 siblings, 3 replies; 17+ messages in thread
From: Steve Leung @ 2014-04-25 18:43 UTC (permalink / raw)
To: Austin S Hemmelgarn, Chris Murphy; +Cc: linux-btrfs
On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
> On 2014-04-25 13:24, Chris Murphy wrote:
>>
>> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>>
>>> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>>>
>>> btrfs fi df:
>>>
>>> Data, RAID1: total=1.31TiB, used=1.07TiB
>>> System, RAID1: total=32.00MiB, used=224.00KiB
>>> System, DUP: total=32.00MiB, used=32.00KiB
>>> System, single: total=4.00MiB, used=0.00
>>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>>
>>> This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be?
>>>
>>> Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks? Since the "force" flag is required, does that mean that balancing system chunks is inherently risky or unpleasant?
>>
>> I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data.
> The kernel won't allow a balance involving system chunks unless you
> specify force, as it considers any kind of balance using them to be
> dangerous. Given your circumstances, I'd personally say that the safety
> provided by RAID1 outweighs the risk of making the FS un-mountable.
Agreed, I'll attempt the system balance shortly.
> Personally, I would recommend making a full backup of all the data (tar
> works wonderfully for this), and recreate the entire filesystem from
> scratch, but passing all three devices to mkfs.btrfs. This should
> result in all the chunks being RAID1, and will also allow you to benefit
> from newer features.
I do have backups of the really important stuff from this filesystem,
but they're offsite. As this is just for a home system, I don't have
enough temporary space for a full backup handy (which is related to how
I ended up in this situation in the first place).
Once everything gets rebalanced though, I don't think I'd be missing out
on any features, would I?
Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 18:43 ` Steve Leung
@ 2014-04-25 19:07 ` Austin S Hemmelgarn
2014-04-26 4:01 ` Duncan
2014-04-26 1:11 ` Duncan
2014-04-26 1:24 ` Chris Murphy
2 siblings, 1 reply; 17+ messages in thread
From: Austin S Hemmelgarn @ 2014-04-25 19:07 UTC (permalink / raw)
To: Steve Leung, Chris Murphy; +Cc: linux-btrfs
On 2014-04-25 14:43, Steve Leung wrote:
> On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
>> On 2014-04-25 13:24, Chris Murphy wrote:
>>>
>>> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>>>
>>>> I've got a 3-device RAID1 btrfs filesystem that started out life as
>>>> single-device.
>>>>
>>>> btrfs fi df:
>>>>
>>>> Data, RAID1: total=1.31TiB, used=1.07TiB
>>>> System, RAID1: total=32.00MiB, used=224.00KiB
>>>> System, DUP: total=32.00MiB, used=32.00KiB
>>>> System, single: total=4.00MiB, used=0.00
>>>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>>>
>>>> This still lists some system chunks as DUP, and not as RAID1. Does
>>>> this mean that if one device were to fail, some system chunks would
>>>> be unrecoverable? How bad would that be?
>>>>
>>>> Assuming this is something that needs to be fixed, would I be able
>>>> to fix this by balancing the system chunks? Since the "force" flag
>>>> is required, does that mean that balancing system chunks is
>>>> inherently risky or unpleasant?
>>>
>>> I don't think force is needed. You'd use btrfs balance start
>>> -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although
>>> it's probably a minor distinction for such a small amount of data.
>> The kernel won't allow a balance involving system chunks unless you
>> specify force, as it considers any kind of balance using them to be
>> dangerous. Given your circumstances, I'd personally say that the safety
>> provided by RAID1 outweighs the risk of making the FS un-mountable.
>
> Agreed, I'll attempt the system balance shortly.
>
>> Personally, I would recommend making a full backup of all the data (tar
>> works wonderfully for this), and recreate the entire filesystem from
>> scratch, but passing all three devices to mkfs.btrfs. This should
>> result in all the chunks being RAID1, and will also allow you to benefit
>> from newer features.
>
> I do have backups of the really important stuff from this filesystem,
> but they're offsite. As this is just for a home system, I don't have
> enough temporary space for a full backup handy (which is related to how
> I ended up in this situation in the first place).
>
> Once everything gets rebalanced though, I don't think I'd be missing out
> on any features, would I?
>
> Steve
In general, it shouldn't be an issue, but it might get you slightly
better performance to recreate it. I actually have a similar situation
with how I have my desktop system set up, when I go about recreating
the filesystem (which I do every time I upgrade either the tools or the
kernel), I use the following approach:
1. Delete one of the devices from the filesystem
2. Create a new btrfs file system on the device just removed from the
filesystem
3. Copy the data from the old filesystem to the new one
4. one at a time, delete the remaining devices from the old filesystem
and add them to the new one, re-balancing the new filesystem after
adding each device.
This seems to work relatively well for me, and prevents the possibility
that there is ever just one copy of the data. It does, however, require
that the amount of data that you are storing on the filesystem is less
than the size of one of the devices (although you can kind of work
around this limitation by setting compress-force=zlib on the new file
system when you mount it, then using defrag to decompress everything
after the conversion is done), and that you have to drop to single user
mode for the conversion (unless it's something that isn't needed all the
time, like the home directories or /usr/src, in which case you just log
everyone out and log in as root on the console to do it).
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 18:12 ` Austin S Hemmelgarn
2014-04-25 18:43 ` Steve Leung
@ 2014-04-25 19:14 ` Hugo Mills
2014-06-19 11:32 ` Alex Lyakas
2014-04-25 23:03 ` Duncan
2 siblings, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2014-04-25 19:14 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: Chris Murphy, Steve Leung, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2959 bytes --]
On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
> On 2014-04-25 13:24, Chris Murphy wrote:
> >
> > On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
> >
> >>
> >> Hi list,
> >>
> >> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
> >>
> >> btrfs fi df:
> >>
> >> Data, RAID1: total=1.31TiB, used=1.07TiB
> >> System, RAID1: total=32.00MiB, used=224.00KiB
> >> System, DUP: total=32.00MiB, used=32.00KiB
> >> System, single: total=4.00MiB, used=0.00
> >> Metadata, RAID1: total=66.00GiB, used=2.97GiB
> >>
> >> This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be?
> >
> > Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
> >
> > Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
> >
> As far as I understand it, the system chunks are THE root chunk tree for
> the entire system, that is to say, it's the tree of tree roots that is
> pointed to by the superblock. (I would love to know if this
> understanding is wrong). Thus losing that data almost always means
> losing the whole filesystem.
From a conversation I had with cmason a while ago, the System
chunks contain the chunk tree. They're special because *everything* in
the filesystem -- including the locations of all the trees, including
the chunk tree and the roots tree -- is positioned in terms of the
internal virtual address space. Therefore, when starting up the FS,
you can read the superblock (which is at a known position on each
device), which tells you the virtual address of the other trees... and
you still need to find out where that really is.
The superblock has (I think) a list of physical block addresses at
the end of it (sys_chunk_array), which allows you to find the blocks
for the chunk tree and work out this mapping, which allows you to find
everything else. I'm not 100% certain of the actual format of that
array -- it's declared as u8 [2048], so I'm guessing there's a load of
casting to something useful going on in the code somewhere.
Hugo.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Is it still called an affair if I'm sleeping with my wife ---
behind her lover's back?
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 18:12 ` Austin S Hemmelgarn
2014-04-25 18:43 ` Steve Leung
2014-04-25 19:14 ` Hugo Mills
@ 2014-04-25 23:03 ` Duncan
2014-04-26 1:41 ` Chris Murphy
2 siblings, 1 reply; 17+ messages in thread
From: Duncan @ 2014-04-25 23:03 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Fri, 25 Apr 2014 14:12:17 -0400 as
excerpted:
>
> On 2014-04-25 13:24, Chris Murphy wrote:
>>
>> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>>
>>> Assuming this is something that needs to be fixed, would I be able to
>>> fix this by balancing the system chunks? Since the "force" flag is
>>> required, does that mean that balancing system chunks is inherently
>>> risky or unpleasant?
>>
>> I don't think force is needed. You'd use btrfs balance start
>> -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although
>> it's probably a minor distinction for such a small amount of data.
>
> The kernel won't allow a balance involving system chunks unless you
> specify force, as it considers any kind of balance using them to be
> dangerous. Given your circumstances, I'd personally say that the safety
> provided by RAID1 outweighs the risk of making the FS un-mountable.
To clear this up, FWIW...
In a balance, metadata includes system by default.
If you go back and look at the committed balance filters patch, the
wording on the -s/system chunks option is that it requires -f/force
because one would normally handle system as part of metadata, not for any
other reason.
What it looks like to me is that the original patch in progress may not
have had -s/system as a separate filter at all, treating it as
-m/metadata, but perhaps someone suggested having -s/system as a separate
option too, and the author agreed. But since -m/metadata includes -s/
system by default, and that was the intended way of doing things,
-f/force was added as necessary when doing only -s/system, since
presumably that was considered an artificial distinction, and handling -s/
system as a part of -m/metadata was considered the more natural method.
Which begs the question[1], is there a safety or procedural reason one
should prefer handling metadata and system chunks at the same time,
perhaps because rewriting the one involves rewriting critical bits of the
other anyway, or is it simply that the author considered system a subset
of metadata, anyway? That I don't know.
But what I do know is that -f/force isn't required with -m/metadata,
which includes -s/system by default anyway, so unless there's reason to
treat the two differently, just use -m/metadata and let it handle -s/
system as well. =:^)
---
[1] Begs the question: Modern more natural/literal majority usage
meaning: invites/forces the question, the question becomes so obvious
that it's "begging" to be asked, at least in the speaker/author's (my)
own head. Yes, I am aware of but generally prefer "assumes and thus
can't prove the postulate" or similar wording as an alternate to the
translation-accident meaning. If you have some time and are wondering
what I'm talking about and/or think I used the term incorrectly, google
it (using duck-duck-go or the like if you don't like google's profiling).
=:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 18:43 ` Steve Leung
2014-04-25 19:07 ` Austin S Hemmelgarn
@ 2014-04-26 1:11 ` Duncan
2014-04-26 1:24 ` Chris Murphy
2 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26 1:11 UTC (permalink / raw)
To: linux-btrfs
Steve Leung posted on Fri, 25 Apr 2014 12:43:12 -0600 as excerpted:
> On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
>
>> Personally, I would recommend making a full backup of all the data (tar
>> works wonderfully for this), and recreate the entire filesystem from
>> scratch, but passing all three devices to mkfs.btrfs. This should
>> result in all the chunks being RAID1, and will also allow you to
>> benefit from newer features.
>
> I do have backups of the really important stuff from this filesystem,
> but they're offsite. As this is just for a home system, I don't have
> enough temporary space for a full backup handy (which is related to how
> I ended up in this situation in the first place).
>
> Once everything gets rebalanced though, I don't think I'd be missing out
> on any features, would I?
As ASH says, nothing critical.
But there are some relatively minor newer features available, and I
actually re-mkfs.btrfs most of my (several) btrfs every few kernel cycles
to take advantage of them, since btrfs is still under development and
these minor features do accumulate over time. The on-device format is
now guaranteed to be readable by newer kernels, but that doesn't mean a
newer kernel couldn't take advantage of minor features available to it in
a newer filesystem, if the filesystem was new enough to make them
available.
Of course the other reason is that doing a mkfs guarantees (especially
with ssds, where it by default does a trim/discard on the entire space
it's mkfsing, guaranteeing a zero-out that you'd otherwise have to do
manually for that level of zero-out guarantee) that I've eliminated any
cruft from now-fixed bugs that otherwise might come back to haunt me at
some point.
The other consideration is the range of kernels you plan on mounting/
accessing the filesystem with. If you're planning on accessing the
filesystem with an old kernel, mkfs.btrfs does have an option to toggle
these newer features (with one, extref, allowing more per-directory hard-
links, defaulting on, others generally defaulting off), and keeping them
off to work with older kernels is possible, but then of course eliminates
the newer features as a reason for doing the mkfs in the first place.
My local policy being upto four kernel stability levels, current/testing
development-kernel, last tested working kernel as first level fallback,
latest stable series as second level fallback, and some reasonably recent
but occasionally 2-3 stable series stable kernel before that (depending
on when I last updated my backup /boot) as backup-boot stable fallback,
even the latter is reasonably new, and given that I tend to wait a couple
kernel cycles to work out the bugs before activating a new minor-feature
here anyway, I don't generally worry much about old kernels when
activating such features.
So what are these minor features? Using mkfs.btrfs -O list-all (as
suggested in the mkfs.btrfs manpage, for btrfs-progs v3.14 (slightly
reformatted to avoid wrap when posting):
$ mkfs.btrfs -O list-all
Filesystem features available at mkfs time:
mixed-bg - mixed data and metadata block groups (0x4)
extref - increased hardlink limit per file to 65536 (0x40, def)
raid56 - raid56 extended format (0x80)
skinny-metadata - reduced-size metadata extent refs (0x100)
no-holes - no explicit hole extents for files (0x200)
Mixed-bg: This one's reasonably old and is available with the -M option
as well. It has been the default for filesystems under 1 GiB for some
time. Some people recommend it for filesystems upto perhaps 32-64 GiB as
well, and it does lessen the hassle with data/metadata getting out of
balance since they're then combined, but there is a performance cost to
enabling it. Basically, I'd say don't bother with activating it via -O,
use -M instead if you want it, but do consider well if you really want it
above say 64 or 128 MiB, because there IS a performance cost, and as
filesystem sizes get bigger, the benefit of -M/mixed-bg on smaller
filesystems doesn't matter as much.
Tho mixed-bg DOES make possible dup data (and indeed, requires if if you
want dup metadata, since they're mixed together in this mode) on a single-
device btrfs, something that's not otherwise possible.
Extref: As mentioned, extref is now the default. The reason being it was
introduced a number of kernels ago and is reasonably important as some
people were running into hardlinking issues with the previous layout, so
activating it by default is the right choice.
Raid56: Unless you plan on doing raid56 in the near term (and that's not
recommended ATM as btrfs raid56 mode isn't yet complete in terms of
device loss recovery, etc, anyway), that one probably doesn't matter.
Recommend not using raid56 at this time and thus keeping the option off.
Skinny-metadata: This one's /relatively/ new, being introduced in kernel
3.10 according to the wiki. In the 3.10 and possibly 3.11 cycles I did
see a number of bugfixes going by for it, and wasn't using or
recommending it at that time. But I used it on one less critical btrfs
in the 3.12 timeframe and had no issues, and with my last mkfs.btrfs
round shortly after v3.14's release, I enabled it on everything I redid.
The benefit of skinny-metadata is simply less metadata to deal with.
It's not critical as a new kernel can write the "fat" metadata just fine,
and is not yet the default, but if you're recreating filesystems anyway
and don't plan on accessing them with anything older than 3.11, I suggest
enabling it.
No-holes: This one is still new, enabled in kernel (and btrfs-progs)
v3.14, and thus could have a few bugs to work out still. In theory, like
skinny-metadata it simply makes for more efficient metadata. However,
unlike skinny metadata I've yet to see any bugs at all related to it, and
in fact, tracking explicit-hole mapping has I believe caused a few bugs
of its own, so despite its newness, I enabled it for all new btrfs
in my last round of mkfs.btrfs filesystem redos shortly after v3.14
release.
So a cautious no-holes recommend once you are fairly sure you won't be
mounting with anything pre-3.14 series, tho be aware that since 3.14
itself is so new and because this isn't yet the default, it won't yet
have the testing that the other minor-features have, and thus could in
theory still have a few bugs. But as I said, I believe there were
actually bugs in the hole-extent processing before, so I think the risk
profile on this one is actually pretty favorable, and I'd consider the
accessing kernel age factor the major caveat, at this point.
So here I'm doing -O extref,skinny-metadata,no-holes .
(Minor usage note: In btrfs-progs v3.14 itself, --features, the long
form of the -O option, was buggy and didn't work. That was actually a
bug I reported here after finding it when I was doing those redoes as I
use a script that was coded to use the long form, only to have it bug
out. -O worked tho, and after rewriting that bit of the script to use
that, it worked fine. I haven't actually updated btrfs-progs in 10 days
or so, but I've seen mention of a v3.14.1, which presumably fixes this
bug.)
Meanwhile, as I've observed before, I tend to be more comfortable on
newsgroups and mailing lists than editing the wiki, and I still haven't
gotten a wiki account setup. If someone with such an account wants to
put all that on the wiki somewhere I'm sure many will find it useful. =;^)
So back to the immediate situation at hand. Since you don't have all the
data at hand (it's partially remote) to do a mkfs and restore at this
time, you may or may not wish to do a full mkfs.btrfs and restore, and
indeed, the features and performance you'd gain in doing so are
relatively minor. But in general, you probably want to consider doing
such a mkfs.btrfs and restore at some point, even if it's only once,
perhaps a year or so from now as btrfs continues toward full
stabilization and the frequency of these individually relatively minor on-
device-format changes drops toward zero, the ultimate idea being to
rebuild your filesystem with a stable btrfs, doing away with all the
cruft that might have built up after years of running a not-entirely
stable development filesystem, as well as taking advantage of all the
individually incremental feature tweaks that were made available one at a
time as the filesystem stabilized.
Personally I've been routinely testing pre-stable releases of various
things for a couple decades now, including what I now consider MS
proprietary servantware (in the context of my sig) before the turn of the
century (I was active back on the IE/OE beta newsgroups back in the day
and at one point was considering becoming an MSMVP, before I discovered
freedomware), and a policy of cleaning out the beta cruft and making a
clean start once there's a proper stable release out, has never done me
wrong. I don't always do so, and in fact am using the same basic user-
level KDE config I used back with KDE 2 shortly after the turn of the
century, tho I've of course gone thru and manually cleaned out old config
files from time to time, but particularly for something as critical to
the safety of my data as a filesystem, I'd consider, and could certainly
recommend, nothing else.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 18:43 ` Steve Leung
2014-04-25 19:07 ` Austin S Hemmelgarn
2014-04-26 1:11 ` Duncan
@ 2014-04-26 1:24 ` Chris Murphy
2014-04-26 2:56 ` Steve Leung
2 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2014-04-26 1:24 UTC (permalink / raw)
To: Steve Leung; +Cc: Btrfs BTRFS
On Apr 25, 2014, at 12:43 PM, Steve Leung <sjleung@shaw.ca> wrote:
> Once everything gets rebalanced though, I don't think I'd be missing out on any features, would I?
The default nodesize/leafsize is 16KB since btrfs-progs v3.12. This isn't changed with a balance. The difference between the previous default 4KB, and 16KB is performance and small file efficiency.
Also, I think newly default with v3.12 btrfs-progs is extref support is enabled, which permits significantly more hardlinks. But this can be turned on for an existing volume using btrfstune.
Any other efficiencies in writing things to disk aren't actually rewritten with newer methods using a balance. Balance just causes chunks to be rewritten.
Chris Murphy
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 23:03 ` Duncan
@ 2014-04-26 1:41 ` Chris Murphy
2014-04-26 4:23 ` Duncan
0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2014-04-26 1:41 UTC (permalink / raw)
To: Btrfs BTRFS
On Apr 25, 2014, at 5:03 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> But since -m/metadata includes -s/
> system by default, and that was the intended way of doing things,
> -f/force was added as necessary when doing only -s/system, since
> presumably that was considered an artificial distinction, and handling -s/
> system as a part of -m/metadata was considered the more natural method.
OK so somehow in Steve's conversion, metadata was converted from DUP to RAID1 completely, but some portion of system was left as DUP, incompletely converted to RAID1. It doesn't seem obvious that -mconvert is what he'd use now, but maybe newer btrfs-progs it will also convert any unconverted system chunk.
If not, then -sconvert=raid1 -f and optionally -v.
This isn't exactly risk free, given that it requires -f; and I'm not sure we can risk assess conversion failure vs the specific drive containing system DUP chunks dying. But for me a forced susage balance was fast:
[root@rawhide ~]# time btrfs balance start -susage=100 -f -v /
Dumping filters: flags 0xa, state 0x0, force is on
SYSTEM (flags 0x2): balancing, usage=100
Done, had to relocate 1 out of 8 chunks
real 0m0.095s
user 0m0.001s
sys 0m0.017s
Chris Murphy
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-26 1:24 ` Chris Murphy
@ 2014-04-26 2:56 ` Steve Leung
2014-04-26 4:05 ` Chris Murphy
2014-04-26 4:55 ` Duncan
0 siblings, 2 replies; 17+ messages in thread
From: Steve Leung @ 2014-04-26 2:56 UTC (permalink / raw)
To: Chris Murphy; +Cc: Btrfs BTRFS
On Fri, 25 Apr 2014, Chris Murphy wrote:
> On Apr 25, 2014, at 12:43 PM, Steve Leung <sjleung@shaw.ca> wrote:
>> Once everything gets rebalanced though, I don't think I'd be missing out on any features, would I?
> The default nodesize/leafsize is 16KB since btrfs-progs v3.12. This
> isn't changed with a balance. The difference between the previous
> default 4KB, and 16KB is performance and small file efficiency.
Ah, now it's coming back to me. The last major gyration I had on this
filesystem (and the ultimate trigger for my original issue) was juggling
everything around so that I could reformat for the 16kB node size.
Incidentally, is there a way for someone to tell what the node size
currently is for a btrfs filesystem? I never noticed that info printed
anywhere from any of the btrfs utilities.
In case anyone's wondering, I did balance the system chunks on my
filesystem and "btrfs fi df" now looks normal. So thanks to all for the
hints and advice.
Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 19:07 ` Austin S Hemmelgarn
@ 2014-04-26 4:01 ` Duncan
0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26 4:01 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Fri, 25 Apr 2014 15:07:40 -0400 as
excerpted:
> I actually have a similar situation with how I have my desktop system
> set up, when I go about recreating the filesystem (which I do every
> time I upgrade either the tools or the kernel),
Wow. Given that I run a git kernel and btrfs-tools, I'd be spending a
*LOT* of time on redoing my filesystems if I did that! Tho see my just-
previous reply for what I do (a fresh mkfs.btrfs every few kernel cycles,
to take advantage of new on-device-format feature options and to clean
out any possibly remaining cruft from bugs now fixed, given that btrfs
isn't fully stable yet).
Anyway, why I'm replying here:
[in the context of btrfs raid1 mode]
> I use the following approach:
>
> 1. Delete one of the devices from the filesystem
> 2. Create a new btrfs file system on the device just removed from the
> filesystem
> 3. Copy the data from the old filesystem to the new one
> 4. one at a time, delete the remaining devices from the old filesystem
> and add them to the new one, re-balancing the new filesystem after
> adding each device.
>
> This seems to work relatively well for me, and prevents the possibility
> that there is ever just one copy of the data. It does, however, require
> that the amount of data that you are storing on the filesystem is less
> than the size of one of the devices (although you can kind of work
> around this limitation by setting compress-force=zlib on the new file
> system when you mount it, then using defrag to decompress everything
> after the conversion is done), and that you have to drop to single user
> mode for the conversion (unless it's something that isn't needed all the
> time, like the home directories or /usr/src, in which case you just log
> everyone out and log in as root on the console to do it).
I believe you're laboring under an unfortunate but understandable
misconception of the nature of btrfs raid1. Since in the event of device-
loss it's a critical misconception, I decided to deal with it in a reply
separate from the other one (which I then made as a sibling post to yours
in reply to the same parent, instead of as a reply to you).
Unlike for instance mdraid raid1 mode, which is N mirror-copies of the
data across N devices (so 3 devices = 3 copies, 5 devices = 5 copies,
etc)...
**BTRFS RAID1 MODE IS CURRENTLY PAIR-MIRROR ONLY!**
No matter the number of devices in the btrfs so-called "raid1", btrfs
only pair-mirrors each chunk, so it's only two copies of the data per
filesystem. To have more than two-copy redundancy, you must use multiple
filesystems and make one a copy of the other using either conventional
backup methods or the btrfs-specific send/receive.
This is actually my biggest annoyance/feature-request with current btrfs,
as my own sweet-spot ideal is triplet-mirroring, and N-way-mirroring is
indeed on the roadmap and has been for years, but the devs plan to use
some of the code from btrfs raid5/6 to implement it, and of course while
incomplete raid5/6 mode was introduced in 3.9, as of 3.14 at least,
that's exactly what raid5/6 mode is, incomplete, and while I saw patches
to properly support raid5/6 scrub recently, I believe it's still
incomplete in 3.15 as well. And of course N-way-mirroring remains
roadmapped for after that... So not being a dev, I continue to wait, as
patiently as I can manage since I'd rather a good implementation later
than a buggy one now, for that still coming N-way-mirroring. Tho at this
point I admit to having some sympathy for the donkey forever following
that apple held on the end of a stick just out of reach... even if I
/would/ rather wait another five years for it and have it done /right/,
than be dealing with a bad implementation available right now.
Anyway, given that we /are/ dealing with pair-mirror-only raid1 mode
currently... as well as your pre-condition that for your method to work,
the data to store on the filesystem must fit on a single device...
If you have a 3-device-plus btrfs raid1 and you're using btrfs device
delete to remove the device you're going to create the new filesystem on,
you do still have two-way-redundancy at all times, since the the btrfs
device delete will ensure the two copies are on the remaining devices,
but that's unnecessary work compared to simply leaving it a device down
in the first place, and starting with the last device of the previous
(grandparent generation) filesystem as the first of a new (child
generation) filesystem, leaving it unused between.
If OTOH you're hard-removing a device from the raid1, without a btrfs
device delete first, then at the moment you do so, you only have a single
copy of any chunk where one of the pair was on that device, and it
remains that way until you do the mkfs and finish populating the new
filesystem with the contents of the old one.
So you're either doing extra work (if you're using btrfs device delete),
or leaving yourself with a single copy of anything on the removed device,
until it is back up and running as the new filesystem! =:^(
I'd suggest not bothering with more than two (or possibly three) devices
per filesystem, since by btrfs raid1, you only get pair-mirroring, so
more devices is a waste for that, and by your own pre-condition, you
limit the amount of data to the capacity of one device, so you can't take
advantage of the extra storage capacity of more devices with >2 devices
on a two-way-mirroring-limited raid1 either, making it a waste for that
as well. Save the extra devices for when you do the transfer.
If you have only three devices, setup the btrfs raid1 with two, and leave
the third as a spare. Then for the transfer, create and populate the new
filesystem on the third, remove a device from the btrfs raid1 pair, add
it to the new btrfs and convert to raid1. At that point you can drop the
old filesystem and leave its remaining device as your first device when
you repeat the process later, making the last device of the grandparent
into the first device of the child.
This way you'll have two copies of the data at all times and/or will save
the work of the third device add and rebalance, and later the device
delete, bringing it to two devices again.
And as a bonus, except for the time you're actually doing the mkfs and
repopulating the new filesystem, you'll have a third copy, albeit a bit
outdated, as a backup, that being the spare that you're not including in
the current filesystem, since it still has a complete copy of the old
filesystem from before it was removed, and that old copy can still be
mounted using the degraded option (since it's the single device remaining
of what was previously a multi-device raid1).
Alternatively, do the three-device raid1 thing and btrfs device delete
when you're taking a device out and btrfs balance after adding the third
device. This will be more hassle, but dropping a device from a two-
device raid1 forces it read-only as writes can no longer be made in raid1
mode, while a three-device raid1 doesn't give you more redundancy since
btrfs raid1 remains pair-mirror-only, but DOES give you the ability to
continue writing in raid1 mode with a missing device, since you still
have two devices and can do raid1 pair-mirror writing.
So in view of the pair-mirror restriction, three devices won't give you
additional redundancy, but it WILL give you a continued writable raid1 if
a device drops out. Whether that's worth the hassle of the additional
steps needed to btrfs device delete to create the new filesystem and
btrfs balance on adding the third device, is up to you, but it does give
you that choice. =:^)
Similarly if you have four devices, only in that case you can actually do
two independent two-device btrfs raid1 filesystems, one working and one
backup, taking the backup down to recreate as the new primary/working
filesystem when necessary, thus avoiding the whole device-add and
rebalance thing entirely. And your backup is then a full pair-redundant
backup as well, tho of course you lose the backup for the period you're
doing the mkfs and repopulating the new version.
This is actually pretty much what I'm doing here, except that my physical
devices are more than twice the size of my data and I only have two
physical devices. But I use partitioning and create the dual-device
btrfs raid1 pair-mirror across two partitions, one on each physical
device, with the backup set being two different partitions, one each on
the same pair of physical devices.
If you have five devices, I'd recommend doing about the same thing, only
with the fifth device as a normally physically disconnected (and possibly
stored separately, perhaps even off-site) backup of the two separate
btrfs pair-mirror raid1s. Actually, you can remove a device from one of
the raid1s (presumably the backup/secondary) to create the new btrfs
raid1, still leaving the one (presumably the working/primary) as a
complete two-device raid1 pair, leaving the other device as a backup that
can still be mounted using degraded, should that be necessary.
Or simply use the fifth device for something else. =:^)
With six devices you have a multi-way choice:
1) Btrfs raid1 pairs as with four devices but with two levels of backup.
This would be the same as the 5-device scenario, but completing the pair
for the secondary backup.
2) Btrfs raid1 pairs with an addition device in primary and backup.
2a) This gives you a bit more flexibility in terms of size, since you now
get 1.5 times the capacity of a single device, for both primary/working
and secondary/backup.
2b) You also get the device-dropped write-flexibility described under the
three-device case, but now for both primary and backup. =:^)
3) Six-device raid10. In "simple" configuration, this would give you 3-
way-striping and 3X capacity of a single device, still pair mirroring,
but you'd lose the independent backups. However, if you used
partitioning to split each physical device in half and made each set of
six partitions an independent btrfs raid10, you'd still have half the 3X
capacity, so 1.5X the capacity of a single device, still have the three-
way-striping and 2-way-mirroring for 3X the speed with pair-mirroring
redundancy, *AND* have independent primary and backup sets, each its own
6-way set of partitions across the 6 devices, giving you simple tear-down
and recreate of the backup raid10 as the new working raid10.
That would be a very nice setup; something I'd like for myself. =:^)
Actually, once N-way-mirroring hits I'm going to want to setup pretty
close to just this, except using triplet mirroring and two-way-striping
instead of the reverse. Keeping the two-way-partitioning as well, that'd
give me 2X speed and 3X redundancy, at 1X capacity, with a primary and
backup raid10 on different 6-way partition sets of the same six physical
devices.
Ideally, the selectable-way mirroring/striping code will be flexible
enough by that time to let me temporarily reduce striping (and speed/
capacity) to 1-way while keeping 3-way-mirroring, should I lose a device
or two, thus avoiding the force-to-read-only that dropping below two-
devices in a raid1 or four devices in a raid10 currently does. Upon
replacing the bad devices, I could rebalance the 1-way-striped bits and
get full 2-way-striping once again, while the triplet mirroring would
have never been compromised.
That's my ideal. =:^)
But to do that I still need triplet-mirroring, and triplet-mirroring
isn't available yet. =:^(
But it'll sure be nice when I CAN do it! =:^)
4) Do something else with the last pair of devices. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-26 2:56 ` Steve Leung
@ 2014-04-26 4:05 ` Chris Murphy
2014-04-26 4:55 ` Duncan
1 sibling, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2014-04-26 4:05 UTC (permalink / raw)
To: Btrfs BTRFS; +Cc: Steve Leung
On Apr 25, 2014, at 8:56 PM, Steve Leung <sjleung@shaw.ca> wrote:
>
> Incidentally, is there a way for someone to tell what the node size currently is for a btrfs filesystem? I never noticed that info printed anywhere from any of the btrfs utilities.
btrfs-show-super
> In case anyone's wondering, I did balance the system chunks on my filesystem and "btrfs fi df" now looks normal. So thanks to all for the hints and advice.
Good news.
I kinda wonder if some of the degraded multiple device mount failures we've seen are the result of partially missing system chunks.
Chris Murphy
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-26 1:41 ` Chris Murphy
@ 2014-04-26 4:23 ` Duncan
0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26 4:23 UTC (permalink / raw)
To: linux-btrfs
Chris Murphy posted on Fri, 25 Apr 2014 19:41:43 -0600 as excerpted:
> OK so somehow in Steve's conversion, metadata was converted from DUP to
> RAID1 completely, but some portion of system was left as DUP,
> incompletely converted to RAID1. It doesn't seem obvious that -mconvert
> is what he'd use now, but maybe newer btrfs-progs it will also convert
> any unconverted system chunk.
>
> If not, then -sconvert=raid1 -f and optionally -v.
>
> This isn't exactly risk free, given that it requires -f; and I'm not
> sure we can risk assess conversion failure vs the specific drive
> containing system DUP chunks dying. But for me a forced susage balance
> was fast:
>
> [root@rawhide ~]# time btrfs balance start -susage=100 -f -v /
> Dumping filters: flags 0xa, state 0x0, force is on
> SYSTEM (flags 0x2): balancing, usage=100
> Done, had to relocate 1 out of 8 chunks
>
> real 0m0.095s user 0m0.001s sys 0m0.017s
Yes. The one thing that can be said about system chunks is that they're
small enough that processing just them should be quite fast, even on
spinning rust.
So regardless of whether there's a safety issue justifying the required
-f/force for -s/system-only, or not, unlike the possibly many hours a
full balance or some minutes to an hour or so a full balance on a large
spinning rust based btrfs may take, at least if there is some possible
danger in the -s system alone rebalance, the risk-window should be quite
small, time-wise. =:^)
And correspondingly, safety issue or not, I've never seen a report here
of bugs or filesystem loss due to use of -s -f. That doesn't mean it
can't happen, that's under debate and I can't safely say; it does mean
you're pretty unlucky if you're the first to have a need to report such a
thing, here. =:^\
But we all know that btrfs is still under heavy development, and thus
have those tested backups ready just in case, right? In which case, I
think whatever risk there might be relative to that of simply using btrfs
at all at this point in time, must be pretty negligible. =:^) Tho a few
people each year still do get struck by lightening... or win the lottery.
Just living is a risk. <shrug>
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-26 2:56 ` Steve Leung
2014-04-26 4:05 ` Chris Murphy
@ 2014-04-26 4:55 ` Duncan
1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26 4:55 UTC (permalink / raw)
To: linux-btrfs
Steve Leung posted on Fri, 25 Apr 2014 20:56:06 -0600 as excerpted:
> Incidentally, is there a way for someone to tell what the node size
> currently is for a btrfs filesystem? I never noticed that info printed
> anywhere from any of the btrfs utilities.
btrfs-show-super <device> displays that, among other relatively obscure
information. Look for node-size and leaf-size. (Today they are labeled
synonyms in the mkfs.btrfs manpage and should be set the same. But if
I'm remembering correctly, originally they could be set separately in
mkfs.btrfs, and apparently had slightly different technical meaning. Tho
I don't believe actually setting them to different sizes was ever
supported.)
Sectorsize is also printed. The only value actually supported for it,
however, has always been the architecture's kernel page size, 4096 bytes
for x86 in both 32- and 64-bit variants, and I'm told in arm as well.
But there are other archs (including sparc, mips and s390) where it's
different, and as the mkfs.btrfs manpage says, don't set it unless you
plan on actually using the filesystem on a different arch. There is,
however, work to allow btrfs to use different sector-sizes, 2048 bytes to
I believe 64 KiB, thus allowing a btrfs created on an arch with a
different page size to at least work on other archs, even if it's never
going to be horribly efficient.
The former default for all three settings was page size, 4096 bytes on
x86, but node/leafsize were apparently merged at the same time their
default was changed to 16 KiB, since that's more efficient for nearly all
users.
What I've wondered, however, is if a 16K nodesize is more efficient than
4K for nearly everyone, under what conditions might the even larger 32 KiB
or even 64 KiB (the max) be even MORE efficient.
That I don't know, and anyway, I strongly suspect that being less tested,
it might trigger more bugs anyway, and while I'm testing a still not
entirely stable btrfs, I've not been /that/ interested in trying the more
unusual stuff or in triggering more bugs than I might normally come
across.
But someday curiosity might get the better of me and I might try it...
> In case anyone's wondering, I did balance the system chunks on my
> filesystem and "btrfs fi df" now looks normal. So thanks to all for the
> hints and advice.
Heh, good to read. =:^)
Anyway, you provokes quite a discussion, and I think most of us learned
something from it or at least thought about angles we'd not thought of
before, so I'm glad you posted the questions. Challenged me, anyway! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: safe/necessary to balance system chunks?
2014-04-25 19:14 ` Hugo Mills
@ 2014-06-19 11:32 ` Alex Lyakas
0 siblings, 0 replies; 17+ messages in thread
From: Alex Lyakas @ 2014-06-19 11:32 UTC (permalink / raw)
To: Hugo Mills, Austin S Hemmelgarn, Chris Murphy, Steve Leung,
linux-btrfs
On Fri, Apr 25, 2014 at 10:14 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
>> On 2014-04-25 13:24, Chris Murphy wrote:
>> >
>> > On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>> >
>> >>
>> >> Hi list,
>> >>
>> >> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>> >>
>> >> btrfs fi df:
>> >>
>> >> Data, RAID1: total=1.31TiB, used=1.07TiB
>> >> System, RAID1: total=32.00MiB, used=224.00KiB
>> >> System, DUP: total=32.00MiB, used=32.00KiB
>> >> System, single: total=4.00MiB, used=0.00
>> >> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>> >>
>> >> This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be?
>> >
>> > Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
>> >
>> > Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
>> >
>> As far as I understand it, the system chunks are THE root chunk tree for
>> the entire system, that is to say, it's the tree of tree roots that is
>> pointed to by the superblock. (I would love to know if this
>> understanding is wrong). Thus losing that data almost always means
>> losing the whole filesystem.
>
> From a conversation I had with cmason a while ago, the System
> chunks contain the chunk tree. They're special because *everything* in
> the filesystem -- including the locations of all the trees, including
> the chunk tree and the roots tree -- is positioned in terms of the
> internal virtual address space. Therefore, when starting up the FS,
> you can read the superblock (which is at a known position on each
> device), which tells you the virtual address of the other trees... and
> you still need to find out where that really is.
>
> The superblock has (I think) a list of physical block addresses at
> the end of it (sys_chunk_array), which allows you to find the blocks
> for the chunk tree and work out this mapping, which allows you to find
> everything else. I'm not 100% certain of the actual format of that
> array -- it's declared as u8 [2048], so I'm guessing there's a load of
> casting to something useful going on in the code somewhere.
The format is just a list of pairs:
struct btrfs_disk_key, struct btrfs_chunk
struct btrfs_disk_key, struct btrfs_chunk
...
For each SYSTEM block-group (btrfs_chunk), we need one entry in the
sys_chunk_array. During mkfs the first SYSTEM block group is created,
for me its 4MB. So only if the whole chunk tree grows over 4MB, we
need to create an additional SYSTEM block group, and then we need to
have a second entry in the sys_chunk_array. And so on.
Alex.
>
> Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
> PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
> --- Is it still called an affair if I'm sleeping with my wife ---
> behind her lover's back?
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-06-19 11:32 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-25 14:57 safe/necessary to balance system chunks? Steve Leung
2014-04-25 17:24 ` Chris Murphy
2014-04-25 18:12 ` Austin S Hemmelgarn
2014-04-25 18:43 ` Steve Leung
2014-04-25 19:07 ` Austin S Hemmelgarn
2014-04-26 4:01 ` Duncan
2014-04-26 1:11 ` Duncan
2014-04-26 1:24 ` Chris Murphy
2014-04-26 2:56 ` Steve Leung
2014-04-26 4:05 ` Chris Murphy
2014-04-26 4:55 ` Duncan
2014-04-25 19:14 ` Hugo Mills
2014-06-19 11:32 ` Alex Lyakas
2014-04-25 23:03 ` Duncan
2014-04-26 1:41 ` Chris Murphy
2014-04-26 4:23 ` Duncan
2014-04-25 18:36 ` Steve Leung
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).