safe/necessary to balance system chunks?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* safe/necessary to balance system chunks?
@ 2014-04-25 14:57 Steve Leung
  2014-04-25 17:24 ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Steve Leung @ 2014-04-25 14:57 UTC (permalink / raw)
  To: linux-btrfs

Hi list,

I've got a 3-device RAID1 btrfs filesystem that started out life as 
single-device.

btrfs fi df:

  Data, RAID1: total=1.31TiB, used=1.07TiB
  System, RAID1: total=32.00MiB, used=224.00KiB
  System, DUP: total=32.00MiB, used=32.00KiB
  System, single: total=4.00MiB, used=0.00
  Metadata, RAID1: total=66.00GiB, used=2.97GiB

This still lists some system chunks as DUP, and not as RAID1.  Does this 
mean that if one device were to fail, some system chunks would be 
unrecoverable?  How bad would that be?

Assuming this is something that needs to be fixed, would I be able to fix 
this by balancing the system chunks?  Since the "force" flag is required, 
does that mean that balancing system chunks is inherently risky or 
unpleasant?

Thanks,

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 14:57 safe/necessary to balance system chunks? Steve Leung
@ 2014-04-25 17:24 ` Chris Murphy
  2014-04-25 18:12   ` Austin S Hemmelgarn
  2014-04-25 18:36   ` Steve Leung
  0 siblings, 2 replies; 17+ messages in thread
From: Chris Murphy @ 2014-04-25 17:24 UTC (permalink / raw)
  To: Steve Leung; +Cc: linux-btrfs

On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:

> 
> Hi list,
> 
> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
> 
> btrfs fi df:
> 
> Data, RAID1: total=1.31TiB, used=1.07TiB
> System, RAID1: total=32.00MiB, used=224.00KiB
> System, DUP: total=32.00MiB, used=32.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, RAID1: total=66.00GiB, used=2.97GiB
> 
> This still lists some system chunks as DUP, and not as RAID1.  Does this mean that if one device were to fail, some system chunks would be unrecoverable?  How bad would that be?

Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?

Anyway, it's probably a high penalty for losing only 32KB of data.  I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.

> 
> Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks?  Since the "force" flag is required, does that mean that balancing system chunks is inherently risky or unpleasant?

I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data.

The metadata looks like it could use a balance, 66GB of metadata chunks allocated but only 3GB used. So you could include something like -musage=50 at the same time and that will balance any chunks with 50% or less usage.

Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 17:24 ` Chris Murphy
@ 2014-04-25 18:12   ` Austin S Hemmelgarn
  2014-04-25 18:43     ` Steve Leung
                       ` (2 more replies)
  2014-04-25 18:36   ` Steve Leung
  1 sibling, 3 replies; 17+ messages in thread
From: Austin S Hemmelgarn @ 2014-04-25 18:12 UTC (permalink / raw)
  To: Chris Murphy, Steve Leung; +Cc: linux-btrfs

On 2014-04-25 13:24, Chris Murphy wrote:
> 
> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
> 
>>
>> Hi list,
>>
>> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>>
>> btrfs fi df:
>>
>> Data, RAID1: total=1.31TiB, used=1.07TiB
>> System, RAID1: total=32.00MiB, used=224.00KiB
>> System, DUP: total=32.00MiB, used=32.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>
>> This still lists some system chunks as DUP, and not as RAID1.  Does this mean that if one device were to fail, some system chunks would be unrecoverable?  How bad would that be?
> 
> Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
> 
> Anyway, it's probably a high penalty for losing only 32KB of data.  I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
>
As far as I understand it, the system chunks are THE root chunk tree for
the entire system, that is to say, it's the tree of tree roots that is
pointed to by the superblock. (I would love to know if this
understanding is wrong).  Thus losing that data almost always means
losing the whole filesystem.
>>
>> Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks?  Since the "force" flag is required, does that mean that balancing system chunks is inherently risky or unpleasant?
> 
> I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data.
The kernel won't allow a balance involving system chunks unless you
specify force, as it considers any kind of balance using them to be
dangerous.  Given your circumstances, I'd personally say that the safety
provided by RAID1 outweighs the risk of making the FS un-mountable.
> 
> The metadata looks like it could use a balance, 66GB of metadata chunks allocated but only 3GB used. So you could include something like -musage=50 at the same time and that will balance any chunks with 50% or less usage.
> 
> 
> Chris Murphy
> 

Personally, I would recommend making a full backup of all the data (tar
works wonderfully for this), and recreate the entire filesystem from
scratch, but passing all three devices to mkfs.btrfs.  This should
result in all the chunks being RAID1, and will also allow you to benefit
from newer features.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 17:24 ` Chris Murphy
  2014-04-25 18:12   ` Austin S Hemmelgarn
@ 2014-04-25 18:36   ` Steve Leung
  1 sibling, 0 replies; 17+ messages in thread
From: Steve Leung @ 2014-04-25 18:36 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On 04/25/2014 11:24 AM, Chris Murphy wrote:
>
> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>
>> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>>
>> btrfs fi df:
>>
>> Data, RAID1: total=1.31TiB, used=1.07TiB
>> System, RAID1: total=32.00MiB, used=224.00KiB
>> System, DUP: total=32.00MiB, used=32.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>
>> This still lists some system chunks as DUP, and not as RAID1.  Does this mean that if one device were to fail, some system chunks would be unrecoverable?  How bad would that be?
>
> Anyway, it's probably a high penalty for losing only 32KB of data.  I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.

As for how it occurred, I'm not sure.  I created this filesystem some 
time ago (not sure exactly, but I'm guessing with a 3.4-era kernel?) so 
it's quite possible it's not reproducible on newer kernels.

It's also nice to know I've been one failed device away from a dead 
filesystem for a long time now, but better to notice it late than never.  :)

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 18:12   ` Austin S Hemmelgarn
@ 2014-04-25 18:43     ` Steve Leung
  2014-04-25 19:07       ` Austin S Hemmelgarn
                         ` (2 more replies)
  2014-04-25 19:14     ` Hugo Mills
  2014-04-25 23:03     ` Duncan
  2 siblings, 3 replies; 17+ messages in thread
From: Steve Leung @ 2014-04-25 18:43 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Chris Murphy; +Cc: linux-btrfs

On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
> On 2014-04-25 13:24, Chris Murphy wrote:
>>
>> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>>
>>> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>>>
>>> btrfs fi df:
>>>
>>> Data, RAID1: total=1.31TiB, used=1.07TiB
>>> System, RAID1: total=32.00MiB, used=224.00KiB
>>> System, DUP: total=32.00MiB, used=32.00KiB
>>> System, single: total=4.00MiB, used=0.00
>>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>>
>>> This still lists some system chunks as DUP, and not as RAID1.  Does this mean that if one device were to fail, some system chunks would be unrecoverable?  How bad would that be?
>>>
>>> Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks?  Since the "force" flag is required, does that mean that balancing system chunks is inherently risky or unpleasant?
>>
>> I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data.
> The kernel won't allow a balance involving system chunks unless you
> specify force, as it considers any kind of balance using them to be
> dangerous.  Given your circumstances, I'd personally say that the safety
> provided by RAID1 outweighs the risk of making the FS un-mountable.

Agreed, I'll attempt the system balance shortly.

> Personally, I would recommend making a full backup of all the data (tar
> works wonderfully for this), and recreate the entire filesystem from
> scratch, but passing all three devices to mkfs.btrfs.  This should
> result in all the chunks being RAID1, and will also allow you to benefit
> from newer features.

I do have backups of the really important stuff from this filesystem, 
but they're offsite.  As this is just for a home system, I don't have 
enough temporary space for a full backup handy (which is related to how 
I ended up in this situation in the first place).

Once everything gets rebalanced though, I don't think I'd be missing out 
on any features, would I?

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 18:43     ` Steve Leung
@ 2014-04-25 19:07       ` Austin S Hemmelgarn
  2014-04-26  4:01         ` Duncan
  2014-04-26  1:11       ` Duncan
  2014-04-26  1:24       ` Chris Murphy
  2 siblings, 1 reply; 17+ messages in thread
From: Austin S Hemmelgarn @ 2014-04-25 19:07 UTC (permalink / raw)
  To: Steve Leung, Chris Murphy; +Cc: linux-btrfs

On 2014-04-25 14:43, Steve Leung wrote:
> On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
>> On 2014-04-25 13:24, Chris Murphy wrote:
>>>
>>> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>>>
>>>> I've got a 3-device RAID1 btrfs filesystem that started out life as
>>>> single-device.
>>>>
>>>> btrfs fi df:
>>>>
>>>> Data, RAID1: total=1.31TiB, used=1.07TiB
>>>> System, RAID1: total=32.00MiB, used=224.00KiB
>>>> System, DUP: total=32.00MiB, used=32.00KiB
>>>> System, single: total=4.00MiB, used=0.00
>>>> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>>>>
>>>> This still lists some system chunks as DUP, and not as RAID1.  Does
>>>> this mean that if one device were to fail, some system chunks would
>>>> be unrecoverable?  How bad would that be?
>>>>
>>>> Assuming this is something that needs to be fixed, would I be able
>>>> to fix this by balancing the system chunks?  Since the "force" flag
>>>> is required, does that mean that balancing system chunks is
>>>> inherently risky or unpleasant?
>>>
>>> I don't think force is needed. You'd use btrfs balance start
>>> -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although
>>> it's probably a minor distinction for such a small amount of data.
>> The kernel won't allow a balance involving system chunks unless you
>> specify force, as it considers any kind of balance using them to be
>> dangerous.  Given your circumstances, I'd personally say that the safety
>> provided by RAID1 outweighs the risk of making the FS un-mountable.
> 
> Agreed, I'll attempt the system balance shortly.
> 
>> Personally, I would recommend making a full backup of all the data (tar
>> works wonderfully for this), and recreate the entire filesystem from
>> scratch, but passing all three devices to mkfs.btrfs.  This should
>> result in all the chunks being RAID1, and will also allow you to benefit
>> from newer features.
> 
> I do have backups of the really important stuff from this filesystem,
> but they're offsite.  As this is just for a home system, I don't have
> enough temporary space for a full backup handy (which is related to how
> I ended up in this situation in the first place).
> 
> Once everything gets rebalanced though, I don't think I'd be missing out
> on any features, would I?
> 
> Steve
In general, it shouldn't be an issue, but it might get you slightly
better performance to recreate it.  I actually have a similar situation
 with how I have my desktop system set up, when I go about recreating
the filesystem (which I do every time I upgrade either the tools or the
kernel), I use the following approach:

1. Delete one of the devices from the filesystem
2. Create a new btrfs file system on the device just removed from the
filesystem
3. Copy the data from the old filesystem to the new one
4. one at a time, delete the remaining devices from the old filesystem
and add them to the new one, re-balancing the new filesystem after
adding each device.

This seems to work relatively well for me, and prevents the possibility
that there is ever just one copy of the data.  It does, however, require
that the amount of data that you are storing on the filesystem is less
than the size of one of the devices (although you can kind of work
around this limitation by setting compress-force=zlib on the new file
system when you mount it, then using defrag to decompress everything
after the conversion is done), and that you have to drop to single user
mode for the conversion (unless it's something that isn't needed all the
time, like the home directories or /usr/src, in which case you just log
everyone out and log in as root on the console to do it).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 18:12   ` Austin S Hemmelgarn
  2014-04-25 18:43     ` Steve Leung
@ 2014-04-25 19:14     ` Hugo Mills
  2014-06-19 11:32       ` Alex Lyakas
  2014-04-25 23:03     ` Duncan
  2 siblings, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2014-04-25 19:14 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Chris Murphy, Steve Leung, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2959 bytes --]

On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
> On 2014-04-25 13:24, Chris Murphy wrote:
> > 
> > On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
> > 
> >>
> >> Hi list,
> >>
> >> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
> >>
> >> btrfs fi df:
> >>
> >> Data, RAID1: total=1.31TiB, used=1.07TiB
> >> System, RAID1: total=32.00MiB, used=224.00KiB
> >> System, DUP: total=32.00MiB, used=32.00KiB
> >> System, single: total=4.00MiB, used=0.00
> >> Metadata, RAID1: total=66.00GiB, used=2.97GiB
> >>
> >> This still lists some system chunks as DUP, and not as RAID1.  Does this mean that if one device were to fail, some system chunks would be unrecoverable?  How bad would that be?
> > 
> > Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
> > 
> > Anyway, it's probably a high penalty for losing only 32KB of data.  I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
> >
> As far as I understand it, the system chunks are THE root chunk tree for
> the entire system, that is to say, it's the tree of tree roots that is
> pointed to by the superblock. (I would love to know if this
> understanding is wrong).  Thus losing that data almost always means
> losing the whole filesystem.

   From a conversation I had with cmason a while ago, the System
chunks contain the chunk tree. They're special because *everything* in
the filesystem -- including the locations of all the trees, including
the chunk tree and the roots tree -- is positioned in terms of the
internal virtual address space. Therefore, when starting up the FS,
you can read the superblock (which is at a known position on each
device), which tells you the virtual address of the other trees... and
you still need to find out where that really is.

   The superblock has (I think) a list of physical block addresses at
the end of it (sys_chunk_array), which allows you to find the blocks
for the chunk tree and work out this mapping, which allows you to find
everything else. I'm not 100% certain of the actual format of that
array -- it's declared as u8 [2048], so I'm guessing there's a load of
casting to something useful going on in the code somewhere.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- Is it still called an affair if I'm sleeping with my wife ---    
                        behind her lover's back?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 18:12   ` Austin S Hemmelgarn
  2014-04-25 18:43     ` Steve Leung
  2014-04-25 19:14     ` Hugo Mills
@ 2014-04-25 23:03     ` Duncan
  2014-04-26  1:41       ` Chris Murphy
  2 siblings, 1 reply; 17+ messages in thread
From: Duncan @ 2014-04-25 23:03 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Fri, 25 Apr 2014 14:12:17 -0400 as
excerpted:
>
> On 2014-04-25 13:24, Chris Murphy wrote:
>> 
>> On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>> 
>>> Assuming this is something that needs to be fixed, would I be able to
>>> fix this by balancing the system chunks?  Since the "force" flag is
>>> required, does that mean that balancing system chunks is inherently
>>> risky or unpleasant?
>> 
>> I don't think force is needed. You'd use btrfs balance start
>> -sconvert=raid1 <mountpoint>; or with -sconvert=raid1,soft although
>> it's probably a minor distinction for such a small amount of data.
> 
> The kernel won't allow a balance involving system chunks unless you
> specify force, as it considers any kind of balance using them to be
> dangerous.  Given your circumstances, I'd personally say that the safety
> provided by RAID1 outweighs the risk of making the FS un-mountable.

To clear this up, FWIW...

In a balance, metadata includes system by default.

If you go back and look at the committed balance filters patch, the 
wording on the -s/system chunks option is that it requires -f/force 
because one would normally handle system as part of metadata, not for any 
other reason.

What it looks like to me is that the original patch in progress may not 
have had -s/system as a separate filter at all, treating it as
-m/metadata, but perhaps someone suggested having -s/system as a separate 
option too, and the author agreed.  But since -m/metadata includes -s/
system by default, and that was the intended way of doing things,
-f/force was added as necessary when doing only -s/system, since 
presumably that was considered an artificial distinction, and handling -s/
system as a part of -m/metadata was considered the more natural method.

Which begs the question[1], is there a safety or procedural reason one 
should prefer handling metadata and system chunks at the same time, 
perhaps because rewriting the one involves rewriting critical bits of the 
other anyway, or is it simply that the author considered system a subset 
of metadata, anyway?  That I don't know.

But what I do know is that -f/force isn't required with -m/metadata, 
which includes -s/system by default anyway, so unless there's reason to 
treat the two differently, just use -m/metadata and let it handle -s/
system as well. =:^)

---
[1] Begs the question: Modern more natural/literal majority usage 
meaning: invites/forces the question, the question becomes so obvious 
that it's "begging" to be asked, at least in the speaker/author's (my) 
own head.  Yes, I am aware of but generally prefer "assumes and thus 
can't prove the postulate" or similar wording as an alternate to the 
translation-accident meaning.  If you have some time and are wondering 
what I'm talking about and/or think I used the term incorrectly, google 
it (using duck-duck-go or the like if you don't like google's profiling). 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 18:43     ` Steve Leung
  2014-04-25 19:07       ` Austin S Hemmelgarn
@ 2014-04-26  1:11       ` Duncan
  2014-04-26  1:24       ` Chris Murphy
  2 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26  1:11 UTC (permalink / raw)
  To: linux-btrfs

Steve Leung posted on Fri, 25 Apr 2014 12:43:12 -0600 as excerpted:

> On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
> 
>> Personally, I would recommend making a full backup of all the data (tar
>> works wonderfully for this), and recreate the entire filesystem from
>> scratch, but passing all three devices to mkfs.btrfs.  This should
>> result in all the chunks being RAID1, and will also allow you to
>> benefit from newer features.
> 
> I do have backups of the really important stuff from this filesystem,
> but they're offsite.  As this is just for a home system, I don't have
> enough temporary space for a full backup handy (which is related to how
> I ended up in this situation in the first place).
> 
> Once everything gets rebalanced though, I don't think I'd be missing out
> on any features, would I?

As ASH says, nothing critical.

But there are some relatively minor newer features available, and I 
actually re-mkfs.btrfs most of my (several) btrfs every few kernel cycles 
to take advantage of them, since btrfs is still under development and 
these minor features do accumulate over time.  The on-device format is 
now guaranteed to be readable by newer kernels, but that doesn't mean a 
newer kernel couldn't take advantage of minor features available to it in 
a newer filesystem, if the filesystem was new enough to make them 
available.

Of course the other reason is that doing a mkfs guarantees (especially 
with ssds, where it by default does a trim/discard on the entire space 
it's mkfsing, guaranteeing a zero-out that you'd otherwise have to do 
manually for that level of zero-out guarantee) that I've eliminated any 
cruft from now-fixed bugs that otherwise might come back to haunt me at 
some point.  

The other consideration is the range of kernels you plan on mounting/
accessing the filesystem with.  If you're planning on accessing the 
filesystem with an old kernel, mkfs.btrfs does have an option to toggle 
these newer features (with one, extref, allowing more per-directory hard-
links, defaulting on, others generally defaulting off), and keeping them 
off to work with older kernels is possible, but then of course eliminates 
the newer features as a reason for doing the mkfs in the first place.

My local policy being upto four kernel stability levels, current/testing 
development-kernel, last tested working kernel as first level fallback, 
latest stable series as second level fallback, and some reasonably recent 
but occasionally 2-3 stable series stable kernel before that (depending 
on when I last updated my backup /boot) as backup-boot stable fallback, 
even the latter is reasonably new, and given that I tend to wait a couple 
kernel cycles to work out the bugs before activating a new minor-feature 
here anyway, I don't generally worry much about old kernels when 
activating such features.

So what are these minor features?  Using mkfs.btrfs -O list-all (as 
suggested in the mkfs.btrfs manpage, for btrfs-progs v3.14 (slightly 
reformatted to avoid wrap when posting):

$ mkfs.btrfs -O list-all
Filesystem features available at mkfs time:
mixed-bg        - mixed data and metadata block groups (0x4)
extref          - increased hardlink limit per file to 65536 (0x40, def)
raid56          - raid56 extended format (0x80)
skinny-metadata - reduced-size metadata extent refs (0x100)
no-holes        - no explicit hole extents for files (0x200)

Mixed-bg: This one's reasonably old and is available with the -M option 
as well.  It has been the default for filesystems under 1 GiB for some 
time.  Some people recommend it for filesystems upto perhaps 32-64 GiB as 
well, and it does lessen the hassle with data/metadata getting out of 
balance since they're then combined, but there is a performance cost to 
enabling it.  Basically, I'd say don't bother with activating it via -O, 
use -M instead if you want it, but do consider well if you really want it 
above say 64 or 128 MiB, because there IS a performance cost, and as 
filesystem sizes get bigger, the benefit of -M/mixed-bg on smaller 
filesystems doesn't matter as much.

Tho mixed-bg DOES make possible dup data (and indeed, requires if if you 
want dup metadata, since they're mixed together in this mode) on a single-
device btrfs, something that's not otherwise possible.

Extref: As mentioned, extref is now the default.  The reason being it was 
introduced a number of kernels ago and is reasonably important as some 
people were running into hardlinking issues with the previous layout, so 
activating it by default is the right choice.

Raid56: Unless you plan on doing raid56 in the near term (and that's not 
recommended ATM as btrfs raid56 mode isn't yet complete in terms of 
device loss recovery, etc, anyway), that one probably doesn't matter.  
Recommend not using raid56 at this time and thus keeping the option off.

Skinny-metadata: This one's /relatively/ new, being introduced in kernel 
3.10 according to the wiki.  In the 3.10 and possibly 3.11 cycles I did 
see a number of bugfixes going by for it, and wasn't using or 
recommending it at that time.  But I used it on one less critical btrfs 
in the 3.12 timeframe and had no issues, and with my last mkfs.btrfs 
round shortly after v3.14's release, I enabled it on everything I redid.

The benefit of skinny-metadata is simply less metadata to deal with.  
It's not critical as a new kernel can write the "fat" metadata just fine, 
and is not yet the default, but if you're recreating filesystems anyway 
and don't plan on accessing them with anything older than 3.11, I suggest 
enabling it.

No-holes:  This one is still new, enabled in kernel (and btrfs-progs) 
v3.14, and thus could have a few bugs to work out still.  In theory, like 
skinny-metadata it simply makes for more efficient metadata.  However, 
unlike skinny metadata I've yet to see any bugs at all related to it, and 
in fact, tracking explicit-hole mapping has I believe caused a few bugs 
of its own, so despite its newness, I enabled it for all new btrfs
in my last round of mkfs.btrfs filesystem redos shortly after v3.14 
release.

So a cautious no-holes recommend once you are fairly sure you won't be 
mounting with anything pre-3.14 series, tho be aware that since 3.14 
itself is so new and because this isn't yet the default, it won't yet 
have the testing that the other minor-features have, and thus could in 
theory still have a few bugs.  But as I said, I believe there were 
actually bugs in the hole-extent processing before, so I think the risk 
profile on this one is actually pretty favorable, and I'd consider the 
accessing kernel age factor the major caveat, at this point.

So here I'm doing -O extref,skinny-metadata,no-holes .

(Minor usage note:  In btrfs-progs v3.14 itself, --features, the long 
form of the -O option, was buggy and didn't work.  That was actually a 
bug I reported here after finding it when I was doing those redoes as I 
use a script that was coded to use the long form, only to have it bug 
out.  -O worked tho, and after rewriting that bit of the script to use 
that, it worked fine.  I haven't actually updated btrfs-progs in 10 days 
or so, but I've seen mention of a v3.14.1, which presumably fixes this 
bug.)

Meanwhile, as I've observed before, I tend to be more comfortable on 
newsgroups and mailing lists than editing the wiki, and I still haven't 
gotten a wiki account setup.  If someone with such an account wants to 
put all that on the wiki somewhere I'm sure many will find it useful. =;^)

So back to the immediate situation at hand.  Since you don't have all the 
data at hand (it's partially remote) to do a mkfs and restore at this 
time, you may or may not wish to do a full mkfs.btrfs and restore, and 
indeed, the features and performance you'd gain in doing so are 
relatively minor.  But in general, you probably want to consider doing 
such a mkfs.btrfs and restore at some point, even if it's only once, 
perhaps a year or so from now as btrfs continues toward full 
stabilization and the frequency of these individually relatively minor on-
device-format changes drops toward zero, the ultimate idea being to 
rebuild your filesystem with a stable btrfs, doing away with all the 
cruft that might have built up after years of running a not-entirely 
stable development filesystem, as well as taking advantage of all the 
individually incremental feature tweaks that were made available one at a 
time as the filesystem stabilized.

Personally I've been routinely testing pre-stable releases of various 
things for a couple decades now, including what I now consider MS 
proprietary servantware (in the context of my sig) before the turn of the 
century (I was active back on the IE/OE beta newsgroups back in the day 
and at one point was considering becoming an MSMVP, before I discovered 
freedomware), and a policy of cleaning out the beta cruft and making a 
clean start once there's a proper stable release out, has never done me 
wrong.  I don't always do so, and in fact am using the same basic user-
level KDE config I used back with KDE 2 shortly after the turn of the 
century, tho I've of course gone thru and manually cleaned out old config 
files from time to time, but particularly for something as critical to 
the safety of my data as a filesystem, I'd consider, and could certainly 
recommend, nothing else.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 18:43     ` Steve Leung
  2014-04-25 19:07       ` Austin S Hemmelgarn
  2014-04-26  1:11       ` Duncan
@ 2014-04-26  1:24       ` Chris Murphy
  2014-04-26  2:56         ` Steve Leung
  2 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2014-04-26  1:24 UTC (permalink / raw)
  To: Steve Leung; +Cc: Btrfs BTRFS

On Apr 25, 2014, at 12:43 PM, Steve Leung <sjleung@shaw.ca> wrote:

> Once everything gets rebalanced though, I don't think I'd be missing out on any features, would I?

The default nodesize/leafsize is 16KB since btrfs-progs v3.12. This isn't changed with a balance. The difference between the previous default 4KB, and 16KB is performance and small file efficiency.

Also, I think newly default with v3.12 btrfs-progs is extref support is enabled, which permits significantly more hardlinks. But this can be turned on for an existing volume using btrfstune.

Any other efficiencies in writing things to disk aren't actually rewritten with newer methods using a balance. Balance just causes chunks to be rewritten.

Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 23:03     ` Duncan
@ 2014-04-26  1:41       ` Chris Murphy
  2014-04-26  4:23         ` Duncan
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2014-04-26  1:41 UTC (permalink / raw)
  To: Btrfs BTRFS

On Apr 25, 2014, at 5:03 PM, Duncan <1i5t5.duncan@cox.net> wrote:

> But since -m/metadata includes -s/
> system by default, and that was the intended way of doing things,
> -f/force was added as necessary when doing only -s/system, since 
> presumably that was considered an artificial distinction, and handling -s/
> system as a part of -m/metadata was considered the more natural method.

OK so somehow in Steve's conversion, metadata was converted from DUP to RAID1 completely, but some portion of system was left as DUP, incompletely converted to RAID1. It doesn't seem obvious that -mconvert is what he'd use now, but maybe newer btrfs-progs it will also convert any unconverted system chunk.

If not, then -sconvert=raid1 -f and optionally -v.

This isn't exactly risk free, given that it requires -f; and I'm not sure we can risk assess conversion failure vs the specific drive containing system DUP chunks dying. But for me a forced susage balance was fast:

[root@rawhide ~]# time btrfs balance start -susage=100 -f -v /
Dumping filters: flags 0xa, state 0x0, force is on
  SYSTEM (flags 0x2): balancing, usage=100
Done, had to relocate 1 out of 8 chunks

real	0m0.095s
user	0m0.001s
sys	0m0.017s

Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-26  1:24       ` Chris Murphy
@ 2014-04-26  2:56         ` Steve Leung
  2014-04-26  4:05           ` Chris Murphy
  2014-04-26  4:55           ` Duncan
  0 siblings, 2 replies; 17+ messages in thread
From: Steve Leung @ 2014-04-26  2:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Fri, 25 Apr 2014, Chris Murphy wrote:

> On Apr 25, 2014, at 12:43 PM, Steve Leung <sjleung@shaw.ca> wrote:
>> Once everything gets rebalanced though, I don't think I'd be missing out on any features, would I?

> The default nodesize/leafsize is 16KB since btrfs-progs v3.12. This 
> isn't changed with a balance. The difference between the previous 
> default 4KB, and 16KB is performance and small file efficiency.

Ah, now it's coming back to me.  The last major gyration I had on this 
filesystem (and the ultimate trigger for my original issue) was juggling 
everything around so that I could reformat for the 16kB node size.

Incidentally, is there a way for someone to tell what the node size 
currently is for a btrfs filesystem?  I never noticed that info printed 
anywhere from any of the btrfs utilities.

In case anyone's wondering, I did balance the system chunks on my 
filesystem and "btrfs fi df" now looks normal.  So thanks to all for the 
hints and advice.

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 19:07       ` Austin S Hemmelgarn
@ 2014-04-26  4:01         ` Duncan
  0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26  4:01 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Fri, 25 Apr 2014 15:07:40 -0400 as
excerpted:

> I actually have a similar situation with how I have my desktop system
> set up, when I go about recreating the filesystem (which I do every
> time I upgrade either the tools or the kernel),

Wow.  Given that I run a git kernel and btrfs-tools, I'd be spending a 
*LOT* of time on redoing my filesystems if I did that!  Tho see my just-
previous reply for what I do (a fresh mkfs.btrfs every few kernel cycles, 
to take advantage of new on-device-format feature options and to clean 
out any possibly remaining cruft from bugs now fixed, given that btrfs 
isn't fully stable yet).

Anyway, why I'm replying here:

[in the context of btrfs raid1 mode]

> I use the following approach:
> 
> 1. Delete one of the devices from the filesystem
> 2. Create a new btrfs file system on the device just removed from the
> filesystem
> 3. Copy the data from the old filesystem to the new one
> 4. one at a time, delete the remaining devices from the old filesystem
> and add them to the new one, re-balancing the new filesystem after
> adding each device.
> 
> This seems to work relatively well for me, and prevents the possibility
> that there is ever just one copy of the data.  It does, however, require
> that the amount of data that you are storing on the filesystem is less
> than the size of one of the devices (although you can kind of work
> around this limitation by setting compress-force=zlib on the new file
> system when you mount it, then using defrag to decompress everything
> after the conversion is done), and that you have to drop to single user
> mode for the conversion (unless it's something that isn't needed all the
> time, like the home directories or /usr/src, in which case you just log
> everyone out and log in as root on the console to do it).

I believe you're laboring under an unfortunate but understandable 
misconception of the nature of btrfs raid1.  Since in the event of device-
loss it's a critical misconception, I decided to deal with it in a reply 
separate from the other one (which I then made as a sibling post to yours 
in reply to the same parent, instead of as a reply to you).

Unlike for instance mdraid raid1 mode, which is N mirror-copies of the 
data across N devices (so 3 devices = 3 copies, 5 devices = 5 copies, 
etc)...

**BTRFS RAID1 MODE IS CURRENTLY PAIR-MIRROR ONLY!**

No matter the number of devices in the btrfs so-called "raid1", btrfs 
only pair-mirrors each chunk, so it's only two copies of the data per 
filesystem.  To have more than two-copy redundancy, you must use multiple 
filesystems and make one a copy of the other using either conventional 
backup methods or the btrfs-specific send/receive.

This is actually my biggest annoyance/feature-request with current btrfs, 
as my own sweet-spot ideal is triplet-mirroring, and N-way-mirroring is 
indeed on the roadmap and has been for years, but the devs plan to use 
some of the code from btrfs raid5/6 to implement it, and of course while 
incomplete raid5/6 mode was introduced in 3.9, as of 3.14 at least, 
that's exactly what raid5/6 mode is, incomplete, and while I saw patches 
to properly support raid5/6 scrub recently, I believe it's still 
incomplete in 3.15 as well.  And of course N-way-mirroring remains 
roadmapped for after that... So not being a dev, I continue to wait, as 
patiently as I can manage since I'd rather a good implementation later 
than a buggy one now, for that still coming N-way-mirroring.  Tho at this 
point I admit to having some sympathy for the donkey forever following 
that apple held on the end of a stick just out of reach... even if I 
/would/ rather wait another five years for it and have it done /right/, 
than be dealing with a bad implementation available right now.

Anyway, given that we /are/ dealing with pair-mirror-only raid1 mode 
currently... as well as your pre-condition that for your method to work, 
the data to store on the filesystem must fit on a single device...

If you have a 3-device-plus btrfs raid1 and you're using btrfs device 
delete to remove the device you're going to create the new filesystem on, 
you do still have two-way-redundancy at all times, since the the btrfs 
device delete will ensure the two copies are on the remaining devices, 
but that's unnecessary work compared to simply leaving it a device down 
in the first place, and starting with the last device of the previous 
(grandparent generation) filesystem as the first of a new (child 
generation) filesystem, leaving it unused between.

If OTOH you're hard-removing a device from the raid1, without a btrfs 
device delete first, then at the moment you do so, you only have a single 
copy of any chunk where one of the pair was on that device, and it 
remains that way until you do the mkfs and finish populating the new 
filesystem with the contents of the old one.

So you're either doing extra work (if you're using btrfs device delete), 
or leaving yourself with a single copy of anything on the removed device, 
until it is back up and running as the new filesystem! =:^(

I'd suggest not bothering with more than two (or possibly three) devices 
per filesystem, since by btrfs raid1, you only get pair-mirroring, so 
more devices is a waste for that, and by your own pre-condition, you 
limit the amount of data to the capacity of one device, so you can't take 
advantage of the extra storage capacity of more devices with >2 devices 
on a two-way-mirroring-limited raid1 either, making it a waste for that 
as well.  Save the extra devices for when you do the transfer.

If you have only three devices, setup the btrfs raid1 with two, and leave 
the third as a spare.  Then for the transfer, create and populate the new 
filesystem on the third, remove a device from the btrfs raid1 pair, add 
it to the new btrfs and convert to raid1.  At that point you can drop the 
old filesystem and leave its remaining device as your first device when 
you repeat the process later, making the last device of the grandparent 
into the first device of the child.

This way you'll have two copies of the data at all times and/or will save 
the work of the third device add and rebalance, and later the device 
delete, bringing it to two devices again.

And as a bonus, except for the time you're actually doing the mkfs and 
repopulating the new filesystem, you'll have a third copy, albeit a bit 
outdated, as a backup, that being the spare that you're not including in 
the current filesystem, since it still has a complete copy of the old 
filesystem from before it was removed, and that old copy can still be 
mounted using the degraded option (since it's the single device remaining 
of what was previously a multi-device raid1).

Alternatively, do the three-device raid1 thing and btrfs device delete 
when you're taking a device out and btrfs balance after adding the third 
device.  This will be more hassle, but dropping a device from a two-
device raid1 forces it read-only as writes can no longer be made in raid1 
mode, while a three-device raid1 doesn't give you more redundancy since 
btrfs raid1 remains pair-mirror-only, but DOES give you the ability to 
continue writing in raid1 mode with a missing device, since you still 
have two devices and can do raid1 pair-mirror writing.

So in view of the pair-mirror restriction, three devices won't give you 
additional redundancy, but it WILL give you a continued writable raid1 if 
a device drops out.  Whether that's worth the hassle of the additional 
steps needed to btrfs device delete to create the new filesystem and 
btrfs balance on adding the third device, is up to you, but it does give 
you that choice. =:^)

Similarly if you have four devices, only in that case you can actually do 
two independent two-device btrfs raid1 filesystems, one working and one 
backup, taking the backup down to recreate as the new primary/working 
filesystem when necessary, thus avoiding the whole device-add and 
rebalance thing entirely.  And your backup is then a full pair-redundant 
backup as well, tho of course you lose the backup for the period you're 
doing the mkfs and repopulating the new version.

This is actually pretty much what I'm doing here, except that my physical 
devices are more than twice the size of my data and I only have two 
physical devices.  But I use partitioning and create the dual-device 
btrfs raid1 pair-mirror across two partitions, one on each physical 
device, with the backup set being two different partitions, one each on 
the same pair of physical devices.

If you have five devices, I'd recommend doing about the same thing, only 
with the fifth device as a normally physically disconnected (and possibly 
stored separately, perhaps even off-site) backup of the two separate 
btrfs pair-mirror raid1s.  Actually, you can remove a device from one of 
the raid1s (presumably the backup/secondary) to create the new btrfs 
raid1, still leaving the one (presumably the working/primary) as a 
complete two-device raid1 pair, leaving the other device as a backup that 
can still be mounted using degraded, should that be necessary.

Or simply use the fifth device for something else. =:^)

With six devices you have a multi-way choice:

1) Btrfs raid1 pairs as with four devices but with two levels of backup.

This would be the same as the 5-device scenario, but completing the pair 
for the secondary backup.

2) Btrfs raid1 pairs with an addition device in primary and backup.

2a) This gives you a bit more flexibility in terms of size, since you now 
get 1.5 times the capacity of a single device, for both primary/working 
and secondary/backup.

2b) You also get the device-dropped write-flexibility described under the 
three-device case, but now for both primary and backup. =:^)

3) Six-device raid10.  In "simple" configuration, this would give you 3-
way-striping and 3X capacity of a single device, still pair mirroring, 
but you'd lose the independent backups.  However, if you used 
partitioning to split each physical device in half and made each set of 
six partitions an independent btrfs raid10, you'd still have half the 3X 
capacity, so 1.5X the capacity of a single device, still have the three-
way-striping and 2-way-mirroring for 3X the speed with pair-mirroring 
redundancy, *AND* have independent primary and backup sets, each its own 
6-way set of partitions across the 6 devices, giving you simple tear-down 
and recreate of the backup raid10 as the new working raid10.

That would be a very nice setup; something I'd like for myself. =:^)

Actually, once N-way-mirroring hits I'm going to want to setup pretty 
close to just this, except using triplet mirroring and two-way-striping 
instead of the reverse.  Keeping the two-way-partitioning as well, that'd 
give me 2X speed and 3X redundancy, at 1X capacity, with a primary and 
backup raid10 on different 6-way partition sets of the same six physical 
devices.

Ideally, the selectable-way mirroring/striping code will be flexible 
enough by that time to let me temporarily reduce striping (and speed/
capacity) to 1-way while keeping 3-way-mirroring, should I lose a device 
or two, thus avoiding the force-to-read-only that dropping below two-
devices in a raid1 or four devices in a raid10 currently does.  Upon 
replacing the bad devices, I could rebalance the 1-way-striped bits and 
get full 2-way-striping once again, while the triplet mirroring would 
have never been compromised.

That's my ideal. =:^)

But to do that I still need triplet-mirroring, and triplet-mirroring 
isn't available yet. =:^(

But it'll sure be nice when I CAN do it! =:^)

4) Do something else with the last pair of devices. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-26  2:56         ` Steve Leung
@ 2014-04-26  4:05           ` Chris Murphy
  2014-04-26  4:55           ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2014-04-26  4:05 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Steve Leung


On Apr 25, 2014, at 8:56 PM, Steve Leung <sjleung@shaw.ca> wrote:
> 
> Incidentally, is there a way for someone to tell what the node size currently is for a btrfs filesystem?  I never noticed that info printed anywhere from any of the btrfs utilities.

btrfs-show-super

> In case anyone's wondering, I did balance the system chunks on my filesystem and "btrfs fi df" now looks normal.  So thanks to all for the hints and advice.

Good news.

I kinda wonder if some of the degraded multiple device mount failures we've seen are the result of partially missing system chunks.


Chris Murphy


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-26  1:41       ` Chris Murphy
@ 2014-04-26  4:23         ` Duncan
  0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26  4:23 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Fri, 25 Apr 2014 19:41:43 -0600 as excerpted:

> OK so somehow in Steve's conversion, metadata was converted from DUP to
> RAID1 completely, but some portion of system was left as DUP,
> incompletely converted to RAID1. It doesn't seem obvious that -mconvert
> is what he'd use now, but maybe newer btrfs-progs it will also convert
> any unconverted system chunk.
> 
> If not, then -sconvert=raid1 -f and optionally -v.
> 
> This isn't exactly risk free, given that it requires -f; and I'm not
> sure we can risk assess conversion failure vs the specific drive
> containing system DUP chunks dying. But for me a forced susage balance
> was fast:
> 
> [root@rawhide ~]# time btrfs balance start -susage=100 -f -v /
> Dumping filters: flags 0xa, state 0x0, force is on
>   SYSTEM (flags 0x2): balancing, usage=100
> Done, had to relocate 1 out of 8 chunks
> 
> real	0m0.095s user	0m0.001s sys	0m0.017s

Yes.  The one thing that can be said about system chunks is that they're 
small enough that processing just them should be quite fast, even on 
spinning rust.

So regardless of whether there's a safety issue justifying the required 
-f/force for -s/system-only, or not, unlike the possibly many hours a 
full balance or some minutes to an hour or so a full balance on a large 
spinning rust based btrfs may take, at least if there is some possible 
danger in the -s system alone rebalance, the risk-window should be quite 
small, time-wise. =:^)

And correspondingly, safety issue or not, I've never seen a report here 
of bugs or filesystem loss due to use of -s -f.  That doesn't mean it 
can't happen, that's under debate and I can't safely say; it does mean 
you're pretty unlucky if you're the first to have a need to report such a 
thing, here. =:^\

But we all know that btrfs is still under heavy development, and thus 
have those tested backups ready just in case, right?  In which case, I 
think whatever risk there might be relative to that of simply using btrfs 
at all at this point in time, must be pretty negligible. =:^)  Tho a few 
people each year still do get struck by lightening... or win the lottery. 
Just living is a risk.  <shrug>

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-26  2:56         ` Steve Leung
  2014-04-26  4:05           ` Chris Murphy
@ 2014-04-26  4:55           ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2014-04-26  4:55 UTC (permalink / raw)
  To: linux-btrfs

Steve Leung posted on Fri, 25 Apr 2014 20:56:06 -0600 as excerpted:

> Incidentally, is there a way for someone to tell what the node size
> currently is for a btrfs filesystem?  I never noticed that info printed
> anywhere from any of the btrfs utilities.

btrfs-show-super <device> displays that, among other relatively obscure 
information.  Look for node-size and leaf-size.  (Today they are labeled 
synonyms in the mkfs.btrfs manpage and should be set the same.  But if 
I'm remembering correctly, originally they could be set separately in 
mkfs.btrfs, and apparently had slightly different technical meaning.  Tho 
I don't believe actually setting them to different sizes was ever 
supported.)  

Sectorsize is also printed.  The only value actually supported for it, 
however, has always been the architecture's kernel page size, 4096 bytes 
for x86 in both 32- and 64-bit variants, and I'm told in arm as well.  
But there are other archs (including sparc, mips and s390) where it's 
different, and as the mkfs.btrfs manpage says, don't set it unless you 
plan on actually using the filesystem on a different arch.  There is, 
however, work to allow btrfs to use different sector-sizes, 2048 bytes to 
I believe 64 KiB, thus allowing a btrfs created on an arch with a 
different page size to at least work on other archs, even if it's never 
going to be horribly efficient.

The former default for all three settings was page size, 4096 bytes on 
x86, but node/leafsize were apparently merged at the same time their 
default was changed to 16 KiB, since that's more efficient for nearly all 
users.

What I've wondered, however, is if a 16K nodesize is more efficient than 
4K for nearly everyone, under what conditions might the even larger 32 KiB 
or even 64 KiB (the max) be even MORE efficient.

That I don't know, and anyway, I strongly suspect that being less tested, 
it might trigger more bugs anyway, and while I'm testing a still not 
entirely stable btrfs, I've not been /that/ interested in trying the more 
unusual stuff or in triggering more bugs than I might normally come 
across.

But someday curiosity might get the better of me and I might try it...

> In case anyone's wondering, I did balance the system chunks on my
> filesystem and "btrfs fi df" now looks normal.  So thanks to all for the
> hints and advice.

Heh, good to read. =:^)

Anyway, you provokes quite a discussion, and I think most of us learned 
something from it or at least thought about angles we'd not thought of 
before, so I'm glad you posted the questions. Challenged me, anyway! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: safe/necessary to balance system chunks?
  2014-04-25 19:14     ` Hugo Mills
@ 2014-06-19 11:32       ` Alex Lyakas
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Lyakas @ 2014-06-19 11:32 UTC (permalink / raw)
  To: Hugo Mills, Austin S Hemmelgarn, Chris Murphy, Steve Leung,
	linux-btrfs

On Fri, Apr 25, 2014 at 10:14 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
>> On 2014-04-25 13:24, Chris Murphy wrote:
>> >
>> > On Apr 25, 2014, at 8:57 AM, Steve Leung <sjleung@shaw.ca> wrote:
>> >
>> >>
>> >> Hi list,
>> >>
>> >> I've got a 3-device RAID1 btrfs filesystem that started out life as single-device.
>> >>
>> >> btrfs fi df:
>> >>
>> >> Data, RAID1: total=1.31TiB, used=1.07TiB
>> >> System, RAID1: total=32.00MiB, used=224.00KiB
>> >> System, DUP: total=32.00MiB, used=32.00KiB
>> >> System, single: total=4.00MiB, used=0.00
>> >> Metadata, RAID1: total=66.00GiB, used=2.97GiB
>> >>
>> >> This still lists some system chunks as DUP, and not as RAID1.  Does this mean that if one device were to fail, some system chunks would be unrecoverable?  How bad would that be?
>> >
>> > Since it's "system" type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work?
>> >
>> > Anyway, it's probably a high penalty for losing only 32KB of data.  I think this could use some testing to try and reproduce conversions where some amount of "system" or "metadata" type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it.
>> >
>> As far as I understand it, the system chunks are THE root chunk tree for
>> the entire system, that is to say, it's the tree of tree roots that is
>> pointed to by the superblock. (I would love to know if this
>> understanding is wrong).  Thus losing that data almost always means
>> losing the whole filesystem.
>
>    From a conversation I had with cmason a while ago, the System
> chunks contain the chunk tree. They're special because *everything* in
> the filesystem -- including the locations of all the trees, including
> the chunk tree and the roots tree -- is positioned in terms of the
> internal virtual address space. Therefore, when starting up the FS,
> you can read the superblock (which is at a known position on each
> device), which tells you the virtual address of the other trees... and
> you still need to find out where that really is.
>
>    The superblock has (I think) a list of physical block addresses at
> the end of it (sys_chunk_array), which allows you to find the blocks
> for the chunk tree and work out this mapping, which allows you to find
> everything else. I'm not 100% certain of the actual format of that
> array -- it's declared as u8 [2048], so I'm guessing there's a load of
> casting to something useful going on in the code somewhere.
The format is just a list of pairs:
struct btrfs_disk_key,  struct btrfs_chunk
struct btrfs_disk_key,  struct btrfs_chunk
...

For each SYSTEM block-group (btrfs_chunk), we need one entry in the
sys_chunk_array. During mkfs the first SYSTEM block group is created,
for me its 4MB. So only if the whole chunk tree grows over 4MB, we
need to create an additional SYSTEM block group, and then we need to
have a second entry in the sys_chunk_array. And so on.

Alex.


>
>    Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
>   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
>     --- Is it still called an affair if I'm sleeping with my wife ---
>                         behind her lover's back?

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-06-19 11:32 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-25 14:57 safe/necessary to balance system chunks? Steve Leung
2014-04-25 17:24 ` Chris Murphy
2014-04-25 18:12   ` Austin S Hemmelgarn
2014-04-25 18:43     ` Steve Leung
2014-04-25 19:07       ` Austin S Hemmelgarn
2014-04-26  4:01         ` Duncan
2014-04-26  1:11       ` Duncan
2014-04-26  1:24       ` Chris Murphy
2014-04-26  2:56         ` Steve Leung
2014-04-26  4:05           ` Chris Murphy
2014-04-26  4:55           ` Duncan
2014-04-25 19:14     ` Hugo Mills
2014-06-19 11:32       ` Alex Lyakas
2014-04-25 23:03     ` Duncan
2014-04-26  1:41       ` Chris Murphy
2014-04-26  4:23         ` Duncan
2014-04-25 18:36   ` Steve Leung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).