Help interpreting RAID1 space allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

* Help interpreting RAID1 space allocation
@ 2013-08-24  0:05 Joel Johnson
  2013-08-24  4:24 ` Chris Murphy
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Joel Johnson @ 2013-08-24  0:05 UTC (permalink / raw)
  To: linux-btrfs

I've created a test volume and copied a bulk of data to it, however the 
results of the space allocation are confusing at best. I've tried to 
capture the history of events leading up to the current state. This is 
all on a Debian Wheezy system using a 3.10.5 kernel package 
(linux-image-3.10-2-amd64) and btrfs tools v0.20-rc1 (Debian package 
0.19+20130315-5). The host uses an Intel Atom 330 processor, and runs 
the 64-bit kernel with a 32-bit userland.

I initially created the volume as RAID1 data, then removed (hotplugged 
out from under the system) one of the drives while empty as a test. I 
then unmounted and remounted it with the degraded option and copied a 
small amount of data. Once verifying that the space was used, I 
hotplugged the second original drive, which was detected and added back 
to the volume (showing up in a filesystem show instead of missing). I 
then tried to copy over more data than a RAID1 should be expected to 
hold (~650GB onto 2 x 500GB disks in RAID1), got out of space reported 
as expected. I then deleted all data from the volume (did not recreate 
the filesystem), and copied just over 300GB of data onto the volume, 
which is the current state.

Only as I was typing this up did I notice that the mount options still 
show degraded from the original mount. I expected that once the second 
drive was readded, since it showed up as part of the volume 
automatically (I assume because the UUIDs matched?), however since all 
data appears to have been written to the first drive, I am led to 
believe that the second drive was present but not readded, even though 
it reappeared as devid 2 in the listing.

If the above is correct, then I have two questions that I haven't found 
any documentation on:

1. What is the expectation on hot-adding a failed drive, is an explicit 
'device add' or 'replace' expected/required? In my case it appeared to 
be auto-added, but that may have been spurious or misleading. I'd 
consider that if an explicit readd is required, that the device not be 
listed at all, however I would be much more interested to see hotplug of 
a previously missing device (with older modifications from the same 
volume) be readded and synced automatically.

2. If initially mounted as degraded, once a new drive is added, is a 
remount required? I'd hope not, but since the mount flag can't be 
changed later on, what is the best way to confirm health of the volume? 
Until this issue I'd assumed using 'filesystem show'. Since the mount 
flag is at mount time only, degraded seems to mean "be degraded if 
needed" instead of a positive indicator that the volume is indeed 
degraded"

$ mount | grep btrfs
/dev/sdc on /mnt/new-store type btrfs (rw,relatime,degraded,space_cache)

$ du -hsx /mnt/new-store
305G    /mnt/new-store

$ df -h | grep new-store
/dev/sdc                         932G  307G  160G  66% /mnt/new-store

$ btrfs fi show /dev/sdc
Label: 'new-store'  uuid: 14e6e9c7-b249-40ff-8be1-78fc8b26b53d
Total devices 2 FS bytes used 540.00KB
devid    2 size 465.76GB used 2.01GB path /dev/sdd
devid    1 size 465.76GB used 453.03GB path /dev/sdc

$ btrfs fi df /mnt/new-store
Data, RAID1: total=1.00GB, used=997.21MB
Data: total=450.01GB, used=303.18GB
System, RAID1: total=8.00MB, used=56.00KB
System: total=4.00MB, used=0.00
Metadata, RAID1: total=1.00GB, used=617.14MB
Metadata: total=1.01GB, used=0.00

I may be missing or mis-remembering some of the order of events leading 
to the current state, however the space usage numbers don't reflect 
anything close to what I would expect.

On the data used on sdc I've assumed that's old from when I filled the 
volume and hasn't been reclaimed by a balance or other operation. 
However, the "used=997.21MB" from fi df, as well as the "FS bytes used 
540.00KB" from fi show seem suspect based on what

Thanks for help understanding the space allocation and usage patterns, 
I've tried to put pieces together based on man pages, wiki and other 
postings, but can't seem to reconcile what I think I should be seeing 
based on that reading with what I'm actually seeing.

Joel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24  0:05 Help interpreting RAID1 space allocation Joel Johnson
@ 2013-08-24  4:24 ` Chris Murphy
  2013-08-24  4:58   ` Chris Murphy
       [not found] ` < C0968F38-8432-41C3-B916-DC00C5C69B34@colorremedies.com>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2013-08-24  4:24 UTC (permalink / raw)
  To: Btrfs BTRFS

On Aug 23, 2013, at 6:05 PM, Joel Johnson <mrjoel@lixil.net> wrote:

> What is the expectation on hot-adding a failed drive, is an explicit 'device add' or 'replace' expected/required?

I'd expect to have to add a device and then remove missing. There isn't a readd option in btrfs, which in md parlance is used for readding a device previously part of an array. 

When replacing a failed disk, I'd like btrfs to compare states between the available drives and know that it needs to catch up the newly added device, but this doesn't yet happen. It's necessary to call btrfs balance.

However, after adding a device and deleting missing, I see what you see, the btrfs volume is still mounted degraded. 

Further, if I use:

# mount -o remount /dev/sdb /mnt
# mount | grep btrfs
/dev/sdc on /mnt type btrfs (rw,relatime,seclabel,degraded,space_cache)

So it's still degraded. I had to unmount and mount again to clear degraded. That seems to be a problem, to have to unmount the volume in order to remove the degraded flag, which is needed to begin the rebalance. And what if btrfs is the root file system? It needs to be rebooted to clear the degraded option.

And still further, somehow the data profile has reverted to single even though the mkfs.btfs was raid1. So even though the volume now has two devices, has been balanced, is not degraded, the data profile is single and presumably I no longer actually have mirrored data at all on this volume. That is a huge bug. I'll try to come up with some steps to reproduce.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24  4:24 ` Chris Murphy
@ 2013-08-24  4:58   ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2013-08-24  4:58 UTC (permalink / raw)
  To: Btrfs BTRFS

On Aug 23, 2013, at 10:24 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> When replacing a failed disk, I'd like btrfs to compare states between the available drives and know that it needs to catch up the newly added device, but this doesn't yet happen. It's necessary to call btrfs balance.

I can only test device replacement, not a readd. Upon 'btrfs device delete missing /mnt' there's a delay, and it's doing a balance. I don't know what happens for a readd, if the whole volume needs balancing or if it's able to just write the changes.

> And still further, somehow the data profile has reverted to single even though the mkfs.btfs was raid1. [SNIP] That is a huge bug. I'll try to come up with some steps to reproduce.

If I create the file system, mount it, but I do not copy any data, upon adding new and deleting missing, the data profile is changed from raid1 to single. If I've first copied data to the volume prior to device failure/missing, this doesn't happen, it remains raid1.

Also, again after the missing device is replaced, and the volume rebalanced, while the mount option is degraded, subsequent file copies end up on both virtual disks. So it says it's degraded but it really isn't? I'm not sure what's up here.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
       [not found]   ` < 27E06913-DDBB-4B75-86D3-A8F6C5B09F99@colorremedies.com>
@ 2013-08-24 11:56     ` Duncan
  0 siblings, 0 replies; 11+ messages in thread
From: Duncan @ 2013-08-24 11:56 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Fri, 23 Aug 2013 22:58:14 -0600 as excerpted:

> So it says it's degraded but it really isn't?
> I'm not sure what's up here.

In general, this was my experience a couple months ago when I tried it 
then, as well.

And yes, IIRC the wiki actually describes the "degraded" mount option as 
allowing it to mount "degraded" IF NECESSARY, NOT saying that it'll force 
degraded.

I found one additional quirk, however, which was rather disturbing.  You 
may recall my thread about it then...

Background:  I was actually trying to mount a dual device raid1 root 
without an initr*, using rootflags=device=.  However, that didn't work.  
The only way I could mount from the kernel commandline was using 
degraded.  (I'm guessing the double-equals in the rootflags=device= got 
parsed incorrectly, probably trying to take rootflags=device as the name, 
which of course doesn't apply to anything, as it should be rootflags.  
This because it definitely took the rootflags=degraded parameter just 
fine.  That'd be a kernel bug, but I'm not sure that's it; I just know 
others have reported trouble with rootflags=device=whatever on this list, 
as well.)

So, while I was trying to get it to work (while I was still 
experimenting, and before I gave up and setup a simple initramfs using 
dracut), I booted with root=/dev/sdaX rootflags=degraded, and it worked.  
I then made a change to the filesystem, and rebooted again to the other 
one, using root=/dev/sdbX rootflags=degraded, and that worked too.  I 
then made a different change to the same file, thus diverging the two 
separately mounted devices with a different write to each one.

Then I rebooted back to my main system (not yet on btrfs at that point), 
and mounted the btrfs normally -- it mounted without the degraded option 
as both devices were found in the scan, DESPITE the fact that the two 
component devices now had diverged content.  I checked and the later 
change was shown, and there were no errors or warnings with the mount or 
with reading the file.

I then booted degraded again, to the one with the other change, AND IT 
STILL HAD THE CHANGE I HAD WRITTEN!!!

So the mount of both together had given me NO WARNING OR ERROR about 
diverged copies; it simply showed one of them.  BUT THE OTHER ONE WASN'T 
AUTO-DETECTED OR FIXED, EITHER.

With mdraid, attempting to mount (well, re-add, in the case of mdraid) 
with both devices would have triggered an automatic resync.  I expected 
at LEAST a warning with btrfs, but I didn't get it.  I guess I'd have had 
to manually initiate a scrub to detect and fix it, but was new enough to 
(multi-device) btrfs at the time that trying that didn't occur to me.  
I'm not sure what a full rebalance would have done.

That was rather disturbing to me, to say the least.

But for my usage, knowing and noting the problem to avoid in the future 
was enough.

I blew away my (then) test btrfs with a fresh mkfs.btrfs raid1 both data 
and metadata, copied root from my main system over to it once again, and 
set it up with an initr* mount to avoid kernel commandline degraded-
mounting.  Meanwhile, I don't do any more degraded tests.  I do scrubs 
every so often (after a bad shutdown the other day I actually had a scrub 
find and fix some bad checksums for the first time, and files that were 
definitely giving problems before the scrub worked just fine afterward, 
so I was glad I had btrfs checksums and raid1 copies to restore, the 
feature worked as advertised! =:^), and...

If I ever do lose a device and have to go degraded, I know *NOT* to try 
degraded-mounting both devices read/write separately, and then 
recombining them once again.  If I need to degraded-mount one, fine, but 
I'll make sure it's only the one, and if I do mount the other, I'll 
either mount it read-only, or if I do end up mounting them both read/
write, I'll blow one away and then add it as a new device, in ordered to 
avoid having problems figuring out which divergent copy I'm going to be 
dealing with once I'm using both devices again.

With that sort of ground rule, I think I should be fine.  But it's 
certainly a lot different than mdraid works in the same setup.  Btrfs 
definitely has its own raid1 rules -- the mdraid raid1 rules do *NOT* 
apply.

Meanwhile, hopefully at some point as btrfs heads toward stable, it gets 
a write signature or whatever it is that mdraid has, so it can detect and 
warn when raid1 devices diverge, and ideally can auto-sync, at least if 
configured to do so.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24  0:05 Help interpreting RAID1 space allocation Joel Johnson
  2013-08-24  4:24 ` Chris Murphy
       [not found] ` < C0968F38-8432-41C3-B916-DC00C5C69B34@colorremedies.com>
@ 2013-08-24 17:24 ` Joel Johnson
  2013-08-24 18:02   ` Joel Johnson
  2013-08-25  5:18   ` Chris Murphy
       [not found] ` < 093e4c7d21b5e734c86fc4bb1703a69e@lixil.net>
  3 siblings, 2 replies; 11+ messages in thread
From: Joel Johnson @ 2013-08-24 17:24 UTC (permalink / raw)
  To: linux-btrfs

On Aug 23, 2013, Chris Murphy <li...@colorremedies.com> wrote:
> When replacing a failed disk, I'd like btrfs to compare states between 
> the
> available drives and know that it needs to catch up the newly added 
> device,
> but this doesn't yet happen. It's necessary to call btrfs balance.

> I can only test device replacement, not a readd. Upon 'btrfs device 
> delete
> missing /mnt' there's a delay, and it's doing a balance. I don't know 
> what
> happens for a readd, if the whole volume needs balancing or if it's 
> able to
> just write the changes.

Similar to what Duncan described in his response, on a hot-remove 
(without doing the proper btrfs device delete), there is no opportunity 
for a rebalance or metadata change on the pulled drives, so I would 
expect there to be a signature of some sort for consistency checking 
before readding it. At least, btrfs shouldn't add the readded device 
back as an active device when it's really still inconsistent and not 
being used, even if it indicates the same UUID.

> And still further, somehow the data profile has reverted to single even
> though the mkfs.btfs was raid1. [SNIP] That is a huge bug. I'll try to 
> come
> up with some steps to reproduce.

> If I create the file system, mount it, but I do not copy any data, upon 
> adding
> new and deleting missing, the data profile is changed from raid1 to 
> single. If
> I've first copied data to the volume prior to device failure/missing, 
> this
> doesn't happen, it remains raid1.

And yet, the tools indicate that it is still raid1, even if internally 
it reverts to single???

Based on my experience with this and Duncan's feedback, I'd like to see 
the wiki have some warnings about dealing with multidevice filesystems, 
especially surrounding the degraded mount option. Specifically, it 
sounds like a reasonable practice with the current state is that after a 
device is removed from a filesystem which receives any subsequent 
changes, the removed device should be cleared (dd if=/dev/zero, at least 
the superblocks) before being readded in order to remove any ambiguity 
about state. Adding such a note to the wiki now would communicate the 
potential pitfalls (especially with the difficulty determining what the 
state it by bugs below), and also allow updating once things are 
improved in this area.

Looking again at the wiki Gotchas page, it does say
On a multi device btrfs filesystem, mistakingly re-adding
a block device that is already part of the btrfs fs with
btrfs device add results in an error, and brings btrfs in
an inconsistent state. In striping mode, this causes data
loss and kernel oops. The btrfs userland tools need to do
more checking to prevent these easy mistakes.

Which seems very related, however perhaps adding the workaround and 
clarifying that even hotplugged devices may be added and bitten by this.

Should I file a few bugs to capture the related issues? Here are the 
discrete issues that seem to be present from a user point of view:

1. btrfs filesystem show - shouldn't list devices as present unlesss 
they're in use and in a consistent state

If an explicit add is needed to add new device, don't auto-add, even for 
devices previously part of the filesystem. Although I would claim that 
auto-adding and being consistent is most desired (when it makes sense 
with an existing signature, an empty drive has no indication on where to 
be added), it should be all or nothing instead of showing the device as 
being added (or at least how I interpret it being present in the 'fi 
show' listing) but internally being untracked or inconsistent.

As Chris said, "There isn't a readd option in btrfs, which in md 
parlance is used for readding a device previously part of an array." 
However, when I hotplugged the drive and it reappeared in the 'fi show' 
output, I assumed exactly the md semantics had occurred, with the drive 
having been readded and made consistent - it didn't take any time, but I 
hadn't copied data yet and knew btrfs may only sync the used data and 
metadata blocks.

In other words, I never ran a device add or remove, but still saw what 
appeared to be consistent behavior.

2. data profile shouldn't revert to single if adding/deleting before 
copying data

3. degraded mount option should be "allow degraded if needed", allowing 
non-degraded when it becomes available

Shouldn't force degraded, especially after adding (either manually or 
automatically) sufficient devices to operate in non-degraded mode. As 
soon as devices are added, a rebalance should be done to bring the new 
device(s) into consistent state.

This then drives the question, how does one check the degraded state of 
a filesystem if not the mount flag. I (quite likely with an md-raid 
bias) expected to use the 'filesystem show' output, listed the devices 
as well as a status flag of fully-consistent or rebalance in progress. 
If that's not the correct or intended location, then provide 
documentation on how to properly check the consistency state and 
degraded state of a filesystem.

Joel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24 17:24 ` Joel Johnson
@ 2013-08-24 18:02   ` Joel Johnson
  2013-08-24 20:30     ` Sandy McArthur
  2013-08-25  5:18   ` Chris Murphy
  1 sibling, 1 reply; 11+ messages in thread
From: Joel Johnson @ 2013-08-24 18:02 UTC (permalink / raw)
  To: linux-btrfs

On 2013-08-24 11:24, Joel Johnson wrote:
> Should I file a few bugs to capture the related issues? Here are the
> discrete issues that seem to be present from a user point of view:

After writing this, I figured I'd experiment with the current state and 
try to properly delete and add the sdd device. However, I was surprised 
to find that I'm not allowed to remove one of the two drives in a RAID1. 
Kernel message is "btrfs: unable to go below two devices on raid1" Not 
allowing it by default makes some sense, however a --force flag or 
something would be beneficial. I understand that the preferred method is 
to add the replacement device first and then delete the old one (or do a 
direct replace), however the system I'm using for testing only has three 
SATA ports, the first is used for the system drive, and the second and 
third have the two drives I'm using for my btrfs testing - I have no way 
to add a third drive for the filesystem before removing one first. This 
is still with the filesystem mounted with the degraded mount option set.

As it stands, I don't see how I can run btrfs reliably on this system 
since I now know I want to properly remove devices and add them, but I'm 
unable to do so with the limited number of drive interfaces, requiring 
hot removal of drive, and either readding it, or using another system to 
clear the drive before readding it.

Is there a flag I'm missing? If not, I'd add a fourth bug item:

4. allow (forced only if appropriate) removal of redundant drives (e.g. 
second drive in RAID1)

Joel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24 18:02   ` Joel Johnson
@ 2013-08-24 20:30     ` Sandy McArthur
  2013-08-25  2:42       ` Joel Johnson
  0 siblings, 1 reply; 11+ messages in thread
From: Sandy McArthur @ 2013-08-24 20:30 UTC (permalink / raw)
  To: Joel Johnson; +Cc: linux-btrfs

On Sat, Aug 24, 2013 at 2:02 PM, Joel Johnson <mrjoel@lixil.net> wrote:
> On 2013-08-24 11:24, Joel Johnson wrote:
>>
>> Should I file a few bugs to capture the related issues? Here are the
>> discrete issues that seem to be present from a user point of view:
>
>
> After writing this, I figured I'd experiment with the current state and try
> to properly delete and add the sdd device. However, I was surprised to find
> that I'm not allowed to remove one of the two drives in a RAID1. Kernel
> message is "btrfs: unable to go below two devices on raid1" Not allowing it
> by default makes some sense, however a --force flag or something would be
> beneficial. I understand that the preferred method is to add the replacement
> device first and then delete the old one (or do a direct replace), however
> the system I'm using for testing only has three SATA ports, the first is
> used for the system drive, and the second and third have the two drives I'm
> using for my btrfs testing - I have no way to add a third drive for the
> filesystem before removing one first. This is still with the filesystem
> mounted with the degraded mount option set.

If you have an USB enclosure you could connect another drive via USB
temporarily.
That said my biggest btrfs problem happened while I had some drives on
SATA and others on USB.

-- 
Sandy McArthur

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24 20:30     ` Sandy McArthur
@ 2013-08-25  2:42       ` Joel Johnson
  0 siblings, 0 replies; 11+ messages in thread
From: Joel Johnson @ 2013-08-25  2:42 UTC (permalink / raw)
  To: Sandy McArthur; +Cc: linux-btrfs

On 2013-08-24 14:30, Sandy McArthur wrote:
> I was surprised to find
> that I'm not allowed to remove one of the two drives in a RAID1. Kernel
> message is "btrfs: unable to go below two devices on raid1" Not 
> allowing it
> by default makes some sense, however a --force flag or something would 
> be
> beneficial.
> 
> If you have an USB enclosure you could connect another drive via USB
> temporarily.
> That said my biggest btrfs problem happened while I had some drives on
> SATA and others on USB.

Ah, indeed, a close look at my device names reveals that I claim to only 
have three SATA ports, but sdc and sdd are used for my btrfs test. The 
sdb device is in fact an external USB enclosure that I'm using as my 
data source for the testing.

I'm really just wanting to test what I see as some common use cases for 
a RAID1 btrfs filesystem. A very common one that seems to fail currently 
is wishing to preemptively (SMART errors on drive, many other reasons) 
remove it from the mirror, in order to add a replacement device - and 
keep things online during the entire process. As I see it now, I'd need 
to umount, forcibly/uncleanly pull the drive, mount degraded, add the 
new device, umount, mount cleanly, rebalance and move forward.

I'd like to use btrfs for the data checksumming, however based on my 
investigations it doesn't appear to be ready for the set of RAID1 
support I'd expect. Based on that, I'm relooking at options.

Are there any known gotchas or limitations with a btrfs filesystem with 
single data and metadata, running on top of an md-raid RAID1 mirror?

Joel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-24 17:24 ` Joel Johnson
  2013-08-24 18:02   ` Joel Johnson
@ 2013-08-25  5:18   ` Chris Murphy
  1 sibling, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2013-08-25  5:18 UTC (permalink / raw)
  To: Joel Johnson; +Cc: linux-btrfs

On Aug 24, 2013, at 11:24 AM, Joel Johnson <mrjoel@lixil.net> wrote:
> 
> Similar to what Duncan described in his response, on a hot-remove (without doing the proper btrfs device delete), there is no opportunity for a rebalance or metadata change on the pulled drives, so I would expect there to be a signature of some sort for consistency checking before readding it. At least, btrfs shouldn't add the readded device back as an active device when it's really still inconsistent and not being used, even if it indicates the same UUID.

Question: On hot-remove, does 'mount' show the volume as degraded?

I find the degraded mount option confusing. What does it mean to use -o degraded when mounting a volume for which all devices are present and functioning?

>> If I create the file system, mount it, but I do not copy any data, upon adding
>> new and deleting missing, the data profile is changed from raid1 to single. If
>> I've first copied data to the volume prior to device failure/missing, this
>> doesn't happen, it remains raid1.
> 
> And yet, the tools indicate that it is still raid1, even if internally it reverts to single???

No. btrfs fi df <mp> does reflect that the data profile has flipped from raid1 to single. As I mention later, this is reproducible only if the volume has had no data written to it. If I first write a file, then the reversion from raid1 to single doesn't happen upon 'btrfs device delete'.

> Based on my experience with this and Duncan's feedback, I'd like to see the wiki have some warnings about dealing with multidevice filesystems, especially surrounding the degraded mount option.

To me, degraded is an array or volume state, not up to the user to set as an option. So I'd like to know if the option is temporary, to more easily handle a particular problem for now, but the intention is to handle it better (differently) in the future.

> 
> Looking again at the wiki Gotchas page, it does say
> On a multi device btrfs filesystem, mistakingly re-adding
> a block device that is already part of the btrfs fs with
> btrfs device add results in an error, and brings btrfs in
> an inconsistent state.

For raid1 and raid10 this seems a problem for a file system that can become very large. The devices have enough information to determine exactly how far behind temporarily kicked devices are; it seems they effectively have an mdraid write-intent bitmap.

> 
> 1. btrfs filesystem show - shouldn't list devices as present unlesss they're in use and in a consistent state.

Or mark them as being inconsistent/unavailable.

> 
> As Chris said, "There isn't a readd option in btrfs, which in md parlance is used for readding a device previously part of an array." However, when I hotplugged the drive and it reappeared in the 'fi show' output, I assumed exactly the md semantics had occurred, with the drive having been readded and made consistent - it didn't take any time, but I hadn't copied data yet and knew btrfs may only sync the used data and metadata blocks.

The md semantics is that there is no auto add or readd. You must tell it to do this once the dropped device is made available again. If there's a write-intent bitmap, the readded device is caught up very quickly. 

I think it's a problem if there isn't an write-intent bitmap equivalent for btrfs raid1/raid10, and right now there doesn't seem to be one. A compulsory rebalance means hours or days of rebalance just because one drive was dropped for a short while.

> 
> In other words, I never ran a device add or remove, but still saw what appeared to be consistent behavior.
> 
> 2. data profile shouldn't revert to single if adding/deleting before copying data

Yes I think it's a bug too, but it's probably benign.

> 
> This then drives the question, how does one check the degraded state of a filesystem if not the mount flag. I (quite likely with an md-raid bias) expected to use the 'filesystem show' output, listed the devices as well as a status flag of fully-consistent or rebalance in progress. If that's not the correct or intended location, then provide documentation on how to properly check the consistency state and degraded state of a filesystem.

Yeah I think something functionally equivalent to a combination of mdadm -D and -E. mdadm distinguishes between array status/metadata vs member device status/metadata with those two commands.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
       [not found]   ` < D8D710C1-2FEC-41E1-A139-522F351E0464@colorremedies.com>
@ 2013-08-25 12:12     ` Duncan
  2013-08-25 19:13       ` Chris Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Duncan @ 2013-08-25 12:12 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Sat, 24 Aug 2013 23:18:26 -0600 as excerpted:

> On Aug 24, 2013, at 11:24 AM, Joel Johnson <mrjoel@lixil.net> wrote:
>> 
>> Similar to what Duncan described in his response, on a hot-remove
>> (without doing the proper btrfs device delete), there is no opportunity
>> for a rebalance or metadata change on the pulled drives, so I would
>> expect there to be a signature of some sort for consistency checking
>> before readding it. At least, btrfs shouldn't add the readded device
>> back as an active device when it's really still inconsistent and not
>> being used, even if it indicates the same UUID.
> 
> Question: On hot-remove, does 'mount' show the volume as degraded?
> 
> I find the degraded mount option confusing. What does it mean to use -o
> degraded when mounting a volume for which all devices are present and
> functioning?

The degraded mount option does indeed simply ALLOW mounting without all 
devices.  If all devices can be found, btrfs will still integrate them 
all, regardless of the mount option.

Looked at in that way, therefore, having the degraded option remain when 
all devices were found and integrated makes sense.  It's simply denoting 
the historical fact at that point, that the degraded option was included 
when mounting, and thus that it WOULD have mounted without all devices, 
if it couldn't find them all, regardless of whether it found and 
integrated all devices or not.

And hot-remove won't change the options used to mount, either, so 
degraded won't (or shouldn't, I don't think it does but didn't actually 
check that case personally) magically appear in the options due to the 
hot-remove.

However, I /believe/ btrfs filesystem show should display MISSING when a 
device has been hot-removed, until it's added again.  That's what I 
understand Joel to be saying, at least, and it's consistent with my 
understanding of the situation.

(I would have tested that when I did my original testing, except I didn't 
know my way around multi-device btrfs well enough to properly grok either 
the commands I really should be running or their output.  I did run the 
commands, but I had the other device still attached even tho I'd 
originally mounted degraded, so it didn't show up as missing, and I 
didn't understand the significance of what I was seeing, except to the 
extent that I knew the results I got from the separate degraded writes 
followed by a non-degraded mount were NOT what I expected, and I simply 
resolved to steer well clear of degraded mounting in the first place, if 
I could help it, and to take steps to wipe and clean-add in the event 
something happened and I really NEEDED that degraded.)

>> Based on my experience with this and Duncan's feedback, I'd like to see
>> the wiki have some warnings about dealing with multidevice filesystems,
>> especially surrounding the degraded mount option.

Well, as I got told at one point, it's a wiki, knock yourself out. =:^/

Tho... in fairness, while I intend to register and do some of these 
changes at some point, in practice, I'm far more comfortable on 
newsgroups and mailing lists than in web forums or editing wikis, so 
unfortunately I've not gotten "the properly rounded tuit" yet. =:^(

But seriously, Joel, I agree it needs done, and if you get to it before I 
do... there'll be less I need to do.  So if you have the time and 
motivation to do it, please do so! =:^)  Plus you appear to be doing a 
bit more thorough testing with it than I did, so you're arguably better 
placed to do it anyway.

> To me, degraded is an array or volume state, not up to the user to set
> as an option. So I'd like to know if the option is temporary, to more
> easily handle a particular problem for now, but the intention is to
> handle it better (differently) in the future.

Hopefully the above helped with that.  AFAIK the degraded mount-option 
will remain more or less as it is -- simply allowing the filesystem to 
start instead of error-out if it can't find all devices, but effectively 
doing nothing if it does find all devices.

Meanwhile, I'm used to running beta and at times alpha software, and what 
we have now is clearly classic alpha, not all primary features 
implemented yet, let alone all the sharp edges removed and the chrome 
polished up.  Classic beta has all the baseline features, and we are 
getting close, but still has sharp edges/bugs that can hurt if one isn't 
careful around them.  I honestly expect btrfs should be hitting that by 
end of year or certainly early next, as it really is getting close now.

What that means in context is that I expect and hope that once the last 
few primary features get added, finish up raid5/6 mode and get full N-way 
mirroring (not just the 2-way referred to as raid1 currently), possibly 
dedup, finish up send/receive (it's there but rather too buggy to be 
entirely practical at present)... AFAIK, that's about it on the primary 
features list.

Then it's beta, with the full focus turning to debugging and getting rid 
of those sharp corners, and I expect THAT is when we'll see some of these 
really bare and sharp-cornered features such as multi-device raidN get 
rounded out in userspace, with the tools actually turning into something 
reasonably usable, not the bare-bones alpha proof-of-concept userspace 
tools we have for the multi-device features at present.

> For raid1 and raid10 this seems a problem for a file system that can
> become very large. The devices have enough information to determine
> exactly how far behind temporarily kicked devices are; it seems they
> effectively have an mdraid write-intent bitmap.

With atomic tree updates not taking effect until the root node is finally 
written, and with btrfs keeping a list of the last several root nodes as 
it has actually been doing for several versions now (since 3.0 at least, 
I believe), I /believe/ it's even better than a 1-deep write-intent 
bitmap, as it's effectively an N-deep stack of such bitmaps. =:^)

The problem is, as I explained above, btrfs is still effectively alpha, 
and the tools we are using to work with it are effectively bare-bones 
proof-of-concept alpha level tools, since not all features have yet been 
fully implemented, let alone having time to flesh anything out properly.

It'll take time...

> I think it's a problem if there isn't an write-intent bitmap equivalent
> for btrfs raid1/raid10, and right now there doesn't seem to be one.

As I explained I believe btrfs has even better.  It's simply that there's 
no proper tools available to use it yet...

> A compulsory rebalance means hours or days of rebalance just because one
> drive was dropped for a short while.

I think I consider myself lucky.  One thing I learned with my years of 
playing with mdraid is how to make proper use of partitions, with only 
the ones I actually needed active and the filesystems mounted, and 
activating/mounting read-only where possible, so if a device did drop out 
for whatever reason, between the split-up mounts meaning relatively 
little actual data was affected, and the write-intent bitmaps, I was back 
online right away.

While btrfs doesn't YET expose its root-node-stack as a stack of write-
intent-bitmaps as it COULD, and I believe eventually WILL, unlike back 
when I was running mdraid and dealing with the write-intent bitmaps 
there, I'm on SSD for my btrfs filesystems today, and they're MUCH faster.

*So* much so that between the multiple relatively small partitions (fully 
independent, I don't want all my eggs in one filesystem tree basket, so 
no subvolumes, they're fully independent filesystems/partitions) and the 
fact that they're on ssd...

Here, a full filesystem balance typically takes on the order of seconds 
to a minute, depending on the filesystem/partition.  That's rewriting ALL 
data and metadata on the filesystem!

So while I understand the concept of a full multi-terabyte filesystem 
rebalance taking on the order of days, the contrast between that concept, 
and the reality of a few gigabytes of data in its own dedicated 
filesystem on ssd rebalancing in a few tens of seconds here...

Makes a world of difference!

Let's just say I'm glad it isn't the other way around! =:^)

>> This then drives the question, how does one check the degraded state of
>> a filesystem if not the mount flag. I (quite likely with an md-raid
>> bias) expected to use the 'filesystem show' output, listed the devices
>> as well as a status flag of fully-consistent or rebalance in progress.
>> If that's not the correct or intended location, then provide
>> documentation on how to properly check the consistency state and
>> degraded state of a filesystem.
> 
> Yeah I think something functionally equivalent to a combination of mdadm
> -D and -E. mdadm distinguishes between array status/metadata vs member
> device status/metadata with those two commands.

While the bare-bones-alpha tools-state explains the current situation, 
never-the-less I believe these sorts of conversations are important, as 
they will very possibly help drive the shaping of the tools as they flesh 
out.

And yes, I do hope that the btrfs tools eventually get something 
comparable to mdadm -D and -E.  But I think it's equally important to 
realize that mdadm is actually a second-generation solution, that what 
we're looking at in mdadm is the end product of several years of it 
maturing plus the raid-tools solution before that, and that even then, 
those were already patterned after commercial and other raid product 
administration tools from before that.

Meanwhile, while there's some analogies between btrfs and md, and others 
between btrfs and zfs, really, this whole field of having filesystems do 
all that btrfs is attempting to do is relatively new ground, and we 
cannot and should not expect to directly compare the state of btrfs tools 
even after it's first declared stable, with the state of either mdadm or 
zfs tools, today.  It'll take some time to get there.

But get there I believe it will eventually do. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Help interpreting RAID1 space allocation
  2013-08-25 12:12     ` Duncan
@ 2013-08-25 19:13       ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2013-08-25 19:13 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Aug 25, 2013, at 6:12 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> The degraded mount option does indeed simply ALLOW mounting without all 
> devices.  If all devices can be found, btrfs will still integrate them 
> all, regardless of the mount option.

I understand btrfs handling is necessarily different because array assembly vs mounting aren't distinguished as they are with md and hardware raid. An md device won't come up on its own if all members aren't available, you have to tell mdadm to assemble what it can, if successful the array is degraded but not mounted, then you mount the degraded array. So what takes two steps for md raid is a single step with btrfs, and in that context it makes sense intent is indicated with the degraded mount option.

Aside: I think a more conservative approach would be for -o degraded to also imply, by default, ro. Presently specifying -o degraded, I still get a rw mount.

> 
> Looked at in that way, therefore, having the degraded option remain when 
> all devices were found and integrated makes sense.  It's simply denoting 
> the historical fact at that point, that the degraded option was included 
> when mounting, and thus that it WOULD have mounted without all devices, 
> if it couldn't find them all, regardless of whether it found and 
> integrated all devices or not.

As far as I'm aware, nothing else in the mount line works based on history though. If I use -o rw, the line says rw. But if for some reason the kernel finds an inconsistency and drops the filesystem to ro, the mount line immediately says ro, not the rw used when mounting. If I -o remount,rw and the operation is successful, again the mount line reflects this.

But with a btrfs mount, even -o remount doesn't clear degraded once all devices are available. That's confusing.

> And hot-remove won't change the options used to mount, either, so 
> degraded won't (or shouldn't, I don't think it does but didn't actually 
> check that case personally) magically appear in the options due to the 
> hot-remove.

Even btrfs, when it detects certain problems, will change the mount state from rw to ro. There's every reason it could do the same thing when the volume becomes degraded during use. 

If the kernel doesn't export both volume and device state somehow, how does e.g. udisks know the volume is degraded, and which device is the source of the problem, so that the user can be informed in the desktop UI? And also, when the problem is rectified, that the volume is no longer degraded?

> 
>> To me, degraded is an array or volume state, not up to the user to set
>> as an option. So I'd like to know if the option is temporary, to more
>> easily handle a particular problem for now, but the intention is to
>> handle it better (differently) in the future.
> 
> Hopefully the above helped with that.

Yes, I agree it's both a state and a mount option.

>> I think it's a problem if there isn't an write-intent bitmap equivalent
>> for btrfs raid1/raid10, and right now there doesn't seem to be one.
> 
> As I explained I believe btrfs has even better.  It's simply that there's 
> no proper tools available to use it yet…

Understood.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-25 19:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-24  0:05 Help interpreting RAID1 space allocation Joel Johnson
2013-08-24  4:24 ` Chris Murphy
2013-08-24  4:58   ` Chris Murphy
     [not found] ` < C0968F38-8432-41C3-B916-DC00C5C69B34@colorremedies.com>
     [not found]   ` < 27E06913-DDBB-4B75-86D3-A8F6C5B09F99@colorremedies.com>
2013-08-24 11:56     ` Duncan
2013-08-24 17:24 ` Joel Johnson
2013-08-24 18:02   ` Joel Johnson
2013-08-24 20:30     ` Sandy McArthur
2013-08-25  2:42       ` Joel Johnson
2013-08-25  5:18   ` Chris Murphy
     [not found] ` < 093e4c7d21b5e734c86fc4bb1703a69e@lixil.net>
     [not found]   ` < D8D710C1-2FEC-41E1-A139-522F351E0464@colorremedies.com>
2013-08-25 12:12     ` Duncan
2013-08-25 19:13       ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.