evidence of persistent state, despite device disconnects

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* evidence of persistent state, despite device disconnects
@ 2016-01-02 19:22 Chris Murphy
  2016-01-03 13:48 ` Duncan
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2016-01-02 19:22 UTC (permalink / raw)
  To: Btrfs BTRFS

OK, I basically do not trust the f'n kernel anymore. I'm having to
reboot in order to get to a (reasonably) deterministic state. Merely
disconnecting devices doesn't  make all aspects of that device and its
filesystem, vanish.

I think this persistence might be causing some Btrfs corruptions that
don't seem to make any sense. Here is one example that I've kept track
of every step of the way:

I have a Btrfs raid1 that fails to mount rw,degraded:
[  174.520303] BTRFS info (device sdc): allowing degraded mounts
[  174.520421] BTRFS info (device sdc): disk space caching is enabled
[  174.520527] BTRFS: has skinny extents
[  174.528060] BTRFS warning (device sdc): devid 1 uuid
94c62352-2568-4abe-8a58-828d1766719c is missing
[  177.924127] BTRFS: missing devices(1) exceeds the limit(0),
writeable mount is not allowed
[  177.950761] BTRFS: open_ctree failed

When mounted -o ro,degraded

[root@f23s ~]# btrfs fi df /mnt/brick2
Data, RAID1: total=502.00GiB, used=499.69GiB
Data, single: total=1.00GiB, used=2.00MiB
System, RAID1: total=32.00MiB, used=80.00KiB
System, single: total=32.00MiB, used=32.00KiB
Metadata, RAID1: total=2.00GiB, used=1008.22MiB
Metadata, single: total=1.00GiB, used=0.00B
GlobalReserve, single: total=352.00MiB, used=0.00B

What the F?

Because the last time it was normal/non-degraded and mounted, the only
chunks were raid1 chunks. Somehow, single chunks have been added and
used without any kernel messages to warn the user they no longer have
a raid1, in effect.

What *exactly* happened since this was an intact raid1 only, 2 device volume?

1. umount /mnt/brick           ##cleanly umounted
2. ## USB cables from the drives disconnected
3. lsblk and blkid see neither of them
4. devid1 is reconnected
5. devid1 is issued ATA security-erase-enhanced command via hdparm
6. devid1 is physically disconnected
7. oldidevid1 is luksformatted and opened
8. devid2 is connected
9.
[root@f23s ~]# lsblk -f
NAME   FSTYPE      LABEL   UUID                                 MOUNTPOINT
sdb    crypto_LUKS         493a7656-8fe6-46e9-88af-a0ffe83ced7e
└─sdb
sdc    btrfs       second  197606b2-9f4a-4742-8824-7fc93285c29c /mnt/brick2

[root@f23s ~]# btrfs fi show /mnt/brick2
Label: 'second'  uuid: 197606b2-9f4a-4742-8824-7fc93285c29c
    Total devices 2 FS bytes used 500.68GiB
    devid    1 size 697.64GiB used 504.03GiB path /dev/sdb
    devid    2 size 697.64GiB used 504.03GiB path /dev/sdc

WTF?! This shouldn't be possible. devid1 is *completely* obliterated.
It was securely erased. It has been luks formatted. It has been
disconnected multiple times (as has devid2). And yet Btrfs sees this
as an intact pair? That's just complete crap. *AND*

It let's me mount it! Not degraded! No error messages!

11. umount /mnt/brick2
12. Reboot
13. btrfs fi show
warning, device 1 is missing
warning devid 1 not found already
Label: 'second'  uuid: 197606b2-9f4a-4742-8824-7fc93285c29c
    Total devices 2 FS bytes used 500.68GiB
    devid    2 size 697.64GiB used 506.06GiB path /dev/sdc
    *** Some devices missing

14. # mount -o degraded, /dev/sdc /mnt/brick2
mount: wrong fs type, bad option, bad superblock on /dev/sdc

and the trace at the very top with bogus missing devices(1) exceeds
the limit(0), writeable mount is not allowed.

So during that not degraded mount of the file system where it saw a
ghost of devid1, it wrote single chunks to devid2. And now devid2 can
only ever be mounted read only. It's impossible to fix it, because I
can't add devices when ro mounted.

Does anyone have any idea what tool to use to explain how the devid1
/dev/sdb, which has been securely erased, luks formatted,
disconnected, reconnected, *STILL* results in Btrfs thinking it's a
valid drive and allowing a non-degraded mount until there's a reboot?
That's really scary.

It's like the btrfs kernel code isn't refreshing its own fs or dev
states when other parts of the kernel know it's gone. Maybe a 'btrfs
dev scan' would have cleared this up, but I shouldn't have to do that
to refresh Btrfs's state anytime I disconnect and connect devices just
to make sure it doesn't sabotage the devices by surreptitiously adding
single chunks to one of the drives!

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-02 19:22 evidence of persistent state, despite device disconnects Chris Murphy
@ 2016-01-03 13:48 ` Duncan
  2016-01-03 21:33   ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Duncan @ 2016-01-03 13:48 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Sat, 02 Jan 2016 12:22:07 -0700 as excerpted:

> OK, I basically do not trust the f'n kernel anymore. I'm having to
> reboot in order to get to a (reasonably) deterministic state. Merely
> disconnecting devices doesn't  make all aspects of that device and its
> filesystem, vanish.

We already knew that btrfs itself doesn't track device state very well, 
and that a reboot or for those with btrfs as a module, module unload/
reload, was needed to fully clear state.  Are you suggesting it's more 
than that?

> I think this persistence might be causing some Btrfs corruptions that
> don't seem to make any sense. Here is one example that I've kept track
> of every step of the way:
> 
> I have a Btrfs raid1 that fails to mount rw,degraded:

[Shortening the UUIDs for easier 80-column posting.  I deleted them in 
the first attempt, but decided they were useful here, as UUIDs are about 
the only way to track what's what as you will see, in the absence of 
btrfs fi show, with mountpoints jumping between brick and brick1, with 
references to devids that we don't know anything about due to that lack 
of fi show output, etc.]

> [  174.520303] BTRFS info (device sdc): allowing degraded mounts
> [  174.520421] BTRFS info (device sdc): disk space caching is enabled
> [  174.520527] BTRFS: has skinny extents
> [  174.528060] BTRFS warning (device sdc):
> devid 1 uuid [...]-828d1766719c is missing
> [  177.924127] BTRFS: missing devices(1) exceeds the limit(0),
> writeable mount is not allowed
> [  177.950761] BTRFS: open_ctree failed

That's the -828 UUID...

OK, looks like your "raid1" must have some single or raid0 chunks, which 
have a missing device limit of 0.

BTW, what kernel?  You don't say.

Meanwhile, I lost track of whether the patch set to do per-chunk 
evaluation of whether it's all there, thereby allowing degraded,rw 
mounting of multi-device filesystems with single chunks only on available 
devices, ever made it in, and if so, in which kernel.

I /think/ they were too late to make it into 4.3, but should have made it 
into 4.4.  But unfortunately, neither the 4.3 or 4.4 kernel btrfs changes 
are up on the wiki yet, and to confirm it in git I'd have to go back and 
figure out what those patches were named, which I'm too lazy to do ATM.

But of course without a reported kernel here, knowing whether they made 
it in and for what kernel wouldn't help, despite that information 
apparently being apropos to the situation.

> When mounted -o ro,degraded
> 
> [root@f23s ~]# btrfs fi df /mnt/brick2
> Data, RAID1: total=502.00GiB, used=499.69GiB
> Data, single: total=1.00GiB, used=2.00MiB
> System, RAID1: total=32.00MiB, used=80.00KiB
> System, single: total=32.00MiB, used=32.00KiB
> Metadata, RAID1: total=2.00GiB, used=1008.22MiB
> Metadata, single: total=1.00GiB, used=0.00B
> GlobalReserve, single: total=352.00MiB, used=0.00B
> 
> What the F?

OK, there we have the btrfs fi df.  But there's no btrfs fi show.  And 
you posted the dmesg from the mount, but didn't give the commandline, so 
we have nothing connecting the btrfs fi df /mnt/brick2 (note the brick2), 
to the above dmesg output.  No mount commandline, no btrfs fi show, 
nothing else, at this point.

> Because the last time it was normal/non-degraded and mounted, the only
> chunks were raid1 chunks. Somehow, single chunks have been added and
> used without any kernel messages to warn the user they no longer have a
> raid1, in effect.
> 
> What *exactly* happened since this was an intact raid1 only, 2 device
> volume?
> 
> 1. umount /mnt/brick           ##cleanly umounted

OK, the above fi df was for /mnt/brick2.  Here you're umounting
/mnt/brick.  **NOT** the same mountpoint.  So **NOT** cleanly umounted, 
as that's an entirely different filesystem.  Unless you did a copy/pasto 
and you actually umounted brick2.

But that's not what it says...

> 2. ## USB cables from the drives disconnected
> 3. lsblk and blkid see neither of them
> 4. devid1 is reconnected

Wait... devid1?  For brick or brick2?  Either way, we have no idea what 
devid1 is, because we don't have a btrfs fi show.

Honestly, CMurphy, your posts are /normally/ much more coherent than 
this.  Joking, but serious, are you still recovering from your new year's 
partying?  There's too many missing pieces and inconsistencies here.  
It's not like your normal posts.

> 5. devid1 is issued ATA security-erase-enhanced command via hdparm
> 6. devid1 is physically disconnected
> 7. oldidevid1 is luksformatted and opened

Oldidevid1?  Is that old devid1?  You said it was physically 
disconnected.  Nothing about reconnection.  So was it reconnected and 
lukesformated, or is this a different device, presumably from some much 
older btrfs devid1?

> 8. devid2 is connected
> 9. [root@f23s ~]# lsblk -f
> NAME   FSTYPE      LABEL   UUID               MOUNTPOINT
> sdb    crypto_LUKS         [...]-a0ffe83ced7e
> └─sdb
> sdc    btrfs       second  [...]-7fc93285c29c /mnt/brick2
> 
> [root@f23s ~]# btrfs fi show /mnt/brick2
> Label: 'second'  uuid: [...]-7fc93285c29c
>     Total devices 2 FS bytes used 500.68GiB
>     devid    1 size 697.64GiB used 504.03GiB path /dev/sdb
>     devid    2 size 697.64GiB used 504.03GiB path /dev/sdc

UUIDs:  No -828 UUID to match the dmesg output above.  The -a0ff UUID is 
new, apparently from the luksformatting in #7, and the -7fc UUID matches 
between the lsblk and (NOW we get it!!) btrfs fi show, but isn't the -828 
UUID in the dmesg above, so that dmesg segment is presumably for some 
other btrfs.  Note that with all the device disconnection and reconnection 
going on, the /dev/sdc here wouldn't be expected to be the same device as 
the /dev/sdc in the dmesg above, so mismatching UUIDs despite matching 
/dev/sdc device-paths isn't at all unexpected.

Which would seem to imply that while we have a btrfs fi show now, it's 
not the btrfs in the dmesg above, because the UUIDs don't match.  Either 
that or the UUID in the dmesg isn't the filesystem UUID but rather the 
device UUID.  But I can't verify that right now as the dmesg output for a 
whole device doesn't list UUIDs, only the nominal device node (nominal 
being the one used to mount, on multi-device btrfs).  Either way, the UUID 
in the dmesg from the btrfs mount error doesn't match any other UUID 
we've seen, yet.

Meanwhile, both these show a mounted btrfs on /mnt/brick2, but there's no 
mount in the sequence above.  Based on the sequence above, nothing should 
be mounted at /mnt/brick2.

But at this point there's enough odd and nonsensical about what we know 
and don't know from the post so far that this really isn't surprising...

> WTF?! This shouldn't be possible. devid1 is *completely* obliterated.
> It was securely erased. It has been luks formatted. It has been
> disconnected multiple times (as has devid2). And yet Btrfs sees this as
> an intact pair? That's just complete crap. *AND*

Why would you expect it to make any sense?  The rest of the post doesn't.

> It let's me mount it! Not degraded! No error messages!

Oh, here we're talking about a mount!  But as I said, no mount in the 
sequence!  At this point it's just entertainment.  I'm not even trying to 
make sense of it any longer!

Meanwhile, we have #9 above, and #11, below, but no #10.  I guess the 
btrfs fi show is supposed to be #10.  Or maybe #9 was supposed to be #10 
and include both the lsblk and the btrfs fi show, and #9 was supposed to 
be the mount we're missing.  Either way, more to not make any sense in a 
post that already made no sense. <shrug>

> 11. umount /mnt/brick2
> 12. Reboot
> 13. btrfs fi show
> warning, device 1 is missing
> warning devid 1 not found already
> Label: 'second'  uuid: [...]-7fc93285c29c
>     Total devices 2 FS bytes used 500.68GiB
>     devid    2 size 697.64GiB used 506.06GiB path /dev/sdc
>     *** Some devices missing

OK, the -7fc UUID that was previously mounted on /mnt/brick2...

And this is a btrfs fi show, without path, so it should list all btrfs in 
the system, mounted or not.  No others shown.  Whatever happened to the 
/mnt/brick filesystem umounted in #1, or the -828 UUID the dmesg at the 
top complaining about a missing device was complaining about?  No clue.

But there was no btrfs device scan done before that btrfs fi show.  Maybe 
that's why.  Or maybe it's because the other btrfs entries were manually 
edited out here.

> 14. # mount -o degraded, /dev/sdc /mnt/brick2
> mount: wrong fs type, bad option, bad superblock on /dev/sdc
> 
> and the trace at the very top with bogus missing devices(1) exceeds the
> limit(0), writeable mount is not allowed.
> 
> So during that not degraded mount of the file system where it saw a
> ghost of devid1, it wrote single chunks to devid2. And now devid2 can
> only ever be mounted read only. It's impossible to fix it, because I
> can't add devices when ro mounted.

The sequence still doesn't show where you actually did that mount that 
actually worked, only the one in #14 that didn't work, or what command 
you might have used.

And the umount in #1 was apparently for an entirely different /mnt/brick, 
while the lsblk and btrfs fi show in #9 clearly shows /mnt/brick2, which 
if the sequence above is to be believed, remained mounted the entire 
time, including while you unplugged its devices, plugged them back in and 
ATA secure-erased one, then luksformatted it (tho you don't record the 
actual commands used so we don't know for sure you got the devices 
correct, particularly in light of your already mixing up brick and 
brick2), all while the btrfs on brick2 is still supposedly mounted, with 
a btrfs that we already know doesn't track device disappearance 
particularly well.

In which case, I can see the still mounted btrfs trying to write raid1, 
and failing that, creating single chunks on the devices it could still 
see, to try to write to.

But that's very much not the only thing mixed up here!

Meanwhile, if your kernel is one without the per-chunk patches mentioned 
above, it could well be that the single chunks listed in that btrfs fi df 
are indeed there, intact, and that it didn't try to write to the other 
device at all.  In fact, the presence of those single-mode chunks, 
indicate that it indeed *did* sense the missing other device at some 
point, and wrote single chunks instead of raid1 chunks as a result.  With 
a kernel with those per-chunk tracking patches, it might well mount 
degraded,rw, and you may well have everything there, despite the entirely 
mixed up series of events above that make absolutely no sense as reported.

> Does anyone have any idea what tool to use to explain how the devid1
> /dev/sdb, which has been securely erased, luks formatted,
> disconnected, reconnected, *STILL* results in Btrfs thinking it's a
> valid drive and allowing a non-degraded mount until there's a reboot?
> That's really scary.
> 
> It's like the btrfs kernel code isn't refreshing its own fs or dev
> states when other parts of the kernel know it's gone. Maybe a 'btrfs dev
> scan' would have cleared this up, but I shouldn't have to do that to
> refresh Btrfs's state anytime I disconnect and connect devices just to
> make sure it doesn't sabotage the devices by surreptitiously adding
> single chunks to one of the drives!

Based on the evidence, I'd guess that you actually mounted it degraded,rw, 
somewhere along the line, and it wrote those single-mode chunks at that 
point.  Further, whatever kernel you're running, I'd guess it doesn't 
have the fairly recent patches checking data/metadata availability per-
chunk, and thus is exhibiting the known pre-patch behavior of refusing a 
second degraded,rw mount when the first put some single chunks on the 
existing drive, despite the contents of those chunks and thus the entire 
filesystem, still being available.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-03 13:48 ` Duncan
@ 2016-01-03 21:33   ` Chris Murphy
  2016-01-05 14:50     ` Duncan
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2016-01-03 21:33 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

kernel-4.4.0-0.rc6.git0.1.fc24.x86_64
btrfs-progs 4.3.1

There was some copy pasting, hence /mnt/brick vs /mnt/brick2
confusion, but the volume was always cleanly mounted and umounted.

The biggest problem I have with all of this is the completely silent
addition of single chunks. That made the volume, in effect, no longer
completely raid1. No other details matter, except to try and reproduce
the problem, and find its source so it can be fixed. It is a bug,
because it's definitely not sane or expected behavior at all.

Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-03 21:33   ` Chris Murphy
@ 2016-01-05 14:50     ` Duncan
  2016-01-05 21:47       ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Duncan @ 2016-01-05 14:50 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Sun, 03 Jan 2016 14:33:40 -0700 as excerpted:

> kernel-4.4.0-0.rc6.git0.1.fc24.x86_64 btrfs-progs 4.3.1
> 
> There was some copy pasting, hence /mnt/brick vs /mnt/brick2 confusion,
> but the volume was always cleanly mounted and umounted.
> 
> The biggest problem I have with all of this is the completely silent
> addition of single chunks. That made the volume, in effect, no longer
> completely raid1. No other details matter, except to try and reproduce
> the problem, and find its source so it can be fixed. It is a bug,
> because it's definitely not sane or expected behavior at all.

If there's no way you mounted it degraded,rw at any point, I agree, 
single mode chunks are unexpected on a raid1 for both data and metadata, 
and it's a bug -- possibly actually related to that new code that allows 
degraded,rw recovery via per-chunk checks.

If however you mounted it degraded,rw at some point, then I'd say the bug 
is in wetware, as in that case, based on my understanding, it's working 
as intended.  I was inclined to believe that was what happened based on 
the obviously partial sequence in the earlier post, but if you say you 
didn't... then it's all down to duplication and finding why it's suddenly 
reverting to single mode on non-degraded mounts, which indeed /is/ a bug.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-05 14:50     ` Duncan
@ 2016-01-05 21:47       ` Chris Murphy
  2016-01-09 10:55         ` Duncan
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2016-01-05 21:47 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Tue, Jan 5, 2016 at 7:50 AM, Duncan <1i5t5.duncan@cox.net> wrote:

>
> If however you mounted it degraded,rw at some point, then I'd say the bug
> is in wetware, as in that case, based on my understanding, it's working
> as intended.  I was inclined to believe that was what happened based on
> the obviously partial sequence in the earlier post, but if you say you
> didn't... then it's all down to duplication and finding why it's suddenly
> reverting to single mode on non-degraded mounts, which indeed /is/ a bug.

Clearly I will have to retest.

But even as rw,degraded, it doesn't matter, that'd still be a huge
bug. There's no possible way you'll convince me this is a user
misunderstanding. No where is this documented.

I made the fs using mfks.btrfs -draid1 -mraid1. There is no way the
fs, under any circumstance, legitimately creates and uses any other
profile for any chunk type, ever. Let alone silently.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-05 21:47       ` Chris Murphy
@ 2016-01-09 10:55         ` Duncan
  2016-01-09 22:29           ` Chris Murphy
  2016-01-10 16:54           ` Goffredo Baroncelli
  0 siblings, 2 replies; 9+ messages in thread
From: Duncan @ 2016-01-09 10:55 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Tue, 05 Jan 2016 14:47:52 -0700 as excerpted:

> On Tue, Jan 5, 2016 at 7:50 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> 
>> If however you mounted it degraded,rw at some point, then I'd say the
>> bug is in wetware, as in that case, based on my understanding, it's
>> working as intended.  I was inclined to believe that was what happened
>> based on the obviously partial sequence in the earlier post, but if you
>> say you didn't... then it's all down to duplication and finding why
>> it's suddenly reverting to single mode on non-degraded mounts, which
>> indeed /is/ a bug.
> 
> Clearly I will have to retest.
> 
> But even as rw,degraded, it doesn't matter, that'd still be a huge bug.
> There's no possible way you'll convince me this is a user
> misunderstanding. No where is this documented.
> 
> I made the fs using mfks.btrfs -draid1 -mraid1. There is no way the fs,
> under any circumstance, legitimately creates and uses any other profile
> for any chunk type, ever. Let alone silently.

If you're mounting degraded,rw, and you're down to a single device on a 
raid1, then once the existing chunks fill up, it /has/ to create single 
chunks, because it can't create them raid1 as there's not enough devices 
(a minimum of two devices are required to create raid1 chunks, since two 
copies are required and they can't be on the same device).

And by mounting degraded,rw you've given it permission to create those 
single mode chunks if it has to, so it's not "silent", as you've 
explicitly mounted it degraded,rw, and single is what raid1 degrades to 
when there's only one device.

And with automatic empty-chunk deletion, existing chunks can fill up 
pretty fast...

Further, NOT letting it write single chunks when an otherwise raid1 btrfs 
is mounted in degraded,rw mode, would very possibly prevent you from 
repairing the filesystem with a btrfs replace or btrfs device add and 
delete.  And we've been there, done that, except slightly differently, 
with the can only mount degraded,rw until a single mode chunk is written, 
after which you can only mount degraded,ro, and then can't repair, which 
is the problem that the per-chunk check patches, vs. the old filesystem-
scope check, were designed to eliminate.

But as I said, if it's creating single chunks when you did /not/ have it 
mounted degraded, then you indeed have found a bug, and figuring out how 
to replicate it so it can be properly traced and fixed is where we're 
left, as I can't see how anyone would find that not a bug.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-09 10:55         ` Duncan
@ 2016-01-09 22:29           ` Chris Murphy
  2016-01-10  5:34             ` Duncan
  2016-01-10 16:54           ` Goffredo Baroncelli
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2016-01-09 22:29 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Sat, Jan 9, 2016 at 3:55 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Chris Murphy posted on Tue, 05 Jan 2016 14:47:52 -0700 as excerpted:
>
>> On Tue, Jan 5, 2016 at 7:50 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>
>>
>>> If however you mounted it degraded,rw at some point, then I'd say the
>>> bug is in wetware, as in that case, based on my understanding, it's
>>> working as intended.  I was inclined to believe that was what happened
>>> based on the obviously partial sequence in the earlier post, but if you
>>> say you didn't... then it's all down to duplication and finding why
>>> it's suddenly reverting to single mode on non-degraded mounts, which
>>> indeed /is/ a bug.
>>
>> Clearly I will have to retest.
>>
>> But even as rw,degraded, it doesn't matter, that'd still be a huge bug.
>> There's no possible way you'll convince me this is a user
>> misunderstanding. No where is this documented.
>>
>> I made the fs using mfks.btrfs -draid1 -mraid1. There is no way the fs,
>> under any circumstance, legitimately creates and uses any other profile
>> for any chunk type, ever. Let alone silently.
>
> If you're mounting degraded,rw, and you're down to a single device on a
> raid1, then once the existing chunks fill up, it /has/ to create single
> chunks, because it can't create them raid1 as there's not enough devices
> (a minimum of two devices are required to create raid1 chunks, since two
> copies are required and they can't be on the same device).
>
> And by mounting degraded,rw you've given it permission to create those
> single mode chunks if it has to, so it's not "silent", as you've
> explicitly mounted it degraded,rw, and single is what raid1 degrades to
> when there's only one device.

This is esoteric for mortal users (let alone without documentation)
that degraded,rw means single chunks will be made, and now new data is
no longer replicated once the bad device is replaced and volume
scrubbed.

There's an incongruency between the promise of "fault tolerance,
repair, and easy administration" and the esoteric reality. This is not
easy, this is a gotcha. I'll bet almost no users have any idea this is
how rw,degraded behaves and the risk it entails.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-09 22:29           ` Chris Murphy
@ 2016-01-10  5:34             ` Duncan
  0 siblings, 0 replies; 9+ messages in thread
From: Duncan @ 2016-01-10  5:34 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Sat, 09 Jan 2016 15:29:31 -0700 as excerpted:

> On Sat, Jan 9, 2016 at 3:55 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>
>> If you're mounting degraded,rw, and you're down to a single device on a
>> raid1, then once the existing chunks fill up, it /has/ to create single
>> chunks, because it can't create them raid1 as there's not enough
>> devices (a minimum of two devices are required to create raid1 chunks,
>> since two copies are required and they can't be on the same device).
>>
>> And by mounting degraded,rw you've given it permission to create those
>> single mode chunks if it has to, so it's not "silent", as you've
>> explicitly mounted it degraded,rw, and single is what raid1 degrades to
>> when there's only one device.
> 
> This is esoteric for mortal users (let alone without documentation) that
> degraded,rw means single chunks will be made, and now new data is no
> longer replicated once the bad device is replaced and volume scrubbed.
> 
> There's an incongruency between the promise of "fault tolerance, repair,
> and easy administration" and the esoteric reality. This is not easy,
> this is a gotcha. I'll bet almost no users have any idea this is how
> rw,degraded behaves and the risk it entails.

Certainly, documentation is an issue.  But while the degraded option 
doesn't force degraded, only allows it if there are missing devices, it's 
not recommended, and this is one reason why.  Using the degraded option 
really /does/ give the filesystem permission to break the rules that 
would apply in normal operation, and adding to your mount options 
shouldn't be done lightly or routinely.  Ideally, it's /only/ added after 
a device fails, in ordered to be able to mount the filesystem and replace 
the failing/failed device with a new one or reshape the filesystem to one 
less device if a new one isn't to be added.

OTOH, if there are three devices in the raid1, and all three have 
unallocated space, then loss of a device shouldn't result in single-mode 
chunks even when mounting degraded, because it's still possible in that 
case to create raid1 chunks as there's still two devices with free space 
available.  Again, creation of single chunks in that case would be a bug.

But I think we're past the effective argument point and pretty much just 
restating our position at this point.  Given that I'm definitely not a 
btrfs coder and to my knowledge, while you may well read the code and do 
occasional trivial patches, you're not really a btrfs coder either, 
alleviating that documentation issue, which we both agree is there, is 
the best either of us can really do.  The rest remains with the real 
btrfs coders, and arguing further about it as non-btrfs-devs isn't going 
to help.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: evidence of persistent state, despite device disconnects
  2016-01-09 10:55         ` Duncan
  2016-01-09 22:29           ` Chris Murphy
@ 2016-01-10 16:54           ` Goffredo Baroncelli
  1 sibling, 0 replies; 9+ messages in thread
From: Goffredo Baroncelli @ 2016-01-10 16:54 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2016-01-09 11:55, Duncan wrote:
> (a minimum of two devices are required to create raid1 chunks, since two
> copies are required and they can't be on the same device).

I think that this is the problem: BTRFS should allocate a new chunk as RAID1, even if only one device is available. It is already capable to use a RAID1 chunk in degraded mode, so it shouldn't be so difficult to create new chunk RAID1 when only a one disk is available.

Anyway I agree with Chris about the fact that btrfs sometime gives incorrect information about the devices. In the past I proposed to abandon the current model where the device are "per-registered" before the mount command asynchronously.
I wrote a mount helper which does a scan at the mount time [1]; this would reduce the window time where a device disappearing could cause confusion.

BR
G.Baroncelli

[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg39429.html
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-01-10 16:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-02 19:22 evidence of persistent state, despite device disconnects Chris Murphy
2016-01-03 13:48 ` Duncan
2016-01-03 21:33   ` Chris Murphy
2016-01-05 14:50     ` Duncan
2016-01-05 21:47       ` Chris Murphy
2016-01-09 10:55         ` Duncan
2016-01-09 22:29           ` Chris Murphy
2016-01-10  5:34             ` Duncan
2016-01-10 16:54           ` Goffredo Baroncelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).