linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Allocator behaviour during device delete
@ 2016-06-09 12:34 Brendan Hide
  2016-06-09 13:07 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 7+ messages in thread
From: Brendan Hide @ 2016-06-09 12:34 UTC (permalink / raw)
  To: linux-btrfs

Hey, all

I noticed this odd behaviour while migrating from a 1TB spindle to SSD 
(in this case on a LUKS-encrypted 200GB partition) - and am curious if 
this behaviour I've noted below is expected or known. I figure it is a 
bug. Depending on the situation, it *could* be severe. In my case it was 
simply annoying.

---
Steps

After having added the new device (btrfs dev add), I deleted the old 
device (btrfs dev del)

Then, whilst waiting for that to complete, I started a watch of "btrfs 
fi show /". Note that the below is very close to the output at the time 
- but is not actually copy/pasted from the output.

 > Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
 >         Total devices 2 FS bytes used 115.03GiB
 >         devid    1 size 0.00GiB used 298.06GiB path /dev/sda2
 >         devid    2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot


devid1 is the old disk while devid2 is the new SSD

After a few minutes, I saw that the numbers have changed - but that the 
SSD still had no data:

 > Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
 >         Total devices 2 FS bytes used 115.03GiB
 >         devid    1 size 0.00GiB used 284.06GiB path /dev/sda2
 >         devid    2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot

The "FS bytes used" amount was changing a lot - but mostly stayed near 
the original total, which is expected since there was very little 
happening other than the "migration".

I'm not certain of the exact point where it started using the new disk's 
space. I figure that may have been helpful to pinpoint. :-/

---
Educated guess as to what was happening:

Key: Though the available space on devid1 is displayed as 0 GiB, 
internally the allocator still sees most of the device's space as 
available. The allocator will continue writing to the old disk even 
though the intention is to remove it.

The dev delete operation goes through the chunks in sequence and does a 
"normal" balance operation on each, which the kernel simply sends to the 
"normal" single allocator. At the start of the operation, the allocator 
will see that the device of 1TB has more space available than the 200GB 
device, thus it writes the data to a new chunk on the 1TB spindle.

Only after the chunk is balanced away, does the operation mark *only* 
that "source" chunk as being unavailable. As each chunk is subsequently 
balanced away, eventually the allocator will see that there is more 
space available on the new device than on the old device (1:199/2:200), 
thus the next chunk gets allocated to the new device. The same occurs 
for the next chunk (1:198/2:199) and so on, until the device finally has 
zero usage and is removed completely.

---
Naive approach for a fix (assuming my assessment above is correct)

At the start:
1. "Balance away"/Mark-as-Unavailable empty space
2. Balance away the *current* chunks (data+metadata) that would 
otherwise be written to if the device was still available
3. As before, balance in whatever order is applicable.

---
Severity

I figure that, for my use-case, this isn't a severe issue. However, in 
the case where you want quickly to remove a potentially failing disk 
(common use case for dev delete), I'd much rather that btrfs does *not* 
write data to the disk I'm trying to remove, making this a potentially 
severe bug.


-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-13 11:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-09 12:34 Allocator behaviour during device delete Brendan Hide
2016-06-09 13:07 ` Austin S. Hemmelgarn
2016-06-09 13:54   ` Brendan Hide
2016-06-10 19:26     ` Henk Slager
2016-06-10 19:58       ` Hans van Kranenburg
2016-06-10 22:31         ` Hans van Kranenburg
2016-06-13 11:09       ` Austin S. Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).