* Better distribution of RAID1 data? @ 2019-02-15 15:40 Brian B 2019-02-15 15:55 ` Hugo Mills 2019-02-15 16:54 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 6+ messages in thread From: Brian B @ 2019-02-15 15:40 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 856 bytes --] It looks like the btrfs code currently uses the total space available on a disk to determine where it should place the two copies of a file in RAID1 mode. Wouldn't it make more sense to use the _percentage_ of free space instead of the number of free bytes? For example, I have two disks in my array that are 8 TB, plus an assortment of 3,4, and 1 TB disks. With the current allocation code, btrfs will use my two 8 TB drives exclusively until I've written 4 TB of files, then it will start using the 4 TB disks, then eventually the 3, and finally the 1 TB disks. If the code used a percentage figure instead, it would spread the allocations much more evenly across the drives, ideally spreading load and reducing drive wear. Is there a reason this is done this way, or is it just something that hasn't had time for development? [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Better distribution of RAID1 data? 2019-02-15 15:40 Better distribution of RAID1 data? Brian B @ 2019-02-15 15:55 ` Hugo Mills 2019-02-15 16:54 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 6+ messages in thread From: Hugo Mills @ 2019-02-15 15:55 UTC (permalink / raw) To: Brian B; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1655 bytes --] On Fri, Feb 15, 2019 at 10:40:56AM -0500, Brian B wrote: > It looks like the btrfs code currently uses the total space available on > a disk to determine where it should place the two copies of a file in > RAID1 mode. Wouldn't it make more sense to use the _percentage_ of free > space instead of the number of free bytes? I don't think it'll make much difference. I spent a long time a couple of years ago trying to prove (mathematically) that the current strategy always produces an optimal usage of the available space -- I wasn't able to complete the theorem, but a lot of playing around with it convinced me that at least if there are cases where it's non-optimal, they're bizarre corner cases. > For example, I have two disks in my array that are 8 TB, plus an > assortment of 3,4, and 1 TB disks. With the current allocation code, > btrfs will use my two 8 TB drives exclusively until I've written 4 TB of > files, then it will start using the 4 TB disks, then eventually the 3, > and finally the 1 TB disks. If the code used a percentage figure > instead, it would spread the allocations much more evenly across the > drives, ideally spreading load and reducing drive wear. > > Is there a reason this is done this way, or is it just something that > hasn't had time for development? I'd guess it's the easiest algorithm to use, plus it seems to provide optimal space usage (almost?) all of the time. Hugo. -- Hugo Mills | Be pure. hugo@... carfax.org.uk | Be vigilant. http://carfax.org.uk/ | Behave. PGP: E2AB1DE4 | Torquemada, Nemesis [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Better distribution of RAID1 data? 2019-02-15 15:40 Better distribution of RAID1 data? Brian B 2019-02-15 15:55 ` Hugo Mills @ 2019-02-15 16:54 ` Austin S. Hemmelgarn 2019-02-15 19:50 ` Zygo Blaxell 1 sibling, 1 reply; 6+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-15 16:54 UTC (permalink / raw) To: Brian B, linux-btrfs On 2019-02-15 10:40, Brian B wrote: > It looks like the btrfs code currently uses the total space available on > a disk to determine where it should place the two copies of a file in > RAID1 mode. Wouldn't it make more sense to use the _percentage_ of free > space instead of the number of free bytes? > > For example, I have two disks in my array that are 8 TB, plus an > assortment of 3,4, and 1 TB disks. With the current allocation code, > btrfs will use my two 8 TB drives exclusively until I've written 4 TB of > files, then it will start using the 4 TB disks, then eventually the 3, > and finally the 1 TB disks. If the code used a percentage figure > instead, it would spread the allocations much more evenly across the > drives, ideally spreading load and reducing drive wear. > > Is there a reason this is done this way, or is it just something that > hasn't had time for development? It's simple to implement, easy to verify, runs fast, produces optimal or near optimal space usage in pretty much all cases, and is highly deterministic. Using percentages reduces the simplicity, ease of verification, and speed (division is still slow on most CPU's, and you need division for percentages), and is likely to not be as deterministic (both because the choice of first devices is harder when all are 100% empty, and because of potential rounding errors), and probably won't produce optimal layouts quite as reliably (you either need to get into floating-point math (which is to be avoided in the kernel whenever possible), or you end up with much more quantized disk selection). I could see an adapted percentage method that preferentially spreads across disks whenever possible _possibly_ making sense once we can properly parallelize disk access in BTRFS, but until then I see no reason to change something that is already working reasonably well. In your particular case, I'd actually suggest using something under BTRFS to merge the smaller disks to get as many devices as close to 8TB as possible. That should help spread load better. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Better distribution of RAID1 data? 2019-02-15 16:54 ` Austin S. Hemmelgarn @ 2019-02-15 19:50 ` Zygo Blaxell 2019-02-15 19:55 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 6+ messages in thread From: Zygo Blaxell @ 2019-02-15 19:50 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Brian B, linux-btrfs On Fri, Feb 15, 2019 at 11:54:57AM -0500, Austin S. Hemmelgarn wrote: > On 2019-02-15 10:40, Brian B wrote: > > It looks like the btrfs code currently uses the total space available on > > a disk to determine where it should place the two copies of a file in > > RAID1 mode. Wouldn't it make more sense to use the _percentage_ of free > > space instead of the number of free bytes? > > > > For example, I have two disks in my array that are 8 TB, plus an > > assortment of 3,4, and 1 TB disks. With the current allocation code, > > btrfs will use my two 8 TB drives exclusively until I've written 4 TB of > > files, then it will start using the 4 TB disks, then eventually the 3, > > and finally the 1 TB disks. If the code used a percentage figure > > instead, it would spread the allocations much more evenly across the > > drives, ideally spreading load and reducing drive wear. Spreading load should make all the drives wear at the same rate (or a rate proportional to size). That would be a gain for the big disks but a loss for the smaller ones. > > Is there a reason this is done this way, or is it just something that > > hasn't had time for development? > It's simple to implement, easy to verify, runs fast, produces optimal or > near optimal space usage in pretty much all cases, and is highly > deterministic. > > Using percentages reduces the simplicity, ease of verification, and speed > (division is still slow on most CPU's, and you need division for > percentages), and is likely to not be as deterministic (both because the A few integer divides _per GB of writes_ is not going to matter. raid5 profile does a 64-bit modulus operation on every stripe to locate parity blocks. > choice of first devices is harder when all are 100% empty, and because of > potential rounding errors), and probably won't produce optimal layouts quite > as reliably (you either need to get into floating-point math (which is to be > avoided in the kernel whenever possible), or you end up with much more > quantized disk selection). > > I could see an adapted percentage method that preferentially spreads across > disks whenever possible _possibly_ making sense once we can properly > parallelize disk access in BTRFS, but until then I see no reason to change > something that is already working reasonably well. > > In your particular case, I'd actually suggest using something under BTRFS to > merge the smaller disks to get as many devices as close to 8TB as possible. > That should help spread load better. That would change the distribution of risk. The aggregated small disks would have a higher failure rate (because they don't have redundancy and there are more active components to fail) than the individual big drives. Brian can go the other way: use 'btrfs fi resize' to limit the size of the large disks to match the size of the smaller disks, then resize the bigger disks to the next size tier as the smaller disks become full. I'm not sure _why_ you'd want to do this, though. The practical gains on any metric that I know of would be negligible. Once there's enough data on the filesystem, IO would be roughly distributed according to device size no matter how the data was arranged. In any case, you can easily experiment with this now, and show us data to support your theory. ;) There is a possible gain on metadata placement (i.e. try to put metadata on small SSDs while the data goes on big HDDs), but once the disks reach equilibrium, metadata placement would revert to random. If you want to place metadata on SSD then it would be better to modify the code to do that explicitly. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Better distribution of RAID1 data? 2019-02-15 19:50 ` Zygo Blaxell @ 2019-02-15 19:55 ` Austin S. Hemmelgarn 2019-02-15 23:11 ` Zygo Blaxell 0 siblings, 1 reply; 6+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-15 19:55 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Brian B, linux-btrfs On 2019-02-15 14:50, Zygo Blaxell wrote: > On Fri, Feb 15, 2019 at 11:54:57AM -0500, Austin S. Hemmelgarn wrote: >> On 2019-02-15 10:40, Brian B wrote: >>> It looks like the btrfs code currently uses the total space available on >>> a disk to determine where it should place the two copies of a file in >>> RAID1 mode. Wouldn't it make more sense to use the _percentage_ of free >>> space instead of the number of free bytes? >>> >>> For example, I have two disks in my array that are 8 TB, plus an >>> assortment of 3,4, and 1 TB disks. With the current allocation code, >>> btrfs will use my two 8 TB drives exclusively until I've written 4 TB of >>> files, then it will start using the 4 TB disks, then eventually the 3, >>> and finally the 1 TB disks. If the code used a percentage figure >>> instead, it would spread the allocations much more evenly across the >>> drives, ideally spreading load and reducing drive wear. > > Spreading load should make all the drives wear at the same rate (or a rate > proportional to size). That would be a gain for the big disks but a > loss for the smaller ones. > >>> Is there a reason this is done this way, or is it just something that >>> hasn't had time for development? >> It's simple to implement, easy to verify, runs fast, produces optimal or >> near optimal space usage in pretty much all cases, and is highly >> deterministic. >> >> Using percentages reduces the simplicity, ease of verification, and speed >> (division is still slow on most CPU's, and you need division for >> percentages), and is likely to not be as deterministic (both because the > > A few integer divides _per GB of writes_ is not going to matter. > raid5 profile does a 64-bit modulus operation on every stripe to locate > parity blocks. It really depends on the system in question, and division is just the _easy_ bit to point at being slower. Doing this right will likely need FP work, which would make chunk allocations rather painfully slow. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Better distribution of RAID1 data? 2019-02-15 19:55 ` Austin S. Hemmelgarn @ 2019-02-15 23:11 ` Zygo Blaxell 0 siblings, 0 replies; 6+ messages in thread From: Zygo Blaxell @ 2019-02-15 23:11 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Brian B, linux-btrfs On Fri, Feb 15, 2019 at 02:55:13PM -0500, Austin S. Hemmelgarn wrote: > On 2019-02-15 14:50, Zygo Blaxell wrote: > > On Fri, Feb 15, 2019 at 11:54:57AM -0500, Austin S. Hemmelgarn wrote: > > > On 2019-02-15 10:40, Brian B wrote: > > > > It looks like the btrfs code currently uses the total space available on > > > > a disk to determine where it should place the two copies of a file in > > > > RAID1 mode. Wouldn't it make more sense to use the _percentage_ of free > > > > space instead of the number of free bytes? > > > > > > > > For example, I have two disks in my array that are 8 TB, plus an > > > > assortment of 3,4, and 1 TB disks. With the current allocation code, > > > > btrfs will use my two 8 TB drives exclusively until I've written 4 TB of > > > > files, then it will start using the 4 TB disks, then eventually the 3, > > > > and finally the 1 TB disks. If the code used a percentage figure > > > > instead, it would spread the allocations much more evenly across the > > > > drives, ideally spreading load and reducing drive wear. > > > > Spreading load should make all the drives wear at the same rate (or a rate > > proportional to size). That would be a gain for the big disks but a > > loss for the smaller ones. > > > > > > Is there a reason this is done this way, or is it just something that > > > > hasn't had time for development? > > > It's simple to implement, easy to verify, runs fast, produces optimal or > > > near optimal space usage in pretty much all cases, and is highly > > > deterministic. > > > > > > Using percentages reduces the simplicity, ease of verification, and speed > > > (division is still slow on most CPU's, and you need division for > > > percentages), and is likely to not be as deterministic (both because the > > > > A few integer divides _per GB of writes_ is not going to matter. > > raid5 profile does a 64-bit modulus operation on every stripe to locate > > parity blocks. > It really depends on the system in question, and division is just the _easy_ > bit to point at being slower. Doing this right will likely need FP work, > which would make chunk allocations rather painfully slow. It still doesn't matter. Chunk allocations don't happen very often, so anything faster than an Arduino should be able to keep up. You could spend milliseconds on each one (and probably do, just for the IO required to update the device/block group trees). ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-02-15 23:11 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-15 15:40 Better distribution of RAID1 data? Brian B 2019-02-15 15:55 ` Hugo Mills 2019-02-15 16:54 ` Austin S. Hemmelgarn 2019-02-15 19:50 ` Zygo Blaxell 2019-02-15 19:55 ` Austin S. Hemmelgarn 2019-02-15 23:11 ` Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).