Better distribution of RAID1 data?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Better distribution of RAID1 data?
@ 2019-02-15 15:40 Brian B
  2019-02-15 15:55 ` Hugo Mills
  2019-02-15 16:54 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 6+ messages in thread
From: Brian B @ 2019-02-15 15:40 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 856 bytes --]

It looks like the btrfs code currently uses the total space available on
a disk to determine where it should place the two copies of a file in
RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
space instead of the number of free bytes?

For example, I have two disks in my array that are 8 TB, plus an
assortment of 3,4, and 1 TB disks.  With the current allocation code,
btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
files, then it will start using the 4 TB disks, then eventually the 3,
and finally the 1 TB disks.  If the code used a percentage figure
instead, it would spread the allocations much more evenly across the
drives, ideally spreading load and reducing drive wear.

Is there a reason this is done this way, or is it just something that
hasn't had time for development?

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Better distribution of RAID1 data?
  2019-02-15 15:40 Better distribution of RAID1 data? Brian B
@ 2019-02-15 15:55 ` Hugo Mills
  2019-02-15 16:54 ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 6+ messages in thread
From: Hugo Mills @ 2019-02-15 15:55 UTC (permalink / raw)
  To: Brian B; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

On Fri, Feb 15, 2019 at 10:40:56AM -0500, Brian B wrote:
> It looks like the btrfs code currently uses the total space available on
> a disk to determine where it should place the two copies of a file in
> RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
> space instead of the number of free bytes?

   I don't think it'll make much difference. I spent a long time a
couple of years ago trying to prove (mathematically) that the current
strategy always produces an optimal usage of the available space -- I
wasn't able to complete the theorem, but a lot of playing around with
it convinced me that at least if there are cases where it's
non-optimal, they're bizarre corner cases.

> For example, I have two disks in my array that are 8 TB, plus an
> assortment of 3,4, and 1 TB disks.  With the current allocation code,
> btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
> files, then it will start using the 4 TB disks, then eventually the 3,
> and finally the 1 TB disks.  If the code used a percentage figure
> instead, it would spread the allocations much more evenly across the
> drives, ideally spreading load and reducing drive wear.
> 
> Is there a reason this is done this way, or is it just something that
> hasn't had time for development?

   I'd guess it's the easiest algorithm to use, plus it seems to
provide optimal space usage (almost?) all of the time. 

   Hugo.

-- 
Hugo Mills             | Be pure.
hugo@... carfax.org.uk | Be vigilant.
http://carfax.org.uk/  | Behave.
PGP: E2AB1DE4          |                                   Torquemada, Nemesis

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Better distribution of RAID1 data?
  2019-02-15 15:40 Better distribution of RAID1 data? Brian B
  2019-02-15 15:55 ` Hugo Mills
@ 2019-02-15 16:54 ` Austin S. Hemmelgarn
  2019-02-15 19:50   ` Zygo Blaxell
  1 sibling, 1 reply; 6+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-15 16:54 UTC (permalink / raw)
  To: Brian B, linux-btrfs

On 2019-02-15 10:40, Brian B wrote:
> It looks like the btrfs code currently uses the total space available on
> a disk to determine where it should place the two copies of a file in
> RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
> space instead of the number of free bytes?
> 
> For example, I have two disks in my array that are 8 TB, plus an
> assortment of 3,4, and 1 TB disks.  With the current allocation code,
> btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
> files, then it will start using the 4 TB disks, then eventually the 3,
> and finally the 1 TB disks.  If the code used a percentage figure
> instead, it would spread the allocations much more evenly across the
> drives, ideally spreading load and reducing drive wear.
> 
> Is there a reason this is done this way, or is it just something that
> hasn't had time for development?
It's simple to implement, easy to verify, runs fast, produces optimal or 
near optimal space usage in pretty much all cases, and is highly 
deterministic.

Using percentages reduces the simplicity, ease of verification, and 
speed (division is still slow on most CPU's, and you need division for 
percentages), and is likely to not be as deterministic (both because the 
choice of first devices is harder when all are 100% empty, and because 
of potential rounding errors), and probably won't produce optimal 
layouts quite as reliably (you either need to get into floating-point 
math (which is to be avoided in the kernel whenever possible), or you 
end up with much more quantized disk selection).

I could see an adapted percentage method that preferentially spreads 
across disks whenever possible _possibly_ making sense once we can 
properly parallelize disk access in BTRFS, but until then I see no 
reason to change something that is already working reasonably well.

In your particular case, I'd actually suggest using something under 
BTRFS to merge the smaller disks to get as many devices as close to 8TB 
as possible.  That should help spread load better.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Better distribution of RAID1 data?
  2019-02-15 16:54 ` Austin S. Hemmelgarn
@ 2019-02-15 19:50   ` Zygo Blaxell
  2019-02-15 19:55     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 6+ messages in thread
From: Zygo Blaxell @ 2019-02-15 19:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Brian B, linux-btrfs

On Fri, Feb 15, 2019 at 11:54:57AM -0500, Austin S. Hemmelgarn wrote:
> On 2019-02-15 10:40, Brian B wrote:
> > It looks like the btrfs code currently uses the total space available on
> > a disk to determine where it should place the two copies of a file in
> > RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
> > space instead of the number of free bytes?
> > 
> > For example, I have two disks in my array that are 8 TB, plus an
> > assortment of 3,4, and 1 TB disks.  With the current allocation code,
> > btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
> > files, then it will start using the 4 TB disks, then eventually the 3,
> > and finally the 1 TB disks.  If the code used a percentage figure
> > instead, it would spread the allocations much more evenly across the
> > drives, ideally spreading load and reducing drive wear.

Spreading load should make all the drives wear at the same rate (or a rate
proportional to size).  That would be a gain for the big disks but a
loss for the smaller ones.

> > Is there a reason this is done this way, or is it just something that
> > hasn't had time for development?
> It's simple to implement, easy to verify, runs fast, produces optimal or
> near optimal space usage in pretty much all cases, and is highly
> deterministic.
> 
> Using percentages reduces the simplicity, ease of verification, and speed
> (division is still slow on most CPU's, and you need division for
> percentages), and is likely to not be as deterministic (both because the

A few integer divides _per GB of writes_ is not going to matter.
raid5 profile does a 64-bit modulus operation on every stripe to locate
parity blocks.

> choice of first devices is harder when all are 100% empty, and because of
> potential rounding errors), and probably won't produce optimal layouts quite
> as reliably (you either need to get into floating-point math (which is to be
> avoided in the kernel whenever possible), or you end up with much more
> quantized disk selection).
> 
> I could see an adapted percentage method that preferentially spreads across
> disks whenever possible _possibly_ making sense once we can properly
> parallelize disk access in BTRFS, but until then I see no reason to change
> something that is already working reasonably well.
> 
> In your particular case, I'd actually suggest using something under BTRFS to
> merge the smaller disks to get as many devices as close to 8TB as possible.
> That should help spread load better.

That would change the distribution of risk.  The aggregated small disks
would have a higher failure rate (because they don't have redundancy and
there are more active components to fail) than the individual big drives.

Brian can go the other way:  use 'btrfs fi resize' to limit the size of
the large disks to match the size of the smaller disks, then resize the
bigger disks to the next size tier as the smaller disks become full.

I'm not sure _why_ you'd want to do this, though.  The practical gains on
any metric that I know of would be negligible.  Once there's enough data
on the filesystem, IO would be roughly distributed according to device
size no matter how the data was arranged.  In any case, you can easily
experiment with this now, and show us data to support your theory.  ;)

There is a possible gain on metadata placement (i.e. try to put metadata
on small SSDs while the data goes on big HDDs), but once the disks reach
equilibrium, metadata placement would revert to random.  If you want to
place metadata on SSD then it would be better to modify the code to do
that explicitly.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Better distribution of RAID1 data?
  2019-02-15 19:50   ` Zygo Blaxell
@ 2019-02-15 19:55     ` Austin S. Hemmelgarn
  2019-02-15 23:11       ` Zygo Blaxell
  0 siblings, 1 reply; 6+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-15 19:55 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Brian B, linux-btrfs

On 2019-02-15 14:50, Zygo Blaxell wrote:
> On Fri, Feb 15, 2019 at 11:54:57AM -0500, Austin S. Hemmelgarn wrote:
>> On 2019-02-15 10:40, Brian B wrote:
>>> It looks like the btrfs code currently uses the total space available on
>>> a disk to determine where it should place the two copies of a file in
>>> RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
>>> space instead of the number of free bytes?
>>>
>>> For example, I have two disks in my array that are 8 TB, plus an
>>> assortment of 3,4, and 1 TB disks.  With the current allocation code,
>>> btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
>>> files, then it will start using the 4 TB disks, then eventually the 3,
>>> and finally the 1 TB disks.  If the code used a percentage figure
>>> instead, it would spread the allocations much more evenly across the
>>> drives, ideally spreading load and reducing drive wear.
> 
> Spreading load should make all the drives wear at the same rate (or a rate
> proportional to size).  That would be a gain for the big disks but a
> loss for the smaller ones.
> 
>>> Is there a reason this is done this way, or is it just something that
>>> hasn't had time for development?
>> It's simple to implement, easy to verify, runs fast, produces optimal or
>> near optimal space usage in pretty much all cases, and is highly
>> deterministic.
>>
>> Using percentages reduces the simplicity, ease of verification, and speed
>> (division is still slow on most CPU's, and you need division for
>> percentages), and is likely to not be as deterministic (both because the
> 
> A few integer divides _per GB of writes_ is not going to matter.
> raid5 profile does a 64-bit modulus operation on every stripe to locate
> parity blocks.
It really depends on the system in question, and division is just the 
_easy_ bit to point at being slower.  Doing this right will likely need 
FP work, which would make chunk allocations rather painfully slow.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Better distribution of RAID1 data?
  2019-02-15 19:55     ` Austin S. Hemmelgarn
@ 2019-02-15 23:11       ` Zygo Blaxell
  0 siblings, 0 replies; 6+ messages in thread
From: Zygo Blaxell @ 2019-02-15 23:11 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Brian B, linux-btrfs

On Fri, Feb 15, 2019 at 02:55:13PM -0500, Austin S. Hemmelgarn wrote:
> On 2019-02-15 14:50, Zygo Blaxell wrote:
> > On Fri, Feb 15, 2019 at 11:54:57AM -0500, Austin S. Hemmelgarn wrote:
> > > On 2019-02-15 10:40, Brian B wrote:
> > > > It looks like the btrfs code currently uses the total space available on
> > > > a disk to determine where it should place the two copies of a file in
> > > > RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
> > > > space instead of the number of free bytes?
> > > > 
> > > > For example, I have two disks in my array that are 8 TB, plus an
> > > > assortment of 3,4, and 1 TB disks.  With the current allocation code,
> > > > btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
> > > > files, then it will start using the 4 TB disks, then eventually the 3,
> > > > and finally the 1 TB disks.  If the code used a percentage figure
> > > > instead, it would spread the allocations much more evenly across the
> > > > drives, ideally spreading load and reducing drive wear.
> > 
> > Spreading load should make all the drives wear at the same rate (or a rate
> > proportional to size).  That would be a gain for the big disks but a
> > loss for the smaller ones.
> > 
> > > > Is there a reason this is done this way, or is it just something that
> > > > hasn't had time for development?
> > > It's simple to implement, easy to verify, runs fast, produces optimal or
> > > near optimal space usage in pretty much all cases, and is highly
> > > deterministic.
> > > 
> > > Using percentages reduces the simplicity, ease of verification, and speed
> > > (division is still slow on most CPU's, and you need division for
> > > percentages), and is likely to not be as deterministic (both because the
> > 
> > A few integer divides _per GB of writes_ is not going to matter.
> > raid5 profile does a 64-bit modulus operation on every stripe to locate
> > parity blocks.
> It really depends on the system in question, and division is just the _easy_
> bit to point at being slower.  Doing this right will likely need FP work,
> which would make chunk allocations rather painfully slow.

It still doesn't matter.  Chunk allocations don't happen very often,
so anything faster than an Arduino should be able to keep up.
You could spend milliseconds on each one (and probably do, just for
the IO required to update the device/block group trees).

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-02-15 23:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-15 15:40 Better distribution of RAID1 data? Brian B
2019-02-15 15:55 ` Hugo Mills
2019-02-15 16:54 ` Austin S. Hemmelgarn
2019-02-15 19:50   ` Zygo Blaxell
2019-02-15 19:55     ` Austin S. Hemmelgarn
2019-02-15 23:11       ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).