From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Paul Jones <paul@pauljones.id.au>
Cc: Pedro Macedo <pmacedo@pmacedo.com>,
Anand Jain <anand.jain@oracle.com>,
Roman Mamedov <rm@romanrm.net>, Remi Gauvin <remi@georgianit.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Balance on 5-disk RAID1 put all data on 2 disks, leaving the rest empty
Date: Thu, 2 Nov 2023 09:50:22 -0400 [thread overview]
Message-ID: <ZUOpHtII/SQZt1w7@hungrycats.org> (raw)
In-Reply-To: <SYCPR01MB4685B23E1FA74D65A3859BDE9EA6A@SYCPR01MB4685.ausprd01.prod.outlook.com>
On Thu, Nov 02, 2023 at 05:11:00AM +0000, Paul Jones wrote:
> > -----Original Message-----
> > From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> > Sent: Thursday, November 2, 2023 1:13 PM
> > To: Pedro Macedo <pmacedo@pmacedo.com>
> > Cc: Anand Jain <anand.jain@oracle.com>; Roman Mamedov
> > <rm@romanrm.net>; Remi Gauvin <remi@georgianit.com>; linux-
> > btrfs@vger.kernel.org
> > Subject: Re: Balance on 5-disk RAID1 put all data on 2 disks, leaving the rest
> > empty
> >
> > On Wed, Nov 01, 2023 at 08:20:56PM +0100, Pedro Macedo wrote:
> > >
> > > On 27.10.23 06:21, Anand Jain wrote:
> > > > On 10/26/23 05:15, Roman Mamedov wrote:
> > > > > On Wed, 25 Oct 2023 17:08:08 -0400 Remi Gauvin
> > > > > <remi@georgianit.com> wrote:
> > > > >
> > > > > > On 2023-10-25 4:29 p.m., Peter Wedder wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > I had a RAID1 array on top of 4x4TB drives. Recently I removed
> > > > > > > one 4TB drive and added two 16TB drives to it. After running a
> > > > > > > full, unfiltered balance on the array, I am left in a
> > > > > > > situation where all the 4TB drives are completely empty, and
> > > > > > > all the data and metadata is on the 16TB drives.
> > > > > > > Is this normal? I was expecting to have at least some data on
> > > > > > > the smaller drives.
> > > > > > >
> > > > > >
> > > > > > Yes, this is normal. The BTRFS allocates space in drives with
> > > > > > the the most available free space. The idea is to balance the
> > 'unallocated'
> > > > > > space on each drive, so they can be filled evenly. The 4TB
> > > > > > drives will be used when the 16TB dives have less than 4TB
> > unallocated.
> > > > >
> > > >
> > > > Correct. That's the only allocation method we have at the moment. Do
> > > > you have any feedback on whether there are any other allocation
> > > > methods that make sense?
> > >
> > >
> > > IMHO, based on the frequency of this question appearing here/on
> > > reddit/other sites, perhaps allocation by absolute space used? It
> > > should fit the expectations of most folks that if you have free space
> > > on a disk it will be utilised, plus has potential performance
> > > implications by always using as many devices as possible to write to as long
> > as they have any space left.
> >
> > That is how allocation works with striped profiles: chunks are allocated using
> > space from all non-full drives, in order to use space and iops optimally.
> >
> > For a non-striped profile like raid1, it's not possible to use all the space
> > without filling the larger devices first. As the large devices fill up, their free
> > space becomes equal in size to the smaller devices, and it's always possible to
> > completely fill a raid1 array of equal-sized devices. If raid1 distributed data
> > across the small devices at the same time as the large devices, it would run
> > out of space on small devices before running out of space on the large ones,
> > so significant space on some devices would be wasted.
>
> I was always under the impression that space was allocated from the
> emptiest drive(s) on a percentage basis. Was that ever the case and
> has since changed? That seems like the most optimal way to do it.
The current behavior was introduced in 2011, and hasn't changed since
except for regressions in 2015, 2022, and 2023 (now fixed). Support for
zoned devices was added in 2020, but it doesn't affect regular device
behavior.
btrfs finds the largest contiguous free space block >= 1 GiB on each
device (using the lowest offset to break ties), then creates a chunk
using up to 1 GiB from each of the top N devices with the largest free
byte count (using devid to break ties), where N is the maximum number
of devices supported by the profile.
You could replace "largest free byte count" with "largest proportion
of free space" in the above, but that would only make sense if the
filesystem had never had drives added or replaced. e.g. in cases where
you had already filled some devices, then replaced them with larger ones,
the space available on a device would not be correlated to its size
at all.
>
> Paul.
next prev parent reply other threads:[~2023-11-02 13:50 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-25 20:29 Balance on 5-disk RAID1 put all data on 2 disks, leaving the rest empty Peter Wedder
2023-10-25 21:08 ` Remi Gauvin
2023-10-25 21:15 ` Roman Mamedov
2023-10-27 4:21 ` Anand Jain
2023-11-01 19:20 ` Pedro Macedo
2023-11-02 2:13 ` Zygo Blaxell
2023-11-02 5:11 ` Paul Jones
2023-11-02 13:50 ` Zygo Blaxell [this message]
2023-11-02 23:57 ` waxhead
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZUOpHtII/SQZt1w7@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=anand.jain@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=paul@pauljones.id.au \
--cc=pmacedo@pmacedo.com \
--cc=remi@georgianit.com \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox