From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B9A2C43381 for ; Sat, 16 Mar 2019 23:10:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5C8062186A for ; Sat, 16 Mar 2019 23:10:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726666AbfCPXKc (ORCPT ); Sat, 16 Mar 2019 19:10:32 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:41018 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726562AbfCPXKb (ORCPT ); Sat, 16 Mar 2019 19:10:31 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id 076F7266BC0; Sat, 16 Mar 2019 19:10:28 -0400 (EDT) Date: Sat, 16 Mar 2019 19:10:28 -0400 From: Zygo Blaxell To: Andrei Borzenkov Cc: Hans van Kranenburg , Jakub =?iso-8859-1?Q?Hus=E1k?= , linux-btrfs@vger.kernel.org Subject: Re: Balancing raid5 after adding another disk does not move/use any data on it Message-ID: <20190316231028.GA16664@hungrycats.org> References: <7a713010-5db6-2627-2593-8e13092868b1@husak.pro> <20190315180123.GJ9995@hungrycats.org> <3dce71e1-6caa-59ad-1765-6a29c7dd774f@knorrie.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="xHFwDpU9dbj6ez1V" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org --xHFwDpU9dbj6ez1V Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Mar 16, 2019 at 09:07:17AM +0300, Andrei Borzenkov wrote: > 15.03.2019 23:31, Hans van Kranenburg =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > ... > >> > >>>> If so, shouldn't it be really balancing (spreading) the data among a= ll > >>>> the drives to use all the IOPS capacity, even when the raid5 redunda= ncy > >>>> constraint is currently satisfied? > >> > >> btrfs divides the disks into chunks first, then spreads the data across > >> the chunks. The chunk allocation behavior spreads chunks across all t= he > >> disks. When you are adding a disk to raid5, you have to redistribute = all > >> the old data across all the disks to get balanced IOPS and space usage, > >> hence the full balance requirement. > >> > >> If you don't do a full balance, it will eventually allocate data on > >> all disks, but it will run out of space on sdb, sdc, and sde first, > >> and then be unable to use the remaining 2TB+ on sdd. > >=20 > > Also, if you have a lot of empty space in the current allocations, btrfs > > balance has the tendency to first start packing everything together > > before allocating new (4 disk wide) block groups. > >=20 > > This is annoying, because it can result in moving the same data multiple > > times during balance (into empty space of another existing block group, > > and then when that one has its turn again etc). > > > So you want to get rid of empty space in existing block groups as soon > > as possible. btrfs-balance-least-used can do this, (also an example from > > python-btrfs), by doing them in order of most empty one first. > >=20 >=20 > But if I understand the above correctly it will still attempt to move > data in next most empty chunks first. Is there any way to force > allocation of new chunks? Or, better, force usage of chunks with given > stripe width as balance target? >=20 > This thread actually made me wonder - is there any guarantee (or even > tentative promise) about RAID stripe width from btrfs at all? Is it > possible that RAID5 degrades to mirror by itself due to unfortunate > space distribution? Note that the data layout of RAID5 with 1 data disk, 1 parity disk, and even parity is identical to RAID1 with 1 data disk and 1 mirror copy. The two algorithms produce the same data layout with those parameters. IRC btrfs uses odd parity, so on btrfs the RAID5 parity stripes are the complement of the data stripes, but they don't need to be: with even parity on 2 disks, the data and parity blocks are identical and interchangeable. If you have RAID5 with non-equal device sizes, as long as the two largest disks are the same size, btrfs will adjust the stripe width to match the disks with free space available, subject to the constraint that the resulting block group must have enough disks to survive one disk failure. e.g. for RAID5 with 5 disks, 2x3TB, 2x2TB, 1x1TB, you get three zones: -> raid5 fills smallest unallocated spaces first, all drives -> 3TB AAAAAAAAAABBBBBBBBBBCCCCCCCCCC 3TB AAAAAAAAAABBBBBBBBBBCCCCCCCCCC 2TB AAAAAAAAAABBBBBBBBBB 2TB AAAAAAAAAABBBBBBBBBB 1TB AAAAAAAAAA Zone "A" is 5 disks wide, zone "B" is 4 disks wide, and zone "C" is 2 disks wide (each letter represents 100x1GB chunks). This is not necessarily how the data is laid out on disk--the btrfs allocator will store data on disk in some permutation of this order; however, the total number of chunks in each zone on each disk is as shown. For -draid5 -mraid1, you can get patterns like this: <- raid1 fills largest unallocated spaces first, 2 drives <- 3TB 5AAAAAAAA4BBBBBBBBB3CCCCCCCC21 3TB 5AAAAAAAA4BBBBBBBBB3CCCCCCCC21 2TB 6AAAAAAAADBBBBBBBBBC 2TB 6AAAAAAAADBBBBBBBBBC 1TB UAAAAAAAAD where numbered zones are raid1 metadata chunks, zone "D" is raid5 3 disks wide, and "U" is the worst-case one unusable 1GB chunk (not to scale) in arrays with an odd number of disks. The numbered zones occupy space that would normally form a full-width raid5 stripe in the zone, so the last raid5 block groups in each zone are less wide (i.e. the metadata chunks in the "B" zone make some stripes in the "B" zone space behave like stripes in "C" zone space). If the allocations start from empty disks and there are no array reshaping operations (convert profile, add/delete/resize devices) then the allocator should allocate all the usable space as efficiently as possible. In the -draid5 -mraid1 case, it would be slightly more efficient to allocate all the metadata in the "C" zone so it doesn't make any narrower stripes in the "B" and "A" zones. Typically this is exactly what happens, since all the "A" and "B" space must be allocated before raid5 can reach the "C" zone from the left, while all the "C" space must be allocated before raid1 can reach the "B" zone from the right, and the two allocators only interact when the filesystem is completely full. <- raid1 fills from the right, raid5 from the left <- 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 2TB AAAAAAAAAABBBBBBBBBB 2TB AAAAAAAAAABBBBBBBBBB 1TB AAAAAAAAAA -> they meet somewhere in the middle, no space wasted -> If all the drives are the same size, then raid5 and raid1 meet immediately in zone "A": <- raid1 fills from the right, raid5 from the left <- 3TB AAAAAAAAAAAAAAAAAAAAAAAAAAA421 3TB AAAAAAAAAAAAAAAAAAAAAAAAAAA431 3TB AAAAAAAAAAAAAAAAAAAAAAAAAAAU32 -> they meet somewhere in the middle, up to 1GB wasted -> There used to be a bug (maybe there still is?) where the allocator would randomly place about 0.1% of chunks on a non-optimal disk (due to a race condition?). That can theoretically lose a few GB of space per TB by shrinking the stripe width on a few block groups, or stealing a mirror chunk from the largest disk in a raid1 array with multiple disk sizes. You can get rid of those using the 'stripes' filter for balance--though only 0.1% of the space is gained or lost this way, so it may not be worth the IO cost. If you are converting or reshaping an array, the nice rules above don't hold any more. e.g. if we replace a 1TB drive with a 3TB drive, we get 2TB unallocated ("_"): 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 2TB AAAAAAAAAABBBBBBBBBB 2TB AAAAAAAAAABBBBBBBBBB 3TB AAAAAAAAAA____________________ Now we have no available space because there's no free chunks on two or more drives (i.e. all the free space is on 1 drive and all the RAID profiles we are using require 2). Upgrade another disk, and... 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 2TB AAAAAAAAAABBBBBBBBBB 3TB AAAAAAAAAABBBBBBBBBB__________ 3TB AAAAAAAAAA____________________ Now we have 1TB of free space, in stripes 2 disks wide. Without a balance, it would fill up like this: 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 3TB AAAAAAAAAABBBBBBBBBBCCCC654321 2TB AAAAAAAAAABBBBBBBBBB 3TB AAAAAAAAAABBBBBBBBBBCCCCCCCCCC 3TB AAAAAAAAAACCCCCCCCCCXXXXXXXXXX -> raid5 fills smallest unallocated spaces first on all drives -> Note the "C" zone here is still stripes 2 disks wide, so a lot of space is wasted by narrow stripes. Even the diagram makes it look we did something wrong--we don't have the nice orderly fill pattern. 1TB is unusable, and the free space estimated by 'df' was egregiously wrong the whole time. Full balance fixes that, and we get some unallocated space that is usable: -> raid5 from left to right -> 3TB AAAAAAAAAAAAAAAAAAA________531 3TB AAAAAAAAAAAAAAAAAAA________531 2TB AAAAAAAAAAAAAAAAAAA_ 3TB AAAAAAAAAAAAAAAAAAA________642 3TB AAAAAAAAAAAAAAAAAAA________642 <- raid1 from right to left <- which can then be filled up like this: 3TB AAAAAAAAAAAAAAAAAAAABBBBBB7531 3TB AAAAAAAAAAAAAAAAAAAABBBBBB7531 2TB AAAAAAAAAAAAAAAAAAAA 3TB AAAAAAAAAAAAAAAAAAAABBBBBBC642 3TB AAAAAAAAAAAAAAAAAAAABBBBBBC642 By the time our hypothetical filesystem was full, there was another metadata chunk allocated, so we end up with one 1GB block group in zone "C" with 2 disks--but at most one. --xHFwDpU9dbj6ez1V Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABECAB0WIQSnOVjcfGcC/+em7H2B+YsaVrMbnAUCXI2CYgAKCRCB+YsaVrMb nBK3AJ91zttg03LqQGzvjnEuLVxQAOL+ZwCg58u9pm5QFhO81p+w/zxWzDvHg1M= =K9CQ -----END PGP SIGNATURE----- --xHFwDpU9dbj6ez1V--