Re: Very slow balance / btrfs-transaction

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: Very slow balance / btrfs-transaction
Date: Mon, 6 Feb 2017 08:19:37 -0500	[thread overview]
Message-ID: <403247fe-376f-27d7-bbd5-d8acd260a8ad@gmail.com> (raw)
In-Reply-To: <20170204221051.664ada65@jupiter.sol.kaishome.de>

On 2017-02-04 16:10, Kai Krakow wrote:
> Am Sat, 04 Feb 2017 20:50:03 +0000
> schrieb "Jorg Bornschein" <jb@capsec.org>:
>
>> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" <rgoldwyn@suse.de>
>> wrote:
>>
>>> Yes, please check if disabling quotas makes a difference in
>>> execution time of btrfs balance.
>>
>> Just FYI: With quotas disabled it took ~20h to finish the balance
>> instead of the projected >30 days. Therefore, in my case, there was a
>> speedup of factor ~35.
>>
>>
>> and thanks for the quick reply! (and for btrfs general!)
>>
>>
>> BTW: I'm wondering how much sense it makes to activate the underlying
>> bcache for my raid1 fs again. I guess btrfs chooses randomly (or
>> based predicted of disk latency?) which copy of a given extend to
>> load?
>
> As far as I know, it uses PID modulo only currently, no round-robin,
> no random value. There are no performance optimizations going into btrfs
> yet because there're still a lot of ongoing feature implementations.
>
> I think there were patches to include a rotator value in the stripe
> selection. They don't apply to the current kernel. I tried it once and
> didn't see any subjective difference for normal desktop workloads. But
> that's probably because I use RAID1 for metadata only.
I had tested similar patches myself using raid1 for everything, and saw 
near zero improvement unless I explicitly tried to create a worst-case 
performance situation.  The reality is that the current algorithm is 
actually remarkably close to being optimal for most use cases while 
using an insanely small amount of processing power and memory compared 
to an optimal algorithm (and a truly optimal algorithm is in fact 
functionally impossible in almost all cases because it would require 
predicting the future).
>
> MDRAID uses stripe selection based on latency and other measurements
> (like head position). It would be nice if btrfs implemented similar
> functionality. This would also be helpful for selecting a disk if
> there're more disks than stripesets (for example, I have 3 disks in my
> btrfs array). This could write new blocks to the most idle disk always.
> I think this wasn't covered by the above mentioned patch. Currently,
> selection is based only on the disk with most free space.
You're confusing read selection and write selection.  MDADM and DM-RAID 
both use a load-balancing read selection algorithm that takes latency 
and other factors into account.  However, they use a round-robin write 
selection algorithm that only cares about the position of the block in 
the virtual device modulo the number of physical devices.

As an example, say you have a 3 disk RAID10 array set up using MDADM 
(this is functionally the same as a 3-disk raid1 mode BTRFS filesystem). 
  Every third block starting from block 0 will be on disks 1 and 2, 
every third block starting from block 1 will be on disks 3 and 1, and 
every third block starting from block 2 will be on disks 2 and 3.  No 
latency measurements are taken, literally nothing is factored in except 
the block's position in the virtual device.

Now, that said, BTRFS does behave differently under the same 
circumstances, but this is because the striping is different for BTRFS. 
It happens at the chunk level instead of the block level.  If we look at 
an example using the same 3 devices as the MDADM example, and then for 
simplicity assume that you end up allocating alternating data and 
metadata chunks, things might look a bit like this:
* System chunk: Device 1 and 2
* Metadata chunk 0: Device 3 and 1
* Data chunk 0: Device 2 and 3
* Metadata chunk 1: Device 1 and 2
* Data chunk 1: Device 1 and 2
Overall, there is technically a pattern, but it's got a very long 
repetition period.  This is still however a near optimal allocation 
pattern given the constraints.  It also gives (just like the MDADM and 
DM-RAID method) 100% deterministic behavior, the only difference is it 
depends on a slightly different factor.  Changing this to select the 
most idle disk as you suggest would remove that determinism, increase 
the likelihood of sub-optimal layouts in terms of space usage, increase 
the number of cases where you could get ENOSPC, and provide near zero 
net performance benefit except under heavy load.  IOW, it would provide 
a pretty negative net benefit.

What actually needs to happen to improve write performance is that BTRFS 
needs to quit serializing writes when writing chunks across multiple 
devices.  In the case of a raid1 setup, it writes first to one device, 
then the other, alternating back and forth as it updates each extent. 
This combined with the COW behavior causing write amplification is what 
makes write performance so horrible for BTRFS compared to MDADM or 
DM-RAID.  It's not that we have bad device selection for writes, it's 
that we don't even try to do any kind of practical parallelization 
despite it being an embarrassingly parallel task (and yes, that 
seriously is what something that's trivial to parallelize is called in 
scientific papers...).

next prev parent reply	other threads:[~2017-02-06 13:19 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-03 22:13 Very slow balance / btrfs-transaction jb
2017-02-03 23:25 ` Goldwyn Rodrigues
2017-02-04  0:30 ` Jorg Bornschein
2017-02-04  1:07   ` Goldwyn Rodrigues
2017-02-04  1:47   ` Jorg Bornschein
2017-02-04  2:55     ` Lakshmipathi.G
2017-02-04  8:22       ` Duncan
2017-02-06  1:45     ` Qu Wenruo
2017-02-06 16:09       ` Goldwyn Rodrigues
2017-02-07  0:22         ` Qu Wenruo
2017-02-07 15:55           ` Filipe Manana
2017-02-08  0:39             ` Qu Wenruo
2017-02-08 13:56               ` Filipe Manana
2017-02-09  1:13                 ` Qu Wenruo
2017-02-06  9:14     ` Jorg Bornschein
2017-02-06  9:29       ` Qu Wenruo
2017-02-04 20:50   ` Jorg Bornschein
2017-02-04 21:10     ` Kai Krakow
2017-02-06 13:19       ` Austin S. Hemmelgarn [this message]
2017-02-07 19:47         ` Kai Krakow
2017-02-07 19:58           ` Austin S. Hemmelgarn
  -- strict thread matches above, loose matches on Subject: below --
2017-07-01 14:24 Sidney San Martín

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=403247fe-376f-27d7-bbd5-d8acd260a8ad@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).