linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: dsterba@suse.cz
Cc: linux-btrfs@vger.kernel.org, dsterba@suse.com, wqu@suse.com,
	hrx@bupt.moe, waxhead@dirtcellar.net
Subject: Re: [PATCH v2 0/3] raid1 balancing methods
Date: Mon, 21 Oct 2024 23:36:10 +0800	[thread overview]
Message-ID: <788e9d8f-df2e-4d54-b60c-dddbc47fd701@oracle.com> (raw)
In-Reply-To: <20241021140518.GD17835@twin.jikos.cz>


Thanks for commenting.

On 21/10/24 22:05, David Sterba wrote:
> On Fri, Oct 11, 2024 at 10:49:15AM +0800, Anand Jain wrote:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
> 
> I think it's safe to start with the round-round robin policy, but the
> syntax is strange, why the [ ] are mandatory? Also please call it
> round-robin, or 'rr' for short.
> 

I'm fine with round-robin. The [ ] part is not mandatory; if the
min_contiguous_read value is not specified, it will default to a
predefined value.

> The default of sector size is IMHO a wrong value, switching devices so
> often will drop the performance just because of the io request overhead.

>  From what I rememer values around 200k were reasonable, so either 192k
> or 256k should be the default. We may also drop the configurable value
> at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
> if not only to drop parsing of user strings.

I'm okay with a default value of 256k. For the experimental feature,
we can keep it configurable, allowing the opportunity to experiment
with other values as well

> 
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
>>
>>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
>>
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> | rotation|            |            |        |        |
>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>
>>     rotation RAID1 balancing technique performs more than 2x better for
>>     single-process defrag.
>>
>>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 18.00s| 3800   | 0      |
>> | rotation|       |        |        |
>> |     4096|  8.95s| 1900   | 1901   |
>> |   196608|  8.50s| 1881   | 1919   |
>> |   262144|  8.80s| 1881   | 1919   |
>> | latency | 17.18s| 3800   | 0      |
>> | devid:2 | 17.48s| 0      | 3800   |
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method is preferable as default. More workload testing is
>> needed while the code is EXPERIMENTAL.
> 
> Yeah round-robin will be a good defalt, we only need to verify the chunk
> size and then do the switch in the next release.
> 

Yes..

>> While Latency is better during the failing/unstable block layer transport.
>> As of now these two techniques, are needed to be further independently
>> tested with different worloads, and in the long term we should be merge
>> these technique to a unified heuristic.
> 
> This sounds like he latency is good for a specific case and maybe a
> fallback if the device becomes faulty, but once the layer below becomes
> unstable we may need to skip reading from the device. This is also a
> different mode of operation than balancing reads.
> 

If the latency on the faulty path is so high that it shouldn't pick that
path at all, so it works. However, the round-robin balancing is unaware
of dynamic faults on the device path. IMO, a round-robin method that is
latency aware (with ~20% variance) would be better.

>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method should be the default. More workload testing is needed
>> while the code is EXPERIMENTAL.
>>
>> Latency is smarter with unstable block layer transport.
>>
>> Both techniques need independent testing across workloads, with the goal of
>> eventually merging them into a unified approach? for the long term.
>>
>> Devid is a hands-on approach, provides manual or user-space script control.
>>
>> These RAID1 balancing methods are tunable via the sysfs knob.
>> The mount -o option and btrfs properties are under consideration.
> 
> To move forward with the feature I think the round robin and preferred
> device id can be merged. I'm not sure about the latency but if it's
> under experimental we can take it as is and tune later.

I hope the experimental feature also means we can change the name of the
balancing method at any time. Once we have tested a fair combination of
block device types, we'll definitely need a method that can
automatically tune based on the device type, which will require adding
or dropping balancing methods accordingly.

> I'll check my notes from the last time Michal attempted to implement the
> policies if we haven't missed something.

Thanks, Anand


  reply	other threads:[~2024-10-21 15:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
2024-10-11  2:49 ` [PATCH v2 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
2024-10-11  2:49 ` [PATCH v2 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
2024-10-11  2:49 ` [PATCH v2 3/3] btrfs: add RAID1 preferred read device Anand Jain
2024-10-11  3:35 ` [PATCH v2 0/3] raid1 balancing methods Anand Jain
2024-10-11  4:59 ` Qu Wenruo
2024-10-11  6:04   ` Anand Jain
2024-10-21 14:05 ` David Sterba
2024-10-21 15:36   ` Anand Jain [this message]
2024-10-21 18:42     ` David Sterba
2024-10-22  0:31       ` Anand Jain
2024-10-21 14:32 ` waxhead
2024-10-21 15:44   ` Anand Jain
2024-10-22  7:07   ` Johannes Thumshirn
2024-10-24  4:39 ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=788e9d8f-df2e-4d54-b60c-dddbc47fd701@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=dsterba@suse.com \
    --cc=dsterba@suse.cz \
    --cc=hrx@bupt.moe \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=waxhead@dirtcellar.net \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).