public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Forza <forza@tnonline.net>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
Date: Wed, 21 May 2025 16:37:18 +0800	[thread overview]
Message-ID: <5e79efb5-94f8-43e9-8ff0-c7ffdc027c8a@oracle.com> (raw)
In-Reply-To: <c49ee3e6-b0f8-4886-a364-e745d0e5d091@tnonline.net>



On 20/5/25 17:19, Forza wrote:
> Hi,
> 
> On 2025-05-12 20:07, Anand Jain wrote:
>> In host hardware, devices can have different speeds. Generally, faster
>> devices come with lesser capacity while slower devices come with larger
>> capacity. A typical configuration would expect that:
>>
>>   - A filesystem's read/write performance is evenly distributed on 
>> average
>>   across the entire filesystem. This is not achievable with the current
>>   allocation method because chunks are allocated based only on device 
>> free
>>   space.
>>
>>   - Typically, faster devices are assigned to metadata chunk allocations
>>   while slower devices are assigned to data chunk allocations.
>>
>> Introducing Device Roles:
>>
>>   Here I define 5 device roles in a specific order for metadata and in 
>> the
>>   reverse order for data: metadata_only, metadata, none, data, data_only.
>>   One or more devices may have the same role.
>>
>>   The metadata and data roles indicate preference but not exclusivity for
>>   that role, whereas data_only and metadata_only are exclusive roles.
> 
> This sounds like the old preferred_metadata (Allocator Hints) patch 
> series from Goffredo Baroncelli[1] back in the 2020, now being 
> maintained and improved by Kai Krakow[2] and others. Is this an updated/ 
> enhanced version of those patches?
> 

Thanks for the comments.

I haven't reviewed the implementation details of [1], so I can't make a
direct comparison. The goal here is to define a generic device priority
range from 1 to 255, which can be externally assigned and stored.

In one of the current modes under development, ROLE_THEN_SPACE, devices
are first grouped by three priority levels, then sorted by available
free space at the time of allocation.

I’m calling them generic device priorities because even when all devices
have similar performance—as is common in most general-purpose setups—we
can still use priorities to enable simple, linear allocation for the
single profile.

>>
>> Introducing Role-then-Space allocation method:
>>
>>   Metadata allocation can happen on devices with the roles metadata_only,
>>   metadata, none, and data in that order. If multiple devices share a 
>> role,
>>   they are arranged based on device free space.
>>
>>   Similarly, data allocation can happen on devices with the roles 
>> data_only,
>>   data, none, and metadata in that order. If multiple devices share a 
>> role,
>>   they are arranged based on device free space.
> 
> The Allocator Hints patch series show that this is a good method. We are 
> several users that use those, also in production environments to good 
> effect. Some argue that having more tiers would be beneficial, it could 
> be combined with defrag or balance operation to place data on slow or 
> fast storage.
> 
>>
>> Finding device speed automatically:
>>
>>   Measuring device read/write latency for the allocaiton is not good 
>> idea,
>>   as the historical readings and may be misleading, as they could include
>>   iostat data from periods with issues that have since been fixed. 
>> Testing
>>   to determine relative latency and arranging in ascending order for 
>> metadata
>>   and descending for data is possible, but is better handled by an 
>> external
>>   tool that can still set device roles.
> 
> Benchmarks using round-robin, latency and latency-round-robin and queue 
> based scheduling show that latency based allocation can be particularly 
> useful for some workloads and device types. It is difficult to 
> generalise, but based on benchmarks we see that a good all-rounder is a 
> queue based approach. See [3] for a complete set of raw data from these 
> benchmarks.
> 

I'm not commenting on the implementation details. My point is that
dynamic latency-based allocation was previously rejected because
temporary latency spikes can mislead the allocator and cause data to
land on the wrong device.

That said, for reads, there are indeed patches that support a latency-
based read_policy.

> 
> |  # | Storage    | Jobs | Test                | Policy      |   IOPS  |
> | -: | :--------- | ---: | :------------------ | :---------- | ------: |
> |  1 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | pid         |      81 |
> |  2 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | round-robin |      93 |
> |  3 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | latency     |      89 |
> |  4 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | latency-rr  |      87 |
> |  5 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | queue       |     102 |
> |  6 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | pid         |  68 800 |
> |  7 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | round-robin | 143 000 |
> |  8 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | latency     | 142 000 |
> |  9 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | latency-rr  | 137 000 |
> | 10 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | queue       | 143 000 |
> 
> (table wraps)
> 
> |  # | Policy      | BW (KiB/s) | Avg Lat (ms) | 99 % Lat | 99.9 % Lat |
> | -: | :---------- | ---------: | -----------: | -------: | ---------: |
> |  1 | pid         |        328 |        0.310 |   30.016 |    242.222 |
> |  2 | round-robin |        374 |        0.091 |   26.084 |     60.031 |
> |  3 | latency     |        358 |        0.041 |   26.608 |     32.900 |
> |  4 | latency-rr  |        348 |        0.041 |   28.181 |     33.817 |
> |  5 | queue       |        409 |        0.050 |   24.511 |     35.390 |
> |  6 | pid         |    275 456 |        0.458 |    8.029 |     10.290 |
> |  7 | round-robin |    572 416 |        0.217 |    0.338 |      0.627 |
> |  8 | latency     |    569 344 |        0.219 |    0.306 |      0.400 |
> |  9 | latency-rr  |    547 840 |        0.227 |    0.326 |      0.449 |
> | 10 | queue       |    571 392 |        0.218 |    0.457 |      0.594 |
> 
> I think md uses a mix of queue based and sector-distance based approach 
> depending on device type[4].
> 
>>
>> On-Disk Format changes:
>>
>>   The following items are defined but are unused on-disk format:
>>
>>     btrfs_dev_item::
>>      __le64 type; // unused
>>      __le64 start_offset; // unused
>>      __le32 dev_group; // unused
>>      __u8 seek_speed; // unused
>>      __u8 bandwidth; // unused
>>
>>   The device roles is using the dev_item::type 8-bit field to store each
>>   device's role.
>>
>> Anand Jain (10):
>>   btrfs: fix thresh scope in should_alloc_chunk()
>>   btrfs: refactor should_alloc_chunk() arg type
>>   btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>>   btrfs: introduce device allocation method
>>   btrfs: sysfs: show device allocation method
>>   btrfs: skip device sorting when only one device is present
>>   btrfs: refactor chunk allocation device handling to use list_head
>>   btrfs: introduce explicit device roles for block groups
>>   btrfs: introduce ROLE_THEN_SPACE device allocation method
>>   btrfs: pass device roles through device add ioctl
> 
> 
> 
> Have you considered how to deal with `df` and disk free calculation? Are 
> device roles preserved during `btrfs device replace`?
>

This is the foundational framework; the remaining features will be added
progressively.

Thanks!
Anand

> Thank you!
> 
> [1] https://lore.kernel.org/linux- 
> btrfs/20210116002533.GE31381@hungrycats.org/T/
> [2] https://github.com/kakra/linux/pull/36
> [3] https://gist.github.com/kakra/ce99896e5915f9b26d13c5637f56ff37
> [4] https://github.com/torvalds/linux/blob/ 
> a5806cd506af5a7c19bcd596e4708b5c464bfd21/drivers/md/raid1.c#L832-L843
> 
> 
> 
> 


  reply	other threads:[~2025-05-21  8:37 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
2025-05-12 18:07 ` [PATCH 01/10] btrfs: fix thresh scope in should_alloc_chunk() Anand Jain
2025-05-12 18:07 ` [PATCH 02/10] btrfs: refactor should_alloc_chunk() arg type Anand Jain
2025-05-12 18:07 ` [PATCH 03/10] btrfs: introduce btrfs_split_sysfs_arg() for argument parsing Anand Jain
2025-05-12 18:07 ` [PATCH 04/10] btrfs: introduce device allocation method Anand Jain
2025-05-12 18:07 ` [PATCH 05/10] btrfs: sysfs: show " Anand Jain
2025-05-12 18:07 ` [PATCH 06/10] btrfs: skip device sorting when only one device is present Anand Jain
2025-05-12 18:07 ` [PATCH 07/10] btrfs: refactor chunk allocation device handling to use list_head Anand Jain
2025-05-12 18:07 ` [PATCH 08/10] btrfs: introduce explicit device roles for block groups Anand Jain
2025-05-12 18:07 ` [PATCH 09/10] btrfs: introduce ROLE_THEN_SPACE device allocation method Anand Jain
2025-05-12 18:07 ` [PATCH 10/10] btrfs: pass device roles through device add ioctl Anand Jain
2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
2025-05-12 18:09   ` [PATCH 01/14] btrfs-progs: minor spelling correction in the list-chunk help text Anand Jain
2025-05-12 18:09   ` [PATCH 02/14] btrfs-progs: refactor devid comparison function Anand Jain
2025-05-12 18:09   ` [PATCH 03/14] btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk Anand Jain
2025-05-12 18:09   ` [PATCH 04/14] btrfs-progs: mkfs: prepare to merge duplicate if-else blocks Anand Jain
2025-05-12 18:09   ` [PATCH 05/14] btrfs-progs: mkfs: eliminate duplicate code in if-else Anand Jain
2025-05-12 18:09   ` [PATCH 06/14] btrfs-progs: mkfs: refactor test_num_disk_vs_raid - split data and metadata Anand Jain
2025-05-12 18:09   ` [PATCH 07/14] btrfs-progs: mkfs: device argument handling with a list Anand Jain
2025-05-12 18:09   ` [PATCH 08/14] btrfs-progs: import device role handling from the kernel Anand Jain
2025-05-12 18:09   ` [PATCH 09/14] btrfs-progs: mkfs: introduce device roles in device paths Anand Jain
2025-05-12 18:09   ` [PATCH 10/14] btrfs-progs: sort devices by role before using them Anand Jain
2025-05-12 18:09   ` [PATCH 11/14] btrfs-progs: helper for the device role within dev_item::type Anand Jain
2025-05-12 18:09   ` [PATCH 12/14] btrfs-progs: mkfs: persist device roles to dev_item::type Anand Jain
2025-05-12 18:09   ` [PATCH 13/14] btrfs-progs: update device add ioctl with device type Anand Jain
2025-05-12 18:09   ` [PATCH 14/14] btrfs-progs: disable exclusive metadata/data device roles Anand Jain
2025-06-20 16:46   ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation David Sterba
2025-05-12 18:11 ` [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles Anand Jain
2025-05-12 18:11   ` [PATCH 1/2] fstests: common/btrfs: add _require_btrfs_feature_device_roles Anand Jain
2025-05-12 18:11   ` [PATCH 2/2] fstests: btrfs/366: add test for device role-based chunk allocation Anand Jain
2025-05-20  9:19 ` [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Forza
2025-05-21  8:37   ` Anand Jain [this message]
2025-05-22  4:07 ` Zygo Blaxell
2025-06-02  4:26   ` Anand Jain
2025-06-21  1:11     ` Zygo Blaxell
2025-05-22 18:19 ` waxhead
2025-06-02  4:25   ` Anand Jain
2025-06-06 14:21     ` waxhead
2025-05-22 20:39 ` Ferry Toth
2025-06-02  4:24   ` Anand Jain
2025-06-04 21:29     ` Ferry Toth
2025-06-04 21:48       ` Anand Jain
2025-05-30  0:15 ` Jani Partanen
2025-06-02  4:25   ` Anand Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5e79efb5-94f8-43e9-8ff0-c7ffdc027c8a@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=forza@tnonline.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox