public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Goffredo Baroncelli <kreijack@libero.it>
Cc: linux-btrfs@vger.kernel.org,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	David Sterba <dsterba@suse.cz>,
	Sinnamohideen Shafeeq <shafeeqs@panasas.com>,
	Paul Jones <paul@pauljones.id.au>, Boris Burkov <boris@bur.io>,
	Goffredo Baroncelli <kreijack@inwind.it>
Subject: Re: [PATCH 4/5] btrfs: add allocation_hint mode
Date: Tue, 22 Mar 2022 15:47:56 -0400	[thread overview]
Message-ID: <Yjon7DClcBkw2V9i@localhost.localdomain> (raw)
In-Reply-To: <2291ba747c6c9701952fa75140684535cfe4ab3e.1646589622.git.kreijack@inwind.it>

On Sun, Mar 06, 2022 at 07:14:42PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> The chunk allocation policy is modified as follow.
> 
> Each disk may have one of the following tags:
> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> 
> During a *mixed data/metadata* chunk allocation, BTRFS works as
> usual.
> 
> During a *data* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
> in raid5, only two disks are available), then the disks tagged
> BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
> enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> are also considered. If even in this case the space is not
> sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
> for a data BG allocation.
> 
> During a *metadata* chunk allocation, the same algorithm applies swapping
> _DATA_ and _METADATA_.
> 
> By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
> so BTRFS behaves as usual.
> 
> If the user prefers to store the metadata in the faster disks (e.g. SSD),
> he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
> case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> disks and the data BG in the others ones. When a disks set is filled, the
> other is considered.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |   1 +
>  2 files changed, 111 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d4ac90f5c949..7b37db9bb887 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -184,6 +184,27 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
>  	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>  }
>  
> +#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
> +		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
> +
> +/*
> + *	The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
> + *	good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
> + *	(for backward compatibility reason), and the other
> + *	values are greater (because the field is unsigned). So we
> + *	need a map that rearranges the order giving to _DATA_PREFERRED
> + *	an intermediate priority.
> + *	These values give to METADATA_ONLY the highest priority, and are
> + *	valid for metadata BG allocation. When a data
> + *	BG is allocated we negate these values to reverse the priority.
> + */
> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 0,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 1,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
> +};

This should be const int, not const char.  Also the formatting for the comment
is awkward, it's 1 space between the * and the word, so

/*
 * The order of ....
 *
 * These values give METADATA_ONLY the highest priority...
 */

Also the -1 thing is weird and unclear.  In fact I think it's problematic, I'll
explain below.

> +
>  const char *btrfs_bg_type_to_raid_name(u64 flags)
>  {
>  	const int index = btrfs_bg_flags_to_raid_index(flags);
> @@ -5030,13 +5051,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  }
>  
>  /*
> - * sort the devices in descending order by max_avail, total_avail
> + * sort the devices in descending order by alloc_hint,
> + * max_avail, total_avail
>   */
>  static int btrfs_cmp_device_info(const void *a, const void *b)
>  {
>  	const struct btrfs_device_info *di_a = a;
>  	const struct btrfs_device_info *di_b = b;
>  
> +	if (di_a->alloc_hint > di_b->alloc_hint)
> +		return -1;
> +	if (di_a->alloc_hint < di_b->alloc_hint)
> +		return 1;

This is making things awkward, instead I think we change this to sort_r, which
uses cmp(a, b, priv).  You pass in priv which is the type we want, DATA or
METADATA or whatever.  Then you can do

if (priv == data) {
	/* do the sorting so DATA_ONLY is on top, then DATA_PREFERRED, etc. */
} else {
	/* do the METADATA thing instead. */
}

>  	if (di_a->max_avail > di_b->max_avail)
>  		return -1;
>  	if (di_a->max_avail < di_b->max_avail)
> @@ -5199,6 +5225,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	int ndevs = 0;
>  	u64 max_avail;
>  	u64 dev_offset;
> +	int hint;
>  
>  	/*
>  	 * in the first pass through the devices list, we gather information
> @@ -5251,17 +5278,95 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +
> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
> +			/*
> +			 * if mixed bg set all the alloc_hint
> +			 * fields to the same value, so the sorting
> +			 * is not affected
> +			 */
> +			devices_info[ndevs].alloc_hint = 0;
> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_METADATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,
> +			 * sort also by hint (data disk
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_DATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
> +				continue;
> +			/*
> +			 * if a metadata chunk must be allocated,
> +			 * sort also by hint (metadata hint
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
> +		}
> +

Shouldn't we be doing this before adding the device to devices_info?  That way
for _ONLY we just don't even add the disk to the devices_info.

>  		++ndevs;
>  	}
>  	ctl->ndevs = ndevs;
>  
> +	return 0;
> +}
> +
> +static void sort_and_reduce_device_info(struct alloc_chunk_ctl *ctl,
> +					struct btrfs_device_info *devices_info)
> +{
> +	int ndevs, hint, i;
> +
> +	ndevs = ctl->ndevs;
>  	/*
> -	 * now sort the devices by hole size / available space
> +	 * now sort the devices by hint / hole size / available space
>  	 */
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> -	return 0;
> +	/*
> +	 * select the minimum set of disks grouped by hint that
> +	 * can host the chunk
> +	 */
> +	ndevs = 0;
> +	while (ndevs < ctl->ndevs) {
> +		hint = devices_info[ndevs++].alloc_hint;
> +		while (ndevs < ctl->ndevs) {
> +			if (devices_info[ndevs].alloc_hint != hint)
> +				break;
> +			ndevs++;
> +		}
> +		if (ndevs >= ctl->devs_min)
> +			break;
> +	}
> +
> +	ctl->ndevs = ndevs;
> +
> +	/*
> +	 * the next layers require the devices_info ordered by
> +	 * max_avail. If we are returning two (or more) different
> +	 * group of alloc_hint, this is not always true. So sort
> +	 * these again.
> +	 */
> +
> +	for (i = 0 ; i < ndevs ; i++)
> +		devices_info[i].alloc_hint = 0;
> +
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);

With my sort_r suggestion I think we no longer need the second sort.  It'll get
you the devices you want in order of most preferred alloc hint, and with the
max_avail.  I'd double check with printk's, but you should be able to drop all
this.  Thanks,

Josef

  reply	other threads:[~2022-03-22 19:48 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2022-03-22 17:51   ` Josef Bacik
2022-03-22 17:56   ` Josef Bacik
2022-03-22 18:49     ` Goffredo Baroncelli
2022-04-06 19:32       ` Goffredo Baroncelli
2022-03-23  3:26     ` Zygo Blaxell
2022-03-22 19:52   ` Josef Bacik
2022-03-22 20:25     ` Goffredo Baroncelli
2022-03-23  2:56       ` Zygo Blaxell
2022-03-24 19:05         ` Goffredo Baroncelli
2022-03-25 14:59       ` Josef Bacik
2022-03-25 18:55         ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
2022-03-08 13:25   ` kernel test robot
2022-03-22 18:05   ` Josef Bacik
2022-03-22 19:02     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
2022-03-22 19:19   ` Josef Bacik
2022-03-22 19:52     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 4/5] btrfs: add allocation_hint mode Goffredo Baroncelli
2022-03-22 19:47   ` Josef Bacik [this message]
2022-03-22 20:56     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 5/5] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
2022-03-21 20:29 ` [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yjon7DClcBkw2V9i@localhost.localdomain \
    --to=josef@toxicpanda.com \
    --cc=boris@bur.io \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=kreijack@inwind.it \
    --cc=kreijack@libero.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=paul@pauljones.id.au \
    --cc=shafeeqs@panasas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox