All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Goffredo Baroncelli <kreijack@libero.it>
Cc: linux-btrfs@vger.kernel.org,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	David Sterba <dsterba@suse.cz>,
	Sinnamohideen Shafeeq <shafeeqs@panasas.com>,
	Paul Jones <paul@pauljones.id.au>, Boris Burkov <boris@bur.io>,
	Goffredo Baroncelli <kreijack@inwind.it>
Subject: Re: [PATCH 4/5] btrfs: add allocation_hint mode
Date: Tue, 22 Mar 2022 15:47:56 -0400	[thread overview]
Message-ID: <Yjon7DClcBkw2V9i@localhost.localdomain> (raw)
In-Reply-To: <2291ba747c6c9701952fa75140684535cfe4ab3e.1646589622.git.kreijack@inwind.it>

On Sun, Mar 06, 2022 at 07:14:42PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> The chunk allocation policy is modified as follow.
> 
> Each disk may have one of the following tags:
> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> 
> During a *mixed data/metadata* chunk allocation, BTRFS works as
> usual.
> 
> During a *data* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
> in raid5, only two disks are available), then the disks tagged
> BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
> enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> are also considered. If even in this case the space is not
> sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
> for a data BG allocation.
> 
> During a *metadata* chunk allocation, the same algorithm applies swapping
> _DATA_ and _METADATA_.
> 
> By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
> so BTRFS behaves as usual.
> 
> If the user prefers to store the metadata in the faster disks (e.g. SSD),
> he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
> case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> disks and the data BG in the others ones. When a disks set is filled, the
> other is considered.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |   1 +
>  2 files changed, 111 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d4ac90f5c949..7b37db9bb887 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -184,6 +184,27 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
>  	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>  }
>  
> +#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
> +		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
> +
> +/*
> + *	The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
> + *	good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
> + *	(for backward compatibility reason), and the other
> + *	values are greater (because the field is unsigned). So we
> + *	need a map that rearranges the order giving to _DATA_PREFERRED
> + *	an intermediate priority.
> + *	These values give to METADATA_ONLY the highest priority, and are
> + *	valid for metadata BG allocation. When a data
> + *	BG is allocated we negate these values to reverse the priority.
> + */
> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 0,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 1,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
> +};

This should be const int, not const char.  Also the formatting for the comment
is awkward, it's 1 space between the * and the word, so

/*
 * The order of ....
 *
 * These values give METADATA_ONLY the highest priority...
 */

Also the -1 thing is weird and unclear.  In fact I think it's problematic, I'll
explain below.

> +
>  const char *btrfs_bg_type_to_raid_name(u64 flags)
>  {
>  	const int index = btrfs_bg_flags_to_raid_index(flags);
> @@ -5030,13 +5051,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  }
>  
>  /*
> - * sort the devices in descending order by max_avail, total_avail
> + * sort the devices in descending order by alloc_hint,
> + * max_avail, total_avail
>   */
>  static int btrfs_cmp_device_info(const void *a, const void *b)
>  {
>  	const struct btrfs_device_info *di_a = a;
>  	const struct btrfs_device_info *di_b = b;
>  
> +	if (di_a->alloc_hint > di_b->alloc_hint)
> +		return -1;
> +	if (di_a->alloc_hint < di_b->alloc_hint)
> +		return 1;

This is making things awkward, instead I think we change this to sort_r, which
uses cmp(a, b, priv).  You pass in priv which is the type we want, DATA or
METADATA or whatever.  Then you can do

if (priv == data) {
	/* do the sorting so DATA_ONLY is on top, then DATA_PREFERRED, etc. */
} else {
	/* do the METADATA thing instead. */
}

>  	if (di_a->max_avail > di_b->max_avail)
>  		return -1;
>  	if (di_a->max_avail < di_b->max_avail)
> @@ -5199,6 +5225,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	int ndevs = 0;
>  	u64 max_avail;
>  	u64 dev_offset;
> +	int hint;
>  
>  	/*
>  	 * in the first pass through the devices list, we gather information
> @@ -5251,17 +5278,95 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +
> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
> +			/*
> +			 * if mixed bg set all the alloc_hint
> +			 * fields to the same value, so the sorting
> +			 * is not affected
> +			 */
> +			devices_info[ndevs].alloc_hint = 0;
> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_METADATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,
> +			 * sort also by hint (data disk
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_DATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
> +				continue;
> +			/*
> +			 * if a metadata chunk must be allocated,
> +			 * sort also by hint (metadata hint
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
> +		}
> +

Shouldn't we be doing this before adding the device to devices_info?  That way
for _ONLY we just don't even add the disk to the devices_info.

>  		++ndevs;
>  	}
>  	ctl->ndevs = ndevs;
>  
> +	return 0;
> +}
> +
> +static void sort_and_reduce_device_info(struct alloc_chunk_ctl *ctl,
> +					struct btrfs_device_info *devices_info)
> +{
> +	int ndevs, hint, i;
> +
> +	ndevs = ctl->ndevs;
>  	/*
> -	 * now sort the devices by hole size / available space
> +	 * now sort the devices by hint / hole size / available space
>  	 */
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> -	return 0;
> +	/*
> +	 * select the minimum set of disks grouped by hint that
> +	 * can host the chunk
> +	 */
> +	ndevs = 0;
> +	while (ndevs < ctl->ndevs) {
> +		hint = devices_info[ndevs++].alloc_hint;
> +		while (ndevs < ctl->ndevs) {
> +			if (devices_info[ndevs].alloc_hint != hint)
> +				break;
> +			ndevs++;
> +		}
> +		if (ndevs >= ctl->devs_min)
> +			break;
> +	}
> +
> +	ctl->ndevs = ndevs;
> +
> +	/*
> +	 * the next layers require the devices_info ordered by
> +	 * max_avail. If we are returning two (or more) different
> +	 * group of alloc_hint, this is not always true. So sort
> +	 * these again.
> +	 */
> +
> +	for (i = 0 ; i < ndevs ; i++)
> +		devices_info[i].alloc_hint = 0;
> +
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);

With my sort_r suggestion I think we no longer need the second sort.  It'll get
you the devices you want in order of most preferred alloc hint, and with the
max_avail.  I'd double check with printk's, but you should be able to drop all
this.  Thanks,

Josef

  reply	other threads:[~2022-03-22 19:48 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2022-03-22 17:51   ` Josef Bacik
2022-03-22 17:56   ` Josef Bacik
2022-03-22 18:49     ` Goffredo Baroncelli
2022-04-06 19:32       ` Goffredo Baroncelli
2022-03-23  3:26     ` Zygo Blaxell
2022-03-22 19:52   ` Josef Bacik
2022-03-22 20:25     ` Goffredo Baroncelli
2022-03-23  2:56       ` Zygo Blaxell
2022-03-24 19:05         ` Goffredo Baroncelli
2022-03-25 14:59       ` Josef Bacik
2022-03-25 18:55         ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
2022-03-08 13:25   ` kernel test robot
2022-03-22 18:05   ` Josef Bacik
2022-03-22 19:02     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
2022-03-22 19:19   ` Josef Bacik
2022-03-22 19:52     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 4/5] btrfs: add allocation_hint mode Goffredo Baroncelli
2022-03-22 19:47   ` Josef Bacik [this message]
2022-03-22 20:56     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 5/5] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
2022-03-21 20:29 ` [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yjon7DClcBkw2V9i@localhost.localdomain \
    --to=josef@toxicpanda.com \
    --cc=boris@bur.io \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=kreijack@inwind.it \
    --cc=kreijack@libero.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=paul@pauljones.id.au \
    --cc=shafeeqs@panasas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.