linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] btrfs: quasi-round-robin for chunk allocation
@ 2011-05-02  8:47 Arne Jansen
  2011-05-02  8:47 ` [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c Arne Jansen
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Arne Jansen @ 2011-05-02  8:47 UTC (permalink / raw)
  To: chris.mason, linux-btrfs

In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.

Changes from v1:
 - split into multiple parts
 - added some comments
 - generated with --patience for better readability

Changes from v2:
 - rebased to current master
 - bugfix for 'single' raid type; the initial parameter initialization lacked
   a case for the 'single' type, thus leaving devs_max at the wrong value

Arne Jansen (3):
  btrfs: move btrfs_cmp_device_free_bytes to super.c
  btrfs: heed alloc_start
  btrfs: quasi-round-robin for chunk allocation

 fs/btrfs/super.c   |   25 +++
 fs/btrfs/volumes.c |  494 ++++++++++++++++++----------------------------------
 fs/btrfs/volumes.h |   16 +--
 3 files changed, 197 insertions(+), 338 deletions(-)

-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c
  2011-05-02  8:47 [PATCH v3 0/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
@ 2011-05-02  8:47 ` Arne Jansen
  2011-05-02  9:16   ` Daniel J Blueman
  2011-05-02  8:47 ` [PATCH v3 2/3] btrfs: heed alloc_start Arne Jansen
  2011-05-02  8:47 ` [PATCH v3 3/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
  2 siblings, 1 reply; 6+ messages in thread
From: Arne Jansen @ 2011-05-02  8:47 UTC (permalink / raw)
  To: chris.mason, linux-btrfs

this function won't be used here anymore, so move it super.c where it is
used for df-calculation

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/super.c   |   25 +++++++++++++++++++++++++
 fs/btrfs/volumes.c |   13 -------------
 fs/btrfs/volumes.h |   15 ---------------
 3 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 0ac712e..d8c9a49 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -913,6 +913,31 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 	return 0;
 }
 
+/* Used to sort the devices by max_avail(descending sort) */
+int btrfs_cmp_device_free_bytes(const void *dev_info1, const void *dev_info2)
+{
+	if (((struct btrfs_device_info *)dev_info1)->max_avail >
+	    ((struct btrfs_device_info *)dev_info2)->max_avail)
+		return -1;
+	else if (((struct btrfs_device_info *)dev_info1)->max_avail <
+	         ((struct btrfs_device_info *)dev_info2)->max_avail)
+		return 1;
+	else
+	return 0;
+}
+
+/*
+ * sort the devices by max_avail, in which max free extent size of each device
+ * is stored.(Descending Sort)
+ */
+static inline void btrfs_descending_sort_devices(
+					struct btrfs_device_info *devices,
+					size_t nr_devices)
+{
+	sort(devices, nr_devices, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_free_bytes, NULL);
+}
+
 /*
  * The helper to calc the free space on the devices that can be used to store
  * file data.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8b9fb8c..a9f1fc2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2282,19 +2282,6 @@ static noinline u64 chunk_bytes_by_type(u64 type, u64 calc_size,
 		return calc_size * num_stripes;
 }
 
-/* Used to sort the devices by max_avail(descending sort) */
-int btrfs_cmp_device_free_bytes(const void *dev_info1, const void *dev_info2)
-{
-	if (((struct btrfs_device_info *)dev_info1)->max_avail >
-	    ((struct btrfs_device_info *)dev_info2)->max_avail)
-		return -1;
-	else if (((struct btrfs_device_info *)dev_info1)->max_avail <
-		 ((struct btrfs_device_info *)dev_info2)->max_avail)
-		return 1;
-	else
-		return 0;
-}
-
 static int __btrfs_calc_nstripes(struct btrfs_fs_devices *fs_devices, u64 type,
 				 int *num_stripes, int *min_stripes,
 				 int *sub_stripes)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index cc2eada..b502f01 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -157,21 +157,6 @@ struct map_lookup {
 	struct btrfs_bio_stripe stripes[];
 };
 
-/* Used to sort the devices by max_avail(descending sort) */
-int btrfs_cmp_device_free_bytes(const void *dev_info1, const void *dev_info2);
-
-/*
- * sort the devices by max_avail, in which max free extent size of each device
- * is stored.(Descending Sort)
- */
-static inline void btrfs_descending_sort_devices(
-					struct btrfs_device_info *devices,
-					size_t nr_devices)
-{
-	sort(devices, nr_devices, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_free_bytes, NULL);
-}
-
 int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 				   u64 end, u64 *length);
 
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/3] btrfs: heed alloc_start
  2011-05-02  8:47 [PATCH v3 0/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
  2011-05-02  8:47 ` [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c Arne Jansen
@ 2011-05-02  8:47 ` Arne Jansen
  2011-05-02  8:47 ` [PATCH v3 3/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
  2 siblings, 0 replies; 6+ messages in thread
From: Arne Jansen @ 2011-05-02  8:47 UTC (permalink / raw)
  To: chris.mason, linux-btrfs

currently alloc_start is disregarded if the requested
chunk size is bigger than (device size - alloc_start),
but smaller than the device size.
The only situation where I see this could have made sense
was when a chunk equal the size of the device has been
requested. This was possible as the allocator failed to
take alloc_start into account when calculating the request
chunk size. As this gets fixed by this patch, the workaround
is not necessary anymore.

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/volumes.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a9f1fc2..45c592a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -849,10 +849,7 @@ int find_free_dev_extent(struct btrfs_trans_handle *trans,
 	/* we don't want to overwrite the superblock on the drive,
 	 * so we make sure to start at an offset of at least 1MB
 	 */
-	search_start = 1024 * 1024;
-
-	if (root->fs_info->alloc_start + num_bytes <= search_end)
-		search_start = max(root->fs_info->alloc_start, search_start);
+	search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
 
 	max_hole_start = search_start;
 	max_hole_size = 0;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 3/3] btrfs: quasi-round-robin for chunk allocation
  2011-05-02  8:47 [PATCH v3 0/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
  2011-05-02  8:47 ` [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c Arne Jansen
  2011-05-02  8:47 ` [PATCH v3 2/3] btrfs: heed alloc_start Arne Jansen
@ 2011-05-02  8:47 ` Arne Jansen
  2011-05-13 11:43   ` David Sterba
  2 siblings, 1 reply; 6+ messages in thread
From: Arne Jansen @ 2011-05-02  8:47 UTC (permalink / raw)
  To: chris.mason, linux-btrfs

In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.

Changes from v2:
 - bugfix for 'single' raid type; the initial parameter initialization lacked
   a case for the 'single' type, thus leaving devs_max at the wrong value

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/volumes.c |  476 +++++++++++++++++++---------------------------------
 fs/btrfs/volumes.h |    1 +
 2 files changed, 171 insertions(+), 306 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 45c592a..77eb763 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2268,349 +2268,213 @@ static int btrfs_add_system_chunk(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-static noinline u64 chunk_bytes_by_type(u64 type, u64 calc_size,
-					int num_stripes, int sub_stripes)
-{
-	if (type & (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_DUP))
-		return calc_size;
-	else if (type & BTRFS_BLOCK_GROUP_RAID10)
-		return calc_size * (num_stripes / sub_stripes);
-	else
-		return calc_size * num_stripes;
-}
-
-static int __btrfs_calc_nstripes(struct btrfs_fs_devices *fs_devices, u64 type,
-				 int *num_stripes, int *min_stripes,
-				 int *sub_stripes)
-{
-	*num_stripes = 1;
-	*min_stripes = 1;
-	*sub_stripes = 0;
-
-	if (type & (BTRFS_BLOCK_GROUP_RAID0)) {
-		*num_stripes = fs_devices->rw_devices;
-		*min_stripes = 2;
-	}
-	if (type & (BTRFS_BLOCK_GROUP_DUP)) {
-		*num_stripes = 2;
-		*min_stripes = 2;
-	}
-	if (type & (BTRFS_BLOCK_GROUP_RAID1)) {
-		if (fs_devices->rw_devices < 2)
-			return -ENOSPC;
-		*num_stripes = 2;
-		*min_stripes = 2;
-	}
-	if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
-		*num_stripes = fs_devices->rw_devices;
-		if (*num_stripes < 4)
-			return -ENOSPC;
-		*num_stripes &= ~(u32)1;
-		*sub_stripes = 2;
-		*min_stripes = 4;
-	}
-
-	return 0;
-}
-
-static u64 __btrfs_calc_stripe_size(struct btrfs_fs_devices *fs_devices,
-				    u64 proposed_size, u64 type,
-				    int num_stripes, int small_stripe)
-{
-	int min_stripe_size = 1 * 1024 * 1024;
-	u64 calc_size = proposed_size;
-	u64 max_chunk_size = calc_size;
-	int ncopies = 1;
-
-	if (type & (BTRFS_BLOCK_GROUP_RAID1 |
-		    BTRFS_BLOCK_GROUP_DUP |
-		    BTRFS_BLOCK_GROUP_RAID10))
-		ncopies = 2;
-
-	if (type & BTRFS_BLOCK_GROUP_DATA) {
-		max_chunk_size = 10 * calc_size;
-		min_stripe_size = 64 * 1024 * 1024;
-	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
-		max_chunk_size = 256 * 1024 * 1024;
-		min_stripe_size = 32 * 1024 * 1024;
-	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
-		calc_size = 8 * 1024 * 1024;
-		max_chunk_size = calc_size * 2;
-		min_stripe_size = 1 * 1024 * 1024;
-	}
-
-	/* we don't want a chunk larger than 10% of writeable space */
-	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
-			     max_chunk_size);
-
-	if (calc_size * num_stripes > max_chunk_size * ncopies) {
-		calc_size = max_chunk_size * ncopies;
-		do_div(calc_size, num_stripes);
-		do_div(calc_size, BTRFS_STRIPE_LEN);
-		calc_size *= BTRFS_STRIPE_LEN;
-	}
-
-	/* we don't want tiny stripes */
-	if (!small_stripe)
-		calc_size = max_t(u64, min_stripe_size, calc_size);
-
-	/*
-	 * we're about to do_div by the BTRFS_STRIPE_LEN so lets make sure
-	 * we end up with something bigger than a stripe
-	 */
-	calc_size = max_t(u64, calc_size, BTRFS_STRIPE_LEN);
-
-	do_div(calc_size, BTRFS_STRIPE_LEN);
-	calc_size *= BTRFS_STRIPE_LEN;
-
-	return calc_size;
-}
-
-static struct map_lookup *__shrink_map_lookup_stripes(struct map_lookup *map,
-						      int num_stripes)
-{
-	struct map_lookup *new;
-	size_t len = map_lookup_size(num_stripes);
-
-	BUG_ON(map->num_stripes < num_stripes);
-
-	if (map->num_stripes == num_stripes)
-		return map;
-
-	new = kmalloc(len, GFP_NOFS);
-	if (!new) {
-		/* just change map->num_stripes */
-		map->num_stripes = num_stripes;
-		return map;
-	}
-
-	memcpy(new, map, len);
-	new->num_stripes = num_stripes;
-	kfree(map);
-	return new;
-}
-
 /*
- * helper to allocate device space from btrfs_device_info, in which we stored
- * max free space information of every device. It is used when we can not
- * allocate chunks by default size.
- *
- * By this helper, we can allocate a new chunk as larger as possible.
+ * sort the devices in descending order by max_avail, total_avail
  */
-static int __btrfs_alloc_tiny_space(struct btrfs_trans_handle *trans,
-				    struct btrfs_fs_devices *fs_devices,
-				    struct btrfs_device_info *devices,
-				    int nr_device, u64 type,
-				    struct map_lookup **map_lookup,
-				    int min_stripes, u64 *stripe_size)
+static int btrfs_cmp_device_info(const void *a, const void *b)
 {
-	int i, index, sort_again = 0;
-	int min_devices = min_stripes;
-	u64 max_avail, min_free;
-	struct map_lookup *map = *map_lookup;
-	int ret;
-
-	if (nr_device < min_stripes)
-		return -ENOSPC;
-
-	btrfs_descending_sort_devices(devices, nr_device);
-
-	max_avail = devices[0].max_avail;
-	if (!max_avail)
-		return -ENOSPC;
-
-	for (i = 0; i < nr_device; i++) {
-		/*
-		 * if dev_offset = 0, it means the free space of this device
-		 * is less than what we need, and we didn't search max avail
-		 * extent on this device, so do it now.
-		 */
-		if (!devices[i].dev_offset) {
-			ret = find_free_dev_extent(trans, devices[i].dev,
-						   max_avail,
-						   &devices[i].dev_offset,
-						   &devices[i].max_avail);
-			if (ret != 0 && ret != -ENOSPC)
-				return ret;
-			sort_again = 1;
-		}
-	}
-
-	/* we update the max avail free extent of each devices, sort again */
-	if (sort_again)
-		btrfs_descending_sort_devices(devices, nr_device);
-
-	if (type & BTRFS_BLOCK_GROUP_DUP)
-		min_devices = 1;
-
-	if (!devices[min_devices - 1].max_avail)
-		return -ENOSPC;
-
-	max_avail = devices[min_devices - 1].max_avail;
-	if (type & BTRFS_BLOCK_GROUP_DUP)
-		do_div(max_avail, 2);
-
-	max_avail = __btrfs_calc_stripe_size(fs_devices, max_avail, type,
-					     min_stripes, 1);
-	if (type & BTRFS_BLOCK_GROUP_DUP)
-		min_free = max_avail * 2;
-	else
-		min_free = max_avail;
-
-	if (min_free > devices[min_devices - 1].max_avail)
-		return -ENOSPC;
-
-	map = __shrink_map_lookup_stripes(map, min_stripes);
-	*stripe_size = max_avail;
-
-	index = 0;
-	for (i = 0; i < min_stripes; i++) {
-		map->stripes[i].dev = devices[index].dev;
-		map->stripes[i].physical = devices[index].dev_offset;
-		if (type & BTRFS_BLOCK_GROUP_DUP) {
-			i++;
-			map->stripes[i].dev = devices[index].dev;
-			map->stripes[i].physical = devices[index].dev_offset +
-						   max_avail;
-		}
-		index++;
-	}
-	*map_lookup = map;
-
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
 	return 0;
 }
 
 static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *extent_root,
 			       struct map_lookup **map_ret,
-			       u64 *num_bytes, u64 *stripe_size,
+			       u64 *num_bytes, u64 *stripe_size_out,
 			       u64 start, u64 type)
 {
 	struct btrfs_fs_info *info = extent_root->fs_info;
 	struct btrfs_device *device = NULL;
 	struct btrfs_fs_devices *fs_devices = info->fs_devices;
 	struct list_head *cur;
-	struct map_lookup *map;
+	struct map_lookup *map = NULL;
 	struct extent_map_tree *em_tree;
 	struct extent_map *em;
-	struct btrfs_device_info *devices_info;
-	struct list_head private_devs;
-	u64 calc_size = 1024 * 1024 * 1024;
-	u64 min_free;
-	u64 avail;
+	struct btrfs_device_info *devices_info = NULL;
+	u64 total_avail;
+	u64 max_avail;
 	u64 dev_offset;
-	int num_stripes;
-	int min_stripes;
-	int sub_stripes;
-	int min_devices;	/* the min number of devices we need */
-	int i;
+	int num_stripes;	/* total number of stripes to allocate */
+	int sub_stripes;	/* sub_stripes info for map */
+	int dev_stripes;	/* stripes per dev */
+	int devs_max;		/* max devs to use */
+	int devs_min;		/* min devs needed */
+	int devs_increment;	/* ndevs has to be a multiple of this */
+	int ncopies;		/* how many copies to data has */
 	int ret;
+	u64 max_stripe_size;
+	u64 max_chunk_size;
+	u64 stripe_size;
+	int ndevs;
 	int index;
+	int i;
+	int j;
 
 	if ((type & BTRFS_BLOCK_GROUP_RAID1) &&
 	    (type & BTRFS_BLOCK_GROUP_DUP)) {
 		WARN_ON(1);
 		type &= ~BTRFS_BLOCK_GROUP_DUP;
 	}
+
 	if (list_empty(&fs_devices->alloc_list))
 		return -ENOSPC;
 
-	ret = __btrfs_calc_nstripes(fs_devices, type, &num_stripes,
-				    &min_stripes, &sub_stripes);
-	if (ret)
-		return ret;
+
+	sub_stripes = 1;
+	dev_stripes = 1;
+	devs_increment = 1;
+	ncopies = 1;
+	devs_max = 0;	/* 0 == as many as possible */
+	devs_min = 1;
+
+	/*
+	 * define the properties of each RAID type.
+	 * FIXME: move this to a global table and use it in all RAID
+	 * calculation code
+	 */
+	if (type & (BTRFS_BLOCK_GROUP_DUP)) {
+		dev_stripes = 2;
+		ncopies = 2;
+		devs_max = 1;
+	} else if (type & (BTRFS_BLOCK_GROUP_RAID0)) {
+		devs_min = 2;
+	} else if (type & (BTRFS_BLOCK_GROUP_RAID1)) {
+		devs_increment = 2;
+		ncopies = 2;
+		devs_max = 2;
+		devs_min = 2;
+	} else if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
+		sub_stripes = 2;
+		devs_increment = 2;
+		ncopies = 2;
+		devs_min = 4;
+	} else {
+		devs_max = 1;
+	}
+
+	if (type & BTRFS_BLOCK_GROUP_DATA) {
+		max_stripe_size = 1024 * 1024 * 1024;
+		max_chunk_size = 10 * max_stripe_size;
+	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+		max_stripe_size = 256 * 1024 * 1024;
+		max_chunk_size = max_stripe_size;
+	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
+		max_stripe_size = 8 * 1024 * 1024;
+		max_chunk_size = 2 * max_stripe_size;
+	} else {
+		BUG_ON(1);
+	}
+
+	/* we don't want a chunk larger than 10% of writeable space */
+	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
+	                     max_chunk_size);
 
 	devices_info = kzalloc(sizeof(*devices_info) * fs_devices->rw_devices,
 			       GFP_NOFS);
 	if (!devices_info)
 		return -ENOMEM;
 
-	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
-	if (!map) {
-		ret = -ENOMEM;
-		goto error;
-	}
-	map->num_stripes = num_stripes;
-
 	cur = fs_devices->alloc_list.next;
-	index = 0;
-	i = 0;
 
-	calc_size = __btrfs_calc_stripe_size(fs_devices, calc_size, type,
-					     num_stripes, 0);
-
-	if (type & BTRFS_BLOCK_GROUP_DUP) {
-		min_free = calc_size * 2;
-		min_devices = 1;
-	} else {
-		min_free = calc_size;
-		min_devices = min_stripes;
-	}
-
-	INIT_LIST_HEAD(&private_devs);
-	while (index < num_stripes) {
+	/*
+	 * in the first pass through the devices list, we gather information
+	 * about the available holes on each device.
+	 */
+	ndevs = 0;
+	while (cur != &fs_devices->alloc_list) {
 		device = list_entry(cur, struct btrfs_device, dev_alloc_list);
 		BUG_ON(!device->writeable);
+
+		cur = cur->next;
+
+		if (!device->in_fs_metadata)
+			continue;
+
 		if (device->total_bytes > device->bytes_used)
-			avail = device->total_bytes - device->bytes_used;
+			total_avail = device->total_bytes - device->bytes_used;
 		else
-			avail = 0;
-		cur = cur->next;
-
-		if (device->in_fs_metadata && avail >= min_free) {
-			ret = find_free_dev_extent(trans, device, min_free,
-						   &devices_info[i].dev_offset,
-						   &devices_info[i].max_avail);
-			if (ret == 0) {
-				list_move_tail(&device->dev_alloc_list,
-					       &private_devs);
-				map->stripes[index].dev = device;
-				map->stripes[index].physical =
-						devices_info[i].dev_offset;
-				index++;
-				if (type & BTRFS_BLOCK_GROUP_DUP) {
-					map->stripes[index].dev = device;
-					map->stripes[index].physical =
-						devices_info[i].dev_offset +
-						calc_size;
-					index++;
-				}
-			} else if (ret != -ENOSPC)
-				goto error;
-
-			devices_info[i].dev = device;
-			i++;
-		} else if (device->in_fs_metadata &&
-			   avail >= BTRFS_STRIPE_LEN) {
-			devices_info[i].dev = device;
-			devices_info[i].max_avail = avail;
-			i++;
-		}
-
-		if (cur == &fs_devices->alloc_list)
-			break;
-	}
-
-	list_splice(&private_devs, &fs_devices->alloc_list);
-	if (index < num_stripes) {
-		if (index >= min_stripes) {
-			num_stripes = index;
-			if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
-				num_stripes /= sub_stripes;
-				num_stripes *= sub_stripes;
-			}
-
-			map = __shrink_map_lookup_stripes(map, num_stripes);
-		} else if (i >= min_devices) {
-			ret = __btrfs_alloc_tiny_space(trans, fs_devices,
-						       devices_info, i, type,
-						       &map, min_stripes,
-						       &calc_size);
-			if (ret)
-				goto error;
-		} else {
-			ret = -ENOSPC;
+			total_avail = 0;
+		/* avail is off by max(alloc_start, 1MB), but that is the same
+		 * for all devices, so it doesn't hurt the sorting later on
+		 */
+
+		ret = find_free_dev_extent(trans, device,
+		                           max_stripe_size * dev_stripes,
+					   &dev_offset, &max_avail);
+		if (ret && ret != -ENOSPC)
 			goto error;
+
+		if (ret == 0)
+			max_avail = max_stripe_size * dev_stripes;
+
+		if (max_avail < BTRFS_STRIPE_LEN * dev_stripes)
+			continue;
+
+		devices_info[ndevs].dev_offset = dev_offset;
+		devices_info[ndevs].max_avail = max_avail;
+		devices_info[ndevs].total_avail = total_avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+
+	/*
+	 * now sort the devices by hole size / available space
+	 */
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+
+	/* round down to number of usable stripes */
+	ndevs -= ndevs % devs_increment;
+
+	if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
+		ret = -ENOSPC;
+		goto error;
+	}
+
+	if (devs_max && ndevs > devs_max)
+		ndevs = devs_max;
+	/*
+	 * the primary goal is to maximize the number of stripes, so use as many
+	 * devices as possible, even if the stripes are not maximum sized.
+	 */
+	stripe_size = devices_info[ndevs-1].max_avail;
+	num_stripes = ndevs * dev_stripes;
+
+	if (stripe_size * num_stripes > max_chunk_size * ncopies) {
+		stripe_size = max_chunk_size * ncopies;
+		do_div(stripe_size, num_stripes);
+	}
+
+	do_div(stripe_size, dev_stripes);
+	do_div(stripe_size, BTRFS_STRIPE_LEN);
+	stripe_size *= BTRFS_STRIPE_LEN;
+
+	BUG_ON(!stripe_size);
+
+	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
+	if (!map) {
+		ret = -ENOMEM;
+		goto error;
+	}
+	map->num_stripes = num_stripes;
+
+	index = 0;
+	for (i = 0; i < ndevs; ++i) {
+		for (j = 0; j < dev_stripes; ++j) {
+			int s = i * dev_stripes + j;
+			map->stripes[s].dev = devices_info[i].dev;
+			map->stripes[s].physical = devices_info[i].dev_offset +
+                                                   j * stripe_size;
 		}
 	}
 	map->sector_size = extent_root->sectorsize;
@@ -2621,9 +2485,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	map->sub_stripes = sub_stripes;
 
 	*map_ret = map;
-	*stripe_size = calc_size;
-	*num_bytes = chunk_bytes_by_type(type, calc_size,
-					 map->num_stripes, sub_stripes);
+	*stripe_size_out = stripe_size;
+	*num_bytes = stripe_size * (num_stripes / ncopies);
 
 	trace_btrfs_chunk_alloc(info->chunk_root, map, start, *num_bytes);
 
@@ -2658,7 +2521,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		ret = btrfs_alloc_dev_extent(trans, device,
 				info->chunk_root->root_key.objectid,
 				BTRFS_FIRST_CHUNK_TREE_OBJECTID,
-				start, dev_offset, calc_size);
+				start, dev_offset, stripe_size);
 		BUG_ON(ret);
 		index++;
 	}
@@ -2667,8 +2530,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	return 0;
 
 error:
+	kfree(devices_info);
 	kfree(map);
-	kfree(devices_info);
+
 	return ret;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b502f01..37ae6e2 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -144,6 +144,7 @@ struct btrfs_device_info {
 	struct btrfs_device *dev;
 	u64 dev_offset;
 	u64 max_avail;
+	u64 total_avail;
 };
 
 struct map_lookup {
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c
  2011-05-02  8:47 ` [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c Arne Jansen
@ 2011-05-02  9:16   ` Daniel J Blueman
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel J Blueman @ 2011-05-02  9:16 UTC (permalink / raw)
  To: Arne Jansen; +Cc: chris.mason, linux-btrfs

On 2 May 2011 16:47, Arne Jansen <sensille@gmx.net> wrote:
> this function won't be used here anymore, so move it super.c where it=
 is
> used for df-calculation
>
> Signed-off-by: Arne Jansen <sensille@gmx.net>
> ---
> =A0fs/btrfs/super.c =A0 | =A0 25 +++++++++++++++++++++++++
> =A0fs/btrfs/volumes.c | =A0 13 -------------
> =A0fs/btrfs/volumes.h | =A0 15 ---------------
> =A03 files changed, 25 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 0ac712e..d8c9a49 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -913,6 +913,31 @@ static int btrfs_remount(struct super_block *sb,=
 int *flags, char *data)
> =A0 =A0 =A0 =A0return 0;
> =A0}
>
> +/* Used to sort the devices by max_avail(descending sort) */
> +int btrfs_cmp_device_free_bytes(const void *dev_info1, const void *d=
ev_info2)
> +{
> + =A0 =A0 =A0 if (((struct btrfs_device_info *)dev_info1)->max_avail =
>
> + =A0 =A0 =A0 =A0 =A0 ((struct btrfs_device_info *)dev_info2)->max_av=
ail)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return -1;
> + =A0 =A0 =A0 else if (((struct btrfs_device_info *)dev_info1)->max_a=
vail <
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0((struct btrfs_device_info *)dev_inf=
o2)->max_avail)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 1;
> + =A0 =A0 =A0 else
> + =A0 =A0 =A0 return 0;
> +}
> +
> +/*
> + * sort the devices by max_avail, in which max free extent size of e=
ach device
> + * is stored.(Descending Sort)
> + */
> +static inline void btrfs_descending_sort_devices(
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 struct btrfs_device_info *devices,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 size_t nr_devices)
> +{
> + =A0 =A0 =A0 sort(devices, nr_devices, sizeof(struct btrfs_device_in=
fo),
> + =A0 =A0 =A0 =A0 =A0 =A0btrfs_cmp_device_free_bytes, NULL);
> +}
> +
> =A0/*
> =A0* The helper to calc the free space on the devices that can be use=
d to store
> =A0* file data.
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8b9fb8c..a9f1fc2 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2282,19 +2282,6 @@ static noinline u64 chunk_bytes_by_type(u64 ty=
pe, u64 calc_size,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return calc_size * num_stripes;
> =A0}
>
> -/* Used to sort the devices by max_avail(descending sort) */
> -int btrfs_cmp_device_free_bytes(const void *dev_info1, const void *d=
ev_info2)
> -{
> - =A0 =A0 =A0 if (((struct btrfs_device_info *)dev_info1)->max_avail =
>
> - =A0 =A0 =A0 =A0 =A0 ((struct btrfs_device_info *)dev_info2)->max_av=
ail)
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return -1;
> - =A0 =A0 =A0 else if (((struct btrfs_device_info *)dev_info1)->max_a=
vail <
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0((struct btrfs_device_info *)dev_inf=
o2)->max_avail)
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 1;
> - =A0 =A0 =A0 else
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
> -}
> -
> =A0static int __btrfs_calc_nstripes(struct btrfs_fs_devices *fs_devic=
es, u64 type,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int *=
num_stripes, int *min_stripes,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int *=
sub_stripes)
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index cc2eada..b502f01 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -157,21 +157,6 @@ struct map_lookup {
> =A0 =A0 =A0 =A0struct btrfs_bio_stripe stripes[];
> =A0};
>
> -/* Used to sort the devices by max_avail(descending sort) */
> -int btrfs_cmp_device_free_bytes(const void *dev_info1, const void *d=
ev_info2);
> -
> -/*
> - * sort the devices by max_avail, in which max free extent size of e=
ach device
> - * is stored.(Descending Sort)
> - */
> -static inline void btrfs_descending_sort_devices(
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 struct btrfs_device_info *devices,
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 size_t nr_devices)
> -{
> - =A0 =A0 =A0 sort(devices, nr_devices, sizeof(struct btrfs_device_in=
fo),
> - =A0 =A0 =A0 =A0 =A0 =A0btrfs_cmp_device_free_bytes, NULL);
> -}
> -
> =A0int btrfs_account_dev_extents_size(struct btrfs_device *device, u6=
4 start,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 u=
64 end, u64 *length);
>

btrfs_cmp_device_free_bytes() can be marked static, since there are no
users outside the compilation unit.

Daniel
--=20
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 3/3] btrfs: quasi-round-robin for chunk allocation
  2011-05-02  8:47 ` [PATCH v3 3/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
@ 2011-05-13 11:43   ` David Sterba
  0 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2011-05-13 11:43 UTC (permalink / raw)
  To: Arne Jansen; +Cc: chris.mason, linux-btrfs

Hi,

here are my comments, after first review round, mostly on code itself,
not the functionality/quality of the allocator (although after this
patch the allocator code looks quite straightforward a better to read
and understand).

Reviewing the patch directly does not work, Arne's advice was to look at
the new code only, which did work, at the function __btrfs_alloc_chunk.

I found several categories of similar issues:
* move declarations from function-level to local scopes; makes the func
  declaration block less scary :)
* BUG_ONs -- a number of new ones -- I'd rather get rid of them now
* a few more comments would help
* misc code transformations or identifier renames

On Mon, May 02, 2011 at 10:47:42AM +0200, Arne Jansen wrote:
> In a multi device setup, the chunk allocator currently always allocates
> chunks on the devices in the same order. This leads to a very uneven
> distribution, especially with RAID1 or RAID10 and an uneven number of
> devices.
> This patch always sorts the devices before allocating, and allocates the
> stripes on the devices with the most available space, as long as there
> is enough space available. In a low space situation, it first tries to
> maximize striping.
> The patch also simplifies the allocator and reduces the checks for
> corner cases.
> The simplification is done by several means. First, it defines the
> properties of each RAID type upfront. These properties are used afterwards
> instead of differentiating cases in several places.
> Second, the old allocator defined a minimum stripe size for each block
> group type, tried to find a large enough chunk, and if this fails just
> allocates a smaller one. This is now done in one step. The largest possible
> chunk (up to max_chunk_size) is searched and allocated.
> Because we now have only one pass, the allocation of the map (struct
> map_lookup) is moved down to the point where the number of stripes is
> already known. This way we avoid reallocation of the map.
> We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
> 
> Changes from v2:
>  - bugfix for 'single' raid type; the initial parameter initialization lacked
>    a case for the 'single' type, thus leaving devs_max at the wrong value
> 
> Signed-off-by: Arne Jansen <sensille@gmx.net>
> ---
>  fs/btrfs/volumes.c |  476 +++++++++++++++++++---------------------------------
>  fs/btrfs/volumes.h |    1 +
>  2 files changed, 171 insertions(+), 306 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 45c592a..77eb763 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2268,349 +2268,213 @@ static int btrfs_add_system_chunk(struct btrfs_trans_handle *trans,
>  	return 0;
>  }
>  
> -static noinline u64 chunk_bytes_by_type(u64 type, u64 calc_size,
> -					int num_stripes, int sub_stripes)
> -{
> -	if (type & (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_DUP))
> -		return calc_size;
> -	else if (type & BTRFS_BLOCK_GROUP_RAID10)
> -		return calc_size * (num_stripes / sub_stripes);
> -	else
> -		return calc_size * num_stripes;
> -}
> -
> -static int __btrfs_calc_nstripes(struct btrfs_fs_devices *fs_devices, u64 type,
> -				 int *num_stripes, int *min_stripes,
> -				 int *sub_stripes)
> -{
> -	*num_stripes = 1;
> -	*min_stripes = 1;
> -	*sub_stripes = 0;
> -
> -	if (type & (BTRFS_BLOCK_GROUP_RAID0)) {
> -		*num_stripes = fs_devices->rw_devices;
> -		*min_stripes = 2;
> -	}
> -	if (type & (BTRFS_BLOCK_GROUP_DUP)) {
> -		*num_stripes = 2;
> -		*min_stripes = 2;
> -	}
> -	if (type & (BTRFS_BLOCK_GROUP_RAID1)) {
> -		if (fs_devices->rw_devices < 2)
> -			return -ENOSPC;
> -		*num_stripes = 2;
> -		*min_stripes = 2;
> -	}
> -	if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
> -		*num_stripes = fs_devices->rw_devices;
> -		if (*num_stripes < 4)
> -			return -ENOSPC;
> -		*num_stripes &= ~(u32)1;
> -		*sub_stripes = 2;
> -		*min_stripes = 4;
> -	}
> -
> -	return 0;
> -}
> -
> -static u64 __btrfs_calc_stripe_size(struct btrfs_fs_devices *fs_devices,
> -				    u64 proposed_size, u64 type,
> -				    int num_stripes, int small_stripe)
> -{
> -	int min_stripe_size = 1 * 1024 * 1024;
> -	u64 calc_size = proposed_size;
> -	u64 max_chunk_size = calc_size;
> -	int ncopies = 1;
> -
> -	if (type & (BTRFS_BLOCK_GROUP_RAID1 |
> -		    BTRFS_BLOCK_GROUP_DUP |
> -		    BTRFS_BLOCK_GROUP_RAID10))
> -		ncopies = 2;
> -
> -	if (type & BTRFS_BLOCK_GROUP_DATA) {
> -		max_chunk_size = 10 * calc_size;
> -		min_stripe_size = 64 * 1024 * 1024;
> -	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
> -		max_chunk_size = 256 * 1024 * 1024;
> -		min_stripe_size = 32 * 1024 * 1024;
> -	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
> -		calc_size = 8 * 1024 * 1024;
> -		max_chunk_size = calc_size * 2;
> -		min_stripe_size = 1 * 1024 * 1024;
> -	}
> -
> -	/* we don't want a chunk larger than 10% of writeable space */
> -	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
> -			     max_chunk_size);
> -
> -	if (calc_size * num_stripes > max_chunk_size * ncopies) {
> -		calc_size = max_chunk_size * ncopies;
> -		do_div(calc_size, num_stripes);
> -		do_div(calc_size, BTRFS_STRIPE_LEN);
> -		calc_size *= BTRFS_STRIPE_LEN;
> -	}
> -
> -	/* we don't want tiny stripes */
> -	if (!small_stripe)
> -		calc_size = max_t(u64, min_stripe_size, calc_size);
> -
> -	/*
> -	 * we're about to do_div by the BTRFS_STRIPE_LEN so lets make sure
> -	 * we end up with something bigger than a stripe
> -	 */
> -	calc_size = max_t(u64, calc_size, BTRFS_STRIPE_LEN);
> -
> -	do_div(calc_size, BTRFS_STRIPE_LEN);
> -	calc_size *= BTRFS_STRIPE_LEN;
> -
> -	return calc_size;
> -}
> -
> -static struct map_lookup *__shrink_map_lookup_stripes(struct map_lookup *map,
> -						      int num_stripes)
> -{
> -	struct map_lookup *new;
> -	size_t len = map_lookup_size(num_stripes);
> -
> -	BUG_ON(map->num_stripes < num_stripes);
> -
> -	if (map->num_stripes == num_stripes)
> -		return map;
> -
> -	new = kmalloc(len, GFP_NOFS);
> -	if (!new) {
> -		/* just change map->num_stripes */
> -		map->num_stripes = num_stripes;
> -		return map;
> -	}
> -
> -	memcpy(new, map, len);
> -	new->num_stripes = num_stripes;
> -	kfree(map);
> -	return new;
> -}
> -
>  /*
> - * helper to allocate device space from btrfs_device_info, in which we stored
> - * max free space information of every device. It is used when we can not
> - * allocate chunks by default size.
> - *
> - * By this helper, we can allocate a new chunk as larger as possible.
> + * sort the devices in descending order by max_avail, total_avail
>   */
> -static int __btrfs_alloc_tiny_space(struct btrfs_trans_handle *trans,
> -				    struct btrfs_fs_devices *fs_devices,
> -				    struct btrfs_device_info *devices,
> -				    int nr_device, u64 type,
> -				    struct map_lookup **map_lookup,
> -				    int min_stripes, u64 *stripe_size)
> +static int btrfs_cmp_device_info(const void *a, const void *b)
>  {
> -	int i, index, sort_again = 0;
> -	int min_devices = min_stripes;
> -	u64 max_avail, min_free;
> -	struct map_lookup *map = *map_lookup;
> -	int ret;
> -
> -	if (nr_device < min_stripes)
> -		return -ENOSPC;
> -
> -	btrfs_descending_sort_devices(devices, nr_device);
> -
> -	max_avail = devices[0].max_avail;
> -	if (!max_avail)
> -		return -ENOSPC;
> -
> -	for (i = 0; i < nr_device; i++) {
> -		/*
> -		 * if dev_offset = 0, it means the free space of this device
> -		 * is less than what we need, and we didn't search max avail
> -		 * extent on this device, so do it now.
> -		 */
> -		if (!devices[i].dev_offset) {
> -			ret = find_free_dev_extent(trans, devices[i].dev,
> -						   max_avail,
> -						   &devices[i].dev_offset,
> -						   &devices[i].max_avail);
> -			if (ret != 0 && ret != -ENOSPC)
> -				return ret;
> -			sort_again = 1;
> -		}
> -	}
> -
> -	/* we update the max avail free extent of each devices, sort again */
> -	if (sort_again)
> -		btrfs_descending_sort_devices(devices, nr_device);
> -
> -	if (type & BTRFS_BLOCK_GROUP_DUP)
> -		min_devices = 1;
> -
> -	if (!devices[min_devices - 1].max_avail)
> -		return -ENOSPC;
> -
> -	max_avail = devices[min_devices - 1].max_avail;
> -	if (type & BTRFS_BLOCK_GROUP_DUP)
> -		do_div(max_avail, 2);
> -
> -	max_avail = __btrfs_calc_stripe_size(fs_devices, max_avail, type,
> -					     min_stripes, 1);
> -	if (type & BTRFS_BLOCK_GROUP_DUP)
> -		min_free = max_avail * 2;
> -	else
> -		min_free = max_avail;
> -
> -	if (min_free > devices[min_devices - 1].max_avail)
> -		return -ENOSPC;
> -
> -	map = __shrink_map_lookup_stripes(map, min_stripes);
> -	*stripe_size = max_avail;
> -
> -	index = 0;
> -	for (i = 0; i < min_stripes; i++) {
> -		map->stripes[i].dev = devices[index].dev;
> -		map->stripes[i].physical = devices[index].dev_offset;
> -		if (type & BTRFS_BLOCK_GROUP_DUP) {
> -			i++;
> -			map->stripes[i].dev = devices[index].dev;
> -			map->stripes[i].physical = devices[index].dev_offset +
> -						   max_avail;
> -		}
> -		index++;
> -	}
> -	*map_lookup = map;
> -
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
>  	return 0;
>  }
>  
>  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  			       struct btrfs_root *extent_root,
>  			       struct map_lookup **map_ret,
                                                   map_out

> -			       u64 *num_bytes, u64 *stripe_size,
> +			       u64 *num_bytes, u64 *stripe_size_out,
                                    num_bytes_out

there seems to be no formal convention to name outgoing parameters, but
at least this makes things more consistent

>  			       u64 start, u64 type)
>  {
>  	struct btrfs_fs_info *info = extent_root->fs_info;
>  	struct btrfs_device *device = NULL;
this variable can be declared in local scope block ...

>  	struct btrfs_fs_devices *fs_devices = info->fs_devices;
>  	struct list_head *cur;
> -	struct map_lookup *map;
> +	struct map_lookup *map = NULL;
>  	struct extent_map_tree *em_tree;
>  	struct extent_map *em;
> -	struct btrfs_device_info *devices_info;
> -	struct list_head private_devs;
> -	u64 calc_size = 1024 * 1024 * 1024;
> -	u64 min_free;
> -	u64 avail;
> +	struct btrfs_device_info *devices_info = NULL;
> +	u64 total_avail;
this

> +	u64 max_avail;
this

>  	u64 dev_offset;
and this should go to local scope

> -	int num_stripes;
> -	int min_stripes;
> -	int sub_stripes;
> -	int min_devices;	/* the min number of devices we need */
> -	int i;
> +	int num_stripes;	/* total number of stripes to allocate */
> +	int sub_stripes;	/* sub_stripes info for map */
> +	int dev_stripes;	/* stripes per dev */
> +	int devs_max;		/* max devs to use */
> +	int devs_min;		/* min devs needed */
> +	int devs_increment;	/* ndevs has to be a multiple of this */
> +	int ncopies;		/* how many copies to data has */
>  	int ret;
> +	u64 max_stripe_size;
> +	u64 max_chunk_size;
> +	u64 stripe_size;
> +	int ndevs;
>  	int index;
            ^^^^^
remove, unused or can be substituted with existing var (i)

> +	int i;
> +	int j;
>  
>  	if ((type & BTRFS_BLOCK_GROUP_RAID1) &&
>  	    (type & BTRFS_BLOCK_GROUP_DUP)) {
>  		WARN_ON(1);
>  		type &= ~BTRFS_BLOCK_GROUP_DUP;
>  	}

comment please, why this combination is bad

> +
(extra newline)

>  	if (list_empty(&fs_devices->alloc_list))
>  		return -ENOSPC;
>  
> -	ret = __btrfs_calc_nstripes(fs_devices, type, &num_stripes,
> -				    &min_stripes, &sub_stripes);
> -	if (ret)
> -		return ret;
> +
> +	sub_stripes = 1;
> +	dev_stripes = 1;
> +	devs_increment = 1;
> +	ncopies = 1;
> +	devs_max = 0;	/* 0 == as many as possible */
> +	devs_min = 1;
> +
> +	/*
> +	 * define the properties of each RAID type.
> +	 * FIXME: move this to a global table and use it in all RAID
> +	 * calculation code
> +	 */
agreed with the FIXME, Hugo Mills' balance patches have the refactoring
code, so let's keep this as-is for now

> +	if (type & (BTRFS_BLOCK_GROUP_DUP)) {
> +		dev_stripes = 2;
> +		ncopies = 2;
> +		devs_max = 1;
> +	} else if (type & (BTRFS_BLOCK_GROUP_RAID0)) {
> +		devs_min = 2;
> +	} else if (type & (BTRFS_BLOCK_GROUP_RAID1)) {
> +		devs_increment = 2;
> +		ncopies = 2;
> +		devs_max = 2;
> +		devs_min = 2;
> +	} else if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
> +		sub_stripes = 2;
> +		devs_increment = 2;
> +		ncopies = 2;
> +		devs_min = 4;
> +	} else {
> +		devs_max = 1;
> +	}
> +
> +	if (type & BTRFS_BLOCK_GROUP_DATA) {
> +		max_stripe_size = 1024 * 1024 * 1024;
> +		max_chunk_size = 10 * max_stripe_size;
> +	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
> +		max_stripe_size = 256 * 1024 * 1024;
> +		max_chunk_size = max_stripe_size;
> +	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
> +		max_stripe_size = 8 * 1024 * 1024;
> +		max_chunk_size = 2 * max_stripe_size;
> +	} else {
> +		BUG_ON(1);
a printk would be good, this may catch a bug eg. in alloc_profile bit
handling

> +	}
> +
> +	/* we don't want a chunk larger than 10% of writeable space */
> +	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
> +	                     max_chunk_size);
>  
>  	devices_info = kzalloc(sizeof(*devices_info) * fs_devices->rw_devices,
>  			       GFP_NOFS);
>  	if (!devices_info)
>  		return -ENOMEM;
>  
> -	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
> -	if (!map) {
> -		ret = -ENOMEM;
> -		goto error;
> -	}
> -	map->num_stripes = num_stripes;
> -
>  	cur = fs_devices->alloc_list.next;
> -	index = 0;
> -	i = 0;
>  
> -	calc_size = __btrfs_calc_stripe_size(fs_devices, calc_size, type,
> -					     num_stripes, 0);
> -
> -	if (type & BTRFS_BLOCK_GROUP_DUP) {
> -		min_free = calc_size * 2;
> -		min_devices = 1;
> -	} else {
> -		min_free = calc_size;
> -		min_devices = min_stripes;
> -	}
> -
> -	INIT_LIST_HEAD(&private_devs);
> -	while (index < num_stripes) {
> +	/*
> +	 * in the first pass through the devices list, we gather information
> +	 * about the available holes on each device.
> +	 */
> +	ndevs = 0;
> +	while (cur != &fs_devices->alloc_list) {
>  		device = list_entry(cur, struct btrfs_device, dev_alloc_list);
                ^^^^^^
declare locally, and add max_avail and total_avail

>  		BUG_ON(!device->writeable);

not a newly added BUG_ON, but I wonder if this is the best what can be
done here, a printk would be helpful too.
I do not see any locking of the device list here, and did not have a
deeper look, but I think this could race with device removal, when the
device is still in the list, but the racing thread managed to set
device->writable to zero. In this case, a WARN would do, and skipping
the device will not break things (too much).

> +
> +		cur = cur->next;
> +
> +		if (!device->in_fs_metadata)
> +			continue;
> +
>  		if (device->total_bytes > device->bytes_used)
> -			avail = device->total_bytes - device->bytes_used;
> +			total_avail = device->total_bytes - device->bytes_used;
>  		else
> -			avail = 0;
> -		cur = cur->next;
> -
> -		if (device->in_fs_metadata && avail >= min_free) {
> -			ret = find_free_dev_extent(trans, device, min_free,
> -						   &devices_info[i].dev_offset,
> -						   &devices_info[i].max_avail);
> -			if (ret == 0) {
> -				list_move_tail(&device->dev_alloc_list,
> -					       &private_devs);
> -				map->stripes[index].dev = device;
> -				map->stripes[index].physical =
> -						devices_info[i].dev_offset;
> -				index++;
> -				if (type & BTRFS_BLOCK_GROUP_DUP) {
> -					map->stripes[index].dev = device;
> -					map->stripes[index].physical =
> -						devices_info[i].dev_offset +
> -						calc_size;
> -					index++;
> -				}
> -			} else if (ret != -ENOSPC)
> -				goto error;
> -
> -			devices_info[i].dev = device;
> -			i++;
> -		} else if (device->in_fs_metadata &&
> -			   avail >= BTRFS_STRIPE_LEN) {
> -			devices_info[i].dev = device;
> -			devices_info[i].max_avail = avail;
> -			i++;
> -		}
> -
> -		if (cur == &fs_devices->alloc_list)
> -			break;
> -	}
> -
> -	list_splice(&private_devs, &fs_devices->alloc_list);
> -	if (index < num_stripes) {
> -		if (index >= min_stripes) {
> -			num_stripes = index;
> -			if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
> -				num_stripes /= sub_stripes;
> -				num_stripes *= sub_stripes;
> -			}
> -
> -			map = __shrink_map_lookup_stripes(map, num_stripes);
> -		} else if (i >= min_devices) {
> -			ret = __btrfs_alloc_tiny_space(trans, fs_devices,
> -						       devices_info, i, type,
> -						       &map, min_stripes,
> -						       &calc_size);
> -			if (ret)
> -				goto error;
> -		} else {
> -			ret = -ENOSPC;
> +			total_avail = 0;
> +		/* avail is off by max(alloc_start, 1MB), but that is the same
> +		 * for all devices, so it doesn't hurt the sorting later on
> +		 */
> +
> +		ret = find_free_dev_extent(trans, device,
> +		                           max_stripe_size * dev_stripes,
> +					   &dev_offset, &max_avail);
> +		if (ret && ret != -ENOSPC)
>  			goto error;
> +
> +		if (ret == 0)
> +			max_avail = max_stripe_size * dev_stripes;
> +
> +		if (max_avail < BTRFS_STRIPE_LEN * dev_stripes)
> +			continue;
> +
> +		devices_info[ndevs].dev_offset = dev_offset;
> +		devices_info[ndevs].max_avail = max_avail;
> +		devices_info[ndevs].total_avail = total_avail;
> +		devices_info[ndevs].dev = device;
> +		++ndevs;
> +	}
> +
> +	/*
> +	 * now sort the devices by hole size / available space
> +	 */
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +
> +	/* round down to number of usable stripes */
> +	ndevs -= ndevs % devs_increment;
> +
> +	if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> +		ret = -ENOSPC;
> +		goto error;
> +	}
> +
> +	if (devs_max && ndevs > devs_max)
> +		ndevs = devs_max;
> +	/*
> +	 * the primary goal is to maximize the number of stripes, so use as many
> +	 * devices as possible, even if the stripes are not maximum sized.
> +	 */
> +	stripe_size = devices_info[ndevs-1].max_avail;
> +	num_stripes = ndevs * dev_stripes;
> +
> +	if (stripe_size * num_stripes > max_chunk_size * ncopies) {
> +		stripe_size = max_chunk_size * ncopies;
> +		do_div(stripe_size, num_stripes);
> +	}
> +
> +	do_div(stripe_size, dev_stripes);
> +	do_div(stripe_size, BTRFS_STRIPE_LEN);
> +	stripe_size *= BTRFS_STRIPE_LEN;
> +
> +	BUG_ON(!stripe_size);

what does this catch, in words? stripe_size was smaller than
BTRFS_STRIPE_LEN. Old code made sure that this does not happen, doing

  calc_size = max_t(u64, calc_size, BTRFS_STRIPE_LEN)

before the final do_div. Is there really a reason to crash?

> +
> +	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
> +	if (!map) {
> +		ret = -ENOMEM;
> +		goto error;
> +	}
> +	map->num_stripes = num_stripes;
> +
> +	index = 0;

not used, probably intended as a linear counter of the 'int s = ... '
expression below. the compiler 'strength-reduce' pass will probably
deduce this too.

> +	for (i = 0; i < ndevs; ++i) {
> +		for (j = 0; j < dev_stripes; ++j) {
> +			int s = i * dev_stripes + j;
> +			map->stripes[s].dev = devices_info[i].dev;
> +			map->stripes[s].physical = devices_info[i].dev_offset +
> +                                                   j * stripe_size;
>  		}
>  	}

not really an issue here, but the devices_info[i] could be moved out of
the for( j ) loop (assigned to a tmp var), but an optimizing compiler
will do this too :)

>  	map->sector_size = extent_root->sectorsize;
> @@ -2621,9 +2485,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	map->sub_stripes = sub_stripes;
>  
>  	*map_ret = map;
> -	*stripe_size = calc_size;
> -	*num_bytes = chunk_bytes_by_type(type, calc_size,
> -					 map->num_stripes, sub_stripes);
> +	*stripe_size_out = stripe_size;
> +	*num_bytes = stripe_size * (num_stripes / ncopies);
>  
>  	trace_btrfs_chunk_alloc(info->chunk_root, map, start, *num_bytes);

...

[inserted from patched file]

2482         index = 0;
2483         while (index < map->num_stripes) {
2484                 device = map->stripes[index].dev;
2485                 dev_offset = map->stripes[index].physical;
2486
2487                 ret = btrfs_alloc_dev_extent(trans, device,
2488                                 info->chunk_root->root_key.objectid,
2489                                 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
2490                                 start, dev_offset, stripe_size);
2491                 BUG_ON(ret);
2492                 index++;
2493         }

can you please transform this to a for loop for me? I know this is not
your original code, but this cleanup seems worth while changing the
whole function.

(and make those device and def_offset local declarations)

>  
> @@ -2658,7 +2521,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  		ret = btrfs_alloc_dev_extent(trans, device,
>  				info->chunk_root->root_key.objectid,
>  				BTRFS_FIRST_CHUNK_TREE_OBJECTID,
> -				start, dev_offset, calc_size);
> +				start, dev_offset, stripe_size);
>  		BUG_ON(ret);
>  		index++;
>  	}
> @@ -2667,8 +2530,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	return 0;
>  
>  error:
> +	kfree(devices_info);
>  	kfree(map);
> -	kfree(devices_info);
> +

unnecessary hunk

>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b502f01..37ae6e2 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -144,6 +144,7 @@ struct btrfs_device_info {
>  	struct btrfs_device *dev;
>  	u64 dev_offset;
>  	u64 max_avail;
> +	u64 total_avail;
>  };
>  
>  struct map_lookup {
> -- 

and I'm going to test this a bit too :)

david

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-05-13 11:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-02  8:47 [PATCH v3 0/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
2011-05-02  8:47 ` [PATCH v3 1/3] btrfs: move btrfs_cmp_device_free_bytes to super.c Arne Jansen
2011-05-02  9:16   ` Daniel J Blueman
2011-05-02  8:47 ` [PATCH v3 2/3] btrfs: heed alloc_start Arne Jansen
2011-05-02  8:47 ` [PATCH v3 3/3] btrfs: quasi-round-robin for chunk allocation Arne Jansen
2011-05-13 11:43   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).