* [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 8:09 ` Nikolay Borisov
2018-02-09 7:44 ` [PATCH v2 01/10] btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info Qu Wenruo
` (9 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Strangely, we have level check in btrfs_print_tree() while we don't have
the same check in read_node_slot().
That's to say, for the following corruption, btrfs_search_slot() or
btrfs_next_leaf() can return invalid leaf:
Parent eb:
node XXXXXX level 1
^^^^^^^
Child should be leaf (level 0)
...
key (XXX XXX XXX) block YYYYYY
Child eb:
leaf YYYYYY level 1
^^^^^^^
Something went wrong now
And for the corrupted leaf returned, later caller can be screwed up
easily.
Although the root cause (powerloss, but still something wrong breaking
metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to
avoid SEGV.
Reported-by: Ralph Gauges <ralphgauges@googlemail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
changlog:
v2:
Check if the extent buffer is up-to-date before checking its level to
avoid possible NULL pointer access.
---
ctree.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/ctree.c b/ctree.c
index 4fc33b14000a..430805e3043f 100644
--- a/ctree.c
+++ b/ctree.c
@@ -22,6 +22,7 @@
#include "repair.h"
#include "internal.h"
#include "sizes.h"
+#include "messages.h"
static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
*root, struct btrfs_path *path, int level);
@@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
struct extent_buffer *parent, int slot)
{
+ struct extent_buffer *ret;
int level = btrfs_header_level(parent);
+
if (slot < 0)
return NULL;
if (slot >= btrfs_header_nritems(parent))
@@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
if (level == 0)
return NULL;
- return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
+ ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
btrfs_node_ptr_generation(parent, slot));
+ if (!extent_buffer_uptodate(ret))
+ return ERR_PTR(-EIO);
+
+ if (btrfs_header_level(ret) != level - 1) {
+ error("child eb corrupted: parent bytenr=%llu item=%d parent level=%d child level=%d",
+ btrfs_header_bytenr(parent), slot,
+ btrfs_header_level(parent), btrfs_header_level(ret));
+ free_extent_buffer(ret);
+ return ERR_PTR(-EIO);
+ }
+ return ret;
}
static int balance_level(struct btrfs_trans_handle *trans,
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()
2018-02-09 7:44 ` [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot() Qu Wenruo
@ 2018-02-09 8:09 ` Nikolay Borisov
2018-02-09 10:02 ` Qu Wenruo
0 siblings, 1 reply; 24+ messages in thread
From: Nikolay Borisov @ 2018-02-09 8:09 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs, dsterba
On 9.02.2018 09:44, Qu Wenruo wrote:
> Strangely, we have level check in btrfs_print_tree() while we don't have
> the same check in read_node_slot().
>
> That's to say, for the following corruption, btrfs_search_slot() or
> btrfs_next_leaf() can return invalid leaf:
>
> Parent eb:
> node XXXXXX level 1
> ^^^^^^^
> Child should be leaf (level 0)
> ...
> key (XXX XXX XXX) block YYYYYY
>
> Child eb:
> leaf YYYYYY level 1
> ^^^^^^^
> Something went wrong now
>
> And for the corrupted leaf returned, later caller can be screwed up
> easily.
>
> Although the root cause (powerloss, but still something wrong breaking
> metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to
> avoid SEGV.
>
> Reported-by: Ralph Gauges <ralphgauges@googlemail.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> changlog:
> v2:
> Check if the extent buffer is up-to-date before checking its level to
> avoid possible NULL pointer access.
> ---
> ctree.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
That was sent separately so I'd assume it was in the wrong dir ;)
>
> diff --git a/ctree.c b/ctree.c
> index 4fc33b14000a..430805e3043f 100644
> --- a/ctree.c
> +++ b/ctree.c
> @@ -22,6 +22,7 @@
> #include "repair.h"
> #include "internal.h"
> #include "sizes.h"
> +#include "messages.h"
>
> static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int level);
> @@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
> struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
> struct extent_buffer *parent, int slot)
> {
> + struct extent_buffer *ret;
> int level = btrfs_header_level(parent);
> +
> if (slot < 0)
> return NULL;
> if (slot >= btrfs_header_nritems(parent))
> @@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
> if (level == 0)
> return NULL;
>
> - return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
> + ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
> btrfs_node_ptr_generation(parent, slot));
> + if (!extent_buffer_uptodate(ret))
> + return ERR_PTR(-EIO);
> +
> + if (btrfs_header_level(ret) != level - 1) {
> + error("child eb corrupted: parent bytenr=%llu item=%d parent level=%d child level=%d",
> + btrfs_header_bytenr(parent), slot,
> + btrfs_header_level(parent), btrfs_header_level(ret));
> + free_extent_buffer(ret);
> + return ERR_PTR(-EIO);
> + }
> + return ret;
> }
>
> static int balance_level(struct btrfs_trans_handle *trans,
>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()
2018-02-09 8:09 ` Nikolay Borisov
@ 2018-02-09 10:02 ` Qu Wenruo
0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 10:02 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs, dsterba
[-- Attachment #1.1: Type: text/plain, Size: 3233 bytes --]
On 2018年02月09日 16:09, Nikolay Borisov wrote:
>
>
> On 9.02.2018 09:44, Qu Wenruo wrote:
>> Strangely, we have level check in btrfs_print_tree() while we don't have
>> the same check in read_node_slot().
>>
>> That's to say, for the following corruption, btrfs_search_slot() or
>> btrfs_next_leaf() can return invalid leaf:
>>
>> Parent eb:
>> node XXXXXX level 1
>> ^^^^^^^
>> Child should be leaf (level 0)
>> ...
>> key (XXX XXX XXX) block YYYYYY
>>
>> Child eb:
>> leaf YYYYYY level 1
>> ^^^^^^^
>> Something went wrong now
>>
>> And for the corrupted leaf returned, later caller can be screwed up
>> easily.
>>
>> Although the root cause (powerloss, but still something wrong breaking
>> metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to
>> avoid SEGV.
>>
>> Reported-by: Ralph Gauges <ralphgauges@googlemail.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> changlog:
>> v2:
>> Check if the extent buffer is up-to-date before checking its level to
>> avoid possible NULL pointer access.
>> ---
>> ctree.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> That was sent separately so I'd assume it was in the wrong dir ;)
Yep, my bad habit of putting all patches in project dir.
Thanks,
Qu
>
>>
>> diff --git a/ctree.c b/ctree.c
>> index 4fc33b14000a..430805e3043f 100644
>> --- a/ctree.c
>> +++ b/ctree.c
>> @@ -22,6 +22,7 @@
>> #include "repair.h"
>> #include "internal.h"
>> #include "sizes.h"
>> +#include "messages.h"
>>
>> static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
>> *root, struct btrfs_path *path, int level);
>> @@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
>> struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
>> struct extent_buffer *parent, int slot)
>> {
>> + struct extent_buffer *ret;
>> int level = btrfs_header_level(parent);
>> +
>> if (slot < 0)
>> return NULL;
>> if (slot >= btrfs_header_nritems(parent))
>> @@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
>> if (level == 0)
>> return NULL;
>>
>> - return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
>> + ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
>> btrfs_node_ptr_generation(parent, slot));
>> + if (!extent_buffer_uptodate(ret))
>> + return ERR_PTR(-EIO);
>> +
>> + if (btrfs_header_level(ret) != level - 1) {
>> + error("child eb corrupted: parent bytenr=%llu item=%d parent level=%d child level=%d",
>> + btrfs_header_bytenr(parent), slot,
>> + btrfs_header_level(parent), btrfs_header_level(ret));
>> + free_extent_buffer(ret);
>> + return ERR_PTR(-EIO);
>> + }
>> + return ret;
>> }
>>
>> static int balance_level(struct btrfs_trans_handle *trans,
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v2 01/10] btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
2018-02-09 7:44 ` [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot() Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 02/10] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk Qu Wenruo
` (8 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
volumes.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/volumes.c b/volumes.c
index edad367b593c..677d085de96c 100644
--- a/volumes.c
+++ b/volumes.c
@@ -826,7 +826,7 @@ error:
return ret;
}
-#define BTRFS_MAX_DEVS(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info) \
+#define BTRFS_MAX_DEVS(info) ((BTRFS_LEAF_DATA_SIZE(info) \
- sizeof(struct btrfs_item) \
- sizeof(struct btrfs_chunk)) \
/ sizeof(struct btrfs_stripe) + 1)
@@ -882,12 +882,12 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
calc_size = SZ_1G;
max_chunk_size = 10 * calc_size;
min_stripe_size = SZ_64M;
- max_stripes = BTRFS_MAX_DEVS(chunk_root);
+ max_stripes = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
calc_size = SZ_1G;
max_chunk_size = 4 * calc_size;
min_stripe_size = SZ_32M;
- max_stripes = BTRFS_MAX_DEVS(chunk_root);
+ max_stripes = BTRFS_MAX_DEVS(info);
}
}
if (type & BTRFS_BLOCK_GROUP_RAID1) {
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 02/10] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
2018-02-09 7:44 ` [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot() Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 01/10] btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 03/10] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation Qu Wenruo
` (7 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
We used to have two chunk allocators, btrfs_alloc_chunk() and
btrfs_alloc_data_chunk(), the former is the more generic one, while the
later is only used in mkfs and convert, to allocate SINGLE data chunk.
Although btrfs_alloc_data_chunk() has some special hacks to cooperate
with convert, it's quite simple to integrity it into the generic chunk
allocator.
So merge them into one btrfs_alloc_chunk(), with extra @convert
parameter and necessary comment, to make code less duplicated and less
thing to maintain.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
convert/main.c | 6 +-
extent-tree.c | 2 +-
mkfs/main.c | 8 +--
volumes.c | 219 ++++++++++++++++++++++-----------------------------------
volumes.h | 5 +-
5 files changed, 94 insertions(+), 146 deletions(-)
diff --git a/convert/main.c b/convert/main.c
index b3ea31d7af43..b2444bb2ff21 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct btrfs_trans_handle *trans,
len = min(max_chunk_size,
cache->start + cache->size - cur);
- ret = btrfs_alloc_data_chunk(trans, fs_info,
- &cur_backup, len,
- BTRFS_BLOCK_GROUP_DATA, 1);
+ ret = btrfs_alloc_chunk(trans, fs_info,
+ &cur_backup, &len,
+ BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
ret = btrfs_make_block_group(trans, fs_info, 0,
diff --git a/extent-tree.c b/extent-tree.c
index e2ae74a7fe66..b085ab0352b3 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1906,7 +1906,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
return 0;
ret = btrfs_alloc_chunk(trans, fs_info, &start, &num_bytes,
- space_info->flags);
+ space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
return 0;
diff --git a/mkfs/main.c b/mkfs/main.c
index ea65c6d897b2..358395ca0250 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -82,7 +82,7 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
ret = btrfs_alloc_chunk(trans, fs_info,
&chunk_start, &chunk_size,
BTRFS_BLOCK_GROUP_METADATA |
- BTRFS_BLOCK_GROUP_DATA);
+ BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data/metadata chunk");
goto err;
@@ -99,7 +99,7 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
} else {
ret = btrfs_alloc_chunk(trans, fs_info,
&chunk_start, &chunk_size,
- BTRFS_BLOCK_GROUP_METADATA);
+ BTRFS_BLOCK_GROUP_METADATA, false);
if (ret == -ENOSPC) {
error("no space to allocate metadata chunk");
goto err;
@@ -133,7 +133,7 @@ static int create_data_block_groups(struct btrfs_trans_handle *trans,
if (!mixed) {
ret = btrfs_alloc_chunk(trans, fs_info,
&chunk_start, &chunk_size,
- BTRFS_BLOCK_GROUP_DATA);
+ BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data chunk");
goto err;
@@ -241,7 +241,7 @@ static int create_one_raid_group(struct btrfs_trans_handle *trans,
int ret;
ret = btrfs_alloc_chunk(trans, fs_info,
- &chunk_start, &chunk_size, type);
+ &chunk_start, &chunk_size, type, false);
if (ret == -ENOSPC) {
error("not enough free space to allocate chunk");
exit(1);
diff --git a/volumes.c b/volumes.c
index 677d085de96c..9ee4650351c3 100644
--- a/volumes.c
+++ b/volumes.c
@@ -836,9 +836,23 @@ error:
- 2 * sizeof(struct btrfs_chunk)) \
/ sizeof(struct btrfs_stripe) + 1)
+/*
+ * Alloc a chunk, will insert dev extents, chunk item.
+ * NOTE: This function will not insert block group item nor mark newly
+ * allocated chunk available for later allocation.
+ * Block group item and free space update is handled by btrfs_make_block_group()
+ *
+ * @start: return value of allocated chunk start bytenr.
+ * @num_bytes: return value of allocated chunk size
+ * @type: chunk type (including both profile and type)
+ * @convert: if the chunk is allocated for convert case.
+ * If @convert is true, chunk allocator will skip device extent
+ * search, but use *start and *num_bytes as chunk start/num_bytes
+ * and devive offset, to build a 1:1 chunk mapping for convert.
+ */
int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *info, u64 *start,
- u64 *num_bytes, u64 type)
+ u64 *num_bytes, u64 type, bool convert)
{
u64 dev_offset;
struct btrfs_root *extent_root = info->extent_root;
@@ -868,10 +882,38 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
struct btrfs_key key;
u64 offset;
- if (list_empty(dev_list)) {
+ if (list_empty(dev_list))
return -ENOSPC;
+
+ if (convert) {
+ /* For convert, profile must be SINGLE */
+ if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ error("convert only suports SINGLE profile");
+ return -EINVAL;
+ }
+ if (!IS_ALIGNED(*start, info->sectorsize)) {
+ error("chunk start not aligned, start=%llu sectorsize=%u",
+ *start, info->sectorsize);
+ return -EINVAL;
+ }
+ if (!IS_ALIGNED(*num_bytes, info->sectorsize)) {
+ error("chunk size not aligned, size=%llu sectorsize=%u",
+ *num_bytes, info->sectorsize);
+ return -EINVAL;
+ }
+ calc_size = *num_bytes;
+ offset = *start;
+ /*
+ * For convert, we use 1:1 chunk mapping specified by @start and
+ * @num_bytes, so there is no need to go through dev_extent
+ * searching.
+ */
+ goto alloc_chunk;
}
+ /*
+ * Chunk size calculation part.
+ */
if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
calc_size = SZ_8M;
@@ -942,6 +984,9 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
percent_max = div_factor(btrfs_super_total_bytes(info->super_copy), 1);
max_chunk_size = min(percent_max, max_chunk_size);
+ /*
+ * Reserve space from each device.
+ */
again:
if (chunk_bytes_by_type(type, calc_size, num_stripes, sub_stripes) >
max_chunk_size) {
@@ -972,7 +1017,8 @@ again:
return ret;
cur = cur->next;
if (avail >= min_free) {
- list_move_tail(&device->dev_list, &private_devs);
+ list_move_tail(&device->dev_list,
+ &private_devs);
index++;
if (type & BTRFS_BLOCK_GROUP_DUP)
index++;
@@ -999,9 +1045,16 @@ again:
}
return -ENOSPC;
}
- ret = find_next_chunk(info, &offset);
- if (ret)
- return ret;
+
+ /*
+ * Fill chunk mapping and chunk stripes
+ */
+alloc_chunk:
+ if (!convert) {
+ ret = find_next_chunk(info, &offset);
+ if (ret)
+ return ret;
+ }
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
key.type = BTRFS_CHUNK_ITEM_KEY;
key.offset = offset;
@@ -1022,17 +1075,31 @@ again:
index = 0;
while(index < num_stripes) {
struct btrfs_stripe *stripe;
- BUG_ON(list_empty(&private_devs));
- cur = private_devs.next;
- device = list_entry(cur, struct btrfs_device, dev_list);
- /* loop over this device again if we're doing a dup group */
- if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
- (index == num_stripes - 1))
- list_move_tail(&device->dev_list, dev_list);
+ if (!convert) {
+ if (list_empty(&private_devs))
+ return -ENODEV;
+ cur = private_devs.next;
+ device = list_entry(cur, struct btrfs_device, dev_list);
+ if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
+ (index == num_stripes - 1)) {
+ /*
+ * loop over this device again if we're doing a
+ * dup group
+ */
+ list_move_tail(&device->dev_list, dev_list);
+ }
+ } else {
+ /* Only SINGLE is accepted in convert case */
+ BUG_ON(num_stripes > 1);
+ device = list_entry(dev_list->next, struct btrfs_device,
+ dev_list);
+ key.offset = *start;
+ dev_offset = *start;
+ }
ret = btrfs_alloc_dev_extent(trans, device, key.offset,
- calc_size, &dev_offset, 0);
+ calc_size, &dev_offset, convert);
if (ret < 0)
goto out_chunk_map;
@@ -1049,7 +1116,7 @@ again:
memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
index++;
}
- BUG_ON(!list_empty(&private_devs));
+ BUG_ON(!convert && !list_empty(&private_devs));
/* key was set above */
btrfs_set_stack_chunk_length(chunk, *num_bytes);
@@ -1069,6 +1136,9 @@ again:
map->num_stripes = num_stripes;
map->sub_stripes = sub_stripes;
+ /*
+ * Insert chunk item and chunk mapping.
+ */
ret = btrfs_insert_item(trans, chunk_root, &key, chunk,
btrfs_chunk_item_size(num_stripes));
BUG_ON(ret);
@@ -1098,125 +1168,6 @@ out_chunk:
return ret;
}
-/*
- * Alloc a DATA chunk with SINGLE profile.
- *
- * If 'convert' is set, it will alloc a chunk with 1:1 mapping
- * (btrfs logical bytenr == on-disk bytenr)
- * For that case, caller must make sure the chunk and dev_extent are not
- * occupied.
- */
-int btrfs_alloc_data_chunk(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *info, u64 *start,
- u64 num_bytes, u64 type, int convert)
-{
- u64 dev_offset;
- struct btrfs_root *extent_root = info->extent_root;
- struct btrfs_root *chunk_root = info->chunk_root;
- struct btrfs_stripe *stripes;
- struct btrfs_device *device = NULL;
- struct btrfs_chunk *chunk;
- struct list_head *dev_list = &info->fs_devices->devices;
- struct list_head *cur;
- struct map_lookup *map;
- u64 calc_size = SZ_8M;
- int num_stripes = 1;
- int sub_stripes = 0;
- int ret;
- int index;
- int stripe_len = BTRFS_STRIPE_LEN;
- struct btrfs_key key;
-
- key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
- key.type = BTRFS_CHUNK_ITEM_KEY;
- if (convert) {
- if (*start != round_down(*start, info->sectorsize)) {
- error("DATA chunk start not sectorsize aligned: %llu",
- (unsigned long long)*start);
- return -EINVAL;
- }
- key.offset = *start;
- dev_offset = *start;
- } else {
- u64 tmp;
-
- ret = find_next_chunk(info, &tmp);
- key.offset = tmp;
- if (ret)
- return ret;
- }
-
- chunk = kmalloc(btrfs_chunk_item_size(num_stripes), GFP_NOFS);
- if (!chunk)
- return -ENOMEM;
-
- map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
- if (!map) {
- kfree(chunk);
- return -ENOMEM;
- }
-
- stripes = &chunk->stripe;
- calc_size = num_bytes;
-
- index = 0;
- cur = dev_list->next;
- device = list_entry(cur, struct btrfs_device, dev_list);
-
- while (index < num_stripes) {
- struct btrfs_stripe *stripe;
-
- ret = btrfs_alloc_dev_extent(trans, device, key.offset,
- calc_size, &dev_offset, convert);
- BUG_ON(ret);
-
- device->bytes_used += calc_size;
- ret = btrfs_update_device(trans, device);
- BUG_ON(ret);
-
- map->stripes[index].dev = device;
- map->stripes[index].physical = dev_offset;
- stripe = stripes + index;
- btrfs_set_stack_stripe_devid(stripe, device->devid);
- btrfs_set_stack_stripe_offset(stripe, dev_offset);
- memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
- index++;
- }
-
- /* key was set above */
- btrfs_set_stack_chunk_length(chunk, num_bytes);
- btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
- btrfs_set_stack_chunk_stripe_len(chunk, stripe_len);
- btrfs_set_stack_chunk_type(chunk, type);
- btrfs_set_stack_chunk_num_stripes(chunk, num_stripes);
- btrfs_set_stack_chunk_io_align(chunk, stripe_len);
- btrfs_set_stack_chunk_io_width(chunk, stripe_len);
- btrfs_set_stack_chunk_sector_size(chunk, info->sectorsize);
- btrfs_set_stack_chunk_sub_stripes(chunk, sub_stripes);
- map->sector_size = info->sectorsize;
- map->stripe_len = stripe_len;
- map->io_align = stripe_len;
- map->io_width = stripe_len;
- map->type = type;
- map->num_stripes = num_stripes;
- map->sub_stripes = sub_stripes;
-
- ret = btrfs_insert_item(trans, chunk_root, &key, chunk,
- btrfs_chunk_item_size(num_stripes));
- BUG_ON(ret);
- if (!convert)
- *start = key.offset;
-
- map->ce.start = key.offset;
- map->ce.size = num_bytes;
-
- ret = insert_cache_extent(&info->mapping_tree.cache_tree, &map->ce);
- BUG_ON(ret);
-
- kfree(chunk);
- return ret;
-}
-
int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
{
struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
diff --git a/volumes.h b/volumes.h
index 11572e78c04f..7bbdf615d31a 100644
--- a/volumes.h
+++ b/volumes.h
@@ -208,10 +208,7 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, u64 *start,
- u64 *num_bytes, u64 type);
-int btrfs_alloc_data_chunk(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *fs_info, u64 *start,
- u64 num_bytes, u64 type, int convert);
+ u64 *num_bytes, u64 type, bool convert);
int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
int flags);
int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 03/10] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (2 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 02/10] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 04/10] btrfs-progs: Introduce btrfs_raid_array and related infrastructures Qu Wenruo
` (6 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Before this patch, chunk allocation is split into 2 parts:
1) Chunk allocation
Handled by btrfs_alloc_chunk(), which will insert chunk and device
extent items.
2) Block group allocation
Handled by btrfs_make_block_group(), which will insert block group
item and update space info.
However for chunk allocation, we don't really need to split these
operations as all btrfs_alloc_chunk() has btrfs_make_block_group()
followed.
So it's reasonable to merge btrfs_make_block_group() call into
btrfs_alloc_chunk() to save several lines, and provides the basis for
later btrfs_alloc_chunk() rework.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
convert/main.c | 4 ----
extent-tree.c | 10 ++--------
mkfs/main.c | 19 -------------------
volumes.c | 10 ++++++----
4 files changed, 8 insertions(+), 35 deletions(-)
diff --git a/convert/main.c b/convert/main.c
index b2444bb2ff21..240d3aa46db9 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -915,10 +915,6 @@ static int make_convert_data_block_groups(struct btrfs_trans_handle *trans,
BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
- ret = btrfs_make_block_group(trans, fs_info, 0,
- BTRFS_BLOCK_GROUP_DATA, cur, len);
- if (ret < 0)
- break;
cur += len;
}
}
diff --git a/extent-tree.c b/extent-tree.c
index b085ab0352b3..bccd83d1bae6 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1909,15 +1909,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
- return 0;
+ return ret;
}
-
- BUG_ON(ret);
-
- ret = btrfs_make_block_group(trans, fs_info, 0, space_info->flags,
- start, num_bytes);
- BUG_ON(ret);
- return 0;
+ return ret;
}
static int update_block_group(struct btrfs_root *root,
diff --git a/mkfs/main.c b/mkfs/main.c
index 358395ca0250..49159ea533b9 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -87,12 +87,6 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
error("no space to allocate data/metadata chunk");
goto err;
}
- if (ret)
- return ret;
- ret = btrfs_make_block_group(trans, fs_info, 0,
- BTRFS_BLOCK_GROUP_METADATA |
- BTRFS_BLOCK_GROUP_DATA,
- chunk_start, chunk_size);
if (ret)
return ret;
allocation->mixed += chunk_size;
@@ -106,12 +100,7 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
}
if (ret)
return ret;
- ret = btrfs_make_block_group(trans, fs_info, 0,
- BTRFS_BLOCK_GROUP_METADATA,
- chunk_start, chunk_size);
allocation->metadata += chunk_size;
- if (ret)
- return ret;
}
root->fs_info->system_allocs = 0;
@@ -140,12 +129,7 @@ static int create_data_block_groups(struct btrfs_trans_handle *trans,
}
if (ret)
return ret;
- ret = btrfs_make_block_group(trans, fs_info, 0,
- BTRFS_BLOCK_GROUP_DATA,
- chunk_start, chunk_size);
allocation->data += chunk_size;
- if (ret)
- return ret;
}
err:
@@ -249,9 +233,6 @@ static int create_one_raid_group(struct btrfs_trans_handle *trans,
if (ret)
return ret;
- ret = btrfs_make_block_group(trans, fs_info, 0,
- type, chunk_start, chunk_size);
-
type &= BTRFS_BLOCK_GROUP_TYPE_MASK;
if (type == BTRFS_BLOCK_GROUP_DATA) {
allocation->data += chunk_size;
diff --git a/volumes.c b/volumes.c
index 9ee4650351c3..a9dc8c939dc5 100644
--- a/volumes.c
+++ b/volumes.c
@@ -837,10 +837,9 @@ error:
/ sizeof(struct btrfs_stripe) + 1)
/*
- * Alloc a chunk, will insert dev extents, chunk item.
- * NOTE: This function will not insert block group item nor mark newly
- * allocated chunk available for later allocation.
- * Block group item and free space update is handled by btrfs_make_block_group()
+ * Alloc a chunk, will insert dev extents, chunk item, and insert new
+ * block group and update space info (so that extent allocator can use
+ * newly allocated chunk).
*
* @start: return value of allocated chunk start bytenr.
* @num_bytes: return value of allocated chunk size
@@ -1159,6 +1158,9 @@ alloc_chunk:
}
kfree(chunk);
+
+ ret = btrfs_make_block_group(trans, info, 0, type, map->ce.start,
+ map->ce.size);
return ret;
out_chunk_map:
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 04/10] btrfs-progs: Introduce btrfs_raid_array and related infrastructures
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (3 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 03/10] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 05/10] btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum hole size Qu Wenruo
` (5 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
As part of the effort to unify code and behavior between btrfs-progs and
kernel, copy the btrfs_raid_array from kernel to btrfs-progs.
So later we can use the btrfs_raid_array[] to get needed raid info other
than manually do if-else branches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
ctree.h | 12 +++++++++++-
volumes.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
volumes.h | 30 +++++++++++++++++++++++++++++
3 files changed, 107 insertions(+), 1 deletion(-)
diff --git a/ctree.h b/ctree.h
index 17cdac76c58c..c76849d8deb7 100644
--- a/ctree.h
+++ b/ctree.h
@@ -958,7 +958,17 @@ struct btrfs_csum_item {
#define BTRFS_BLOCK_GROUP_RAID5 (1ULL << 7)
#define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
#define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE
-#define BTRFS_NR_RAID_TYPES 7
+
+enum btrfs_raid_types {
+ BTRFS_RAID_RAID10,
+ BTRFS_RAID_RAID1,
+ BTRFS_RAID_DUP,
+ BTRFS_RAID_RAID0,
+ BTRFS_RAID_SINGLE,
+ BTRFS_RAID_RAID5,
+ BTRFS_RAID_RAID6,
+ BTRFS_NR_RAID_TYPES
+};
#define BTRFS_BLOCK_GROUP_TYPE_MASK (BTRFS_BLOCK_GROUP_DATA | \
BTRFS_BLOCK_GROUP_SYSTEM | \
diff --git a/volumes.c b/volumes.c
index a9dc8c939dc5..b47ff1f392b5 100644
--- a/volumes.c
+++ b/volumes.c
@@ -30,6 +30,72 @@
#include "utils.h"
#include "kernel-lib/raid56.h"
+const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
+ [BTRFS_RAID_RAID10] = {
+ .sub_stripes = 2,
+ .dev_stripes = 1,
+ .devs_max = 0, /* 0 == as many as possible */
+ .devs_min = 4,
+ .tolerated_failures = 1,
+ .devs_increment = 2,
+ .ncopies = 2,
+ },
+ [BTRFS_RAID_RAID1] = {
+ .sub_stripes = 1,
+ .dev_stripes = 1,
+ .devs_max = 2,
+ .devs_min = 2,
+ .tolerated_failures = 1,
+ .devs_increment = 2,
+ .ncopies = 2,
+ },
+ [BTRFS_RAID_DUP] = {
+ .sub_stripes = 1,
+ .dev_stripes = 2,
+ .devs_max = 1,
+ .devs_min = 1,
+ .tolerated_failures = 0,
+ .devs_increment = 1,
+ .ncopies = 2,
+ },
+ [BTRFS_RAID_RAID0] = {
+ .sub_stripes = 1,
+ .dev_stripes = 1,
+ .devs_max = 0,
+ .devs_min = 2,
+ .tolerated_failures = 0,
+ .devs_increment = 1,
+ .ncopies = 1,
+ },
+ [BTRFS_RAID_SINGLE] = {
+ .sub_stripes = 1,
+ .dev_stripes = 1,
+ .devs_max = 1,
+ .devs_min = 1,
+ .tolerated_failures = 0,
+ .devs_increment = 1,
+ .ncopies = 1,
+ },
+ [BTRFS_RAID_RAID5] = {
+ .sub_stripes = 1,
+ .dev_stripes = 1,
+ .devs_max = 0,
+ .devs_min = 2,
+ .tolerated_failures = 1,
+ .devs_increment = 1,
+ .ncopies = 2,
+ },
+ [BTRFS_RAID_RAID6] = {
+ .sub_stripes = 1,
+ .dev_stripes = 1,
+ .devs_max = 0,
+ .devs_min = 3,
+ .tolerated_failures = 2,
+ .devs_increment = 1,
+ .ncopies = 3,
+ },
+};
+
struct stripe {
struct btrfs_device *dev;
u64 physical;
diff --git a/volumes.h b/volumes.h
index 7bbdf615d31a..612a0a7586f4 100644
--- a/volumes.h
+++ b/volumes.h
@@ -108,6 +108,36 @@ struct map_lookup {
struct btrfs_bio_stripe stripes[];
};
+struct btrfs_raid_attr {
+ int sub_stripes; /* sub_stripes info for map */
+ int dev_stripes; /* stripes per dev */
+ int devs_max; /* max devs to use */
+ int devs_min; /* min devs needed */
+ int tolerated_failures; /* max tolerated fail devs */
+ int devs_increment; /* ndevs has to be a multiple of this */
+ int ncopies; /* how many copies to data has */
+};
+
+extern const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES];
+
+static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
+{
+ if (flags & BTRFS_BLOCK_GROUP_RAID10)
+ return BTRFS_RAID_RAID10;
+ else if (flags & BTRFS_BLOCK_GROUP_RAID1)
+ return BTRFS_RAID_RAID1;
+ else if (flags & BTRFS_BLOCK_GROUP_DUP)
+ return BTRFS_RAID_DUP;
+ else if (flags & BTRFS_BLOCK_GROUP_RAID0)
+ return BTRFS_RAID_RAID0;
+ else if (flags & BTRFS_BLOCK_GROUP_RAID5)
+ return BTRFS_RAID_RAID5;
+ else if (flags & BTRFS_BLOCK_GROUP_RAID6)
+ return BTRFS_RAID_RAID6;
+
+ return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
+}
+
#define btrfs_multi_bio_size(n) (sizeof(struct btrfs_multi_bio) + \
(sizeof(struct btrfs_bio_stripe) * (n)))
#define btrfs_map_lookup_size(n) (sizeof(struct map_lookup) + \
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 05/10] btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum hole size
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (4 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 04/10] btrfs-progs: Introduce btrfs_raid_array and related infrastructures Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs Qu Wenruo
` (4 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Just as kernel find_free_dev_extent(), allow it to return maximum hole
size for us to build device list for later chunk allocator rework.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
volumes.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/volumes.c b/volumes.c
index b47ff1f392b5..f4009ffa7c9e 100644
--- a/volumes.c
+++ b/volumes.c
@@ -516,10 +516,10 @@ out:
}
static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
- u64 *start)
+ u64 *start, u64 *len)
{
/* FIXME use last free of some kind */
- return find_free_dev_extent_start(device, num_bytes, 0, start, NULL);
+ return find_free_dev_extent_start(device, num_bytes, 0, start, len);
}
static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
@@ -543,7 +543,7 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
* is responsible to make sure it's free.
*/
if (!convert) {
- ret = find_free_dev_extent(device, num_bytes, start);
+ ret = find_free_dev_extent(device, num_bytes, start, NULL);
if (ret)
goto err;
}
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (5 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 05/10] btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum hole size Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-21 15:40 ` David Sterba
2018-02-09 7:44 ` [PATCH v2 07/10] btrfs-progs: volumes: Unify free dev extent search behavior between kernel and btrfs-progs Qu Wenruo
` (3 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Used by later btrfs_alloc_chunk() rework.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Makefile | 3 +-
kernel-lib/sort.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
kernel-lib/sort.h | 16 +++++++++
3 files changed, 122 insertions(+), 1 deletion(-)
create mode 100644 kernel-lib/sort.c
create mode 100644 kernel-lib/sort.h
diff --git a/Makefile b/Makefile
index 327cdfa08eba..3db7bbe04662 100644
--- a/Makefile
+++ b/Makefile
@@ -106,7 +106,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \
qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o \
+ kernel-lib/sort.o
cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
cmds-quota.o cmds-qgroup.o cmds-replace.o check/main.o \
diff --git a/kernel-lib/sort.c b/kernel-lib/sort.c
new file mode 100644
index 000000000000..70ae3dbe2852
--- /dev/null
+++ b/kernel-lib/sort.c
@@ -0,0 +1,104 @@
+/*
+ * taken from linux kernel lib/sort.c, removed kernel config code and adapted
+ * for btrfsprogs
+ */
+
+#include "sort.h"
+
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
+ *
+ * Jan 23 2005 Matt Mackall <mpm@selenic.com>
+ */
+
+static int alignment_ok(const void *base, int align)
+{
+ return ((unsigned long)base & (align - 1)) == 0;
+}
+
+static void u32_swap(void *a, void *b, int size)
+{
+ u32 t = *(u32 *)a;
+ *(u32 *)a = *(u32 *)b;
+ *(u32 *)b = t;
+}
+
+static void u64_swap(void *a, void *b, int size)
+{
+ u64 t = *(u64 *)a;
+ *(u64 *)a = *(u64 *)b;
+ *(u64 *)b = t;
+}
+
+static void generic_swap(void *a, void *b, int size)
+{
+ char t;
+
+ do {
+ t = *(char *)a;
+ *(char *)a++ = *(char *)b;
+ *(char *)b++ = t;
+ } while (--size > 0);
+}
+
+/**
+ * sort - sort an array of elements
+ * @base: pointer to data to sort
+ * @num: number of elements
+ * @size: size of each element
+ * @cmp_func: pointer to comparison function
+ * @swap_func: pointer to swap function or NULL
+ *
+ * This function does a heapsort on the given array. You may provide a
+ * swap_func function optimized to your element type.
+ *
+ * Sorting time is O(n log n) both on average and worst-case. While
+ * qsort is about 20% faster on average, it suffers from exploitable
+ * O(n*n) worst-case behavior and extra memory requirements that make
+ * it less suitable for kernel use.
+ */
+
+void sort(void *base, size_t num, size_t size,
+ int (*cmp_func)(const void *, const void *),
+ void (*swap_func)(void *, void *, int size))
+{
+ /* pre-scale counters for performance */
+ int i = (num/2 - 1) * size, n = num * size, c, r;
+
+ if (!swap_func) {
+ if (size == 4 && alignment_ok(base, 4))
+ swap_func = u32_swap;
+ else if (size == 8 && alignment_ok(base, 8))
+ swap_func = u64_swap;
+ else
+ swap_func = generic_swap;
+ }
+
+ /* heapify */
+ for ( ; i >= 0; i -= size) {
+ for (r = i; r * 2 + size < n; r = c) {
+ c = r * 2 + size;
+ if (c < n - size &&
+ cmp_func(base + c, base + c + size) < 0)
+ c += size;
+ if (cmp_func(base + r, base + c) >= 0)
+ break;
+ swap_func(base + r, base + c, size);
+ }
+ }
+
+ /* sort */
+ for (i = n - size; i > 0; i -= size) {
+ swap_func(base, base + i, size);
+ for (r = 0; r * 2 + size < i; r = c) {
+ c = r * 2 + size;
+ if (c < i - size &&
+ cmp_func(base + c, base + c + size) < 0)
+ c += size;
+ if (cmp_func(base + r, base + c) >= 0)
+ break;
+ swap_func(base + r, base + c, size);
+ }
+ }
+}
diff --git a/kernel-lib/sort.h b/kernel-lib/sort.h
new file mode 100644
index 000000000000..9355e01248f2
--- /dev/null
+++ b/kernel-lib/sort.h
@@ -0,0 +1,16 @@
+/*
+ * taken from linux kernel include/list_sort.h
+ * changed include to kerncompat.h
+ */
+
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SORT_H
+#define _LINUX_SORT_H
+
+#include "kerncompat.h"
+
+void sort(void *base, size_t num, size_t size,
+ int (*cmp)(const void *, const void *),
+ void (*swap)(void *, void *, int));
+
+#endif
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs
2018-02-09 7:44 ` [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs Qu Wenruo
@ 2018-02-21 15:40 ` David Sterba
2018-02-22 1:43 ` Qu Wenruo
0 siblings, 1 reply; 24+ messages in thread
From: David Sterba @ 2018-02-21 15:40 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, dsterba
On Fri, Feb 09, 2018 at 03:44:24PM +0800, Qu Wenruo wrote:
> Used by later btrfs_alloc_chunk() rework.
We have the libc provided qsort, so please don't pull another
implementation but rather add a wrapper.
> --- /dev/null
> +++ b/kernel-lib/sort.c
> @@ -0,0 +1,104 @@
> +/*
> + * taken from linux kernel lib/sort.c, removed kernel config code and adapted
> + * for btrfsprogs
> + */
> +
> +#include "sort.h"
> +
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
> + *
> + * Jan 23 2005 Matt Mackall <mpm@selenic.com>
> + */
When taking files from kernel: take the file as is and insert notes for
btrfs-progs below the comments. The SPDX line must be first.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs
2018-02-21 15:40 ` David Sterba
@ 2018-02-22 1:43 ` Qu Wenruo
2018-03-09 18:14 ` David Sterba
0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2018-02-22 1:43 UTC (permalink / raw)
To: dsterba, Qu Wenruo, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1281 bytes --]
On 2018年02月21日 23:40, David Sterba wrote:
> On Fri, Feb 09, 2018 at 03:44:24PM +0800, Qu Wenruo wrote:
>> Used by later btrfs_alloc_chunk() rework.
>
> We have the libc provided qsort, so please don't pull another
> implementation but rather add a wrapper.
Thanks for pointing this out.
Another 100 lines saved.
>
>> --- /dev/null
>> +++ b/kernel-lib/sort.c
>> @@ -0,0 +1,104 @@
>> +/*
>> + * taken from linux kernel lib/sort.c, removed kernel config code and adapted
>> + * for btrfsprogs
>> + */
>> +
>> +#include "sort.h"
>> +
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
>> + *
>> + * Jan 23 2005 Matt Mackall <mpm@selenic.com>
>> + */
>
> When taking files from kernel: take the file as is and insert notes for
> btrfs-progs below the comments. The SPDX line must be first.
Will keep that in mind.
Although quite a lot of kernel-libs doesn't follow this behavior.
Maybe we should put some README into kernel-libs/?
Thanks,
Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs
2018-02-22 1:43 ` Qu Wenruo
@ 2018-03-09 18:14 ` David Sterba
0 siblings, 0 replies; 24+ messages in thread
From: David Sterba @ 2018-03-09 18:14 UTC (permalink / raw)
To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs
On Thu, Feb 22, 2018 at 09:43:37AM +0800, Qu Wenruo wrote:
>
>
> On 2018年02月21日 23:40, David Sterba wrote:
> > On Fri, Feb 09, 2018 at 03:44:24PM +0800, Qu Wenruo wrote:
> >> Used by later btrfs_alloc_chunk() rework.
> >
> > We have the libc provided qsort, so please don't pull another
> > implementation but rather add a wrapper.
>
> Thanks for pointing this out.
>
> Another 100 lines saved.
>
> >
> >> --- /dev/null
> >> +++ b/kernel-lib/sort.c
> >> @@ -0,0 +1,104 @@
> >> +/*
> >> + * taken from linux kernel lib/sort.c, removed kernel config code and adapted
> >> + * for btrfsprogs
> >> + */
> >> +
> >> +#include "sort.h"
> >> +
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
> >> + *
> >> + * Jan 23 2005 Matt Mackall <mpm@selenic.com>
> >> + */
> >
> > When taking files from kernel: take the file as is and insert notes for
> > btrfs-progs below the comments. The SPDX line must be first.
>
> Will keep that in mind.
>
> Although quite a lot of kernel-libs doesn't follow this behavior.
>
> Maybe we should put some README into kernel-libs/?
This should be a part of development documentation, it's in
Documentation, so far incomplete but adding the 'how to sync shared
kernel code' would be welcome.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v2 07/10] btrfs-progs: volumes: Unify free dev extent search behavior between kernel and btrfs-progs.
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (6 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 08/10] btrfs-progs: Move chunk stripe size calcution function to volumes.h Qu Wenruo
` (2 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
As preparation to create libbtrfs which shares code between kernel and
btrfs, this patch mainly unifies the search for free device extents.
The main modifications are:
1) Search for free device extent
Use the kernel method, by sorting the devices by its max hole
capability, and use that sorted result to determine stripe size
and chunk size.
The previous method, which uses available bytes of each device to
search, can't handle scattered small holes in each device.
2) Chunk/stripe minimal size
Remove the minimal size requirement.
Now the real minimal stripe size is 64K (BTRFS_STRIPE_LEN), the same
as kernel one.
While up limit is still kept as is, and minimal device size is still
kept for a while.
But since we no longer have strange minimal stripe size limit,
existing minimal device size calculation won't cause any problem.
3) How to insert device extents
Although not the same as kernel, here we follow kernel behavior to
delay dev extents insert.
There is plan to follow kernel __btrfs_alloc_chunk() to make it only
handle chunk mapping allocation, while do nothing with tree
operation.
4) Usage of btrfs_raid_array[]
Which makes a lot of old if-else branches disappear.
There are still a lot of work to do (both kernel and btrfs-progs) before
we could starting extracting code into libbtrfs, but this should make
libbtrfs inside our reach.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
kerncompat.h | 5 +
volumes.c | 621 ++++++++++++++++++++++++++---------------------------------
volumes.h | 7 +
3 files changed, 285 insertions(+), 348 deletions(-)
diff --git a/kerncompat.h b/kerncompat.h
index fa96715fb70c..658d28ed0792 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -285,6 +285,7 @@ static inline int IS_ERR_OR_NULL(const void *ptr)
*/
#define kmalloc(x, y) malloc(x)
#define kzalloc(x, y) calloc(1, x)
+#define kcalloc(x, y) calloc(x, y)
#define kstrdup(x, y) strdup(x)
#define kfree(x) free(x)
#define vmalloc(x) malloc(x)
@@ -394,4 +395,8 @@ struct __una_u64 { __le64 x; } __attribute__((__packed__));
#define noinline
#endif
+static inline u64 div_u64(u64 dividend, u32 divisor)
+{
+ return dividend / ((u64) divisor);
+}
#endif
diff --git a/volumes.c b/volumes.c
index f4009ffa7c9e..c2efb3c674dc 100644
--- a/volumes.c
+++ b/volumes.c
@@ -29,6 +29,7 @@
#include "volumes.h"
#include "utils.h"
#include "kernel-lib/raid56.h"
+#include "kernel-lib/sort.h"
const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
[BTRFS_RAID_RAID10] = {
@@ -522,55 +523,55 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
return find_free_dev_extent_start(device, num_bytes, 0, start, len);
}
-static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
- struct btrfs_device *device,
- u64 chunk_offset, u64 num_bytes, u64 *start,
- int convert)
+static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info,
+ struct map_lookup *map, u64 stripe_size)
{
- int ret;
- struct btrfs_path *path;
- struct btrfs_root *root = device->dev_root;
+ int ret = 0;
+ struct btrfs_path path;
+ struct btrfs_root *root = fs_info->dev_root;
struct btrfs_dev_extent *extent;
struct extent_buffer *leaf;
struct btrfs_key key;
+ int i;
- path = btrfs_alloc_path();
- if (!path)
- return -ENOMEM;
+ btrfs_init_path(&path);
- /*
- * For convert case, just skip search free dev_extent, as caller
- * is responsible to make sure it's free.
- */
- if (!convert) {
- ret = find_free_dev_extent(device, num_bytes, start, NULL);
- if (ret)
- goto err;
- }
+ for (i = 0; i < map->num_stripes; i++) {
+ struct btrfs_device *device = map->stripes[i].dev;
- key.objectid = device->devid;
- key.offset = *start;
- key.type = BTRFS_DEV_EXTENT_KEY;
- ret = btrfs_insert_empty_item(trans, root, path, &key,
- sizeof(*extent));
- BUG_ON(ret);
+ key.objectid = device->devid;
+ key.offset = map->stripes[i].physical;
+ key.type = BTRFS_DEV_EXTENT_KEY;
- leaf = path->nodes[0];
- extent = btrfs_item_ptr(leaf, path->slots[0],
- struct btrfs_dev_extent);
- btrfs_set_dev_extent_chunk_tree(leaf, extent, BTRFS_CHUNK_TREE_OBJECTID);
- btrfs_set_dev_extent_chunk_objectid(leaf, extent,
- BTRFS_FIRST_CHUNK_TREE_OBJECTID);
- btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
-
- write_extent_buffer(leaf, root->fs_info->chunk_tree_uuid,
- (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
- BTRFS_UUID_SIZE);
-
- btrfs_set_dev_extent_length(leaf, extent, num_bytes);
- btrfs_mark_buffer_dirty(leaf);
-err:
- btrfs_free_path(path);
+ ret = btrfs_insert_empty_item(trans, root, &path, &key,
+ sizeof(*extent));
+ if (ret < 0)
+ goto out;
+ leaf = path.nodes[0];
+ extent = btrfs_item_ptr(leaf, path.slots[0],
+ struct btrfs_dev_extent);
+ btrfs_set_dev_extent_chunk_tree(leaf, extent,
+ BTRFS_CHUNK_TREE_OBJECTID);
+ btrfs_set_dev_extent_chunk_objectid(leaf, extent,
+ BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+ btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
+
+ write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
+ (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
+ BTRFS_UUID_SIZE);
+
+ btrfs_set_dev_extent_length(leaf, extent, stripe_size);
+ btrfs_mark_buffer_dirty(leaf);
+ btrfs_release_path(&path);
+
+ device->bytes_used += stripe_size;
+ ret = btrfs_update_device(trans, device);
+ if (ret < 0)
+ goto out;
+ }
+out:
+ btrfs_release_path(&path);
return ret;
}
@@ -782,114 +783,23 @@ int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
return 0;
}
-static u64 chunk_bytes_by_type(u64 type, u64 calc_size, int num_stripes,
- int sub_stripes)
-{
- if (type & (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_DUP))
- return calc_size;
- else if (type & BTRFS_BLOCK_GROUP_RAID10)
- return calc_size * (num_stripes / sub_stripes);
- else if (type & BTRFS_BLOCK_GROUP_RAID5)
- return calc_size * (num_stripes - 1);
- else if (type & BTRFS_BLOCK_GROUP_RAID6)
- return calc_size * (num_stripes - 2);
- else
- return calc_size * num_stripes;
-}
-
-
-static u32 find_raid56_stripe_len(u32 data_devices, u32 dev_stripe_target)
-{
- /* TODO, add a way to store the preferred stripe size */
- return BTRFS_STRIPE_LEN;
-}
-
/*
- * btrfs_device_avail_bytes - count bytes available for alloc_chunk
- *
- * It is not equal to "device->total_bytes - device->bytes_used".
- * We do not allocate any chunk in 1M at beginning of device, and not
- * allowed to allocate any chunk before alloc_start if it is specified.
- * So search holes from max(1M, alloc_start) to device->total_bytes.
+ * sort the devices in descending order by max_avail, total_avail
*/
-static int btrfs_device_avail_bytes(struct btrfs_trans_handle *trans,
- struct btrfs_device *device,
- u64 *avail_bytes)
+static int btrfs_cmp_device_info(const void *a, const void *b)
{
- struct btrfs_path *path;
- struct btrfs_root *root = device->dev_root;
- struct btrfs_key key;
- struct btrfs_dev_extent *dev_extent = NULL;
- struct extent_buffer *l;
- u64 search_start = root->fs_info->alloc_start;
- u64 search_end = device->total_bytes;
- u64 extent_end = 0;
- u64 free_bytes = 0;
- int ret;
- int slot = 0;
+ const struct btrfs_device_info *di_a = a;
+ const struct btrfs_device_info *di_b = b;
- search_start = max(BTRFS_BLOCK_RESERVED_1M_FOR_SUPER, search_start);
-
- path = btrfs_alloc_path();
- if (!path)
- return -ENOMEM;
-
- key.objectid = device->devid;
- key.offset = root->fs_info->alloc_start;
- key.type = BTRFS_DEV_EXTENT_KEY;
-
- path->reada = 2;
- ret = btrfs_search_slot(trans, root, &key, path, 0, 0);
- if (ret < 0)
- goto error;
- ret = btrfs_previous_item(root, path, 0, key.type);
- if (ret < 0)
- goto error;
-
- while (1) {
- l = path->nodes[0];
- slot = path->slots[0];
- if (slot >= btrfs_header_nritems(l)) {
- ret = btrfs_next_leaf(root, path);
- if (ret == 0)
- continue;
- if (ret < 0)
- goto error;
- break;
- }
- btrfs_item_key_to_cpu(l, &key, slot);
-
- if (key.objectid < device->devid)
- goto next;
- if (key.objectid > device->devid)
- break;
- if (key.type != BTRFS_DEV_EXTENT_KEY)
- goto next;
- if (key.offset > search_end)
- break;
- if (key.offset > search_start)
- free_bytes += key.offset - search_start;
-
- dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
- extent_end = key.offset + btrfs_dev_extent_length(l,
- dev_extent);
- if (extent_end > search_start)
- search_start = extent_end;
- if (search_start > search_end)
- break;
-next:
- path->slots[0]++;
- cond_resched();
- }
-
- if (search_start < search_end)
- free_bytes += search_end - search_start;
-
- *avail_bytes = free_bytes;
- ret = 0;
-error:
- btrfs_free_path(path);
- return ret;
+ if (di_a->max_avail > di_b->max_avail)
+ return -1;
+ if (di_a->max_avail < di_b->max_avail)
+ return 1;
+ if (di_a->total_avail > di_b->total_avail)
+ return -1;
+ if (di_a->total_avail < di_b->total_avail)
+ return 1;
+ return 0;
}
#define BTRFS_MAX_DEVS(info) ((BTRFS_LEAF_DATA_SIZE(info) \
@@ -910,46 +820,60 @@ error:
* @start: return value of allocated chunk start bytenr.
* @num_bytes: return value of allocated chunk size
* @type: chunk type (including both profile and type)
- * @convert: if the chunk is allocated for convert case.
- * If @convert is true, chunk allocator will skip device extent
- * search, but use *start and *num_bytes as chunk start/num_bytes
- * and devive offset, to build a 1:1 chunk mapping for convert.
+ * @convert: if the chunk is allocated for convert case.
+ * If @convert is true, chunk allocator will skip device extent
+ * search, but use *start and *num_bytes as chunk start/num_bytes
+ * and devive offset, to build a 1:1 chunk mapping for convert.
*/
int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *info, u64 *start,
u64 *num_bytes, u64 type, bool convert)
{
- u64 dev_offset;
struct btrfs_root *extent_root = info->extent_root;
struct btrfs_root *chunk_root = info->chunk_root;
- struct btrfs_stripe *stripes;
struct btrfs_device *device = NULL;
struct btrfs_chunk *chunk;
- struct list_head private_devs;
struct list_head *dev_list = &info->fs_devices->devices;
- struct list_head *cur;
+ struct btrfs_stripe *stripe;
struct map_lookup *map;
- int min_stripe_size = SZ_1M;
- u64 calc_size = SZ_8M;
- u64 min_free;
- u64 max_chunk_size = 4 * calc_size;
- u64 avail = 0;
- u64 max_avail = 0;
+ struct btrfs_device_info *devices_info = NULL;
u64 percent_max;
- int num_stripes = 1;
- int max_stripes = 0;
- int min_stripes = 1;
- int sub_stripes = 0;
- int looped = 0;
+ int num_stripes;
int ret;
+ int i;
+ int j;
int index;
+ int ndevs = 0;
+ int rw_devs = 0;
int stripe_len = BTRFS_STRIPE_LEN;
struct btrfs_key key;
u64 offset;
+ int data_stripes; /* number of stripes that counts for bg size */
+ int sub_stripes; /* sub_stripes info for map */
+ int dev_stripes; /* stripes per dev */
+ int devs_max; /* max devs to use */
+ int devs_min; /* min devs needed */
+ int devs_increment; /* ndevs has to be a multiple of this */
+ int ncopies; /* how many copies to data has */
+ u64 max_stripe_size; /* max physical size of each stripe */
+ u64 max_chunk_size; /* max logical size of the block group */
+ u64 stripe_size;
if (list_empty(dev_list))
return -ENOSPC;
+ list_for_each_entry(device, dev_list, dev_list)
+ rw_devs++;
+
+ index = btrfs_bg_flags_to_raid_index(type);
+
+ sub_stripes = btrfs_raid_array[index].sub_stripes;
+ dev_stripes = btrfs_raid_array[index].dev_stripes;
+ devs_max = btrfs_raid_array[index].devs_max;
+ devs_min = btrfs_raid_array[index].devs_min;
+ devs_increment = btrfs_raid_array[index].devs_increment;
+ ncopies = btrfs_raid_array[index].ncopies;
+
if (convert) {
/* For convert, profile must be SINGLE */
if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
@@ -966,8 +890,10 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
*num_bytes, info->sectorsize);
return -EINVAL;
}
- calc_size = *num_bytes;
+ num_stripes = 1;
+ data_stripes = 1;
offset = *start;
+ stripe_size = *num_bytes;
/*
* For convert, we use 1:1 chunk mapping specified by @start and
* @num_bytes, so there is no need to go through dev_extent
@@ -979,70 +905,25 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
/*
* Chunk size calculation part.
*/
- if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
- if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
- calc_size = SZ_8M;
- max_chunk_size = calc_size * 2;
- min_stripe_size = SZ_1M;
- max_stripes = BTRFS_MAX_DEVS_SYS_CHUNK;
- } else if (type & BTRFS_BLOCK_GROUP_DATA) {
- calc_size = SZ_1G;
- max_chunk_size = 10 * calc_size;
- min_stripe_size = SZ_64M;
- max_stripes = BTRFS_MAX_DEVS(info);
- } else if (type & BTRFS_BLOCK_GROUP_METADATA) {
- calc_size = SZ_1G;
- max_chunk_size = 4 * calc_size;
- min_stripe_size = SZ_32M;
- max_stripes = BTRFS_MAX_DEVS(info);
- }
- }
- if (type & BTRFS_BLOCK_GROUP_RAID1) {
- num_stripes = min_t(u64, 2,
- btrfs_super_num_devices(info->super_copy));
- if (num_stripes < 2)
- return -ENOSPC;
- min_stripes = 2;
- }
- if (type & BTRFS_BLOCK_GROUP_DUP) {
- num_stripes = 2;
- min_stripes = 2;
- }
- if (type & (BTRFS_BLOCK_GROUP_RAID0)) {
- num_stripes = btrfs_super_num_devices(info->super_copy);
- if (num_stripes > max_stripes)
- num_stripes = max_stripes;
- min_stripes = 2;
- }
- if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
- num_stripes = btrfs_super_num_devices(info->super_copy);
- if (num_stripes > max_stripes)
- num_stripes = max_stripes;
- if (num_stripes < 4)
- return -ENOSPC;
- num_stripes &= ~(u32)1;
- sub_stripes = 2;
- min_stripes = 4;
- }
- if (type & (BTRFS_BLOCK_GROUP_RAID5)) {
- num_stripes = btrfs_super_num_devices(info->super_copy);
- if (num_stripes > max_stripes)
- num_stripes = max_stripes;
- if (num_stripes < 2)
- return -ENOSPC;
- min_stripes = 2;
- stripe_len = find_raid56_stripe_len(num_stripes - 1,
- btrfs_super_stripesize(info->super_copy));
- }
- if (type & (BTRFS_BLOCK_GROUP_RAID6)) {
- num_stripes = btrfs_super_num_devices(info->super_copy);
- if (num_stripes > max_stripes)
- num_stripes = max_stripes;
- if (num_stripes < 3)
- return -ENOSPC;
- min_stripes = 3;
- stripe_len = find_raid56_stripe_len(num_stripes - 2,
- btrfs_super_stripesize(info->super_copy));
+ if (type & BTRFS_BLOCK_GROUP_DATA) {
+ max_stripe_size = SZ_1G;
+ max_chunk_size = 10 * max_stripe_size;
+ if (!devs_max)
+ devs_max = BTRFS_MAX_DEVS(info);
+ } else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+ /* TODO: Support total_rw_bytes in fs_devices */
+ max_stripe_size = SZ_256M;
+ max_chunk_size = max_stripe_size;
+ if (!devs_max)
+ devs_max = BTRFS_MAX_DEVS(info);
+ } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
+ max_stripe_size = SZ_32M;
+ max_chunk_size = 2 * max_stripe_size;
+ if (!devs_max)
+ devs_max = BTRFS_MAX_DEVS_SYS_CHUNK;
+ } else {
+ error("invalid chunk type 0x%llx requested", type);
+ return -EINVAL;
}
/* we don't want a chunk larger than 10% of the FS */
@@ -1052,64 +933,116 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
/*
* Reserve space from each device.
*/
-again:
- if (chunk_bytes_by_type(type, calc_size, num_stripes, sub_stripes) >
- max_chunk_size) {
- calc_size = max_chunk_size;
- calc_size /= num_stripes;
- calc_size /= stripe_len;
- calc_size *= stripe_len;
- }
- /* we don't want tiny stripes */
- calc_size = max_t(u64, calc_size, min_stripe_size);
-
- calc_size /= stripe_len;
- calc_size *= stripe_len;
- INIT_LIST_HEAD(&private_devs);
- cur = dev_list->next;
- index = 0;
-
- if (type & BTRFS_BLOCK_GROUP_DUP)
- min_free = calc_size * 2;
- else
- min_free = calc_size;
- /* build a private list of devices we will allocate from */
- while(index < num_stripes) {
- device = list_entry(cur, struct btrfs_device, dev_list);
- ret = btrfs_device_avail_bytes(trans, device, &avail);
- if (ret)
- return ret;
- cur = cur->next;
- if (avail >= min_free) {
- list_move_tail(&device->dev_list,
- &private_devs);
- index++;
- if (type & BTRFS_BLOCK_GROUP_DUP)
- index++;
- } else if (avail > max_avail)
- max_avail = avail;
- if (cur == dev_list)
- break;
- }
- if (index < num_stripes) {
- list_splice(&private_devs, dev_list);
- if (index >= min_stripes) {
- num_stripes = index;
- if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
- num_stripes /= sub_stripes;
- num_stripes *= sub_stripes;
- }
- looped = 1;
- goto again;
+ devices_info = kcalloc(rw_devs, sizeof(*devices_info));
+ if (!devices_info)
+ return -ENOMEM;
+ /*
+ * in the first pass through the devices list, we gather information
+ * about the available holes on each device.
+ */
+ ndevs = 0;
+ list_for_each_entry(device, dev_list, dev_list) {
+ u64 max_avail;
+ u64 dev_offset;
+ u64 total_avail;
+
+ if (!device->writeable) {
+ warning("read-only device in alloc_list, devid=%llu",
+ device->devid);
+ continue;
}
- if (!looped && max_avail > 0) {
- looped = 1;
- calc_size = max_avail;
- goto again;
+
+ if (device->total_bytes > device->bytes_used)
+ total_avail = device->total_bytes - device->bytes_used;
+ else
+ total_avail = 0;
+
+ /* If there is no space on this device, skip it. */
+ if (total_avail == 0)
+ continue;
+
+ ret = find_free_dev_extent(device,
+ max_stripe_size * dev_stripes,
+ &dev_offset, &max_avail);
+ if (ret && ret != -ENOSPC)
+ goto out_devices_info;
+
+ if (ret == 0)
+ max_avail = max_stripe_size * dev_stripes;
+
+ if (max_avail < BTRFS_STRIPE_LEN * dev_stripes)
+ continue;
+
+ if (ndevs == rw_devs) {
+ warning("%s: found more than %u devices", __func__,
+ rw_devs);
+ break;
}
- return -ENOSPC;
+ devices_info[ndevs].dev_offset = dev_offset;
+ devices_info[ndevs].max_avail = max_avail;
+ devices_info[ndevs].total_avail = total_avail;
+ devices_info[ndevs].dev = device;
+ ++ndevs;
+ }
+
+ /*
+ * now sort the devices by hole size / available space
+ */
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
+ /* round down to number of usable stripes */
+ ndevs = round_down(ndevs, devs_increment);
+
+ if (ndevs < devs_min) {
+ ret = -ENOSPC;
+ goto out_devices_info;
+ }
+
+ ndevs = min(ndevs, devs_max);
+
+ /*
+ * the primary goal is to maximize the number of stripes, so use as many
+ * devices as possible, even if the stripes are not maximum sized.
+ */
+ stripe_size = devices_info[ndevs-1].max_avail;
+ num_stripes = ndevs * dev_stripes;
+
+ /*
+ * this will have to be fixed for RAID1 and RAID10 over
+ * more drives
+ */
+ data_stripes = num_stripes / ncopies;
+
+ if (type & BTRFS_BLOCK_GROUP_RAID5)
+ data_stripes = num_stripes - 1;
+
+ if (type & BTRFS_BLOCK_GROUP_RAID6)
+ data_stripes = num_stripes - 2;
+
+ /*
+ * Use the number of data stripes to figure out how big this chunk
+ * is really going to be in terms of logical address space,
+ * and compare that answer with the max chunk size
+ */
+ if (stripe_size * data_stripes > max_chunk_size) {
+ stripe_size = div_u64(max_chunk_size, data_stripes);
+
+ /* bump the answer up to a 16MB boundary */
+ stripe_size = round_up(stripe_size, SZ_16M);
+
+ /*
+ * but don't go higher than the limits we found
+ * while searching for free extents
+ */
+ stripe_size = min(devices_info[ndevs - 1].max_avail,
+ stripe_size);
}
+ stripe_size = div_u64(stripe_size, dev_stripes);
+
+ /* align to BTRFS_STRIPE_LEN */
+ stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
/*
* Fill chunk mapping and chunk stripes
@@ -1118,70 +1051,66 @@ alloc_chunk:
if (!convert) {
ret = find_next_chunk(info, &offset);
if (ret)
- return ret;
+ goto out_devices_info;
}
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
key.type = BTRFS_CHUNK_ITEM_KEY;
key.offset = offset;
+ *num_bytes = stripe_size * data_stripes;
chunk = kmalloc(btrfs_chunk_item_size(num_stripes), GFP_NOFS);
if (!chunk)
- return -ENOMEM;
+ goto out_devices_info;
map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
- if (!map) {
- kfree(chunk);
- return -ENOMEM;
- }
-
- stripes = &chunk->stripe;
- *num_bytes = chunk_bytes_by_type(type, calc_size,
- num_stripes, sub_stripes);
- index = 0;
- while(index < num_stripes) {
- struct btrfs_stripe *stripe;
-
- if (!convert) {
- if (list_empty(&private_devs))
- return -ENODEV;
- cur = private_devs.next;
- device = list_entry(cur, struct btrfs_device, dev_list);
- if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
- (index == num_stripes - 1)) {
- /*
- * loop over this device again if we're doing a
- * dup group
- */
- list_move_tail(&device->dev_list, dev_list);
- }
- } else {
- /* Only SINGLE is accepted in convert case */
- BUG_ON(num_stripes > 1);
- device = list_entry(dev_list->next, struct btrfs_device,
- dev_list);
- key.offset = *start;
- dev_offset = *start;
- }
-
- ret = btrfs_alloc_dev_extent(trans, device, key.offset,
- calc_size, &dev_offset, convert);
- if (ret < 0)
- goto out_chunk_map;
+ if (!map)
+ goto out_chunk_map;
+ map->num_stripes = num_stripes;
- device->bytes_used += calc_size;
- ret = btrfs_update_device(trans, device);
- if (ret < 0)
- goto out_chunk_map;
+ if (convert) {
+ device = list_entry(dev_list->next, struct btrfs_device,
+ dev_list);
- map->stripes[index].dev = device;
- map->stripes[index].physical = dev_offset;
- stripe = stripes + index;
+ map->stripes[0].dev = device;
+ map->stripes[0].physical = *start;
+ stripe = &chunk->stripe;
btrfs_set_stack_stripe_devid(stripe, device->devid);
- btrfs_set_stack_stripe_offset(stripe, dev_offset);
+ btrfs_set_stack_stripe_offset(stripe, *start);
memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
- index++;
+ } else {
+ for (i = 0; i < ndevs; i++) {
+ for (j = 0; j < dev_stripes; ++j) {
+ int s = i * dev_stripes + j;
+
+ device = devices_info[i].dev;
+ map->stripes[s].dev = device;
+ map->stripes[s].physical =
+ devices_info[i].dev_offset +
+ j * stripe_size;
+ stripe = &chunk->stripe + s;
+ btrfs_set_stack_stripe_devid(stripe,
+ device->devid);
+ btrfs_set_stack_stripe_offset(stripe,
+ devices_info[i].dev_offset +
+ j * stripe_size);
+ memcpy(stripe->dev_uuid, device->uuid,
+ BTRFS_UUID_SIZE);
+ }
+ }
}
- BUG_ON(!convert && !list_empty(&private_devs));
+ map->stripe_len = BTRFS_STRIPE_LEN;
+ map->io_align = BTRFS_STRIPE_LEN;
+ map->io_width = BTRFS_STRIPE_LEN;
+ map->type = type;
+ map->sub_stripes = sub_stripes;
+ map->sector_size = info->sectorsize;
+ map->ce.start = key.offset;
+ map->ce.size = *num_bytes;
+
+
+ ret = insert_cache_extent(&info->mapping_tree.cache_tree, &map->ce);
+ if (ret < 0)
+ goto out_chunk_map;
/* key was set above */
btrfs_set_stack_chunk_length(chunk, *num_bytes);
@@ -1193,13 +1122,6 @@ alloc_chunk:
btrfs_set_stack_chunk_io_width(chunk, stripe_len);
btrfs_set_stack_chunk_sector_size(chunk, info->sectorsize);
btrfs_set_stack_chunk_sub_stripes(chunk, sub_stripes);
- map->sector_size = info->sectorsize;
- map->stripe_len = stripe_len;
- map->io_align = stripe_len;
- map->io_width = stripe_len;
- map->type = type;
- map->num_stripes = num_stripes;
- map->sub_stripes = sub_stripes;
/*
* Insert chunk item and chunk mapping.
@@ -1207,14 +1129,7 @@ alloc_chunk:
ret = btrfs_insert_item(trans, chunk_root, &key, chunk,
btrfs_chunk_item_size(num_stripes));
BUG_ON(ret);
- *start = key.offset;;
-
- map->ce.start = key.offset;
- map->ce.size = *num_bytes;
-
- ret = insert_cache_extent(&info->mapping_tree.cache_tree, &map->ce);
- if (ret < 0)
- goto out_chunk_map;
+ *start = key.offset;
if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
ret = btrfs_add_system_chunk(info, &key,
@@ -1225,14 +1140,24 @@ alloc_chunk:
kfree(chunk);
+ /*
+ * Insert device extents
+ */
+ ret = btrfs_insert_dev_extents(trans, info, map, stripe_size);
+ if (ret < 0)
+ goto out_devices_info;
+
ret = btrfs_make_block_group(trans, info, 0, type, map->ce.start,
map->ce.size);
+ kfree(devices_info);
return ret;
out_chunk_map:
kfree(map);
out_chunk:
kfree(chunk);
+out_devices_info:
+ kfree(devices_info);
return ret;
}
diff --git a/volumes.h b/volumes.h
index 612a0a7586f4..3741d45cae80 100644
--- a/volumes.h
+++ b/volumes.h
@@ -108,6 +108,13 @@ struct map_lookup {
struct btrfs_bio_stripe stripes[];
};
+struct btrfs_device_info {
+ struct btrfs_device *dev;
+ u64 dev_offset;
+ u64 max_avail;
+ u64 total_avail;
+};
+
struct btrfs_raid_attr {
int sub_stripes; /* sub_stripes info for map */
int dev_stripes; /* stripes per dev */
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 08/10] btrfs-progs: Move chunk stripe size calcution function to volumes.h
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (7 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 07/10] btrfs-progs: volumes: Unify free dev extent search behavior between kernel and btrfs-progs Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 09/10] btrfs-progs: Use btrfs_device->fs_info to replace btrfs_device->dev_root Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior Qu Wenruo
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
check/main.c | 22 ----------------------
volumes.h | 22 ++++++++++++++++++++++
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/check/main.c b/check/main.c
index c051a862eb35..96607f6817af 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7638,28 +7638,6 @@ repair_abort:
return err;
}
-u64 calc_stripe_length(u64 type, u64 length, int num_stripes)
-{
- u64 stripe_size;
-
- if (type & BTRFS_BLOCK_GROUP_RAID0) {
- stripe_size = length;
- stripe_size /= num_stripes;
- } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
- stripe_size = length * 2;
- stripe_size /= num_stripes;
- } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
- stripe_size = length;
- stripe_size /= (num_stripes - 1);
- } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
- stripe_size = length;
- stripe_size /= (num_stripes - 2);
- } else {
- stripe_size = length;
- }
- return stripe_size;
-}
-
/*
* Check the chunk with its block group/dev list ref:
* Return 0 if all refs seems valid.
diff --git a/volumes.h b/volumes.h
index 3741d45cae80..950de5a9f910 100644
--- a/volumes.h
+++ b/volumes.h
@@ -216,6 +216,28 @@ static inline int check_crossing_stripes(struct btrfs_fs_info *fs_info,
(bg_offset + len - 1) / BTRFS_STRIPE_LEN);
}
+static inline u64 calc_stripe_length(u64 type, u64 length, int num_stripes)
+{
+ u64 stripe_size;
+
+ if (type & BTRFS_BLOCK_GROUP_RAID0) {
+ stripe_size = length;
+ stripe_size /= num_stripes;
+ } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
+ stripe_size = length * 2;
+ stripe_size /= num_stripes;
+ } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
+ stripe_size = length;
+ stripe_size /= (num_stripes - 1);
+ } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
+ stripe_size = length;
+ stripe_size /= (num_stripes - 2);
+ } else {
+ stripe_size = length;
+ }
+ return stripe_size;
+}
+
int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
u64 logical, u64 *length, u64 *type,
struct btrfs_multi_bio **multi_ret, int mirror_num,
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 09/10] btrfs-progs: Use btrfs_device->fs_info to replace btrfs_device->dev_root
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (8 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 08/10] btrfs-progs: Move chunk stripe size calcution function to volumes.h Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 7:44 ` [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior Qu Wenruo
10 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Same as kernel declaration.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
utils.c | 2 +-
volumes.c | 6 +++---
volumes.h | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/utils.c b/utils.c
index e9cb3a82fda6..eff5fb64cfd5 100644
--- a/utils.c
+++ b/utils.c
@@ -216,7 +216,7 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
device->total_bytes = device_total_bytes;
device->bytes_used = 0;
device->total_ios = 0;
- device->dev_root = fs_info->dev_root;
+ device->fs_info = fs_info;
device->name = strdup(path);
if (!device->name) {
ret = -ENOMEM;
diff --git a/volumes.c b/volumes.c
index c2efb3c674dc..cff54c612872 100644
--- a/volumes.c
+++ b/volumes.c
@@ -380,7 +380,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
u64 *start, u64 *len)
{
struct btrfs_key key;
- struct btrfs_root *root = device->dev_root;
+ struct btrfs_root *root = device->fs_info->dev_root;
struct btrfs_dev_extent *dev_extent;
struct btrfs_path *path;
u64 hole_size;
@@ -724,7 +724,7 @@ int btrfs_update_device(struct btrfs_trans_handle *trans,
struct extent_buffer *leaf;
struct btrfs_key key;
- root = device->dev_root->fs_info->chunk_root;
+ root = device->fs_info->chunk_root;
path = btrfs_alloc_path();
if (!path)
@@ -1895,7 +1895,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
}
fill_device_from_item(leaf, dev_item, device);
- device->dev_root = fs_info->dev_root;
+ device->fs_info = fs_info;
return ret;
}
diff --git a/volumes.h b/volumes.h
index 950de5a9f910..84deafc98b0d 100644
--- a/volumes.h
+++ b/volumes.h
@@ -26,7 +26,7 @@
struct btrfs_device {
struct list_head dev_list;
- struct btrfs_root *dev_root;
+ struct btrfs_fs_info *fs_info;
struct btrfs_fs_devices *fs_devices;
u64 total_ios;
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior
2018-02-09 7:44 [PATCH v2 00/10] Chunk allocator unification Qu Wenruo
` (9 preceding siblings ...)
2018-02-09 7:44 ` [PATCH v2 09/10] btrfs-progs: Use btrfs_device->fs_info to replace btrfs_device->dev_root Qu Wenruo
@ 2018-02-09 7:44 ` Qu Wenruo
2018-02-09 8:17 ` Nikolay Borisov
10 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 7:44 UTC (permalink / raw)
To: linux-btrfs, dsterba
Kernel uses a delayed chunk allocation behavior for metadata chunks
KERNEL:
btrfs_alloc_chunk()
|- __btrfs_alloc_chunk(): Only allocate chunk mapping
|- btrfs_make_block_group(): Add corresponding bg to fs_info->new_bgs
Then at transaction commit time, it finishes the remaining work:
btrfs_start_dirty_block_groups():
|- btrfs_create_pending_block_groups()
|- btrfs_insert_item(): To insert block group item
|- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
Although btrfs-progs btrfs_alloc_chunk() does all the work in one
function, it can still benefit from function refactor like:
btrfs-progs:
btrfs_alloc_chunk(): Wrapper for both normal and convert chunks
|- __btrfs_alloc_chunk(): Only alloc chunk mapping
| |- btrfs_make_block_group(): <<Insert bg items into extent tree>>
|- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
With such refactor, the following functions can share most of its code
with kernel now:
__btrfs_alloc_chunk()
btrfs_finish_chunk_alloc()
btrfs_alloc_dev_extent()
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
volumes.c | 421 ++++++++++++++++++++++++++++++++++++++------------------------
1 file changed, 260 insertions(+), 161 deletions(-)
diff --git a/volumes.c b/volumes.c
index cff54c612872..e89520326314 100644
--- a/volumes.c
+++ b/volumes.c
@@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
return find_free_dev_extent_start(device, num_bytes, 0, start, len);
}
-static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *fs_info,
- struct map_lookup *map, u64 stripe_size)
+static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device,
+ u64 chunk_offset, u64 physical,
+ u64 stripe_size)
{
- int ret = 0;
- struct btrfs_path path;
+ int ret;
+ struct btrfs_path *path;
+ struct btrfs_fs_info *fs_info = device->fs_info;
struct btrfs_root *root = fs_info->dev_root;
struct btrfs_dev_extent *extent;
struct extent_buffer *leaf;
struct btrfs_key key;
- int i;
- btrfs_init_path(&path);
-
- for (i = 0; i < map->num_stripes; i++) {
- struct btrfs_device *device = map->stripes[i].dev;
-
- key.objectid = device->devid;
- key.offset = map->stripes[i].physical;
- key.type = BTRFS_DEV_EXTENT_KEY;
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
- ret = btrfs_insert_empty_item(trans, root, &path, &key,
- sizeof(*extent));
- if (ret < 0)
- goto out;
- leaf = path.nodes[0];
- extent = btrfs_item_ptr(leaf, path.slots[0],
- struct btrfs_dev_extent);
- btrfs_set_dev_extent_chunk_tree(leaf, extent,
+ key.objectid = device->devid;
+ key.offset = physical;
+ key.type = BTRFS_DEV_EXTENT_KEY;
+ ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*extent));
+ if (ret)
+ goto out;
+ leaf = path->nodes[0];
+ extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
+ btrfs_set_dev_extent_chunk_tree(leaf, extent,
BTRFS_CHUNK_TREE_OBJECTID);
- btrfs_set_dev_extent_chunk_objectid(leaf, extent,
- BTRFS_FIRST_CHUNK_TREE_OBJECTID);
- btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
-
- write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
- (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
- BTRFS_UUID_SIZE);
-
- btrfs_set_dev_extent_length(leaf, extent, stripe_size);
- btrfs_mark_buffer_dirty(leaf);
- btrfs_release_path(&path);
-
- device->bytes_used += stripe_size;
- ret = btrfs_update_device(trans, device);
- if (ret < 0)
- goto out;
- }
+ btrfs_set_dev_extent_chunk_objectid(leaf, extent,
+ BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+ btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
+ btrfs_set_dev_extent_length(leaf, extent, stripe_size);
+ btrfs_mark_buffer_dirty(leaf);
out:
- btrfs_release_path(&path);
+ btrfs_free_path(path);
return ret;
}
@@ -813,28 +798,28 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
/ sizeof(struct btrfs_stripe) + 1)
/*
- * Alloc a chunk, will insert dev extents, chunk item, and insert new
- * block group and update space info (so that extent allocator can use
- * newly allocated chunk).
+ * Alloc a chunk mapping.
+ * Will do chunk size calculation and free dev extent search, and insert
+ * chunk mapping into chunk mapping tree.
+ *
+ * NOTE: This function doesn't handle any chunk item/dev extent insert.
+ * chunk item/dev extent insert is handled by later btrfs_finish_chunk_alloc().
+ * And for convert chunk (1:1 mapped, more flex chunk location), it's
+ * handled by __btrfs_alloc_convert_chunk().
+ *
+ * Qu: Block group item is still inserted in this function by
+ * btrfs_make_block_group(), this still differs from kernel.
*
* @start: return value of allocated chunk start bytenr.
* @num_bytes: return value of allocated chunk size
* @type: chunk type (including both profile and type)
- * @convert: if the chunk is allocated for convert case.
- * If @convert is true, chunk allocator will skip device extent
- * search, but use *start and *num_bytes as chunk start/num_bytes
- * and devive offset, to build a 1:1 chunk mapping for convert.
*/
-int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *info, u64 *start,
- u64 *num_bytes, u64 type, bool convert)
+static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *info, u64 *start,
+ u64 *num_bytes, u64 type)
{
- struct btrfs_root *extent_root = info->extent_root;
- struct btrfs_root *chunk_root = info->chunk_root;
struct btrfs_device *device = NULL;
- struct btrfs_chunk *chunk;
struct list_head *dev_list = &info->fs_devices->devices;
- struct btrfs_stripe *stripe;
struct map_lookup *map;
struct btrfs_device_info *devices_info = NULL;
u64 percent_max;
@@ -845,8 +830,6 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
int index;
int ndevs = 0;
int rw_devs = 0;
- int stripe_len = BTRFS_STRIPE_LEN;
- struct btrfs_key key;
u64 offset;
int data_stripes; /* number of stripes that counts for bg size */
int sub_stripes; /* sub_stripes info for map */
@@ -874,34 +857,6 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
devs_increment = btrfs_raid_array[index].devs_increment;
ncopies = btrfs_raid_array[index].ncopies;
- if (convert) {
- /* For convert, profile must be SINGLE */
- if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
- error("convert only suports SINGLE profile");
- return -EINVAL;
- }
- if (!IS_ALIGNED(*start, info->sectorsize)) {
- error("chunk start not aligned, start=%llu sectorsize=%u",
- *start, info->sectorsize);
- return -EINVAL;
- }
- if (!IS_ALIGNED(*num_bytes, info->sectorsize)) {
- error("chunk size not aligned, size=%llu sectorsize=%u",
- *num_bytes, info->sectorsize);
- return -EINVAL;
- }
- num_stripes = 1;
- data_stripes = 1;
- offset = *start;
- stripe_size = *num_bytes;
- /*
- * For convert, we use 1:1 chunk mapping specified by @start and
- * @num_bytes, so there is no need to go through dev_extent
- * searching.
- */
- goto alloc_chunk;
- }
-
/*
* Chunk size calculation part.
*/
@@ -1047,55 +1002,23 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
/*
* Fill chunk mapping and chunk stripes
*/
-alloc_chunk:
- if (!convert) {
- ret = find_next_chunk(info, &offset);
- if (ret)
- goto out_devices_info;
- }
- key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
- key.type = BTRFS_CHUNK_ITEM_KEY;
- key.offset = offset;
- *num_bytes = stripe_size * data_stripes;
-
- chunk = kmalloc(btrfs_chunk_item_size(num_stripes), GFP_NOFS);
- if (!chunk)
+ ret = find_next_chunk(info, &offset);
+ if (ret)
goto out_devices_info;
+ *num_bytes = stripe_size * data_stripes;
map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
if (!map)
goto out_chunk_map;
map->num_stripes = num_stripes;
- if (convert) {
- device = list_entry(dev_list->next, struct btrfs_device,
- dev_list);
+ for (i = 0; i < ndevs; i++) {
+ for (j = 0; j < dev_stripes; ++j) {
+ int s = i * dev_stripes + j;
- map->stripes[0].dev = device;
- map->stripes[0].physical = *start;
- stripe = &chunk->stripe;
- btrfs_set_stack_stripe_devid(stripe, device->devid);
- btrfs_set_stack_stripe_offset(stripe, *start);
- memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
- } else {
- for (i = 0; i < ndevs; i++) {
- for (j = 0; j < dev_stripes; ++j) {
- int s = i * dev_stripes + j;
-
- device = devices_info[i].dev;
- map->stripes[s].dev = device;
- map->stripes[s].physical =
- devices_info[i].dev_offset +
- j * stripe_size;
- stripe = &chunk->stripe + s;
- btrfs_set_stack_stripe_devid(stripe,
- device->devid);
- btrfs_set_stack_stripe_offset(stripe,
- devices_info[i].dev_offset +
- j * stripe_size);
- memcpy(stripe->dev_uuid, device->uuid,
- BTRFS_UUID_SIZE);
- }
+ map->stripes[s].dev = devices_info[i].dev;
+ map->stripes[s].physical = devices_info[i].dev_offset +
+ j * stripe_size;
}
}
map->stripe_len = BTRFS_STRIPE_LEN;
@@ -1104,60 +1027,236 @@ alloc_chunk:
map->type = type;
map->sub_stripes = sub_stripes;
map->sector_size = info->sectorsize;
- map->ce.start = key.offset;
+ map->ce.start = offset;
map->ce.size = *num_bytes;
+ kfree(devices_info);
+ /*
+ * Insert chunk mapping and block group
+ */
ret = insert_cache_extent(&info->mapping_tree.cache_tree, &map->ce);
if (ret < 0)
goto out_chunk_map;
- /* key was set above */
- btrfs_set_stack_chunk_length(chunk, *num_bytes);
- btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
- btrfs_set_stack_chunk_stripe_len(chunk, stripe_len);
- btrfs_set_stack_chunk_type(chunk, type);
- btrfs_set_stack_chunk_num_stripes(chunk, num_stripes);
- btrfs_set_stack_chunk_io_align(chunk, stripe_len);
- btrfs_set_stack_chunk_io_width(chunk, stripe_len);
- btrfs_set_stack_chunk_sector_size(chunk, info->sectorsize);
- btrfs_set_stack_chunk_sub_stripes(chunk, sub_stripes);
+ ret = btrfs_make_block_group(trans, info, 0, type, offset,
+ *num_bytes);
+ *start = offset;
+ return ret;
+
+out_chunk_map:
+ kfree(map);
+out_devices_info:
+ kfree(devices_info);
+ return ret;
+}
+
+/*
+ * Alloc a chunk mapping for convert.
+ * Convert needs special SINGLE chunk whose logical bytenr is the same as its
+ * physical bytenr.
+ * Chunk size and start bytenr are all specified by @start and @num_bytes
+ *
+ * And just like __btrfs_alloc_chunk() this only handles chunk mapping and
+ * block group item.
+ */
+static int __btrfs_alloc_convert_chunk(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info, u64 start,
+ u64 num_bytes, u64 type)
+{
+ struct list_head *dev_list = &fs_info->fs_devices->devices;
+ struct map_lookup *map;
+ struct btrfs_device *device;
+ int ret;
+
+ /* For convert, profile must be SINGLE */
+ if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ error("convert only suports SINGLE profile");
+ return -EINVAL;
+ }
+ if (!IS_ALIGNED(start, fs_info->sectorsize)) {
+ error("chunk start not aligned, start=%llu sectorsize=%u",
+ start, fs_info->sectorsize);
+ return -EINVAL;
+ }
+ if (!IS_ALIGNED(num_bytes, fs_info->sectorsize)) {
+ error("chunk size not aligned, size=%llu sectorsize=%u",
+ num_bytes, fs_info->sectorsize);
+ return -EINVAL;
+ }
+ if (list_empty(dev_list)) {
+ error("no writable device");
+ return -ENODEV;
+ }
+
+ device = list_entry(dev_list->next, struct btrfs_device, dev_list);
+ map = malloc(btrfs_map_lookup_size(1));
+ if (!map)
+ return -ENOMEM;
+ map->num_stripes = 1;
+ map->stripes[0].dev = device;
+ map->stripes[0].physical = start;
+ map->stripe_len = BTRFS_STRIPE_LEN;
+ map->io_align = BTRFS_STRIPE_LEN;
+ map->io_width = BTRFS_STRIPE_LEN;
+ map->type = type;
+ map->sub_stripes = 1;
+ map->sector_size = fs_info->sectorsize;
+ map->ce.start = start;
+ map->ce.size = num_bytes;
+
+ ret = insert_cache_extent(&fs_info->mapping_tree.cache_tree, &map->ce);
+ if (ret < 0)
+ goto error;
+ ret = btrfs_make_block_group(trans, fs_info, 0, type, start, num_bytes);
+ return ret;
+error:
+ free(map);
+ return ret;
+}
+
+/*
+ * Finish the chunk allocation by inserting needed chunk item and device
+ * extents, and update device used bytes
+ */
+static int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info,
+ u64 chunk_start, u64 chunk_size)
+{
+ struct btrfs_root *extent_root = fs_info->extent_root;
+ struct btrfs_root *chunk_root = fs_info->chunk_root;
+ struct btrfs_key key;
+ struct btrfs_device *device;
+ struct btrfs_chunk *chunk;
+ struct btrfs_stripe *stripe;
+ struct cache_extent *ce;
+ struct map_lookup *map;
+ size_t item_size;
+ u64 dev_offset;
+ u64 stripe_size;
+ int i = 0;
+ int ret = 0;
+
+ ce = search_cache_extent(&fs_info->mapping_tree.cache_tree, chunk_start);
+ if (!ce)
+ return -ENOENT;
+
+ map = container_of(ce, struct map_lookup, ce);
+ item_size = btrfs_chunk_item_size(map->num_stripes);
+ stripe_size = calc_stripe_length(map->type, map->ce.size,
+ map->num_stripes);
+
+ chunk = kzalloc(item_size, GFP_NOFS);
+ if (!chunk) {
+ ret = -ENOMEM;
+ goto out;
+ }
/*
- * Insert chunk item and chunk mapping.
+ * Take the device list mutex to prevent races with the final phase of
+ * a device replace operation that replaces the device object associated
+ * with the map's stripes, because the device object's id can change
+ * at any time during that final phase of the device replace operation
+ * (dev-replace.c:btrfs_dev_replace_finishing()).
*/
- ret = btrfs_insert_item(trans, chunk_root, &key, chunk,
- btrfs_chunk_item_size(num_stripes));
- BUG_ON(ret);
- *start = key.offset;
+ /* mutex_lock(&fs_info->fs_devices->device_list_mutex); */
+ for (i = 0; i < map->num_stripes; i++) {
+ device = map->stripes[i].dev;
+ dev_offset = map->stripes[i].physical;
- if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
- ret = btrfs_add_system_chunk(info, &key,
- chunk, btrfs_chunk_item_size(num_stripes));
- if (ret < 0)
- goto out_chunk;
+ device->bytes_used += stripe_size;
+ ret = btrfs_update_device(trans, device);
+ if (ret)
+ break;
+ ret = btrfs_alloc_dev_extent(trans, device, chunk_start,
+ dev_offset, stripe_size);
+ if (ret)
+ break;
+ }
+ if (ret) {
+ /* mutex_unlock(&fs_info->fs_devices->device_list_mutex); */
+ goto out;
}
+ stripe = &chunk->stripe;
+ for (i = 0; i < map->num_stripes; i++) {
+ device = map->stripes[i].dev;
+ dev_offset = map->stripes[i].physical;
+
+ btrfs_set_stack_stripe_devid(stripe, device->devid);
+ btrfs_set_stack_stripe_offset(stripe, dev_offset);
+ memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
+ stripe++;
+ }
+ /* mutex_unlock(&fs_info->fs_devices->device_list_mutex); */
+
+ btrfs_set_stack_chunk_length(chunk, chunk_size);
+ btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
+ btrfs_set_stack_chunk_stripe_len(chunk, map->stripe_len);
+ btrfs_set_stack_chunk_type(chunk, map->type);
+ btrfs_set_stack_chunk_num_stripes(chunk, map->num_stripes);
+ btrfs_set_stack_chunk_io_align(chunk, map->stripe_len);
+ btrfs_set_stack_chunk_io_width(chunk, map->stripe_len);
+ btrfs_set_stack_chunk_sector_size(chunk, fs_info->sectorsize);
+ btrfs_set_stack_chunk_sub_stripes(chunk, map->sub_stripes);
+
+ key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+ key.type = BTRFS_CHUNK_ITEM_KEY;
+ key.offset = chunk_start;
+
+ ret = btrfs_insert_item(trans, chunk_root, &key, chunk, item_size);
+ if (ret == 0 && map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
+ /*
+ * TODO: Cleanup of inserted chunk root in case of
+ * failure.
+ */
+ ret = btrfs_add_system_chunk(fs_info, &key, chunk, item_size);
+ }
+
+out:
kfree(chunk);
+ return ret;
+}
+
+/*
+ * Alloc a chunk.
+ * Will do all the needed work including seaching free device extent, insert
+ * chunk mapping, chunk item, block group item and device extents.
+ *
+ * @start: return value of allocated chunk start bytenr.
+ * @num_bytes: return value of allocated chunk size
+ * @type: chunk type (including both profile and type)
+ * @convert: if the chunk is allocated for convert case.
+ * If @convert is true, chunk allocator will skip device extent
+ * search, but use *start and *num_bytes as chunk start/num_bytes
+ * and devive offset, to build a 1:1 chunk mapping for convert.
+ */
+int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info, u64 *start, u64 *num_bytes,
+ u64 type, bool convert)
+{
+ int ret;
/*
- * Insert device extents
+ * Allocate chunk mapping
*/
- ret = btrfs_insert_dev_extents(trans, info, map, stripe_size);
- if (ret < 0)
- goto out_devices_info;
-
- ret = btrfs_make_block_group(trans, info, 0, type, map->ce.start,
- map->ce.size);
- kfree(devices_info);
- return ret;
+ if (convert)
+ ret = __btrfs_alloc_convert_chunk(trans, fs_info, *start,
+ *num_bytes, type);
+ else
+ ret = __btrfs_alloc_chunk(trans, fs_info, start, num_bytes,
+ type);
+ if (ret < 0) {
+ error("failed to allocate chunk mapping: %s", strerror(-ret));
+ return ret;
+ }
-out_chunk_map:
- kfree(map);
-out_chunk:
- kfree(chunk);
-out_devices_info:
- kfree(devices_info);
+ /*
+ * Insert the remaining part (insert variant items)
+ */
+ ret = btrfs_finish_chunk_alloc(trans, fs_info, *start, *num_bytes);
+ if (ret < 0)
+ error("failed to finish chunk allocation: %s", strerror(-ret));
return ret;
}
--
2.16.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior
2018-02-09 7:44 ` [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior Qu Wenruo
@ 2018-02-09 8:17 ` Nikolay Borisov
2018-02-09 10:01 ` Qu Wenruo
0 siblings, 1 reply; 24+ messages in thread
From: Nikolay Borisov @ 2018-02-09 8:17 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs, dsterba
On 9.02.2018 09:44, Qu Wenruo wrote:
> Kernel uses a delayed chunk allocation behavior for metadata chunks
>
> KERNEL:
> btrfs_alloc_chunk()
> |- __btrfs_alloc_chunk(): Only allocate chunk mapping
> |- btrfs_make_block_group(): Add corresponding bg to fs_info->new_bgs
>
> Then at transaction commit time, it finishes the remaining work:
> btrfs_start_dirty_block_groups():
> |- btrfs_create_pending_block_groups()
> |- btrfs_insert_item(): To insert block group item
> |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
>
> Although btrfs-progs btrfs_alloc_chunk() does all the work in one
> function, it can still benefit from function refactor like:
> btrfs-progs:
> btrfs_alloc_chunk(): Wrapper for both normal and convert chunks
> |- __btrfs_alloc_chunk(): Only alloc chunk mapping
> | |- btrfs_make_block_group(): <<Insert bg items into extent tree>>
> |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
>
> With such refactor, the following functions can share most of its code
> with kernel now:
> __btrfs_alloc_chunk()
> btrfs_finish_chunk_alloc()
> btrfs_alloc_dev_extent()
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> volumes.c | 421 ++++++++++++++++++++++++++++++++++++++------------------------
> 1 file changed, 260 insertions(+), 161 deletions(-)
>
> diff --git a/volumes.c b/volumes.c
> index cff54c612872..e89520326314 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
> return find_free_dev_extent_start(device, num_bytes, 0, start, len);
> }
>
> -static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
> - struct btrfs_fs_info *fs_info,
> - struct map_lookup *map, u64 stripe_size)
> +static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_device *device,
> + u64 chunk_offset, u64 physical,
> + u64 stripe_size)
> {
> - int ret = 0;
> - struct btrfs_path path;
> + int ret;
> + struct btrfs_path *path;
> + struct btrfs_fs_info *fs_info = device->fs_info;
> struct btrfs_root *root = fs_info->dev_root;
> struct btrfs_dev_extent *extent;
> struct extent_buffer *leaf;
> struct btrfs_key key;
> - int i;
>
> - btrfs_init_path(&path);
> -
> - for (i = 0; i < map->num_stripes; i++) {
> - struct btrfs_device *device = map->stripes[i].dev;
> -
> - key.objectid = device->devid;
> - key.offset = map->stripes[i].physical;
> - key.type = BTRFS_DEV_EXTENT_KEY;
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
>
> - ret = btrfs_insert_empty_item(trans, root, &path, &key,
> - sizeof(*extent));
> - if (ret < 0)
> - goto out;
> - leaf = path.nodes[0];
> - extent = btrfs_item_ptr(leaf, path.slots[0],
> - struct btrfs_dev_extent);
> - btrfs_set_dev_extent_chunk_tree(leaf, extent,
> + key.objectid = device->devid;
> + key.offset = physical;
> + key.type = BTRFS_DEV_EXTENT_KEY;
> + ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*extent));
> + if (ret)
> + goto out;
> + leaf = path->nodes[0];
> + extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
> + btrfs_set_dev_extent_chunk_tree(leaf, extent,
> BTRFS_CHUNK_TREE_OBJECTID);
> - btrfs_set_dev_extent_chunk_objectid(leaf, extent,
> - BTRFS_FIRST_CHUNK_TREE_OBJECTID);
> - btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
> -
> - write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
> - (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
> - BTRFS_UUID_SIZE);
> -
> - btrfs_set_dev_extent_length(leaf, extent, stripe_size);
> - btrfs_mark_buffer_dirty(leaf);
> - btrfs_release_path(&path);
> -
> - device->bytes_used += stripe_size;
> - ret = btrfs_update_device(trans, device);
> - if (ret < 0)
> - goto out;
> - }
> + btrfs_set_dev_extent_chunk_objectid(leaf, extent,
> + BTRFS_FIRST_CHUNK_TREE_OBJECTID);
> + btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
> + btrfs_set_dev_extent_length(leaf, extent, stripe_size);
> + btrfs_mark_buffer_dirty(leaf);
> out:
> - btrfs_release_path(&path);
> + btrfs_free_path(path);
> return ret;
> }
>
> @@ -813,28 +798,28 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
> / sizeof(struct btrfs_stripe) + 1)
>
> /*
> - * Alloc a chunk, will insert dev extents, chunk item, and insert new
> - * block group and update space info (so that extent allocator can use
> - * newly allocated chunk).
> + * Alloc a chunk mapping.
> + * Will do chunk size calculation and free dev extent search, and insert
> + * chunk mapping into chunk mapping tree.
> + *
> + * NOTE: This function doesn't handle any chunk item/dev extent insert.
> + * chunk item/dev extent insert is handled by later btrfs_finish_chunk_alloc().
> + * And for convert chunk (1:1 mapped, more flex chunk location), it's
> + * handled by __btrfs_alloc_convert_chunk().
> + *
> + * Qu: Block group item is still inserted in this function by
> + * btrfs_make_block_group(), this still differs from kernel.
> *
If it still differs does that mean this function should undergo further
refactoring to unify it with the kernel or it means it will never be the
same as the kernel's code? I thought the idea of unifying the
userspace/kernel code is to have only 1 set of functions which work both
in kernel and userspace?
> * @start: return value of allocated chunk start bytenr.
> * @num_bytes: return value of allocated chunk size
> * @type: chunk type (including both profile and type)
> - * @convert: if the chunk is allocated for convert case.
> - * If @convert is true, chunk allocator will skip device extent
> - * search, but use *start and *num_bytes as chunk start/num_bytes
> - * and devive offset, to build a 1:1 chunk mapping for convert.
> */
> -int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> - struct btrfs_fs_info *info, u64 *start,
> - u64 *num_bytes, u64 type, bool convert)
> +static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> + struct btrfs_fs_info *info, u64 *start,
> + u64 *num_bytes, u64 type)
> {
> - struct btrfs_root *extent_root = info->extent_root;
> - struct btrfs_root *chunk_root = info->chunk_root;
> struct btrfs_device *device = NULL;
> - struct btrfs_chunk *chunk;
> struct list_head *dev_list = &info->fs_devices->devices;
> - struct btrfs_stripe *stripe;
> struct map_lookup *map;
> struct btrfs_device_info *devices_info = NULL;
> u64 percent_max;
> @@ -845,8 +830,6 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> int index;
> int ndevs = 0;
> int rw_devs = 0;
> - int stripe_len = BTRFS_STRIPE_LEN;
> - struct btrfs_key key;
> u64 offset;
> int data_stripes; /* number of stripes that counts for bg size */
> int sub_stripes; /* sub_stripes info for map */
> @@ -874,34 +857,6 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> devs_increment = btrfs_raid_array[index].devs_increment;
> ncopies = btrfs_raid_array[index].ncopies;
>
> - if (convert) {
> - /* For convert, profile must be SINGLE */
> - if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> - error("convert only suports SINGLE profile");
> - return -EINVAL;
> - }
> - if (!IS_ALIGNED(*start, info->sectorsize)) {
> - error("chunk start not aligned, start=%llu sectorsize=%u",
> - *start, info->sectorsize);
> - return -EINVAL;
> - }
> - if (!IS_ALIGNED(*num_bytes, info->sectorsize)) {
> - error("chunk size not aligned, size=%llu sectorsize=%u",
> - *num_bytes, info->sectorsize);
> - return -EINVAL;
> - }
> - num_stripes = 1;
> - data_stripes = 1;
> - offset = *start;
> - stripe_size = *num_bytes;
> - /*
> - * For convert, we use 1:1 chunk mapping specified by @start and
> - * @num_bytes, so there is no need to go through dev_extent
> - * searching.
> - */
> - goto alloc_chunk;
> - }
> -
> /*
> * Chunk size calculation part.
> */
> @@ -1047,55 +1002,23 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> /*
> * Fill chunk mapping and chunk stripes
> */
> -alloc_chunk:
> - if (!convert) {
> - ret = find_next_chunk(info, &offset);
> - if (ret)
> - goto out_devices_info;
> - }
> - key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> - key.type = BTRFS_CHUNK_ITEM_KEY;
> - key.offset = offset;
> - *num_bytes = stripe_size * data_stripes;
> -
> - chunk = kmalloc(btrfs_chunk_item_size(num_stripes), GFP_NOFS);
> - if (!chunk)
> + ret = find_next_chunk(info, &offset);
> + if (ret)
> goto out_devices_info;
> + *num_bytes = stripe_size * data_stripes;
>
> map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
> if (!map)
> goto out_chunk_map;
> map->num_stripes = num_stripes;
>
> - if (convert) {
> - device = list_entry(dev_list->next, struct btrfs_device,
> - dev_list);
> + for (i = 0; i < ndevs; i++) {
> + for (j = 0; j < dev_stripes; ++j) {
> + int s = i * dev_stripes + j;
>
> - map->stripes[0].dev = device;
> - map->stripes[0].physical = *start;
> - stripe = &chunk->stripe;
> - btrfs_set_stack_stripe_devid(stripe, device->devid);
> - btrfs_set_stack_stripe_offset(stripe, *start);
> - memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
> - } else {
> - for (i = 0; i < ndevs; i++) {
> - for (j = 0; j < dev_stripes; ++j) {
> - int s = i * dev_stripes + j;
> -
> - device = devices_info[i].dev;
> - map->stripes[s].dev = device;
> - map->stripes[s].physical =
> - devices_info[i].dev_offset +
> - j * stripe_size;
> - stripe = &chunk->stripe + s;
> - btrfs_set_stack_stripe_devid(stripe,
> - device->devid);
> - btrfs_set_stack_stripe_offset(stripe,
> - devices_info[i].dev_offset +
> - j * stripe_size);
> - memcpy(stripe->dev_uuid, device->uuid,
> - BTRFS_UUID_SIZE);
> - }
> + map->stripes[s].dev = devices_info[i].dev;
> + map->stripes[s].physical = devices_info[i].dev_offset +
> + j * stripe_size;
> }
> }
> map->stripe_len = BTRFS_STRIPE_LEN;
> @@ -1104,60 +1027,236 @@ alloc_chunk:
> map->type = type;
> map->sub_stripes = sub_stripes;
> map->sector_size = info->sectorsize;
> - map->ce.start = key.offset;
> + map->ce.start = offset;
> map->ce.size = *num_bytes;
>
> + kfree(devices_info);
>
> + /*
> + * Insert chunk mapping and block group
> + */
> ret = insert_cache_extent(&info->mapping_tree.cache_tree, &map->ce);
> if (ret < 0)
> goto out_chunk_map;
>
> - /* key was set above */
> - btrfs_set_stack_chunk_length(chunk, *num_bytes);
> - btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
> - btrfs_set_stack_chunk_stripe_len(chunk, stripe_len);
> - btrfs_set_stack_chunk_type(chunk, type);
> - btrfs_set_stack_chunk_num_stripes(chunk, num_stripes);
> - btrfs_set_stack_chunk_io_align(chunk, stripe_len);
> - btrfs_set_stack_chunk_io_width(chunk, stripe_len);
> - btrfs_set_stack_chunk_sector_size(chunk, info->sectorsize);
> - btrfs_set_stack_chunk_sub_stripes(chunk, sub_stripes);
> + ret = btrfs_make_block_group(trans, info, 0, type, offset,
> + *num_bytes);
> + *start = offset;
> + return ret;
> +
> +out_chunk_map:
> + kfree(map);
> +out_devices_info:
> + kfree(devices_info);
> + return ret;
> +}
> +
> +/*
> + * Alloc a chunk mapping for convert.
> + * Convert needs special SINGLE chunk whose logical bytenr is the same as its
> + * physical bytenr.
> + * Chunk size and start bytenr are all specified by @start and @num_bytes
> + *
> + * And just like __btrfs_alloc_chunk() this only handles chunk mapping and
> + * block group item.
> + */
> +static int __btrfs_alloc_convert_chunk(struct btrfs_trans_handle *trans,
> + struct btrfs_fs_info *fs_info, u64 start,
> + u64 num_bytes, u64 type)
> +{
> + struct list_head *dev_list = &fs_info->fs_devices->devices;
> + struct map_lookup *map;
> + struct btrfs_device *device;
> + int ret;
> +
> + /* For convert, profile must be SINGLE */
> + if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + error("convert only suports SINGLE profile");
> + return -EINVAL;
> + }
> + if (!IS_ALIGNED(start, fs_info->sectorsize)) {
> + error("chunk start not aligned, start=%llu sectorsize=%u",
> + start, fs_info->sectorsize);
> + return -EINVAL;
> + }
> + if (!IS_ALIGNED(num_bytes, fs_info->sectorsize)) {
> + error("chunk size not aligned, size=%llu sectorsize=%u",
> + num_bytes, fs_info->sectorsize);
> + return -EINVAL;
> + }
> + if (list_empty(dev_list)) {
> + error("no writable device");
> + return -ENODEV;
> + }
> +
> + device = list_entry(dev_list->next, struct btrfs_device, dev_list);
> + map = malloc(btrfs_map_lookup_size(1));
> + if (!map)
> + return -ENOMEM;
> + map->num_stripes = 1;
> + map->stripes[0].dev = device;
> + map->stripes[0].physical = start;
> + map->stripe_len = BTRFS_STRIPE_LEN;
> + map->io_align = BTRFS_STRIPE_LEN;
> + map->io_width = BTRFS_STRIPE_LEN;
> + map->type = type;
> + map->sub_stripes = 1;
> + map->sector_size = fs_info->sectorsize;
> + map->ce.start = start;
> + map->ce.size = num_bytes;
> +
> + ret = insert_cache_extent(&fs_info->mapping_tree.cache_tree, &map->ce);
> + if (ret < 0)
> + goto error;
> + ret = btrfs_make_block_group(trans, fs_info, 0, type, start, num_bytes);
> + return ret;
> +error:
> + free(map);
> + return ret;
> +}
> +
> +/*
> + * Finish the chunk allocation by inserting needed chunk item and device
> + * extents, and update device used bytes
> + */
> +static int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
> + struct btrfs_fs_info *fs_info,
> + u64 chunk_start, u64 chunk_size)
> +{
> + struct btrfs_root *extent_root = fs_info->extent_root;
> + struct btrfs_root *chunk_root = fs_info->chunk_root;
> + struct btrfs_key key;
> + struct btrfs_device *device;
> + struct btrfs_chunk *chunk;
> + struct btrfs_stripe *stripe;
> + struct cache_extent *ce;
> + struct map_lookup *map;
> + size_t item_size;
> + u64 dev_offset;
> + u64 stripe_size;
> + int i = 0;
> + int ret = 0;
> +
> + ce = search_cache_extent(&fs_info->mapping_tree.cache_tree, chunk_start);
> + if (!ce)
> + return -ENOENT;
> +
> + map = container_of(ce, struct map_lookup, ce);
> + item_size = btrfs_chunk_item_size(map->num_stripes);
> + stripe_size = calc_stripe_length(map->type, map->ce.size,
> + map->num_stripes);
> +
> + chunk = kzalloc(item_size, GFP_NOFS);
> + if (!chunk) {
> + ret = -ENOMEM;
> + goto out;
> + }
>
> /*
> - * Insert chunk item and chunk mapping.
> + * Take the device list mutex to prevent races with the final phase of
> + * a device replace operation that replaces the device object associated
> + * with the map's stripes, because the device object's id can change
> + * at any time during that final phase of the device replace operation
> + * (dev-replace.c:btrfs_dev_replace_finishing()).
> */
> - ret = btrfs_insert_item(trans, chunk_root, &key, chunk,
> - btrfs_chunk_item_size(num_stripes));
> - BUG_ON(ret);
> - *start = key.offset;
> + /* mutex_lock(&fs_info->fs_devices->device_list_mutex); */
> + for (i = 0; i < map->num_stripes; i++) {
> + device = map->stripes[i].dev;
> + dev_offset = map->stripes[i].physical;
>
> - if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
> - ret = btrfs_add_system_chunk(info, &key,
> - chunk, btrfs_chunk_item_size(num_stripes));
> - if (ret < 0)
> - goto out_chunk;
> + device->bytes_used += stripe_size;
> + ret = btrfs_update_device(trans, device);
> + if (ret)
> + break;
> + ret = btrfs_alloc_dev_extent(trans, device, chunk_start,
> + dev_offset, stripe_size);
> + if (ret)
> + break;
> + }
> + if (ret) {
> + /* mutex_unlock(&fs_info->fs_devices->device_list_mutex); */
> + goto out;
> }
>
> + stripe = &chunk->stripe;
> + for (i = 0; i < map->num_stripes; i++) {
> + device = map->stripes[i].dev;
> + dev_offset = map->stripes[i].physical;
> +
> + btrfs_set_stack_stripe_devid(stripe, device->devid);
> + btrfs_set_stack_stripe_offset(stripe, dev_offset);
> + memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
> + stripe++;
> + }
> + /* mutex_unlock(&fs_info->fs_devices->device_list_mutex); */
> +
> + btrfs_set_stack_chunk_length(chunk, chunk_size);
> + btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
> + btrfs_set_stack_chunk_stripe_len(chunk, map->stripe_len);
> + btrfs_set_stack_chunk_type(chunk, map->type);
> + btrfs_set_stack_chunk_num_stripes(chunk, map->num_stripes);
> + btrfs_set_stack_chunk_io_align(chunk, map->stripe_len);
> + btrfs_set_stack_chunk_io_width(chunk, map->stripe_len);
> + btrfs_set_stack_chunk_sector_size(chunk, fs_info->sectorsize);
> + btrfs_set_stack_chunk_sub_stripes(chunk, map->sub_stripes);
> +
> + key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> + key.type = BTRFS_CHUNK_ITEM_KEY;
> + key.offset = chunk_start;
> +
> + ret = btrfs_insert_item(trans, chunk_root, &key, chunk, item_size);
> + if (ret == 0 && map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
> + /*
> + * TODO: Cleanup of inserted chunk root in case of
> + * failure.
> + */
> + ret = btrfs_add_system_chunk(fs_info, &key, chunk, item_size);
> + }
> +
> +out:
> kfree(chunk);
> + return ret;
> +}
> +
> +/*
> + * Alloc a chunk.
> + * Will do all the needed work including seaching free device extent, insert
> + * chunk mapping, chunk item, block group item and device extents.
> + *
> + * @start: return value of allocated chunk start bytenr.
> + * @num_bytes: return value of allocated chunk size
> + * @type: chunk type (including both profile and type)
> + * @convert: if the chunk is allocated for convert case.
> + * If @convert is true, chunk allocator will skip device extent
> + * search, but use *start and *num_bytes as chunk start/num_bytes
> + * and devive offset, to build a 1:1 chunk mapping for convert.
> + */
> +int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> + struct btrfs_fs_info *fs_info, u64 *start, u64 *num_bytes,
> + u64 type, bool convert)
> +{
> + int ret;
>
> /*
> - * Insert device extents
> + * Allocate chunk mapping
> */
> - ret = btrfs_insert_dev_extents(trans, info, map, stripe_size);
> - if (ret < 0)
> - goto out_devices_info;
> -
> - ret = btrfs_make_block_group(trans, info, 0, type, map->ce.start,
> - map->ce.size);
> - kfree(devices_info);
> - return ret;
> + if (convert)
> + ret = __btrfs_alloc_convert_chunk(trans, fs_info, *start,
> + *num_bytes, type);
> + else
> + ret = __btrfs_alloc_chunk(trans, fs_info, start, num_bytes,
> + type);
> + if (ret < 0) {
> + error("failed to allocate chunk mapping: %s", strerror(-ret));
> + return ret;
> + }
>
> -out_chunk_map:
> - kfree(map);
> -out_chunk:
> - kfree(chunk);
> -out_devices_info:
> - kfree(devices_info);
> + /*
> + * Insert the remaining part (insert variant items)
> + */
> + ret = btrfs_finish_chunk_alloc(trans, fs_info, *start, *num_bytes);
> + if (ret < 0)
> + error("failed to finish chunk allocation: %s", strerror(-ret));
> return ret;
> }
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior
2018-02-09 8:17 ` Nikolay Borisov
@ 2018-02-09 10:01 ` Qu Wenruo
0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2018-02-09 10:01 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs, dsterba
[-- Attachment #1.1: Type: text/plain, Size: 6225 bytes --]
On 2018年02月09日 16:17, Nikolay Borisov wrote:
>
>
> On 9.02.2018 09:44, Qu Wenruo wrote:
>> Kernel uses a delayed chunk allocation behavior for metadata chunks
>>
>> KERNEL:
>> btrfs_alloc_chunk()
>> |- __btrfs_alloc_chunk(): Only allocate chunk mapping
>> |- btrfs_make_block_group(): Add corresponding bg to fs_info->new_bgs
>>
>> Then at transaction commit time, it finishes the remaining work:
>> btrfs_start_dirty_block_groups():
>> |- btrfs_create_pending_block_groups()
>> |- btrfs_insert_item(): To insert block group item
>> |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
>>
>> Although btrfs-progs btrfs_alloc_chunk() does all the work in one
>> function, it can still benefit from function refactor like:
>> btrfs-progs:
>> btrfs_alloc_chunk(): Wrapper for both normal and convert chunks
>> |- __btrfs_alloc_chunk(): Only alloc chunk mapping
>> | |- btrfs_make_block_group(): <<Insert bg items into extent tree>>
>> |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
>>
>> With such refactor, the following functions can share most of its code
>> with kernel now:
>> __btrfs_alloc_chunk()
>> btrfs_finish_chunk_alloc()
>> btrfs_alloc_dev_extent()
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> volumes.c | 421 ++++++++++++++++++++++++++++++++++++++------------------------
>> 1 file changed, 260 insertions(+), 161 deletions(-)
>>
>> diff --git a/volumes.c b/volumes.c
>> index cff54c612872..e89520326314 100644
>> --- a/volumes.c
>> +++ b/volumes.c
>> @@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
>> return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>> }
>>
>> -static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
>> - struct btrfs_fs_info *fs_info,
>> - struct map_lookup *map, u64 stripe_size)
>> +static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>> + struct btrfs_device *device,
>> + u64 chunk_offset, u64 physical,
>> + u64 stripe_size)
>> {
>> - int ret = 0;
>> - struct btrfs_path path;
>> + int ret;
>> + struct btrfs_path *path;
>> + struct btrfs_fs_info *fs_info = device->fs_info;
>> struct btrfs_root *root = fs_info->dev_root;
>> struct btrfs_dev_extent *extent;
>> struct extent_buffer *leaf;
>> struct btrfs_key key;
>> - int i;
>>
>> - btrfs_init_path(&path);
>> -
>> - for (i = 0; i < map->num_stripes; i++) {
>> - struct btrfs_device *device = map->stripes[i].dev;
>> -
>> - key.objectid = device->devid;
>> - key.offset = map->stripes[i].physical;
>> - key.type = BTRFS_DEV_EXTENT_KEY;
>> + path = btrfs_alloc_path();
>> + if (!path)
>> + return -ENOMEM;
>>
>> - ret = btrfs_insert_empty_item(trans, root, &path, &key,
>> - sizeof(*extent));
>> - if (ret < 0)
>> - goto out;
>> - leaf = path.nodes[0];
>> - extent = btrfs_item_ptr(leaf, path.slots[0],
>> - struct btrfs_dev_extent);
>> - btrfs_set_dev_extent_chunk_tree(leaf, extent,
>> + key.objectid = device->devid;
>> + key.offset = physical;
>> + key.type = BTRFS_DEV_EXTENT_KEY;
>> + ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*extent));
>> + if (ret)
>> + goto out;
>> + leaf = path->nodes[0];
>> + extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
>> + btrfs_set_dev_extent_chunk_tree(leaf, extent,
>> BTRFS_CHUNK_TREE_OBJECTID);
>> - btrfs_set_dev_extent_chunk_objectid(leaf, extent,
>> - BTRFS_FIRST_CHUNK_TREE_OBJECTID);
>> - btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
>> -
>> - write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
>> - (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
>> - BTRFS_UUID_SIZE);
>> -
>> - btrfs_set_dev_extent_length(leaf, extent, stripe_size);
>> - btrfs_mark_buffer_dirty(leaf);
>> - btrfs_release_path(&path);
>> -
>> - device->bytes_used += stripe_size;
>> - ret = btrfs_update_device(trans, device);
>> - if (ret < 0)
>> - goto out;
>> - }
>> + btrfs_set_dev_extent_chunk_objectid(leaf, extent,
>> + BTRFS_FIRST_CHUNK_TREE_OBJECTID);
>> + btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
>> + btrfs_set_dev_extent_length(leaf, extent, stripe_size);
>> + btrfs_mark_buffer_dirty(leaf);
>> out:
>> - btrfs_release_path(&path);
>> + btrfs_free_path(path);
>> return ret;
>> }
>>
>> @@ -813,28 +798,28 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>> / sizeof(struct btrfs_stripe) + 1)
>>
>> /*
>> - * Alloc a chunk, will insert dev extents, chunk item, and insert new
>> - * block group and update space info (so that extent allocator can use
>> - * newly allocated chunk).
>> + * Alloc a chunk mapping.
>> + * Will do chunk size calculation and free dev extent search, and insert
>> + * chunk mapping into chunk mapping tree.
>> + *
>> + * NOTE: This function doesn't handle any chunk item/dev extent insert.
>> + * chunk item/dev extent insert is handled by later btrfs_finish_chunk_alloc().
>> + * And for convert chunk (1:1 mapped, more flex chunk location), it's
>> + * handled by __btrfs_alloc_convert_chunk().
>> + *
>> + * Qu: Block group item is still inserted in this function by
>> + * btrfs_make_block_group(), this still differs from kernel.
>> *
>
> If it still differs does that mean this function should undergo further
> refactoring to unify it with the kernel or it means it will never be the
> same as the kernel's code? I thought the idea of unifying the
> userspace/kernel code is to have only 1 set of functions which work both
> in kernel and userspace?
One (temporary) solution here is, make btrfs_make_block_group() virtual.
So btrfs-progs and kernel has their own version of btrfs_make_block_group().
And the unification is a progressive work, leaving
btrfs_make_block_group() different is acceptable at this point, since
the objective of this patchset is just to unify chunk allocator.
BTW, the real unification needs both kernel and progs side to be
modified, so this patchset is still "preparation".
Thanks,
Qu
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread