From: Miao Xie <miaox@cn.fujitsu.com>
To: Alex Lyakas <alex.btrfs@zadarastorage.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
Date: Tue, 04 Mar 2014 14:04:30 +0800 [thread overview]
Message-ID: <53156CEE.5060308@cn.fujitsu.com> (raw)
In-Reply-To: <CAOcd+r0UPMbxMzfa=Mpj96zYo1pO2C6NB3HrR6bipYbFmLVFdw@mail.gmail.com>
On sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote:
> Hi Miao,
>
> On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> When we mounted the filesystem after the crash, we got the following
>> message:
>> BTRFS error (device xxx): block group 4315938816 has wrong amount of free space
>> BTRFS error (device xxx): failed to load free space cache for block group 4315938816
>>
>> It is because we didn't update the metadata of the allocated space until
>> the file data was written into the disk. During this time, there was no
>> information about the allocated spaces in either the extent tree nor the
>> free space cache. when we wrote out the free space cache at this time, those
>> spaces were lost.
> Can you please clarify more about the problem?
> So I understand that we allocate something from the free space cache.
> So after that, the free space cache does not account for this extent
> anymore. On the other hand the extent tree also does not account for
> it (yet). We need to add a delayed reference, which will be run at
> transaction commit and update the extent tree. But free-space cache is
> also written out during transaction commit. So how the issue happens?
> Can you perhaps post a flow of events?
Task1 Task2
btrfs_writepages()
alloc data space
(The allocated space was
removed from the free
space cache)
submit_bio()
btrfs_commit_transaction()
write out space cache
...
commit transaction completed
system crash
(end_io() wasn't executed)
The system crashed before end_io was executed, so no file references to the
allocated space, and the extent tree also does not account for it. That space
was lost.
Thanks
Miao
>
> Thanks.
> Alex.
>
>
>>
>> In ordered to fix this problem, I use a state tree for every block group
>> to record those allocated spaces. We record the information when they are
>> allocated, and clean up the information after the metadata update. Besides
>> that, we also introduce a read-write semaphore to avoid the race between
>> the allocation and the free space cache write out.
>>
>> Only data block groups had this problem, so the above change is just
>> for data space allocation.
>>
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>> ---
>> fs/btrfs/ctree.h | 15 ++++++++++++++-
>> fs/btrfs/disk-io.c | 2 +-
>> fs/btrfs/extent-tree.c | 24 ++++++++++++++++++++----
>> fs/btrfs/free-space-cache.c | 42 ++++++++++++++++++++++++++++++++++++++----
>> fs/btrfs/inode.c | 42 +++++++++++++++++++++++++++++++++++-------
>> 5 files changed, 108 insertions(+), 17 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 1667c9a..f58e1f7 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
>> /* free space cache stuff */
>> struct btrfs_free_space_ctl *free_space_ctl;
>>
>> + /*
>> + * It is used to record the extents that are allocated for
>> + * the data, but don/t update its metadata.
>> + */
>> + struct extent_io_tree pinned_extents;
>> +
>> /* block group cache stuff */
>> struct rb_node cache_node;
>>
>> @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
>> */
>> struct list_head space_info;
>>
>> + /*
>> + * It is just used for the delayed data space allocation
>> + * because only the data space allocation can be done during
>> + * we write out the free space cache.
>> + */
>> + struct rw_semaphore data_rwsem;
>> +
>> struct btrfs_space_info *data_sinfo;
>>
>> struct reloc_control *reloc_ctl;
>> @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
>> struct btrfs_key *ins);
>> int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
>> u64 min_alloc_size, u64 empty_size, u64 hint_byte,
>> - struct btrfs_key *ins, int is_data);
>> + struct btrfs_key *ins, int is_data, bool need_pin);
>> int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>> struct extent_buffer *buf, int full_backref, int for_cow);
>> int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 8072cfa..426b558 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
>> fs_info->pinned_extents = &fs_info->freed_extents[0];
>> fs_info->do_barriers = 1;
>>
>> -
>> mutex_init(&fs_info->ordered_operations_mutex);
>> mutex_init(&fs_info->ordered_extent_flush_mutex);
>> mutex_init(&fs_info->tree_log_mutex);
>> @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
>> init_rwsem(&fs_info->extent_commit_sem);
>> init_rwsem(&fs_info->cleanup_work_sem);
>> init_rwsem(&fs_info->subvol_sem);
>> + init_rwsem(&fs_info->data_rwsem);
>> sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>> fs_info->dev_replace.lock_owner = 0;
>> atomic_set(&fs_info->dev_replace.nesting_level, 0);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 3664cfb..7b07876 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
>> static noinline int find_free_extent(struct btrfs_root *orig_root,
>> u64 num_bytes, u64 empty_size,
>> u64 hint_byte, struct btrfs_key *ins,
>> - u64 flags)
>> + u64 flags, bool need_pin)
>> {
>> int ret = 0;
>> struct btrfs_root *root = orig_root->fs_info->extent_root;
>> @@ -6502,6 +6502,16 @@ checks:
>> ins->objectid = search_start;
>> ins->offset = num_bytes;
>>
>> + if (need_pin) {
>> + ASSERT(search_start >= block_group->key.objectid &&
>> + search_start < block_group->key.objectid +
>> + block_group->key.offset);
>> + set_extent_dirty(&block_group->pinned_extents,
>> + search_start,
>> + search_start + num_bytes - 1,
>> + GFP_NOFS);
>> + }
>> +
>> trace_btrfs_reserve_extent(orig_root, block_group,
>> search_start, num_bytes);
>> btrfs_put_block_group(block_group);
>> @@ -6614,17 +6624,20 @@ again:
>> int btrfs_reserve_extent(struct btrfs_root *root,
>> u64 num_bytes, u64 min_alloc_size,
>> u64 empty_size, u64 hint_byte,
>> - struct btrfs_key *ins, int is_data)
>> + struct btrfs_key *ins, int is_data, bool need_pin)
>> {
>> bool final_tried = false;
>> u64 flags;
>> int ret;
>>
>> flags = btrfs_get_alloc_profile(root, is_data);
>> +
>> + if (need_pin)
>> + down_read(&root->fs_info->data_rwsem);
>> again:
>> WARN_ON(num_bytes < root->sectorsize);
>> ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins,
>> - flags);
>> + flags, need_pin);
>>
>> if (ret == -ENOSPC) {
>> if (!final_tried && ins->offset) {
>> @@ -6645,6 +6658,8 @@ again:
>> }
>> }
>>
>> + if (need_pin)
>> + up_read(&root->fs_info->data_rwsem);
>> return ret;
>> }
>>
>> @@ -7016,7 +7031,7 @@ struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
>> return ERR_CAST(block_rsv);
>>
>> ret = btrfs_reserve_extent(root, blocksize, blocksize,
>> - empty_size, hint, &ins, 0);
>> + empty_size, hint, &ins, 0, false);
>> if (ret) {
>> unuse_block_rsv(root->fs_info, block_rsv, blocksize);
>> return ERR_PTR(ret);
>> @@ -8387,6 +8402,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
>> INIT_LIST_HEAD(&cache->cluster_list);
>> INIT_LIST_HEAD(&cache->new_bg_list);
>> btrfs_init_free_space_ctl(cache);
>> + extent_io_tree_init(&cache->pinned_extents, NULL);
>>
>> return cache;
>> }
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index 057be95..486e12a3 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -875,6 +875,7 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
>> struct rb_node *node;
>> struct list_head *pos, *n;
>> struct extent_state *cached_state = NULL;
>> + struct extent_state *pinned_extent = NULL;
>> struct btrfs_free_cluster *cluster = NULL;
>> struct extent_io_tree *unpin = NULL;
>> struct io_ctl io_ctl;
>> @@ -948,17 +949,17 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
>> * so we don't leak the space
>> */
>>
>> + if (!block_group)
>> + goto bitmap;
>> /*
>> * We shouldn't have switched the pinned extents yet so this is the
>> * right one
>> */
>> unpin = root->fs_info->pinned_extents;
>>
>> - if (block_group)
>> - start = block_group->key.objectid;
>> + start = block_group->key.objectid;
>>
>> - while (block_group && (start < block_group->key.objectid +
>> - block_group->key.offset)) {
>> + while (start < block_group->key.objectid + block_group->key.offset) {
>> ret = find_first_extent_bit(unpin, start,
>> &extent_start, &extent_end,
>> EXTENT_DIRTY, NULL);
>> @@ -985,6 +986,33 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
>> start = extent_end;
>> }
>>
>> + if (!(block_group->flags & BTRFS_BLOCK_GROUP_DATA))
>> + goto bitmap;
>> +
>> + start = block_group->key.objectid;
>> + unpin = &block_group->pinned_extents;
>> + while (1) {
>> + ret = find_first_extent_bit(unpin, start,
>> + &extent_start, &extent_end,
>> + EXTENT_DIRTY, &pinned_extent);
>> + if (ret) {
>> + ret = 0;
>> + break;
>> + }
>> +
>> + len = extent_end - extent_start + 1;
>> +
>> + entries++;
>> + ret = io_ctl_add_entry(&io_ctl, extent_start, len, NULL);
>> + if (ret) {
>> + free_extent_state(pinned_extent);
>> + goto out_nospc;
>> + }
>> +
>> + start = extent_end + 1;
>> + }
>> + free_extent_state(pinned_extent);
>> +bitmap:
>> /* Write out the bitmaps */
>> list_for_each_safe(pos, n, &bitmap_list) {
>> struct btrfs_free_space *entry =
>> @@ -1097,6 +1125,9 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>> if (IS_ERR(inode))
>> return 0;
>>
>> + if (block_group->flags & BTRFS_BLOCK_GROUP_DATA)
>> + down_write(&root->fs_info->data_rwsem);
>> +
>> ret = __btrfs_write_out_cache(root, inode, ctl, block_group, trans,
>> path, block_group->key.objectid);
>> if (ret) {
>> @@ -1111,6 +1142,9 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>> #endif
>> }
>>
>> + if (block_group->flags & BTRFS_BLOCK_GROUP_DATA)
>> + up_write(&root->fs_info->data_rwsem);
>> +
>> iput(inode);
>> return ret;
>> }
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index f1a7744..8172ca6 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -592,6 +592,28 @@ free_pages_out:
>> goto out;
>> }
>>
>> +static void btrfs_unpin_data_extent(struct btrfs_root *root, u64 start,
>> + u64 len)
>> +{
>> + struct btrfs_block_group_cache *cache;
>> +
>> + cache = btrfs_lookup_block_group(root->fs_info, start);
>> + BUG_ON(!cache);
>> + clear_extent_dirty(&cache->pinned_extents, start, start + len - 1,
>> + GFP_NOFS);
>> + btrfs_put_block_group(cache);
>> +}
>> +
>> +/* it is not used for free space cache file */
>> +static void btrfs_free_reserved_data_extent(struct btrfs_root *root, u64 start,
>> + u64 len)
>> +{
>> + down_read(&root->fs_info->data_rwsem);
>> + btrfs_unpin_data_extent(root, start, len);
>> + btrfs_free_reserved_extent(root, start, len);
>> + up_read(&root->fs_info->data_rwsem);
>> +}
>> +
>> /*
>> * phase two of compressed writeback. This is the ordered portion
>> * of the code, which only gets called in the order the work was
>> @@ -666,7 +688,7 @@ retry:
>> ret = btrfs_reserve_extent(root,
>> async_extent->compressed_size,
>> async_extent->compressed_size,
>> - 0, alloc_hint, &ins, 1);
>> + 0, alloc_hint, &ins, 1, true);
>> if (ret) {
>> int i;
>>
>> @@ -767,7 +789,7 @@ retry:
>> out:
>> return ret;
>> out_free_reserve:
>> - btrfs_free_reserved_extent(root, ins.objectid, ins.offset);
>> + btrfs_free_reserved_data_extent(root, ins.objectid, ins.offset);
>> out_free:
>> extent_clear_unlock_delalloc(inode, async_extent->start,
>> async_extent->start +
>> @@ -889,7 +911,7 @@ static noinline int cow_file_range(struct inode *inode,
>> cur_alloc_size = disk_num_bytes;
>> ret = btrfs_reserve_extent(root, cur_alloc_size,
>> root->sectorsize, 0, alloc_hint,
>> - &ins, 1);
>> + &ins, 1, true);
>> if (ret < 0)
>> goto out_unlock;
>>
>> @@ -967,6 +989,7 @@ out:
>> return ret;
>>
>> out_reserve:
>> + btrfs_unpin_data_extent(root, ins.objectid, ins.offset);
>> btrfs_free_reserved_extent(root, ins.objectid, ins.offset);
>> out_unlock:
>> extent_clear_unlock_delalloc(inode, start, end, locked_page,
>> @@ -2647,6 +2670,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>> logical_len, logical_len,
>> compress_type, 0, 0,
>> BTRFS_FILE_EXTENT_REG);
>> + BUG_ON(nolock);
>> + btrfs_unpin_data_extent(root, ordered_extent->start,
>> + ordered_extent->disk_len);
>> }
>> unpin_extent_cache(&BTRFS_I(inode)->extent_tree,
>> ordered_extent->file_offset, ordered_extent->len,
>> @@ -2698,8 +2724,9 @@ out:
>> if ((ret || !logical_len) &&
>> !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
>> !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags))
>> - btrfs_free_reserved_extent(root, ordered_extent->start,
>> - ordered_extent->disk_len);
>> + btrfs_free_reserved_data_extent(root,
>> + ordered_extent->start,
>> + ordered_extent->disk_len);
>> }
>>
>>
>> @@ -6342,7 +6369,7 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
>>
>> alloc_hint = get_extent_allocation_hint(inode, start, len);
>> ret = btrfs_reserve_extent(root, len, root->sectorsize, 0,
>> - alloc_hint, &ins, 1);
>> + alloc_hint, &ins, 1, true);
>> if (ret)
>> return ERR_PTR(ret);
>>
>> @@ -6356,6 +6383,7 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
>> ret = btrfs_add_ordered_extent_dio(inode, start, ins.objectid,
>> ins.offset, ins.offset, 0);
>> if (ret) {
>> + btrfs_unpin_data_extent(root, ins.objectid, ins.offset);
>> btrfs_free_reserved_extent(root, ins.objectid, ins.offset);
>> free_extent_map(em);
>> return ERR_PTR(ret);
>> @@ -8507,7 +8535,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>> cur_bytes = min(num_bytes, 256ULL * 1024 * 1024);
>> cur_bytes = max(cur_bytes, min_size);
>> ret = btrfs_reserve_extent(root, cur_bytes, min_size, 0,
>> - *alloc_hint, &ins, 1);
>> + *alloc_hint, &ins, 1, false);
>> if (ret) {
>> if (own_trans)
>> btrfs_end_transaction(trans, root);
>> --
>> 1.8.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2014-03-04 6:02 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-15 12:00 [PATCH 1/5] Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss Miao Xie
2014-01-15 12:00 ` [PATCH 2/5] Btrfs: cleanup the redundant code for the block group allocation and init Miao Xie
2014-01-15 12:00 ` [PATCH 3/5] Btrfs: cleanup the code of used_block_group in find_free_extent() Miao Xie
2014-01-15 12:00 ` [PATCH 4/5] Btrfs: fix wrong block group in trace during the free space allocation Miao Xie
2014-01-15 12:00 ` [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed Miao Xie
2014-01-16 5:54 ` Liu Bo
2014-03-01 18:05 ` Alex Lyakas
2014-03-04 6:04 ` Miao Xie [this message]
2014-03-08 16:48 ` Alex Lyakas
2014-03-04 15:19 ` Josef Bacik
2014-03-05 7:02 ` Miao Xie
2014-03-05 15:02 ` Josef Bacik
2014-03-26 6:36 ` miaox
2014-04-24 5:31 ` [PATCH V2 1/2] Btrfs: output warning instead of error when loading free space cache failed Miao Xie
2014-04-24 5:31 ` [PATCH V2 2/2] Btrfs: fix broken free space cache after the system crashed Miao Xie
2014-05-19 1:33 ` [RFC PATCH 5/5] " Chris Mason
2014-06-10 8:15 ` Alin Dobre
2014-01-15 15:56 ` [PATCH 1/5] Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53156CEE.5060308@cn.fujitsu.com \
--to=miaox@cn.fujitsu.com \
--cc=alex.btrfs@zadarastorage.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).