From: Josef Bacik <jbacik@fb.com>
To: <fdmanana@gmail.com>
Cc: Filipe Manana <fdmanana@suse.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
Date: Wed, 26 Nov 2014 11:30:59 -0500 [thread overview]
Message-ID: <54760043.6070505@fb.com> (raw)
In-Reply-To: <CAL3q7H4rB80XONBNKFZ-TEKaVproQbeyKOfaJG=sAg4k=Pnwww@mail.gmail.com>
On 11/26/2014 11:25 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>
>>> Our fs trim operation, which is completely transactionless (doesn't start
>>> or joins an existing transaction) consists of visiting all block groups
>>> and then for each one to iterate its free space entries and perform a
>>> discard operation against the space range represented by the free space
>>> entries. However before performing a discard, the corresponding free space
>>> entry is removed from the free space rbtree, and when the discard
>>> completes
>>> it is added back to the free space rbtree.
>>>
>>> If a block group remove operation happens while the discard is ongoing (or
>>> before it starts and after a free space entry is hidden), we end up not
>>> waiting for the discard to complete, remove the extent map that maps
>>> logical address to physical addresses and the corresponding chunk metadata
>>> from the the chunk and device trees. After that and before the discard
>>> completes, the current running transaction can finish and a new one start,
>>> allowing for new block groups that map to the same physical addresses to
>>> be allocated and written to.
>>>
>>> So fix this by keeping the extent map in memory until the discard
>>> completes
>>> so that the same physical addresses aren't reused before it completes.
>>>
>>> If the physical locations that are under a discard operation end up being
>>> used for a new metadata block group for example, and dirty metadata
>>> extents
>>> are written before the discard finishes (the VM might call writepages() of
>>> our btree inode's i_mapping for example, or an fsync log commit happens)
>>> we
>>> end up overwriting metadata with zeroes, which leads to errors from fsck
>>> like the following:
>>>
>>> checking extents
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> read block failed check_tree_block
>>> owner ref check failed [833912832 16384]
>>> Errors found in extent allocation tree or chunk allocation
>>> checking free space cache
>>> checking fs roots
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> Check tree block failed, want=833912832, have=0
>>> read block failed check_tree_block
>>> root 5 root dir 256 error
>>> root 5 inode 260 errors 2001, no inode item, link count wrong
>>> unresolved ref dir 256 index 0 namelen 8 name foobar_3
>>> filetype 1 errors 6, no dir index, no inode ref
>>> root 5 inode 262 errors 2001, no inode item, link count wrong
>>> unresolved ref dir 256 index 0 namelen 8 name foobar_5
>>> filetype 1 errors 6, no dir index, no inode ref
>>> root 5 inode 263 errors 2001, no inode item, link count wrong
>>> (...)
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>> fs/btrfs/ctree.h | 13 ++++++++++++-
>>> fs/btrfs/disk-io.c | 14 ++++++++++++++
>>> fs/btrfs/extent-tree.c | 24 +++++++++++++++++++++++-
>>> fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
>>> fs/btrfs/volumes.c | 33 ++++++++++++++++++++++++++-------
>>> 5 files changed, 100 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 7f40a65..51056c7 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
>>> unsigned int dirty:1;
>>> unsigned int iref:1;
>>> unsigned int has_caching_ctl:1;
>>> + unsigned int removed:1;
>>>
>>> int disk_cache_state;
>>>
>>> @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
>>>
>>> /* For delayed block group creation or deletion of empty block
>>> groups */
>>> struct list_head bg_list;
>>> +
>>> + atomic_t trimming;
>>> };
>>>
>>> /* delayed seq elem */
>>> @@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
>>>
>>> /* For btrfs to record security options */
>>> struct security_mnt_opts security_opts;
>>> +
>>> + /*
>>> + * Chunks that can't be freed yet (under a trim/discard operation)
>>> + * and will be latter freed.
>>> + */
>>> + rwlock_t pinned_chunks_lock;
>>> + struct list_head pinned_chunks;
>>> };
>>>
>>> struct btrfs_subvolume_writers {
>>> @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle
>>> *trans,
>>> u64 type, u64 chunk_objectid, u64 chunk_offset,
>>> u64 size);
>>> int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> - struct btrfs_root *root, u64 group_start);
>>> + struct btrfs_root *root, u64 group_start,
>>> + bool *remove_em);
>>> void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>>> void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>>> struct btrfs_root *root);
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 9d4fb0a..76012d0 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
>>> init_waitqueue_head(&fs_info->transaction_blocked_wait);
>>> init_waitqueue_head(&fs_info->async_submit_wait);
>>>
>>> + rwlock_init(&fs_info->pinned_chunks_lock);
>>> + INIT_LIST_HEAD(&fs_info->pinned_chunks);
>>> +
>>> ret = btrfs_alloc_stripe_hash_table(fs_info);
>>> if (ret) {
>>> err = ret;
>>> @@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
>>>
>>> btrfs_free_block_rsv(root, root->orphan_block_rsv);
>>> root->orphan_block_rsv = NULL;
>>> +
>>> + write_lock(&fs_info->pinned_chunks_lock);
>>> + while (!list_empty(&fs_info->pinned_chunks)) {
>>> + struct extent_map *em;
>>> +
>>> + em = list_first_entry(&fs_info->pinned_chunks,
>>> + struct extent_map, list);
>>> + list_del_init(&em->list);
>>> + free_extent_map(em);
>>> + }
>>> + write_unlock(&fs_info->pinned_chunks_lock);
>>> }
>>>
>>> int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 92f61f2..4bf8f02 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root
>>> *root, u64 start, u64 size)
>>> INIT_LIST_HEAD(&cache->cluster_list);
>>> INIT_LIST_HEAD(&cache->bg_list);
>>> btrfs_init_free_space_ctl(cache);
>>> + atomic_set(&cache->trimming, 0);
>>>
>>> return cache;
>>> }
>>> @@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct
>>> btrfs_fs_info *fs_info, u64 flags)
>>> }
>>>
>>> int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> - struct btrfs_root *root, u64 group_start)
>>> + struct btrfs_root *root, u64 group_start,
>>> + bool *remove_em)
>>> {
>>> struct btrfs_path *path;
>>> struct btrfs_block_group_cache *block_group;
>>> @@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct
>>> btrfs_trans_handle *trans,
>>>
>>> memcpy(&key, &block_group->key, sizeof(key));
>>>
>>> + spin_lock(&block_group->lock);
>>> + block_group->removed = 1;
>>
>>
>> Ok you set block_group->removed here, but you don't check it in
>> btrfs_trim_block_group, so we can easily race in afterwards and start
>> trimming this block group.
>>
>>
>>> + /*
>>> + * At this point trimming can't start on this block group, because
>>> we
>>> + * removed the block group from the tree
>>> fs_info->block_group_cache_tree
>>> + * so no one can't find it anymore.
>>> + *
>>> + * And we must tell our caller to not remove the extent map from
>>> the
>>> + * fs_info->mapping_tree to prevent the same logical address range
>>> and
>>> + * physical device space ranges from being reused for a new block
>>> group.
>>> + * This is because our fs trim operation (btrfs_trim_fs(),
>>> + * btrfs_ioctl_fitrim()) is completely transactionless, so while
>>> its
>>> + * trimming a range the currently running transaction might finish
>>> and
>>> + * a new one start, allowing for new block groups to be created
>>> that can
>>> + * reuse the same physical device locations unless we take this
>>> special
>>> + * care.
>>> + */
>>> + *remove_em = (atomic_read(&block_group->trimming) == 0);
>>> + spin_unlock(&block_group->lock);
>>> +
>>> btrfs_put_block_group(block_group);
>>> btrfs_put_block_group(block_group);
>>>
>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>>> index 3384819..16c2d39 100644
>>> --- a/fs/btrfs/free-space-cache.c
>>> +++ b/fs/btrfs/free-space-cache.c
>>> @@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct
>>> btrfs_block_group_cache *block_group,
>>>
>>> *trimmed = 0;
>>>
>>> + atomic_inc(&block_group->trimming);
>>> +
>>
>>
>> This needs to be something like this
>>
>>
>> spin_lock(&block_group->lock);
>> if (block_group->removed == 1) {
>> spin_unlock(&block_group->lock);
>> return 0;
>> }
>> atomic_inc(&block_group->trimming);
>> spin_unlock(&block_group->lock);
>>
>> To be properly safe. Thanks,
>
> Hi Josef, it is safe.
> btrfs_trim_block_group() checks for ->removed after decrementing the
> atomic. Before it starts trimming, it increments the counter - if the
> block group removal already started it isn't a problem, because it
> removed all the free space entries from the rbtree while holding the
> tree's lock.
Ah ok that makes sense. Will you add that to the comment cause it was
not clear and still took me 5 minutes to figure out how it was safe.
Thanks,
Josef
next prev parent reply other threads:[~2014-11-26 16:31 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
2014-11-26 15:58 ` Josef Bacik
2014-11-26 15:28 ` [PATCH 2/6] Btrfs: fix crash caused by block group removal Filipe Manana
2014-11-26 15:57 ` Josef Bacik
2014-11-26 16:09 ` Filipe David Manana
2014-11-26 16:24 ` Josef Bacik
2014-11-26 16:34 ` Filipe David Manana
2014-11-26 16:41 ` Josef Bacik
2014-11-26 15:28 ` [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group Filipe Manana
2014-11-26 16:02 ` Josef Bacik
2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
2014-11-26 16:15 ` Josef Bacik
2014-11-26 16:25 ` Filipe David Manana
2014-11-26 16:30 ` Josef Bacik [this message]
2014-11-26 17:19 ` [PATCH v2 " Filipe Manana
2014-11-27 1:16 ` [PATCH v3 " Filipe Manana
2014-11-27 21:14 ` [PATCH v4 " Filipe Manana
2014-11-26 15:28 ` [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming Filipe Manana
2014-12-01 17:04 ` [PATCH v2 " Filipe Manana
2014-11-26 15:28 ` [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups Filipe Manana
2014-11-26 16:07 ` Josef Bacik
2014-11-26 16:15 ` Filipe David Manana
2014-11-26 16:19 ` Josef Bacik
2014-11-26 16:29 ` Filipe David Manana
2014-11-26 16:32 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54760043.6070505@fb.com \
--to=jbacik@fb.com \
--cc=fdmanana@gmail.com \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.