Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Josef Bacik <jbacik@fb.com>
To: <fdmanana@gmail.com>
Cc: Filipe Manana <fdmanana@suse.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
Date: Wed, 26 Nov 2014 11:30:59 -0500	[thread overview]
Message-ID: <54760043.6070505@fb.com> (raw)
In-Reply-To: <CAL3q7H4rB80XONBNKFZ-TEKaVproQbeyKOfaJG=sAg4k=Pnwww@mail.gmail.com>

On 11/26/2014 11:25 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>
>>> Our fs trim operation, which is completely transactionless (doesn't start
>>> or joins an existing transaction) consists of visiting all block groups
>>> and then for each one to iterate its free space entries and perform a
>>> discard operation against the space range represented by the free space
>>> entries. However before performing a discard, the corresponding free space
>>> entry is removed from the free space rbtree, and when the discard
>>> completes
>>> it is added back to the free space rbtree.
>>>
>>> If a block group remove operation happens while the discard is ongoing (or
>>> before it starts and after a free space entry is hidden), we end up not
>>> waiting for the discard to complete, remove the extent map that maps
>>> logical address to physical addresses and the corresponding chunk metadata
>>> from the the chunk and device trees. After that and before the discard
>>> completes, the current running transaction can finish and a new one start,
>>> allowing for new block groups that map to the same physical addresses to
>>> be allocated and written to.
>>>
>>> So fix this by keeping the extent map in memory until the discard
>>> completes
>>> so that the same physical addresses aren't reused before it completes.
>>>
>>> If the physical locations that are under a discard operation end up being
>>> used for a new metadata block group for example, and dirty metadata
>>> extents
>>> are written before the discard finishes (the VM might call writepages() of
>>> our btree inode's i_mapping for example, or an fsync log commit happens)
>>> we
>>> end up overwriting metadata with zeroes, which leads to errors from fsck
>>> like the following:
>>>
>>>           checking extents
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           read block failed check_tree_block
>>>           owner ref check failed [833912832 16384]
>>>           Errors found in extent allocation tree or chunk allocation
>>>           checking free space cache
>>>           checking fs roots
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           read block failed check_tree_block
>>>           root 5 root dir 256 error
>>>           root 5 inode 260 errors 2001, no inode item, link count wrong
>>>                   unresolved ref dir 256 index 0 namelen 8 name foobar_3
>>> filetype 1 errors 6, no dir index, no inode ref
>>>           root 5 inode 262 errors 2001, no inode item, link count wrong
>>>                   unresolved ref dir 256 index 0 namelen 8 name foobar_5
>>> filetype 1 errors 6, no dir index, no inode ref
>>>           root 5 inode 263 errors 2001, no inode item, link count wrong
>>>           (...)
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/ctree.h            | 13 ++++++++++++-
>>>    fs/btrfs/disk-io.c          | 14 ++++++++++++++
>>>    fs/btrfs/extent-tree.c      | 24 +++++++++++++++++++++++-
>>>    fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
>>>    fs/btrfs/volumes.c          | 33 ++++++++++++++++++++++++++-------
>>>    5 files changed, 100 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 7f40a65..51056c7 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
>>>          unsigned int dirty:1;
>>>          unsigned int iref:1;
>>>          unsigned int has_caching_ctl:1;
>>> +       unsigned int removed:1;
>>>
>>>          int disk_cache_state;
>>>
>>> @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
>>>
>>>          /* For delayed block group creation or deletion of empty block
>>> groups */
>>>          struct list_head bg_list;
>>> +
>>> +       atomic_t trimming;
>>>    };
>>>
>>>    /* delayed seq elem */
>>> @@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
>>>
>>>          /* For btrfs to record security options */
>>>          struct security_mnt_opts security_opts;
>>> +
>>> +       /*
>>> +        * Chunks that can't be freed yet (under a trim/discard operation)
>>> +        * and will be latter freed.
>>> +        */
>>> +       rwlock_t pinned_chunks_lock;
>>> +       struct list_head pinned_chunks;
>>>    };
>>>
>>>    struct btrfs_subvolume_writers {
>>> @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle
>>> *trans,
>>>                             u64 type, u64 chunk_objectid, u64 chunk_offset,
>>>                             u64 size);
>>>    int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> -                            struct btrfs_root *root, u64 group_start);
>>> +                            struct btrfs_root *root, u64 group_start,
>>> +                            bool *remove_em);
>>>    void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>>>    void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>>>                                         struct btrfs_root *root);
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 9d4fb0a..76012d0 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
>>>          init_waitqueue_head(&fs_info->transaction_blocked_wait);
>>>          init_waitqueue_head(&fs_info->async_submit_wait);
>>>
>>> +       rwlock_init(&fs_info->pinned_chunks_lock);
>>> +       INIT_LIST_HEAD(&fs_info->pinned_chunks);
>>> +
>>>          ret = btrfs_alloc_stripe_hash_table(fs_info);
>>>          if (ret) {
>>>                  err = ret;
>>> @@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
>>>
>>>          btrfs_free_block_rsv(root, root->orphan_block_rsv);
>>>          root->orphan_block_rsv = NULL;
>>> +
>>> +       write_lock(&fs_info->pinned_chunks_lock);
>>> +       while (!list_empty(&fs_info->pinned_chunks)) {
>>> +               struct extent_map *em;
>>> +
>>> +               em = list_first_entry(&fs_info->pinned_chunks,
>>> +                                     struct extent_map, list);
>>> +               list_del_init(&em->list);
>>> +               free_extent_map(em);
>>> +       }
>>> +       write_unlock(&fs_info->pinned_chunks_lock);
>>>    }
>>>
>>>    int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 92f61f2..4bf8f02 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root
>>> *root, u64 start, u64 size)
>>>          INIT_LIST_HEAD(&cache->cluster_list);
>>>          INIT_LIST_HEAD(&cache->bg_list);
>>>          btrfs_init_free_space_ctl(cache);
>>> +       atomic_set(&cache->trimming, 0);
>>>
>>>          return cache;
>>>    }
>>> @@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct
>>> btrfs_fs_info *fs_info, u64 flags)
>>>    }
>>>
>>>    int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> -                            struct btrfs_root *root, u64 group_start)
>>> +                            struct btrfs_root *root, u64 group_start,
>>> +                            bool *remove_em)
>>>    {
>>>          struct btrfs_path *path;
>>>          struct btrfs_block_group_cache *block_group;
>>> @@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct
>>> btrfs_trans_handle *trans,
>>>
>>>          memcpy(&key, &block_group->key, sizeof(key));
>>>
>>> +       spin_lock(&block_group->lock);
>>> +       block_group->removed = 1;
>>
>>
>> Ok you set block_group->removed here, but you don't check it in
>> btrfs_trim_block_group, so we can easily race in afterwards and start
>> trimming this block group.
>>
>>
>>> +       /*
>>> +        * At this point trimming can't start on this block group, because
>>> we
>>> +        * removed the block group from the tree
>>> fs_info->block_group_cache_tree
>>> +        * so no one can't find it anymore.
>>> +        *
>>> +        * And we must tell our caller to not remove the extent map from
>>> the
>>> +        * fs_info->mapping_tree to prevent the same logical address range
>>> and
>>> +        * physical device space ranges from being reused for a new block
>>> group.
>>> +        * This is because our fs trim operation (btrfs_trim_fs(),
>>> +        * btrfs_ioctl_fitrim()) is completely transactionless, so while
>>> its
>>> +        * trimming a range the currently running transaction might finish
>>> and
>>> +        * a new one start, allowing for new block groups to be created
>>> that can
>>> +        * reuse the same physical device locations unless we take this
>>> special
>>> +        * care.
>>> +        */
>>> +       *remove_em = (atomic_read(&block_group->trimming) == 0);
>>> +       spin_unlock(&block_group->lock);
>>> +
>>>          btrfs_put_block_group(block_group);
>>>          btrfs_put_block_group(block_group);
>>>
>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>>> index 3384819..16c2d39 100644
>>> --- a/fs/btrfs/free-space-cache.c
>>> +++ b/fs/btrfs/free-space-cache.c
>>> @@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct
>>> btrfs_block_group_cache *block_group,
>>>
>>>          *trimmed = 0;
>>>
>>> +       atomic_inc(&block_group->trimming);
>>> +
>>
>>
>> This needs to be something like this
>>
>>
>> spin_lock(&block_group->lock);
>> if (block_group->removed == 1) {
>>          spin_unlock(&block_group->lock);
>>          return 0;
>> }
>> atomic_inc(&block_group->trimming);
>> spin_unlock(&block_group->lock);
>>
>> To be  properly safe.  Thanks,
>
> Hi Josef, it is safe.
> btrfs_trim_block_group() checks for ->removed after decrementing the
> atomic. Before it starts trimming, it increments the counter - if the
> block group removal already started it isn't a problem, because it
> removed all the free space entries from the rbtree while holding the
> tree's lock.

Ah ok that makes sense.  Will you add that to the comment cause it was 
not clear and still took me 5 minutes to figure out how it was safe. 
Thanks,

Josef

next prev parent reply	other threads:[~2014-11-26 16:31 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
2014-11-26 15:58   ` Josef Bacik
2014-11-26 15:28 ` [PATCH 2/6] Btrfs: fix crash caused by block group removal Filipe Manana
2014-11-26 15:57   ` Josef Bacik
2014-11-26 16:09     ` Filipe David Manana
2014-11-26 16:24       ` Josef Bacik
2014-11-26 16:34         ` Filipe David Manana
2014-11-26 16:41           ` Josef Bacik
2014-11-26 15:28 ` [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group Filipe Manana
2014-11-26 16:02   ` Josef Bacik
2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
2014-11-26 16:15   ` Josef Bacik
2014-11-26 16:25     ` Filipe David Manana
2014-11-26 16:30       ` Josef Bacik [this message]
2014-11-26 17:19   ` [PATCH v2 " Filipe Manana
2014-11-27  1:16   ` [PATCH v3 " Filipe Manana
2014-11-27 21:14   ` [PATCH v4 " Filipe Manana
2014-11-26 15:28 ` [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming Filipe Manana
2014-12-01 17:04   ` [PATCH v2 " Filipe Manana
2014-11-26 15:28 ` [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups Filipe Manana
2014-11-26 16:07   ` Josef Bacik
2014-11-26 16:15     ` Filipe David Manana
2014-11-26 16:19       ` Josef Bacik
2014-11-26 16:29         ` Filipe David Manana
2014-11-26 16:32           ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54760043.6070505@fb.com \
    --to=jbacik@fb.com \
    --cc=fdmanana@gmail.com \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).