Re: [PATCH v2] btrfs: add missing discards when unpinning extents with -o discard

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Filipe David Manana <fdmanana@gmail.com>
To: Jeff Mahoney <jeffm@suse.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH v2] btrfs: add missing discards when unpinning extents with -o discard
Date: Thu, 14 May 2015 10:42:46 +0100	[thread overview]
Message-ID: <CAL3q7H4YWu7mx4tteoBH-mHaUWnuKvV52rnFAh9fYk-3dwzpBA@mail.gmail.com> (raw)
In-Reply-To: <CAL3q7H5OFEybC+3HDvxFXQZyeE5edZ4oOhrcCQCqnv=75agDZA@mail.gmail.com>

On Fri, May 8, 2015 at 10:10 PM, Filipe David Manana <fdmanana@gmail.com> wrote:
> On Tue, May 5, 2015 at 7:19 PM, Jeff Mahoney <jeffm@suse.com> wrote:
>> When we clear the dirty bits in btrfs_delete_unused_bgs for extents
>> in the empty block group, it results in btrfs_finish_extent_commit being
>> unable to discard the freed extents.
>>
>> The block group removal patch added an alternate path to forget extents
>> other than btrfs_finish_extent_commit. As a result, any extents that would
>> be freed when the block group is removed aren't discarded. In my
>> test run, with a large copy of mixed sized files followed by removal, it
>> left nearly 2/3 of extents undiscarded.
>>
>> To clean up the block groups, we add the removed block group onto a list
>> that will be discarded after transaction commit.
>>
>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>

Hi Jeff,
Revising this after testing. Some more comments inlined below.

>> ---
>>  fs/btrfs/ctree.h            |  2 ++
>>  fs/btrfs/extent-tree.c      | 58 +++++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/free-space-cache.c | 45 +++++++++++++++++++----------------
>>  fs/btrfs/super.c            |  2 +-
>>  fs/btrfs/transaction.c      |  2 ++
>>  fs/btrfs/transaction.h      |  2 ++
>>  6 files changed, 90 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 6f364e1..3448a54 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3438,6 +3438,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>                              struct btrfs_root *root, u64 group_start,
>>                              struct extent_map *em);
>>  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>> +void btrfs_cleanup_block_group_mapping(struct btrfs_block_group_cache *cache);
>>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>>                                        struct btrfs_root *root);
>>  u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
>> @@ -4068,6 +4069,7 @@ __printf(5, 6)
>>  void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
>>                      unsigned int line, int errno, const char *fmt, ...);
>>
>> +const char *btrfs_decode_error(int errno);
>>
>>  void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
>>                                struct btrfs_root *root, const char *function,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 6d1d74d..521a294 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -6011,6 +6011,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
>>                                struct btrfs_root *root)
>>  {
>>         struct btrfs_fs_info *fs_info = root->fs_info;
>> +       struct btrfs_block_group_cache *block_group, *tmp;
>> +       struct list_head *deleted_bgs;
>>         struct extent_io_tree *unpin;
>>         u64 start;
>>         u64 end;
>> @@ -6043,6 +6045,29 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
>>                 cond_resched();
>>         }
>>
>> +       /* Transaction is finished. We don't need the lock anymore. */
>> +       deleted_bgs = &trans->transaction->deleted_bgs;
>> +       list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
>> +               u64 start, end, trimmed = 0;
>> +               int ret;
>> +               list_del_init(&block_group->bg_list);
>> +
>> +               start = block_group->key.objectid;
>> +               end = start + block_group->key.offset - 1;
>> +
>> +               ret = btrfs_discard_extent(root, start, end, &trimmed);

The third argument for btrfs_discard_extent() is supposed to be the
number of bytes and not the end offset, that is, it should be:

ret = btrfs_discard_extent(root, block_group->key.objectid,
block_group->key.offset, &trimmed);

Fortunately this doesn't seem to cause problems as __btrfs_map_block()
will make sure we don't operate beyond the chunk's end offset.
Nevertheless it should be fixed to avoid confusion and breakage if
map_block() changes its behaviour in the future.

>> +
>> +               btrfs_cleanup_block_group_mapping(block_group);
>
> We can't do this unconditionally. Only if
> "atomic_dec_and_test(&block_group->trimming)" - we can have a
> concurrent trimmer that started before the bg deletion started (via
> the fitrim ioctl) - we don't want  the space to be possible to
> allocate while any other trimmers are still active.
>
>> +               btrfs_put_block_group(block_group);
>> +
>> +               if (ret) {
>> +                       const char *errstr = btrfs_decode_error(ret);
>> +                       btrfs_warn(fs_info,
>> +                                  "Discard failed while removing blockgroup: errno=%d %s\n",
>> +                                  ret, errstr);
>> +               }
>> +       }
>> +
>>         return 0;
>>  }
>>
>> @@ -9802,6 +9827,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>          * currently running transaction might finish and a new one start,
>>          * allowing for new block groups to be created that can reuse the same
>>          * physical device locations unless we take this special care.
>> +        *
>> +        * There may also be an implicit trim operation if the file system
>> +        * is mounted with -odiscard. The same protections must remain
>> +        * in place until the extents have been discarded completely when
>> +        * the transaction commit has completed.
>>          */
>>         remove_em = (atomic_read(&block_group->trimming) == 0);
>>         /*
>> @@ -9876,6 +9906,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>         spin_lock(&fs_info->unused_bgs_lock);
>>         while (!list_empty(&fs_info->unused_bgs)) {
>>                 u64 start, end;
>> +               int trimming;
>>
>>                 block_group = list_first_entry(&fs_info->unused_bgs,
>>                                                struct btrfs_block_group_cache,
>> @@ -9973,12 +10004,39 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>                 spin_unlock(&block_group->lock);
>>                 spin_unlock(&space_info->lock);
>>
>> +               /* DISCARD can flip during remount */
>> +               trimming = btrfs_test_opt(root, DISCARD);
>> +
>> +               /* Implicit trim during transaction commit. */
>> +               if (trimming)
>> +                       atomic_inc(&block_group->trimming);
>> +
>>                 /*
>>                  * Btrfs_remove_chunk will abort the transaction if things go
>>                  * horribly wrong.
>>                  */
>>                 ret = btrfs_remove_chunk(trans, root,
>>                                          block_group->key.objectid);
>> +
>> +               if (ret) {
>> +                       if (trimming)
>> +                               atomic_dec(&block_group->trimming);
>
> And if the new value becomes 0 after decrementing we need to call
> btrfs_cleanup_block_group_mapping(), otherwise we leak the
> space/pinned chunk forever, extent maps, etc.

Well if the new value becomes 0 and block_group->removed == 1 too of
course. Even though failure in btrfs_remove_chunk() results in
transaction abortion, it's always better to not leak memory
(btrfs_free_space entries, see below) and do proper cleanup.

>
>> +                       goto end_trans;
>> +               }
>> +
>> +               /*
>> +                * If we're not mounted with -odiscard, we can just forget
>> +                * about this block group. Otherwise we'll need to wait
>> +                * until transaction commit to do the actual discard.
>> +                */
>> +               if (trimming) {
>> +                       WARN_ON(!list_empty(&block_group->bg_list));
>> +                       spin_lock(&trans->transaction->deleted_bgs_lock);
>> +                       list_add(&block_group->bg_list,
>> +                                &trans->transaction->deleted_bgs);
>
> I'd rather do this only if list_empty(&block_group->bg_list).
> Otherwise we risk getting the bg referenced multiple times in a list
> (or in different lists), resulting in dangling pointers even after the
> list_del() call.
>
>> +                       spin_unlock(&trans->transaction->deleted_bgs_lock);
>> +                       btrfs_get_block_group(block_group);
>> +               }
>>  end_trans:
>>                 btrfs_end_transaction(trans, root);
>>  next:
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index 41c510b..d33b146 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -3274,6 +3274,30 @@ next:
>>         return ret;
>>  }
>>
>> +void btrfs_cleanup_block_group_mapping(struct btrfs_block_group_cache *cache)
>> +{
>> +       struct extent_map_tree *em_tree;
>> +       struct extent_map *em;
>> +
>> +       lock_chunks(cache->fs_info->chunk_root);
>> +       em_tree = &cache->fs_info->mapping_tree.map_tree;
>> +       write_lock(&em_tree->lock);
>> +       em = lookup_extent_mapping(em_tree, cache->key.objectid, 1);
>> +       BUG_ON(!em); /* logic error, can't happen */
>> +
>> +       /*
>> +        * remove_extent_mapping() will delete us from the pinned_chunks
>> +        * list, which is protected by the chunk mutex.
>> +        */
>> +       remove_extent_mapping(em_tree, em);
>> +       write_unlock(&em_tree->lock);
>> +       unlock_chunks(cache->fs_info->chunk_root);
>> +
>> +       /* once for us and once for the tree */
>> +       free_extent_map(em);
>> +       free_extent_map(em);
>> +}
>> +
>>  int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
>>                            u64 *trimmed, u64 start, u64 end, u64 minlen)
>>  {
>> @@ -3298,28 +3322,9 @@ out:
>>         spin_lock(&block_group->lock);
>>         if (atomic_dec_and_test(&block_group->trimming) &&
>>             block_group->removed) {
>> -               struct extent_map_tree *em_tree;
>> -               struct extent_map *em;
>> -
>>                 spin_unlock(&block_group->lock);
>>
>> -               lock_chunks(block_group->fs_info->chunk_root);
>> -               em_tree = &block_group->fs_info->mapping_tree.map_tree;
>> -               write_lock(&em_tree->lock);
>> -               em = lookup_extent_mapping(em_tree, block_group->key.objectid,
>> -                                          1);
>> -               BUG_ON(!em); /* logic error, can't happen */
>> -               /*
>> -                * remove_extent_mapping() will delete us from the pinned_chunks
>> -                * list, which is protected by the chunk mutex.
>> -                */
>> -               remove_extent_mapping(em_tree, em);
>> -               write_unlock(&em_tree->lock);
>> -               unlock_chunks(block_group->fs_info->chunk_root);
>> -
>> -               /* once for us and once for the tree */
>> -               free_extent_map(em);
>> -               free_extent_map(em);
>> +               btrfs_cleanup_block_group_mapping(block_group);
>>
>>                 /*
>>                  * We've left one free space entry and other tasks trimming

So this part here is important. We need either to move the call to
__btrfs_remove_free_space_cache(block_group->free_space_ctl) to the
new function btrfs_cleanup_block_group_mapping() or make sure
btrfs_finish_extent_commit() calls it when it calls
btrfs_cleanup_block_group_mapping(). Otherwise we have a leak of
btrfs_free_space entries, 1 for each trimmer that came from the fitrim
ioctl right before the block group was removed from the rbtree. This
is visible at rmmod time if you keep running generic/038:

[63529.346047] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
(...)
[63529.362309] Call Trace:
[63529.362835]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[63529.363782]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[63529.364878]  [<ffffffff8111e73d>] kmem_cache_destroy+0x6b/0xe9
[63529.366011]  [<ffffffffa0579ceb>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
[63529.367230]  [<ffffffffa05d80c4>] exit_btrfs_fs+0x9/0xf45 [btrfs]
[63529.368367]  [<ffffffff810af90a>] SyS_delete_module+0x144/0x1d2
[63529.369451]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[63529.370610]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17



>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 9e66f5e..016e65a 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
>>
>>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>>
>> -static const char *btrfs_decode_error(int errno)
>> +const char *btrfs_decode_error(int errno)
>>  {
>>         char *errstr = "unknown";
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 5628e25..2005262 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -256,6 +256,8 @@ loop:
>>         mutex_init(&cur_trans->cache_write_mutex);
>>         cur_trans->num_dirty_bgs = 0;
>>         spin_lock_init(&cur_trans->dirty_bgs_lock);
>> +       INIT_LIST_HEAD(&cur_trans->deleted_bgs);
>> +       spin_lock_init(&cur_trans->deleted_bgs_lock);
>>         list_add_tail(&cur_trans->list, &fs_info->trans_list);
>>         extent_io_tree_init(&cur_trans->dirty_pages,
>>                              fs_info->btree_inode->i_mapping);
>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>> index 0b24755..14325f2 100644
>> --- a/fs/btrfs/transaction.h
>> +++ b/fs/btrfs/transaction.h
>> @@ -74,6 +74,8 @@ struct btrfs_transaction {
>>          */
>>         struct mutex cache_write_mutex;
>>         spinlock_t dirty_bgs_lock;
>> +       struct list_head deleted_bgs;
>> +       spinlock_t deleted_bgs_lock;
>>         struct btrfs_delayed_ref_root delayed_refs;
>>         int aborted;
>>         int dirty_bg_run;
>> --
>> 1.8.5.6

While testing this I ran often into transaction abortions, with
-EEXISTS, due to getting the same physical device offsets assigned to
multiple new block groups:

[194737.659017] ------------[ cut here ]------------
[194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260
__btrfs_abort_transaction+0x52/0x106 [btrfs]()
[194737.662209] BTRFS: Transaction aborted (error -17)
[194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio
dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
acpi_cpufreq i2c_piix4 parport_pc processor psmouse i2c_core
thermal_sys pcspkr evdev parport button serio_raw microcode ext4 crc16
jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix
floppy virtio_pci libata virtio_ring e1000 virtio scsi_mod [last
unloaded: btrfs]
[194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W
   4.0.0-rc5-btrfs-next-9+ #2
[194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
[194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46
ffffffff8108b6a2
[194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5
ffff8800564c7b78
[194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000
ffff8801a1f66f40
[194737.687509] Call Trace:
[194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[194737.691198]  [<ffffffffa0383aa7>] ?
__btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[194737.695065]  [<ffffffffa0383aa7>]
__btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.696806]  [<ffffffffa039a3bd>]
btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
[194737.701924]  [<ffffffffa0394b51>]
btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
[194737.705417]  [<ffffffffa03bb502>] ?
btrfs_file_write_iter+0x19a/0x431 [btrfs]
[194737.707058]  [<ffffffffa03bb511>] ?
btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
[194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
[194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
[194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
[194737.718170] BTRFS: error (device sdc) in
btrfs_create_pending_block_groups:9524: errno=-17 Object already
exists


But this turned out to not be a problem in this nor your other
trim/discard patch. It's actually a regression from a change
introduced in 4.1 and this patch only makes that issue trigger more
often. I'll send a fix for it soon.

Everything else looks good.
Thanks.

>>
>>
>> --
>> Jeff Mahoney
>> SUSE Labs
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

next prev parent reply	other threads:[~2015-05-14  9:42 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-05 18:19 [PATCH v2] btrfs: add missing discards when unpinning extents with -o discard Jeff Mahoney
2015-05-08 21:10 ` Filipe David Manana
2015-05-14  9:42   ` Filipe David Manana [this message]
2015-05-14 13:06     ` Jeff Mahoney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H4YWu7mx4tteoBH-mHaUWnuKvV52rnFAh9fYk-3dwzpBA@mail.gmail.com \
    --to=fdmanana@gmail.com \
    --cc=jeffm@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).