linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Lyakas <alex@zadarastorage.com>
To: Omar Sandoval <osandov@fb.com>
Cc: fdmanana@kernel.org, linux-btrfs <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Date: Sun, 13 Dec 2015 12:29:24 +0200	[thread overview]
Message-ID: <CAOcd+r3vHCntp5k5KAP6pKQ4X6zot_7whLd0mS9oOvroAgATgQ@mail.gmail.com> (raw)
In-Reply-To: <20150721235334.GA8113@huxley.DHCP.TheFacebook.com>

Hi Filipe Manana,

Can't the call to btrfs_create_pending_block_groups() cause a
deadlock, like in
http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
call updates the device tree, and we may be calling do_chunk_alloc()
from find_free_extent() when holding a lock on the device tree root
(because we want to COW a block of the device tree).

My understanding from Josef's chunk allocator rework
(http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
when allocating a new chunk we do not immediately update the
device/chunk tree. We keep the new chunk in "pending_chunks" and in
"new_bgs" on a transaction handle, and we actually update the
chunk/device tree only when we are done with a particular transaction
handle. This way we avoid that sort of deadlocks.

But this patch breaks this rule, as it may make us update the
device/chunk tree in the context of chunk allocation, which is the
scenario that the rework was meant to avoid.

Can you please point me at what I am missing?

Thanks,
Alex.


On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osandov@fb.com> wrote:
> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>> finishing block group creation"), introduced in 4.2-rc1, the following
>> test was failing due to exhaustion of the system array in the superblock:
>>
>>   #!/bin/bash
>>
>>   truncate -s 100T big.img
>>   mkfs.btrfs big.img
>>   mount -o loop big.img /mnt/loop
>>
>>   num=5
>>   sz=10T
>>   for ((i = 0; i < $num; i++)); do
>>       echo fallocate $i $sz
>>       fallocate -l $sz /mnt/loop/testfile$i
>>   done
>>   btrfs filesystem sync /mnt/loop
>>
>>   for ((i = 0; i < $num; i++)); do
>>         echo rm $i
>>         rm /mnt/loop/testfile$i
>>         btrfs filesystem sync /mnt/loop
>>   done
>>   umount /mnt/loop
>>
>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>> allocation of system block groups. This happened because the test creates
>> a large number of data block groups per transaction and when committing
>> the transaction we start the writeout of the block group caches for all
>> the new new (dirty) block groups, which results in pre-allocating space
>> for each block group's free space cache using the same transaction handle.
>> That in turn often leads to creation of more block groups, and all get
>> attached to the new_bgs list of the same transaction handle to the point
>> of getting a list with over 1500 elements, and creation of new block groups
>> leads to the need of reserving space in the chunk block reserve and often
>> creating a new system block group too.
>>
>> So that made us quickly exhaust the chunk block reserve/system space info,
>> because as of the commit mentioned before, we do reserve space for each
>> new block group in the chunk block reserve, unlike before where we would
>> not and would at most allocate one new system block group and therefore
>> would only ensure that there was enough space in the system space info to
>> allocate 1 new block group even if we ended up allocating thousands of
>> new block groups using the same transaction handle. That worked most of
>> the time because the computed required space at check_system_chunk() is
>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>> that all nodes/leafs in a path will be COWed and split) and since the
>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>> it is unlikely that a path needs to be COWed more than once (unless
>> writepages() for the btree inode is called by mm in between) and that
>> compensated for the need of creating any new nodes/leads in the chunk
>> tree.
>>
>> So fix this by ensuring we don't accumulate a too large list of new block
>> groups in a transaction's handles new_bgs list, inserting/updating the
>> chunk tree for all accumulated new block groups and releasing the unused
>> space from the chunk block reserve whenever the list becomes sufficiently
>> large. This is a generic solution even though the problem currently can
>> only happen when starting the writeout of the free space caches for all
>> dirty block groups (btrfs_start_dirty_block_groups()).
>>
>> Reported-by: Omar Sandoval <osandov@fb.com>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks a lot for taking a look.
>
> Tested-by: Omar Sandoval <osandov@fb.com>
>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 171312d..07204bf 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4227,6 +4227,24 @@ out:
>>       space_info->chunk_alloc = 0;
>>       spin_unlock(&space_info->lock);
>>       mutex_unlock(&fs_info->chunk_mutex);
>> +     /*
>> +      * When we allocate a new chunk we reserve space in the chunk block
>> +      * reserve to make sure we can COW nodes/leafs in the chunk tree or
>> +      * add new nodes/leafs to it if we end up needing to do it when
>> +      * inserting the chunk item and updating device items as part of the
>> +      * second phase of chunk allocation, performed by
>> +      * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
>> +      * large number of new block groups to create in our transaction
>> +      * handle's new_bgs list to avoid exhausting the chunk block reserve
>> +      * in extreme cases - like having a single transaction create many new
>> +      * block groups when starting to write out the free space caches of all
>> +      * the block groups that were made dirty during the lifetime of the
>> +      * transaction.
>> +      */
>> +     if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
>> +             btrfs_create_pending_block_groups(trans, trans->root);
>> +             btrfs_trans_release_chunk_metadata(trans);
>> +     }
>>       return ret;
>>  }
>>
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Omar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2015-12-13 10:29 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-20 13:56 [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock fdmanana
2015-07-21 23:53 ` Omar Sandoval
2015-12-13 10:29   ` Alex Lyakas [this message]
2015-12-13 15:45     ` Filipe Manana
2015-12-13 16:15       ` Alex Lyakas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOcd+r3vHCntp5k5KAP6pKQ4X6zot_7whLd0mS9oOvroAgATgQ@mail.gmail.com \
    --to=alex@zadarastorage.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=osandov@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).