From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:39518 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933284AbbGUXxk (ORCPT ); Tue, 21 Jul 2015 19:53:40 -0400 Date: Tue, 21 Jul 2015 16:53:34 -0700 From: Omar Sandoval To: CC: , Filipe Manana Subject: Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock Message-ID: <20150721235334.GA8113@huxley.DHCP.TheFacebook.com> References: <1437400580-30525-1-git-send-email-fdmanana@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" In-Reply-To: <1437400580-30525-1-git-send-email-fdmanana@kernel.org> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana > > Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when > finishing block group creation"), introduced in 4.2-rc1, the following > test was failing due to exhaustion of the system array in the superblock: > > #!/bin/bash > > truncate -s 100T big.img > mkfs.btrfs big.img > mount -o loop big.img /mnt/loop > > num=5 > sz=10T > for ((i = 0; i < $num; i++)); do > echo fallocate $i $sz > fallocate -l $sz /mnt/loop/testfile$i > done > btrfs filesystem sync /mnt/loop > > for ((i = 0; i < $num; i++)); do > echo rm $i > rm /mnt/loop/testfile$i > btrfs filesystem sync /mnt/loop > done > umount /mnt/loop > > This made btrfs_add_system_chunk() fail with -EFBIG due to excessive > allocation of system block groups. This happened because the test creates > a large number of data block groups per transaction and when committing > the transaction we start the writeout of the block group caches for all > the new new (dirty) block groups, which results in pre-allocating space > for each block group's free space cache using the same transaction handle. > That in turn often leads to creation of more block groups, and all get > attached to the new_bgs list of the same transaction handle to the point > of getting a list with over 1500 elements, and creation of new block groups > leads to the need of reserving space in the chunk block reserve and often > creating a new system block group too. > > So that made us quickly exhaust the chunk block reserve/system space info, > because as of the commit mentioned before, we do reserve space for each > new block group in the chunk block reserve, unlike before where we would > not and would at most allocate one new system block group and therefore > would only ensure that there was enough space in the system space info to > allocate 1 new block group even if we ended up allocating thousands of > new block groups using the same transaction handle. That worked most of > the time because the computed required space at check_system_chunk() is > very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and > that all nodes/leafs in a path will be COWed and split) and since the > updates to the chunk tree all happen at btrfs_create_pending_block_groups > it is unlikely that a path needs to be COWed more than once (unless > writepages() for the btree inode is called by mm in between) and that > compensated for the need of creating any new nodes/leads in the chunk > tree. > > So fix this by ensuring we don't accumulate a too large list of new block > groups in a transaction's handles new_bgs list, inserting/updating the > chunk tree for all accumulated new block groups and releasing the unused > space from the chunk block reserve whenever the list becomes sufficiently > large. This is a generic solution even though the problem currently can > only happen when starting the writeout of the free space caches for all > dirty block groups (btrfs_start_dirty_block_groups()). > > Reported-by: Omar Sandoval > Signed-off-by: Filipe Manana Thanks a lot for taking a look. Tested-by: Omar Sandoval > --- > fs/btrfs/extent-tree.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 171312d..07204bf 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4227,6 +4227,24 @@ out: > space_info->chunk_alloc = 0; > spin_unlock(&space_info->lock); > mutex_unlock(&fs_info->chunk_mutex); > + /* > + * When we allocate a new chunk we reserve space in the chunk block > + * reserve to make sure we can COW nodes/leafs in the chunk tree or > + * add new nodes/leafs to it if we end up needing to do it when > + * inserting the chunk item and updating device items as part of the > + * second phase of chunk allocation, performed by > + * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a > + * large number of new block groups to create in our transaction > + * handle's new_bgs list to avoid exhausting the chunk block reserve > + * in extreme cases - like having a single transaction create many new > + * block groups when starting to write out the free space caches of all > + * the block groups that were made dirty during the lifetime of the > + * transaction. > + */ > + if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) { > + btrfs_create_pending_block_groups(trans, trans->root); > + btrfs_trans_release_chunk_metadata(trans); > + } > return ret; > } > > -- > 2.1.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar