From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:36014 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755525AbbBLOxz (ORCPT ); Thu, 12 Feb 2015 09:53:55 -0500 Message-ID: <54DCBE7D.5010906@fb.com> Date: Thu, 12 Feb 2015 09:53:49 -0500 From: Josef Bacik MIME-Version: 1.0 To: CC: Subject: Re: [PATCH 3/3] Btrfs: account for large extents with enospc References: <1423685339-8278-1-git-send-email-jbacik@fb.com> <1423685339-8278-3-git-send-email-jbacik@fb.com> <20150212043653.GB2416@localhost.localdomain> In-Reply-To: <20150212043653.GB2416@localhost.localdomain> Content-Type: text/plain; charset="windows-1252"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 02/11/2015 11:36 PM, Liu Bo wrote: > On Wed, Feb 11, 2015 at 03:08:59PM -0500, Josef Bacik wrote: >> On our gluster boxes we stream large tar balls of backups onto our fses. With >> 160gb of ram this means we get really large contiguous ranges of dirty data, but >> the way our ENOSPC stuff works is that as long as it's contiguous we only hold >> metadata reservation for one extent. The problem is we limit our extents to >> 128mb, so we'll end up with at least 800 extents so our enospc accounting is >> quite a bit lower than what we need. To keep track of this make sure we >> increase outstanding_extents for every multiple of the max extent size so we can >> be sure to have enough reserved metadata space. Thanks, >> >> Signed-off-by: Josef Bacik >> --- >> fs/btrfs/ctree.h | 2 ++ >> fs/btrfs/extent-tree.c | 16 +++++++++---- >> fs/btrfs/extent_io.c | 2 +- >> fs/btrfs/inode.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++- >> 4 files changed, 76 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 0b4683f..1675602 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -198,6 +198,8 @@ static int btrfs_csum_sizes[] = { 4, 0 }; >> >> #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) >> >> +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) >> + >> /* >> * The key defines the order in the tree, and so it also defines (optimal) >> * block layout. >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a346487..eb30b90 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, >> /** >> * drop_outstanding_extent - drop an outstanding extent >> * @inode: the inode we're dropping the extent for >> + * @num_bytes: the number of bytes we're relaseing. >> * >> * This is called when we are freeing up an outstanding extent, either called >> * after an error or after an extent is written. This will return the number of >> * reserved extents that need to be freed. This must be called with >> * BTRFS_I(inode)->lock held. >> */ >> -static unsigned drop_outstanding_extent(struct inode *inode) >> +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) >> { >> unsigned drop_inode_space = 0; >> unsigned dropped_extents = 0; >> + unsigned num_extents = 0; >> >> - BUG_ON(!BTRFS_I(inode)->outstanding_extents); >> - BTRFS_I(inode)->outstanding_extents--; >> + num_extents = (unsigned)div64_u64(num_bytes + >> + BTRFS_MAX_EXTENT_SIZE - 1, >> + BTRFS_MAX_EXTENT_SIZE); > > A fastpath is better, like btrfs_merge_extent_hook(). > > (num_extents < BTRFS_MAX_EXTENT_SIZE) ? num_extents = 1 : (div64_u64(...)) Agreed. > >> + ASSERT(num_extents); >> + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); >> + BTRFS_I(inode)->outstanding_extents -= num_extents; >> >> if (BTRFS_I(inode)->outstanding_extents == 0 && >> test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, >> @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) >> >> out_fail: >> spin_lock(&BTRFS_I(inode)->lock); >> - dropped = drop_outstanding_extent(inode); >> + dropped = drop_outstanding_extent(inode, num_bytes); >> /* >> * If the inodes csum_bytes is the same as the original >> * csum_bytes then we know we haven't raced with any free()ers >> @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) >> >> num_bytes = ALIGN(num_bytes, root->sectorsize); >> spin_lock(&BTRFS_I(inode)->lock); >> - dropped = drop_outstanding_extent(inode); >> + dropped = drop_outstanding_extent(inode, num_bytes); >> >> if (num_bytes) >> to_free = calc_csum_metadata_size(inode, num_bytes, 0); >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 4ebabd2..3fbc177 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, >> page, >> &delalloc_start, >> &delalloc_end, >> - 128 * 1024 * 1024); >> + BTRFS_MAX_EXTENT_SIZE); >> if (nr_delalloc == 0) { >> delalloc_start = delalloc_end + 1; >> continue; >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index fb16fd3..4564975 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -1530,10 +1530,45 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, >> static void btrfs_split_extent_hook(struct inode *inode, >> struct extent_state *orig, u64 split) >> { >> + u64 size; >> + >> /* not delalloc, ignore it */ >> if (!(orig->state & EXTENT_DELALLOC)) >> return; >> >> + size = orig->end - orig->start + 1; >> + if (size > BTRFS_MAX_EXTENT_SIZE) { >> + u64 num_extents; >> + u64 new_size; >> + >> + /* >> + * We need the largest size of the remaining extent to see if we >> + * need to add a new outstanding extent. Think of the following >> + * case >> + * >> + * [MEAX_EXTENT_SIZEx2 - 4k][4k] > > s/MEAX/MAX > >> + * >> + * The new_size would just be 4k and we'd think we had enough >> + * outstanding extents for this if we only took one side of the >> + * split, same goes for the other direction. We need to see if >> + * the larger size still is the same amount of extents as the >> + * original size, because if it is we need to add a new >> + * outstanding extent. But if we split up and the larger size >> + * is less than the original then we are good to go since we've >> + * already accounted for the extra extent in our original >> + * accounting. >> + */ >> + new_size = orig->end - split + 1; >> + if ((split - orig->start) > new_size) >> + new_size = split - orig->start; >> + >> + num_extents = div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, >> + BTRFS_MAX_EXTENT_SIZE); >> + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, >> + BTRFS_MAX_EXTENT_SIZE) < num_extents) > > BTRFS_MAX_EXTENT_SIZE (128mb) means 1 << BTRFS_MAX_EXTENT_BIT (27), can we use '>>' instead of div64_u64? > Yeah that's not a bad idea, I'll do that. Thanks, Josef