* [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write
@ 2015-02-11 20:08 Josef Bacik
2015-02-11 20:08 ` [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes Josef Bacik
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Josef Bacik @ 2015-02-11 20:08 UTC (permalink / raw)
To: linux-btrfs
We have this weird dance where we always inc outstanding_extents when we do a
O_DIRECT write, even if we allocate the entire range. To get around this we
also drop the metadata space if we successfully write. This is an unnecessary
dance, we only need to jack up outstanding_extents if we don't satisfy the
entire range request in get_blocks_direct, otherwise we are good using our
original reservation. So drop the unconditional inc and the drop of the
metadata space that we have for the unconditional inc. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
fs/btrfs/inode.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8a036ed..e78a2fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7137,6 +7137,7 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
u64 start = iblock << inode->i_blkbits;
u64 lockstart, lockend;
u64 len = bh_result->b_size;
+ u64 orig_len = len;
int unlock_bits = EXTENT_LOCKED;
int ret = 0;
@@ -7272,9 +7273,11 @@ unlock:
if (start + len > i_size_read(inode))
i_size_write(inode, start + len);
- spin_lock(&BTRFS_I(inode)->lock);
- BTRFS_I(inode)->outstanding_extents++;
- spin_unlock(&BTRFS_I(inode)->lock);
+ if (len < orig_len) {
+ spin_lock(&BTRFS_I(inode)->lock);
+ BTRFS_I(inode)->outstanding_extents++;
+ spin_unlock(&BTRFS_I(inode)->lock);
+ }
ret = set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
lockstart + len - 1, EXTENT_DELALLOC, NULL,
@@ -8056,8 +8059,6 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
else if (ret >= 0 && (size_t)ret < count)
btrfs_delalloc_release_space(inode,
count - (size_t)ret);
- else
- btrfs_delalloc_release_metadata(inode, 0);
}
out:
if (wakeup)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes 2015-02-11 20:08 [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Josef Bacik @ 2015-02-11 20:08 ` Josef Bacik 2015-02-12 4:45 ` Liu Bo 2015-02-11 20:08 ` [PATCH 3/3] Btrfs: account for large extents with enospc Josef Bacik 2015-02-12 3:52 ` [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Liu Bo 2 siblings, 1 reply; 11+ messages in thread From: Josef Bacik @ 2015-02-11 20:08 UTC (permalink / raw) To: linux-btrfs We do this to get the space accounting, but this is just needless churn on the io_tree, so just drop setting/clearing delalloc and just drop the reserved data space when we have a successfull allocation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- fs/btrfs/inode.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e78a2fd..fb16fd3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7142,7 +7142,7 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock, int ret = 0; if (create) - unlock_bits |= EXTENT_DELALLOC | EXTENT_DIRTY; + unlock_bits |= EXTENT_DIRTY; else len = min_t(u64, len, root->sectorsize); @@ -7278,11 +7278,7 @@ unlock: BTRFS_I(inode)->outstanding_extents++; spin_unlock(&BTRFS_I(inode)->lock); } - - ret = set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, - lockstart + len - 1, EXTENT_DELALLOC, NULL, - &cached_state, GFP_NOFS); - BUG_ON(ret); + btrfs_free_reserved_data_space(inode, len); } /* -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes 2015-02-11 20:08 ` [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes Josef Bacik @ 2015-02-12 4:45 ` Liu Bo 0 siblings, 0 replies; 11+ messages in thread From: Liu Bo @ 2015-02-12 4:45 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Wed, Feb 11, 2015 at 03:08:58PM -0500, Josef Bacik wrote: > We do this to get the space accounting, but this is just needless churn on the > io_tree, so just drop setting/clearing delalloc and just drop the reserved data > space when we have a successfull allocation. Thanks, > Looks good. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > fs/btrfs/inode.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index e78a2fd..fb16fd3 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7142,7 +7142,7 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock, > int ret = 0; > > if (create) > - unlock_bits |= EXTENT_DELALLOC | EXTENT_DIRTY; > + unlock_bits |= EXTENT_DIRTY; > else > len = min_t(u64, len, root->sectorsize); > > @@ -7278,11 +7278,7 @@ unlock: > BTRFS_I(inode)->outstanding_extents++; > spin_unlock(&BTRFS_I(inode)->lock); > } > - > - ret = set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, > - lockstart + len - 1, EXTENT_DELALLOC, NULL, > - &cached_state, GFP_NOFS); > - BUG_ON(ret); > + btrfs_free_reserved_data_space(inode, len); > } > > /* > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/3] Btrfs: account for large extents with enospc 2015-02-11 20:08 [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Josef Bacik 2015-02-11 20:08 ` [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes Josef Bacik @ 2015-02-11 20:08 ` Josef Bacik 2015-02-12 4:36 ` Liu Bo 2015-03-06 20:01 ` [PATCH 3/3] " Filipe David Manana 2015-02-12 3:52 ` [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Liu Bo 2 siblings, 2 replies; 11+ messages in thread From: Josef Bacik @ 2015-02-11 20:08 UTC (permalink / raw) To: linux-btrfs On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 16 +++++++++---- fs/btrfs/extent_io.c | 2 +- fs/btrfs/inode.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 76 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0b4683f..1675602 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -198,6 +198,8 @@ static int btrfs_csum_sizes[] = { 4, 0 }; #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) + /* * The key defines the order in the tree, and so it also defines (optimal) * block layout. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a346487..eb30b90 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, /** * drop_outstanding_extent - drop an outstanding extent * @inode: the inode we're dropping the extent for + * @num_bytes: the number of bytes we're relaseing. * * This is called when we are freeing up an outstanding extent, either called * after an error or after an extent is written. This will return the number of * reserved extents that need to be freed. This must be called with * BTRFS_I(inode)->lock held. */ -static unsigned drop_outstanding_extent(struct inode *inode) +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) { unsigned drop_inode_space = 0; unsigned dropped_extents = 0; + unsigned num_extents = 0; - BUG_ON(!BTRFS_I(inode)->outstanding_extents); - BTRFS_I(inode)->outstanding_extents--; + num_extents = (unsigned)div64_u64(num_bytes + + BTRFS_MAX_EXTENT_SIZE - 1, + BTRFS_MAX_EXTENT_SIZE); + ASSERT(num_extents); + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); + BTRFS_I(inode)->outstanding_extents -= num_extents; if (BTRFS_I(inode)->outstanding_extents == 0 && test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) out_fail: spin_lock(&BTRFS_I(inode)->lock); - dropped = drop_outstanding_extent(inode); + dropped = drop_outstanding_extent(inode, num_bytes); /* * If the inodes csum_bytes is the same as the original * csum_bytes then we know we haven't raced with any free()ers @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) num_bytes = ALIGN(num_bytes, root->sectorsize); spin_lock(&BTRFS_I(inode)->lock); - dropped = drop_outstanding_extent(inode); + dropped = drop_outstanding_extent(inode, num_bytes); if (num_bytes) to_free = calc_csum_metadata_size(inode, num_bytes, 0); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4ebabd2..3fbc177 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, page, &delalloc_start, &delalloc_end, - 128 * 1024 * 1024); + BTRFS_MAX_EXTENT_SIZE); if (nr_delalloc == 0) { delalloc_start = delalloc_end + 1; continue; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index fb16fd3..4564975 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1530,10 +1530,45 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, static void btrfs_split_extent_hook(struct inode *inode, struct extent_state *orig, u64 split) { + u64 size; + /* not delalloc, ignore it */ if (!(orig->state & EXTENT_DELALLOC)) return; + size = orig->end - orig->start + 1; + if (size > BTRFS_MAX_EXTENT_SIZE) { + u64 num_extents; + u64 new_size; + + /* + * We need the largest size of the remaining extent to see if we + * need to add a new outstanding extent. Think of the following + * case + * + * [MEAX_EXTENT_SIZEx2 - 4k][4k] + * + * The new_size would just be 4k and we'd think we had enough + * outstanding extents for this if we only took one side of the + * split, same goes for the other direction. We need to see if + * the larger size still is the same amount of extents as the + * original size, because if it is we need to add a new + * outstanding extent. But if we split up and the larger size + * is less than the original then we are good to go since we've + * already accounted for the extra extent in our original + * accounting. + */ + new_size = orig->end - split + 1; + if ((split - orig->start) > new_size) + new_size = split - orig->start; + + num_extents = div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, + BTRFS_MAX_EXTENT_SIZE); + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, + BTRFS_MAX_EXTENT_SIZE) < num_extents) + return; + } + spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents++; spin_unlock(&BTRFS_I(inode)->lock); @@ -1549,10 +1584,34 @@ static void btrfs_merge_extent_hook(struct inode *inode, struct extent_state *new, struct extent_state *other) { + u64 new_size, old_size; + u64 num_extents; + /* not delalloc, ignore it */ if (!(other->state & EXTENT_DELALLOC)) return; + old_size = other->end - other->start + 1; + new_size = old_size + (new->end - new->start + 1); + + /* we're not bigger than the max, unreserve the space and go */ + if (new_size <= BTRFS_MAX_EXTENT_SIZE) { + spin_lock(&BTRFS_I(inode)->lock); + BTRFS_I(inode)->outstanding_extents--; + spin_unlock(&BTRFS_I(inode)->lock); + return; + } + + /* + * If we grew by another max_extent, just return, we want to keep that + * reserved amount. + */ + num_extents = div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1, + BTRFS_MAX_EXTENT_SIZE); + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, + BTRFS_MAX_EXTENT_SIZE) > num_extents) + return; + spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents--; spin_unlock(&BTRFS_I(inode)->lock); @@ -1648,6 +1707,8 @@ static void btrfs_clear_bit_hook(struct inode *inode, unsigned long *bits) { u64 len = state->end + 1 - state->start; + u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE -1, + BTRFS_MAX_EXTENT_SIZE); spin_lock(&BTRFS_I(inode)->lock); if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) @@ -1667,7 +1728,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, *bits &= ~EXTENT_FIRST_DELALLOC; } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { spin_lock(&BTRFS_I(inode)->lock); - BTRFS_I(inode)->outstanding_extents--; + BTRFS_I(inode)->outstanding_extents -= num_extents; spin_unlock(&BTRFS_I(inode)->lock); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] Btrfs: account for large extents with enospc 2015-02-11 20:08 ` [PATCH 3/3] Btrfs: account for large extents with enospc Josef Bacik @ 2015-02-12 4:36 ` Liu Bo 2015-02-12 14:53 ` Josef Bacik 2015-02-12 16:16 ` [PATCH 3/3 v2] " Josef Bacik 2015-03-06 20:01 ` [PATCH 3/3] " Filipe David Manana 1 sibling, 2 replies; 11+ messages in thread From: Liu Bo @ 2015-02-12 4:36 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Wed, Feb 11, 2015 at 03:08:59PM -0500, Josef Bacik wrote: > On our gluster boxes we stream large tar balls of backups onto our fses. With > 160gb of ram this means we get really large contiguous ranges of dirty data, but > the way our ENOSPC stuff works is that as long as it's contiguous we only hold > metadata reservation for one extent. The problem is we limit our extents to > 128mb, so we'll end up with at least 800 extents so our enospc accounting is > quite a bit lower than what we need. To keep track of this make sure we > increase outstanding_extents for every multiple of the max extent size so we can > be sure to have enough reserved metadata space. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > fs/btrfs/ctree.h | 2 ++ > fs/btrfs/extent-tree.c | 16 +++++++++---- > fs/btrfs/extent_io.c | 2 +- > fs/btrfs/inode.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++- > 4 files changed, 76 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0b4683f..1675602 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -198,6 +198,8 @@ static int btrfs_csum_sizes[] = { 4, 0 }; > > #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) > > +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) > + > /* > * The key defines the order in the tree, and so it also defines (optimal) > * block layout. > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a346487..eb30b90 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, > /** > * drop_outstanding_extent - drop an outstanding extent > * @inode: the inode we're dropping the extent for > + * @num_bytes: the number of bytes we're relaseing. > * > * This is called when we are freeing up an outstanding extent, either called > * after an error or after an extent is written. This will return the number of > * reserved extents that need to be freed. This must be called with > * BTRFS_I(inode)->lock held. > */ > -static unsigned drop_outstanding_extent(struct inode *inode) > +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) > { > unsigned drop_inode_space = 0; > unsigned dropped_extents = 0; > + unsigned num_extents = 0; > > - BUG_ON(!BTRFS_I(inode)->outstanding_extents); > - BTRFS_I(inode)->outstanding_extents--; > + num_extents = (unsigned)div64_u64(num_bytes + > + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE); A fastpath is better, like btrfs_merge_extent_hook(). (num_extents < BTRFS_MAX_EXTENT_SIZE) ? num_extents = 1 : (div64_u64(...)) > + ASSERT(num_extents); > + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); > + BTRFS_I(inode)->outstanding_extents -= num_extents; > > if (BTRFS_I(inode)->outstanding_extents == 0 && > test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, > @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) > > out_fail: > spin_lock(&BTRFS_I(inode)->lock); > - dropped = drop_outstanding_extent(inode); > + dropped = drop_outstanding_extent(inode, num_bytes); > /* > * If the inodes csum_bytes is the same as the original > * csum_bytes then we know we haven't raced with any free()ers > @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) > > num_bytes = ALIGN(num_bytes, root->sectorsize); > spin_lock(&BTRFS_I(inode)->lock); > - dropped = drop_outstanding_extent(inode); > + dropped = drop_outstanding_extent(inode, num_bytes); > > if (num_bytes) > to_free = calc_csum_metadata_size(inode, num_bytes, 0); > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 4ebabd2..3fbc177 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, > page, > &delalloc_start, > &delalloc_end, > - 128 * 1024 * 1024); > + BTRFS_MAX_EXTENT_SIZE); > if (nr_delalloc == 0) { > delalloc_start = delalloc_end + 1; > continue; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index fb16fd3..4564975 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -1530,10 +1530,45 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, > static void btrfs_split_extent_hook(struct inode *inode, > struct extent_state *orig, u64 split) > { > + u64 size; > + > /* not delalloc, ignore it */ > if (!(orig->state & EXTENT_DELALLOC)) > return; > > + size = orig->end - orig->start + 1; > + if (size > BTRFS_MAX_EXTENT_SIZE) { > + u64 num_extents; > + u64 new_size; > + > + /* > + * We need the largest size of the remaining extent to see if we > + * need to add a new outstanding extent. Think of the following > + * case > + * > + * [MEAX_EXTENT_SIZEx2 - 4k][4k] s/MEAX/MAX > + * > + * The new_size would just be 4k and we'd think we had enough > + * outstanding extents for this if we only took one side of the > + * split, same goes for the other direction. We need to see if > + * the larger size still is the same amount of extents as the > + * original size, because if it is we need to add a new > + * outstanding extent. But if we split up and the larger size > + * is less than the original then we are good to go since we've > + * already accounted for the extra extent in our original > + * accounting. > + */ > + new_size = orig->end - split + 1; > + if ((split - orig->start) > new_size) > + new_size = split - orig->start; > + > + num_extents = div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE); > + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE) < num_extents) BTRFS_MAX_EXTENT_SIZE (128mb) means 1 << BTRFS_MAX_EXTENT_BIT (27), can we use '>>' instead of div64_u64? Thanks, -liubo > + return; > + } > + > spin_lock(&BTRFS_I(inode)->lock); > BTRFS_I(inode)->outstanding_extents++; > spin_unlock(&BTRFS_I(inode)->lock); > @@ -1549,10 +1584,34 @@ static void btrfs_merge_extent_hook(struct inode *inode, > struct extent_state *new, > struct extent_state *other) > { > + u64 new_size, old_size; > + u64 num_extents; > + > /* not delalloc, ignore it */ > if (!(other->state & EXTENT_DELALLOC)) > return; > > + old_size = other->end - other->start + 1; > + new_size = old_size + (new->end - new->start + 1); > + > + /* we're not bigger than the max, unreserve the space and go */ > + if (new_size <= BTRFS_MAX_EXTENT_SIZE) { > + spin_lock(&BTRFS_I(inode)->lock); > + BTRFS_I(inode)->outstanding_extents--; > + spin_unlock(&BTRFS_I(inode)->lock); > + return; > + } > + > + /* > + * If we grew by another max_extent, just return, we want to keep that > + * reserved amount. > + */ > + num_extents = div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE); > + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE) > num_extents) > + return; > + > spin_lock(&BTRFS_I(inode)->lock); > BTRFS_I(inode)->outstanding_extents--; > spin_unlock(&BTRFS_I(inode)->lock); > @@ -1648,6 +1707,8 @@ static void btrfs_clear_bit_hook(struct inode *inode, > unsigned long *bits) > { > u64 len = state->end + 1 - state->start; > + u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE -1, > + BTRFS_MAX_EXTENT_SIZE); > > spin_lock(&BTRFS_I(inode)->lock); > if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) > @@ -1667,7 +1728,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, > *bits &= ~EXTENT_FIRST_DELALLOC; > } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { > spin_lock(&BTRFS_I(inode)->lock); > - BTRFS_I(inode)->outstanding_extents--; > + BTRFS_I(inode)->outstanding_extents -= num_extents; > spin_unlock(&BTRFS_I(inode)->lock); > } > > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] Btrfs: account for large extents with enospc 2015-02-12 4:36 ` Liu Bo @ 2015-02-12 14:53 ` Josef Bacik 2015-02-12 16:16 ` [PATCH 3/3 v2] " Josef Bacik 1 sibling, 0 replies; 11+ messages in thread From: Josef Bacik @ 2015-02-12 14:53 UTC (permalink / raw) To: bo.li.liu; +Cc: linux-btrfs On 02/11/2015 11:36 PM, Liu Bo wrote: > On Wed, Feb 11, 2015 at 03:08:59PM -0500, Josef Bacik wrote: >> On our gluster boxes we stream large tar balls of backups onto our fses. With >> 160gb of ram this means we get really large contiguous ranges of dirty data, but >> the way our ENOSPC stuff works is that as long as it's contiguous we only hold >> metadata reservation for one extent. The problem is we limit our extents to >> 128mb, so we'll end up with at least 800 extents so our enospc accounting is >> quite a bit lower than what we need. To keep track of this make sure we >> increase outstanding_extents for every multiple of the max extent size so we can >> be sure to have enough reserved metadata space. Thanks, >> >> Signed-off-by: Josef Bacik <jbacik@fb.com> >> --- >> fs/btrfs/ctree.h | 2 ++ >> fs/btrfs/extent-tree.c | 16 +++++++++---- >> fs/btrfs/extent_io.c | 2 +- >> fs/btrfs/inode.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++- >> 4 files changed, 76 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 0b4683f..1675602 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -198,6 +198,8 @@ static int btrfs_csum_sizes[] = { 4, 0 }; >> >> #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) >> >> +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) >> + >> /* >> * The key defines the order in the tree, and so it also defines (optimal) >> * block layout. >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a346487..eb30b90 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, >> /** >> * drop_outstanding_extent - drop an outstanding extent >> * @inode: the inode we're dropping the extent for >> + * @num_bytes: the number of bytes we're relaseing. >> * >> * This is called when we are freeing up an outstanding extent, either called >> * after an error or after an extent is written. This will return the number of >> * reserved extents that need to be freed. This must be called with >> * BTRFS_I(inode)->lock held. >> */ >> -static unsigned drop_outstanding_extent(struct inode *inode) >> +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) >> { >> unsigned drop_inode_space = 0; >> unsigned dropped_extents = 0; >> + unsigned num_extents = 0; >> >> - BUG_ON(!BTRFS_I(inode)->outstanding_extents); >> - BTRFS_I(inode)->outstanding_extents--; >> + num_extents = (unsigned)div64_u64(num_bytes + >> + BTRFS_MAX_EXTENT_SIZE - 1, >> + BTRFS_MAX_EXTENT_SIZE); > > A fastpath is better, like btrfs_merge_extent_hook(). > > (num_extents < BTRFS_MAX_EXTENT_SIZE) ? num_extents = 1 : (div64_u64(...)) Agreed. > >> + ASSERT(num_extents); >> + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); >> + BTRFS_I(inode)->outstanding_extents -= num_extents; >> >> if (BTRFS_I(inode)->outstanding_extents == 0 && >> test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, >> @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) >> >> out_fail: >> spin_lock(&BTRFS_I(inode)->lock); >> - dropped = drop_outstanding_extent(inode); >> + dropped = drop_outstanding_extent(inode, num_bytes); >> /* >> * If the inodes csum_bytes is the same as the original >> * csum_bytes then we know we haven't raced with any free()ers >> @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) >> >> num_bytes = ALIGN(num_bytes, root->sectorsize); >> spin_lock(&BTRFS_I(inode)->lock); >> - dropped = drop_outstanding_extent(inode); >> + dropped = drop_outstanding_extent(inode, num_bytes); >> >> if (num_bytes) >> to_free = calc_csum_metadata_size(inode, num_bytes, 0); >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 4ebabd2..3fbc177 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, >> page, >> &delalloc_start, >> &delalloc_end, >> - 128 * 1024 * 1024); >> + BTRFS_MAX_EXTENT_SIZE); >> if (nr_delalloc == 0) { >> delalloc_start = delalloc_end + 1; >> continue; >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index fb16fd3..4564975 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -1530,10 +1530,45 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, >> static void btrfs_split_extent_hook(struct inode *inode, >> struct extent_state *orig, u64 split) >> { >> + u64 size; >> + >> /* not delalloc, ignore it */ >> if (!(orig->state & EXTENT_DELALLOC)) >> return; >> >> + size = orig->end - orig->start + 1; >> + if (size > BTRFS_MAX_EXTENT_SIZE) { >> + u64 num_extents; >> + u64 new_size; >> + >> + /* >> + * We need the largest size of the remaining extent to see if we >> + * need to add a new outstanding extent. Think of the following >> + * case >> + * >> + * [MEAX_EXTENT_SIZEx2 - 4k][4k] > > s/MEAX/MAX > >> + * >> + * The new_size would just be 4k and we'd think we had enough >> + * outstanding extents for this if we only took one side of the >> + * split, same goes for the other direction. We need to see if >> + * the larger size still is the same amount of extents as the >> + * original size, because if it is we need to add a new >> + * outstanding extent. But if we split up and the larger size >> + * is less than the original then we are good to go since we've >> + * already accounted for the extra extent in our original >> + * accounting. >> + */ >> + new_size = orig->end - split + 1; >> + if ((split - orig->start) > new_size) >> + new_size = split - orig->start; >> + >> + num_extents = div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, >> + BTRFS_MAX_EXTENT_SIZE); >> + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, >> + BTRFS_MAX_EXTENT_SIZE) < num_extents) > > BTRFS_MAX_EXTENT_SIZE (128mb) means 1 << BTRFS_MAX_EXTENT_BIT (27), can we use '>>' instead of div64_u64? > Yeah that's not a bad idea, I'll do that. Thanks, Josef ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/3 v2] Btrfs: account for large extents with enospc 2015-02-12 4:36 ` Liu Bo 2015-02-12 14:53 ` Josef Bacik @ 2015-02-12 16:16 ` Josef Bacik 2015-02-13 2:06 ` Liu Bo 2015-02-13 15:44 ` David Sterba 1 sibling, 2 replies; 11+ messages in thread From: Josef Bacik @ 2015-02-12 16:16 UTC (permalink / raw) To: linux-btrfs On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- V1->V2: -use bit shift instead of div64_u64. -skip math if num_bytes < max_size in drop_outstanding_extents fs/btrfs/ctree.h | 8 ++++++++ fs/btrfs/extent-tree.c | 16 ++++++++++----- fs/btrfs/extent_io.c | 2 +- fs/btrfs/inode.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 73 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0b4683f..afdae42 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -198,6 +198,14 @@ static int btrfs_csum_sizes[] = { 4, 0 }; #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) +#define BTRFS_MAX_EXTENT_BIT 27 + +static inline u64 btrfs_num_extents(u64 num_bytes) +{ + return (num_bytes + BTRFS_MAX_EXTENT_SIZE - 1) >> BTRFS_MAX_EXTENT_BIT; +} + /* * The key defines the order in the tree, and so it also defines (optimal) * block layout. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a346487..18e8a5a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, /** * drop_outstanding_extent - drop an outstanding extent * @inode: the inode we're dropping the extent for + * @num_bytes: the number of bytes we're relaseing. * * This is called when we are freeing up an outstanding extent, either called * after an error or after an extent is written. This will return the number of * reserved extents that need to be freed. This must be called with * BTRFS_I(inode)->lock held. */ -static unsigned drop_outstanding_extent(struct inode *inode) +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) { unsigned drop_inode_space = 0; unsigned dropped_extents = 0; + unsigned num_extents = 0; - BUG_ON(!BTRFS_I(inode)->outstanding_extents); - BTRFS_I(inode)->outstanding_extents--; + num_extents = (num_bytes > BTRFS_MAX_EXTENT_SIZE) ? + (unsigned)btrfs_num_extents(num_bytes) : 1; + + ASSERT(num_extents); + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); + BTRFS_I(inode)->outstanding_extents -= num_extents; if (BTRFS_I(inode)->outstanding_extents == 0 && test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) out_fail: spin_lock(&BTRFS_I(inode)->lock); - dropped = drop_outstanding_extent(inode); + dropped = drop_outstanding_extent(inode, num_bytes); /* * If the inodes csum_bytes is the same as the original * csum_bytes then we know we haven't raced with any free()ers @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) num_bytes = ALIGN(num_bytes, root->sectorsize); spin_lock(&BTRFS_I(inode)->lock); - dropped = drop_outstanding_extent(inode); + dropped = drop_outstanding_extent(inode, num_bytes); if (num_bytes) to_free = calc_csum_metadata_size(inode, num_bytes, 0); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4ebabd2..3fbc177 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, page, &delalloc_start, &delalloc_end, - 128 * 1024 * 1024); + BTRFS_MAX_EXTENT_SIZE); if (nr_delalloc == 0) { delalloc_start = delalloc_end + 1; continue; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index fb16fd3..324cb6d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1530,10 +1530,41 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, static void btrfs_split_extent_hook(struct inode *inode, struct extent_state *orig, u64 split) { + u64 size; + /* not delalloc, ignore it */ if (!(orig->state & EXTENT_DELALLOC)) return; + size = orig->end - orig->start + 1; + if (size > BTRFS_MAX_EXTENT_SIZE) { + u64 new_size; + + /* + * We need the largest size of the remaining extent to see if we + * need to add a new outstanding extent. Think of the following + * case + * + * [MEAX_EXTENT_SIZEx2 - 4k][4k] + * + * The new_size would just be 4k and we'd think we had enough + * outstanding extents for this if we only took one side of the + * split, same goes for the other direction. We need to see if + * the larger size still is the same amount of extents as the + * original size, because if it is we need to add a new + * outstanding extent. But if we split up and the larger size + * is less than the original then we are good to go since we've + * already accounted for the extra extent in our original + * accounting. + */ + new_size = orig->end - split + 1; + if ((split - orig->start) > new_size) + new_size = split - orig->start; + + if (btrfs_num_extents(size) > btrfs_num_extents(new_size)) + return; + } + spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents++; spin_unlock(&BTRFS_I(inode)->lock); @@ -1549,10 +1580,30 @@ static void btrfs_merge_extent_hook(struct inode *inode, struct extent_state *new, struct extent_state *other) { + u64 new_size, old_size; + /* not delalloc, ignore it */ if (!(other->state & EXTENT_DELALLOC)) return; + old_size = other->end - other->start + 1; + new_size = old_size + (new->end - new->start + 1); + + /* we're not bigger than the max, unreserve the space and go */ + if (new_size <= BTRFS_MAX_EXTENT_SIZE) { + spin_lock(&BTRFS_I(inode)->lock); + BTRFS_I(inode)->outstanding_extents--; + spin_unlock(&BTRFS_I(inode)->lock); + return; + } + + /* + * If we grew by another max_extent, just return, we want to keep that + * reserved amount. + */ + if (btrfs_num_extents(new_size) > btrfs_num_extents(old_size)) + return; + spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents--; spin_unlock(&BTRFS_I(inode)->lock); @@ -1648,6 +1699,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, unsigned long *bits) { u64 len = state->end + 1 - state->start; + u64 num_extents = btrfs_num_extents(len); spin_lock(&BTRFS_I(inode)->lock); if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) @@ -1667,7 +1719,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, *bits &= ~EXTENT_FIRST_DELALLOC; } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { spin_lock(&BTRFS_I(inode)->lock); - BTRFS_I(inode)->outstanding_extents--; + BTRFS_I(inode)->outstanding_extents -= num_extents; spin_unlock(&BTRFS_I(inode)->lock); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3 v2] Btrfs: account for large extents with enospc 2015-02-12 16:16 ` [PATCH 3/3 v2] " Josef Bacik @ 2015-02-13 2:06 ` Liu Bo 2015-02-13 15:44 ` David Sterba 1 sibling, 0 replies; 11+ messages in thread From: Liu Bo @ 2015-02-13 2:06 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Thu, Feb 12, 2015 at 11:16:51AM -0500, Josef Bacik wrote: > On our gluster boxes we stream large tar balls of backups onto our fses. With > 160gb of ram this means we get really large contiguous ranges of dirty data, but > the way our ENOSPC stuff works is that as long as it's contiguous we only hold > metadata reservation for one extent. The problem is we limit our extents to > 128mb, so we'll end up with at least 800 extents so our enospc accounting is > quite a bit lower than what we need. To keep track of this make sure we > increase outstanding_extents for every multiple of the max extent size so we can > be sure to have enough reserved metadata space. Thanks, Looks good. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks, -liubo > > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > V1->V2: > -use bit shift instead of div64_u64. > -skip math if num_bytes < max_size in drop_outstanding_extents > > fs/btrfs/ctree.h | 8 ++++++++ > fs/btrfs/extent-tree.c | 16 ++++++++++----- > fs/btrfs/extent_io.c | 2 +- > fs/btrfs/inode.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++- > 4 files changed, 73 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0b4683f..afdae42 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -198,6 +198,14 @@ static int btrfs_csum_sizes[] = { 4, 0 }; > > #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) > > +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) > +#define BTRFS_MAX_EXTENT_BIT 27 > + > +static inline u64 btrfs_num_extents(u64 num_bytes) > +{ > + return (num_bytes + BTRFS_MAX_EXTENT_SIZE - 1) >> BTRFS_MAX_EXTENT_BIT; > +} > + > /* > * The key defines the order in the tree, and so it also defines (optimal) > * block layout. > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a346487..18e8a5a 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, > /** > * drop_outstanding_extent - drop an outstanding extent > * @inode: the inode we're dropping the extent for > + * @num_bytes: the number of bytes we're relaseing. > * > * This is called when we are freeing up an outstanding extent, either called > * after an error or after an extent is written. This will return the number of > * reserved extents that need to be freed. This must be called with > * BTRFS_I(inode)->lock held. > */ > -static unsigned drop_outstanding_extent(struct inode *inode) > +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) > { > unsigned drop_inode_space = 0; > unsigned dropped_extents = 0; > + unsigned num_extents = 0; > > - BUG_ON(!BTRFS_I(inode)->outstanding_extents); > - BTRFS_I(inode)->outstanding_extents--; > + num_extents = (num_bytes > BTRFS_MAX_EXTENT_SIZE) ? > + (unsigned)btrfs_num_extents(num_bytes) : 1; > + > + ASSERT(num_extents); > + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); > + BTRFS_I(inode)->outstanding_extents -= num_extents; > > if (BTRFS_I(inode)->outstanding_extents == 0 && > test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, > @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) > > out_fail: > spin_lock(&BTRFS_I(inode)->lock); > - dropped = drop_outstanding_extent(inode); > + dropped = drop_outstanding_extent(inode, num_bytes); > /* > * If the inodes csum_bytes is the same as the original > * csum_bytes then we know we haven't raced with any free()ers > @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) > > num_bytes = ALIGN(num_bytes, root->sectorsize); > spin_lock(&BTRFS_I(inode)->lock); > - dropped = drop_outstanding_extent(inode); > + dropped = drop_outstanding_extent(inode, num_bytes); > > if (num_bytes) > to_free = calc_csum_metadata_size(inode, num_bytes, 0); > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 4ebabd2..3fbc177 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, > page, > &delalloc_start, > &delalloc_end, > - 128 * 1024 * 1024); > + BTRFS_MAX_EXTENT_SIZE); > if (nr_delalloc == 0) { > delalloc_start = delalloc_end + 1; > continue; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index fb16fd3..324cb6d 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -1530,10 +1530,41 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, > static void btrfs_split_extent_hook(struct inode *inode, > struct extent_state *orig, u64 split) > { > + u64 size; > + > /* not delalloc, ignore it */ > if (!(orig->state & EXTENT_DELALLOC)) > return; > > + size = orig->end - orig->start + 1; > + if (size > BTRFS_MAX_EXTENT_SIZE) { > + u64 new_size; > + > + /* > + * We need the largest size of the remaining extent to see if we > + * need to add a new outstanding extent. Think of the following > + * case > + * > + * [MEAX_EXTENT_SIZEx2 - 4k][4k] > + * > + * The new_size would just be 4k and we'd think we had enough > + * outstanding extents for this if we only took one side of the > + * split, same goes for the other direction. We need to see if > + * the larger size still is the same amount of extents as the > + * original size, because if it is we need to add a new > + * outstanding extent. But if we split up and the larger size > + * is less than the original then we are good to go since we've > + * already accounted for the extra extent in our original > + * accounting. > + */ > + new_size = orig->end - split + 1; > + if ((split - orig->start) > new_size) > + new_size = split - orig->start; > + > + if (btrfs_num_extents(size) > btrfs_num_extents(new_size)) > + return; > + } > + > spin_lock(&BTRFS_I(inode)->lock); > BTRFS_I(inode)->outstanding_extents++; > spin_unlock(&BTRFS_I(inode)->lock); > @@ -1549,10 +1580,30 @@ static void btrfs_merge_extent_hook(struct inode *inode, > struct extent_state *new, > struct extent_state *other) > { > + u64 new_size, old_size; > + > /* not delalloc, ignore it */ > if (!(other->state & EXTENT_DELALLOC)) > return; > > + old_size = other->end - other->start + 1; > + new_size = old_size + (new->end - new->start + 1); > + > + /* we're not bigger than the max, unreserve the space and go */ > + if (new_size <= BTRFS_MAX_EXTENT_SIZE) { > + spin_lock(&BTRFS_I(inode)->lock); > + BTRFS_I(inode)->outstanding_extents--; > + spin_unlock(&BTRFS_I(inode)->lock); > + return; > + } > + > + /* > + * If we grew by another max_extent, just return, we want to keep that > + * reserved amount. > + */ > + if (btrfs_num_extents(new_size) > btrfs_num_extents(old_size)) > + return; > + > spin_lock(&BTRFS_I(inode)->lock); > BTRFS_I(inode)->outstanding_extents--; > spin_unlock(&BTRFS_I(inode)->lock); > @@ -1648,6 +1699,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, > unsigned long *bits) > { > u64 len = state->end + 1 - state->start; > + u64 num_extents = btrfs_num_extents(len); > > spin_lock(&BTRFS_I(inode)->lock); > if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) > @@ -1667,7 +1719,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, > *bits &= ~EXTENT_FIRST_DELALLOC; > } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { > spin_lock(&BTRFS_I(inode)->lock); > - BTRFS_I(inode)->outstanding_extents--; > + BTRFS_I(inode)->outstanding_extents -= num_extents; > spin_unlock(&BTRFS_I(inode)->lock); > } > > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3 v2] Btrfs: account for large extents with enospc 2015-02-12 16:16 ` [PATCH 3/3 v2] " Josef Bacik 2015-02-13 2:06 ` Liu Bo @ 2015-02-13 15:44 ` David Sterba 1 sibling, 0 replies; 11+ messages in thread From: David Sterba @ 2015-02-13 15:44 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Thu, Feb 12, 2015 at 11:16:51AM -0500, Josef Bacik wrote: > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -198,6 +198,14 @@ static int btrfs_csum_sizes[] = { 4, 0 }; > > #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) > > +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) > +#define BTRFS_MAX_EXTENT_BIT 27 #include <linux/ilog2.h> #define BTRFS_MAX_EXTENT_SHIFT ilog2(BTRFS_MAX_EXTENT_SIZE) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] Btrfs: account for large extents with enospc 2015-02-11 20:08 ` [PATCH 3/3] Btrfs: account for large extents with enospc Josef Bacik 2015-02-12 4:36 ` Liu Bo @ 2015-03-06 20:01 ` Filipe David Manana 1 sibling, 0 replies; 11+ messages in thread From: Filipe David Manana @ 2015-03-06 20:01 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs@vger.kernel.org On Wed, Feb 11, 2015 at 8:08 PM, Josef Bacik <jbacik@fb.com> wrote: > On our gluster boxes we stream large tar balls of backups onto our fses. With > 160gb of ram this means we get really large contiguous ranges of dirty data, but > the way our ENOSPC stuff works is that as long as it's contiguous we only hold > metadata reservation for one extent. The problem is we limit our extents to > 128mb, so we'll end up with at least 800 extents so our enospc accounting is > quite a bit lower than what we need. To keep track of this make sure we > increase outstanding_extents for every multiple of the max extent size so we can > be sure to have enough reserved metadata space. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> Hi Josef, So this change causes the outstanding_extents to never decrease to zero or stay with very large values due to underflow, causing warnings such as the ones reported at https://bugzilla.kernel.org/show_bug.cgi?id=94401 or the ASSERT() you added to drop_outstanding_extent() when assertions are enabled. It's very easy to reproduce this, like with the following fio job config: [global] ioengine=sync direct=1 bssplit=130M/100 fallocate=none fadvise_hint=0 filename=foobar [job1] numjobs=1 size=600M rw=randwrite So the following happens: 1) btrfs_delalloc_reserve_metadata, num_bytes 136314880 (num_extents 2), outstanding extents == 0, incremented outstanding extents by 1 (and not num_extents) 2) btrfs_drop_outstanding_extent num_bytes 136314880 (num_extents 2), outstanding extents == 1, decrement outstanding extents by 2 (num_extents) giving 4294967295 For buffered writes (by setting direct=0 in that fio job config), outstanding_extents rarely gets decremented to 0, but I haven't checked why exactly (extent state merges need to happen at least for all the special cases in btrfs_merge_extent_hook). thanks > --- > fs/btrfs/ctree.h | 2 ++ > fs/btrfs/extent-tree.c | 16 +++++++++---- > fs/btrfs/extent_io.c | 2 +- > fs/btrfs/inode.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++- > 4 files changed, 76 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0b4683f..1675602 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -198,6 +198,8 @@ static int btrfs_csum_sizes[] = { 4, 0 }; > > #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) > > +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) > + > /* > * The key defines the order in the tree, and so it also defines (optimal) > * block layout. > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a346487..eb30b90 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, > /** > * drop_outstanding_extent - drop an outstanding extent > * @inode: the inode we're dropping the extent for > + * @num_bytes: the number of bytes we're relaseing. > * > * This is called when we are freeing up an outstanding extent, either called > * after an error or after an extent is written. This will return the number of > * reserved extents that need to be freed. This must be called with > * BTRFS_I(inode)->lock held. > */ > -static unsigned drop_outstanding_extent(struct inode *inode) > +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) > { > unsigned drop_inode_space = 0; > unsigned dropped_extents = 0; > + unsigned num_extents = 0; > > - BUG_ON(!BTRFS_I(inode)->outstanding_extents); > - BTRFS_I(inode)->outstanding_extents--; > + num_extents = (unsigned)div64_u64(num_bytes + > + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE); > + ASSERT(num_extents); > + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); > + BTRFS_I(inode)->outstanding_extents -= num_extents; > > if (BTRFS_I(inode)->outstanding_extents == 0 && > test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, > @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) > > out_fail: > spin_lock(&BTRFS_I(inode)->lock); > - dropped = drop_outstanding_extent(inode); > + dropped = drop_outstanding_extent(inode, num_bytes); > /* > * If the inodes csum_bytes is the same as the original > * csum_bytes then we know we haven't raced with any free()ers > @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) > > num_bytes = ALIGN(num_bytes, root->sectorsize); > spin_lock(&BTRFS_I(inode)->lock); > - dropped = drop_outstanding_extent(inode); > + dropped = drop_outstanding_extent(inode, num_bytes); > > if (num_bytes) > to_free = calc_csum_metadata_size(inode, num_bytes, 0); > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 4ebabd2..3fbc177 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, > page, > &delalloc_start, > &delalloc_end, > - 128 * 1024 * 1024); > + BTRFS_MAX_EXTENT_SIZE); > if (nr_delalloc == 0) { > delalloc_start = delalloc_end + 1; > continue; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index fb16fd3..4564975 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -1530,10 +1530,45 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, > static void btrfs_split_extent_hook(struct inode *inode, > struct extent_state *orig, u64 split) > { > + u64 size; > + > /* not delalloc, ignore it */ > if (!(orig->state & EXTENT_DELALLOC)) > return; > > + size = orig->end - orig->start + 1; > + if (size > BTRFS_MAX_EXTENT_SIZE) { > + u64 num_extents; > + u64 new_size; > + > + /* > + * We need the largest size of the remaining extent to see if we > + * need to add a new outstanding extent. Think of the following > + * case > + * > + * [MEAX_EXTENT_SIZEx2 - 4k][4k] > + * > + * The new_size would just be 4k and we'd think we had enough > + * outstanding extents for this if we only took one side of the > + * split, same goes for the other direction. We need to see if > + * the larger size still is the same amount of extents as the > + * original size, because if it is we need to add a new > + * outstanding extent. But if we split up and the larger size > + * is less than the original then we are good to go since we've > + * already accounted for the extra extent in our original > + * accounting. > + */ > + new_size = orig->end - split + 1; > + if ((split - orig->start) > new_size) > + new_size = split - orig->start; > + > + num_extents = div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE); > + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE) < num_extents) > + return; > + } > + > spin_lock(&BTRFS_I(inode)->lock); > BTRFS_I(inode)->outstanding_extents++; > spin_unlock(&BTRFS_I(inode)->lock); > @@ -1549,10 +1584,34 @@ static void btrfs_merge_extent_hook(struct inode *inode, > struct extent_state *new, > struct extent_state *other) > { > + u64 new_size, old_size; > + u64 num_extents; > + > /* not delalloc, ignore it */ > if (!(other->state & EXTENT_DELALLOC)) > return; > > + old_size = other->end - other->start + 1; > + new_size = old_size + (new->end - new->start + 1); > + > + /* we're not bigger than the max, unreserve the space and go */ > + if (new_size <= BTRFS_MAX_EXTENT_SIZE) { > + spin_lock(&BTRFS_I(inode)->lock); > + BTRFS_I(inode)->outstanding_extents--; > + spin_unlock(&BTRFS_I(inode)->lock); > + return; > + } > + > + /* > + * If we grew by another max_extent, just return, we want to keep that > + * reserved amount. > + */ > + num_extents = div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE); > + if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1, > + BTRFS_MAX_EXTENT_SIZE) > num_extents) > + return; > + > spin_lock(&BTRFS_I(inode)->lock); > BTRFS_I(inode)->outstanding_extents--; > spin_unlock(&BTRFS_I(inode)->lock); > @@ -1648,6 +1707,8 @@ static void btrfs_clear_bit_hook(struct inode *inode, > unsigned long *bits) > { > u64 len = state->end + 1 - state->start; > + u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE -1, > + BTRFS_MAX_EXTENT_SIZE); > > spin_lock(&BTRFS_I(inode)->lock); > if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) > @@ -1667,7 +1728,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, > *bits &= ~EXTENT_FIRST_DELALLOC; > } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { > spin_lock(&BTRFS_I(inode)->lock); > - BTRFS_I(inode)->outstanding_extents--; > + BTRFS_I(inode)->outstanding_extents -= num_extents; > spin_unlock(&BTRFS_I(inode)->lock); > } > > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write 2015-02-11 20:08 [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Josef Bacik 2015-02-11 20:08 ` [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes Josef Bacik 2015-02-11 20:08 ` [PATCH 3/3] Btrfs: account for large extents with enospc Josef Bacik @ 2015-02-12 3:52 ` Liu Bo 2 siblings, 0 replies; 11+ messages in thread From: Liu Bo @ 2015-02-12 3:52 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Wed, Feb 11, 2015 at 03:08:57PM -0500, Josef Bacik wrote: > We have this weird dance where we always inc outstanding_extents when we do a > O_DIRECT write, even if we allocate the entire range. To get around this we > also drop the metadata space if we successfully write. This is an unnecessary > dance, we only need to jack up outstanding_extents if we don't satisfy the > entire range request in get_blocks_direct, otherwise we are good using our > original reservation. So drop the unconditional inc and the drop of the > metadata space that we have for the unconditional inc. Thanks, Looks good. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks, -liubo > > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > fs/btrfs/inode.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 8a036ed..e78a2fd 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7137,6 +7137,7 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock, > u64 start = iblock << inode->i_blkbits; > u64 lockstart, lockend; > u64 len = bh_result->b_size; > + u64 orig_len = len; > int unlock_bits = EXTENT_LOCKED; > int ret = 0; > > @@ -7272,9 +7273,11 @@ unlock: > if (start + len > i_size_read(inode)) > i_size_write(inode, start + len); > > - spin_lock(&BTRFS_I(inode)->lock); > - BTRFS_I(inode)->outstanding_extents++; > - spin_unlock(&BTRFS_I(inode)->lock); > + if (len < orig_len) { > + spin_lock(&BTRFS_I(inode)->lock); > + BTRFS_I(inode)->outstanding_extents++; > + spin_unlock(&BTRFS_I(inode)->lock); > + } > > ret = set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, > lockstart + len - 1, EXTENT_DELALLOC, NULL, > @@ -8056,8 +8059,6 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, > else if (ret >= 0 && (size_t)ret < count) > btrfs_delalloc_release_space(inode, > count - (size_t)ret); > - else > - btrfs_delalloc_release_metadata(inode, 0); > } > out: > if (wakeup) > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-03-06 20:01 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-11 20:08 [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Josef Bacik 2015-02-11 20:08 ` [PATCH 2/3] Btrfs: don't set and clear delalloc for O_DIRECT writes Josef Bacik 2015-02-12 4:45 ` Liu Bo 2015-02-11 20:08 ` [PATCH 3/3] Btrfs: account for large extents with enospc Josef Bacik 2015-02-12 4:36 ` Liu Bo 2015-02-12 14:53 ` Josef Bacik 2015-02-12 16:16 ` [PATCH 3/3 v2] " Josef Bacik 2015-02-13 2:06 ` Liu Bo 2015-02-13 15:44 ` David Sterba 2015-03-06 20:01 ` [PATCH 3/3] " Filipe David Manana 2015-02-12 3:52 ` [PATCH 1/3] Btrfs: only adjust outstanding_extents when we do a short write Liu Bo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).