[PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait
@ 2014-03-06  5:54 Miao Xie
  2014-03-06  5:54 ` [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier Miao Xie
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:54 UTC (permalink / raw)
  To: linux-btrfs

btrfs_wait_ordered_roots() moves all the list entries to a new list,
and then deals with them one by one. But if the other task invokes this
function at that time, it would get a empty list. It makes the enospc
error happens more early. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- New patch.
---
 fs/btrfs/ordered-data.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 138a7d7..a6ba75e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -589,7 +589,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
  * wait for all the ordered extents in a root.  This is done when balancing
  * space between drives.
  */
-int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
+static int __btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
 {
 	struct list_head splice, works;
 	struct btrfs_ordered_extent *ordered, *next;
@@ -598,7 +598,6 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
 	INIT_LIST_HEAD(&splice);
 	INIT_LIST_HEAD(&works);
 
-	mutex_lock(&root->fs_info->ordered_operations_mutex);
 	spin_lock(&root->ordered_extent_lock);
 	list_splice_init(&root->ordered_extents, &splice);
 	while (!list_empty(&splice) && nr) {
@@ -629,6 +628,16 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
 		btrfs_put_ordered_extent(ordered);
 		cond_resched();
 	}
+
+	return count;
+}
+
+int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
+{
+	int count;
+
+	mutex_lock(&root->fs_info->ordered_operations_mutex);
+	count = __btrfs_wait_ordered_extents(root, nr);
 	mutex_unlock(&root->fs_info->ordered_operations_mutex);
 
 	return count;
@@ -642,6 +651,7 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr)
 
 	INIT_LIST_HEAD(&splice);
 
+	mutex_lock(&fs_info->ordered_operations_mutex);
 	spin_lock(&fs_info->ordered_root_lock);
 	list_splice_init(&fs_info->ordered_roots, &splice);
 	while (!list_empty(&splice) && nr) {
@@ -653,7 +663,7 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr)
 			       &fs_info->ordered_roots);
 		spin_unlock(&fs_info->ordered_root_lock);
 
-		done = btrfs_wait_ordered_extents(root, nr);
+		done = __btrfs_wait_ordered_extents(root, nr);
 		btrfs_put_fs_root(root);
 
 		spin_lock(&fs_info->ordered_root_lock);
@@ -664,6 +674,7 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr)
 	}
 	list_splice_tail(&splice, &fs_info->ordered_roots);
 	spin_unlock(&fs_info->ordered_root_lock);
+	mutex_unlock(&fs_info->ordered_operations_mutex);
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
@ 2014-03-06  5:54 ` Miao Xie
  2014-03-27 16:57   ` David Sterba
  2014-03-06  5:54 ` [PATCH V2 03/10] Btrfs: just do dirty page flush for the inode with compression before direct IO Miao Xie
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:54 UTC (permalink / raw)
  To: linux-btrfs

The tasks that wait for the IO_DONE flag just care about the io of the dirty
pages, so it is better to wake up them immediately after all the pages are
written, not the whole process of the io completes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- None.
---
 fs/btrfs/ordered-data.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a6ba75e..067e129 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -349,10 +349,13 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode,
 	if (!uptodate)
 		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
 
-	if (entry->bytes_left == 0)
+	if (entry->bytes_left == 0) {
 		ret = test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
-	else
+		if (waitqueue_active(&entry->wait))
+			wake_up(&entry->wait);
+	} else {
 		ret = 1;
+	}
 out:
 	if (!ret && cached && entry) {
 		*cached = entry;
@@ -410,10 +413,13 @@ have_entry:
 	if (!uptodate)
 		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
 
-	if (entry->bytes_left == 0)
+	if (entry->bytes_left == 0) {
 		ret = test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
-	else
+		if (waitqueue_active(&entry->wait))
+			wake_up(&entry->wait);
+	} else {
 		ret = 1;
+	}
 out:
 	if (!ret && cached && entry) {
 		*cached = entry;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier
  2014-03-06  5:54 ` [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier Miao Xie
@ 2014-03-27 16:57   ` David Sterba
  0 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2014-03-27 16:57 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs

On Thu, Mar 06, 2014 at 01:54:56PM +0800, Miao Xie wrote:
> @@ -349,10 +349,13 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode,
>  	if (!uptodate)
>  		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
>  
> -	if (entry->bytes_left == 0)
> +	if (entry->bytes_left == 0) {
>  		ret = test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
> -	else

waitqueue_active() should be preceded by a barrier (either implicit or
explicit), which is missing here and below. Though this could lead to a
missed wakeup, I don't think it's required here, but for consistency I
suggest to add it or put a comment why it's not needed.

> +		if (waitqueue_active(&entry->wait))
> +			wake_up(&entry->wait);
> +	} else {
>  		ret = 1;
> +	}
>  out:
>  	if (!ret && cached && entry) {
>  		*cached = entry;
> @@ -410,10 +413,13 @@ have_entry:
>  	if (!uptodate)
>  		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
>  
> -	if (entry->bytes_left == 0)
> +	if (entry->bytes_left == 0) {
>  		ret = test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
> -	else
> +		if (waitqueue_active(&entry->wait))

^^^

> +			wake_up(&entry->wait);
> +	} else {
>  		ret = 1;
> +	}
>  out:
>  	if (!ret && cached && entry) {
>  		*cached = entry;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH V2 03/10] Btrfs: just do dirty page flush for the inode with compression before direct IO
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
  2014-03-06  5:54 ` [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier Miao Xie
@ 2014-03-06  5:54 ` Miao Xie
  2014-03-06  5:54 ` [PATCH V2 04/10] Btrfs: remove the unnecessary flush when preparing the pages Miao Xie
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:54 UTC (permalink / raw)
  To: linux-btrfs

As the comment in the btrfs_direct_IO says, only the compressed pages need be
flush again to make sure they are on the disk, but the common pages needn't,
so we add a if statement to check if the inode has compressed pages or not,
if no, skip the flush.

And in order to prevent the write ranges from intersecting, we need wait for
the running ordered extents. But the current code waits for them twice, one
is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
is before we get the blocks, it is unnecessary. because we can do the direct
IO without holding i_mutex, it means that the intersected ordered extents may
happen during the direct IO, the first wait can not avoid this problem. So we
use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
the first wait.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- None.
---
 fs/btrfs/inode.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1af34d0..8c9522b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7401,15 +7401,15 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 	smp_mb__after_atomic_inc();
 
 	/*
-	 * The generic stuff only does filemap_write_and_wait_range, which isn't
-	 * enough if we've written compressed pages to this area, so we need to
-	 * call btrfs_wait_ordered_range to make absolutely sure that any
-	 * outstanding dirty pages are on disk.
+	 * The generic stuff only does filemap_write_and_wait_range, which
+	 * isn't enough if we've written compressed pages to this area, so
+	 * we need to flush the dirty pages again to make absolutely sure
+	 * that any outstanding dirty pages are on disk.
 	 */
 	count = iov_length(iov, nr_segs);
-	ret = btrfs_wait_ordered_range(inode, offset, count);
-	if (ret)
-		return ret;
+	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+		     &BTRFS_I(inode)->runtime_flags))
+		filemap_fdatawrite_range(inode->i_mapping, offset, count);
 
 	if (rw & WRITE) {
 		/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 04/10] Btrfs: remove the unnecessary flush when preparing the pages
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
  2014-03-06  5:54 ` [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier Miao Xie
  2014-03-06  5:54 ` [PATCH V2 03/10] Btrfs: just do dirty page flush for the inode with compression before direct IO Miao Xie
@ 2014-03-06  5:54 ` Miao Xie
  2014-03-06  5:54 ` [PATCH V2 05/10] Btrfs: remove unnecessary lock in may_commit_transaction() Miao Xie
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:54 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- None.
---
 fs/btrfs/file.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b86..76b725e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1346,11 +1346,11 @@ lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
 		struct btrfs_ordered_extent *ordered;
 		lock_extent_bits(&BTRFS_I(inode)->io_tree,
 				 start_pos, last_pos, 0, cached_state);
-		ordered = btrfs_lookup_first_ordered_extent(inode, last_pos);
+		ordered = btrfs_lookup_ordered_range(inode, start_pos,
+						     last_pos - start_pos + 1);
 		if (ordered &&
 		    ordered->file_offset + ordered->len > start_pos &&
 		    ordered->file_offset <= last_pos) {
-			btrfs_put_ordered_extent(ordered);
 			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 					     start_pos, last_pos,
 					     cached_state, GFP_NOFS);
@@ -1358,12 +1358,9 @@ lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
 				unlock_page(pages[i]);
 				page_cache_release(pages[i]);
 			}
-			ret = btrfs_wait_ordered_range(inode, start_pos,
-						last_pos - start_pos + 1);
-			if (ret)
-				return ret;
-			else
-				return -EAGAIN;
+			btrfs_start_ordered_extent(inode, ordered, 1);
+			btrfs_put_ordered_extent(ordered);
+			return -EAGAIN;
 		}
 		if (ordered)
 			btrfs_put_ordered_extent(ordered);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 05/10] Btrfs: remove unnecessary lock in may_commit_transaction()
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
                   ` (2 preceding siblings ...)
  2014-03-06  5:54 ` [PATCH V2 04/10] Btrfs: remove the unnecessary flush when preparing the pages Miao Xie
@ 2014-03-06  5:54 ` Miao Xie
  2014-03-06  5:55 ` [PATCH V2 06/10] Btrfs: reclaim delalloc metadata more aggressively Miao Xie
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:54 UTC (permalink / raw)
  To: linux-btrfs

The reason is:
- The per-cpu counter has its own lock to protect itself.
- Here we needn't get a exact value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- None.
---
 fs/btrfs/extent-tree.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 32312e0..ddb5c84 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4112,13 +4112,9 @@ static int may_commit_transaction(struct btrfs_root *root,
 		goto commit;
 
 	/* See if there is enough pinned space to make this reservation */
-	spin_lock(&space_info->lock);
 	if (percpu_counter_compare(&space_info->total_bytes_pinned,
-				   bytes) >= 0) {
-		spin_unlock(&space_info->lock);
+				   bytes) >= 0)
 		goto commit;
-	}
-	spin_unlock(&space_info->lock);
 
 	/*
 	 * See if there is some space in the delayed insertion reservation for
@@ -4127,16 +4123,13 @@ static int may_commit_transaction(struct btrfs_root *root,
 	if (space_info != delayed_rsv->space_info)
 		return -ENOSPC;
 
-	spin_lock(&space_info->lock);
 	spin_lock(&delayed_rsv->lock);
 	if (percpu_counter_compare(&space_info->total_bytes_pinned,
 				   bytes - delayed_rsv->size) >= 0) {
 		spin_unlock(&delayed_rsv->lock);
-		spin_unlock(&space_info->lock);
 		return -ENOSPC;
 	}
 	spin_unlock(&delayed_rsv->lock);
-	spin_unlock(&space_info->lock);
 
 commit:
 	trans = btrfs_join_transaction(root);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 06/10] Btrfs: reclaim delalloc metadata more aggressively
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
                   ` (3 preceding siblings ...)
  2014-03-06  5:54 ` [PATCH V2 05/10] Btrfs: remove unnecessary lock in may_commit_transaction() Miao Xie
@ 2014-03-06  5:55 ` Miao Xie
  2014-03-06  5:55 ` [PATCH V2 07/10] Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Miao Xie
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:55 UTC (permalink / raw)
  To: linux-btrfs

generic/074 in xfstests failed sometimes because of the enospc error,
the reason of this problem is that we just reclaimed the space we need
from the reserved space for delalloc, and then tried to reserve the space,
but if some task did no-flush reservation between the above reclamation
and reservation,
	Task1			Task2
	shrink_delalloc()
	reclaim 1 block
	(The space that can
	 be reserved now is 1
	 block)
				do no-flush reservation
				reserve 1 block
				(The space that can
				 be reserved now is 0
				 block)
	reserving 1 block failed
the reservation of Task1 failed, but in fact, there was enough space to
reserve if we could reclaim more space before.

Fix this problem by the aggressive reclamation of the reserved delalloc
metadata space.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- New patch.
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ddb5c84..07e3d14 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4174,7 +4174,7 @@ static int flush_space(struct btrfs_root *root,
 		break;
 	case FLUSH_DELALLOC:
 	case FLUSH_DELALLOC_WAIT:
-		shrink_delalloc(root, num_bytes, orig_bytes,
+		shrink_delalloc(root, num_bytes * 2, orig_bytes,
 				state == FLUSH_DELALLOC_WAIT);
 		break;
 	case ALLOC_CHUNK:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 07/10] Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
                   ` (4 preceding siblings ...)
  2014-03-06  5:55 ` [PATCH V2 06/10] Btrfs: reclaim delalloc metadata more aggressively Miao Xie
@ 2014-03-06  5:55 ` Miao Xie
  2014-03-06  5:55 ` [PATCH V2 08/10] Btrfs: split the global ordered extents mutex Miao Xie
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:55 UTC (permalink / raw)
  To: linux-btrfs

We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- New patch.
---
 fs/btrfs/ctree.h       |  3 ++-
 fs/btrfs/dev-replace.c |  2 +-
 fs/btrfs/extent-tree.c |  8 ++++----
 fs/btrfs/inode.c       | 34 +++++++++++++++++++---------------
 fs/btrfs/ioctl.c       |  2 +-
 fs/btrfs/relocation.c  |  2 +-
 fs/btrfs/transaction.c |  2 +-
 7 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 277f627..2f3949b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3727,7 +3727,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			       u32 min_type);
 
 int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
-int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput);
+int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
+			       int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 			      struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index ec1c3f3..9f22905 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -491,7 +491,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	 * flush all outstanding I/O and inode extent mappings before the
 	 * copy operation is declared as being finished
 	 */
-	ret = btrfs_start_delalloc_roots(root->fs_info, 0);
+	ret = btrfs_start_delalloc_roots(root->fs_info, 0, -1);
 	if (ret) {
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 		return ret;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07e3d14..da43003 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3971,7 +3971,7 @@ static int can_overcommit(struct btrfs_root *root,
 }
 
 static void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
-					 unsigned long nr_pages)
+					 unsigned long nr_pages, int nr_items)
 {
 	struct super_block *sb = root->fs_info->sb;
 
@@ -3986,9 +3986,9 @@ static void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
 		 * the filesystem is readonly(all dirty pages are written to
 		 * the disk).
 		 */
-		btrfs_start_delalloc_roots(root->fs_info, 0);
+		btrfs_start_delalloc_roots(root->fs_info, 0, nr_items);
 		if (!current->journal_info)
-			btrfs_wait_ordered_roots(root->fs_info, -1);
+			btrfs_wait_ordered_roots(root->fs_info, nr_items);
 	}
 }
 
@@ -4045,7 +4045,7 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig,
 	while (delalloc_bytes && loops < 3) {
 		max_reclaim = min(delalloc_bytes, to_reclaim);
 		nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
-		btrfs_writeback_inodes_sb_nr(root, nr_pages);
+		btrfs_writeback_inodes_sb_nr(root, nr_pages, items);
 		/*
 		 * We need to wait for the async pages to actually start before
 		 * we do anything.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8c9522b..4f64216 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8416,7 +8416,8 @@ void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work)
  * some fairly slow code that needs optimization. This walks the list
  * of all the inodes with pending delalloc and forces them to disk.
  */
-static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
+static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput,
+				   int nr)
 {
 	struct btrfs_inode *binode;
 	struct inode *inode;
@@ -8450,12 +8451,14 @@ static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 			else
 				iput(inode);
 			ret = -ENOMEM;
-			goto out;
+			break;
 		}
 		list_add_tail(&work->list, &works);
 		btrfs_queue_worker(&root->fs_info->flush_workers,
 				   &work->work);
-
+		ret++;
+		if (nr != -1 && ret >= nr)
+			break;
 		cond_resched();
 		spin_lock(&root->delalloc_lock);
 	}
@@ -8465,12 +8468,6 @@ static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 		list_del_init(&work->list);
 		btrfs_wait_and_free_delalloc_work(work);
 	}
-	return 0;
-out:
-	list_for_each_entry_safe(work, next, &works, list) {
-		list_del_init(&work->list);
-		btrfs_wait_and_free_delalloc_work(work);
-	}
 
 	if (!list_empty_careful(&splice)) {
 		spin_lock(&root->delalloc_lock);
@@ -8487,7 +8484,9 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 	if (test_bit(BTRFS_FS_STATE_ERROR, &root->fs_info->fs_state))
 		return -EROFS;
 
-	ret = __start_delalloc_inodes(root, delay_iput);
+	ret = __start_delalloc_inodes(root, delay_iput, -1);
+	if (ret > 0)
+		ret = 0;
 	/*
 	 * the filemap_flush will queue IO into the worker threads, but
 	 * we have to make sure the IO is actually started and that
@@ -8504,7 +8503,8 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 	return ret;
 }
 
-int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput)
+int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
+			       int nr)
 {
 	struct btrfs_root *root;
 	struct list_head splice;
@@ -8517,7 +8517,7 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput)
 
 	spin_lock(&fs_info->delalloc_root_lock);
 	list_splice_init(&fs_info->delalloc_roots, &splice);
-	while (!list_empty(&splice)) {
+	while (!list_empty(&splice) && nr) {
 		root = list_first_entry(&splice, struct btrfs_root,
 					delalloc_root);
 		root = btrfs_grab_fs_root(root);
@@ -8526,15 +8526,20 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput)
 			       &fs_info->delalloc_roots);
 		spin_unlock(&fs_info->delalloc_root_lock);
 
-		ret = __start_delalloc_inodes(root, delay_iput);
+		ret = __start_delalloc_inodes(root, delay_iput, nr);
 		btrfs_put_fs_root(root);
-		if (ret)
+		if (ret < 0)
 			goto out;
 
+		if (nr != -1) {
+			nr -= ret;
+			WARN_ON(nr < 0);
+		}
 		spin_lock(&fs_info->delalloc_root_lock);
 	}
 	spin_unlock(&fs_info->delalloc_root_lock);
 
+	ret = 0;
 	atomic_inc(&fs_info->async_submit_draining);
 	while (atomic_read(&fs_info->nr_async_submits) ||
 	      atomic_read(&fs_info->async_delalloc_pages)) {
@@ -8543,7 +8548,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput)
 		    atomic_read(&fs_info->async_delalloc_pages) == 0));
 	}
 	atomic_dec(&fs_info->async_submit_draining);
-	return 0;
 out:
 	if (!list_empty_careful(&splice)) {
 		spin_lock(&fs_info->delalloc_root_lock);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b7dc0f7..e80b2bd 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4879,7 +4879,7 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SYNC: {
 		int ret;
 
-		ret = btrfs_start_delalloc_roots(root->fs_info, 0);
+		ret = btrfs_start_delalloc_roots(root->fs_info, 0, -1);
 		if (ret)
 			return ret;
 		ret = btrfs_sync_fs(file->f_dentry->d_sb, 1);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 07b3b36..def428a 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4248,7 +4248,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
 	btrfs_info(extent_root->fs_info, "relocating block group %llu flags %llu",
 	       rc->block_group->key.objectid, rc->block_group->flags);
 
-	ret = btrfs_start_delalloc_roots(fs_info, 0);
+	ret = btrfs_start_delalloc_roots(fs_info, 0, -1);
 	if (ret < 0) {
 		err = ret;
 		goto out;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d888f26..cd4deb3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1621,7 +1621,7 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
 static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
 {
 	if (btrfs_test_opt(fs_info->tree_root, FLUSHONCOMMIT))
-		return btrfs_start_delalloc_roots(fs_info, 1);
+		return btrfs_start_delalloc_roots(fs_info, 1, -1);
 	return 0;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 08/10] Btrfs: split the global ordered extents mutex
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
                   ` (5 preceding siblings ...)
  2014-03-06  5:55 ` [PATCH V2 07/10] Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Miao Xie
@ 2014-03-06  5:55 ` Miao Xie
  2014-03-06  5:55 ` [PATCH V2 09/10] Btrfs: fix possible empty list access when flushing the delalloc inodes Miao Xie
  2014-03-06  5:55 ` [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background Miao Xie
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:55 UTC (permalink / raw)
  To: linux-btrfs

When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.

This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- New patch.
---
 fs/btrfs/ctree.h        |  2 ++
 fs/btrfs/disk-io.c      |  1 +
 fs/btrfs/ordered-data.c | 17 ++++-------------
 3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2f3949b..7bae97e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1806,6 +1806,8 @@ struct btrfs_root {
 	struct list_head delalloc_inodes;
 	struct list_head delalloc_root;
 	u64 nr_delalloc_inodes;
+
+	struct mutex ordered_extent_mutex;
 	/*
 	 * this is used by the balancing code to wait for all the pending
 	 * ordered extents
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index de6a48f..65fe26e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1197,6 +1197,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	spin_lock_init(&root->log_extents_lock[1]);
 	mutex_init(&root->objectid_mutex);
 	mutex_init(&root->log_mutex);
+	mutex_init(&root->ordered_extent_mutex);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 067e129..2849485 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -595,7 +595,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
  * wait for all the ordered extents in a root.  This is done when balancing
  * space between drives.
  */
-static int __btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
+int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
 {
 	struct list_head splice, works;
 	struct btrfs_ordered_extent *ordered, *next;
@@ -604,6 +604,7 @@ static int __btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
 	INIT_LIST_HEAD(&splice);
 	INIT_LIST_HEAD(&works);
 
+	mutex_lock(&root->ordered_extent_mutex);
 	spin_lock(&root->ordered_extent_lock);
 	list_splice_init(&root->ordered_extents, &splice);
 	while (!list_empty(&splice) && nr) {
@@ -634,17 +635,7 @@ static int __btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
 		btrfs_put_ordered_extent(ordered);
 		cond_resched();
 	}
-
-	return count;
-}
-
-int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
-{
-	int count;
-
-	mutex_lock(&root->fs_info->ordered_operations_mutex);
-	count = __btrfs_wait_ordered_extents(root, nr);
-	mutex_unlock(&root->fs_info->ordered_operations_mutex);
+	mutex_unlock(&root->ordered_extent_mutex);
 
 	return count;
 }
@@ -669,7 +660,7 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr)
 			       &fs_info->ordered_roots);
 		spin_unlock(&fs_info->ordered_root_lock);
 
-		done = __btrfs_wait_ordered_extents(root, nr);
+		done = btrfs_wait_ordered_extents(root, nr);
 		btrfs_put_fs_root(root);
 
 		spin_lock(&fs_info->ordered_root_lock);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 09/10] Btrfs: fix possible empty list access when flushing the delalloc inodes
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
                   ` (6 preceding siblings ...)
  2014-03-06  5:55 ` [PATCH V2 08/10] Btrfs: split the global ordered extents mutex Miao Xie
@ 2014-03-06  5:55 ` Miao Xie
  2014-03-06  5:55 ` [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background Miao Xie
  8 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:55 UTC (permalink / raw)
  To: linux-btrfs

We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- New patch.
---
 fs/btrfs/ctree.h   | 2 ++
 fs/btrfs/disk-io.c | 2 ++
 fs/btrfs/inode.c   | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7bae97e..ec47aa9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1490,6 +1490,7 @@ struct btrfs_fs_info {
 	 */
 	struct list_head ordered_roots;
 
+	struct mutex delalloc_root_mutex;
 	spinlock_t delalloc_root_lock;
 	/* all fs/file tree roots that have delalloc inodes. */
 	struct list_head delalloc_roots;
@@ -1797,6 +1798,7 @@ struct btrfs_root {
 	spinlock_t root_item_lock;
 	atomic_t refs;
 
+	struct mutex delalloc_mutex;
 	spinlock_t delalloc_lock;
 	/*
 	 * all of the inodes that have delalloc bytes.  It is possible for
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 65fe26e..2bb0bbd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1198,6 +1198,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	mutex_init(&root->objectid_mutex);
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
+	mutex_init(&root->delalloc_mutex);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
@@ -2169,6 +2170,7 @@ int open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->buffer_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->reloc_mutex);
+	mutex_init(&fs_info->delalloc_root_mutex);
 	seqlock_init(&fs_info->profiles_lock);
 
 	init_completion(&fs_info->kobj_unregister);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f64216..34c484c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8429,6 +8429,7 @@ static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput,
 	INIT_LIST_HEAD(&works);
 	INIT_LIST_HEAD(&splice);
 
+	mutex_lock(&root->delalloc_mutex);
 	spin_lock(&root->delalloc_lock);
 	list_splice_init(&root->delalloc_inodes, &splice);
 	while (!list_empty(&splice)) {
@@ -8474,6 +8475,7 @@ static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput,
 		list_splice_tail(&splice, &root->delalloc_inodes);
 		spin_unlock(&root->delalloc_lock);
 	}
+	mutex_unlock(&root->delalloc_mutex);
 	return ret;
 }
 
@@ -8515,6 +8517,7 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
 
 	INIT_LIST_HEAD(&splice);
 
+	mutex_lock(&fs_info->delalloc_root_mutex);
 	spin_lock(&fs_info->delalloc_root_lock);
 	list_splice_init(&fs_info->delalloc_roots, &splice);
 	while (!list_empty(&splice) && nr) {
@@ -8554,6 +8557,7 @@ out:
 		list_splice_tail(&splice, &fs_info->delalloc_roots);
 		spin_unlock(&fs_info->delalloc_root_lock);
 	}
+	mutex_unlock(&fs_info->delalloc_root_mutex);
 	return ret;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background
  2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
                   ` (7 preceding siblings ...)
  2014-03-06  5:55 ` [PATCH V2 09/10] Btrfs: fix possible empty list access when flushing the delalloc inodes Miao Xie
@ 2014-03-06  5:55 ` Miao Xie
  2014-03-10 13:35   ` Josef Bacik
  2014-05-08  2:21   ` [PATCH V3] " Miao Xie
  8 siblings, 2 replies; 14+ messages in thread
From: Miao Xie @ 2014-03-06  5:55 UTC (permalink / raw)
  To: linux-btrfs

Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

We needn't worry about the early enospc problem because all the reclaim work
is serialized by the lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
I just do some simple test now, I'll do more performance test and send out
the result.

Changelog v1 -> v2:
- change the reclaim size.
---
 fs/btrfs/ctree.h       |  6 +++
 fs/btrfs/disk-io.c     |  3 ++
 fs/btrfs/extent-tree.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/super.c       |  1 +
 4 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ec47aa9..21f156b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
 #include <asm/kmap_types.h>
 #include <linux/pagemap.h>
 #include <linux/btrfs.h>
+#include <linux/workqueue.h>
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -1305,6 +1306,8 @@ struct btrfs_stripe_hash_table {
 
 #define BTRFS_STRIPE_HASH_TABLE_BITS 11
 
+void btrfs_init_async_reclaim_work(struct work_struct *work);
+
 /* fs_info */
 struct reloc_control;
 struct btrfs_device;
@@ -1681,6 +1684,9 @@ struct btrfs_fs_info {
 
 	struct semaphore uuid_tree_rescan_sem;
 	unsigned int update_uuid_tree_gen:1;
+
+	/* Used to reclaim the metadata space in the background. */
+	struct work_struct async_reclaim_work;
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2bb0bbd..d77516e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2237,6 +2237,7 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->balance_cancel_req, 0);
 	fs_info->balance_ctl = NULL;
 	init_waitqueue_head(&fs_info->balance_wait_q);
+	btrfs_init_async_reclaim_work(&fs_info->async_reclaim_work);
 
 	sb->s_blocksize = 4096;
 	sb->s_blocksize_bits = blksize_bits(4096);
@@ -3580,6 +3581,8 @@ int close_ctree(struct btrfs_root *root)
 	/* clear out the rbtree of defraggable inodes */
 	btrfs_cleanup_defrag_inodes(fs_info);
 
+	cancel_work_sync(&fs_info->async_reclaim_work);
+
 	if (!(fs_info->sb->s_flags & MS_RDONLY)) {
 		ret = btrfs_commit_super(root);
 		if (ret)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index da43003..6640d28 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4200,6 +4200,98 @@ static int flush_space(struct btrfs_root *root,
 
 	return ret;
 }
+
+static inline u64
+btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
+				 struct btrfs_space_info *space_info)
+{
+	u64 used;
+	u64 expected;
+	u64 to_reclaim;
+
+	to_reclaim = min_t(u64, num_online_cpus() * 1024 * 1024,
+				16 * 1024 * 1024);
+	spin_lock(&space_info->lock);
+	if (can_overcommit(root, space_info, to_reclaim,
+			   BTRFS_RESERVE_FLUSH_ALL)) {
+		to_reclaim = 0;
+		goto out;
+	}
+
+	used = space_info->bytes_used + space_info->bytes_reserved +
+	       space_info->bytes_pinned + space_info->bytes_readonly +
+	       space_info->bytes_may_use;
+	if (can_overcommit(root, space_info, 1024 * 1024,
+			   BTRFS_RESERVE_FLUSH_ALL))
+		expected = div_factor_fine(space_info->total_bytes, 95);
+	else
+		expected = div_factor_fine(space_info->total_bytes, 90);
+	to_reclaim = used - expected;
+out:
+	spin_unlock(&space_info->lock);
+
+	return to_reclaim;
+}
+
+static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
+					struct btrfs_fs_info *fs_info, u64 used)
+{
+	return (used >= div_factor_fine(space_info->total_bytes, 95) &&
+		!btrfs_fs_closing(fs_info) &&
+		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
+}
+
+static int btrfs_need_do_async_reclaim(struct btrfs_space_info *space_info,
+				       struct btrfs_fs_info *fs_info)
+{
+	u64 used;
+
+	spin_lock(&space_info->lock);
+	used = space_info->bytes_used + space_info->bytes_reserved +
+	       space_info->bytes_pinned + space_info->bytes_readonly +
+	       space_info->bytes_may_use;
+	if (need_do_async_reclaim(space_info, fs_info, used)) {
+		spin_unlock(&space_info->lock);
+		return 1;
+	}
+	spin_unlock(&space_info->lock);
+
+	return 0;
+}
+
+static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_space_info *space_info;
+	u64 to_reclaim;
+	int flush_state;
+
+	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
+	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
+
+	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
+						      space_info);
+	if (!to_reclaim)
+		return;
+
+	flush_state = FLUSH_DELAYED_ITEMS_NR;
+	do {
+		flush_space(fs_info->fs_root, space_info, to_reclaim,
+			    to_reclaim, flush_state);
+		flush_state++;
+		if (!btrfs_need_do_async_reclaim(space_info, fs_info))
+			return;
+	} while (flush_state <= COMMIT_TRANS);
+
+	if (btrfs_need_do_async_reclaim(space_info, fs_info))
+		queue_work(system_unbound_wq, work);
+}
+
+void btrfs_init_async_reclaim_work(struct work_struct *work)
+{
+	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
+}
+
 /**
  * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
  * @root - the root we're allocating for
@@ -4307,8 +4399,13 @@ again:
 	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
 		flushing = true;
 		space_info->flush = 1;
+	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
+		used += orig_bytes;
+		if (need_do_async_reclaim(space_info, root->fs_info, used) &&
+		    !work_busy(&root->fs_info->async_reclaim_work))
+			queue_work(system_unbound_wq,
+				   &root->fs_info->async_reclaim_work);
 	}
-
 	spin_unlock(&space_info->lock);
 
 	if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97cc241..7cc7423 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1401,6 +1401,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 * this also happens on 'umount -rf' or on shutdown, when
 		 * the filesystem is busy.
 		 */
+		cancel_work_sync(&fs_info->async_reclaim_work);
 
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background
  2014-03-06  5:55 ` [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background Miao Xie
@ 2014-03-10 13:35   ` Josef Bacik
  2014-05-08  2:06     ` Miao Xie
  2014-05-08  2:21   ` [PATCH V3] " Miao Xie
  1 sibling, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2014-03-10 13:35 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/06/2014 12:55 AM, Miao Xie wrote:
> Before applying this patch, the task had to reclaim the metadata
> space by itself if the metadata space was not enough. And When the
> task started the space reclamation, all the other tasks which
> wanted to reserve the metadata space were blocked. At some cases,
> they would be blocked for a long time, it made the performance
> fluctuate wildly.
> 
> So we introduce the background metadata space reclamation, when the
> space is about to be exhausted, we insert a reclaim work into the
> workqueue, the worker of the workqueue helps us to reclaim the
> reserved space at the background. By this way, the tasks needn't
> reclaim the space by themselves at most cases, and even if the
> tasks have to reclaim the space or are blocked for the space
> reclamation, they will get enough space more quickly.
> 
> We needn't worry about the early enospc problem because all the
> reclaim work is serialized by the lock.
> 
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

This causes generic/015 to fail with early enospc, I'm kicking this
patch out, I'll take the rest.  Thanks,

Josef

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTHb+NAAoJEANb+wAKly3BCcUP/jGmW85hiurfTF7eom+wzDcr
nxqvdTB/F21UJU1RRrb92CdYRYb9d4hHKhXE5OK+qamE+K55GEtgCUWCLQgDfJJL
Wx0aUD/pTqv3J5S5zM43UBJkn2ZR99Q7hJzm9PPMSMn7hBgK87QUEme8HerCPUgY
0VS4OcqUGhg88qO8GjdEFLnHawhjMDw9iGPUi+tMdCEmr9aQQo8ntiahdVKyTHej
vSRQRs0igvAt73OWHXiP6vc4LOQdu1vKCFdbxhgg+duKjNOHfUoaiiaUiGhWIA9l
BcTWd62bEJNOaXd6k06GzhpCWzMM6faTLfjI6XADUFY0VZ79akzk2KAO6YdaLz8w
3IAKN1chTpr7q7oPuRDgDQuwwdeLPImN29CKlAF3jlSRJEblM8CKoXYD1fyqVwDy
c1mA6mMUJnEnXrkJ/Pb5zuNIZMAlU+v3d6CCjYKHMACORvJeZVlg9gLLMATaAJIA
xLjFlzbgSbp/OUNuBuS4YGIaa51aAyODd2h1T3E+T5JYbVkA39N3Ni9HODE8AuSE
E6U/06FK47L0e5uGFrM3tMTL0XBF62C1iml4NsjOWgiERz8lFDdFVArgXamCVacM
1+VdeLLS88RHFEuwlMBy/ZQBdnvWCVsNVjYukuxntmWbSWrsLUFUSzExWnp+7TAO
xkEd2yMw75yasTVGKSXU
=Q/fM
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background
  2014-03-10 13:35   ` Josef Bacik
@ 2014-05-08  2:06     ` Miao Xie
  0 siblings, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-05-08  2:06 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs

On Mon, 10 Mar 2014 09:35:13 -0400, Josef Bacik wrote:
> On 03/06/2014 12:55 AM, Miao Xie wrote:
>> Before applying this patch, the task had to reclaim the metadata
>> space by itself if the metadata space was not enough. And When the
>> task started the space reclamation, all the other tasks which
>> wanted to reserve the metadata space were blocked. At some cases,
>> they would be blocked for a long time, it made the performance
>> fluctuate wildly.
>>
>> So we introduce the background metadata space reclamation, when the
>> space is about to be exhausted, we insert a reclaim work into the
>> workqueue, the worker of the workqueue helps us to reclaim the
>> reserved space at the background. By this way, the tasks needn't
>> reclaim the space by themselves at most cases, and even if the
>> tasks have to reclaim the space or are blocked for the space
>> reclamation, they will get enough space more quickly.
>>
>> We needn't worry about the early enospc problem because all the
>> reclaim work is serialized by the lock.
>>
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> 
> This causes generic/015 to fail with early enospc, I'm kicking this
> patch out, I'll take the rest.  Thanks,

It is not early enospc problem.

This test is to check that the space of the file is released immediately
or not after the file is deleted. In fact, the result of the test is
unstable, because the kernel may be syncing the file data when we delete
it, if so the space of file would not be released immediately.

But the case I said above is rare because the size of fs in this test is
just 50MB, and the memory size of the most machine is very large(maybe > 1GB),
that is the dirty pages is not so many, the background flusher may not
be waked up immediately, so no one holds the inode of the test file after
we delete it, and then the space of it can be released immediately.

After applying this patch, we will flush the dirty pages because our background
metadata space reclaimer finds that the metadata space is going to be used up
(< 5% of the total metadata size), and need flush dirty pages to reclaim some
delalloc metadata space. that is this patch makes the above case happen easily.

Anyway, we need improve this patch though it is not a bug. I will send out
a new one.

Thanks
Miao

> 
> Josef
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iQIcBAEBAgAGBQJTHb+NAAoJEANb+wAKly3BCcUP/jGmW85hiurfTF7eom+wzDcr
> nxqvdTB/F21UJU1RRrb92CdYRYb9d4hHKhXE5OK+qamE+K55GEtgCUWCLQgDfJJL
> Wx0aUD/pTqv3J5S5zM43UBJkn2ZR99Q7hJzm9PPMSMn7hBgK87QUEme8HerCPUgY
> 0VS4OcqUGhg88qO8GjdEFLnHawhjMDw9iGPUi+tMdCEmr9aQQo8ntiahdVKyTHej
> vSRQRs0igvAt73OWHXiP6vc4LOQdu1vKCFdbxhgg+duKjNOHfUoaiiaUiGhWIA9l
> BcTWd62bEJNOaXd6k06GzhpCWzMM6faTLfjI6XADUFY0VZ79akzk2KAO6YdaLz8w
> 3IAKN1chTpr7q7oPuRDgDQuwwdeLPImN29CKlAF3jlSRJEblM8CKoXYD1fyqVwDy
> c1mA6mMUJnEnXrkJ/Pb5zuNIZMAlU+v3d6CCjYKHMACORvJeZVlg9gLLMATaAJIA
> xLjFlzbgSbp/OUNuBuS4YGIaa51aAyODd2h1T3E+T5JYbVkA39N3Ni9HODE8AuSE
> E6U/06FK47L0e5uGFrM3tMTL0XBF62C1iml4NsjOWgiERz8lFDdFVArgXamCVacM
> 1+VdeLLS88RHFEuwlMBy/ZQBdnvWCVsNVjYukuxntmWbSWrsLUFUSzExWnp+7TAO
> xkEd2yMw75yasTVGKSXU
> =Q/fM
> -----END PGP SIGNATURE-----
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH V3] Btrfs: reclaim the reserved metadata space at background
  2014-03-06  5:55 ` [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background Miao Xie
  2014-03-10 13:35   ` Josef Bacik
@ 2014-05-08  2:21   ` Miao Xie
  1 sibling, 0 replies; 14+ messages in thread
From: Miao Xie @ 2014-05-08  2:21 UTC (permalink / raw)
  To: linux-btrfs

Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

Here is my test result(Tested by compilebench):
 Memory:	2GB
 CPU:		2Cores * 1CPU
 Partition:	40GB(SSD)

Test command:
 # compilebench -D <mnt> -m

Without this patch:
 intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
 compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
 read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
 delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)

With this patch:
 intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
 compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
 read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
 delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v2 -> v3:
- change the condition that the background reclaimation starts.
---
 fs/btrfs/ctree.h       |   6 +++
 fs/btrfs/disk-io.c     |   3 ++
 fs/btrfs/extent-tree.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/super.c       |   1 +
 4 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c48df5..f264edf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
 #include <asm/kmap_types.h>
 #include <linux/pagemap.h>
 #include <linux/btrfs.h>
+#include <linux/workqueue.h>
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -1313,6 +1314,8 @@ struct btrfs_stripe_hash_table {
 
 #define BTRFS_STRIPE_HASH_TABLE_BITS 11
 
+void btrfs_init_async_reclaim_work(struct work_struct *work);
+
 /* fs_info */
 struct reloc_control;
 struct btrfs_device;
@@ -1688,6 +1691,9 @@ struct btrfs_fs_info {
 
 	struct semaphore uuid_tree_rescan_sem;
 	unsigned int update_uuid_tree_gen:1;
+
+	/* Used to reclaim the metadata space in the background. */
+	struct work_struct async_reclaim_work;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 029d46c..475889a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2291,6 +2291,7 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->balance_cancel_req, 0);
 	fs_info->balance_ctl = NULL;
 	init_waitqueue_head(&fs_info->balance_wait_q);
+	btrfs_init_async_reclaim_work(&fs_info->async_reclaim_work);
 
 	sb->s_blocksize = 4096;
 	sb->s_blocksize_bits = blksize_bits(4096);
@@ -3603,6 +3604,8 @@ int close_ctree(struct btrfs_root *root)
 	/* clear out the rbtree of defraggable inodes */
 	btrfs_cleanup_defrag_inodes(fs_info);
 
+	cancel_work_sync(&fs_info->async_reclaim_work);
+
 	if (!(fs_info->sb->s_flags & MS_RDONLY)) {
 		ret = btrfs_commit_super(root);
 		if (ret)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1306487..5a5e156 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4201,6 +4201,104 @@ static int flush_space(struct btrfs_root *root,
 
 	return ret;
 }
+
+static inline u64
+btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
+				 struct btrfs_space_info *space_info)
+{
+	u64 used;
+	u64 expected;
+	u64 to_reclaim;
+
+	to_reclaim = min_t(u64, num_online_cpus() * 1024 * 1024,
+				16 * 1024 * 1024);
+	spin_lock(&space_info->lock);
+	if (can_overcommit(root, space_info, to_reclaim,
+			   BTRFS_RESERVE_FLUSH_ALL)) {
+		to_reclaim = 0;
+		goto out;
+	}
+
+	used = space_info->bytes_used + space_info->bytes_reserved +
+	       space_info->bytes_pinned + space_info->bytes_readonly +
+	       space_info->bytes_may_use;
+	if (can_overcommit(root, space_info, 1024 * 1024,
+			   BTRFS_RESERVE_FLUSH_ALL))
+		expected = div_factor_fine(space_info->total_bytes, 95);
+	else
+		expected = div_factor_fine(space_info->total_bytes, 90);
+
+	if (used > expected)
+		to_reclaim = used - expected;
+	else
+		to_reclaim = 0;
+	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
+				     space_info->bytes_reserved);
+out:
+	spin_unlock(&space_info->lock);
+
+	return to_reclaim;
+}
+
+static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
+					struct btrfs_fs_info *fs_info, u64 used)
+{
+	return (used >= div_factor_fine(space_info->total_bytes, 98) &&
+		!btrfs_fs_closing(fs_info) &&
+		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
+}
+
+static int btrfs_need_do_async_reclaim(struct btrfs_space_info *space_info,
+				       struct btrfs_fs_info *fs_info)
+{
+	u64 used;
+
+	spin_lock(&space_info->lock);
+	used = space_info->bytes_used + space_info->bytes_reserved +
+	       space_info->bytes_pinned + space_info->bytes_readonly +
+	       space_info->bytes_may_use;
+	if (need_do_async_reclaim(space_info, fs_info, used)) {
+		spin_unlock(&space_info->lock);
+		return 1;
+	}
+	spin_unlock(&space_info->lock);
+
+	return 0;
+}
+
+static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_space_info *space_info;
+	u64 to_reclaim;
+	int flush_state;
+
+	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
+	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
+
+	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
+						      space_info);
+	if (!to_reclaim)
+		return;
+
+	flush_state = FLUSH_DELAYED_ITEMS_NR;
+	do {
+		flush_space(fs_info->fs_root, space_info, to_reclaim,
+			    to_reclaim, flush_state);
+		flush_state++;
+		if (!btrfs_need_do_async_reclaim(space_info, fs_info))
+			return;
+	} while (flush_state <= COMMIT_TRANS);
+
+	if (btrfs_need_do_async_reclaim(space_info, fs_info))
+		queue_work(system_unbound_wq, work);
+}
+
+void btrfs_init_async_reclaim_work(struct work_struct *work)
+{
+	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
+}
+
 /**
  * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
  * @root - the root we're allocating for
@@ -4308,8 +4406,13 @@ again:
 	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
 		flushing = true;
 		space_info->flush = 1;
+	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
+		used += orig_bytes;
+		if (need_do_async_reclaim(space_info, root->fs_info, used) &&
+		    !work_busy(&root->fs_info->async_reclaim_work))
+			queue_work(system_unbound_wq,
+				   &root->fs_info->async_reclaim_work);
 	}
-
 	spin_unlock(&space_info->lock);
 
 	if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5011aad..99607b6 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1423,6 +1423,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 * this also happens on 'umount -rf' or on shutdown, when
 		 * the filesystem is busy.
 		 */
+		cancel_work_sync(&fs_info->async_reclaim_work);
 
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-05-08  2:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-06  5:54 [PATCH V2 01/10] Btrfs: fix early enospc due to the race of the two ordered extent wait Miao Xie
2014-03-06  5:54 ` [PATCH V2 02/10] Btrfs: wake up the tasks that wait for the io earlier Miao Xie
2014-03-27 16:57   ` David Sterba
2014-03-06  5:54 ` [PATCH V2 03/10] Btrfs: just do dirty page flush for the inode with compression before direct IO Miao Xie
2014-03-06  5:54 ` [PATCH V2 04/10] Btrfs: remove the unnecessary flush when preparing the pages Miao Xie
2014-03-06  5:54 ` [PATCH V2 05/10] Btrfs: remove unnecessary lock in may_commit_transaction() Miao Xie
2014-03-06  5:55 ` [PATCH V2 06/10] Btrfs: reclaim delalloc metadata more aggressively Miao Xie
2014-03-06  5:55 ` [PATCH V2 07/10] Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Miao Xie
2014-03-06  5:55 ` [PATCH V2 08/10] Btrfs: split the global ordered extents mutex Miao Xie
2014-03-06  5:55 ` [PATCH V2 09/10] Btrfs: fix possible empty list access when flushing the delalloc inodes Miao Xie
2014-03-06  5:55 ` [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background Miao Xie
2014-03-10 13:35   ` Josef Bacik
2014-05-08  2:06     ` Miao Xie
2014-05-08  2:21   ` [PATCH V3] " Miao Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).