[PATCH 1/4] btrfs: always wait on ordered extents at fsync time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] btrfs: always wait on ordered extents at fsync time
@ 2018-05-23 15:58 Josef Bacik
  2018-05-23 15:58 ` [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Josef Bacik @ 2018-05-23 15:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

There's a priority inversion that exists currently with btrfs fsync.  In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second.  However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/file.c | 56 ++++----------------------------------------------------
 1 file changed, 4 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5772f0cbedef..2b1c36612384 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	atomic_inc(&root->log_batch);
 	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			     &BTRFS_I(inode)->runtime_flags);
+
 	/*
-	 * We might have have had more pages made dirty after calling
-	 * start_ordered_ops and before acquiring the inode's i_mutex.
+	 * We have to do this here to avoid the priority inversion of waiting on
+	 * IO of a lower priority task while holding a transaciton open.
 	 */
-	if (full_sync) {
-		/*
-		 * For a full sync, we need to make sure any ordered operations
-		 * start and finish before we start logging the inode, so that
-		 * all extents are persisted and the respective file extent
-		 * items are in the fs/subvol btree.
-		 */
-		ret = btrfs_wait_ordered_range(inode, start, len);
-	} else {
-		/*
-		 * Start any new ordered operations before starting to log the
-		 * inode. We will wait for them to finish in btrfs_sync_log().
-		 *
-		 * Right before acquiring the inode's mutex, we might have new
-		 * writes dirtying pages, which won't immediately start the
-		 * respective ordered operations - that is done through the
-		 * fill_delalloc callbacks invoked from the writepage and
-		 * writepages address space operations. So make sure we start
-		 * all ordered operations before starting to log our inode. Not
-		 * doing this means that while logging the inode, writeback
-		 * could start and invoke writepage/writepages, which would call
-		 * the fill_delalloc callbacks (cow_file_range,
-		 * submit_compressed_extents). These callbacks add first an
-		 * extent map to the modified list of extents and then create
-		 * the respective ordered operation, which means in
-		 * tree-log.c:btrfs_log_inode() we might capture all existing
-		 * ordered operations (with btrfs_get_logged_extents()) before
-		 * the fill_delalloc callback adds its ordered operation, and by
-		 * the time we visit the modified list of extent maps (with
-		 * btrfs_log_changed_extents()), we see and process the extent
-		 * map they created. We then use the extent map to construct a
-		 * file extent item for logging without waiting for the
-		 * respective ordered operation to finish - this file extent
-		 * item points to a disk location that might not have yet been
-		 * written to, containing random data - so after a crash a log
-		 * replay will make our inode have file extent items that point
-		 * to disk locations containing invalid data, as we returned
-		 * success to userspace without waiting for the respective
-		 * ordered operation to finish, because it wasn't captured by
-		 * btrfs_get_logged_extents().
-		 */
-		ret = start_ordered_ops(inode, start, end);
-	}
+	ret = btrfs_wait_ordered_range(inode, start, len);
 	if (ret) {
 		inode_unlock(inode);
 		goto out;
@@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 				goto out;
 			}
 		}
-		if (!full_sync) {
-			ret = btrfs_wait_ordered_range(inode, start, len);
-			if (ret) {
-				btrfs_end_transaction(trans);
-				goto out;
-			}
-		}
 		ret = btrfs_commit_transaction(trans);
 	} else {
 		ret = btrfs_end_transaction(trans);
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path
  2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
@ 2018-05-23 15:58 ` Josef Bacik
  2018-05-26 12:48   ` Nikolay Borisov
  2018-05-23 15:58 ` [PATCH 3/4] btrfs: clean up the left over logged_list usage Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2018-05-23 15:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Since we are waiting on all ordered extents at the start of the fsync()
path we don't need to wait on any logged ordered extents, and we don't
need to look up the checksums on the ordered extents as they will
already be on disk prior to getting here.  Rework this so we're only
looking up and copying the on-disk checksums for the extent range we
care about.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/tree-log.c | 118 +++++-----------------------------------------------
 1 file changed, 10 insertions(+), 108 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 43758e30aa7a..791b5a731456 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4082,127 +4082,30 @@ static int extent_cmp(void *priv, struct list_head *a, struct list_head *b)
 	return 0;
 }
 
-static int wait_ordered_extents(struct btrfs_trans_handle *trans,
-				struct inode *inode,
-				struct btrfs_root *root,
-				const struct extent_map *em,
-				const struct list_head *logged_list,
-				bool *ordered_io_error)
+static int log_extent_csums(struct btrfs_trans_handle *trans,
+			    struct btrfs_inode *inode,
+			    struct btrfs_root *root,
+			    const struct extent_map *em)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct btrfs_ordered_extent *ordered;
 	struct btrfs_root *log = root->log_root;
-	u64 mod_start = em->mod_start;
-	u64 mod_len = em->mod_len;
-	const bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 	u64 csum_offset;
 	u64 csum_len;
 	LIST_HEAD(ordered_sums);
 	int ret = 0;
 
-	*ordered_io_error = false;
-
-	if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) ||
+	if (inode->flags & BTRFS_INODE_NODATASUM ||
+	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags) ||
 	    em->block_start == EXTENT_MAP_HOLE)
 		return 0;
 
-	/*
-	 * Wait far any ordered extent that covers our extent map. If it
-	 * finishes without an error, first check and see if our csums are on
-	 * our outstanding ordered extents.
-	 */
-	list_for_each_entry(ordered, logged_list, log_list) {
-		struct btrfs_ordered_sum *sum;
-
-		if (!mod_len)
-			break;
-
-		if (ordered->file_offset + ordered->len <= mod_start ||
-		    mod_start + mod_len <= ordered->file_offset)
-			continue;
-
-		if (!test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags) &&
-		    !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
-		    !test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
-			const u64 start = ordered->file_offset;
-			const u64 end = ordered->file_offset + ordered->len - 1;
-
-			WARN_ON(ordered->inode != inode);
-			filemap_fdatawrite_range(inode->i_mapping, start, end);
-		}
-
-		wait_event(ordered->wait,
-			   (test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags) ||
-			    test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)));
-
-		if (test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)) {
-			/*
-			 * Clear the AS_EIO/AS_ENOSPC flags from the inode's
-			 * i_mapping flags, so that the next fsync won't get
-			 * an outdated io error too.
-			 */
-			filemap_check_errors(inode->i_mapping);
-			*ordered_io_error = true;
-			break;
-		}
-		/*
-		 * We are going to copy all the csums on this ordered extent, so
-		 * go ahead and adjust mod_start and mod_len in case this
-		 * ordered extent has already been logged.
-		 */
-		if (ordered->file_offset > mod_start) {
-			if (ordered->file_offset + ordered->len >=
-			    mod_start + mod_len)
-				mod_len = ordered->file_offset - mod_start;
-			/*
-			 * If we have this case
-			 *
-			 * |--------- logged extent ---------|
-			 *       |----- ordered extent ----|
-			 *
-			 * Just don't mess with mod_start and mod_len, we'll
-			 * just end up logging more csums than we need and it
-			 * will be ok.
-			 */
-		} else {
-			if (ordered->file_offset + ordered->len <
-			    mod_start + mod_len) {
-				mod_len = (mod_start + mod_len) -
-					(ordered->file_offset + ordered->len);
-				mod_start = ordered->file_offset +
-					ordered->len;
-			} else {
-				mod_len = 0;
-			}
-		}
-
-		if (skip_csum)
-			continue;
-
-		/*
-		 * To keep us from looping for the above case of an ordered
-		 * extent that falls inside of the logged extent.
-		 */
-		if (test_and_set_bit(BTRFS_ORDERED_LOGGED_CSUM,
-				     &ordered->flags))
-			continue;
-
-		list_for_each_entry(sum, &ordered->list, list) {
-			ret = btrfs_csum_file_blocks(trans, log, sum);
-			if (ret)
-				break;
-		}
-	}
-
-	if (*ordered_io_error || !mod_len || ret || skip_csum)
-		return ret;
-
+	/* If we're compressed we have to save the entire range of csums. */
 	if (em->compress_type) {
 		csum_offset = 0;
 		csum_len = max(em->block_len, em->orig_block_len);
 	} else {
-		csum_offset = mod_start - em->start;
-		csum_len = mod_len;
+		csum_offset = em->mod_start - em->start;
+		csum_len = em->mod_len;
 	}
 
 	/* block start is already adjusted for the file extent offset. */
@@ -4244,8 +4147,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	int extent_inserted = 0;
 	bool ordered_io_err = false;
 
-	ret = wait_ordered_extents(trans, &inode->vfs_inode, root, em,
-			logged_list, &ordered_io_err);
+	ret = log_extent_csums(trans, inode, root, em);
 	if (ret)
 		return ret;
 
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path
  2018-05-23 15:58 ` [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path Josef Bacik
@ 2018-05-26 12:48   ` Nikolay Borisov
  2018-06-20 13:43     ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2018-05-26 12:48 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team; +Cc: Josef Bacik



On 23.05.2018 18:58, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> Since we are waiting on all ordered extents at the start of the fsync()
> path we don't need to wait on any logged ordered extents, and we don't
> need to look up the checksums on the ordered extents as they will
> already be on disk prior to getting here.  Rework this so we're only
> looking up and copying the on-disk checksums for the extent range we
> care about.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/tree-log.c | 118 +++++-----------------------------------------------
>  1 file changed, 10 insertions(+), 108 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 43758e30aa7a..791b5a731456 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4082,127 +4082,30 @@ static int extent_cmp(void *priv, struct list_head *a, struct list_head *b)
>  	return 0;
>  }
>  
> -static int wait_ordered_extents(struct btrfs_trans_handle *trans,
> -				struct inode *inode,
> -				struct btrfs_root *root,
> -				const struct extent_map *em,
> -				const struct list_head *logged_list,
> -				bool *ordered_io_error)
> +static int log_extent_csums(struct btrfs_trans_handle *trans,
> +			    struct btrfs_inode *inode,
> +			    struct btrfs_root *root,
> +			    const struct extent_map *em)
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
> -	struct btrfs_ordered_extent *ordered;
>  	struct btrfs_root *log = root->log_root;
> -	u64 mod_start = em->mod_start;
> -	u64 mod_len = em->mod_len;
> -	const bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>  	u64 csum_offset;
>  	u64 csum_len;
>  	LIST_HEAD(ordered_sums);
>  	int ret = 0;
>  
> -	*ordered_io_error = false;
> -
> -	if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) ||
> +	if (inode->flags & BTRFS_INODE_NODATASUM ||
> +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags) ||
>  	    em->block_start == EXTENT_MAP_HOLE)
>  		return 0;
>  
> -	/*
> -	 * Wait far any ordered extent that covers our extent map. If it
> -	 * finishes without an error, first check and see if our csums are on
> -	 * our outstanding ordered extents.
> -	 */
> -	list_for_each_entry(ordered, logged_list, log_list) {
> -		struct btrfs_ordered_sum *sum;
> -
> -		if (!mod_len)
> -			break;
> -
> -		if (ordered->file_offset + ordered->len <= mod_start ||
> -		    mod_start + mod_len <= ordered->file_offset)
> -			continue;
> -
> -		if (!test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags) &&
> -		    !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
> -		    !test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
> -			const u64 start = ordered->file_offset;
> -			const u64 end = ordered->file_offset + ordered->len - 1;
> -
> -			WARN_ON(ordered->inode != inode);
> -			filemap_fdatawrite_range(inode->i_mapping, start, end);
> -		}
> -
> -		wait_event(ordered->wait,
> -			   (test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags) ||
> -			    test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)));
> -
> -		if (test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)) {
> -			/*
> -			 * Clear the AS_EIO/AS_ENOSPC flags from the inode's
> -			 * i_mapping flags, so that the next fsync won't get
> -			 * an outdated io error too.
> -			 */
> -			filemap_check_errors(inode->i_mapping);
> -			*ordered_io_error = true;
> -			break;
> -		}
> -		/*
> -		 * We are going to copy all the csums on this ordered extent, so
> -		 * go ahead and adjust mod_start and mod_len in case this
> -		 * ordered extent has already been logged.
> -		 */
> -		if (ordered->file_offset > mod_start) {
> -			if (ordered->file_offset + ordered->len >=
> -			    mod_start + mod_len)
> -				mod_len = ordered->file_offset - mod_start;
> -			/*
> -			 * If we have this case
> -			 *
> -			 * |--------- logged extent ---------|
> -			 *       |----- ordered extent ----|
> -			 *
> -			 * Just don't mess with mod_start and mod_len, we'll
> -			 * just end up logging more csums than we need and it
> -			 * will be ok.
> -			 */
> -		} else {
> -			if (ordered->file_offset + ordered->len <
> -			    mod_start + mod_len) {
> -				mod_len = (mod_start + mod_len) -
> -					(ordered->file_offset + ordered->len);
> -				mod_start = ordered->file_offset +
> -					ordered->len;
> -			} else {
> -				mod_len = 0;
> -			}
> -		}
> -
> -		if (skip_csum)
> -			continue;
> -
> -		/*
> -		 * To keep us from looping for the above case of an ordered
> -		 * extent that falls inside of the logged extent.
> -		 */
> -		if (test_and_set_bit(BTRFS_ORDERED_LOGGED_CSUM,
> -				     &ordered->flags))
> -			continue;
> -
> -		list_for_each_entry(sum, &ordered->list, list) {
> -			ret = btrfs_csum_file_blocks(trans, log, sum);
> -			if (ret)
> -				break;
> -		}
> -	}
> -
> -	if (*ordered_io_error || !mod_len || ret || skip_csum)
> -		return ret;
> -
> +	/* If we're compressed we have to save the entire range of csums. */
>  	if (em->compress_type) {
>  		csum_offset = 0;
>  		csum_len = max(em->block_len, em->orig_block_len);
>  	} else {
> -		csum_offset = mod_start - em->start;
> -		csum_len = mod_len;
> +		csum_offset = em->mod_start - em->start;
> +		csum_len = em->mod_len;
>  	}
>  
>  	/* block start is already adjusted for the file extent offset. */
> @@ -4244,8 +4147,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>  	int extent_inserted = 0;
>  	bool ordered_io_err = false;
>  
> -	ret = wait_ordered_extents(trans, &inode->vfs_inode, root, em,
> -			logged_list, &ordered_io_err);
> +	ret = log_extent_csums(trans, inode, root, em);

With the following minor diff: 

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index daf32dc94dc3..34d5b0630824 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c

@@ -4064,11 +4064,9 @@ static int extent_cmp(void *priv, struct list_head *a, struct list_head *b)
 
 static int log_extent_csums(struct btrfs_trans_handle *trans,
                            struct btrfs_inode *inode,
-                           struct btrfs_root *root,
+                           struct btrfs_root *log,
                            const struct extent_map *em)
 {
-       struct btrfs_fs_info *fs_info = root->fs_info;
-       struct btrfs_root *log = root->log_root;
        u64 csum_offset;
        u64 csum_len;
        LIST_HEAD(ordered_sums);
@@ -4089,7 +4087,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
        }
 
        /* block start is already adjusted for the file extent offset. */
-       ret = btrfs_lookup_csums_range(fs_info->csum_root,
+       ret = btrfs_lookup_csums_range(trans->fs_info->csum_root,
                                       em->block_start + csum_offset,
                                       em->block_start + csum_offset +
                                       csum_len - 1, &ordered_sums, 0);
@@ -4125,7 +4123,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
        int ret;
        int extent_inserted = 0;
 
-       ret = log_extent_csums(trans, inode, root, em);
+       ret = log_extent_csums(trans, inode, log, em);
        if (ret)
                return ret;

Bloat-o-meter reports: 

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-55 (-55)
Function                                     old     new   delta
btrfs_log_changed_extents.isra              4298    4243     -55
Total: Before=64999, After=64944, chg -0.08%

I suggest you incorporate it in the patch


>  	if (ret)
>  		return ret;
>  
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path
  2018-05-26 12:48   ` Nikolay Borisov
@ 2018-06-20 13:43     ` David Sterba
  2018-06-20 14:26       ` [PATCH] btrfs: Streamline log_extent_csums a bit Nikolay Borisov
  0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2018-06-20 13:43 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs, kernel-team, Josef Bacik

On Sat, May 26, 2018 at 03:48:31PM +0300, Nikolay Borisov wrote:
> > +	ret = log_extent_csums(trans, inode, root, em);
> 
> With the following minor diff: 
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index daf32dc94dc3..34d5b0630824 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> 
> @@ -4064,11 +4064,9 @@ static int extent_cmp(void *priv, struct list_head *a, struct list_head *b)
>  
>  static int log_extent_csums(struct btrfs_trans_handle *trans,
>                             struct btrfs_inode *inode,
> -                           struct btrfs_root *root,
> +                           struct btrfs_root *log,
>                             const struct extent_map *em)
>  {
> -       struct btrfs_fs_info *fs_info = root->fs_info;
> -       struct btrfs_root *log = root->log_root;
>         u64 csum_offset;
>         u64 csum_len;
>         LIST_HEAD(ordered_sums);
> @@ -4089,7 +4087,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
>         }
>  
>         /* block start is already adjusted for the file extent offset. */
> -       ret = btrfs_lookup_csums_range(fs_info->csum_root,
> +       ret = btrfs_lookup_csums_range(trans->fs_info->csum_root,
>                                        em->block_start + csum_offset,
>                                        em->block_start + csum_offset +
>                                        csum_len - 1, &ordered_sums, 0);
> @@ -4125,7 +4123,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>         int ret;
>         int extent_inserted = 0;
>  
> -       ret = log_extent_csums(trans, inode, root, em);
> +       ret = log_extent_csums(trans, inode, log, em);
>         if (ret)
>                 return ret;
> 
> Bloat-o-meter reports: 
> 
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-55 (-55)
> Function                                     old     new   delta
> btrfs_log_changed_extents.isra              4298    4243     -55
> Total: Before=64999, After=64944, chg -0.08%
> 
> I suggest you incorporate it in the patch

The patches are in misc-next now, please send that as a separate patch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] btrfs: Streamline log_extent_csums a bit
  2018-06-20 13:43     ` David Sterba
@ 2018-06-20 14:26       ` Nikolay Borisov
  2018-06-21 11:11         ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2018-06-20 14:26 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, Nikolay Borisov

Currently this function takes the root as an argument only to get the
log_root from it. Simplify this by directly passing the log root from
the caller. Also eliminate the fs_info local var, since it's used only
once, so directly reference it from the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/tree-log.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index daf32dc94dc3..b52ca6b8503e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4064,11 +4064,9 @@ static int extent_cmp(void *priv, struct list_head *a, struct list_head *b)
 
 static int log_extent_csums(struct btrfs_trans_handle *trans,
 			    struct btrfs_inode *inode,
-			    struct btrfs_root *root,
+			    struct btrfs_root *log,
 			    const struct extent_map *em)
 {
-	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct btrfs_root *log = root->log_root;
 	u64 csum_offset;
 	u64 csum_len;
 	LIST_HEAD(ordered_sums);
@@ -4089,7 +4087,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 	}
 
 	/* block start is already adjusted for the file extent offset. */
-	ret = btrfs_lookup_csums_range(fs_info->csum_root,
+	ret = btrfs_lookup_csums_range(trans->fs_info->csum_root,
 				       em->block_start + csum_offset,
 				       em->block_start + csum_offset +
 				       csum_len - 1, &ordered_sums, 0);
@@ -4125,7 +4123,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	int ret;
 	int extent_inserted = 0;
 
-	ret = log_extent_csums(trans, inode, root, em);
+	ret = log_extent_csums(trans, inode, log, em);
 	if (ret)
 		return ret;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] btrfs: Streamline log_extent_csums a bit
  2018-06-20 14:26       ` [PATCH] btrfs: Streamline log_extent_csums a bit Nikolay Borisov
@ 2018-06-21 11:11         ` David Sterba
  0 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2018-06-21 11:11 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Jun 20, 2018 at 05:26:42PM +0300, Nikolay Borisov wrote:
> Currently this function takes the root as an argument only to get the
> log_root from it. Simplify this by directly passing the log root from
> the caller. Also eliminate the fs_info local var, since it's used only
> once, so directly reference it from the transaction handle.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/4] btrfs: clean up the left over logged_list usage
  2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
  2018-05-23 15:58 ` [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path Josef Bacik
@ 2018-05-23 15:58 ` Josef Bacik
  2018-05-23 15:58 ` [PATCH 4/4] btrfs: remove the logged extents infrastructure Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2018-05-23 15:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We no longer use this list we've passed around so remove it everywhere.
Also remove the extra checks for ordered/filemap errors as this is
handled higher up now that we're waiting on ordered_extents before
getting to the tree log code.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/tree-log.c | 32 ++------------------------------
 1 file changed, 2 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 791b5a731456..ef1b1a071bba 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4133,7 +4133,6 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 			  struct btrfs_inode *inode, struct btrfs_root *root,
 			  const struct extent_map *em,
 			  struct btrfs_path *path,
-			  const struct list_head *logged_list,
 			  struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_root *log = root->log_root;
@@ -4145,17 +4144,11 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	u64 block_len;
 	int ret;
 	int extent_inserted = 0;
-	bool ordered_io_err = false;
 
 	ret = log_extent_csums(trans, inode, root, em);
 	if (ret)
 		return ret;
 
-	if (ordered_io_err) {
-		ctx->io_err = -EIO;
-		return ctx->io_err;
-	}
-
 	btrfs_init_map_token(&token);
 
 	ret = __btrfs_drop_extents(trans, log, &inode->vfs_inode, path, em->start,
@@ -4226,7 +4219,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root,
 				     struct btrfs_inode *inode,
 				     struct btrfs_path *path,
-				     struct list_head *logged_list,
 				     struct btrfs_log_ctx *ctx,
 				     const u64 start,
 				     const u64 end)
@@ -4302,20 +4294,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	}
 
 	list_sort(NULL, &extents, extent_cmp);
-	btrfs_get_logged_extents(inode, logged_list, logged_start, logged_end);
-	/*
-	 * Some ordered extents started by fsync might have completed
-	 * before we could collect them into the list logged_list, which
-	 * means they're gone, not in our logged_list nor in the inode's
-	 * ordered tree. We want the application/user space to know an
-	 * error happened while attempting to persist file data so that
-	 * it can take proper action. If such error happened, we leave
-	 * without writing to the log tree and the fsync must report the
-	 * file data write error and not commit the current transaction.
-	 */
-	ret = filemap_check_errors(inode->vfs_inode.i_mapping);
-	if (ret)
-		ctx->io_err = ret;
 process:
 	while (!list_empty(&extents)) {
 		em = list_entry(extents.next, struct extent_map, list);
@@ -4334,8 +4312,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 
 		write_unlock(&tree->lock);
 
-		ret = log_one_extent(trans, inode, root, em, path, logged_list,
-				     ctx);
+		ret = log_one_extent(trans, inode, root, em, path, ctx);
 		write_lock(&tree->lock);
 		clear_em_logging(tree, em);
 		free_extent_map(em);
@@ -4717,7 +4694,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_key min_key;
 	struct btrfs_key max_key;
 	struct btrfs_root *log = root->log_root;
-	LIST_HEAD(logged_list);
 	u64 last_extent = 0;
 	int err = 0;
 	int ret;
@@ -5047,7 +5023,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	}
 	if (fast_search) {
 		ret = btrfs_log_changed_extents(trans, root, inode, dst_path,
-						&logged_list, ctx, start, end);
+						ctx, start, end);
 		if (ret) {
 			err = ret;
 			goto out_unlock;
@@ -5098,10 +5074,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	inode->last_log_commit = inode->last_sub_trans;
 	spin_unlock(&inode->lock);
 out_unlock:
-	if (unlikely(err))
-		btrfs_put_logged_extents(&logged_list);
-	else
-		btrfs_submit_logged_extents(&logged_list, log);
 	mutex_unlock(&inode->log_mutex);
 
 	btrfs_free_path(path);
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] btrfs: remove the logged extents infrastructure
  2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
  2018-05-23 15:58 ` [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path Josef Bacik
  2018-05-23 15:58 ` [PATCH 3/4] btrfs: clean up the left over logged_list usage Josef Bacik
@ 2018-05-23 15:58 ` Josef Bacik
  2018-05-24 10:49 ` [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Filipe Manana
  2018-06-20 13:42 ` David Sterba
  4 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2018-05-23 15:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

This is no longer used anywhere, remove all of it.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ordered-data.c      | 123 -------------------------------------------
 fs/btrfs/ordered-data.h      |  20 ++-----
 fs/btrfs/tree-log.c          |  16 ------
 include/trace/events/btrfs.h |   1 -
 4 files changed, 3 insertions(+), 157 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6db8bb2f2c28..88f858baf87d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -427,129 +427,6 @@ int btrfs_dec_test_ordered_pending(struct inode *inode,
 	return ret == 0;
 }
 
-/* Needs to either be called under a log transaction or the log_mutex */
-void btrfs_get_logged_extents(struct btrfs_inode *inode,
-			      struct list_head *logged_list,
-			      const loff_t start,
-			      const loff_t end)
-{
-	struct btrfs_ordered_inode_tree *tree;
-	struct btrfs_ordered_extent *ordered;
-	struct rb_node *n;
-	struct rb_node *prev;
-
-	tree = &inode->ordered_tree;
-	spin_lock_irq(&tree->lock);
-	n = __tree_search(&tree->tree, end, &prev);
-	if (!n)
-		n = prev;
-	for (; n; n = rb_prev(n)) {
-		ordered = rb_entry(n, struct btrfs_ordered_extent, rb_node);
-		if (ordered->file_offset > end)
-			continue;
-		if (entry_end(ordered) <= start)
-			break;
-		if (test_and_set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags))
-			continue;
-		list_add(&ordered->log_list, logged_list);
-		refcount_inc(&ordered->refs);
-	}
-	spin_unlock_irq(&tree->lock);
-}
-
-void btrfs_put_logged_extents(struct list_head *logged_list)
-{
-	struct btrfs_ordered_extent *ordered;
-
-	while (!list_empty(logged_list)) {
-		ordered = list_first_entry(logged_list,
-					   struct btrfs_ordered_extent,
-					   log_list);
-		list_del_init(&ordered->log_list);
-		btrfs_put_ordered_extent(ordered);
-	}
-}
-
-void btrfs_submit_logged_extents(struct list_head *logged_list,
-				 struct btrfs_root *log)
-{
-	int index = log->log_transid % 2;
-
-	spin_lock_irq(&log->log_extents_lock[index]);
-	list_splice_tail(logged_list, &log->logged_list[index]);
-	spin_unlock_irq(&log->log_extents_lock[index]);
-}
-
-void btrfs_wait_logged_extents(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *log, u64 transid)
-{
-	struct btrfs_ordered_extent *ordered;
-	int index = transid % 2;
-
-	spin_lock_irq(&log->log_extents_lock[index]);
-	while (!list_empty(&log->logged_list[index])) {
-		struct inode *inode;
-		ordered = list_first_entry(&log->logged_list[index],
-					   struct btrfs_ordered_extent,
-					   log_list);
-		list_del_init(&ordered->log_list);
-		inode = ordered->inode;
-		spin_unlock_irq(&log->log_extents_lock[index]);
-
-		if (!test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags) &&
-		    !test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
-			u64 start = ordered->file_offset;
-			u64 end = ordered->file_offset + ordered->len - 1;
-
-			WARN_ON(!inode);
-			filemap_fdatawrite_range(inode->i_mapping, start, end);
-		}
-		wait_event(ordered->wait, test_bit(BTRFS_ORDERED_IO_DONE,
-						   &ordered->flags));
-
-		/*
-		 * In order to keep us from losing our ordered extent
-		 * information when committing the transaction we have to make
-		 * sure that any logged extents are completed when we go to
-		 * commit the transaction.  To do this we simply increase the
-		 * current transactions pending_ordered counter and decrement it
-		 * when the ordered extent completes.
-		 */
-		if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
-			struct btrfs_ordered_inode_tree *tree;
-
-			tree = &BTRFS_I(inode)->ordered_tree;
-			spin_lock_irq(&tree->lock);
-			if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
-				set_bit(BTRFS_ORDERED_PENDING, &ordered->flags);
-				atomic_inc(&trans->transaction->pending_ordered);
-			}
-			spin_unlock_irq(&tree->lock);
-		}
-		btrfs_put_ordered_extent(ordered);
-		spin_lock_irq(&log->log_extents_lock[index]);
-	}
-	spin_unlock_irq(&log->log_extents_lock[index]);
-}
-
-void btrfs_free_logged_extents(struct btrfs_root *log, u64 transid)
-{
-	struct btrfs_ordered_extent *ordered;
-	int index = transid % 2;
-
-	spin_lock_irq(&log->log_extents_lock[index]);
-	while (!list_empty(&log->logged_list[index])) {
-		ordered = list_first_entry(&log->logged_list[index],
-					   struct btrfs_ordered_extent,
-					   log_list);
-		list_del_init(&ordered->log_list);
-		spin_unlock_irq(&log->log_extents_lock[index]);
-		btrfs_put_ordered_extent(ordered);
-		spin_lock_irq(&log->log_extents_lock[index]);
-	}
-	spin_unlock_irq(&log->log_extents_lock[index]);
-}
-
 /*
  * used to drop a reference on an ordered extent.  This will free
  * the extent if the last reference is dropped
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 3be443fb3001..b2d3f6a091f7 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -54,15 +54,11 @@ struct btrfs_ordered_sum {
 #define BTRFS_ORDERED_UPDATED_ISIZE 7 /* indicates whether this ordered extent
 				       * has done its due diligence in updating
 				       * the isize. */
-#define BTRFS_ORDERED_LOGGED_CSUM 8 /* We've logged the csums on this ordered
-				       ordered extent */
-#define BTRFS_ORDERED_TRUNCATED 9 /* Set when we have to truncate an extent */
+#define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent */
 
-#define BTRFS_ORDERED_LOGGED 10 /* Set when we've waited on this ordered extent
-				 * in the logging code. */
-#define BTRFS_ORDERED_PENDING 11 /* We are waiting for this ordered extent to
+#define BTRFS_ORDERED_PENDING 9 /* We are waiting for this ordered extent to
 				  * complete in the current transaction. */
-#define BTRFS_ORDERED_REGULAR 12 /* Regular IO for COW */
+#define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
 
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
@@ -193,16 +189,6 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 			       const u64 range_start, const u64 range_len);
 u64 btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 			      const u64 range_start, const u64 range_len);
-void btrfs_get_logged_extents(struct btrfs_inode *inode,
-			      struct list_head *logged_list,
-			      const loff_t start,
-			      const loff_t end);
-void btrfs_put_logged_extents(struct list_head *logged_list);
-void btrfs_submit_logged_extents(struct list_head *logged_list,
-				 struct btrfs_root *log);
-void btrfs_wait_logged_extents(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *log, u64 transid);
-void btrfs_free_logged_extents(struct btrfs_root *log, u64 transid);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index ef1b1a071bba..ab9461a07fc6 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2936,7 +2936,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	/* bail out if we need to do a full commit */
 	if (btrfs_need_log_full_commit(fs_info, trans)) {
 		ret = -EAGAIN;
-		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&root->log_mutex);
 		goto out;
 	}
@@ -2954,7 +2953,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	if (ret) {
 		blk_finish_plug(&plug);
 		btrfs_abort_transaction(trans, ret);
-		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(fs_info, trans);
 		mutex_unlock(&root->log_mutex);
 		goto out;
@@ -3008,7 +3006,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 		btrfs_wait_tree_log_extents(log, mark);
-		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
 		goto out;
@@ -3026,7 +3023,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
 		ret = btrfs_wait_tree_log_extents(log, mark);
-		btrfs_wait_logged_extents(trans, log, log_transid);
 		wait_log_commit(log_root_tree,
 				root_log_ctx.log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -3051,7 +3047,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	if (btrfs_need_log_full_commit(fs_info, trans)) {
 		blk_finish_plug(&plug);
 		btrfs_wait_tree_log_extents(log, mark);
-		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
 		goto out_wake_log_root;
@@ -3064,7 +3059,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	if (ret) {
 		btrfs_set_log_full_commit(fs_info, trans);
 		btrfs_abort_transaction(trans, ret);
-		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
@@ -3074,11 +3068,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 						  EXTENT_NEW | EXTENT_DIRTY);
 	if (ret) {
 		btrfs_set_log_full_commit(fs_info, trans);
-		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	btrfs_wait_logged_extents(trans, log, log_transid);
 
 	btrfs_set_super_log_root(fs_info->super_for_commit,
 				 log_root_tree->node->start);
@@ -3163,14 +3155,6 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 				  EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
 	}
 
-	/*
-	 * We may have short-circuited the log tree with the full commit logic
-	 * and left ordered extents on our list, so clear these out to keep us
-	 * from leaking inodes and memory.
-	 */
-	btrfs_free_logged_extents(log, 0);
-	btrfs_free_logged_extents(log, 1);
-
 	free_extent_buffer(log->node);
 	kfree(log);
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 29f9b14412ad..b3c246353f8e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -433,7 +433,6 @@ DEFINE_EVENT(
 		{ (1 << BTRFS_ORDERED_DIRECT),	 	"DIRECT" 	}, \
 		{ (1 << BTRFS_ORDERED_IOERR), 		"IOERR" 	}, \
 		{ (1 << BTRFS_ORDERED_UPDATED_ISIZE), 	"UPDATED_ISIZE"	}, \
-		{ (1 << BTRFS_ORDERED_LOGGED_CSUM), 	"LOGGED_CSUM"	}, \
 		{ (1 << BTRFS_ORDERED_TRUNCATED), 	"TRUNCATED"	})
 
 
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time
  2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
                   ` (2 preceding siblings ...)
  2018-05-23 15:58 ` [PATCH 4/4] btrfs: remove the logged extents infrastructure Josef Bacik
@ 2018-05-24 10:49 ` Filipe Manana
  2018-05-24 12:40   ` Nikolay Borisov
  2018-06-20 13:44   ` David Sterba
  2018-06-20 13:42 ` David Sterba
  4 siblings, 2 replies; 12+ messages in thread
From: Filipe Manana @ 2018-05-24 10:49 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Josef Bacik

On Wed, May 23, 2018 at 4:58 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> From: Josef Bacik <jbacik@fb.com>
>
> There's a priority inversion that exists currently with btrfs fsync.  In
> some cases we will collect outstanding ordered extents onto a list and
> only wait on them at the very last second.  However this "very last
> second" falls inside of a transaction handle, so if we are in a lower
> priority cgroup we can end up holding the transaction open for longer
> than needed, so if a high priority cgroup is also trying to fsync()
> it'll see latency.
>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/file.c | 56 ++++----------------------------------------------------
>  1 file changed, 4 insertions(+), 52 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 5772f0cbedef..2b1c36612384 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>         atomic_inc(&root->log_batch);
>         full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>                              &BTRFS_I(inode)->runtime_flags);
> +
>         /*
> -        * We might have have had more pages made dirty after calling
> -        * start_ordered_ops and before acquiring the inode's i_mutex.
> +        * We have to do this here to avoid the priority inversion of waiting on
> +        * IO of a lower priority task while holding a transaciton open.
>          */
> -       if (full_sync) {
> -               /*
> -                * For a full sync, we need to make sure any ordered operations
> -                * start and finish before we start logging the inode, so that
> -                * all extents are persisted and the respective file extent
> -                * items are in the fs/subvol btree.
> -                */
> -               ret = btrfs_wait_ordered_range(inode, start, len);
> -       } else {
> -               /*
> -                * Start any new ordered operations before starting to log the
> -                * inode. We will wait for them to finish in btrfs_sync_log().
> -                *
> -                * Right before acquiring the inode's mutex, we might have new
> -                * writes dirtying pages, which won't immediately start the
> -                * respective ordered operations - that is done through the
> -                * fill_delalloc callbacks invoked from the writepage and
> -                * writepages address space operations. So make sure we start
> -                * all ordered operations before starting to log our inode. Not
> -                * doing this means that while logging the inode, writeback
> -                * could start and invoke writepage/writepages, which would call
> -                * the fill_delalloc callbacks (cow_file_range,
> -                * submit_compressed_extents). These callbacks add first an
> -                * extent map to the modified list of extents and then create
> -                * the respective ordered operation, which means in
> -                * tree-log.c:btrfs_log_inode() we might capture all existing
> -                * ordered operations (with btrfs_get_logged_extents()) before
> -                * the fill_delalloc callback adds its ordered operation, and by
> -                * the time we visit the modified list of extent maps (with
> -                * btrfs_log_changed_extents()), we see and process the extent
> -                * map they created. We then use the extent map to construct a
> -                * file extent item for logging without waiting for the
> -                * respective ordered operation to finish - this file extent
> -                * item points to a disk location that might not have yet been
> -                * written to, containing random data - so after a crash a log
> -                * replay will make our inode have file extent items that point
> -                * to disk locations containing invalid data, as we returned
> -                * success to userspace without waiting for the respective
> -                * ordered operation to finish, because it wasn't captured by
> -                * btrfs_get_logged_extents().
> -                */
> -               ret = start_ordered_ops(inode, start, end);
> -       }
> +       ret = btrfs_wait_ordered_range(inode, start, len);
>         if (ret) {
>                 inode_unlock(inode);
>                 goto out;
> @@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>                                 goto out;
>                         }
>                 }
> -               if (!full_sync) {
> -                       ret = btrfs_wait_ordered_range(inode, start, len);
> -                       if (ret) {
> -                               btrfs_end_transaction(trans);
> -                               goto out;
> -                       }
> -               }
>                 ret = btrfs_commit_transaction(trans);
>         } else {
>                 ret = btrfs_end_transaction(trans);

There's more code in this function that can go away after this.
The logic to check if the inode is already in the log can now be
simplified since we always for the ordered extents to complete before
deciding whether the inode needs to be blogged. The big commet about
it can go away too:

https://friendpaste.com/5MHqvkBmdIQgrySryhhjMy

Will you integrate this?

Thanks! This difference in handling ordered extents brought many nasty
bugs in the past.


> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time
  2018-05-24 10:49 ` [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Filipe Manana
@ 2018-05-24 12:40   ` Nikolay Borisov
  2018-06-20 13:44   ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: Nikolay Borisov @ 2018-05-24 12:40 UTC (permalink / raw)
  To: fdmanana, Josef Bacik; +Cc: linux-btrfs, kernel-team, Josef Bacik



On 24.05.2018 13:49, Filipe Manana wrote:
> On Wed, May 23, 2018 at 4:58 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>> From: Josef Bacik <jbacik@fb.com>
>>
>> There's a priority inversion that exists currently with btrfs fsync.  In
>> some cases we will collect outstanding ordered extents onto a list and
>> only wait on them at the very last second.  However this "very last
>> second" falls inside of a transaction handle, so if we are in a lower
>> priority cgroup we can end up holding the transaction open for longer
>> than needed, so if a high priority cgroup is also trying to fsync()
>> it'll see latency.
>>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>> ---
>>  fs/btrfs/file.c | 56 ++++----------------------------------------------------
>>  1 file changed, 4 insertions(+), 52 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 5772f0cbedef..2b1c36612384 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>>         atomic_inc(&root->log_batch);
>>         full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>>                              &BTRFS_I(inode)->runtime_flags);
>> +
>>         /*
>> -        * We might have have had more pages made dirty after calling
>> -        * start_ordered_ops and before acquiring the inode's i_mutex.
>> +        * We have to do this here to avoid the priority inversion of waiting on
>> +        * IO of a lower priority task while holding a transaciton open.
>>          */
>> -       if (full_sync) {
>> -               /*
>> -                * For a full sync, we need to make sure any ordered operations
>> -                * start and finish before we start logging the inode, so that
>> -                * all extents are persisted and the respective file extent
>> -                * items are in the fs/subvol btree.
>> -                */
>> -               ret = btrfs_wait_ordered_range(inode, start, len);
>> -       } else {
>> -               /*
>> -                * Start any new ordered operations before starting to log the
>> -                * inode. We will wait for them to finish in btrfs_sync_log().
>> -                *
>> -                * Right before acquiring the inode's mutex, we might have new
>> -                * writes dirtying pages, which won't immediately start the
>> -                * respective ordered operations - that is done through the
>> -                * fill_delalloc callbacks invoked from the writepage and
>> -                * writepages address space operations. So make sure we start
>> -                * all ordered operations before starting to log our inode. Not
>> -                * doing this means that while logging the inode, writeback
>> -                * could start and invoke writepage/writepages, which would call
>> -                * the fill_delalloc callbacks (cow_file_range,
>> -                * submit_compressed_extents). These callbacks add first an
>> -                * extent map to the modified list of extents and then create
>> -                * the respective ordered operation, which means in
>> -                * tree-log.c:btrfs_log_inode() we might capture all existing
>> -                * ordered operations (with btrfs_get_logged_extents()) before
>> -                * the fill_delalloc callback adds its ordered operation, and by
>> -                * the time we visit the modified list of extent maps (with
>> -                * btrfs_log_changed_extents()), we see and process the extent
>> -                * map they created. We then use the extent map to construct a
>> -                * file extent item for logging without waiting for the
>> -                * respective ordered operation to finish - this file extent
>> -                * item points to a disk location that might not have yet been
>> -                * written to, containing random data - so after a crash a log
>> -                * replay will make our inode have file extent items that point
>> -                * to disk locations containing invalid data, as we returned
>> -                * success to userspace without waiting for the respective
>> -                * ordered operation to finish, because it wasn't captured by
>> -                * btrfs_get_logged_extents().
>> -                */
>> -               ret = start_ordered_ops(inode, start, end);
>> -       }
>> +       ret = btrfs_wait_ordered_range(inode, start, len);
>>         if (ret) {
>>                 inode_unlock(inode);
>>                 goto out;
>> @@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>>                                 goto out;
>>                         }
>>                 }
>> -               if (!full_sync) {
>> -                       ret = btrfs_wait_ordered_range(inode, start, len);
>> -                       if (ret) {
>> -                               btrfs_end_transaction(trans);
>> -                               goto out;
>> -                       }
>> -               }
>>                 ret = btrfs_commit_transaction(trans);
>>         } else {
>>                 ret = btrfs_end_transaction(trans);
> 
> There's more code in this function that can go away after this.
> The logic to check if the inode is already in the log can now be
> simplified since we always for the ordered extents to complete before
> deciding whether the inode needs to be blogged. The big commet about
> it can go away too:
> 
> https://friendpaste.com/5MHqvkBmdIQgrySryhhjMy
> 
> Will you integrate this?
> 
> Thanks! This difference in handling ordered extents brought many nasty
> bugs in the past.

While at it, I think the smp_mb before the if() deserves a comment about
the pairing logic as well.


> 
> 
>> --
>> 2.14.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time
  2018-05-24 10:49 ` [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Filipe Manana
  2018-05-24 12:40   ` Nikolay Borisov
@ 2018-06-20 13:44   ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: David Sterba @ 2018-06-20 13:44 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Josef Bacik, linux-btrfs, kernel-team, Josef Bacik

On Thu, May 24, 2018 at 11:49:04AM +0100, Filipe Manana wrote:
> On Wed, May 23, 2018 at 4:58 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > From: Josef Bacik <jbacik@fb.com>
> >
> > There's a priority inversion that exists currently with btrfs fsync.  In
> > some cases we will collect outstanding ordered extents onto a list and
> > only wait on them at the very last second.  However this "very last
> > second" falls inside of a transaction handle, so if we are in a lower
> > priority cgroup we can end up holding the transaction open for longer
> > than needed, so if a high priority cgroup is also trying to fsync()
> > it'll see latency.
> >
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/file.c | 56 ++++----------------------------------------------------
> >  1 file changed, 4 insertions(+), 52 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 5772f0cbedef..2b1c36612384 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >         atomic_inc(&root->log_batch);
> >         full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> >                              &BTRFS_I(inode)->runtime_flags);
> > +
> >         /*
> > -        * We might have have had more pages made dirty after calling
> > -        * start_ordered_ops and before acquiring the inode's i_mutex.
> > +        * We have to do this here to avoid the priority inversion of waiting on
> > +        * IO of a lower priority task while holding a transaciton open.
> >          */
> > -       if (full_sync) {
> > -               /*
> > -                * For a full sync, we need to make sure any ordered operations
> > -                * start and finish before we start logging the inode, so that
> > -                * all extents are persisted and the respective file extent
> > -                * items are in the fs/subvol btree.
> > -                */
> > -               ret = btrfs_wait_ordered_range(inode, start, len);
> > -       } else {
> > -               /*
> > -                * Start any new ordered operations before starting to log the
> > -                * inode. We will wait for them to finish in btrfs_sync_log().
> > -                *
> > -                * Right before acquiring the inode's mutex, we might have new
> > -                * writes dirtying pages, which won't immediately start the
> > -                * respective ordered operations - that is done through the
> > -                * fill_delalloc callbacks invoked from the writepage and
> > -                * writepages address space operations. So make sure we start
> > -                * all ordered operations before starting to log our inode. Not
> > -                * doing this means that while logging the inode, writeback
> > -                * could start and invoke writepage/writepages, which would call
> > -                * the fill_delalloc callbacks (cow_file_range,
> > -                * submit_compressed_extents). These callbacks add first an
> > -                * extent map to the modified list of extents and then create
> > -                * the respective ordered operation, which means in
> > -                * tree-log.c:btrfs_log_inode() we might capture all existing
> > -                * ordered operations (with btrfs_get_logged_extents()) before
> > -                * the fill_delalloc callback adds its ordered operation, and by
> > -                * the time we visit the modified list of extent maps (with
> > -                * btrfs_log_changed_extents()), we see and process the extent
> > -                * map they created. We then use the extent map to construct a
> > -                * file extent item for logging without waiting for the
> > -                * respective ordered operation to finish - this file extent
> > -                * item points to a disk location that might not have yet been
> > -                * written to, containing random data - so after a crash a log
> > -                * replay will make our inode have file extent items that point
> > -                * to disk locations containing invalid data, as we returned
> > -                * success to userspace without waiting for the respective
> > -                * ordered operation to finish, because it wasn't captured by
> > -                * btrfs_get_logged_extents().
> > -                */
> > -               ret = start_ordered_ops(inode, start, end);
> > -       }
> > +       ret = btrfs_wait_ordered_range(inode, start, len);
> >         if (ret) {
> >                 inode_unlock(inode);
> >                 goto out;
> > @@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >                                 goto out;
> >                         }
> >                 }
> > -               if (!full_sync) {
> > -                       ret = btrfs_wait_ordered_range(inode, start, len);
> > -                       if (ret) {
> > -                               btrfs_end_transaction(trans);
> > -                               goto out;
> > -                       }
> > -               }
> >                 ret = btrfs_commit_transaction(trans);
> >         } else {
> >                 ret = btrfs_end_transaction(trans);
> 
> There's more code in this function that can go away after this.
> The logic to check if the inode is already in the log can now be
> simplified since we always for the ordered extents to complete before
> deciding whether the inode needs to be blogged. The big commet about
> it can go away too:
> 
> https://friendpaste.com/5MHqvkBmdIQgrySryhhjMy

The paste is still alive, the change looks good, can you please send is
as a proper patch? The other patches are in misc-next now. Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time
  2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
                   ` (3 preceding siblings ...)
  2018-05-24 10:49 ` [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Filipe Manana
@ 2018-06-20 13:42 ` David Sterba
  4 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2018-06-20 13:42 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Josef Bacik, fdmanana

On Wed, May 23, 2018 at 11:58:33AM -0400, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> There's a priority inversion that exists currently with btrfs fsync.  In
> some cases we will collect outstanding ordered extents onto a list and
> only wait on them at the very last second.  However this "very last
> second" falls inside of a transaction handle, so if we are in a lower
> priority cgroup we can end up holding the transaction open for longer
> than needed, so if a high priority cgroup is also trying to fsync()
> it'll see latency.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

1-4 added to misc-next, with Filipe's reviewed-by from the first
iteration.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-06-21 11:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
2018-05-23 15:58 ` [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path Josef Bacik
2018-05-26 12:48   ` Nikolay Borisov
2018-06-20 13:43     ` David Sterba
2018-06-20 14:26       ` [PATCH] btrfs: Streamline log_extent_csums a bit Nikolay Borisov
2018-06-21 11:11         ` David Sterba
2018-05-23 15:58 ` [PATCH 3/4] btrfs: clean up the left over logged_list usage Josef Bacik
2018-05-23 15:58 ` [PATCH 4/4] btrfs: remove the logged extents infrastructure Josef Bacik
2018-05-24 10:49 ` [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Filipe Manana
2018-05-24 12:40   ` Nikolay Borisov
2018-06-20 13:44   ` David Sterba
2018-06-20 13:42 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.