Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nikolay Borisov <nborisov@suse.com>
To: fdmanana@gmail.com, Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	kernel-team@fb.com, Josef Bacik <jbacik@fb.com>
Subject: Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time
Date: Thu, 24 May 2018 15:40:57 +0300	[thread overview]
Message-ID: <a87a3a7e-d55c-d763-844c-15ec2c48c171@suse.com> (raw)
In-Reply-To: <CAL3q7H4efWBiovBP3XQtTQVP1tSL+a3hBtaOm36H=ss_F0mCFw@mail.gmail.com>



On 24.05.2018 13:49, Filipe Manana wrote:
> On Wed, May 23, 2018 at 4:58 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>> From: Josef Bacik <jbacik@fb.com>
>>
>> There's a priority inversion that exists currently with btrfs fsync.  In
>> some cases we will collect outstanding ordered extents onto a list and
>> only wait on them at the very last second.  However this "very last
>> second" falls inside of a transaction handle, so if we are in a lower
>> priority cgroup we can end up holding the transaction open for longer
>> than needed, so if a high priority cgroup is also trying to fsync()
>> it'll see latency.
>>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>> ---
>>  fs/btrfs/file.c | 56 ++++----------------------------------------------------
>>  1 file changed, 4 insertions(+), 52 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 5772f0cbedef..2b1c36612384 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>>         atomic_inc(&root->log_batch);
>>         full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>>                              &BTRFS_I(inode)->runtime_flags);
>> +
>>         /*
>> -        * We might have have had more pages made dirty after calling
>> -        * start_ordered_ops and before acquiring the inode's i_mutex.
>> +        * We have to do this here to avoid the priority inversion of waiting on
>> +        * IO of a lower priority task while holding a transaciton open.
>>          */
>> -       if (full_sync) {
>> -               /*
>> -                * For a full sync, we need to make sure any ordered operations
>> -                * start and finish before we start logging the inode, so that
>> -                * all extents are persisted and the respective file extent
>> -                * items are in the fs/subvol btree.
>> -                */
>> -               ret = btrfs_wait_ordered_range(inode, start, len);
>> -       } else {
>> -               /*
>> -                * Start any new ordered operations before starting to log the
>> -                * inode. We will wait for them to finish in btrfs_sync_log().
>> -                *
>> -                * Right before acquiring the inode's mutex, we might have new
>> -                * writes dirtying pages, which won't immediately start the
>> -                * respective ordered operations - that is done through the
>> -                * fill_delalloc callbacks invoked from the writepage and
>> -                * writepages address space operations. So make sure we start
>> -                * all ordered operations before starting to log our inode. Not
>> -                * doing this means that while logging the inode, writeback
>> -                * could start and invoke writepage/writepages, which would call
>> -                * the fill_delalloc callbacks (cow_file_range,
>> -                * submit_compressed_extents). These callbacks add first an
>> -                * extent map to the modified list of extents and then create
>> -                * the respective ordered operation, which means in
>> -                * tree-log.c:btrfs_log_inode() we might capture all existing
>> -                * ordered operations (with btrfs_get_logged_extents()) before
>> -                * the fill_delalloc callback adds its ordered operation, and by
>> -                * the time we visit the modified list of extent maps (with
>> -                * btrfs_log_changed_extents()), we see and process the extent
>> -                * map they created. We then use the extent map to construct a
>> -                * file extent item for logging without waiting for the
>> -                * respective ordered operation to finish - this file extent
>> -                * item points to a disk location that might not have yet been
>> -                * written to, containing random data - so after a crash a log
>> -                * replay will make our inode have file extent items that point
>> -                * to disk locations containing invalid data, as we returned
>> -                * success to userspace without waiting for the respective
>> -                * ordered operation to finish, because it wasn't captured by
>> -                * btrfs_get_logged_extents().
>> -                */
>> -               ret = start_ordered_ops(inode, start, end);
>> -       }
>> +       ret = btrfs_wait_ordered_range(inode, start, len);
>>         if (ret) {
>>                 inode_unlock(inode);
>>                 goto out;
>> @@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>>                                 goto out;
>>                         }
>>                 }
>> -               if (!full_sync) {
>> -                       ret = btrfs_wait_ordered_range(inode, start, len);
>> -                       if (ret) {
>> -                               btrfs_end_transaction(trans);
>> -                               goto out;
>> -                       }
>> -               }
>>                 ret = btrfs_commit_transaction(trans);
>>         } else {
>>                 ret = btrfs_end_transaction(trans);
> 
> There's more code in this function that can go away after this.
> The logic to check if the inode is already in the log can now be
> simplified since we always for the ordered extents to complete before
> deciding whether the inode needs to be blogged. The big commet about
> it can go away too:
> 
> https://friendpaste.com/5MHqvkBmdIQgrySryhhjMy
> 
> Will you integrate this?
> 
> Thanks! This difference in handling ordered extents brought many nasty
> bugs in the past.

While at it, I think the smp_mb before the if() deserves a comment about
the pairing logic as well.


> 
> 
>> --
>> 2.14.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
>

next prev parent reply	other threads:[~2018-05-24 12:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-23 15:58 [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Josef Bacik
2018-05-23 15:58 ` [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path Josef Bacik
2018-05-26 12:48   ` Nikolay Borisov
2018-06-20 13:43     ` David Sterba
2018-06-20 14:26       ` [PATCH] btrfs: Streamline log_extent_csums a bit Nikolay Borisov
2018-06-21 11:11         ` David Sterba
2018-05-23 15:58 ` [PATCH 3/4] btrfs: clean up the left over logged_list usage Josef Bacik
2018-05-23 15:58 ` [PATCH 4/4] btrfs: remove the logged extents infrastructure Josef Bacik
2018-05-24 10:49 ` [PATCH 1/4] btrfs: always wait on ordered extents at fsync time Filipe Manana
2018-05-24 12:40   ` Nikolay Borisov [this message]
2018-06-20 13:44   ` David Sterba
2018-06-20 13:42 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a87a3a7e-d55c-d763-844c-15ec2c48c171@suse.com \
    --to=nborisov@suse.com \
    --cc=fdmanana@gmail.com \
    --cc=jbacik@fb.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).