Re: [PATCH] btrfs: Fix locking during DIO read

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Filipe Manana <fdmanana@gmail.com>
To: Nikolay Borisov <nborisov@suse.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	Liu Bo <bo.li.liu@oracle.com>, Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH] btrfs: Fix locking during DIO read
Date: Wed, 21 Feb 2018 13:06:07 +0000	[thread overview]
Message-ID: <CAL3q7H7ccJybVDcr9=zT4hGOR0q0f2ukFabFyYRRc_Kn=X1BQw@mail.gmail.com> (raw)
In-Reply-To: <1519213303-2802-1-git-send-email-nborisov@suse.com>

On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> Currently the DIO read cases uses a botched idea from ext4 to ensure
> that DIO reads don't race with truncate. The idea is that if we have a
> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> forces the dio read case to fallback to inode_locking to prevent
> read/truncate races. Unfortunately this is subtly broken for at least
> 2 reasons:
>
> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> (for the read case). This means that there is no ordering guarantee
> between the invocation of inode_dio_wait and the increment of
> i_dio_count in btrfs_direct_IO in the tread case.
>
> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
> are not really paired with the reader side - the test_bit in
> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
> the actual sleeping condition that needs ordering to prevent live-locks/
> missed wakeups is the modification/read of i_dio_count. So in this case
> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>
> The interraction between the two threads roughly looks like:
>
> T1(truncate):                                    T2(btrfs_direct_IO):
> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>   schedule()                                            wake_up_bit
> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>
> Without the ordering between the test_bit in T2 and setting the bit in
> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
> to sleep in schedule and T2 misses the bit set, resulting in missing the
> wake up.
>
> In any case all of this is VERY subtle. So fix it by simply making
> the DIO READ case take inode_lock_shared. This ensure that we can have
> DIO reads in parallel at the same time we are protected against
> concurrent modification of the target file.

And that prevents writes and reads against different (i.e. not
overlapping) ranges from happening in parallel.
That has a big impact on applications (databases for e.g.) that
operate on large files serving multiple requests.
Now all reads are serialized against all writes and vice versa.

Unless I missed something, a big NAK to this change as it is.


> This way we closely mimic
> what ext4 codes does and simplify this mess.
>
> Multiple xfstest runs didn't show any regressions.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h | 17 -----------------
>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>  2 files changed, 20 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index f527e99c9f8d..3519e49d4ef0 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>                         blk_status_t);
>  };
>
> -/*
> - * Disable DIO read nolock optimization, so new dio readers will be forced
> - * to grab i_mutex. It is used to avoid the endless truncate due to
> - * nonlocked dio read.
> - */
> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
> -{
> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
> -       smp_mb();
> -}
> -
> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
> -{
> -       smp_mb__before_atomic();
> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
> -}
> -
>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>  {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 491a7397f6fa..9c43257e6e11 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>                 truncate_setsize(inode, newsize);
>
> -               /* Disable nonlocked read DIO to avoid the end less truncate */
> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
> +               /*
> +                * Truncate after all in-flight dios are finished, new ones
> +                * will block on inode_lock. This only matters for AIO requests
> +                * since DIO READ is performed under inode_shared_lock and
> +                * write under exclusive lock.
> +                */
>                 inode_dio_wait(inode);
> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>
>                 ret = btrfs_truncate(inode);
>                 if (ret && inode->i_nlink) {
> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>         loff_t offset = iocb->ki_pos;
>         size_t count = 0;
>         int flags = 0;
> -       bool wakeup = true;
>         bool relock = false;
>         ssize_t ret;
>
>         if (check_direct_IO(fs_info, iter, offset))
>                 return 0;
>
> -       inode_dio_begin(inode);
> -
>         /*
>          * The generic stuff only does filemap_write_and_wait_range, which
>          * isn't enough if we've written compressed pages to this area, so
> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                                          offset + count - 1);
>
>         if (iov_iter_rw(iter) == WRITE) {
> +
> +               inode_dio_begin(inode);
> +
>                 /*
>                  * If the write DIO is beyond the EOF, we need update
>                  * the isize, but it is protected by i_mutex. So we can
> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>                 current->journal_info = &dio_data;
>                 down_read(&BTRFS_I(inode)->dio_sem);
> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
> -                                    &BTRFS_I(inode)->runtime_flags)) {
> -               inode_dio_end(inode);
> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
> -               wakeup = false;
> +       } else {
> +               /*
> +                * In DIO READ case locking the inode in shared mode ensures
> +                * we are protected against parallel writes/truncates
> +                */
> +               inode_lock_shared(inode);
> +               inode_dio_begin(inode);
>         }
>
>         ret = __blockdev_direct_IO(iocb, inode,
> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                         btrfs_delalloc_release_space(inode, data_reserved,
>                                         offset, count - (size_t)ret);
>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
> -       }
> +       } else
> +               inode_unlock_shared(inode);
>  out:
> -       if (wakeup)
> -               inode_dio_end(inode);
> +       inode_dio_end(inode);
> +
>         if (relock)
>                 inode_lock(inode);
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

next prev parent reply	other threads:[~2018-02-21 13:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-21 11:41 [PATCH] btrfs: Fix locking during DIO read Nikolay Borisov
2018-02-21 13:06 ` Filipe Manana [this message]
2018-02-21 13:10   ` Nikolay Borisov
2018-02-21 13:27     ` Filipe Manana
2018-02-21 13:51 ` Filipe Manana
2018-02-21 14:15   ` Nikolay Borisov
2018-02-21 14:42     ` Filipe Manana
2018-02-21 18:28       ` Liu Bo
2018-02-21 18:38         ` Nikolay Borisov
2018-02-21 19:05         ` Filipe Manana
2018-02-21 22:38           ` Liu Bo
2018-02-22  6:49             ` Nikolay Borisov
2018-02-22 19:09               ` Liu Bo
2018-02-22 19:24                 ` Liu Bo
2018-02-22 23:39                   ` David Sterba
2018-02-23  6:36                     ` Nikolay Borisov
2018-02-22 10:05             ` Filipe Manana
2018-02-21 18:14     ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H7ccJybVDcr9=zT4hGOR0q0f2ukFabFyYRRc_Kn=X1BQw@mail.gmail.com' \
    --to=fdmanana@gmail.com \
    --cc=bo.li.liu@oracle.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).