[PATCH for 5.15 stable] btrfs: fix corruption after buffer fault in during direct IO append write

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH for 5.15 stable] btrfs: fix corruption after buffer fault in during direct IO append write
       [not found] <cover.1723046461.git.fdmanana@suse.com>
@ 2024-08-07 16:06 ` fdmanana
  2024-08-12 14:19   ` Greg KH
  2024-08-12 15:27 ` [PATCH for 5.15 stable] btrfs: fix double inode unlock for direct IO sync writes fdmanana
  1 sibling, 1 reply; 4+ messages in thread
From: fdmanana @ 2024-08-07 16:06 UTC (permalink / raw)
  To: stable; +Cc: linux-btrfs, gregkh, Filipe Manana, Hanna Czenczek

From: Filipe Manana <fdmanana@suse.com>

commit 939b656bc8ab203fdbde26ccac22bcb7f0985be5 upstream.

During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.

The problem happens like this:

1) We have an empty file, with size 0, for example;

2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
   buffer is not currently faulted in;

3) We enter btrfs_direct_write(), lock the inode and call
   generic_write_checks(), which calls generic_write_checks_count(), and
   that function sets the iocb position to 0 with the following code:

        if (iocb->ki_flags & IOCB_APPEND)
                iocb->ki_pos = i_size_read(inode);

4) We call btrfs_dio_write() and enter into iomap, which will end up
   calling btrfs_dio_iomap_begin() and that calls
   btrfs_get_blocks_direct_write(), where we update the i_size of the
   inode to 4096 bytes;

5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
   the page of the write input buffer (at iomap_dio_bio_iter(), with a
   call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
   returned to btrfs at btrfs_direct_write() via btrfs_dio_write();

6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
   fault in the write buffer and then goto to the label 'relock';

7) We lock again the inode, do all the necessary checks again and call
   again generic_write_checks(), which calls generic_write_checks_count()
   again, and there we set the iocb's position to 4K, which is the current
   i_size of the inode, with the following code pointed above:

        if (iocb->ki_flags & IOCB_APPEND)
                iocb->ki_pos = i_size_read(inode);

8) Then we go again to btrfs_dio_write() and enter iomap and the write
   succeeds, but it wrote to the file range [4K, 8K[, leaving a hole in
   the [0, 4K[ range and an i_size of 8K, which goes against the
   expections of having the data written to the range [0, 4K[ and get an
   i_size of 4K.

Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit
51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").

A sample reproducer provided by a reporter is the following:

   $ cat test.c
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <fcntl.h>
   #include <stdio.h>
   #include <sys/mman.h>
   #include <sys/stat.h>
   #include <unistd.h>

   int main(int argc, char *argv[])
   {
       if (argc < 2) {
           fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
           return 1;
       }

       int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
                     O_APPEND, 0644);
       if (fd < 0) {
           perror("creating test file");
           return 1;
       }

       char *buf = mmap(NULL, 4096, PROT_READ,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
       ssize_t ret = write(fd, buf, 4096);
       if (ret < 0) {
           perror("pwritev2");
           return 1;
       }

       struct stat stbuf;
       ret = fstat(fd, &stbuf);
       if (ret < 0) {
           perror("stat");
           return 1;
       }

       printf("size: %llu\n", (unsigned long long)stbuf.st_size);
       return stbuf.st_size == 4096 ? 0 : 1;
   }

A test case for fstests will be sent soon.

Reported-by: Hanna Czenczek <hreitz@redhat.com>
Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/file.c  | 55 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 17ebcf19b444..f19c6aa3ea4b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1383,6 +1383,7 @@ struct btrfs_drop_extents_args {
 struct btrfs_file_private {
 	void *filldir_buf;
 	u64 last_index;
+	bool fsync_skip_inode_lock;
 };
 
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index eae622ef4c6d..7ca49c02e8f8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1983,22 +1983,38 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * So here we disable page faults in the iov_iter and then retry if we
 	 * got -EFAULT, faulting in the pages before the retry.
 	 */
+again:
 	from->nofault = true;
 	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			     IOMAP_DIO_PARTIAL, written);
 	from->nofault = false;
 
-	/*
-	 * iomap_dio_complete() will call btrfs_sync_file() if we have a dsync
-	 * iocb, and that needs to lock the inode. So unlock it before calling
-	 * iomap_dio_complete() to avoid a deadlock.
-	 */
-	btrfs_inode_unlock(inode, ilock_flags);
-
-	if (IS_ERR_OR_NULL(dio))
+	if (IS_ERR_OR_NULL(dio)) {
 		err = PTR_ERR_OR_ZERO(dio);
-	else
+	} else {
+		struct btrfs_file_private stack_private = { 0 };
+		struct btrfs_file_private *private;
+		const bool have_private = (file->private_data != NULL);
+
+		if (!have_private)
+			file->private_data = &stack_private;
+
+		/*
+		 * If we have a synchoronous write, we must make sure the fsync
+		 * triggered by the iomap_dio_complete() call below doesn't
+		 * deadlock on the inode lock - we are already holding it and we
+		 * can't call it after unlocking because we may need to complete
+		 * partial writes due to the input buffer (or parts of it) not
+		 * being already faulted in.
+		 */
+		private = file->private_data;
+		private->fsync_skip_inode_lock = true;
 		err = iomap_dio_complete(dio);
+		private->fsync_skip_inode_lock = false;
+
+		if (!have_private)
+			file->private_data = NULL;
+	}
 
 	/* No increment (+=) because iomap returns a cumulative value. */
 	if (err > 0)
@@ -2025,10 +2041,12 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		} else {
 			fault_in_iov_iter_readable(from, left);
 			prev_left = left;
-			goto relock;
+			goto again;
 		}
 	}
 
+	btrfs_inode_unlock(inode, ilock_flags);
+
 	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
 	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
 		goto out;
@@ -2177,6 +2195,7 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
  */
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
+	struct btrfs_file_private *private = file->private_data;
 	struct dentry *dentry = file_dentry(file);
 	struct inode *inode = d_inode(dentry);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2186,6 +2205,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	int ret = 0, err;
 	u64 len;
 	bool full_sync;
+	const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false);
 
 	trace_btrfs_sync_file(file, datasync);
 
@@ -2213,7 +2233,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (ret)
 		goto out;
 
-	btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
+	if (skip_ilock)
+		down_write(&BTRFS_I(inode)->i_mmap_lock);
+	else
+		btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
 
 	atomic_inc(&root->log_batch);
 
@@ -2245,7 +2268,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 */
 	ret = start_ordered_ops(inode, start, end);
 	if (ret) {
-		btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
+		if (skip_ilock)
+			up_write(&BTRFS_I(inode)->i_mmap_lock);
+		else
+			btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 		goto out;
 	}
 
@@ -2337,7 +2363,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * file again, but that will end up using the synchronization
 	 * inside btrfs_sync_log to keep things safe.
 	 */
-	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
+	if (skip_ilock)
+		up_write(&BTRFS_I(inode)->i_mmap_lock);
+	else
+		btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 
 	if (ret == BTRFS_NO_LOG_SYNC) {
 		ret = btrfs_end_transaction(trans);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH for 5.15 stable] btrfs: fix corruption after buffer fault in during direct IO append write
  2024-08-07 16:06 ` [PATCH for 5.15 stable] btrfs: fix corruption after buffer fault in during direct IO append write fdmanana
@ 2024-08-12 14:19   ` Greg KH
  0 siblings, 0 replies; 4+ messages in thread
From: Greg KH @ 2024-08-12 14:19 UTC (permalink / raw)
  To: fdmanana; +Cc: stable, linux-btrfs, Filipe Manana, Hanna Czenczek

On Wed, Aug 07, 2024 at 05:06:17PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> commit 939b656bc8ab203fdbde26ccac22bcb7f0985be5 upstream.
> 
> During an append (O_APPEND write flag) direct IO write if the input buffer
> was not previously faulted in, we can corrupt the file in a way that the
> final size is unexpected and it includes an unexpected hole.
> 
> The problem happens like this:
> 
> 1) We have an empty file, with size 0, for example;
> 
> 2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
>    buffer is not currently faulted in;
> 
> 3) We enter btrfs_direct_write(), lock the inode and call
>    generic_write_checks(), which calls generic_write_checks_count(), and
>    that function sets the iocb position to 0 with the following code:
> 
>         if (iocb->ki_flags & IOCB_APPEND)
>                 iocb->ki_pos = i_size_read(inode);
> 
> 4) We call btrfs_dio_write() and enter into iomap, which will end up
>    calling btrfs_dio_iomap_begin() and that calls
>    btrfs_get_blocks_direct_write(), where we update the i_size of the
>    inode to 4096 bytes;
> 
> 5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
>    the page of the write input buffer (at iomap_dio_bio_iter(), with a
>    call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
>    returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
> 
> 6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
>    fault in the write buffer and then goto to the label 'relock';
> 
> 7) We lock again the inode, do all the necessary checks again and call
>    again generic_write_checks(), which calls generic_write_checks_count()
>    again, and there we set the iocb's position to 4K, which is the current
>    i_size of the inode, with the following code pointed above:
> 
>         if (iocb->ki_flags & IOCB_APPEND)
>                 iocb->ki_pos = i_size_read(inode);
> 
> 8) Then we go again to btrfs_dio_write() and enter iomap and the write
>    succeeds, but it wrote to the file range [4K, 8K[, leaving a hole in
>    the [0, 4K[ range and an i_size of 8K, which goes against the
>    expections of having the data written to the range [0, 4K[ and get an
>    i_size of 4K.
> 
> Fix this by not unlocking the inode before faulting in the input buffer,
> in case we get -EFAULT or an incomplete write, and not jumping to the
> 'relock' label after faulting in the buffer - instead jump to a location
> immediately before calling iomap, skipping all the write checks and
> relocking. This solves this problem and it's fine even in case the input
> buffer is memory mapped to the same file range, since only holding the
> range locked in the inode's io tree can cause a deadlock, it's safe to
> keep the inode lock (VFS lock), as was fixed and described in commit
> 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
> reads and writes").
> 
> A sample reproducer provided by a reporter is the following:
> 
>    $ cat test.c
>    #ifndef _GNU_SOURCE
>    #define _GNU_SOURCE
>    #endif
> 
>    #include <fcntl.h>
>    #include <stdio.h>
>    #include <sys/mman.h>
>    #include <sys/stat.h>
>    #include <unistd.h>
> 
>    int main(int argc, char *argv[])
>    {
>        if (argc < 2) {
>            fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
>            return 1;
>        }
> 
>        int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
>                      O_APPEND, 0644);
>        if (fd < 0) {
>            perror("creating test file");
>            return 1;
>        }
> 
>        char *buf = mmap(NULL, 4096, PROT_READ,
>                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>        ssize_t ret = write(fd, buf, 4096);
>        if (ret < 0) {
>            perror("pwritev2");
>            return 1;
>        }
> 
>        struct stat stbuf;
>        ret = fstat(fd, &stbuf);
>        if (ret < 0) {
>            perror("stat");
>            return 1;
>        }
> 
>        printf("size: %llu\n", (unsigned long long)stbuf.st_size);
>        return stbuf.st_size == 4096 ? 0 : 1;
>    }
> 
> A test case for fstests will be sent soon.
> 
> Reported-by: Hanna Czenczek <hreitz@redhat.com>
> Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
> Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

All now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH for 5.15 stable] btrfs: fix double inode unlock for direct IO sync writes
       [not found] <cover.1723046461.git.fdmanana@suse.com>
  2024-08-07 16:06 ` [PATCH for 5.15 stable] btrfs: fix corruption after buffer fault in during direct IO append write fdmanana
@ 2024-08-12 15:27 ` fdmanana
  2024-08-12 15:49   ` Greg KH
  1 sibling, 1 reply; 4+ messages in thread
From: fdmanana @ 2024-08-12 15:27 UTC (permalink / raw)
  To: stable
  Cc: linux-btrfs, gregkh, Filipe Manana, syzbot+7dbbb74af6291b5a5a8b,
	Josef Bacik, David Sterba

From: Filipe Manana <fdmanana@suse.com>

commit e0391e92f9ab4fb3dbdeb139c967dcfa7ac4b115 upstream.

If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode logging or we get an error starting a transaction or an error when
flushing delalloc, we end up unlocking the inode when we shouldn't under
the 'out_release_extents' label, and then unlock it again at
btrfs_direct_write().

Fix that by checking if we have to skip inode unlocking under that label.

Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/file.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7ca49c02e8f8..c44dfb4370d7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2433,7 +2433,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 out_release_extents:
 	btrfs_release_log_ctx_extents(&ctx);
-	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
+	if (skip_ilock)
+		up_write(&BTRFS_I(inode)->i_mmap_lock);
+	else
+		btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 	goto out;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH for 5.15 stable] btrfs: fix double inode unlock for direct IO sync writes
  2024-08-12 15:27 ` [PATCH for 5.15 stable] btrfs: fix double inode unlock for direct IO sync writes fdmanana
@ 2024-08-12 15:49   ` Greg KH
  0 siblings, 0 replies; 4+ messages in thread
From: Greg KH @ 2024-08-12 15:49 UTC (permalink / raw)
  To: fdmanana
  Cc: stable, linux-btrfs, Filipe Manana, syzbot+7dbbb74af6291b5a5a8b,
	Josef Bacik, David Sterba

On Mon, Aug 12, 2024 at 04:27:33PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> commit e0391e92f9ab4fb3dbdeb139c967dcfa7ac4b115 upstream.
> 
> If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
> inode logging or we get an error starting a transaction or an error when
> flushing delalloc, we end up unlocking the inode when we shouldn't under
> the 'out_release_extents' label, and then unlock it again at
> btrfs_direct_write().
> 
> Fix that by checking if we have to skip inode unlocking under that label.
> 
> Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
> Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/file.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)

All now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-08-12 15:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1723046461.git.fdmanana@suse.com>
2024-08-07 16:06 ` [PATCH for 5.15 stable] btrfs: fix corruption after buffer fault in during direct IO append write fdmanana
2024-08-12 14:19   ` Greg KH
2024-08-12 15:27 ` [PATCH for 5.15 stable] btrfs: fix double inode unlock for direct IO sync writes fdmanana
2024-08-12 15:49   ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox