From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: fix corruption after buffer fault in during direct IO append write
Date: Fri, 26 Jul 2024 12:21:43 -0400 [thread overview]
Message-ID: <20240726162143.GA3435578@perftesting> (raw)
In-Reply-To: <a7cdb10155e5e823ce82edfc8eed99d1b0ef4eeb.1722005943.git.fdmanana@suse.com>
On Fri, Jul 26, 2024 at 04:55:40PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> During an append (O_APPEND write flag) direct IO write if the input buffer
> was not previously faulted in, we can corrupt the file in a way that the
> final size is unexpected and it includes an unexpected hole.
>
> The problem happens like this:
>
> 1) We have an empty file, with size 0, for example;
>
> 2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
> buffer is not currently faulted in;
>
> 3) We enter btrfs_direct_write(), lock the inode and call
> generic_write_checks(), which calls generic_write_checks_count(), and
> that function sets the iocb position to 0 with the following code:
>
> if (iocb->ki_flags & IOCB_APPEND)
> iocb->ki_pos = i_size_read(inode);
>
> 4) We call btrfs_dio_write() and enter into iomap, which will end up
> calling btrfs_dio_iomap_begin() and that calls
> btrfs_get_blocks_direct_write(), where we update the i_size of the
> inode to 4096 bytes;
>
> 5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
> the page of the write input buffer (at iomap_dio_bio_iter(), with a
> call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
> returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
>
> 6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
> fault in the write buffer and then goto to the label 'relock';
>
> 7) We lock again the inode, do all the necessary checks again and call
> again generic_write_checks(), which calls generic_write_checks_count()
> again, and there we set the iocb's position to 4K, which is the current
> i_size of the inode, with the following code pointed above:
>
> if (iocb->ki_flags & IOCB_APPEND)
> iocb->ki_pos = i_size_read(inode);
>
> 8) Then we go again to btrfs_dio_write() and enter iomap and the write
> succeeds, but it wrote to the file range [4K, 8K[, leaving a hole in
> the [0, 4K[ range and an i_size of 8K, which goes against the
> expections of having the data written to the range [0, 4K[ and get an
> i_size of 4K.
>
> Fix this by not unlocking the inode before faulting in the input buffer,
> in case we get -EFAULT or an incomplete write, and not jumping to the
> 'relock' label after faulting in the buffer - instead jump to a location
> immediately before calling iomap, skipping all the write checks and
> relocking. This solves this problem and it's fine even in case the input
> buffer is memory mapped to the same file range, since only holding the
> range locked in the inode's io tree can cause a deadlock, it's safe to
> keep the inode lock (VFS lock), as was fixed and described in commit
> 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
> reads and writes").
>
> A sample reproducer provided by a reporter is the following:
>
> $ cat test.c
> #ifndef _GNU_SOURCE
> #define _GNU_SOURCE
> #endif
>
> #include <fcntl.h>
> #include <stdio.h>
> #include <sys/mman.h>
> #include <sys/stat.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
> if (argc < 2) {
> fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
> return 1;
> }
>
> int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
> O_APPEND, 0644);
> if (fd < 0) {
> perror("creating test file");
> return 1;
> }
>
> char *buf = mmap(NULL, 4096, PROT_READ,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> ssize_t ret = write(fd, buf, 4096);
> if (ret < 0) {
> perror("pwritev2");
> return 1;
> }
>
> struct stat stbuf;
> ret = fstat(fd, &stbuf);
> if (ret < 0) {
> perror("stat");
> return 1;
> }
>
> printf("size: %llu\n", (unsigned long long)stbuf.st_size);
> return stbuf.st_size == 4096 ? 0 : 1;
> }
>
> A test case for fstests will be sent soon.
>
> Reported-by: Hanna Czenczek <hreitz@redhat.com>
> Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
> Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks for this Filipe, this stuff is so messy,
Josef
next prev parent reply other threads:[~2024-07-26 16:21 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-26 15:55 [PATCH] btrfs: fix corruption after buffer fault in during direct IO append write fdmanana
2024-07-26 16:21 ` Josef Bacik [this message]
2024-07-26 16:26 ` [PATCH v2] " fdmanana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240726162143.GA3435578@perftesting \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.