From: "Darrick J. Wong" <djwong@kernel.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, fstests@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>,
David Disseldorp <ddiss@suse.de>, Zorro Lang <zlang@redhat.com>
Subject: Re: [PATCH v3] fstests: add a generic test to verify direct IO writes with buffer contents change
Date: Tue, 11 Feb 2025 22:16:57 -0800 [thread overview]
Message-ID: <20250212061657.GX21799@frogsfrogsfrogs> (raw)
In-Reply-To: <93410edfde1cb0405c133cca49a8291dcdb90e2e.1739329404.git.wqu@suse.com>
On Wed, Feb 12, 2025 at 01:34:06PM +1030, Qu Wenruo wrote:
> There is a long existing btrfs problem that if some one is modifying the
> buffer of an on-going direct IO write, it has a very high chance causing
> permanent data checksum mismatch.
>
> This is caused by the following factors:
>
> - Direct IO buffer is out of the control of filesystem
> Thus user space can modify the contents during writeback.
>
> - Btrfs generate its data checksum just before submitting the bio
> This means if the contents happens after the checksum is generated,
> the data written to disk will no longer match the checksum.
>
> Btrfs later fixes the problem by forcing the direct IO to fallback to
> buffered IO (if the inode requires data checksum), so that btrfs can
> have a consistent view of the buffer.
>
> This test case will verify the behavior by:
>
> - Create a helper program 'dio-writeback-race'
> Which does direct IO writes block-by-block, and the buffer is always
> initialized to all 0xff before write,
> Then starting two threads:
> - One to submit the direct IO
> - One to modify the buffer to 0x00
>
> The program uses 4K as default block size, and 64MiB as the default
> file size.
> Which is more than enough to trigger tons of btrfs checksum errors
> on unpatched kernels.
>
> - New test case generic/761
> Which will:
>
> * Use above program to create a 64MiB file
>
> * Do buffered read on that file
> Since above program is doing direct IO, there is no page cache
> populated.
> And the buffered read will need to read out all data from the disk,
> and if the filesystem supports data checksum, then the data checksum
> will also be verified against the data.
>
> The test case passes on the following fses:
> - ext4
> - xfs
> - btrfs with "nodatasum" mount option
> No data checksum to bother.
>
> - btrfs with default "datasum" mount option and the fix "btrfs: always
> fallback to buffered write if the inode requires checksum"
> This makes btrfs to fallback on buffered IO so the contents won't
> change during writeback of page cache.
>
> And fails on the following fses:
>
> - btrfs with default "datasum" mount option and without the fix
> Expected.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: David Disseldorp <ddiss@suse.de>
> Reviewed-by: Zorro Lang <zlang@redhat.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Looks good to me,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> Changelog:
> v2:
> - Fix the comment on the default block size of dio-writeback-race
> - Use proper type for pthread_exit() of do_io() function
> - Fix the error message when filesize is invalid
> - Fix the error message when unknown option is parsed
> - Catch the thread return value correctly for pthread_join() on IO thread
> - Always update @ret
> - Return EXIT_SUCCESS/FAILURE based on @ret at error: tag
> - Check the return value of pthread_join() correctly
> - Remove unused cleanup override/include comments from the test case
> - Add the missing fixed-by tag
>
> v3:
> - Use hyphens for the program's name in the comments of dio-writeback-race.c
> - Fix a missing argv[2] usage
> - Use _get_file_block_size() to benefit from ocfs2/xfs special handling
> ---
> .gitignore | 1 +
> src/Makefile | 3 +-
> src/dio-writeback-race.c | 148 +++++++++++++++++++++++++++++++++++++++
> tests/generic/761 | 41 +++++++++++
> tests/generic/761.out | 2 +
> 5 files changed, 194 insertions(+), 1 deletion(-)
> create mode 100644 src/dio-writeback-race.c
> create mode 100755 tests/generic/761
> create mode 100644 tests/generic/761.out
>
> diff --git a/.gitignore b/.gitignore
> index efd477738e1e..7060f52cf6b8 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -210,6 +210,7 @@ tags
> /src/perf/*.pyc
> /src/fiemap-fault
> /src/min_dio_alignment
> +/src/dio-writeback-race
>
> # Symlinked files
> /tests/generic/035.out
> diff --git a/src/Makefile b/src/Makefile
> index 1417c383863e..6ac72b366257 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -20,7 +20,8 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
> t_get_file_time t_create_short_dirs t_create_long_dirs t_enospc \
> t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
> t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
> - readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd
> + readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
> + dio-writeback-race
>
> LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
> preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
> diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
> new file mode 100644
> index 000000000000..963ed207fc1b
> --- /dev/null
> +++ b/src/dio-writeback-race.c
> @@ -0,0 +1,148 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * dio-writeback-race -- test direct IO writes with the contents of
> + * the buffer changed during writeback.
> + *
> + * Copyright (C) 2025 SUSE Linux Products GmbH.
> + */
> +
> +/*
> + * dio-writeback-race
> + *
> + * Compile:
> + *
> + * gcc -Wall -pthread -o dio-writeback-race dio-writeback-race.c
> + *
> + * Make sure filesystems with explicit data checksum can handle direct IO
> + * writes correctly, so that even the contents of the direct IO buffer changes
> + * during writeback, the fs should still work fine without data checksum mismatch.
> + * (Normally by falling back to buffer IO to keep the checksum and contents
> + * consistent)
> + *
> + * Usage:
> + *
> + * dio-writeback-race [-b <buffersize>] [-s <filesize>] <filename>
> + *
> + * Where:
> + *
> + * <filename> is the name of the test file to create, if the file exists
> + * it will be overwritten.
> + *
> + * <buffersize> is the buffer size for direct IO. Users are responsible to
> + * probe the correct direct IO size and alignment requirement.
> + * If not specified, the default value will be 4096.
> + *
> + * <filesize> is the total size of the test file. If not aligned to <blocksize>
> + * the result file size will be rounded up to <blocksize>.
> + * If not specified, the default value will be 64MiB.
> + */
> +
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <pthread.h>
> +#include <unistd.h>
> +#include <getopt.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/stat.h>
> +
> +static int fd = -1;
> +static void *buf = NULL;
> +static unsigned long long filesize = 64 * 1024 * 1024;
> +static int blocksize = 4096;
> +
> +const int orig_pattern = 0xff;
> +const int modify_pattern = 0x00;
> +
> +static void *do_io(void *arg)
> +{
> + ssize_t ret;
> +
> + ret = write(fd, buf, blocksize);
> + pthread_exit((void *)ret);
> +}
> +
> +static void *do_modify(void *arg)
> +{
> + memset(buf, modify_pattern, blocksize);
> + pthread_exit(NULL);
> +}
> +
> +int main (int argc, char *argv[])
> +{
> + pthread_t io_thread;
> + pthread_t modify_thread;
> + unsigned long long filepos;
> + int opt;
> + int ret = -EINVAL;
> +
> + while ((opt = getopt(argc, argv, "b:s:")) > 0) {
> + switch (opt) {
> + case 'b':
> + blocksize = atoi(optarg);
> + if (blocksize == 0) {
> + fprintf(stderr, "invalid blocksize '%s'\n", optarg);
> + goto error;
> + }
> + break;
> + case 's':
> + filesize = atoll(optarg);
> + if (filesize == 0) {
> + fprintf(stderr, "invalid filesize '%s'\n", optarg);
> + goto error;
> + }
> + break;
> + default:
> + fprintf(stderr, "unknown option '%c'\n", opt);
> + goto error;
> + }
> + }
> + if (optind >= argc) {
> + fprintf(stderr, "missing argument\n");
> + goto error;
> + }
> + ret = posix_memalign(&buf, blocksize, blocksize);
> + if (!buf) {
> + fprintf(stderr, "failed to allocate aligned memory\n");
> + exit(EXIT_FAILURE);
> + }
> + fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
> + if (fd < 0) {
> + fprintf(stderr, "failed to open file '%s': %m\n", argv[optind]);
> + goto error;
> + }
> +
> + for (filepos = 0; filepos < filesize; filepos += blocksize) {
> + void *retval = NULL;
> +
> + memset(buf, orig_pattern, blocksize);
> + pthread_create(&io_thread, NULL, do_io, NULL);
> + pthread_create(&modify_thread, NULL, do_modify, NULL);
> + ret = pthread_join(io_thread, &retval);
> + if (ret) {
> + errno = ret;
> + ret = -ret;
> + fprintf(stderr, "failed to join io thread: %m\n");
> + goto error;
> + }
> + if ((ssize_t )retval < blocksize) {
> + ret = -EIO;
> + fprintf(stderr, "io thread failed\n");
> + goto error;
> + }
> + ret = pthread_join(modify_thread, NULL);
> + if (ret) {
> + errno = ret;
> + ret = -ret;
> + fprintf(stderr, "failed to join modify thread: %m\n");
> + goto error;
> + }
> + }
> +error:
> + close(fd);
> + free(buf);
> + if (ret < 0)
> + return EXIT_FAILURE;
> + return EXIT_SUCCESS;
> +}
> diff --git a/tests/generic/761 b/tests/generic/761
> new file mode 100755
> index 000000000000..9406a4b86f2e
> --- /dev/null
> +++ b/tests/generic/761
> @@ -0,0 +1,41 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2025 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 761
> +#
> +# Making sure direct IO (O_DIRECT) writes won't cause any data checksum mismatch,
> +# even if the contents of the buffer changes during writeback.
> +#
> +# This is mostly for filesystems with data checksum support, which should fallback
> +# to buffer IO to avoid inconsistency.
> +# For filesystems without data checksum support, nothing needs to be bothered.
> +#
> +
> +. ./common/preamble
> +_begin_fstest auto quick
> +
> +_require_scratch
> +_require_odirect
> +_require_test_program dio-writeback-race
> +_fixed_by_kernel_commit XXXXXXXX \
> + "btrfs: always fallback to buffered write if the inode requires checksum"
> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +blocksize=$(_get_file_block_size $SCRATCH_MNT)
> +filesize=$(( 64 * 1024 * 1024))
> +
> +echo "blocksize=$blocksize filesize=$filesize" >> $seqres.full
> +
> +$here/src/dio-writeback-race -b $blocksize -s $filesize $SCRATCH_MNT/foobar
> +
> +# Read out the file, which should trigger checksum verification
> +cat $SCRATCH_MNT/foobar > /dev/null
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/761.out b/tests/generic/761.out
> new file mode 100644
> index 000000000000..72ebba4cb426
> --- /dev/null
> +++ b/tests/generic/761.out
> @@ -0,0 +1,2 @@
> +QA output created by 761
> +Silence is golden
> --
> 2.48.1
>
>
next prev parent reply other threads:[~2025-02-12 6:16 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-12 3:04 [PATCH v3] fstests: add a generic test to verify direct IO writes with buffer contents change Qu Wenruo
2025-02-12 6:16 ` Darrick J. Wong [this message]
2025-02-13 5:49 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250212061657.GX21799@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=ddiss@suse.de \
--cc=fstests@vger.kernel.org \
--cc=hch@infradead.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
--cc=zlang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox