From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically
Date: Thu, 5 Nov 2020 14:24:19 -0500 [thread overview]
Message-ID: <df80c13e-f42d-4074-cca7-c71ced6a789a@toxicpanda.com> (raw)
In-Reply-To: <aeebaf45f19779b8f869cc16db0bcfe8ba4dcf2d.1604486892.git.fdmanana@suse.com>
On 11/4/20 6:07 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> There are several occasions where we do not update the inode's number of
> used bytes atomically, resulting in a concurrent stat(2) syscall to report
> a value of used blocks that does not correspond to a valid value, that is,
> a value that does not match neither what we had before the operation nor
> what we get after the operation completes.
>
> In extreme cases it can result in stat(2) reporting zero used blocks, which
> can cause problems for some userspace tools where they can consider a file
> with a non-zero size and zero used blocks as completely sparse and skip
> reading data, as reported/discussed a long time ago in some threads like
> the following:
>
> https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
>
> The cases where this can happen are the following:
>
> -> Case 1
>
> If we do a write (buffered or direct IO) against a file region for which
> there is already an allocated extent (or multiple extents), then we have a
> short time window where we can report a number of used blocks to stat(2)
> that does not take into account the file region being overwritten. This
> short time window happens when completing the ordered extent(s).
>
> This happens because when we drop the extents in the write range we
> decrement the inode's number of bytes and later on when we insert the new
> extent(s) we increment the number of bytes in the inode, resulting in a
> short time window where a stat(2) syscall can get an incorrect number of
> used blocks.
>
> If we do writes that overwrite an entire file, then we have a short time
> window where we report 0 used blocks to stat(2).
>
> Example reproducer:
>
> $ cat reproducer-1.sh
> #!/bin/bash
>
> MNT=/mnt/sdi
> DEV=/dev/sdi
>
> stat_loop()
> {
> trap "wait; exit" SIGTERM
> local filepath=$1
> local expected=$2
> local got
>
> while :; do
> got=$(stat -c %b $filepath)
> if [ $got -ne $expected ]; then
> echo -n "ERROR: unexpected used blocks"
> echo " (got: $got expected: $expected)"
> fi
> done
> }
>
> mkfs.btrfs -f $DEV > /dev/null
> # mkfs.xfs -f $DEV > /dev/null
> # mkfs.ext4 -F $DEV > /dev/null
> # mkfs.f2fs -f $DEV > /dev/null
> # mkfs.reiserfs -f $DEV > /dev/null
> mount $DEV $MNT
>
> xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
> expected=$(stat -c %b $MNT/foobar)
>
> # Create a process to keep calling stat(2) on the file and see if the
> # reported number of blocks used (disk space used) changes, it should
> # not because we are not increasing the file size nor punching holes.
> stat_loop $MNT/foobar $expected &
> loop_pid=$!
>
> for ((i = 0; i < 50000; i++)); do
> xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
> done
>
> kill $loop_pid &> /dev/null
> wait
>
> umount $DEV
>
> $ ./reproducer-1.sh
> ERROR: unexpected used blocks (got: 0 expected: 128)
> ERROR: unexpected used blocks (got: 0 expected: 128)
> (...)
>
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
>
> -> Case 2
>
> If we do a buffered write against a file region that does not have any
> allocated extents, like a hole or beyond eof, then during ordered extent
> completion we have a short time window where a concurrent stat(2) syscall
> can report a number of used blocks that does not correspond to the value
> before or after the write operation, a value that is actually larger than
> the value after the write completes.
>
> This happens because once we start a buffered write into an unallocated
> file range we increment the inode's 'new_delalloc_bytes', to make sure
> any stat(2) call gets a correct used blocks value before delalloc is
> flushed and completes. However at ordered extent completion, after we
> inserted the new extent, we increment the inode's number of bytes used
> with the size of the new extent, and only later, when clearing the range
> in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
> counter with the size of the extent. So this results in a short time
> window where a concurrent stat(2) syscall can report a number of used
> blocks that accounts for the new extent twice.
>
> Example reproducer:
>
> $ cat reproducer-2.sh
> #!/bin/bash
>
> MNT=/mnt/sdi
> DEV=/dev/sdi
>
> stat_loop()
> {
> trap "wait; exit" SIGTERM
> local filepath=$1
> local expected=$2
> local got
>
> while :; do
> got=$(stat -c %b $filepath)
> if [ $got -ne $expected ]; then
> echo -n "ERROR: unexpected used blocks"
> echo " (got: $got expected: $expected)"
> fi
> done
> }
>
> mkfs.btrfs -f $DEV > /dev/null
> # mkfs.xfs -f $DEV > /dev/null
> # mkfs.ext4 -F $DEV > /dev/null
> # mkfs.f2fs -f $DEV > /dev/null
> # mkfs.reiserfs -f $DEV > /dev/null
> mount $DEV $MNT
>
> touch $MNT/foobar
> write_size=$((64 * 1024))
> for ((i = 0; i < 16384; i++)); do
> offset=$(($i * $write_size))
> xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
> blocks_used=$(stat -c %b $MNT/foobar)
>
> # Fsync the file to trigger writeback and keep calling stat(2) on it
> # to see if the number of blocks used changes.
> stat_loop $MNT/foobar $blocks_used &
> loop_pid=$!
> xfs_io -c "fsync" $MNT/foobar
>
> kill $loop_pid &> /dev/null
> wait $loop_pid
> done
>
> umount $DEV
>
> $ ./reproducer-2.sh
> ERROR: unexpected used blocks (got: 265472 expected: 265344)
> ERROR: unexpected used blocks (got: 284032 expected: 283904)
> (...)
>
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
>
> -> Case 3
>
> Another case where such problems happen is during other operations that
> replace extents in a file range with other extents. Those operations are
> extent cloning, deduplication and fallocate's zero range operation.
>
> The cause of the problem is similar to the first case. When we drop the
> extents from a range, we decrement the inode's number of bytes, and later
> on, after inserting the new extents we increment it. Since this is not
> done atomically, a concurrent stat(2) call can see and return a number of
> used blocks that is smaller than it should be, does not match the number
> of used blocks before or after the clone/deduplication/zero operation.
>
> Like for the first case, when doing a clone, deduplication or zero range
> operation against an entire file, we end up having a time window where we
> can report 0 used blocks to a stat(2) call.
>
> Example reproducer:
>
> $ cat reproducer-3.sh
> #!/bin/bash
>
> MNT=/mnt/sdi
> DEV=/dev/sdi
>
> mkfs.btrfs -f $DEV > /dev/null
> # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
> mount $DEV $MNT
>
> extent_size=$((64 * 1024))
> num_extents=16384
> file_size=$(($extent_size * $num_extents))
>
> # File foo has many small extents.
> xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> > /dev/null
> # File bar has much less extents and has exactly the same data as foo.
> xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
>
> expected=$(stat -c %b $MNT/foo)
>
> # Now deduplicate bar into foo. While the deduplication is in progres,
> # the number of used blocks/file size reported by stat should not change
> xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
> dedupe_pid=$!
> while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
> used=$(stat -c %b $MNT/foo)
> if [ $used -ne $expected ]; then
> echo "Unexpected blocks used: $used (expected: $expected)"
> fi
> done
>
> umount $DEV
>
> $ ./reproducer-3.sh
> Unexpected blocks used: 2076800 (expected: 2097152)
> Unexpected blocks used: 2097024 (expected: 2097152)
> Unexpected blocks used: 2079872 (expected: 2097152)
> (...)
>
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
>
> So fix this by:
>
> 1) Making btrfs_drop_extents() not decrement the vfs inode's number of
> bytes, and instead return the number of bytes;
>
> 2) Making any code that drops extents and adds new extents update the
> inode's number of bytes atomically, while holding the btrfs inode's
> spinlock, which is also used by the stat(2) callback to get the inode's
> number of bytes;
>
> 3) For ranges in the inode's iotree that are marked as 'delalloc new',
> corresponding to previously unallocated ranges, increment the inode's
> number of bytes when clearing the 'delalloc new' bit from the range,
> in the same critical section that decrements the inode's
> 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
>
> An alternative would be to have btrfs_getattr() wait for any IO (ordered
> extents in progress) and locking the whole range (0 to (u64)-1) while it
> it computes the number of blocks used. But that would mean blocking
> stat(2), which is a very used syscall and expected to be fast, waiting
> for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2020-11-05 19:24 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
2020-11-05 18:29 ` Josef Bacik
2020-11-04 11:07 ` [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend fdmanana
2020-11-05 18:39 ` Josef Bacik
2020-11-04 11:07 ` [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO fdmanana
2020-11-05 18:44 ` Josef Bacik
2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
2020-11-05 19:24 ` Josef Bacik [this message]
2020-11-09 10:34 ` Nikolay Borisov
2020-11-09 11:10 ` Filipe Manana
2020-11-10 21:46 ` [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=df80c13e-f42d-4074-cca7-c71ced6a789a@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox