Re: [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically
Date: Thu, 5 Nov 2020 14:24:19 -0500	[thread overview]
Message-ID: <df80c13e-f42d-4074-cca7-c71ced6a789a@toxicpanda.com> (raw)
In-Reply-To: <aeebaf45f19779b8f869cc16db0bcfe8ba4dcf2d.1604486892.git.fdmanana@suse.com>

On 11/4/20 6:07 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There are several occasions where we do not update the inode's number of
> used bytes atomically, resulting in a concurrent stat(2) syscall to report
> a value of used blocks that does not correspond to a valid value, that is,
> a value that does not match neither what we had before the operation nor
> what we get after the operation completes.
> 
> In extreme cases it can result in stat(2) reporting zero used blocks, which
> can cause problems for some userspace tools where they can consider a file
> with a non-zero size and zero used blocks as completely sparse and skip
> reading data, as reported/discussed a long time ago in some threads like
> the following:
> 
>    https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
> 
> The cases where this can happen are the following:
> 
> -> Case 1
> 
> If we do a write (buffered or direct IO) against a file region for which
> there is already an allocated extent (or multiple extents), then we have a
> short time window where we can report a number of used blocks to stat(2)
> that does not take into account the file region being overwritten. This
> short time window happens when completing the ordered extent(s).
> 
> This happens because when we drop the extents in the write range we
> decrement the inode's number of bytes and later on when we insert the new
> extent(s) we increment the number of bytes in the inode, resulting in a
> short time window where a stat(2) syscall can get an incorrect number of
> used blocks.
> 
> If we do writes that overwrite an entire file, then we have a short time
> window where we report 0 used blocks to stat(2).
> 
> Example reproducer:
> 
>    $ cat reproducer-1.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    stat_loop()
>    {
>        trap "wait; exit" SIGTERM
>        local filepath=$1
>        local expected=$2
>        local got
> 
>        while :; do
>            got=$(stat -c %b $filepath)
>            if [ $got -ne $expected ]; then
>               echo -n "ERROR: unexpected used blocks"
>               echo " (got: $got expected: $expected)"
>            fi
>        done
>    }
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f $DEV > /dev/null
>    # mkfs.ext4 -F $DEV > /dev/null
>    # mkfs.f2fs -f $DEV > /dev/null
>    # mkfs.reiserfs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
>    expected=$(stat -c %b $MNT/foobar)
> 
>    # Create a process to keep calling stat(2) on the file and see if the
>    # reported number of blocks used (disk space used) changes, it should
>    # not because we are not increasing the file size nor punching holes.
>    stat_loop $MNT/foobar $expected &
>    loop_pid=$!
> 
>    for ((i = 0; i < 50000; i++)); do
>        xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
>    done
> 
>    kill $loop_pid &> /dev/null
>    wait
> 
>    umount $DEV
> 
>    $ ./reproducer-1.sh
>    ERROR: unexpected used blocks (got: 0 expected: 128)
>    ERROR: unexpected used blocks (got: 0 expected: 128)
>    (...)
> 
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
> 
> -> Case 2
> 
> If we do a buffered write against a file region that does not have any
> allocated extents, like a hole or beyond eof, then during ordered extent
> completion we have a short time window where a concurrent stat(2) syscall
> can report a number of used blocks that does not correspond to the value
> before or after the write operation, a value that is actually larger than
> the value after the write completes.
> 
> This happens because once we start a buffered write into an unallocated
> file range we increment the inode's 'new_delalloc_bytes', to make sure
> any stat(2) call gets a correct used blocks value before delalloc is
> flushed and completes. However at ordered extent completion, after we
> inserted the new extent, we increment the inode's number of bytes used
> with the size of the new extent, and only later, when clearing the range
> in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
> counter with the size of the extent. So this results in a short time
> window where a concurrent stat(2) syscall can report a number of used
> blocks that accounts for the new extent twice.
> 
> Example reproducer:
> 
>    $ cat reproducer-2.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    stat_loop()
>    {
>        trap "wait; exit" SIGTERM
>        local filepath=$1
>        local expected=$2
>        local got
> 
>        while :; do
>            got=$(stat -c %b $filepath)
>            if [ $got -ne $expected ]; then
>                echo -n "ERROR: unexpected used blocks"
>                echo " (got: $got expected: $expected)"
>            fi
>        done
>    }
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f $DEV > /dev/null
>    # mkfs.ext4 -F $DEV > /dev/null
>    # mkfs.f2fs -f $DEV > /dev/null
>    # mkfs.reiserfs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    touch $MNT/foobar
>    write_size=$((64 * 1024))
>    for ((i = 0; i < 16384; i++)); do
>       offset=$(($i * $write_size))
>       xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
>       blocks_used=$(stat -c %b $MNT/foobar)
> 
>       # Fsync the file to trigger writeback and keep calling stat(2) on it
>       # to see if the number of blocks used changes.
>       stat_loop $MNT/foobar $blocks_used &
>       loop_pid=$!
>       xfs_io -c "fsync" $MNT/foobar
> 
>       kill $loop_pid &> /dev/null
>       wait $loop_pid
>    done
> 
>    umount $DEV
> 
>    $ ./reproducer-2.sh
>    ERROR: unexpected used blocks (got: 265472 expected: 265344)
>    ERROR: unexpected used blocks (got: 284032 expected: 283904)
>    (...)
> 
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
> 
> -> Case 3
> 
> Another case where such problems happen is during other operations that
> replace extents in a file range with other extents. Those operations are
> extent cloning, deduplication and fallocate's zero range operation.
> 
> The cause of the problem is similar to the first case. When we drop the
> extents from a range, we decrement the inode's number of bytes, and later
> on, after inserting the new extents we increment it. Since this is not
> done atomically, a concurrent stat(2) call can see and return a number of
> used blocks that is smaller than it should be, does not match the number
> of used blocks before or after the clone/deduplication/zero operation.
> 
> Like for the first case, when doing a clone, deduplication or zero range
> operation against an entire file, we end up having a time window where we
> can report 0 used blocks to a stat(2) call.
> 
> Example reproducer:
> 
>    $ cat reproducer-3.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
>    mount $DEV $MNT
> 
>    extent_size=$((64 * 1024))
>    num_extents=16384
>    file_size=$(($extent_size * $num_extents))
> 
>    # File foo has many small extents.
>    xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
>        > /dev/null
>    # File bar has much less extents and has exactly the same data as foo.
>    xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
> 
>    expected=$(stat -c %b $MNT/foo)
> 
>    # Now deduplicate bar into foo. While the deduplication is in progres,
>    # the number of used blocks/file size reported by stat should not change
>    xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
>    dedupe_pid=$!
>    while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
>        used=$(stat -c %b $MNT/foo)
>        if [ $used -ne $expected ]; then
>            echo "Unexpected blocks used: $used (expected: $expected)"
>        fi
>    done
> 
>    umount $DEV
> 
>    $ ./reproducer-3.sh
>    Unexpected blocks used: 2076800 (expected: 2097152)
>    Unexpected blocks used: 2097024 (expected: 2097152)
>    Unexpected blocks used: 2079872 (expected: 2097152)
>    (...)
> 
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
> 
> So fix this by:
> 
> 1) Making btrfs_drop_extents() not decrement the vfs inode's number of
>     bytes, and instead return the number of bytes;
> 
> 2) Making any code that drops extents and adds new extents update the
>     inode's number of bytes atomically, while holding the btrfs inode's
>     spinlock, which is also used by the stat(2) callback to get the inode's
>     number of bytes;
> 
> 3) For ranges in the inode's iotree that are marked as 'delalloc new',
>     corresponding to previously unallocated ranges, increment the inode's
>     number of bytes when clearing the 'delalloc new' bit from the range,
>     in the same critical section that decrements the inode's
>     'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
> 
> An alternative would be to have btrfs_getattr() wait for any IO (ordered
> extents in progress) and locking the whole range (0 to (u64)-1) while it
> it computes the number of blocks used. But that would mean blocking
> stat(2), which is a very used syscall and expected to be fast, waiting
> for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

next prev parent reply	other threads:[~2020-11-05 19:24 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
2020-11-05 18:29   ` Josef Bacik
2020-11-04 11:07 ` [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend fdmanana
2020-11-05 18:39   ` Josef Bacik
2020-11-04 11:07 ` [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO fdmanana
2020-11-05 18:44   ` Josef Bacik
2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
2020-11-05 19:24   ` Josef Bacik [this message]
2020-11-09 10:34   ` Nikolay Borisov
2020-11-09 11:10     ` Filipe Manana
2020-11-10 21:46 ` [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df80c13e-f42d-4074-cca7-c71ced6a789a@toxicpanda.com \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox