From: robbieko <robbieko@synology.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: fdmanana@gmail.com, Nikolay Borisov <nborisov@suse.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>,
Filipe Manana <fdmanana@suse.com>,
linux-btrfs-owner@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Date: Fri, 26 Apr 2019 11:32:46 +0800 [thread overview]
Message-ID: <50d35084420d7fed439d74fd29b12d7d@synology.com> (raw)
In-Reply-To: <1fcaaa0c-cf7b-38dc-0af9-c90773c9af07@gmx.com>
Qu Wenruo 於 2019-04-26 10:53 寫到:
> On 2019/4/26 上午10:47, robbieko wrote:
>> Qu Wenruo 於 2019-04-26 07:20 寫到:
>>> [snip]
>>>>>>
>>>>>> 1. Serializing snapshots (acquiring write-side of the semaphore)
>>>>>
>>>>> With the new btrfs_start_delalloc_snapshot() in
>>>>> btrfs_commit_transaction(), do we really need to do anything
>>>>> special
>>>>> now?
>>>>>
>>>>> Snapshot creation will only happen in btrfs_commit_transaction(),
>>>>> as
>>>>> long as all dirty inodes/pages are written before pending
>>>>> snapshots,
>>>>> we're completely fine.
>>>>>
>>>>> Or did I miss something?
>>>>
>>>> You missed the whole point of both patches.
>>>>
>>>> The one I authored recently, is about ensuring we see ordered update
>>>> of an inode's disk_i_size / i_size.
>>>> That flushes delalloc a second time, during the transaction commit,
>>>> to
>>>> ensure an ordered update of disk_i_size in case direct IO writes
>>>> happened during snapshotting.
>>>> btrfs/078 could detect this when not running with the no-holes
>>>> feature
>>>> enabled, since fsck will report missing file extent items.
>>>
>>> But that flush at commit time ensures all pages of the source
>>> snapshot
>>> to disk, killing the following old case:
>>> - preallocate
>>> - buffered write
>>> at this timing, the write will be nodatacow
>>> - snapshot creation
>>> now the write needs to be cowed
>>> - dirty page writeback happens
>>>
>>> With your flush, snapshot creation will flush all its dirty pages
>>> before
>>> the new snapshot is created.
>>>
>>>>
>>>> Robbie's patch is about making sure that buffered nodatacow writes
>>>> that happened before snapshotting (and success was returned to user
>>>> space), will not fail silently during the writeback triggered by
>>>> snapshot creation.
>>>> There's even a test case for this, btrfs/170, which I submitted.
>>>>
>>>> So no, you can't simply revert Robbie's change, that will
>>>> re-introduce
>>>> the bug it fixed.
>>>
>>> With your patch, I see nothing need to be handled specially in a
>>> snapshot source. Thus we don't need Robbie patch after your more
>>> comprehensive fix.
>>>
>>> Or did I miss something again?
>>
>> If you revert my patch directly, when volume is full, when writing
>> data,
>> it will check and try to use nocow to write, but if the data is still
>> in the page cache, it has not been written back to disk.
>> If snapshot is triggered,
>
> But now, before creating new snapshot, we will flush all data.
>
> This means at this time, the data will be written back *NOCOW*, just as
> expected.
>
>> will_be_snapshotted will be set, and then flush
>
> will_be_snapshotted will also be gone.
>
> Thanks,
> Qu
>
Why will will_be_snapshotted disappear?
If will_be_snapshotted is removed, when volume is full,
how do we ensure that snapshot and write are both
simultaneous data not loss ?
How do you avoid the write data between btrfs_start_delalloc_snapshot
and create_pending_snapshot during this time ?
E.g. chattr +C /mnt
Fallocate -l 4096 /mnt/small_file
Fallocate -l $((64*1024*1024)) /mnt/large_file
First process:
While true; do
dd if=/dev/urandom of=/mnt/small_file bs=4k count=1 conv=notrunc
Done
Second process:
While true; do
dd if=/dev/urandom of=/mnt/large_file bs=64k count=1024 conv=notrunc
Done
Third process: create snapshot.
>> data, causing data to be unable to do nocow in run_delalloc_nocow,
>> can only be written back in cow way, but there is no extra space to
>> use,
>> so that data loss.
>> Thanks.
>> Robbieko
>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> 2. Nocow writers simply acquire the readsize of the semaphore.
>>>>>>
>>>>>> the will_be_snapshoted thing is very convoluted relying on a
>>>>>> percpu
>>>>>> counter/waitqueue to exclude snapshots from pending nocow
>>>>>> writers.
>>>>>> OTOH
>>>>>> it depends on atomic_t and an implicit wait queue thanks to
>>>>>> wait_var
>>>>>> infrastructure to exclude nocow writers from pending snapshots.
>>>>>> Filipe
>>>>>> had some concerns regarding performance but if the patch you
>>>>>> mentioned
>>>>>> fixed all issues I'm all in favor of removing the code!
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Qu
>>>>>>>
>>>>>>>>
>>>>>>>> 1. It is incremented when we start to create a snapshot after
>>>>>>>> triggering writeback and before waiting for writeback to finish.
>>>>>>>>
>>>>>>>> 2. This new atomic is now what is used by writeback (running
>>>>>>>> delalloc)
>>>>>>>> to decide whether we need to fallback to COW or not. Because we
>>>>>>>> incremented this new atomic after triggering writeback in the
>>>>>>>> snapshot
>>>>>>>> creation ioctl, we ensure that all buffered writes that happened
>>>>>>>> before snapshot creation will succeed and not fallback to COW
>>>>>>>> (which would make them fail with ENOSPC).
>>>>>>>>
>>>>>>>> 3. The existing atomic, will_be_snapshotted, is kept because it
>>>>>>>> is
>>>>>>>> used to force new buffered writes, that start after we started
>>>>>>>> snapshotting, to reserve data space even when NOCOW is possible.
>>>>>>>> This makes these writes fail early with ENOSPC when there's no
>>>>>>>> available space to allocate, preventing the unexpected behaviour
>>>>>>>> of writeback later failing with ENOSPC due to a fallback to COW
>>>>>>>> mode.
>>>>>>>>
>>>>>>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>>>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>>>>>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>>>>>>> ---
>>>>>>>> fs/btrfs/ctree.h | 1 +
>>>>>>>> fs/btrfs/disk-io.c | 1 +
>>>>>>>> fs/btrfs/inode.c | 26 +++++---------------------
>>>>>>>> fs/btrfs/ioctl.c | 16 ++++++++++++++++
>>>>>>>> 4 files changed, 23 insertions(+), 21 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>>>> index 118346a..663ce05 100644
>>>>>>>> --- a/fs/btrfs/ctree.h
>>>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>>>>>>> int send_in_progress;
>>>>>>>> struct btrfs_subvolume_writers *subv_writers;
>>>>>>>> atomic_t will_be_snapshotted;
>>>>>>>> + atomic_t snapshot_force_cow;
>>>>>>>>
>>>>>>>> /* For qgroup metadata reserved space */
>>>>>>>> spinlock_t qgroup_meta_rsv_lock;
>>>>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>>>>> index 205092d..5573916 100644
>>>>>>>> --- a/fs/btrfs/disk-io.c
>>>>>>>> +++ b/fs/btrfs/disk-io.c
>>>>>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root
>>>>>>>> *root, struct btrfs_fs_info *fs_info,
>>>>>>>> atomic_set(&root->log_batch, 0);
>>>>>>>> refcount_set(&root->refs, 1);
>>>>>>>> atomic_set(&root->will_be_snapshotted, 0);
>>>>>>>> + atomic_set(&root->snapshot_force_cow, 0);
>>>>>>>> root->log_transid = 0;
>>>>>>>> root->log_transid_committed = -1;
>>>>>>>> root->last_log_commit = 0;
>>>>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>>>>> index eba61bc..263b852 100644
>>>>>>>> --- a/fs/btrfs/inode.c
>>>>>>>> +++ b/fs/btrfs/inode.c
>>>>>>>> @@ -1275,7 +1275,7 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> u64 disk_num_bytes;
>>>>>>>> u64 ram_bytes;
>>>>>>>> int extent_type;
>>>>>>>> - int ret, err;
>>>>>>>> + int ret;
>>>>>>>> int type;
>>>>>>>> int nocow;
>>>>>>>> int check_prev = 1;
>>>>>>>> @@ -1407,11 +1407,9 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> * if there are pending snapshots for this
>>>>>>>> root,
>>>>>>>> * we fall into common COW way.
>>>>>>>> */
>>>>>>>> - if (!nolock) {
>>>>>>>> - err =
>>>>>>>> btrfs_start_write_no_snapshotting(root);
>>>>>>>> - if (!err)
>>>>>>>> - goto out_check;
>>>>>>>> - }
>>>>>>>> + if (!nolock &&
>>>>>>>> +
>>>>>>>> unlikely(atomic_read(&root->snapshot_force_cow)))
>>>>>>>> + goto out_check;
>>>>>>>> /*
>>>>>>>> * force cow if csum exists in the range.
>>>>>>>> * this ensure that csum for a given extent
>>>>>>>> are
>>>>>>>> @@ -1420,9 +1418,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> ret = csum_exist_in_range(fs_info,
>>>>>>>> disk_bytenr,
>>>>>>>> num_bytes);
>>>>>>>> if (ret) {
>>>>>>>> - if (!nolock)
>>>>>>>> -
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> -
>>>>>>>> /*
>>>>>>>> * ret could be -EIO if the above
>>>>>>>> fails to read
>>>>>>>> * metadata.
>>>>>>>> @@ -1435,11 +1430,8 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> WARN_ON_ONCE(nolock);
>>>>>>>> goto out_check;
>>>>>>>> }
>>>>>>>> - if (!btrfs_inc_nocow_writers(fs_info,
>>>>>>>> disk_bytenr)) {
>>>>>>>> - if (!nolock)
>>>>>>>> -
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> + if (!btrfs_inc_nocow_writers(fs_info,
>>>>>>>> disk_bytenr))
>>>>>>>> goto out_check;
>>>>>>>> - }
>>>>>>>> nocow = 1;
>>>>>>>> } else if (extent_type == BTRFS_FILE_EXTENT_INLINE)
>>>>>>>> {
>>>>>>>> extent_end = found_key.offset +
>>>>>>>> @@ -1453,8 +1445,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> out_check:
>>>>>>>> if (extent_end <= start) {
>>>>>>>> path->slots[0]++;
>>>>>>>> - if (!nolock && nocow)
>>>>>>>> -
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> if (nocow)
>>>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>>> disk_bytenr);
>>>>>>>> goto next_slot;
>>>>>>>> @@ -1476,8 +1466,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> end, page_started,
>>>>>>>> nr_written, 1,
>>>>>>>> NULL);
>>>>>>>> if (ret) {
>>>>>>>> - if (!nolock && nocow)
>>>>>>>> -
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> if (nocow)
>>>>>>>>
>>>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>>>
>>>>>>>> disk_bytenr);
>>>>>>>> @@ -1497,8 +1485,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> ram_bytes,
>>>>>>>> BTRFS_COMPRESS_NONE,
>>>>>>>> BTRFS_ORDERED_PREALLOC);
>>>>>>>> if (IS_ERR(em)) {
>>>>>>>> - if (!nolock && nocow)
>>>>>>>> -
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> if (nocow)
>>>>>>>>
>>>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>>>
>>>>>>>> disk_bytenr);
>>>>>>>> @@ -1537,8 +1523,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>> EXTENT_CLEAR_DATA_RESV,
>>>>>>>> PAGE_UNLOCK |
>>>>>>>> PAGE_SET_PRIVATE2);
>>>>>>>>
>>>>>>>> - if (!nolock && nocow)
>>>>>>>> - btrfs_end_write_no_snapshotting(root);
>>>>>>>> cur_offset = extent_end;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>>>> index b077544..331b495 100644
>>>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>> struct btrfs_pending_snapshot *pending_snapshot;
>>>>>>>> struct btrfs_trans_handle *trans;
>>>>>>>> int ret;
>>>>>>>> + bool snapshot_force_cow = false;
>>>>>>>>
>>>>>>>> if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>>>>>> return -EINVAL;
>>>>>>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct
>>>>>>>> btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>> goto free_pending;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * Force new buffered writes to reserve space even when
>>>>>>>> NOCOW is
>>>>>>>> + * possible. This is to avoid later writeback (running
>>>>>>>> dealloc)
>>>>>>>> + * to fallback to COW mode and unexpectedly fail with
>>>>>>>> ENOSPC.
>>>>>>>> + */
>>>>>>>> atomic_inc(&root->will_be_snapshotted);
>>>>>>>> smp_mb__after_atomic();
>>>>>>>> /* wait for no snapshot writes */
>>>>>>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct
>>>>>>>> btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>> if (ret)
>>>>>>>> goto dec_and_free;
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * All previous writes have started writeback in NOCOW mode,
>>>>>>>> so now
>>>>>>>> + * we force future writes to fallback to COW mode during
>>>>>>>> snapshot
>>>>>>>> + * creation.
>>>>>>>> + */
>>>>>>>> + atomic_inc(&root->snapshot_force_cow);
>>>>>>>> + snapshot_force_cow = true;
>>>>>>>> +
>>>>>>>> btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>>>>>>
>>>>>>>> btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>>>>>>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>> fail:
>>>>>>>> btrfs_subvolume_release_metadata(fs_info,
>>>>>>>> &pending_snapshot->block_rsv);
>>>>>>>> dec_and_free:
>>>>>>>> + if (snapshot_force_cow)
>>>>>>>> + atomic_dec(&root->snapshot_force_cow);
>>>>>>>> if (atomic_dec_and_test(&root->will_be_snapshotted))
>>>>>>>> wake_up_var(&root->will_be_snapshotted);
>>>>>>>> free_pending:
>>>>>>>>
>>>>>>>
>>>>
>>>>
>>>>
>>
next prev parent reply other threads:[~2019-04-26 3:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-06 2:30 [PATCH] Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space robbieko
2018-08-07 15:52 ` David Sterba
2019-04-25 9:29 ` Qu Wenruo
2019-04-25 9:37 ` Nikolay Borisov
2019-04-25 10:07 ` Qu Wenruo
2019-04-25 14:51 ` Filipe Manana
2019-04-25 23:20 ` Qu Wenruo
2019-04-26 2:47 ` robbieko
2019-04-26 2:53 ` Qu Wenruo
2019-04-26 3:32 ` robbieko [this message]
2019-04-26 6:32 ` Qu Wenruo
2019-04-26 6:59 ` Qu Wenruo
2019-04-26 9:09 ` Filipe Manana
2019-04-26 9:03 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50d35084420d7fed439d74fd29b12d7d@synology.com \
--to=robbieko@synology.com \
--cc=fdmanana@gmail.com \
--cc=fdmanana@suse.com \
--cc=linux-btrfs-owner@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=nborisov@suse.com \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).