From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Goldwyn Rodrigues <rgoldwyn@suse.com>,
Goldwyn Rodrigues <rgoldwyn@suse.de>,
<linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero
Date: Mon, 26 Sep 2016 10:33:55 +0800 [thread overview]
Message-ID: <65e6515c-a9f3-d5f9-db6b-3ffd8d97f90e@cn.fujitsu.com> (raw)
In-Reply-To: <628e1dcd-ddc3-3f37-8a47-c01c912db970@suse.com>
At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:
>
>
> On 09/22/2016 08:06 PM, Qu Wenruo wrote:
>>
>>
>> At 09/23/2016 02:47 AM, Goldwyn Rodrigues wrote:
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>> While free'ing qgroup->reserved resources, we must check
>>> if the page is already commmitted to disk or still in memory.
>>> If not, the reserve free is doubly accounted, once while
>>> invalidating the page, and the next time while free'ing
>>> delalloc. This results is qgroup->reserved(u64) going subzero,
>>> thus very large value. So, no further I/O can be performed.
>>>
>>> This is also expressed in the comments, but not performed.
>>>
>>> Testcase:
>>> SCRATCH_DEV=/dev/vdb
>>> SCRATCH_MNT=/mnt
>>> mkfs.btrfs -f $SCRATCH_DEV
>>> mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
>>> cd $SCRATCH_MNT
>>> btrfs quota enable $SCRATCH_MNT
>>> btrfs subvolume create a
>>> btrfs qgroup limit 50m a $SCRATCH_MNT
>>> sync
>>> for c in {1..15}; do
>>> dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file;
>>> done
>>>
>>> sleep 10
>>> sync
>>> sleep 5
>>>
>>> touch $SCRATCH_MNT/a/newfile
>>>
>>> echo "Removing file"
>>> rm $SCRATCH_MNT/a/file
>>>
>>> Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>> ---
>>> fs/btrfs/inode.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index e6811c4..2e2a026 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -8917,7 +8917,8 @@ again:
>>> * 2) Not written to disk
>>> * This means the reserved space should be freed here.
>>> */
>>> - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
>>> + if (PageDirty(page))
>>> + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
>>> if (!inode_evicting) {
>>> clear_extent_bit(tree, page_start, page_end,
>>> EXTENT_LOCKED | EXTENT_DIRTY |
>>>
>> Thanks for the test case.
>>
>> However for the fix, I'm afraid it may not be the root cause.
>>
>> Here, if the pages are dirty, then corresponding range is marked
>> EXTENT_QGROUP_RESERVED.
>> Then btrfs_qgroup_free_data() will clear that bit and reduce the number.
>>
>> If the pages are already committed, then corresponding range won't be
>> marked EXTENT_QGROUP_RESERVED.
>> Later btrfs_qgroup_free_data() won't reduce any bytes, since it will
>> only reduce the bytes if it cleared EXTENT_QGROUP_RESERVED bit.
>>
>> If everything goes well there is no need to check PageDirty() here, as
>> we have EXTENT_QGROUP_RESERVED bit for that accounting.
>>
>> So there is some other thing causing EXTENT_QGROUP_RESERVED bit out of
>> sync with dirty pages.
>> Considering you did it 15 times to reproduce the problem, maybe there is
>> some race causing the problem?
>>
>
> You can have pages marked as not dirty with EXTENT_QGROUP_RESERVED set
> for a truncate operation. Performing dd on the same file, truncates the
> file before overwriting, while the pages of the previous writes are
> still in memory and not committed to disk.
>
> truncate_inode_page() -> truncate_complete_page() clears the dirty flag.
> So, you can have a case where the EXTENT_QGROUP_RESERVED bit is set
> while the page is not listed as dirty because the truncate "cleared" all
> the dirty pages.
>
Sorry I still don't get the point.
Would you please give a call flow of the timing dirtying page and
calling btrfs_qgroup_reserve/free/release_data()?
Like:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
| |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages() <- Mark page dirty
[[Timing of btrfs_invalidatepage()]]
About your commit message "once while invalidating the page, and the
next time while free'ing delalloc."
"Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().
If so, it means one extent goes through full write back, and long before
calling btrfs_qgroup_free_delayed_ref(), it will call
btrfs_qgroup_release_data() to clear QGROUP_RESERVED.
So the call will be:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
| |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages() <- Mark page dirty
<data write back happens>
run_delalloc_range()
|- cow_file_range()
|- extent_clear_unlock_delalloc() <- Clear page dirty
<modifying metadata>
btrfs_finish_ordered_io()
|- insert_reserved_file_extent()
|- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
but not decrease reserved space
<run delayed refs, normally happens in commit_trans>
run_one_delyaed_refs()
|- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space
So the problem seems to be, btrfs_invalidatepage() is called after
run_delalloc_range() but before btrfs_finish_ordered_io().
Did you mean that?
[[About test case]]
And for the test case, I can't reproduce the problem no matter if I
apply the fix or not.
Either way it just fails after 3 loops of dd, and later dd will all fail.
But I can still remove the file and write new data into the fs.
[[Extra protect about qgroup->reserved]]
And for the underflowed qgroup reserve space, would you mind to add
warning for that case?
Just like what we did in qgroup excl/rfer values, so at least it won't
make qgroup blocking any write.
Thanks,
Qu
next prev parent reply other threads:[~2016-09-26 2:34 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-22 18:47 [PATCH] qgroup: Prevent qgroup->reserved from going subzero Goldwyn Rodrigues
2016-09-23 1:06 ` Qu Wenruo
2016-09-23 13:43 ` Goldwyn Rodrigues
2016-09-26 2:33 ` Qu Wenruo [this message]
2016-09-26 14:31 ` Goldwyn Rodrigues
2016-09-27 3:10 ` Qu Wenruo
2016-09-27 14:04 ` Goldwyn Rodrigues
2016-09-28 1:44 ` Qu Wenruo
2016-09-28 2:19 ` Goldwyn Rodrigues
2016-09-29 8:57 ` Qu Wenruo
2016-09-29 11:13 ` Goldwyn Rodrigues
2016-09-30 1:21 ` Qu Wenruo
2016-09-30 2:18 ` Qu Wenruo
2016-09-30 2:24 ` Qu Wenruo
-- strict thread matches above, loose matches on Subject: below --
2016-09-30 15:40 Goldwyn Rodrigues
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=65e6515c-a9f3-d5f9-db6b-3ffd8d97f90e@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=rgoldwyn@suse.com \
--cc=rgoldwyn@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).