From: Goldwyn Rodrigues <rgoldwyn@suse.de>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>,
Goldwyn Rodrigues <rgoldwyn@suse.com>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero
Date: Tue, 27 Sep 2016 09:04:42 -0500 [thread overview]
Message-ID: <2a9a3890-3f32-f175-b4aa-becf7caabb7f@suse.de> (raw)
In-Reply-To: <1274b3b2-d196-ee9e-2261-d2f5cb906477@cn.fujitsu.com>
On 09/26/2016 10:10 PM, Qu Wenruo wrote:
>
>
> At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote:
>>
>>
>> On 09/25/2016 09:33 PM, Qu Wenruo wrote:
>>>
>>>
>>> At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:
>>>>
>>>>
> [snipped]
>>>
>>> Sorry I still don't get the point.
>>> Would you please give a call flow of the timing dirtying page and
>>> calling btrfs_qgroup_reserve/free/release_data()?
>>>
>>> Like:
>>> __btrfs_buffered_write()
>>> |- btrfs_check_data_free_space()
>>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>>> |- btrfs_dirty_pages() <- Mark page dirty
>>>
>>>
>>> [[Timing of btrfs_invalidatepage()]]
>>> About your commit message "once while invalidating the page, and the
>>> next time while free'ing delalloc."
>>> "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().
>>>
>>> If so, it means one extent goes through full write back, and long before
>>> calling btrfs_qgroup_free_delayed_ref(), it will call
>>> btrfs_qgroup_release_data() to clear QGROUP_RESERVED.
>>>
>>> So the call will be:
>>> __btrfs_buffered_write()
>>> |- btrfs_check_data_free_space()
>>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>>> |- btrfs_dirty_pages() <- Mark page dirty
>>>
>>> <data write back happens>
>>> run_delalloc_range()
>>> |- cow_file_range()
>>> |- extent_clear_unlock_delalloc() <- Clear page dirty
>>>
>>> <modifying metadata>
>>>
>>> btrfs_finish_ordered_io()
>>> |- insert_reserved_file_extent()
>>> |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
>>> but not decrease reserved space
>>>
>>> <run delayed refs, normally happens in commit_trans>
>>> run_one_delyaed_refs()
>>> |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space
>>>
>>>
>>> So the problem seems to be, btrfs_invalidatepage() is called after
>>> run_delalloc_range() but before btrfs_finish_ordered_io().
>>>
>>> Did you mean that?
>>
>> This happens event before a writeback happens. So, here is what is
>> happening. This is in reference, and specific to the test case in the
>> description.
>
> The flowchart helps a lot, thanks.
>
> But still some flow is still strange.
>>
>> Process: dd - first time
>> __btrfs_buffered_write()
>> |- btrfs_check_data_free_space()
>> | |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>> |- btrfs_dirty_pages() <- Mark page dirty
>>
>> Please note data writeback does _not_ happen/complete.
>
> This part is completely fine.
>
>>
>> Process: dd - next time, new process
>> sys_open O_TRUNC
>> .
>> |-btrfs_setattr()
>> |-truncate_pagecache()
>> |-truncate_inode_pages_range()
>> |-truncate_inode_page() - Page is cleared of Dirty flag.
>> |-btrfs_invalidatepage(page)
>> |-__btrfs_qgroup_release_data()
>> |-btrfs_qgroup_free_refroot() - qgroup->reserved is
>> reduced by PAGESIZE.
>
> That's OK too. No problem.
> Although the __btrfs_qgroup_release_data() is called by
> btrfs_qgroup_free_data().
> Which cleared QGROUP_RESERVED flag.
>
>
> So the call flow should be
>
> |-btrfs_setattr()
> |-truncate_pagecache()
> |-truncate_inode_pages_range()
> |-truncate_inode_page() <- Clear page dirty
> |-btrfs_invalidatepage(page)
> |-btrfs_qgroup_free_data() <- reduce qgroup->reserved
> and clear QGROUP_RESERVED
>
> After btrfs_setattr(), the new dd write data.
> So __btrfs_buffered_write() happens again,
> dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved.
>
> So here I don't see any problem
> buffered_write:
> Mark dirty
> Mark QGROUP_RESERVED
> Increase qgroup->reserved
>
> btrfs_setattr:
> Clear dirty
> Clear QGROUP_RESERVED
> Decrease qgroup->reserved
>
> All pairs with each other, no leak no underflow.
Yes, but the problem is
run_one_delayed_ref()
|-btrfs_qgroup_free_delayed_ref
uses delayed_ref_head->qgroup_reserved to reduce the count. Which
function is responsible for decreasing delayed_ref_head->qgroup_reserved
when btrfs_invalidatepage() is reducing the count?
>
> Until the last buffered_write(), which doesn't got truncated.
>
>>
>>
>> Process: sync
>> btrfs_sync_fs()
>> |-btrfs_commit_transaction()
>> |-btrfs_run_delayed_refs()
>> |- qgroup_free_refroot() - Reduces reserved by the size of the
>> extent (in this case, the filesize of dd (first time)), even though some
>> of the PAGESIZEs have been reduced while performing the truncate on the
>> file.
>
> The delayed_ref belongs to the last extent, which doesn't got truncated.
>
> And in that case, that last extent got into disk.
> run_dealloc_range() <- Write data into disk
> finish_ordered_io() <- Update metadata
> |- insert_reserved_file_extents()
> |- btrfs_alloc_reserved_file_extent()
> | |- btrfs_add_delayed_data_ref
> | <This will cause a new delayed_data_ref>
> |- btrfs_qgroup_release_data()
> <This will clear QGROURP_RESERVED>
>
> Then goes to your btrfs_sync_fs() => qgroup_free_refroot()
>
>
>
> [[I think I find the problem now]]
>
> Between btrfs_alloc_reserved_file_extent() and
> btrfs_qgroup_release_data(), there is a window that delayed_refs can be
> executed.
> Since that d*mn delayed_refs can be run at any time asynchronously.
>
> In that case, it will call qgroup_free_refroot() first at another
> process context, and later btrfs_qgroup_release_data() get schedualed,
> which will clear QGROUP_RESERVED again, causing the underflow.
>
> Although it's not the original cause you reported, would you mind to try
> the following quick-dirty fix without your fix, to see if it fixes the
> problem?
>
> -------
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e6811c4..70efa14 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2168,15 +2168,15 @@ static int insert_reserved_file_extent(struct
> btrfs_trans_handle *trans,
> ins.objectid = disk_bytenr;
> ins.offset = disk_num_bytes;
> ins.type = BTRFS_EXTENT_ITEM_KEY;
> - ret = btrfs_alloc_reserved_file_extent(trans, root,
> - root->root_key.objectid,
> - btrfs_ino(inode), file_pos,
> - ram_bytes, &ins);
> /*
> * Release the reserved range from inode dirty range map, as it is
> * already moved into delayed_ref_head
> */
> btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
> + ret = btrfs_alloc_reserved_file_extent(trans, root,
> + root->root_key.objectid,
> + btrfs_ino(inode), file_pos,
> + ram_bytes, &ins);
> out:
> btrfs_free_path(path);
>
> -------
No, it does not work.
>
>>
>> I hope that makes it clear.
>>
>>>
>>> [[About test case]]
>>> And for the test case, I can't reproduce the problem no matter if I
>>> apply the fix or not.
>>>
>>> Either way it just fails after 3 loops of dd, and later dd will all
>>> fail.
>>> But I can still remove the file and write new data into the fs.
>>>
>>
>> Strange? I can reproduce at every instance of running it. Even on
>> 4.8.0-rc7
>
> Maybe related to the memory size.
>
> Since you're also using VM, maybe your VM RAM is too small, so dirty
> pages pressure triggers a commit halfway and cause the race?
Not quite, this problem happens on a physical machine as well. Besides,
this VM is not in memory pressure. I tried it again on 4.8.0-rc8.
--
Goldwyn
next prev parent reply other threads:[~2016-09-27 14:04 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-22 18:47 [PATCH] qgroup: Prevent qgroup->reserved from going subzero Goldwyn Rodrigues
2016-09-23 1:06 ` Qu Wenruo
2016-09-23 13:43 ` Goldwyn Rodrigues
2016-09-26 2:33 ` Qu Wenruo
2016-09-26 14:31 ` Goldwyn Rodrigues
2016-09-27 3:10 ` Qu Wenruo
2016-09-27 14:04 ` Goldwyn Rodrigues [this message]
2016-09-28 1:44 ` Qu Wenruo
2016-09-28 2:19 ` Goldwyn Rodrigues
2016-09-29 8:57 ` Qu Wenruo
2016-09-29 11:13 ` Goldwyn Rodrigues
2016-09-30 1:21 ` Qu Wenruo
2016-09-30 2:18 ` Qu Wenruo
2016-09-30 2:24 ` Qu Wenruo
-- strict thread matches above, loose matches on Subject: below --
2016-09-30 15:40 Goldwyn Rodrigues
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2a9a3890-3f32-f175-b4aa-becf7caabb7f@suse.de \
--to=rgoldwyn@suse.de \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo@cn.fujitsu.com \
--cc=rgoldwyn@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).