linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Goldwyn Rodrigues <rgoldwyn@suse.de>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero
Date: Tue, 27 Sep 2016 09:04:42 -0500	[thread overview]
Message-ID: <2a9a3890-3f32-f175-b4aa-becf7caabb7f@suse.de> (raw)
In-Reply-To: <1274b3b2-d196-ee9e-2261-d2f5cb906477@cn.fujitsu.com>



On 09/26/2016 10:10 PM, Qu Wenruo wrote:
> 
> 
> At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote:
>>
>>
>> On 09/25/2016 09:33 PM, Qu Wenruo wrote:
>>>
>>>
>>> At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:
>>>>
>>>>
> [snipped]
>>>
>>> Sorry I still don't get the point.
>>> Would you please give a call flow of the timing dirtying page and
>>> calling btrfs_qgroup_reserve/free/release_data()?
>>>
>>> Like:
>>> __btrfs_buffered_write()
>>> |- btrfs_check_data_free_space()
>>> |  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>>> |- btrfs_dirty_pages()            <- Mark page dirty
>>>
>>>
>>> [[Timing of btrfs_invalidatepage()]]
>>> About your commit message "once while invalidating the page, and the
>>> next time while free'ing delalloc."
>>> "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().
>>>
>>> If so, it means one extent goes through full write back, and long before
>>> calling btrfs_qgroup_free_delayed_ref(), it will call
>>> btrfs_qgroup_release_data() to clear QGROUP_RESERVED.
>>>
>>> So the call will be:
>>> __btrfs_buffered_write()
>>> |- btrfs_check_data_free_space()
>>> |  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>>> |- btrfs_dirty_pages()            <- Mark page dirty
>>>
>>> <data write back happens>
>>> run_delalloc_range()
>>> |- cow_file_range()
>>>    |- extent_clear_unlock_delalloc() <- Clear page dirty
>>>
>>> <modifying metadata>
>>>
>>> btrfs_finish_ordered_io()
>>> |- insert_reserved_file_extent()
>>>    |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
>>>                                      but not decrease reserved space
>>>
>>> <run delayed refs, normally happens in commit_trans>
>>> run_one_delyaed_refs()
>>> |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space
>>>
>>>
>>> So the problem seems to be, btrfs_invalidatepage() is called after
>>> run_delalloc_range() but before btrfs_finish_ordered_io().
>>>
>>> Did you mean that?
>>
>> This happens event before a writeback happens. So, here is what is
>> happening. This is in reference, and specific to the test case in the
>> description.
> 
> The flowchart helps a lot, thanks.
> 
> But still some flow is still strange.
>>
>> Process: dd - first time
>> __btrfs_buffered_write()
>> |- btrfs_check_data_free_space()
>> |  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>> |- btrfs_dirty_pages()            <- Mark page dirty
>>
>> Please note data writeback does _not_ happen/complete.
> 
> This part is completely fine.
> 
>>
>> Process: dd - next time, new process
>> sys_open O_TRUNC
>> .
>>  |-btrfs_setattr()
>>    |-truncate_pagecache()
>>      |-truncate_inode_pages_range()
>>         |-truncate_inode_page() - Page is cleared of Dirty flag.
>>           |-btrfs_invalidatepage(page)
>>             |-__btrfs_qgroup_release_data()
>>               |-btrfs_qgroup_free_refroot() - qgroup->reserved is
>> reduced by PAGESIZE.
> 
> That's OK too. No problem.
> Although the __btrfs_qgroup_release_data() is called by
> btrfs_qgroup_free_data().
> Which cleared QGROUP_RESERVED flag.
> 
> 
> So the call flow should be
> 
>   |-btrfs_setattr()
>     |-truncate_pagecache()
>       |-truncate_inode_pages_range()
>          |-truncate_inode_page()        <- Clear page dirty
>            |-btrfs_invalidatepage(page)
>              |-btrfs_qgroup_free_data() <- reduce qgroup->reserved
>                                            and clear QGROUP_RESERVED
> 
> After btrfs_setattr(), the new dd write data.
> So __btrfs_buffered_write() happens again,
> dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved.
> 
> So here I don't see any problem
> buffered_write:
>   Mark dirty
>   Mark QGROUP_RESERVED
>   Increase qgroup->reserved
> 
> btrfs_setattr:
>   Clear dirty
>   Clear QGROUP_RESERVED
>   Decrease qgroup->reserved
> 
> All pairs with each other, no leak no underflow.

Yes, but the problem is
run_one_delayed_ref()
 |-btrfs_qgroup_free_delayed_ref
uses delayed_ref_head->qgroup_reserved to reduce the count. Which
function is responsible for decreasing delayed_ref_head->qgroup_reserved
when btrfs_invalidatepage() is reducing the count?


> 
> Until the last buffered_write(), which doesn't got truncated.
> 
>>
>>
>> Process: sync
>> btrfs_sync_fs()
>> |-btrfs_commit_transaction()
>>   |-btrfs_run_delayed_refs()
>>     |- qgroup_free_refroot() - Reduces reserved by the size of the
>> extent (in this case, the filesize of dd (first time)), even though some
>> of the PAGESIZEs have been reduced while performing the truncate on the
>> file.
> 
> The delayed_ref belongs to the last extent, which doesn't got truncated.
> 
> And in that case, that last extent got into disk.
> run_dealloc_range() <- Write data into disk
> finish_ordered_io() <- Update metadata
>   |- insert_reserved_file_extents()
>      |- btrfs_alloc_reserved_file_extent()
>      |  |- btrfs_add_delayed_data_ref
>      |     <This will cause a new delayed_data_ref>
>      |- btrfs_qgroup_release_data()
>         <This will clear QGROURP_RESERVED>
> 
> Then goes to your btrfs_sync_fs() => qgroup_free_refroot()
> 
> 
> 
> [[I think I find the problem now]]
> 
> Between btrfs_alloc_reserved_file_extent() and
> btrfs_qgroup_release_data(), there is a window that delayed_refs can be
> executed.
> Since that d*mn delayed_refs can be run at any time asynchronously.
> 
> In that case, it will call qgroup_free_refroot() first at another
> process context, and later btrfs_qgroup_release_data() get schedualed,
> which will clear QGROUP_RESERVED again, causing the underflow.
> 
> Although it's not the original cause you reported, would you mind to try
> the following quick-dirty fix without your fix, to see if it fixes the
> problem?
> 
> -------
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e6811c4..70efa14 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2168,15 +2168,15 @@ static int insert_reserved_file_extent(struct
> btrfs_trans_handle *trans,
>         ins.objectid = disk_bytenr;
>         ins.offset = disk_num_bytes;
>         ins.type = BTRFS_EXTENT_ITEM_KEY;
> -       ret = btrfs_alloc_reserved_file_extent(trans, root,
> -                                       root->root_key.objectid,
> -                                       btrfs_ino(inode), file_pos,
> -                                       ram_bytes, &ins);
>         /*
>          * Release the reserved range from inode dirty range map, as it is
>          * already moved into delayed_ref_head
>          */
>         btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
> +       ret = btrfs_alloc_reserved_file_extent(trans, root,
> +                                       root->root_key.objectid,
> +                                       btrfs_ino(inode), file_pos,
> +                                       ram_bytes, &ins);
>  out:
>         btrfs_free_path(path);
> 
> -------

No, it does not work.

> 
>>
>> I hope that makes it clear.
>>
>>>
>>> [[About test case]]
>>> And for the test case, I can't reproduce the problem no matter if I
>>> apply the fix or not.
>>>
>>> Either way it just fails after 3 loops of dd, and later dd will all
>>> fail.
>>> But I can still remove the file and write new data into the fs.
>>>
>>
>> Strange? I can reproduce at every instance of running it. Even on
>> 4.8.0-rc7
> 
> Maybe related to the memory size.
> 
> Since you're also using VM, maybe your VM RAM is too small, so dirty
> pages pressure triggers a commit halfway and cause the race?

Not quite, this problem happens on a physical machine as well. Besides,
this VM is not in memory pressure. I tried it again on 4.8.0-rc8.

-- 
Goldwyn

  reply	other threads:[~2016-09-27 14:04 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-22 18:47 [PATCH] qgroup: Prevent qgroup->reserved from going subzero Goldwyn Rodrigues
2016-09-23  1:06 ` Qu Wenruo
2016-09-23 13:43   ` Goldwyn Rodrigues
2016-09-26  2:33     ` Qu Wenruo
2016-09-26 14:31       ` Goldwyn Rodrigues
2016-09-27  3:10         ` Qu Wenruo
2016-09-27 14:04           ` Goldwyn Rodrigues [this message]
2016-09-28  1:44             ` Qu Wenruo
2016-09-28  2:19               ` Goldwyn Rodrigues
2016-09-29  8:57                 ` Qu Wenruo
2016-09-29 11:13                   ` Goldwyn Rodrigues
2016-09-30  1:21                     ` Qu Wenruo
2016-09-30  2:18                     ` Qu Wenruo
2016-09-30  2:24                       ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2016-09-30 15:40 Goldwyn Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2a9a3890-3f32-f175-b4aa-becf7caabb7f@suse.de \
    --to=rgoldwyn@suse.de \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo@cn.fujitsu.com \
    --cc=rgoldwyn@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).