Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: fdmanana@gmail.com, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC 4/4] btrfs: inode: make btrfs_invalidatepage() to be subpage compatible
Date: Tue, 22 Dec 2020 12:38:38 +0800	[thread overview]
Message-ID: <7013d0b6-0ce1-9ee6-7cbf-955126053e10@gmx.com> (raw)
In-Reply-To: <CAL3q7H7WTScxZmAwaNCztp-Q=zP6NSkRzh==quQS5mqxxbto2Q@mail.gmail.com>



On 2020/12/17 下午7:20, Filipe Manana wrote:
> On Thu, Dec 17, 2020 at 5:03 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> With current subpage RW patchset, the following script can lead to
>> filesystem hang:
>>    # mkfs.btrfs -f -s 4k $dev
>>    # mount $dev -o nospace_cache $mnt
>>    # fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
>>
>> The file system will hang at wait_event() of
>> btrfs_start_ordered_extent().
>>
>> [CAUSE]
>> The root cause is, btrfs_invalidatepage() is freeing page::private which
>> still has subpage dirty bit set.
>>
>> The offending situation happens like this:
>> btrfs_fllocate()
>> |- btrfs_zero_range()
>>     |- btrfs_punch_hole_lock_range()
>>        |- truncate_pagecache_range()
>>           |- btrfs_invalidatepage()
>>
>> The involved range looks like:
>>
>> 0       32K     64K     96K     128K
>>          |///////||//////|
>>          | Range to drop |
>>
>> For the [32K, 64K) range, since the offset is 32K, the page won't be
>> invalidated.
>>
>> But for the [64K, 96K) range, the offset is 0, current
>> btrfs_invalidatepage() will call clear_page_extent_mapped() which will
>> detach page::private, making the subpage dirty bitmap being cleared.
>>
>> This prevents later __extent_writepage_io() to locate any range to
>> write, thus no way to wake up the ordered extents.
>>
>> [FIX]
>> To fix the problem this patch will:
>> - Only clear page status and detach page private when the full page
>>    is invalidated
>>
>> - Change how we handle unfinished ordered extent
>>    If there is any ordered extent unfinished in the page range, we can't
>>    call clear_extent_bit() with delete == true.
>>
>> [REASON FOR RFC]
>> There is still uncertainty around the btrfs_releasepage() call.
>>
>> 1. Why we need btrfs_releasepage() call for non-full-page condition?
>>     Other fs (aka. xfs) just exit without doing special handling if
>>     invalidatepage() is called with part of the page.
>>
>>     Thus I didn't completely understand why btrfs_releasepage() here is
>>     needed for non-full page call.
>>
>> 2. Why "if (offset)" is not causing problem for current code?
>>     This existing if (offset) call can be skipped for cases like
>>     offset == 0 length == 2K.
>>     As MM layer can call invalidatepage() with unaligned offset/length,
>>     for cases like truncate_inode_pages_range().
>>     This will make btrfs_invalidatepage() to truncate the whole page when
>>     we only need to zero part of the page.
>
> I don't think you can ever get offset == 0 and length < PAGE_SIZE
> unless this is the last page in the file, the one containing eof, in
> which it's perfectly valid to invalidate the whole page.

You're right.

After more testing, it indeed shows except setsize call which could pass
unaligned range, all other call sites are using sector aligned ranges.

Thus the existing code won't cause any problem for current code base.

Either we're invalidating the last page of the inode, or we're
invalidating sector (PAGE) aligned range.


But for subpage support, we can have cases like
btrfs_punch_hole_lock_range() which only passes sector aligned range in,
and since sector size is smaller than page size, we can have offset == 0
while length < PAGE_SIZE and it's not the last page.

Further more, the page can have dirty range not covered by the
invalidatepage range, causing the problem.

I'll update the commit message to explain the case more, and only put
the fix into the subpage series, other than sending it out without
subpage context.

>
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c | 23 ++++++++++++++++-------
>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index eb493fbb65f9..872c5309b4ca 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -8180,7 +8180,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>>          int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
>>          bool cleared_private2;
>>          bool found_ordered = false;
>> -       bool completed_ordered = false;
>> +       bool incompleted_ordered = false;
>>
>>          /*
>>           * we have the page locked, so new writeback can't start,
>> @@ -8191,7 +8191,13 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>>           */
>>          wait_on_page_writeback(page);
>>
>> -       if (offset) {
>> +       /*
>> +        * The range doesn't cover the full page, just let btrfs_releasepage() to
>> +        * check if we can release the extent mapping.
>> +        * Any locked/pinned/logged extent map would prevent us freeing the
>> +        * extent mapping.
>> +        */
>> +       if (!(offset == 0 && length == PAGE_SIZE)) {
>>                  btrfs_releasepage(page, GFP_NOFS);
>>                  return;
>>          }
>> @@ -8208,9 +8214,10 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>>                  end = min(page_end,
>>                            ordered->file_offset + ordered->num_bytes - 1);
>>                  /*
>> -                * IO on this page will never be started, so we need to account
>> -                * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
>> -                * here, must leave that up for the ordered extent completion.
>> +                * IO on this ordered extent will never be started, so we need
>> +                * to account for any ordered extents now. Don't clear
>
> So this comment update states that "IO on this ordered extent will
> never be started".
> That is not true, unless some other patch not in misc-next changed
> something and I missed it (like splitting ordered extents).
>
> If you have a 1M ordered extent, for file range [0, 1M[ for example,
> and then truncate the file to 512K or punch a hole for the range
> [512K, 1M[, then IO for the first 512K of the ordered extent is still
> done.
>
> So I think what you wanted to say is more like "IO for this subpage
> will never be started ...".

This is indeed much better.

>
>> +                * EXTENT_DELALLOC_NEW here, must leave that up for the
>> +                * ordered extent completion.
>>                   */
>>                  if (!inode_evicting)
>>                          clear_extent_bit(tree, start, end,
>> @@ -8234,7 +8241,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>>                                                             start,
>>                                                             end - start + 1, 1)) {
>>                                  btrfs_finish_ordered_io(ordered);
>> -                               completed_ordered = true;
>> +                       } else {
>> +                               incompleted_ordered = true;
>>                          }
>>                  }
>>
>> @@ -8276,7 +8284,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>>                   * is cleared if we don't delete, otherwise it can lead to
>>                   * corruptions if the i_size is extented later.
>>                   */
>> -               if (found_ordered && !completed_ordered)
>> +               if (found_ordered && incompleted_ordered)
>
> I find this naming, "incompleted_ordered" confusing, I think
> "incompleted" is not even a valid english word.
>
> What you mean is that if there is some ordered extent for the page
> range that we could not complete ourselves.
> I would suggest naming it to "completed_all_ordered", initialize it to
> true and then set it to false when we can't complete an ordered extent
> ourselves.
>
> Then it would just be "if (found_ordered && !completed_all_ordered)
> delete = false;".
>
> Also, I haven't checked the other patchsets for subpage support, but
> from looking only at this patchset, I'm assuming we can't set ranges
> in the io tree smaller than page size, is that correct?
> Otherwise this would be calling clear_extent_bit for each subpage range.

For current subpage implementation, we have to support sector aligned
writeback.

The requirement comes from data balance, we have to be able to write new
data extents exactly the same size as the originals.

Thus here we support range smaller than page size for subpage.
The extent io tree itself can support it without problem.

Thanks,
Qu
>
> Thanks.
>
>>                          delete = false;
>>                  clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
>>                                   EXTENT_DELALLOC | EXTENT_UPTODATE |
>> @@ -8286,6 +8294,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>>                  __btrfs_releasepage(page, GFP_NOFS);
>>          }
>>
>> +       ClearPagePrivate2(page);
>>          ClearPageChecked(page);
>>          clear_page_extent_mapped(page);
>>   }
>> --
>> 2.29.2
>>
>
>

  reply	other threads:[~2020-12-22  4:40 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-17  4:57 [PATCH 0/4] btrfs: inode: btrfs_invalidatepage() related refactor and fix for subpage Qu Wenruo
2020-12-17  4:57 ` [PATCH 1/4] btrfs: inode: use min() to replace open-code in btrfs_invalidatepage() Qu Wenruo
2020-12-17  4:57 ` [PATCH 2/4] btrfs: inode: remove variable shadowing " Qu Wenruo
2020-12-17  5:38   ` Su Yue
2020-12-17  5:42     ` Qu Wenruo
2020-12-17  6:08       ` Su Yue
2020-12-17  5:55   ` Nikolay Borisov
2020-12-17  5:59     ` Nikolay Borisov
2020-12-17  6:13       ` Qu Wenruo
2020-12-17 12:29         ` David Sterba
2020-12-17  4:57 ` [PATCH 3/4] btrfs: inode: move the timing of TestClearPagePrivate() " Qu Wenruo
2020-12-17  4:57 ` [PATCH RFC 4/4] btrfs: inode: make btrfs_invalidatepage() to be subpage compatible Qu Wenruo
2020-12-17 11:20   ` Filipe Manana
2020-12-22  4:38     ` Qu Wenruo [this message]
2020-12-17 14:51   ` Josef Bacik
2020-12-18  0:42     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7013d0b6-0ce1-9ee6-7cbf-955126053e10@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=fdmanana@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox