Re: [PATCH v2] btrfs: fix a possible race window when allocating new extent buffers

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: David Sterba <dsterba@suse.cz>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>,
	Chris Mason <clm@fb.com>
Subject: Re: [PATCH v2] btrfs: fix a possible race window when allocating new extent buffers
Date: Thu, 6 Jun 2024 21:27:22 +0200	[thread overview]
Message-ID: <20240606192722.GF18508@twin.jikos.cz> (raw)
In-Reply-To: <0f003d96bcb54c9c1afd5512739645bbdddb701b.1717637062.git.wqu@suse.com>

On Thu, Jun 06, 2024 at 11:01:51AM +0930, Qu Wenruo wrote:
> [BUG]
> Since v6.8 there are rare kernel crashes hitting by different reporters,
> and most of them share the same bad page status error messages like
> this:
> 
>  BUG: Bad page state in process kswapd0  pfn:d6e840
>  page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
>  pfn:0xd6e840
>  aops:btree_aops ino:1
>  flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
>  page_type: 0xffffffff()
>  raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
>  raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: non-NULL mapping
> 
> [CAUSE]
> Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") changes the sequence when allocating a new
> extent buffer.
> 
> Previously we always call grab_extent_buffer() under
> mapping->i_private_lock, to ensure the safety on modification on
> folio::private (which is a pointer to extent buffer for regular
> sectorsize)
> 
> This can lead to the following race:
> 
> Thread A is trying to allocate an extent buffer at bytenr X, with 4
> 4K pages, meanwhile thread B is trying to release the page at X + 4K
> (the second page of the extent buffer at X).
> 
>            Thread A                |                 Thread B
> -----------------------------------+-------------------------------------
>                                    | btree_release_folio()
> 				   | | This is for the page at X + 4K,
> 				   | | Not page X.
> 				   | |
> alloc_extent_buffer()              | |- release_extent_buffer()
> |- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
> |  page at bytenr X (the first     | |  |
> |  page).                          | |  |
> |  Which returned -EEXIST.         | |  |
> |                                  | |  |
> |- filemap_lock_folio()            | |  |
> |  Returned the first page locked. | |  |
> |                                  | |  |
> |- grab_extent_buffer()            | |  |
> |  |- atomic_inc_not_zero()        | |  |
> |  |  Returned false               | |  |
> |  |- folio_detach_private()       | |  |- folio_detach_private() for X
> |     |- folio_test_private()      | |     |- folio_test_private()
>       |  Returned true             | |     |  Returned true
>       |- folio_put()               |       |- folio_put()
> 
> Now this double puts on the same folio at folio X, leads to the
> refcount underflow of the folio X, and eventually causing the BUG_ON()
> on the page->mapping.
> 
> The condition is not that easy to hit:
> 
> - The release must be triggered for the middle page of an eb
>   If the release is on the same first page of an eb, page lock would kick
>   in and prevent the race.
> 
> - folio_detach_private() has a very small race window
>   It's only between folio_test_private() and folio_clear_private().
> 
> That's exactly what mapping->i_private_lock is used to prevent such race,
> and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") totally screwed this up.
> 
> At that time, I thought the page lock would kick in as
> filemap_release_folio() also requires the page to be locked, but forgot
> the filemap_release_folio() only locks one page, not all pages of an
> extent buffer.
> 
> [FIX]
> Move all the code requiring i_private_lock into
> attach_eb_folio_to_filemap(), so that everything is done with proper
> lock protection.
> 
> Furthermore to prevent future problems, add an extra
> lockdep_assert_locked() to ensure we're holding the proper lock.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
> Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
> Cc: Chris Mason <clm@fb.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Thanks. I'll pick the patch to branch for the next pull request, the fix has
survived enough testing and we should get it to stable without further delays.
I've edited the subject and changelog a bit, the problem is really the folio
private protection, it is a race window fix but that does not tell much what is
the cause. I've also added the reproducer script from Chris.

next prev parent reply	other threads:[~2024-06-06 19:27 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-06  1:31 [PATCH v2] btrfs: fix a possible race window when allocating new extent buffers Qu Wenruo
2024-06-06 16:22 ` Filipe Manana
2024-06-06 16:52 ` Josef Bacik
2024-06-06 19:27 ` David Sterba [this message]
2024-06-07  4:27   ` Qu Wenruo
2024-06-07 13:56     ` David Sterba
2024-06-07 15:37       ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240606192722.GF18508@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=clm@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mikhail.v.gavrilov@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox