From: Byungchul Park <byungchul@sk.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-kernel@vger.kernel.org, clm@fb.com, josef@toxicpanda.com,
dsterba@suse.com, linux-btrfs@vger.kernel.org,
kernel_team@skhynix.com, torvalds@linux-foundation.org,
akpm@linux-foundation.org, yeoreum.yun@arm.com,
yunseong.kim@ericsson.com, gwan-gyeong.mun@intel.com,
harry.yoo@oracle.com, ysk@kzalloc.com
Subject: Re: [RFC] DEPT report on around btrfs, unlink, and truncate
Date: Tue, 24 Jun 2025 10:59:24 +0900 [thread overview]
Message-ID: <20250624015924.GE5820@system.software.com> (raw)
In-Reply-To: <20250624014426.GC5820@system.software.com>
On Tue, Jun 24, 2025 at 10:44:26AM +0900, Byungchul Park wrote:
> On Mon, Jun 23, 2025 at 07:28:38PM +0930, Qu Wenruo wrote:
> > 在 2025/6/23 19:22, Byungchul Park 写道:
> > > On Mon, Jun 23, 2025 at 06:22:44PM +0930, Qu Wenruo wrote:
> > > > 在 2025/6/23 17:49, Byungchul Park 写道:
> > > > > On Mon, Jun 23, 2025 at 03:20:43PM +0930, Qu Wenruo wrote:
> > > > > > 在 2025/6/23 12:51, Byungchul Park 写道:
> > > > > > > Hi folks,
> > > > > > >
> > > > > > > Thanks to Yunseong, we got two DEPT reports in btrfs. It doesn't mean
> > > > > > > it's obvious deadlocks, but after digging into the reports, I'm
> > > > > > > wondering if it could happen by any chance.
> > > > > > >
> > > > > > > 1) The first scenario that I'm concerning is:
> > > > > > >
> > > > > > > context A context B
> > > > > > >
> > > > > > > do_truncate()
> > > > > > > ...
> > > > > > > btrfs_do_readpage() // with folio lock held
> > > > > >
> > > > > > This one is for data.
> > > > >
> > > > > Do you mean this folio is for data? Thanks for the confirmation.
> > > >
> > > > Yes, only data folios will go through btrfs_do_readpage().
> > > >
> > > > For metadata, we never go through btrfs_do_readpage(), but
> > > > read_extent_buffer_pages_nowait().
> > > >
> > > >
> > > > >
> > > > > > > do_unlinkat()
> > > > > > > ...
> > > > > > > push_leaf_right()
> > > > > > > btrfs_tree_lock_nested()
> > > > > > > down_write_nested(&eb->lock) // hold
> > > > >
> > > > > This is struct extent_buffer's rw_sem. Right?
> > > > >
> > > > > > > btrfs_get_extent()
> > > > > > > btrfs_lookup_file_extent()
> > > > > > > btrfs_search_slot()
> > > > > > > down_read_nested(&eb->lock) // stuck
> > > > > >
> > > > > > This one is for metadata.
> > > > > ^
> > > > > I don't get this actually.
> > > > >
> > > > > This is struct extent_buffer's rw_sem, too. Cannot this rw_sem be the
> > > > > same as the rw_sem above in context A?
> > > >
> > > > My bad, I thought you're talking about that down_read_nested()
> > > > conflicting with folio lock.
> > > >
> > > > But if you're talking about extent_buffer::lock, then the one in context
> > > > B will wait for the one in context A, and that's expected.
> > >
> > > Sounds good.
> > >
> > > > > > Data and metadata page cache will never cross into each other.
> > > > > >
> > > > > > Thanks,
> > > > > > Qu
> > > > > >
> > > > > > > __push_leaf_right()
> > > > > > > ...
> > > > > > > folio_lock() // stuck
> > > > >
> > > > > Did you mean this folio is always for metadata?
> > > >
> > > > Can you explain more on where this folio_lock() comes from?
> > >
> > > I should also rely on the following stacktrace in the dept report. I
> > > asked Yunseong who reported this issue, for the decoded stacktrace, so
> > > that I can interpret that better. I will get back once I figure out
> > > where the wait on PG_locked comes from.
> > >
> > > [ 304.344198][ T7488] [W] dept_page_wait_on_bit(pg_locked_map:0):
> > > [ 304.344211][ T7488] [<ffff8000823b1d20>] __push_leaf_right+0x8f0/0xc70
> >
> > I believe it's from btrfs_clear_buffer_dirty():
> >
> > As we have a for() loop iterating all the folios of a an extent buffer
> > (aka, metadata structure), then clear the dirty flags.
> >
> > The same applies to btrfs_mark_buffer_dirty() -> set_extent_buffer_dirty().
>
> Thanks to Yunseong, I figured out this is the case.
>
> > In that case, the folio is 100% belonging to btree inode thus metadata.
>
> Good to know.
>
> Lastly, is it still good with directly manipulating block devs or
^
I meant block devs files.
Byungchul
> stacked file system using loopback devices, from the confliction of
> folios and extent_buffers?
>
> If you confirm it, this issue can be closed :-) Thanks in advance.
>
> Byungchul
>
> > Thus the folio lock can not conflict with a data folio, thus there
> > should be no deadlock.
> >
> > Thanks,
> > Qu
> >
> >
> > > [ 304.344232][ T7488] stacktrace:
> > > [ 304.344241][ T7488] __push_leaf_right+0x8f0/0xc70
> > > [ 304.344260][ T7488] push_leaf_right+0x408/0x628
> > > [ 304.344278][ T7488] btrfs_del_items+0x974/0xaec
> > > [ 304.344297][ T7488] btrfs_truncate_inode_items+0x1c5c/0x2b00
> > > [ 304.344314][ T7488] btrfs_evict_inode+0xa4c/0xd38
> > > [ 304.344335][ T7488] evict+0x340/0x7b0
> > > [ 304.344352][ T7488] iput+0x4ec/0x840
> > > [ 304.344369][ T7488] do_unlinkat+0x444/0x59c
> > > [ 304.344388][ T7488] __arm64_sys_unlinkat+0x11c/0x260
> > > [ 304.344407][ T7488] invoke_syscall+0x88/0x2e0
> > > [ 304.344425][ T7488] el0_svc_common.constprop.0+0xe8/0x2e0
> > > [ 304.344445][ T7488] do_el0_svc+0x44/0x60
> > > [ 304.344463][ T7488] el0_svc+0x50/0x188
> > > [ 304.344482][ T7488] el0t_64_sync_handler+0x10c/0x140
> > > [ 304.344503][ T7488] el0t_64_sync+0x198/0x19c
> > >
> > > Thanks.
> > >
> > > Byungchul
> > >
> > > > I didn't see any location where __push_leaf_right() is locking a folio
> > > > nor the original do_unlinkat().
> > > >
> > > > So here I can only guess the folio is from __push_leaf_right() context,
> > > > that means it can only be a metadata folio.
> > > >
> > > > >
> > > > > If no, it could lead a deadlock in my opinion. If yes, dept should
> > > > > assign different classes to folios between data data and metadata.
> > > >
> > > > So far I believe the folio belongs to metadata.
> > > >
> > > > And since btrfs has very different handling of metadata folios, and it's
> > > > a little confusing that, we also have a btree_inode to handle the
> > > > metadata page cache, but do not have read_folio() callbacks, it can be a
> > > > little confusing to some automatic tools.
> > > >
> > > > Thanks,
> > > > Qu
> >
> >
next prev parent reply other threads:[~2025-06-24 1:59 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-23 3:21 [RFC] DEPT report on around btrfs, unlink, and truncate Byungchul Park
2025-06-23 5:50 ` Qu Wenruo
2025-06-23 8:19 ` Byungchul Park
2025-06-23 8:52 ` Qu Wenruo
2025-06-23 9:52 ` Byungchul Park
2025-06-23 9:58 ` Qu Wenruo
2025-06-24 1:44 ` Byungchul Park
2025-06-24 1:59 ` Byungchul Park [this message]
2025-06-24 3:07 ` Qu Wenruo
2025-06-23 12:22 ` Yunseong Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250624015924.GE5820@system.software.com \
--to=byungchul@sk.com \
--cc=akpm@linux-foundation.org \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=gwan-gyeong.mun@intel.com \
--cc=harry.yoo@oracle.com \
--cc=josef@toxicpanda.com \
--cc=kernel_team@skhynix.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=torvalds@linux-foundation.org \
--cc=yeoreum.yun@arm.com \
--cc=ysk@kzalloc.com \
--cc=yunseong.kim@ericsson.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.