* [PATCH 0/3] btrfs: some fixes related to the extent map shrinker
@ 2025-02-16 13:16 fdmanana
2025-02-16 13:16 ` [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking fdmanana
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: fdmanana @ 2025-02-16 13:16 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
A few fixes and improvements for the extent map shrinker. The first is a
use-after-free that is likely hard to hit and the other two fixes make
things more efficient by avoiding grabbing and dropping (iput) inodes
which don't have extent maps loaded as well as eleminating the need to
do delayed iputs, as that's not needed anymore.
These are related to a recent report from Ivan Shapovalov where the
cleaner kthread was using over 50% of CPU doing a lot of delayed iputs:
https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
More details in the change logs.
Filipe Manana (3):
btrfs: fix use-after-free on inode when scanning root during em shrinking
btrfs: skip inodes without loaded extent maps when shrinking extent maps
btrfs: do regular iput instead of delayed iput during extent map shrinking
fs/btrfs/extent_map.c | 82 ++++++++++++++++++++++++++++++-------------
1 file changed, 58 insertions(+), 24 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking 2025-02-16 13:16 [PATCH 0/3] btrfs: some fixes related to the extent map shrinker fdmanana @ 2025-02-16 13:16 ` fdmanana 2025-02-17 17:47 ` Johannes Thumshirn 2025-02-16 13:16 ` [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps fdmanana ` (2 subsequent siblings) 3 siblings, 1 reply; 9+ messages in thread From: fdmanana @ 2025-02-16 13:16 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a call to btrfs_fs_closing() after we have scheduled the inode for a delayed iput, and that can result in a use-after-free on the inode in case the cleaner kthread does the iput before we dereference the inode in the call to btrfs_fs_closing(). Fix this by using the fs_info stored already in a local variable instead of doing inode->root->fs_info. Fixes: 102044384056 ("btrfs: make the extent map shrinker run asynchronously as a work queue job") CC: stable@vger.kernel.org # 6.13+ Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/extent_map.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 67ce85ff0ae2..bee1b94a1049 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1222,8 +1222,7 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx fs_info->em_shrinker_last_ino = btrfs_ino(inode); btrfs_add_delayed_iput(inode); - if (ctx->scanned >= ctx->nr_to_scan || - btrfs_fs_closing(inode->root->fs_info)) + if (ctx->scanned >= ctx->nr_to_scan || btrfs_fs_closing(fs_info)) break; cond_resched(); -- 2.45.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking 2025-02-16 13:16 ` [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking fdmanana @ 2025-02-17 17:47 ` Johannes Thumshirn 0 siblings, 0 replies; 9+ messages in thread From: Johannes Thumshirn @ 2025-02-17 17:47 UTC (permalink / raw) To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org Looks good, Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps 2025-02-16 13:16 [PATCH 0/3] btrfs: some fixes related to the extent map shrinker fdmanana 2025-02-16 13:16 ` [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking fdmanana @ 2025-02-16 13:16 ` fdmanana 2025-02-18 6:18 ` Johannes Thumshirn 2025-02-16 13:16 ` [PATCH 3/3] btrfs: do regular iput instead of delayed iput during extent map shrinking fdmanana 2025-02-17 21:56 ` [PATCH 0/3] btrfs: some fixes related to the extent map shrinker Qu Wenruo 3 siblings, 1 reply; 9+ messages in thread From: fdmanana @ 2025-02-16 13:16 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> If there are inodes that don't have any loaded extent maps, we end up grabbing a reference on them and later adding a delayed iput, which wakes up the cleaner and makes it do unnecessary work. This is common when for example the inodes were open only to run stat(2) or all their extent maps were already released through the folio release callback (btrfs_release_folio()) or released by a previous run of the shrinker, or directories which never have extent maps. Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.13+ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/extent_map.c | 77 +++++++++++++++++++++++++++++++------------ 1 file changed, 56 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index bee1b94a1049..820aef514238 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1128,6 +1128,8 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c long nr_dropped = 0; struct rb_node *node; + lockdep_assert_held_write(&tree->lock); + /* * Take the mmap lock so that we serialize with the inode logging phase * of fsync because we may need to set the full sync flag on the inode, @@ -1139,28 +1141,12 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c * to find new extents, which may not be there yet because ordered * extents haven't completed yet. * - * We also do a try lock because otherwise we could deadlock. This is - * because the shrinker for this filesystem may be invoked while we are - * in a path that is holding the mmap lock in write mode. For example in - * a reflink operation while COWing an extent buffer, when allocating - * pages for a new extent buffer and under memory pressure, the shrinker - * may be invoked, and therefore we would deadlock by attempting to read - * lock the mmap lock while we are holding already a write lock on it. + * We also do a try lock because we don't want to block for too long and + * we are holding the extent map tree's lock in write mode. */ if (!down_read_trylock(&inode->i_mmap_lock)) return 0; - /* - * We want to be fast so if the lock is busy we don't want to spend time - * waiting for it - either some task is about to do IO for the inode or - * we may have another task shrinking extent maps, here in this code, so - * skip this inode. - */ - if (!write_trylock(&tree->lock)) { - up_read(&inode->i_mmap_lock); - return 0; - } - node = rb_first(&tree->root); while (node) { struct rb_node *next = rb_next(node); @@ -1201,12 +1187,60 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c break; node = next; } - write_unlock(&tree->lock); up_read(&inode->i_mmap_lock); return nr_dropped; } +static struct btrfs_inode *find_first_inode(struct btrfs_root *root, u64 min_ino) +{ + struct btrfs_inode *inode; + unsigned long from = min_ino; + + xa_lock(&root->inodes); + while (true) { + struct extent_map_tree *tree; + + inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT); + if (!inode) + break; + + tree = &inode->extent_tree; + + /* + * We want to be fast so if the lock is busy we don't want to + * spend time waiting for it (some task is about to do IO for + * the inode). + */ + if (!write_trylock(&tree->lock)) + goto next; + + /* + * Skip inode if it doesn't have loaded extent maps, so we avoid + * getting a reference and doing an iput later. This includes + * cases like files that were opened for things like stat(2), or + * files with all extent maps previously released through the + * release folio callback (btrfs_release_folio()) or released in + * a previous run, or directories which never have extent maps. + */ + if (RB_EMPTY_ROOT(&tree->root)) { + write_unlock(&tree->lock); + goto next; + } + + if (igrab(&inode->vfs_inode)) + break; + + write_unlock(&tree->lock); +next: + from = btrfs_ino(inode) + 1; + cond_resched_lock(&root->inodes.xa_lock); + } + xa_unlock(&root->inodes); + + return inode; +} + static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx *ctx) { struct btrfs_fs_info *fs_info = root->fs_info; @@ -1214,9 +1248,10 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx long nr_dropped = 0; u64 min_ino = fs_info->em_shrinker_last_ino + 1; - inode = btrfs_find_first_inode(root, min_ino); + inode = find_first_inode(root, min_ino); while (inode) { nr_dropped += btrfs_scan_inode(inode, ctx); + write_unlock(&inode->extent_tree.lock); min_ino = btrfs_ino(inode) + 1; fs_info->em_shrinker_last_ino = btrfs_ino(inode); @@ -1227,7 +1262,7 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx cond_resched(); - inode = btrfs_find_first_inode(root, min_ino); + inode = find_first_inode(root, min_ino); } if (inode) { -- 2.45.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps 2025-02-16 13:16 ` [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps fdmanana @ 2025-02-18 6:18 ` Johannes Thumshirn 2025-02-18 10:19 ` Filipe Manana 0 siblings, 1 reply; 9+ messages in thread From: Johannes Thumshirn @ 2025-02-18 6:18 UTC (permalink / raw) To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org On 16.02.25 14:16, fdmanana@kernel.org wrote: > +static struct btrfs_inode *find_first_inode(struct btrfs_root *root, u64 min_ino) Can we call it something like find_first_inode_for_shrinker() or sth like that? It is very similar to btrfs_find_first_inode() but not the similar enough to be the subset of it and I find that quite confusing. Otherwise Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps 2025-02-18 6:18 ` Johannes Thumshirn @ 2025-02-18 10:19 ` Filipe Manana 0 siblings, 0 replies; 9+ messages in thread From: Filipe Manana @ 2025-02-18 10:19 UTC (permalink / raw) To: Johannes Thumshirn; +Cc: linux-btrfs@vger.kernel.org On Tue, Feb 18, 2025 at 6:18 AM Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote: > > On 16.02.25 14:16, fdmanana@kernel.org wrote: > > +static struct btrfs_inode *find_first_inode(struct btrfs_root *root, u64 min_ino) > > Can we call it something like find_first_inode_for_shrinker() or sth > like that? Sure, I'll rename it to find_first_inode_to_shrink() at commit time then. Thanks. > > It is very similar to btrfs_find_first_inode() but not the similar > enough to be the subset of it and I find that quite confusing. > > Otherwise > Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 3/3] btrfs: do regular iput instead of delayed iput during extent map shrinking 2025-02-16 13:16 [PATCH 0/3] btrfs: some fixes related to the extent map shrinker fdmanana 2025-02-16 13:16 ` [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking fdmanana 2025-02-16 13:16 ` [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps fdmanana @ 2025-02-16 13:16 ` fdmanana 2025-02-18 6:18 ` Johannes Thumshirn 2025-02-17 21:56 ` [PATCH 0/3] btrfs: some fixes related to the extent map shrinker Qu Wenruo 3 siblings, 1 reply; 9+ messages in thread From: fdmanana @ 2025-02-16 13:16 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> The extent map shrinker now runs in the system unbound workqueue and no longer in kswapd context so it can directly do an iput() on inodes even if that blocks or needs to acquire any lock (we aren't holding any locks when requesting the delayed iput from the shrinker). So we don't need to add a delayed iput, wake up the cleaner and delegate the iput() to the cleaner, which also adds extra contention on the spinlock that protects the delayed iputs list. Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.12+ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/extent_map.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 820aef514238..3b05a2d27704 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1255,7 +1255,7 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx min_ino = btrfs_ino(inode) + 1; fs_info->em_shrinker_last_ino = btrfs_ino(inode); - btrfs_add_delayed_iput(inode); + iput(&inode->vfs_inode); if (ctx->scanned >= ctx->nr_to_scan || btrfs_fs_closing(fs_info)) break; -- 2.45.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] btrfs: do regular iput instead of delayed iput during extent map shrinking 2025-02-16 13:16 ` [PATCH 3/3] btrfs: do regular iput instead of delayed iput during extent map shrinking fdmanana @ 2025-02-18 6:18 ` Johannes Thumshirn 0 siblings, 0 replies; 9+ messages in thread From: Johannes Thumshirn @ 2025-02-18 6:18 UTC (permalink / raw) To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org Looks good, Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/3] btrfs: some fixes related to the extent map shrinker 2025-02-16 13:16 [PATCH 0/3] btrfs: some fixes related to the extent map shrinker fdmanana ` (2 preceding siblings ...) 2025-02-16 13:16 ` [PATCH 3/3] btrfs: do regular iput instead of delayed iput during extent map shrinking fdmanana @ 2025-02-17 21:56 ` Qu Wenruo 3 siblings, 0 replies; 9+ messages in thread From: Qu Wenruo @ 2025-02-17 21:56 UTC (permalink / raw) To: fdmanana, linux-btrfs 在 2025/2/16 23:46, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > A few fixes and improvements for the extent map shrinker. The first is a > use-after-free that is likely hard to hit and the other two fixes make > things more efficient by avoiding grabbing and dropping (iput) inodes > which don't have extent maps loaded as well as eleminating the need to > do delayed iputs, as that's not needed anymore. > > These are related to a recent report from Ivan Shapovalov where the > cleaner kthread was using over 50% of CPU doing a lot of delayed iputs: > > https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ > > More details in the change logs. Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > > Filipe Manana (3): > btrfs: fix use-after-free on inode when scanning root during em shrinking > btrfs: skip inodes without loaded extent maps when shrinking extent maps > btrfs: do regular iput instead of delayed iput during extent map shrinking > > fs/btrfs/extent_map.c | 82 ++++++++++++++++++++++++++++++------------- > 1 file changed, 58 insertions(+), 24 deletions(-) > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-02-18 10:20 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-16 13:16 [PATCH 0/3] btrfs: some fixes related to the extent map shrinker fdmanana 2025-02-16 13:16 ` [PATCH 1/3] btrfs: fix use-after-free on inode when scanning root during em shrinking fdmanana 2025-02-17 17:47 ` Johannes Thumshirn 2025-02-16 13:16 ` [PATCH 2/3] btrfs: skip inodes without loaded extent maps when shrinking extent maps fdmanana 2025-02-18 6:18 ` Johannes Thumshirn 2025-02-18 10:19 ` Filipe Manana 2025-02-16 13:16 ` [PATCH 3/3] btrfs: do regular iput instead of delayed iput during extent map shrinking fdmanana 2025-02-18 6:18 ` Johannes Thumshirn 2025-02-17 21:56 ` [PATCH 0/3] btrfs: some fixes related to the extent map shrinker Qu Wenruo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.