* [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
@ 2024-07-12 6:27 Zhihao Cheng
2024-07-12 14:37 ` Theodore Ts'o
2024-08-05 1:29 ` Zhihao Cheng
0 siblings, 2 replies; 9+ messages in thread
From: Zhihao Cheng @ 2024-07-12 6:27 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, LKML, Ted Tso, Jan Kara,
Christoph Hellwig, linux-mtd, Richard Weinberger, zhangyi (F),
yangerkun, wangzhaolong (A)
Hi. Recently, we found a deadlock in inode recliaiming process caused by
circular dependence between file inode and file's xattr inode.
Problem description
===================
The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes(See
function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list().
Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
do inode lookup in the inode evicting callback function, if the inode
lookup is operated under the inode lru traversing context, deadlock
problems may happen.
Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
if ea_inode feature is enabled, the lookup process will be stuck under
the evicting context like this:
1. File A has inode i_reg and an ea inode i_ea
2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
3. Then, following three processes running like this:
PA PB
echo 2 > /proc/sys/vm/drop_caches
shrink_slab
prune_dcache_sb
// i_reg is added into lru, lru->i_ea->i_reg
prune_icache_sb
list_lru_walk_one
inode_lru_isolate
i_ea->i_state |= I_FREEING // set inode state
inode_lru_isolate
__iget(i_reg)
spin_unlock(&i_reg->i_lock)
spin_unlock(lru_lock)
rm file A
i_reg->nlink = 0
iput(i_reg) // i_reg->nlink is 0, do evict
ext4_evict_inode
ext4_xattr_delete_inode
ext4_xattr_inode_dec_ref_all
ext4_xattr_inode_iget
ext4_iget(i_ea->i_ino)
iget_locked
find_inode_fast
__wait_on_freeing_inode(i_ea) ----→ AA deadlock
dispose_list // cannot be executed by prune_icache_sb
wake_up_bit(&i_ea->i_state)
Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
deleting process holds BASEHD's wbuf->io_mutex while getting the xattr
inode, which could race with inode reclaiming process(The reclaiming
process could try locking BASEHD's wbuf->io_mutex in inode evicting
function), then an ABBA deadlock problem would happen as following:
1. File A has inode ia and a xattr(with inode ixa), regular file B has
inode ib and a xattr.
2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
3. Then, following three processes running like this:
PA PB PC
echo 2 > /proc/sys/vm/drop_caches
shrink_slab
prune_dcache_sb
// ib and ia are added into lru, lru->ixa->ib->ia
prune_icache_sb
list_lru_walk_one
inode_lru_isolate
ixa->i_state |= I_FREEING // set inode state
inode_lru_isolate
__iget(ib)
spin_unlock(&ib->i_lock)
spin_unlock(lru_lock)
rm file B
ib->nlink = 0
rm file A
iput(ia)
ubifs_evict_inode(ia)
ubifs_jnl_delete_inode(ia)
ubifs_jnl_write_inode(ia)
make_reservation(BASEHD) // Lock wbuf->io_mutex
ubifs_iget(ixa->i_ino)
iget_locked
find_inode_fast
__wait_on_freeing_inode(ixa)
| iput(ib) // ib->nlink is 0, do evict
| ubifs_evict_inode
| ubifs_jnl_delete_inode(ib)
↓ ubifs_jnl_write_inode
ABBA deadlock ←-----make_reservation(BASEHD)
dispose_list // cannot be executed by prune_icache_sb
wake_up_bit(&ixa->i_state)
Reproducer:
===========
https://bugzilla.kernel.org/show_bug.cgi?id=219022
About solutions
===============
We have thought some solutions, but all of them will import new influence.
Solution 1: Don't cache xattr inode, make drop_inode callback return
true for xattr inode. It will make getting xattr slower.
Test code:
gettimeofday(&s, NULL);
for (i = 0; i < 10000; ++i)
if (getxattr("/root/temp/file_a", "user.a", buf, 4098) < 0)
perror("getxattr");
gettimeofday(&e, NULL);
printf("cost %ld us\n", 1000000 * (e.tv_sec - s.tv_sec) + e.tv_usec
- s.tv_usec);
Result:
ext4:
cost 151068 us // without fix
cost 321594 us // with fix
ubifs:
134125 us // without fix
371023 us // with fix
Solution 2: Don't put xattr inode into lru, which is implemented by
holding xattr inode's refcnt until the file inode is evicted, besides
that, make drop_inode callback return true for xattr inode. The solution
pins xattr inode in memory until the file inode is evicted, file inode
won't be evicted if it has pagecahes, specifically:
inode_lru_isolate
if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { //
file inode won't be evicted, so its' xattr inode won't be reclaimed too,
which will increase the memory noise.
__iget(inode);
if (remove_inode_buffers(inode))
...
iput(inode);
}
Solution 3: Forbid inode evicting under the inode lru traversing
context. Specifically, mark inode with 'I_FREEING' instead of getting
its' refcnt to eliminate the inode getting chance in
inode_lru_isolate(). The solution is wrong, because the pagecahes may
still alive after invalidate_mapping_pages(), so we cannot destroy the
file inode after clearing I_WILL_FREE.
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..c649be22f841 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -843,7 +844,7 @@ static enum lru_status inode_lru_isolate(struct
list_head *item,
* be under pressure before the cache inside the highmem zone.
*/
if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
- __iget(inode);
+ inode->i_state |= I_WILL_FREE;
spin_unlock(&inode->i_lock);
spin_unlock(lru_lock);
if (remove_inode_buffers(inode)) {
@@ -855,9 +856,9 @@ static enum lru_status inode_lru_isolate(struct
list_head *item,
__count_vm_events(PGINODESTEAL, reap);
mm_account_reclaimed_pages(reap);
}
- iput(inode);
+ spin_lock(&inode->i_lock);
+ inode->i_state &= ~I_WILL_FREE;
spin_lock(lru_lock);
- return LRU_RETRY;
}
Solution 4: Break the circular dependence between file inode and file's
xattr inode, for example, after comparing
inode_lru_isolate(prune_icache_sb) with invalidate_inodes, we found that
invalidate_mapping_pages is not invoked by invalidate_inodes, can we
directly remove the branch 'if (inode_has_buffers(inode) ||
!mapping_empty(&inode->i_data))' from inode_lru_isolate?
Any better solutions?
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-12 6:27 [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs Zhihao Cheng
@ 2024-07-12 14:37 ` Theodore Ts'o
2024-07-13 2:29 ` Zhihao Cheng
` (2 more replies)
2024-08-05 1:29 ` Zhihao Cheng
1 sibling, 3 replies; 9+ messages in thread
From: Theodore Ts'o @ 2024-07-12 14:37 UTC (permalink / raw)
To: Zhihao Cheng
Cc: linux-fsdevel, linux-ext4, LKML, Jan Kara, Christoph Hellwig,
linux-mtd, Richard Weinberger, zhangyi (F), yangerkun,
wangzhaolong (A)
On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
> Problem description
> ===================
>
> The inode reclaiming process(See function prune_icache_sb) collects all
> reclaimable inodes and mark them with I_FREEING flag at first, at that
> time, other processes will be stuck if they try getting these inodes(See
> function find_inode_fast), then the reclaiming process destroy the
> inodes by function dispose_list().
> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
> do inode lookup in the inode evicting callback function, if the inode
> lookup is operated under the inode lru traversing context, deadlock
> problems may happen.
>
> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
> if ea_inode feature is enabled, the lookup process will be stuck under
> the evicting context like this:
>
> 1. File A has inode i_reg and an ea inode i_ea
> 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
> 3. Then, following three processes running like this:
>
> PA PB
> echo 2 > /proc/sys/vm/drop_caches
> shrink_slab
> prune_dcache_sb
> // i_reg is added into lru, lru->i_ea->i_reg
> prune_icache_sb
> list_lru_walk_one
> inode_lru_isolate
> i_ea->i_state |= I_FREEING // set inode state
> i_ea->i_state |= I_FREEING // set inode state
Um, I don't see how this can happen. If the ea_inode is in use,
i_count will be greater than zero, and hence the inode will never be
go down the rest of the path in inode_lru_inode():
if (atomic_read(&inode->i_count) ||
...) {
list_lru_isolate(lru, &inode->i_lru);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}
Do you have an actual reproduer which triggers this? Or would this
happen be any chance something that was dreamed up with DEPT?
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-12 14:37 ` Theodore Ts'o
@ 2024-07-13 2:29 ` Zhihao Cheng
2024-07-18 3:04 ` Ryder Wang
2024-07-18 13:40 ` Jan Kara
2 siblings, 0 replies; 9+ messages in thread
From: Zhihao Cheng @ 2024-07-13 2:29 UTC (permalink / raw)
To: Theodore Ts'o
Cc: linux-fsdevel, linux-ext4, LKML, Jan Kara, Christoph Hellwig,
linux-mtd, Richard Weinberger, zhangyi (F), yangerkun,
wangzhaolong (A)
在 2024/7/12 22:37, Theodore Ts'o 写道:
>> Problem description
>> ===================
>>
>> The inode reclaiming process(See function prune_icache_sb) collects all
>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>> time, other processes will be stuck if they try getting these inodes(See
>> function find_inode_fast), then the reclaiming process destroy the
>> inodes by function dispose_list().
>> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
>> do inode lookup in the inode evicting callback function, if the inode
>> lookup is operated under the inode lru traversing context, deadlock
>> problems may happen.
>>
>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>> if ea_inode feature is enabled, the lookup process will be stuck under
>> the evicting context like this:
>>
>> 1. File A has inode i_reg and an ea inode i_ea
>> 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>> 3. Then, following three processes running like this:
>>
>> PA PB
>> echo 2 > /proc/sys/vm/drop_caches
>> shrink_slab
>> prune_dcache_sb
>> // i_reg is added into lru, lru->i_ea->i_reg
>> prune_icache_sb
>> list_lru_walk_one
>> inode_lru_isolate
>> i_ea->i_state |= I_FREEING // set inode state
>> i_ea->i_state |= I_FREEING // set inode state
> Um, I don't see how this can happen. If the ea_inode is in use,
> i_count will be greater than zero, and hence the inode will never be
> go down the rest of the path in inode_lru_inode():
The counter of ea_inode could become zero before the file inode,
according to the following process:
path_getxattr
user_path_at(&path) // get file dentry and file inode
getxattr
ext4_xattr_get
ext4_xattr_ibody_get
ext4_xattr_inode_get
ext4_xattr_inode_iget(&ea_inode) // ea_inode->i_count = 1
iput(ea_inode) // ea_inode->i_count = 0, put it into lru
path_put(&path); // put file dentry and file inode
>
> if (atomic_read(&inode->i_count) ||
> ...) {
> list_lru_isolate(lru, &inode->i_lru);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }
>
> Do you have an actual reproduer which triggers this? Or would this
> happen be any chance something that was dreamed up with DEPT?
The reproducer is in the second half of the ariticle, along with some of
the solutions we tried.
Reproducer:
===========
https://bugzilla.kernel.org/show_bug.cgi?id=219022
About solutions
===============
[...]
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-12 14:37 ` Theodore Ts'o
2024-07-13 2:29 ` Zhihao Cheng
@ 2024-07-18 3:04 ` Ryder Wang
2024-07-18 7:30 ` Zhihao Cheng
2024-07-18 13:40 ` Jan Kara
2 siblings, 1 reply; 9+ messages in thread
From: Ryder Wang @ 2024-07-18 3:04 UTC (permalink / raw)
To: Theodore Ts'o, Zhihao Cheng
Cc: linux-fsdevel, linux-ext4@vger.kernel.org, LKML, Jan Kara,
Christoph Hellwig, linux-mtd, Richard Weinberger, zhangyi (F),
yangerkun, wangzhaolong (A)
> Um, I don't see how this can happen. If the ea_inode is in use,
> i_count will be greater than zero, and hence the inode will never be
> go down the rest of the path in inode_lru_inode():
>
> if (atomic_read(&inode->i_count) ||
> ...) {
> list_lru_isolate(lru, &inode->i_lru);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }
Yes, in the function inode_lru_inode (in case of clearing cache), there has been such inode->i_state check mechanism to avoid double-removing the inode which is being removed by another process. Unluckily, no such similar inode->i_state check mechanism in the function iput_final (in case of removing file), so double-removing inode can still appear.
It looks we need to add some inode->i_state check in iput_final() , if we want to fix this race condition bug.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-18 3:04 ` Ryder Wang
@ 2024-07-18 7:30 ` Zhihao Cheng
0 siblings, 0 replies; 9+ messages in thread
From: Zhihao Cheng @ 2024-07-18 7:30 UTC (permalink / raw)
To: Ryder Wang, Theodore Ts'o, Zhihao Cheng
Cc: linux-fsdevel, linux-ext4@vger.kernel.org, LKML, Jan Kara,
Christoph Hellwig, linux-mtd, Richard Weinberger, zhangyi (F),
yangerkun, wangzhaolong (A)
在 2024/7/18 11:04, Ryder Wang 写道:
Hi, Ryder
>> Um, I don't see how this can happen. If the ea_inode is in use,
>> i_count will be greater than zero, and hence the inode will never be
>> go down the rest of the path in inode_lru_inode():
>>
>> if (atomic_read(&inode->i_count) ||
>> ...) {
>> list_lru_isolate(lru, &inode->i_lru);
>> spin_unlock(&inode->i_lock);
>> this_cpu_dec(nr_unused);
>> return LRU_REMOVED;
>> }
>
> Yes, in the function inode_lru_inode (in case of clearing cache), there has been such inode->i_state check mechanism to avoid double-removing the inode which is being removed by another process. Unluckily, no such similar inode->i_state check mechanism in the function iput_final (in case of removing file), so double-removing inode can still appear.
I'm a little confused about the process of inode double-removing, can
you provide a detailed case about how double-revemoving happens? I can't
find the relationship between inode double-removing and the problem i
described.
>
> It looks we need to add some inode->i_state check in iput_final() , if we want to fix this race condition bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-12 14:37 ` Theodore Ts'o
2024-07-13 2:29 ` Zhihao Cheng
2024-07-18 3:04 ` Ryder Wang
@ 2024-07-18 13:40 ` Jan Kara
2024-07-19 3:21 ` Zhihao Cheng
2 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2024-07-18 13:40 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Zhihao Cheng, linux-fsdevel, linux-ext4, LKML, Jan Kara,
Christoph Hellwig, linux-mtd, Richard Weinberger, zhangyi (F),
yangerkun, wangzhaolong (A)
On Fri 12-07-24 10:37:08, Theodore Ts'o wrote:
> On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
> > Problem description
> > ===================
> >
> > The inode reclaiming process(See function prune_icache_sb) collects all
> > reclaimable inodes and mark them with I_FREEING flag at first, at that
> > time, other processes will be stuck if they try getting these inodes(See
> > function find_inode_fast), then the reclaiming process destroy the
> > inodes by function dispose_list().
> > Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
> > do inode lookup in the inode evicting callback function, if the inode
> > lookup is operated under the inode lru traversing context, deadlock
> > problems may happen.
> >
> > Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
> > if ea_inode feature is enabled, the lookup process will be stuck under
> > the evicting context like this:
> >
> > 1. File A has inode i_reg and an ea inode i_ea
> > 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
> > 3. Then, following three processes running like this:
> >
> > PA PB
> > echo 2 > /proc/sys/vm/drop_caches
> > shrink_slab
> > prune_dcache_sb
> > // i_reg is added into lru, lru->i_ea->i_reg
> > prune_icache_sb
> > list_lru_walk_one
> > inode_lru_isolate
> > i_ea->i_state |= I_FREEING // set inode state
> > i_ea->i_state |= I_FREEING // set inode state
>
> Um, I don't see how this can happen. If the ea_inode is in use,
> i_count will be greater than zero, and hence the inode will never be
> go down the rest of the path in inode_lru_inode():
>
> if (atomic_read(&inode->i_count) ||
> ...) {
> list_lru_isolate(lru, &inode->i_lru);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }
>
> Do you have an actual reproduer which triggers this? Or would this
> happen be any chance something that was dreamed up with DEPT?
No, it looks like a real problem and I agree with the analysis. We don't
hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The
normal inode just owns that that special on-disk xattr reference. Standard
inode references are acquired and dropped as needed.
And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from
evict() needs to lookup the ea_inode and iget() it. So if we are processing
a list of inodes to dispose, all inodes have I_FREEING bit already set and
if ea_inode and its parent normal inode are both in the list, then the
evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock.
Normally we don't hit this path because LRU list walk is not handling
inodes with 0 link count. But a race with unlink can make that happen with
iput() from inode_lru_isolate().
I'm pondering about the best way to fix this. Maybe we could handle the
need for inode pinning in inode_lru_isolate() in a similar way as in
writeback code so that last iput() cannot happen from inode_lru_isolate().
In writeback we use I_SYNC flag to pin the inode and evict() waits for this
flag to clear. I'll probably sleep to it and if I won't find it too
disgusting to live tomorrow, I can code it.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-18 13:40 ` Jan Kara
@ 2024-07-19 3:21 ` Zhihao Cheng
2024-07-20 10:42 ` Mateusz Guzik
0 siblings, 1 reply; 9+ messages in thread
From: Zhihao Cheng @ 2024-07-19 3:21 UTC (permalink / raw)
To: Jan Kara, Theodore Ts'o
Cc: linux-fsdevel, linux-ext4, LKML, Christoph Hellwig, linux-mtd,
Richard Weinberger, zhangyi (F), yangerkun, wangzhaolong (A)
Hi, Jan
在 2024/7/18 21:40, Jan Kara 写道:
> On Fri 12-07-24 10:37:08, Theodore Ts'o wrote:
>> On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
>>> Problem description
>>> ===================
>>>
>>> The inode reclaiming process(See function prune_icache_sb) collects all
>>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>>> time, other processes will be stuck if they try getting these inodes(See
>>> function find_inode_fast), then the reclaiming process destroy the
>>> inodes by function dispose_list().
>>> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
>>> do inode lookup in the inode evicting callback function, if the inode
>>> lookup is operated under the inode lru traversing context, deadlock
>>> problems may happen.
>>>
>>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>>> if ea_inode feature is enabled, the lookup process will be stuck under
>>> the evicting context like this:
>>>
>>> 1. File A has inode i_reg and an ea inode i_ea
>>> 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>>> 3. Then, following three processes running like this:
>>>
>>> PA PB
>>> echo 2 > /proc/sys/vm/drop_caches
>>> shrink_slab
>>> prune_dcache_sb
>>> // i_reg is added into lru, lru->i_ea->i_reg
>>> prune_icache_sb
>>> list_lru_walk_one
>>> inode_lru_isolate
>>> i_ea->i_state |= I_FREEING // set inode state
>>> i_ea->i_state |= I_FREEING // set inode state
>>
>> Um, I don't see how this can happen. If the ea_inode is in use,
>> i_count will be greater than zero, and hence the inode will never be
>> go down the rest of the path in inode_lru_inode():
>>
>> if (atomic_read(&inode->i_count) ||
>> ...) {
>> list_lru_isolate(lru, &inode->i_lru);
>> spin_unlock(&inode->i_lock);
>> this_cpu_dec(nr_unused);
>> return LRU_REMOVED;
>> }
>>
>> Do you have an actual reproduer which triggers this? Or would this
>> happen be any chance something that was dreamed up with DEPT?
>
> No, it looks like a real problem and I agree with the analysis. We don't
> hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The
> normal inode just owns that that special on-disk xattr reference. Standard
> inode references are acquired and dropped as needed.
>
> And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from
> evict() needs to lookup the ea_inode and iget() it. So if we are processing
> a list of inodes to dispose, all inodes have I_FREEING bit already set and
> if ea_inode and its parent normal inode are both in the list, then the
> evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock.
Yes, absolutely right.
>
> Normally we don't hit this path because LRU list walk is not handling
> inodes with 0 link count. But a race with unlink can make that happen with
> iput() from inode_lru_isolate().
Another reason is that mapping_empty(&inode->i_data) is consistent with
mapping_shrinkable(&inode->i_data) in most cases(CONFIG_HIGHMEM is
disabled in default on 64bit platforms, so mapping_shrinkable() hardly
returns true if file inode's mapping has pagecahes), the problem path
expects that mapping_shrinkable() returns true and mapping_empty()
returns false.
Do we have any other methods to replace following if-branch without
invoking __iget()?
/*
* On highmem systems, mapping_shrinkable() permits dropping
* page cache in order to free up struct inodes: lowmem might
* be under pressure before the cache inside the highmem zone.
*/
if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data))
{
__iget(inode);
...
iput(inode);
spin_lock(lru_lock);
return LRU_RETRY;
}
>
> I'm pondering about the best way to fix this. Maybe we could handle the
> need for inode pinning in inode_lru_isolate() in a similar way as in
> writeback code so that last iput() cannot happen from inode_lru_isolate().
> In writeback we use I_SYNC flag to pin the inode and evict() waits for this
> flag to clear. I'll probably sleep to it and if I won't find it too
> disgusting to live tomorrow, I can code it.
>
I guess that you may modify like this:
diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..5b1a9b23f53f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);
static void __inode_add_lru(struct inode *inode, bool rotate)
{
- if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING |
I_WILL_FREE))
+ if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING |
I_WILL_FREE | I_PINING))
return;
if (atomic_read(&inode->i_count))
return;
@@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct
list_head *item,
* be under pressure before the cache inside the highmem zone.
*/
if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
- __iget(inode);
+ inode->i_state |= I_PINING;
spin_unlock(&inode->i_lock);
spin_unlock(lru_lock);
if (remove_inode_buffers(inode)) {
@@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct
list_head *item,
__count_vm_events(PGINODESTEAL, reap);
mm_account_reclaimed_pages(reap);
}
- iput(inode);
+ spin_lock(&inode->i_lock);
+ inode->i_state &= ~I_PINING;
+ wake_up_bit(&inode->i_state, __I_PINING);
+ spin_unlock(&inode->i_lock);
spin_lock(lru_lock);
return LRU_RETRY;
}
@@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
return;
}
+ inode_wait_for_pining(inode);
state = inode->i_state;
if (!drop) {
WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..daf094fff5fe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb
*kiocb, struct kiocb *kiocb_src,
#define I_DONTCACHE (1 << 16)
#define I_SYNC_QUEUED (1 << 17)
#define I_PINNING_NETFS_WB (1 << 18)
+#define __I_PINING 19
+#define I_PINING (1 << __I_PINING)
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
, which means that we will import a new inode state to solve the problem.
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-19 3:21 ` Zhihao Cheng
@ 2024-07-20 10:42 ` Mateusz Guzik
0 siblings, 0 replies; 9+ messages in thread
From: Mateusz Guzik @ 2024-07-20 10:42 UTC (permalink / raw)
To: Zhihao Cheng
Cc: Jan Kara, Theodore Ts'o, linux-fsdevel, linux-ext4, LKML,
Christoph Hellwig, linux-mtd, Richard Weinberger, zhangyi (F),
yangerkun, wangzhaolong (A)
On Fri, Jul 19, 2024 at 11:21:51AM +0800, Zhihao Cheng wrote:
> 在 2024/7/18 21:40, Jan Kara 写道:
> > I'm pondering about the best way to fix this. Maybe we could handle the
> > need for inode pinning in inode_lru_isolate() in a similar way as in
> > writeback code so that last iput() cannot happen from inode_lru_isolate().
> > In writeback we use I_SYNC flag to pin the inode and evict() waits for this
> > flag to clear. I'll probably sleep to it and if I won't find it too
> > disgusting to live tomorrow, I can code it.
> >
>
> I guess that you may modify like this:
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..5b1a9b23f53f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);
>
> static void __inode_add_lru(struct inode *inode, bool rotate)
> {
> - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING |
> I_WILL_FREE))
> + if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE
> | I_PINING))
> return;
> if (atomic_read(&inode->i_count))
> return;
> @@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
> * be under pressure before the cache inside the highmem zone.
> */
> if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> - __iget(inode);
> + inode->i_state |= I_PINING;
> spin_unlock(&inode->i_lock);
> spin_unlock(lru_lock);
> if (remove_inode_buffers(inode)) {
> @@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
> __count_vm_events(PGINODESTEAL, reap);
> mm_account_reclaimed_pages(reap);
> }
> - iput(inode);
> + spin_lock(&inode->i_lock);
> + inode->i_state &= ~I_PINING;
> + wake_up_bit(&inode->i_state, __I_PINING);
> + spin_unlock(&inode->i_lock);
> spin_lock(lru_lock);
> return LRU_RETRY;
> }
> @@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
> return;
> }
>
> + inode_wait_for_pining(inode);
> state = inode->i_state;
> if (!drop) {
> WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..daf094fff5fe 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb *kiocb,
> struct kiocb *kiocb_src,
> #define I_DONTCACHE (1 << 16)
> #define I_SYNC_QUEUED (1 << 17)
> #define I_PINNING_NETFS_WB (1 << 18)
> +#define __I_PINING 19
> +#define I_PINING (1 << __I_PINING)
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
>
> , which means that we will import a new inode state to solve the problem.
>
My non-maintainer $0,03 is as follows:
1. I_PINING is too generic of a name. I_LRU_PINNED or something else
indicating what this is for would be prudent
2. while not specific to this patch, the handling of i_state is too
accidental-breakage friendly. a full blown solution is way out of the
scope here, but something can be done to future-proof this work anyway.
To that end I would suggest:
1. inode_lru_pin() which appart from setting the flag includes:
BUG_ON(inode->i_state & (I_LRU_PINNED | I_FREEING | I_WILL_FREE)
2. inode_lru_unpin() which apart from unsetting the flag + wakeup includes:
BUG_ON(!(inode->i_state & I_LRU_PINNED))
3. inode_lru_wait_for_pinned()
However, a non-cosmetic remark is that at the spot inode_wait_for_pining
gets invoked none of the of the pinning-blocking flags may be set (to my
reading anyway). This is not the end of the world, but it does mean the
waiting routine will have to check stuff in a loop.
Names are not that important, the key is to keep the logic and
dependencies close by code-wise.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs
2024-07-12 6:27 [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs Zhihao Cheng
2024-07-12 14:37 ` Theodore Ts'o
@ 2024-08-05 1:29 ` Zhihao Cheng
1 sibling, 0 replies; 9+ messages in thread
From: Zhihao Cheng @ 2024-08-05 1:29 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, LKML, Ted Tso, Jan Kara,
Christoph Hellwig, linux-mtd, Richard Weinberger, zhangyi (F),
yangerkun, wangzhaolong (A), mjguzik,
rydercoding@hotmail.com >> Ryder Wang
Hi, based on the ideas from Jan and Mateusz, I sent a fix patch, see
https://lore.kernel.org/linux-fsdevel/20240805013446.814357-1-chengzhihao@huaweicloud.com/T/#u
在 2024/7/12 14:27, Zhihao Cheng 写道:
> Hi. Recently, we found a deadlock in inode recliaiming process caused by
> circular dependence between file inode and file's xattr inode.
>
> Problem description
> ===================
>
> The inode reclaiming process(See function prune_icache_sb) collects all
> reclaimable inodes and mark them with I_FREEING flag at first, at that
> time, other processes will be stuck if they try getting these inodes(See
> function find_inode_fast), then the reclaiming process destroy the
> inodes by function dispose_list().
> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
> do inode lookup in the inode evicting callback function, if the inode
> lookup is operated under the inode lru traversing context, deadlock
> problems may happen.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-08-05 1:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-12 6:27 [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs Zhihao Cheng
2024-07-12 14:37 ` Theodore Ts'o
2024-07-13 2:29 ` Zhihao Cheng
2024-07-18 3:04 ` Ryder Wang
2024-07-18 7:30 ` Zhihao Cheng
2024-07-18 13:40 ` Jan Kara
2024-07-19 3:21 ` Zhihao Cheng
2024-07-20 10:42 ` Mateusz Guzik
2024-08-05 1:29 ` Zhihao Cheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).