* About the conflict between XFS inode recycle and VFS rcu-walk
@ 2023-12-05 11:38 alexjlzheng
2023-12-08 0:14 ` Dave Chinner
0 siblings, 1 reply; 19+ messages in thread
From: alexjlzheng @ 2023-12-05 11:38 UTC (permalink / raw)
To: djwong, bfoster, david, linux-xfs, raven, rcu, linux-fsdevel
Hi, all
I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
which can lead to null pointer references has been resolved?
I browsed through emails about the following patches and their discussions:
- https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
- https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
- https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
And then came to the conclusion that this problem has not been solved, am I
right? Did I miss some patch that could solve this problem?
According to my understanding, the essence of this problem is that XFS reuses
the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
Are there any recommended workarounds until an elegant and efficient solution
can be proposed? After all, causing a crash is extremely unacceptable in a
production environment.
Thank you very much for your advice :)
Jinliang Zheng
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng
@ 2023-12-08 0:14 ` Dave Chinner
2024-01-31 6:35 ` Jinliang Zheng
0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2023-12-08 0:14 UTC (permalink / raw)
To: alexjlzheng; +Cc: djwong, bfoster, linux-xfs, raven, rcu, linux-fsdevel
On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
> Hi, all
>
> I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
> which can lead to null pointer references has been resolved?
>
> I browsed through emails about the following patches and their discussions:
> - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>
> And then came to the conclusion that this problem has not been solved, am I
> right? Did I miss some patch that could solve this problem?
We fixed the known problems this caused by turning off the VFS
functionality that the rcu pathwalks kept tripping over. See commit
7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
the vfs").
Apart from that issue, I'm not aware of any other issues that the
XFS inode recycling directly exposes.
> According to my understanding, the essence of this problem is that XFS reuses
> the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
It assumes that the inode will not change identity during the RCU
grace period after the inode has been evicted from cache. We can
safely reinstantiate an evicted inode without waiting for an RCU
grace period as long as it is the same inode with the same content
and same state.
Problems *may* arise when we unlink the inode, then evict it, then a
new file is created and the old slab cache memory address is used
for the new inode. I describe the issue here:
https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
That said, we have exactly zero evidence that this is actually a
problem in production systems. We did get systems tripping over the
symlink issue, but there's no evidence that the
unlink->close->open(O_CREAT) issues are manifesting in the wild and
hence there hasn't been any particular urgency to address it.
> Are there any recommended workarounds until an elegant and efficient solution
> can be proposed? After all, causing a crash is extremely unacceptable in a
> production environment.
What crashes are you seeing in your production environment?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2023-12-08 0:14 ` Dave Chinner
@ 2024-01-31 6:35 ` Jinliang Zheng
2024-01-31 19:30 ` Darrick J. Wong
0 siblings, 1 reply; 19+ messages in thread
From: Jinliang Zheng @ 2024-01-31 6:35 UTC (permalink / raw)
To: david; +Cc: alexjlzheng, bfoster, djwong, linux-fsdevel, linux-xfs, raven,
rcu
On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
> > Hi, all
> >
> > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
> > which can lead to null pointer references has been resolved?
> >
> > I browsed through emails about the following patches and their discussions:
> > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> >
> > And then came to the conclusion that this problem has not been solved, am I
> > right? Did I miss some patch that could solve this problem?
>
> We fixed the known problems this caused by turning off the VFS
> functionality that the rcu pathwalks kept tripping over. See commit
> 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
> the vfs").
Sorry for the delay.
The problem I encountered in the production environment was that during the
rcu walk process the ->get_link() pointer was NULL, which caused a crash.
As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink
metadata buffers to the vfs") first appeared in:
- https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
Does this commit solve the problem of NULL ->get_link()? And how?
>
> Apart from that issue, I'm not aware of any other issues that the
> XFS inode recycling directly exposes.
>
> > According to my understanding, the essence of this problem is that XFS reuses
> > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
>
> It assumes that the inode will not change identity during the RCU
> grace period after the inode has been evicted from cache. We can
> safely reinstantiate an evicted inode without waiting for an RCU
> grace period as long as it is the same inode with the same content
> and same state.
>
> Problems *may* arise when we unlink the inode, then evict it, then a
> new file is created and the old slab cache memory address is used
> for the new inode. I describe the issue here:
>
> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
And judging from the relevant emails, the main reason why ->get_link() is set
to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when
the inode is chosen to be reused.
However, perhaps due to performance reasons, this solution has not been merged
for a long time. How is it now?
Maybe I am missing something in the threads of mail?
Thank you very much. :)
Jinliang Zheng
>
> That said, we have exactly zero evidence that this is actually a
> problem in production systems. We did get systems tripping over the
> symlink issue, but there's no evidence that the
> unlink->close->open(O_CREAT) issues are manifesting in the wild and
> hence there hasn't been any particular urgency to address it.
>
> > Are there any recommended workarounds until an elegant and efficient solution
> > can be proposed? After all, causing a crash is extremely unacceptable in a
> > production environment.
>
> What crashes are you seeing in your production environment?
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-01-31 6:35 ` Jinliang Zheng
@ 2024-01-31 19:30 ` Darrick J. Wong
2024-05-15 15:54 ` alexjlzheng
0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-01-31 19:30 UTC (permalink / raw)
To: Jinliang Zheng; +Cc: david, bfoster, linux-fsdevel, linux-xfs, raven, rcu
On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
> > > Hi, all
> > >
> > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
> > > which can lead to null pointer references has been resolved?
> > >
> > > I browsed through emails about the following patches and their discussions:
> > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > >
> > > And then came to the conclusion that this problem has not been solved, am I
> > > right? Did I miss some patch that could solve this problem?
> >
> > We fixed the known problems this caused by turning off the VFS
> > functionality that the rcu pathwalks kept tripping over. See commit
> > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
> > the vfs").
>
> Sorry for the delay.
>
> The problem I encountered in the production environment was that during the
> rcu walk process the ->get_link() pointer was NULL, which caused a crash.
>
> As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink
> metadata buffers to the vfs") first appeared in:
> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>
> Does this commit solve the problem of NULL ->get_link()? And how?
I suggest reading the call stack from wherever the VFS enters the XFS
readlink code. If you have a reliable reproducer, then apply this patch
to your kernel (you haven't mentioned which one it is) and see if the
bad dereference goes away.
--D
> >
> > Apart from that issue, I'm not aware of any other issues that the
> > XFS inode recycling directly exposes.
> >
> > > According to my understanding, the essence of this problem is that XFS reuses
> > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
> >
> > It assumes that the inode will not change identity during the RCU
> > grace period after the inode has been evicted from cache. We can
> > safely reinstantiate an evicted inode without waiting for an RCU
> > grace period as long as it is the same inode with the same content
> > and same state.
> >
> > Problems *may* arise when we unlink the inode, then evict it, then a
> > new file is created and the old slab cache memory address is used
> > for the new inode. I describe the issue here:
> >
> > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>
> And judging from the relevant emails, the main reason why ->get_link() is set
> to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when
> the inode is chosen to be reused.
>
> However, perhaps due to performance reasons, this solution has not been merged
> for a long time. How is it now?
>
> Maybe I am missing something in the threads of mail?
>
> Thank you very much. :)
> Jinliang Zheng
>
> >
> > That said, we have exactly zero evidence that this is actually a
> > problem in production systems. We did get systems tripping over the
> > symlink issue, but there's no evidence that the
> > unlink->close->open(O_CREAT) issues are manifesting in the wild and
> > hence there hasn't been any particular urgency to address it.
> >
> > > Are there any recommended workarounds until an elegant and efficient solution
> > > can be proposed? After all, causing a crash is extremely unacceptable in a
> > > production environment.
> >
> > What crashes are you seeing in your production environment?
> >
> > -Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-01-31 19:30 ` Darrick J. Wong
@ 2024-05-15 15:54 ` alexjlzheng
2024-05-16 4:56 ` Jinliang Zheng
0 siblings, 1 reply; 19+ messages in thread
From: alexjlzheng @ 2024-05-15 15:54 UTC (permalink / raw)
To: djwong
Cc: alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, raven, rcu,
alexjlzheng
On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
> > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
> > > > Hi, all
> > > >
> > > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
> > > > which can lead to null pointer references has been resolved?
> > > >
> > > > I browsed through emails about the following patches and their discussions:
> > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > > >
> > > > And then came to the conclusion that this problem has not been solved, am I
> > > > right? Did I miss some patch that could solve this problem?
> > >
> > > We fixed the known problems this caused by turning off the VFS
> > > functionality that the rcu pathwalks kept tripping over. See commit
> > > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
> > > the vfs").
> >
> > Sorry for the delay.
> >
> > The problem I encountered in the production environment was that during the
> > rcu walk process the ->get_link() pointer was NULL, which caused a crash.
> >
> > As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink
> > metadata buffers to the vfs") first appeared in:
> > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
> >
> > Does this commit solve the problem of NULL ->get_link()? And how?
>
> I suggest reading the call stack from wherever the VFS enters the XFS
> readlink code. If you have a reliable reproducer, then apply this patch
> to your kernel (you haven't mentioned which one it is) and see if the
> bad dereference goes away.
>
> --D
Sorry for the delay.
I encountered the following calltrace:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000
[20213.578785] #PF: supervisor instruction fetch in kernel mode
[20213.578799] #PF: error_code(0x0010) - not-present page
[20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
[20213.578828] Oops: 0010 [#1] SMP NOPTI
[20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
[20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
[20213.578884] RIP: 0010:0x0
[20213.578894] Code: Bad RIP value.
[20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
[20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000
[20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000
[20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380
[20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000
[20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48
[20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
[20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0
[20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[20213.579079] PKRU: 55555554
[20213.579087] Call Trace:
[20213.579099] trailing_symlink+0x1da/0x260
[20213.579112] path_lookupat.isra.53+0x79/0x220
[20213.579125] filename_lookup.part.69+0xa0/0x170
[20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
[20213.579151] ? getname_flags+0x4f/0x1e0
[20213.579161] user_path_at_empty+0x3e/0x50
[20213.579172] vfs_statx+0x76/0xe0
[20213.579182] __do_sys_newstat+0x3d/0x70
[20213.579194] ? fput+0x13/0x20
[20213.579203] ? ksys_ioctl+0xb0/0x300
[20213.579213] ? generic_file_llseek+0x24/0x30
[20213.579225] ? fput+0x13/0x20
[20213.579233] ? ksys_lseek+0x8d/0xb0
[20213.579243] __x64_sys_newstat+0x16/0x20
[20213.579256] do_syscall_64+0x4d/0x140
[20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL
->get_link() happened here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
0xffffffff812e485b <trailing_symlink+0xb>: push %r14
0xffffffff812e485d <trailing_symlink+0xd>: push %r13
0xffffffff812e485f <trailing_symlink+0xf>: push %r12
0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx # rbx = &nameidate
0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
0xffffffff812e4869 <trailing_symlink+0x19>: mov 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 <sysctl_protected_symlinks>
0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
0xffffffff812e487f <trailing_symlink+0x2f>: mov 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
0xffffffff812e488d <trailing_symlink+0x3d>: mov 0x4(%rcx),%ecx # ecx = link_inode->uid
0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f <trailing_symlink+0x10f>
0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
0xffffffff812e48af <trailing_symlink+0x5f>: mov %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
0xffffffff812e48b2 <trailing_symlink+0x62>: mov 0x50(%rbx),%rax # rax = nd->stack
0xffffffff812e48b6 <trailing_symlink+0x66>: movq $0x0,0x20(%rax) # stack[0].name = NULL
0xffffffff812e48be <trailing_symlink+0x6e>: mov 0x48(%rbx),%eax # nd->depth
0xffffffff812e48c1 <trailing_symlink+0x71>: mov 0x50(%rbx),%rdx # nd->stack
0xffffffff812e48c5 <trailing_symlink+0x75>: mov 0xc8(%rbx),%r13 # nd->link_inode
0xffffffff812e48cc <trailing_symlink+0x7c>: lea (%rax,%rax,2),%rax # rax = depth * 3
0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
0xffffffff812e48d4 <trailing_symlink+0x84>: lea -0x30(%rdx,%rax,1),%r15 # r15 = last
0xffffffff812e48d9 <trailing_symlink+0x89>: mov 0x8(%r15),%r14 # r14 = last->link.dentry
0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 <trailing_symlink+0x100>
0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
0xffffffff812e48e9 <trailing_symlink+0x99>: callq 0xffffffff812f8a00 <atime_needs_update>
0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
0xffffffff812e48f0 <trailing_symlink+0xa0>: jne 0xffffffff812e4a56 <trailing_symlink+0x206>
0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
0xffffffff812e4905 <trailing_symlink+0xb5>: callq 0xffffffff81424310 <security_inode_follow_link>
0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
0xffffffff812e490f <trailing_symlink+0xbf>: jne 0xffffffff812e4939 <trailing_symlink+0xe9>
0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 <trailing_symlink+0x195>
0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 <trailing_symlink+0x167>
0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
0xffffffff812e494f <trailing_symlink+0xff>: retq
0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
0xffffffff812e4953 <trailing_symlink+0x103>: callq 0xffffffff812f8ae0 <touch_atime>
0xffffffff812e4958 <trailing_symlink+0x108>: callq 0xffffffff81a26410 <_cond_resched>
0xffffffff812e495d <trailing_symlink+0x10d>: jmp 0xffffffff812e48f6 <trailing_symlink+0xa6>
0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
0xffffffff812e4965 <trailing_symlink+0x115>: je 0xffffffff812e496f <trailing_symlink+0x11f>
0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
0xffffffff812e4969 <trailing_symlink+0x119>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
0xffffffff812e496f <trailing_symlink+0x11f>: mov $0xfffffffffffffff6,%r12
0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
0xffffffff812e4978 <trailing_symlink+0x128>: jne 0xffffffff812e493e <trailing_symlink+0xee>
0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
0xffffffff812e498d <trailing_symlink+0x13d>: je 0xffffffff812e4999 <trailing_symlink+0x149>
0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
0xffffffff812e4993 <trailing_symlink+0x143>: je 0xffffffff812e4a6f <trailing_symlink+0x21f>
0xffffffff812e4999 <trailing_symlink+0x149>: mov $0xffffffff82319b4f,%rdi
0xffffffff812e49a0 <trailing_symlink+0x150>: mov $0xfffffffffffffff3,%r12
0xffffffff812e49a7 <trailing_symlink+0x157>: callq 0xffffffff81161310 <audit_log_link_denied>
0xffffffff812e49ac <trailing_symlink+0x15c>: jmp 0xffffffff812e493e <trailing_symlink+0xee>
0xffffffff812e49ae <trailing_symlink+0x15e>: mov $0xffffffff8230164d,%r12
0xffffffff812e49b5 <trailing_symlink+0x165>: jmp 0xffffffff812e493e <trailing_symlink+0xee>
0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
0xffffffff812e49bc <trailing_symlink+0x16c>: je 0xffffffff812e4a8a <trailing_symlink+0x23a>
0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
0xffffffff812e49c5 <trailing_symlink+0x175>: callq 0xffffffff812e2da0 <nd_jump_root>
0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
0xffffffff812e49cc <trailing_symlink+0x17c>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
0xffffffff812e49dd <trailing_symlink+0x18d>: jne 0xffffffff812e4935 <trailing_symlink+0xe5>
0xffffffff812e49e3 <trailing_symlink+0x193>: jmp 0xffffffff812e49d2 <trailing_symlink+0x182>
0xffffffff812e49e5 <trailing_symlink+0x195>: mov 0x20(%r13),%rax # inode->i_op
0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov 0x8(%rax),%rcx # inode_operations->get_link
0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
0xffffffff812e49fb <trailing_symlink+0x1ab>: jne 0xffffffff812e4a1f <trailing_symlink+0x1cf>
0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi # nd->flags & LOOKUP_RCU == 0
0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
0xffffffff812e4a0b <trailing_symlink+0x1bb>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp $0xfffffffffffff000,%r12
0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe 0xffffffff812e4928 <trailing_symlink+0xd8>
0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq 0xffffffff812e493e <trailing_symlink+0xee>
0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi # nd->flags & LOOKUP_RCU != 0
0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp $0xfffffffffffffff6,%rax
0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne 0xffffffff812e4a08 <trailing_symlink+0x1b8>
0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq 0xffffffff812e3840 <unlazy_walk>
0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
0xffffffff812e4a54 <trailing_symlink+0x204>: jmp 0xffffffff812e4a08 <trailing_symlink+0x1b8>
0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
0xffffffff812e4a59 <trailing_symlink+0x209>: callq 0xffffffff812e3840 <unlazy_walk>
0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
0xffffffff812e4a60 <trailing_symlink+0x210>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
0xffffffff812e4a65 <trailing_symlink+0x215>: callq 0xffffffff812f8ae0 <touch_atime>
0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq 0xffffffff812e48f6 <trailing_symlink+0xa6>
0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
0xffffffff812e4a80 <trailing_symlink+0x230>: callq 0xffffffff811673f0 <__audit_inode>
0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq 0xffffffff812e4999 <trailing_symlink+0x149>
0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
0xffffffff812e4a8d <trailing_symlink+0x23d>: callq 0xffffffff812e4790 <set_root>
0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq 0xffffffff812e49c2 <trailing_symlink+0x172>
0xffffffff812e4a97 <trailing_symlink+0x247>: mov $0xfffffffffffffff6,%r12
0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq 0xffffffff812e493e <trailing_symlink+0xee>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs:
don't expose internal symlink metadata buffers to the vfs") is a data NULL
pointer dereference, but the problem here is an instruction NULL pointer
dereference.
Further, I analyzed the possible triggering process as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
rcu_walk do_unlinkat ~~> prune_dcache_sb create
rcu_read_lock
read_seqcount_retry
(the last check) iput_final
evict
destroy_inode
xfs_fs_destroy_inode
xfs_inode_set_reclaim_tag xfs_ialloc
spin_lock(ip->i_flags_lock) xfs_dialloc
set(ip, XFS_IRECLAIMABLE) xfs_iget
wakeup(xfs_reclaim_worker) rcu_read_lock
spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
spin_lock(ip->i_flags_lock)
if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
set(ip, XFS_IRECLAIM)
spin_unlock(ip->i_flags_lock)
rcu_read_unlock
< ------------ >
// miss synchronize_rcu()
xfs_reinit_inode
->get_link = NULL
get_link() // NULL
rcu_read_unlock
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal
symlink metadata buffers to the vfs"), we should start processing this NULL
->get_link pointer dereference.
Or, am I thinking wrong somewhere?
Thanks,
Jinliang Zheng
>
> > >
> > > Apart from that issue, I'm not aware of any other issues that the
> > > XFS inode recycling directly exposes.
> > >
> > > > According to my understanding, the essence of this problem is that XFS reuses
> > > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
> > >
> > > It assumes that the inode will not change identity during the RCU
> > > grace period after the inode has been evicted from cache. We can
> > > safely reinstantiate an evicted inode without waiting for an RCU
> > > grace period as long as it is the same inode with the same content
> > > and same state.
> > >
> > > Problems *may* arise when we unlink the inode, then evict it, then a
> > > new file is created and the old slab cache memory address is used
> > > for the new inode. I describe the issue here:
> > >
> > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
> >
> > And judging from the relevant emails, the main reason why ->get_link() is set
> > to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when
> > the inode is chosen to be reused.
> >
> > However, perhaps due to performance reasons, this solution has not been merged
> > for a long time. How is it now?
> >
> > Maybe I am missing something in the threads of mail?
> >
> > Thank you very much. :)
> > Jinliang Zheng
> >
> > >
> > > That said, we have exactly zero evidence that this is actually a
> > > problem in production systems. We did get systems tripping over the
> > > symlink issue, but there's no evidence that the
> > > unlink->close->open(O_CREAT) issues are manifesting in the wild and
> > > hence there hasn't been any particular urgency to address it.
> > >
> > > > Are there any recommended workarounds until an elegant and efficient solution
> > > > can be proposed? After all, causing a crash is extremely unacceptable in a
> > > > production environment.
> > >
> > > What crashes are you seeing in your production environment?
> > >
> > > -Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> >
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-15 15:54 ` alexjlzheng
@ 2024-05-16 4:56 ` Jinliang Zheng
2024-05-16 7:08 ` Ian Kent
0 siblings, 1 reply; 19+ messages in thread
From: Jinliang Zheng @ 2024-05-16 4:56 UTC (permalink / raw)
To: alexjlzheng; +Cc: bfoster, david, djwong, linux-fsdevel, linux-xfs, raven, rcu
On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
> > On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
> > > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> > > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
> > > > > Hi, all
> > > > >
> > > > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
> > > > > which can lead to null pointer references has been resolved?
> > > > >
> > > > > I browsed through emails about the following patches and their discussions:
> > > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> > > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> > > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > > > >
> > > > > And then came to the conclusion that this problem has not been solved, am I
> > > > > right? Did I miss some patch that could solve this problem?
> > > >
> > > > We fixed the known problems this caused by turning off the VFS
> > > > functionality that the rcu pathwalks kept tripping over. See commit
> > > > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
> > > > the vfs").
> > >
> > > Sorry for the delay.
> > >
> > > The problem I encountered in the production environment was that during the
> > > rcu walk process the ->get_link() pointer was NULL, which caused a crash.
> > >
> > > As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink
> > > metadata buffers to the vfs") first appeared in:
> > > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
> > >
> > > Does this commit solve the problem of NULL ->get_link()? And how?
> >
> > I suggest reading the call stack from wherever the VFS enters the XFS
> > readlink code. If you have a reliable reproducer, then apply this patch
> > to your kernel (you haven't mentioned which one it is) and see if the
> > bad dereference goes away.
> >
> > --D
>
> Sorry for the delay.
>
> I encountered the following calltrace:
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> [20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [20213.578785] #PF: supervisor instruction fetch in kernel mode
> [20213.578799] #PF: error_code(0x0010) - not-present page
> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> [20213.578828] Oops: 0010 [#1] SMP NOPTI
> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> [20213.578884] RIP: 0010:0x0
> [20213.578894] Code: Bad RIP value.
> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000
> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000
> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380
> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000
> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48
> [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0
> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [20213.579079] PKRU: 55555554
> [20213.579087] Call Trace:
> [20213.579099] trailing_symlink+0x1da/0x260
> [20213.579112] path_lookupat.isra.53+0x79/0x220
> [20213.579125] filename_lookup.part.69+0xa0/0x170
> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
> [20213.579151] ? getname_flags+0x4f/0x1e0
> [20213.579161] user_path_at_empty+0x3e/0x50
> [20213.579172] vfs_statx+0x76/0xe0
> [20213.579182] __do_sys_newstat+0x3d/0x70
> [20213.579194] ? fput+0x13/0x20
> [20213.579203] ? ksys_ioctl+0xb0/0x300
> [20213.579213] ? generic_file_llseek+0x24/0x30
> [20213.579225] ? fput+0x13/0x20
> [20213.579233] ? ksys_lseek+0x8d/0xb0
> [20213.579243] __x64_sys_newstat+0x16/0x20
> [20213.579256] do_syscall_64+0x4d/0x140
> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Please note that the kernel version I use is the one maintained by Tencent.Inc,
and the baseline is v5.4. But in fact, in the latest upstream source tree,
although the trailing_symlink() function has been removed, its logic has been
moved to pick_link(), so the problem still exists.
Ian Kent pointed out that try_to_unlazy() was introduced in pick_link() in the
latest upstream source tree, but I don't understand why this can solve the NULL
->get_link pointer dereference problem, because ->get_link pointer will be
dereferenced before try_to_unlazy().
(I don't understand why Ian Kent's email didn't appear on the mailing list.)
Thanks,
Jinliang Zheng
>
> And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL
> ->get_link() happened here:
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
> 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx # rbx = &nameidate
> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
> 0xffffffff812e4869 <trailing_symlink+0x19>: mov 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 <sysctl_protected_symlinks>
> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
> 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
> 0xffffffff812e487f <trailing_symlink+0x2f>: mov 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
> 0xffffffff812e488d <trailing_symlink+0x3d>: mov 0x4(%rcx),%ecx # ecx = link_inode->uid
> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
> 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
> 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f <trailing_symlink+0x10f>
> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
> 0xffffffff812e48af <trailing_symlink+0x5f>: mov %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov 0x50(%rbx),%rax # rax = nd->stack
> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq $0x0,0x20(%rax) # stack[0].name = NULL
> 0xffffffff812e48be <trailing_symlink+0x6e>: mov 0x48(%rbx),%eax # nd->depth
> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov 0x50(%rbx),%rdx # nd->stack
> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov 0xc8(%rbx),%r13 # nd->link_inode
> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea (%rax,%rax,2),%rax # rax = depth * 3
> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea -0x30(%rdx,%rax,1),%r15 # r15 = last
> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov 0x8(%r15),%r14 # r14 = last->link.dentry
> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
> 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 <trailing_symlink+0x100>
> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq 0xffffffff812f8a00 <atime_needs_update>
> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne 0xffffffff812e4a56 <trailing_symlink+0x206>
> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq 0xffffffff81424310 <security_inode_follow_link>
> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
> 0xffffffff812e490f <trailing_symlink+0xbf>: jne 0xffffffff812e4939 <trailing_symlink+0xe9>
> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
> 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 <trailing_symlink+0x195>
> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
> 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 <trailing_symlink+0x167>
> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
> 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
> 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
> 0xffffffff812e494f <trailing_symlink+0xff>: retq
> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
> 0xffffffff812e4953 <trailing_symlink+0x103>: callq 0xffffffff812f8ae0 <touch_atime>
> 0xffffffff812e4958 <trailing_symlink+0x108>: callq 0xffffffff81a26410 <_cond_resched>
> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp 0xffffffff812e48f6 <trailing_symlink+0xa6>
> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
> 0xffffffff812e4965 <trailing_symlink+0x115>: je 0xffffffff812e496f <trailing_symlink+0x11f>
> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
> 0xffffffff812e4969 <trailing_symlink+0x119>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
> 0xffffffff812e496f <trailing_symlink+0x11f>: mov $0xfffffffffffffff6,%r12
> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
> 0xffffffff812e4978 <trailing_symlink+0x128>: jne 0xffffffff812e493e <trailing_symlink+0xee>
> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
> 0xffffffff812e498d <trailing_symlink+0x13d>: je 0xffffffff812e4999 <trailing_symlink+0x149>
> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
> 0xffffffff812e4993 <trailing_symlink+0x143>: je 0xffffffff812e4a6f <trailing_symlink+0x21f>
> 0xffffffff812e4999 <trailing_symlink+0x149>: mov $0xffffffff82319b4f,%rdi
> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov $0xfffffffffffffff3,%r12
> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq 0xffffffff81161310 <audit_log_link_denied>
> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp 0xffffffff812e493e <trailing_symlink+0xee>
> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov $0xffffffff8230164d,%r12
> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp 0xffffffff812e493e <trailing_symlink+0xee>
> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
> 0xffffffff812e49bc <trailing_symlink+0x16c>: je 0xffffffff812e4a8a <trailing_symlink+0x23a>
> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq 0xffffffff812e2da0 <nd_jump_root>
> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne 0xffffffff812e4935 <trailing_symlink+0xe5>
> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp 0xffffffff812e49d2 <trailing_symlink+0x182>
> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov 0x20(%r13),%rax # inode->i_op
> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov 0x8(%rax),%rcx # inode_operations->get_link
> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne 0xffffffff812e4a1f <trailing_symlink+0x1cf>
> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi # nd->flags & LOOKUP_RCU == 0
> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp $0xfffffffffffff000,%r12
> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe 0xffffffff812e4928 <trailing_symlink+0xd8>
> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq 0xffffffff812e493e <trailing_symlink+0xee>
> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi # nd->flags & LOOKUP_RCU != 0
> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp $0xfffffffffffffff6,%rax
> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq 0xffffffff812e3840 <unlazy_walk>
> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq 0xffffffff812e3840 <unlazy_walk>
> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq 0xffffffff812f8ae0 <touch_atime>
> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq 0xffffffff812e48f6 <trailing_symlink+0xa6>
> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq 0xffffffff811673f0 <__audit_inode>
> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq 0xffffffff812e4999 <trailing_symlink+0x149>
> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq 0xffffffff812e4790 <set_root>
> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq 0xffffffff812e49c2 <trailing_symlink+0x172>
> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov $0xfffffffffffffff6,%r12
> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq 0xffffffff812e493e <trailing_symlink+0xee>
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
> According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs:
> don't expose internal symlink metadata buffers to the vfs") is a data NULL
> pointer dereference, but the problem here is an instruction NULL pointer
> dereference.
>
> Further, I analyzed the possible triggering process as follows:
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> rcu_walk do_unlinkat ~~> prune_dcache_sb create
>
>
> rcu_read_lock
>
> read_seqcount_retry
> (the last check) iput_final
> evict
> destroy_inode
> xfs_fs_destroy_inode
> xfs_inode_set_reclaim_tag xfs_ialloc
> spin_lock(ip->i_flags_lock) xfs_dialloc
> set(ip, XFS_IRECLAIMABLE) xfs_iget
> wakeup(xfs_reclaim_worker) rcu_read_lock
> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
> spin_lock(ip->i_flags_lock)
> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
> set(ip, XFS_IRECLAIM)
> spin_unlock(ip->i_flags_lock)
> rcu_read_unlock
> < ------------ >
> // miss synchronize_rcu()
> xfs_reinit_inode
> ->get_link = NULL
> get_link() // NULL
>
> rcu_read_unlock
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal
> symlink metadata buffers to the vfs"), we should start processing this NULL
> ->get_link pointer dereference.
>
> Or, am I thinking wrong somewhere?
>
> Thanks,
> Jinliang Zheng
>
> >
> > > >
> > > > Apart from that issue, I'm not aware of any other issues that the
> > > > XFS inode recycling directly exposes.
> > > >
> > > > > According to my understanding, the essence of this problem is that XFS reuses
> > > > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
> > > >
> > > > It assumes that the inode will not change identity during the RCU
> > > > grace period after the inode has been evicted from cache. We can
> > > > safely reinstantiate an evicted inode without waiting for an RCU
> > > > grace period as long as it is the same inode with the same content
> > > > and same state.
> > > >
> > > > Problems *may* arise when we unlink the inode, then evict it, then a
> > > > new file is created and the old slab cache memory address is used
> > > > for the new inode. I describe the issue here:
> > > >
> > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
> > >
> > > And judging from the relevant emails, the main reason why ->get_link() is set
> > > to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when
> > > the inode is chosen to be reused.
> > >
> > > However, perhaps due to performance reasons, this solution has not been merged
> > > for a long time. How is it now?
> > >
> > > Maybe I am missing something in the threads of mail?
> > >
> > > Thank you very much. :)
> > > Jinliang Zheng
> > >
> > > >
> > > > That said, we have exactly zero evidence that this is actually a
> > > > problem in production systems. We did get systems tripping over the
> > > > symlink issue, but there's no evidence that the
> > > > unlink->close->open(O_CREAT) issues are manifesting in the wild and
> > > > hence there hasn't been any particular urgency to address it.
> > > >
> > > > > Are there any recommended workarounds until an elegant and efficient solution
> > > > > can be proposed? After all, causing a crash is extremely unacceptable in a
> > > > > production environment.
> > > >
> > > > What crashes are you seeing in your production environment?
> > > >
> > > > -Dave.
> > > > --
> > > > Dave Chinner
> > > > david@fromorbit.com
> > >
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-16 4:56 ` Jinliang Zheng
@ 2024-05-16 7:08 ` Ian Kent
2024-05-16 7:23 ` Ian Kent
0 siblings, 1 reply; 19+ messages in thread
From: Ian Kent @ 2024-05-16 7:08 UTC (permalink / raw)
To: Jinliang Zheng, alexjlzheng
Cc: bfoster, david, djwong, linux-fsdevel, linux-xfs, rcu
On 16/5/24 12:56, Jinliang Zheng wrote:
> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
>>>>>> Hi, all
>>>>>>
>>>>>> I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
>>>>>> which can lead to null pointer references has been resolved?
>>>>>>
>>>>>> I browsed through emails about the following patches and their discussions:
>>>>>> - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
>>>>>> - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
>>>>>> - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>>>>>>
>>>>>> And then came to the conclusion that this problem has not been solved, am I
>>>>>> right? Did I miss some patch that could solve this problem?
>>>>> We fixed the known problems this caused by turning off the VFS
>>>>> functionality that the rcu pathwalks kept tripping over. See commit
>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
>>>>> the vfs").
>>>> Sorry for the delay.
>>>>
>>>> The problem I encountered in the production environment was that during the
>>>> rcu walk process the ->get_link() pointer was NULL, which caused a crash.
>>>>
>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink
>>>> metadata buffers to the vfs") first appeared in:
>>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>>>>
>>>> Does this commit solve the problem of NULL ->get_link()? And how?
>>> I suggest reading the call stack from wherever the VFS enters the XFS
>>> readlink code. If you have a reliable reproducer, then apply this patch
>>> to your kernel (you haven't mentioned which one it is) and see if the
>>> bad dereference goes away.
>>>
>>> --D
>> Sorry for the delay.
>>
>> I encountered the following calltrace:
>>
>> [20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
>> [20213.578799] #PF: error_code(0x0010) - not-present page
>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
>> [20213.578884] RIP: 0010:0x0
>> [20213.578894] Code: Bad RIP value.
>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000
>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000
>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380
>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000
>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48
>> [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0
>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [20213.579079] PKRU: 55555554
>> [20213.579087] Call Trace:
>> [20213.579099] trailing_symlink+0x1da/0x260
>> [20213.579112] path_lookupat.isra.53+0x79/0x220
>> [20213.579125] filename_lookup.part.69+0xa0/0x170
>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
>> [20213.579151] ? getname_flags+0x4f/0x1e0
>> [20213.579161] user_path_at_empty+0x3e/0x50
>> [20213.579172] vfs_statx+0x76/0xe0
>> [20213.579182] __do_sys_newstat+0x3d/0x70
>> [20213.579194] ? fput+0x13/0x20
>> [20213.579203] ? ksys_ioctl+0xb0/0x300
>> [20213.579213] ? generic_file_llseek+0x24/0x30
>> [20213.579225] ? fput+0x13/0x20
>> [20213.579233] ? ksys_lseek+0x8d/0xb0
>> [20213.579243] __x64_sys_newstat+0x16/0x20
>> [20213.579256] do_syscall_64+0x4d/0x140
>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> Please note that the kernel version I use is the one maintained by Tencent.Inc,
> and the baseline is v5.4. But in fact, in the latest upstream source tree,
> although the trailing_symlink() function has been removed, its logic has been
> moved to pick_link(), so the problem still exists.
>
> Ian Kent pointed out that try_to_unlazy() was introduced in pick_link() in the
> latest upstream source tree, but I don't understand why this can solve the NULL
> ->get_link pointer dereference problem, because ->get_link pointer will be
> dereferenced before try_to_unlazy().
>
> (I don't understand why Ian Kent's email didn't appear on the mailing list.)
It was something about html mail and I think my mail client was at fault.
In any case what you say is indeed correct, so the comment isn't important.
Fact is it is still a race between the lockless path walk and inode eviction
and xfs recycling. I believe that the xfs recycling code is very hard to
fix.
IIRC correctly putting a NULL check in pick_link() was not considered
acceptable
but there must be a way that is acceptable to check this and restart the
walk.
Maybe there was a reluctance to suffer the overhead of restarting the
walk when
it shouldn't be needed.
The alternative would be to find some way to identify when it's unsafe
to reuse
an inode marked for re-cycle before dropping rcu read, perhaps with the
reference
count plus the seqlock. Basically, to reuse inodes xfs will need to
identify when
the race occurs and let the inode go away under rcu and create a new one
if a race
is detected. But possibly that isn't nearly as simple as it sounds?
>
> Thanks,
> Jinliang Zheng
>
>> And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL
>> ->get_link() happened here:
>>
>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx # rbx = &nameidate
>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 <sysctl_protected_symlinks>
>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
>> 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov 0x4(%rcx),%ecx # ecx = link_inode->uid
>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
>> 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f <trailing_symlink+0x10f>
>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov 0x50(%rbx),%rax # rax = nd->stack
>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq $0x0,0x20(%rax) # stack[0].name = NULL
>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov 0x48(%rbx),%eax # nd->depth
>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov 0x50(%rbx),%rdx # nd->stack
>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov 0xc8(%rbx),%r13 # nd->link_inode
>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea (%rax,%rax,2),%rax # rax = depth * 3
>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea -0x30(%rdx,%rax,1),%r15 # r15 = last
>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov 0x8(%r15),%r14 # r14 = last->link.dentry
>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 <trailing_symlink+0x100>
>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq 0xffffffff812f8a00 <atime_needs_update>
>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne 0xffffffff812e4a56 <trailing_symlink+0x206>
>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq 0xffffffff81424310 <security_inode_follow_link>
>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne 0xffffffff812e4939 <trailing_symlink+0xe9>
>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 <trailing_symlink+0x195>
>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
>> 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 <trailing_symlink+0x167>
>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
>> 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq 0xffffffff812f8ae0 <touch_atime>
>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq 0xffffffff81a26410 <_cond_resched>
>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp 0xffffffff812e48f6 <trailing_symlink+0xa6>
>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
>> 0xffffffff812e4965 <trailing_symlink+0x115>: je 0xffffffff812e496f <trailing_symlink+0x11f>
>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
>> 0xffffffff812e4969 <trailing_symlink+0x119>: je 0xffffffff812e48ac <trailing_symlink+0x5c>
>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov $0xfffffffffffffff6,%r12
>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne 0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
>> 0xffffffff812e498d <trailing_symlink+0x13d>: je 0xffffffff812e4999 <trailing_symlink+0x149>
>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
>> 0xffffffff812e4993 <trailing_symlink+0x143>: je 0xffffffff812e4a6f <trailing_symlink+0x21f>
>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov $0xffffffff82319b4f,%rdi
>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov $0xfffffffffffffff3,%r12
>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq 0xffffffff81161310 <audit_log_link_denied>
>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp 0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov $0xffffffff8230164d,%r12
>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp 0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je 0xffffffff812e4a8a <trailing_symlink+0x23a>
>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq 0xffffffff812e2da0 <nd_jump_root>
>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne 0xffffffff812e4935 <trailing_symlink+0xe5>
>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp 0xffffffff812e49d2 <trailing_symlink+0x182>
>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov 0x20(%r13),%rax # inode->i_op
>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov 0x8(%rax),%rcx # inode_operations->get_link
>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne 0xffffffff812e4a1f <trailing_symlink+0x1cf>
>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi # nd->flags & LOOKUP_RCU == 0
>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je 0xffffffff812e49ae <trailing_symlink+0x15e>
>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp $0xfffffffffffff000,%r12
>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe 0xffffffff812e4928 <trailing_symlink+0xd8>
>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq 0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi # nd->flags & LOOKUP_RCU != 0
>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp $0xfffffffffffffff6,%rax
>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq 0xffffffff812e3840 <unlazy_walk>
>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq 0xffffffff812e3840 <unlazy_walk>
>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne 0xffffffff812e4a97 <trailing_symlink+0x247>
>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq 0xffffffff812f8ae0 <touch_atime>
>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq 0xffffffff812e48f6 <trailing_symlink+0xa6>
>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq 0xffffffff811673f0 <__audit_inode>
>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq 0xffffffff812e4999 <trailing_symlink+0x149>
>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq 0xffffffff812e4790 <set_root>
>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq 0xffffffff812e49c2 <trailing_symlink+0x172>
>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov $0xfffffffffffffff6,%r12
>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq 0xffffffff812e493e <trailing_symlink+0xee>
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>
>> According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs:
>> don't expose internal symlink metadata buffers to the vfs") is a data NULL
>> pointer dereference, but the problem here is an instruction NULL pointer
>> dereference.
>>
>> Further, I analyzed the possible triggering process as follows:
>>
>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
>>
>>
>> rcu_read_lock
>>
>> read_seqcount_retry
>> (the last check) iput_final
>> evict
>> destroy_inode
>> xfs_fs_destroy_inode
>> xfs_inode_set_reclaim_tag xfs_ialloc
>> spin_lock(ip->i_flags_lock) xfs_dialloc
>> set(ip, XFS_IRECLAIMABLE) xfs_iget
>> wakeup(xfs_reclaim_worker) rcu_read_lock
>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
>> spin_lock(ip->i_flags_lock)
>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
>> set(ip, XFS_IRECLAIM)
>> spin_unlock(ip->i_flags_lock)
>> rcu_read_unlock
>> < ------------ >
>> // miss synchronize_rcu()
>> xfs_reinit_inode
>> ->get_link = NULL
>> get_link() // NULL
>>
>> rcu_read_unlock
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>
>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal
>> symlink metadata buffers to the vfs"), we should start processing this NULL
>> ->get_link pointer dereference.
>>
>> Or, am I thinking wrong somewhere?
>>
>> Thanks,
>> Jinliang Zheng
>>
>>>>> Apart from that issue, I'm not aware of any other issues that the
>>>>> XFS inode recycling directly exposes.
>>>>>
>>>>>> According to my understanding, the essence of this problem is that XFS reuses
>>>>>> the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
>>>>> It assumes that the inode will not change identity during the RCU
>>>>> grace period after the inode has been evicted from cache. We can
>>>>> safely reinstantiate an evicted inode without waiting for an RCU
>>>>> grace period as long as it is the same inode with the same content
>>>>> and same state.
>>>>>
>>>>> Problems *may* arise when we unlink the inode, then evict it, then a
>>>>> new file is created and the old slab cache memory address is used
>>>>> for the new inode. I describe the issue here:
>>>>>
>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>>>> And judging from the relevant emails, the main reason why ->get_link() is set
>>>> to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when
>>>> the inode is chosen to be reused.
>>>>
>>>> However, perhaps due to performance reasons, this solution has not been merged
>>>> for a long time. How is it now?
>>>>
>>>> Maybe I am missing something in the threads of mail?
>>>>
>>>> Thank you very much. :)
>>>> Jinliang Zheng
>>>>
>>>>> That said, we have exactly zero evidence that this is actually a
>>>>> problem in production systems. We did get systems tripping over the
>>>>> symlink issue, but there's no evidence that the
>>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and
>>>>> hence there hasn't been any particular urgency to address it.
>>>>>
>>>>>> Are there any recommended workarounds until an elegant and efficient solution
>>>>>> can be proposed? After all, causing a crash is extremely unacceptable in a
>>>>>> production environment.
>>>>> What crashes are you seeing in your production environment?
>>>>>
>>>>> -Dave.
>>>>> --
>>>>> Dave Chinner
>>>>> david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-16 7:08 ` Ian Kent
@ 2024-05-16 7:23 ` Ian Kent
2024-05-20 17:36 ` Darrick J. Wong
2024-05-27 9:41 ` Dave Chinner
0 siblings, 2 replies; 19+ messages in thread
From: Ian Kent @ 2024-05-16 7:23 UTC (permalink / raw)
To: Jinliang Zheng, alexjlzheng
Cc: bfoster, david, djwong, linux-fsdevel, linux-xfs, rcu
On 16/5/24 15:08, Ian Kent wrote:
> On 16/5/24 12:56, Jinliang Zheng wrote:
>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com
>>>>>> wrote:
>>>>>>> Hi, all
>>>>>>>
>>>>>>> I would like to ask if the conflict between xfs inode recycle
>>>>>>> and vfs rcu-walk
>>>>>>> which can lead to null pointer references has been resolved?
>>>>>>>
>>>>>>> I browsed through emails about the following patches and their
>>>>>>> discussions:
>>>>>>> -
>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
>>>>>>> -
>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
>>>>>>> -
>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>>>>>>>
>>>>>>> And then came to the conclusion that this problem has not been
>>>>>>> solved, am I
>>>>>>> right? Did I miss some patch that could solve this problem?
>>>>>> We fixed the known problems this caused by turning off the VFS
>>>>>> functionality that the rcu pathwalks kept tripping over. See commit
>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink metadata
>>>>>> buffers to
>>>>>> the vfs").
>>>>> Sorry for the delay.
>>>>>
>>>>> The problem I encountered in the production environment was that
>>>>> during the
>>>>> rcu walk process the ->get_link() pointer was NULL, which caused a
>>>>> crash.
>>>>>
>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal
>>>>> symlink
>>>>> metadata buffers to the vfs") first appeared in:
>>>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>>>>>
>>>>> Does this commit solve the problem of NULL ->get_link()? And how?
>>>> I suggest reading the call stack from wherever the VFS enters the XFS
>>>> readlink code. If you have a reliable reproducer, then apply this
>>>> patch
>>>> to your kernel (you haven't mentioned which one it is) and see if the
>>>> bad dereference goes away.
>>>>
>>>> --D
>>> Sorry for the delay.
>>>
>>> I encountered the following calltrace:
>>>
>>> [20213.578756] BUG: kernel NULL pointer dereference, address:
>>> 0000000000000000
>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
>>> [20213.578799] #PF: error_code(0x0010) - not-present page
>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded
>>> Not tainted 5.4.241-1-tlinux4-0017.3 #1
>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
>>> [20213.578884] RIP: 0010:0x0
>>> [20213.578894] Code: Bad RIP value.
>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
>>> 0000000000000000
>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
>>> 0000000000000000
>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
>>> ffff889b9eeae380
>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
>>> 0000000000000000
>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
>>> ffffc90021ebfd48
>>> [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000)
>>> knlGS:0000000000000000
>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
>>> 00000000007706e0
>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [20213.579079] PKRU: 55555554
>>> [20213.579087] Call Trace:
>>> [20213.579099] trailing_symlink+0x1da/0x260
>>> [20213.579112] path_lookupat.isra.53+0x79/0x220
>>> [20213.579125] filename_lookup.part.69+0xa0/0x170
>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
>>> [20213.579151] ? getname_flags+0x4f/0x1e0
>>> [20213.579161] user_path_at_empty+0x3e/0x50
>>> [20213.579172] vfs_statx+0x76/0xe0
>>> [20213.579182] __do_sys_newstat+0x3d/0x70
>>> [20213.579194] ? fput+0x13/0x20
>>> [20213.579203] ? ksys_ioctl+0xb0/0x300
>>> [20213.579213] ? generic_file_llseek+0x24/0x30
>>> [20213.579225] ? fput+0x13/0x20
>>> [20213.579233] ? ksys_lseek+0x8d/0xb0
>>> [20213.579243] __x64_sys_newstat+0x16/0x20
>>> [20213.579256] do_syscall_64+0x4d/0x140
>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>>>
>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>
>> Please note that the kernel version I use is the one maintained by
>> Tencent.Inc,
>> and the baseline is v5.4. But in fact, in the latest upstream source
>> tree,
>> although the trailing_symlink() function has been removed, its logic
>> has been
>> moved to pick_link(), so the problem still exists.
>>
>> Ian Kent pointed out that try_to_unlazy() was introduced in
>> pick_link() in the
>> latest upstream source tree, but I don't understand why this can
>> solve the NULL
>> ->get_link pointer dereference problem, because ->get_link pointer
>> will be
>> dereferenced before try_to_unlazy().
>>
>> (I don't understand why Ian Kent's email didn't appear on the mailing
>> list.)
>
> It was something about html mail and I think my mail client was at fault.
>
> In any case what you say is indeed correct, so the comment isn't
> important.
>
>
> Fact is it is still a race between the lockless path walk and inode
> eviction
>
> and xfs recycling. I believe that the xfs recycling code is very hard
> to fix.
>
>
> IIRC correctly putting a NULL check in pick_link() was not considered
> acceptable
>
> but there must be a way that is acceptable to check this and restart
> the walk.
>
> Maybe there was a reluctance to suffer the overhead of restarting the
> walk when
>
> it shouldn't be needed.
Or perhaps the worry was that if it can become NULL it could also become
a pointer to a
different (incorrect) link altogether which could have really
odd/unpleasant outcomes.
>
>
> The alternative would be to find some way to identify when it's unsafe
> to reuse
>
> an inode marked for re-cycle before dropping rcu read, perhaps with
> the reference
>
> count plus the seqlock. Basically, to reuse inodes xfs will need to
> identify when
>
> the race occurs and let the inode go away under rcu and create a new
> one if a race
>
> is detected. But possibly that isn't nearly as simple as it sounds?
>
>
>>
>> Thanks,
>> Jinliang Zheng
>>
>>> And I analyzed the disassembly of trailing_symlink() and confirmed
>>> that a NULL
>>> ->get_link() happened here:
>>>
>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
>>> [FTRACE NOP]
>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx
>>> # rbx = &nameidate
>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov
>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
>>> <sysctl_protected_symlinks>
>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac
>>> <trailing_symlink+0x5c>
>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov
>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov
>>> 0x4(%rcx),%ecx # ecx = link_inode->uid
>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac
>>> <trailing_symlink+0x5c>
>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f
>>> <trailing_symlink+0x10f>
>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov
>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
>>> 0x50(%rbx),%rax # rax = nd->stack
>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
>>> $0x0,0x20(%rax) # stack[0].name = NULL
>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov
>>> 0x48(%rbx),%eax # nd->depth
>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
>>> 0x50(%rbx),%rdx # nd->stack
>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
>>> 0xc8(%rbx),%r13 # nd->link_inode
>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
>>> (%rax,%rax,2),%rax # rax = depth * 3
>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax
>>> # rax = rax << 4, sizeof(saved):0x30
>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
>>> -0x30(%rdx,%rax,1),%r15 # r15 = last
>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
>>> 0x8(%r15),%r14 # r14 = last->link.dentry
>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950
>>> <trailing_symlink+0x100>
>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
>>> 0xffffffff812f8a00 <atime_needs_update>
>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
>>> 0xffffffff812e4a56 <trailing_symlink+0x206>
>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
>>> 0xffffffff81424310 <security_inode_follow_link>
>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne
>>> 0xffffffff812e4939 <trailing_symlink+0xe9>
>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5
>>> <trailing_symlink+0x195>
>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7
>>> <trailing_symlink+0x167>
>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae
>>> <trailing_symlink+0x15e>
>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
>>> 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae
>>> <trailing_symlink+0x15e>
>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq
>>> 0xffffffff812f8ae0 <touch_atime>
>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq
>>> 0xffffffff81a26410 <_cond_resched>
>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je
>>> 0xffffffff812e496f <trailing_symlink+0x11f>
>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je
>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov
>>> $0xfffffffffffffff6,%r12
>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne
>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je
>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je
>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>
>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov
>>> $0xffffffff82319b4f,%rdi
>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
>>> $0xfffffffffffffff3,%r12
>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
>>> 0xffffffff81161310 <audit_log_link_denied>
>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
>>> $0xffffffff8230164d,%r12
>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je
>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>
>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
>>> 0xffffffff812e2da0 <nd_jump_root>
>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
>>> 0xffffffff812e4935 <trailing_symlink+0xe5>
>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
>>> 0xffffffff812e49d2 <trailing_symlink+0x182>
>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
>>> 0x20(%r13),%rax # inode->i_op
>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
>>> 0x8(%rax),%rcx # inode_operations->get_link
>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>
>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi
>>> # nd->flags & LOOKUP_RCU == 0
>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
>>> $0xfffffffffffff000,%r12
>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
>>> 0xffffffff812e4928 <trailing_symlink+0xd8>
>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi
>>> # nd->flags & LOOKUP_RCU != 0
>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
>>> $0xfffffffffffffff6,%rax
>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
>>> 0xffffffff812e3840 <unlazy_walk>
>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
>>> 0xffffffff812e3840 <unlazy_walk>
>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
>>> 0xffffffff812f8ae0 <touch_atime>
>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
>>> 0xffffffff811673f0 <__audit_inode>
>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
>>> 0xffffffff812e4790 <set_root>
>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
>>> 0xffffffff812e49c2 <trailing_symlink+0x172>
>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
>>> $0xfffffffffffffff6,%r12
>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>
>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>
>>>
>>> According to my understanding, the problem solved by commit
>>> 7b7820b83f23 ("xfs:
>>> don't expose internal symlink metadata buffers to the vfs") is a
>>> data NULL
>>> pointer dereference, but the problem here is an instruction NULL
>>> pointer
>>> dereference.
>>>
>>> Further, I analyzed the possible triggering process as follows:
>>>
>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
>>> rcu_read_lock
>>> read_seqcount_retry
>>> (the last check) iput_final
>>> evict
>>> destroy_inode
>>> xfs_fs_destroy_inode
>>> xfs_inode_set_reclaim_tag xfs_ialloc
>>> spin_lock(ip->i_flags_lock) xfs_dialloc
>>> set(ip, XFS_IRECLAIMABLE)
>>> xfs_iget
>>> wakeup(xfs_reclaim_worker) rcu_read_lock
>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
>>> spin_lock(ip->i_flags_lock)
>>>
>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
>>> set(ip, XFS_IRECLAIM)
>>> spin_unlock(ip->i_flags_lock)
>>> rcu_read_unlock
>>> < ------------ >
>>>
>>> // miss synchronize_rcu()
>>> xfs_reinit_inode
>>> ->get_link = NULL
>>> get_link() // NULL
>>>
>>> rcu_read_unlock
>>>
>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>
>>>
>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
>>> expose internal
>>> symlink metadata buffers to the vfs"), we should start processing
>>> this NULL
>>> ->get_link pointer dereference.
>>>
>>> Or, am I thinking wrong somewhere?
>>>
>>> Thanks,
>>> Jinliang Zheng
>>>
>>>>>> Apart from that issue, I'm not aware of any other issues that the
>>>>>> XFS inode recycling directly exposes.
>>>>>>
>>>>>>> According to my understanding, the essence of this problem is
>>>>>>> that XFS reuses
>>>>>>> the inode evicted by VFS, but VFS rcu-walk assumes that this
>>>>>>> will not happen.
>>>>>> It assumes that the inode will not change identity during the RCU
>>>>>> grace period after the inode has been evicted from cache. We can
>>>>>> safely reinstantiate an evicted inode without waiting for an RCU
>>>>>> grace period as long as it is the same inode with the same content
>>>>>> and same state.
>>>>>>
>>>>>> Problems *may* arise when we unlink the inode, then evict it, then a
>>>>>> new file is created and the old slab cache memory address is used
>>>>>> for the new inode. I describe the issue here:
>>>>>>
>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>>>>>>
>>>>> And judging from the relevant emails, the main reason why
>>>>> ->get_link() is set
>>>>> to NULL should be the lack of synchronize_rcu() before
>>>>> xfs_reinit_inode() when
>>>>> the inode is chosen to be reused.
>>>>>
>>>>> However, perhaps due to performance reasons, this solution has not
>>>>> been merged
>>>>> for a long time. How is it now?
>>>>>
>>>>> Maybe I am missing something in the threads of mail?
>>>>>
>>>>> Thank you very much. :)
>>>>> Jinliang Zheng
>>>>>
>>>>>> That said, we have exactly zero evidence that this is actually a
>>>>>> problem in production systems. We did get systems tripping over the
>>>>>> symlink issue, but there's no evidence that the
>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and
>>>>>> hence there hasn't been any particular urgency to address it.
>>>>>>
>>>>>>> Are there any recommended workarounds until an elegant and
>>>>>>> efficient solution
>>>>>>> can be proposed? After all, causing a crash is extremely
>>>>>>> unacceptable in a
>>>>>>> production environment.
>>>>>> What crashes are you seeing in your production environment?
>>>>>>
>>>>>> -Dave.
>>>>>> --
>>>>>> Dave Chinner
>>>>>> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-16 7:23 ` Ian Kent
@ 2024-05-20 17:36 ` Darrick J. Wong
2024-05-21 1:35 ` Ian Kent
2024-05-27 9:41 ` Dave Chinner
1 sibling, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-05-20 17:36 UTC (permalink / raw)
To: Ian Kent
Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel,
linux-xfs, rcu
On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
>
> On 16/5/24 15:08, Ian Kent wrote:
> > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
> > > > On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
> > > > > On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
> > > > > > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> > > > > > > On Tue, Dec 05, 2023 at 07:38:33PM +0800,
> > > > > > > alexjlzheng@gmail.com wrote:
> > > > > > > > Hi, all
> > > > > > > >
> > > > > > > > I would like to ask if the conflict between xfs
> > > > > > > > inode recycle and vfs rcu-walk
> > > > > > > > which can lead to null pointer references has been resolved?
> > > > > > > >
> > > > > > > > I browsed through emails about the following
> > > > > > > > patches and their discussions:
> > > > > > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> > > > > > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> > > > > > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > > > > > > >
> > > > > > > > And then came to the conclusion that this
> > > > > > > > problem has not been solved, am I
> > > > > > > > right? Did I miss some patch that could solve this problem?
> > > > > > > We fixed the known problems this caused by turning off the VFS
> > > > > > > functionality that the rcu pathwalks kept tripping over. See commit
> > > > > > > 7b7820b83f23 ("xfs: don't expose internal symlink
> > > > > > > metadata buffers to
> > > > > > > the vfs").
> > > > > > Sorry for the delay.
> > > > > >
> > > > > > The problem I encountered in the production environment
> > > > > > was that during the
> > > > > > rcu walk process the ->get_link() pointer was NULL,
> > > > > > which caused a crash.
> > > > > >
> > > > > > As far as I know, commit 7b7820b83f23 ("xfs: don't
> > > > > > expose internal symlink
> > > > > > metadata buffers to the vfs") first appeared in:
> > > > > > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
> > > > > >
> > > > > > Does this commit solve the problem of NULL ->get_link()? And how?
> > > > > I suggest reading the call stack from wherever the VFS enters the XFS
> > > > > readlink code. If you have a reliable reproducer, then
> > > > > apply this patch
> > > > > to your kernel (you haven't mentioned which one it is) and see if the
> > > > > bad dereference goes away.
> > > > >
> > > > > --D
> > > > Sorry for the delay.
> > > >
> > > > I encountered the following calltrace:
> > > >
> > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > 0000000000000000
> > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > [20213.578884] RIP: 0010:0x0
> > > > [20213.578894] Code: Bad RIP value.
> > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > 0000000000000000
> > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > 0000000000000000
> > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > ffff889b9eeae380
> > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > 0000000000000000
> > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > ffffc90021ebfd48
> > > > [20213.578998] FS: 00007f89c534e740(0000)
> > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > 00000000007706e0
> > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [20213.579079] PKRU: 55555554
> > > > [20213.579087] Call Trace:
> > > > [20213.579099] trailing_symlink+0x1da/0x260
> > > > [20213.579112] path_lookupat.isra.53+0x79/0x220
> > > > [20213.579125] filename_lookup.part.69+0xa0/0x170
> > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
> > > > [20213.579151] ? getname_flags+0x4f/0x1e0
> > > > [20213.579161] user_path_at_empty+0x3e/0x50
> > > > [20213.579172] vfs_statx+0x76/0xe0
> > > > [20213.579182] __do_sys_newstat+0x3d/0x70
> > > > [20213.579194] ? fput+0x13/0x20
> > > > [20213.579203] ? ksys_ioctl+0xb0/0x300
> > > > [20213.579213] ? generic_file_llseek+0x24/0x30
> > > > [20213.579225] ? fput+0x13/0x20
> > > > [20213.579233] ? ksys_lseek+0x8d/0xb0
> > > > [20213.579243] __x64_sys_newstat+0x16/0x20
> > > > [20213.579256] do_syscall_64+0x4d/0x140
> > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > >
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > >
> > > Please note that the kernel version I use is the one maintained by
> > > Tencent.Inc,
> > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > tree,
> > > although the trailing_symlink() function has been removed, its logic
> > > has been
> > > moved to pick_link(), so the problem still exists.
> > >
> > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > pick_link() in the
> > > latest upstream source tree, but I don't understand why this can
> > > solve the NULL
> > > ->get_link pointer dereference problem, because ->get_link pointer
> > > will be
> > > dereferenced before try_to_unlazy().
> > >
> > > (I don't understand why Ian Kent's email didn't appear on the
> > > mailing list.)
> >
> > It was something about html mail and I think my mail client was at fault.
> >
> > In any case what you say is indeed correct, so the comment isn't
> > important.
> >
> >
> > Fact is it is still a race between the lockless path walk and inode
> > eviction
> >
> > and xfs recycling. I believe that the xfs recycling code is very hard to
> > fix.
> >
> >
> > IIRC correctly putting a NULL check in pick_link() was not considered
> > acceptable
> >
> > but there must be a way that is acceptable to check this and restart the
> > walk.
> >
> > Maybe there was a reluctance to suffer the overhead of restarting the
> > walk when
> >
> > it shouldn't be needed.
>
> Or perhaps the worry was that if it can become NULL it could also become a
> pointer to a
>
> different (incorrect) link altogether which could have really odd/unpleasant
> outcomes.
Yuck. I think that means that we can't reallocate freed inodes until
the rcu grace period expires. For inodes that haven't been evicted, I
think that also means we cannot recycle cached inodes until after an rcu
grace period expires; or maybe that we cannot reset i_op/i_fop and must
not leave the incore state in an inconsistent format?
--D
>
> >
> >
> > The alternative would be to find some way to identify when it's unsafe
> > to reuse
> >
> > an inode marked for re-cycle before dropping rcu read, perhaps with the
> > reference
> >
> > count plus the seqlock. Basically, to reuse inodes xfs will need to
> > identify when
> >
> > the race occurs and let the inode go away under rcu and create a new one
> > if a race
> >
> > is detected. But possibly that isn't nearly as simple as it sounds?
> >
> >
> > >
> > > Thanks,
> > > Jinliang Zheng
> > >
> > > > And I analyzed the disassembly of trailing_symlink() and
> > > > confirmed that a NULL
> > > > ->get_link() happened here:
> > > >
> > > > 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
> > > > [FTRACE NOP]
> > > > 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
> > > > 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
> > > > 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
> > > > 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
> > > > 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
> > > > 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
> > > > 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
> > > > 0xffffffff812e4862 <trailing_symlink+0x12>: mov
> > > > %rdi,%rbx # rbx = &nameidate
> > > > 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
> > > > 0xffffffff812e4869 <trailing_symlink+0x19>: mov
> > > > 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
> > > > <sysctl_protected_symlinks>
> > > > 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
> > > > 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
> > > > 0xffffffff812e4874 <trailing_symlink+0x24>: je
> > > > 0xffffffff812e48ac <trailing_symlink+0x5c>
> > > > 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
> > > > 0xffffffff812e487f <trailing_symlink+0x2f>: mov
> > > > 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
> > > > 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
> > > > 0xffffffff812e488d <trailing_symlink+0x3d>: mov
> > > > 0x4(%rcx),%ecx # ecx = link_inode->uid
> > > > 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
> > > > 0xffffffff812e4893 <trailing_symlink+0x43>: je
> > > > 0xffffffff812e48ac <trailing_symlink+0x5c>
> > > > 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
> > > > 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
> > > > 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
> > > > 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
> > > > 0xffffffff812e48a6 <trailing_symlink+0x56>: je
> > > > 0xffffffff812e495f <trailing_symlink+0x10f>
> > > > 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
> > > > 0xffffffff812e48af <trailing_symlink+0x5f>: mov
> > > > %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
> > > > 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
> > > > 0x50(%rbx),%rax # rax = nd->stack
> > > > 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
> > > > $0x0,0x20(%rax) # stack[0].name = NULL
> > > > 0xffffffff812e48be <trailing_symlink+0x6e>: mov
> > > > 0x48(%rbx),%eax # nd->depth
> > > > 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
> > > > 0x50(%rbx),%rdx # nd->stack
> > > > 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
> > > > 0xc8(%rbx),%r13 # nd->link_inode
> > > > 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
> > > > (%rax,%rax,2),%rax # rax = depth * 3
> > > > 0xffffffff812e48d0 <trailing_symlink+0x80>: shl
> > > > $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
> > > > 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
> > > > -0x30(%rdx,%rax,1),%r15 # r15 = last
> > > > 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
> > > > 0x8(%r15),%r14 # r14 = last->link.dentry
> > > > 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
> > > > 0xffffffff812e48e1 <trailing_symlink+0x91>: je
> > > > 0xffffffff812e4950 <trailing_symlink+0x100>
> > > > 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
> > > > 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
> > > > 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
> > > > 0xffffffff812f8a00 <atime_needs_update>
> > > > 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
> > > > 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
> > > > 0xffffffff812e4a56 <trailing_symlink+0x206>
> > > > 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
> > > > 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
> > > > 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
> > > > 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
> > > > 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
> > > > 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
> > > > 0xffffffff81424310 <security_inode_follow_link>
> > > > 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
> > > > 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
> > > > 0xffffffff812e490f <trailing_symlink+0xbf>: jne
> > > > 0xffffffff812e4939 <trailing_symlink+0xe9>
> > > > 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
> > > > 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
> > > > 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
> > > > 0xffffffff812e4922 <trailing_symlink+0xd2>: je
> > > > 0xffffffff812e49e5 <trailing_symlink+0x195>
> > > > 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
> > > > 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
> > > > 0xffffffff812e492f <trailing_symlink+0xdf>: je
> > > > 0xffffffff812e49b7 <trailing_symlink+0x167>
> > > > 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
> > > > 0xffffffff812e4937 <trailing_symlink+0xe7>: je
> > > > 0xffffffff812e49ae <trailing_symlink+0x15e>
> > > > 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
> > > > 0xffffffff812e493c <trailing_symlink+0xec>: je
> > > > 0xffffffff812e49ae <trailing_symlink+0x15e>
> > > > 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
> > > > 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
> > > > 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
> > > > 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
> > > > 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
> > > > 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
> > > > 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
> > > > 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
> > > > 0xffffffff812e494f <trailing_symlink+0xff>: retq
> > > > 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
> > > > 0xffffffff812e4953 <trailing_symlink+0x103>: callq
> > > > 0xffffffff812f8ae0 <touch_atime>
> > > > 0xffffffff812e4958 <trailing_symlink+0x108>: callq
> > > > 0xffffffff81a26410 <_cond_resched>
> > > > 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
> > > > 0xffffffff812e48f6 <trailing_symlink+0xa6>
> > > > 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
> > > > 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
> > > > 0xffffffff812e4965 <trailing_symlink+0x115>: je
> > > > 0xffffffff812e496f <trailing_symlink+0x11f>
> > > > 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
> > > > 0xffffffff812e4969 <trailing_symlink+0x119>: je
> > > > 0xffffffff812e48ac <trailing_symlink+0x5c>
> > > > 0xffffffff812e496f <trailing_symlink+0x11f>: mov
> > > > $0xfffffffffffffff6,%r12
> > > > 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
> > > > 0xffffffff812e4978 <trailing_symlink+0x128>: jne
> > > > 0xffffffff812e493e <trailing_symlink+0xee>
> > > > 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
> > > > 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
> > > > 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
> > > > 0xffffffff812e498d <trailing_symlink+0x13d>: je
> > > > 0xffffffff812e4999 <trailing_symlink+0x149>
> > > > 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
> > > > 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
> > > > 0xffffffff812e4993 <trailing_symlink+0x143>: je
> > > > 0xffffffff812e4a6f <trailing_symlink+0x21f>
> > > > 0xffffffff812e4999 <trailing_symlink+0x149>: mov
> > > > $0xffffffff82319b4f,%rdi
> > > > 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
> > > > $0xfffffffffffffff3,%r12
> > > > 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
> > > > 0xffffffff81161310 <audit_log_link_denied>
> > > > 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
> > > > 0xffffffff812e493e <trailing_symlink+0xee>
> > > > 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
> > > > $0xffffffff8230164d,%r12
> > > > 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
> > > > 0xffffffff812e493e <trailing_symlink+0xee>
> > > > 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
> > > > 0xffffffff812e49bc <trailing_symlink+0x16c>: je
> > > > 0xffffffff812e4a8a <trailing_symlink+0x23a>
> > > > 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
> > > > 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
> > > > 0xffffffff812e2da0 <nd_jump_root>
> > > > 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
> > > > 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
> > > > 0xffffffff812e4a97 <trailing_symlink+0x247>
> > > > 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
> > > > 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
> > > > 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
> > > > 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
> > > > 0xffffffff812e4935 <trailing_symlink+0xe5>
> > > > 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
> > > > 0xffffffff812e49d2 <trailing_symlink+0x182>
> > > > 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
> > > > 0x20(%r13),%rax # inode->i_op
> > > > 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
> > > > 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
> > > > 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
> > > > 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
> > > > 0x8(%rax),%rcx # inode_operations->get_link
> > > > 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
> > > > 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
> > > > 0xffffffff812e4a1f <trailing_symlink+0x1cf>
> > > > 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov
> > > > %r14,%rdi # nd->flags & LOOKUP_RCU == 0
> > > > 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
> > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> > > > 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
> > > > 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
> > > > 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
> > > > 0xffffffff812e49ae <trailing_symlink+0x15e>
> > > > 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
> > > > $0xfffffffffffff000,%r12
> > > > 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
> > > > 0xffffffff812e4928 <trailing_symlink+0xd8>
> > > > 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
> > > > 0xffffffff812e493e <trailing_symlink+0xee>
> > > > 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor
> > > > %edi,%edi # nd->flags & LOOKUP_RCU != 0
> > > > 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
> > > > 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
> > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> > > > 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
> > > > 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
> > > > $0xfffffffffffffff6,%rax
> > > > 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
> > > > 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> > > > 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
> > > > 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
> > > > 0xffffffff812e3840 <unlazy_walk>
> > > > 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
> > > > 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
> > > > 0xffffffff812e4a97 <trailing_symlink+0x247>
> > > > 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
> > > > 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
> > > > 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
> > > > 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
> > > > 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
> > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
> > > > 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
> > > > 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
> > > > 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> > > > 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
> > > > 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
> > > > 0xffffffff812e3840 <unlazy_walk>
> > > > 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
> > > > 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
> > > > 0xffffffff812e4a97 <trailing_symlink+0x247>
> > > > 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
> > > > 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
> > > > 0xffffffff812f8ae0 <touch_atime>
> > > > 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
> > > > 0xffffffff812e48f6 <trailing_symlink+0xa6>
> > > > 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
> > > > 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
> > > > 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
> > > > 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
> > > > 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
> > > > 0xffffffff811673f0 <__audit_inode>
> > > > 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
> > > > 0xffffffff812e4999 <trailing_symlink+0x149>
> > > > 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
> > > > 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
> > > > 0xffffffff812e4790 <set_root>
> > > > 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
> > > > 0xffffffff812e49c2 <trailing_symlink+0x172>
> > > > 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
> > > > $0xfffffffffffffff6,%r12
> > > > 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
> > > > 0xffffffff812e493e <trailing_symlink+0xee>
> > > >
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > >
> > > >
> > > > According to my understanding, the problem solved by commit
> > > > 7b7820b83f23 ("xfs:
> > > > don't expose internal symlink metadata buffers to the vfs") is a
> > > > data NULL
> > > > pointer dereference, but the problem here is an instruction NULL
> > > > pointer
> > > > dereference.
> > > >
> > > > Further, I analyzed the possible triggering process as follows:
> > > >
> > > > rcu_walk do_unlinkat ~~> prune_dcache_sb create
> > > > rcu_read_lock
> > > > read_seqcount_retry
> > > > (the last check) iput_final
> > > > evict
> > > > destroy_inode
> > > > xfs_fs_destroy_inode
> > > > xfs_inode_set_reclaim_tag xfs_ialloc
> > > > spin_lock(ip->i_flags_lock) xfs_dialloc
> > > > set(ip, XFS_IRECLAIMABLE)
> > > > xfs_iget
> > > > wakeup(xfs_reclaim_worker) rcu_read_lock
> > > > spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
> > > > spin_lock(ip->i_flags_lock)
> > > >
> > > > if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
> > > > set(ip, XFS_IRECLAIM)
> > > > spin_unlock(ip->i_flags_lock)
> > > > rcu_read_unlock
> > > > < ------------ >
> > > >
> > > > // miss synchronize_rcu()
> > > > xfs_reinit_inode
> > > > ->get_link = NULL
> > > > get_link() // NULL
> > > >
> > > > rcu_read_unlock
> > > >
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > >
> > > >
> > > > Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
> > > > expose internal
> > > > symlink metadata buffers to the vfs"), we should start
> > > > processing this NULL
> > > > ->get_link pointer dereference.
> > > >
> > > > Or, am I thinking wrong somewhere?
> > > >
> > > > Thanks,
> > > > Jinliang Zheng
> > > >
> > > > > > > Apart from that issue, I'm not aware of any other issues that the
> > > > > > > XFS inode recycling directly exposes.
> > > > > > >
> > > > > > > > According to my understanding, the essence of
> > > > > > > > this problem is that XFS reuses
> > > > > > > > the inode evicted by VFS, but VFS rcu-walk
> > > > > > > > assumes that this will not happen.
> > > > > > > It assumes that the inode will not change identity during the RCU
> > > > > > > grace period after the inode has been evicted from cache. We can
> > > > > > > safely reinstantiate an evicted inode without waiting for an RCU
> > > > > > > grace period as long as it is the same inode with the same content
> > > > > > > and same state.
> > > > > > >
> > > > > > > Problems *may* arise when we unlink the inode, then evict it, then a
> > > > > > > new file is created and the old slab cache memory address is used
> > > > > > > for the new inode. I describe the issue here:
> > > > > > >
> > > > > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
> > > > > > >
> > > > > > And judging from the relevant emails, the main reason
> > > > > > why ->get_link() is set
> > > > > > to NULL should be the lack of synchronize_rcu() before
> > > > > > xfs_reinit_inode() when
> > > > > > the inode is chosen to be reused.
> > > > > >
> > > > > > However, perhaps due to performance reasons, this
> > > > > > solution has not been merged
> > > > > > for a long time. How is it now?
> > > > > >
> > > > > > Maybe I am missing something in the threads of mail?
> > > > > >
> > > > > > Thank you very much. :)
> > > > > > Jinliang Zheng
> > > > > >
> > > > > > > That said, we have exactly zero evidence that this is actually a
> > > > > > > problem in production systems. We did get systems tripping over the
> > > > > > > symlink issue, but there's no evidence that the
> > > > > > > unlink->close->open(O_CREAT) issues are manifesting in the wild and
> > > > > > > hence there hasn't been any particular urgency to address it.
> > > > > > >
> > > > > > > > Are there any recommended workarounds until an
> > > > > > > > elegant and efficient solution
> > > > > > > > can be proposed? After all, causing a crash is
> > > > > > > > extremely unacceptable in a
> > > > > > > > production environment.
> > > > > > > What crashes are you seeing in your production environment?
> > > > > > >
> > > > > > > -Dave.
> > > > > > > --
> > > > > > > Dave Chinner
> > > > > > > david@fromorbit.com
> >
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-20 17:36 ` Darrick J. Wong
@ 2024-05-21 1:35 ` Ian Kent
2024-05-21 2:13 ` Ian Kent
0 siblings, 1 reply; 19+ messages in thread
From: Ian Kent @ 2024-05-21 1:35 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel,
linux-xfs, rcu
On 21/5/24 01:36, Darrick J. Wong wrote:
> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
>> On 16/5/24 15:08, Ian Kent wrote:
>>> On 16/5/24 12:56, Jinliang Zheng wrote:
>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800,
>>>>>>>> alexjlzheng@gmail.com wrote:
>>>>>>>>> Hi, all
>>>>>>>>>
>>>>>>>>> I would like to ask if the conflict between xfs
>>>>>>>>> inode recycle and vfs rcu-walk
>>>>>>>>> which can lead to null pointer references has been resolved?
>>>>>>>>>
>>>>>>>>> I browsed through emails about the following
>>>>>>>>> patches and their discussions:
>>>>>>>>> - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
>>>>>>>>> - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
>>>>>>>>> - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>>>>>>>>>
>>>>>>>>> And then came to the conclusion that this
>>>>>>>>> problem has not been solved, am I
>>>>>>>>> right? Did I miss some patch that could solve this problem?
>>>>>>>> We fixed the known problems this caused by turning off the VFS
>>>>>>>> functionality that the rcu pathwalks kept tripping over. See commit
>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink
>>>>>>>> metadata buffers to
>>>>>>>> the vfs").
>>>>>>> Sorry for the delay.
>>>>>>>
>>>>>>> The problem I encountered in the production environment
>>>>>>> was that during the
>>>>>>> rcu walk process the ->get_link() pointer was NULL,
>>>>>>> which caused a crash.
>>>>>>>
>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't
>>>>>>> expose internal symlink
>>>>>>> metadata buffers to the vfs") first appeared in:
>>>>>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>>>>>>>
>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how?
>>>>>> I suggest reading the call stack from wherever the VFS enters the XFS
>>>>>> readlink code. If you have a reliable reproducer, then
>>>>>> apply this patch
>>>>>> to your kernel (you haven't mentioned which one it is) and see if the
>>>>>> bad dereference goes away.
>>>>>>
>>>>>> --D
>>>>> Sorry for the delay.
>>>>>
>>>>> I encountered the following calltrace:
>>>>>
>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address:
>>>>> 0000000000000000
>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page
>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
>>>>> [20213.578884] RIP: 0010:0x0
>>>>> [20213.578894] Code: Bad RIP value.
>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
>>>>> 0000000000000000
>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
>>>>> 0000000000000000
>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
>>>>> ffff889b9eeae380
>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
>>>>> 0000000000000000
>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
>>>>> ffffc90021ebfd48
>>>>> [20213.578998] FS: 00007f89c534e740(0000)
>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000
>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
>>>>> 00000000007706e0
>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>> 0000000000000400
>>>>> [20213.579079] PKRU: 55555554
>>>>> [20213.579087] Call Trace:
>>>>> [20213.579099] trailing_symlink+0x1da/0x260
>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220
>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170
>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0
>>>>> [20213.579161] user_path_at_empty+0x3e/0x50
>>>>> [20213.579172] vfs_statx+0x76/0xe0
>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70
>>>>> [20213.579194] ? fput+0x13/0x20
>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300
>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30
>>>>> [20213.579225] ? fput+0x13/0x20
>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0
>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20
>>>>> [20213.579256] do_syscall_64+0x4d/0x140
>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>>>>>
>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>
>>>> Please note that the kernel version I use is the one maintained by
>>>> Tencent.Inc,
>>>> and the baseline is v5.4. But in fact, in the latest upstream source
>>>> tree,
>>>> although the trailing_symlink() function has been removed, its logic
>>>> has been
>>>> moved to pick_link(), so the problem still exists.
>>>>
>>>> Ian Kent pointed out that try_to_unlazy() was introduced in
>>>> pick_link() in the
>>>> latest upstream source tree, but I don't understand why this can
>>>> solve the NULL
>>>> ->get_link pointer dereference problem, because ->get_link pointer
>>>> will be
>>>> dereferenced before try_to_unlazy().
>>>>
>>>> (I don't understand why Ian Kent's email didn't appear on the
>>>> mailing list.)
>>> It was something about html mail and I think my mail client was at fault.
>>>
>>> In any case what you say is indeed correct, so the comment isn't
>>> important.
>>>
>>>
>>> Fact is it is still a race between the lockless path walk and inode
>>> eviction
>>>
>>> and xfs recycling. I believe that the xfs recycling code is very hard to
>>> fix.
>>>
>>>
>>> IIRC correctly putting a NULL check in pick_link() was not considered
>>> acceptable
>>>
>>> but there must be a way that is acceptable to check this and restart the
>>> walk.
>>>
>>> Maybe there was a reluctance to suffer the overhead of restarting the
>>> walk when
>>>
>>> it shouldn't be needed.
>> Or perhaps the worry was that if it can become NULL it could also become a
>> pointer to a
>>
>> different (incorrect) link altogether which could have really odd/unpleasant
>> outcomes.
> Yuck. I think that means that we can't reallocate freed inodes until
> the rcu grace period expires. For inodes that haven't been evicted, I
> think that also means we cannot recycle cached inodes until after an rcu
> grace period expires; or maybe that we cannot reset i_op/i_fop and must
> not leave the incore state in an inconsistent format?
Yeah, not pretty!
But shouldn't this case occur only occasionally?
So issuing a cache miss shouldn't impact performance too much that was,
I believe, the concern with waiting for the rcu grace period.
Identifying it's happening should be possible, the vfs legitimize_*()
has this job for various objects but maybe it's using vfs private info.
(certainly it uses nameidata struct with a seq lock sequence number in
it) but I assume it can be done somehow.
My question then becomes is it viable/straight forward to not recycle such
an inode and discard it instead so it gets re-created, I guess it's
essentially
a cache miss?
Ian
>
> --D
>
>>>
>>> The alternative would be to find some way to identify when it's unsafe
>>> to reuse
>>>
>>> an inode marked for re-cycle before dropping rcu read, perhaps with the
>>> reference
>>>
>>> count plus the seqlock. Basically, to reuse inodes xfs will need to
>>> identify when
>>>
>>> the race occurs and let the inode go away under rcu and create a new one
>>> if a race
>>>
>>> is detected. But possibly that isn't nearly as simple as it sounds?
>>>
>>>
>>>> Thanks,
>>>> Jinliang Zheng
>>>>
>>>>> And I analyzed the disassembly of trailing_symlink() and
>>>>> confirmed that a NULL
>>>>> ->get_link() happened here:
>>>>>
>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
>>>>> [FTRACE NOP]
>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov
>>>>> %rdi,%rbx # rbx = &nameidate
>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov
>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
>>>>> <sysctl_protected_symlinks>
>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je
>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov
>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov
>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid
>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je
>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je
>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>
>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov
>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
>>>>> 0x50(%rbx),%rax # rax = nd->stack
>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
>>>>> $0x0,0x20(%rax) # stack[0].name = NULL
>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov
>>>>> 0x48(%rbx),%eax # nd->depth
>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
>>>>> 0x50(%rbx),%rdx # nd->stack
>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
>>>>> 0xc8(%rbx),%r13 # nd->link_inode
>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
>>>>> (%rax,%rax,2),%rax # rax = depth * 3
>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl
>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last
>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry
>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je
>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>
>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
>>>>> 0xffffffff812f8a00 <atime_needs_update>
>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>
>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
>>>>> 0xffffffff81424310 <security_inode_follow_link>
>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne
>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>
>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je
>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>
>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je
>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>
>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je
>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je
>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq
>>>>> 0xffffffff812f8ae0 <touch_atime>
>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq
>>>>> 0xffffffff81a26410 <_cond_resched>
>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je
>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>
>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je
>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov
>>>>> $0xfffffffffffffff6,%r12
>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne
>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je
>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je
>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>
>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov
>>>>> $0xffffffff82319b4f,%rdi
>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
>>>>> $0xfffffffffffffff3,%r12
>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
>>>>> 0xffffffff81161310 <audit_log_link_denied>
>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
>>>>> $0xffffffff8230164d,%r12
>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je
>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>
>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
>>>>> 0xffffffff812e2da0 <nd_jump_root>
>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>
>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>
>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
>>>>> 0x20(%r13),%rax # inode->i_op
>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
>>>>> 0x8(%rax),%rcx # inode_operations->get_link
>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>
>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov
>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0
>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
>>>>> $0xfffffffffffff000,%r12
>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>
>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor
>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0
>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
>>>>> $0xfffffffffffffff6,%rax
>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
>>>>> 0xffffffff812e3840 <unlazy_walk>
>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
>>>>> 0xffffffff812e3840 <unlazy_walk>
>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
>>>>> 0xffffffff812f8ae0 <touch_atime>
>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
>>>>> 0xffffffff811673f0 <__audit_inode>
>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
>>>>> 0xffffffff812e4790 <set_root>
>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>
>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
>>>>> $0xfffffffffffffff6,%r12
>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>
>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>
>>>>>
>>>>> According to my understanding, the problem solved by commit
>>>>> 7b7820b83f23 ("xfs:
>>>>> don't expose internal symlink metadata buffers to the vfs") is a
>>>>> data NULL
>>>>> pointer dereference, but the problem here is an instruction NULL
>>>>> pointer
>>>>> dereference.
>>>>>
>>>>> Further, I analyzed the possible triggering process as follows:
>>>>>
>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
>>>>> rcu_read_lock
>>>>> read_seqcount_retry
>>>>> (the last check) iput_final
>>>>> evict
>>>>> destroy_inode
>>>>> xfs_fs_destroy_inode
>>>>> xfs_inode_set_reclaim_tag xfs_ialloc
>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc
>>>>> set(ip, XFS_IRECLAIMABLE)
>>>>> xfs_iget
>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock
>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
>>>>> spin_lock(ip->i_flags_lock)
>>>>>
>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
>>>>> set(ip, XFS_IRECLAIM)
>>>>> spin_unlock(ip->i_flags_lock)
>>>>> rcu_read_unlock
>>>>> < ------------ >
>>>>>
>>>>> // miss synchronize_rcu()
>>>>> xfs_reinit_inode
>>>>> ->get_link = NULL
>>>>> get_link() // NULL
>>>>>
>>>>> rcu_read_unlock
>>>>>
>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>
>>>>>
>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
>>>>> expose internal
>>>>> symlink metadata buffers to the vfs"), we should start
>>>>> processing this NULL
>>>>> ->get_link pointer dereference.
>>>>>
>>>>> Or, am I thinking wrong somewhere?
>>>>>
>>>>> Thanks,
>>>>> Jinliang Zheng
>>>>>
>>>>>>>> Apart from that issue, I'm not aware of any other issues that the
>>>>>>>> XFS inode recycling directly exposes.
>>>>>>>>
>>>>>>>>> According to my understanding, the essence of
>>>>>>>>> this problem is that XFS reuses
>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk
>>>>>>>>> assumes that this will not happen.
>>>>>>>> It assumes that the inode will not change identity during the RCU
>>>>>>>> grace period after the inode has been evicted from cache. We can
>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU
>>>>>>>> grace period as long as it is the same inode with the same content
>>>>>>>> and same state.
>>>>>>>>
>>>>>>>> Problems *may* arise when we unlink the inode, then evict it, then a
>>>>>>>> new file is created and the old slab cache memory address is used
>>>>>>>> for the new inode. I describe the issue here:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>>>>>>>>
>>>>>>> And judging from the relevant emails, the main reason
>>>>>>> why ->get_link() is set
>>>>>>> to NULL should be the lack of synchronize_rcu() before
>>>>>>> xfs_reinit_inode() when
>>>>>>> the inode is chosen to be reused.
>>>>>>>
>>>>>>> However, perhaps due to performance reasons, this
>>>>>>> solution has not been merged
>>>>>>> for a long time. How is it now?
>>>>>>>
>>>>>>> Maybe I am missing something in the threads of mail?
>>>>>>>
>>>>>>> Thank you very much. :)
>>>>>>> Jinliang Zheng
>>>>>>>
>>>>>>>> That said, we have exactly zero evidence that this is actually a
>>>>>>>> problem in production systems. We did get systems tripping over the
>>>>>>>> symlink issue, but there's no evidence that the
>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and
>>>>>>>> hence there hasn't been any particular urgency to address it.
>>>>>>>>
>>>>>>>>> Are there any recommended workarounds until an
>>>>>>>>> elegant and efficient solution
>>>>>>>>> can be proposed? After all, causing a crash is
>>>>>>>>> extremely unacceptable in a
>>>>>>>>> production environment.
>>>>>>>> What crashes are you seeing in your production environment?
>>>>>>>>
>>>>>>>> -Dave.
>>>>>>>> --
>>>>>>>> Dave Chinner
>>>>>>>> david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-21 1:35 ` Ian Kent
@ 2024-05-21 2:13 ` Ian Kent
2024-05-26 15:04 ` Jinliang Zheng
2024-05-26 23:51 ` Ian Kent
0 siblings, 2 replies; 19+ messages in thread
From: Ian Kent @ 2024-05-21 2:13 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel,
linux-xfs, rcu
On 21/5/24 09:35, Ian Kent wrote:
> On 21/5/24 01:36, Darrick J. Wong wrote:
>> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
>>> On 16/5/24 15:08, Ian Kent wrote:
>>>> On 16/5/24 12:56, Jinliang Zheng wrote:
>>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
>>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
>>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
>>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
>>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800,
>>>>>>>>> alexjlzheng@gmail.com wrote:
>>>>>>>>>> Hi, all
>>>>>>>>>>
>>>>>>>>>> I would like to ask if the conflict between xfs
>>>>>>>>>> inode recycle and vfs rcu-walk
>>>>>>>>>> which can lead to null pointer references has been resolved?
>>>>>>>>>>
>>>>>>>>>> I browsed through emails about the following
>>>>>>>>>> patches and their discussions:
>>>>>>>>>> -
>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
>>>>>>>>>> -
>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
>>>>>>>>>> -
>>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>>>>>>>>>>
>>>>>>>>>> And then came to the conclusion that this
>>>>>>>>>> problem has not been solved, am I
>>>>>>>>>> right? Did I miss some patch that could solve this problem?
>>>>>>>>> We fixed the known problems this caused by turning off the VFS
>>>>>>>>> functionality that the rcu pathwalks kept tripping over. See
>>>>>>>>> commit
>>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink
>>>>>>>>> metadata buffers to
>>>>>>>>> the vfs").
>>>>>>>> Sorry for the delay.
>>>>>>>>
>>>>>>>> The problem I encountered in the production environment
>>>>>>>> was that during the
>>>>>>>> rcu walk process the ->get_link() pointer was NULL,
>>>>>>>> which caused a crash.
>>>>>>>>
>>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't
>>>>>>>> expose internal symlink
>>>>>>>> metadata buffers to the vfs") first appeared in:
>>>>>>>> -
>>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>>>>>>>>
>>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how?
>>>>>>> I suggest reading the call stack from wherever the VFS enters
>>>>>>> the XFS
>>>>>>> readlink code. If you have a reliable reproducer, then
>>>>>>> apply this patch
>>>>>>> to your kernel (you haven't mentioned which one it is) and see
>>>>>>> if the
>>>>>>> bad dereference goes away.
>>>>>>>
>>>>>>> --D
>>>>>> Sorry for the delay.
>>>>>>
>>>>>> I encountered the following calltrace:
>>>>>>
>>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address:
>>>>>> 0000000000000000
>>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
>>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page
>>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
>>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
>>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
>>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
>>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
>>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
>>>>>> [20213.578884] RIP: 0010:0x0
>>>>>> [20213.578894] Code: Bad RIP value.
>>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
>>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
>>>>>> 0000000000000000
>>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
>>>>>> 0000000000000000
>>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
>>>>>> ffff889b9eeae380
>>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
>>>>>> 0000000000000000
>>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
>>>>>> ffffc90021ebfd48
>>>>>> [20213.578998] FS: 00007f89c534e740(0000)
>>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000
>>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
>>>>>> 00000000007706e0
>>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>> 0000000000000000
>>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> 0000000000000400
>>>>>> [20213.579079] PKRU: 55555554
>>>>>> [20213.579087] Call Trace:
>>>>>> [20213.579099] trailing_symlink+0x1da/0x260
>>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220
>>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170
>>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
>>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0
>>>>>> [20213.579161] user_path_at_empty+0x3e/0x50
>>>>>> [20213.579172] vfs_statx+0x76/0xe0
>>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70
>>>>>> [20213.579194] ? fput+0x13/0x20
>>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300
>>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30
>>>>>> [20213.579225] ? fput+0x13/0x20
>>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0
>>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20
>>>>>> [20213.579256] do_syscall_64+0x4d/0x140
>>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>>>>>>
>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>
>>>>>>
>>>>> Please note that the kernel version I use is the one maintained by
>>>>> Tencent.Inc,
>>>>> and the baseline is v5.4. But in fact, in the latest upstream source
>>>>> tree,
>>>>> although the trailing_symlink() function has been removed, its logic
>>>>> has been
>>>>> moved to pick_link(), so the problem still exists.
>>>>>
>>>>> Ian Kent pointed out that try_to_unlazy() was introduced in
>>>>> pick_link() in the
>>>>> latest upstream source tree, but I don't understand why this can
>>>>> solve the NULL
>>>>> ->get_link pointer dereference problem, because ->get_link pointer
>>>>> will be
>>>>> dereferenced before try_to_unlazy().
>>>>>
>>>>> (I don't understand why Ian Kent's email didn't appear on the
>>>>> mailing list.)
>>>> It was something about html mail and I think my mail client was at
>>>> fault.
>>>>
>>>> In any case what you say is indeed correct, so the comment isn't
>>>> important.
>>>>
>>>>
>>>> Fact is it is still a race between the lockless path walk and inode
>>>> eviction
>>>>
>>>> and xfs recycling. I believe that the xfs recycling code is very
>>>> hard to
>>>> fix.
>>>>
>>>>
>>>> IIRC correctly putting a NULL check in pick_link() was not considered
>>>> acceptable
>>>>
>>>> but there must be a way that is acceptable to check this and
>>>> restart the
>>>> walk.
>>>>
>>>> Maybe there was a reluctance to suffer the overhead of restarting the
>>>> walk when
>>>>
>>>> it shouldn't be needed.
>>> Or perhaps the worry was that if it can become NULL it could also
>>> become a
>>> pointer to a
>>>
>>> different (incorrect) link altogether which could have really
>>> odd/unpleasant
>>> outcomes.
>> Yuck. I think that means that we can't reallocate freed inodes until
>> the rcu grace period expires. For inodes that haven't been evicted, I
>> think that also means we cannot recycle cached inodes until after an rcu
>> grace period expires; or maybe that we cannot reset i_op/i_fop and must
>> not leave the incore state in an inconsistent format?
>
> Yeah, not pretty!
>
> But shouldn't this case occur only occasionally?
>
>
> So issuing a cache miss shouldn't impact performance too much that was,
>
> I believe, the concern with waiting for the rcu grace period.
>
>
> Identifying it's happening should be possible, the vfs legitimize_*()
>
> has this job for various objects but maybe it's using vfs private info.
>
> (certainly it uses nameidata struct with a seq lock sequence number in
>
> it) but I assume it can be done somehow.
Unfortunately, when you start trying to work out how to do this, it
isn't at all
obvious how to do it ...
>
>
> My question then becomes is it viable/straight forward to not recycle
> such
>
> an inode and discard it instead so it gets re-created, I guess it's
> essentially
>
> a cache miss?
>
>
> Ian
>
>>
>> --D
>>
>>>>
>>>> The alternative would be to find some way to identify when it's unsafe
>>>> to reuse
>>>>
>>>> an inode marked for re-cycle before dropping rcu read, perhaps with
>>>> the
>>>> reference
>>>>
>>>> count plus the seqlock. Basically, to reuse inodes xfs will need to
>>>> identify when
>>>>
>>>> the race occurs and let the inode go away under rcu and create a
>>>> new one
>>>> if a race
>>>>
>>>> is detected. But possibly that isn't nearly as simple as it sounds?
>>>>
>>>>
>>>>> Thanks,
>>>>> Jinliang Zheng
>>>>>
>>>>>> And I analyzed the disassembly of trailing_symlink() and
>>>>>> confirmed that a NULL
>>>>>> ->get_link() happened here:
>>>>>>
>>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
>>>>>> [FTRACE NOP]
>>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
>>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
>>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
>>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
>>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
>>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
>>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
>>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov
>>>>>> %rdi,%rbx # rbx = &nameidate
>>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
>>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov
>>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
>>>>>> <sysctl_protected_symlinks>
>>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
>>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
>>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je
>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
>>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov
>>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
>>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
>>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov
>>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid
>>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
>>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je
>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
>>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
>>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
>>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
>>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je
>>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>
>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
>>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov
>>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
>>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
>>>>>> 0x50(%rbx),%rax # rax = nd->stack
>>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
>>>>>> $0x0,0x20(%rax) # stack[0].name = NULL
>>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov
>>>>>> 0x48(%rbx),%eax # nd->depth
>>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
>>>>>> 0x50(%rbx),%rdx # nd->stack
>>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
>>>>>> 0xc8(%rbx),%r13 # nd->link_inode
>>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
>>>>>> (%rax,%rax,2),%rax # rax = depth * 3
>>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl
>>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
>>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
>>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last
>>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
>>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry
>>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
>>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je
>>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>
>>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
>>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
>>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
>>>>>> 0xffffffff812f8a00 <atime_needs_update>
>>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
>>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
>>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>
>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
>>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
>>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
>>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
>>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
>>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
>>>>>> 0xffffffff81424310 <security_inode_follow_link>
>>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
>>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
>>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne
>>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>
>>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
>>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
>>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
>>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je
>>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>
>>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
>>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
>>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je
>>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>
>>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
>>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je
>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
>>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je
>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
>>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
>>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
>>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
>>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
>>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
>>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
>>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
>>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
>>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
>>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq
>>>>>> 0xffffffff812f8ae0 <touch_atime>
>>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq
>>>>>> 0xffffffff81a26410 <_cond_resched>
>>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
>>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
>>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je
>>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>
>>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
>>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je
>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov
>>>>>> $0xfffffffffffffff6,%r12
>>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
>>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne
>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
>>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
>>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
>>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je
>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
>>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
>>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je
>>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>
>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov
>>>>>> $0xffffffff82319b4f,%rdi
>>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
>>>>>> $0xfffffffffffffff3,%r12
>>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
>>>>>> 0xffffffff81161310 <audit_log_link_denied>
>>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
>>>>>> $0xffffffff8230164d,%r12
>>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
>>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je
>>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>
>>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
>>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
>>>>>> 0xffffffff812e2da0 <nd_jump_root>
>>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
>>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
>>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
>>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
>>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
>>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>
>>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
>>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>
>>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
>>>>>> 0x20(%r13),%rax # inode->i_op
>>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
>>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
>>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
>>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
>>>>>> 0x8(%rax),%rcx # inode_operations->get_link
>>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
>>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
>>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>
>>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov
>>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0
>>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
>>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
>>>>>> $0xfffffffffffff000,%r12
>>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
>>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>
>>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor
>>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0
>>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
>>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
>>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
>>>>>> $0xfffffffffffffff6,%rax
>>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
>>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
>>>>>> 0xffffffff812e3840 <unlazy_walk>
>>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
>>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
>>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
>>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
>>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
>>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
>>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
>>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
>>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
>>>>>> 0xffffffff812e3840 <unlazy_walk>
>>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
>>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
>>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
>>>>>> 0xffffffff812f8ae0 <touch_atime>
>>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
>>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
>>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
>>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
>>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
>>>>>> 0xffffffff811673f0 <__audit_inode>
>>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
>>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
>>>>>> 0xffffffff812e4790 <set_root>
>>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
>>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>
>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
>>>>>> $0xfffffffffffffff6,%r12
>>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>>
>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>
>>>>>>
>>>>>>
>>>>>> According to my understanding, the problem solved by commit
>>>>>> 7b7820b83f23 ("xfs:
>>>>>> don't expose internal symlink metadata buffers to the vfs") is a
>>>>>> data NULL
>>>>>> pointer dereference, but the problem here is an instruction NULL
>>>>>> pointer
>>>>>> dereference.
>>>>>>
>>>>>> Further, I analyzed the possible triggering process as follows:
>>>>>>
>>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
>>>>>> rcu_read_lock
>>>>>> read_seqcount_retry
>>>>>> (the last check) iput_final
>>>>>> evict
>>>>>> destroy_inode
>>>>>> xfs_fs_destroy_inode
>>>>>> xfs_inode_set_reclaim_tag xfs_ialloc
>>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc
>>>>>> set(ip, XFS_IRECLAIMABLE)
>>>>>> xfs_iget
>>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock
>>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
>>>>>> spin_lock(ip->i_flags_lock)
>>>>>>
>>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
>>>>>> set(ip, XFS_IRECLAIM)
>>>>>> spin_unlock(ip->i_flags_lock)
>>>>>> rcu_read_unlock
>>>>>> < ------------ >
>>>>>>
>>>>>> // miss synchronize_rcu()
>>>>>> xfs_reinit_inode
>>>>>> ->get_link = NULL
>>>>>> get_link() // NULL
>>>>>>
>>>>>> rcu_read_unlock
>>>>>>
>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>
>>>>>>
>>>>>>
>>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
>>>>>> expose internal
>>>>>> symlink metadata buffers to the vfs"), we should start
>>>>>> processing this NULL
>>>>>> ->get_link pointer dereference.
>>>>>>
>>>>>> Or, am I thinking wrong somewhere?
>>>>>>
>>>>>> Thanks,
>>>>>> Jinliang Zheng
>>>>>>
>>>>>>>>> Apart from that issue, I'm not aware of any other issues that the
>>>>>>>>> XFS inode recycling directly exposes.
>>>>>>>>>
>>>>>>>>>> According to my understanding, the essence of
>>>>>>>>>> this problem is that XFS reuses
>>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk
>>>>>>>>>> assumes that this will not happen.
>>>>>>>>> It assumes that the inode will not change identity during the RCU
>>>>>>>>> grace period after the inode has been evicted from cache. We can
>>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU
>>>>>>>>> grace period as long as it is the same inode with the same
>>>>>>>>> content
>>>>>>>>> and same state.
>>>>>>>>>
>>>>>>>>> Problems *may* arise when we unlink the inode, then evict it,
>>>>>>>>> then a
>>>>>>>>> new file is created and the old slab cache memory address is used
>>>>>>>>> for the new inode. I describe the issue here:
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>>>>>>>>>
>>>>>>>>>
>>>>>>>> And judging from the relevant emails, the main reason
>>>>>>>> why ->get_link() is set
>>>>>>>> to NULL should be the lack of synchronize_rcu() before
>>>>>>>> xfs_reinit_inode() when
>>>>>>>> the inode is chosen to be reused.
>>>>>>>>
>>>>>>>> However, perhaps due to performance reasons, this
>>>>>>>> solution has not been merged
>>>>>>>> for a long time. How is it now?
>>>>>>>>
>>>>>>>> Maybe I am missing something in the threads of mail?
>>>>>>>>
>>>>>>>> Thank you very much. :)
>>>>>>>> Jinliang Zheng
>>>>>>>>
>>>>>>>>> That said, we have exactly zero evidence that this is actually a
>>>>>>>>> problem in production systems. We did get systems tripping
>>>>>>>>> over the
>>>>>>>>> symlink issue, but there's no evidence that the
>>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the
>>>>>>>>> wild and
>>>>>>>>> hence there hasn't been any particular urgency to address it.
>>>>>>>>>
>>>>>>>>>> Are there any recommended workarounds until an
>>>>>>>>>> elegant and efficient solution
>>>>>>>>>> can be proposed? After all, causing a crash is
>>>>>>>>>> extremely unacceptable in a
>>>>>>>>>> production environment.
>>>>>>>>> What crashes are you seeing in your production environment?
>>>>>>>>>
>>>>>>>>> -Dave.
>>>>>>>>> --
>>>>>>>>> Dave Chinner
>>>>>>>>> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-21 2:13 ` Ian Kent
@ 2024-05-26 15:04 ` Jinliang Zheng
2024-05-26 17:21 ` Paul E. McKenney
2024-05-26 23:51 ` Ian Kent
1 sibling, 1 reply; 19+ messages in thread
From: Jinliang Zheng @ 2024-05-26 15:04 UTC (permalink / raw)
To: raven
Cc: alexjlzheng, alexjlzheng, bfoster, david, djwong, linux-fsdevel,
linux-xfs, rcu
On Tue, 21 May 2024 at 10:13:38 +0800, Ian Kent wrote:
> On 21/5/24 09:35, Ian Kent wrote:
> > On 21/5/24 01:36, Darrick J. Wong wrote:
> >> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> >>> On 16/5/24 15:08, Ian Kent wrote:
> >>>> On 16/5/24 12:56, Jinliang Zheng wrote:
> >>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
> >>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
> >>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
> >>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> >>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800,
> >>>>>>>>> alexjlzheng@gmail.com wrote:
> >>>>>>>>>> Hi, all
> >>>>>>>>>>
> >>>>>>>>>> I would like to ask if the conflict between xfs
> >>>>>>>>>> inode recycle and vfs rcu-walk
> >>>>>>>>>> which can lead to null pointer references has been resolved?
> >>>>>>>>>>
> >>>>>>>>>> I browsed through emails about the following
> >>>>>>>>>> patches and their discussions:
> >>>>>>>>>> -
> >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> >>>>>>>>>> -
> >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> >>>>>>>>>> -
> >>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> >>>>>>>>>>
> >>>>>>>>>> And then came to the conclusion that this
> >>>>>>>>>> problem has not been solved, am I
> >>>>>>>>>> right? Did I miss some patch that could solve this problem?
> >>>>>>>>> We fixed the known problems this caused by turning off the VFS
> >>>>>>>>> functionality that the rcu pathwalks kept tripping over. See
> >>>>>>>>> commit
> >>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink
> >>>>>>>>> metadata buffers to
> >>>>>>>>> the vfs").
> >>>>>>>> Sorry for the delay.
> >>>>>>>>
> >>>>>>>> The problem I encountered in the production environment
> >>>>>>>> was that during the
> >>>>>>>> rcu walk process the ->get_link() pointer was NULL,
> >>>>>>>> which caused a crash.
> >>>>>>>>
> >>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't
> >>>>>>>> expose internal symlink
> >>>>>>>> metadata buffers to the vfs") first appeared in:
> >>>>>>>> -
> >>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
> >>>>>>>>
> >>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how?
> >>>>>>> I suggest reading the call stack from wherever the VFS enters
> >>>>>>> the XFS
> >>>>>>> readlink code. If you have a reliable reproducer, then
> >>>>>>> apply this patch
> >>>>>>> to your kernel (you haven't mentioned which one it is) and see
> >>>>>>> if the
> >>>>>>> bad dereference goes away.
> >>>>>>>
> >>>>>>> --D
> >>>>>> Sorry for the delay.
> >>>>>>
> >>>>>> I encountered the following calltrace:
> >>>>>>
> >>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address:
> >>>>>> 0000000000000000
> >>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
> >>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page
> >>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> >>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
> >>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> >>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> >>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> >>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> >>>>>> [20213.578884] RIP: 0010:0x0
> >>>>>> [20213.578894] Code: Bad RIP value.
> >>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> >>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> >>>>>> 0000000000000000
> >>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> >>>>>> 0000000000000000
> >>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> >>>>>> ffff889b9eeae380
> >>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> >>>>>> 0000000000000000
> >>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> >>>>>> ffffc90021ebfd48
> >>>>>> [20213.578998] FS: 00007f89c534e740(0000)
> >>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> >>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> >>>>>> 00000000007706e0
> >>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>>>> 0000000000000000
> >>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>>>>> 0000000000000400
> >>>>>> [20213.579079] PKRU: 55555554
> >>>>>> [20213.579087] Call Trace:
> >>>>>> [20213.579099] trailing_symlink+0x1da/0x260
> >>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220
> >>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170
> >>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
> >>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0
> >>>>>> [20213.579161] user_path_at_empty+0x3e/0x50
> >>>>>> [20213.579172] vfs_statx+0x76/0xe0
> >>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70
> >>>>>> [20213.579194] ? fput+0x13/0x20
> >>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300
> >>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30
> >>>>>> [20213.579225] ? fput+0x13/0x20
> >>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0
> >>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20
> >>>>>> [20213.579256] do_syscall_64+0x4d/0x140
> >>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> >>>>>>
> >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> >>>>>>
> >>>>>>
> >>>>> Please note that the kernel version I use is the one maintained by
> >>>>> Tencent.Inc,
> >>>>> and the baseline is v5.4. But in fact, in the latest upstream source
> >>>>> tree,
> >>>>> although the trailing_symlink() function has been removed, its logic
> >>>>> has been
> >>>>> moved to pick_link(), so the problem still exists.
> >>>>>
> >>>>> Ian Kent pointed out that try_to_unlazy() was introduced in
> >>>>> pick_link() in the
> >>>>> latest upstream source tree, but I don't understand why this can
> >>>>> solve the NULL
> >>>>> ->get_link pointer dereference problem, because ->get_link pointer
> >>>>> will be
> >>>>> dereferenced before try_to_unlazy().
> >>>>>
> >>>>> (I don't understand why Ian Kent's email didn't appear on the
> >>>>> mailing list.)
> >>>> It was something about html mail and I think my mail client was at
> >>>> fault.
> >>>>
> >>>> In any case what you say is indeed correct, so the comment isn't
> >>>> important.
> >>>>
> >>>>
> >>>> Fact is it is still a race between the lockless path walk and inode
> >>>> eviction
> >>>>
> >>>> and xfs recycling. I believe that the xfs recycling code is very
> >>>> hard to
> >>>> fix.
> >>>>
> >>>>
> >>>> IIRC correctly putting a NULL check in pick_link() was not considered
> >>>> acceptable
> >>>>
> >>>> but there must be a way that is acceptable to check this and
> >>>> restart the
> >>>> walk.
> >>>>
> >>>> Maybe there was a reluctance to suffer the overhead of restarting the
> >>>> walk when
> >>>>
> >>>> it shouldn't be needed.
> >>> Or perhaps the worry was that if it can become NULL it could also
> >>> become a
> >>> pointer to a
> >>>
> >>> different (incorrect) link altogether which could have really
> >>> odd/unpleasant
> >>> outcomes.
> >> Yuck. I think that means that we can't reallocate freed inodes until
> >> the rcu grace period expires. For inodes that haven't been evicted, I
> >> think that also means we cannot recycle cached inodes until after an rcu
> >> grace period expires; or maybe that we cannot reset i_op/i_fop and must
> >> not leave the incore state in an inconsistent format?
> >
> > Yeah, not pretty!
> >
> > But shouldn't this case occur only occasionally?
> >
> >
> > So issuing a cache miss shouldn't impact performance too much that was,
> >
> > I believe, the concern with waiting for the rcu grace period.
> >
> >
> > Identifying it's happening should be possible, the vfs legitimize_*()
> >
> > has this job for various objects but maybe it's using vfs private info.
> >
> > (certainly it uses nameidata struct with a seq lock sequence number in
> >
> > it) but I assume it can be done somehow.
>
> Unfortunately, when you start trying to work out how to do this, it
> isn't at all
>
> obvious how to do it ...
How about adding a synchronize_rcu() in front of xfs_reinit_inode()?
Maybe this will affect performance, but compared to crashing the kernel, this
performance penalty is completely worth it.
And, perhaps we can gradually take some optimization measures, such as:
https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
Best Regards,
Jinliang Zheng
>
>
> >
> >
> > My question then becomes is it viable/straight forward to not recycle
> > such
> >
> > an inode and discard it instead so it gets re-created, I guess it's
> > essentially
> >
> > a cache miss?
> >
> >
> > Ian
> >
> >>
> >> --D
> >>
> >>>>
> >>>> The alternative would be to find some way to identify when it's unsafe
> >>>> to reuse
> >>>>
> >>>> an inode marked for re-cycle before dropping rcu read, perhaps with
> >>>> the
> >>>> reference
> >>>>
> >>>> count plus the seqlock. Basically, to reuse inodes xfs will need to
> >>>> identify when
> >>>>
> >>>> the race occurs and let the inode go away under rcu and create a
> >>>> new one
> >>>> if a race
> >>>>
> >>>> is detected. But possibly that isn't nearly as simple as it sounds?
> >>>>
> >>>>
> >>>>> Thanks,
> >>>>> Jinliang Zheng
> >>>>>
> >>>>>> And I analyzed the disassembly of trailing_symlink() and
> >>>>>> confirmed that a NULL
> >>>>>> ->get_link() happened here:
> >>>>>>
> >>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
> >>>>>> [FTRACE NOP]
> >>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
> >>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
> >>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
> >>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
> >>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
> >>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
> >>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
> >>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov
> >>>>>> %rdi,%rbx # rbx = &nameidate
> >>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
> >>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov
> >>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
> >>>>>> <sysctl_protected_symlinks>
> >>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
> >>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
> >>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je
> >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
> >>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
> >>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov
> >>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
> >>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
> >>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov
> >>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid
> >>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
> >>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je
> >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
> >>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
> >>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
> >>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
> >>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
> >>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je
> >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>
> >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
> >>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov
> >>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
> >>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
> >>>>>> 0x50(%rbx),%rax # rax = nd->stack
> >>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
> >>>>>> $0x0,0x20(%rax) # stack[0].name = NULL
> >>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov
> >>>>>> 0x48(%rbx),%eax # nd->depth
> >>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
> >>>>>> 0x50(%rbx),%rdx # nd->stack
> >>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
> >>>>>> 0xc8(%rbx),%r13 # nd->link_inode
> >>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
> >>>>>> (%rax,%rax,2),%rax # rax = depth * 3
> >>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl
> >>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
> >>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
> >>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last
> >>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
> >>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry
> >>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
> >>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je
> >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>
> >>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
> >>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
> >>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
> >>>>>> 0xffffffff812f8a00 <atime_needs_update>
> >>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
> >>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
> >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>
> >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
> >>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
> >>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
> >>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
> >>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
> >>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
> >>>>>> 0xffffffff81424310 <security_inode_follow_link>
> >>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
> >>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
> >>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne
> >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>
> >>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
> >>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
> >>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
> >>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je
> >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>
> >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
> >>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
> >>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je
> >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>
> >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
> >>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je
> >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
> >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
> >>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je
> >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
> >>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
> >>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
> >>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
> >>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
> >>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
> >>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
> >>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
> >>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
> >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
> >>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq
> >>>>>> 0xffffffff812f8ae0 <touch_atime>
> >>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq
> >>>>>> 0xffffffff81a26410 <_cond_resched>
> >>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
> >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
> >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
> >>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
> >>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je
> >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>
> >>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
> >>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je
> >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
> >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov
> >>>>>> $0xfffffffffffffff6,%r12
> >>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
> >>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne
> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> >>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
> >>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
> >>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
> >>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je
> >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
> >>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
> >>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
> >>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je
> >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>
> >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov
> >>>>>> $0xffffffff82319b4f,%rdi
> >>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
> >>>>>> $0xfffffffffffffff3,%r12
> >>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
> >>>>>> 0xffffffff81161310 <audit_log_link_denied>
> >>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
> >>>>>> $0xffffffff8230164d,%r12
> >>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
> >>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je
> >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>
> >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
> >>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
> >>>>>> 0xffffffff812e2da0 <nd_jump_root>
> >>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
> >>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
> >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
> >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
> >>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
> >>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
> >>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
> >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>
> >>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
> >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>
> >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
> >>>>>> 0x20(%r13),%rax # inode->i_op
> >>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
> >>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
> >>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
> >>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
> >>>>>> 0x8(%rax),%rcx # inode_operations->get_link
> >>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
> >>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
> >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>
> >>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov
> >>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0
> >>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
> >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> >>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
> >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
> >>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
> >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
> >>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
> >>>>>> $0xfffffffffffff000,%r12
> >>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
> >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>
> >>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor
> >>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0
> >>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
> >>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
> >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> >>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
> >>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
> >>>>>> $0xfffffffffffffff6,%rax
> >>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
> >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> >>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
> >>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
> >>>>>> 0xffffffff812e3840 <unlazy_walk>
> >>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
> >>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
> >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
> >>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
> >>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
> >>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
> >>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
> >>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
> >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
> >>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
> >>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
> >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
> >>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
> >>>>>> 0xffffffff812e3840 <unlazy_walk>
> >>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
> >>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
> >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
> >>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
> >>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
> >>>>>> 0xffffffff812f8ae0 <touch_atime>
> >>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
> >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
> >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
> >>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
> >>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
> >>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
> >>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
> >>>>>> 0xffffffff811673f0 <__audit_inode>
> >>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
> >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
> >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
> >>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
> >>>>>> 0xffffffff812e4790 <set_root>
> >>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
> >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>
> >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
> >>>>>> $0xfffffffffffffff6,%r12
> >>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> >>>>>>
> >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> According to my understanding, the problem solved by commit
> >>>>>> 7b7820b83f23 ("xfs:
> >>>>>> don't expose internal symlink metadata buffers to the vfs") is a
> >>>>>> data NULL
> >>>>>> pointer dereference, but the problem here is an instruction NULL
> >>>>>> pointer
> >>>>>> dereference.
> >>>>>>
> >>>>>> Further, I analyzed the possible triggering process as follows:
> >>>>>>
> >>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
> >>>>>> rcu_read_lock
> >>>>>> read_seqcount_retry
> >>>>>> (the last check) iput_final
> >>>>>> evict
> >>>>>> destroy_inode
> >>>>>> xfs_fs_destroy_inode
> >>>>>> xfs_inode_set_reclaim_tag xfs_ialloc
> >>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc
> >>>>>> set(ip, XFS_IRECLAIMABLE)
> >>>>>> xfs_iget
> >>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock
> >>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
> >>>>>> spin_lock(ip->i_flags_lock)
> >>>>>>
> >>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
> >>>>>> set(ip, XFS_IRECLAIM)
> >>>>>> spin_unlock(ip->i_flags_lock)
> >>>>>> rcu_read_unlock
> >>>>>> < ------------ >
> >>>>>>
> >>>>>> // miss synchronize_rcu()
> >>>>>> xfs_reinit_inode
> >>>>>> ->get_link = NULL
> >>>>>> get_link() // NULL
> >>>>>>
> >>>>>> rcu_read_unlock
> >>>>>>
> >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
> >>>>>> expose internal
> >>>>>> symlink metadata buffers to the vfs"), we should start
> >>>>>> processing this NULL
> >>>>>> ->get_link pointer dereference.
> >>>>>>
> >>>>>> Or, am I thinking wrong somewhere?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jinliang Zheng
> >>>>>>
> >>>>>>>>> Apart from that issue, I'm not aware of any other issues that the
> >>>>>>>>> XFS inode recycling directly exposes.
> >>>>>>>>>
> >>>>>>>>>> According to my understanding, the essence of
> >>>>>>>>>> this problem is that XFS reuses
> >>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk
> >>>>>>>>>> assumes that this will not happen.
> >>>>>>>>> It assumes that the inode will not change identity during the RCU
> >>>>>>>>> grace period after the inode has been evicted from cache. We can
> >>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU
> >>>>>>>>> grace period as long as it is the same inode with the same
> >>>>>>>>> content
> >>>>>>>>> and same state.
> >>>>>>>>>
> >>>>>>>>> Problems *may* arise when we unlink the inode, then evict it,
> >>>>>>>>> then a
> >>>>>>>>> new file is created and the old slab cache memory address is used
> >>>>>>>>> for the new inode. I describe the issue here:
> >>>>>>>>>
> >>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> And judging from the relevant emails, the main reason
> >>>>>>>> why ->get_link() is set
> >>>>>>>> to NULL should be the lack of synchronize_rcu() before
> >>>>>>>> xfs_reinit_inode() when
> >>>>>>>> the inode is chosen to be reused.
> >>>>>>>>
> >>>>>>>> However, perhaps due to performance reasons, this
> >>>>>>>> solution has not been merged
> >>>>>>>> for a long time. How is it now?
> >>>>>>>>
> >>>>>>>> Maybe I am missing something in the threads of mail?
> >>>>>>>>
> >>>>>>>> Thank you very much. :)
> >>>>>>>> Jinliang Zheng
> >>>>>>>>
> >>>>>>>>> That said, we have exactly zero evidence that this is actually a
> >>>>>>>>> problem in production systems. We did get systems tripping
> >>>>>>>>> over the
> >>>>>>>>> symlink issue, but there's no evidence that the
> >>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the
> >>>>>>>>> wild and
> >>>>>>>>> hence there hasn't been any particular urgency to address it.
> >>>>>>>>>
> >>>>>>>>>> Are there any recommended workarounds until an
> >>>>>>>>>> elegant and efficient solution
> >>>>>>>>>> can be proposed? After all, causing a crash is
> >>>>>>>>>> extremely unacceptable in a
> >>>>>>>>>> production environment.
> >>>>>>>>> What crashes are you seeing in your production environment?
> >>>>>>>>>
> >>>>>>>>> -Dave.
> >>>>>>>>> --
> >>>>>>>>> Dave Chinner
> >>>>>>>>> david@fromorbit.com
> >
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-26 15:04 ` Jinliang Zheng
@ 2024-05-26 17:21 ` Paul E. McKenney
0 siblings, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2024-05-26 17:21 UTC (permalink / raw)
To: Jinliang Zheng
Cc: raven, alexjlzheng, bfoster, david, djwong, linux-fsdevel,
linux-xfs, rcu
On Sun, May 26, 2024 at 11:04:14PM +0800, Jinliang Zheng wrote:
> On Tue, 21 May 2024 at 10:13:38 +0800, Ian Kent wrote:
> > On 21/5/24 09:35, Ian Kent wrote:
> > > On 21/5/24 01:36, Darrick J. Wong wrote:
> > >> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> > >>> On 16/5/24 15:08, Ian Kent wrote:
> > >>>> On 16/5/24 12:56, Jinliang Zheng wrote:
> > >>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
> > >>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
> > >>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
> > >>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
> > >>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800,
> > >>>>>>>>> alexjlzheng@gmail.com wrote:
> > >>>>>>>>>> Hi, all
> > >>>>>>>>>>
> > >>>>>>>>>> I would like to ask if the conflict between xfs
> > >>>>>>>>>> inode recycle and vfs rcu-walk
> > >>>>>>>>>> which can lead to null pointer references has been resolved?
> > >>>>>>>>>>
> > >>>>>>>>>> I browsed through emails about the following
> > >>>>>>>>>> patches and their discussions:
> > >>>>>>>>>> -
> > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
> > >>>>>>>>>> -
> > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
> > >>>>>>>>>> -
> > >>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > >>>>>>>>>>
> > >>>>>>>>>> And then came to the conclusion that this
> > >>>>>>>>>> problem has not been solved, am I
> > >>>>>>>>>> right? Did I miss some patch that could solve this problem?
> > >>>>>>>>> We fixed the known problems this caused by turning off the VFS
> > >>>>>>>>> functionality that the rcu pathwalks kept tripping over. See
> > >>>>>>>>> commit
> > >>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink
> > >>>>>>>>> metadata buffers to
> > >>>>>>>>> the vfs").
> > >>>>>>>> Sorry for the delay.
> > >>>>>>>>
> > >>>>>>>> The problem I encountered in the production environment
> > >>>>>>>> was that during the
> > >>>>>>>> rcu walk process the ->get_link() pointer was NULL,
> > >>>>>>>> which caused a crash.
> > >>>>>>>>
> > >>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't
> > >>>>>>>> expose internal symlink
> > >>>>>>>> metadata buffers to the vfs") first appeared in:
> > >>>>>>>> -
> > >>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
> > >>>>>>>>
> > >>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how?
> > >>>>>>> I suggest reading the call stack from wherever the VFS enters
> > >>>>>>> the XFS
> > >>>>>>> readlink code. If you have a reliable reproducer, then
> > >>>>>>> apply this patch
> > >>>>>>> to your kernel (you haven't mentioned which one it is) and see
> > >>>>>>> if the
> > >>>>>>> bad dereference goes away.
> > >>>>>>>
> > >>>>>>> --D
> > >>>>>> Sorry for the delay.
> > >>>>>>
> > >>>>>> I encountered the following calltrace:
> > >>>>>>
> > >>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address:
> > >>>>>> 0000000000000000
> > >>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > >>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page
> > >>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > >>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > >>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > >>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > >>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > >>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > >>>>>> [20213.578884] RIP: 0010:0x0
> > >>>>>> [20213.578894] Code: Bad RIP value.
> > >>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > >>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > >>>>>> 0000000000000000
> > >>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > >>>>>> 0000000000000000
> > >>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > >>>>>> ffff889b9eeae380
> > >>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > >>>>>> 0000000000000000
> > >>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > >>>>>> ffffc90021ebfd48
> > >>>>>> [20213.578998] FS: 00007f89c534e740(0000)
> > >>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > >>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > >>>>>> 00000000007706e0
> > >>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > >>>>>> 0000000000000000
> > >>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > >>>>>> 0000000000000400
> > >>>>>> [20213.579079] PKRU: 55555554
> > >>>>>> [20213.579087] Call Trace:
> > >>>>>> [20213.579099] trailing_symlink+0x1da/0x260
> > >>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220
> > >>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170
> > >>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
> > >>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0
> > >>>>>> [20213.579161] user_path_at_empty+0x3e/0x50
> > >>>>>> [20213.579172] vfs_statx+0x76/0xe0
> > >>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70
> > >>>>>> [20213.579194] ? fput+0x13/0x20
> > >>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300
> > >>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30
> > >>>>>> [20213.579225] ? fput+0x13/0x20
> > >>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0
> > >>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20
> > >>>>>> [20213.579256] do_syscall_64+0x4d/0x140
> > >>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > >>>>>>
> > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > >>>>>>
> > >>>>>>
> > >>>>> Please note that the kernel version I use is the one maintained by
> > >>>>> Tencent.Inc,
> > >>>>> and the baseline is v5.4. But in fact, in the latest upstream source
> > >>>>> tree,
> > >>>>> although the trailing_symlink() function has been removed, its logic
> > >>>>> has been
> > >>>>> moved to pick_link(), so the problem still exists.
> > >>>>>
> > >>>>> Ian Kent pointed out that try_to_unlazy() was introduced in
> > >>>>> pick_link() in the
> > >>>>> latest upstream source tree, but I don't understand why this can
> > >>>>> solve the NULL
> > >>>>> ->get_link pointer dereference problem, because ->get_link pointer
> > >>>>> will be
> > >>>>> dereferenced before try_to_unlazy().
> > >>>>>
> > >>>>> (I don't understand why Ian Kent's email didn't appear on the
> > >>>>> mailing list.)
> > >>>> It was something about html mail and I think my mail client was at
> > >>>> fault.
> > >>>>
> > >>>> In any case what you say is indeed correct, so the comment isn't
> > >>>> important.
> > >>>>
> > >>>>
> > >>>> Fact is it is still a race between the lockless path walk and inode
> > >>>> eviction
> > >>>>
> > >>>> and xfs recycling. I believe that the xfs recycling code is very
> > >>>> hard to
> > >>>> fix.
> > >>>>
> > >>>>
> > >>>> IIRC correctly putting a NULL check in pick_link() was not considered
> > >>>> acceptable
> > >>>>
> > >>>> but there must be a way that is acceptable to check this and
> > >>>> restart the
> > >>>> walk.
> > >>>>
> > >>>> Maybe there was a reluctance to suffer the overhead of restarting the
> > >>>> walk when
> > >>>>
> > >>>> it shouldn't be needed.
> > >>> Or perhaps the worry was that if it can become NULL it could also
> > >>> become a
> > >>> pointer to a
> > >>>
> > >>> different (incorrect) link altogether which could have really
> > >>> odd/unpleasant
> > >>> outcomes.
> > >> Yuck. I think that means that we can't reallocate freed inodes until
> > >> the rcu grace period expires. For inodes that haven't been evicted, I
> > >> think that also means we cannot recycle cached inodes until after an rcu
> > >> grace period expires; or maybe that we cannot reset i_op/i_fop and must
> > >> not leave the incore state in an inconsistent format?
> > >
> > > Yeah, not pretty!
> > >
> > > But shouldn't this case occur only occasionally?
> > >
> > >
> > > So issuing a cache miss shouldn't impact performance too much that was,
> > >
> > > I believe, the concern with waiting for the rcu grace period.
> > >
> > >
> > > Identifying it's happening should be possible, the vfs legitimize_*()
> > >
> > > has this job for various objects but maybe it's using vfs private info.
> > >
> > > (certainly it uses nameidata struct with a seq lock sequence number in
> > >
> > > it) but I assume it can be done somehow.
> >
> > Unfortunately, when you start trying to work out how to do this, it
> > isn't at all
> >
> > obvious how to do it ...
>
> How about adding a synchronize_rcu() in front of xfs_reinit_inode()?
>
> Maybe this will affect performance, but compared to crashing the kernel, this
> performance penalty is completely worth it.
There is always synchronize_rcu_expedited(), especially if this is a
relatively rare operation. The typical synchronize_rcu() delay is tens
of milliseconds, while the typical synchronize_rcu_expedited() delay is
tens to hundreds of microseconds.
The downside of synchronize_rcu_expedited() is higher per-RCU-update
CPU utilization. Plus added IPIs.
> And, perhaps we can gradually take some optimization measures, such as:
> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
I cannot claim to fully understand this code, but I do agree that the use
of things like poll_state_synchronize_rcu() and cond_synchronize_rcu()
could greatly reduce the number of grace-period waits. In recent kernels,
there is cond_synchronize_rcu_expedited() as well.
Thanx, Paul
> Best Regards,
> Jinliang Zheng
>
> >
> >
> > >
> > >
> > > My question then becomes is it viable/straight forward to not recycle
> > > such
> > >
> > > an inode and discard it instead so it gets re-created, I guess it's
> > > essentially
> > >
> > > a cache miss?
> > >
> > >
> > > Ian
> > >
> > >>
> > >> --D
> > >>
> > >>>>
> > >>>> The alternative would be to find some way to identify when it's unsafe
> > >>>> to reuse
> > >>>>
> > >>>> an inode marked for re-cycle before dropping rcu read, perhaps with
> > >>>> the
> > >>>> reference
> > >>>>
> > >>>> count plus the seqlock. Basically, to reuse inodes xfs will need to
> > >>>> identify when
> > >>>>
> > >>>> the race occurs and let the inode go away under rcu and create a
> > >>>> new one
> > >>>> if a race
> > >>>>
> > >>>> is detected. But possibly that isn't nearly as simple as it sounds?
> > >>>>
> > >>>>
> > >>>>> Thanks,
> > >>>>> Jinliang Zheng
> > >>>>>
> > >>>>>> And I analyzed the disassembly of trailing_symlink() and
> > >>>>>> confirmed that a NULL
> > >>>>>> ->get_link() happened here:
> > >>>>>>
> > >>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
> > >>>>>> [FTRACE NOP]
> > >>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
> > >>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
> > >>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
> > >>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
> > >>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
> > >>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
> > >>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
> > >>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov
> > >>>>>> %rdi,%rbx # rbx = &nameidate
> > >>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
> > >>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov
> > >>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
> > >>>>>> <sysctl_protected_symlinks>
> > >>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
> > >>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
> > >>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je
> > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
> > >>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
> > >>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov
> > >>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
> > >>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
> > >>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov
> > >>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid
> > >>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
> > >>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je
> > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
> > >>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
> > >>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
> > >>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
> > >>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
> > >>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je
> > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>
> > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
> > >>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov
> > >>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
> > >>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
> > >>>>>> 0x50(%rbx),%rax # rax = nd->stack
> > >>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
> > >>>>>> $0x0,0x20(%rax) # stack[0].name = NULL
> > >>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov
> > >>>>>> 0x48(%rbx),%eax # nd->depth
> > >>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
> > >>>>>> 0x50(%rbx),%rdx # nd->stack
> > >>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
> > >>>>>> 0xc8(%rbx),%r13 # nd->link_inode
> > >>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
> > >>>>>> (%rax,%rax,2),%rax # rax = depth * 3
> > >>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl
> > >>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
> > >>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
> > >>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last
> > >>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
> > >>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry
> > >>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
> > >>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je
> > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>
> > >>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
> > >>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
> > >>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
> > >>>>>> 0xffffffff812f8a00 <atime_needs_update>
> > >>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
> > >>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
> > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>
> > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
> > >>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
> > >>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
> > >>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
> > >>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
> > >>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
> > >>>>>> 0xffffffff81424310 <security_inode_follow_link>
> > >>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
> > >>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
> > >>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne
> > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>
> > >>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
> > >>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
> > >>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
> > >>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je
> > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>
> > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
> > >>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
> > >>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je
> > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>
> > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
> > >>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je
> > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
> > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
> > >>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je
> > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
> > >>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
> > >>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
> > >>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
> > >>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
> > >>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
> > >>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
> > >>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
> > >>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
> > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
> > >>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq
> > >>>>>> 0xffffffff812f8ae0 <touch_atime>
> > >>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq
> > >>>>>> 0xffffffff81a26410 <_cond_resched>
> > >>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
> > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
> > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
> > >>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
> > >>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je
> > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>
> > >>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
> > >>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je
> > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
> > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov
> > >>>>>> $0xfffffffffffffff6,%r12
> > >>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
> > >>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne
> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> > >>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
> > >>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
> > >>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
> > >>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je
> > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
> > >>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
> > >>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
> > >>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je
> > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>
> > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov
> > >>>>>> $0xffffffff82319b4f,%rdi
> > >>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
> > >>>>>> $0xfffffffffffffff3,%r12
> > >>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
> > >>>>>> 0xffffffff81161310 <audit_log_link_denied>
> > >>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
> > >>>>>> $0xffffffff8230164d,%r12
> > >>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
> > >>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je
> > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>
> > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
> > >>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
> > >>>>>> 0xffffffff812e2da0 <nd_jump_root>
> > >>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
> > >>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
> > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
> > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
> > >>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
> > >>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
> > >>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
> > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>
> > >>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
> > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>
> > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
> > >>>>>> 0x20(%r13),%rax # inode->i_op
> > >>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
> > >>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
> > >>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
> > >>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
> > >>>>>> 0x8(%rax),%rcx # inode_operations->get_link
> > >>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
> > >>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
> > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>
> > >>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov
> > >>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0
> > >>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
> > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> > >>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
> > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
> > >>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
> > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
> > >>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
> > >>>>>> $0xfffffffffffff000,%r12
> > >>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
> > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>
> > >>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor
> > >>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0
> > >>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
> > >>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
> > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
> > >>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
> > >>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
> > >>>>>> $0xfffffffffffffff6,%rax
> > >>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
> > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> > >>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
> > >>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
> > >>>>>> 0xffffffff812e3840 <unlazy_walk>
> > >>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
> > >>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
> > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
> > >>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
> > >>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
> > >>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
> > >>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
> > >>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
> > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
> > >>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
> > >>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
> > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
> > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
> > >>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
> > >>>>>> 0xffffffff812e3840 <unlazy_walk>
> > >>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
> > >>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
> > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
> > >>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
> > >>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
> > >>>>>> 0xffffffff812f8ae0 <touch_atime>
> > >>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
> > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
> > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
> > >>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
> > >>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
> > >>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
> > >>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
> > >>>>>> 0xffffffff811673f0 <__audit_inode>
> > >>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
> > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
> > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
> > >>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
> > >>>>>> 0xffffffff812e4790 <set_root>
> > >>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
> > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>
> > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
> > >>>>>> $0xfffffffffffffff6,%r12
> > >>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
> > >>>>>>
> > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> According to my understanding, the problem solved by commit
> > >>>>>> 7b7820b83f23 ("xfs:
> > >>>>>> don't expose internal symlink metadata buffers to the vfs") is a
> > >>>>>> data NULL
> > >>>>>> pointer dereference, but the problem here is an instruction NULL
> > >>>>>> pointer
> > >>>>>> dereference.
> > >>>>>>
> > >>>>>> Further, I analyzed the possible triggering process as follows:
> > >>>>>>
> > >>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
> > >>>>>> rcu_read_lock
> > >>>>>> read_seqcount_retry
> > >>>>>> (the last check) iput_final
> > >>>>>> evict
> > >>>>>> destroy_inode
> > >>>>>> xfs_fs_destroy_inode
> > >>>>>> xfs_inode_set_reclaim_tag xfs_ialloc
> > >>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc
> > >>>>>> set(ip, XFS_IRECLAIMABLE)
> > >>>>>> xfs_iget
> > >>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock
> > >>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
> > >>>>>> spin_lock(ip->i_flags_lock)
> > >>>>>>
> > >>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
> > >>>>>> set(ip, XFS_IRECLAIM)
> > >>>>>> spin_unlock(ip->i_flags_lock)
> > >>>>>> rcu_read_unlock
> > >>>>>> < ------------ >
> > >>>>>>
> > >>>>>> // miss synchronize_rcu()
> > >>>>>> xfs_reinit_inode
> > >>>>>> ->get_link = NULL
> > >>>>>> get_link() // NULL
> > >>>>>>
> > >>>>>> rcu_read_unlock
> > >>>>>>
> > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
> > >>>>>> expose internal
> > >>>>>> symlink metadata buffers to the vfs"), we should start
> > >>>>>> processing this NULL
> > >>>>>> ->get_link pointer dereference.
> > >>>>>>
> > >>>>>> Or, am I thinking wrong somewhere?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Jinliang Zheng
> > >>>>>>
> > >>>>>>>>> Apart from that issue, I'm not aware of any other issues that the
> > >>>>>>>>> XFS inode recycling directly exposes.
> > >>>>>>>>>
> > >>>>>>>>>> According to my understanding, the essence of
> > >>>>>>>>>> this problem is that XFS reuses
> > >>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk
> > >>>>>>>>>> assumes that this will not happen.
> > >>>>>>>>> It assumes that the inode will not change identity during the RCU
> > >>>>>>>>> grace period after the inode has been evicted from cache. We can
> > >>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU
> > >>>>>>>>> grace period as long as it is the same inode with the same
> > >>>>>>>>> content
> > >>>>>>>>> and same state.
> > >>>>>>>>>
> > >>>>>>>>> Problems *may* arise when we unlink the inode, then evict it,
> > >>>>>>>>> then a
> > >>>>>>>>> new file is created and the old slab cache memory address is used
> > >>>>>>>>> for the new inode. I describe the issue here:
> > >>>>>>>>>
> > >>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> And judging from the relevant emails, the main reason
> > >>>>>>>> why ->get_link() is set
> > >>>>>>>> to NULL should be the lack of synchronize_rcu() before
> > >>>>>>>> xfs_reinit_inode() when
> > >>>>>>>> the inode is chosen to be reused.
> > >>>>>>>>
> > >>>>>>>> However, perhaps due to performance reasons, this
> > >>>>>>>> solution has not been merged
> > >>>>>>>> for a long time. How is it now?
> > >>>>>>>>
> > >>>>>>>> Maybe I am missing something in the threads of mail?
> > >>>>>>>>
> > >>>>>>>> Thank you very much. :)
> > >>>>>>>> Jinliang Zheng
> > >>>>>>>>
> > >>>>>>>>> That said, we have exactly zero evidence that this is actually a
> > >>>>>>>>> problem in production systems. We did get systems tripping
> > >>>>>>>>> over the
> > >>>>>>>>> symlink issue, but there's no evidence that the
> > >>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the
> > >>>>>>>>> wild and
> > >>>>>>>>> hence there hasn't been any particular urgency to address it.
> > >>>>>>>>>
> > >>>>>>>>>> Are there any recommended workarounds until an
> > >>>>>>>>>> elegant and efficient solution
> > >>>>>>>>>> can be proposed? After all, causing a crash is
> > >>>>>>>>>> extremely unacceptable in a
> > >>>>>>>>>> production environment.
> > >>>>>>>>> What crashes are you seeing in your production environment?
> > >>>>>>>>>
> > >>>>>>>>> -Dave.
> > >>>>>>>>> --
> > >>>>>>>>> Dave Chinner
> > >>>>>>>>> david@fromorbit.com
> > >
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-21 2:13 ` Ian Kent
2024-05-26 15:04 ` Jinliang Zheng
@ 2024-05-26 23:51 ` Ian Kent
2024-05-27 0:18 ` Al Viro
1 sibling, 1 reply; 19+ messages in thread
From: Ian Kent @ 2024-05-26 23:51 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel,
linux-xfs, rcu
On 21/5/24 10:13, Ian Kent wrote:
> On 21/5/24 09:35, Ian Kent wrote:
>> On 21/5/24 01:36, Darrick J. Wong wrote:
>>> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
>>>> On 16/5/24 15:08, Ian Kent wrote:
>>>>> On 16/5/24 12:56, Jinliang Zheng wrote:
>>>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
>>>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
>>>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
>>>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
>>>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800,
>>>>>>>>>> alexjlzheng@gmail.com wrote:
>>>>>>>>>>> Hi, all
>>>>>>>>>>>
>>>>>>>>>>> I would like to ask if the conflict between xfs
>>>>>>>>>>> inode recycle and vfs rcu-walk
>>>>>>>>>>> which can lead to null pointer references has been resolved?
>>>>>>>>>>>
>>>>>>>>>>> I browsed through emails about the following
>>>>>>>>>>> patches and their discussions:
>>>>>>>>>>> -
>>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
>>>>>>>>>>> -
>>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
>>>>>>>>>>> -
>>>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>>>>>>>>>>>
>>>>>>>>>>> And then came to the conclusion that this
>>>>>>>>>>> problem has not been solved, am I
>>>>>>>>>>> right? Did I miss some patch that could solve this problem?
>>>>>>>>>> We fixed the known problems this caused by turning off the VFS
>>>>>>>>>> functionality that the rcu pathwalks kept tripping over. See
>>>>>>>>>> commit
>>>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink
>>>>>>>>>> metadata buffers to
>>>>>>>>>> the vfs").
>>>>>>>>> Sorry for the delay.
>>>>>>>>>
>>>>>>>>> The problem I encountered in the production environment
>>>>>>>>> was that during the
>>>>>>>>> rcu walk process the ->get_link() pointer was NULL,
>>>>>>>>> which caused a crash.
>>>>>>>>>
>>>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't
>>>>>>>>> expose internal symlink
>>>>>>>>> metadata buffers to the vfs") first appeared in:
>>>>>>>>> -
>>>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>>>>>>>>>
>>>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how?
>>>>>>>> I suggest reading the call stack from wherever the VFS enters
>>>>>>>> the XFS
>>>>>>>> readlink code. If you have a reliable reproducer, then
>>>>>>>> apply this patch
>>>>>>>> to your kernel (you haven't mentioned which one it is) and see
>>>>>>>> if the
>>>>>>>> bad dereference goes away.
>>>>>>>>
>>>>>>>> --D
>>>>>>> Sorry for the delay.
>>>>>>>
>>>>>>> I encountered the following calltrace:
>>>>>>>
>>>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address:
>>>>>>> 0000000000000000
>>>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
>>>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page
>>>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
>>>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
>>>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
>>>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
>>>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
>>>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
>>>>>>> [20213.578884] RIP: 0010:0x0
>>>>>>> [20213.578894] Code: Bad RIP value.
>>>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
>>>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
>>>>>>> 0000000000000000
>>>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
>>>>>>> 0000000000000000
>>>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
>>>>>>> ffff889b9eeae380
>>>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
>>>>>>> 0000000000000000
>>>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
>>>>>>> ffffc90021ebfd48
>>>>>>> [20213.578998] FS: 00007f89c534e740(0000)
>>>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000
>>>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
>>>>>>> 00000000007706e0
>>>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>> 0000000000000000
>>>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>> 0000000000000400
>>>>>>> [20213.579079] PKRU: 55555554
>>>>>>> [20213.579087] Call Trace:
>>>>>>> [20213.579099] trailing_symlink+0x1da/0x260
>>>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220
>>>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170
>>>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
>>>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0
>>>>>>> [20213.579161] user_path_at_empty+0x3e/0x50
>>>>>>> [20213.579172] vfs_statx+0x76/0xe0
>>>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70
>>>>>>> [20213.579194] ? fput+0x13/0x20
>>>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300
>>>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30
>>>>>>> [20213.579225] ? fput+0x13/0x20
>>>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0
>>>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20
>>>>>>> [20213.579256] do_syscall_64+0x4d/0x140
>>>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>>>>>>>
>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>
>>>>>>>
>>>>>> Please note that the kernel version I use is the one maintained by
>>>>>> Tencent.Inc,
>>>>>> and the baseline is v5.4. But in fact, in the latest upstream source
>>>>>> tree,
>>>>>> although the trailing_symlink() function has been removed, its logic
>>>>>> has been
>>>>>> moved to pick_link(), so the problem still exists.
>>>>>>
>>>>>> Ian Kent pointed out that try_to_unlazy() was introduced in
>>>>>> pick_link() in the
>>>>>> latest upstream source tree, but I don't understand why this can
>>>>>> solve the NULL
>>>>>> ->get_link pointer dereference problem, because ->get_link pointer
>>>>>> will be
>>>>>> dereferenced before try_to_unlazy().
>>>>>>
>>>>>> (I don't understand why Ian Kent's email didn't appear on the
>>>>>> mailing list.)
>>>>> It was something about html mail and I think my mail client was at
>>>>> fault.
>>>>>
>>>>> In any case what you say is indeed correct, so the comment isn't
>>>>> important.
>>>>>
>>>>>
>>>>> Fact is it is still a race between the lockless path walk and inode
>>>>> eviction
>>>>>
>>>>> and xfs recycling. I believe that the xfs recycling code is very
>>>>> hard to
>>>>> fix.
>>>>>
>>>>>
>>>>> IIRC correctly putting a NULL check in pick_link() was not considered
>>>>> acceptable
>>>>>
>>>>> but there must be a way that is acceptable to check this and
>>>>> restart the
>>>>> walk.
>>>>>
>>>>> Maybe there was a reluctance to suffer the overhead of restarting the
>>>>> walk when
>>>>>
>>>>> it shouldn't be needed.
>>>> Or perhaps the worry was that if it can become NULL it could also
>>>> become a
>>>> pointer to a
>>>>
>>>> different (incorrect) link altogether which could have really
>>>> odd/unpleasant
>>>> outcomes.
>>> Yuck. I think that means that we can't reallocate freed inodes until
>>> the rcu grace period expires. For inodes that haven't been evicted, I
>>> think that also means we cannot recycle cached inodes until after an
>>> rcu
>>> grace period expires; or maybe that we cannot reset i_op/i_fop and must
>>> not leave the incore state in an inconsistent format?
>>
>> Yeah, not pretty!
>>
>> But shouldn't this case occur only occasionally?
>>
>>
>> So issuing a cache miss shouldn't impact performance too much that was,
>>
>> I believe, the concern with waiting for the rcu grace period.
>>
>>
>> Identifying it's happening should be possible, the vfs legitimize_*()
>>
>> has this job for various objects but maybe it's using vfs private info.
>>
>> (certainly it uses nameidata struct with a seq lock sequence number in
>>
>> it) but I assume it can be done somehow.
>
> Unfortunately, when you start trying to work out how to do this, it
> isn't at all
>
> obvious how to do it ...
Indeed, that's what I found when I had a quick look.
Maybe a dentry (since that's part of the subject of the path walk and
inode is readily
accessible) flag could be used since there's opportunity to set it in
vfs callbacks that
are done as a matter of course.
Ian
>
>
>>
>>
>> My question then becomes is it viable/straight forward to not recycle
>> such
>>
>> an inode and discard it instead so it gets re-created, I guess it's
>> essentially
>>
>> a cache miss?
>>
>>
>> Ian
>>
>>>
>>> --D
>>>
>>>>>
>>>>> The alternative would be to find some way to identify when it's
>>>>> unsafe
>>>>> to reuse
>>>>>
>>>>> an inode marked for re-cycle before dropping rcu read, perhaps
>>>>> with the
>>>>> reference
>>>>>
>>>>> count plus the seqlock. Basically, to reuse inodes xfs will need to
>>>>> identify when
>>>>>
>>>>> the race occurs and let the inode go away under rcu and create a
>>>>> new one
>>>>> if a race
>>>>>
>>>>> is detected. But possibly that isn't nearly as simple as it sounds?
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Jinliang Zheng
>>>>>>
>>>>>>> And I analyzed the disassembly of trailing_symlink() and
>>>>>>> confirmed that a NULL
>>>>>>> ->get_link() happened here:
>>>>>>>
>>>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1)
>>>>>>> [FTRACE NOP]
>>>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp
>>>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp
>>>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15
>>>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14
>>>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13
>>>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12
>>>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx
>>>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov
>>>>>>> %rdi,%rbx # rbx = &nameidate
>>>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp
>>>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov
>>>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4
>>>>>>> <sysctl_protected_symlinks>
>>>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax
>>>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx
>>>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je
>>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx
>>>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov
>>>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode
>>>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx
>>>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov
>>>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid
>>>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx)
>>>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je
>>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi
>>>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx
>>>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx
>>>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx
>>>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je
>>>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>
>>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax
>>>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov
>>>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT
>>>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov
>>>>>>> 0x50(%rbx),%rax # rax = nd->stack
>>>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq
>>>>>>> $0x0,0x20(%rax) # stack[0].name = NULL
>>>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov
>>>>>>> 0x48(%rbx),%eax # nd->depth
>>>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov
>>>>>>> 0x50(%rbx),%rdx # nd->stack
>>>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov
>>>>>>> 0xc8(%rbx),%r13 # nd->link_inode
>>>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea
>>>>>>> (%rax,%rax,2),%rax # rax = depth * 3
>>>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl
>>>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30
>>>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea
>>>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last
>>>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov
>>>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry
>>>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx)
>>>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je
>>>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>
>>>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi
>>>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi
>>>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq
>>>>>>> 0xffffffff812f8a00 <atime_needs_update>
>>>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al
>>>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne
>>>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>
>>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx
>>>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi
>>>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi
>>>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx
>>>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx
>>>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq
>>>>>>> 0xffffffff81424310 <security_inode_follow_link>
>>>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12
>>>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax
>>>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne
>>>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>
>>>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx)
>>>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12
>>>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12
>>>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je
>>>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>
>>>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax
>>>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al
>>>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je
>>>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>
>>>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al
>>>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je
>>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12
>>>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je
>>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp
>>>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax
>>>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx
>>>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12
>>>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13
>>>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14
>>>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15
>>>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp
>>>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq
>>>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi
>>>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq
>>>>>>> 0xffffffff812f8ae0 <touch_atime>
>>>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq
>>>>>>> 0xffffffff81a26410 <_cond_resched>
>>>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp
>>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx
>>>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx
>>>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je
>>>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>
>>>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx
>>>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je
>>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>
>>>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov
>>>>>>> $0xfffffffffffffff6,%r12
>>>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al
>>>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne
>>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax
>>>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax
>>>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax
>>>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je
>>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax
>>>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax
>>>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je
>>>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>
>>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov
>>>>>>> $0xffffffff82319b4f,%rdi
>>>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov
>>>>>>> $0xfffffffffffffff3,%r12
>>>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq
>>>>>>> 0xffffffff81161310 <audit_log_link_denied>
>>>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp
>>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov
>>>>>>> $0xffffffff8230164d,%r12
>>>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp
>>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx)
>>>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je
>>>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>
>>>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi
>>>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq
>>>>>>> 0xffffffff812e2da0 <nd_jump_root>
>>>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax
>>>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne
>>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12
>>>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax
>>>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al
>>>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne
>>>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>
>>>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp
>>>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>
>>>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov
>>>>>>> 0x20(%r13),%rax # inode->i_op
>>>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15
>>>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi
>>>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx
>>>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov
>>>>>>> 0x8(%rax),%rcx # inode_operations->get_link
>>>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx)
>>>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne
>>>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>
>>>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov
>>>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0
>>>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq
>>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12
>>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12
>>>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je
>>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>
>>>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp
>>>>>>> $0xfffffffffffff000,%r12
>>>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe
>>>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>
>>>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq
>>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor
>>>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0
>>>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp)
>>>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq
>>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>>>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12
>>>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp
>>>>>>> $0xfffffffffffffff6,%rax
>>>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne
>>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi
>>>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq
>>>>>>> 0xffffffff812e3840 <unlazy_walk>
>>>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax
>>>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne
>>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx
>>>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi
>>>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi
>>>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx
>>>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq
>>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
>>>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12
>>>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp
>>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>
>>>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi
>>>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq
>>>>>>> 0xffffffff812e3840 <unlazy_walk>
>>>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax
>>>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne
>>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>
>>>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi
>>>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq
>>>>>>> 0xffffffff812f8ae0 <touch_atime>
>>>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq
>>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>
>>>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax
>>>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi
>>>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx
>>>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi
>>>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq
>>>>>>> 0xffffffff811673f0 <__audit_inode>
>>>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq
>>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>
>>>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi
>>>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq
>>>>>>> 0xffffffff812e4790 <set_root>
>>>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq
>>>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>
>>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov
>>>>>>> $0xfffffffffffffff6,%r12
>>>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq
>>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>
>>>>>>>
>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> According to my understanding, the problem solved by commit
>>>>>>> 7b7820b83f23 ("xfs:
>>>>>>> don't expose internal symlink metadata buffers to the vfs") is a
>>>>>>> data NULL
>>>>>>> pointer dereference, but the problem here is an instruction NULL
>>>>>>> pointer
>>>>>>> dereference.
>>>>>>>
>>>>>>> Further, I analyzed the possible triggering process as follows:
>>>>>>>
>>>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create
>>>>>>> rcu_read_lock
>>>>>>> read_seqcount_retry
>>>>>>> (the last check) iput_final
>>>>>>> evict
>>>>>>> destroy_inode
>>>>>>> xfs_fs_destroy_inode
>>>>>>> xfs_inode_set_reclaim_tag xfs_ialloc
>>>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc
>>>>>>> set(ip, XFS_IRECLAIMABLE)
>>>>>>> xfs_iget
>>>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock
>>>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit
>>>>>>> spin_lock(ip->i_flags_lock)
>>>>>>>
>>>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
>>>>>>> set(ip, XFS_IRECLAIM)
>>>>>>> spin_unlock(ip->i_flags_lock)
>>>>>>> rcu_read_unlock
>>>>>>> < ------------ >
>>>>>>>
>>>>>>> // miss synchronize_rcu()
>>>>>>> xfs_reinit_inode
>>>>>>> ->get_link = NULL
>>>>>>> get_link() // NULL
>>>>>>>
>>>>>>> rcu_read_unlock
>>>>>>>
>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't
>>>>>>> expose internal
>>>>>>> symlink metadata buffers to the vfs"), we should start
>>>>>>> processing this NULL
>>>>>>> ->get_link pointer dereference.
>>>>>>>
>>>>>>> Or, am I thinking wrong somewhere?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jinliang Zheng
>>>>>>>
>>>>>>>>>> Apart from that issue, I'm not aware of any other issues that
>>>>>>>>>> the
>>>>>>>>>> XFS inode recycling directly exposes.
>>>>>>>>>>
>>>>>>>>>>> According to my understanding, the essence of
>>>>>>>>>>> this problem is that XFS reuses
>>>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk
>>>>>>>>>>> assumes that this will not happen.
>>>>>>>>>> It assumes that the inode will not change identity during the
>>>>>>>>>> RCU
>>>>>>>>>> grace period after the inode has been evicted from cache. We can
>>>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU
>>>>>>>>>> grace period as long as it is the same inode with the same
>>>>>>>>>> content
>>>>>>>>>> and same state.
>>>>>>>>>>
>>>>>>>>>> Problems *may* arise when we unlink the inode, then evict it,
>>>>>>>>>> then a
>>>>>>>>>> new file is created and the old slab cache memory address is
>>>>>>>>>> used
>>>>>>>>>> for the new inode. I describe the issue here:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> And judging from the relevant emails, the main reason
>>>>>>>>> why ->get_link() is set
>>>>>>>>> to NULL should be the lack of synchronize_rcu() before
>>>>>>>>> xfs_reinit_inode() when
>>>>>>>>> the inode is chosen to be reused.
>>>>>>>>>
>>>>>>>>> However, perhaps due to performance reasons, this
>>>>>>>>> solution has not been merged
>>>>>>>>> for a long time. How is it now?
>>>>>>>>>
>>>>>>>>> Maybe I am missing something in the threads of mail?
>>>>>>>>>
>>>>>>>>> Thank you very much. :)
>>>>>>>>> Jinliang Zheng
>>>>>>>>>
>>>>>>>>>> That said, we have exactly zero evidence that this is actually a
>>>>>>>>>> problem in production systems. We did get systems tripping
>>>>>>>>>> over the
>>>>>>>>>> symlink issue, but there's no evidence that the
>>>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the
>>>>>>>>>> wild and
>>>>>>>>>> hence there hasn't been any particular urgency to address it.
>>>>>>>>>>
>>>>>>>>>>> Are there any recommended workarounds until an
>>>>>>>>>>> elegant and efficient solution
>>>>>>>>>>> can be proposed? After all, causing a crash is
>>>>>>>>>>> extremely unacceptable in a
>>>>>>>>>>> production environment.
>>>>>>>>>> What crashes are you seeing in your production environment?
>>>>>>>>>>
>>>>>>>>>> -Dave.
>>>>>>>>>> --
>>>>>>>>>> Dave Chinner
>>>>>>>>>> david@fromorbit.com
>>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-26 23:51 ` Ian Kent
@ 2024-05-27 0:18 ` Al Viro
2024-05-28 15:51 ` Brian Foster
0 siblings, 1 reply; 19+ messages in thread
From: Al Viro @ 2024-05-27 0:18 UTC (permalink / raw)
To: Ian Kent
Cc: Darrick J. Wong, Jinliang Zheng, alexjlzheng, bfoster, david,
linux-fsdevel, linux-xfs, rcu
On Mon, May 27, 2024 at 07:51:39AM +0800, Ian Kent wrote:
> Indeed, that's what I found when I had a quick look.
>
>
> Maybe a dentry (since that's part of the subject of the path walk and inode
> is readily
>
> accessible) flag could be used since there's opportunity to set it in vfs
> callbacks that
>
> are done as a matter of course.
You might recheck ->d_seq after fetching ->get_link there; with XFS
->get_link() unconditionlly failing in RCU mode that would prevent
this particular problem. But it would obviously have to be done
in pick_link() itself (and I refuse to touch that area in 5.4 -
carrying those changes across the e.g. 5.6 changes in pathwalk
machinery is too much).
And it's really just the tip of the iceberg - e.g. I'd expect a massive
headache in ACL-related part of permission checks, etc.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-16 7:23 ` Ian Kent
2024-05-20 17:36 ` Darrick J. Wong
@ 2024-05-27 9:41 ` Dave Chinner
2024-05-27 13:56 ` Jinliang Zheng
1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2024-05-27 9:41 UTC (permalink / raw)
To: Ian Kent
Cc: Jinliang Zheng, alexjlzheng, bfoster, djwong, linux-fsdevel,
linux-xfs, rcu
On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> On 16/5/24 15:08, Ian Kent wrote:
> > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > I encountered the following calltrace:
> > > >
> > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > 0000000000000000
> > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > [20213.578884] RIP: 0010:0x0
> > > > [20213.578894] Code: Bad RIP value.
> > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > 0000000000000000
> > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > 0000000000000000
> > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > ffff889b9eeae380
> > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > 0000000000000000
> > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > ffffc90021ebfd48
> > > > [20213.578998] FS: 00007f89c534e740(0000)
> > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > 00000000007706e0
> > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [20213.579079] PKRU: 55555554
> > > > [20213.579087] Call Trace:
> > > > [20213.579099] trailing_symlink+0x1da/0x260
> > > > [20213.579112] path_lookupat.isra.53+0x79/0x220
> > > > [20213.579125] filename_lookup.part.69+0xa0/0x170
> > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
> > > > [20213.579151] ? getname_flags+0x4f/0x1e0
> > > > [20213.579161] user_path_at_empty+0x3e/0x50
> > > > [20213.579172] vfs_statx+0x76/0xe0
> > > > [20213.579182] __do_sys_newstat+0x3d/0x70
> > > > [20213.579194] ? fput+0x13/0x20
> > > > [20213.579203] ? ksys_ioctl+0xb0/0x300
> > > > [20213.579213] ? generic_file_llseek+0x24/0x30
> > > > [20213.579225] ? fput+0x13/0x20
> > > > [20213.579233] ? ksys_lseek+0x8d/0xb0
> > > > [20213.579243] __x64_sys_newstat+0x16/0x20
> > > > [20213.579256] do_syscall_64+0x4d/0x140
> > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > >
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > >
> > > Please note that the kernel version I use is the one maintained by
> > > Tencent.Inc,
> > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > tree,
> > > although the trailing_symlink() function has been removed, its logic
> > > has been
> > > moved to pick_link(), so the problem still exists.
> > >
> > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > pick_link() in the
> > > latest upstream source tree, but I don't understand why this can
> > > solve the NULL
> > > ->get_link pointer dereference problem, because ->get_link pointer
> > > will be
> > > dereferenced before try_to_unlazy().
> > >
> > > (I don't understand why Ian Kent's email didn't appear on the
> > > mailing list.)
> >
> > It was something about html mail and I think my mail client was at fault.
> >
> > In any case what you say is indeed correct, so the comment isn't
> > important.
> >
> >
> > Fact is it is still a race between the lockless path walk and inode
> > eviction
> >
> > and xfs recycling. I believe that the xfs recycling code is very hard to
> > fix.
Not really for this case. This is simply concurrent pathwalk lookups
occurring just after the inode has been evicted from the VFS inode
cache. The first lookup hits the XFS inode cache, sees
XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
reinstantiate the VFS inode to an initial state. The Xfs inode
itself is still valid as it hasn't reached the XFS_IRECLAIM state
where it will be torn down and freed.
Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
been run and that it is trying to call ->get_link on that same
inode. Unfortunately, the first lookup has just set inode->f_ops =
&empty_fops as part of the VFS inode reinit, and that then triggers
the null pointer deref.
Once the first lookup has finished the inode_init_always(),
xfs_reinit_inode() resets inode->f_ops back to
xfs_symlink_file_ops and get_link calls work again.
Fundamentally, the problem is that we are completely reinitialising
the VFS inode within the RCU grace period. i.e. while concurrent RCU
pathwalks can still be in progress and find the VFS inode whilst the
XFS inode cache is manipulating it.
What we should be doing here is a subset of inode_init_always(),
which only reinitialises the bits of the VFS inode we need to
initialise rather than the entire inode. The identity of the inode
is not changing and so we don't need to go through a transient state
where the VFS inode goes xfs symlink -> empty initialised inode ->
xfs symlink.
i.e. We need to re-initialise the non-identity related parts of the
VFS inode so the identity parts that the RCU pathwalks rely on never
change within the RCU grace period where lookups can find the VFS
inode after it has been evicted.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-27 9:41 ` Dave Chinner
@ 2024-05-27 13:56 ` Jinliang Zheng
2024-05-28 2:10 ` Dave Chinner
0 siblings, 1 reply; 19+ messages in thread
From: Jinliang Zheng @ 2024-05-27 13:56 UTC (permalink / raw)
To: david
Cc: alexjlzheng, alexjlzheng, bfoster, djwong, linux-fsdevel,
linux-xfs, raven, rcu
On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote:
> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> > On 16/5/24 15:08, Ian Kent wrote:
> > > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > > I encountered the following calltrace:
> > > > >
> > > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > > 0000000000000000
> > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > > [20213.578884] RIP: 0010:0x0
> > > > > [20213.578894] Code: Bad RIP value.
> > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > > 0000000000000000
> > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > > 0000000000000000
> > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > > ffff889b9eeae380
> > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > > 0000000000000000
> > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > > ffffc90021ebfd48
> > > > > [20213.578998] FS: 00007f89c534e740(0000)
> > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > > 00000000007706e0
> > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > > 0000000000000000
> > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > > 0000000000000400
> > > > > [20213.579079] PKRU: 55555554
> > > > > [20213.579087] Call Trace:
> > > > > [20213.579099] trailing_symlink+0x1da/0x260
> > > > > [20213.579112] path_lookupat.isra.53+0x79/0x220
> > > > > [20213.579125] filename_lookup.part.69+0xa0/0x170
> > > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0
> > > > > [20213.579151] ? getname_flags+0x4f/0x1e0
> > > > > [20213.579161] user_path_at_empty+0x3e/0x50
> > > > > [20213.579172] vfs_statx+0x76/0xe0
> > > > > [20213.579182] __do_sys_newstat+0x3d/0x70
> > > > > [20213.579194] ? fput+0x13/0x20
> > > > > [20213.579203] ? ksys_ioctl+0xb0/0x300
> > > > > [20213.579213] ? generic_file_llseek+0x24/0x30
> > > > > [20213.579225] ? fput+0x13/0x20
> > > > > [20213.579233] ? ksys_lseek+0x8d/0xb0
> > > > > [20213.579243] __x64_sys_newstat+0x16/0x20
> > > > > [20213.579256] do_syscall_64+0x4d/0x140
> > > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > > >
> > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > > >
> > > > Please note that the kernel version I use is the one maintained by
> > > > Tencent.Inc,
> > > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > > tree,
> > > > although the trailing_symlink() function has been removed, its logic
> > > > has been
> > > > moved to pick_link(), so the problem still exists.
> > > >
> > > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > > pick_link() in the
> > > > latest upstream source tree, but I don't understand why this can
> > > > solve the NULL
> > > > ->get_link pointer dereference problem, because ->get_link pointer
> > > > will be
> > > > dereferenced before try_to_unlazy().
> > > >
> > > > (I don't understand why Ian Kent's email didn't appear on the
> > > > mailing list.)
> > >
> > > It was something about html mail and I think my mail client was at fault.
> > >
> > > In any case what you say is indeed correct, so the comment isn't
> > > important.
> > >
> > >
> > > Fact is it is still a race between the lockless path walk and inode
> > > eviction
> > >
> > > and xfs recycling. I believe that the xfs recycling code is very hard to
> > > fix.
>
> Not really for this case. This is simply concurrent pathwalk lookups
> occurring just after the inode has been evicted from the VFS inode
> cache. The first lookup hits the XFS inode cache, sees
> XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
> reinstantiate the VFS inode to an initial state. The Xfs inode
> itself is still valid as it hasn't reached the XFS_IRECLAIM state
> where it will be torn down and freed.
>
> Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
> been run and that it is trying to call ->get_link on that same
> inode. Unfortunately, the first lookup has just set inode->f_ops =
> &empty_fops as part of the VFS inode reinit, and that then triggers
> the null pointer deref.
The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the
target inode before xfs_reinit_inode(). Because the target inode of
xfs_reinit_inode() must NOT be associated with any dentry, which is necessary
conditions for iput() -> iput_final() -> evict(), and the RCU pathwalk cannot
obtain any inode without a dentry.
>
> Once the first lookup has finished the inode_init_always(),
> xfs_reinit_inode() resets inode->f_ops back to
> xfs_symlink_file_ops and get_link calls work again.
>
> Fundamentally, the problem is that we are completely reinitialising
> the VFS inode within the RCU grace period. i.e. while concurrent RCU
> pathwalks can still be in progress and find the VFS inode whilst the
> XFS inode cache is manipulating it.
>
> What we should be doing here is a subset of inode_init_always(),
> which only reinitialises the bits of the VFS inode we need to
> initialise rather than the entire inode. The identity of the inode
> is not changing and so we don't need to go through a transient state
> where the VFS inode goes xfs symlink -> empty initialised inode ->
> xfs symlink.
Sorry, I think this question is wrong in more ways than just inode_operations.
After the target inode has been reinited by xfs_reinit_inode(), its semantics
is no longer the inode required by RCU walkpath. The meanings of many fields
have changed, such as mode, i_mtime, i_atime, i_ctime and so on.
>
> i.e. We need to re-initialise the non-identity related parts of the
> VFS inode so the identity parts that the RCU pathwalks rely on never
> change within the RCU grace period where lookups can find the VFS
> inode after it has been evicted.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-27 13:56 ` Jinliang Zheng
@ 2024-05-28 2:10 ` Dave Chinner
0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2024-05-28 2:10 UTC (permalink / raw)
To: Jinliang Zheng
Cc: alexjlzheng, bfoster, djwong, linux-fsdevel, linux-xfs, raven,
rcu
On Mon, May 27, 2024 at 09:56:15PM +0800, Jinliang Zheng wrote:
> On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote:
> > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> > > On 16/5/24 15:08, Ian Kent wrote:
> > > > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > In any case what you say is indeed correct, so the comment isn't
> > > > important.
> > > >
> > > >
> > > > Fact is it is still a race between the lockless path walk and inode
> > > > eviction
> > > >
> > > > and xfs recycling. I believe that the xfs recycling code is very hard to
> > > > fix.
> >
> > Not really for this case. This is simply concurrent pathwalk lookups
> > occurring just after the inode has been evicted from the VFS inode
> > cache. The first lookup hits the XFS inode cache, sees
> > XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
> > reinstantiate the VFS inode to an initial state. The Xfs inode
> > itself is still valid as it hasn't reached the XFS_IRECLAIM state
> > where it will be torn down and freed.
> >
> > Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
> > been run and that it is trying to call ->get_link on that same
> > inode. Unfortunately, the first lookup has just set inode->f_ops =
> > &empty_fops as part of the VFS inode reinit, and that then triggers
> > the null pointer deref.
>
> The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the
> target inode before xfs_reinit_inode().
I'm not sure I follow - xfs_reinit_inode() typically occurs during a
pathwalk when no dentry for the given path component is found in the
dcache. Hence it has to create the dentry and look up the inode.
i.e.
walk_component()
lookup_fast() -> doesn't find a valid cached dentry
lookup_slow()
inode_lock_shared(parent)
parent->i_op->lookup(child)
xfs_vn_lookup()
xfs_lookup()
xfs_iget(child) <<<< inode may not exist until here
xfs_iget_recycle(child)
xfs_reinit_inode(child)
inode_unlock_shared(parent)
The path you are indicating is going wrong is:
link_path_walk()
walk_component()
<find child dentry>
step_into(child)
if (!d_is_symlink(child dentry)) {
....
return
}
pick_link(child)
if (!inode->i_link)
inode->i_op->get_link() <<<< doesn't exist, not a symlink inode
This implies that lookup_fast() found a symlink dentry with a
d_inode pointer to something that wasn't a symlink. That doesn't
mean that anything has gone wrong with xfs inode recycling within an
RCU grace period.
For example, d_is_symlink() looks purely at the dentry state and
assumes that it matches the dentry->d_inode attached to it:
#define DCACHE_ENTRY_TYPE (7 << 20) /* bits 20..22 are for storing type: */
#define DCACHE_MISS_TYPE (0 << 20) /* Negative dentry */
#define DCACHE_WHITEOUT_TYPE (1 << 20) /* Whiteout dentry (stop pathwalk) */
#define DCACHE_DIRECTORY_TYPE (2 << 20) /* Normal directory */
#define DCACHE_AUTODIR_TYPE (3 << 20) /* Lookupless directory (presumed automount) */
#define DCACHE_REGULAR_TYPE (4 << 20) /* Regular file type */
#define DCACHE_SPECIAL_TYPE (5 << 20) /* Other file type */
#define DCACHE_SYMLINK_TYPE (6 << 20) /* Symlink */
static inline unsigned __d_entry_type(const struct dentry *dentry)
{
return dentry->d_flags & DCACHE_ENTRY_TYPE;
}
static inline bool d_is_symlink(const struct dentry *dentry)
{
return __d_entry_type(dentry) == DCACHE_SYMLINK_TYPE;
}
This is a valid optimisation and good for performance, but it does
make it susceptible to memory corruption based failues. i.e. a
single bit memory corruption can change a DCACHE_DIRECTORY_TYPE
dentry to look like a DCACHE_SYMLINK_TYPE dentry, and then the code
calls pick_link() on a dentry that points to a directory inode and
not a symlink inode.
Such a memory corruption would have an identical crash signature
to the stack trace you posted, hence I'd really like to have solid
confirmation that the crash you are seeing is actually a result of
inode recycling and not something else....
> > Once the first lookup has finished the inode_init_always(),
> > xfs_reinit_inode() resets inode->f_ops back to
> > xfs_symlink_file_ops and get_link calls work again.
> >
> > Fundamentally, the problem is that we are completely reinitialising
> > the VFS inode within the RCU grace period. i.e. while concurrent RCU
> > pathwalks can still be in progress and find the VFS inode whilst the
> > XFS inode cache is manipulating it.
> >
> > What we should be doing here is a subset of inode_init_always(),
> > which only reinitialises the bits of the VFS inode we need to
> > initialise rather than the entire inode. The identity of the inode
> > is not changing and so we don't need to go through a transient state
> > where the VFS inode goes xfs symlink -> empty initialised inode ->
> > xfs symlink.
>
> Sorry, I think this question is wrong in more ways than just inode_operations.
> After the target inode has been reinited by xfs_reinit_inode(), its semantics
> is no longer the inode required by RCU walkpath. The meanings of many fields
> have changed, such as mode, i_mtime, i_atime, i_ctime and so on.
That's only the case in the the unlink->inode free->create-> inode
allocation path, assuming that is what the system actually tripped
over.
However, we can hit the reinit code from a simple path lookup
immediately after memory reclaim freed the dentry and inode and it
is still in the XFS inode cache. i.e. ->destroy_inode() ->
XFS_IRECLAIMABLE -> ->lookup() -> xfs_iget() -> xfs_iget_recycle().
i.e. the inode reinit doesn't only get triggered by unlink/alloc
cycles, so we often reinit to the exact same inode state as before
the inode was evicted from memory.
Essentially, it is not clear to me how your system tripped over this
issue; it *may* be an inode cache recycling issue, but I can also
point to other situations that could result in a very similar crash
signature. What I'm looking for is real evidence that it was a
recycling issue that lead to this problem, and evidence that it can
still occur on a current TOT kernel. A method for reproducing the
issue your kernels are seeing would be nice.
FWIW, reproducing on a current TOT kernel is important - even if you're seeing the
unlink/alloc/reinit case on your 5.4 kernel, this path had a major
architectural change in 5.14 and AFAICT that largely invalidates all
the previous analysis of this inode reinit behaviour.
In 5.14 we moved the inode freeing code we used to do in evict()
into a background thread, hence the "evict, unlink, create, reinit"
process now has an enforced context switch and delay between
->destroy_inode() and the internal inode unlink/freeing code.
By decoupling the unlink processing from the calling task context,
the task context can no longer immediately reallocate the same
physical inode, and so the mechanism that lead to applications being
able to directly trigger the xfs_inode_reinit() code for inodes that
are changing identity repeatedly in certain situations no longer
exists. The delay in unlink processing also affects how RCU grace
periods expire between unlink and allocation/reinit, so assumptions
made on that side of the analysis are also suspect and need to be
re-examined.
Hence before we spend any more time chasing ghosts, I'd really like
to see hard evidence for what caused the crash you reported and a
demonstration of it occuring on current TOT kernels, too.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk
2024-05-27 0:18 ` Al Viro
@ 2024-05-28 15:51 ` Brian Foster
0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2024-05-28 15:51 UTC (permalink / raw)
To: Al Viro
Cc: Ian Kent, Darrick J. Wong, Jinliang Zheng, alexjlzheng, david,
linux-fsdevel, linux-xfs, rcu
On Mon, May 27, 2024 at 01:18:23AM +0100, Al Viro wrote:
> On Mon, May 27, 2024 at 07:51:39AM +0800, Ian Kent wrote:
>
> > Indeed, that's what I found when I had a quick look.
> >
> >
> > Maybe a dentry (since that's part of the subject of the path walk and inode
> > is readily
> >
> > accessible) flag could be used since there's opportunity to set it in vfs
> > callbacks that
> >
> > are done as a matter of course.
>
> You might recheck ->d_seq after fetching ->get_link there; with XFS
> ->get_link() unconditionlly failing in RCU mode that would prevent
> this particular problem. But it would obviously have to be done
> in pick_link() itself (and I refuse to touch that area in 5.4 -
> carrying those changes across the e.g. 5.6 changes in pathwalk
> machinery is too much).
>
Ian sent a patch along those lines a couple years or so ago:
https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
I'm still not quite sure why we didn't merge this, at least as a bandaid
fix for the symlink variant of this particular problem..?
Brian
> And it's really just the tip of the iceberg - e.g. I'd expect a massive
> headache in ACL-related part of permission checks, etc.
>
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-05-28 15:51 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng
2023-12-08 0:14 ` Dave Chinner
2024-01-31 6:35 ` Jinliang Zheng
2024-01-31 19:30 ` Darrick J. Wong
2024-05-15 15:54 ` alexjlzheng
2024-05-16 4:56 ` Jinliang Zheng
2024-05-16 7:08 ` Ian Kent
2024-05-16 7:23 ` Ian Kent
2024-05-20 17:36 ` Darrick J. Wong
2024-05-21 1:35 ` Ian Kent
2024-05-21 2:13 ` Ian Kent
2024-05-26 15:04 ` Jinliang Zheng
2024-05-26 17:21 ` Paul E. McKenney
2024-05-26 23:51 ` Ian Kent
2024-05-27 0:18 ` Al Viro
2024-05-28 15:51 ` Brian Foster
2024-05-27 9:41 ` Dave Chinner
2024-05-27 13:56 ` Jinliang Zheng
2024-05-28 2:10 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).