* About the conflict between XFS inode recycle and VFS rcu-walk @ 2023-12-05 11:38 alexjlzheng 2023-12-08 0:14 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: alexjlzheng @ 2023-12-05 11:38 UTC (permalink / raw) To: djwong, bfoster, david, linux-xfs, raven, rcu, linux-fsdevel Hi, all I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk which can lead to null pointer references has been resolved? I browsed through emails about the following patches and their discussions: - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ And then came to the conclusion that this problem has not been solved, am I right? Did I miss some patch that could solve this problem? According to my understanding, the essence of this problem is that XFS reuses the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. Are there any recommended workarounds until an elegant and efficient solution can be proposed? After all, causing a crash is extremely unacceptable in a production environment. Thank you very much for your advice :) Jinliang Zheng ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng @ 2023-12-08 0:14 ` Dave Chinner 2024-01-31 6:35 ` Jinliang Zheng 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2023-12-08 0:14 UTC (permalink / raw) To: alexjlzheng; +Cc: djwong, bfoster, linux-xfs, raven, rcu, linux-fsdevel On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote: > Hi, all > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk > which can lead to null pointer references has been resolved? > > I browsed through emails about the following patches and their discussions: > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > And then came to the conclusion that this problem has not been solved, am I > right? Did I miss some patch that could solve this problem? We fixed the known problems this caused by turning off the VFS functionality that the rcu pathwalks kept tripping over. See commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to the vfs"). Apart from that issue, I'm not aware of any other issues that the XFS inode recycling directly exposes. > According to my understanding, the essence of this problem is that XFS reuses > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. It assumes that the inode will not change identity during the RCU grace period after the inode has been evicted from cache. We can safely reinstantiate an evicted inode without waiting for an RCU grace period as long as it is the same inode with the same content and same state. Problems *may* arise when we unlink the inode, then evict it, then a new file is created and the old slab cache memory address is used for the new inode. I describe the issue here: https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ That said, we have exactly zero evidence that this is actually a problem in production systems. We did get systems tripping over the symlink issue, but there's no evidence that the unlink->close->open(O_CREAT) issues are manifesting in the wild and hence there hasn't been any particular urgency to address it. > Are there any recommended workarounds until an elegant and efficient solution > can be proposed? After all, causing a crash is extremely unacceptable in a > production environment. What crashes are you seeing in your production environment? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2023-12-08 0:14 ` Dave Chinner @ 2024-01-31 6:35 ` Jinliang Zheng 2024-01-31 19:30 ` Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Jinliang Zheng @ 2024-01-31 6:35 UTC (permalink / raw) To: david; +Cc: alexjlzheng, bfoster, djwong, linux-fsdevel, linux-xfs, raven, rcu On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote: > > Hi, all > > > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk > > which can lead to null pointer references has been resolved? > > > > I browsed through emails about the following patches and their discussions: > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > And then came to the conclusion that this problem has not been solved, am I > > right? Did I miss some patch that could solve this problem? > > We fixed the known problems this caused by turning off the VFS > functionality that the rcu pathwalks kept tripping over. See commit > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to > the vfs"). Sorry for the delay. The problem I encountered in the production environment was that during the rcu walk process the ->get_link() pointer was NULL, which caused a crash. As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to the vfs") first appeared in: - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ Does this commit solve the problem of NULL ->get_link()? And how? > > Apart from that issue, I'm not aware of any other issues that the > XFS inode recycling directly exposes. > > > According to my understanding, the essence of this problem is that XFS reuses > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. > > It assumes that the inode will not change identity during the RCU > grace period after the inode has been evicted from cache. We can > safely reinstantiate an evicted inode without waiting for an RCU > grace period as long as it is the same inode with the same content > and same state. > > Problems *may* arise when we unlink the inode, then evict it, then a > new file is created and the old slab cache memory address is used > for the new inode. I describe the issue here: > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ And judging from the relevant emails, the main reason why ->get_link() is set to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when the inode is chosen to be reused. However, perhaps due to performance reasons, this solution has not been merged for a long time. How is it now? Maybe I am missing something in the threads of mail? Thank you very much. :) Jinliang Zheng > > That said, we have exactly zero evidence that this is actually a > problem in production systems. We did get systems tripping over the > symlink issue, but there's no evidence that the > unlink->close->open(O_CREAT) issues are manifesting in the wild and > hence there hasn't been any particular urgency to address it. > > > Are there any recommended workarounds until an elegant and efficient solution > > can be proposed? After all, causing a crash is extremely unacceptable in a > > production environment. > > What crashes are you seeing in your production environment? > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-01-31 6:35 ` Jinliang Zheng @ 2024-01-31 19:30 ` Darrick J. Wong 2024-05-15 15:54 ` alexjlzheng 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2024-01-31 19:30 UTC (permalink / raw) To: Jinliang Zheng; +Cc: david, bfoster, linux-fsdevel, linux-xfs, raven, rcu On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote: > > > Hi, all > > > > > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk > > > which can lead to null pointer references has been resolved? > > > > > > I browsed through emails about the following patches and their discussions: > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > And then came to the conclusion that this problem has not been solved, am I > > > right? Did I miss some patch that could solve this problem? > > > > We fixed the known problems this caused by turning off the VFS > > functionality that the rcu pathwalks kept tripping over. See commit > > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to > > the vfs"). > > Sorry for the delay. > > The problem I encountered in the production environment was that during the > rcu walk process the ->get_link() pointer was NULL, which caused a crash. > > As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink > metadata buffers to the vfs") first appeared in: > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > Does this commit solve the problem of NULL ->get_link()? And how? I suggest reading the call stack from wherever the VFS enters the XFS readlink code. If you have a reliable reproducer, then apply this patch to your kernel (you haven't mentioned which one it is) and see if the bad dereference goes away. --D > > > > Apart from that issue, I'm not aware of any other issues that the > > XFS inode recycling directly exposes. > > > > > According to my understanding, the essence of this problem is that XFS reuses > > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. > > > > It assumes that the inode will not change identity during the RCU > > grace period after the inode has been evicted from cache. We can > > safely reinstantiate an evicted inode without waiting for an RCU > > grace period as long as it is the same inode with the same content > > and same state. > > > > Problems *may* arise when we unlink the inode, then evict it, then a > > new file is created and the old slab cache memory address is used > > for the new inode. I describe the issue here: > > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > > And judging from the relevant emails, the main reason why ->get_link() is set > to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when > the inode is chosen to be reused. > > However, perhaps due to performance reasons, this solution has not been merged > for a long time. How is it now? > > Maybe I am missing something in the threads of mail? > > Thank you very much. :) > Jinliang Zheng > > > > > That said, we have exactly zero evidence that this is actually a > > problem in production systems. We did get systems tripping over the > > symlink issue, but there's no evidence that the > > unlink->close->open(O_CREAT) issues are manifesting in the wild and > > hence there hasn't been any particular urgency to address it. > > > > > Are there any recommended workarounds until an elegant and efficient solution > > > can be proposed? After all, causing a crash is extremely unacceptable in a > > > production environment. > > > > What crashes are you seeing in your production environment? > > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-01-31 19:30 ` Darrick J. Wong @ 2024-05-15 15:54 ` alexjlzheng 2024-05-16 4:56 ` Jinliang Zheng 0 siblings, 1 reply; 19+ messages in thread From: alexjlzheng @ 2024-05-15 15:54 UTC (permalink / raw) To: djwong Cc: alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, raven, rcu, alexjlzheng On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: > On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote: > > > > Hi, all > > > > > > > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk > > > > which can lead to null pointer references has been resolved? > > > > > > > > I browsed through emails about the following patches and their discussions: > > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > > > And then came to the conclusion that this problem has not been solved, am I > > > > right? Did I miss some patch that could solve this problem? > > > > > > We fixed the known problems this caused by turning off the VFS > > > functionality that the rcu pathwalks kept tripping over. See commit > > > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to > > > the vfs"). > > > > Sorry for the delay. > > > > The problem I encountered in the production environment was that during the > > rcu walk process the ->get_link() pointer was NULL, which caused a crash. > > > > As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink > > metadata buffers to the vfs") first appeared in: > > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > > > Does this commit solve the problem of NULL ->get_link()? And how? > > I suggest reading the call stack from wherever the VFS enters the XFS > readlink code. If you have a reliable reproducer, then apply this patch > to your kernel (you haven't mentioned which one it is) and see if the > bad dereference goes away. > > --D Sorry for the delay. I encountered the following calltrace: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000 [20213.578785] #PF: supervisor instruction fetch in kernel mode [20213.578799] #PF: error_code(0x0010) - not-present page [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 [20213.578828] Oops: 0010 [#1] SMP NOPTI [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 [20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 [20213.578884] RIP: 0010:0x0 [20213.578894] Code: Bad RIP value. [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000 [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000 [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380 [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000 [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48 [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0 [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [20213.579079] PKRU: 55555554 [20213.579087] Call Trace: [20213.579099] trailing_symlink+0x1da/0x260 [20213.579112] path_lookupat.isra.53+0x79/0x220 [20213.579125] filename_lookup.part.69+0xa0/0x170 [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 [20213.579151] ? getname_flags+0x4f/0x1e0 [20213.579161] user_path_at_empty+0x3e/0x50 [20213.579172] vfs_statx+0x76/0xe0 [20213.579182] __do_sys_newstat+0x3d/0x70 [20213.579194] ? fput+0x13/0x20 [20213.579203] ? ksys_ioctl+0xb0/0x300 [20213.579213] ? generic_file_llseek+0x24/0x30 [20213.579225] ? fput+0x13/0x20 [20213.579233] ? ksys_lseek+0x8d/0xb0 [20213.579243] __x64_sys_newstat+0x16/0x20 [20213.579256] do_syscall_64+0x4d/0x140 [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL ->get_link() happened here: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx # rbx = &nameidate 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp 0xffffffff812e4869 <trailing_symlink+0x19>: mov 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 <sysctl_protected_symlinks> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac <trailing_symlink+0x5c> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx 0xffffffff812e487f <trailing_symlink+0x2f>: mov 0xc8(%rdi),%rcx # rcx = nameidata->link_inode 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx 0xffffffff812e488d <trailing_symlink+0x3d>: mov 0x4(%rcx),%ecx # ecx = link_inode->uid 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac <trailing_symlink+0x5c> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f <trailing_symlink+0x10f> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax 0xffffffff812e48af <trailing_symlink+0x5f>: mov %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT 0xffffffff812e48b2 <trailing_symlink+0x62>: mov 0x50(%rbx),%rax # rax = nd->stack 0xffffffff812e48b6 <trailing_symlink+0x66>: movq $0x0,0x20(%rax) # stack[0].name = NULL 0xffffffff812e48be <trailing_symlink+0x6e>: mov 0x48(%rbx),%eax # nd->depth 0xffffffff812e48c1 <trailing_symlink+0x71>: mov 0x50(%rbx),%rdx # nd->stack 0xffffffff812e48c5 <trailing_symlink+0x75>: mov 0xc8(%rbx),%r13 # nd->link_inode 0xffffffff812e48cc <trailing_symlink+0x7c>: lea (%rax,%rax,2),%rax # rax = depth * 3 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 0xffffffff812e48d4 <trailing_symlink+0x84>: lea -0x30(%rdx,%rax,1),%r15 # r15 = last 0xffffffff812e48d9 <trailing_symlink+0x89>: mov 0x8(%r15),%r14 # r14 = last->link.dentry 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 <trailing_symlink+0x100> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi 0xffffffff812e48e9 <trailing_symlink+0x99>: callq 0xffffffff812f8a00 <atime_needs_update> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne 0xffffffff812e4a56 <trailing_symlink+0x206> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx 0xffffffff812e4905 <trailing_symlink+0xb5>: callq 0xffffffff81424310 <security_inode_follow_link> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax 0xffffffff812e490f <trailing_symlink+0xbf>: jne 0xffffffff812e4939 <trailing_symlink+0xe9> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 <trailing_symlink+0x195> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 <trailing_symlink+0x167> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae <trailing_symlink+0x15e> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae <trailing_symlink+0x15e> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp 0xffffffff812e494f <trailing_symlink+0xff>: retq 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi 0xffffffff812e4953 <trailing_symlink+0x103>: callq 0xffffffff812f8ae0 <touch_atime> 0xffffffff812e4958 <trailing_symlink+0x108>: callq 0xffffffff81a26410 <_cond_resched> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp 0xffffffff812e48f6 <trailing_symlink+0xa6> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx 0xffffffff812e4965 <trailing_symlink+0x115>: je 0xffffffff812e496f <trailing_symlink+0x11f> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx 0xffffffff812e4969 <trailing_symlink+0x119>: je 0xffffffff812e48ac <trailing_symlink+0x5c> 0xffffffff812e496f <trailing_symlink+0x11f>: mov $0xfffffffffffffff6,%r12 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al 0xffffffff812e4978 <trailing_symlink+0x128>: jne 0xffffffff812e493e <trailing_symlink+0xee> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax 0xffffffff812e498d <trailing_symlink+0x13d>: je 0xffffffff812e4999 <trailing_symlink+0x149> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax 0xffffffff812e4993 <trailing_symlink+0x143>: je 0xffffffff812e4a6f <trailing_symlink+0x21f> 0xffffffff812e4999 <trailing_symlink+0x149>: mov $0xffffffff82319b4f,%rdi 0xffffffff812e49a0 <trailing_symlink+0x150>: mov $0xfffffffffffffff3,%r12 0xffffffff812e49a7 <trailing_symlink+0x157>: callq 0xffffffff81161310 <audit_log_link_denied> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp 0xffffffff812e493e <trailing_symlink+0xee> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov $0xffffffff8230164d,%r12 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp 0xffffffff812e493e <trailing_symlink+0xee> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) 0xffffffff812e49bc <trailing_symlink+0x16c>: je 0xffffffff812e4a8a <trailing_symlink+0x23a> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi 0xffffffff812e49c5 <trailing_symlink+0x175>: callq 0xffffffff812e2da0 <nd_jump_root> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax 0xffffffff812e49cc <trailing_symlink+0x17c>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al 0xffffffff812e49dd <trailing_symlink+0x18d>: jne 0xffffffff812e4935 <trailing_symlink+0xe5> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp 0xffffffff812e49d2 <trailing_symlink+0x182> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov 0x20(%r13),%rax # inode->i_op 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov 0x8(%rax),%rcx # inode_operations->get_link 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne 0xffffffff812e4a1f <trailing_symlink+0x1cf> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi # nd->flags & LOOKUP_RCU == 0 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je 0xffffffff812e49ae <trailing_symlink+0x15e> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp $0xfffffffffffff000,%r12 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe 0xffffffff812e4928 <trailing_symlink+0xd8> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq 0xffffffff812e493e <trailing_symlink+0xee> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi # nd->flags & LOOKUP_RCU != 0 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp $0xfffffffffffffff6,%rax 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne 0xffffffff812e4a08 <trailing_symlink+0x1b8> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq 0xffffffff812e3840 <unlazy_walk> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp 0xffffffff812e4a08 <trailing_symlink+0x1b8> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi 0xffffffff812e4a59 <trailing_symlink+0x209>: callq 0xffffffff812e3840 <unlazy_walk> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax 0xffffffff812e4a60 <trailing_symlink+0x210>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi 0xffffffff812e4a65 <trailing_symlink+0x215>: callq 0xffffffff812f8ae0 <touch_atime> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq 0xffffffff812e48f6 <trailing_symlink+0xa6> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi 0xffffffff812e4a80 <trailing_symlink+0x230>: callq 0xffffffff811673f0 <__audit_inode> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq 0xffffffff812e4999 <trailing_symlink+0x149> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq 0xffffffff812e4790 <set_root> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq 0xffffffff812e49c2 <trailing_symlink+0x172> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov $0xfffffffffffffff6,%r12 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq 0xffffffff812e493e <trailing_symlink+0xee> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to the vfs") is a data NULL pointer dereference, but the problem here is an instruction NULL pointer dereference. Further, I analyzed the possible triggering process as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create rcu_read_lock read_seqcount_retry (the last check) iput_final evict destroy_inode xfs_fs_destroy_inode xfs_inode_set_reclaim_tag xfs_ialloc spin_lock(ip->i_flags_lock) xfs_dialloc set(ip, XFS_IRECLAIMABLE) xfs_iget wakeup(xfs_reclaim_worker) rcu_read_lock spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit spin_lock(ip->i_flags_lock) if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) set(ip, XFS_IRECLAIM) spin_unlock(ip->i_flags_lock) rcu_read_unlock < ------------ > // miss synchronize_rcu() xfs_reinit_inode ->get_link = NULL get_link() // NULL rcu_read_unlock <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to the vfs"), we should start processing this NULL ->get_link pointer dereference. Or, am I thinking wrong somewhere? Thanks, Jinliang Zheng > > > > > > > Apart from that issue, I'm not aware of any other issues that the > > > XFS inode recycling directly exposes. > > > > > > > According to my understanding, the essence of this problem is that XFS reuses > > > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. > > > > > > It assumes that the inode will not change identity during the RCU > > > grace period after the inode has been evicted from cache. We can > > > safely reinstantiate an evicted inode without waiting for an RCU > > > grace period as long as it is the same inode with the same content > > > and same state. > > > > > > Problems *may* arise when we unlink the inode, then evict it, then a > > > new file is created and the old slab cache memory address is used > > > for the new inode. I describe the issue here: > > > > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > > > > And judging from the relevant emails, the main reason why ->get_link() is set > > to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when > > the inode is chosen to be reused. > > > > However, perhaps due to performance reasons, this solution has not been merged > > for a long time. How is it now? > > > > Maybe I am missing something in the threads of mail? > > > > Thank you very much. :) > > Jinliang Zheng > > > > > > > > That said, we have exactly zero evidence that this is actually a > > > problem in production systems. We did get systems tripping over the > > > symlink issue, but there's no evidence that the > > > unlink->close->open(O_CREAT) issues are manifesting in the wild and > > > hence there hasn't been any particular urgency to address it. > > > > > > > Are there any recommended workarounds until an elegant and efficient solution > > > > can be proposed? After all, causing a crash is extremely unacceptable in a > > > > production environment. > > > > > > What crashes are you seeing in your production environment? > > > > > > -Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-15 15:54 ` alexjlzheng @ 2024-05-16 4:56 ` Jinliang Zheng 2024-05-16 7:08 ` Ian Kent 0 siblings, 1 reply; 19+ messages in thread From: Jinliang Zheng @ 2024-05-16 4:56 UTC (permalink / raw) To: alexjlzheng; +Cc: bfoster, david, djwong, linux-fsdevel, linux-xfs, raven, rcu On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: > On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: > > On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > > > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > > > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote: > > > > > Hi, all > > > > > > > > > > I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk > > > > > which can lead to null pointer references has been resolved? > > > > > > > > > > I browsed through emails about the following patches and their discussions: > > > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > > > > > And then came to the conclusion that this problem has not been solved, am I > > > > > right? Did I miss some patch that could solve this problem? > > > > > > > > We fixed the known problems this caused by turning off the VFS > > > > functionality that the rcu pathwalks kept tripping over. See commit > > > > 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to > > > > the vfs"). > > > > > > Sorry for the delay. > > > > > > The problem I encountered in the production environment was that during the > > > rcu walk process the ->get_link() pointer was NULL, which caused a crash. > > > > > > As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink > > > metadata buffers to the vfs") first appeared in: > > > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > > > > > Does this commit solve the problem of NULL ->get_link()? And how? > > > > I suggest reading the call stack from wherever the VFS enters the XFS > > readlink code. If you have a reliable reproducer, then apply this patch > > to your kernel (you haven't mentioned which one it is) and see if the > > bad dereference goes away. > > > > --D > > Sorry for the delay. > > I encountered the following calltrace: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > [20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [20213.578785] #PF: supervisor instruction fetch in kernel mode > [20213.578799] #PF: error_code(0x0010) - not-present page > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > [20213.578828] Oops: 0010 [#1] SMP NOPTI > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > [20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > [20213.578884] RIP: 0010:0x0 > [20213.578894] Code: Bad RIP value. > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000 > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000 > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380 > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000 > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48 > [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0 > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [20213.579079] PKRU: 55555554 > [20213.579087] Call Trace: > [20213.579099] trailing_symlink+0x1da/0x260 > [20213.579112] path_lookupat.isra.53+0x79/0x220 > [20213.579125] filename_lookup.part.69+0xa0/0x170 > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > [20213.579151] ? getname_flags+0x4f/0x1e0 > [20213.579161] user_path_at_empty+0x3e/0x50 > [20213.579172] vfs_statx+0x76/0xe0 > [20213.579182] __do_sys_newstat+0x3d/0x70 > [20213.579194] ? fput+0x13/0x20 > [20213.579203] ? ksys_ioctl+0xb0/0x300 > [20213.579213] ? generic_file_llseek+0x24/0x30 > [20213.579225] ? fput+0x13/0x20 > [20213.579233] ? ksys_lseek+0x8d/0xb0 > [20213.579243] __x64_sys_newstat+0x16/0x20 > [20213.579256] do_syscall_64+0x4d/0x140 > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Please note that the kernel version I use is the one maintained by Tencent.Inc, and the baseline is v5.4. But in fact, in the latest upstream source tree, although the trailing_symlink() function has been removed, its logic has been moved to pick_link(), so the problem still exists. Ian Kent pointed out that try_to_unlazy() was introduced in pick_link() in the latest upstream source tree, but I don't understand why this can solve the NULL ->get_link pointer dereference problem, because ->get_link pointer will be dereferenced before try_to_unlazy(). (I don't understand why Ian Kent's email didn't appear on the mailing list.) Thanks, Jinliang Zheng > > And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL > ->get_link() happened here: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] > 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp > 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp > 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 > 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 > 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 > 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 > 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx > 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx # rbx = &nameidate > 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp > 0xffffffff812e4869 <trailing_symlink+0x19>: mov 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 <sysctl_protected_symlinks> > 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax > 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx > 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac <trailing_symlink+0x5c> > 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx > 0xffffffff812e487f <trailing_symlink+0x2f>: mov 0xc8(%rdi),%rcx # rcx = nameidata->link_inode > 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx > 0xffffffff812e488d <trailing_symlink+0x3d>: mov 0x4(%rcx),%ecx # ecx = link_inode->uid > 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) > 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac <trailing_symlink+0x5c> > 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi > 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx > 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx > 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx > 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f <trailing_symlink+0x10f> > 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax > 0xffffffff812e48af <trailing_symlink+0x5f>: mov %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT > 0xffffffff812e48b2 <trailing_symlink+0x62>: mov 0x50(%rbx),%rax # rax = nd->stack > 0xffffffff812e48b6 <trailing_symlink+0x66>: movq $0x0,0x20(%rax) # stack[0].name = NULL > 0xffffffff812e48be <trailing_symlink+0x6e>: mov 0x48(%rbx),%eax # nd->depth > 0xffffffff812e48c1 <trailing_symlink+0x71>: mov 0x50(%rbx),%rdx # nd->stack > 0xffffffff812e48c5 <trailing_symlink+0x75>: mov 0xc8(%rbx),%r13 # nd->link_inode > 0xffffffff812e48cc <trailing_symlink+0x7c>: lea (%rax,%rax,2),%rax # rax = depth * 3 > 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 > 0xffffffff812e48d4 <trailing_symlink+0x84>: lea -0x30(%rdx,%rax,1),%r15 # r15 = last > 0xffffffff812e48d9 <trailing_symlink+0x89>: mov 0x8(%r15),%r14 # r14 = last->link.dentry > 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) > 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 <trailing_symlink+0x100> > 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi > 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi > 0xffffffff812e48e9 <trailing_symlink+0x99>: callq 0xffffffff812f8a00 <atime_needs_update> > 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al > 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne 0xffffffff812e4a56 <trailing_symlink+0x206> > 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx > 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi > 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi > 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx > 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx > 0xffffffff812e4905 <trailing_symlink+0xb5>: callq 0xffffffff81424310 <security_inode_follow_link> > 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 > 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax > 0xffffffff812e490f <trailing_symlink+0xbf>: jne 0xffffffff812e4939 <trailing_symlink+0xe9> > 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) > 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 > 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 > 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 <trailing_symlink+0x195> > 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax > 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al > 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 <trailing_symlink+0x167> > 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al > 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae <trailing_symlink+0x15e> > 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 > 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae <trailing_symlink+0x15e> > 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp > 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax > 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx > 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 > 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 > 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 > 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 > 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp > 0xffffffff812e494f <trailing_symlink+0xff>: retq > 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi > 0xffffffff812e4953 <trailing_symlink+0x103>: callq 0xffffffff812f8ae0 <touch_atime> > 0xffffffff812e4958 <trailing_symlink+0x108>: callq 0xffffffff81a26410 <_cond_resched> > 0xffffffff812e495d <trailing_symlink+0x10d>: jmp 0xffffffff812e48f6 <trailing_symlink+0xa6> > 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx > 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx > 0xffffffff812e4965 <trailing_symlink+0x115>: je 0xffffffff812e496f <trailing_symlink+0x11f> > 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx > 0xffffffff812e4969 <trailing_symlink+0x119>: je 0xffffffff812e48ac <trailing_symlink+0x5c> > 0xffffffff812e496f <trailing_symlink+0x11f>: mov $0xfffffffffffffff6,%r12 > 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al > 0xffffffff812e4978 <trailing_symlink+0x128>: jne 0xffffffff812e493e <trailing_symlink+0xee> > 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax > 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax > 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax > 0xffffffff812e498d <trailing_symlink+0x13d>: je 0xffffffff812e4999 <trailing_symlink+0x149> > 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax > 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax > 0xffffffff812e4993 <trailing_symlink+0x143>: je 0xffffffff812e4a6f <trailing_symlink+0x21f> > 0xffffffff812e4999 <trailing_symlink+0x149>: mov $0xffffffff82319b4f,%rdi > 0xffffffff812e49a0 <trailing_symlink+0x150>: mov $0xfffffffffffffff3,%r12 > 0xffffffff812e49a7 <trailing_symlink+0x157>: callq 0xffffffff81161310 <audit_log_link_denied> > 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp 0xffffffff812e493e <trailing_symlink+0xee> > 0xffffffff812e49ae <trailing_symlink+0x15e>: mov $0xffffffff8230164d,%r12 > 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp 0xffffffff812e493e <trailing_symlink+0xee> > 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) > 0xffffffff812e49bc <trailing_symlink+0x16c>: je 0xffffffff812e4a8a <trailing_symlink+0x23a> > 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi > 0xffffffff812e49c5 <trailing_symlink+0x175>: callq 0xffffffff812e2da0 <nd_jump_root> > 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax > 0xffffffff812e49cc <trailing_symlink+0x17c>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> > 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 > 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax > 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al > 0xffffffff812e49dd <trailing_symlink+0x18d>: jne 0xffffffff812e4935 <trailing_symlink+0xe5> > 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp 0xffffffff812e49d2 <trailing_symlink+0x182> > 0xffffffff812e49e5 <trailing_symlink+0x195>: mov 0x20(%r13),%rax # inode->i_op > 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 > 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi > 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx > 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov 0x8(%rax),%rcx # inode_operations->get_link > 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) > 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne 0xffffffff812e4a1f <trailing_symlink+0x1cf> > 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi # nd->flags & LOOKUP_RCU == 0 > 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 > 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 > 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je 0xffffffff812e49ae <trailing_symlink+0x15e> > 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp $0xfffffffffffff000,%r12 > 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe 0xffffffff812e4928 <trailing_symlink+0xd8> > 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq 0xffffffff812e493e <trailing_symlink+0xee> > 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi # nd->flags & LOOKUP_RCU != 0 > 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) > 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 > 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp $0xfffffffffffffff6,%rax > 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne 0xffffffff812e4a08 <trailing_symlink+0x1b8> > 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi > 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq 0xffffffff812e3840 <unlazy_walk> > 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax > 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> > 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx > 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi > 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi > 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx > 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> > 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 > 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp 0xffffffff812e4a08 <trailing_symlink+0x1b8> > 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi > 0xffffffff812e4a59 <trailing_symlink+0x209>: callq 0xffffffff812e3840 <unlazy_walk> > 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax > 0xffffffff812e4a60 <trailing_symlink+0x210>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> > 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi > 0xffffffff812e4a65 <trailing_symlink+0x215>: callq 0xffffffff812f8ae0 <touch_atime> > 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq 0xffffffff812e48f6 <trailing_symlink+0xa6> > 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax > 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi > 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx > 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi > 0xffffffff812e4a80 <trailing_symlink+0x230>: callq 0xffffffff811673f0 <__audit_inode> > 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq 0xffffffff812e4999 <trailing_symlink+0x149> > 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi > 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq 0xffffffff812e4790 <set_root> > 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq 0xffffffff812e49c2 <trailing_symlink+0x172> > 0xffffffff812e4a97 <trailing_symlink+0x247>: mov $0xfffffffffffffff6,%r12 > 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq 0xffffffff812e493e <trailing_symlink+0xee> > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs: > don't expose internal symlink metadata buffers to the vfs") is a data NULL > pointer dereference, but the problem here is an instruction NULL pointer > dereference. > > Further, I analyzed the possible triggering process as follows: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > rcu_walk do_unlinkat ~~> prune_dcache_sb create > > > rcu_read_lock > > read_seqcount_retry > (the last check) iput_final > evict > destroy_inode > xfs_fs_destroy_inode > xfs_inode_set_reclaim_tag xfs_ialloc > spin_lock(ip->i_flags_lock) xfs_dialloc > set(ip, XFS_IRECLAIMABLE) xfs_iget > wakeup(xfs_reclaim_worker) rcu_read_lock > spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit > spin_lock(ip->i_flags_lock) > if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) > set(ip, XFS_IRECLAIM) > spin_unlock(ip->i_flags_lock) > rcu_read_unlock > < ------------ > > // miss synchronize_rcu() > xfs_reinit_inode > ->get_link = NULL > get_link() // NULL > > rcu_read_unlock > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal > symlink metadata buffers to the vfs"), we should start processing this NULL > ->get_link pointer dereference. > > Or, am I thinking wrong somewhere? > > Thanks, > Jinliang Zheng > > > > > > > > > > > Apart from that issue, I'm not aware of any other issues that the > > > > XFS inode recycling directly exposes. > > > > > > > > > According to my understanding, the essence of this problem is that XFS reuses > > > > > the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. > > > > > > > > It assumes that the inode will not change identity during the RCU > > > > grace period after the inode has been evicted from cache. We can > > > > safely reinstantiate an evicted inode without waiting for an RCU > > > > grace period as long as it is the same inode with the same content > > > > and same state. > > > > > > > > Problems *may* arise when we unlink the inode, then evict it, then a > > > > new file is created and the old slab cache memory address is used > > > > for the new inode. I describe the issue here: > > > > > > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > > > > > > And judging from the relevant emails, the main reason why ->get_link() is set > > > to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when > > > the inode is chosen to be reused. > > > > > > However, perhaps due to performance reasons, this solution has not been merged > > > for a long time. How is it now? > > > > > > Maybe I am missing something in the threads of mail? > > > > > > Thank you very much. :) > > > Jinliang Zheng > > > > > > > > > > > That said, we have exactly zero evidence that this is actually a > > > > problem in production systems. We did get systems tripping over the > > > > symlink issue, but there's no evidence that the > > > > unlink->close->open(O_CREAT) issues are manifesting in the wild and > > > > hence there hasn't been any particular urgency to address it. > > > > > > > > > Are there any recommended workarounds until an elegant and efficient solution > > > > > can be proposed? After all, causing a crash is extremely unacceptable in a > > > > > production environment. > > > > > > > > What crashes are you seeing in your production environment? > > > > > > > > -Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-16 4:56 ` Jinliang Zheng @ 2024-05-16 7:08 ` Ian Kent 2024-05-16 7:23 ` Ian Kent 0 siblings, 1 reply; 19+ messages in thread From: Ian Kent @ 2024-05-16 7:08 UTC (permalink / raw) To: Jinliang Zheng, alexjlzheng Cc: bfoster, david, djwong, linux-fsdevel, linux-xfs, rcu On 16/5/24 12:56, Jinliang Zheng wrote: > On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: >> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: >>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: >>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: >>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote: >>>>>> Hi, all >>>>>> >>>>>> I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk >>>>>> which can lead to null pointer references has been resolved? >>>>>> >>>>>> I browsed through emails about the following patches and their discussions: >>>>>> - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ >>>>>> - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ >>>>>> - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ >>>>>> >>>>>> And then came to the conclusion that this problem has not been solved, am I >>>>>> right? Did I miss some patch that could solve this problem? >>>>> We fixed the known problems this caused by turning off the VFS >>>>> functionality that the rcu pathwalks kept tripping over. See commit >>>>> 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to >>>>> the vfs"). >>>> Sorry for the delay. >>>> >>>> The problem I encountered in the production environment was that during the >>>> rcu walk process the ->get_link() pointer was NULL, which caused a crash. >>>> >>>> As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink >>>> metadata buffers to the vfs") first appeared in: >>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ >>>> >>>> Does this commit solve the problem of NULL ->get_link()? And how? >>> I suggest reading the call stack from wherever the VFS enters the XFS >>> readlink code. If you have a reliable reproducer, then apply this patch >>> to your kernel (you haven't mentioned which one it is) and see if the >>> bad dereference goes away. >>> >>> --D >> Sorry for the delay. >> >> I encountered the following calltrace: >> >> [20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000 >> [20213.578785] #PF: supervisor instruction fetch in kernel mode >> [20213.578799] #PF: error_code(0x0010) - not-present page >> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 >> [20213.578828] Oops: 0010 [#1] SMP NOPTI >> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 >> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 >> [20213.578884] RIP: 0010:0x0 >> [20213.578894] Code: Bad RIP value. >> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 >> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000 >> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000 >> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380 >> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000 >> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48 >> [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 >> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0 >> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [20213.579079] PKRU: 55555554 >> [20213.579087] Call Trace: >> [20213.579099] trailing_symlink+0x1da/0x260 >> [20213.579112] path_lookupat.isra.53+0x79/0x220 >> [20213.579125] filename_lookup.part.69+0xa0/0x170 >> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 >> [20213.579151] ? getname_flags+0x4f/0x1e0 >> [20213.579161] user_path_at_empty+0x3e/0x50 >> [20213.579172] vfs_statx+0x76/0xe0 >> [20213.579182] __do_sys_newstat+0x3d/0x70 >> [20213.579194] ? fput+0x13/0x20 >> [20213.579203] ? ksys_ioctl+0xb0/0x300 >> [20213.579213] ? generic_file_llseek+0x24/0x30 >> [20213.579225] ? fput+0x13/0x20 >> [20213.579233] ? ksys_lseek+0x8d/0xb0 >> [20213.579243] __x64_sys_newstat+0x16/0x20 >> [20213.579256] do_syscall_64+0x4d/0x140 >> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 >> >> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > Please note that the kernel version I use is the one maintained by Tencent.Inc, > and the baseline is v5.4. But in fact, in the latest upstream source tree, > although the trailing_symlink() function has been removed, its logic has been > moved to pick_link(), so the problem still exists. > > Ian Kent pointed out that try_to_unlazy() was introduced in pick_link() in the > latest upstream source tree, but I don't understand why this can solve the NULL > ->get_link pointer dereference problem, because ->get_link pointer will be > dereferenced before try_to_unlazy(). > > (I don't understand why Ian Kent's email didn't appear on the mailing list.) It was something about html mail and I think my mail client was at fault. In any case what you say is indeed correct, so the comment isn't important. Fact is it is still a race between the lockless path walk and inode eviction and xfs recycling. I believe that the xfs recycling code is very hard to fix. IIRC correctly putting a NULL check in pick_link() was not considered acceptable but there must be a way that is acceptable to check this and restart the walk. Maybe there was a reluctance to suffer the overhead of restarting the walk when it shouldn't be needed. The alternative would be to find some way to identify when it's unsafe to reuse an inode marked for re-cycle before dropping rcu read, perhaps with the reference count plus the seqlock. Basically, to reuse inodes xfs will need to identify when the race occurs and let the inode go away under rcu and create a new one if a race is detected. But possibly that isn't nearly as simple as it sounds? > > Thanks, > Jinliang Zheng > >> And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL >> ->get_link() happened here: >> >> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] >> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp >> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp >> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 >> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 >> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 >> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 >> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx >> 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx # rbx = &nameidate >> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp >> 0xffffffff812e4869 <trailing_symlink+0x19>: mov 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 <sysctl_protected_symlinks> >> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax >> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx >> 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac <trailing_symlink+0x5c> >> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx >> 0xffffffff812e487f <trailing_symlink+0x2f>: mov 0xc8(%rdi),%rcx # rcx = nameidata->link_inode >> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx >> 0xffffffff812e488d <trailing_symlink+0x3d>: mov 0x4(%rcx),%ecx # ecx = link_inode->uid >> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) >> 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac <trailing_symlink+0x5c> >> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi >> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx >> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx >> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx >> 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f <trailing_symlink+0x10f> >> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax >> 0xffffffff812e48af <trailing_symlink+0x5f>: mov %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT >> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov 0x50(%rbx),%rax # rax = nd->stack >> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq $0x0,0x20(%rax) # stack[0].name = NULL >> 0xffffffff812e48be <trailing_symlink+0x6e>: mov 0x48(%rbx),%eax # nd->depth >> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov 0x50(%rbx),%rdx # nd->stack >> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov 0xc8(%rbx),%r13 # nd->link_inode >> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea (%rax,%rax,2),%rax # rax = depth * 3 >> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 >> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea -0x30(%rdx,%rax,1),%r15 # r15 = last >> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov 0x8(%r15),%r14 # r14 = last->link.dentry >> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) >> 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 <trailing_symlink+0x100> >> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi >> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi >> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq 0xffffffff812f8a00 <atime_needs_update> >> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al >> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne 0xffffffff812e4a56 <trailing_symlink+0x206> >> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx >> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi >> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi >> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx >> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx >> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq 0xffffffff81424310 <security_inode_follow_link> >> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 >> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax >> 0xffffffff812e490f <trailing_symlink+0xbf>: jne 0xffffffff812e4939 <trailing_symlink+0xe9> >> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) >> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 >> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 >> 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 <trailing_symlink+0x195> >> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax >> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al >> 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 <trailing_symlink+0x167> >> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al >> 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae <trailing_symlink+0x15e> >> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 >> 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae <trailing_symlink+0x15e> >> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp >> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax >> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx >> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 >> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 >> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 >> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 >> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp >> 0xffffffff812e494f <trailing_symlink+0xff>: retq >> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi >> 0xffffffff812e4953 <trailing_symlink+0x103>: callq 0xffffffff812f8ae0 <touch_atime> >> 0xffffffff812e4958 <trailing_symlink+0x108>: callq 0xffffffff81a26410 <_cond_resched> >> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp 0xffffffff812e48f6 <trailing_symlink+0xa6> >> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx >> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx >> 0xffffffff812e4965 <trailing_symlink+0x115>: je 0xffffffff812e496f <trailing_symlink+0x11f> >> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx >> 0xffffffff812e4969 <trailing_symlink+0x119>: je 0xffffffff812e48ac <trailing_symlink+0x5c> >> 0xffffffff812e496f <trailing_symlink+0x11f>: mov $0xfffffffffffffff6,%r12 >> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al >> 0xffffffff812e4978 <trailing_symlink+0x128>: jne 0xffffffff812e493e <trailing_symlink+0xee> >> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax >> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax >> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax >> 0xffffffff812e498d <trailing_symlink+0x13d>: je 0xffffffff812e4999 <trailing_symlink+0x149> >> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax >> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax >> 0xffffffff812e4993 <trailing_symlink+0x143>: je 0xffffffff812e4a6f <trailing_symlink+0x21f> >> 0xffffffff812e4999 <trailing_symlink+0x149>: mov $0xffffffff82319b4f,%rdi >> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov $0xfffffffffffffff3,%r12 >> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq 0xffffffff81161310 <audit_log_link_denied> >> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp 0xffffffff812e493e <trailing_symlink+0xee> >> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov $0xffffffff8230164d,%r12 >> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp 0xffffffff812e493e <trailing_symlink+0xee> >> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) >> 0xffffffff812e49bc <trailing_symlink+0x16c>: je 0xffffffff812e4a8a <trailing_symlink+0x23a> >> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi >> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq 0xffffffff812e2da0 <nd_jump_root> >> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax >> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> >> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 >> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax >> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al >> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne 0xffffffff812e4935 <trailing_symlink+0xe5> >> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp 0xffffffff812e49d2 <trailing_symlink+0x182> >> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov 0x20(%r13),%rax # inode->i_op >> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 >> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi >> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx >> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov 0x8(%rax),%rcx # inode_operations->get_link >> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) >> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne 0xffffffff812e4a1f <trailing_symlink+0x1cf> >> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi # nd->flags & LOOKUP_RCU == 0 >> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 >> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 >> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je 0xffffffff812e49ae <trailing_symlink+0x15e> >> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp $0xfffffffffffff000,%r12 >> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe 0xffffffff812e4928 <trailing_symlink+0xd8> >> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq 0xffffffff812e493e <trailing_symlink+0xee> >> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi # nd->flags & LOOKUP_RCU != 0 >> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) >> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 >> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp $0xfffffffffffffff6,%rax >> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne 0xffffffff812e4a08 <trailing_symlink+0x1b8> >> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi >> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq 0xffffffff812e3840 <unlazy_walk> >> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax >> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> >> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx >> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi >> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi >> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx >> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> >> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 >> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp 0xffffffff812e4a08 <trailing_symlink+0x1b8> >> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi >> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq 0xffffffff812e3840 <unlazy_walk> >> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax >> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne 0xffffffff812e4a97 <trailing_symlink+0x247> >> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi >> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq 0xffffffff812f8ae0 <touch_atime> >> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq 0xffffffff812e48f6 <trailing_symlink+0xa6> >> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax >> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi >> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx >> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi >> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq 0xffffffff811673f0 <__audit_inode> >> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq 0xffffffff812e4999 <trailing_symlink+0x149> >> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi >> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq 0xffffffff812e4790 <set_root> >> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq 0xffffffff812e49c2 <trailing_symlink+0x172> >> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov $0xfffffffffffffff6,%r12 >> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq 0xffffffff812e493e <trailing_symlink+0xee> >> >> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >> >> According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs: >> don't expose internal symlink metadata buffers to the vfs") is a data NULL >> pointer dereference, but the problem here is an instruction NULL pointer >> dereference. >> >> Further, I analyzed the possible triggering process as follows: >> >> rcu_walk do_unlinkat ~~> prune_dcache_sb create >> >> >> rcu_read_lock >> >> read_seqcount_retry >> (the last check) iput_final >> evict >> destroy_inode >> xfs_fs_destroy_inode >> xfs_inode_set_reclaim_tag xfs_ialloc >> spin_lock(ip->i_flags_lock) xfs_dialloc >> set(ip, XFS_IRECLAIMABLE) xfs_iget >> wakeup(xfs_reclaim_worker) rcu_read_lock >> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit >> spin_lock(ip->i_flags_lock) >> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) >> set(ip, XFS_IRECLAIM) >> spin_unlock(ip->i_flags_lock) >> rcu_read_unlock >> < ------------ > >> // miss synchronize_rcu() >> xfs_reinit_inode >> ->get_link = NULL >> get_link() // NULL >> >> rcu_read_unlock >> >> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >> >> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal >> symlink metadata buffers to the vfs"), we should start processing this NULL >> ->get_link pointer dereference. >> >> Or, am I thinking wrong somewhere? >> >> Thanks, >> Jinliang Zheng >> >>>>> Apart from that issue, I'm not aware of any other issues that the >>>>> XFS inode recycling directly exposes. >>>>> >>>>>> According to my understanding, the essence of this problem is that XFS reuses >>>>>> the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen. >>>>> It assumes that the inode will not change identity during the RCU >>>>> grace period after the inode has been evicted from cache. We can >>>>> safely reinstantiate an evicted inode without waiting for an RCU >>>>> grace period as long as it is the same inode with the same content >>>>> and same state. >>>>> >>>>> Problems *may* arise when we unlink the inode, then evict it, then a >>>>> new file is created and the old slab cache memory address is used >>>>> for the new inode. I describe the issue here: >>>>> >>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ >>>> And judging from the relevant emails, the main reason why ->get_link() is set >>>> to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when >>>> the inode is chosen to be reused. >>>> >>>> However, perhaps due to performance reasons, this solution has not been merged >>>> for a long time. How is it now? >>>> >>>> Maybe I am missing something in the threads of mail? >>>> >>>> Thank you very much. :) >>>> Jinliang Zheng >>>> >>>>> That said, we have exactly zero evidence that this is actually a >>>>> problem in production systems. We did get systems tripping over the >>>>> symlink issue, but there's no evidence that the >>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and >>>>> hence there hasn't been any particular urgency to address it. >>>>> >>>>>> Are there any recommended workarounds until an elegant and efficient solution >>>>>> can be proposed? After all, causing a crash is extremely unacceptable in a >>>>>> production environment. >>>>> What crashes are you seeing in your production environment? >>>>> >>>>> -Dave. >>>>> -- >>>>> Dave Chinner >>>>> david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-16 7:08 ` Ian Kent @ 2024-05-16 7:23 ` Ian Kent 2024-05-20 17:36 ` Darrick J. Wong 2024-05-27 9:41 ` Dave Chinner 0 siblings, 2 replies; 19+ messages in thread From: Ian Kent @ 2024-05-16 7:23 UTC (permalink / raw) To: Jinliang Zheng, alexjlzheng Cc: bfoster, david, djwong, linux-fsdevel, linux-xfs, rcu On 16/5/24 15:08, Ian Kent wrote: > On 16/5/24 12:56, Jinliang Zheng wrote: >> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: >>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: >>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: >>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: >>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com >>>>>> wrote: >>>>>>> Hi, all >>>>>>> >>>>>>> I would like to ask if the conflict between xfs inode recycle >>>>>>> and vfs rcu-walk >>>>>>> which can lead to null pointer references has been resolved? >>>>>>> >>>>>>> I browsed through emails about the following patches and their >>>>>>> discussions: >>>>>>> - >>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ >>>>>>> - >>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ >>>>>>> - >>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ >>>>>>> >>>>>>> And then came to the conclusion that this problem has not been >>>>>>> solved, am I >>>>>>> right? Did I miss some patch that could solve this problem? >>>>>> We fixed the known problems this caused by turning off the VFS >>>>>> functionality that the rcu pathwalks kept tripping over. See commit >>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink metadata >>>>>> buffers to >>>>>> the vfs"). >>>>> Sorry for the delay. >>>>> >>>>> The problem I encountered in the production environment was that >>>>> during the >>>>> rcu walk process the ->get_link() pointer was NULL, which caused a >>>>> crash. >>>>> >>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal >>>>> symlink >>>>> metadata buffers to the vfs") first appeared in: >>>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ >>>>> >>>>> Does this commit solve the problem of NULL ->get_link()? And how? >>>> I suggest reading the call stack from wherever the VFS enters the XFS >>>> readlink code. If you have a reliable reproducer, then apply this >>>> patch >>>> to your kernel (you haven't mentioned which one it is) and see if the >>>> bad dereference goes away. >>>> >>>> --D >>> Sorry for the delay. >>> >>> I encountered the following calltrace: >>> >>> [20213.578756] BUG: kernel NULL pointer dereference, address: >>> 0000000000000000 >>> [20213.578785] #PF: supervisor instruction fetch in kernel mode >>> [20213.578799] #PF: error_code(0x0010) - not-present page >>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 >>> [20213.578828] Oops: 0010 [#1] SMP NOPTI >>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded >>> Not tainted 5.4.241-1-tlinux4-0017.3 #1 >>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. >>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 >>> [20213.578884] RIP: 0010:0x0 >>> [20213.578894] Code: Bad RIP value. >>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 >>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: >>> 0000000000000000 >>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: >>> 0000000000000000 >>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: >>> ffff889b9eeae380 >>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: >>> 0000000000000000 >>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: >>> ffffc90021ebfd48 >>> [20213.578998] FS: 00007f89c534e740(0000) GS:ffff88c07fd00000(0000) >>> knlGS:0000000000000000 >>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: >>> 00000000007706e0 >>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>> 0000000000000000 >>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>> 0000000000000400 >>> [20213.579079] PKRU: 55555554 >>> [20213.579087] Call Trace: >>> [20213.579099] trailing_symlink+0x1da/0x260 >>> [20213.579112] path_lookupat.isra.53+0x79/0x220 >>> [20213.579125] filename_lookup.part.69+0xa0/0x170 >>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 >>> [20213.579151] ? getname_flags+0x4f/0x1e0 >>> [20213.579161] user_path_at_empty+0x3e/0x50 >>> [20213.579172] vfs_statx+0x76/0xe0 >>> [20213.579182] __do_sys_newstat+0x3d/0x70 >>> [20213.579194] ? fput+0x13/0x20 >>> [20213.579203] ? ksys_ioctl+0xb0/0x300 >>> [20213.579213] ? generic_file_llseek+0x24/0x30 >>> [20213.579225] ? fput+0x13/0x20 >>> [20213.579233] ? ksys_lseek+0x8d/0xb0 >>> [20213.579243] __x64_sys_newstat+0x16/0x20 >>> [20213.579256] do_syscall_64+0x4d/0x140 >>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 >>> >>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>> >> Please note that the kernel version I use is the one maintained by >> Tencent.Inc, >> and the baseline is v5.4. But in fact, in the latest upstream source >> tree, >> although the trailing_symlink() function has been removed, its logic >> has been >> moved to pick_link(), so the problem still exists. >> >> Ian Kent pointed out that try_to_unlazy() was introduced in >> pick_link() in the >> latest upstream source tree, but I don't understand why this can >> solve the NULL >> ->get_link pointer dereference problem, because ->get_link pointer >> will be >> dereferenced before try_to_unlazy(). >> >> (I don't understand why Ian Kent's email didn't appear on the mailing >> list.) > > It was something about html mail and I think my mail client was at fault. > > In any case what you say is indeed correct, so the comment isn't > important. > > > Fact is it is still a race between the lockless path walk and inode > eviction > > and xfs recycling. I believe that the xfs recycling code is very hard > to fix. > > > IIRC correctly putting a NULL check in pick_link() was not considered > acceptable > > but there must be a way that is acceptable to check this and restart > the walk. > > Maybe there was a reluctance to suffer the overhead of restarting the > walk when > > it shouldn't be needed. Or perhaps the worry was that if it can become NULL it could also become a pointer to a different (incorrect) link altogether which could have really odd/unpleasant outcomes. > > > The alternative would be to find some way to identify when it's unsafe > to reuse > > an inode marked for re-cycle before dropping rcu read, perhaps with > the reference > > count plus the seqlock. Basically, to reuse inodes xfs will need to > identify when > > the race occurs and let the inode go away under rcu and create a new > one if a race > > is detected. But possibly that isn't nearly as simple as it sounds? > > >> >> Thanks, >> Jinliang Zheng >> >>> And I analyzed the disassembly of trailing_symlink() and confirmed >>> that a NULL >>> ->get_link() happened here: >>> >>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) >>> [FTRACE NOP] >>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp >>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp >>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 >>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 >>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 >>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 >>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx >>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov %rdi,%rbx >>> # rbx = &nameidate >>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp >>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov >>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 >>> <sysctl_protected_symlinks> >>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax >>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx >>> 0xffffffff812e4874 <trailing_symlink+0x24>: je 0xffffffff812e48ac >>> <trailing_symlink+0x5c> >>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx >>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov >>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode >>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx >>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov >>> 0x4(%rcx),%ecx # ecx = link_inode->uid >>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) >>> 0xffffffff812e4893 <trailing_symlink+0x43>: je 0xffffffff812e48ac >>> <trailing_symlink+0x5c> >>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi >>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx >>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx >>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx >>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je 0xffffffff812e495f >>> <trailing_symlink+0x10f> >>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax >>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov >>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT >>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov >>> 0x50(%rbx),%rax # rax = nd->stack >>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq >>> $0x0,0x20(%rax) # stack[0].name = NULL >>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov >>> 0x48(%rbx),%eax # nd->depth >>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov >>> 0x50(%rbx),%rdx # nd->stack >>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov >>> 0xc8(%rbx),%r13 # nd->link_inode >>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea >>> (%rax,%rax,2),%rax # rax = depth * 3 >>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl $0x4,%rax >>> # rax = rax << 4, sizeof(saved):0x30 >>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea >>> -0x30(%rdx,%rax,1),%r15 # r15 = last >>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov >>> 0x8(%r15),%r14 # r14 = last->link.dentry >>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) >>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je 0xffffffff812e4950 >>> <trailing_symlink+0x100> >>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi >>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi >>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq >>> 0xffffffff812f8a00 <atime_needs_update> >>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al >>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne >>> 0xffffffff812e4a56 <trailing_symlink+0x206> >>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx >>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi >>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi >>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx >>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx >>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq >>> 0xffffffff81424310 <security_inode_follow_link> >>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 >>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax >>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne >>> 0xffffffff812e4939 <trailing_symlink+0xe9> >>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) >>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 >>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 >>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je 0xffffffff812e49e5 >>> <trailing_symlink+0x195> >>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax >>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al >>> 0xffffffff812e492f <trailing_symlink+0xdf>: je 0xffffffff812e49b7 >>> <trailing_symlink+0x167> >>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al >>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je 0xffffffff812e49ae >>> <trailing_symlink+0x15e> >>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 >>> 0xffffffff812e493c <trailing_symlink+0xec>: je 0xffffffff812e49ae >>> <trailing_symlink+0x15e> >>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp >>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax >>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx >>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 >>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 >>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 >>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 >>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp >>> 0xffffffff812e494f <trailing_symlink+0xff>: retq >>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi >>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq >>> 0xffffffff812f8ae0 <touch_atime> >>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq >>> 0xffffffff81a26410 <_cond_resched> >>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp >>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx >>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx >>> 0xffffffff812e4965 <trailing_symlink+0x115>: je >>> 0xffffffff812e496f <trailing_symlink+0x11f> >>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx >>> 0xffffffff812e4969 <trailing_symlink+0x119>: je >>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov >>> $0xfffffffffffffff6,%r12 >>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al >>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne >>> 0xffffffff812e493e <trailing_symlink+0xee> >>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax >>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax >>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax >>> 0xffffffff812e498d <trailing_symlink+0x13d>: je >>> 0xffffffff812e4999 <trailing_symlink+0x149> >>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax >>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax >>> 0xffffffff812e4993 <trailing_symlink+0x143>: je >>> 0xffffffff812e4a6f <trailing_symlink+0x21f> >>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov >>> $0xffffffff82319b4f,%rdi >>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov >>> $0xfffffffffffffff3,%r12 >>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq >>> 0xffffffff81161310 <audit_log_link_denied> >>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp >>> 0xffffffff812e493e <trailing_symlink+0xee> >>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov >>> $0xffffffff8230164d,%r12 >>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp >>> 0xffffffff812e493e <trailing_symlink+0xee> >>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) >>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je >>> 0xffffffff812e4a8a <trailing_symlink+0x23a> >>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi >>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq >>> 0xffffffff812e2da0 <nd_jump_root> >>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax >>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne >>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 >>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax >>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al >>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne >>> 0xffffffff812e4935 <trailing_symlink+0xe5> >>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp >>> 0xffffffff812e49d2 <trailing_symlink+0x182> >>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov >>> 0x20(%r13),%rax # inode->i_op >>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 >>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi >>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx >>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov >>> 0x8(%rax),%rcx # inode_operations->get_link >>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) >>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne >>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> >>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov %r14,%rdi >>> # nd->flags & LOOKUP_RCU == 0 >>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq >>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 >>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 >>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je >>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp >>> $0xfffffffffffff000,%r12 >>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe >>> 0xffffffff812e4928 <trailing_symlink+0xd8> >>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq >>> 0xffffffff812e493e <trailing_symlink+0xee> >>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor %edi,%edi >>> # nd->flags & LOOKUP_RCU != 0 >>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) >>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq >>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 >>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp >>> $0xfffffffffffffff6,%rax >>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne >>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi >>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq >>> 0xffffffff812e3840 <unlazy_walk> >>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax >>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne >>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx >>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi >>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi >>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx >>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq >>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> >>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 >>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp >>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi >>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq >>> 0xffffffff812e3840 <unlazy_walk> >>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax >>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne >>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi >>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq >>> 0xffffffff812f8ae0 <touch_atime> >>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq >>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax >>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi >>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx >>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi >>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq >>> 0xffffffff811673f0 <__audit_inode> >>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq >>> 0xffffffff812e4999 <trailing_symlink+0x149> >>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi >>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq >>> 0xffffffff812e4790 <set_root> >>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq >>> 0xffffffff812e49c2 <trailing_symlink+0x172> >>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov >>> $0xfffffffffffffff6,%r12 >>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq >>> 0xffffffff812e493e <trailing_symlink+0xee> >>> >>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>> >>> >>> According to my understanding, the problem solved by commit >>> 7b7820b83f23 ("xfs: >>> don't expose internal symlink metadata buffers to the vfs") is a >>> data NULL >>> pointer dereference, but the problem here is an instruction NULL >>> pointer >>> dereference. >>> >>> Further, I analyzed the possible triggering process as follows: >>> >>> rcu_walk do_unlinkat ~~> prune_dcache_sb create >>> rcu_read_lock >>> read_seqcount_retry >>> (the last check) iput_final >>> evict >>> destroy_inode >>> xfs_fs_destroy_inode >>> xfs_inode_set_reclaim_tag xfs_ialloc >>> spin_lock(ip->i_flags_lock) xfs_dialloc >>> set(ip, XFS_IRECLAIMABLE) >>> xfs_iget >>> wakeup(xfs_reclaim_worker) rcu_read_lock >>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit >>> spin_lock(ip->i_flags_lock) >>> >>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) >>> set(ip, XFS_IRECLAIM) >>> spin_unlock(ip->i_flags_lock) >>> rcu_read_unlock >>> < ------------ > >>> >>> // miss synchronize_rcu() >>> xfs_reinit_inode >>> ->get_link = NULL >>> get_link() // NULL >>> >>> rcu_read_unlock >>> >>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>> >>> >>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't >>> expose internal >>> symlink metadata buffers to the vfs"), we should start processing >>> this NULL >>> ->get_link pointer dereference. >>> >>> Or, am I thinking wrong somewhere? >>> >>> Thanks, >>> Jinliang Zheng >>> >>>>>> Apart from that issue, I'm not aware of any other issues that the >>>>>> XFS inode recycling directly exposes. >>>>>> >>>>>>> According to my understanding, the essence of this problem is >>>>>>> that XFS reuses >>>>>>> the inode evicted by VFS, but VFS rcu-walk assumes that this >>>>>>> will not happen. >>>>>> It assumes that the inode will not change identity during the RCU >>>>>> grace period after the inode has been evicted from cache. We can >>>>>> safely reinstantiate an evicted inode without waiting for an RCU >>>>>> grace period as long as it is the same inode with the same content >>>>>> and same state. >>>>>> >>>>>> Problems *may* arise when we unlink the inode, then evict it, then a >>>>>> new file is created and the old slab cache memory address is used >>>>>> for the new inode. I describe the issue here: >>>>>> >>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ >>>>>> >>>>> And judging from the relevant emails, the main reason why >>>>> ->get_link() is set >>>>> to NULL should be the lack of synchronize_rcu() before >>>>> xfs_reinit_inode() when >>>>> the inode is chosen to be reused. >>>>> >>>>> However, perhaps due to performance reasons, this solution has not >>>>> been merged >>>>> for a long time. How is it now? >>>>> >>>>> Maybe I am missing something in the threads of mail? >>>>> >>>>> Thank you very much. :) >>>>> Jinliang Zheng >>>>> >>>>>> That said, we have exactly zero evidence that this is actually a >>>>>> problem in production systems. We did get systems tripping over the >>>>>> symlink issue, but there's no evidence that the >>>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and >>>>>> hence there hasn't been any particular urgency to address it. >>>>>> >>>>>>> Are there any recommended workarounds until an elegant and >>>>>>> efficient solution >>>>>>> can be proposed? After all, causing a crash is extremely >>>>>>> unacceptable in a >>>>>>> production environment. >>>>>> What crashes are you seeing in your production environment? >>>>>> >>>>>> -Dave. >>>>>> -- >>>>>> Dave Chinner >>>>>> david@fromorbit.com > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-16 7:23 ` Ian Kent @ 2024-05-20 17:36 ` Darrick J. Wong 2024-05-21 1:35 ` Ian Kent 2024-05-27 9:41 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2024-05-20 17:36 UTC (permalink / raw) To: Ian Kent Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, rcu On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > On 16/5/24 15:08, Ian Kent wrote: > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: > > > > On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: > > > > > On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > > > > > > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > > > > > > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, > > > > > > > alexjlzheng@gmail.com wrote: > > > > > > > > Hi, all > > > > > > > > > > > > > > > > I would like to ask if the conflict between xfs > > > > > > > > inode recycle and vfs rcu-walk > > > > > > > > which can lead to null pointer references has been resolved? > > > > > > > > > > > > > > > > I browsed through emails about the following > > > > > > > > patches and their discussions: > > > > > > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > > > > > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > > > > > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > > > > > > > > > > > And then came to the conclusion that this > > > > > > > > problem has not been solved, am I > > > > > > > > right? Did I miss some patch that could solve this problem? > > > > > > > We fixed the known problems this caused by turning off the VFS > > > > > > > functionality that the rcu pathwalks kept tripping over. See commit > > > > > > > 7b7820b83f23 ("xfs: don't expose internal symlink > > > > > > > metadata buffers to > > > > > > > the vfs"). > > > > > > Sorry for the delay. > > > > > > > > > > > > The problem I encountered in the production environment > > > > > > was that during the > > > > > > rcu walk process the ->get_link() pointer was NULL, > > > > > > which caused a crash. > > > > > > > > > > > > As far as I know, commit 7b7820b83f23 ("xfs: don't > > > > > > expose internal symlink > > > > > > metadata buffers to the vfs") first appeared in: > > > > > > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > > > > > > > > > > > Does this commit solve the problem of NULL ->get_link()? And how? > > > > > I suggest reading the call stack from wherever the VFS enters the XFS > > > > > readlink code. If you have a reliable reproducer, then > > > > > apply this patch > > > > > to your kernel (you haven't mentioned which one it is) and see if the > > > > > bad dereference goes away. > > > > > > > > > > --D > > > > Sorry for the delay. > > > > > > > > I encountered the following calltrace: > > > > > > > > [20213.578756] BUG: kernel NULL pointer dereference, address: > > > > 0000000000000000 > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode > > > > [20213.578799] #PF: error_code(0x0010) - not-present page > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > > > [20213.578884] RIP: 0010:0x0 > > > > [20213.578894] Code: Bad RIP value. > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > > > 0000000000000000 > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > > > 0000000000000000 > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > > > ffff889b9eeae380 > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > > > 0000000000000000 > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > > > ffffc90021ebfd48 > > > > [20213.578998] FS: 00007f89c534e740(0000) > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > > > 00000000007706e0 > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [20213.579079] PKRU: 55555554 > > > > [20213.579087] Call Trace: > > > > [20213.579099] trailing_symlink+0x1da/0x260 > > > > [20213.579112] path_lookupat.isra.53+0x79/0x220 > > > > [20213.579125] filename_lookup.part.69+0xa0/0x170 > > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > > > > [20213.579151] ? getname_flags+0x4f/0x1e0 > > > > [20213.579161] user_path_at_empty+0x3e/0x50 > > > > [20213.579172] vfs_statx+0x76/0xe0 > > > > [20213.579182] __do_sys_newstat+0x3d/0x70 > > > > [20213.579194] ? fput+0x13/0x20 > > > > [20213.579203] ? ksys_ioctl+0xb0/0x300 > > > > [20213.579213] ? generic_file_llseek+0x24/0x30 > > > > [20213.579225] ? fput+0x13/0x20 > > > > [20213.579233] ? ksys_lseek+0x8d/0xb0 > > > > [20213.579243] __x64_sys_newstat+0x16/0x20 > > > > [20213.579256] do_syscall_64+0x4d/0x140 > > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > Please note that the kernel version I use is the one maintained by > > > Tencent.Inc, > > > and the baseline is v5.4. But in fact, in the latest upstream source > > > tree, > > > although the trailing_symlink() function has been removed, its logic > > > has been > > > moved to pick_link(), so the problem still exists. > > > > > > Ian Kent pointed out that try_to_unlazy() was introduced in > > > pick_link() in the > > > latest upstream source tree, but I don't understand why this can > > > solve the NULL > > > ->get_link pointer dereference problem, because ->get_link pointer > > > will be > > > dereferenced before try_to_unlazy(). > > > > > > (I don't understand why Ian Kent's email didn't appear on the > > > mailing list.) > > > > It was something about html mail and I think my mail client was at fault. > > > > In any case what you say is indeed correct, so the comment isn't > > important. > > > > > > Fact is it is still a race between the lockless path walk and inode > > eviction > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > fix. > > > > > > IIRC correctly putting a NULL check in pick_link() was not considered > > acceptable > > > > but there must be a way that is acceptable to check this and restart the > > walk. > > > > Maybe there was a reluctance to suffer the overhead of restarting the > > walk when > > > > it shouldn't be needed. > > Or perhaps the worry was that if it can become NULL it could also become a > pointer to a > > different (incorrect) link altogether which could have really odd/unpleasant > outcomes. Yuck. I think that means that we can't reallocate freed inodes until the rcu grace period expires. For inodes that haven't been evicted, I think that also means we cannot recycle cached inodes until after an rcu grace period expires; or maybe that we cannot reset i_op/i_fop and must not leave the incore state in an inconsistent format? --D > > > > > > > The alternative would be to find some way to identify when it's unsafe > > to reuse > > > > an inode marked for re-cycle before dropping rcu read, perhaps with the > > reference > > > > count plus the seqlock. Basically, to reuse inodes xfs will need to > > identify when > > > > the race occurs and let the inode go away under rcu and create a new one > > if a race > > > > is detected. But possibly that isn't nearly as simple as it sounds? > > > > > > > > > > Thanks, > > > Jinliang Zheng > > > > > > > And I analyzed the disassembly of trailing_symlink() and > > > > confirmed that a NULL > > > > ->get_link() happened here: > > > > > > > > 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) > > > > [FTRACE NOP] > > > > 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp > > > > 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp > > > > 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 > > > > 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 > > > > 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 > > > > 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 > > > > 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx > > > > 0xffffffff812e4862 <trailing_symlink+0x12>: mov > > > > %rdi,%rbx # rbx = &nameidate > > > > 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp > > > > 0xffffffff812e4869 <trailing_symlink+0x19>: mov > > > > 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 > > > > <sysctl_protected_symlinks> > > > > 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax > > > > 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx > > > > 0xffffffff812e4874 <trailing_symlink+0x24>: je > > > > 0xffffffff812e48ac <trailing_symlink+0x5c> > > > > 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx > > > > 0xffffffff812e487f <trailing_symlink+0x2f>: mov > > > > 0xc8(%rdi),%rcx # rcx = nameidata->link_inode > > > > 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx > > > > 0xffffffff812e488d <trailing_symlink+0x3d>: mov > > > > 0x4(%rcx),%ecx # ecx = link_inode->uid > > > > 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) > > > > 0xffffffff812e4893 <trailing_symlink+0x43>: je > > > > 0xffffffff812e48ac <trailing_symlink+0x5c> > > > > 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi > > > > 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx > > > > 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx > > > > 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx > > > > 0xffffffff812e48a6 <trailing_symlink+0x56>: je > > > > 0xffffffff812e495f <trailing_symlink+0x10f> > > > > 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax > > > > 0xffffffff812e48af <trailing_symlink+0x5f>: mov > > > > %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT > > > > 0xffffffff812e48b2 <trailing_symlink+0x62>: mov > > > > 0x50(%rbx),%rax # rax = nd->stack > > > > 0xffffffff812e48b6 <trailing_symlink+0x66>: movq > > > > $0x0,0x20(%rax) # stack[0].name = NULL > > > > 0xffffffff812e48be <trailing_symlink+0x6e>: mov > > > > 0x48(%rbx),%eax # nd->depth > > > > 0xffffffff812e48c1 <trailing_symlink+0x71>: mov > > > > 0x50(%rbx),%rdx # nd->stack > > > > 0xffffffff812e48c5 <trailing_symlink+0x75>: mov > > > > 0xc8(%rbx),%r13 # nd->link_inode > > > > 0xffffffff812e48cc <trailing_symlink+0x7c>: lea > > > > (%rax,%rax,2),%rax # rax = depth * 3 > > > > 0xffffffff812e48d0 <trailing_symlink+0x80>: shl > > > > $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 > > > > 0xffffffff812e48d4 <trailing_symlink+0x84>: lea > > > > -0x30(%rdx,%rax,1),%r15 # r15 = last > > > > 0xffffffff812e48d9 <trailing_symlink+0x89>: mov > > > > 0x8(%r15),%r14 # r14 = last->link.dentry > > > > 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) > > > > 0xffffffff812e48e1 <trailing_symlink+0x91>: je > > > > 0xffffffff812e4950 <trailing_symlink+0x100> > > > > 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi > > > > 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi > > > > 0xffffffff812e48e9 <trailing_symlink+0x99>: callq > > > > 0xffffffff812f8a00 <atime_needs_update> > > > > 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al > > > > 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne > > > > 0xffffffff812e4a56 <trailing_symlink+0x206> > > > > 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx > > > > 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi > > > > 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi > > > > 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx > > > > 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx > > > > 0xffffffff812e4905 <trailing_symlink+0xb5>: callq > > > > 0xffffffff81424310 <security_inode_follow_link> > > > > 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 > > > > 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax > > > > 0xffffffff812e490f <trailing_symlink+0xbf>: jne > > > > 0xffffffff812e4939 <trailing_symlink+0xe9> > > > > 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) > > > > 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 > > > > 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 > > > > 0xffffffff812e4922 <trailing_symlink+0xd2>: je > > > > 0xffffffff812e49e5 <trailing_symlink+0x195> > > > > 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax > > > > 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al > > > > 0xffffffff812e492f <trailing_symlink+0xdf>: je > > > > 0xffffffff812e49b7 <trailing_symlink+0x167> > > > > 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al > > > > 0xffffffff812e4937 <trailing_symlink+0xe7>: je > > > > 0xffffffff812e49ae <trailing_symlink+0x15e> > > > > 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 > > > > 0xffffffff812e493c <trailing_symlink+0xec>: je > > > > 0xffffffff812e49ae <trailing_symlink+0x15e> > > > > 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp > > > > 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax > > > > 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx > > > > 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 > > > > 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 > > > > 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 > > > > 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 > > > > 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp > > > > 0xffffffff812e494f <trailing_symlink+0xff>: retq > > > > 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi > > > > 0xffffffff812e4953 <trailing_symlink+0x103>: callq > > > > 0xffffffff812f8ae0 <touch_atime> > > > > 0xffffffff812e4958 <trailing_symlink+0x108>: callq > > > > 0xffffffff81a26410 <_cond_resched> > > > > 0xffffffff812e495d <trailing_symlink+0x10d>: jmp > > > > 0xffffffff812e48f6 <trailing_symlink+0xa6> > > > > 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx > > > > 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx > > > > 0xffffffff812e4965 <trailing_symlink+0x115>: je > > > > 0xffffffff812e496f <trailing_symlink+0x11f> > > > > 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx > > > > 0xffffffff812e4969 <trailing_symlink+0x119>: je > > > > 0xffffffff812e48ac <trailing_symlink+0x5c> > > > > 0xffffffff812e496f <trailing_symlink+0x11f>: mov > > > > $0xfffffffffffffff6,%r12 > > > > 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al > > > > 0xffffffff812e4978 <trailing_symlink+0x128>: jne > > > > 0xffffffff812e493e <trailing_symlink+0xee> > > > > 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax > > > > 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax > > > > 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax > > > > 0xffffffff812e498d <trailing_symlink+0x13d>: je > > > > 0xffffffff812e4999 <trailing_symlink+0x149> > > > > 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax > > > > 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax > > > > 0xffffffff812e4993 <trailing_symlink+0x143>: je > > > > 0xffffffff812e4a6f <trailing_symlink+0x21f> > > > > 0xffffffff812e4999 <trailing_symlink+0x149>: mov > > > > $0xffffffff82319b4f,%rdi > > > > 0xffffffff812e49a0 <trailing_symlink+0x150>: mov > > > > $0xfffffffffffffff3,%r12 > > > > 0xffffffff812e49a7 <trailing_symlink+0x157>: callq > > > > 0xffffffff81161310 <audit_log_link_denied> > > > > 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp > > > > 0xffffffff812e493e <trailing_symlink+0xee> > > > > 0xffffffff812e49ae <trailing_symlink+0x15e>: mov > > > > $0xffffffff8230164d,%r12 > > > > 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp > > > > 0xffffffff812e493e <trailing_symlink+0xee> > > > > 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) > > > > 0xffffffff812e49bc <trailing_symlink+0x16c>: je > > > > 0xffffffff812e4a8a <trailing_symlink+0x23a> > > > > 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi > > > > 0xffffffff812e49c5 <trailing_symlink+0x175>: callq > > > > 0xffffffff812e2da0 <nd_jump_root> > > > > 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax > > > > 0xffffffff812e49cc <trailing_symlink+0x17c>: jne > > > > 0xffffffff812e4a97 <trailing_symlink+0x247> > > > > 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 > > > > 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax > > > > 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al > > > > 0xffffffff812e49dd <trailing_symlink+0x18d>: jne > > > > 0xffffffff812e4935 <trailing_symlink+0xe5> > > > > 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp > > > > 0xffffffff812e49d2 <trailing_symlink+0x182> > > > > 0xffffffff812e49e5 <trailing_symlink+0x195>: mov > > > > 0x20(%r13),%rax # inode->i_op > > > > 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 > > > > 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi > > > > 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx > > > > 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov > > > > 0x8(%rax),%rcx # inode_operations->get_link > > > > 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) > > > > 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne > > > > 0xffffffff812e4a1f <trailing_symlink+0x1cf> > > > > 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov > > > > %r14,%rdi # nd->flags & LOOKUP_RCU == 0 > > > > 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq > > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > > > 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 > > > > 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 > > > > 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je > > > > 0xffffffff812e49ae <trailing_symlink+0x15e> > > > > 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp > > > > $0xfffffffffffff000,%r12 > > > > 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe > > > > 0xffffffff812e4928 <trailing_symlink+0xd8> > > > > 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq > > > > 0xffffffff812e493e <trailing_symlink+0xee> > > > > 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor > > > > %edi,%edi # nd->flags & LOOKUP_RCU != 0 > > > > 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) > > > > 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq > > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > > > 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 > > > > 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp > > > > $0xfffffffffffffff6,%rax > > > > 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne > > > > 0xffffffff812e4a08 <trailing_symlink+0x1b8> > > > > 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi > > > > 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq > > > > 0xffffffff812e3840 <unlazy_walk> > > > > 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax > > > > 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne > > > > 0xffffffff812e4a97 <trailing_symlink+0x247> > > > > 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx > > > > 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi > > > > 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi > > > > 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx > > > > 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq > > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> > > > > 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 > > > > 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp > > > > 0xffffffff812e4a08 <trailing_symlink+0x1b8> > > > > 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi > > > > 0xffffffff812e4a59 <trailing_symlink+0x209>: callq > > > > 0xffffffff812e3840 <unlazy_walk> > > > > 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax > > > > 0xffffffff812e4a60 <trailing_symlink+0x210>: jne > > > > 0xffffffff812e4a97 <trailing_symlink+0x247> > > > > 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi > > > > 0xffffffff812e4a65 <trailing_symlink+0x215>: callq > > > > 0xffffffff812f8ae0 <touch_atime> > > > > 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq > > > > 0xffffffff812e48f6 <trailing_symlink+0xa6> > > > > 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax > > > > 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi > > > > 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx > > > > 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi > > > > 0xffffffff812e4a80 <trailing_symlink+0x230>: callq > > > > 0xffffffff811673f0 <__audit_inode> > > > > 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq > > > > 0xffffffff812e4999 <trailing_symlink+0x149> > > > > 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi > > > > 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq > > > > 0xffffffff812e4790 <set_root> > > > > 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq > > > > 0xffffffff812e49c2 <trailing_symlink+0x172> > > > > 0xffffffff812e4a97 <trailing_symlink+0x247>: mov > > > > $0xfffffffffffffff6,%r12 > > > > 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq > > > > 0xffffffff812e493e <trailing_symlink+0xee> > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > > > > > > According to my understanding, the problem solved by commit > > > > 7b7820b83f23 ("xfs: > > > > don't expose internal symlink metadata buffers to the vfs") is a > > > > data NULL > > > > pointer dereference, but the problem here is an instruction NULL > > > > pointer > > > > dereference. > > > > > > > > Further, I analyzed the possible triggering process as follows: > > > > > > > > rcu_walk do_unlinkat ~~> prune_dcache_sb create > > > > rcu_read_lock > > > > read_seqcount_retry > > > > (the last check) iput_final > > > > evict > > > > destroy_inode > > > > xfs_fs_destroy_inode > > > > xfs_inode_set_reclaim_tag xfs_ialloc > > > > spin_lock(ip->i_flags_lock) xfs_dialloc > > > > set(ip, XFS_IRECLAIMABLE) > > > > xfs_iget > > > > wakeup(xfs_reclaim_worker) rcu_read_lock > > > > spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit > > > > spin_lock(ip->i_flags_lock) > > > > > > > > if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) > > > > set(ip, XFS_IRECLAIM) > > > > spin_unlock(ip->i_flags_lock) > > > > rcu_read_unlock > > > > < ------------ > > > > > > > > > // miss synchronize_rcu() > > > > xfs_reinit_inode > > > > ->get_link = NULL > > > > get_link() // NULL > > > > > > > > rcu_read_unlock > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > > > > > > Therefore, I think that after commit 7b7820b83f23 ("xfs: don't > > > > expose internal > > > > symlink metadata buffers to the vfs"), we should start > > > > processing this NULL > > > > ->get_link pointer dereference. > > > > > > > > Or, am I thinking wrong somewhere? > > > > > > > > Thanks, > > > > Jinliang Zheng > > > > > > > > > > > Apart from that issue, I'm not aware of any other issues that the > > > > > > > XFS inode recycling directly exposes. > > > > > > > > > > > > > > > According to my understanding, the essence of > > > > > > > > this problem is that XFS reuses > > > > > > > > the inode evicted by VFS, but VFS rcu-walk > > > > > > > > assumes that this will not happen. > > > > > > > It assumes that the inode will not change identity during the RCU > > > > > > > grace period after the inode has been evicted from cache. We can > > > > > > > safely reinstantiate an evicted inode without waiting for an RCU > > > > > > > grace period as long as it is the same inode with the same content > > > > > > > and same state. > > > > > > > > > > > > > > Problems *may* arise when we unlink the inode, then evict it, then a > > > > > > > new file is created and the old slab cache memory address is used > > > > > > > for the new inode. I describe the issue here: > > > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > > > > > > > > > > > > > And judging from the relevant emails, the main reason > > > > > > why ->get_link() is set > > > > > > to NULL should be the lack of synchronize_rcu() before > > > > > > xfs_reinit_inode() when > > > > > > the inode is chosen to be reused. > > > > > > > > > > > > However, perhaps due to performance reasons, this > > > > > > solution has not been merged > > > > > > for a long time. How is it now? > > > > > > > > > > > > Maybe I am missing something in the threads of mail? > > > > > > > > > > > > Thank you very much. :) > > > > > > Jinliang Zheng > > > > > > > > > > > > > That said, we have exactly zero evidence that this is actually a > > > > > > > problem in production systems. We did get systems tripping over the > > > > > > > symlink issue, but there's no evidence that the > > > > > > > unlink->close->open(O_CREAT) issues are manifesting in the wild and > > > > > > > hence there hasn't been any particular urgency to address it. > > > > > > > > > > > > > > > Are there any recommended workarounds until an > > > > > > > > elegant and efficient solution > > > > > > > > can be proposed? After all, causing a crash is > > > > > > > > extremely unacceptable in a > > > > > > > > production environment. > > > > > > > What crashes are you seeing in your production environment? > > > > > > > > > > > > > > -Dave. > > > > > > > -- > > > > > > > Dave Chinner > > > > > > > david@fromorbit.com > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-20 17:36 ` Darrick J. Wong @ 2024-05-21 1:35 ` Ian Kent 2024-05-21 2:13 ` Ian Kent 0 siblings, 1 reply; 19+ messages in thread From: Ian Kent @ 2024-05-21 1:35 UTC (permalink / raw) To: Darrick J. Wong Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, rcu On 21/5/24 01:36, Darrick J. Wong wrote: > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: >> On 16/5/24 15:08, Ian Kent wrote: >>> On 16/5/24 12:56, Jinliang Zheng wrote: >>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: >>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: >>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: >>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: >>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, >>>>>>>> alexjlzheng@gmail.com wrote: >>>>>>>>> Hi, all >>>>>>>>> >>>>>>>>> I would like to ask if the conflict between xfs >>>>>>>>> inode recycle and vfs rcu-walk >>>>>>>>> which can lead to null pointer references has been resolved? >>>>>>>>> >>>>>>>>> I browsed through emails about the following >>>>>>>>> patches and their discussions: >>>>>>>>> - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ >>>>>>>>> - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ >>>>>>>>> - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ >>>>>>>>> >>>>>>>>> And then came to the conclusion that this >>>>>>>>> problem has not been solved, am I >>>>>>>>> right? Did I miss some patch that could solve this problem? >>>>>>>> We fixed the known problems this caused by turning off the VFS >>>>>>>> functionality that the rcu pathwalks kept tripping over. See commit >>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink >>>>>>>> metadata buffers to >>>>>>>> the vfs"). >>>>>>> Sorry for the delay. >>>>>>> >>>>>>> The problem I encountered in the production environment >>>>>>> was that during the >>>>>>> rcu walk process the ->get_link() pointer was NULL, >>>>>>> which caused a crash. >>>>>>> >>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't >>>>>>> expose internal symlink >>>>>>> metadata buffers to the vfs") first appeared in: >>>>>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ >>>>>>> >>>>>>> Does this commit solve the problem of NULL ->get_link()? And how? >>>>>> I suggest reading the call stack from wherever the VFS enters the XFS >>>>>> readlink code. If you have a reliable reproducer, then >>>>>> apply this patch >>>>>> to your kernel (you haven't mentioned which one it is) and see if the >>>>>> bad dereference goes away. >>>>>> >>>>>> --D >>>>> Sorry for the delay. >>>>> >>>>> I encountered the following calltrace: >>>>> >>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: >>>>> 0000000000000000 >>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode >>>>> [20213.578799] #PF: error_code(0x0010) - not-present page >>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 >>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI >>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: >>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 >>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. >>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 >>>>> [20213.578884] RIP: 0010:0x0 >>>>> [20213.578894] Code: Bad RIP value. >>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 >>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: >>>>> 0000000000000000 >>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: >>>>> 0000000000000000 >>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: >>>>> ffff889b9eeae380 >>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: >>>>> 0000000000000000 >>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: >>>>> ffffc90021ebfd48 >>>>> [20213.578998] FS: 00007f89c534e740(0000) >>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000 >>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: >>>>> 00000000007706e0 >>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>>> 0000000000000000 >>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>> 0000000000000400 >>>>> [20213.579079] PKRU: 55555554 >>>>> [20213.579087] Call Trace: >>>>> [20213.579099] trailing_symlink+0x1da/0x260 >>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220 >>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170 >>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 >>>>> [20213.579151] ? getname_flags+0x4f/0x1e0 >>>>> [20213.579161] user_path_at_empty+0x3e/0x50 >>>>> [20213.579172] vfs_statx+0x76/0xe0 >>>>> [20213.579182] __do_sys_newstat+0x3d/0x70 >>>>> [20213.579194] ? fput+0x13/0x20 >>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300 >>>>> [20213.579213] ? generic_file_llseek+0x24/0x30 >>>>> [20213.579225] ? fput+0x13/0x20 >>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0 >>>>> [20213.579243] __x64_sys_newstat+0x16/0x20 >>>>> [20213.579256] do_syscall_64+0x4d/0x140 >>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 >>>>> >>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>> >>>> Please note that the kernel version I use is the one maintained by >>>> Tencent.Inc, >>>> and the baseline is v5.4. But in fact, in the latest upstream source >>>> tree, >>>> although the trailing_symlink() function has been removed, its logic >>>> has been >>>> moved to pick_link(), so the problem still exists. >>>> >>>> Ian Kent pointed out that try_to_unlazy() was introduced in >>>> pick_link() in the >>>> latest upstream source tree, but I don't understand why this can >>>> solve the NULL >>>> ->get_link pointer dereference problem, because ->get_link pointer >>>> will be >>>> dereferenced before try_to_unlazy(). >>>> >>>> (I don't understand why Ian Kent's email didn't appear on the >>>> mailing list.) >>> It was something about html mail and I think my mail client was at fault. >>> >>> In any case what you say is indeed correct, so the comment isn't >>> important. >>> >>> >>> Fact is it is still a race between the lockless path walk and inode >>> eviction >>> >>> and xfs recycling. I believe that the xfs recycling code is very hard to >>> fix. >>> >>> >>> IIRC correctly putting a NULL check in pick_link() was not considered >>> acceptable >>> >>> but there must be a way that is acceptable to check this and restart the >>> walk. >>> >>> Maybe there was a reluctance to suffer the overhead of restarting the >>> walk when >>> >>> it shouldn't be needed. >> Or perhaps the worry was that if it can become NULL it could also become a >> pointer to a >> >> different (incorrect) link altogether which could have really odd/unpleasant >> outcomes. > Yuck. I think that means that we can't reallocate freed inodes until > the rcu grace period expires. For inodes that haven't been evicted, I > think that also means we cannot recycle cached inodes until after an rcu > grace period expires; or maybe that we cannot reset i_op/i_fop and must > not leave the incore state in an inconsistent format? Yeah, not pretty! But shouldn't this case occur only occasionally? So issuing a cache miss shouldn't impact performance too much that was, I believe, the concern with waiting for the rcu grace period. Identifying it's happening should be possible, the vfs legitimize_*() has this job for various objects but maybe it's using vfs private info. (certainly it uses nameidata struct with a seq lock sequence number in it) but I assume it can be done somehow. My question then becomes is it viable/straight forward to not recycle such an inode and discard it instead so it gets re-created, I guess it's essentially a cache miss? Ian > > --D > >>> >>> The alternative would be to find some way to identify when it's unsafe >>> to reuse >>> >>> an inode marked for re-cycle before dropping rcu read, perhaps with the >>> reference >>> >>> count plus the seqlock. Basically, to reuse inodes xfs will need to >>> identify when >>> >>> the race occurs and let the inode go away under rcu and create a new one >>> if a race >>> >>> is detected. But possibly that isn't nearly as simple as it sounds? >>> >>> >>>> Thanks, >>>> Jinliang Zheng >>>> >>>>> And I analyzed the disassembly of trailing_symlink() and >>>>> confirmed that a NULL >>>>> ->get_link() happened here: >>>>> >>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) >>>>> [FTRACE NOP] >>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp >>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp >>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 >>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 >>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 >>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 >>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx >>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov >>>>> %rdi,%rbx # rbx = &nameidate >>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp >>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov >>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 >>>>> <sysctl_protected_symlinks> >>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax >>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx >>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je >>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx >>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov >>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode >>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx >>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov >>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid >>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) >>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je >>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi >>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx >>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx >>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx >>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je >>>>> 0xffffffff812e495f <trailing_symlink+0x10f> >>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax >>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov >>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT >>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov >>>>> 0x50(%rbx),%rax # rax = nd->stack >>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq >>>>> $0x0,0x20(%rax) # stack[0].name = NULL >>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov >>>>> 0x48(%rbx),%eax # nd->depth >>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov >>>>> 0x50(%rbx),%rdx # nd->stack >>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov >>>>> 0xc8(%rbx),%r13 # nd->link_inode >>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea >>>>> (%rax,%rax,2),%rax # rax = depth * 3 >>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl >>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 >>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea >>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last >>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov >>>>> 0x8(%r15),%r14 # r14 = last->link.dentry >>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) >>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je >>>>> 0xffffffff812e4950 <trailing_symlink+0x100> >>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi >>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi >>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq >>>>> 0xffffffff812f8a00 <atime_needs_update> >>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al >>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne >>>>> 0xffffffff812e4a56 <trailing_symlink+0x206> >>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx >>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi >>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi >>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx >>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx >>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq >>>>> 0xffffffff81424310 <security_inode_follow_link> >>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 >>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax >>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne >>>>> 0xffffffff812e4939 <trailing_symlink+0xe9> >>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) >>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 >>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 >>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je >>>>> 0xffffffff812e49e5 <trailing_symlink+0x195> >>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax >>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al >>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je >>>>> 0xffffffff812e49b7 <trailing_symlink+0x167> >>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al >>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je >>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 >>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je >>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp >>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax >>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx >>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 >>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 >>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 >>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 >>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp >>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq >>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi >>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq >>>>> 0xffffffff812f8ae0 <touch_atime> >>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq >>>>> 0xffffffff81a26410 <_cond_resched> >>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp >>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx >>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx >>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je >>>>> 0xffffffff812e496f <trailing_symlink+0x11f> >>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx >>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je >>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov >>>>> $0xfffffffffffffff6,%r12 >>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al >>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne >>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax >>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax >>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax >>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je >>>>> 0xffffffff812e4999 <trailing_symlink+0x149> >>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax >>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax >>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je >>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f> >>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov >>>>> $0xffffffff82319b4f,%rdi >>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov >>>>> $0xfffffffffffffff3,%r12 >>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq >>>>> 0xffffffff81161310 <audit_log_link_denied> >>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp >>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov >>>>> $0xffffffff8230164d,%r12 >>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp >>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) >>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je >>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a> >>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi >>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq >>>>> 0xffffffff812e2da0 <nd_jump_root> >>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax >>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne >>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 >>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax >>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al >>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne >>>>> 0xffffffff812e4935 <trailing_symlink+0xe5> >>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp >>>>> 0xffffffff812e49d2 <trailing_symlink+0x182> >>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov >>>>> 0x20(%r13),%rax # inode->i_op >>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 >>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi >>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx >>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov >>>>> 0x8(%rax),%rcx # inode_operations->get_link >>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) >>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne >>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> >>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov >>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0 >>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq >>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 >>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 >>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je >>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp >>>>> $0xfffffffffffff000,%r12 >>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe >>>>> 0xffffffff812e4928 <trailing_symlink+0xd8> >>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq >>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor >>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0 >>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) >>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq >>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 >>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp >>>>> $0xfffffffffffffff6,%rax >>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne >>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi >>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq >>>>> 0xffffffff812e3840 <unlazy_walk> >>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax >>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne >>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx >>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi >>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi >>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx >>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq >>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> >>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 >>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp >>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi >>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq >>>>> 0xffffffff812e3840 <unlazy_walk> >>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax >>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne >>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi >>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq >>>>> 0xffffffff812f8ae0 <touch_atime> >>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq >>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax >>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi >>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx >>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi >>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq >>>>> 0xffffffff811673f0 <__audit_inode> >>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq >>>>> 0xffffffff812e4999 <trailing_symlink+0x149> >>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi >>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq >>>>> 0xffffffff812e4790 <set_root> >>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq >>>>> 0xffffffff812e49c2 <trailing_symlink+0x172> >>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov >>>>> $0xfffffffffffffff6,%r12 >>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq >>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>> >>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>> >>>>> >>>>> According to my understanding, the problem solved by commit >>>>> 7b7820b83f23 ("xfs: >>>>> don't expose internal symlink metadata buffers to the vfs") is a >>>>> data NULL >>>>> pointer dereference, but the problem here is an instruction NULL >>>>> pointer >>>>> dereference. >>>>> >>>>> Further, I analyzed the possible triggering process as follows: >>>>> >>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create >>>>> rcu_read_lock >>>>> read_seqcount_retry >>>>> (the last check) iput_final >>>>> evict >>>>> destroy_inode >>>>> xfs_fs_destroy_inode >>>>> xfs_inode_set_reclaim_tag xfs_ialloc >>>>> spin_lock(ip->i_flags_lock) xfs_dialloc >>>>> set(ip, XFS_IRECLAIMABLE) >>>>> xfs_iget >>>>> wakeup(xfs_reclaim_worker) rcu_read_lock >>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit >>>>> spin_lock(ip->i_flags_lock) >>>>> >>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) >>>>> set(ip, XFS_IRECLAIM) >>>>> spin_unlock(ip->i_flags_lock) >>>>> rcu_read_unlock >>>>> < ------------ > >>>>> >>>>> // miss synchronize_rcu() >>>>> xfs_reinit_inode >>>>> ->get_link = NULL >>>>> get_link() // NULL >>>>> >>>>> rcu_read_unlock >>>>> >>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>> >>>>> >>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't >>>>> expose internal >>>>> symlink metadata buffers to the vfs"), we should start >>>>> processing this NULL >>>>> ->get_link pointer dereference. >>>>> >>>>> Or, am I thinking wrong somewhere? >>>>> >>>>> Thanks, >>>>> Jinliang Zheng >>>>> >>>>>>>> Apart from that issue, I'm not aware of any other issues that the >>>>>>>> XFS inode recycling directly exposes. >>>>>>>> >>>>>>>>> According to my understanding, the essence of >>>>>>>>> this problem is that XFS reuses >>>>>>>>> the inode evicted by VFS, but VFS rcu-walk >>>>>>>>> assumes that this will not happen. >>>>>>>> It assumes that the inode will not change identity during the RCU >>>>>>>> grace period after the inode has been evicted from cache. We can >>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU >>>>>>>> grace period as long as it is the same inode with the same content >>>>>>>> and same state. >>>>>>>> >>>>>>>> Problems *may* arise when we unlink the inode, then evict it, then a >>>>>>>> new file is created and the old slab cache memory address is used >>>>>>>> for the new inode. I describe the issue here: >>>>>>>> >>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ >>>>>>>> >>>>>>> And judging from the relevant emails, the main reason >>>>>>> why ->get_link() is set >>>>>>> to NULL should be the lack of synchronize_rcu() before >>>>>>> xfs_reinit_inode() when >>>>>>> the inode is chosen to be reused. >>>>>>> >>>>>>> However, perhaps due to performance reasons, this >>>>>>> solution has not been merged >>>>>>> for a long time. How is it now? >>>>>>> >>>>>>> Maybe I am missing something in the threads of mail? >>>>>>> >>>>>>> Thank you very much. :) >>>>>>> Jinliang Zheng >>>>>>> >>>>>>>> That said, we have exactly zero evidence that this is actually a >>>>>>>> problem in production systems. We did get systems tripping over the >>>>>>>> symlink issue, but there's no evidence that the >>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and >>>>>>>> hence there hasn't been any particular urgency to address it. >>>>>>>> >>>>>>>>> Are there any recommended workarounds until an >>>>>>>>> elegant and efficient solution >>>>>>>>> can be proposed? After all, causing a crash is >>>>>>>>> extremely unacceptable in a >>>>>>>>> production environment. >>>>>>>> What crashes are you seeing in your production environment? >>>>>>>> >>>>>>>> -Dave. >>>>>>>> -- >>>>>>>> Dave Chinner >>>>>>>> david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-21 1:35 ` Ian Kent @ 2024-05-21 2:13 ` Ian Kent 2024-05-26 15:04 ` Jinliang Zheng 2024-05-26 23:51 ` Ian Kent 0 siblings, 2 replies; 19+ messages in thread From: Ian Kent @ 2024-05-21 2:13 UTC (permalink / raw) To: Darrick J. Wong Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, rcu On 21/5/24 09:35, Ian Kent wrote: > On 21/5/24 01:36, Darrick J. Wong wrote: >> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: >>> On 16/5/24 15:08, Ian Kent wrote: >>>> On 16/5/24 12:56, Jinliang Zheng wrote: >>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: >>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: >>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: >>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: >>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, >>>>>>>>> alexjlzheng@gmail.com wrote: >>>>>>>>>> Hi, all >>>>>>>>>> >>>>>>>>>> I would like to ask if the conflict between xfs >>>>>>>>>> inode recycle and vfs rcu-walk >>>>>>>>>> which can lead to null pointer references has been resolved? >>>>>>>>>> >>>>>>>>>> I browsed through emails about the following >>>>>>>>>> patches and their discussions: >>>>>>>>>> - >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ >>>>>>>>>> - >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ >>>>>>>>>> - >>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ >>>>>>>>>> >>>>>>>>>> And then came to the conclusion that this >>>>>>>>>> problem has not been solved, am I >>>>>>>>>> right? Did I miss some patch that could solve this problem? >>>>>>>>> We fixed the known problems this caused by turning off the VFS >>>>>>>>> functionality that the rcu pathwalks kept tripping over. See >>>>>>>>> commit >>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink >>>>>>>>> metadata buffers to >>>>>>>>> the vfs"). >>>>>>>> Sorry for the delay. >>>>>>>> >>>>>>>> The problem I encountered in the production environment >>>>>>>> was that during the >>>>>>>> rcu walk process the ->get_link() pointer was NULL, >>>>>>>> which caused a crash. >>>>>>>> >>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't >>>>>>>> expose internal symlink >>>>>>>> metadata buffers to the vfs") first appeared in: >>>>>>>> - >>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ >>>>>>>> >>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how? >>>>>>> I suggest reading the call stack from wherever the VFS enters >>>>>>> the XFS >>>>>>> readlink code. If you have a reliable reproducer, then >>>>>>> apply this patch >>>>>>> to your kernel (you haven't mentioned which one it is) and see >>>>>>> if the >>>>>>> bad dereference goes away. >>>>>>> >>>>>>> --D >>>>>> Sorry for the delay. >>>>>> >>>>>> I encountered the following calltrace: >>>>>> >>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: >>>>>> 0000000000000000 >>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode >>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page >>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 >>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI >>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: >>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 >>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. >>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 >>>>>> [20213.578884] RIP: 0010:0x0 >>>>>> [20213.578894] Code: Bad RIP value. >>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 >>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: >>>>>> 0000000000000000 >>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: >>>>>> 0000000000000000 >>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: >>>>>> ffff889b9eeae380 >>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: >>>>>> 0000000000000000 >>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: >>>>>> ffffc90021ebfd48 >>>>>> [20213.578998] FS: 00007f89c534e740(0000) >>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000 >>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: >>>>>> 00000000007706e0 >>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>>>> 0000000000000000 >>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>>> 0000000000000400 >>>>>> [20213.579079] PKRU: 55555554 >>>>>> [20213.579087] Call Trace: >>>>>> [20213.579099] trailing_symlink+0x1da/0x260 >>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220 >>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170 >>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 >>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0 >>>>>> [20213.579161] user_path_at_empty+0x3e/0x50 >>>>>> [20213.579172] vfs_statx+0x76/0xe0 >>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70 >>>>>> [20213.579194] ? fput+0x13/0x20 >>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300 >>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30 >>>>>> [20213.579225] ? fput+0x13/0x20 >>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0 >>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20 >>>>>> [20213.579256] do_syscall_64+0x4d/0x140 >>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 >>>>>> >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>> >>>>>> >>>>> Please note that the kernel version I use is the one maintained by >>>>> Tencent.Inc, >>>>> and the baseline is v5.4. But in fact, in the latest upstream source >>>>> tree, >>>>> although the trailing_symlink() function has been removed, its logic >>>>> has been >>>>> moved to pick_link(), so the problem still exists. >>>>> >>>>> Ian Kent pointed out that try_to_unlazy() was introduced in >>>>> pick_link() in the >>>>> latest upstream source tree, but I don't understand why this can >>>>> solve the NULL >>>>> ->get_link pointer dereference problem, because ->get_link pointer >>>>> will be >>>>> dereferenced before try_to_unlazy(). >>>>> >>>>> (I don't understand why Ian Kent's email didn't appear on the >>>>> mailing list.) >>>> It was something about html mail and I think my mail client was at >>>> fault. >>>> >>>> In any case what you say is indeed correct, so the comment isn't >>>> important. >>>> >>>> >>>> Fact is it is still a race between the lockless path walk and inode >>>> eviction >>>> >>>> and xfs recycling. I believe that the xfs recycling code is very >>>> hard to >>>> fix. >>>> >>>> >>>> IIRC correctly putting a NULL check in pick_link() was not considered >>>> acceptable >>>> >>>> but there must be a way that is acceptable to check this and >>>> restart the >>>> walk. >>>> >>>> Maybe there was a reluctance to suffer the overhead of restarting the >>>> walk when >>>> >>>> it shouldn't be needed. >>> Or perhaps the worry was that if it can become NULL it could also >>> become a >>> pointer to a >>> >>> different (incorrect) link altogether which could have really >>> odd/unpleasant >>> outcomes. >> Yuck. I think that means that we can't reallocate freed inodes until >> the rcu grace period expires. For inodes that haven't been evicted, I >> think that also means we cannot recycle cached inodes until after an rcu >> grace period expires; or maybe that we cannot reset i_op/i_fop and must >> not leave the incore state in an inconsistent format? > > Yeah, not pretty! > > But shouldn't this case occur only occasionally? > > > So issuing a cache miss shouldn't impact performance too much that was, > > I believe, the concern with waiting for the rcu grace period. > > > Identifying it's happening should be possible, the vfs legitimize_*() > > has this job for various objects but maybe it's using vfs private info. > > (certainly it uses nameidata struct with a seq lock sequence number in > > it) but I assume it can be done somehow. Unfortunately, when you start trying to work out how to do this, it isn't at all obvious how to do it ... > > > My question then becomes is it viable/straight forward to not recycle > such > > an inode and discard it instead so it gets re-created, I guess it's > essentially > > a cache miss? > > > Ian > >> >> --D >> >>>> >>>> The alternative would be to find some way to identify when it's unsafe >>>> to reuse >>>> >>>> an inode marked for re-cycle before dropping rcu read, perhaps with >>>> the >>>> reference >>>> >>>> count plus the seqlock. Basically, to reuse inodes xfs will need to >>>> identify when >>>> >>>> the race occurs and let the inode go away under rcu and create a >>>> new one >>>> if a race >>>> >>>> is detected. But possibly that isn't nearly as simple as it sounds? >>>> >>>> >>>>> Thanks, >>>>> Jinliang Zheng >>>>> >>>>>> And I analyzed the disassembly of trailing_symlink() and >>>>>> confirmed that a NULL >>>>>> ->get_link() happened here: >>>>>> >>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) >>>>>> [FTRACE NOP] >>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp >>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp >>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 >>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 >>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 >>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 >>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx >>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov >>>>>> %rdi,%rbx # rbx = &nameidate >>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp >>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov >>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 >>>>>> <sysctl_protected_symlinks> >>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax >>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx >>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx >>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov >>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode >>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx >>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov >>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid >>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) >>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi >>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx >>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx >>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx >>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f> >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax >>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov >>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT >>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov >>>>>> 0x50(%rbx),%rax # rax = nd->stack >>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq >>>>>> $0x0,0x20(%rax) # stack[0].name = NULL >>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov >>>>>> 0x48(%rbx),%eax # nd->depth >>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov >>>>>> 0x50(%rbx),%rdx # nd->stack >>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov >>>>>> 0xc8(%rbx),%r13 # nd->link_inode >>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea >>>>>> (%rax,%rax,2),%rax # rax = depth * 3 >>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl >>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 >>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea >>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last >>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov >>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry >>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) >>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100> >>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi >>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi >>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq >>>>>> 0xffffffff812f8a00 <atime_needs_update> >>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al >>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206> >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx >>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi >>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi >>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx >>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx >>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq >>>>>> 0xffffffff81424310 <security_inode_follow_link> >>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 >>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax >>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9> >>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) >>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 >>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 >>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195> >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax >>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al >>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167> >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al >>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 >>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp >>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax >>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx >>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 >>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 >>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 >>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 >>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp >>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi >>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq >>>>>> 0xffffffff812f8ae0 <touch_atime> >>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq >>>>>> 0xffffffff81a26410 <_cond_resched> >>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx >>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx >>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f> >>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx >>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov >>>>>> $0xfffffffffffffff6,%r12 >>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al >>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax >>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax >>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax >>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> >>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax >>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax >>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f> >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov >>>>>> $0xffffffff82319b4f,%rdi >>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov >>>>>> $0xfffffffffffffff3,%r12 >>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq >>>>>> 0xffffffff81161310 <audit_log_link_denied> >>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov >>>>>> $0xffffffff8230164d,%r12 >>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) >>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a> >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi >>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq >>>>>> 0xffffffff812e2da0 <nd_jump_root> >>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax >>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 >>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax >>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al >>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5> >>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182> >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov >>>>>> 0x20(%r13),%rax # inode->i_op >>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 >>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi >>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx >>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov >>>>>> 0x8(%rax),%rcx # inode_operations->get_link >>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) >>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> >>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov >>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0 >>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 >>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp >>>>>> $0xfffffffffffff000,%r12 >>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8> >>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor >>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0 >>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) >>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 >>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp >>>>>> $0xfffffffffffffff6,%rax >>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi >>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq >>>>>> 0xffffffff812e3840 <unlazy_walk> >>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax >>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx >>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi >>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi >>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx >>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> >>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 >>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi >>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq >>>>>> 0xffffffff812e3840 <unlazy_walk> >>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax >>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi >>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq >>>>>> 0xffffffff812f8ae0 <touch_atime> >>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax >>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi >>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx >>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi >>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq >>>>>> 0xffffffff811673f0 <__audit_inode> >>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi >>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq >>>>>> 0xffffffff812e4790 <set_root> >>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172> >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov >>>>>> $0xfffffffffffffff6,%r12 >>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>> >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>> >>>>>> >>>>>> >>>>>> According to my understanding, the problem solved by commit >>>>>> 7b7820b83f23 ("xfs: >>>>>> don't expose internal symlink metadata buffers to the vfs") is a >>>>>> data NULL >>>>>> pointer dereference, but the problem here is an instruction NULL >>>>>> pointer >>>>>> dereference. >>>>>> >>>>>> Further, I analyzed the possible triggering process as follows: >>>>>> >>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create >>>>>> rcu_read_lock >>>>>> read_seqcount_retry >>>>>> (the last check) iput_final >>>>>> evict >>>>>> destroy_inode >>>>>> xfs_fs_destroy_inode >>>>>> xfs_inode_set_reclaim_tag xfs_ialloc >>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc >>>>>> set(ip, XFS_IRECLAIMABLE) >>>>>> xfs_iget >>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock >>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit >>>>>> spin_lock(ip->i_flags_lock) >>>>>> >>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) >>>>>> set(ip, XFS_IRECLAIM) >>>>>> spin_unlock(ip->i_flags_lock) >>>>>> rcu_read_unlock >>>>>> < ------------ > >>>>>> >>>>>> // miss synchronize_rcu() >>>>>> xfs_reinit_inode >>>>>> ->get_link = NULL >>>>>> get_link() // NULL >>>>>> >>>>>> rcu_read_unlock >>>>>> >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>> >>>>>> >>>>>> >>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't >>>>>> expose internal >>>>>> symlink metadata buffers to the vfs"), we should start >>>>>> processing this NULL >>>>>> ->get_link pointer dereference. >>>>>> >>>>>> Or, am I thinking wrong somewhere? >>>>>> >>>>>> Thanks, >>>>>> Jinliang Zheng >>>>>> >>>>>>>>> Apart from that issue, I'm not aware of any other issues that the >>>>>>>>> XFS inode recycling directly exposes. >>>>>>>>> >>>>>>>>>> According to my understanding, the essence of >>>>>>>>>> this problem is that XFS reuses >>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk >>>>>>>>>> assumes that this will not happen. >>>>>>>>> It assumes that the inode will not change identity during the RCU >>>>>>>>> grace period after the inode has been evicted from cache. We can >>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU >>>>>>>>> grace period as long as it is the same inode with the same >>>>>>>>> content >>>>>>>>> and same state. >>>>>>>>> >>>>>>>>> Problems *may* arise when we unlink the inode, then evict it, >>>>>>>>> then a >>>>>>>>> new file is created and the old slab cache memory address is used >>>>>>>>> for the new inode. I describe the issue here: >>>>>>>>> >>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ >>>>>>>>> >>>>>>>>> >>>>>>>> And judging from the relevant emails, the main reason >>>>>>>> why ->get_link() is set >>>>>>>> to NULL should be the lack of synchronize_rcu() before >>>>>>>> xfs_reinit_inode() when >>>>>>>> the inode is chosen to be reused. >>>>>>>> >>>>>>>> However, perhaps due to performance reasons, this >>>>>>>> solution has not been merged >>>>>>>> for a long time. How is it now? >>>>>>>> >>>>>>>> Maybe I am missing something in the threads of mail? >>>>>>>> >>>>>>>> Thank you very much. :) >>>>>>>> Jinliang Zheng >>>>>>>> >>>>>>>>> That said, we have exactly zero evidence that this is actually a >>>>>>>>> problem in production systems. We did get systems tripping >>>>>>>>> over the >>>>>>>>> symlink issue, but there's no evidence that the >>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the >>>>>>>>> wild and >>>>>>>>> hence there hasn't been any particular urgency to address it. >>>>>>>>> >>>>>>>>>> Are there any recommended workarounds until an >>>>>>>>>> elegant and efficient solution >>>>>>>>>> can be proposed? After all, causing a crash is >>>>>>>>>> extremely unacceptable in a >>>>>>>>>> production environment. >>>>>>>>> What crashes are you seeing in your production environment? >>>>>>>>> >>>>>>>>> -Dave. >>>>>>>>> -- >>>>>>>>> Dave Chinner >>>>>>>>> david@fromorbit.com > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-21 2:13 ` Ian Kent @ 2024-05-26 15:04 ` Jinliang Zheng 2024-05-26 17:21 ` Paul E. McKenney 2024-05-26 23:51 ` Ian Kent 1 sibling, 1 reply; 19+ messages in thread From: Jinliang Zheng @ 2024-05-26 15:04 UTC (permalink / raw) To: raven Cc: alexjlzheng, alexjlzheng, bfoster, david, djwong, linux-fsdevel, linux-xfs, rcu On Tue, 21 May 2024 at 10:13:38 +0800, Ian Kent wrote: > On 21/5/24 09:35, Ian Kent wrote: > > On 21/5/24 01:36, Darrick J. Wong wrote: > >> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > >>> On 16/5/24 15:08, Ian Kent wrote: > >>>> On 16/5/24 12:56, Jinliang Zheng wrote: > >>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: > >>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: > >>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > >>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > >>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, > >>>>>>>>> alexjlzheng@gmail.com wrote: > >>>>>>>>>> Hi, all > >>>>>>>>>> > >>>>>>>>>> I would like to ask if the conflict between xfs > >>>>>>>>>> inode recycle and vfs rcu-walk > >>>>>>>>>> which can lead to null pointer references has been resolved? > >>>>>>>>>> > >>>>>>>>>> I browsed through emails about the following > >>>>>>>>>> patches and their discussions: > >>>>>>>>>> - > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > >>>>>>>>>> - > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > >>>>>>>>>> - > >>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > >>>>>>>>>> > >>>>>>>>>> And then came to the conclusion that this > >>>>>>>>>> problem has not been solved, am I > >>>>>>>>>> right? Did I miss some patch that could solve this problem? > >>>>>>>>> We fixed the known problems this caused by turning off the VFS > >>>>>>>>> functionality that the rcu pathwalks kept tripping over. See > >>>>>>>>> commit > >>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink > >>>>>>>>> metadata buffers to > >>>>>>>>> the vfs"). > >>>>>>>> Sorry for the delay. > >>>>>>>> > >>>>>>>> The problem I encountered in the production environment > >>>>>>>> was that during the > >>>>>>>> rcu walk process the ->get_link() pointer was NULL, > >>>>>>>> which caused a crash. > >>>>>>>> > >>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't > >>>>>>>> expose internal symlink > >>>>>>>> metadata buffers to the vfs") first appeared in: > >>>>>>>> - > >>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > >>>>>>>> > >>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how? > >>>>>>> I suggest reading the call stack from wherever the VFS enters > >>>>>>> the XFS > >>>>>>> readlink code. If you have a reliable reproducer, then > >>>>>>> apply this patch > >>>>>>> to your kernel (you haven't mentioned which one it is) and see > >>>>>>> if the > >>>>>>> bad dereference goes away. > >>>>>>> > >>>>>>> --D > >>>>>> Sorry for the delay. > >>>>>> > >>>>>> I encountered the following calltrace: > >>>>>> > >>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: > >>>>>> 0000000000000000 > >>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode > >>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page > >>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > >>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI > >>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > >>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > >>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > >>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > >>>>>> [20213.578884] RIP: 0010:0x0 > >>>>>> [20213.578894] Code: Bad RIP value. > >>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > >>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > >>>>>> 0000000000000000 > >>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > >>>>>> 0000000000000000 > >>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > >>>>>> ffff889b9eeae380 > >>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > >>>>>> 0000000000000000 > >>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > >>>>>> ffffc90021ebfd48 > >>>>>> [20213.578998] FS: 00007f89c534e740(0000) > >>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > >>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > >>>>>> 00000000007706e0 > >>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > >>>>>> 0000000000000000 > >>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > >>>>>> 0000000000000400 > >>>>>> [20213.579079] PKRU: 55555554 > >>>>>> [20213.579087] Call Trace: > >>>>>> [20213.579099] trailing_symlink+0x1da/0x260 > >>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220 > >>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170 > >>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > >>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0 > >>>>>> [20213.579161] user_path_at_empty+0x3e/0x50 > >>>>>> [20213.579172] vfs_statx+0x76/0xe0 > >>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70 > >>>>>> [20213.579194] ? fput+0x13/0x20 > >>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300 > >>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30 > >>>>>> [20213.579225] ? fput+0x13/0x20 > >>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0 > >>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20 > >>>>>> [20213.579256] do_syscall_64+0x4d/0x140 > >>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > >>>>>> > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > >>>>>> > >>>>>> > >>>>> Please note that the kernel version I use is the one maintained by > >>>>> Tencent.Inc, > >>>>> and the baseline is v5.4. But in fact, in the latest upstream source > >>>>> tree, > >>>>> although the trailing_symlink() function has been removed, its logic > >>>>> has been > >>>>> moved to pick_link(), so the problem still exists. > >>>>> > >>>>> Ian Kent pointed out that try_to_unlazy() was introduced in > >>>>> pick_link() in the > >>>>> latest upstream source tree, but I don't understand why this can > >>>>> solve the NULL > >>>>> ->get_link pointer dereference problem, because ->get_link pointer > >>>>> will be > >>>>> dereferenced before try_to_unlazy(). > >>>>> > >>>>> (I don't understand why Ian Kent's email didn't appear on the > >>>>> mailing list.) > >>>> It was something about html mail and I think my mail client was at > >>>> fault. > >>>> > >>>> In any case what you say is indeed correct, so the comment isn't > >>>> important. > >>>> > >>>> > >>>> Fact is it is still a race between the lockless path walk and inode > >>>> eviction > >>>> > >>>> and xfs recycling. I believe that the xfs recycling code is very > >>>> hard to > >>>> fix. > >>>> > >>>> > >>>> IIRC correctly putting a NULL check in pick_link() was not considered > >>>> acceptable > >>>> > >>>> but there must be a way that is acceptable to check this and > >>>> restart the > >>>> walk. > >>>> > >>>> Maybe there was a reluctance to suffer the overhead of restarting the > >>>> walk when > >>>> > >>>> it shouldn't be needed. > >>> Or perhaps the worry was that if it can become NULL it could also > >>> become a > >>> pointer to a > >>> > >>> different (incorrect) link altogether which could have really > >>> odd/unpleasant > >>> outcomes. > >> Yuck. I think that means that we can't reallocate freed inodes until > >> the rcu grace period expires. For inodes that haven't been evicted, I > >> think that also means we cannot recycle cached inodes until after an rcu > >> grace period expires; or maybe that we cannot reset i_op/i_fop and must > >> not leave the incore state in an inconsistent format? > > > > Yeah, not pretty! > > > > But shouldn't this case occur only occasionally? > > > > > > So issuing a cache miss shouldn't impact performance too much that was, > > > > I believe, the concern with waiting for the rcu grace period. > > > > > > Identifying it's happening should be possible, the vfs legitimize_*() > > > > has this job for various objects but maybe it's using vfs private info. > > > > (certainly it uses nameidata struct with a seq lock sequence number in > > > > it) but I assume it can be done somehow. > > Unfortunately, when you start trying to work out how to do this, it > isn't at all > > obvious how to do it ... How about adding a synchronize_rcu() in front of xfs_reinit_inode()? Maybe this will affect performance, but compared to crashing the kernel, this performance penalty is completely worth it. And, perhaps we can gradually take some optimization measures, such as: https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ Best Regards, Jinliang Zheng > > > > > > > > My question then becomes is it viable/straight forward to not recycle > > such > > > > an inode and discard it instead so it gets re-created, I guess it's > > essentially > > > > a cache miss? > > > > > > Ian > > > >> > >> --D > >> > >>>> > >>>> The alternative would be to find some way to identify when it's unsafe > >>>> to reuse > >>>> > >>>> an inode marked for re-cycle before dropping rcu read, perhaps with > >>>> the > >>>> reference > >>>> > >>>> count plus the seqlock. Basically, to reuse inodes xfs will need to > >>>> identify when > >>>> > >>>> the race occurs and let the inode go away under rcu and create a > >>>> new one > >>>> if a race > >>>> > >>>> is detected. But possibly that isn't nearly as simple as it sounds? > >>>> > >>>> > >>>>> Thanks, > >>>>> Jinliang Zheng > >>>>> > >>>>>> And I analyzed the disassembly of trailing_symlink() and > >>>>>> confirmed that a NULL > >>>>>> ->get_link() happened here: > >>>>>> > >>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) > >>>>>> [FTRACE NOP] > >>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp > >>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp > >>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 > >>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 > >>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 > >>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 > >>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx > >>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov > >>>>>> %rdi,%rbx # rbx = &nameidate > >>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp > >>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov > >>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 > >>>>>> <sysctl_protected_symlinks> > >>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax > >>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx > >>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > >>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx > >>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov > >>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode > >>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx > >>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov > >>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid > >>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) > >>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > >>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi > >>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx > >>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx > >>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx > >>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f> > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax > >>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov > >>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT > >>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov > >>>>>> 0x50(%rbx),%rax # rax = nd->stack > >>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq > >>>>>> $0x0,0x20(%rax) # stack[0].name = NULL > >>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov > >>>>>> 0x48(%rbx),%eax # nd->depth > >>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov > >>>>>> 0x50(%rbx),%rdx # nd->stack > >>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov > >>>>>> 0xc8(%rbx),%r13 # nd->link_inode > >>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea > >>>>>> (%rax,%rax,2),%rax # rax = depth * 3 > >>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl > >>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 > >>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea > >>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last > >>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov > >>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry > >>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) > >>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100> > >>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi > >>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi > >>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq > >>>>>> 0xffffffff812f8a00 <atime_needs_update> > >>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al > >>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206> > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx > >>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi > >>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi > >>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx > >>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx > >>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq > >>>>>> 0xffffffff81424310 <security_inode_follow_link> > >>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 > >>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax > >>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9> > >>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) > >>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 > >>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 > >>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195> > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax > >>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al > >>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167> > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al > >>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 > >>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp > >>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax > >>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx > >>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 > >>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 > >>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 > >>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 > >>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp > >>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi > >>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq > >>>>>> 0xffffffff812f8ae0 <touch_atime> > >>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq > >>>>>> 0xffffffff81a26410 <_cond_resched> > >>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx > >>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx > >>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f> > >>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx > >>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov > >>>>>> $0xfffffffffffffff6,%r12 > >>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al > >>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > >>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax > >>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax > >>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax > >>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> > >>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax > >>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax > >>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f> > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov > >>>>>> $0xffffffff82319b4f,%rdi > >>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov > >>>>>> $0xfffffffffffffff3,%r12 > >>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq > >>>>>> 0xffffffff81161310 <audit_log_link_denied> > >>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov > >>>>>> $0xffffffff8230164d,%r12 > >>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) > >>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a> > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi > >>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq > >>>>>> 0xffffffff812e2da0 <nd_jump_root> > >>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax > >>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 > >>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax > >>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al > >>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5> > >>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182> > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov > >>>>>> 0x20(%r13),%rax # inode->i_op > >>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 > >>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi > >>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx > >>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov > >>>>>> 0x8(%rax),%rcx # inode_operations->get_link > >>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) > >>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> > >>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov > >>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0 > >>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > >>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 > >>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > >>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp > >>>>>> $0xfffffffffffff000,%r12 > >>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8> > >>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor > >>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0 > >>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) > >>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > >>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 > >>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp > >>>>>> $0xfffffffffffffff6,%rax > >>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> > >>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi > >>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq > >>>>>> 0xffffffff812e3840 <unlazy_walk> > >>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax > >>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > >>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx > >>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi > >>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi > >>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx > >>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> > >>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 > >>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi > >>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq > >>>>>> 0xffffffff812e3840 <unlazy_walk> > >>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax > >>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > >>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi > >>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq > >>>>>> 0xffffffff812f8ae0 <touch_atime> > >>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax > >>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi > >>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx > >>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi > >>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq > >>>>>> 0xffffffff811673f0 <__audit_inode> > >>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi > >>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq > >>>>>> 0xffffffff812e4790 <set_root> > >>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172> > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov > >>>>>> $0xfffffffffffffff6,%r12 > >>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > >>>>>> > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > >>>>>> > >>>>>> > >>>>>> > >>>>>> According to my understanding, the problem solved by commit > >>>>>> 7b7820b83f23 ("xfs: > >>>>>> don't expose internal symlink metadata buffers to the vfs") is a > >>>>>> data NULL > >>>>>> pointer dereference, but the problem here is an instruction NULL > >>>>>> pointer > >>>>>> dereference. > >>>>>> > >>>>>> Further, I analyzed the possible triggering process as follows: > >>>>>> > >>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create > >>>>>> rcu_read_lock > >>>>>> read_seqcount_retry > >>>>>> (the last check) iput_final > >>>>>> evict > >>>>>> destroy_inode > >>>>>> xfs_fs_destroy_inode > >>>>>> xfs_inode_set_reclaim_tag xfs_ialloc > >>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc > >>>>>> set(ip, XFS_IRECLAIMABLE) > >>>>>> xfs_iget > >>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock > >>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit > >>>>>> spin_lock(ip->i_flags_lock) > >>>>>> > >>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) > >>>>>> set(ip, XFS_IRECLAIM) > >>>>>> spin_unlock(ip->i_flags_lock) > >>>>>> rcu_read_unlock > >>>>>> < ------------ > > >>>>>> > >>>>>> // miss synchronize_rcu() > >>>>>> xfs_reinit_inode > >>>>>> ->get_link = NULL > >>>>>> get_link() // NULL > >>>>>> > >>>>>> rcu_read_unlock > >>>>>> > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > >>>>>> > >>>>>> > >>>>>> > >>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't > >>>>>> expose internal > >>>>>> symlink metadata buffers to the vfs"), we should start > >>>>>> processing this NULL > >>>>>> ->get_link pointer dereference. > >>>>>> > >>>>>> Or, am I thinking wrong somewhere? > >>>>>> > >>>>>> Thanks, > >>>>>> Jinliang Zheng > >>>>>> > >>>>>>>>> Apart from that issue, I'm not aware of any other issues that the > >>>>>>>>> XFS inode recycling directly exposes. > >>>>>>>>> > >>>>>>>>>> According to my understanding, the essence of > >>>>>>>>>> this problem is that XFS reuses > >>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk > >>>>>>>>>> assumes that this will not happen. > >>>>>>>>> It assumes that the inode will not change identity during the RCU > >>>>>>>>> grace period after the inode has been evicted from cache. We can > >>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU > >>>>>>>>> grace period as long as it is the same inode with the same > >>>>>>>>> content > >>>>>>>>> and same state. > >>>>>>>>> > >>>>>>>>> Problems *may* arise when we unlink the inode, then evict it, > >>>>>>>>> then a > >>>>>>>>> new file is created and the old slab cache memory address is used > >>>>>>>>> for the new inode. I describe the issue here: > >>>>>>>>> > >>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > >>>>>>>>> > >>>>>>>>> > >>>>>>>> And judging from the relevant emails, the main reason > >>>>>>>> why ->get_link() is set > >>>>>>>> to NULL should be the lack of synchronize_rcu() before > >>>>>>>> xfs_reinit_inode() when > >>>>>>>> the inode is chosen to be reused. > >>>>>>>> > >>>>>>>> However, perhaps due to performance reasons, this > >>>>>>>> solution has not been merged > >>>>>>>> for a long time. How is it now? > >>>>>>>> > >>>>>>>> Maybe I am missing something in the threads of mail? > >>>>>>>> > >>>>>>>> Thank you very much. :) > >>>>>>>> Jinliang Zheng > >>>>>>>> > >>>>>>>>> That said, we have exactly zero evidence that this is actually a > >>>>>>>>> problem in production systems. We did get systems tripping > >>>>>>>>> over the > >>>>>>>>> symlink issue, but there's no evidence that the > >>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the > >>>>>>>>> wild and > >>>>>>>>> hence there hasn't been any particular urgency to address it. > >>>>>>>>> > >>>>>>>>>> Are there any recommended workarounds until an > >>>>>>>>>> elegant and efficient solution > >>>>>>>>>> can be proposed? After all, causing a crash is > >>>>>>>>>> extremely unacceptable in a > >>>>>>>>>> production environment. > >>>>>>>>> What crashes are you seeing in your production environment? > >>>>>>>>> > >>>>>>>>> -Dave. > >>>>>>>>> -- > >>>>>>>>> Dave Chinner > >>>>>>>>> david@fromorbit.com > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-26 15:04 ` Jinliang Zheng @ 2024-05-26 17:21 ` Paul E. McKenney 0 siblings, 0 replies; 19+ messages in thread From: Paul E. McKenney @ 2024-05-26 17:21 UTC (permalink / raw) To: Jinliang Zheng Cc: raven, alexjlzheng, bfoster, david, djwong, linux-fsdevel, linux-xfs, rcu On Sun, May 26, 2024 at 11:04:14PM +0800, Jinliang Zheng wrote: > On Tue, 21 May 2024 at 10:13:38 +0800, Ian Kent wrote: > > On 21/5/24 09:35, Ian Kent wrote: > > > On 21/5/24 01:36, Darrick J. Wong wrote: > > >> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > >>> On 16/5/24 15:08, Ian Kent wrote: > > >>>> On 16/5/24 12:56, Jinliang Zheng wrote: > > >>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: > > >>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: > > >>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > > >>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > > >>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, > > >>>>>>>>> alexjlzheng@gmail.com wrote: > > >>>>>>>>>> Hi, all > > >>>>>>>>>> > > >>>>>>>>>> I would like to ask if the conflict between xfs > > >>>>>>>>>> inode recycle and vfs rcu-walk > > >>>>>>>>>> which can lead to null pointer references has been resolved? > > >>>>>>>>>> > > >>>>>>>>>> I browsed through emails about the following > > >>>>>>>>>> patches and their discussions: > > >>>>>>>>>> - > > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > >>>>>>>>>> - > > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > >>>>>>>>>> - > > >>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > >>>>>>>>>> > > >>>>>>>>>> And then came to the conclusion that this > > >>>>>>>>>> problem has not been solved, am I > > >>>>>>>>>> right? Did I miss some patch that could solve this problem? > > >>>>>>>>> We fixed the known problems this caused by turning off the VFS > > >>>>>>>>> functionality that the rcu pathwalks kept tripping over. See > > >>>>>>>>> commit > > >>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink > > >>>>>>>>> metadata buffers to > > >>>>>>>>> the vfs"). > > >>>>>>>> Sorry for the delay. > > >>>>>>>> > > >>>>>>>> The problem I encountered in the production environment > > >>>>>>>> was that during the > > >>>>>>>> rcu walk process the ->get_link() pointer was NULL, > > >>>>>>>> which caused a crash. > > >>>>>>>> > > >>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't > > >>>>>>>> expose internal symlink > > >>>>>>>> metadata buffers to the vfs") first appeared in: > > >>>>>>>> - > > >>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > >>>>>>>> > > >>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how? > > >>>>>>> I suggest reading the call stack from wherever the VFS enters > > >>>>>>> the XFS > > >>>>>>> readlink code. If you have a reliable reproducer, then > > >>>>>>> apply this patch > > >>>>>>> to your kernel (you haven't mentioned which one it is) and see > > >>>>>>> if the > > >>>>>>> bad dereference goes away. > > >>>>>>> > > >>>>>>> --D > > >>>>>> Sorry for the delay. > > >>>>>> > > >>>>>> I encountered the following calltrace: > > >>>>>> > > >>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode > > >>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page > > >>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > >>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI > > >>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > >>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > >>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > >>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > >>>>>> [20213.578884] RIP: 0010:0x0 > > >>>>>> [20213.578894] Code: Bad RIP value. > > >>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > >>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > >>>>>> ffff889b9eeae380 > > >>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > >>>>>> ffffc90021ebfd48 > > >>>>>> [20213.578998] FS: 00007f89c534e740(0000) > > >>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > >>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > >>>>>> 00000000007706e0 > > >>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > >>>>>> 0000000000000000 > > >>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > >>>>>> 0000000000000400 > > >>>>>> [20213.579079] PKRU: 55555554 > > >>>>>> [20213.579087] Call Trace: > > >>>>>> [20213.579099] trailing_symlink+0x1da/0x260 > > >>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220 > > >>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170 > > >>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > > >>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0 > > >>>>>> [20213.579161] user_path_at_empty+0x3e/0x50 > > >>>>>> [20213.579172] vfs_statx+0x76/0xe0 > > >>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70 > > >>>>>> [20213.579194] ? fput+0x13/0x20 > > >>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300 > > >>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30 > > >>>>>> [20213.579225] ? fput+0x13/0x20 > > >>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0 > > >>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20 > > >>>>>> [20213.579256] do_syscall_64+0x4d/0x140 > > >>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > >>>>>> > > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > >>>>>> > > >>>>>> > > >>>>> Please note that the kernel version I use is the one maintained by > > >>>>> Tencent.Inc, > > >>>>> and the baseline is v5.4. But in fact, in the latest upstream source > > >>>>> tree, > > >>>>> although the trailing_symlink() function has been removed, its logic > > >>>>> has been > > >>>>> moved to pick_link(), so the problem still exists. > > >>>>> > > >>>>> Ian Kent pointed out that try_to_unlazy() was introduced in > > >>>>> pick_link() in the > > >>>>> latest upstream source tree, but I don't understand why this can > > >>>>> solve the NULL > > >>>>> ->get_link pointer dereference problem, because ->get_link pointer > > >>>>> will be > > >>>>> dereferenced before try_to_unlazy(). > > >>>>> > > >>>>> (I don't understand why Ian Kent's email didn't appear on the > > >>>>> mailing list.) > > >>>> It was something about html mail and I think my mail client was at > > >>>> fault. > > >>>> > > >>>> In any case what you say is indeed correct, so the comment isn't > > >>>> important. > > >>>> > > >>>> > > >>>> Fact is it is still a race between the lockless path walk and inode > > >>>> eviction > > >>>> > > >>>> and xfs recycling. I believe that the xfs recycling code is very > > >>>> hard to > > >>>> fix. > > >>>> > > >>>> > > >>>> IIRC correctly putting a NULL check in pick_link() was not considered > > >>>> acceptable > > >>>> > > >>>> but there must be a way that is acceptable to check this and > > >>>> restart the > > >>>> walk. > > >>>> > > >>>> Maybe there was a reluctance to suffer the overhead of restarting the > > >>>> walk when > > >>>> > > >>>> it shouldn't be needed. > > >>> Or perhaps the worry was that if it can become NULL it could also > > >>> become a > > >>> pointer to a > > >>> > > >>> different (incorrect) link altogether which could have really > > >>> odd/unpleasant > > >>> outcomes. > > >> Yuck. I think that means that we can't reallocate freed inodes until > > >> the rcu grace period expires. For inodes that haven't been evicted, I > > >> think that also means we cannot recycle cached inodes until after an rcu > > >> grace period expires; or maybe that we cannot reset i_op/i_fop and must > > >> not leave the incore state in an inconsistent format? > > > > > > Yeah, not pretty! > > > > > > But shouldn't this case occur only occasionally? > > > > > > > > > So issuing a cache miss shouldn't impact performance too much that was, > > > > > > I believe, the concern with waiting for the rcu grace period. > > > > > > > > > Identifying it's happening should be possible, the vfs legitimize_*() > > > > > > has this job for various objects but maybe it's using vfs private info. > > > > > > (certainly it uses nameidata struct with a seq lock sequence number in > > > > > > it) but I assume it can be done somehow. > > > > Unfortunately, when you start trying to work out how to do this, it > > isn't at all > > > > obvious how to do it ... > > How about adding a synchronize_rcu() in front of xfs_reinit_inode()? > > Maybe this will affect performance, but compared to crashing the kernel, this > performance penalty is completely worth it. There is always synchronize_rcu_expedited(), especially if this is a relatively rare operation. The typical synchronize_rcu() delay is tens of milliseconds, while the typical synchronize_rcu_expedited() delay is tens to hundreds of microseconds. The downside of synchronize_rcu_expedited() is higher per-RCU-update CPU utilization. Plus added IPIs. > And, perhaps we can gradually take some optimization measures, such as: > https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ I cannot claim to fully understand this code, but I do agree that the use of things like poll_state_synchronize_rcu() and cond_synchronize_rcu() could greatly reduce the number of grace-period waits. In recent kernels, there is cond_synchronize_rcu_expedited() as well. Thanx, Paul > Best Regards, > Jinliang Zheng > > > > > > > > > > > > > > My question then becomes is it viable/straight forward to not recycle > > > such > > > > > > an inode and discard it instead so it gets re-created, I guess it's > > > essentially > > > > > > a cache miss? > > > > > > > > > Ian > > > > > >> > > >> --D > > >> > > >>>> > > >>>> The alternative would be to find some way to identify when it's unsafe > > >>>> to reuse > > >>>> > > >>>> an inode marked for re-cycle before dropping rcu read, perhaps with > > >>>> the > > >>>> reference > > >>>> > > >>>> count plus the seqlock. Basically, to reuse inodes xfs will need to > > >>>> identify when > > >>>> > > >>>> the race occurs and let the inode go away under rcu and create a > > >>>> new one > > >>>> if a race > > >>>> > > >>>> is detected. But possibly that isn't nearly as simple as it sounds? > > >>>> > > >>>> > > >>>>> Thanks, > > >>>>> Jinliang Zheng > > >>>>> > > >>>>>> And I analyzed the disassembly of trailing_symlink() and > > >>>>>> confirmed that a NULL > > >>>>>> ->get_link() happened here: > > >>>>>> > > >>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) > > >>>>>> [FTRACE NOP] > > >>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp > > >>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp > > >>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 > > >>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 > > >>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 > > >>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 > > >>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx > > >>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov > > >>>>>> %rdi,%rbx # rbx = &nameidate > > >>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp > > >>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov > > >>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 > > >>>>>> <sysctl_protected_symlinks> > > >>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax > > >>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx > > >>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > > >>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx > > >>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov > > >>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode > > >>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx > > >>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov > > >>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid > > >>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) > > >>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > > >>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi > > >>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx > > >>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx > > >>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx > > >>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je > > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f> > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax > > >>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov > > >>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT > > >>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov > > >>>>>> 0x50(%rbx),%rax # rax = nd->stack > > >>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq > > >>>>>> $0x0,0x20(%rax) # stack[0].name = NULL > > >>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov > > >>>>>> 0x48(%rbx),%eax # nd->depth > > >>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov > > >>>>>> 0x50(%rbx),%rdx # nd->stack > > >>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov > > >>>>>> 0xc8(%rbx),%r13 # nd->link_inode > > >>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea > > >>>>>> (%rax,%rax,2),%rax # rax = depth * 3 > > >>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl > > >>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 > > >>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea > > >>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last > > >>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov > > >>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry > > >>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) > > >>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je > > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100> > > >>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi > > >>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi > > >>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq > > >>>>>> 0xffffffff812f8a00 <atime_needs_update> > > >>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al > > >>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne > > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206> > > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx > > >>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi > > >>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi > > >>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx > > >>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx > > >>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq > > >>>>>> 0xffffffff81424310 <security_inode_follow_link> > > >>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 > > >>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax > > >>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne > > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9> > > >>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) > > >>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 > > >>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 > > >>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je > > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195> > > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax > > >>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al > > >>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je > > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167> > > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al > > >>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 > > >>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp > > >>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax > > >>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx > > >>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 > > >>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 > > >>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 > > >>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 > > >>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp > > >>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq > > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi > > >>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq > > >>>>>> 0xffffffff812f8ae0 <touch_atime> > > >>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq > > >>>>>> 0xffffffff81a26410 <_cond_resched> > > >>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp > > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> > > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx > > >>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx > > >>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je > > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f> > > >>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx > > >>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov > > >>>>>> $0xfffffffffffffff6,%r12 > > >>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al > > >>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax > > >>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax > > >>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax > > >>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je > > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> > > >>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax > > >>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax > > >>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je > > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f> > > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov > > >>>>>> $0xffffffff82319b4f,%rdi > > >>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov > > >>>>>> $0xfffffffffffffff3,%r12 > > >>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq > > >>>>>> 0xffffffff81161310 <audit_log_link_denied> > > >>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov > > >>>>>> $0xffffffff8230164d,%r12 > > >>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) > > >>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je > > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a> > > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq > > >>>>>> 0xffffffff812e2da0 <nd_jump_root> > > >>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax > > >>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 > > >>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax > > >>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al > > >>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne > > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5> > > >>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp > > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182> > > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov > > >>>>>> 0x20(%r13),%rax # inode->i_op > > >>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 > > >>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi > > >>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx > > >>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov > > >>>>>> 0x8(%rax),%rcx # inode_operations->get_link > > >>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) > > >>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne > > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> > > >>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov > > >>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0 > > >>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq > > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > >>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 > > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 > > >>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > > >>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp > > >>>>>> $0xfffffffffffff000,%r12 > > >>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe > > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8> > > >>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor > > >>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0 > > >>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) > > >>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq > > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > >>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 > > >>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp > > >>>>>> $0xfffffffffffffff6,%rax > > >>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne > > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> > > >>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq > > >>>>>> 0xffffffff812e3840 <unlazy_walk> > > >>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax > > >>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > > >>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx > > >>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi > > >>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi > > >>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx > > >>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq > > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> > > >>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 > > >>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp > > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> > > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq > > >>>>>> 0xffffffff812e3840 <unlazy_walk> > > >>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax > > >>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > > >>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi > > >>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq > > >>>>>> 0xffffffff812f8ae0 <touch_atime> > > >>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq > > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> > > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax > > >>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi > > >>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx > > >>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi > > >>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq > > >>>>>> 0xffffffff811673f0 <__audit_inode> > > >>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq > > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> > > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq > > >>>>>> 0xffffffff812e4790 <set_root> > > >>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq > > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172> > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov > > >>>>>> $0xfffffffffffffff6,%r12 > > >>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> > > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> According to my understanding, the problem solved by commit > > >>>>>> 7b7820b83f23 ("xfs: > > >>>>>> don't expose internal symlink metadata buffers to the vfs") is a > > >>>>>> data NULL > > >>>>>> pointer dereference, but the problem here is an instruction NULL > > >>>>>> pointer > > >>>>>> dereference. > > >>>>>> > > >>>>>> Further, I analyzed the possible triggering process as follows: > > >>>>>> > > >>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create > > >>>>>> rcu_read_lock > > >>>>>> read_seqcount_retry > > >>>>>> (the last check) iput_final > > >>>>>> evict > > >>>>>> destroy_inode > > >>>>>> xfs_fs_destroy_inode > > >>>>>> xfs_inode_set_reclaim_tag xfs_ialloc > > >>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc > > >>>>>> set(ip, XFS_IRECLAIMABLE) > > >>>>>> xfs_iget > > >>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock > > >>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit > > >>>>>> spin_lock(ip->i_flags_lock) > > >>>>>> > > >>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) > > >>>>>> set(ip, XFS_IRECLAIM) > > >>>>>> spin_unlock(ip->i_flags_lock) > > >>>>>> rcu_read_unlock > > >>>>>> < ------------ > > > >>>>>> > > >>>>>> // miss synchronize_rcu() > > >>>>>> xfs_reinit_inode > > >>>>>> ->get_link = NULL > > >>>>>> get_link() // NULL > > >>>>>> > > >>>>>> rcu_read_unlock > > >>>>>> > > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't > > >>>>>> expose internal > > >>>>>> symlink metadata buffers to the vfs"), we should start > > >>>>>> processing this NULL > > >>>>>> ->get_link pointer dereference. > > >>>>>> > > >>>>>> Or, am I thinking wrong somewhere? > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Jinliang Zheng > > >>>>>> > > >>>>>>>>> Apart from that issue, I'm not aware of any other issues that the > > >>>>>>>>> XFS inode recycling directly exposes. > > >>>>>>>>> > > >>>>>>>>>> According to my understanding, the essence of > > >>>>>>>>>> this problem is that XFS reuses > > >>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk > > >>>>>>>>>> assumes that this will not happen. > > >>>>>>>>> It assumes that the inode will not change identity during the RCU > > >>>>>>>>> grace period after the inode has been evicted from cache. We can > > >>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU > > >>>>>>>>> grace period as long as it is the same inode with the same > > >>>>>>>>> content > > >>>>>>>>> and same state. > > >>>>>>>>> > > >>>>>>>>> Problems *may* arise when we unlink the inode, then evict it, > > >>>>>>>>> then a > > >>>>>>>>> new file is created and the old slab cache memory address is used > > >>>>>>>>> for the new inode. I describe the issue here: > > >>>>>>>>> > > >>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> And judging from the relevant emails, the main reason > > >>>>>>>> why ->get_link() is set > > >>>>>>>> to NULL should be the lack of synchronize_rcu() before > > >>>>>>>> xfs_reinit_inode() when > > >>>>>>>> the inode is chosen to be reused. > > >>>>>>>> > > >>>>>>>> However, perhaps due to performance reasons, this > > >>>>>>>> solution has not been merged > > >>>>>>>> for a long time. How is it now? > > >>>>>>>> > > >>>>>>>> Maybe I am missing something in the threads of mail? > > >>>>>>>> > > >>>>>>>> Thank you very much. :) > > >>>>>>>> Jinliang Zheng > > >>>>>>>> > > >>>>>>>>> That said, we have exactly zero evidence that this is actually a > > >>>>>>>>> problem in production systems. We did get systems tripping > > >>>>>>>>> over the > > >>>>>>>>> symlink issue, but there's no evidence that the > > >>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the > > >>>>>>>>> wild and > > >>>>>>>>> hence there hasn't been any particular urgency to address it. > > >>>>>>>>> > > >>>>>>>>>> Are there any recommended workarounds until an > > >>>>>>>>>> elegant and efficient solution > > >>>>>>>>>> can be proposed? After all, causing a crash is > > >>>>>>>>>> extremely unacceptable in a > > >>>>>>>>>> production environment. > > >>>>>>>>> What crashes are you seeing in your production environment? > > >>>>>>>>> > > >>>>>>>>> -Dave. > > >>>>>>>>> -- > > >>>>>>>>> Dave Chinner > > >>>>>>>>> david@fromorbit.com > > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-21 2:13 ` Ian Kent 2024-05-26 15:04 ` Jinliang Zheng @ 2024-05-26 23:51 ` Ian Kent 2024-05-27 0:18 ` Al Viro 1 sibling, 1 reply; 19+ messages in thread From: Ian Kent @ 2024-05-26 23:51 UTC (permalink / raw) To: Darrick J. Wong Cc: Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, rcu On 21/5/24 10:13, Ian Kent wrote: > On 21/5/24 09:35, Ian Kent wrote: >> On 21/5/24 01:36, Darrick J. Wong wrote: >>> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: >>>> On 16/5/24 15:08, Ian Kent wrote: >>>>> On 16/5/24 12:56, Jinliang Zheng wrote: >>>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: >>>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: >>>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: >>>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: >>>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, >>>>>>>>>> alexjlzheng@gmail.com wrote: >>>>>>>>>>> Hi, all >>>>>>>>>>> >>>>>>>>>>> I would like to ask if the conflict between xfs >>>>>>>>>>> inode recycle and vfs rcu-walk >>>>>>>>>>> which can lead to null pointer references has been resolved? >>>>>>>>>>> >>>>>>>>>>> I browsed through emails about the following >>>>>>>>>>> patches and their discussions: >>>>>>>>>>> - >>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ >>>>>>>>>>> - >>>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ >>>>>>>>>>> - >>>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ >>>>>>>>>>> >>>>>>>>>>> And then came to the conclusion that this >>>>>>>>>>> problem has not been solved, am I >>>>>>>>>>> right? Did I miss some patch that could solve this problem? >>>>>>>>>> We fixed the known problems this caused by turning off the VFS >>>>>>>>>> functionality that the rcu pathwalks kept tripping over. See >>>>>>>>>> commit >>>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink >>>>>>>>>> metadata buffers to >>>>>>>>>> the vfs"). >>>>>>>>> Sorry for the delay. >>>>>>>>> >>>>>>>>> The problem I encountered in the production environment >>>>>>>>> was that during the >>>>>>>>> rcu walk process the ->get_link() pointer was NULL, >>>>>>>>> which caused a crash. >>>>>>>>> >>>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't >>>>>>>>> expose internal symlink >>>>>>>>> metadata buffers to the vfs") first appeared in: >>>>>>>>> - >>>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ >>>>>>>>> >>>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how? >>>>>>>> I suggest reading the call stack from wherever the VFS enters >>>>>>>> the XFS >>>>>>>> readlink code. If you have a reliable reproducer, then >>>>>>>> apply this patch >>>>>>>> to your kernel (you haven't mentioned which one it is) and see >>>>>>>> if the >>>>>>>> bad dereference goes away. >>>>>>>> >>>>>>>> --D >>>>>>> Sorry for the delay. >>>>>>> >>>>>>> I encountered the following calltrace: >>>>>>> >>>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: >>>>>>> 0000000000000000 >>>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode >>>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page >>>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 >>>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI >>>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: >>>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 >>>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. >>>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 >>>>>>> [20213.578884] RIP: 0010:0x0 >>>>>>> [20213.578894] Code: Bad RIP value. >>>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 >>>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: >>>>>>> 0000000000000000 >>>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: >>>>>>> 0000000000000000 >>>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: >>>>>>> ffff889b9eeae380 >>>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: >>>>>>> 0000000000000000 >>>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: >>>>>>> ffffc90021ebfd48 >>>>>>> [20213.578998] FS: 00007f89c534e740(0000) >>>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000 >>>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: >>>>>>> 00000000007706e0 >>>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>>>>> 0000000000000000 >>>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>>>> 0000000000000400 >>>>>>> [20213.579079] PKRU: 55555554 >>>>>>> [20213.579087] Call Trace: >>>>>>> [20213.579099] trailing_symlink+0x1da/0x260 >>>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220 >>>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170 >>>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 >>>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0 >>>>>>> [20213.579161] user_path_at_empty+0x3e/0x50 >>>>>>> [20213.579172] vfs_statx+0x76/0xe0 >>>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70 >>>>>>> [20213.579194] ? fput+0x13/0x20 >>>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300 >>>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30 >>>>>>> [20213.579225] ? fput+0x13/0x20 >>>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0 >>>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20 >>>>>>> [20213.579256] do_syscall_64+0x4d/0x140 >>>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 >>>>>>> >>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>> >>>>>>> >>>>>> Please note that the kernel version I use is the one maintained by >>>>>> Tencent.Inc, >>>>>> and the baseline is v5.4. But in fact, in the latest upstream source >>>>>> tree, >>>>>> although the trailing_symlink() function has been removed, its logic >>>>>> has been >>>>>> moved to pick_link(), so the problem still exists. >>>>>> >>>>>> Ian Kent pointed out that try_to_unlazy() was introduced in >>>>>> pick_link() in the >>>>>> latest upstream source tree, but I don't understand why this can >>>>>> solve the NULL >>>>>> ->get_link pointer dereference problem, because ->get_link pointer >>>>>> will be >>>>>> dereferenced before try_to_unlazy(). >>>>>> >>>>>> (I don't understand why Ian Kent's email didn't appear on the >>>>>> mailing list.) >>>>> It was something about html mail and I think my mail client was at >>>>> fault. >>>>> >>>>> In any case what you say is indeed correct, so the comment isn't >>>>> important. >>>>> >>>>> >>>>> Fact is it is still a race between the lockless path walk and inode >>>>> eviction >>>>> >>>>> and xfs recycling. I believe that the xfs recycling code is very >>>>> hard to >>>>> fix. >>>>> >>>>> >>>>> IIRC correctly putting a NULL check in pick_link() was not considered >>>>> acceptable >>>>> >>>>> but there must be a way that is acceptable to check this and >>>>> restart the >>>>> walk. >>>>> >>>>> Maybe there was a reluctance to suffer the overhead of restarting the >>>>> walk when >>>>> >>>>> it shouldn't be needed. >>>> Or perhaps the worry was that if it can become NULL it could also >>>> become a >>>> pointer to a >>>> >>>> different (incorrect) link altogether which could have really >>>> odd/unpleasant >>>> outcomes. >>> Yuck. I think that means that we can't reallocate freed inodes until >>> the rcu grace period expires. For inodes that haven't been evicted, I >>> think that also means we cannot recycle cached inodes until after an >>> rcu >>> grace period expires; or maybe that we cannot reset i_op/i_fop and must >>> not leave the incore state in an inconsistent format? >> >> Yeah, not pretty! >> >> But shouldn't this case occur only occasionally? >> >> >> So issuing a cache miss shouldn't impact performance too much that was, >> >> I believe, the concern with waiting for the rcu grace period. >> >> >> Identifying it's happening should be possible, the vfs legitimize_*() >> >> has this job for various objects but maybe it's using vfs private info. >> >> (certainly it uses nameidata struct with a seq lock sequence number in >> >> it) but I assume it can be done somehow. > > Unfortunately, when you start trying to work out how to do this, it > isn't at all > > obvious how to do it ... Indeed, that's what I found when I had a quick look. Maybe a dentry (since that's part of the subject of the path walk and inode is readily accessible) flag could be used since there's opportunity to set it in vfs callbacks that are done as a matter of course. Ian > > >> >> >> My question then becomes is it viable/straight forward to not recycle >> such >> >> an inode and discard it instead so it gets re-created, I guess it's >> essentially >> >> a cache miss? >> >> >> Ian >> >>> >>> --D >>> >>>>> >>>>> The alternative would be to find some way to identify when it's >>>>> unsafe >>>>> to reuse >>>>> >>>>> an inode marked for re-cycle before dropping rcu read, perhaps >>>>> with the >>>>> reference >>>>> >>>>> count plus the seqlock. Basically, to reuse inodes xfs will need to >>>>> identify when >>>>> >>>>> the race occurs and let the inode go away under rcu and create a >>>>> new one >>>>> if a race >>>>> >>>>> is detected. But possibly that isn't nearly as simple as it sounds? >>>>> >>>>> >>>>>> Thanks, >>>>>> Jinliang Zheng >>>>>> >>>>>>> And I analyzed the disassembly of trailing_symlink() and >>>>>>> confirmed that a NULL >>>>>>> ->get_link() happened here: >>>>>>> >>>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) >>>>>>> [FTRACE NOP] >>>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp >>>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp >>>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 >>>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 >>>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 >>>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 >>>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx >>>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov >>>>>>> %rdi,%rbx # rbx = &nameidate >>>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp >>>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov >>>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 >>>>>>> <sysctl_protected_symlinks> >>>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax >>>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx >>>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je >>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx >>>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov >>>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode >>>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx >>>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov >>>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid >>>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) >>>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je >>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi >>>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx >>>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx >>>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx >>>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je >>>>>>> 0xffffffff812e495f <trailing_symlink+0x10f> >>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax >>>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov >>>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT >>>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov >>>>>>> 0x50(%rbx),%rax # rax = nd->stack >>>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq >>>>>>> $0x0,0x20(%rax) # stack[0].name = NULL >>>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov >>>>>>> 0x48(%rbx),%eax # nd->depth >>>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov >>>>>>> 0x50(%rbx),%rdx # nd->stack >>>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov >>>>>>> 0xc8(%rbx),%r13 # nd->link_inode >>>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea >>>>>>> (%rax,%rax,2),%rax # rax = depth * 3 >>>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl >>>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 >>>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea >>>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last >>>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov >>>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry >>>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) >>>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je >>>>>>> 0xffffffff812e4950 <trailing_symlink+0x100> >>>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi >>>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi >>>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq >>>>>>> 0xffffffff812f8a00 <atime_needs_update> >>>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al >>>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne >>>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206> >>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx >>>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi >>>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi >>>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx >>>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx >>>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq >>>>>>> 0xffffffff81424310 <security_inode_follow_link> >>>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 >>>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax >>>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne >>>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9> >>>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) >>>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 >>>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 >>>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je >>>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195> >>>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax >>>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al >>>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je >>>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167> >>>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al >>>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je >>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 >>>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je >>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp >>>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax >>>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx >>>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 >>>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 >>>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 >>>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 >>>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp >>>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq >>>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi >>>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq >>>>>>> 0xffffffff812f8ae0 <touch_atime> >>>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq >>>>>>> 0xffffffff81a26410 <_cond_resched> >>>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp >>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx >>>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx >>>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je >>>>>>> 0xffffffff812e496f <trailing_symlink+0x11f> >>>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx >>>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je >>>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> >>>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov >>>>>>> $0xfffffffffffffff6,%r12 >>>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al >>>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne >>>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax >>>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax >>>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax >>>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je >>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> >>>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax >>>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax >>>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je >>>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f> >>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov >>>>>>> $0xffffffff82319b4f,%rdi >>>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov >>>>>>> $0xfffffffffffffff3,%r12 >>>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq >>>>>>> 0xffffffff81161310 <audit_log_link_denied> >>>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp >>>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov >>>>>>> $0xffffffff8230164d,%r12 >>>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp >>>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) >>>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je >>>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a> >>>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi >>>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq >>>>>>> 0xffffffff812e2da0 <nd_jump_root> >>>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax >>>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne >>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 >>>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax >>>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al >>>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne >>>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5> >>>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp >>>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182> >>>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov >>>>>>> 0x20(%r13),%rax # inode->i_op >>>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 >>>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi >>>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx >>>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov >>>>>>> 0x8(%rax),%rcx # inode_operations->get_link >>>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) >>>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne >>>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> >>>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov >>>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0 >>>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq >>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 >>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 >>>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je >>>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> >>>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp >>>>>>> $0xfffffffffffff000,%r12 >>>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe >>>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8> >>>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq >>>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor >>>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0 >>>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) >>>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq >>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx >>>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 >>>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp >>>>>>> $0xfffffffffffffff6,%rax >>>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne >>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi >>>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq >>>>>>> 0xffffffff812e3840 <unlazy_walk> >>>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax >>>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne >>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx >>>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi >>>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi >>>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx >>>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq >>>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> >>>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 >>>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp >>>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> >>>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi >>>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq >>>>>>> 0xffffffff812e3840 <unlazy_walk> >>>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax >>>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne >>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> >>>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi >>>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq >>>>>>> 0xffffffff812f8ae0 <touch_atime> >>>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq >>>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> >>>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax >>>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi >>>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx >>>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi >>>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq >>>>>>> 0xffffffff811673f0 <__audit_inode> >>>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq >>>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> >>>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi >>>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq >>>>>>> 0xffffffff812e4790 <set_root> >>>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq >>>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172> >>>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov >>>>>>> $0xfffffffffffffff6,%r12 >>>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq >>>>>>> 0xffffffff812e493e <trailing_symlink+0xee> >>>>>>> >>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>> >>>>>>> >>>>>>> >>>>>>> According to my understanding, the problem solved by commit >>>>>>> 7b7820b83f23 ("xfs: >>>>>>> don't expose internal symlink metadata buffers to the vfs") is a >>>>>>> data NULL >>>>>>> pointer dereference, but the problem here is an instruction NULL >>>>>>> pointer >>>>>>> dereference. >>>>>>> >>>>>>> Further, I analyzed the possible triggering process as follows: >>>>>>> >>>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create >>>>>>> rcu_read_lock >>>>>>> read_seqcount_retry >>>>>>> (the last check) iput_final >>>>>>> evict >>>>>>> destroy_inode >>>>>>> xfs_fs_destroy_inode >>>>>>> xfs_inode_set_reclaim_tag xfs_ialloc >>>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc >>>>>>> set(ip, XFS_IRECLAIMABLE) >>>>>>> xfs_iget >>>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock >>>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit >>>>>>> spin_lock(ip->i_flags_lock) >>>>>>> >>>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) >>>>>>> set(ip, XFS_IRECLAIM) >>>>>>> spin_unlock(ip->i_flags_lock) >>>>>>> rcu_read_unlock >>>>>>> < ------------ > >>>>>>> >>>>>>> // miss synchronize_rcu() >>>>>>> xfs_reinit_inode >>>>>>> ->get_link = NULL >>>>>>> get_link() // NULL >>>>>>> >>>>>>> rcu_read_unlock >>>>>>> >>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>> >>>>>>> >>>>>>> >>>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't >>>>>>> expose internal >>>>>>> symlink metadata buffers to the vfs"), we should start >>>>>>> processing this NULL >>>>>>> ->get_link pointer dereference. >>>>>>> >>>>>>> Or, am I thinking wrong somewhere? >>>>>>> >>>>>>> Thanks, >>>>>>> Jinliang Zheng >>>>>>> >>>>>>>>>> Apart from that issue, I'm not aware of any other issues that >>>>>>>>>> the >>>>>>>>>> XFS inode recycling directly exposes. >>>>>>>>>> >>>>>>>>>>> According to my understanding, the essence of >>>>>>>>>>> this problem is that XFS reuses >>>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk >>>>>>>>>>> assumes that this will not happen. >>>>>>>>>> It assumes that the inode will not change identity during the >>>>>>>>>> RCU >>>>>>>>>> grace period after the inode has been evicted from cache. We can >>>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU >>>>>>>>>> grace period as long as it is the same inode with the same >>>>>>>>>> content >>>>>>>>>> and same state. >>>>>>>>>> >>>>>>>>>> Problems *may* arise when we unlink the inode, then evict it, >>>>>>>>>> then a >>>>>>>>>> new file is created and the old slab cache memory address is >>>>>>>>>> used >>>>>>>>>> for the new inode. I describe the issue here: >>>>>>>>>> >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>> And judging from the relevant emails, the main reason >>>>>>>>> why ->get_link() is set >>>>>>>>> to NULL should be the lack of synchronize_rcu() before >>>>>>>>> xfs_reinit_inode() when >>>>>>>>> the inode is chosen to be reused. >>>>>>>>> >>>>>>>>> However, perhaps due to performance reasons, this >>>>>>>>> solution has not been merged >>>>>>>>> for a long time. How is it now? >>>>>>>>> >>>>>>>>> Maybe I am missing something in the threads of mail? >>>>>>>>> >>>>>>>>> Thank you very much. :) >>>>>>>>> Jinliang Zheng >>>>>>>>> >>>>>>>>>> That said, we have exactly zero evidence that this is actually a >>>>>>>>>> problem in production systems. We did get systems tripping >>>>>>>>>> over the >>>>>>>>>> symlink issue, but there's no evidence that the >>>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the >>>>>>>>>> wild and >>>>>>>>>> hence there hasn't been any particular urgency to address it. >>>>>>>>>> >>>>>>>>>>> Are there any recommended workarounds until an >>>>>>>>>>> elegant and efficient solution >>>>>>>>>>> can be proposed? After all, causing a crash is >>>>>>>>>>> extremely unacceptable in a >>>>>>>>>>> production environment. >>>>>>>>>> What crashes are you seeing in your production environment? >>>>>>>>>> >>>>>>>>>> -Dave. >>>>>>>>>> -- >>>>>>>>>> Dave Chinner >>>>>>>>>> david@fromorbit.com >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-26 23:51 ` Ian Kent @ 2024-05-27 0:18 ` Al Viro 2024-05-28 15:51 ` Brian Foster 0 siblings, 1 reply; 19+ messages in thread From: Al Viro @ 2024-05-27 0:18 UTC (permalink / raw) To: Ian Kent Cc: Darrick J. Wong, Jinliang Zheng, alexjlzheng, bfoster, david, linux-fsdevel, linux-xfs, rcu On Mon, May 27, 2024 at 07:51:39AM +0800, Ian Kent wrote: > Indeed, that's what I found when I had a quick look. > > > Maybe a dentry (since that's part of the subject of the path walk and inode > is readily > > accessible) flag could be used since there's opportunity to set it in vfs > callbacks that > > are done as a matter of course. You might recheck ->d_seq after fetching ->get_link there; with XFS ->get_link() unconditionlly failing in RCU mode that would prevent this particular problem. But it would obviously have to be done in pick_link() itself (and I refuse to touch that area in 5.4 - carrying those changes across the e.g. 5.6 changes in pathwalk machinery is too much). And it's really just the tip of the iceberg - e.g. I'd expect a massive headache in ACL-related part of permission checks, etc. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-27 0:18 ` Al Viro @ 2024-05-28 15:51 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2024-05-28 15:51 UTC (permalink / raw) To: Al Viro Cc: Ian Kent, Darrick J. Wong, Jinliang Zheng, alexjlzheng, david, linux-fsdevel, linux-xfs, rcu On Mon, May 27, 2024 at 01:18:23AM +0100, Al Viro wrote: > On Mon, May 27, 2024 at 07:51:39AM +0800, Ian Kent wrote: > > > Indeed, that's what I found when I had a quick look. > > > > > > Maybe a dentry (since that's part of the subject of the path walk and inode > > is readily > > > > accessible) flag could be used since there's opportunity to set it in vfs > > callbacks that > > > > are done as a matter of course. > > You might recheck ->d_seq after fetching ->get_link there; with XFS > ->get_link() unconditionlly failing in RCU mode that would prevent > this particular problem. But it would obviously have to be done > in pick_link() itself (and I refuse to touch that area in 5.4 - > carrying those changes across the e.g. 5.6 changes in pathwalk > machinery is too much). > Ian sent a patch along those lines a couple years or so ago: https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ I'm still not quite sure why we didn't merge this, at least as a bandaid fix for the symlink variant of this particular problem..? Brian > And it's really just the tip of the iceberg - e.g. I'd expect a massive > headache in ACL-related part of permission checks, etc. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-16 7:23 ` Ian Kent 2024-05-20 17:36 ` Darrick J. Wong @ 2024-05-27 9:41 ` Dave Chinner 2024-05-27 13:56 ` Jinliang Zheng 1 sibling, 1 reply; 19+ messages in thread From: Dave Chinner @ 2024-05-27 9:41 UTC (permalink / raw) To: Ian Kent Cc: Jinliang Zheng, alexjlzheng, bfoster, djwong, linux-fsdevel, linux-xfs, rcu On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > On 16/5/24 15:08, Ian Kent wrote: > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > > I encountered the following calltrace: > > > > > > > > [20213.578756] BUG: kernel NULL pointer dereference, address: > > > > 0000000000000000 > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode > > > > [20213.578799] #PF: error_code(0x0010) - not-present page > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > > > [20213.578884] RIP: 0010:0x0 > > > > [20213.578894] Code: Bad RIP value. > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > > > 0000000000000000 > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > > > 0000000000000000 > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > > > ffff889b9eeae380 > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > > > 0000000000000000 > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > > > ffffc90021ebfd48 > > > > [20213.578998] FS: 00007f89c534e740(0000) > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > > > 00000000007706e0 > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [20213.579079] PKRU: 55555554 > > > > [20213.579087] Call Trace: > > > > [20213.579099] trailing_symlink+0x1da/0x260 > > > > [20213.579112] path_lookupat.isra.53+0x79/0x220 > > > > [20213.579125] filename_lookup.part.69+0xa0/0x170 > > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > > > > [20213.579151] ? getname_flags+0x4f/0x1e0 > > > > [20213.579161] user_path_at_empty+0x3e/0x50 > > > > [20213.579172] vfs_statx+0x76/0xe0 > > > > [20213.579182] __do_sys_newstat+0x3d/0x70 > > > > [20213.579194] ? fput+0x13/0x20 > > > > [20213.579203] ? ksys_ioctl+0xb0/0x300 > > > > [20213.579213] ? generic_file_llseek+0x24/0x30 > > > > [20213.579225] ? fput+0x13/0x20 > > > > [20213.579233] ? ksys_lseek+0x8d/0xb0 > > > > [20213.579243] __x64_sys_newstat+0x16/0x20 > > > > [20213.579256] do_syscall_64+0x4d/0x140 > > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > Please note that the kernel version I use is the one maintained by > > > Tencent.Inc, > > > and the baseline is v5.4. But in fact, in the latest upstream source > > > tree, > > > although the trailing_symlink() function has been removed, its logic > > > has been > > > moved to pick_link(), so the problem still exists. > > > > > > Ian Kent pointed out that try_to_unlazy() was introduced in > > > pick_link() in the > > > latest upstream source tree, but I don't understand why this can > > > solve the NULL > > > ->get_link pointer dereference problem, because ->get_link pointer > > > will be > > > dereferenced before try_to_unlazy(). > > > > > > (I don't understand why Ian Kent's email didn't appear on the > > > mailing list.) > > > > It was something about html mail and I think my mail client was at fault. > > > > In any case what you say is indeed correct, so the comment isn't > > important. > > > > > > Fact is it is still a race between the lockless path walk and inode > > eviction > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > fix. Not really for this case. This is simply concurrent pathwalk lookups occurring just after the inode has been evicted from the VFS inode cache. The first lookup hits the XFS inode cache, sees XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to reinstantiate the VFS inode to an initial state. The Xfs inode itself is still valid as it hasn't reached the XFS_IRECLAIM state where it will be torn down and freed. Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has been run and that it is trying to call ->get_link on that same inode. Unfortunately, the first lookup has just set inode->f_ops = &empty_fops as part of the VFS inode reinit, and that then triggers the null pointer deref. Once the first lookup has finished the inode_init_always(), xfs_reinit_inode() resets inode->f_ops back to xfs_symlink_file_ops and get_link calls work again. Fundamentally, the problem is that we are completely reinitialising the VFS inode within the RCU grace period. i.e. while concurrent RCU pathwalks can still be in progress and find the VFS inode whilst the XFS inode cache is manipulating it. What we should be doing here is a subset of inode_init_always(), which only reinitialises the bits of the VFS inode we need to initialise rather than the entire inode. The identity of the inode is not changing and so we don't need to go through a transient state where the VFS inode goes xfs symlink -> empty initialised inode -> xfs symlink. i.e. We need to re-initialise the non-identity related parts of the VFS inode so the identity parts that the RCU pathwalks rely on never change within the RCU grace period where lookups can find the VFS inode after it has been evicted. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-27 9:41 ` Dave Chinner @ 2024-05-27 13:56 ` Jinliang Zheng 2024-05-28 2:10 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Jinliang Zheng @ 2024-05-27 13:56 UTC (permalink / raw) To: david Cc: alexjlzheng, alexjlzheng, bfoster, djwong, linux-fsdevel, linux-xfs, raven, rcu On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote: > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > On 16/5/24 15:08, Ian Kent wrote: > > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > > > I encountered the following calltrace: > > > > > > > > > > [20213.578756] BUG: kernel NULL pointer dereference, address: > > > > > 0000000000000000 > > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode > > > > > [20213.578799] #PF: error_code(0x0010) - not-present page > > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI > > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > > > > [20213.578884] RIP: 0010:0x0 > > > > > [20213.578894] Code: Bad RIP value. > > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > > > > 0000000000000000 > > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > > > > 0000000000000000 > > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > > > > ffff889b9eeae380 > > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > > > > 0000000000000000 > > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > > > > ffffc90021ebfd48 > > > > > [20213.578998] FS: 00007f89c534e740(0000) > > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > > > > 00000000007706e0 > > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > > 0000000000000000 > > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > > 0000000000000400 > > > > > [20213.579079] PKRU: 55555554 > > > > > [20213.579087] Call Trace: > > > > > [20213.579099] trailing_symlink+0x1da/0x260 > > > > > [20213.579112] path_lookupat.isra.53+0x79/0x220 > > > > > [20213.579125] filename_lookup.part.69+0xa0/0x170 > > > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > > > > > [20213.579151] ? getname_flags+0x4f/0x1e0 > > > > > [20213.579161] user_path_at_empty+0x3e/0x50 > > > > > [20213.579172] vfs_statx+0x76/0xe0 > > > > > [20213.579182] __do_sys_newstat+0x3d/0x70 > > > > > [20213.579194] ? fput+0x13/0x20 > > > > > [20213.579203] ? ksys_ioctl+0xb0/0x300 > > > > > [20213.579213] ? generic_file_llseek+0x24/0x30 > > > > > [20213.579225] ? fput+0x13/0x20 > > > > > [20213.579233] ? ksys_lseek+0x8d/0xb0 > > > > > [20213.579243] __x64_sys_newstat+0x16/0x20 > > > > > [20213.579256] do_syscall_64+0x4d/0x140 > > > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > > > Please note that the kernel version I use is the one maintained by > > > > Tencent.Inc, > > > > and the baseline is v5.4. But in fact, in the latest upstream source > > > > tree, > > > > although the trailing_symlink() function has been removed, its logic > > > > has been > > > > moved to pick_link(), so the problem still exists. > > > > > > > > Ian Kent pointed out that try_to_unlazy() was introduced in > > > > pick_link() in the > > > > latest upstream source tree, but I don't understand why this can > > > > solve the NULL > > > > ->get_link pointer dereference problem, because ->get_link pointer > > > > will be > > > > dereferenced before try_to_unlazy(). > > > > > > > > (I don't understand why Ian Kent's email didn't appear on the > > > > mailing list.) > > > > > > It was something about html mail and I think my mail client was at fault. > > > > > > In any case what you say is indeed correct, so the comment isn't > > > important. > > > > > > > > > Fact is it is still a race between the lockless path walk and inode > > > eviction > > > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > > fix. > > Not really for this case. This is simply concurrent pathwalk lookups > occurring just after the inode has been evicted from the VFS inode > cache. The first lookup hits the XFS inode cache, sees > XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to > reinstantiate the VFS inode to an initial state. The Xfs inode > itself is still valid as it hasn't reached the XFS_IRECLAIM state > where it will be torn down and freed. > > Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has > been run and that it is trying to call ->get_link on that same > inode. Unfortunately, the first lookup has just set inode->f_ops = > &empty_fops as part of the VFS inode reinit, and that then triggers > the null pointer deref. The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the target inode before xfs_reinit_inode(). Because the target inode of xfs_reinit_inode() must NOT be associated with any dentry, which is necessary conditions for iput() -> iput_final() -> evict(), and the RCU pathwalk cannot obtain any inode without a dentry. > > Once the first lookup has finished the inode_init_always(), > xfs_reinit_inode() resets inode->f_ops back to > xfs_symlink_file_ops and get_link calls work again. > > Fundamentally, the problem is that we are completely reinitialising > the VFS inode within the RCU grace period. i.e. while concurrent RCU > pathwalks can still be in progress and find the VFS inode whilst the > XFS inode cache is manipulating it. > > What we should be doing here is a subset of inode_init_always(), > which only reinitialises the bits of the VFS inode we need to > initialise rather than the entire inode. The identity of the inode > is not changing and so we don't need to go through a transient state > where the VFS inode goes xfs symlink -> empty initialised inode -> > xfs symlink. Sorry, I think this question is wrong in more ways than just inode_operations. After the target inode has been reinited by xfs_reinit_inode(), its semantics is no longer the inode required by RCU walkpath. The meanings of many fields have changed, such as mode, i_mtime, i_atime, i_ctime and so on. > > i.e. We need to re-initialise the non-identity related parts of the > VFS inode so the identity parts that the RCU pathwalks rely on never > change within the RCU grace period where lookups can find the VFS > inode after it has been evicted. > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: About the conflict between XFS inode recycle and VFS rcu-walk 2024-05-27 13:56 ` Jinliang Zheng @ 2024-05-28 2:10 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2024-05-28 2:10 UTC (permalink / raw) To: Jinliang Zheng Cc: alexjlzheng, bfoster, djwong, linux-fsdevel, linux-xfs, raven, rcu On Mon, May 27, 2024 at 09:56:15PM +0800, Jinliang Zheng wrote: > On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote: > > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > > On 16/5/24 15:08, Ian Kent wrote: > > > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > > In any case what you say is indeed correct, so the comment isn't > > > > important. > > > > > > > > > > > > Fact is it is still a race between the lockless path walk and inode > > > > eviction > > > > > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > > > fix. > > > > Not really for this case. This is simply concurrent pathwalk lookups > > occurring just after the inode has been evicted from the VFS inode > > cache. The first lookup hits the XFS inode cache, sees > > XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to > > reinstantiate the VFS inode to an initial state. The Xfs inode > > itself is still valid as it hasn't reached the XFS_IRECLAIM state > > where it will be torn down and freed. > > > > Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has > > been run and that it is trying to call ->get_link on that same > > inode. Unfortunately, the first lookup has just set inode->f_ops = > > &empty_fops as part of the VFS inode reinit, and that then triggers > > the null pointer deref. > > The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the > target inode before xfs_reinit_inode(). I'm not sure I follow - xfs_reinit_inode() typically occurs during a pathwalk when no dentry for the given path component is found in the dcache. Hence it has to create the dentry and look up the inode. i.e. walk_component() lookup_fast() -> doesn't find a valid cached dentry lookup_slow() inode_lock_shared(parent) parent->i_op->lookup(child) xfs_vn_lookup() xfs_lookup() xfs_iget(child) <<<< inode may not exist until here xfs_iget_recycle(child) xfs_reinit_inode(child) inode_unlock_shared(parent) The path you are indicating is going wrong is: link_path_walk() walk_component() <find child dentry> step_into(child) if (!d_is_symlink(child dentry)) { .... return } pick_link(child) if (!inode->i_link) inode->i_op->get_link() <<<< doesn't exist, not a symlink inode This implies that lookup_fast() found a symlink dentry with a d_inode pointer to something that wasn't a symlink. That doesn't mean that anything has gone wrong with xfs inode recycling within an RCU grace period. For example, d_is_symlink() looks purely at the dentry state and assumes that it matches the dentry->d_inode attached to it: #define DCACHE_ENTRY_TYPE (7 << 20) /* bits 20..22 are for storing type: */ #define DCACHE_MISS_TYPE (0 << 20) /* Negative dentry */ #define DCACHE_WHITEOUT_TYPE (1 << 20) /* Whiteout dentry (stop pathwalk) */ #define DCACHE_DIRECTORY_TYPE (2 << 20) /* Normal directory */ #define DCACHE_AUTODIR_TYPE (3 << 20) /* Lookupless directory (presumed automount) */ #define DCACHE_REGULAR_TYPE (4 << 20) /* Regular file type */ #define DCACHE_SPECIAL_TYPE (5 << 20) /* Other file type */ #define DCACHE_SYMLINK_TYPE (6 << 20) /* Symlink */ static inline unsigned __d_entry_type(const struct dentry *dentry) { return dentry->d_flags & DCACHE_ENTRY_TYPE; } static inline bool d_is_symlink(const struct dentry *dentry) { return __d_entry_type(dentry) == DCACHE_SYMLINK_TYPE; } This is a valid optimisation and good for performance, but it does make it susceptible to memory corruption based failues. i.e. a single bit memory corruption can change a DCACHE_DIRECTORY_TYPE dentry to look like a DCACHE_SYMLINK_TYPE dentry, and then the code calls pick_link() on a dentry that points to a directory inode and not a symlink inode. Such a memory corruption would have an identical crash signature to the stack trace you posted, hence I'd really like to have solid confirmation that the crash you are seeing is actually a result of inode recycling and not something else.... > > Once the first lookup has finished the inode_init_always(), > > xfs_reinit_inode() resets inode->f_ops back to > > xfs_symlink_file_ops and get_link calls work again. > > > > Fundamentally, the problem is that we are completely reinitialising > > the VFS inode within the RCU grace period. i.e. while concurrent RCU > > pathwalks can still be in progress and find the VFS inode whilst the > > XFS inode cache is manipulating it. > > > > What we should be doing here is a subset of inode_init_always(), > > which only reinitialises the bits of the VFS inode we need to > > initialise rather than the entire inode. The identity of the inode > > is not changing and so we don't need to go through a transient state > > where the VFS inode goes xfs symlink -> empty initialised inode -> > > xfs symlink. > > Sorry, I think this question is wrong in more ways than just inode_operations. > After the target inode has been reinited by xfs_reinit_inode(), its semantics > is no longer the inode required by RCU walkpath. The meanings of many fields > have changed, such as mode, i_mtime, i_atime, i_ctime and so on. That's only the case in the the unlink->inode free->create-> inode allocation path, assuming that is what the system actually tripped over. However, we can hit the reinit code from a simple path lookup immediately after memory reclaim freed the dentry and inode and it is still in the XFS inode cache. i.e. ->destroy_inode() -> XFS_IRECLAIMABLE -> ->lookup() -> xfs_iget() -> xfs_iget_recycle(). i.e. the inode reinit doesn't only get triggered by unlink/alloc cycles, so we often reinit to the exact same inode state as before the inode was evicted from memory. Essentially, it is not clear to me how your system tripped over this issue; it *may* be an inode cache recycling issue, but I can also point to other situations that could result in a very similar crash signature. What I'm looking for is real evidence that it was a recycling issue that lead to this problem, and evidence that it can still occur on a current TOT kernel. A method for reproducing the issue your kernels are seeing would be nice. FWIW, reproducing on a current TOT kernel is important - even if you're seeing the unlink/alloc/reinit case on your 5.4 kernel, this path had a major architectural change in 5.14 and AFAICT that largely invalidates all the previous analysis of this inode reinit behaviour. In 5.14 we moved the inode freeing code we used to do in evict() into a background thread, hence the "evict, unlink, create, reinit" process now has an enforced context switch and delay between ->destroy_inode() and the internal inode unlink/freeing code. By decoupling the unlink processing from the calling task context, the task context can no longer immediately reallocate the same physical inode, and so the mechanism that lead to applications being able to directly trigger the xfs_inode_reinit() code for inodes that are changing identity repeatedly in certain situations no longer exists. The delay in unlink processing also affects how RCU grace periods expire between unlink and allocation/reinit, so assumptions made on that side of the analysis are also suspect and need to be re-examined. Hence before we spend any more time chasing ghosts, I'd really like to see hard evidence for what caused the crash you reported and a demonstration of it occuring on current TOT kernels, too. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-05-28 15:51 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng 2023-12-08 0:14 ` Dave Chinner 2024-01-31 6:35 ` Jinliang Zheng 2024-01-31 19:30 ` Darrick J. Wong 2024-05-15 15:54 ` alexjlzheng 2024-05-16 4:56 ` Jinliang Zheng 2024-05-16 7:08 ` Ian Kent 2024-05-16 7:23 ` Ian Kent 2024-05-20 17:36 ` Darrick J. Wong 2024-05-21 1:35 ` Ian Kent 2024-05-21 2:13 ` Ian Kent 2024-05-26 15:04 ` Jinliang Zheng 2024-05-26 17:21 ` Paul E. McKenney 2024-05-26 23:51 ` Ian Kent 2024-05-27 0:18 ` Al Viro 2024-05-28 15:51 ` Brian Foster 2024-05-27 9:41 ` Dave Chinner 2024-05-27 13:56 ` Jinliang Zheng 2024-05-28 2:10 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).