From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 480DFA2D; Mon, 20 May 2024 17:36:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716226599; cv=none; b=FifmGA+lcM3AkBkuKH4PBNf2Ok+99+ojp1VnlESMYulmZsTsBqYkxwPr+Y29unQBZadp0WcgGz6OZ5Qgy9OwDDb0usavikX6jdviHpy2dKg6h0+/8YrZaoepBMoQswzkppUnZ/878R6kpgF1XjoFwUqIgQMTUBhwSG1I/SViQ8Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716226599; c=relaxed/simple; bh=sRY/bxyAZ/szsRe2iOQRSVd756eRmqhxVHBLc7WZMN8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=EFj7yrEZgMuIl14+LygiYnDIfN4HDLhYfOpmzIbcC5xM3pLMlHrewRtJ3Hhsm0nSK4UZCI8DJx3G7qJj6d1Fh8mZh2+lS8xRW1pBCf6VHz8JCAdT9bNbpBtWrrrasiRFqkbxQ110imnMwI6LadK+3sIVLiOAKBWWdQXNXw80DRY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=taSmzdyl; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="taSmzdyl" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0F860C2BD10; Mon, 20 May 2024 17:36:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716226599; bh=sRY/bxyAZ/szsRe2iOQRSVd756eRmqhxVHBLc7WZMN8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=taSmzdylfvSv+0KCMgromgEuw+Fgp9Q8887H6iyZT5mjbfgHgd527C42YdU4tYDxb uZJjGX+xhSDdbl4DvtMArwqPXDnAiqlXRD0Cf2iGaig6yWSE8AmVFhwm9FnO6biPGu CqwjuSOtjT//CR4S8TDc/m0jYWquEradUKi3bEV5Z4HEk/v+tRqbATHhC74vTEf06o rsvZ+MTt/pkwkWgBDTZr2tEVYGON3pTBNPEVPYGg1J1MmL8dX2EDLKFziXL82oJRDH C0q6vY6gGWowPDrfla9KXBKrg+IaHHG2xgQOqD5CRf2BYgJ1fc8NaGbhwLImig4ixJ RR6W+c1vDBjnQ== Date: Mon, 20 May 2024 10:36:38 -0700 From: "Darrick J. Wong" To: Ian Kent Cc: Jinliang Zheng , alexjlzheng@tencent.com, bfoster@redhat.com, david@fromorbit.com, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, rcu@vger.kernel.org Subject: Re: About the conflict between XFS inode recycle and VFS rcu-walk Message-ID: <20240520173638.GB25518@frogsfrogsfrogs> References: <20240515155441.2788093-1-alexjlzheng@tencent.com> <20240516045655.40122-1-alexjlzheng@tencent.com> <7f744bf5-5f6d-4031-8a4f-91be2cd45147@themaw.net> <3545f78c-5e1c-4328-8ab0-19227005f4b7@themaw.net> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <3545f78c-5e1c-4328-8ab0-19227005f4b7@themaw.net> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > On 16/5/24 15:08, Ian Kent wrote: > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: > > > > On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote: > > > > > On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > > > > > > On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote: > > > > > > > On Tue, Dec 05, 2023 at 07:38:33PM +0800, > > > > > > > alexjlzheng@gmail.com wrote: > > > > > > > > Hi, all > > > > > > > > > > > > > > > > I would like to ask if the conflict between xfs > > > > > > > > inode recycle and vfs rcu-walk > > > > > > > > which can lead to null pointer references has been resolved? > > > > > > > > > > > > > > > > I browsed through emails about the following > > > > > > > > patches and their discussions: > > > > > > > > - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/ > > > > > > > > - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/ > > > > > > > > - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > > > > > > > > > > > And then came to the conclusion that this > > > > > > > > problem has not been solved, am I > > > > > > > > right? Did I miss some patch that could solve this problem? > > > > > > > We fixed the known problems this caused by turning off the VFS > > > > > > > functionality that the rcu pathwalks kept tripping over. See commit > > > > > > > 7b7820b83f23 ("xfs: don't expose internal symlink > > > > > > > metadata buffers to > > > > > > > the vfs"). > > > > > > Sorry for the delay. > > > > > > > > > > > > The problem I encountered in the production environment > > > > > > was that during the > > > > > > rcu walk process the ->get_link() pointer was NULL, > > > > > > which caused a crash. > > > > > > > > > > > > As far as I know, commit 7b7820b83f23 ("xfs: don't > > > > > > expose internal symlink > > > > > > metadata buffers to the vfs") first appeared in: > > > > > > - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > > > > > > > > > > > Does this commit solve the problem of NULL ->get_link()? And how? > > > > > I suggest reading the call stack from wherever the VFS enters the XFS > > > > > readlink code.  If you have a reliable reproducer, then > > > > > apply this patch > > > > > to your kernel (you haven't mentioned which one it is) and see if the > > > > > bad dereference goes away. > > > > > > > > > > --D > > > > Sorry for the delay. > > > > > > > > I encountered the following calltrace: > > > > > > > > [20213.578756] BUG: kernel NULL pointer dereference, address: > > > > 0000000000000000 > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode > > > > [20213.578799] #PF: error_code(0x0010) - not-present page > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > > > [20213.578884] RIP: 0010:0x0 > > > > [20213.578894] Code: Bad RIP value. > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > > > 0000000000000000 > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > > > 0000000000000000 > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > > > ffff889b9eeae380 > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > > > 0000000000000000 > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > > > ffffc90021ebfd48 > > > > [20213.578998] FS:  00007f89c534e740(0000) > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > > > [20213.579016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > > > 00000000007706e0 > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [20213.579079] PKRU: 55555554 > > > > [20213.579087] Call Trace: > > > > [20213.579099]  trailing_symlink+0x1da/0x260 > > > > [20213.579112]  path_lookupat.isra.53+0x79/0x220 > > > > [20213.579125]  filename_lookup.part.69+0xa0/0x170 > > > > [20213.579138]  ? kmem_cache_alloc+0x3f/0x3f0 > > > > [20213.579151]  ? getname_flags+0x4f/0x1e0 > > > > [20213.579161]  user_path_at_empty+0x3e/0x50 > > > > [20213.579172]  vfs_statx+0x76/0xe0 > > > > [20213.579182]  __do_sys_newstat+0x3d/0x70 > > > > [20213.579194]  ? fput+0x13/0x20 > > > > [20213.579203]  ? ksys_ioctl+0xb0/0x300 > > > > [20213.579213]  ? generic_file_llseek+0x24/0x30 > > > > [20213.579225]  ? fput+0x13/0x20 > > > > [20213.579233]  ? ksys_lseek+0x8d/0xb0 > > > > [20213.579243]  __x64_sys_newstat+0x16/0x20 > > > > [20213.579256]  do_syscall_64+0x4d/0x140 > > > > [20213.579268]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > Please note that the kernel version I use is the one maintained by > > > Tencent.Inc, > > > and the baseline is v5.4. But in fact, in the latest upstream source > > > tree, > > > although the trailing_symlink() function has been removed, its logic > > > has been > > > moved to pick_link(), so the problem still exists. > > > > > > Ian Kent pointed out that try_to_unlazy() was introduced in > > > pick_link() in the > > > latest upstream source tree, but I don't understand why this can > > > solve the NULL > > > ->get_link pointer dereference problem, because ->get_link pointer > > > will be > > > dereferenced before try_to_unlazy(). > > > > > > (I don't understand why Ian Kent's email didn't appear on the > > > mailing list.) > > > > It was something about html mail and I think my mail client was at fault. > > > > In any case what you say is indeed correct, so the comment isn't > > important. > > > > > > Fact is it is still a race between the lockless path walk and inode > > eviction > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > fix. > > > > > > IIRC correctly putting a NULL check in pick_link() was not considered > > acceptable > > > > but there must be a way that is acceptable to check this and restart the > > walk. > > > > Maybe there was a reluctance to suffer the overhead of restarting the > > walk when > > > > it shouldn't be needed. > > Or perhaps the worry was that if it can become NULL it could also become a > pointer to a > > different (incorrect) link altogether which could have really odd/unpleasant > outcomes. Yuck. I think that means that we can't reallocate freed inodes until the rcu grace period expires. For inodes that haven't been evicted, I think that also means we cannot recycle cached inodes until after an rcu grace period expires; or maybe that we cannot reset i_op/i_fop and must not leave the incore state in an inconsistent format? --D > > > > > > > The alternative would be to find some way to identify when it's unsafe > > to reuse > > > > an inode marked for re-cycle before dropping rcu read, perhaps with the > > reference > > > > count plus the seqlock. Basically, to reuse inodes xfs will need to > > identify when > > > > the race occurs and let the inode go away under rcu and create a new one > > if a race > > > > is detected. But possibly that isn't nearly as simple as it sounds? > > > > > > > > > > Thanks, > > > Jinliang Zheng > > > > > > > And I analyzed the disassembly of trailing_symlink() and > > > > confirmed that a NULL > > > > ->get_link() happened here: > > > > > > > > 0xffffffff812e4850 :    nopl 0x0(%rax,%rax,1) > > > > [FTRACE NOP] > > > > 0xffffffff812e4855 :    push %rbp > > > > 0xffffffff812e4856 :    mov %rsp,%rbp > > > > 0xffffffff812e4859 :    push %r15 > > > > 0xffffffff812e485b :    push %r14 > > > > 0xffffffff812e485d :    push %r13 > > > > 0xffffffff812e485f :    push %r12 > > > > 0xffffffff812e4861 :    push %rbx > > > > 0xffffffff812e4862 :    mov > > > > %rdi,%rbx        # rbx = &nameidate > > > > 0xffffffff812e4865 :    sub $0x8,%rsp > > > > 0xffffffff812e4869 :    mov > > > > 0x1765845(%rip),%edx     # 0xffffffff82a4a0b4 > > > > > > > > 0xffffffff812e486f :    mov 0x38(%rdi),%eax > > > > 0xffffffff812e4872 :    test %edx,%edx > > > > 0xffffffff812e4874 :    je > > > > 0xffffffff812e48ac > > > > 0xffffffff812e4876 :    mov %gs:0x1ad00,%rdx > > > > 0xffffffff812e487f :    mov > > > > 0xc8(%rdi),%rcx        # rcx = nameidata->link_inode > > > > 0xffffffff812e4886 :    mov 0xc18(%rdx),%rdx > > > > 0xffffffff812e488d :    mov > > > > 0x4(%rcx),%ecx        # ecx = link_inode->uid > > > > 0xffffffff812e4890 :    cmp %ecx,0x1c(%rdx) > > > > 0xffffffff812e4893 :    je > > > > 0xffffffff812e48ac > > > > 0xffffffff812e4895 :    mov 0x30(%rdi),%rsi > > > > 0xffffffff812e4899 :    movzwl (%rsi),%edx > > > > 0xffffffff812e489c :    and $0x202,%dx > > > > 0xffffffff812e48a1 :    cmp $0x202,%dx > > > > 0xffffffff812e48a6 :    je > > > > 0xffffffff812e495f > > > > 0xffffffff812e48ac :    or $0x10,%eax > > > > 0xffffffff812e48af :    mov > > > > %eax,0x38(%rbx)        # nd->flags |= LOOKUP_PARENT > > > > 0xffffffff812e48b2 :    mov > > > > 0x50(%rbx),%rax        # rax = nd->stack > > > > 0xffffffff812e48b6 :    movq > > > > $0x0,0x20(%rax)        # stack[0].name = NULL > > > > 0xffffffff812e48be :    mov > > > > 0x48(%rbx),%eax        # nd->depth > > > > 0xffffffff812e48c1 :    mov > > > > 0x50(%rbx),%rdx        # nd->stack > > > > 0xffffffff812e48c5 :    mov > > > > 0xc8(%rbx),%r13        # nd->link_inode > > > > 0xffffffff812e48cc :    lea > > > > (%rax,%rax,2),%rax    # rax = depth * 3 > > > > 0xffffffff812e48d0 :    shl > > > > $0x4,%rax        # rax = rax << 4, sizeof(saved):0x30 > > > > 0xffffffff812e48d4 :    lea > > > > -0x30(%rdx,%rax,1),%r15    # r15 = last > > > > 0xffffffff812e48d9 :    mov > > > > 0x8(%r15),%r14        # r14 = last->link.dentry > > > > 0xffffffff812e48dd :    testb $0x40,0x38(%rbx) > > > > 0xffffffff812e48e1 :    je > > > > 0xffffffff812e4950 > > > > 0xffffffff812e48e3 :    mov %r13,%rsi > > > > 0xffffffff812e48e6 :    mov %r15,%rdi > > > > 0xffffffff812e48e9 :    callq > > > > 0xffffffff812f8a00 > > > > 0xffffffff812e48ee :    test %al,%al > > > > 0xffffffff812e48f0 :    jne > > > > 0xffffffff812e4a56 > > > > 0xffffffff812e48f6 :    mov 0x38(%rbx),%edx > > > > 0xffffffff812e48f9 :    mov %r13,%rsi > > > > 0xffffffff812e48fc :    mov %r14,%rdi > > > > 0xffffffff812e48ff :    shr $0x6,%edx > > > > 0xffffffff812e4902 :    and $0x1,%edx > > > > 0xffffffff812e4905 :    callq > > > > 0xffffffff81424310 > > > > 0xffffffff812e490a :    movslq %eax,%r12 > > > > 0xffffffff812e490d :    test %eax,%eax > > > > 0xffffffff812e490f :    jne > > > > 0xffffffff812e4939 > > > > 0xffffffff812e4911 :    movl $0x4,0x44(%rbx) > > > > 0xffffffff812e4918 :    mov 0x248(%r13),%r12 > > > > 0xffffffff812e491f :    test %r12,%r12 > > > > 0xffffffff812e4922 :    je > > > > 0xffffffff812e49e5 > > > > 0xffffffff812e4928 :    movzbl (%r12),%eax > > > > 0xffffffff812e492d :    cmp $0x2f,%al > > > > 0xffffffff812e492f :    je > > > > 0xffffffff812e49b7 > > > > 0xffffffff812e4935 :    test %al,%al > > > > 0xffffffff812e4937 :    je > > > > 0xffffffff812e49ae > > > > 0xffffffff812e4939 :    test %r12,%r12 > > > > 0xffffffff812e493c :    je > > > > 0xffffffff812e49ae > > > > 0xffffffff812e493e :    add $0x8,%rsp > > > > 0xffffffff812e4942 :    mov %r12,%rax > > > > 0xffffffff812e4945 :    pop %rbx > > > > 0xffffffff812e4946 :    pop %r12 > > > > 0xffffffff812e4948 :    pop %r13 > > > > 0xffffffff812e494a :    pop %r14 > > > > 0xffffffff812e494c :    pop %r15 > > > > 0xffffffff812e494e :    pop %rbp > > > > 0xffffffff812e494f :    retq > > > > 0xffffffff812e4950 :    mov %r15,%rdi > > > > 0xffffffff812e4953 :    callq > > > > 0xffffffff812f8ae0 > > > > 0xffffffff812e4958 :    callq > > > > 0xffffffff81a26410 <_cond_resched> > > > > 0xffffffff812e495d :    jmp > > > > 0xffffffff812e48f6 > > > > 0xffffffff812e495f :    mov 0x4(%rsi),%edx > > > > 0xffffffff812e4962 :    cmp $0xffffffff,%edx > > > > 0xffffffff812e4965 :    je > > > > 0xffffffff812e496f > > > > 0xffffffff812e4967 :    cmp %edx,%ecx > > > > 0xffffffff812e4969 :    je > > > > 0xffffffff812e48ac > > > > 0xffffffff812e496f :    mov > > > > $0xfffffffffffffff6,%r12 > > > > 0xffffffff812e4976 :    test $0x40,%al > > > > 0xffffffff812e4978 :    jne > > > > 0xffffffff812e493e > > > > 0xffffffff812e497a :    mov %gs:0x1ad00,%rax > > > > 0xffffffff812e4983 :    mov 0xce0(%rax),%rax > > > > 0xffffffff812e498a :    test %rax,%rax > > > > 0xffffffff812e498d :    je > > > > 0xffffffff812e4999 > > > > 0xffffffff812e498f :    mov (%rax),%eax > > > > 0xffffffff812e4991 :    test %eax,%eax > > > > 0xffffffff812e4993 :    je > > > > 0xffffffff812e4a6f > > > > 0xffffffff812e4999 :    mov > > > > $0xffffffff82319b4f,%rdi > > > > 0xffffffff812e49a0 :    mov > > > > $0xfffffffffffffff3,%r12 > > > > 0xffffffff812e49a7 :    callq > > > > 0xffffffff81161310 > > > > 0xffffffff812e49ac :    jmp > > > > 0xffffffff812e493e > > > > 0xffffffff812e49ae :    mov > > > > $0xffffffff8230164d,%r12 > > > > 0xffffffff812e49b5 :    jmp > > > > 0xffffffff812e493e > > > > 0xffffffff812e49b7 :    cmpq $0x0,0x20(%rbx) > > > > 0xffffffff812e49bc :    je > > > > 0xffffffff812e4a8a > > > > 0xffffffff812e49c2 :    mov %rbx,%rdi > > > > 0xffffffff812e49c5 :    callq > > > > 0xffffffff812e2da0 > > > > 0xffffffff812e49ca :    test %eax,%eax > > > > 0xffffffff812e49cc :    jne > > > > 0xffffffff812e4a97 > > > > 0xffffffff812e49d2 :    add $0x1,%r12 > > > > 0xffffffff812e49d6 :    movzbl (%r12),%eax > > > > 0xffffffff812e49db :    cmp $0x2f,%al > > > > 0xffffffff812e49dd :    jne > > > > 0xffffffff812e4935 > > > > 0xffffffff812e49e3 :    jmp > > > > 0xffffffff812e49d2 > > > > 0xffffffff812e49e5 :    mov > > > > 0x20(%r13),%rax        # inode->i_op > > > > 0xffffffff812e49e9 :    add $0x10,%r15 > > > > 0xffffffff812e49ed :    mov %r13,%rsi > > > > 0xffffffff812e49f0 :    mov %r15,%rdx > > > > 0xffffffff812e49f3 :    mov > > > > 0x8(%rax),%rcx        # inode_operations->get_link > > > > 0xffffffff812e49f7 :    testb $0x40,0x38(%rbx) > > > > 0xffffffff812e49fb :    jne > > > > 0xffffffff812e4a1f > > > > 0xffffffff812e49fd :    mov > > > > %r14,%rdi        # nd->flags & LOOKUP_RCU == 0 > > > > 0xffffffff812e4a00 :    callq > > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > > > 0xffffffff812e4a05 :    mov %rax,%r12 > > > > 0xffffffff812e4a08 :    test %r12,%r12 > > > > 0xffffffff812e4a0b :    je > > > > 0xffffffff812e49ae > > > > 0xffffffff812e4a0d :    cmp > > > > $0xfffffffffffff000,%r12 > > > > 0xffffffff812e4a14 :    jbe > > > > 0xffffffff812e4928 > > > > 0xffffffff812e4a1a :    jmpq > > > > 0xffffffff812e493e > > > > 0xffffffff812e4a1f :    xor > > > > %edi,%edi        # nd->flags & LOOKUP_RCU != 0 > > > > 0xffffffff812e4a21 :    mov %rcx,-0x30(%rbp) > > > > 0xffffffff812e4a25 :    callq > > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > > > 0xffffffff812e4a2a :    mov %rax,%r12 > > > > 0xffffffff812e4a2d :    cmp > > > > $0xfffffffffffffff6,%rax > > > > 0xffffffff812e4a31 :    jne > > > > 0xffffffff812e4a08 > > > > 0xffffffff812e4a33 :    mov %rbx,%rdi > > > > 0xffffffff812e4a36 :    callq > > > > 0xffffffff812e3840 > > > > 0xffffffff812e4a3b :    test %eax,%eax > > > > 0xffffffff812e4a3d :    jne > > > > 0xffffffff812e4a97 > > > > 0xffffffff812e4a3f :    mov %r15,%rdx > > > > 0xffffffff812e4a42 :    mov %r13,%rsi > > > > 0xffffffff812e4a45 :    mov %r14,%rdi > > > > 0xffffffff812e4a48 :    mov -0x30(%rbp),%rcx > > > > 0xffffffff812e4a4c :    callq > > > > 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> > > > > 0xffffffff812e4a51 :    mov %rax,%r12 > > > > 0xffffffff812e4a54 :    jmp > > > > 0xffffffff812e4a08 > > > > 0xffffffff812e4a56 :    mov %rbx,%rdi > > > > 0xffffffff812e4a59 :    callq > > > > 0xffffffff812e3840 > > > > 0xffffffff812e4a5e :    test %eax,%eax > > > > 0xffffffff812e4a60 :    jne > > > > 0xffffffff812e4a97 > > > > 0xffffffff812e4a62 :    mov %r15,%rdi > > > > 0xffffffff812e4a65 :    callq > > > > 0xffffffff812f8ae0 > > > > 0xffffffff812e4a6a :    jmpq > > > > 0xffffffff812e48f6 > > > > 0xffffffff812e4a6f :    mov 0x50(%rbx),%rax > > > > 0xffffffff812e4a73 :    mov 0xb8(%rbx),%rdi > > > > 0xffffffff812e4a7a :    xor %edx,%edx > > > > 0xffffffff812e4a7c :    mov 0x8(%rax),%rsi > > > > 0xffffffff812e4a80 :    callq > > > > 0xffffffff811673f0 <__audit_inode> > > > > 0xffffffff812e4a85 :    jmpq > > > > 0xffffffff812e4999 > > > > 0xffffffff812e4a8a :    mov %rbx,%rdi > > > > 0xffffffff812e4a8d :    callq > > > > 0xffffffff812e4790 > > > > 0xffffffff812e4a92 :    jmpq > > > > 0xffffffff812e49c2 > > > > 0xffffffff812e4a97 :    mov > > > > $0xfffffffffffffff6,%r12 > > > > 0xffffffff812e4a9e :    jmpq > > > > 0xffffffff812e493e > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > > > > > > According to my understanding, the problem solved by commit > > > > 7b7820b83f23 ("xfs: > > > > don't expose internal symlink metadata buffers to the vfs") is a > > > > data NULL > > > > pointer dereference, but the problem here is an instruction NULL > > > > pointer > > > > dereference. > > > > > > > > Further, I analyzed the possible triggering process as follows: > > > > > > > > rcu_walk            do_unlinkat ~~> prune_dcache_sb create > > > > rcu_read_lock > > > > read_seqcount_retry > > > > (the last check)      iput_final > > > >                          evict > > > >                            destroy_inode > > > >                              xfs_fs_destroy_inode > > > >                                xfs_inode_set_reclaim_tag xfs_ialloc > > > > spin_lock(ip->i_flags_lock)     xfs_dialloc > > > >                                  set(ip, XFS_IRECLAIMABLE)       > > > > xfs_iget > > > > wakeup(xfs_reclaim_worker)        rcu_read_lock > > > > spin_unlock(ip->i_flags_lock)     xfs_iget_cache_hit > > > > spin_lock(ip->i_flags_lock) > > > >                                                                      > > > > if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) > > > > set(ip, XFS_IRECLAIM) > > > > spin_unlock(ip->i_flags_lock) > > > > rcu_read_unlock > > > > < ------------ > > > > >                                                                      > > > > // miss synchronize_rcu() > > > > xfs_reinit_inode > > > > ->get_link = NULL > > > > get_link() // NULL > > > > > > > > rcu_read_unlock > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > > > > > > Therefore, I think that after commit 7b7820b83f23 ("xfs: don't > > > > expose internal > > > > symlink metadata buffers to the vfs"), we should start > > > > processing this NULL > > > > ->get_link pointer dereference. > > > > > > > > Or, am I thinking wrong somewhere? > > > > > > > > Thanks, > > > > Jinliang Zheng > > > > > > > > > > > Apart from that issue, I'm not aware of any other issues that the > > > > > > > XFS inode recycling directly exposes. > > > > > > > > > > > > > > > According to my understanding, the essence of > > > > > > > > this problem is that XFS reuses > > > > > > > > the inode evicted by VFS, but VFS rcu-walk > > > > > > > > assumes that this will not happen. > > > > > > > It assumes that the inode will not change identity during the RCU > > > > > > > grace period after the inode has been evicted from cache. We can > > > > > > > safely reinstantiate an evicted inode without waiting for an RCU > > > > > > > grace period as long as it is the same inode with the same content > > > > > > > and same state. > > > > > > > > > > > > > > Problems *may* arise when we unlink the inode, then evict it, then a > > > > > > > new file is created and the old slab cache memory address is used > > > > > > > for the new inode. I describe the issue here: > > > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/ > > > > > > > > > > > > > And judging from the relevant emails, the main reason > > > > > > why ->get_link() is set > > > > > > to NULL should be the lack of synchronize_rcu() before > > > > > > xfs_reinit_inode() when > > > > > > the inode is chosen to be reused. > > > > > > > > > > > > However, perhaps due to performance reasons, this > > > > > > solution has not been merged > > > > > > for a long time. How is it now? > > > > > > > > > > > > Maybe I am missing something in the threads of mail? > > > > > > > > > > > > Thank you very much. :) > > > > > > Jinliang Zheng > > > > > > > > > > > > > That said, we have exactly zero evidence that this is actually a > > > > > > > problem in production systems. We did get systems tripping over the > > > > > > > symlink issue, but there's no evidence that the > > > > > > > unlink->close->open(O_CREAT) issues are manifesting in the wild and > > > > > > > hence there hasn't been any particular urgency to address it. > > > > > > > > > > > > > > > Are there any recommended workarounds until an > > > > > > > > elegant and efficient solution > > > > > > > > can be proposed? After all, causing a crash is > > > > > > > > extremely unacceptable in a > > > > > > > > production environment. > > > > > > > What crashes are you seeing in your production environment? > > > > > > > > > > > > > > -Dave. > > > > > > > -- > > > > > > > Dave Chinner > > > > > > > david@fromorbit.com > > >