linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jinliang Zheng <alexjlzheng@gmail.com>
To: david@fromorbit.com
Cc: alexjlzheng@gmail.com, alexjlzheng@tencent.com,
	bfoster@redhat.com, djwong@kernel.org,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	raven@themaw.net, rcu@vger.kernel.org
Subject: Re: About the conflict between XFS inode recycle and VFS rcu-walk
Date: Mon, 27 May 2024 21:56:15 +0800	[thread overview]
Message-ID: <20240527135615.2633248-1-alexjlzheng@tencent.com> (raw)
In-Reply-To: <ZlRVPv0EGIu5q7l9@dread.disaster.area>

On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote:
> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> > On 16/5/24 15:08, Ian Kent wrote:
> > > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > > I encountered the following calltrace:
> > > > > 
> > > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > > 0000000000000000
> > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > > [20213.578884] RIP: 0010:0x0
> > > > > [20213.578894] Code: Bad RIP value.
> > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > > 0000000000000000
> > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > > 0000000000000000
> > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > > ffff889b9eeae380
> > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > > 0000000000000000
> > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > > ffffc90021ebfd48
> > > > > [20213.578998] FS:  00007f89c534e740(0000)
> > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > > [20213.579016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > > 00000000007706e0
> > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > > 0000000000000000
> > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > > 0000000000000400
> > > > > [20213.579079] PKRU: 55555554
> > > > > [20213.579087] Call Trace:
> > > > > [20213.579099]  trailing_symlink+0x1da/0x260
> > > > > [20213.579112]  path_lookupat.isra.53+0x79/0x220
> > > > > [20213.579125]  filename_lookup.part.69+0xa0/0x170
> > > > > [20213.579138]  ? kmem_cache_alloc+0x3f/0x3f0
> > > > > [20213.579151]  ? getname_flags+0x4f/0x1e0
> > > > > [20213.579161]  user_path_at_empty+0x3e/0x50
> > > > > [20213.579172]  vfs_statx+0x76/0xe0
> > > > > [20213.579182]  __do_sys_newstat+0x3d/0x70
> > > > > [20213.579194]  ? fput+0x13/0x20
> > > > > [20213.579203]  ? ksys_ioctl+0xb0/0x300
> > > > > [20213.579213]  ? generic_file_llseek+0x24/0x30
> > > > > [20213.579225]  ? fput+0x13/0x20
> > > > > [20213.579233]  ? ksys_lseek+0x8d/0xb0
> > > > > [20213.579243]  __x64_sys_newstat+0x16/0x20
> > > > > [20213.579256]  do_syscall_64+0x4d/0x140
> > > > > [20213.579268]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > > > 
> > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > > > 
> > > > Please note that the kernel version I use is the one maintained by
> > > > Tencent.Inc,
> > > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > > tree,
> > > > although the trailing_symlink() function has been removed, its logic
> > > > has been
> > > > moved to pick_link(), so the problem still exists.
> > > > 
> > > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > > pick_link() in the
> > > > latest upstream source tree, but I don't understand why this can
> > > > solve the NULL
> > > > ->get_link pointer dereference problem, because ->get_link pointer
> > > > will be
> > > > dereferenced before try_to_unlazy().
> > > > 
> > > > (I don't understand why Ian Kent's email didn't appear on the
> > > > mailing list.)
> > > 
> > > It was something about html mail and I think my mail client was at fault.
> > > 
> > > In any case what you say is indeed correct, so the comment isn't
> > > important.
> > > 
> > > 
> > > Fact is it is still a race between the lockless path walk and inode
> > > eviction
> > > 
> > > and xfs recycling. I believe that the xfs recycling code is very hard to
> > > fix.
> 
> Not really for this case. This is simply concurrent pathwalk lookups
> occurring just after the inode has been evicted from the VFS inode
> cache. The first lookup hits the XFS inode cache, sees
> XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
> reinstantiate the VFS inode to an initial state. The Xfs inode
> itself is still valid as it hasn't reached the XFS_IRECLAIM state
> where it will be torn down and freed.
> 
> Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
> been run and that it is trying to call ->get_link on that same
> inode. Unfortunately, the first lookup has just set inode->f_ops =
> &empty_fops as part of the VFS inode reinit, and that then triggers
> the null pointer deref.

The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the
target inode before xfs_reinit_inode(). Because the target inode of
xfs_reinit_inode() must NOT be associated with any dentry, which is necessary
conditions for iput() -> iput_final() -> evict(), and the RCU pathwalk cannot
obtain any inode without a dentry.

> 
> Once the first lookup has finished the inode_init_always(),
> xfs_reinit_inode() resets inode->f_ops back to 
> xfs_symlink_file_ops and get_link calls work again.
> 
> Fundamentally, the problem is that we are completely reinitialising
> the VFS inode within the RCU grace period. i.e. while concurrent RCU
> pathwalks can still be in progress and find the VFS inode whilst the
> XFS inode cache is manipulating it.
> 
> What we should be doing here is a subset of inode_init_always(),
> which only reinitialises the bits of the VFS inode we need to
> initialise rather than the entire inode. The identity of the inode
> is not changing and so we don't need to go through a transient state
> where the VFS inode goes xfs symlink -> empty initialised inode ->
> xfs symlink.

Sorry, I think this question is wrong in more ways than just inode_operations.
After the target inode has been reinited by xfs_reinit_inode(), its semantics
is no longer the inode required by RCU walkpath. The meanings of many fields
have changed, such as mode, i_mtime, i_atime, i_ctime and so on.

> 
> i.e. We need to re-initialise the non-identity related parts of the
> VFS inode so the identity parts that the RCU pathwalks rely on never
> change within the RCU grace period where lookups can find the VFS
> inode after it has been evicted.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2024-05-27 13:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng
2023-12-08  0:14 ` Dave Chinner
2024-01-31  6:35   ` Jinliang Zheng
2024-01-31 19:30     ` Darrick J. Wong
2024-05-15 15:54       ` alexjlzheng
2024-05-16  4:56         ` Jinliang Zheng
2024-05-16  7:08           ` Ian Kent
2024-05-16  7:23             ` Ian Kent
2024-05-20 17:36               ` Darrick J. Wong
2024-05-21  1:35                 ` Ian Kent
2024-05-21  2:13                   ` Ian Kent
2024-05-26 15:04                     ` Jinliang Zheng
2024-05-26 17:21                       ` Paul E. McKenney
2024-05-26 23:51                     ` Ian Kent
2024-05-27  0:18                       ` Al Viro
2024-05-28 15:51                         ` Brian Foster
2024-05-27  9:41               ` Dave Chinner
2024-05-27 13:56                 ` Jinliang Zheng [this message]
2024-05-28  2:10                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240527135615.2633248-1-alexjlzheng@tencent.com \
    --to=alexjlzheng@gmail.com \
    --cc=alexjlzheng@tencent.com \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=raven@themaw.net \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).