Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, Chandan Babu R <chandan.babu@oracle.com>
Subject: Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
Date: Mon, 3 Jan 2022 17:19:00 -0800	[thread overview]
Message-ID: <20220104011900.GB31583@magnolia> (raw)
In-Reply-To: <20211221051004.GC945095@dread.disaster.area>

On Tue, Dec 21, 2021 at 04:10:04PM +1100, Dave Chinner wrote:
> On Mon, Dec 20, 2021 at 05:08:55PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 17, 2021 at 10:59:33AM -0800, Darrick J. Wong wrote:
> > > On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > I was poking around in the directory code while diagnosing online fsck
> > > > > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > > > > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > > > > the data fork mappings
> > > > 
> > > > Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> > > > extent count will be non-zero, hence we can only get to the
> > > > xfs_dir2_isblock() check if the inode has moved from local to block
> > > > form between the open and xfs_dir2_isblock() get in the getdents
> > > > code.
> > > > 
> > > > > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > > > > we're protected against writer threads, but we really need to follow the
> > > > > locking model like we do in other places.  The same applies to the
> > > > > shortform getdents function.
> > > > 
> > > > Locking rules should be the same as xfs_dir_lookup().....
> ....
> > > > Yup, I know, VFS holds i_rwsem, so directory can't be modified while
> > > > xfs_readdir() is running, but if you are going to make one of these
> > > > functions have to take the ILOCK, then they all need to. See
> > > > xfs_dir_lookup()....
> > > 
> > > Hmm.  I thought (and Chandan asked in passing) that the reason that we
> > > keep cycling the directory ILOCK in the block/leaf getdents functions is
> > > because the VFS ->actor functions (aka filldir) directly copy dirents to
> > > userspace and we could trigger a page fault.  The page fault could
> > > trigger memory reclaim, which could in turn route us to writeback with
> > > that ILOCK still held.
> > > 
> > > Though, thinking about this further, the directory we have ILOCKed
> > > doesn't itself use the page cache, so writeback will never touch it.
> > > So I /think/ it's ok to grab the xfs_ilock_data_map_shared once in
> > > xfs_readdir and hold it all the way to the end of the function?
> > > 
> > > Or at least I tried it and lockdep didn't complain immediately... :P
> > 
> > But lockdep does complain now:
> > 
> >  ======================================================
> >  WARNING: possible circular locking dependency detected
> >  5.16.0-rc6-xfsx #rc6 Not tainted
> >  ------------------------------------------------------
> >  xfs_scrub/8151 is trying to acquire lock:
> >  ffff888040abcbe8 (&mm->mmap_lock#2){++++}-{4:4}, at: do_user_addr_fault+0x386/0x600
> >  
> >  but task is already holding lock:
> >  ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
> >  
> >  which lock already depends on the new lock.
> >  
> >  
> >  the existing dependency chain (in reverse order) is:
> >  
> >  -> #2 (&xfs_dir_ilock_class){++++}-{4:4}:
> >         down_write_nested+0x41/0x80
> >         xfs_ilock+0xc9/0x270 [xfs]
> >         xfs_rename+0x559/0xb80 [xfs]
> >         xfs_vn_rename+0xdb/0x150 [xfs]
> >         vfs_rename+0x775/0xa70
> >         do_renameat2+0x355/0x510
> >         __x64_sys_renameat2+0x4b/0x60
> >         do_syscall_64+0x35/0x80
> >         entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  
> >  -> #1 (sb_internal){.+.+}-{0:0}:
> >         xfs_trans_alloc+0x1a8/0x3e0 [xfs]
> >         xfs_vn_update_time+0xca/0x2a0 [xfs]
> >         touch_atime+0x17d/0x2b0
> >         xfs_file_mmap+0xa7/0xb0 [xfs]
> >         mmap_region+0x3d8/0x600
> >         do_mmap+0x337/0x4f0
> >         vm_mmap_pgoff+0xa6/0x150
> >         ksys_mmap_pgoff+0x16f/0x1c0
> >         do_syscall_64+0x35/0x80
> >         entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> IDGI. That's a mmap() syscall, not a page fault. You can't mmap() a
> directory inode, so this has to be a regular file inode...
> 
> >  -> #0 (&mm->mmap_lock#2){++++}-{4:4}:
> >         __lock_acquire+0x116a/0x1eb0
> >         lock_acquire+0xc9/0x2f0
> >         down_read+0x3e/0x50
> >         do_user_addr_fault+0x386/0x600
> >         exc_page_fault+0x65/0x250
> >         asm_exc_page_fault+0x1b/0x20
> >         filldir64+0xb5/0x1b0
> >         xfs_dir2_sf_getdents+0x14e/0x370 [xfs]
> >         xfs_readdir+0x1fd/0x2b0 [xfs]
> >         iterate_dir+0x142/0x190
> >         __x64_sys_getdents64+0x7a/0x130
> >         do_syscall_64+0x35/0x80
> >         entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  
> >  other info that might help us debug this:
> >  
> >  Chain exists of:
> >    &mm->mmap_lock#2 --> sb_internal --> &xfs_dir_ilock_class
> >  
> >   Possible unsafe locking scenario:
> >  
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(&xfs_dir_ilock_class);
> >                                 lock(sb_internal);
> >                                 lock(&xfs_dir_ilock_class);
> >    lock(&mm->mmap_lock#2);
> >  
> >   *** DEADLOCK ***
> >  
> >  3 locks held by xfs_scrub/8151:
> >   #0: ffff88800a8aeaf0 (&f->f_pos_lock){+.+.}-{4:4}, at: __fdget_pos+0x4a/0x60
> >   #1: ffff8880270b8a08 (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{4:4}, at: iterate_dir+0x3d/0x190
> >   #2: ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
> >  
> >  stack backtrace:
> >  CPU: 0 PID: 8151 Comm: xfs_scrub Not tainted 5.16.0-rc6-xfsx #rc6 574205e0343df89e2059bf7ee73cf2f2ec847f12
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> >  Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x45/0x59
> >   check_noncircular+0xf2/0x110
> >   __lock_acquire+0x116a/0x1eb0
> >   lock_acquire+0xc9/0x2f0
> >   ? do_user_addr_fault+0x386/0x600
> >   down_read+0x3e/0x50
> >   ? do_user_addr_fault+0x386/0x600
> >   do_user_addr_fault+0x386/0x600
> >   exc_page_fault+0x65/0x250
> >   asm_exc_page_fault+0x1b/0x20
> >  RIP: 0010:filldir64+0xb5/0x1b0
> >  Code: 01 c0 48 29 ca 48 98 48 01 d0 0f 82 9f 00 00 00 48 b9 00 f0 ff ff ff 7f 00 00 48 39 c8 0f 87 8c 00 00 00 0f ae e8 4c 89 6a 08 <4c> 89 36 66 44 89 46 10 44 88 7e 12 48 8d 46 13 48 63 d5 c6 44 16
> >  RSP: 0018:ffffc900041ebd38 EFLAGS: 00010283
> >  RAX: 00007f729c004020 RBX: ffff88804616a96b RCX: 00007ffffffff000
> >  RDX: 00007f729c003fe0 RSI: 00007f729c004000 RDI: ffff88804616a970
> >  RBP: 0000000000000005 R08: 0000000000000020 R09: 0000000000000000
> >  R10: 0000000000000002 R11: ffff88804616a96b R12: ffffc900041ebee0
> >  R13: 0000000000000022 R14: 0000000003009b6b R15: 0000000000000002
> >   ? filldir64+0x3b/0x1b0
> >   xfs_dir2_sf_getdents+0x14e/0x370 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
> >   xfs_readdir+0x1fd/0x2b0 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
> >   iterate_dir+0x142/0x190
> >   __x64_sys_getdents64+0x7a/0x130
> >   ? fillonedir+0x160/0x160
> >   do_syscall_64+0x35/0x80
> >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  RIP: 0033:0x7f72ab7d543b
> >  Code: 0f 1e fa 48 8b 47 20 c3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 81 fa ff ff ff 7f b8 ff ff ff 7f 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 21 9a 10 00 f7 d8
> >  RSP: 002b:00007f72a88d6a58 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
> >  RAX: ffffffffffffffda RBX: 00007f729c003f00 RCX: 00007f72ab7d543b
> >  RDX: 0000000000008000 RSI: 00007f729c003f00 RDI: 0000000000000006
> >  RBP: fffffffffffffe00 R08: 0000000000000030 R09: 00007f729c000780
> >  R10: 00007f729c003c40 R11: 0000000000000293 R12: 00007f729c003ed4
> >  R13: 0000000000000000 R14: 00007f729c003ed0 R15: 00007f72a0003e10
> >   </TASK>
> > 
> > IOWs, we have to drop the ILOCK when calling dir_emit because:
> > 
> > 1. Rename takes sb_internal (xfs_trans_alloc) and then a directory ILOCK;
> > 2. A pagefault can take the MMAPLOCK and then sb_internal to update the
> >    file mtime;
> 
> Ok, let's assume that the lockdep report is actually a page fault
> rather than a completely independent mmap() syscall.
> 
> A page fault on a mmap()d data buffer that won't get this far - if
> there is a freeze in progress it will get stuck on on SB_PAGEFAULT
> in __xfs_filemap_fault() before it updates the mtime.
> 
> Hence, AFAICT, if we have an inode stuck there waiting for a freeze
> to make progress in a page fault, it means the readdir holds the
> directory i_rwsem (IOLOCK) in read mode,
> 
> If this is the case, then the rename() syscall cannot get past
> vfs_rename->lock_rename() as that will block trying to get the
> directory i_rwsem in write mode that the readdir already holds.
> 
> ANd if we have the opposite, where we are in xfs_rename() waiting
> for XFS_ILOCK_EXCL on the directory inodes, it means that the VFS is
> holding the i_rwsem in write mode on the directory and hence readdir
> gets locked out.
> 
> i.e. the vfs level i_rwsem locking prevents xfs_readdir() and
> xfs_rename() being called on the same directory are the same time
> and so the nested page fault recursion scenario indicated here does
> not seem possible.

Hmm, I think I agree that we can't have sb_internal->dir_ilock at the
same time as dir_ilock->mmap_lock->sb_internal because the first will
want IOLOCK_EXCL and the second will want IOLOCK_SHARED.  IOWs, I agree
that it's a false positive, and I couldn't find anywhere in the VFS that
allows mixed read and write access to a directory.  The directory mod
operations (link/unlink/create/rename) all seem to inode_lock(), as does
setattr and the xattr functions.  The only code paths that I could find
that use inode_lock_shared are readdir and certain lookups.

(Why didn't lockdep show i_rwsem as part of this chain?)

However, I also think this breaks my mental model of lock acquisition
order in XFS.

The chain, as presented, is this:

> >    &mm->mmap_lock#2 --> sb_internal --> &xfs_dir_ilock_class

But let me rearrange that to cover what I think can happen in the
readdir case with this patch applied:

IOLOCK_SHARED         // directory
ILOCK_SHARED          // also directory

mmap_lock             // file
sb_pagefault          // __xfs_filemap_fault
MMAPLOCK_SHARED       // file
sb_intwrite           // xfs_iomap_write_direct
ILOCK_EXCL            // file

Up until now, I've always written code to acquire resources in the order
IOLOCK -> MMAPLOCK -> transaction -> ILOCK, and I think everyone else
has done that too.  Allowing this new readdir behavior is awkward
because now there /is/ a place where we can try to allocate a
transaction while holding an ILOCK.

Call me a bureaucrat, but I'd rather keep the bright line rule that we
don't take sb_internal with an ILOCK held than deal with more locking
model nuance. :)

--D

> > 3. Now we've made readdir take the directory ILOCK and do something that
> >    can cause a userspace pagefault.
> 
> Yup, and while that fault on the regular file is being handled,
> the rename cannot get past the VFS because of the the directory
> IOLOCK/i_rwsem is held...
> 
> > So with that in mind, can I get a re-review of the original patch?  I'll
> > add the above to the commit message as a justification for why we can't
> > just move the ilock/iunlock calls.
> 
> On the above, I think it's a false positive. Can you go through the
> analysis again and check that I haven't missed a case where the VFS
> allows concurrent read and write access to a single directory?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2022-01-04  1:19 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
2021-12-16  1:09 ` [PATCH 1/7] xfs: take the ILOCK when accessing the inode core Darrick J. Wong
2021-12-16  4:56   ` Dave Chinner
2021-12-17 18:59     ` Darrick J. Wong
2021-12-21  1:08       ` Darrick J. Wong
2021-12-21  5:10         ` Dave Chinner
2022-01-04  1:19           ` Darrick J. Wong [this message]
2022-01-05  0:09     ` Dave Chinner
2022-01-05  1:38       ` Darrick J. Wong
2021-12-16  1:09 ` [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
2021-12-16  4:57   ` Dave Chinner
2021-12-24  7:16   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
2021-12-16  5:05   ` Dave Chinner
2021-12-16 19:25     ` Darrick J. Wong
2021-12-16 21:17       ` Dave Chinner
2021-12-16 21:40         ` Darrick J. Wong
2021-12-16 22:04   ` Dave Chinner
2021-12-24  7:17   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt Darrick J. Wong
2021-12-16  4:36   ` Dave Chinner
2021-12-16 16:35     ` Darrick J. Wong
2021-12-16  1:09 ` [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it Darrick J. Wong
2021-12-16  5:07   ` Dave Chinner
2021-12-24  7:17   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs Darrick J. Wong
2021-12-16  5:11   ` Dave Chinner
2021-12-17  2:58     ` Ian Kent
2021-12-24  7:22   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
2021-12-16  4:41   ` Dave Chinner
2021-12-24  7:18   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220104011900.GB31583@magnolia \
    --to=djwong@kernel.org \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.