From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 261FDC433EF for ; Tue, 5 Jul 2022 18:38:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229590AbiGESh7 (ORCPT ); Tue, 5 Jul 2022 14:37:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42710 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229579AbiGESh6 (ORCPT ); Tue, 5 Jul 2022 14:37:58 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97EC51D0D0 for ; Tue, 5 Jul 2022 11:37:57 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 22D73B818EE for ; Tue, 5 Jul 2022 18:37:56 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BACE0C341C7; Tue, 5 Jul 2022 18:37:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1657046274; bh=blHK3NDtFdxf6rMCgoKNnQY5HuEaSAROGoC+CksRFK8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=spDtHXmGx6BOtVap6Jjj8Vzy0tBybAztGqRUmhnaW9mQmwB4dlh0gaPL6B8Ua+i0I T8PuSeyuM2qjsLruvMp5PlfNMGkAVX8NfYrbsFRzpIc0iVuXlORZu+UcrywpHmUpvh 6vm+1daFgt5FYE9f2gTy0GRd5wyr2ErvXCaEmnvoWQTgSn7M/TKqomMMOObkwsq353 g1SOTerBKteiLOhCdTt+IfkmshEqTFi9X24gd9H2Z9is+6DewNn6mkj7fUcYr/8MEd ccIqb5t/Uv03vAEm8MbQ7ojySKwODmEBll0UImrLELKCmeuooB1P2kFiuCdbwicMeD AaI0rEdvDlQIA== Date: Tue, 5 Jul 2022 11:37:54 -0700 From: "Darrick J. Wong" To: Wengang Wang Cc: linux-xfs@vger.kernel.org, david@fromorbit.com Subject: Re: [PATCH V2] xfs: make src file readable during reflink Message-ID: References: <20220629060755.25537-1-wen.gang.wang@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220629060755.25537-1-wen.gang.wang@oracle.com> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Tue, Jun 28, 2022 at 11:07:55PM -0700, Wengang Wang wrote: > During a reflink operation, the IOLOCK and MMAPLOCK of the source file > are held in exclusive mode for the duration. This prevents reads on the > source file, which could be a very long time if the source file has > millions of extents. > > As the source of copy, besides some necessary modification (say dirty page > flushing), it plays readonly role. Locking source file exclusively through > out the full reflink copy is unreasonable. > > This patch downgrades exclusive locks on source file to shared modes after > page cache flushing and before cloning the extents. To avoid source file > change after lock downgradation, direct write paths take IOLOCK_EXCL on > seeing reflink copy happening to the files. This is going to complicate the synchronization logic between reflink and everything else quite a bit -- right now we generally allow multiple concurrent direct writers (IOLOCK) and write faults (MMAPLOCK) per file, so space mapping operations (fallocate/reflink) can lock out those writers in a simple manner. > Signed-off-by: Wengang Wang > --- > V2 changes: > Commit message > Make direct write paths take IOLOCK_EXCL when reflink copy is happening > Tiny changes > --- > fs/xfs/xfs_file.c | 33 ++++++++++++++++++++++++++++++--- > fs/xfs/xfs_inode.c | 31 +++++++++++++++++++++++++++++++ > fs/xfs/xfs_inode.h | 11 +++++++++++ > 3 files changed, 72 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 5a171c0b244b..6ca7118ee274 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -514,8 +514,10 @@ xfs_file_dio_write_aligned( > struct iov_iter *from) > { > unsigned int iolock = XFS_IOLOCK_SHARED; > + int remapping; bool? > ssize_t ret; > > +relock: > ret = xfs_ilock_iocb(iocb, iolock); > if (ret) > return ret; > @@ -523,14 +525,25 @@ xfs_file_dio_write_aligned( > if (ret) > goto out_unlock; > > + remapping = xfs_iflags_test(ip, XFS_IREMAPPING); remapping = xfs_has_reflink(mp) && xfs_iflags_test(ip, XFS_IREMAPPING); so that you can skip the locked test on filesystems where remapping isn't possible. > + > /* > * We don't need to hold the IOLOCK exclusively across the IO, so demote > * the iolock back to shared if we had to take the exclusive lock in > * xfs_file_write_checks() for other reasons. > + * But take IOLOCK_EXCL when reflink copy is going on > */ > if (iolock == XFS_IOLOCK_EXCL) { > - xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > - iolock = XFS_IOLOCK_SHARED; > + if (!remapping) { > + xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > + iolock = XFS_IOLOCK_SHARED; > + } Hm. So the logic in the IOLOCK_EXCL case is that if a directio write takes IOLOCK_EXCL, there can't possibly be a remap operation running. Remap operations themselves always start by taking IOLOCK_EXCL before setting IREMAPPING, so a remap operation cannot set IREMAPPING until after this directio completes. In the IOLOCK_SHARED case below, the directio upgrades to IOLOCK_EXCL if a remap operation is detected. We're protected against IREMAPPING getting set while we hold IOLOCK_SHARED (because remap operations start by taking IOLOCK_EXCL), though in theory we could race with the end of a remapping operation, which at worst will result in an unnecessary IOLOCK_EXCL acquisition, right? There can only be one remapping operation in progress at a time because they will take IOLOCK_EXCL initially and demote to _SHARED, so there shouldn't be any races to setting and clearing IREMAPPING. So I /think/ this works, but concurrency is hard to think about. :/ > + } else { /* iolock == XFS_ILOCK_SHARED */ IOLOCK_SHARED, not ILOCK_SHARED? > + if (remapping) { > + xfs_iunlock(ip, iolock); > + iolock = XFS_IOLOCK_EXCL; > + goto relock; > + } > } > trace_xfs_file_direct_write(iocb, from); > ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops, > @@ -1125,6 +1138,19 @@ xfs_file_remap_range( Aren't changes necessary for xfs_file_dio_write_unaligned too? > if (ret || len == 0) > return ret; > > + /* > + * Set XFS_IREMAPPING flag to source file before we downgrade > + * the locks, so that all direct writes know they have to take > + * IOLOCK_EXCL. > + */ > + xfs_iflags_set(src, XFS_IREMAPPING); > + > + /* > + * From now on, we read only from src, so downgrade locks to allow > + * read operations go. > + */ > + xfs_ilock_io_mmap_downgrade_src(src, dest); > + > trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out); > > ret = xfs_reflink_remap_blocks(src, pos_in, dest, pos_out, len, > @@ -1152,7 +1178,8 @@ xfs_file_remap_range( > if (xfs_file_sync_writes(file_in) || xfs_file_sync_writes(file_out)) > xfs_log_force_inode(dest); > out_unlock: > - xfs_iunlock2_io_mmap(src, dest); > + xfs_iflags_clear(src, XFS_IREMAPPING); > + xfs_iunlock2_io_mmap_src_shared(src, dest); > if (ret) > trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_); > return remapped > 0 ? remapped : ret; > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 52d6f2c7d58b..1cbd4a594f28 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -3786,6 +3786,16 @@ xfs_ilock2_io_mmap( > return 0; > } > > +/* Downgrade the locks on src file if src and dest are not the same one. */ > +void > +xfs_ilock_io_mmap_downgrade_src( > + struct xfs_inode *src, > + struct xfs_inode *dest) > +{ > + if (src != dest) > + xfs_ilock_demote(src, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); Oh, you're downgrading MMAPLOCK_EXCL (aka invalidate_lock) too? That's going to be tricky to figure out -- write page faults (and apparently all DAX faults) take MMAPLOCK_SHARED (invalidate_lock_shared) so I think you'd have to add similar "upgrade/downgrade lock" logic to __xfs_filemap_fault? I'm not 100% sure I'm correct about that statement, my head is starting to spin and I'm not sure it's worth the complexity. I /do/ wonder if range locking would be a better solution here, since we can safely unlock file ranges that we've already remapped? --D > +} > + > /* Unlock both inodes to allow IO and mmap activity. */ > void > xfs_iunlock2_io_mmap( > @@ -3798,3 +3808,24 @@ xfs_iunlock2_io_mmap( > if (ip1 != ip2) > inode_unlock(VFS_I(ip1)); > } > + > +/* > + * Unlock the exclusive locks on dest file. > + * Also unlock the shared locks on src if src and dest are not the same one > + */ > +void > +xfs_iunlock2_io_mmap_src_shared( > + struct xfs_inode *src, > + struct xfs_inode *dest) > +{ > + struct inode *src_inode = VFS_I(src); > + struct inode *dest_inode = VFS_I(dest); > + > + inode_unlock(dest_inode); > + filemap_invalidate_unlock(dest_inode->i_mapping); > + if (src == dest) > + return; > + > + inode_unlock_shared(src_inode); > + filemap_invalidate_unlock_shared(src_inode->i_mapping); > +} > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index 7be6f8e705ab..c07d4b42cf9d 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -262,6 +262,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip) > */ > #define XFS_INACTIVATING (1 << 13) > > +/* > + * A flag indicating reflink copy / remapping is happening to the file as > + * source. When set, all direct IOs should take IOLOCK_EXCL to avoid > + * interphering the remapping. > + */ > +#define XFS_IREMAPPING (1 << 14) > + > /* All inode state flags related to inode reclaim. */ > #define XFS_ALL_IRECLAIM_FLAGS (XFS_IRECLAIMABLE | \ > XFS_IRECLAIM | \ > @@ -512,5 +519,9 @@ void xfs_end_io(struct work_struct *work); > > int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); > void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); > +void xfs_ilock_io_mmap_downgrade_src(struct xfs_inode *src, > + struct xfs_inode *dest); > +void xfs_iunlock2_io_mmap_src_shared(struct xfs_inode *src, > + struct xfs_inode *dest); > > #endif /* __XFS_INODE_H__ */ > -- > 2.21.0 (Apple Git-122.2) >