From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Chandan Babu R <chandan.babu@oracle.com>,
Christian Brauner <brauner@kernel.org>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
Date: Mon, 23 Sep 2024 09:18:25 -0700 [thread overview]
Message-ID: <20240923161825.GE21877@frogsfrogsfrogs> (raw)
In-Reply-To: <20240923152904.1747117-3-hch@lst.de>
On Mon, Sep 23, 2024 at 05:28:16PM +0200, Christoph Hellwig wrote:
> Currently iomap_file_buffered_write_punch_delalloc can be called from
> XFS either with the invalidate lock held or not. To fix this while
> keeping the locking in the file system and not the iomap library
> code we'll need to life the locking up into the file system.
>
> To prepare for that, open code iomap_file_buffered_write_punch_delalloc
> in the only caller, and instead export iomap_write_delalloc_release.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> .../filesystems/iomap/operations.rst | 2 +-
> fs/iomap/buffered-io.c | 85 ++++++-------------
> fs/xfs/xfs_iomap.c | 16 +++-
> include/linux/iomap.h | 6 +-
> 4 files changed, 46 insertions(+), 63 deletions(-)
>
> diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> index 8e6c721d233010..b93115ab8748ae 100644
> --- a/Documentation/filesystems/iomap/operations.rst
> +++ b/Documentation/filesystems/iomap/operations.rst
> @@ -208,7 +208,7 @@ The filesystem must arrange to `cancel
> such `reservations
> <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
> because writeback will not consume the reservation.
> -The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
> +The ``iomap_write_delalloc_release`` can be called from a
> ``->iomap_end`` function to find all the clean areas of the folios
> caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
> It takes the ``invalidate_lock``.
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 884891ac7a226c..237aeb883166df 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1149,6 +1149,32 @@ static void iomap_write_delalloc_scan(struct inode *inode,
> * have dirty data still pending in the page cache - those are going to be
> * written and so must still retain the delalloc backing for writeback.
> *
> + * When a short write occurs, the filesystem may need to remove reserved space
> + * that was allocated in ->iomap_begin from it's ->iomap_end method. For
"When a short write occurs, the filesystem may need to remove space
reservations created in ->iomap_begin.
> + * filesystems that use delayed allocation, we need to punch out delalloc
> + * extents from the range that are not dirty in the page cache. As the write can
> + * race with page faults, there can be dirty pages over the delalloc extent
> + * outside the range of a short write but still within the delalloc extent
> + * allocated for this iomap.
> + *
> + * The punch() callback *must* only punch delalloc extents in the range passed
> + * to it. It must skip over all other types of extents in the range and leave
> + * them completely unchanged. It must do this punch atomically with respect to
> + * other extent modifications.
Can a failing buffered write race with a write fault to the same file
range?
write() thread: page_mkwrite thread:
--------------- --------------------
take i_rwsem
->iomap_begin
create da reservation
lock folio
fail to write
unlock folio
take invalidation lock
lock folio
->iomap_begin
sees da reservation
mark folio dirty
unlock folio
drop invalidation lock
->iomap_end
take invalidation lock
iomap_write_delalloc_release
drop invalidation lock
Can we end up in this situation, where the write fault thinks it has a
dirty page backed by a delalloc reservation, yet the delalloc
reservation gets removed by the delalloc punch logic? I think the
answer to my question is that this sequence is impossible because the
write fault dirties the folio so the iomap_write_delalloc_release does
nothing, correct?
Unrelated question about iomap_write_begin: Can we get rid of the
!mapping_large_folio_support if-body just prior to __iomap_get_folio?
filemap_get_folio won't return large folios if
!mapping_large_folio_support, so I think the separate check in iomap
isn't needed anymore?
This push-down looks fine though, so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> + *
> + * The punch() callback may be called with a folio locked to prevent writeback
> + * extent allocation racing at the edge of the range we are currently punching.
> + * The locked folio may or may not cover the range being punched, so it is not
> + * safe for the punch() callback to lock folios itself.
> + *
> + * Lock order is:
> + *
> + * inode->i_rwsem (shared or exclusive)
> + * inode->i_mapping->invalidate_lock (exclusive)
> + * folio_lock()
> + * ->punch
> + * internal filesystem allocation lock
> + *
> * As we are scanning the page cache for data, we don't need to reimplement the
> * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> * start and end of data ranges correctly even for sub-folio block sizes. This
> @@ -1177,7 +1203,7 @@ static void iomap_write_delalloc_scan(struct inode *inode,
> * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
> * the code to subtle off-by-one bugs....
> */
> -static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> +void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> loff_t end_byte, unsigned flags, struct iomap *iomap,
> iomap_punch_t punch)
> {
> @@ -1243,62 +1269,7 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> out_unlock:
> filemap_invalidate_unlock(inode->i_mapping);
> }
> -
> -/*
> - * When a short write occurs, the filesystem may need to remove reserved space
> - * that was allocated in ->iomap_begin from it's ->iomap_end method. For
> - * filesystems that use delayed allocation, we need to punch out delalloc
> - * extents from the range that are not dirty in the page cache. As the write can
> - * race with page faults, there can be dirty pages over the delalloc extent
> - * outside the range of a short write but still within the delalloc extent
> - * allocated for this iomap.
> - *
> - * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
> - * simplify range iterations.
> - *
> - * The punch() callback *must* only punch delalloc extents in the range passed
> - * to it. It must skip over all other types of extents in the range and leave
> - * them completely unchanged. It must do this punch atomically with respect to
> - * other extent modifications.
> - *
> - * The punch() callback may be called with a folio locked to prevent writeback
> - * extent allocation racing at the edge of the range we are currently punching.
> - * The locked folio may or may not cover the range being punched, so it is not
> - * safe for the punch() callback to lock folios itself.
> - *
> - * Lock order is:
> - *
> - * inode->i_rwsem (shared or exclusive)
> - * inode->i_mapping->invalidate_lock (exclusive)
> - * folio_lock()
> - * ->punch
> - * internal filesystem allocation lock
> - */
> -void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
> - loff_t pos, loff_t length, ssize_t written, unsigned flags,
> - struct iomap *iomap, iomap_punch_t punch)
> -{
> - loff_t start_byte;
> - loff_t end_byte;
> -
> - if (iomap->type != IOMAP_DELALLOC)
> - return;
> -
> - /* If we didn't reserve the blocks, we're not allowed to punch them. */
> - if (!(iomap->flags & IOMAP_F_NEW))
> - return;
> -
> - start_byte = iomap_last_written_block(inode, pos, written);
> - end_byte = round_up(pos + length, i_blocksize(inode));
> -
> - /* Nothing to do if we've written the entire delalloc extent */
> - if (start_byte >= end_byte)
> - return;
> -
> - iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
> - punch);
> -}
> -EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
> +EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
>
> static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> {
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1e11f48814c0d0..30f2530b6d5461 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1227,8 +1227,20 @@ xfs_buffered_write_iomap_end(
> unsigned flags,
> struct iomap *iomap)
> {
> - iomap_file_buffered_write_punch_delalloc(inode, offset, length, written,
> - flags, iomap, &xfs_buffered_write_delalloc_punch);
> + loff_t start_byte, end_byte;
> +
> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> + return 0;
> +
> + /* Nothing to do if we've written the entire delalloc extent */
> + start_byte = iomap_last_written_block(inode, offset, written);
> + end_byte = round_up(offset + length, i_blocksize(inode));
> + if (start_byte >= end_byte)
> + return 0;
> +
> + iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
> + xfs_buffered_write_delalloc_punch);
> return 0;
> }
>
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e62df5d93f04de..137e0783faa224 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -290,9 +290,9 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
>
> typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
> struct iomap *iomap);
> -void iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
> - loff_t length, ssize_t written, unsigned flag,
> - struct iomap *iomap, iomap_punch_t punch);
> +void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> + loff_t end_byte, unsigned flags, struct iomap *iomap,
> + iomap_punch_t punch);
>
> int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 start, u64 len, const struct iomap_ops *ops);
> --
> 2.45.2
>
next prev parent reply other threads:[~2024-09-23 16:18 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
2024-09-23 15:28 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
2024-09-23 15:53 ` Darrick J. Wong
2024-09-24 5:45 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
2024-09-23 16:18 ` Darrick J. Wong [this message]
2024-09-23 22:43 ` Dave Chinner
2024-09-24 5:55 ` Christoph Hellwig
2024-09-24 6:05 ` Darrick J. Wong
2024-09-24 6:10 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
2024-09-23 16:19 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
2024-09-23 16:20 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
2024-09-23 15:28 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
2024-09-23 16:22 ` Darrick J. Wong
2024-09-24 5:44 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
2024-09-23 15:28 ` [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
2024-09-23 15:28 ` [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
2024-09-23 15:28 ` [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
2024-09-25 9:19 ` fix stale delalloc punching for COW I/O v3 Christian Brauner
2024-09-25 9:24 ` fix stale delalloc punching for COW I/O v4 Christian Brauner
-- strict thread matches above, loose matches on Subject: below --
2024-09-24 7:40 Christoph Hellwig
2024-09-24 7:40 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
2024-09-24 14:58 ` Darrick J. Wong
2024-09-24 7:40 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
2024-09-24 7:40 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
2024-09-24 7:40 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
2024-09-24 7:40 ` [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
2024-09-24 7:40 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
2024-09-24 7:40 ` [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
2024-09-24 7:40 ` [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
2024-09-24 7:40 ` [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
2024-09-24 7:40 ` [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
2024-10-05 15:53 ` fix stale delalloc punching for COW I/O v4 Darrick J. Wong
2024-10-07 5:41 ` Christoph Hellwig
2024-10-07 6:28 ` Darrick J. Wong
2024-10-07 6:46 ` Christoph Hellwig
2024-10-07 15:20 ` Darrick J. Wong
2024-10-08 8:59 fix stale delalloc punching for COW I/O v5 Christoph Hellwig
2024-10-08 8:59 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240923161825.GE21877@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).