* [PATCH 01/10] iomap: factor out a iomap_last_written_block helper
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 14:58 ` Darrick J. Wong
2024-09-24 7:40 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
` (9 subsequent siblings)
10 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Split out a pice of logic from iomap_file_buffered_write_punch_delalloc
that is useful for all iomap_end implementations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/iomap/buffered-io.c | 13 ++-----------
include/linux/iomap.h | 14 ++++++++++++++
2 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 11ea747228aeec..884891ac7a226c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1280,7 +1280,6 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
{
loff_t start_byte;
loff_t end_byte;
- unsigned int blocksize = i_blocksize(inode);
if (iomap->type != IOMAP_DELALLOC)
return;
@@ -1289,16 +1288,8 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
if (!(iomap->flags & IOMAP_F_NEW))
return;
- /*
- * start_byte refers to the first unused block after a short write. If
- * nothing was written, round offset down to point at the first block in
- * the range.
- */
- if (unlikely(!written))
- start_byte = round_down(pos, blocksize);
- else
- start_byte = round_up(pos + written, blocksize);
- end_byte = round_up(pos + length, blocksize);
+ start_byte = iomap_last_written_block(inode, pos, written);
+ end_byte = round_up(pos + length, i_blocksize(inode));
/* Nothing to do if we've written the entire delalloc extent */
if (start_byte >= end_byte)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4ad12a3c8bae22..62253739dedcbe 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -256,6 +256,20 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
return &i->iomap;
}
+/*
+ * Return the file offset for the first unchanged block after a short write.
+ *
+ * If nothing was written, round @pos down to point at the first block in
+ * the range, else round up to include the partially written block.
+ */
+static inline loff_t iomap_last_written_block(struct inode *inode, loff_t pos,
+ ssize_t written)
+{
+ if (unlikely(!written))
+ return round_down(pos, i_blocksize(inode));
+ return round_up(pos + written, i_blocksize(inode));
+}
+
ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops, void *private);
int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] iomap: factor out a iomap_last_written_block helper
2024-09-24 7:40 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
@ 2024-09-24 14:58 ` Darrick J. Wong
0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-24 14:58 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Carlos Maiolino, Christian Brauner, linux-xfs,
linux-fsdevel
On Tue, Sep 24, 2024 at 09:40:43AM +0200, Christoph Hellwig wrote:
> Split out a pice of logic from iomap_file_buffered_write_punch_delalloc
> that is useful for all iomap_end implementations.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Looks good to me now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/iomap/buffered-io.c | 13 ++-----------
> include/linux/iomap.h | 14 ++++++++++++++
> 2 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 11ea747228aeec..884891ac7a226c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1280,7 +1280,6 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
> {
> loff_t start_byte;
> loff_t end_byte;
> - unsigned int blocksize = i_blocksize(inode);
>
> if (iomap->type != IOMAP_DELALLOC)
> return;
> @@ -1289,16 +1288,8 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
> if (!(iomap->flags & IOMAP_F_NEW))
> return;
>
> - /*
> - * start_byte refers to the first unused block after a short write. If
> - * nothing was written, round offset down to point at the first block in
> - * the range.
> - */
> - if (unlikely(!written))
> - start_byte = round_down(pos, blocksize);
> - else
> - start_byte = round_up(pos + written, blocksize);
> - end_byte = round_up(pos + length, blocksize);
> + start_byte = iomap_last_written_block(inode, pos, written);
> + end_byte = round_up(pos + length, i_blocksize(inode));
>
> /* Nothing to do if we've written the entire delalloc extent */
> if (start_byte >= end_byte)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4ad12a3c8bae22..62253739dedcbe 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -256,6 +256,20 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
> return &i->iomap;
> }
>
> +/*
> + * Return the file offset for the first unchanged block after a short write.
> + *
> + * If nothing was written, round @pos down to point at the first block in
> + * the range, else round up to include the partially written block.
> + */
> +static inline loff_t iomap_last_written_block(struct inode *inode, loff_t pos,
> + ssize_t written)
> +{
> + if (unlikely(!written))
> + return round_down(pos, i_blocksize(inode));
> + return round_up(pos + written, i_blocksize(inode));
> +}
> +
> ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> const struct iomap_ops *ops, void *private);
> int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
2024-09-24 7:40 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
` (8 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Currently iomap_file_buffered_write_punch_delalloc can be called from
XFS either with the invalidate lock held or not. To fix this while
keeping the locking in the file system and not the iomap library
code we'll need to life the locking up into the file system.
To prepare for that, open code iomap_file_buffered_write_punch_delalloc
in the only caller, and instead export iomap_write_delalloc_release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
.../filesystems/iomap/operations.rst | 2 +-
fs/iomap/buffered-io.c | 85 ++++++-------------
fs/xfs/xfs_iomap.c | 16 +++-
include/linux/iomap.h | 6 +-
4 files changed, 46 insertions(+), 63 deletions(-)
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 8e6c721d233010..b93115ab8748ae 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -208,7 +208,7 @@ The filesystem must arrange to `cancel
such `reservations
<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
because writeback will not consume the reservation.
-The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
+The ``iomap_write_delalloc_release`` can be called from a
``->iomap_end`` function to find all the clean areas of the folios
caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
It takes the ``invalidate_lock``.
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 884891ac7a226c..a350dac1bf5136 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1145,10 +1145,36 @@ static void iomap_write_delalloc_scan(struct inode *inode,
}
/*
+ * When a short write occurs, the filesystem might need to use ->iomap_end
+ * to remove space reservations created in ->iomap_begin.
+ *
+ * For filesystems that use delayed allocation, there can be dirty pages over
+ * the delalloc extent outside the range of a short write but still within the
+ * delalloc extent allocated for this iomap if the write raced with page
+ * faults.
+ *
* Punch out all the delalloc blocks in the range given except for those that
* have dirty data still pending in the page cache - those are going to be
* written and so must still retain the delalloc backing for writeback.
*
+ * The punch() callback *must* only punch delalloc extents in the range passed
+ * to it. It must skip over all other types of extents in the range and leave
+ * them completely unchanged. It must do this punch atomically with respect to
+ * other extent modifications.
+ *
+ * The punch() callback may be called with a folio locked to prevent writeback
+ * extent allocation racing at the edge of the range we are currently punching.
+ * The locked folio may or may not cover the range being punched, so it is not
+ * safe for the punch() callback to lock folios itself.
+ *
+ * Lock order is:
+ *
+ * inode->i_rwsem (shared or exclusive)
+ * inode->i_mapping->invalidate_lock (exclusive)
+ * folio_lock()
+ * ->punch
+ * internal filesystem allocation lock
+ *
* As we are scanning the page cache for data, we don't need to reimplement the
* wheel - mapping_seek_hole_data() does exactly what we need to identify the
* start and end of data ranges correctly even for sub-folio block sizes. This
@@ -1177,7 +1203,7 @@ static void iomap_write_delalloc_scan(struct inode *inode,
* require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
* the code to subtle off-by-one bugs....
*/
-static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
+void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
loff_t end_byte, unsigned flags, struct iomap *iomap,
iomap_punch_t punch)
{
@@ -1243,62 +1269,7 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
out_unlock:
filemap_invalidate_unlock(inode->i_mapping);
}
-
-/*
- * When a short write occurs, the filesystem may need to remove reserved space
- * that was allocated in ->iomap_begin from it's ->iomap_end method. For
- * filesystems that use delayed allocation, we need to punch out delalloc
- * extents from the range that are not dirty in the page cache. As the write can
- * race with page faults, there can be dirty pages over the delalloc extent
- * outside the range of a short write but still within the delalloc extent
- * allocated for this iomap.
- *
- * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
- * simplify range iterations.
- *
- * The punch() callback *must* only punch delalloc extents in the range passed
- * to it. It must skip over all other types of extents in the range and leave
- * them completely unchanged. It must do this punch atomically with respect to
- * other extent modifications.
- *
- * The punch() callback may be called with a folio locked to prevent writeback
- * extent allocation racing at the edge of the range we are currently punching.
- * The locked folio may or may not cover the range being punched, so it is not
- * safe for the punch() callback to lock folios itself.
- *
- * Lock order is:
- *
- * inode->i_rwsem (shared or exclusive)
- * inode->i_mapping->invalidate_lock (exclusive)
- * folio_lock()
- * ->punch
- * internal filesystem allocation lock
- */
-void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
- loff_t pos, loff_t length, ssize_t written, unsigned flags,
- struct iomap *iomap, iomap_punch_t punch)
-{
- loff_t start_byte;
- loff_t end_byte;
-
- if (iomap->type != IOMAP_DELALLOC)
- return;
-
- /* If we didn't reserve the blocks, we're not allowed to punch them. */
- if (!(iomap->flags & IOMAP_F_NEW))
- return;
-
- start_byte = iomap_last_written_block(inode, pos, written);
- end_byte = round_up(pos + length, i_blocksize(inode));
-
- /* Nothing to do if we've written the entire delalloc extent */
- if (start_byte >= end_byte)
- return;
-
- iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
- punch);
-}
-EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
+EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
static loff_t iomap_unshare_iter(struct iomap_iter *iter)
{
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1e11f48814c0d0..30f2530b6d5461 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1227,8 +1227,20 @@ xfs_buffered_write_iomap_end(
unsigned flags,
struct iomap *iomap)
{
- iomap_file_buffered_write_punch_delalloc(inode, offset, length, written,
- flags, iomap, &xfs_buffered_write_delalloc_punch);
+ loff_t start_byte, end_byte;
+
+ /* If we didn't reserve the blocks, we're not allowed to punch them. */
+ if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
+ return 0;
+
+ /* Nothing to do if we've written the entire delalloc extent */
+ start_byte = iomap_last_written_block(inode, offset, written);
+ end_byte = round_up(offset + length, i_blocksize(inode));
+ if (start_byte >= end_byte)
+ return 0;
+
+ iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
+ xfs_buffered_write_delalloc_punch);
return 0;
}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 62253739dedcbe..d0420e962ffdc2 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -290,9 +290,9 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
struct iomap *iomap);
-void iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
- loff_t length, ssize_t written, unsigned flag,
- struct iomap *iomap, iomap_punch_t punch);
+void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
+ loff_t end_byte, unsigned flags, struct iomap *iomap,
+ iomap_punch_t punch);
int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len, const struct iomap_ops *ops);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
2024-09-24 7:40 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
2024-09-24 7:40 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
` (7 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
XFS (which currently is the only user of iomap_write_delalloc_release)
already holds invalidate_lock for most zeroing operations. To be able
to avoid a deadlock it needs to stop taking the lock, but doing so
in iomap would leak XFS locking details into iomap.
To avoid this require the caller to hold invalidate_lock when calling
iomap_write_delalloc_release instead of taking it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/iomap/buffered-io.c | 17 ++++++++---------
fs/xfs/xfs_iomap.c | 2 ++
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a350dac1bf5136..b9e70e30a1b254 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1211,12 +1211,13 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
loff_t scan_end_byte = min(i_size_read(inode), end_byte);
/*
- * Lock the mapping to avoid races with page faults re-instantiating
- * folios and dirtying them via ->page_mkwrite whilst we walk the
- * cache and perform delalloc extent removal. Failing to do this can
- * leave dirty pages with no space reservation in the cache.
+ * The caller must hold invalidate_lock to avoid races with page faults
+ * re-instantiating folios and dirtying them via ->page_mkwrite whilst
+ * we walk the cache and perform delalloc extent removal. Failing to do
+ * this can leave dirty pages with no space reservation in the cache.
*/
- filemap_invalidate_lock(inode->i_mapping);
+ lockdep_assert_held_write(&inode->i_mapping->invalidate_lock);
+
while (start_byte < scan_end_byte) {
loff_t data_end;
@@ -1233,7 +1234,7 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
if (start_byte == -ENXIO || start_byte == scan_end_byte)
break;
if (WARN_ON_ONCE(start_byte < 0))
- goto out_unlock;
+ return;
WARN_ON_ONCE(start_byte < punch_start_byte);
WARN_ON_ONCE(start_byte > scan_end_byte);
@@ -1244,7 +1245,7 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
scan_end_byte, SEEK_HOLE);
if (WARN_ON_ONCE(data_end < 0))
- goto out_unlock;
+ return;
/*
* If we race with post-direct I/O invalidation of the page cache,
@@ -1266,8 +1267,6 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
if (punch_start_byte < end_byte)
punch(inode, punch_start_byte, end_byte - punch_start_byte,
iomap);
-out_unlock:
- filemap_invalidate_unlock(inode->i_mapping);
}
EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 30f2530b6d5461..01324da63fcfc7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1239,8 +1239,10 @@ xfs_buffered_write_iomap_end(
if (start_byte >= end_byte)
return 0;
+ filemap_invalidate_lock(inode->i_mapping);
iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
xfs_buffered_write_delalloc_punch);
+ filemap_invalidate_unlock(inode->i_mapping);
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (2 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
` (6 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Split a helper from xfs_file_write_checks that just deal with the
post-EOF zeroing to keep the code readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_file.c | 140 +++++++++++++++++++++++++++-------------------
1 file changed, 82 insertions(+), 58 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 412b1d71b52b7d..3efb0da2a910d6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,10 +347,77 @@ xfs_file_splice_read(
return ret;
}
+/*
+ * Take care of zeroing post-EOF blocks when they might exist.
+ *
+ * Returns 0 if successfully, a negative error for a failure, or 1 if this
+ * function dropped the iolock and reacquired it exclusively and the caller
+ * needs to restart the write sanity checks.
+ */
+static ssize_t
+xfs_file_write_zero_eof(
+ struct kiocb *iocb,
+ struct iov_iter *from,
+ unsigned int *iolock,
+ size_t count,
+ bool *drained_dio)
+{
+ struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
+ loff_t isize;
+
+ /*
+ * We need to serialise against EOF updates that occur in IO completions
+ * here. We want to make sure that nobody is changing the size while
+ * we do this check until we have placed an IO barrier (i.e. hold
+ * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
+ * spinlock effectively forms a memory barrier once we have
+ * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
+ * hence be able to correctly determine if we need to run zeroing.
+ */
+ spin_lock(&ip->i_flags_lock);
+ isize = i_size_read(VFS_I(ip));
+ if (iocb->ki_pos <= isize) {
+ spin_unlock(&ip->i_flags_lock);
+ return 0;
+ }
+ spin_unlock(&ip->i_flags_lock);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ if (!*drained_dio) {
+ /*
+ * If zeroing is needed and we are currently holding the iolock
+ * shared, we need to update it to exclusive which implies
+ * having to redo all checks before.
+ */
+ if (*iolock == XFS_IOLOCK_SHARED) {
+ xfs_iunlock(ip, *iolock);
+ *iolock = XFS_IOLOCK_EXCL;
+ xfs_ilock(ip, *iolock);
+ iov_iter_reexpand(from, count);
+ }
+
+ /*
+ * We now have an IO submission barrier in place, but AIO can do
+ * EOF updates during IO completion and hence we now need to
+ * wait for all of them to drain. Non-AIO DIO will have drained
+ * before we are given the XFS_IOLOCK_EXCL, and so for most
+ * cases this wait is a no-op.
+ */
+ inode_dio_wait(VFS_I(ip));
+ *drained_dio = true;
+ return 1;
+ }
+
+ trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
+ return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+}
+
/*
* Common pre-write limit and setup checks.
*
- * Called with the iolocked held either shared and exclusive according to
+ * Called with the iolock held either shared and exclusive according to
* @iolock, and returns with it held. Might upgrade the iolock to exclusive
* if called for a direct write beyond i_size.
*/
@@ -360,13 +427,10 @@ xfs_file_write_checks(
struct iov_iter *from,
unsigned int *iolock)
{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct xfs_inode *ip = XFS_I(inode);
- ssize_t error = 0;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
size_t count = iov_iter_count(from);
bool drained_dio = false;
- loff_t isize;
+ ssize_t error;
restart:
error = generic_write_checks(iocb, from);
@@ -389,7 +453,7 @@ xfs_file_write_checks(
* exclusively.
*/
if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
- xfs_iunlock(ip, *iolock);
+ xfs_iunlock(XFS_I(inode), *iolock);
*iolock = XFS_IOLOCK_EXCL;
error = xfs_ilock_iocb(iocb, *iolock);
if (error) {
@@ -400,64 +464,24 @@ xfs_file_write_checks(
}
/*
- * If the offset is beyond the size of the file, we need to zero any
+ * If the offset is beyond the size of the file, we need to zero all
* blocks that fall between the existing EOF and the start of this
- * write. If zeroing is needed and we are currently holding the iolock
- * shared, we need to update it to exclusive which implies having to
- * redo all checks before.
- *
- * We need to serialise against EOF updates that occur in IO completions
- * here. We want to make sure that nobody is changing the size while we
- * do this check until we have placed an IO barrier (i.e. hold the
- * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
- * spinlock effectively forms a memory barrier once we have the
- * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
- * hence be able to correctly determine if we need to run zeroing.
+ * write.
*
- * We can do an unlocked check here safely as IO completion can only
- * extend EOF. Truncate is locked out at this point, so the EOF can
- * not move backwards, only forwards. Hence we only need to take the
- * slow path and spin locks when we are at or beyond the current EOF.
+ * We can do an unlocked check for i_size here safely as I/O completion
+ * can only extend EOF. Truncate is locked out at this point, so the
+ * EOF can not move backwards, only forwards. Hence we only need to take
+ * the slow path when we are at or beyond the current EOF.
*/
- if (iocb->ki_pos <= i_size_read(inode))
- goto out;
-
- spin_lock(&ip->i_flags_lock);
- isize = i_size_read(inode);
- if (iocb->ki_pos > isize) {
- spin_unlock(&ip->i_flags_lock);
-
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
-
- if (!drained_dio) {
- if (*iolock == XFS_IOLOCK_SHARED) {
- xfs_iunlock(ip, *iolock);
- *iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, *iolock);
- iov_iter_reexpand(from, count);
- }
- /*
- * We now have an IO submission barrier in place, but
- * AIO can do EOF updates during IO completion and hence
- * we now need to wait for all of them to drain. Non-AIO
- * DIO will have drained before we are given the
- * XFS_IOLOCK_EXCL, and so for most cases this wait is a
- * no-op.
- */
- inode_dio_wait(inode);
- drained_dio = true;
+ if (iocb->ki_pos > i_size_read(inode)) {
+ error = xfs_file_write_zero_eof(iocb, from, iolock, count,
+ &drained_dio);
+ if (error == 1)
goto restart;
- }
-
- trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
- error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
if (error)
return error;
- } else
- spin_unlock(&ip->i_flags_lock);
+ }
-out:
return kiocb_modified(iocb);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (3 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
` (5 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that
is actually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.
The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_file.c | 8 +++++++-
fs/xfs/xfs_iomap.c | 2 ++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3efb0da2a910d6..b19916b11fd563 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -364,6 +364,7 @@ xfs_file_write_zero_eof(
{
struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
loff_t isize;
+ int error;
/*
* We need to serialise against EOF updates that occur in IO completions
@@ -411,7 +412,12 @@ xfs_file_write_zero_eof(
}
trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
- return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+
+ xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+ error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+ xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+
+ return error;
}
/*
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 01324da63fcfc7..4fa4d66dc37761 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1449,6 +1449,8 @@ xfs_zero_range(
{
struct inode *inode = VFS_I(ip);
+ xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
+
if (IS_DAX(inode))
return dax_zero_range(inode, pos, len, did_zero,
&xfs_dax_write_iomap_ops);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 06/10] xfs: zeroing already holds invalidate_lock
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (4 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
` (4 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
All XFS callers of iomap_zero_range already hold invalidate_lock, so we can't
take it again in iomap_file_buffered_write_punch_delalloc.
Use the passed in flags argument to detect if we're called from a zeroing
operation and don't take the lock again in this case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4fa4d66dc37761..0f5fa3de6d3ecc 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1239,10 +1239,17 @@ xfs_buffered_write_iomap_end(
if (start_byte >= end_byte)
return 0;
- filemap_invalidate_lock(inode->i_mapping);
+ /* For zeroing operations the callers already hold invalidate_lock. */
+ if (flags & IOMAP_ZERO)
+ rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
+ else
+ filemap_invalidate_lock(inode->i_mapping);
+
iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
xfs_buffered_write_delalloc_punch);
- filemap_invalidate_unlock(inode->i_mapping);
+
+ if (!(flags & IOMAP_ZERO))
+ filemap_invalidate_unlock(inode->i_mapping);
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (5 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
` (3 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_aops.c | 4 ++--
fs/xfs/xfs_bmap_util.c | 10 +++++++---
fs/xfs/xfs_bmap_util.h | 2 +-
fs/xfs/xfs_iomap.c | 3 ++-
4 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6dead20338e24c..559a3a57709748 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -116,7 +116,7 @@ xfs_end_ioend(
if (unlikely(error)) {
if (ioend->io_flags & IOMAP_F_SHARED) {
xfs_reflink_cancel_cow_range(ip, offset, size, true);
- xfs_bmap_punch_delalloc_range(ip, offset,
+ xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset,
offset + size);
}
goto done;
@@ -456,7 +456,7 @@ xfs_discard_folio(
* byte of the next folio. Hence the end offset is only dependent on the
* folio itself and not the start offset that is passed in.
*/
- xfs_bmap_punch_delalloc_range(ip, pos,
+ xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, pos,
folio_pos(folio) + folio_size(folio));
}
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 053d567c910840..4719ec90029cb7 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -442,11 +442,12 @@ xfs_getbmap(
void
xfs_bmap_punch_delalloc_range(
struct xfs_inode *ip,
+ int whichfork,
xfs_off_t start_byte,
xfs_off_t end_byte)
{
struct xfs_mount *mp = ip->i_mount;
- struct xfs_ifork *ifp = &ip->i_df;
+ struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork);
xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, start_byte);
xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, end_byte);
struct xfs_bmbt_irec got, del;
@@ -474,11 +475,14 @@ xfs_bmap_punch_delalloc_range(
continue;
}
- xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, &icur, &got, &del);
+ xfs_bmap_del_extent_delay(ip, whichfork, &icur, &got, &del);
if (!xfs_iext_get_extent(ifp, &icur, &got))
break;
}
+ if (whichfork == XFS_COW_FORK && !ifp->if_bytes)
+ xfs_inode_clear_cowblocks_tag(ip);
+
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_EXCL);
}
@@ -580,7 +584,7 @@ xfs_free_eofblocks(
*/
if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)) {
if (ip->i_delayed_blks) {
- xfs_bmap_punch_delalloc_range(ip,
+ xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK,
round_up(XFS_ISIZE(ip), mp->m_sb.sb_blocksize),
LLONG_MAX);
}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index eb0895bfb9dae4..b29760d36e1ab1 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -30,7 +30,7 @@ xfs_bmap_rtalloc(struct xfs_bmalloca *ap)
}
#endif /* CONFIG_XFS_RT */
-void xfs_bmap_punch_delalloc_range(struct xfs_inode *ip,
+void xfs_bmap_punch_delalloc_range(struct xfs_inode *ip, int whichfork,
xfs_off_t start_byte, xfs_off_t end_byte);
struct kgetbmap {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 0f5fa3de6d3ecc..abc3b9f1115ce7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1215,7 +1215,8 @@ xfs_buffered_write_delalloc_punch(
loff_t length,
struct iomap *iomap)
{
- xfs_bmap_punch_delalloc_range(XFS_I(inode), offset, offset + length);
+ xfs_bmap_punch_delalloc_range(XFS_I(inode), XFS_DATA_FORK, offset,
+ offset + length);
}
static int
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (6 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
` (2 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index abc3b9f1115ce7..aff9e8305399ee 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -975,6 +975,7 @@ xfs_buffered_write_iomap_begin(
int allocfork = XFS_DATA_FORK;
int error = 0;
unsigned int lockmode = XFS_ILOCK_EXCL;
+ unsigned int iomap_flags = 0;
u64 seq;
if (xfs_is_shutdown(mp))
@@ -1145,6 +1146,11 @@ xfs_buffered_write_iomap_begin(
}
}
+ /*
+ * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
+ * them out if the write happens to fail.
+ */
+ iomap_flags |= IOMAP_F_NEW;
if (allocfork == XFS_COW_FORK) {
error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
end_fsb - offset_fsb, prealloc_blocks, &cmap,
@@ -1162,19 +1168,11 @@ xfs_buffered_write_iomap_begin(
if (error)
goto out_unlock;
- /*
- * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
- * them out if the write happens to fail.
- */
- seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
- xfs_iunlock(ip, lockmode);
trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
-
found_imap:
- seq = xfs_iomap_inode_sequence(ip, 0);
+ seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
+ return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
convert_delay:
xfs_iunlock(ip, lockmode);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (7 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-09-24 7:40 ` [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
2024-10-05 15:53 ` fix stale delalloc punching for COW I/O v4 Darrick J. Wong
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork. It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.
This is essentially a revert of commit 72a048c1056a ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.
Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index aff9e8305399ee..cc768f0139d365 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1186,20 +1186,19 @@ xfs_buffered_write_iomap_begin(
return 0;
found_cow:
- seq = xfs_iomap_inode_sequence(ip, 0);
if (imap.br_startoff <= offset_fsb) {
- error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+ error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
+ xfs_iomap_inode_sequence(ip, 0));
if (error)
goto out_unlock;
- seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
- xfs_iunlock(ip, lockmode);
- return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
- IOMAP_F_SHARED, seq);
+ } else {
+ xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
}
- xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
+ iomap_flags = IOMAP_F_SHARED;
+ seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
- return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
+ return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
out_unlock:
xfs_iunlock(ip, lockmode);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (8 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
@ 2024-09-24 7:40 ` Christoph Hellwig
2024-10-05 15:53 ` fix stale delalloc punching for COW I/O v4 Darrick J. Wong
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 7:40 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Christian Brauner,
Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.
Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index cc768f0139d365..8a4e5f14ec0c77 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1195,7 +1195,7 @@ xfs_buffered_write_iomap_begin(
xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
}
- iomap_flags = IOMAP_F_SHARED;
+ iomap_flags |= IOMAP_F_SHARED;
seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
@@ -1212,8 +1212,10 @@ xfs_buffered_write_delalloc_punch(
loff_t length,
struct iomap *iomap)
{
- xfs_bmap_punch_delalloc_range(XFS_I(inode), XFS_DATA_FORK, offset,
- offset + length);
+ xfs_bmap_punch_delalloc_range(XFS_I(inode),
+ (iomap->flags & IOMAP_F_SHARED) ?
+ XFS_COW_FORK : XFS_DATA_FORK,
+ offset, offset + length);
}
static int
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v4
2024-09-24 7:40 fix stale delalloc punching for COW I/O v4 Christoph Hellwig
` (9 preceding siblings ...)
2024-09-24 7:40 ` [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
@ 2024-10-05 15:53 ` Darrick J. Wong
2024-10-07 5:41 ` Christoph Hellwig
10 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2024-10-05 15:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Carlos Maiolino, Christian Brauner, linux-xfs,
linux-fsdevel
On Tue, Sep 24, 2024 at 09:40:42AM +0200, Christoph Hellwig wrote:
> Hi all,
>
> this is another fallout from the zoned XFS work, which stresses the XFS
> COW I/O path very heavily. It affects normal I/O to reflinked files as
> well, but is very hard to hit there.
>
> The main problem here is that we only punch out delalloc reservations
> from the data fork, but COW I/O places delalloc extents into the COW
> fork, which means that it won't get punched out forshort writes.
>
> [Sorry for the rapid fire repost, but as we're down to comment changes
> and the series has been fully reviewed except for the trivial
> refactoring patch at the beginning I'd like to get it out before being
> semi-offline for a few days]
Hmmm so I tried applying this series, but now I get this splat:
[ 217.170122] run fstests xfs/574 at 2024-10-04 16:36:30
[ 218.248617] XFS (sda3): EXPERIMENTAL online scrub feature in use. Use at your own risk!
[ 219.030068] XFS (sda4): EXPERIMENTAL exchange-range feature enabled. Use at your own risk!
[ 219.032253] XFS (sda4): EXPERIMENTAL parent pointer feature enabled. Use at your own risk!
[ 219.034749] XFS (sda4): Mounting V5 Filesystem 9de8b32b-dada-468a-b83b-f2c702031b67
[ 219.073377] XFS (sda4): Ending clean mount
[ 219.076260] XFS (sda4): Quotacheck needed: Please wait.
[ 219.083160] XFS (sda4): Quotacheck: Done.
[ 219.116532] XFS (sda4): EXPERIMENTAL online scrub feature in use. Use at your own risk!
[ 306.312461] INFO: task fsstress:27661 blocked for more than 61 seconds.
[ 306.342904] Not tainted 6.12.0-rc1-djwx #rc1
[ 306.344066] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 306.345961] task:fsstress state:D stack:11464 pid:27661 tgid:27661 ppid:1 flags:0x00004004
[ 306.348070] Call Trace:
[ 306.348722] <TASK>
[ 306.349177] __schedule+0x419/0x14c0
[ 306.350024] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 306.351305] schedule+0x2c/0x110
[ 306.352082] schedule_preempt_disabled+0x18/0x30
[ 306.353183] rwsem_down_write_slowpath+0x2a8/0x6a0
[ 306.354229] ? filemap_add_folio+0xc5/0xf0
[ 306.355160] down_write+0x6e/0x70
[ 306.355953] xfs_buffered_write_iomap_end+0xcf/0x130 [xfs 05abef6fe7019e4129a0560dad36f86c40f3b541]
[ 306.358115] iomap_iter+0x78/0x2f0
[ 306.358974] iomap_file_unshare+0x82/0x290
[ 306.359935] xfs_reflink_unshare+0xf4/0x170 [xfs 05abef6fe7019e4129a0560dad36f86c40f3b541]
[ 306.361787] xfs_file_fallocate+0x13f/0x430 [xfs 05abef6fe7019e4129a0560dad36f86c40f3b541]
[ 306.364005] vfs_fallocate+0x110/0x320
[ 306.364984] __x64_sys_fallocate+0x42/0x70
[ 306.365999] do_syscall_64+0x47/0x100
[ 306.366974] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 306.368228] RIP: 0033:0x7f893d3b2cb3
[ 306.369064] RSP: 002b:00007ffd32c796d8 EFLAGS: 00000202 ORIG_RAX: 000000000000011d
[ 306.370650] RAX: ffffffffffffffda RBX: 000000000000131e RCX: 00007f893d3b2cb3
[ 306.372183] RDX: 000000000002eb27 RSI: 0000000000000041 RDI: 0000000000000005
[ 306.373940] RBP: 0000000000000041 R08: 000000000000005f R09: 0000000000000078
[ 306.375365] R10: 00000000000b0483 R11: 0000000000000202 R12: 0000000000000005
[ 306.376763] R13: 000000000002eb27 R14: 0000000000000000 R15: 00000000000b0483
[ 306.378174] </TASK>
I think this series needs to assert that the invalidatelock is held
(instead of taking it) for the IOMAP_UNSHARE case too, since UNSHARE is
called from fallocate, which has already taken the MMAPLOCK.
--D
> Changes since v3:
> - improve two comments
>
> Changes since v2:
> - drop the patches already merged and rebased to latest Linus' tree
> - moved taking invalidate_lock from iomap to the caller to avoid a
> too complicated locking protocol
> - better document the xfs_file_write_zero_eof return value
> - fix a commit log typo
>
> Changes since v1:
> - move the already reviewed iomap prep changes to the beginning in case
> Christian wants to take them ASAP
> - take the invalidate_lock for post-EOF zeroing so that we have a
> consistent locking pattern for zeroing.
>
> Diffstat:
> Documentation/filesystems/iomap/operations.rst | 2
> fs/iomap/buffered-io.c | 111 ++++++-------------
> fs/xfs/xfs_aops.c | 4
> fs/xfs/xfs_bmap_util.c | 10 +
> fs/xfs/xfs_bmap_util.h | 2
> fs/xfs/xfs_file.c | 146 +++++++++++++++----------
> fs/xfs/xfs_iomap.c | 65 +++++++----
> include/linux/iomap.h | 20 ++-
> 8 files changed, 196 insertions(+), 164 deletions(-)
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v4
2024-10-05 15:53 ` fix stale delalloc punching for COW I/O v4 Darrick J. Wong
@ 2024-10-07 5:41 ` Christoph Hellwig
2024-10-07 6:28 ` Darrick J. Wong
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-10-07 5:41 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Carlos Maiolino,
Christian Brauner, linux-xfs, linux-fsdevel
On Sat, Oct 05, 2024 at 08:53:12AM -0700, Darrick J. Wong wrote:
> Hmmm so I tried applying this series, but now I get this splat:
>
> [ 217.170122] run fstests xfs/574 at 2024-10-04 16:36:30
I don't. What xfstests tree is this with?
> I think this series needs to assert that the invalidatelock is held
> (instead of taking it) for the IOMAP_UNSHARE case too, since UNSHARE is
> called from fallocate, which has already taken the MMAPLOCK.
Yes. I'll look into fixing it, but so far I haven't managed to
reproduce the actual splat.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v4
2024-10-07 5:41 ` Christoph Hellwig
@ 2024-10-07 6:28 ` Darrick J. Wong
2024-10-07 6:46 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2024-10-07 6:28 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Carlos Maiolino, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Oct 07, 2024 at 07:41:01AM +0200, Christoph Hellwig wrote:
> On Sat, Oct 05, 2024 at 08:53:12AM -0700, Darrick J. Wong wrote:
> > Hmmm so I tried applying this series, but now I get this splat:
> >
> > [ 217.170122] run fstests xfs/574 at 2024-10-04 16:36:30
>
> I don't. What xfstests tree is this with?
Hum. My latest djwong-wtf xfstests tree. You might have to have the
new funshare patch I sent for fsstress, though iirc that's already in my
-wtf branch.
> > I think this series needs to assert that the invalidatelock is held
> > (instead of taking it) for the IOMAP_UNSHARE case too, since UNSHARE is
> > called from fallocate, which has already taken the MMAPLOCK.
>
> Yes. I'll look into fixing it, but so far I haven't managed to
> reproduce the actual splat.
<nod>
--D
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v4
2024-10-07 6:28 ` Darrick J. Wong
@ 2024-10-07 6:46 ` Christoph Hellwig
2024-10-07 15:20 ` Darrick J. Wong
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-10-07 6:46 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Carlos Maiolino,
Christian Brauner, linux-xfs, linux-fsdevel
On Sun, Oct 06, 2024 at 11:28:41PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 07, 2024 at 07:41:01AM +0200, Christoph Hellwig wrote:
> > On Sat, Oct 05, 2024 at 08:53:12AM -0700, Darrick J. Wong wrote:
> > > Hmmm so I tried applying this series, but now I get this splat:
> > >
> > > [ 217.170122] run fstests xfs/574 at 2024-10-04 16:36:30
> >
> > I don't. What xfstests tree is this with?
>
> Hum. My latest djwong-wtf xfstests tree. You might have to have the
> new funshare patch I sent for fsstress, though iirc that's already in my
> -wtf branch.
No recent fsstress.c changes in your tree, but then again the last
update is from Oct 1st, so you might have just not pushed it out.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v4
2024-10-07 6:46 ` Christoph Hellwig
@ 2024-10-07 15:20 ` Darrick J. Wong
0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2024-10-07 15:20 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Carlos Maiolino, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Oct 07, 2024 at 08:46:50AM +0200, Christoph Hellwig wrote:
> On Sun, Oct 06, 2024 at 11:28:41PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 07, 2024 at 07:41:01AM +0200, Christoph Hellwig wrote:
> > > On Sat, Oct 05, 2024 at 08:53:12AM -0700, Darrick J. Wong wrote:
> > > > Hmmm so I tried applying this series, but now I get this splat:
> > > >
> > > > [ 217.170122] run fstests xfs/574 at 2024-10-04 16:36:30
> > >
> > > I don't. What xfstests tree is this with?
> >
> > Hum. My latest djwong-wtf xfstests tree. You might have to have the
> > new funshare patch I sent for fsstress, though iirc that's already in my
> > -wtf branch.
>
> No recent fsstress.c changes in your tree, but then again the last
> update is from Oct 1st, so you might have just not pushed it out.
Ah, right, because the kernel code was obviously busted Friday night, so
I didn't bother pushing anything. Will push everything this afternoon,
esp. now that bfoster rvb'd it...
--D
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 01/10] iomap: factor out a iomap_last_written_block helper
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 15:53 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
` (9 subsequent siblings)
10 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Split out a pice of logic from iomap_file_buffered_write_punch_delalloc
that is useful for all iomap_end implementations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/iomap/buffered-io.c | 13 ++-----------
include/linux/iomap.h | 14 ++++++++++++++
2 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 11ea747228aeec..884891ac7a226c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1280,7 +1280,6 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
{
loff_t start_byte;
loff_t end_byte;
- unsigned int blocksize = i_blocksize(inode);
if (iomap->type != IOMAP_DELALLOC)
return;
@@ -1289,16 +1288,8 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
if (!(iomap->flags & IOMAP_F_NEW))
return;
- /*
- * start_byte refers to the first unused block after a short write. If
- * nothing was written, round offset down to point at the first block in
- * the range.
- */
- if (unlikely(!written))
- start_byte = round_down(pos, blocksize);
- else
- start_byte = round_up(pos + written, blocksize);
- end_byte = round_up(pos + length, blocksize);
+ start_byte = iomap_last_written_block(inode, pos, written);
+ end_byte = round_up(pos + length, i_blocksize(inode));
/* Nothing to do if we've written the entire delalloc extent */
if (start_byte >= end_byte)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4ad12a3c8bae22..e62df5d93f04de 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -256,6 +256,20 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
return &i->iomap;
}
+/*
+ * Return the file offset for the first unused block after a short write.
+ *
+ * If nothing was written, round offset down to point at the first block in
+ * the range, else round up to include the partially written block.
+ */
+static inline loff_t iomap_last_written_block(struct inode *inode, loff_t pos,
+ ssize_t written)
+{
+ if (unlikely(!written))
+ return round_down(pos, i_blocksize(inode));
+ return round_up(pos + written, i_blocksize(inode));
+}
+
ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops, void *private);
int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] iomap: factor out a iomap_last_written_block helper
2024-09-23 15:28 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
@ 2024-09-23 15:53 ` Darrick J. Wong
2024-09-24 5:45 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-23 15:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel
On Mon, Sep 23, 2024 at 05:28:15PM +0200, Christoph Hellwig wrote:
> Split out a pice of logic from iomap_file_buffered_write_punch_delalloc
> that is useful for all iomap_end implementations.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/iomap/buffered-io.c | 13 ++-----------
> include/linux/iomap.h | 14 ++++++++++++++
> 2 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 11ea747228aeec..884891ac7a226c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1280,7 +1280,6 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
> {
> loff_t start_byte;
> loff_t end_byte;
> - unsigned int blocksize = i_blocksize(inode);
>
> if (iomap->type != IOMAP_DELALLOC)
> return;
> @@ -1289,16 +1288,8 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
> if (!(iomap->flags & IOMAP_F_NEW))
> return;
>
> - /*
> - * start_byte refers to the first unused block after a short write. If
> - * nothing was written, round offset down to point at the first block in
> - * the range.
> - */
> - if (unlikely(!written))
> - start_byte = round_down(pos, blocksize);
> - else
> - start_byte = round_up(pos + written, blocksize);
> - end_byte = round_up(pos + length, blocksize);
> + start_byte = iomap_last_written_block(inode, pos, written);
> + end_byte = round_up(pos + length, i_blocksize(inode));
>
> /* Nothing to do if we've written the entire delalloc extent */
> if (start_byte >= end_byte)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4ad12a3c8bae22..e62df5d93f04de 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -256,6 +256,20 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
> return &i->iomap;
> }
>
> +/*
> + * Return the file offset for the first unused block after a short write.
the first *unchanged* block after a short write?
> + *
> + * If nothing was written, round offset down to point at the first block in
Might as well make explicit which variable we're operating on:
"...round @pos down..."
--D
> + * the range, else round up to include the partially written block.
> + */
> +static inline loff_t iomap_last_written_block(struct inode *inode, loff_t pos,
> + ssize_t written)
> +{
> + if (unlikely(!written))
> + return round_down(pos, i_blocksize(inode));
> + return round_up(pos + written, i_blocksize(inode));
> +}
> +
> ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> const struct iomap_ops *ops, void *private);
> int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] iomap: factor out a iomap_last_written_block helper
2024-09-23 15:53 ` Darrick J. Wong
@ 2024-09-24 5:45 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 5:45 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Sep 23, 2024 at 08:53:19AM -0700, Darrick J. Wong wrote:
> > +/*
> > + * Return the file offset for the first unused block after a short write.
>
> the first *unchanged* block after a short write?
>
> > + *
> > + * If nothing was written, round offset down to point at the first block in
>
> Might as well make explicit which variable we're operating on:
> "...round @pos down..."
I've updated this for the next version, thanks.
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
2024-09-23 15:28 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 16:18 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
` (8 subsequent siblings)
10 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Currently iomap_file_buffered_write_punch_delalloc can be called from
XFS either with the invalidate lock held or not. To fix this while
keeping the locking in the file system and not the iomap library
code we'll need to life the locking up into the file system.
To prepare for that, open code iomap_file_buffered_write_punch_delalloc
in the only caller, and instead export iomap_write_delalloc_release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
.../filesystems/iomap/operations.rst | 2 +-
fs/iomap/buffered-io.c | 85 ++++++-------------
fs/xfs/xfs_iomap.c | 16 +++-
include/linux/iomap.h | 6 +-
4 files changed, 46 insertions(+), 63 deletions(-)
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 8e6c721d233010..b93115ab8748ae 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -208,7 +208,7 @@ The filesystem must arrange to `cancel
such `reservations
<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
because writeback will not consume the reservation.
-The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
+The ``iomap_write_delalloc_release`` can be called from a
``->iomap_end`` function to find all the clean areas of the folios
caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
It takes the ``invalidate_lock``.
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 884891ac7a226c..237aeb883166df 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1149,6 +1149,32 @@ static void iomap_write_delalloc_scan(struct inode *inode,
* have dirty data still pending in the page cache - those are going to be
* written and so must still retain the delalloc backing for writeback.
*
+ * When a short write occurs, the filesystem may need to remove reserved space
+ * that was allocated in ->iomap_begin from it's ->iomap_end method. For
+ * filesystems that use delayed allocation, we need to punch out delalloc
+ * extents from the range that are not dirty in the page cache. As the write can
+ * race with page faults, there can be dirty pages over the delalloc extent
+ * outside the range of a short write but still within the delalloc extent
+ * allocated for this iomap.
+ *
+ * The punch() callback *must* only punch delalloc extents in the range passed
+ * to it. It must skip over all other types of extents in the range and leave
+ * them completely unchanged. It must do this punch atomically with respect to
+ * other extent modifications.
+ *
+ * The punch() callback may be called with a folio locked to prevent writeback
+ * extent allocation racing at the edge of the range we are currently punching.
+ * The locked folio may or may not cover the range being punched, so it is not
+ * safe for the punch() callback to lock folios itself.
+ *
+ * Lock order is:
+ *
+ * inode->i_rwsem (shared or exclusive)
+ * inode->i_mapping->invalidate_lock (exclusive)
+ * folio_lock()
+ * ->punch
+ * internal filesystem allocation lock
+ *
* As we are scanning the page cache for data, we don't need to reimplement the
* wheel - mapping_seek_hole_data() does exactly what we need to identify the
* start and end of data ranges correctly even for sub-folio block sizes. This
@@ -1177,7 +1203,7 @@ static void iomap_write_delalloc_scan(struct inode *inode,
* require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
* the code to subtle off-by-one bugs....
*/
-static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
+void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
loff_t end_byte, unsigned flags, struct iomap *iomap,
iomap_punch_t punch)
{
@@ -1243,62 +1269,7 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
out_unlock:
filemap_invalidate_unlock(inode->i_mapping);
}
-
-/*
- * When a short write occurs, the filesystem may need to remove reserved space
- * that was allocated in ->iomap_begin from it's ->iomap_end method. For
- * filesystems that use delayed allocation, we need to punch out delalloc
- * extents from the range that are not dirty in the page cache. As the write can
- * race with page faults, there can be dirty pages over the delalloc extent
- * outside the range of a short write but still within the delalloc extent
- * allocated for this iomap.
- *
- * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
- * simplify range iterations.
- *
- * The punch() callback *must* only punch delalloc extents in the range passed
- * to it. It must skip over all other types of extents in the range and leave
- * them completely unchanged. It must do this punch atomically with respect to
- * other extent modifications.
- *
- * The punch() callback may be called with a folio locked to prevent writeback
- * extent allocation racing at the edge of the range we are currently punching.
- * The locked folio may or may not cover the range being punched, so it is not
- * safe for the punch() callback to lock folios itself.
- *
- * Lock order is:
- *
- * inode->i_rwsem (shared or exclusive)
- * inode->i_mapping->invalidate_lock (exclusive)
- * folio_lock()
- * ->punch
- * internal filesystem allocation lock
- */
-void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
- loff_t pos, loff_t length, ssize_t written, unsigned flags,
- struct iomap *iomap, iomap_punch_t punch)
-{
- loff_t start_byte;
- loff_t end_byte;
-
- if (iomap->type != IOMAP_DELALLOC)
- return;
-
- /* If we didn't reserve the blocks, we're not allowed to punch them. */
- if (!(iomap->flags & IOMAP_F_NEW))
- return;
-
- start_byte = iomap_last_written_block(inode, pos, written);
- end_byte = round_up(pos + length, i_blocksize(inode));
-
- /* Nothing to do if we've written the entire delalloc extent */
- if (start_byte >= end_byte)
- return;
-
- iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
- punch);
-}
-EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
+EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
static loff_t iomap_unshare_iter(struct iomap_iter *iter)
{
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1e11f48814c0d0..30f2530b6d5461 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1227,8 +1227,20 @@ xfs_buffered_write_iomap_end(
unsigned flags,
struct iomap *iomap)
{
- iomap_file_buffered_write_punch_delalloc(inode, offset, length, written,
- flags, iomap, &xfs_buffered_write_delalloc_punch);
+ loff_t start_byte, end_byte;
+
+ /* If we didn't reserve the blocks, we're not allowed to punch them. */
+ if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
+ return 0;
+
+ /* Nothing to do if we've written the entire delalloc extent */
+ start_byte = iomap_last_written_block(inode, offset, written);
+ end_byte = round_up(offset + length, i_blocksize(inode));
+ if (start_byte >= end_byte)
+ return 0;
+
+ iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
+ xfs_buffered_write_delalloc_punch);
return 0;
}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e62df5d93f04de..137e0783faa224 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -290,9 +290,9 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
struct iomap *iomap);
-void iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
- loff_t length, ssize_t written, unsigned flag,
- struct iomap *iomap, iomap_punch_t punch);
+void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
+ loff_t end_byte, unsigned flags, struct iomap *iomap,
+ iomap_punch_t punch);
int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len, const struct iomap_ops *ops);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-23 15:28 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
@ 2024-09-23 16:18 ` Darrick J. Wong
2024-09-23 22:43 ` Dave Chinner
2024-09-24 5:55 ` Christoph Hellwig
0 siblings, 2 replies; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-23 16:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel
On Mon, Sep 23, 2024 at 05:28:16PM +0200, Christoph Hellwig wrote:
> Currently iomap_file_buffered_write_punch_delalloc can be called from
> XFS either with the invalidate lock held or not. To fix this while
> keeping the locking in the file system and not the iomap library
> code we'll need to life the locking up into the file system.
>
> To prepare for that, open code iomap_file_buffered_write_punch_delalloc
> in the only caller, and instead export iomap_write_delalloc_release.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> .../filesystems/iomap/operations.rst | 2 +-
> fs/iomap/buffered-io.c | 85 ++++++-------------
> fs/xfs/xfs_iomap.c | 16 +++-
> include/linux/iomap.h | 6 +-
> 4 files changed, 46 insertions(+), 63 deletions(-)
>
> diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> index 8e6c721d233010..b93115ab8748ae 100644
> --- a/Documentation/filesystems/iomap/operations.rst
> +++ b/Documentation/filesystems/iomap/operations.rst
> @@ -208,7 +208,7 @@ The filesystem must arrange to `cancel
> such `reservations
> <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
> because writeback will not consume the reservation.
> -The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
> +The ``iomap_write_delalloc_release`` can be called from a
> ``->iomap_end`` function to find all the clean areas of the folios
> caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
> It takes the ``invalidate_lock``.
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 884891ac7a226c..237aeb883166df 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1149,6 +1149,32 @@ static void iomap_write_delalloc_scan(struct inode *inode,
> * have dirty data still pending in the page cache - those are going to be
> * written and so must still retain the delalloc backing for writeback.
> *
> + * When a short write occurs, the filesystem may need to remove reserved space
> + * that was allocated in ->iomap_begin from it's ->iomap_end method. For
"When a short write occurs, the filesystem may need to remove space
reservations created in ->iomap_begin.
> + * filesystems that use delayed allocation, we need to punch out delalloc
> + * extents from the range that are not dirty in the page cache. As the write can
> + * race with page faults, there can be dirty pages over the delalloc extent
> + * outside the range of a short write but still within the delalloc extent
> + * allocated for this iomap.
> + *
> + * The punch() callback *must* only punch delalloc extents in the range passed
> + * to it. It must skip over all other types of extents in the range and leave
> + * them completely unchanged. It must do this punch atomically with respect to
> + * other extent modifications.
Can a failing buffered write race with a write fault to the same file
range?
write() thread: page_mkwrite thread:
--------------- --------------------
take i_rwsem
->iomap_begin
create da reservation
lock folio
fail to write
unlock folio
take invalidation lock
lock folio
->iomap_begin
sees da reservation
mark folio dirty
unlock folio
drop invalidation lock
->iomap_end
take invalidation lock
iomap_write_delalloc_release
drop invalidation lock
Can we end up in this situation, where the write fault thinks it has a
dirty page backed by a delalloc reservation, yet the delalloc
reservation gets removed by the delalloc punch logic? I think the
answer to my question is that this sequence is impossible because the
write fault dirties the folio so the iomap_write_delalloc_release does
nothing, correct?
Unrelated question about iomap_write_begin: Can we get rid of the
!mapping_large_folio_support if-body just prior to __iomap_get_folio?
filemap_get_folio won't return large folios if
!mapping_large_folio_support, so I think the separate check in iomap
isn't needed anymore?
This push-down looks fine though, so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> + *
> + * The punch() callback may be called with a folio locked to prevent writeback
> + * extent allocation racing at the edge of the range we are currently punching.
> + * The locked folio may or may not cover the range being punched, so it is not
> + * safe for the punch() callback to lock folios itself.
> + *
> + * Lock order is:
> + *
> + * inode->i_rwsem (shared or exclusive)
> + * inode->i_mapping->invalidate_lock (exclusive)
> + * folio_lock()
> + * ->punch
> + * internal filesystem allocation lock
> + *
> * As we are scanning the page cache for data, we don't need to reimplement the
> * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> * start and end of data ranges correctly even for sub-folio block sizes. This
> @@ -1177,7 +1203,7 @@ static void iomap_write_delalloc_scan(struct inode *inode,
> * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
> * the code to subtle off-by-one bugs....
> */
> -static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> +void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> loff_t end_byte, unsigned flags, struct iomap *iomap,
> iomap_punch_t punch)
> {
> @@ -1243,62 +1269,7 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> out_unlock:
> filemap_invalidate_unlock(inode->i_mapping);
> }
> -
> -/*
> - * When a short write occurs, the filesystem may need to remove reserved space
> - * that was allocated in ->iomap_begin from it's ->iomap_end method. For
> - * filesystems that use delayed allocation, we need to punch out delalloc
> - * extents from the range that are not dirty in the page cache. As the write can
> - * race with page faults, there can be dirty pages over the delalloc extent
> - * outside the range of a short write but still within the delalloc extent
> - * allocated for this iomap.
> - *
> - * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
> - * simplify range iterations.
> - *
> - * The punch() callback *must* only punch delalloc extents in the range passed
> - * to it. It must skip over all other types of extents in the range and leave
> - * them completely unchanged. It must do this punch atomically with respect to
> - * other extent modifications.
> - *
> - * The punch() callback may be called with a folio locked to prevent writeback
> - * extent allocation racing at the edge of the range we are currently punching.
> - * The locked folio may or may not cover the range being punched, so it is not
> - * safe for the punch() callback to lock folios itself.
> - *
> - * Lock order is:
> - *
> - * inode->i_rwsem (shared or exclusive)
> - * inode->i_mapping->invalidate_lock (exclusive)
> - * folio_lock()
> - * ->punch
> - * internal filesystem allocation lock
> - */
> -void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
> - loff_t pos, loff_t length, ssize_t written, unsigned flags,
> - struct iomap *iomap, iomap_punch_t punch)
> -{
> - loff_t start_byte;
> - loff_t end_byte;
> -
> - if (iomap->type != IOMAP_DELALLOC)
> - return;
> -
> - /* If we didn't reserve the blocks, we're not allowed to punch them. */
> - if (!(iomap->flags & IOMAP_F_NEW))
> - return;
> -
> - start_byte = iomap_last_written_block(inode, pos, written);
> - end_byte = round_up(pos + length, i_blocksize(inode));
> -
> - /* Nothing to do if we've written the entire delalloc extent */
> - if (start_byte >= end_byte)
> - return;
> -
> - iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
> - punch);
> -}
> -EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
> +EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
>
> static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> {
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1e11f48814c0d0..30f2530b6d5461 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1227,8 +1227,20 @@ xfs_buffered_write_iomap_end(
> unsigned flags,
> struct iomap *iomap)
> {
> - iomap_file_buffered_write_punch_delalloc(inode, offset, length, written,
> - flags, iomap, &xfs_buffered_write_delalloc_punch);
> + loff_t start_byte, end_byte;
> +
> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> + return 0;
> +
> + /* Nothing to do if we've written the entire delalloc extent */
> + start_byte = iomap_last_written_block(inode, offset, written);
> + end_byte = round_up(offset + length, i_blocksize(inode));
> + if (start_byte >= end_byte)
> + return 0;
> +
> + iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
> + xfs_buffered_write_delalloc_punch);
> return 0;
> }
>
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e62df5d93f04de..137e0783faa224 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -290,9 +290,9 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
>
> typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
> struct iomap *iomap);
> -void iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
> - loff_t length, ssize_t written, unsigned flag,
> - struct iomap *iomap, iomap_punch_t punch);
> +void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> + loff_t end_byte, unsigned flags, struct iomap *iomap,
> + iomap_punch_t punch);
>
> int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 start, u64 len, const struct iomap_ops *ops);
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-23 16:18 ` Darrick J. Wong
@ 2024-09-23 22:43 ` Dave Chinner
2024-09-24 5:55 ` Christoph Hellwig
1 sibling, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2024-09-23 22:43 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Sep 23, 2024 at 09:18:25AM -0700, Darrick J. Wong wrote:
> On Mon, Sep 23, 2024 at 05:28:16PM +0200, Christoph Hellwig wrote:
> > Currently iomap_file_buffered_write_punch_delalloc can be called from
> > XFS either with the invalidate lock held or not. To fix this while
> > keeping the locking in the file system and not the iomap library
> > code we'll need to life the locking up into the file system.
> >
> > To prepare for that, open code iomap_file_buffered_write_punch_delalloc
> > in the only caller, and instead export iomap_write_delalloc_release.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > .../filesystems/iomap/operations.rst | 2 +-
> > fs/iomap/buffered-io.c | 85 ++++++-------------
> > fs/xfs/xfs_iomap.c | 16 +++-
> > include/linux/iomap.h | 6 +-
> > 4 files changed, 46 insertions(+), 63 deletions(-)
> >
> > diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> > index 8e6c721d233010..b93115ab8748ae 100644
> > --- a/Documentation/filesystems/iomap/operations.rst
> > +++ b/Documentation/filesystems/iomap/operations.rst
> > @@ -208,7 +208,7 @@ The filesystem must arrange to `cancel
> > such `reservations
> > <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
> > because writeback will not consume the reservation.
> > -The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
> > +The ``iomap_write_delalloc_release`` can be called from a
> > ``->iomap_end`` function to find all the clean areas of the folios
> > caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
> > It takes the ``invalidate_lock``.
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 884891ac7a226c..237aeb883166df 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1149,6 +1149,32 @@ static void iomap_write_delalloc_scan(struct inode *inode,
> > * have dirty data still pending in the page cache - those are going to be
> > * written and so must still retain the delalloc backing for writeback.
> > *
> > + * When a short write occurs, the filesystem may need to remove reserved space
> > + * that was allocated in ->iomap_begin from it's ->iomap_end method. For
>
> "When a short write occurs, the filesystem may need to remove space
> reservations created in ->iomap_begin.
>
> > + * filesystems that use delayed allocation, we need to punch out delalloc
> > + * extents from the range that are not dirty in the page cache. As the write can
> > + * race with page faults, there can be dirty pages over the delalloc extent
> > + * outside the range of a short write but still within the delalloc extent
> > + * allocated for this iomap.
> > + *
> > + * The punch() callback *must* only punch delalloc extents in the range passed
> > + * to it. It must skip over all other types of extents in the range and leave
> > + * them completely unchanged. It must do this punch atomically with respect to
> > + * other extent modifications.
>
> Can a failing buffered write race with a write fault to the same file
> range?
Yes.
> write() thread: page_mkwrite thread:
> --------------- --------------------
> take i_rwsem
> ->iomap_begin
> create da reservation
> lock folio
> fail to write
> unlock folio
> take invalidation lock
> lock folio
> ->iomap_begin
> sees da reservation
> mark folio dirty
> unlock folio
> drop invalidation lock
> ->iomap_end
> take invalidation lock
> iomap_write_delalloc_release
> drop invalidation lock
>
> Can we end up in this situation, where the write fault thinks it has a
> dirty page backed by a delalloc reservation, yet the delalloc
> reservation gets removed by the delalloc punch logic?
No.
> I think the
> answer to my question is that this sequence is impossible because the
> write fault dirties the folio so the iomap_write_delalloc_release does
> nothing, correct?
Yes.
The above situation is the race condition that the delalloc punching
code is taking into account when it checks for dirty data over the
range being punched. As the comment above
iomap_write_delalloc_release() says:
/*
* Punch out all the delalloc blocks in the range given except for those that
* have dirty data still pending in the page cache - those are going to be
* written and so must still retain the delalloc backing for writeback.
*
....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-23 16:18 ` Darrick J. Wong
2024-09-23 22:43 ` Dave Chinner
@ 2024-09-24 5:55 ` Christoph Hellwig
2024-09-24 6:05 ` Darrick J. Wong
1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 5:55 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Sep 23, 2024 at 09:18:25AM -0700, Darrick J. Wong wrote:
> > + * When a short write occurs, the filesystem may need to remove reserved space
> > + * that was allocated in ->iomap_begin from it's ->iomap_end method. For
>
> "When a short write occurs, the filesystem may need to remove space
> reservations created in ->iomap_begin.
This just moved the text from the existing comment. I agree that your
wording is better, but I'd keep the "from it's ->iomap_end".
> Unrelated question about iomap_write_begin: Can we get rid of the
> !mapping_large_folio_support if-body just prior to __iomap_get_folio?
> filemap_get_folio won't return large folios if
> !mapping_large_folio_support, so I think the separate check in iomap
> isn't needed anymore?
From the iomap POV it seems like we could (after checking no one
is doing something weird with len in ->get_folio).
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-24 5:55 ` Christoph Hellwig
@ 2024-09-24 6:05 ` Darrick J. Wong
2024-09-24 6:10 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-24 6:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel
On Tue, Sep 24, 2024 at 07:55:33AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 09:18:25AM -0700, Darrick J. Wong wrote:
> > > + * When a short write occurs, the filesystem may need to remove reserved space
> > > + * that was allocated in ->iomap_begin from it's ->iomap_end method. For
> >
> > "When a short write occurs, the filesystem may need to remove space
> > reservations created in ->iomap_begin.
>
> This just moved the text from the existing comment. I agree that your
> wording is better, but I'd keep the "from it's ->iomap_end".
Yeah, please do. What do you think of:
"When a short write occurs, the filesystem might need to use ->iomap_end
to remove space reservations created in ->iomap_begin." ?
> > Unrelated question about iomap_write_begin: Can we get rid of the
> > !mapping_large_folio_support if-body just prior to __iomap_get_folio?
> > filemap_get_folio won't return large folios if
> > !mapping_large_folio_support, so I think the separate check in iomap
> > isn't needed anymore?
>
> From the iomap POV it seems like we could (after checking no one
> is doing something weird with len in ->get_folio).
The only user I know of is gfs2, which allocates a transaction and then
calls iomap_get_folio with pos/len unchanged.
--D
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
2024-09-24 6:05 ` Darrick J. Wong
@ 2024-09-24 6:10 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 6:10 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Sep 23, 2024 at 11:05:16PM -0700, Darrick J. Wong wrote:
> Yeah, please do. What do you think of:
>
> "When a short write occurs, the filesystem might need to use ->iomap_end
> to remove space reservations created in ->iomap_begin." ?
Sounds good, I'll update it again.
> > > Unrelated question about iomap_write_begin: Can we get rid of the
> > > !mapping_large_folio_support if-body just prior to __iomap_get_folio?
> > > filemap_get_folio won't return large folios if
> > > !mapping_large_folio_support, so I think the separate check in iomap
> > > isn't needed anymore?
> >
> > From the iomap POV it seems like we could (after checking no one
> > is doing something weird with len in ->get_folio).
>
> The only user I know of is gfs2, which allocates a transaction and then
> calls iomap_get_folio with pos/len unchanged.
Yeah, so it _should_ be fine. Not really feeling like changing it now
with all the other stuff in flight, though. And eventually I really
want to sort out a few things in the area, like confusing
__iomap_get_folio naming for the wrapper with iomap_get_folio as
the default instance, and the fact that the get_folio and invalidation
are indirect calls right next to each other.
>
> --D
---end quoted text---
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
2024-09-23 15:28 ` [PATCH 01/10] iomap: factor out a iomap_last_written_block helper Christoph Hellwig
2024-09-23 15:28 ` [PATCH 02/10] iomap: remove iomap_file_buffered_write_punch_delalloc Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 16:19 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
` (7 subsequent siblings)
10 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
XFS (which currently is the only user of iomap_write_delalloc_release)
already holds invalidate_lock for most zeroing operations. To be able
to avoid a deadlock it needs to stop taking the lock, but doing so
in iomap would leak XFS locking details into iomap.
To avoid this require the caller to hold invalidate_lock when calling
iomap_write_delalloc_release instead of taking it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/iomap/buffered-io.c | 17 ++++++++---------
fs/xfs/xfs_iomap.c | 2 ++
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 237aeb883166df..232aaa1e86451a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1211,12 +1211,13 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
loff_t scan_end_byte = min(i_size_read(inode), end_byte);
/*
- * Lock the mapping to avoid races with page faults re-instantiating
- * folios and dirtying them via ->page_mkwrite whilst we walk the
- * cache and perform delalloc extent removal. Failing to do this can
- * leave dirty pages with no space reservation in the cache.
+ * The caller must hold invalidate_lock to avoid races with page faults
+ * re-instantiating folios and dirtying them via ->page_mkwrite whilst
+ * we walk the cache and perform delalloc extent removal. Failing to do
+ * this can leave dirty pages with no space reservation in the cache.
*/
- filemap_invalidate_lock(inode->i_mapping);
+ lockdep_assert_held_write(&inode->i_mapping->invalidate_lock);
+
while (start_byte < scan_end_byte) {
loff_t data_end;
@@ -1233,7 +1234,7 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
if (start_byte == -ENXIO || start_byte == scan_end_byte)
break;
if (WARN_ON_ONCE(start_byte < 0))
- goto out_unlock;
+ return;
WARN_ON_ONCE(start_byte < punch_start_byte);
WARN_ON_ONCE(start_byte > scan_end_byte);
@@ -1244,7 +1245,7 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
scan_end_byte, SEEK_HOLE);
if (WARN_ON_ONCE(data_end < 0))
- goto out_unlock;
+ return;
/*
* If we race with post-direct I/O invalidation of the page cache,
@@ -1266,8 +1267,6 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
if (punch_start_byte < end_byte)
punch(inode, punch_start_byte, end_byte - punch_start_byte,
iomap);
-out_unlock:
- filemap_invalidate_unlock(inode->i_mapping);
}
EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 30f2530b6d5461..01324da63fcfc7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1239,8 +1239,10 @@ xfs_buffered_write_iomap_end(
if (start_byte >= end_byte)
return 0;
+ filemap_invalidate_lock(inode->i_mapping);
iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
xfs_buffered_write_delalloc_punch);
+ filemap_invalidate_unlock(inode->i_mapping);
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release
2024-09-23 15:28 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
@ 2024-09-23 16:19 ` Darrick J. Wong
0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-23 16:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel
On Mon, Sep 23, 2024 at 05:28:17PM +0200, Christoph Hellwig wrote:
> XFS (which currently is the only user of iomap_write_delalloc_release)
> already holds invalidate_lock for most zeroing operations. To be able
> to avoid a deadlock it needs to stop taking the lock, but doing so
> in iomap would leak XFS locking details into iomap.
>
> To avoid this require the caller to hold invalidate_lock when calling
> iomap_write_delalloc_release instead of taking it there.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Yes, I like this much better. I'm glad that Dave pointed out the
inconsistency of the locking (iomap doesn't take locks, filesystems take
locks) model in this one odd case.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/iomap/buffered-io.c | 17 ++++++++---------
> fs/xfs/xfs_iomap.c | 2 ++
> 2 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 237aeb883166df..232aaa1e86451a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1211,12 +1211,13 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> loff_t scan_end_byte = min(i_size_read(inode), end_byte);
>
> /*
> - * Lock the mapping to avoid races with page faults re-instantiating
> - * folios and dirtying them via ->page_mkwrite whilst we walk the
> - * cache and perform delalloc extent removal. Failing to do this can
> - * leave dirty pages with no space reservation in the cache.
> + * The caller must hold invalidate_lock to avoid races with page faults
> + * re-instantiating folios and dirtying them via ->page_mkwrite whilst
> + * we walk the cache and perform delalloc extent removal. Failing to do
> + * this can leave dirty pages with no space reservation in the cache.
> */
> - filemap_invalidate_lock(inode->i_mapping);
> + lockdep_assert_held_write(&inode->i_mapping->invalidate_lock);
> +
> while (start_byte < scan_end_byte) {
> loff_t data_end;
>
> @@ -1233,7 +1234,7 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> if (start_byte == -ENXIO || start_byte == scan_end_byte)
> break;
> if (WARN_ON_ONCE(start_byte < 0))
> - goto out_unlock;
> + return;
> WARN_ON_ONCE(start_byte < punch_start_byte);
> WARN_ON_ONCE(start_byte > scan_end_byte);
>
> @@ -1244,7 +1245,7 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> scan_end_byte, SEEK_HOLE);
> if (WARN_ON_ONCE(data_end < 0))
> - goto out_unlock;
> + return;
>
> /*
> * If we race with post-direct I/O invalidation of the page cache,
> @@ -1266,8 +1267,6 @@ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> if (punch_start_byte < end_byte)
> punch(inode, punch_start_byte, end_byte - punch_start_byte,
> iomap);
> -out_unlock:
> - filemap_invalidate_unlock(inode->i_mapping);
> }
> EXPORT_SYMBOL_GPL(iomap_write_delalloc_release);
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 30f2530b6d5461..01324da63fcfc7 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1239,8 +1239,10 @@ xfs_buffered_write_iomap_end(
> if (start_byte >= end_byte)
> return 0;
>
> + filemap_invalidate_lock(inode->i_mapping);
> iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
> xfs_buffered_write_delalloc_punch);
> + filemap_invalidate_unlock(inode->i_mapping);
> return 0;
> }
>
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (2 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 03/10] iomap: move locking out of iomap_write_delalloc_release Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 16:20 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
` (6 subsequent siblings)
10 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Split a helper from xfs_file_write_checks that just deal with the
post-EOF zeroing to keep the code readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_file.c | 140 +++++++++++++++++++++++++++-------------------
1 file changed, 82 insertions(+), 58 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 412b1d71b52b7d..3efb0da2a910d6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,10 +347,77 @@ xfs_file_splice_read(
return ret;
}
+/*
+ * Take care of zeroing post-EOF blocks when they might exist.
+ *
+ * Returns 0 if successfully, a negative error for a failure, or 1 if this
+ * function dropped the iolock and reacquired it exclusively and the caller
+ * needs to restart the write sanity checks.
+ */
+static ssize_t
+xfs_file_write_zero_eof(
+ struct kiocb *iocb,
+ struct iov_iter *from,
+ unsigned int *iolock,
+ size_t count,
+ bool *drained_dio)
+{
+ struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
+ loff_t isize;
+
+ /*
+ * We need to serialise against EOF updates that occur in IO completions
+ * here. We want to make sure that nobody is changing the size while
+ * we do this check until we have placed an IO barrier (i.e. hold
+ * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
+ * spinlock effectively forms a memory barrier once we have
+ * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
+ * hence be able to correctly determine if we need to run zeroing.
+ */
+ spin_lock(&ip->i_flags_lock);
+ isize = i_size_read(VFS_I(ip));
+ if (iocb->ki_pos <= isize) {
+ spin_unlock(&ip->i_flags_lock);
+ return 0;
+ }
+ spin_unlock(&ip->i_flags_lock);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ if (!*drained_dio) {
+ /*
+ * If zeroing is needed and we are currently holding the iolock
+ * shared, we need to update it to exclusive which implies
+ * having to redo all checks before.
+ */
+ if (*iolock == XFS_IOLOCK_SHARED) {
+ xfs_iunlock(ip, *iolock);
+ *iolock = XFS_IOLOCK_EXCL;
+ xfs_ilock(ip, *iolock);
+ iov_iter_reexpand(from, count);
+ }
+
+ /*
+ * We now have an IO submission barrier in place, but AIO can do
+ * EOF updates during IO completion and hence we now need to
+ * wait for all of them to drain. Non-AIO DIO will have drained
+ * before we are given the XFS_IOLOCK_EXCL, and so for most
+ * cases this wait is a no-op.
+ */
+ inode_dio_wait(VFS_I(ip));
+ *drained_dio = true;
+ return 1;
+ }
+
+ trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
+ return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+}
+
/*
* Common pre-write limit and setup checks.
*
- * Called with the iolocked held either shared and exclusive according to
+ * Called with the iolock held either shared and exclusive according to
* @iolock, and returns with it held. Might upgrade the iolock to exclusive
* if called for a direct write beyond i_size.
*/
@@ -360,13 +427,10 @@ xfs_file_write_checks(
struct iov_iter *from,
unsigned int *iolock)
{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct xfs_inode *ip = XFS_I(inode);
- ssize_t error = 0;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
size_t count = iov_iter_count(from);
bool drained_dio = false;
- loff_t isize;
+ ssize_t error;
restart:
error = generic_write_checks(iocb, from);
@@ -389,7 +453,7 @@ xfs_file_write_checks(
* exclusively.
*/
if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
- xfs_iunlock(ip, *iolock);
+ xfs_iunlock(XFS_I(inode), *iolock);
*iolock = XFS_IOLOCK_EXCL;
error = xfs_ilock_iocb(iocb, *iolock);
if (error) {
@@ -400,64 +464,24 @@ xfs_file_write_checks(
}
/*
- * If the offset is beyond the size of the file, we need to zero any
+ * If the offset is beyond the size of the file, we need to zero all
* blocks that fall between the existing EOF and the start of this
- * write. If zeroing is needed and we are currently holding the iolock
- * shared, we need to update it to exclusive which implies having to
- * redo all checks before.
- *
- * We need to serialise against EOF updates that occur in IO completions
- * here. We want to make sure that nobody is changing the size while we
- * do this check until we have placed an IO barrier (i.e. hold the
- * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
- * spinlock effectively forms a memory barrier once we have the
- * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
- * hence be able to correctly determine if we need to run zeroing.
+ * write.
*
- * We can do an unlocked check here safely as IO completion can only
- * extend EOF. Truncate is locked out at this point, so the EOF can
- * not move backwards, only forwards. Hence we only need to take the
- * slow path and spin locks when we are at or beyond the current EOF.
+ * We can do an unlocked check for i_size here safely as I/O completion
+ * can only extend EOF. Truncate is locked out at this point, so the
+ * EOF can not move backwards, only forwards. Hence we only need to take
+ * the slow path when we are at or beyond the current EOF.
*/
- if (iocb->ki_pos <= i_size_read(inode))
- goto out;
-
- spin_lock(&ip->i_flags_lock);
- isize = i_size_read(inode);
- if (iocb->ki_pos > isize) {
- spin_unlock(&ip->i_flags_lock);
-
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
-
- if (!drained_dio) {
- if (*iolock == XFS_IOLOCK_SHARED) {
- xfs_iunlock(ip, *iolock);
- *iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, *iolock);
- iov_iter_reexpand(from, count);
- }
- /*
- * We now have an IO submission barrier in place, but
- * AIO can do EOF updates during IO completion and hence
- * we now need to wait for all of them to drain. Non-AIO
- * DIO will have drained before we are given the
- * XFS_IOLOCK_EXCL, and so for most cases this wait is a
- * no-op.
- */
- inode_dio_wait(inode);
- drained_dio = true;
+ if (iocb->ki_pos > i_size_read(inode)) {
+ error = xfs_file_write_zero_eof(iocb, from, iolock, count,
+ &drained_dio);
+ if (error == 1)
goto restart;
- }
-
- trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
- error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
if (error)
return error;
- } else
- spin_unlock(&ip->i_flags_lock);
+ }
-out:
return kiocb_modified(iocb);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper
2024-09-23 15:28 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
@ 2024-09-23 16:20 ` Darrick J. Wong
0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-23 16:20 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel
On Mon, Sep 23, 2024 at 05:28:18PM +0200, Christoph Hellwig wrote:
> Split a helper from xfs_file_write_checks that just deal with the
> post-EOF zeroing to keep the code readable.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/xfs_file.c | 140 +++++++++++++++++++++++++++-------------------
> 1 file changed, 82 insertions(+), 58 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 412b1d71b52b7d..3efb0da2a910d6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -347,10 +347,77 @@ xfs_file_splice_read(
> return ret;
> }
>
> +/*
> + * Take care of zeroing post-EOF blocks when they might exist.
> + *
> + * Returns 0 if successfully, a negative error for a failure, or 1 if this
> + * function dropped the iolock and reacquired it exclusively and the caller
> + * needs to restart the write sanity checks.
Thanks for documentating the calling conventions,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> + */
> +static ssize_t
> +xfs_file_write_zero_eof(
> + struct kiocb *iocb,
> + struct iov_iter *from,
> + unsigned int *iolock,
> + size_t count,
> + bool *drained_dio)
> +{
> + struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
> + loff_t isize;
> +
> + /*
> + * We need to serialise against EOF updates that occur in IO completions
> + * here. We want to make sure that nobody is changing the size while
> + * we do this check until we have placed an IO barrier (i.e. hold
> + * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
> + * spinlock effectively forms a memory barrier once we have
> + * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> + * hence be able to correctly determine if we need to run zeroing.
> + */
> + spin_lock(&ip->i_flags_lock);
> + isize = i_size_read(VFS_I(ip));
> + if (iocb->ki_pos <= isize) {
> + spin_unlock(&ip->i_flags_lock);
> + return 0;
> + }
> + spin_unlock(&ip->i_flags_lock);
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> +
> + if (!*drained_dio) {
> + /*
> + * If zeroing is needed and we are currently holding the iolock
> + * shared, we need to update it to exclusive which implies
> + * having to redo all checks before.
> + */
> + if (*iolock == XFS_IOLOCK_SHARED) {
> + xfs_iunlock(ip, *iolock);
> + *iolock = XFS_IOLOCK_EXCL;
> + xfs_ilock(ip, *iolock);
> + iov_iter_reexpand(from, count);
> + }
> +
> + /*
> + * We now have an IO submission barrier in place, but AIO can do
> + * EOF updates during IO completion and hence we now need to
> + * wait for all of them to drain. Non-AIO DIO will have drained
> + * before we are given the XFS_IOLOCK_EXCL, and so for most
> + * cases this wait is a no-op.
> + */
> + inode_dio_wait(VFS_I(ip));
> + *drained_dio = true;
> + return 1;
> + }
> +
> + trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
> + return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
> +}
> +
> /*
> * Common pre-write limit and setup checks.
> *
> - * Called with the iolocked held either shared and exclusive according to
> + * Called with the iolock held either shared and exclusive according to
> * @iolock, and returns with it held. Might upgrade the iolock to exclusive
> * if called for a direct write beyond i_size.
> */
> @@ -360,13 +427,10 @@ xfs_file_write_checks(
> struct iov_iter *from,
> unsigned int *iolock)
> {
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> - struct xfs_inode *ip = XFS_I(inode);
> - ssize_t error = 0;
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> size_t count = iov_iter_count(from);
> bool drained_dio = false;
> - loff_t isize;
> + ssize_t error;
>
> restart:
> error = generic_write_checks(iocb, from);
> @@ -389,7 +453,7 @@ xfs_file_write_checks(
> * exclusively.
> */
> if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
> - xfs_iunlock(ip, *iolock);
> + xfs_iunlock(XFS_I(inode), *iolock);
> *iolock = XFS_IOLOCK_EXCL;
> error = xfs_ilock_iocb(iocb, *iolock);
> if (error) {
> @@ -400,64 +464,24 @@ xfs_file_write_checks(
> }
>
> /*
> - * If the offset is beyond the size of the file, we need to zero any
> + * If the offset is beyond the size of the file, we need to zero all
> * blocks that fall between the existing EOF and the start of this
> - * write. If zeroing is needed and we are currently holding the iolock
> - * shared, we need to update it to exclusive which implies having to
> - * redo all checks before.
> - *
> - * We need to serialise against EOF updates that occur in IO completions
> - * here. We want to make sure that nobody is changing the size while we
> - * do this check until we have placed an IO barrier (i.e. hold the
> - * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
> - * spinlock effectively forms a memory barrier once we have the
> - * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> - * hence be able to correctly determine if we need to run zeroing.
> + * write.
> *
> - * We can do an unlocked check here safely as IO completion can only
> - * extend EOF. Truncate is locked out at this point, so the EOF can
> - * not move backwards, only forwards. Hence we only need to take the
> - * slow path and spin locks when we are at or beyond the current EOF.
> + * We can do an unlocked check for i_size here safely as I/O completion
> + * can only extend EOF. Truncate is locked out at this point, so the
> + * EOF can not move backwards, only forwards. Hence we only need to take
> + * the slow path when we are at or beyond the current EOF.
> */
> - if (iocb->ki_pos <= i_size_read(inode))
> - goto out;
> -
> - spin_lock(&ip->i_flags_lock);
> - isize = i_size_read(inode);
> - if (iocb->ki_pos > isize) {
> - spin_unlock(&ip->i_flags_lock);
> -
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - return -EAGAIN;
> -
> - if (!drained_dio) {
> - if (*iolock == XFS_IOLOCK_SHARED) {
> - xfs_iunlock(ip, *iolock);
> - *iolock = XFS_IOLOCK_EXCL;
> - xfs_ilock(ip, *iolock);
> - iov_iter_reexpand(from, count);
> - }
> - /*
> - * We now have an IO submission barrier in place, but
> - * AIO can do EOF updates during IO completion and hence
> - * we now need to wait for all of them to drain. Non-AIO
> - * DIO will have drained before we are given the
> - * XFS_IOLOCK_EXCL, and so for most cases this wait is a
> - * no-op.
> - */
> - inode_dio_wait(inode);
> - drained_dio = true;
> + if (iocb->ki_pos > i_size_read(inode)) {
> + error = xfs_file_write_zero_eof(iocb, from, iolock, count,
> + &drained_dio);
> + if (error == 1)
> goto restart;
> - }
> -
> - trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
> - error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
> if (error)
> return error;
> - } else
> - spin_unlock(&ip->i_flags_lock);
> + }
>
> -out:
> return kiocb_modified(iocb);
> }
>
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (3 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 04/10] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
` (5 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that
is actually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.
The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_file.c | 8 +++++++-
fs/xfs/xfs_iomap.c | 2 ++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3efb0da2a910d6..b19916b11fd563 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -364,6 +364,7 @@ xfs_file_write_zero_eof(
{
struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
loff_t isize;
+ int error;
/*
* We need to serialise against EOF updates that occur in IO completions
@@ -411,7 +412,12 @@ xfs_file_write_zero_eof(
}
trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
- return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+
+ xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+ error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+ xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+
+ return error;
}
/*
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 01324da63fcfc7..4fa4d66dc37761 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1449,6 +1449,8 @@ xfs_zero_range(
{
struct inode *inode = VFS_I(ip);
+ xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
+
if (IS_DAX(inode))
return dax_zero_range(inode, pos, len, did_zero,
&xfs_dax_write_iomap_ops);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 06/10] xfs: zeroing already holds invalidate_lock
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (4 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 16:22 ` Darrick J. Wong
2024-09-23 15:28 ` [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
` (4 subsequent siblings)
10 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
All XFS callers of iomap_zero_range already hold invalidate_lock, so we can't
take it again in iomap_file_buffered_write_punch_delalloc.
Use the passed in flags argument to detect if we're called from a zeroing
operation and don't take the lock again in this case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_iomap.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4fa4d66dc37761..0f5fa3de6d3ecc 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1239,10 +1239,17 @@ xfs_buffered_write_iomap_end(
if (start_byte >= end_byte)
return 0;
- filemap_invalidate_lock(inode->i_mapping);
+ /* For zeroing operations the callers already hold invalidate_lock. */
+ if (flags & IOMAP_ZERO)
+ rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
+ else
+ filemap_invalidate_lock(inode->i_mapping);
+
iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
xfs_buffered_write_delalloc_punch);
- filemap_invalidate_unlock(inode->i_mapping);
+
+ if (!(flags & IOMAP_ZERO))
+ filemap_invalidate_unlock(inode->i_mapping);
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 06/10] xfs: zeroing already holds invalidate_lock
2024-09-23 15:28 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
@ 2024-09-23 16:22 ` Darrick J. Wong
2024-09-24 5:44 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2024-09-23 16:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel
On Mon, Sep 23, 2024 at 05:28:20PM +0200, Christoph Hellwig wrote:
> All XFS callers of iomap_zero_range already hold invalidate_lock, so we can't
> take it again in iomap_file_buffered_write_punch_delalloc.
>
> Use the passed in flags argument to detect if we're called from a zeroing
> operation and don't take the lock again in this case.
Shouldn't this be a part of the previous patch? AFAICT taking the
invalidation lock in xfs_file_write_zero_eof is why we need the change
to rwsem_assert_held_write here, right?
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_iomap.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 4fa4d66dc37761..0f5fa3de6d3ecc 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1239,10 +1239,17 @@ xfs_buffered_write_iomap_end(
> if (start_byte >= end_byte)
> return 0;
>
> - filemap_invalidate_lock(inode->i_mapping);
> + /* For zeroing operations the callers already hold invalidate_lock. */
> + if (flags & IOMAP_ZERO)
> + rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
> + else
> + filemap_invalidate_lock(inode->i_mapping);
> +
> iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
> xfs_buffered_write_delalloc_punch);
> - filemap_invalidate_unlock(inode->i_mapping);
> +
> + if (!(flags & IOMAP_ZERO))
> + filemap_invalidate_unlock(inode->i_mapping);
> return 0;
> }
>
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 06/10] xfs: zeroing already holds invalidate_lock
2024-09-23 16:22 ` Darrick J. Wong
@ 2024-09-24 5:44 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-24 5:44 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
linux-fsdevel
On Mon, Sep 23, 2024 at 09:22:49AM -0700, Darrick J. Wong wrote:
> On Mon, Sep 23, 2024 at 05:28:20PM +0200, Christoph Hellwig wrote:
> > All XFS callers of iomap_zero_range already hold invalidate_lock, so we can't
> > take it again in iomap_file_buffered_write_punch_delalloc.
> >
> > Use the passed in flags argument to detect if we're called from a zeroing
> > operation and don't take the lock again in this case.
>
> Shouldn't this be a part of the previous patch? AFAICT taking the
> invalidation lock in xfs_file_write_zero_eof is why we need the change
> to rwsem_assert_held_write here, right?
Most callers of zeroing already hold the lock. So I can see arguments
for merging the patches now (don't make one case even worse before
fixing) or not (this is really two unrelated changes and easier to
understand).
Or now that the lockig is inside XFS we could even add a private iomap
flag to not do the locking for the eof zeroing, but that would create
even more special cases.
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (5 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 06/10] xfs: zeroing already holds invalidate_lock Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
` (3 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_aops.c | 4 ++--
fs/xfs/xfs_bmap_util.c | 10 +++++++---
fs/xfs/xfs_bmap_util.h | 2 +-
fs/xfs/xfs_iomap.c | 3 ++-
4 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6dead20338e24c..559a3a57709748 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -116,7 +116,7 @@ xfs_end_ioend(
if (unlikely(error)) {
if (ioend->io_flags & IOMAP_F_SHARED) {
xfs_reflink_cancel_cow_range(ip, offset, size, true);
- xfs_bmap_punch_delalloc_range(ip, offset,
+ xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset,
offset + size);
}
goto done;
@@ -456,7 +456,7 @@ xfs_discard_folio(
* byte of the next folio. Hence the end offset is only dependent on the
* folio itself and not the start offset that is passed in.
*/
- xfs_bmap_punch_delalloc_range(ip, pos,
+ xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, pos,
folio_pos(folio) + folio_size(folio));
}
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 053d567c910840..4719ec90029cb7 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -442,11 +442,12 @@ xfs_getbmap(
void
xfs_bmap_punch_delalloc_range(
struct xfs_inode *ip,
+ int whichfork,
xfs_off_t start_byte,
xfs_off_t end_byte)
{
struct xfs_mount *mp = ip->i_mount;
- struct xfs_ifork *ifp = &ip->i_df;
+ struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork);
xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, start_byte);
xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, end_byte);
struct xfs_bmbt_irec got, del;
@@ -474,11 +475,14 @@ xfs_bmap_punch_delalloc_range(
continue;
}
- xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, &icur, &got, &del);
+ xfs_bmap_del_extent_delay(ip, whichfork, &icur, &got, &del);
if (!xfs_iext_get_extent(ifp, &icur, &got))
break;
}
+ if (whichfork == XFS_COW_FORK && !ifp->if_bytes)
+ xfs_inode_clear_cowblocks_tag(ip);
+
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_EXCL);
}
@@ -580,7 +584,7 @@ xfs_free_eofblocks(
*/
if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)) {
if (ip->i_delayed_blks) {
- xfs_bmap_punch_delalloc_range(ip,
+ xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK,
round_up(XFS_ISIZE(ip), mp->m_sb.sb_blocksize),
LLONG_MAX);
}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index eb0895bfb9dae4..b29760d36e1ab1 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -30,7 +30,7 @@ xfs_bmap_rtalloc(struct xfs_bmalloca *ap)
}
#endif /* CONFIG_XFS_RT */
-void xfs_bmap_punch_delalloc_range(struct xfs_inode *ip,
+void xfs_bmap_punch_delalloc_range(struct xfs_inode *ip, int whichfork,
xfs_off_t start_byte, xfs_off_t end_byte);
struct kgetbmap {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 0f5fa3de6d3ecc..abc3b9f1115ce7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1215,7 +1215,8 @@ xfs_buffered_write_delalloc_punch(
loff_t length,
struct iomap *iomap)
{
- xfs_bmap_punch_delalloc_range(XFS_I(inode), offset, offset + length);
+ xfs_bmap_punch_delalloc_range(XFS_I(inode), XFS_DATA_FORK, offset,
+ offset + length);
}
static int
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (6 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
` (2 subsequent siblings)
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index abc3b9f1115ce7..aff9e8305399ee 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -975,6 +975,7 @@ xfs_buffered_write_iomap_begin(
int allocfork = XFS_DATA_FORK;
int error = 0;
unsigned int lockmode = XFS_ILOCK_EXCL;
+ unsigned int iomap_flags = 0;
u64 seq;
if (xfs_is_shutdown(mp))
@@ -1145,6 +1146,11 @@ xfs_buffered_write_iomap_begin(
}
}
+ /*
+ * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
+ * them out if the write happens to fail.
+ */
+ iomap_flags |= IOMAP_F_NEW;
if (allocfork == XFS_COW_FORK) {
error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
end_fsb - offset_fsb, prealloc_blocks, &cmap,
@@ -1162,19 +1168,11 @@ xfs_buffered_write_iomap_begin(
if (error)
goto out_unlock;
- /*
- * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
- * them out if the write happens to fail.
- */
- seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
- xfs_iunlock(ip, lockmode);
trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
-
found_imap:
- seq = xfs_iomap_inode_sequence(ip, 0);
+ seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
- return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
+ return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
convert_delay:
xfs_iunlock(ip, lockmode);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (7 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 08/10] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-23 15:28 ` [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
2024-09-25 9:19 ` fix stale delalloc punching for COW I/O v3 Christian Brauner
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork. It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.
This is essentially a revert of commit 72a048c1056a ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.
Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index aff9e8305399ee..cc768f0139d365 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1186,20 +1186,19 @@ xfs_buffered_write_iomap_begin(
return 0;
found_cow:
- seq = xfs_iomap_inode_sequence(ip, 0);
if (imap.br_startoff <= offset_fsb) {
- error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+ error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
+ xfs_iomap_inode_sequence(ip, 0));
if (error)
goto out_unlock;
- seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
- xfs_iunlock(ip, lockmode);
- return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
- IOMAP_F_SHARED, seq);
+ } else {
+ xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
}
- xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
+ iomap_flags = IOMAP_F_SHARED;
+ seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
- return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
+ return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
out_unlock:
xfs_iunlock(ip, lockmode);
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (8 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
@ 2024-09-23 15:28 ` Christoph Hellwig
2024-09-25 9:19 ` fix stale delalloc punching for COW I/O v3 Christian Brauner
10 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-09-23 15:28 UTC (permalink / raw)
To: Chandan Babu R, Christian Brauner, Darrick J. Wong
Cc: linux-xfs, linux-fsdevel
When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.
Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_iomap.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index cc768f0139d365..8a4e5f14ec0c77 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1195,7 +1195,7 @@ xfs_buffered_write_iomap_begin(
xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
}
- iomap_flags = IOMAP_F_SHARED;
+ iomap_flags |= IOMAP_F_SHARED;
seq = xfs_iomap_inode_sequence(ip, iomap_flags);
xfs_iunlock(ip, lockmode);
return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
@@ -1212,8 +1212,10 @@ xfs_buffered_write_delalloc_punch(
loff_t length,
struct iomap *iomap)
{
- xfs_bmap_punch_delalloc_range(XFS_I(inode), XFS_DATA_FORK, offset,
- offset + length);
+ xfs_bmap_punch_delalloc_range(XFS_I(inode),
+ (iomap->flags & IOMAP_F_SHARED) ?
+ XFS_COW_FORK : XFS_DATA_FORK,
+ offset, offset + length);
}
static int
--
2.45.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v3
2024-09-23 15:28 fix stale delalloc punching for COW I/O v3 Christoph Hellwig
` (9 preceding siblings ...)
2024-09-23 15:28 ` [PATCH 10/10] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
@ 2024-09-25 9:19 ` Christian Brauner
2024-09-25 9:24 ` fix stale delalloc punching for COW I/O v4 Christian Brauner
10 siblings, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2024-09-25 9:19 UTC (permalink / raw)
To: Chandan Babu R, Darrick J. Wong, Christoph Hellwig
Cc: Christian Brauner, linux-xfs, linux-fsdevel
On Mon, 23 Sep 2024 17:28:14 +0200, Christoph Hellwig wrote:
> this is another fallout from the zoned XFS work, which stresses the XFS
> COW I/O path very heavily. It affects normal I/O to reflinked files as
> well, but is very hard to hit there.
>
> The main problem here is that we only punch out delalloc reservations
> from the data fork, but COW I/O places delalloc extents into the COW
> fork, which means that it won't get punched out forshort writes.
>
> [...]
If that should take another route, let me know.
---
Applied to the vfs.iomap.v6.13 branch of the vfs/vfs.git tree.
Patches in the vfs.iomap.v6.13 branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.iomap.v6.13
[01/10] iomap: factor out a iomap_last_written_block helper
https://git.kernel.org/vfs/vfs/c/91c5635e1fdc
[02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
https://git.kernel.org/vfs/vfs/c/7e8b9d3d403b
[03/10] iomap: move locking out of iomap_write_delalloc_release
https://git.kernel.org/vfs/vfs/c/749a89eb81a3
[04/10] xfs: factor out a xfs_file_write_zero_eof helper
https://git.kernel.org/vfs/vfs/c/86e3e09d26c4
[05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
https://git.kernel.org/vfs/vfs/c/fbe5a71ceb96
[06/10] xfs: zeroing already holds invalidate_lock
https://git.kernel.org/vfs/vfs/c/b9db9ae79580
[07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range
https://git.kernel.org/vfs/vfs/c/73ca182d0a9f
[08/10] xfs: share more code in xfs_buffered_write_iomap_begin
https://git.kernel.org/vfs/vfs/c/621aa9522fcb
[09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations
https://git.kernel.org/vfs/vfs/c/4bb081d6d8e0
[10/10] xfs: punch delalloc extents from the COW fork for COW writes
https://git.kernel.org/vfs/vfs/c/48e0d3ca3198
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fix stale delalloc punching for COW I/O v4
2024-09-25 9:19 ` fix stale delalloc punching for COW I/O v3 Christian Brauner
@ 2024-09-25 9:24 ` Christian Brauner
0 siblings, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2024-09-25 9:24 UTC (permalink / raw)
To: Chandan Babu R, Carlos Maiolino, Darrick J. Wong,
Christoph Hellwig
Cc: Christian Brauner, linux-xfs, linux-fsdevel
On Tue, 24 Sep 2024 09:40:42 +0200, Christoph Hellwig wrote:
> this is another fallout from the zoned XFS work, which stresses the XFS
> COW I/O path very heavily. It affects normal I/O to reflinked files as
> well, but is very hard to hit there.
>
> The main problem here is that we only punch out delalloc reservations
> from the data fork, but COW I/O places delalloc extents into the COW
> fork, which means that it won't get punched out forshort writes.
>
> [...]
Sorry, pulled v3 on accident. Now actually pulled v4.
---
Applied to the vfs.iomap.v6.13 branch of the vfs/vfs.git tree.
Patches in the vfs.iomap.v6.13 branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.iomap.v6.13
[01/10] iomap: factor out a iomap_last_written_block helper
https://git.kernel.org/vfs/vfs/c/5d33a6215cf4
[02/10] iomap: remove iomap_file_buffered_write_punch_delalloc
https://git.kernel.org/vfs/vfs/c/45f14ab7316c
[03/10] iomap: move locking out of iomap_write_delalloc_release
https://git.kernel.org/vfs/vfs/c/7f700d206e70
[04/10] xfs: factor out a xfs_file_write_zero_eof helper
https://git.kernel.org/vfs/vfs/c/83b6773ce54a
[05/10] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
https://git.kernel.org/vfs/vfs/c/63744a511a6a
[06/10] xfs: zeroing already holds invalidate_lock
https://git.kernel.org/vfs/vfs/c/7156dde42b72
[07/10] xfs: support the COW fork in xfs_bmap_punch_delalloc_range
https://git.kernel.org/vfs/vfs/c/3bc21e71b820
[08/10] xfs: share more code in xfs_buffered_write_iomap_begin
https://git.kernel.org/vfs/vfs/c/367332e4a22a
[09/10] xfs: set IOMAP_F_SHARED for all COW fork allocations
https://git.kernel.org/vfs/vfs/c/a032a49ce426
[10/10] xfs: punch delalloc extents from the COW fork for COW writes
https://git.kernel.org/vfs/vfs/c/23bdeeb38d2e
^ permalink raw reply [flat|nested] 41+ messages in thread