[RFC PATCH 0/2] Implement concurrent buffered write with folio lock

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Implement concurrent buffered write with folio lock
@ 2025-04-25 10:38 Chi Zhiling
  2025-04-25 10:38 ` [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode Chi Zhiling
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chi Zhiling @ 2025-04-25 10:38 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, Chi Zhiling

From: Chi Zhiling <chizhiling@kylinos.cn>

This is a patch attempting to implement concurrent buffered writes.
The main idea is to use the folio lock to ensure the atomicity of the
write when writing to a single folio, instead of using the i_rwsem.

I tried the "folio batch" solution, which is a great idea, but during
testing, I encountered an OOM issue because the locked folios couldn't
be reclaimed.

So for now, I can only allow concurrent writes within a single block.
The good news is that since we already support BS > PS, we can use a
larger block size to enable higher granularity concurrency.

These ideas come from previous discussions:
https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/

Chi Zhiling (2):
  xfs: Add i_direct_mode to indicate the IO mode of inode
  xfs: Enable concurrency when writing within single block

 fs/xfs/xfs_file.c  | 71 ++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_inode.h |  6 ++++
 2 files changed, 72 insertions(+), 5 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode
  2025-04-25 10:38 [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Chi Zhiling
@ 2025-04-25 10:38 ` Chi Zhiling
  2025-04-25 15:12   ` Darrick J. Wong
  2025-04-25 10:38 ` [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block Chi Zhiling
  2025-04-30  2:05 ` [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Chi Zhiling @ 2025-04-25 10:38 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, Chi Zhiling

From: Chi Zhiling <chizhiling@kylinos.cn>

Direct IO already uses shared lock. If buffered write also uses
shared lock, we need to ensure mutual exclusion between DIO and
buffered IO. Therefore, Now introduce a flag `i_direct_mode` to
indicate the IO mode currently used by the file. In practical
scenarios, DIO and buffered IO are typically not used together,
so this flag is usually not modified.

Additionally, this flag is protected by the i_rwsem lock,
which avoids the need to introduce new lock. When reading this
flag, we need to hold a read lock, and when writing, a write lock
is required.

When a file that uses buffered IO starts using DIO, it needs to
acquire a write lock to modify i_direct_mode, which will force DIO
to wait for the previous IO to complete before starting. After
acquiring the write lock to modify `i_direct_mode`, subsequent
buffered IO will need to acquire the write lock again to modify
i_direct_mode, which will force those IOs to wait for the current
IO to complete.

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
---
 fs/xfs/xfs_file.c  | 37 +++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_inode.h |  6 ++++++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 84f08c976ac4..a6f214f57238 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -206,7 +206,8 @@ xfs_ilock_iocb(
 static int
 xfs_ilock_iocb_for_write(
 	struct kiocb		*iocb,
-	unsigned int		*lock_mode)
+	unsigned int		*lock_mode,
+	bool			is_dio)
 {
 	ssize_t			ret;
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
@@ -226,6 +227,21 @@ xfs_ilock_iocb_for_write(
 		return xfs_ilock_iocb(iocb, *lock_mode);
 	}
 
+	/*
+	 * If the i_direct_mode need update, take the iolock exclusively to write
+	 * it.
+	 */
+	if (ip->i_direct_mode != is_dio) {
+		if (*lock_mode == XFS_IOLOCK_SHARED) {
+			xfs_iunlock(ip, *lock_mode);
+			*lock_mode = XFS_IOLOCK_EXCL;
+			ret = xfs_ilock_iocb(iocb, *lock_mode);
+			if (ret)
+				return ret;
+		}
+		ip->i_direct_mode = is_dio;
+	}
+
 	return 0;
 }
 
@@ -247,6 +263,19 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
+
+	if (!ip->i_direct_mode) {
+		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+		ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_EXCL);
+		if (ret)
+			return ret;
+
+		ip->i_direct_mode = 1;
+
+		/* Update finished, now downgrade to shared lock */
+		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
+	}
+
 	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
@@ -680,7 +709,7 @@ xfs_file_dio_write_aligned(
 	unsigned int		iolock = XFS_IOLOCK_SHARED;
 	ssize_t			ret;
 
-	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock, true);
 	if (ret)
 		return ret;
 	ret = xfs_file_write_checks(iocb, from, &iolock, ac);
@@ -767,7 +796,7 @@ xfs_file_dio_write_unaligned(
 		flags = IOMAP_DIO_FORCE_WAIT;
 	}
 
-	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock, true);
 	if (ret)
 		return ret;
 
@@ -898,7 +927,7 @@ xfs_file_buffered_write(
 
 write_retry:
 	iolock = XFS_IOLOCK_EXCL;
-	ret = xfs_ilock_iocb(iocb, iolock);
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock, false);
 	if (ret)
 		return ret;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index eae0159983ca..04f6c4174fab 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -51,6 +51,12 @@ typedef struct xfs_inode {
 	uint16_t		i_checked;
 	uint16_t		i_sick;
 
+	/*
+	 * Indicates the current IO mode of this inode, (DIO/buffered IO)
+	 * protected by i_rwsem lock.
+	 */
+	uint32_t		i_direct_mode;
+
 	spinlock_t		i_flags_lock;	/* inode i_flags lock */
 	/* Miscellaneous state. */
 	unsigned long		i_flags;	/* see defined flags below */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode
  2025-04-25 10:38 ` [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode Chi Zhiling
@ 2025-04-25 15:12   ` Darrick J. Wong
  2025-04-26  1:28     ` Chi Zhiling
  0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2025-04-25 15:12 UTC (permalink / raw)
  To: Chi Zhiling; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On Fri, Apr 25, 2025 at 06:38:40PM +0800, Chi Zhiling wrote:
> From: Chi Zhiling <chizhiling@kylinos.cn>
> 
> Direct IO already uses shared lock. If buffered write also uses
> shared lock, we need to ensure mutual exclusion between DIO and
> buffered IO. Therefore, Now introduce a flag `i_direct_mode` to
> indicate the IO mode currently used by the file. In practical
> scenarios, DIO and buffered IO are typically not used together,
> so this flag is usually not modified.
> 
> Additionally, this flag is protected by the i_rwsem lock,
> which avoids the need to introduce new lock. When reading this
> flag, we need to hold a read lock, and when writing, a write lock
> is required.
> 
> When a file that uses buffered IO starts using DIO, it needs to
> acquire a write lock to modify i_direct_mode, which will force DIO
> to wait for the previous IO to complete before starting. After
> acquiring the write lock to modify `i_direct_mode`, subsequent
> buffered IO will need to acquire the write lock again to modify
> i_direct_mode, which will force those IOs to wait for the current
> IO to complete.
> 
> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
> ---
>  fs/xfs/xfs_file.c  | 37 +++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_inode.h |  6 ++++++
>  2 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 84f08c976ac4..a6f214f57238 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -206,7 +206,8 @@ xfs_ilock_iocb(
>  static int
>  xfs_ilock_iocb_for_write(
>  	struct kiocb		*iocb,
> -	unsigned int		*lock_mode)
> +	unsigned int		*lock_mode,
> +	bool			is_dio)

Is an explicit flag required here, or can you determine directness from
IS_DAX() || (iocb->ki_flags & IOCB_DIRECT) ?

Hmm, I guess not, since a directio falling back to the pagecache for an
unaligned out of place write doesn't clear IOCB_DIRECT?

How does this new flag intersect with XFS_IREMAPPING?  Are we actually
modelling three states here: bufferedio <-> directio <-> remapping?

>  {
>  	ssize_t			ret;
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
> @@ -226,6 +227,21 @@ xfs_ilock_iocb_for_write(
>  		return xfs_ilock_iocb(iocb, *lock_mode);
>  	}
>  
> +	/*
> +	 * If the i_direct_mode need update, take the iolock exclusively to write
> +	 * it.
> +	 */
> +	if (ip->i_direct_mode != is_dio) {
> +		if (*lock_mode == XFS_IOLOCK_SHARED) {
> +			xfs_iunlock(ip, *lock_mode);
> +			*lock_mode = XFS_IOLOCK_EXCL;
> +			ret = xfs_ilock_iocb(iocb, *lock_mode);
> +			if (ret)
> +				return ret;
> +		}
> +		ip->i_direct_mode = is_dio;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -247,6 +263,19 @@ xfs_file_dio_read(
>  	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
>  	if (ret)
>  		return ret;
> +
> +	if (!ip->i_direct_mode) {
> +		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> +		ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_EXCL);
> +		if (ret)
> +			return ret;
> +
> +		ip->i_direct_mode = 1;
> +
> +		/* Update finished, now downgrade to shared lock */
> +		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> +	}
> +
>  	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
>  	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
> @@ -680,7 +709,7 @@ xfs_file_dio_write_aligned(
>  	unsigned int		iolock = XFS_IOLOCK_SHARED;
>  	ssize_t			ret;
>  
> -	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock, true);
>  	if (ret)
>  		return ret;
>  	ret = xfs_file_write_checks(iocb, from, &iolock, ac);
> @@ -767,7 +796,7 @@ xfs_file_dio_write_unaligned(
>  		flags = IOMAP_DIO_FORCE_WAIT;
>  	}
>  
> -	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock, true);
>  	if (ret)
>  		return ret;
>  
> @@ -898,7 +927,7 @@ xfs_file_buffered_write(
>  
>  write_retry:
>  	iolock = XFS_IOLOCK_EXCL;
> -	ret = xfs_ilock_iocb(iocb, iolock);
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock, false);
>  	if (ret)
>  		return ret;
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index eae0159983ca..04f6c4174fab 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -51,6 +51,12 @@ typedef struct xfs_inode {
>  	uint16_t		i_checked;
>  	uint16_t		i_sick;
>  
> +	/*
> +	 * Indicates the current IO mode of this inode, (DIO/buffered IO)
> +	 * protected by i_rwsem lock.
> +	 */
> +	uint32_t		i_direct_mode;

Yeesh, a whole u32 to encode a single bit.  Can you use i_flags instead?

--D

> +
>  	spinlock_t		i_flags_lock;	/* inode i_flags lock */
>  	/* Miscellaneous state. */
>  	unsigned long		i_flags;	/* see defined flags below */
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode
  2025-04-25 15:12   ` Darrick J. Wong
@ 2025-04-26  1:28     ` Chi Zhiling
  0 siblings, 0 replies; 11+ messages in thread
From: Chi Zhiling @ 2025-04-26  1:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On 2025/4/25 23:12, Darrick J. Wong wrote:
> On Fri, Apr 25, 2025 at 06:38:40PM +0800, Chi Zhiling wrote:
>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>
>> Direct IO already uses shared lock. If buffered write also uses
>> shared lock, we need to ensure mutual exclusion between DIO and
>> buffered IO. Therefore, Now introduce a flag `i_direct_mode` to
>> indicate the IO mode currently used by the file. In practical
>> scenarios, DIO and buffered IO are typically not used together,
>> so this flag is usually not modified.
>>
>> Additionally, this flag is protected by the i_rwsem lock,
>> which avoids the need to introduce new lock. When reading this
>> flag, we need to hold a read lock, and when writing, a write lock
>> is required.
>>
>> When a file that uses buffered IO starts using DIO, it needs to
>> acquire a write lock to modify i_direct_mode, which will force DIO
>> to wait for the previous IO to complete before starting. After
>> acquiring the write lock to modify `i_direct_mode`, subsequent
>> buffered IO will need to acquire the write lock again to modify
>> i_direct_mode, which will force those IOs to wait for the current
>> IO to complete.
>>
>> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
>> ---
>>   fs/xfs/xfs_file.c  | 37 +++++++++++++++++++++++++++++++++----
>>   fs/xfs/xfs_inode.h |  6 ++++++
>>   2 files changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 84f08c976ac4..a6f214f57238 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -206,7 +206,8 @@ xfs_ilock_iocb(
>>   static int
>>   xfs_ilock_iocb_for_write(
>>   	struct kiocb		*iocb,
>> -	unsigned int		*lock_mode)
>> +	unsigned int		*lock_mode,
>> +	bool			is_dio)
> 
> Is an explicit flag required here, or can you determine directness from
> IS_DAX() || (iocb->ki_flags & IOCB_DIRECT) ?
> 
> Hmm, I guess not, since a directio falling back to the pagecache for an
> unaligned out of place write doesn't clear IOCB_DIRECT?

Because DIO can fallback to buffered IO, I think checking 
(iocb->ki_flags & IOCB_DIRECT) is not accurate.

That's why we need to add an additional argument.

> 
> How does this new flag intersect with XFS_IREMAPPING?  Are we actually
> modelling three states here: bufferedio <-> directio <-> remapping?

Yes, and these three states are mutually exclusive.

That's a good suggestion. I think we can include XFS_IREMAPPING in the
new flag as well.

> 
>>   {
>>   	ssize_t			ret;
>>   	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>> @@ -226,6 +227,21 @@ xfs_ilock_iocb_for_write(
>>   		return xfs_ilock_iocb(iocb, *lock_mode);
>>   	}
>>   
>> +	/*
>> +	 * If the i_direct_mode need update, take the iolock exclusively to write
>> +	 * it.
>> +	 */
>> +	if (ip->i_direct_mode != is_dio) {
>> +		if (*lock_mode == XFS_IOLOCK_SHARED) {
>> +			xfs_iunlock(ip, *lock_mode);
>> +			*lock_mode = XFS_IOLOCK_EXCL;
>> +			ret = xfs_ilock_iocb(iocb, *lock_mode);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +		ip->i_direct_mode = is_dio;
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> @@ -247,6 +263,19 @@ xfs_file_dio_read(
>>   	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
>>   	if (ret)
>>   		return ret;
>> +
>> +	if (!ip->i_direct_mode) {
>> +		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>> +		ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_EXCL);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ip->i_direct_mode = 1;
>> +
>> +		/* Update finished, now downgrade to shared lock */
>> +		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
>> +	}
>> +
>>   	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
>>   	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>>   
>> @@ -680,7 +709,7 @@ xfs_file_dio_write_aligned(
>>   	unsigned int		iolock = XFS_IOLOCK_SHARED;
>>   	ssize_t			ret;
>>   
>> -	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock, true);
>>   	if (ret)
>>   		return ret;
>>   	ret = xfs_file_write_checks(iocb, from, &iolock, ac);
>> @@ -767,7 +796,7 @@ xfs_file_dio_write_unaligned(
>>   		flags = IOMAP_DIO_FORCE_WAIT;
>>   	}
>>   
>> -	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock, true);
>>   	if (ret)
>>   		return ret;
>>   
>> @@ -898,7 +927,7 @@ xfs_file_buffered_write(
>>   
>>   write_retry:
>>   	iolock = XFS_IOLOCK_EXCL;
>> -	ret = xfs_ilock_iocb(iocb, iolock);
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock, false);
>>   	if (ret)
>>   		return ret;
>>   
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index eae0159983ca..04f6c4174fab 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -51,6 +51,12 @@ typedef struct xfs_inode {
>>   	uint16_t		i_checked;
>>   	uint16_t		i_sick;
>>   
>> +	/*
>> +	 * Indicates the current IO mode of this inode, (DIO/buffered IO)
>> +	 * protected by i_rwsem lock.
>> +	 */
>> +	uint32_t		i_direct_mode;

I think we can add i_remapping to this new flag, and rename it to
i_current_state, so it can be remapping, DIO, or BIO.

The rule remains the same: it should be protected by i_rwsem, with the
write lock held to change it.

> 
> Yeesh, a whole u32 to encode a single bit.  Can you use i_flags instead?

Sorry, It's a mistake, But I don't think we can use i_flags instead.

I tried using i_flags for this, but i_flags is protected by
i_flags_lock, which means that for every IO operation, it always
requires an additional acquisition of i_flags_lock to check this flag.


Thanks

> 
> --D
> 
>> +
>>   	spinlock_t		i_flags_lock;	/* inode i_flags lock */
>>   	/* Miscellaneous state. */
>>   	unsigned long		i_flags;	/* see defined flags below */
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block
  2025-04-25 10:38 [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Chi Zhiling
  2025-04-25 10:38 ` [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode Chi Zhiling
@ 2025-04-25 10:38 ` Chi Zhiling
  2025-04-25 15:15   ` Darrick J. Wong
  2025-04-30  2:05 ` [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Chi Zhiling @ 2025-04-25 10:38 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, Chi Zhiling

From: Chi Zhiling <chizhiling@kylinos.cn>

For unextending writes, we will only update the pagecache and extent.
In this case, if our write occurs within a single block, that is,
within a single folio, we don't need an exclusive lock to ensure the
atomicity of the write, because we already have the folio lock.

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
---
 fs/xfs/xfs_file.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a6f214f57238..8eaa98464328 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -914,6 +914,27 @@ xfs_file_dax_write(
 	return ret;
 }
 
+#define offset_in_block(inode, p) ((unsigned long)(p) & (i_blocksize(inode) - 1))
+
+static inline bool xfs_allow_concurrent(
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+
+	/* Extending write? */
+	if (iocb->ki_flags & IOCB_APPEND ||
+	    iocb->ki_pos >= i_size_read(inode))
+		return false;
+
+	/* Exceeds a block range? */
+	if (iov_iter_count(from) > i_blocksize(inode) ||
+	    offset_in_block(inode, iocb->ki_pos) + iov_iter_count(from) > i_blocksize(inode))
+		return false;
+
+	return true;
+}
+
 STATIC ssize_t
 xfs_file_buffered_write(
 	struct kiocb		*iocb,
@@ -925,8 +946,12 @@ xfs_file_buffered_write(
 	bool			cleared_space = false;
 	unsigned int		iolock;
 
+	if (xfs_allow_concurrent(iocb, from))
+		iolock = XFS_IOLOCK_SHARED;
+	else
+		iolock = XFS_IOLOCK_EXCL;
+
 write_retry:
-	iolock = XFS_IOLOCK_EXCL;
 	ret = xfs_ilock_iocb_for_write(iocb, &iolock, false);
 	if (ret)
 		return ret;
@@ -935,6 +960,13 @@ xfs_file_buffered_write(
 	if (ret)
 		goto out;
 
+	if (iolock == XFS_IOLOCK_SHARED &&
+	    iocb->ki_pos + iov_iter_count(from) > i_size_read(inode)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_IOLOCK_EXCL;
+		goto write_retry;
+	}
+
 	trace_xfs_file_buffered_write(iocb, from);
 	ret = iomap_file_buffered_write(iocb, from,
 			&xfs_buffered_write_iomap_ops, NULL);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block
  2025-04-25 10:38 ` [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block Chi Zhiling
@ 2025-04-25 15:15   ` Darrick J. Wong
  2025-04-26  1:34     ` Chi Zhiling
  0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2025-04-25 15:15 UTC (permalink / raw)
  To: Chi Zhiling; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On Fri, Apr 25, 2025 at 06:38:41PM +0800, Chi Zhiling wrote:
> From: Chi Zhiling <chizhiling@kylinos.cn>
> 
> For unextending writes, we will only update the pagecache and extent.
> In this case, if our write occurs within a single block, that is,
> within a single folio, we don't need an exclusive lock to ensure the
> atomicity of the write, because we already have the folio lock.
> 
> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
> ---
>  fs/xfs/xfs_file.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a6f214f57238..8eaa98464328 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -914,6 +914,27 @@ xfs_file_dax_write(
>  	return ret;
>  }
>  
> +#define offset_in_block(inode, p) ((unsigned long)(p) & (i_blocksize(inode) - 1))

Is it correct to cast an loff_t (s64) to unsigned long (u32 on i386)
here?

> +
> +static inline bool xfs_allow_concurrent(

static inline bool
xfs_allow_concurrent(

(separate lines style nit)

> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	struct inode		*inode = iocb->ki_filp->f_mapping->host;
> +
> +	/* Extending write? */
> +	if (iocb->ki_flags & IOCB_APPEND ||
> +	    iocb->ki_pos >= i_size_read(inode))
> +		return false;
> +
> +	/* Exceeds a block range? */
> +	if (iov_iter_count(from) > i_blocksize(inode) ||
> +	    offset_in_block(inode, iocb->ki_pos) + iov_iter_count(from) > i_blocksize(inode))
> +		return false;
> +
> +	return true;
> +}

...and since this helper only has one caller, maybe it should be named
xfs_buffered_write_iolock_mode and return the lock mode directly?

> +
>  STATIC ssize_t
>  xfs_file_buffered_write(
>  	struct kiocb		*iocb,
> @@ -925,8 +946,12 @@ xfs_file_buffered_write(
>  	bool			cleared_space = false;
>  	unsigned int		iolock;
>  
> +	if (xfs_allow_concurrent(iocb, from))
> +		iolock = XFS_IOLOCK_SHARED;
> +	else
> +		iolock = XFS_IOLOCK_EXCL;
> +
>  write_retry:
> -	iolock = XFS_IOLOCK_EXCL;
>  	ret = xfs_ilock_iocb_for_write(iocb, &iolock, false);
>  	if (ret)
>  		return ret;
> @@ -935,6 +960,13 @@ xfs_file_buffered_write(
>  	if (ret)
>  		goto out;
>  
> +	if (iolock == XFS_IOLOCK_SHARED &&
> +	    iocb->ki_pos + iov_iter_count(from) > i_size_read(inode)) {
> +		xfs_iunlock(ip, iolock);
> +		iolock = XFS_IOLOCK_EXCL;
> +		goto write_retry;
> +	}
> +
>  	trace_xfs_file_buffered_write(iocb, from);
>  	ret = iomap_file_buffered_write(iocb, from,
>  			&xfs_buffered_write_iomap_ops, NULL);
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block
  2025-04-25 15:15   ` Darrick J. Wong
@ 2025-04-26  1:34     ` Chi Zhiling
  0 siblings, 0 replies; 11+ messages in thread
From: Chi Zhiling @ 2025-04-26  1:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On 2025/4/25 23:15, Darrick J. Wong wrote:
> On Fri, Apr 25, 2025 at 06:38:41PM +0800, Chi Zhiling wrote:
>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>
>> For unextending writes, we will only update the pagecache and extent.
>> In this case, if our write occurs within a single block, that is,
>> within a single folio, we don't need an exclusive lock to ensure the
>> atomicity of the write, because we already have the folio lock.
>>
>> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
>> ---
>>   fs/xfs/xfs_file.c | 34 +++++++++++++++++++++++++++++++++-
>>   1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index a6f214f57238..8eaa98464328 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -914,6 +914,27 @@ xfs_file_dax_write(
>>   	return ret;
>>   }
>>   
>> +#define offset_in_block(inode, p) ((unsigned long)(p) & (i_blocksize(inode) - 1))
> 
> Is it correct to cast an loff_t (s64) to unsigned long (u32 on i386)
> here?

I'm not sure if there is an issue here, although there is a type cast,
it shouldn't affect the final result of offset_in_block.

> 
>> +
>> +static inline bool xfs_allow_concurrent(
> 
> static inline bool
> xfs_allow_concurrent(
> 
> (separate lines style nit)

Okay

> 
>> +	struct kiocb		*iocb,
>> +	struct iov_iter		*from)
>> +{
>> +	struct inode		*inode = iocb->ki_filp->f_mapping->host;
>> +
>> +	/* Extending write? */
>> +	if (iocb->ki_flags & IOCB_APPEND ||
>> +	    iocb->ki_pos >= i_size_read(inode))
>> +		return false;
>> +
>> +	/* Exceeds a block range? */
>> +	if (iov_iter_count(from) > i_blocksize(inode) ||
>> +	    offset_in_block(inode, iocb->ki_pos) + iov_iter_count(from) > i_blocksize(inode))
>> +		return false;
>> +
>> +	return true;
>> +}
> 
> ...and since this helper only has one caller, maybe it should be named
> xfs_buffered_write_iolock_mode and return the lock mode directly?

Yes, this is better. I will update it in the next patch.


Thanks

> 
>> +
>>   STATIC ssize_t
>>   xfs_file_buffered_write(
>>   	struct kiocb		*iocb,
>> @@ -925,8 +946,12 @@ xfs_file_buffered_write(
>>   	bool			cleared_space = false;
>>   	unsigned int		iolock;
>>   
>> +	if (xfs_allow_concurrent(iocb, from))
>> +		iolock = XFS_IOLOCK_SHARED;
>> +	else
>> +		iolock = XFS_IOLOCK_EXCL;
>> +
>>   write_retry:
>> -	iolock = XFS_IOLOCK_EXCL;
>>   	ret = xfs_ilock_iocb_for_write(iocb, &iolock, false);
>>   	if (ret)
>>   		return ret;
>> @@ -935,6 +960,13 @@ xfs_file_buffered_write(
>>   	if (ret)
>>   		goto out;
>>   
>> +	if (iolock == XFS_IOLOCK_SHARED &&
>> +	    iocb->ki_pos + iov_iter_count(from) > i_size_read(inode)) {
>> +		xfs_iunlock(ip, iolock);
>> +		iolock = XFS_IOLOCK_EXCL;
>> +		goto write_retry;
>> +	}
>> +
>>   	trace_xfs_file_buffered_write(iocb, from);
>>   	ret = iomap_file_buffered_write(iocb, from,
>>   			&xfs_buffered_write_iomap_ops, NULL);
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio lock
  2025-04-25 10:38 [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Chi Zhiling
  2025-04-25 10:38 ` [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode Chi Zhiling
  2025-04-25 10:38 ` [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block Chi Zhiling
@ 2025-04-30  2:05 ` Dave Chinner
  2025-04-30  9:03   ` Chi Zhiling
  2 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2025-04-30  2:05 UTC (permalink / raw)
  To: Chi Zhiling; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
> From: Chi Zhiling <chizhiling@kylinos.cn>
> 
> This is a patch attempting to implement concurrent buffered writes.
> The main idea is to use the folio lock to ensure the atomicity of the
> write when writing to a single folio, instead of using the i_rwsem.
> 
> I tried the "folio batch" solution, which is a great idea, but during
> testing, I encountered an OOM issue because the locked folios couldn't
> be reclaimed.
> 
> So for now, I can only allow concurrent writes within a single block.
> The good news is that since we already support BS > PS, we can use a
> larger block size to enable higher granularity concurrency.

I'm not going to say no to this, but I think it's a short term and
niche solution to the general problem of enabling shared buffered
writes. i.e. I expect that it will not exist for long, whilst
experience tells me that adding special cases to the IO path locking
has a fairly high risk of unexpected regressions and/or data
corruption....

> These ideas come from previous discussions:
> https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/

In my spare time I've been looking at using the two state lock from
bcachefs for this because it looks to provide a general solution to
the issue of concurrent buffered writes.

The two valid IO exclusion states are:

+enum {
+       XFS_IOTYPE_BUFFERED = 0,
+       XFS_IOTYPE_DIRECT = 1,
+};

Importantly, this gives us three states, not two:

1. Buffered IO in progress,
2. Direct IO in progress, and
3. No IO in progress. (i.e. not held at all)

When we do operations like truncate or hole punch, we need the state
to be #3 - no IO in progress.

Hence we can use this like we currently use i_dio_count for
truncate with the correct lock ordering. That is, we order the
IOLOCK before the IOTYPE lock:

Buffered IO:

	IOLOCK_SHARED, IOLOCK_EXCL if IREMAPPING
	  <IREMAPPING excluded>
	  IOTYPE_BUFFERED
	    <block waiting for in progress DIO>
	    <do buffered IO>
	  unlock IOTYPE_BUFFERED
	unlock IOLOCK

IREMAPPING IO:

	IOLOCK_EXCL
	  set IREMAPPING
	  demote to IOLOCK_SHARED
	  IOTYPE_BUFFERED
	    <block waiting for in progress DIO>
	    <do reflink operation>
	  unlock IOTYPE_BUFFERED
	  clear IREMAPPING
	unlock IOLOCK

Direct IO:

	IOLOCK_SHARED
	  IOTYPE_DIRECT
	    <block waiting for in progress buffered, IREMAPPING>
	    <do direct IO>
	<submission>
	  unlock IOLOCK_SHARED
	<completion>
	  unlock IOTYPE_DIRECT

Notes on DIO write file extension w.r.t. xfs_file_write_zero_eof():
- xfs_file_write_zero_eof() does buffered IO.
- needs to switch from XFS_IOTYPE_DIRECT to XFS_IOTYPE_BUFFERED
- this locks out all other DIO, as the current switch to
  IOLOCK_EXCL will do.
- DIO write path no longer needs IOLOCK_EXCL to serialise post-EOF
  block zeroing against other concurrent DIO writes.
- future optimisation target so that it doesn't serialise against
  other DIO (reads or writes) within EOF.

This path looks like:

Direct IO extension:

	IOLOCK_EXCL
	  IOTYPE_BUFFERED
	    <block waiting for in progress DIO>
	    xfs_file_write_zero_eof();
	  demote to IOLOCK_SHARED
	  IOTYPE_DIRECT
	    <block waiting for buffered, IREMAPPING>
	    <do direct IO>
	<submission>
	  unlock IOLOCK_SHARED
	<completion>
	  unlock IOTYPE_DIRECT

Notes on xfs_file_dio_write_unaligned()
- this drains all DIO in flight so it has exclusive access to the
  given block being written to. This prevents races doing IO (read
  or write, buffered or direct) to that specific block.
- essentially does an exclusive, synchronous DIO write after
  draining all DIO in flight. Very slow, reliant on inode_dio_wait()
  existing.
- make the slow path after failing the unaligned overwrite a
  buffered write.
- switching modes to buffered drains all the DIO in flight,
  buffered write data all the necessary sub-block zeroing in memory,
  next overlapping DIO of fdatasync() will flush it to disk.

This slow path looks like:

	IOLOCK_EXCL
	  IOTYPE_BUFFERED
	    <excludes all concurrent DIO>
	    set IOCB_DONTCACHE
	    iomap_file_buffered_write()

Truncate and other IO exclusion code such as fallocate() need to do
this:

	IOLOCK_EXCL
	  <wait for IO state to become unlocked>

The IOLOCK_EXCL creates a submission barrier, and the "wait for IO
state to become unlocked" ensures that all buffered and direct IO
have been drained and there is no IO in flight at all.

Th upside of this is that we get rid of the dependency on
inode->i_dio_count and we ensure that we don't potentially need a
similar counter for buffered writes in future. e.g. buffered
AIO+RWF_DONTCACHE+RWF_DSYNC could be optimised to use FUA and/or IO
completion side DSYNC operations like AIO+DIO+RWF_DSYNC currently
does and that would currently need in-flight IO tracking for truncate
synchronisation. The two-state lock solution avoids that completely.

Some work needs to be done to enable sane IO completion unlocking
(i.e. from dio->end_io). My curent notes on this say:

- ->end_io only gets called once when all bios submitted for the dio
  are complete. hence only one completion, so unlock is balanced
- caller has no idea on error if IO was submitted and completed;
  if dio->end_io unlocks on IO error, the waiting submitter has no
  clue whether it has to unlock or not.
- need a clean submitter unlock model. Alternatives?
  - dio->end_io only unlock on on IO error when
    dio->wait_for_completion is not set (i.e. completing an AIO,
    submitter was given -EIOCBQUEUED). iomap_dio_rw() caller can
    then do:

        if (ret < 0 && ret != -EIOCBQUEUED) {
                /* unlock inode */
        }
  - if end_io is checking ->wait_for_completion, only ever unlock
    if it isn't set? i.e. if there is a waiter, we leave it to them
    to unlock? Simpler rule for ->end_io, cleaner for the submitter
    to handle:

        if (ret != -EIOCBQUEUED) {
                /* unlock inode */
        }
- need to move DIO write page cache invalidation and inode_dio_end()
  into ->end_io for implementations
- if no ->end_io provided, do what the current code does.

There are also a few changes need to avoid inode->i_dio_count in
iomap:
- need a flag to tell iomap_dio_rw() not to account the DIO
- inode_dio_end() may need to be moved to ->dio_end, or we could
  use the "do not account" flag to avoid it.
- However, page cache invalidation and dsync work needs to be done
  before in-flight dio release, so this we likely need to move this
  stuff to ->end_io before we drop the IOTYPE lock...
- probably can be handled with appropriate helpers...

I've implemented some of this already; I'm currently in the process
of making truncate exclusion work correctly. Once that works, I'll
post the code....

-~dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio lock
  2025-04-30  2:05 ` [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Dave Chinner
@ 2025-04-30  9:03   ` Chi Zhiling
  2025-04-30 23:45     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Chi Zhiling @ 2025-04-30  9:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On 2025/4/30 10:05, Dave Chinner wrote:
> On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>
>> This is a patch attempting to implement concurrent buffered writes.
>> The main idea is to use the folio lock to ensure the atomicity of the
>> write when writing to a single folio, instead of using the i_rwsem.
>>
>> I tried the "folio batch" solution, which is a great idea, but during
>> testing, I encountered an OOM issue because the locked folios couldn't
>> be reclaimed.
>>
>> So for now, I can only allow concurrent writes within a single block.
>> The good news is that since we already support BS > PS, we can use a
>> larger block size to enable higher granularity concurrency.
> 
> I'm not going to say no to this, but I think it's a short term and
> niche solution to the general problem of enabling shared buffered
> writes. i.e. I expect that it will not exist for long, whilst

Hi, Dave,

Yes, it's a short-term solution, but it's enough for some scenarios.
I would also like to see better idea.

> experience tells me that adding special cases to the IO path locking
> has a fairly high risk of unexpected regressions and/or data
> corruption....

I can't say there is definitely no data corruption, but I haven't seen
any new errors in xfstests.

We might need to add some assertions in the code to check for the risk
of data corruption, not specifically for this patch, but for the current
XFS system in general. This would help developers avoid introducing new
bugs, similar to the lockdep tool.

> 
>> These ideas come from previous discussions:
>> https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/
> 
> In my spare time I've been looking at using the two state lock from
> bcachefs for this because it looks to provide a general solution to
> the issue of concurrent buffered writes.

In fact, I have tried the two state lock, and it does work quite well.
However, I noticed some performance degradation in single-threaded
scenarios in UnixBench (I'm not sure if it's caused by the memory
barrier).

Since single-threaded bufferedio is still the primary read-write mode,
I don't want to introduce too much impact in single-threaded scenarios.

That's why I introduced the i_direct_mode flag, which protected by
i_rwsem. This approach only adds a boolean operation in fast path.

> 
> The two valid IO exclusion states are:
> 
> +enum {
> +       XFS_IOTYPE_BUFFERED = 0,
> +       XFS_IOTYPE_DIRECT = 1,
> +};
> 
> Importantly, this gives us three states, not two:
> 
> 1. Buffered IO in progress,
> 2. Direct IO in progress, and
> 3. No IO in progress. (i.e. not held at all)
> 
> When we do operations like truncate or hole punch, we need the state
> to be #3 - no IO in progress.
> 
> Hence we can use this like we currently use i_dio_count for
> truncate with the correct lock ordering. That is, we order the
> IOLOCK before the IOTYPE lock:
> 
> Buffered IO:
> 
> 	IOLOCK_SHARED, IOLOCK_EXCL if IREMAPPING
> 	  <IREMAPPING excluded>
> 	  IOTYPE_BUFFERED
> 	    <block waiting for in progress DIO>
> 	    <do buffered IO>
> 	  unlock IOTYPE_BUFFERED
> 	unlock IOLOCK
> 
> IREMAPPING IO:
> 
> 	IOLOCK_EXCL
> 	  set IREMAPPING
> 	  demote to IOLOCK_SHARED
> 	  IOTYPE_BUFFERED
> 	    <block waiting for in progress DIO>
> 	    <do reflink operation>
> 	  unlock IOTYPE_BUFFERED
> 	  clear IREMAPPING
> 	unlock IOLOCK
> 
> Direct IO:
> 
> 	IOLOCK_SHARED
> 	  IOTYPE_DIRECT
> 	    <block waiting for in progress buffered, IREMAPPING>
> 	    <do direct IO>
> 	<submission>
> 	  unlock IOLOCK_SHARED
> 	<completion>
> 	  unlock IOTYPE_DIRECT
> 
> Notes on DIO write file extension w.r.t. xfs_file_write_zero_eof():
> - xfs_file_write_zero_eof() does buffered IO.
> - needs to switch from XFS_IOTYPE_DIRECT to XFS_IOTYPE_BUFFERED
> - this locks out all other DIO, as the current switch to
>    IOLOCK_EXCL will do.
> - DIO write path no longer needs IOLOCK_EXCL to serialise post-EOF
>    block zeroing against other concurrent DIO writes.
> - future optimisation target so that it doesn't serialise against
>    other DIO (reads or writes) within EOF.
> 
> This path looks like:
> 
> Direct IO extension:
> 
> 	IOLOCK_EXCL
> 	  IOTYPE_BUFFERED
> 	    <block waiting for in progress DIO>
> 	    xfs_file_write_zero_eof();
> 	  demote to IOLOCK_SHARED
> 	  IOTYPE_DIRECT
> 	    <block waiting for buffered, IREMAPPING>
> 	    <do direct IO>
> 	<submission>
> 	  unlock IOLOCK_SHARED
> 	<completion>
> 	  unlock IOTYPE_DIRECT
> 
> Notes on xfs_file_dio_write_unaligned()
> - this drains all DIO in flight so it has exclusive access to the
>    given block being written to. This prevents races doing IO (read
>    or write, buffered or direct) to that specific block.
> - essentially does an exclusive, synchronous DIO write after
>    draining all DIO in flight. Very slow, reliant on inode_dio_wait()
>    existing.
> - make the slow path after failing the unaligned overwrite a
>    buffered write.
> - switching modes to buffered drains all the DIO in flight,
>    buffered write data all the necessary sub-block zeroing in memory,
>    next overlapping DIO of fdatasync() will flush it to disk.
> 
> This slow path looks like:
> 
> 	IOLOCK_EXCL
> 	  IOTYPE_BUFFERED
> 	    <excludes all concurrent DIO>
> 	    set IOCB_DONTCACHE
> 	    iomap_file_buffered_write()
> 
> Truncate and other IO exclusion code such as fallocate() need to do
> this:
> 
> 	IOLOCK_EXCL
> 	  <wait for IO state to become unlocked>
> 
> The IOLOCK_EXCL creates a submission barrier, and the "wait for IO
> state to become unlocked" ensures that all buffered and direct IO
> have been drained and there is no IO in flight at all.
> 
> Th upside of this is that we get rid of the dependency on
> inode->i_dio_count and we ensure that we don't potentially need a
> similar counter for buffered writes in future. e.g. buffered
> AIO+RWF_DONTCACHE+RWF_DSYNC could be optimised to use FUA and/or IO
> completion side DSYNC operations like AIO+DIO+RWF_DSYNC currently
> does and that would currently need in-flight IO tracking for truncate
> synchronisation. The two-state lock solution avoids that completely.

Okay, sounds great.

> 
> Some work needs to be done to enable sane IO completion unlocking
> (i.e. from dio->end_io). My curent notes on this say:
> 
> - ->end_io only gets called once when all bios submitted for the dio
>    are complete. hence only one completion, so unlock is balanced
> - caller has no idea on error if IO was submitted and completed;
>    if dio->end_io unlocks on IO error, the waiting submitter has no
>    clue whether it has to unlock or not.
> - need a clean submitter unlock model. Alternatives?
>    - dio->end_io only unlock on on IO error when
>      dio->wait_for_completion is not set (i.e. completing an AIO,
>      submitter was given -EIOCBQUEUED). iomap_dio_rw() caller can
>      then do:
> 
>          if (ret < 0 && ret != -EIOCBQUEUED) {
>                  /* unlock inode */
>          }
>    - if end_io is checking ->wait_for_completion, only ever unlock
>      if it isn't set? i.e. if there is a waiter, we leave it to them
>      to unlock? Simpler rule for ->end_io, cleaner for the submitter
>      to handle:
> 
>          if (ret != -EIOCBQUEUED) {
>                  /* unlock inode */
>          }
> - need to move DIO write page cache invalidation and inode_dio_end()
>    into ->end_io for implementations
> - if no ->end_io provided, do what the current code does.
> 
> There are also a few changes need to avoid inode->i_dio_count in
> iomap:
> - need a flag to tell iomap_dio_rw() not to account the DIO
> - inode_dio_end() may need to be moved to ->dio_end, or we could
>    use the "do not account" flag to avoid it.
> - However, page cache invalidation and dsync work needs to be done
>    before in-flight dio release, so this we likely need to move this
>    stuff to ->end_io before we drop the IOTYPE lock...
> - probably can be handled with appropriate helpers...
> 
> I've implemented some of this already; I'm currently in the process
> of making truncate exclusion work correctly. Once that works, I'll
> post the code....

Thank you for sharing your thoughts, I will be waiting for that.

> 
> -~dave


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio lock
  2025-04-30  9:03   ` Chi Zhiling
@ 2025-04-30 23:45     ` Dave Chinner
  2025-05-01 23:57       ` Chi Zhiling
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2025-04-30 23:45 UTC (permalink / raw)
  To: Chi Zhiling; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On Wed, Apr 30, 2025 at 05:03:51PM +0800, Chi Zhiling wrote:
> On 2025/4/30 10:05, Dave Chinner wrote:
> > On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
> > > From: Chi Zhiling <chizhiling@kylinos.cn>
> > > 
> > > This is a patch attempting to implement concurrent buffered writes.
> > > The main idea is to use the folio lock to ensure the atomicity of the
> > > write when writing to a single folio, instead of using the i_rwsem.
> > > 
> > > I tried the "folio batch" solution, which is a great idea, but during
> > > testing, I encountered an OOM issue because the locked folios couldn't
> > > be reclaimed.
> > > 
> > > So for now, I can only allow concurrent writes within a single block.
> > > The good news is that since we already support BS > PS, we can use a
> > > larger block size to enable higher granularity concurrency.
> > 
> > I'm not going to say no to this, but I think it's a short term and
> > niche solution to the general problem of enabling shared buffered
> > writes. i.e. I expect that it will not exist for long, whilst
> 
> Hi, Dave,
> 
> Yes, it's a short-term solution, but it's enough for some scenarios.
> I would also like to see better idea.
> 
> > experience tells me that adding special cases to the IO path locking
> > has a fairly high risk of unexpected regressions and/or data
> > corruption....
> 
> I can't say there is definitely no data corruption, but I haven't seen
> any new errors in xfstests.

Yeah, that's why they are "unexpected regressions" - testing looks
fine, but once it gets out into complex production workloads....

> We might need to add some assertions in the code to check for the risk
> of data corruption, not specifically for this patch, but for the current
> XFS system in general. This would help developers avoid introducing new
> bugs, similar to the lockdep tool.

I'm not sure what you invisage here or what problems you think we
might be able to catch - can you describe what you are thinking
about here?

> > > These ideas come from previous discussions:
> > > https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/
> > 
> > In my spare time I've been looking at using the two state lock from
> > bcachefs for this because it looks to provide a general solution to
> > the issue of concurrent buffered writes.
> 
> In fact, I have tried the two state lock, and it does work quite well.
> However, I noticed some performance degradation in single-threaded
> scenarios in UnixBench (I'm not sure if it's caused by the memory
> barrier).

Please share the patch - I'd like to see how you implemented it and
how you solved the various lock ordering and wider IO serialisation
issues. It may be that I've overlooked something and your
implementation makes me aware of it. OTOH, I might see something in
your patch that could be improved that mitigates the regression.

> Since single-threaded bufferedio is still the primary read-write mode,
> I don't want to introduce too much impact in single-threaded scenarios.

I mostly don't care that much about small single threaded
performance regressions anywhere in XFS if there is some upside for
scalability or performance. We've always traded off single threaded
performance for better concurrency and/or scalability in XFS (right
from the initial design choices way back in the early 1990s), so I
don't see why we'd treat a significant improvement in buffered IO
concurrency any differently.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio lock
  2025-04-30 23:45     ` Dave Chinner
@ 2025-05-01 23:57       ` Chi Zhiling
  0 siblings, 0 replies; 11+ messages in thread
From: Chi Zhiling @ 2025-05-01 23:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: cem, linux-xfs, linux-kernel, Chi Zhiling

On 2025/5/1 07:45, Dave Chinner wrote:
> On Wed, Apr 30, 2025 at 05:03:51PM +0800, Chi Zhiling wrote:
>> On 2025/4/30 10:05, Dave Chinner wrote:
>>> On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
>>>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>>>
>>>> This is a patch attempting to implement concurrent buffered writes.
>>>> The main idea is to use the folio lock to ensure the atomicity of the
>>>> write when writing to a single folio, instead of using the i_rwsem.
>>>>
>>>> I tried the "folio batch" solution, which is a great idea, but during
>>>> testing, I encountered an OOM issue because the locked folios couldn't
>>>> be reclaimed.
>>>>
>>>> So for now, I can only allow concurrent writes within a single block.
>>>> The good news is that since we already support BS > PS, we can use a
>>>> larger block size to enable higher granularity concurrency.
>>>
>>> I'm not going to say no to this, but I think it's a short term and
>>> niche solution to the general problem of enabling shared buffered
>>> writes. i.e. I expect that it will not exist for long, whilst
>>
>> Hi, Dave,
>>
>> Yes, it's a short-term solution, but it's enough for some scenarios.
>> I would also like to see better idea.
>>
>>> experience tells me that adding special cases to the IO path locking
>>> has a fairly high risk of unexpected regressions and/or data
>>> corruption....
>>
>> I can't say there is definitely no data corruption, but I haven't seen
>> any new errors in xfstests.
> 
> Yeah, that's why they are "unexpected regressions" - testing looks
> fine, but once it gets out into complex production workloads....
> 
>> We might need to add some assertions in the code to check for the risk
>> of data corruption, not specifically for this patch, but for the current
>> XFS system in general. This would help developers avoid introducing new
>> bugs, similar to the lockdep tool.
> 
> I'm not sure what you invisage here or what problems you think we
> might be able to catch - can you describe what you are thinking
> about here?

I'm just say it casually.

I mean, is there a way to check for data corruption risks, rather than
waiting for it to happen and then reporting an error? Just like how
lockdep detects deadlock risks in advance.

I guess not.

> 
>>>> These ideas come from previous discussions:
>>>> https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/
>>>
>>> In my spare time I've been looking at using the two state lock from
>>> bcachefs for this because it looks to provide a general solution to
>>> the issue of concurrent buffered writes.
>>
>> In fact, I have tried the two state lock, and it does work quite well.
>> However, I noticed some performance degradation in single-threaded
>> scenarios in UnixBench (I'm not sure if it's caused by the memory
>> barrier).
> 
> Please share the patch - I'd like to see how you implemented it and
> how you solved the various lock ordering and wider IO serialisation
> issues. It may be that I've overlooked something and your
> implementation makes me aware of it. OTOH, I might see something in
> your patch that could be improved that mitigates the regression.

I think I haven't solved these problems.

The lock order I envisioned is IOLOCK -> TWO_STATE_LOCK -> MMAP_LOCK ->
ILOCK, which means that when releasing IOLOCK, TWO_STATE_LOCK should
also be released first, including when upgrading IOLOCK_SHARED to
IOLOCK_EXCL. However, I didn't do this.

I missed this part, and although I didn't encounter any issues in the
xfstests, this could indeed lead to a deadlock.


Besides this, is there anything else I have missed?



The patch is as follows, though it's not helpful

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3e7448c2a969..573e31bfef3f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -36,6 +36,17 @@

  static const struct vm_operations_struct xfs_file_vm_ops;

+#define TWO_STATE_LOCK(ip, state)		\
+	xfs_two_state_lock(&ip->i_write_lock, state)
+
+#define TWO_STATE_UNLOCK(ip, state)		\
+	xfs_two_state_unlock(&ip->i_write_lock, state)
+
+#define buffered_lock(inode)	TWO_STATE_LOCK(inode, 0)
+#define buffered_unlock(inode)	TWO_STATE_UNLOCK(inode, 0)
+#define direct_lock(inode)	TWO_STATE_LOCK(inode, 1)
+#define direct_unlock(inode)	TWO_STATE_UNLOCK(inode, 1)
+
  /*
   * Decide if the given file range is aligned to the size of the 
fundamental
   * allocation unit for the file.
@@ -263,7 +274,10 @@ xfs_file_dio_read(
  	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
  	if (ret)
  		return ret;
+	direct_lock(ip);
  	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+	direct_unlock(ip);
+
  	xfs_iunlock(ip, XFS_IOLOCK_SHARED);

  	return ret;
@@ -598,9 +612,13 @@ xfs_file_dio_write_aligned(
  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
  		iolock = XFS_IOLOCK_SHARED;
  	}
+
+	direct_lock(ip);
  	trace_xfs_file_direct_write(iocb, from);
  	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
  			   &xfs_dio_write_ops, 0, NULL, 0);
+	direct_unlock(ip);
+
  out_unlock:
  	if (iolock)
  		xfs_iunlock(ip, iolock);
@@ -676,9 +694,11 @@ xfs_file_dio_write_unaligned(
  	if (flags & IOMAP_DIO_FORCE_WAIT)
  		inode_dio_wait(VFS_I(ip));

+	direct_lock(ip);
  	trace_xfs_file_direct_write(iocb, from);
  	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
  			   &xfs_dio_write_ops, flags, NULL, 0);
+	direct_unlock(ip);

  	/*
  	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
@@ -776,9 +796,11 @@ xfs_file_buffered_write(
  	if (ret)
  		goto out;

+	buffered_lock(ip);
  	trace_xfs_file_buffered_write(iocb, from);
  	ret = iomap_file_buffered_write(iocb, from,
  			&xfs_buffered_write_iomap_ops);
+	buffered_unlock(ip);

  	/*
  	 * If we hit a space limit, try to free up some lingering preallocated
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 52210a54fe7e..a8bc8d9737c4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,6 +114,7 @@ xfs_inode_alloc(
  	spin_lock_init(&ip->i_ioend_lock);
  	ip->i_next_unlinked = NULLAGINO;
  	ip->i_prev_unlinked = 0;
+	two_state_lock_init(&ip->i_write_lock);

  	return ip;
  }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b91aaa23ea1e..9a8c75feda16 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -8,6 +8,7 @@

  #include "xfs_inode_buf.h"
  #include "xfs_inode_fork.h"
+#include "xfs_lock.h"

  /*
   * Kernel only inode definitions
@@ -92,6 +93,8 @@ typedef struct xfs_inode {
  	spinlock_t		i_ioend_lock;
  	struct work_struct	i_ioend_work;
  	struct list_head	i_ioend_list;
+
+	two_state_lock_t	i_write_lock;
  } xfs_inode_t;

  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)



Thanks

> 
>> Since single-threaded bufferedio is still the primary read-write mode,
>> I don't want to introduce too much impact in single-threaded scenarios.
> 
> I mostly don't care that much about small single threaded
> performance regressions anywhere in XFS if there is some upside for
> scalability or performance. We've always traded off single threaded
> performance for better concurrency and/or scalability in XFS (right
> from the initial design choices way back in the early 1990s), so I
> don't see why we'd treat a significant improvement in buffered IO
> concurrency any differently.
> 
> -Dave.


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-05-01 23:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 10:38 [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Chi Zhiling
2025-04-25 10:38 ` [RFC PATCH 1/2] xfs: Add i_direct_mode to indicate the IO mode of inode Chi Zhiling
2025-04-25 15:12   ` Darrick J. Wong
2025-04-26  1:28     ` Chi Zhiling
2025-04-25 10:38 ` [RFC PATCH 2/2] xfs: Enable concurrency when writing within single block Chi Zhiling
2025-04-25 15:15   ` Darrick J. Wong
2025-04-26  1:34     ` Chi Zhiling
2025-04-30  2:05 ` [RFC PATCH 0/2] Implement concurrent buffered write with folio lock Dave Chinner
2025-04-30  9:03   ` Chi Zhiling
2025-04-30 23:45     ` Dave Chinner
2025-05-01 23:57       ` Chi Zhiling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox