[PATCH 0/3] nfs: Improve throughput for random buffered writes

Linux NFS development
 help / color / mirror / Atom feed

* [PATCH 0/3] nfs: Improve throughput for random buffered writes
@ 2024-06-13  8:28 Jan Kara
  2024-06-13  8:28 ` [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages() Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-13  8:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Neil Brown, Jan Kara

Hello,

I was thinking how to best address the performance regression coming from
NFS write congestion. After considering various options and concerns raised
in the previous discussion, I've got an idea for a simple option that could
help to keep the server more busy - just mimick what block devices do and
block the flush worker waiting for congestion to resolve instead of aborting
the writeback. And it actually helps enough that I don't think more complex
solutions are warranted at this point.

This patch series has two preparatory cleanups and then a patch implementing
this idea.

								Honza

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages()
  2024-06-13  8:28 [PATCH 0/3] nfs: Improve throughput for random buffered writes Jan Kara
@ 2024-06-13  8:28 ` Jan Kara
  2024-06-13  9:17   ` Sagi Grimberg
  2024-06-13 19:55   ` Jeff Layton
  2024-06-13  8:28 ` [PATCH 2/3] nfs: Properly initialize server->writeback Jan Kara
  2024-06-13  8:28 ` [PATCH 3/3] nfs: Block on write congestion Jan Kara
  2 siblings, 2 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-13  8:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Neil Brown, Jan Kara

nfss->writeback is updated only when we are ending page writeback and at
that moment we also clear nfss->write_congested. So there's no point in
rechecking congestion state in nfs_commit_release_pages(). Drop the
pointless check.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nfs/write.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5fc12a721ec3..c6255d7edd3c 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1837,7 +1837,6 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	struct nfs_page	*req;
 	int status = data->task.tk_status;
 	struct nfs_commit_info cinfo;
-	struct nfs_server *nfss;
 	struct folio *folio;
 
 	while (!list_empty(&data->pages)) {
@@ -1880,9 +1879,6 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 		/* Latency breaker */
 		cond_resched();
 	}
-	nfss = NFS_SERVER(data->inode);
-	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		nfss->write_congested = 0;
 
 	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
 	nfs_commit_end(cinfo.mds);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] nfs: Properly initialize server->writeback
  2024-06-13  8:28 [PATCH 0/3] nfs: Improve throughput for random buffered writes Jan Kara
  2024-06-13  8:28 ` [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages() Jan Kara
@ 2024-06-13  8:28 ` Jan Kara
  2024-06-13  9:17   ` Sagi Grimberg
  2024-06-13 19:56   ` Jeff Layton
  2024-06-13  8:28 ` [PATCH 3/3] nfs: Block on write congestion Jan Kara
  2 siblings, 2 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-13  8:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Neil Brown, Jan Kara

Atomic types should better be initialized with atomic_long_set() instead
of relying on zeroing done by kzalloc(). Clean this up.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nfs/client.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index de77848ae654..3b252dceebf5 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -994,6 +994,8 @@ struct nfs_server *nfs_alloc_server(void)
 
 	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
 
+	atomic_long_set(&server->writeback, 0);
+
 	ida_init(&server->openowner_id);
 	ida_init(&server->lockowner_id);
 	pnfs_init_server(server);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] nfs: Block on write congestion
  2024-06-13  8:28 [PATCH 0/3] nfs: Improve throughput for random buffered writes Jan Kara
  2024-06-13  8:28 ` [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages() Jan Kara
  2024-06-13  8:28 ` [PATCH 2/3] nfs: Properly initialize server->writeback Jan Kara
@ 2024-06-13  8:28 ` Jan Kara
  2024-06-13  9:18   ` Sagi Grimberg
                     ` (2 more replies)
  2 siblings, 3 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-13  8:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Neil Brown, Jan Kara

Commit 6df25e58532b ("nfs: remove reliance on bdi congestion")
introduced NFS-private solution for limiting number of writes
outstanding against a particular server. Unlike previous bdi congestion
this algorithm actually works and limits number of outstanding writeback
pages to nfs_congestion_kb which scales with amount of client's memory
and is capped at 256 MB. As a result some workloads such as random
buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
fio command to reproduce is:

fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
  --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
  --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1

This happens because the client sends ~256 MB worth of dirty pages to
the server and any further background writeback request is ignored until
the number of writeback pages gets below the threshold of 192 MB. By the
time this happens and clients decides to trigger another round of
writeback, the server often has no pages to write and the disk is idle.

To fix this problem and make the client react faster to eased congestion
of the server by blocking waiting for congestion to resolve instead of
aborting writeback. This improves the random 4k buffered write
throughput to 184 MB/s.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nfs/client.c           |  1 +
 fs/nfs/write.c            | 12 ++++++++----
 include/linux/nfs_fs_sb.h |  1 +
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 3b252dceebf5..8286edd6062d 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -994,6 +994,7 @@ struct nfs_server *nfs_alloc_server(void)
 
 	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
 
+	init_waitqueue_head(&server->write_congestion_wait);
 	atomic_long_set(&server->writeback, 0);
 
 	ida_init(&server->openowner_id);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c6255d7edd3c..21a5f1e90859 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -423,8 +423,10 @@ static void nfs_folio_end_writeback(struct folio *folio)
 
 	folio_end_writeback(folio);
 	if (atomic_long_dec_return(&nfss->writeback) <
-	    NFS_CONGESTION_OFF_THRESH)
+	    NFS_CONGESTION_OFF_THRESH) {
 		nfss->write_congested = 0;
+		wake_up_all(&nfss->write_congestion_wait);
+	}
 }
 
 static void nfs_page_end_writeback(struct nfs_page *req)
@@ -698,12 +700,14 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	struct nfs_pageio_descriptor pgio;
 	struct nfs_io_completion *ioc = NULL;
 	unsigned int mntflags = NFS_SERVER(inode)->flags;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int priority = 0;
 	int err;
 
-	if (wbc->sync_mode == WB_SYNC_NONE &&
-	    NFS_SERVER(inode)->write_congested)
-		return 0;
+	/* Wait with writeback until write congestion eases */
+	if (wbc->sync_mode == WB_SYNC_NONE && nfss->write_congested)
+		wait_event(nfss->write_congestion_wait,
+			   nfss->write_congested == 0);
 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 92de074e63b9..38a128796a76 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -140,6 +140,7 @@ struct nfs_server {
 	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
 	struct nlm_host		*nlm_host;	/* NLM client handle */
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
+	wait_queue_head_t	write_congestion_wait;	/* wait until write congestion eases */
 	atomic_long_t		writeback;	/* number of writeback pages */
 	unsigned int		write_congested;/* flag set when writeback gets too high */
 	unsigned int		flags;		/* various flags */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages()
  2024-06-13  8:28 ` [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages() Jan Kara
@ 2024-06-13  9:17   ` Sagi Grimberg
  2024-06-13 19:55   ` Jeff Layton
  1 sibling, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2024-06-13  9:17 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust; +Cc: linux-nfs, Neil Brown

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] nfs: Properly initialize server->writeback
  2024-06-13  8:28 ` [PATCH 2/3] nfs: Properly initialize server->writeback Jan Kara
@ 2024-06-13  9:17   ` Sagi Grimberg
  2024-06-13 19:56   ` Jeff Layton
  1 sibling, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2024-06-13  9:17 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust; +Cc: linux-nfs, Neil Brown

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nfs: Block on write congestion
  2024-06-13  8:28 ` [PATCH 3/3] nfs: Block on write congestion Jan Kara
@ 2024-06-13  9:18   ` Sagi Grimberg
  2024-06-13 20:04   ` Jeff Layton
  2024-06-13 22:56   ` NeilBrown
  2 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2024-06-13  9:18 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust; +Cc: linux-nfs, Neil Brown



On 13/06/2024 11:28, Jan Kara wrote:
> Commit 6df25e58532b ("nfs: remove reliance on bdi congestion")
> introduced NFS-private solution for limiting number of writes
> outstanding against a particular server. Unlike previous bdi congestion
> this algorithm actually works and limits number of outstanding writeback
> pages to nfs_congestion_kb which scales with amount of client's memory
> and is capped at 256 MB. As a result some workloads such as random
> buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
> fio command to reproduce is:
>
> fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
>    --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
>    --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1
>
> This happens because the client sends ~256 MB worth of dirty pages to
> the server and any further background writeback request is ignored until
> the number of writeback pages gets below the threshold of 192 MB. By the
> time this happens and clients decides to trigger another round of
> writeback, the server often has no pages to write and the disk is idle.
>
> To fix this problem and make the client react faster to eased congestion
> of the server by blocking waiting for congestion to resolve instead of
> aborting writeback. This improves the random 4k buffered write
> throughput to 184 MB/s.

Nice,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages()
  2024-06-13  8:28 ` [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages() Jan Kara
  2024-06-13  9:17   ` Sagi Grimberg
@ 2024-06-13 19:55   ` Jeff Layton
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff Layton @ 2024-06-13 19:55 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust; +Cc: linux-nfs, Neil Brown

On Thu, 2024-06-13 at 10:28 +0200, Jan Kara wrote:
> nfss->writeback is updated only when we are ending page writeback and
> at
> that moment we also clear nfss->write_congested. So there's no point
> in
> rechecking congestion state in nfs_commit_release_pages(). Drop the
> pointless check.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/nfs/write.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 5fc12a721ec3..c6255d7edd3c 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1837,7 +1837,6 @@ static void nfs_commit_release_pages(struct
> nfs_commit_data *data)
>  	struct nfs_page	*req;
>  	int status = data->task.tk_status;
>  	struct nfs_commit_info cinfo;
> -	struct nfs_server *nfss;
>  	struct folio *folio;
>  
>  	while (!list_empty(&data->pages)) {
> @@ -1880,9 +1879,6 @@ static void nfs_commit_release_pages(struct
> nfs_commit_data *data)
>  		/* Latency breaker */
>  		cond_resched();
>  	}
> -	nfss = NFS_SERVER(data->inode);
> -	if (atomic_long_read(&nfss->writeback) <
> NFS_CONGESTION_OFF_THRESH)
> -		nfss->write_congested = 0;
>  
>  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>  	nfs_commit_end(cinfo.mds);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] nfs: Properly initialize server->writeback
  2024-06-13  8:28 ` [PATCH 2/3] nfs: Properly initialize server->writeback Jan Kara
  2024-06-13  9:17   ` Sagi Grimberg
@ 2024-06-13 19:56   ` Jeff Layton
  2024-06-14 11:33     ` Jan Kara
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Layton @ 2024-06-13 19:56 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust; +Cc: linux-nfs, Neil Brown

On Thu, 2024-06-13 at 10:28 +0200, Jan Kara wrote:
> Atomic types should better be initialized with atomic_long_set()
> instead
> of relying on zeroing done by kzalloc(). Clean this up.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/nfs/client.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index de77848ae654..3b252dceebf5 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -994,6 +994,8 @@ struct nfs_server *nfs_alloc_server(void)
>  
>  	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
>  
> +	atomic_long_set(&server->writeback, 0);
> +
>  	ida_init(&server->openowner_id);
>  	ida_init(&server->lockowner_id);
>  	pnfs_init_server(server);

I'm guilty of doing this, well, all over the place. Is there any
plausible scenario where another task could see this value set to non-
zero after kzalloc()? One would hope not...
--
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nfs: Block on write congestion
  2024-06-13  8:28 ` [PATCH 3/3] nfs: Block on write congestion Jan Kara
  2024-06-13  9:18   ` Sagi Grimberg
@ 2024-06-13 20:04   ` Jeff Layton
  2024-06-14 11:44     ` Jan Kara
  2024-06-13 22:56   ` NeilBrown
  2 siblings, 1 reply; 14+ messages in thread
From: Jeff Layton @ 2024-06-13 20:04 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust; +Cc: linux-nfs, Neil Brown

On Thu, 2024-06-13 at 10:28 +0200, Jan Kara wrote:
> Commit 6df25e58532b ("nfs: remove reliance on bdi congestion")
> introduced NFS-private solution for limiting number of writes
> outstanding against a particular server. Unlike previous bdi congestion
> this algorithm actually works and limits number of outstanding writeback
> pages to nfs_congestion_kb which scales with amount of client's memory
> and is capped at 256 MB. As a result some workloads such as random
> buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
> fio command to reproduce is:
> 

That 256M cap was set back in 2007. I wonder if we ought to reconsider
that. It might be worth experimenting these days with a higher cap on
larger machines. Might that improve throughput even more?

> fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
>   --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
>   --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1
> 
> This happens because the client sends ~256 MB worth of dirty pages to
> the server and any further background writeback request is ignored until
> the number of writeback pages gets below the threshold of 192 MB. By the
> time this happens and clients decides to trigger another round of
> writeback, the server often has no pages to write and the disk is idle.
> 
> To fix this problem and make the client react faster to eased congestion
> of the server by blocking waiting for congestion to resolve instead of
> aborting writeback. This improves the random 4k buffered write
> throughput to 184 MB/s.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/nfs/client.c           |  1 +
>  fs/nfs/write.c            | 12 ++++++++----
>  include/linux/nfs_fs_sb.h |  1 +
>  3 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 3b252dceebf5..8286edd6062d 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -994,6 +994,7 @@ struct nfs_server *nfs_alloc_server(void)
>  
>  	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
>  
> +	init_waitqueue_head(&server->write_congestion_wait);
>  	atomic_long_set(&server->writeback, 0);
>  
>  	ida_init(&server->openowner_id);
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c6255d7edd3c..21a5f1e90859 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -423,8 +423,10 @@ static void nfs_folio_end_writeback(struct folio *folio)
>  
>  	folio_end_writeback(folio);
>  	if (atomic_long_dec_return(&nfss->writeback) <
> -	    NFS_CONGESTION_OFF_THRESH)
> +	    NFS_CONGESTION_OFF_THRESH) {
>  		nfss->write_congested = 0;
> +		wake_up_all(&nfss->write_congestion_wait);
> +	}
>  }
>  
>  static void nfs_page_end_writeback(struct nfs_page *req)
> @@ -698,12 +700,14 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  	struct nfs_pageio_descriptor pgio;
>  	struct nfs_io_completion *ioc = NULL;
>  	unsigned int mntflags = NFS_SERVER(inode)->flags;
> +	struct nfs_server *nfss = NFS_SERVER(inode);
>  	int priority = 0;
>  	int err;
>  
> -	if (wbc->sync_mode == WB_SYNC_NONE &&
> -	    NFS_SERVER(inode)->write_congested)
> -		return 0;
> +	/* Wait with writeback until write congestion eases */
> +	if (wbc->sync_mode == WB_SYNC_NONE && nfss->write_congested)
> +		wait_event(nfss->write_congestion_wait,
> +			   nfss->write_congested == 0);
>  

It seems odd to block here, but if that helps throughput then so be it.

>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>  
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 92de074e63b9..38a128796a76 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -140,6 +140,7 @@ struct nfs_server {
>  	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
>  	struct nlm_host		*nlm_host;	/* NLM client handle */
>  	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
> +	wait_queue_head_t	write_congestion_wait;	/* wait until write congestion eases */
>  	atomic_long_t		writeback;	/* number of writeback pages */
>  	unsigned int		write_congested;/* flag set when writeback gets too high */
>  	unsigned int		flags;		/* various flags */

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nfs: Block on write congestion
  2024-06-13  8:28 ` [PATCH 3/3] nfs: Block on write congestion Jan Kara
  2024-06-13  9:18   ` Sagi Grimberg
  2024-06-13 20:04   ` Jeff Layton
@ 2024-06-13 22:56   ` NeilBrown
  2024-06-14 11:57     ` Jan Kara
  2 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2024-06-13 22:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Trond Myklebust, linux-nfs, Jan Kara

On Thu, 13 Jun 2024, Jan Kara wrote:
> Commit 6df25e58532b ("nfs: remove reliance on bdi congestion")
> introduced NFS-private solution for limiting number of writes
> outstanding against a particular server. Unlike previous bdi congestion
> this algorithm actually works and limits number of outstanding writeback
> pages to nfs_congestion_kb which scales with amount of client's memory
> and is capped at 256 MB. As a result some workloads such as random
> buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
> fio command to reproduce is:
> 
> fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
>   --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
>   --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1
> 
> This happens because the client sends ~256 MB worth of dirty pages to
> the server and any further background writeback request is ignored until
> the number of writeback pages gets below the threshold of 192 MB. By the
> time this happens and clients decides to trigger another round of
> writeback, the server often has no pages to write and the disk is idle.
> 
> To fix this problem and make the client react faster to eased congestion
> of the server by blocking waiting for congestion to resolve instead of
> aborting writeback. This improves the random 4k buffered write
> throughput to 184 MB/s.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/nfs/client.c           |  1 +
>  fs/nfs/write.c            | 12 ++++++++----
>  include/linux/nfs_fs_sb.h |  1 +
>  3 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 3b252dceebf5..8286edd6062d 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -994,6 +994,7 @@ struct nfs_server *nfs_alloc_server(void)
>  
>  	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
>  
> +	init_waitqueue_head(&server->write_congestion_wait);
>  	atomic_long_set(&server->writeback, 0);
>  
>  	ida_init(&server->openowner_id);
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c6255d7edd3c..21a5f1e90859 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -423,8 +423,10 @@ static void nfs_folio_end_writeback(struct folio *folio)
>  
>  	folio_end_writeback(folio);
>  	if (atomic_long_dec_return(&nfss->writeback) <
> -	    NFS_CONGESTION_OFF_THRESH)
> +	    NFS_CONGESTION_OFF_THRESH) {
>  		nfss->write_congested = 0;
> +		wake_up_all(&nfss->write_congestion_wait);
> +	}
>  }
>  
>  static void nfs_page_end_writeback(struct nfs_page *req)
> @@ -698,12 +700,14 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  	struct nfs_pageio_descriptor pgio;
>  	struct nfs_io_completion *ioc = NULL;
>  	unsigned int mntflags = NFS_SERVER(inode)->flags;
> +	struct nfs_server *nfss = NFS_SERVER(inode);
>  	int priority = 0;
>  	int err;
>  
> -	if (wbc->sync_mode == WB_SYNC_NONE &&
> -	    NFS_SERVER(inode)->write_congested)
> -		return 0;
> +	/* Wait with writeback until write congestion eases */
> +	if (wbc->sync_mode == WB_SYNC_NONE && nfss->write_congested)
> +		wait_event(nfss->write_congestion_wait,
> +			   nfss->write_congested == 0);

If there is a network problem or the server reboot, this could wait
indefinitely.  Maybe wait_event_idle() ??

Is this only called from the writeback thread that is dedicated to this
fs?  If so that should be fine.  If it can be reached from a user
processes, then wait_event_killable() might be needed.

Thanks,
NeilBrown


>  
>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>  
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 92de074e63b9..38a128796a76 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -140,6 +140,7 @@ struct nfs_server {
>  	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
>  	struct nlm_host		*nlm_host;	/* NLM client handle */
>  	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
> +	wait_queue_head_t	write_congestion_wait;	/* wait until write congestion eases */
>  	atomic_long_t		writeback;	/* number of writeback pages */
>  	unsigned int		write_congested;/* flag set when writeback gets too high */
>  	unsigned int		flags;		/* various flags */
> -- 
> 2.35.3
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] nfs: Properly initialize server->writeback
  2024-06-13 19:56   ` Jeff Layton
@ 2024-06-14 11:33     ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-14 11:33 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Jan Kara, Trond Myklebust, linux-nfs, Neil Brown

On Thu 13-06-24 15:56:36, Jeff Layton wrote:
> On Thu, 2024-06-13 at 10:28 +0200, Jan Kara wrote:
> > Atomic types should better be initialized with atomic_long_set()
> > instead
> > of relying on zeroing done by kzalloc(). Clean this up.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/nfs/client.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> > index de77848ae654..3b252dceebf5 100644
> > --- a/fs/nfs/client.c
> > +++ b/fs/nfs/client.c
> > @@ -994,6 +994,8 @@ struct nfs_server *nfs_alloc_server(void)
> >  
> >  	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
> >  
> > +	atomic_long_set(&server->writeback, 0);
> > +
> >  	ida_init(&server->openowner_id);
> >  	ida_init(&server->lockowner_id);
> >  	pnfs_init_server(server);
> 
> I'm guilty of doing this, well, all over the place. Is there any
> plausible scenario where another task could see this value set to non-
> zero after kzalloc()? One would hope not...

No, it is not a practical problem these days. It is more a theoretical
correctness thing that atomic_t should be initialized through atomic_set()
and not memset() because in theory you don't know how some weird
architecture decides to implement it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nfs: Block on write congestion
  2024-06-13 20:04   ` Jeff Layton
@ 2024-06-14 11:44     ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-14 11:44 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Jan Kara, Trond Myklebust, linux-nfs, Neil Brown

On Thu 13-06-24 16:04:12, Jeff Layton wrote:
> On Thu, 2024-06-13 at 10:28 +0200, Jan Kara wrote:
> > Commit 6df25e58532b ("nfs: remove reliance on bdi congestion")
> > introduced NFS-private solution for limiting number of writes
> > outstanding against a particular server. Unlike previous bdi congestion
> > this algorithm actually works and limits number of outstanding writeback
> > pages to nfs_congestion_kb which scales with amount of client's memory
> > and is capped at 256 MB. As a result some workloads such as random
> > buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
> > fio command to reproduce is:
> > 
> 
> That 256M cap was set back in 2007. I wonder if we ought to reconsider
> that. It might be worth experimenting these days with a higher cap on
> larger machines. Might that improve throughput even more?

In theory it may help but in my setup I didn't observe it. So I've decided
to leave the limit alone for now. But I certainly would not object if
someone decided to increase it as 256MB seems a bit low buffer space for
writeback given contemporary IO speeds, I just didn't have numbers to back
that decision.

> > fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
> >   --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
> >   --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1
> > 
> > This happens because the client sends ~256 MB worth of dirty pages to
> > the server and any further background writeback request is ignored until
> > the number of writeback pages gets below the threshold of 192 MB. By the
> > time this happens and clients decides to trigger another round of
> > writeback, the server often has no pages to write and the disk is idle.
> > 
> > To fix this problem and make the client react faster to eased congestion
> > of the server by blocking waiting for congestion to resolve instead of
> > aborting writeback. This improves the random 4k buffered write
> > throughput to 184 MB/s.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/nfs/client.c           |  1 +
> >  fs/nfs/write.c            | 12 ++++++++----
> >  include/linux/nfs_fs_sb.h |  1 +
> >  3 files changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> > index 3b252dceebf5..8286edd6062d 100644
> > --- a/fs/nfs/client.c
> > +++ b/fs/nfs/client.c
> > @@ -994,6 +994,7 @@ struct nfs_server *nfs_alloc_server(void)
> >  
> >  	server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED;
> >  
> > +	init_waitqueue_head(&server->write_congestion_wait);
> >  	atomic_long_set(&server->writeback, 0);
> >  
> >  	ida_init(&server->openowner_id);
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index c6255d7edd3c..21a5f1e90859 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -423,8 +423,10 @@ static void nfs_folio_end_writeback(struct folio *folio)
> >  
> >  	folio_end_writeback(folio);
> >  	if (atomic_long_dec_return(&nfss->writeback) <
> > -	    NFS_CONGESTION_OFF_THRESH)
> > +	    NFS_CONGESTION_OFF_THRESH) {
> >  		nfss->write_congested = 0;
> > +		wake_up_all(&nfss->write_congestion_wait);
> > +	}
> >  }
> >  
> >  static void nfs_page_end_writeback(struct nfs_page *req)
> > @@ -698,12 +700,14 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
> >  	struct nfs_pageio_descriptor pgio;
> >  	struct nfs_io_completion *ioc = NULL;
> >  	unsigned int mntflags = NFS_SERVER(inode)->flags;
> > +	struct nfs_server *nfss = NFS_SERVER(inode);
> >  	int priority = 0;
> >  	int err;
> >  
> > -	if (wbc->sync_mode == WB_SYNC_NONE &&
> > -	    NFS_SERVER(inode)->write_congested)
> > -		return 0;
> > +	/* Wait with writeback until write congestion eases */
> > +	if (wbc->sync_mode == WB_SYNC_NONE && nfss->write_congested)
> > +		wait_event(nfss->write_congestion_wait,
> > +			   nfss->write_congested == 0);
> >  
> 
> It seems odd to block here, but if that helps throughput then so be it.

Well, I've kept the place where writeback throttling is decided. Another
logical possibility would be to handle the blocking in
nfs_writepages_callback() more like it happens for standard block device
filesystems which block in submit_bio(). Just there it would potentially
require more work as AFAIU we can have some unfinished RPC requests which
we may need to flush before blocking or something similar (similarly to how
we unplug IO on schedule). So I wasn't really comfortable in moving the
blocking there.

> >  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
> >  
> > diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> > index 92de074e63b9..38a128796a76 100644
> > --- a/include/linux/nfs_fs_sb.h
> > +++ b/include/linux/nfs_fs_sb.h
> > @@ -140,6 +140,7 @@ struct nfs_server {
> >  	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
> >  	struct nlm_host		*nlm_host;	/* NLM client handle */
> >  	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
> > +	wait_queue_head_t	write_congestion_wait;	/* wait until write congestion eases */
> >  	atomic_long_t		writeback;	/* number of writeback pages */
> >  	unsigned int		write_congested;/* flag set when writeback gets too high */
> >  	unsigned int		flags;		/* various flags */
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nfs: Block on write congestion
  2024-06-13 22:56   ` NeilBrown
@ 2024-06-14 11:57     ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2024-06-14 11:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Trond Myklebust, linux-nfs

On Fri 14-06-24 08:56:10, NeilBrown wrote:
> On Thu, 13 Jun 2024, Jan Kara wrote:
> > @@ -698,12 +700,14 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
> >  	struct nfs_pageio_descriptor pgio;
> >  	struct nfs_io_completion *ioc = NULL;
> >  	unsigned int mntflags = NFS_SERVER(inode)->flags;
> > +	struct nfs_server *nfss = NFS_SERVER(inode);
> >  	int priority = 0;
> >  	int err;
> >  
> > -	if (wbc->sync_mode == WB_SYNC_NONE &&
> > -	    NFS_SERVER(inode)->write_congested)
> > -		return 0;
> > +	/* Wait with writeback until write congestion eases */
> > +	if (wbc->sync_mode == WB_SYNC_NONE && nfss->write_congested)
> > +		wait_event(nfss->write_congestion_wait,
> > +			   nfss->write_congested == 0);
> 
> If there is a network problem or the server reboot, this could wait
> indefinitely.  Maybe wait_event_idle() ??
> 
> Is this only called from the writeback thread that is dedicated to this
> fs?  If so that should be fine.  If it can be reached from a user
> processes, then wait_event_killable() might be needed.

Usually yes, but e.g. sync_file_range() can issue writeback that will block
here as well. So I guess wait_event_killable() is a good idea. I'll update
it.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-06-14 11:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13  8:28 [PATCH 0/3] nfs: Improve throughput for random buffered writes Jan Kara
2024-06-13  8:28 ` [PATCH 1/3] nfs: Drop pointless check from nfs_commit_release_pages() Jan Kara
2024-06-13  9:17   ` Sagi Grimberg
2024-06-13 19:55   ` Jeff Layton
2024-06-13  8:28 ` [PATCH 2/3] nfs: Properly initialize server->writeback Jan Kara
2024-06-13  9:17   ` Sagi Grimberg
2024-06-13 19:56   ` Jeff Layton
2024-06-14 11:33     ` Jan Kara
2024-06-13  8:28 ` [PATCH 3/3] nfs: Block on write congestion Jan Kara
2024-06-13  9:18   ` Sagi Grimberg
2024-06-13 20:04   ` Jeff Layton
2024-06-14 11:44     ` Jan Kara
2024-06-13 22:56   ` NeilBrown
2024-06-14 11:57     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox