support large folios for NFS

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* support large folios for NFS
@ 2024-05-27 16:36 Christoph Hellwig
  2024-05-27 16:36 ` [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-27 16:36 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Matthew Wilcox
  Cc: linux-nfs, linux-fsdevel, linux-mm

Hi all,

this series adds large folio support to NFS, and almost doubles the
buffered write throughput from the previous bottleneck of ~2.5GB/s
(just like for other file systems).

The first patch is an old one from willy that I've updated very slightly.
Note that this update now requires the mapping_max_folio_size helper
merged into Linus' tree only a few minutes ago.

Diffstat:
 fs/nfs/file.c  |    4 +++-
 fs/nfs/inode.c |    1 +
 mm/filemap.c   |   40 +++++++++++++++++++++++++---------------
 3 files changed, 29 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-05-27 16:36 support large folios for NFS Christoph Hellwig
@ 2024-05-27 16:36 ` Christoph Hellwig
  2024-05-27 18:17   ` Matthew Wilcox
                     ` (2 more replies)
  2024-05-27 16:36 ` [PATCH 2/2] nfs: add support for " Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-27 16:36 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Matthew Wilcox
  Cc: linux-nfs, linux-fsdevel, linux-mm

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Modelled after the loop in iomap_write_iter(), copy larger chunks from
userspace if the filesystem has created large folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[hch: use mapping_max_folio_size to keep supporting file systems that do
 not support large folios]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/filemap.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 382c3d06bfb10c..860728e26ccf32 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3981,21 +3981,24 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	loff_t pos = iocb->ki_pos;
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
+	size_t chunk = mapping_max_folio_size(mapping);
 	long status = 0;
 	ssize_t written = 0;
 
 	do {
 		struct page *page;
-		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long bytes;	/* Bytes to write to page */
+		struct folio *folio;
+		size_t offset;		/* Offset into folio */
+		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
 		void *fsdata = NULL;
 
-		offset = (pos & (PAGE_SIZE - 1));
-		bytes = min_t(unsigned long, PAGE_SIZE - offset,
-						iov_iter_count(i));
+		bytes = iov_iter_count(i);
+retry:
+		offset = pos & (chunk - 1);
+		bytes = min(chunk - offset, bytes);
+		balance_dirty_pages_ratelimited(mapping);
 
-again:
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -4017,11 +4020,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 		if (unlikely(status < 0))
 			break;
 
+		folio = page_folio(page);
+		offset = offset_in_folio(folio, pos);
+		if (bytes > folio_size(folio) - offset)
+			bytes = folio_size(folio) - offset;
+
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_folio(folio);
 
-		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
-		flush_dcache_page(page);
+		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
+		flush_dcache_folio(folio);
 
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -4039,14 +4047,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
-			if (copied)
+			if (chunk > PAGE_SIZE)
+				chunk /= 2;
+			if (copied) {
 				bytes = copied;
-			goto again;
+				goto retry;
+			}
+		} else {
+			pos += status;
+			written += status;
 		}
-		pos += status;
-		written += status;
-
-		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
 	if (!written)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] nfs: add support for large folios
  2024-05-27 16:36 support large folios for NFS Christoph Hellwig
  2024-05-27 16:36 ` [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios Christoph Hellwig
@ 2024-05-27 16:36 ` Christoph Hellwig
  2024-05-27 19:43 ` support large folios for NFS Sagi Grimberg
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-27 16:36 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Matthew Wilcox
  Cc: linux-nfs, linux-fsdevel, linux-mm

NFS already is void of folio size assumption, so just pass the chunk size
to __filemap_get_folio and set the large folio address_space flag for all
regular files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/nfs/file.c  | 4 +++-
 fs/nfs/inode.c | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 6bd127e6683dce..7f1295475a90fd 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -339,6 +339,7 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
 			   loff_t pos, unsigned len, struct page **pagep,
 			   void **fsdata)
 {
+	fgf_t fgp = FGP_WRITEBEGIN;
 	struct folio *folio;
 	int once_thru = 0;
 	int ret;
@@ -346,8 +347,9 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
 	dfprintk(PAGECACHE, "NFS: write_begin(%pD2(%lu), %u@%lld)\n",
 		file, mapping->host->i_ino, len, (long long) pos);
 
+	fgp |= fgf_set_order(len);
 start:
-	folio = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, FGP_WRITEBEGIN,
+	folio = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, fgp,
 				    mapping_gfp_mask(mapping));
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index acef52ecb1bb7e..6d185af4cb29d4 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -491,6 +491,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 			inode->i_fop = NFS_SB(sb)->nfs_client->rpc_ops->file_ops;
 			inode->i_data.a_ops = &nfs_file_aops;
 			nfs_inode_init_regular(nfsi);
+			mapping_set_large_folios(inode->i_mapping);
 		} else if (S_ISDIR(inode->i_mode)) {
 			inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
 			inode->i_fop = &nfs_dir_operations;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-05-27 16:36 ` [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios Christoph Hellwig
@ 2024-05-27 18:17   ` Matthew Wilcox
  2024-05-28  8:12     ` Christoph Hellwig
       [not found]   ` <CGME20240528152340eucas1p17ba2ad78d8ea869ef44cdeedb2601f80@eucas1p1.samsung.com>
  2024-06-11 10:47   ` Shaun Tancheff
  2 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-05-27 18:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, linux-nfs, linux-fsdevel,
	linux-mm

On Mon, May 27, 2024 at 06:36:08PM +0200, Christoph Hellwig wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Modelled after the loop in iomap_write_iter(), copy larger chunks from
> userspace if the filesystem has created large folios.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> [hch: use mapping_max_folio_size to keep supporting file systems that do
>  not support large folios]
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Yup, this still makes sense to me.

Could you remind me why we need to call flush_dcache_folio() in
generic_perform_write() while we don't in iomap_write_iter()?

>  		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_folio(folio);

(i'm not talking about this one)

> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> -		flush_dcache_page(page);
> +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> +		flush_dcache_folio(folio);

(this one has no equivalent in iomap)

>  		status = a_ops->write_end(file, mapping, pos, bytes, copied,
>  						page, fsdata);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-27 16:36 support large folios for NFS Christoph Hellwig
  2024-05-27 16:36 ` [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios Christoph Hellwig
  2024-05-27 16:36 ` [PATCH 2/2] nfs: add support for " Christoph Hellwig
@ 2024-05-27 19:43 ` Sagi Grimberg
  2024-05-28 21:03 ` [PATCH] nfs: Fix misuses of folio_shift() and folio_order() Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-05-27 19:43 UTC (permalink / raw)
  To: Christoph Hellwig, Trond Myklebust, Anna Schumaker,
	Matthew Wilcox
  Cc: linux-nfs, linux-fsdevel, linux-mm



On 27/05/2024 19:36, Christoph Hellwig wrote:
> Hi all,
>
> this series adds large folio support to NFS, and almost doubles the
> buffered write throughput from the previous bottleneck of ~2.5GB/s
> (just like for other file systems).
>
> The first patch is an old one from willy that I've updated very slightly.
> Note that this update now requires the mapping_max_folio_size helper
> merged into Linus' tree only a few minutes ago.

I'll confirm that NFS buffered writes saw a dramatic >2x improvement
against my test system.

For the series:
Tested-by: Sagi Grimberg <sagi@grimberg.me>




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-05-27 18:17   ` Matthew Wilcox
@ 2024-05-28  8:12     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-28  8:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, linux-mm

On Mon, May 27, 2024 at 07:17:18PM +0100, Matthew Wilcox wrote:
> Could you remind me why we need to call flush_dcache_folio() in
> generic_perform_write() while we don't in iomap_write_iter()?

> > -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> > -		flush_dcache_page(page);
> > +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> > +		flush_dcache_folio(folio);
> 
> (this one has no equivalent in iomap)

The iomap equivalent is in __iomap_write_end and iomap_write_end_inline
and block_write_end.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
       [not found]   ` <CGME20240528152340eucas1p17ba2ad78d8ea869ef44cdeedb2601f80@eucas1p1.samsung.com>
@ 2024-05-28 15:23     ` Daniel Gomez
  2024-05-28 16:50       ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Gomez @ 2024-05-28 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Matthew Wilcox,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

Hi Christoph, Matthew,
On Mon, May 27, 2024 at 06:36:08PM +0200, Christoph Hellwig wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Modelled after the loop in iomap_write_iter(), copy larger chunks from
> userspace if the filesystem has created large folios.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> [hch: use mapping_max_folio_size to keep supporting file systems that do
>  not support large folios]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/filemap.c | 40 +++++++++++++++++++++++++---------------
>  1 file changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 382c3d06bfb10c..860728e26ccf32 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3981,21 +3981,24 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  	loff_t pos = iocb->ki_pos;
>  	struct address_space *mapping = file->f_mapping;
>  	const struct address_space_operations *a_ops = mapping->a_ops;
> +	size_t chunk = mapping_max_folio_size(mapping);
>  	long status = 0;
>  	ssize_t written = 0;
>  
>  	do {
>  		struct page *page;
> -		unsigned long offset;	/* Offset into pagecache page */
> -		unsigned long bytes;	/* Bytes to write to page */
> +		struct folio *folio;
> +		size_t offset;		/* Offset into folio */
> +		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
>  		void *fsdata = NULL;
>  
> -		offset = (pos & (PAGE_SIZE - 1));
> -		bytes = min_t(unsigned long, PAGE_SIZE - offset,
> -						iov_iter_count(i));
> +		bytes = iov_iter_count(i);
> +retry:
> +		offset = pos & (chunk - 1);
> +		bytes = min(chunk - offset, bytes);
> +		balance_dirty_pages_ratelimited(mapping);
>  
> -again:
>  		/*
>  		 * Bring in the user page that we will copy from _first_.
>  		 * Otherwise there's a nasty deadlock on copying from the
> @@ -4017,11 +4020,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  		if (unlikely(status < 0))
>  			break;
>  
> +		folio = page_folio(page);
> +		offset = offset_in_folio(folio, pos);
> +		if (bytes > folio_size(folio) - offset)
> +			bytes = folio_size(folio) - offset;
> +
>  		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_folio(folio);
>  
> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> -		flush_dcache_page(page);
> +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> +		flush_dcache_folio(folio);
>  
>  		status = a_ops->write_end(file, mapping, pos, bytes, copied,
>  						page, fsdata);

I have the same patch for shmem and large folios tree. That was the last piece
needed for getting better performance results. However, it is also needed to
support folios in the write_begin() and write_end() callbacks. In order to avoid
making them local to shmem, how should we do the transition to folios in these
2 callbacks? I was looking into aops->read_folio approach but what do you think?

> @@ -4039,14 +4047,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  			 * halfway through, might be a race with munmap,
>  			 * might be severe memory pressure.
>  			 */
> -			if (copied)
> +			if (chunk > PAGE_SIZE)
> +				chunk /= 2;
> +			if (copied) {
>  				bytes = copied;
> -			goto again;
> +				goto retry;
> +			}
> +		} else {
> +			pos += status;
> +			written += status;
>  		}
> -		pos += status;
> -		written += status;
> -
> -		balance_dirty_pages_ratelimited(mapping);
>  	} while (iov_iter_count(i));
>  
>  	if (!written)
> -- 
> 2.43.0
> 

Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-05-28 15:23     ` Daniel Gomez
@ 2024-05-28 16:50       ` Matthew Wilcox
  2024-05-28 19:01         ` Daniel Gomez
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-05-28 16:50 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Tue, May 28, 2024 at 03:23:39PM +0000, Daniel Gomez wrote:
> I have the same patch for shmem and large folios tree. That was the last piece
> needed for getting better performance results. However, it is also needed to
> support folios in the write_begin() and write_end() callbacks.

I don't think it's *needed*.  It's nice!  But clearly not necessary
since Christoph made nfs work without doing that.

> In order to avoid
> making them local to shmem, how should we do the transition to folios in these
> 2 callbacks? I was looking into aops->read_folio approach but what do you think?

See the v2 of buffer_write_operations that I just posted.  I was waiting
for feedback from Christoph on the revised method for passing fsdata
around, but I may as well just post a v2 and see what happens.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-05-28 16:50       ` Matthew Wilcox
@ 2024-05-28 19:01         ` Daniel Gomez
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel Gomez @ 2024-05-28 19:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Tue, May 28, 2024 at 05:50:44PM +0100, Matthew Wilcox wrote:
> On Tue, May 28, 2024 at 03:23:39PM +0000, Daniel Gomez wrote:
> > I have the same patch for shmem and large folios tree. That was the last piece
> > needed for getting better performance results. However, it is also needed to
> > support folios in the write_begin() and write_end() callbacks.
> 
> I don't think it's *needed*.  It's nice!  But clearly not necessary
> since Christoph made nfs work without doing that.

I see. We send anyway the length with bytes and the folio allocated inside
write_begin() is retrieved with folio_page().

I did test this patch (+mapping_max_folio_size() patch) for shmem an it works fine for me.
> 
> > In order to avoid
> > making them local to shmem, how should we do the transition to folios in these
> > 2 callbacks? I was looking into aops->read_folio approach but what do you think?
> 
> See the v2 of buffer_write_operations that I just posted.  I was waiting
> for feedback from Christoph on the revised method for passing fsdata
> around, but I may as well just post a v2 and see what happens.

Interesting. I think it makes sense to convert tmpfs to
buffered_write_operations as well. Can you add me to the v2 so I can add/review
it for tmpfs?

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] nfs: Fix misuses of folio_shift() and folio_order()
  2024-05-27 16:36 support large folios for NFS Christoph Hellwig
                   ` (2 preceding siblings ...)
  2024-05-27 19:43 ` support large folios for NFS Sagi Grimberg
@ 2024-05-28 21:03 ` Matthew Wilcox (Oracle)
  2024-05-29  5:15   ` Christoph Hellwig
                     ` (2 more replies)
  2024-05-28 21:05 ` support large folios for NFS Matthew Wilcox
  2024-05-29 21:59 ` Trond Myklebust
  5 siblings, 3 replies; 24+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-05-28 21:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-nfs
  Cc: Matthew Wilcox (Oracle), Trond Myklebust, Anna Schumaker,
	Christoph Hellwig

Page cache indices are in units of PAGE_SIZE, not in units of
the folio size.  Revert the change in nfs_grow_file(), and
pass the inode to nfs_folio_length() so it can be reimplemented
in terms of folio_mkwrite_check_truncate() which handles this
correctly.

Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
---
 fs/nfs/file.c           |  6 +++---
 fs/nfs/internal.h       | 16 +++++-----------
 fs/nfs/read.c           |  2 +-
 fs/nfs/write.c          |  9 +++++----
 include/linux/pagemap.h |  4 ++--
 5 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 6bd127e6683d..723d78bbfe3f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -301,7 +301,7 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
 static bool nfs_folio_is_full_write(struct folio *folio, loff_t pos,
 				    unsigned int len)
 {
-	unsigned int pglen = nfs_folio_length(folio);
+	unsigned int pglen = nfs_folio_length(folio, folio->mapping->host);
 	unsigned int offset = offset_in_folio(folio, pos);
 	unsigned int end = offset + len;
 
@@ -386,7 +386,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
 	 */
 	if (!folio_test_uptodate(folio)) {
 		size_t fsize = folio_size(folio);
-		unsigned pglen = nfs_folio_length(folio);
+		unsigned pglen = nfs_folio_length(folio, mapping->host);
 		unsigned end = offset + copied;
 
 		if (pglen == 0) {
@@ -610,7 +610,7 @@ static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf)
 
 	folio_wait_writeback(folio);
 
-	pagelen = nfs_folio_length(folio);
+	pagelen = nfs_folio_length(folio, inode);
 	if (pagelen == 0)
 		goto out_unlock;
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9f0f4534744b..3b0236e67257 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -819,19 +819,13 @@ unsigned int nfs_page_length(struct page *page)
 /*
  * Determine the number of bytes of data the page contains
  */
-static inline size_t nfs_folio_length(struct folio *folio)
+static inline size_t nfs_folio_length(struct folio *folio, struct inode *inode)
 {
-	loff_t i_size = i_size_read(folio_file_mapping(folio)->host);
+	ssize_t ret = folio_mkwrite_check_truncate(folio, inode);
 
-	if (i_size > 0) {
-		pgoff_t index = folio_index(folio) >> folio_order(folio);
-		pgoff_t end_index = (i_size - 1) >> folio_shift(folio);
-		if (index < end_index)
-			return folio_size(folio);
-		if (index == end_index)
-			return offset_in_folio(folio, i_size - 1) + 1;
-	}
-	return 0;
+	if (ret < 0)
+		ret = 0;
+	return ret;
 }
 
 /*
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index a142287d86f6..ba3bb496f832 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -296,7 +296,7 @@ int nfs_read_add_folio(struct nfs_pageio_descriptor *pgio,
 	unsigned int len, aligned_len;
 	int error;
 
-	len = nfs_folio_length(folio);
+	len = nfs_folio_length(folio, inode);
 	if (len == 0)
 		return nfs_return_empty_folio(folio);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 2329cbb0e446..7713ce7c5b3a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -278,8 +278,8 @@ static void nfs_grow_file(struct folio *folio, unsigned int offset,
 
 	spin_lock(&inode->i_lock);
 	i_size = i_size_read(inode);
-	end_index = ((i_size - 1) >> folio_shift(folio)) << folio_order(folio);
-	if (i_size > 0 && folio_index(folio) < end_index)
+	end_index = (i_size - 1) >> PAGE_SHIFT;
+	if (i_size > 0 && folio->index < end_index)
 		goto out;
 	end = folio_file_pos(folio) + (loff_t)offset + (loff_t)count;
 	if (i_size >= end)
@@ -358,7 +358,8 @@ nfs_page_group_search_locked(struct nfs_page *head, unsigned int page_offset)
  */
 static bool nfs_page_group_covers_page(struct nfs_page *req)
 {
-	unsigned int len = nfs_folio_length(nfs_page_to_folio(req));
+	struct folio *folio = nfs_page_to_folio(req);
+	unsigned int len = nfs_folio_length(folio, folio->mapping->host);
 	struct nfs_page *tmp;
 	unsigned int pos = 0;
 
@@ -1356,7 +1357,7 @@ int nfs_update_folio(struct file *file, struct folio *folio,
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
 	struct address_space *mapping = folio_file_mapping(folio);
 	struct inode *inode = mapping->host;
-	unsigned int pagelen = nfs_folio_length(folio);
+	unsigned int pagelen = nfs_folio_length(folio, inode);
 	int		status = 0;
 
 	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c6aaceed0de6..df57d7361a9a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -212,8 +212,8 @@ enum mapping_flags {
 	AS_FOLIO_ORDER_MAX = 21, /* Bits 16-25 are used for FOLIO_ORDER */
 };
 
-#define AS_FOLIO_ORDER_MIN_MASK 0x001f0000
-#define AS_FOLIO_ORDER_MAX_MASK 0x03e00000
+#define AS_FOLIO_ORDER_MIN_MASK (31 << AS_FOLIO_ORDER_MIN)
+#define AS_FOLIO_ORDER_MAX_MASK (31 << AS_FOLIO_ORDER_MAX)
 #define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-27 16:36 support large folios for NFS Christoph Hellwig
                   ` (3 preceding siblings ...)
  2024-05-28 21:03 ` [PATCH] nfs: Fix misuses of folio_shift() and folio_order() Matthew Wilcox (Oracle)
@ 2024-05-28 21:05 ` Matthew Wilcox
  2024-05-29  5:14   ` Christoph Hellwig
  2024-05-29 13:35   ` Trond Myklebust
  2024-05-29 21:59 ` Trond Myklebust
  5 siblings, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-05-28 21:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, linux-nfs, linux-fsdevel,
	linux-mm

On Mon, May 27, 2024 at 06:36:07PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds large folio support to NFS, and almost doubles the
> buffered write throughput from the previous bottleneck of ~2.5GB/s
> (just like for other file systems).
> 
> The first patch is an old one from willy that I've updated very slightly.
> Note that this update now requires the mapping_max_folio_size helper
> merged into Linus' tree only a few minutes ago.

Kind of surprised this didn't fall over given the bugs I just sent a
patch for ... misinterpreting the folio indices seems like it should
have caused a failure in _some_ fstest.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-28 21:05 ` support large folios for NFS Matthew Wilcox
@ 2024-05-29  5:14   ` Christoph Hellwig
  2024-05-29 13:35   ` Trond Myklebust
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-29  5:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, linux-mm

On Tue, May 28, 2024 at 10:05:58PM +0100, Matthew Wilcox wrote:
> On Mon, May 27, 2024 at 06:36:07PM +0200, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series adds large folio support to NFS, and almost doubles the
> > buffered write throughput from the previous bottleneck of ~2.5GB/s
> > (just like for other file systems).
> > 
> > The first patch is an old one from willy that I've updated very slightly.
> > Note that this update now requires the mapping_max_folio_size helper
> > merged into Linus' tree only a few minutes ago.
> 
> Kind of surprised this didn't fall over given the bugs I just sent a
> patch for ... misinterpreting the folio indices seems like it should
> have caused a failure in _some_ fstest.

I've run quite few tests with different NFS protocol versions, and there
were no new failures, and the existing one is a MM one also reproducible
with local XFS.  That's indeed a bit odd.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: Fix misuses of folio_shift() and folio_order()
  2024-05-28 21:03 ` [PATCH] nfs: Fix misuses of folio_shift() and folio_order() Matthew Wilcox (Oracle)
@ 2024-05-29  5:15   ` Christoph Hellwig
  2024-05-29  6:30   ` Christoph Hellwig
  2024-05-29 14:31   ` Trond Myklebust
  2 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-29  5:15 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-nfs, Trond Myklebust, Anna Schumaker,
	Christoph Hellwig

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c6aaceed0de6..df57d7361a9a 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -212,8 +212,8 @@ enum mapping_flags {
>  	AS_FOLIO_ORDER_MAX = 21, /* Bits 16-25 are used for FOLIO_ORDER */
>  };
>  
> -#define AS_FOLIO_ORDER_MIN_MASK 0x001f0000
> -#define AS_FOLIO_ORDER_MAX_MASK 0x03e00000
> +#define AS_FOLIO_ORDER_MIN_MASK (31 << AS_FOLIO_ORDER_MIN)
> +#define AS_FOLIO_ORDER_MAX_MASK (31 << AS_FOLIO_ORDER_MAX)

This looks unrelated.

The NFS changes do look good to me, I'll kick off a testing run on
them in a bit.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: Fix misuses of folio_shift() and folio_order()
  2024-05-28 21:03 ` [PATCH] nfs: Fix misuses of folio_shift() and folio_order() Matthew Wilcox (Oracle)
  2024-05-29  5:15   ` Christoph Hellwig
@ 2024-05-29  6:30   ` Christoph Hellwig
  2024-05-29 14:31   ` Trond Myklebust
  2 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-05-29  6:30 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-nfs, Trond Myklebust, Anna Schumaker,
	Christoph Hellwig

On Tue, May 28, 2024 at 10:03:15PM +0100, Matthew Wilcox (Oracle) wrote:
> Page cache indices are in units of PAGE_SIZE, not in units of
> the folio size.  Revert the change in nfs_grow_file(), and
> pass the inode to nfs_folio_length() so it can be reimplemented
> in terms of folio_mkwrite_check_truncate() which handles this
> correctly.

I had to apply the incremental patch below to make the change compile.
With that it causes a new xfstests failure in generic/127 that I haven't
looked into yet.  The mm-level bugs I've seen even with baseline
Linus' tree also happened more often than in my previous tests, but
that might just be coincidence.


diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 1e710654af1173..0a5d5fa9513735 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -938,7 +938,7 @@ TRACE_EVENT(nfs_sillyrename_unlink,
 
 DECLARE_EVENT_CLASS(nfs_folio_event,
 		TP_PROTO(
-			const struct inode *inode,
+			struct inode *inode,
 			struct folio *folio
 		),
 
@@ -954,14 +954,14 @@ DECLARE_EVENT_CLASS(nfs_folio_event,
 		),
 
 		TP_fast_assign(
-			const struct nfs_inode *nfsi = NFS_I(inode);
+			struct nfs_inode *nfsi = NFS_I(inode);
 
 			__entry->dev = inode->i_sb->s_dev;
 			__entry->fileid = nfsi->fileid;
 			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
 			__entry->version = inode_peek_iversion_raw(inode);
 			__entry->offset = folio_file_pos(folio);
-			__entry->count = nfs_folio_length(folio);
+			__entry->count = nfs_folio_length(folio, inode);
 		),
 
 		TP_printk(
@@ -977,14 +977,14 @@ DECLARE_EVENT_CLASS(nfs_folio_event,
 #define DEFINE_NFS_FOLIO_EVENT(name) \
 	DEFINE_EVENT(nfs_folio_event, name, \
 			TP_PROTO( \
-				const struct inode *inode, \
+				struct inode *inode, \
 				struct folio *folio \
 			), \
 			TP_ARGS(inode, folio))
 
 DECLARE_EVENT_CLASS(nfs_folio_event_done,
 		TP_PROTO(
-			const struct inode *inode,
+			struct inode *inode,
 			struct folio *folio,
 			int ret
 		),
@@ -1002,14 +1002,14 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
 		),
 
 		TP_fast_assign(
-			const struct nfs_inode *nfsi = NFS_I(inode);
+			struct nfs_inode *nfsi = NFS_I(inode);
 
 			__entry->dev = inode->i_sb->s_dev;
 			__entry->fileid = nfsi->fileid;
 			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
 			__entry->version = inode_peek_iversion_raw(inode);
 			__entry->offset = folio_file_pos(folio);
-			__entry->count = nfs_folio_length(folio);
+			__entry->count = nfs_folio_length(folio, inode);
 			__entry->ret = ret;
 		),
 
@@ -1026,7 +1026,7 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
 #define DEFINE_NFS_FOLIO_EVENT_DONE(name) \
 	DEFINE_EVENT(nfs_folio_event_done, name, \
 			TP_PROTO( \
-				const struct inode *inode, \
+				struct inode *inode, \
 				struct folio *folio, \
 				int ret \
 			), \

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-28 21:05 ` support large folios for NFS Matthew Wilcox
  2024-05-29  5:14   ` Christoph Hellwig
@ 2024-05-29 13:35   ` Trond Myklebust
  1 sibling, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2024-05-29 13:35 UTC (permalink / raw)
  To: hch@lst.de, willy@infradead.org
  Cc: anna@kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org

On Tue, 2024-05-28 at 22:05 +0100, Matthew Wilcox wrote:
> On Mon, May 27, 2024 at 06:36:07PM +0200, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series adds large folio support to NFS, and almost doubles the
> > buffered write throughput from the previous bottleneck of ~2.5GB/s
> > (just like for other file systems).
> > 
> > The first patch is an old one from willy that I've updated very
> > slightly.
> > Note that this update now requires the mapping_max_folio_size
> > helper
> > merged into Linus' tree only a few minutes ago.
> 
> Kind of surprised this didn't fall over given the bugs I just sent a
> patch for ... misinterpreting the folio indices seems like it should
> have caused a failure in _some_ fstest.

Why wouldn't it work? The code you're replacing isn't assuming that
page cache indices are in units of the folio size. It is just assuming
that folio boundaries are multiples of the folio size.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: Fix misuses of folio_shift() and folio_order()
  2024-05-28 21:03 ` [PATCH] nfs: Fix misuses of folio_shift() and folio_order() Matthew Wilcox (Oracle)
  2024-05-29  5:15   ` Christoph Hellwig
  2024-05-29  6:30   ` Christoph Hellwig
@ 2024-05-29 14:31   ` Trond Myklebust
  2 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2024-05-29 14:31 UTC (permalink / raw)
  To: linux-nfs@vger.kernel.org, willy@infradead.org,
	linux-fsdevel@vger.kernel.org
  Cc: Anna.Schumaker@Netapp.com, hch@infradead.org

On Tue, 2024-05-28 at 22:03 +0100, Matthew Wilcox (Oracle) wrote:
> Page cache indices are in units of PAGE_SIZE, not in units of
> the folio size.  Revert the change in nfs_grow_file(), and
> pass the inode to nfs_folio_length() so it can be reimplemented
> in terms of folio_mkwrite_check_truncate() which handles this
> correctly.

For the record, the code being replaced here is not assuming that page
cache indices are in units of the folio size. It is assuming that folio
boundaries will lie on offsets that are multiples of the folio size and
that the current page attributes (page lock, uptodate, etc) are
expected to apply to the data that lies within those folio boundaries.
The way the folio code is written today, that assumption appears to be
correct.

I'm fine with replacing NFS-specific code with generic code when
obviously correct, but AFAICS this would be a cleanup, and not a bug
fix.

> 
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> ---
>  fs/nfs/file.c           |  6 +++---
>  fs/nfs/internal.h       | 16 +++++-----------
>  fs/nfs/read.c           |  2 +-
>  fs/nfs/write.c          |  9 +++++----
>  include/linux/pagemap.h |  4 ++--
>  5 files changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 6bd127e6683d..723d78bbfe3f 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -301,7 +301,7 @@ EXPORT_SYMBOL_GPL(nfs_file_fsync);
>  static bool nfs_folio_is_full_write(struct folio *folio, loff_t pos,
>  				    unsigned int len)
>  {
> -	unsigned int pglen = nfs_folio_length(folio);
> +	unsigned int pglen = nfs_folio_length(folio, folio->mapping-
> >host);
>  	unsigned int offset = offset_in_folio(folio, pos);
>  	unsigned int end = offset + len;
>  
> @@ -386,7 +386,7 @@ static int nfs_write_end(struct file *file,
> struct address_space *mapping,
>  	 */
>  	if (!folio_test_uptodate(folio)) {
>  		size_t fsize = folio_size(folio);
> -		unsigned pglen = nfs_folio_length(folio);
> +		unsigned pglen = nfs_folio_length(folio, mapping-
> >host);
>  		unsigned end = offset + copied;
>  
>  		if (pglen == 0) {
> @@ -610,7 +610,7 @@ static vm_fault_t nfs_vm_page_mkwrite(struct
> vm_fault *vmf)
>  
>  	folio_wait_writeback(folio);
>  
> -	pagelen = nfs_folio_length(folio);
> +	pagelen = nfs_folio_length(folio, inode);
>  	if (pagelen == 0)
>  		goto out_unlock;
>  
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index 9f0f4534744b..3b0236e67257 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -819,19 +819,13 @@ unsigned int nfs_page_length(struct page *page)
>  /*
>   * Determine the number of bytes of data the page contains
>   */
> -static inline size_t nfs_folio_length(struct folio *folio)
> +static inline size_t nfs_folio_length(struct folio *folio, struct
> inode *inode)
>  {
> -	loff_t i_size = i_size_read(folio_file_mapping(folio)-
> >host);
> +	ssize_t ret = folio_mkwrite_check_truncate(folio, inode);
>  
> -	if (i_size > 0) {
> -		pgoff_t index = folio_index(folio) >>
> folio_order(folio);
> -		pgoff_t end_index = (i_size - 1) >>
> folio_shift(folio);
> -		if (index < end_index)
> -			return folio_size(folio);
> -		if (index == end_index)
> -			return offset_in_folio(folio, i_size - 1) +
> 1;
> -	}
> -	return 0;
> +	if (ret < 0)
> +		ret = 0;
> +	return ret;
>  }
>  
>  /*
> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
> index a142287d86f6..ba3bb496f832 100644
> --- a/fs/nfs/read.c
> +++ b/fs/nfs/read.c
> @@ -296,7 +296,7 @@ int nfs_read_add_folio(struct
> nfs_pageio_descriptor *pgio,
>  	unsigned int len, aligned_len;
>  	int error;
>  
> -	len = nfs_folio_length(folio);
> +	len = nfs_folio_length(folio, inode);
>  	if (len == 0)
>  		return nfs_return_empty_folio(folio);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 2329cbb0e446..7713ce7c5b3a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -278,8 +278,8 @@ static void nfs_grow_file(struct folio *folio,
> unsigned int offset,
>  
>  	spin_lock(&inode->i_lock);
>  	i_size = i_size_read(inode);
> -	end_index = ((i_size - 1) >> folio_shift(folio)) <<
> folio_order(folio);
> -	if (i_size > 0 && folio_index(folio) < end_index)
> +	end_index = (i_size - 1) >> PAGE_SHIFT;
> +	if (i_size > 0 && folio->index < end_index)
>  		goto out;
>  	end = folio_file_pos(folio) + (loff_t)offset +
> (loff_t)count;
>  	if (i_size >= end)
> @@ -358,7 +358,8 @@ nfs_page_group_search_locked(struct nfs_page
> *head, unsigned int page_offset)
>   */
>  static bool nfs_page_group_covers_page(struct nfs_page *req)
>  {
> -	unsigned int len = nfs_folio_length(nfs_page_to_folio(req));
> +	struct folio *folio = nfs_page_to_folio(req);
> +	unsigned int len = nfs_folio_length(folio, folio->mapping-
> >host);
>  	struct nfs_page *tmp;
>  	unsigned int pos = 0;
>  
> @@ -1356,7 +1357,7 @@ int nfs_update_folio(struct file *file, struct
> folio *folio,
>  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>  	struct address_space *mapping = folio_file_mapping(folio);
>  	struct inode *inode = mapping->host;
> -	unsigned int pagelen = nfs_folio_length(folio);
> +	unsigned int pagelen = nfs_folio_length(folio, inode);
>  	int		status = 0;
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c6aaceed0de6..df57d7361a9a 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -212,8 +212,8 @@ enum mapping_flags {
>  	AS_FOLIO_ORDER_MAX = 21, /* Bits 16-25 are used for
> FOLIO_ORDER */
>  };
>  
> -#define AS_FOLIO_ORDER_MIN_MASK 0x001f0000
> -#define AS_FOLIO_ORDER_MAX_MASK 0x03e00000
> +#define AS_FOLIO_ORDER_MIN_MASK (31 << AS_FOLIO_ORDER_MIN)
> +#define AS_FOLIO_ORDER_MAX_MASK (31 << AS_FOLIO_ORDER_MAX)
>  #define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK |
> AS_FOLIO_ORDER_MAX_MASK)
>  
>  /**

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-27 16:36 support large folios for NFS Christoph Hellwig
                   ` (4 preceding siblings ...)
  2024-05-28 21:05 ` support large folios for NFS Matthew Wilcox
@ 2024-05-29 21:59 ` Trond Myklebust
  2024-05-31  6:14   ` hch
  5 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2024-05-29 21:59 UTC (permalink / raw)
  To: hch@lst.de, anna@kernel.org, willy@infradead.org
  Cc: linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org

On Mon, 2024-05-27 at 18:36 +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds large folio support to NFS, and almost doubles the
> buffered write throughput from the previous bottleneck of ~2.5GB/s
> (just like for other file systems).
> 
> The first patch is an old one from willy that I've updated very
> slightly.
> Note that this update now requires the mapping_max_folio_size helper
> merged into Linus' tree only a few minutes ago.
> 
> Diffstat:
>  fs/nfs/file.c  |    4 +++-
>  fs/nfs/inode.c |    1 +
>  mm/filemap.c   |   40 +++++++++++++++++++++++++---------------
>  3 files changed, 29 insertions(+), 16 deletions(-)
> 

Which tree did you intend to merge this through? Willy's or Anna and
mine? I'm OK either way. I just want to make sure we're on the same
page.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-29 21:59 ` Trond Myklebust
@ 2024-05-31  6:14   ` hch
  2024-06-07  5:29     ` hch
  0 siblings, 1 reply; 24+ messages in thread
From: hch @ 2024-05-31  6:14 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: hch@lst.de, anna@kernel.org, willy@infradead.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org

On Wed, May 29, 2024 at 09:59:44PM +0000, Trond Myklebust wrote:
> Which tree did you intend to merge this through? Willy's or Anna and
> mine? I'm OK either way. I just want to make sure we're on the same
> page.

I'm perfectly fine either way too.  If willy wants to get any other
work for generic_perform_write in as per his RFC patches the pagecache
tree might be a better place, if not maybe the nfs tree.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-05-31  6:14   ` hch
@ 2024-06-07  5:29     ` hch
  2024-06-07  7:57       ` Cedric Blancher
  2024-06-07 15:32       ` Trond Myklebust
  0 siblings, 2 replies; 24+ messages in thread
From: hch @ 2024-06-07  5:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: hch@lst.de, anna@kernel.org, willy@infradead.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org

On Fri, May 31, 2024 at 08:14:43AM +0200, hch@lst.de wrote:
> On Wed, May 29, 2024 at 09:59:44PM +0000, Trond Myklebust wrote:
> > Which tree did you intend to merge this through? Willy's or Anna and
> > mine? I'm OK either way. I just want to make sure we're on the same
> > page.
> 
> I'm perfectly fine either way too.  If willy wants to get any other
> work for generic_perform_write in as per his RFC patches the pagecache
> tree might be a better place, if not maybe the nfs tree.

That maintainer celebrity death match was a bit boring :)  Any takers?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-06-07  5:29     ` hch
@ 2024-06-07  7:57       ` Cedric Blancher
  2024-06-07 15:32       ` Trond Myklebust
  1 sibling, 0 replies; 24+ messages in thread
From: Cedric Blancher @ 2024-06-07  7:57 UTC (permalink / raw)
  To: hch@lst.de
  Cc: Trond Myklebust, anna@kernel.org, willy@infradead.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org

On Fri, 7 Jun 2024 at 07:29, hch@lst.de <hch@lst.de> wrote:
>
> On Fri, May 31, 2024 at 08:14:43AM +0200, hch@lst.de wrote:
> > On Wed, May 29, 2024 at 09:59:44PM +0000, Trond Myklebust wrote:
> > > Which tree did you intend to merge this through? Willy's or Anna and
> > > mine? I'm OK either way. I just want to make sure we're on the same
> > > page.
> >
> > I'm perfectly fine either way too.  If willy wants to get any other
> > work for generic_perform_write in as per his RFC patches the pagecache
> > tree might be a better place, if not maybe the nfs tree.
>
> That maintainer celebrity death match was a bit boring :)  Any takers?
>

As much as we like to see blood, gore, WH4K chainsawswords, ripped off
brains and ethernet cables, it would be easier (and less expensive) to
just apply the patch :)

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: support large folios for NFS
  2024-06-07  5:29     ` hch
  2024-06-07  7:57       ` Cedric Blancher
@ 2024-06-07 15:32       ` Trond Myklebust
  1 sibling, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2024-06-07 15:32 UTC (permalink / raw)
  To: hch@lst.de
  Cc: anna@kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	willy@infradead.org, linux-fsdevel@vger.kernel.org

On Fri, 2024-06-07 at 07:29 +0200, hch@lst.de wrote:
> On Fri, May 31, 2024 at 08:14:43AM +0200, hch@lst.de wrote:
> > On Wed, May 29, 2024 at 09:59:44PM +0000, Trond Myklebust wrote:
> > > Which tree did you intend to merge this through? Willy's or Anna
> > > and
> > > mine? I'm OK either way. I just want to make sure we're on the
> > > same
> > > page.
> > 
> > I'm perfectly fine either way too.  If willy wants to get any other
> > work for generic_perform_write in as per his RFC patches the
> > pagecache
> > tree might be a better place, if not maybe the nfs tree.
> 
> That maintainer celebrity death match was a bit boring :)  Any
> takers?
> 

🙂 We'll push them through the NFS tree.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-05-27 16:36 ` [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios Christoph Hellwig
  2024-05-27 18:17   ` Matthew Wilcox
       [not found]   ` <CGME20240528152340eucas1p17ba2ad78d8ea869ef44cdeedb2601f80@eucas1p1.samsung.com>
@ 2024-06-11 10:47   ` Shaun Tancheff
  2024-06-11 16:13     ` Christoph Hellwig
  2 siblings, 1 reply; 24+ messages in thread
From: Shaun Tancheff @ 2024-06-11 10:47 UTC (permalink / raw)
  To: Christoph Hellwig, Trond Myklebust, Anna Schumaker,
	Matthew Wilcox
  Cc: linux-nfs, linux-fsdevel, linux-mm

On 5/27/24 23:36, Christoph Hellwig wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> Modelled after the loop in iomap_write_iter(), copy larger chunks from
> userspace if the filesystem has created large folios.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> [hch: use mapping_max_folio_size to keep supporting file systems that do
>   not support large folios]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   mm/filemap.c | 40 +++++++++++++++++++++++++---------------
>   1 file changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 382c3d06bfb10c..860728e26ccf32 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3981,21 +3981,24 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>   	loff_t pos = iocb->ki_pos;
>   	struct address_space *mapping = file->f_mapping;
>   	const struct address_space_operations *a_ops = mapping->a_ops;
> +	size_t chunk = mapping_max_folio_size(mapping);

Better to default chunk to PAGE_SIZE for backward compat
+       size_t chunk = PAGE_SIZE;

>   	long status = 0;
>   	ssize_t written = 0;
>   

Have fs opt in to large folio support:

+       if (mapping_large_folio_support(mapping))
+               chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;

>   	do {
>   		struct page *page;
> -		unsigned long offset;	/* Offset into pagecache page */
> -		unsigned long bytes;	/* Bytes to write to page */
> +		struct folio *folio;
> +		size_t offset;		/* Offset into folio */
> +		size_t bytes;		/* Bytes to write to folio */
>   		size_t copied;		/* Bytes copied from user */
>   		void *fsdata = NULL;
>   
> -		offset = (pos & (PAGE_SIZE - 1));
> -		bytes = min_t(unsigned long, PAGE_SIZE - offset,
> -						iov_iter_count(i));
> +		bytes = iov_iter_count(i);
> +retry:
> +		offset = pos & (chunk - 1);
> +		bytes = min(chunk - offset, bytes);
> +		balance_dirty_pages_ratelimited(mapping);
>   
> -again:
>   		/*
>   		 * Bring in the user page that we will copy from _first_.
>   		 * Otherwise there's a nasty deadlock on copying from the
> @@ -4017,11 +4020,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>   		if (unlikely(status < 0))
>   			break;
>   
> +		folio = page_folio(page);
> +		offset = offset_in_folio(folio, pos);
> +		if (bytes > folio_size(folio) - offset)
> +			bytes = folio_size(folio) - offset;
> +
>   		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_folio(folio);
>   
> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> -		flush_dcache_page(page);
> +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> +		flush_dcache_folio(folio);
>   
>   		status = a_ops->write_end(file, mapping, pos, bytes, copied,
>   						page, fsdata);
> @@ -4039,14 +4047,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>   			 * halfway through, might be a race with munmap,
>   			 * might be severe memory pressure.
>   			 */
> -			if (copied)
> +			if (chunk > PAGE_SIZE)
> +				chunk /= 2;
> +			if (copied) {
>   				bytes = copied;
> -			goto again;
> +				goto retry;
> +			}
> +		} else {
> +			pos += status;
> +			written += status;
>   		}
> -		pos += status;
> -		written += status;
> -
> -		balance_dirty_pages_ratelimited(mapping);
>   	} while (iov_iter_count(i));
>   
>   	if (!written)

Tested with Lustre with large folios and kernel 6.6 with this patch (and suggested changes).

Tested-by: Shaun Tancheff <shaun.tancheff@hpe.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
  2024-06-11 10:47   ` Shaun Tancheff
@ 2024-06-11 16:13     ` Christoph Hellwig
       [not found]       ` <1d87741b-7178-4791-aca2-da3ac3033552@gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-11 16:13 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker,
	Matthew Wilcox, linux-nfs, linux-fsdevel, linux-mm

On Tue, Jun 11, 2024 at 05:47:12PM +0700, Shaun Tancheff wrote:
>>   	const struct address_space_operations *a_ops = mapping->a_ops;
>> +	size_t chunk = mapping_max_folio_size(mapping);
>
> Better to default chunk to PAGE_SIZE for backward compat
> +       size_t chunk = PAGE_SIZE;
>
>>   	long status = 0;
>>   	ssize_t written = 0;
>>   
>
> Have fs opt in to large folio support:
>
> +       if (mapping_large_folio_support(mapping))
> +               chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;

I don't think you've actually read the code, have you?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios
       [not found]       ` <1d87741b-7178-4791-aca2-da3ac3033552@gmail.com>
@ 2024-06-12  4:02         ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-12  4:02 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Christoph Hellwig, Shaun Tancheff, Trond Myklebust,
	Anna Schumaker, Matthew Wilcox, linux-nfs, linux-fsdevel,
	linux-mm

On Wed, Jun 12, 2024 at 08:41:01AM +0700, Shaun Tancheff wrote:
> On 6/11/24 23:13, Christoph Hellwig wrote:
>
>> On Tue, Jun 11, 2024 at 05:47:12PM +0700, Shaun Tancheff wrote:
>>>>    	const struct address_space_operations *a_ops = mapping->a_ops;
>>>> +	size_t chunk = mapping_max_folio_size(mapping);
>>> Better to default chunk to PAGE_SIZE for backward compat
>>> +       size_t chunk = PAGE_SIZE;
>>>
>>>>    	long status = 0;
>>>>    	ssize_t written = 0;
>>>>    
>>> Have fs opt in to large folio support:
>>>
>>> +       if (mapping_large_folio_support(mapping))
>>> +               chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
>> I don't think you've actually read the code, have you?
>
> I checked from 6.6 to linux-next with this patch and my ext4 VM does not boot without the opt-in.

Please take a look at the definition of mapping_max_folio_size,
which is called above in the quoted patch.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-06-12  4:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-27 16:36 support large folios for NFS Christoph Hellwig
2024-05-27 16:36 ` [PATCH 1/2] filemap: Convert generic_perform_write() to support large folios Christoph Hellwig
2024-05-27 18:17   ` Matthew Wilcox
2024-05-28  8:12     ` Christoph Hellwig
     [not found]   ` <CGME20240528152340eucas1p17ba2ad78d8ea869ef44cdeedb2601f80@eucas1p1.samsung.com>
2024-05-28 15:23     ` Daniel Gomez
2024-05-28 16:50       ` Matthew Wilcox
2024-05-28 19:01         ` Daniel Gomez
2024-06-11 10:47   ` Shaun Tancheff
2024-06-11 16:13     ` Christoph Hellwig
     [not found]       ` <1d87741b-7178-4791-aca2-da3ac3033552@gmail.com>
2024-06-12  4:02         ` Christoph Hellwig
2024-05-27 16:36 ` [PATCH 2/2] nfs: add support for " Christoph Hellwig
2024-05-27 19:43 ` support large folios for NFS Sagi Grimberg
2024-05-28 21:03 ` [PATCH] nfs: Fix misuses of folio_shift() and folio_order() Matthew Wilcox (Oracle)
2024-05-29  5:15   ` Christoph Hellwig
2024-05-29  6:30   ` Christoph Hellwig
2024-05-29 14:31   ` Trond Myklebust
2024-05-28 21:05 ` support large folios for NFS Matthew Wilcox
2024-05-29  5:14   ` Christoph Hellwig
2024-05-29 13:35   ` Trond Myklebust
2024-05-29 21:59 ` Trond Myklebust
2024-05-31  6:14   ` hch
2024-06-07  5:29     ` hch
2024-06-07  7:57       ` Cedric Blancher
2024-06-07 15:32       ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).