[PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors
       [not found] <20260107210139.40554-1-CFSworks@gmail.com>
@ 2026-01-07 21:01 ` Sam Edwards
  2026-01-08 20:05   ` Viacheslav Dubeyko
  2026-01-07 21:01 ` [PATCH v2 3/6] ceph: Free page array when ceph_submit_write fails Sam Edwards
  2026-01-07 21:01 ` [PATCH v2 6/6] ceph: Fix write storm on fscrypted files Sam Edwards
  2 siblings, 1 reply; 4+ messages in thread
From: Sam Edwards @ 2026-01-07 21:01 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable

When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
because it needs to allocate bounce buffers to store the encrypted
versions of each folio. Each folio beyond the first allocates its bounce
buffer with GFP_NOWAIT. Failures are common (and expected) under this
allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its
own return code and for capturing the return codes of its routine calls;
failing to reset `rc` back to 0 results in the error being propagated
out to the main writeback loop, which cannot actually tolerate any
errors here: once `ceph_wbc.pages` is allocated, it must be passed to
ceph_submit_write() to be freed. If it survives until the next iteration
(e.g. due to the goto being followed), ceph_allocate_page_array()'s
BUG_ON() will oops the worker. (Subsequent patches in this series make
the loop more robust.)

Note that this failure mode is currently masked due to another bug
(addressed later in this series) that prevents multiple encrypted folios
from being selected for the same write.

For now, just reset `rc` when redirtying the folio to prevent errors in
move_dirty_folio_in_page_array() from propagating. (Note that
move_dirty_folio_in_page_array() is careful never to return errors on
the first folio, so there is no need to check for that.) After this
change, ceph_process_folio_batch() no longer returns errors; its only
remaining failure indicator is `locked_pages == 0`, which the caller
already handles correctly. The next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 63b75d214210..3462df35d245 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
 		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
 				folio);
 		if (rc) {
+			rc = 0;
 			folio_redirty_for_writepage(wbc, folio);
 			folio_unlock(folio);
 			break;
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 3/6] ceph: Free page array when ceph_submit_write fails
       [not found] <20260107210139.40554-1-CFSworks@gmail.com>
  2026-01-07 21:01 ` [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
@ 2026-01-07 21:01 ` Sam Edwards
  2026-01-07 21:01 ` [PATCH v2 6/6] ceph: Fix write storm on fscrypted files Sam Edwards
  2 siblings, 0 replies; 4+ messages in thread
From: Sam Edwards @ 2026-01-07 21:01 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable

If `locked_pages` is zero, the page array must not be allocated:
ceph_process_folio_batch() uses `locked_pages` to decide when to
allocate `pages`, and redundant allocations trigger
ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
writeback stall) or even a kernel panic. Consequently, the main loop in
ceph_writepages_start() assumes that the lifetime of `pages` is confined
to a single iteration.

The ceph_submit_write() function claims ownership of the page array on
success. But failures only redirty/unlock the pages and fail to free the
array, making the failure case in ceph_submit_write() fatal.

Free the page array (and reset locked_pages) in ceph_submit_write()'s
error-handling 'if' block so that the caller's invariant (that the array
does not outlive the iteration) is maintained unconditionally, making
failures in ceph_submit_write() recoverable as originally intended.

Fixes: 1551ec61dc55 ("ceph: introduce ceph_submit_write() method")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2b722916fb9b..467aa7242b49 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1466,6 +1466,14 @@ int ceph_submit_write(struct address_space *mapping,
 			unlock_page(page);
 		}

+		if (ceph_wbc->from_pool) {
+			mempool_free(ceph_wbc->pages, ceph_wb_pagevec_pool);
+			ceph_wbc->from_pool = false;
+		} else
+			kfree(ceph_wbc->pages);
+		ceph_wbc->pages = NULL;
+		ceph_wbc->locked_pages = 0;
+
 		ceph_osdc_put_request(req);
 		return -EIO;
 	}
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 6/6] ceph: Fix write storm on fscrypted files
       [not found] <20260107210139.40554-1-CFSworks@gmail.com>
  2026-01-07 21:01 ` [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
  2026-01-07 21:01 ` [PATCH v2 3/6] ceph: Free page array when ceph_submit_write fails Sam Edwards
@ 2026-01-07 21:01 ` Sam Edwards
  2 siblings, 0 replies; 4+ messages in thread
From: Sam Edwards @ 2026-01-07 21:01 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable

CephFS stores file data across multiple RADOS objects. An object is the
atomic unit of storage, so the writeback code must clean only folios
that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled,
each object stores multiple unbroken "stripe units" covering different
portions of the file; if disabled, a "stripe unit" is simply the whole
object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios
belonging to the same object, its current design is to lock only a
single stripe unit at a time. Ever since this code was first written,
it has determined this size by checking the inode's block size.
However, the relatively-new fscrypt support needed to reduce the block
size for encrypted inodes to the crypto block size (see 'fixes' commit),
which causes an unnecessarily high number of write operations (~1024x as
many, with 4MiB objects) and correspondingly degraded performance.

Fix this (and clarify intent) by using i_layout.stripe_unit directly in
ceph_define_write_size() so that encrypted inodes are written back with
the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f2db05b51a3b..b97a6120d4b9 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
 {
 	struct inode *inode = mapping->host;
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	unsigned int wsize = i_blocksize(inode);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	unsigned int wsize = ci->i_layout.stripe_unit;

 	if (fsc->mount_options->wsize < wsize)
 		wsize = fsc->mount_options->wsize;
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re:  [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors
  2026-01-07 21:01 ` [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
@ 2026-01-08 20:05   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 4+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-08 20:05 UTC (permalink / raw)
  To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
  Cc: Milind Changire, stable@vger.kernel.org,
	ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Wed, 2026-01-07 at 13:01 -0800, Sam Edwards wrote:
> When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
> because it needs to allocate bounce buffers to store the encrypted
> versions of each folio. Each folio beyond the first allocates its bounce
> buffer with GFP_NOWAIT. Failures are common (and expected) under this
> allocation mode; they should flush (not abort) the batch.
> 
> However, ceph_process_folio_batch() uses the same `rc` variable for its
> own return code and for capturing the return codes of its routine calls;
> failing to reset `rc` back to 0 results in the error being propagated
> out to the main writeback loop, which cannot actually tolerate any
> errors here: once `ceph_wbc.pages` is allocated, it must be passed to
> ceph_submit_write() to be freed. If it survives until the next iteration
> (e.g. due to the goto being followed), ceph_allocate_page_array()'s
> BUG_ON() will oops the worker. (Subsequent patches in this series make
> the loop more robust.)
> 
> Note that this failure mode is currently masked due to another bug
> (addressed later in this series) that prevents multiple encrypted folios
> from being selected for the same write.
> 
> For now, just reset `rc` when redirtying the folio to prevent errors in
> move_dirty_folio_in_page_array() from propagating. (Note that
> move_dirty_folio_in_page_array() is careful never to return errors on
> the first folio, so there is no need to check for that.) After this
> change, ceph_process_folio_batch() no longer returns errors; its only
> remaining failure indicator is `locked_pages == 0`, which the caller
> already handles correctly. The next patch in this series addresses this.
> 
> Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 63b75d214210..3462df35d245 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
>  		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
>  				folio);
>  		if (rc) {
> +			rc = 0;
>  			folio_redirty_for_writepage(wbc, folio);
>  			folio_unlock(folio);
>  			break;

I've shared my opinion about this patch already. It should be the last one.
Because, another patch fixes the issue that hides this one. It makes sense to
uncover this bug and fix it then. My opinion is still here.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-08 20:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260107210139.40554-1-CFSworks@gmail.com>
2026-01-07 21:01 ` [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
2026-01-08 20:05   ` Viacheslav Dubeyko
2026-01-07 21:01 ` [PATCH v2 3/6] ceph: Free page array when ceph_submit_write fails Sam Edwards
2026-01-07 21:01 ` [PATCH v2 6/6] ceph: Fix write storm on fscrypted files Sam Edwards

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox