* [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes
@ 2026-01-26 2:30 Sam Edwards
2026-01-26 2:30 ` [PATCH v3 1/4] ceph: do not propagate page array emplacement errors as batch errors Sam Edwards
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Sam Edwards @ 2026-01-26 2:30 UTC (permalink / raw)
To: Xiubo Li, Ilya Dryomov
Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
Jeff Layton, ceph-devel, linux-kernel, Sam Edwards
Hello list,
This is v2 of my series that addresses interrelated issues in CephFS writeback,
fixing crashes, improving robustness, and correcting performance behavior,
particularly for fscrypted files. [1]
Changes v2->v3:
- Split out two patches ("ceph: free page array when ceph_submit_write() fails"
and "ceph: split out page-array discarding to a function") to a new series
[2] since they are independent and had no outstanding review comments.
- Lowercase the subject lines of commit messages, per subsystem-local style.
- Update the commit message of ("ceph: fix write storm on fscrypted files") to
mention the explicit dependency on ("ceph: do not propagate page array
emplacement errors as batch errors") for correctness, to prevent the former
from being accidentally backported without the latter.
- Reorder the series to make the aforementioned patches consecutive. The series
cadence is now: bugfix, bugfix, cleanup, cleanup
- Add a clarification to ("ceph: remove error return from
ceph_process_folio_batch()") that "abort" logic is still possible, just that
it is responsible for cleaning up after itself.
Changes v1->v2:
- Clarify patch #1's commit message to establish that failures on the first
folio are not possible.
- Add another patch to move the "clean up page array on abort" logic to a new
ceph_discard_page_array() function. (Thanks Slava!)
- Change the wording "grossly degraded performance" to instead read
"correspondingly degraded performance." This makes the causal relationship
clearer (that write throughput is limited much more significantly by write
op/s due to the bug) without making any claims (qualitative or otherwise)
about significance. (Thanks Slava!)
- Reset locked_pages = 0 immediately when the page array is discarded,
simplifying patch #5 ("ceph: Assert writeback loop invariants")
- Reword "as evidenced by the previous two patches which fix oopses" to
"as evidenced by two recent patches which fix oopses" and refer to the
patches by subject (being in the same series, I cannot refer to them by hash)
Warm regards,
Sam
[1] https://lore.kernel.org/all/20260107210139.40554-1-CFSworks@gmail.com/
[2] https://lore.kernel.org/all/20260126022715.404984-1-CFSworks@gmail.com/
Sam Edwards (4):
ceph: do not propagate page array emplacement errors as batch errors
ceph: fix write storm on fscrypted files
ceph: remove error return from ceph_process_folio_batch()
ceph: assert writeback loop invariants
fs/ceph/addr.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 1/4] ceph: do not propagate page array emplacement errors as batch errors
2026-01-26 2:30 [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Sam Edwards
@ 2026-01-26 2:30 ` Sam Edwards
2026-01-26 2:30 ` [PATCH v3 2/4] ceph: fix write storm on fscrypted files Sam Edwards
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Sam Edwards @ 2026-01-26 2:30 UTC (permalink / raw)
To: Xiubo Li, Ilya Dryomov
Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable
When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
because it needs to allocate bounce buffers to store the encrypted
versions of each folio. Each folio beyond the first allocates its bounce
buffer with GFP_NOWAIT. Failures are common (and expected) under this
allocation mode; they should flush (not abort) the batch.
However, ceph_process_folio_batch() uses the same `rc` variable for its
own return code and for capturing the return codes of its routine calls;
failing to reset `rc` back to 0 results in the error being propagated
out to the main writeback loop, which cannot actually tolerate any
errors here: once `ceph_wbc.pages` is allocated, it must be passed to
ceph_submit_write() to be freed. If it survives until the next iteration
(e.g. due to the goto being followed), ceph_allocate_page_array()'s
BUG_ON() will oops the worker.
Note that this failure mode is currently masked due to another bug
(addressed next in this series) that prevents multiple encrypted folios
from being selected for the same write.
For now, just reset `rc` when redirtying the folio to prevent errors in
move_dirty_folio_in_page_array() from propagating. Note that
move_dirty_folio_in_page_array() is careful never to return errors on
the first folio, so there is no need to check for that. After this
change, ceph_process_folio_batch() no longer returns errors; its only
remaining failure indicator is `locked_pages == 0`, which the caller
already handles correctly.
Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
fs/ceph/addr.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 63b75d214210..3462df35d245 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
folio);
if (rc) {
+ rc = 0;
folio_redirty_for_writepage(wbc, folio);
folio_unlock(folio);
break;
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 2/4] ceph: fix write storm on fscrypted files
2026-01-26 2:30 [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Sam Edwards
2026-01-26 2:30 ` [PATCH v3 1/4] ceph: do not propagate page array emplacement errors as batch errors Sam Edwards
@ 2026-01-26 2:30 ` Sam Edwards
2026-01-26 2:30 ` [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch() Sam Edwards
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Sam Edwards @ 2026-01-26 2:30 UTC (permalink / raw)
To: Xiubo Li, Ilya Dryomov
Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable
CephFS stores file data across multiple RADOS objects. An object is the
atomic unit of storage, so the writeback code must clean only folios
that belong to the same object with each OSD request.
CephFS also supports RAID0-style striping of file contents: if enabled,
each object stores multiple unbroken "stripe units" covering different
portions of the file; if disabled, a "stripe unit" is simply the whole
object. The stripe unit is (usually) reported as the inode's block size.
Though the writeback logic could, in principle, lock all dirty folios
belonging to the same object, its current design is to lock only a
single stripe unit at a time. Ever since this code was first written,
it has determined this size by checking the inode's block size.
However, the relatively-new fscrypt support needed to reduce the block
size for encrypted inodes to the crypto block size (see 'fixes' commit),
which causes an unnecessarily high number of write operations (~1024x as
many, with 4MiB objects) and correspondingly degraded performance.
Fix this (and clarify intent) by using i_layout.stripe_unit directly in
ceph_define_write_size() so that encrypted inodes are written back with
the same number of operations as if they were unencrypted.
This patch depends on the preceding commit ("ceph: do not propagate page
array emplacement errors as batch errors") for correctness. While it
applies cleanly on its own, applying it alone will introduce a
regression. This dependency is only relevant for kernels where
ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") has
been applied; stable kernels without that commit are unaffected.
Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
fs/ceph/addr.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3462df35d245..39064893f35b 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
{
struct inode *inode = mapping->host;
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
- unsigned int wsize = i_blocksize(inode);
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ unsigned int wsize = ci->i_layout.stripe_unit;
if (fsc->mount_options->wsize < wsize)
wsize = fsc->mount_options->wsize;
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch()
2026-01-26 2:30 [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Sam Edwards
2026-01-26 2:30 ` [PATCH v3 1/4] ceph: do not propagate page array emplacement errors as batch errors Sam Edwards
2026-01-26 2:30 ` [PATCH v3 2/4] ceph: fix write storm on fscrypted files Sam Edwards
@ 2026-01-26 2:30 ` Sam Edwards
2026-01-26 22:55 ` Viacheslav Dubeyko
2026-01-26 2:30 ` [PATCH v3 4/4] ceph: assert writeback loop invariants Sam Edwards
2026-02-11 18:11 ` [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Ilya Dryomov
4 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2026-01-26 2:30 UTC (permalink / raw)
To: Xiubo Li, Ilya Dryomov
Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
Jeff Layton, ceph-devel, linux-kernel, Sam Edwards
Following an earlier commit, ceph_process_folio_batch() no longer
returns errors because the writeback loop cannot handle them.
Since this function already indicates failure to lock any pages by
leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
to handle abandonment of a locked batch, change the return type of
ceph_process_folio_batch() to `void` and remove the pathological goto in
the writeback loop. The lack of a return code emphasizes that
ceph_process_folio_batch() is designed to be abort-free: that is, once
it commits a folio for writeback, it will not later abandon it or
propagate an error for that folio. Any future changes requiring "abort"
logic should follow this invariant by cleaning up its array and
resetting ceph_wbc.locked_pages appropriately.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
fs/ceph/addr.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 39064893f35b..cdf11288d6b7 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1284,16 +1284,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
}
static
-int ceph_process_folio_batch(struct address_space *mapping,
- struct writeback_control *wbc,
- struct ceph_writeback_ctl *ceph_wbc)
+void ceph_process_folio_batch(struct address_space *mapping,
+ struct writeback_control *wbc,
+ struct ceph_writeback_ctl *ceph_wbc)
{
struct inode *inode = mapping->host;
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
struct ceph_client *cl = fsc->client;
struct folio *folio = NULL;
unsigned i;
- int rc = 0;
+ int rc;
for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
folio = ceph_wbc->fbatch.folios[i];
@@ -1323,12 +1323,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
rc = ceph_check_page_before_write(mapping, wbc,
ceph_wbc, folio);
if (rc == -ENODATA) {
- rc = 0;
folio_unlock(folio);
ceph_wbc->fbatch.folios[i] = NULL;
continue;
} else if (rc == -E2BIG) {
- rc = 0;
folio_unlock(folio);
ceph_wbc->fbatch.folios[i] = NULL;
break;
@@ -1370,7 +1368,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
folio);
if (rc) {
- rc = 0;
folio_redirty_for_writepage(wbc, folio);
folio_unlock(folio);
break;
@@ -1381,8 +1378,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
}
ceph_wbc->processed_in_fbatch = i;
-
- return rc;
}
static inline
@@ -1686,10 +1681,8 @@ static int ceph_writepages_start(struct address_space *mapping,
break;
process_folio_batch:
- rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
+ ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
- if (rc)
- goto release_folios;
/* did we get anything? */
if (!ceph_wbc.locked_pages)
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 4/4] ceph: assert writeback loop invariants
2026-01-26 2:30 [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Sam Edwards
` (2 preceding siblings ...)
2026-01-26 2:30 ` [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch() Sam Edwards
@ 2026-01-26 2:30 ` Sam Edwards
2026-01-26 22:54 ` Viacheslav Dubeyko
2026-02-11 18:11 ` [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Ilya Dryomov
4 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2026-01-26 2:30 UTC (permalink / raw)
To: Xiubo Li, Ilya Dryomov
Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
Jeff Layton, ceph-devel, linux-kernel, Sam Edwards
If `locked_pages` is zero, the page array must not be allocated:
ceph_process_folio_batch() uses `locked_pages` to decide when to
allocate `pages`, and redundant allocations trigger
ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
writeback stall) or even a kernel panic. Consequently, the main loop in
ceph_writepages_start() assumes that the lifetime of `pages` is confined
to a single iteration.
This expectation is currently not clear enough, as evidenced by two
recent patches which fix oopses caused by `pages` persisting into
the next loop iteration:
- "ceph: do not propagate page array emplacement errors as batch errors"
- "ceph: free page array when ceph_submit_write() fails"
Use an explicit BUG_ON() at the top of the loop to assert the loop's
preexisting expectation that `pages` is cleaned up by the previous
iteration. Because this is closely tied to `locked_pages`, also make it
the previous iteration's responsibility to guarantee its reset, and
verify with a second new BUG_ON() instead of handling (and masking)
failures to do so.
This patch does not change invariants, behavior, or failure modes.
The added BUG_ON() lines catch conditions that would already trigger oops,
but do so earlier for easier debugging and programmer clarity.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
fs/ceph/addr.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index cdf11288d6b7..4e392fc70d33 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1663,7 +1663,9 @@ static int ceph_writepages_start(struct address_space *mapping,
tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
while (!has_writeback_done(&ceph_wbc)) {
- ceph_wbc.locked_pages = 0;
+ BUG_ON(ceph_wbc.locked_pages);
+ BUG_ON(ceph_wbc.pages);
+
ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
get_more_pages:
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3 4/4] ceph: assert writeback loop invariants
2026-01-26 2:30 ` [PATCH v3 4/4] ceph: assert writeback loop invariants Sam Edwards
@ 2026-01-26 22:54 ` Viacheslav Dubeyko
2026-01-29 0:14 ` Sam Edwards
0 siblings, 1 reply; 12+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-26 22:54 UTC (permalink / raw)
To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
Cc: Milind Changire, ceph-devel@vger.kernel.org, brauner@kernel.org,
jlayton@kernel.org, linux-kernel@vger.kernel.org
On Sun, 2026-01-25 at 18:30 -0800, Sam Edwards wrote:
> If `locked_pages` is zero, the page array must not be allocated:
> ceph_process_folio_batch() uses `locked_pages` to decide when to
> allocate `pages`, and redundant allocations trigger
> ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> writeback stall) or even a kernel panic. Consequently, the main loop in
> ceph_writepages_start() assumes that the lifetime of `pages` is confined
> to a single iteration.
>
> This expectation is currently not clear enough, as evidenced by two
> recent patches which fix oopses caused by `pages` persisting into
> the next loop iteration:
> - "ceph: do not propagate page array emplacement errors as batch errors"
> - "ceph: free page array when ceph_submit_write() fails"
>
> Use an explicit BUG_ON() at the top of the loop to assert the loop's
> preexisting expectation that `pages` is cleaned up by the previous
> iteration. Because this is closely tied to `locked_pages`, also make it
> the previous iteration's responsibility to guarantee its reset, and
> verify with a second new BUG_ON() instead of handling (and masking)
> failures to do so.
>
> This patch does not change invariants, behavior, or failure modes.
> The added BUG_ON() lines catch conditions that would already trigger oops,
> but do so earlier for easier debugging and programmer clarity.
>
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
> fs/ceph/addr.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index cdf11288d6b7..4e392fc70d33 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1663,7 +1663,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
>
> while (!has_writeback_done(&ceph_wbc)) {
> - ceph_wbc.locked_pages = 0;
> + BUG_ON(ceph_wbc.locked_pages);
> + BUG_ON(ceph_wbc.pages);
My complains are still the same. I would like not have BUG_ON() here.
Thanks,
Slava.
> +
> ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
>
> get_more_pages:
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch()
2026-01-26 2:30 ` [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch() Sam Edwards
@ 2026-01-26 22:55 ` Viacheslav Dubeyko
2026-01-29 0:29 ` Sam Edwards
0 siblings, 1 reply; 12+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-26 22:55 UTC (permalink / raw)
To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
Cc: Milind Changire, ceph-devel@vger.kernel.org, brauner@kernel.org,
jlayton@kernel.org, linux-kernel@vger.kernel.org
On Sun, 2026-01-25 at 18:30 -0800, Sam Edwards wrote:
> Following an earlier commit, ceph_process_folio_batch() no longer
> returns errors because the writeback loop cannot handle them.
>
> Since this function already indicates failure to lock any pages by
> leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
> to handle abandonment of a locked batch, change the return type of
> ceph_process_folio_batch() to `void` and remove the pathological goto in
> the writeback loop. The lack of a return code emphasizes that
> ceph_process_folio_batch() is designed to be abort-free: that is, once
> it commits a folio for writeback, it will not later abandon it or
> propagate an error for that folio. Any future changes requiring "abort"
> logic should follow this invariant by cleaning up its array and
> resetting ceph_wbc.locked_pages appropriately.
>
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
> fs/ceph/addr.c | 17 +++++------------
> 1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 39064893f35b..cdf11288d6b7 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1284,16 +1284,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
> }
>
> static
> -int ceph_process_folio_batch(struct address_space *mapping,
> - struct writeback_control *wbc,
> - struct ceph_writeback_ctl *ceph_wbc)
I still prefer to return the error code from the function.
Thanks,
Slava.
> +void ceph_process_folio_batch(struct address_space *mapping,
> + struct writeback_control *wbc,
> + struct ceph_writeback_ctl *ceph_wbc)
> {
> struct inode *inode = mapping->host;
> struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> struct ceph_client *cl = fsc->client;
> struct folio *folio = NULL;
> unsigned i;
> - int rc = 0;
> + int rc;
>
> for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
> folio = ceph_wbc->fbatch.folios[i];
> @@ -1323,12 +1323,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
> rc = ceph_check_page_before_write(mapping, wbc,
> ceph_wbc, folio);
> if (rc == -ENODATA) {
> - rc = 0;
> folio_unlock(folio);
> ceph_wbc->fbatch.folios[i] = NULL;
> continue;
> } else if (rc == -E2BIG) {
> - rc = 0;
> folio_unlock(folio);
> ceph_wbc->fbatch.folios[i] = NULL;
> break;
> @@ -1370,7 +1368,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> folio);
> if (rc) {
> - rc = 0;
> folio_redirty_for_writepage(wbc, folio);
> folio_unlock(folio);
> break;
> @@ -1381,8 +1378,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> }
>
> ceph_wbc->processed_in_fbatch = i;
> -
> - return rc;
> }
>
> static inline
> @@ -1686,10 +1681,8 @@ static int ceph_writepages_start(struct address_space *mapping,
> break;
>
> process_folio_batch:
> - rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> + ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> - if (rc)
> - goto release_folios;
>
> /* did we get anything? */
> if (!ceph_wbc.locked_pages)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 4/4] ceph: assert writeback loop invariants
2026-01-26 22:54 ` Viacheslav Dubeyko
@ 2026-01-29 0:14 ` Sam Edwards
2026-02-11 18:04 ` Ilya Dryomov
0 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2026-01-29 0:14 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
ceph-devel@vger.kernel.org, brauner@kernel.org,
jlayton@kernel.org, linux-kernel@vger.kernel.org
On Mon, Jan 26, 2026 at 2:54 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Sun, 2026-01-25 at 18:30 -0800, Sam Edwards wrote:
> > If `locked_pages` is zero, the page array must not be allocated:
> > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > allocate `pages`, and redundant allocations trigger
> > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > writeback stall) or even a kernel panic. Consequently, the main loop in
> > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > to a single iteration.
> >
> > This expectation is currently not clear enough, as evidenced by two
> > recent patches which fix oopses caused by `pages` persisting into
> > the next loop iteration:
> > - "ceph: do not propagate page array emplacement errors as batch errors"
> > - "ceph: free page array when ceph_submit_write() fails"
> >
> > Use an explicit BUG_ON() at the top of the loop to assert the loop's
> > preexisting expectation that `pages` is cleaned up by the previous
> > iteration. Because this is closely tied to `locked_pages`, also make it
> > the previous iteration's responsibility to guarantee its reset, and
> > verify with a second new BUG_ON() instead of handling (and masking)
> > failures to do so.
> >
> > This patch does not change invariants, behavior, or failure modes.
> > The added BUG_ON() lines catch conditions that would already trigger oops,
> > but do so earlier for easier debugging and programmer clarity.
> >
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> > fs/ceph/addr.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index cdf11288d6b7..4e392fc70d33 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1663,7 +1663,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> > tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> >
> > while (!has_writeback_done(&ceph_wbc)) {
> > - ceph_wbc.locked_pages = 0;
> > + BUG_ON(ceph_wbc.locked_pages);
> > + BUG_ON(ceph_wbc.pages);
>
> My complains are still the same. I would like not have BUG_ON() here.
Hey Slava,
I understand your preference, but as discussed before, the BUG_ON() is
defensive programming to catch an invariant violation earlier than the
existing BUG_ON() in ceph_allocate_page_array(). If the invariant is
broken, we'll oops anyway; this just makes the oops happen sooner and
easier to debug, and reduces cognitive load for future programmers
trying to understand the loop invariants.
Regards,
Sam
>
> Thanks,
> Slava.
>
> > +
> > ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
> >
> > get_more_pages:
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch()
2026-01-26 22:55 ` Viacheslav Dubeyko
@ 2026-01-29 0:29 ` Sam Edwards
2026-02-11 17:55 ` Ilya Dryomov
0 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2026-01-29 0:29 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
ceph-devel@vger.kernel.org, brauner@kernel.org,
jlayton@kernel.org, linux-kernel@vger.kernel.org
On Mon, Jan 26, 2026 at 2:55 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Sun, 2026-01-25 at 18:30 -0800, Sam Edwards wrote:
> > Following an earlier commit, ceph_process_folio_batch() no longer
> > returns errors because the writeback loop cannot handle them.
> >
> > Since this function already indicates failure to lock any pages by
> > leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
> > to handle abandonment of a locked batch, change the return type of
> > ceph_process_folio_batch() to `void` and remove the pathological goto in
> > the writeback loop. The lack of a return code emphasizes that
> > ceph_process_folio_batch() is designed to be abort-free: that is, once
> > it commits a folio for writeback, it will not later abandon it or
> > propagate an error for that folio. Any future changes requiring "abort"
> > logic should follow this invariant by cleaning up its array and
> > resetting ceph_wbc.locked_pages appropriately.
> >
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> > fs/ceph/addr.c | 17 +++++------------
> > 1 file changed, 5 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 39064893f35b..cdf11288d6b7 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1284,16 +1284,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
> > }
> >
> > static
> > -int ceph_process_folio_batch(struct address_space *mapping,
> > - struct writeback_control *wbc,
> > - struct ceph_writeback_ctl *ceph_wbc)
>
> I still prefer to return the error code from the function.
Hi Slava,
Since the objection to this patch is mere preference at this point, we
need to seek broader consensus before either of us unilaterally
decides whether this style is appropriate for the CephFS subsystem.
I'm willing to follow the preferences of the group, we just need a
"group" first. :)
Christian: Since you committed the patch introducing the
ceph_process_folio_batch() function, would you like to weigh in?
Thanks,
Sam
>
> Thanks,
> Slava.
>
> > +void ceph_process_folio_batch(struct address_space *mapping,
> > + struct writeback_control *wbc,
> > + struct ceph_writeback_ctl *ceph_wbc)
> > {
> > struct inode *inode = mapping->host;
> > struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > struct ceph_client *cl = fsc->client;
> > struct folio *folio = NULL;
> > unsigned i;
> > - int rc = 0;
> > + int rc;
> >
> > for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
> > folio = ceph_wbc->fbatch.folios[i];
> > @@ -1323,12 +1323,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > rc = ceph_check_page_before_write(mapping, wbc,
> > ceph_wbc, folio);
> > if (rc == -ENODATA) {
> > - rc = 0;
> > folio_unlock(folio);
> > ceph_wbc->fbatch.folios[i] = NULL;
> > continue;
> > } else if (rc == -E2BIG) {
> > - rc = 0;
> > folio_unlock(folio);
> > ceph_wbc->fbatch.folios[i] = NULL;
> > break;
> > @@ -1370,7 +1368,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> > folio);
> > if (rc) {
> > - rc = 0;
> > folio_redirty_for_writepage(wbc, folio);
> > folio_unlock(folio);
> > break;
> > @@ -1381,8 +1378,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > }
> >
> > ceph_wbc->processed_in_fbatch = i;
> > -
> > - return rc;
> > }
> >
> > static inline
> > @@ -1686,10 +1681,8 @@ static int ceph_writepages_start(struct address_space *mapping,
> > break;
> >
> > process_folio_batch:
> > - rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > + ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> > - if (rc)
> > - goto release_folios;
> >
> > /* did we get anything? */
> > if (!ceph_wbc.locked_pages)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch()
2026-01-29 0:29 ` Sam Edwards
@ 2026-02-11 17:55 ` Ilya Dryomov
0 siblings, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2026-02-11 17:55 UTC (permalink / raw)
To: Sam Edwards
Cc: Viacheslav Dubeyko, Xiubo Li, Milind Changire,
ceph-devel@vger.kernel.org, brauner@kernel.org,
jlayton@kernel.org, linux-kernel@vger.kernel.org
On Thu, Jan 29, 2026 at 1:30 AM Sam Edwards <cfsworks@gmail.com> wrote:
>
> On Mon, Jan 26, 2026 at 2:55 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> >
> > On Sun, 2026-01-25 at 18:30 -0800, Sam Edwards wrote:
> > > Following an earlier commit, ceph_process_folio_batch() no longer
> > > returns errors because the writeback loop cannot handle them.
> > >
> > > Since this function already indicates failure to lock any pages by
> > > leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
> > > to handle abandonment of a locked batch, change the return type of
> > > ceph_process_folio_batch() to `void` and remove the pathological goto in
> > > the writeback loop. The lack of a return code emphasizes that
> > > ceph_process_folio_batch() is designed to be abort-free: that is, once
> > > it commits a folio for writeback, it will not later abandon it or
> > > propagate an error for that folio. Any future changes requiring "abort"
> > > logic should follow this invariant by cleaning up its array and
> > > resetting ceph_wbc.locked_pages appropriately.
> > >
> > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > ---
> > > fs/ceph/addr.c | 17 +++++------------
> > > 1 file changed, 5 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 39064893f35b..cdf11288d6b7 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1284,16 +1284,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
> > > }
> > >
> > > static
> > > -int ceph_process_folio_batch(struct address_space *mapping,
> > > - struct writeback_control *wbc,
> > > - struct ceph_writeback_ctl *ceph_wbc)
> >
> > I still prefer to return the error code from the function.
>
> Hi Slava,
>
> Since the objection to this patch is mere preference at this point, we
> need to seek broader consensus before either of us unilaterally
> decides whether this style is appropriate for the CephFS subsystem.
> I'm willing to follow the preferences of the group, we just need a
> "group" first. :)
Hi Sam,
The rationale that you provided on v2 [1] and in other responses makes
sense to me. If the need for signaling an error that would be handled
by the (currently only) caller in a meaningful way arises, the signature
can be rolled back but until then I don't see any point in continuing
to always returning 0. Having this rc variable that one needs to
remember to reset on every error path only brings confusion; the bug
that is getting fixed in the first patch of the series is a clear proof
of that.
[1] https://lore.kernel.org/ceph-devel/CAH5Ym4jn8wg+mYGqKGb17OZGBkyDeX-Vx3wgfVT0cqPtn36QFQ@mail.gmail.com/
Thanks,
Ilya
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 4/4] ceph: assert writeback loop invariants
2026-01-29 0:14 ` Sam Edwards
@ 2026-02-11 18:04 ` Ilya Dryomov
0 siblings, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2026-02-11 18:04 UTC (permalink / raw)
To: Sam Edwards
Cc: Viacheslav Dubeyko, Xiubo Li, Milind Changire,
ceph-devel@vger.kernel.org, brauner@kernel.org,
jlayton@kernel.org, linux-kernel@vger.kernel.org
On Thu, Jan 29, 2026 at 1:14 AM Sam Edwards <cfsworks@gmail.com> wrote:
>
> On Mon, Jan 26, 2026 at 2:54 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> >
> > On Sun, 2026-01-25 at 18:30 -0800, Sam Edwards wrote:
> > > If `locked_pages` is zero, the page array must not be allocated:
> > > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > > allocate `pages`, and redundant allocations trigger
> > > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > > writeback stall) or even a kernel panic. Consequently, the main loop in
> > > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > > to a single iteration.
> > >
> > > This expectation is currently not clear enough, as evidenced by two
> > > recent patches which fix oopses caused by `pages` persisting into
> > > the next loop iteration:
> > > - "ceph: do not propagate page array emplacement errors as batch errors"
> > > - "ceph: free page array when ceph_submit_write() fails"
> > >
> > > Use an explicit BUG_ON() at the top of the loop to assert the loop's
> > > preexisting expectation that `pages` is cleaned up by the previous
> > > iteration. Because this is closely tied to `locked_pages`, also make it
> > > the previous iteration's responsibility to guarantee its reset, and
> > > verify with a second new BUG_ON() instead of handling (and masking)
> > > failures to do so.
> > >
> > > This patch does not change invariants, behavior, or failure modes.
> > > The added BUG_ON() lines catch conditions that would already trigger oops,
> > > but do so earlier for easier debugging and programmer clarity.
> > >
> > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > ---
> > > fs/ceph/addr.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index cdf11288d6b7..4e392fc70d33 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1663,7 +1663,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> > >
> > > while (!has_writeback_done(&ceph_wbc)) {
> > > - ceph_wbc.locked_pages = 0;
> > > + BUG_ON(ceph_wbc.locked_pages);
> > > + BUG_ON(ceph_wbc.pages);
> >
> > My complains are still the same. I would like not have BUG_ON() here.
>
> Hey Slava,
>
> I understand your preference, but as discussed before, the BUG_ON() is
> defensive programming to catch an invariant violation earlier than the
> existing BUG_ON() in ceph_allocate_page_array(). If the invariant is
> broken, we'll oops anyway; this just makes the oops happen sooner and
> easier to debug, and reduces cognitive load for future programmers
> trying to understand the loop invariants.
Hi Sam,
I agree with your argument, these BUG_ONs seem on point to me.
Thanks,
Ilya
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes
2026-01-26 2:30 [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Sam Edwards
` (3 preceding siblings ...)
2026-01-26 2:30 ` [PATCH v3 4/4] ceph: assert writeback loop invariants Sam Edwards
@ 2026-02-11 18:11 ` Ilya Dryomov
4 siblings, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2026-02-11 18:11 UTC (permalink / raw)
To: Sam Edwards
Cc: Xiubo Li, Viacheslav Dubeyko, Christian Brauner, Milind Changire,
Jeff Layton, ceph-devel, linux-kernel
On Mon, Jan 26, 2026 at 3:31 AM Sam Edwards <cfsworks@gmail.com> wrote:
>
> Hello list,
>
> This is v2 of my series that addresses interrelated issues in CephFS writeback,
> fixing crashes, improving robustness, and correcting performance behavior,
> particularly for fscrypted files. [1]
>
> Changes v2->v3:
> - Split out two patches ("ceph: free page array when ceph_submit_write() fails"
> and "ceph: split out page-array discarding to a function") to a new series
> [2] since they are independent and had no outstanding review comments.
> - Lowercase the subject lines of commit messages, per subsystem-local style.
> - Update the commit message of ("ceph: fix write storm on fscrypted files") to
> mention the explicit dependency on ("ceph: do not propagate page array
> emplacement errors as batch errors") for correctness, to prevent the former
> from being accidentally backported without the latter.
> - Reorder the series to make the aforementioned patches consecutive. The series
> cadence is now: bugfix, bugfix, cleanup, cleanup
> - Add a clarification to ("ceph: remove error return from
> ceph_process_folio_batch()") that "abort" logic is still possible, just that
> it is responsible for cleaning up after itself.
>
> Changes v1->v2:
> - Clarify patch #1's commit message to establish that failures on the first
> folio are not possible.
> - Add another patch to move the "clean up page array on abort" logic to a new
> ceph_discard_page_array() function. (Thanks Slava!)
> - Change the wording "grossly degraded performance" to instead read
> "correspondingly degraded performance." This makes the causal relationship
> clearer (that write throughput is limited much more significantly by write
> op/s due to the bug) without making any claims (qualitative or otherwise)
> about significance. (Thanks Slava!)
> - Reset locked_pages = 0 immediately when the page array is discarded,
> simplifying patch #5 ("ceph: Assert writeback loop invariants")
> - Reword "as evidenced by the previous two patches which fix oopses" to
> "as evidenced by two recent patches which fix oopses" and refer to the
> patches by subject (being in the same series, I cannot refer to them by hash)
>
> Warm regards,
> Sam
>
> [1] https://lore.kernel.org/all/20260107210139.40554-1-CFSworks@gmail.com/
> [2] https://lore.kernel.org/all/20260126022715.404984-1-CFSworks@gmail.com/
>
> Sam Edwards (4):
> ceph: do not propagate page array emplacement errors as batch errors
> ceph: fix write storm on fscrypted files
> ceph: remove error return from ceph_process_folio_batch()
> ceph: assert writeback loop invariants
>
> fs/ceph/addr.c | 23 ++++++++++-------------
> 1 file changed, 10 insertions(+), 13 deletions(-)
>
> --
> 2.52.0
>
Applied.
Thanks,
Ilya
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-02-11 18:11 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-26 2:30 [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Sam Edwards
2026-01-26 2:30 ` [PATCH v3 1/4] ceph: do not propagate page array emplacement errors as batch errors Sam Edwards
2026-01-26 2:30 ` [PATCH v3 2/4] ceph: fix write storm on fscrypted files Sam Edwards
2026-01-26 2:30 ` [PATCH v3 3/4] ceph: remove error return from ceph_process_folio_batch() Sam Edwards
2026-01-26 22:55 ` Viacheslav Dubeyko
2026-01-29 0:29 ` Sam Edwards
2026-02-11 17:55 ` Ilya Dryomov
2026-01-26 2:30 ` [PATCH v3 4/4] ceph: assert writeback loop invariants Sam Edwards
2026-01-26 22:54 ` Viacheslav Dubeyko
2026-01-29 0:14 ` Sam Edwards
2026-02-11 18:04 ` Ilya Dryomov
2026-02-11 18:11 ` [PATCH v3 0/4] ceph: CephFS writeback correctness and performance fixes Ilya Dryomov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox