[PATCH 0/5] ceph: CephFS writeback correctness and performance fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes
@ 2025-12-31  2:43 Sam Edwards
  2025-12-31  2:43 ` [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Sam Edwards @ 2025-12-31  2:43 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards

Hello list,

This series addresses several interrelated CephFS writeback issues,
particularly for fscrypted files. My work began with a performance problem:
encrypted files caused a write storm during writeback because the writeback
code was inadvertently selecting the crypto block instead of the stripe unit as
the maximum write unit size.

While testing that fix, I encountered a correctness bug: failures to allocate
bounce pages during writeback were incorrectly propagated as batch errors,
which trigger kernel oopses/panics due to poor handling in the writeback loop.
While investigating that, I discovered that the same oopses could be triggered
by a failure in ceph_submit_write() as well.

The patches in this series:

1. Prevent bounce page allocation failures from aborting the writeback batch
   and causing a kernel oops/panic due to the page array not being freed.
2. Remove the now-redundant error return from ceph_process_folio_batch().
3. Free page arrays during failure in ceph_submit_write(), preventing another
   path to the same kernel oops/panic. This was not an issue I encountered in
   testing, and it is tricky to trigger organically. I used the fault injection
   framework to confirm it and verify the fix.
4. Assert writeback loop invariants explicitly to help prevent regressions and
   aid debugging should the problem reappear.
5. Fix the write storm on fscrypted files by using the correct stripe unit.

Note that this series follows a "fix-then-refactor" cadence: patches 1, 3, and
5 fix bugs and are intended for stable, while patches 2 and 4 represent code
cleanup and are intended only for next.

Wishing you all a prosperous 2026 ahead,
Sam

Sam Edwards (5):
  ceph: Do not propagate page array emplacement errors as batch errors
  ceph: Remove error return from ceph_process_folio_batch()
  ceph: Free page array when ceph_submit_write fails
  ceph: Assert writeback loop invariants
  ceph: Fix write storm on fscrypted files

 fs/ceph/addr.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

-- 
2.51.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors
  2025-12-31  2:43 [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes Sam Edwards
@ 2025-12-31  2:43 ` Sam Edwards
  2026-01-05 20:23   ` Viacheslav Dubeyko
  2025-12-31  2:43 ` [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch() Sam Edwards
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2025-12-31  2:43 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable

When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
because it needs to allocate bounce buffers to store the encrypted
versions of each folio. Each folio beyond the first allocates its bounce
buffer with GFP_NOWAIT. Failures are common (and expected) under this
allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its
own return code and for capturing the return codes of its routine calls;
failing to reset `rc` back to 0 results in the error being propagated
out to the main writeback loop, which cannot actually tolerate any
errors here: once `ceph_wbc.pages` is allocated, it must be passed to
ceph_submit_write() to be freed. If it survives until the next iteration
(e.g. due to the goto being followed), ceph_allocate_page_array()'s
BUG_ON() will oops the worker. (Subsequent patches in this series make
the loop more robust.)

Note that this failure mode is currently masked due to another bug
(addressed later in this series) that prevents multiple encrypted folios
from being selected for the same write.

For now, just reset `rc` when redirtying the folio and prevent the
error from propagating. After this change, ceph_process_folio_batch() no
longer returns errors; its only remaining failure indicator is
`locked_pages == 0`, which the caller already handles correctly. The
next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 63b75d214210..3462df35d245 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
 		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
 				folio);
 		if (rc) {
+			rc = 0;
 			folio_redirty_for_writepage(wbc, folio);
 			folio_unlock(folio);
 			break;
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re:  [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors
  2025-12-31  2:43 ` [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
@ 2026-01-05 20:23   ` Viacheslav Dubeyko
  2026-01-06  6:52     ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-05 20:23 UTC (permalink / raw)
  To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
  Cc: Milind Changire, stable@vger.kernel.org,
	ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
> because it needs to allocate bounce buffers to store the encrypted
> versions of each folio. Each folio beyond the first allocates its bounce
> buffer with GFP_NOWAIT. Failures are common (and expected) under this
> allocation mode; they should flush (not abort) the batch.
> 
> However, ceph_process_folio_batch() uses the same `rc` variable for its
> own return code and for capturing the return codes of its routine calls;
> failing to reset `rc` back to 0 results in the error being propagated
> out to the main writeback loop, which cannot actually tolerate any
> errors here: once `ceph_wbc.pages` is allocated, it must be passed to
> ceph_submit_write() to be freed. If it survives until the next iteration
> (e.g. due to the goto being followed), ceph_allocate_page_array()'s
> BUG_ON() will oops the worker. (Subsequent patches in this series make
> the loop more robust.)

I think you are right with the fix. We have the loop here and if we already
moved some dirty folios, then we should flush it. But what if we failed on the
first one folio, then should we return no error code in this case?

> 
> Note that this failure mode is currently masked due to another bug
> (addressed later in this series) that prevents multiple encrypted folios
> from being selected for the same write.

So, maybe, this patch has been not correctly placed in the order? It will be
good to see the reproduction of the issue and which symptoms we have for this
issue. Do you have the reproduction script and call trace of the issue?

> 
> For now, just reset `rc` when redirtying the folio and prevent the
> error from propagating. After this change, ceph_process_folio_batch() no
> longer returns errors; its only remaining failure indicator is
> `locked_pages == 0`, which the caller already handles correctly. The
> next patch in this series addresses this.
> 
> Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 63b75d214210..3462df35d245 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
>  		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
>  				folio);
>  		if (rc) {
> +			rc = 0;

I like the fix but I would like to clarify the above questions at first.

Thanks,
Slava. 

>  			folio_redirty_for_writepage(wbc, folio);
>  			folio_unlock(folio);
>  			break;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors
  2026-01-05 20:23   ` Viacheslav Dubeyko
@ 2026-01-06  6:52     ` Sam Edwards
  2026-01-06 21:08       ` Viacheslav Dubeyko
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2026-01-06  6:52 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
	stable@vger.kernel.org, ceph-devel@vger.kernel.org,
	brauner@kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Jan 5, 2026 at 12:24 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
> > because it needs to allocate bounce buffers to store the encrypted
> > versions of each folio. Each folio beyond the first allocates its bounce
> > buffer with GFP_NOWAIT. Failures are common (and expected) under this
> > allocation mode; they should flush (not abort) the batch.
> >
> > However, ceph_process_folio_batch() uses the same `rc` variable for its
> > own return code and for capturing the return codes of its routine calls;
> > failing to reset `rc` back to 0 results in the error being propagated
> > out to the main writeback loop, which cannot actually tolerate any
> > errors here: once `ceph_wbc.pages` is allocated, it must be passed to
> > ceph_submit_write() to be freed. If it survives until the next iteration
> > (e.g. due to the goto being followed), ceph_allocate_page_array()'s
> > BUG_ON() will oops the worker. (Subsequent patches in this series make
> > the loop more robust.)
>

Hi Slava,

> I think you are right with the fix. We have the loop here and if we already
> moved some dirty folios, then we should flush it. But what if we failed on the
> first one folio, then should we return no error code in this case?

The case you ask about, where move_dirty_folio_in_page_array() returns
an error for the first folio, is currently not possible:
1) The only error code that move_dirty_folio_in_page_array() can
propagate is from fscrypt_encrypt_pagecache_blocks(), which it calls
with GFP_NOFS for the first folio. The latter function's doc comment
outright states:
 * The bounce page allocation is mempool-backed, so it will always succeed when
 * @gfp_flags includes __GFP_DIRECT_RECLAIM, e.g. when it's GFP_NOFS.
2) The error return isn't even reachable for the first folio because
of the BUG_ON(ceph_wbc->locked_pages == 0); line.

>
> >
> > Note that this failure mode is currently masked due to another bug
> > (addressed later in this series) that prevents multiple encrypted folios
> > from being selected for the same write.
>
> So, maybe, this patch has been not correctly placed in the order?

This crash is unmasked by patch 5 of this series. (It allows multiple
folios to be batched when fscrypt is enabled.) Patch 5 has no hard
dependency on anything else in this series, so it could -- in
principle -- be ordered first as you suggest. However, that ordering
would deliberately cause a regression in kernel stability, even if
only briefly. That's not considered good practice in my view, as it
may affect people who are trying to bisect and regression test. So the
ordering of this series is: fix the crash in the unused code first,
then fix the bug that makes it unused.

> It will be
> good to see the reproduction of the issue and which symptoms we have for this
> issue. Do you have the reproduction script and call trace of the issue?

Fair point!

Function inlining makes the call trace not very interesting:
Call trace:
 ceph_writepages_start+0x16ec/0x18e0 [ceph] ()
 do_writepages+0xb0/0x1c0
 __writeback_single_inode+0x4c/0x4d8
 writeback_sb_inodes+0x238/0x4c8
 __writeback_inodes_wb+0x64/0x120
 wb_writeback+0x320/0x3e8
 wb_workfn+0x42c/0x518
 process_one_work+0x17c/0x428
 worker_thread+0x260/0x390
 kthread+0x148/0x240
 ret_from_fork+0x10/0x20
Code: 34ffdee0 52800020 3903e7e0 17fffef4 (d4210000)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception

ceph_writepages_start+0x16ec corresponds to linux-6.18.2/fs/ceph/addr.c:1222

However, these repro steps should work:
1) Apply patch 5 from this series (and no other patches)
2) Mount CephFS and activate fscrypt
3) Copy a large directory into the CephFS mount
4) After dozens of GBs transferred, you should observe the above kernel oops

Warm regards,
Sam




>
> >
> > For now, just reset `rc` when redirtying the folio and prevent the
> > error from propagating. After this change, ceph_process_folio_batch() no
> > longer returns errors; its only remaining failure indicator is
> > `locked_pages == 0`, which the caller already handles correctly. The
> > next patch in this series addresses this.
> >
> > Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  fs/ceph/addr.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 63b75d214210..3462df35d245 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
> >               rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> >                               folio);
> >               if (rc) {
> > +                     rc = 0;
>
> I like the fix but I would like to clarify the above questions at first.
>
> Thanks,
> Slava.
>
> >                       folio_redirty_for_writepage(wbc, folio);
> >                       folio_unlock(folio);
> >                       break;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors
  2026-01-06  6:52     ` Sam Edwards
@ 2026-01-06 21:08       ` Viacheslav Dubeyko
  2026-01-06 23:50         ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-06 21:08 UTC (permalink / raw)
  To: cfsworks@gmail.com
  Cc: Xiubo Li, brauner@kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, jlayton@kernel.org, Milind Changire,
	idryomov@gmail.com, stable@vger.kernel.org

On Mon, 2026-01-05 at 22:52 -0800, Sam Edwards wrote:
> On Mon, Jan 5, 2026 at 12:24 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
> > > because it needs to allocate bounce buffers to store the encrypted
> > > versions of each folio. Each folio beyond the first allocates its bounce
> > > buffer with GFP_NOWAIT. Failures are common (and expected) under this
> > > allocation mode; they should flush (not abort) the batch.
> > > 
> > > However, ceph_process_folio_batch() uses the same `rc` variable for its
> > > own return code and for capturing the return codes of its routine calls;
> > > failing to reset `rc` back to 0 results in the error being propagated
> > > out to the main writeback loop, which cannot actually tolerate any
> > > errors here: once `ceph_wbc.pages` is allocated, it must be passed to
> > > ceph_submit_write() to be freed. If it survives until the next iteration
> > > (e.g. due to the goto being followed), ceph_allocate_page_array()'s
> > > BUG_ON() will oops the worker. (Subsequent patches in this series make
> > > the loop more robust.)
> > 
> 
> Hi Slava,
> 
> > I think you are right with the fix. We have the loop here and if we already
> > moved some dirty folios, then we should flush it. But what if we failed on the
> > first one folio, then should we return no error code in this case?
> 
> The case you ask about, where move_dirty_folio_in_page_array() returns
> an error for the first folio, is currently not possible:
> 1) The only error code that move_dirty_folio_in_page_array() can
> propagate is from fscrypt_encrypt_pagecache_blocks(), which it calls
> with GFP_NOFS for the first folio. The latter function's doc comment
> outright states:
>  * The bounce page allocation is mempool-backed, so it will always succeed when
>  * @gfp_flags includes __GFP_DIRECT_RECLAIM, e.g. when it's GFP_NOFS.
> 2) The error return isn't even reachable for the first folio because
> of the BUG_ON(ceph_wbc->locked_pages == 0); line.
> 

Unfortunately, the kernel code is not something completely stable. We cannot
rely on particular state of the code. The code should be stable, robust enough,
and ready for different situations. The mentioned BUG_ON() could be removed
somehow during refactoring because we already have comment there "better not
fail on first page!". Also, the behavior of fscrypt_encrypt_pagecache_blocks()
could be changed too. So, we need to expect any bad situation and this is why I
prefer to manage such potential (and maybe not so potential) erroneous
situation(s).

> > 
> > > 
> > > Note that this failure mode is currently masked due to another bug
> > > (addressed later in this series) that prevents multiple encrypted folios
> > > from being selected for the same write.
> > 
> > So, maybe, this patch has been not correctly placed in the order?
> 
> This crash is unmasked by patch 5 of this series. (It allows multiple
> folios to be batched when fscrypt is enabled.) Patch 5 has no hard
> dependency on anything else in this series, so it could -- in
> principle -- be ordered first as you suggest. However, that ordering
> would deliberately cause a regression in kernel stability, even if
> only briefly. That's not considered good practice in my view, as it
> may affect people who are trying to bisect and regression test. So the
> ordering of this series is: fix the crash in the unused code first,
> then fix the bug that makes it unused.
> 

OK, your point sounds confusing, frankly speaking. If we cannot reproduce the
issue because another bug hides the issue, then no such issue exists. And we
don't need to fix something. So, from the logical point of view, we need to fix
the first bug, then we can reproduce the hidden issue, and, finally, the fix
makes sense.

I didn't suggest too make the patch 5th as the first one. But I believe that
this patch should follow to the patch 5th. 

> > It will be
> > good to see the reproduction of the issue and which symptoms we have for this
> > issue. Do you have the reproduction script and call trace of the issue?
> 
> Fair point!
> 
> Function inlining makes the call trace not very interesting:
> Call trace:
>  ceph_writepages_start+0x16ec/0x18e0 [ceph] ()
>  do_writepages+0xb0/0x1c0
>  __writeback_single_inode+0x4c/0x4d8
>  writeback_sb_inodes+0x238/0x4c8
>  __writeback_inodes_wb+0x64/0x120
>  wb_writeback+0x320/0x3e8
>  wb_workfn+0x42c/0x518
>  process_one_work+0x17c/0x428
>  worker_thread+0x260/0x390
>  kthread+0x148/0x240
>  ret_from_fork+0x10/0x20
> Code: 34ffdee0 52800020 3903e7e0 17fffef4 (d4210000)
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Oops - BUG: Fatal exception
> 
> ceph_writepages_start+0x16ec corresponds to linux-6.18.2/fs/ceph/addr.c:1222
> 
> However, these repro steps should work:
> 1) Apply patch 5 from this series (and no other patches)
> 2) Mount CephFS and activate fscrypt
> 3) Copy a large directory into the CephFS mount
> 4) After dozens of GBs transferred, you should observe the above kernel oops

Could we have all of these details in the commit message?

Thanks,
Slava.

> 
> Warm regards,
> Sam
> 
> 
> 
> 
> > 
> > > 
> > > For now, just reset `rc` when redirtying the folio and prevent the
> > > error from propagating. After this change, ceph_process_folio_batch() no
> > > longer returns errors; its only remaining failure indicator is
> > > `locked_pages == 0`, which the caller already handles correctly. The
> > > next patch in this series addresses this.
> > > 
> > > Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > ---
> > >  fs/ceph/addr.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 63b75d214210..3462df35d245 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > >               rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> > >                               folio);
> > >               if (rc) {
> > > +                     rc = 0;
> > 
> > I like the fix but I would like to clarify the above questions at first.
> > 
> > Thanks,
> > Slava.
> > 
> > >                       folio_redirty_for_writepage(wbc, folio);
> > >                       folio_unlock(folio);
> > >                       break;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors
  2026-01-06 21:08       ` Viacheslav Dubeyko
@ 2026-01-06 23:50         ` Sam Edwards
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Edwards @ 2026-01-06 23:50 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, brauner@kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, jlayton@kernel.org, Milind Changire,
	idryomov@gmail.com, stable@vger.kernel.org

On Tue, Jan 6, 2026 at 1:08 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Mon, 2026-01-05 at 22:52 -0800, Sam Edwards wrote:
> > On Mon, Jan 5, 2026 at 12:24 PM Viacheslav Dubeyko
> > <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > > When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
> > > > because it needs to allocate bounce buffers to store the encrypted
> > > > versions of each folio. Each folio beyond the first allocates its bounce
> > > > buffer with GFP_NOWAIT. Failures are common (and expected) under this
> > > > allocation mode; they should flush (not abort) the batch.
> > > >
> > > > However, ceph_process_folio_batch() uses the same `rc` variable for its
> > > > own return code and for capturing the return codes of its routine calls;
> > > > failing to reset `rc` back to 0 results in the error being propagated
> > > > out to the main writeback loop, which cannot actually tolerate any
> > > > errors here: once `ceph_wbc.pages` is allocated, it must be passed to
> > > > ceph_submit_write() to be freed. If it survives until the next iteration
> > > > (e.g. due to the goto being followed), ceph_allocate_page_array()'s
> > > > BUG_ON() will oops the worker. (Subsequent patches in this series make
> > > > the loop more robust.)
> > >
> >
> > Hi Slava,
> >
> > > I think you are right with the fix. We have the loop here and if we already
> > > moved some dirty folios, then we should flush it. But what if we failed on the
> > > first one folio, then should we return no error code in this case?
> >
> > The case you ask about, where move_dirty_folio_in_page_array() returns
> > an error for the first folio, is currently not possible:
> > 1) The only error code that move_dirty_folio_in_page_array() can
> > propagate is from fscrypt_encrypt_pagecache_blocks(), which it calls
> > with GFP_NOFS for the first folio. The latter function's doc comment
> > outright states:
> >  * The bounce page allocation is mempool-backed, so it will always succeed when
> >  * @gfp_flags includes __GFP_DIRECT_RECLAIM, e.g. when it's GFP_NOFS.
> > 2) The error return isn't even reachable for the first folio because
> > of the BUG_ON(ceph_wbc->locked_pages == 0); line.
> >
>

Good day Slava,

> Unfortunately, the kernel code is not something completely stable. We cannot
> rely on particular state of the code. The code should be stable, robust enough,
> and ready for different situations.

You describe "defensive programming." I fully agree and am a strong
advocate for it, but each defensive measure comes with a complexity
cost. A skilled defensive programmer evaluates the likelihood of each
failure and invests that cost only where it's most warranted.

> The mentioned BUG_ON() could be removed
> somehow during refactoring because we already have comment there "better not
> fail on first page!".

If the question is "What happens if the first folio fails when the
BUG_ON is removed?" then my answer is: that is the responsibility of
the person removing it. I am leaving the BUG_ON() in place.

> Also, the behavior of fscrypt_encrypt_pagecache_blocks()
> could be changed too.

Changing that would alter the contract of
fscrypt_encrypt_pagecache_blocks(). Contracts can evolve, but anyone
making such a change must audit all call sites to ensure nothing
breaks. Today, this is purely hypothetical; the function is not being
changed. Speculating about behavior under a different, unimplemented
contract is not a basis for complicating the current code.

> So, we need to expect any bad situation and this is why I
> prefer to manage such potential (and maybe not so potential) erroneous
> situation(s).

This point is moot. Even if move_dirty_folio_in_page_array() somehow
returned nonzero on the first folio, ceph_process_folio_batch() would
simply lock zero folios, which ceph_writepages_start() already handles
gracefully.

>
> > >
> > > >
> > > > Note that this failure mode is currently masked due to another bug
> > > > (addressed later in this series) that prevents multiple encrypted folios
> > > > from being selected for the same write.
> > >
> > > So, maybe, this patch has been not correctly placed in the order?
> >
> > This crash is unmasked by patch 5 of this series. (It allows multiple
> > folios to be batched when fscrypt is enabled.) Patch 5 has no hard
> > dependency on anything else in this series, so it could -- in
> > principle -- be ordered first as you suggest. However, that ordering
> > would deliberately cause a regression in kernel stability, even if
> > only briefly. That's not considered good practice in my view, as it
> > may affect people who are trying to bisect and regression test. So the
> > ordering of this series is: fix the crash in the unused code first,
> > then fix the bug that makes it unused.
> >
>
> OK, your point sounds confusing, frankly speaking. If we cannot reproduce the
> issue because another bug hides the issue, then no such issue exists. And we
> don't need to fix something. So, from the logical point of view, we need to fix
> the first bug, then we can reproduce the hidden issue, and, finally, the fix
> makes sense.

With respect, that reasoning is flawed and not appropriate for a
technical discussion. The crash in question cannot currently occur,
but that does *not* make the fix unnecessary. Patch #5 in this series
will re-enable the code path, at which point the crash becomes
possible. Addressing it now ensures correctness and avoids introducing
a regression. Attempting to "see it happen in the wild" before fixing
it is neither required nor acceptable practice.

We are not uncertain about the crash: I have provided steps to
reproduce it. You can apply patch #5 before #1 *in your own tree* to
observe the crash if that helps you evaluate the patches. *But under
no circumstances should this be done in mainline!* Introducing a crash
upstream, even transiently, is strictly disallowed, and suggesting
otherwise is not appropriate behavior for a Linux kernel developer.

>
> I didn't suggest too make the patch 5th as the first one. But I believe that
> this patch should follow to the patch 5th.

As I explained, putting patch #5 before this one would deliberately
introduce a regression -- a crash. Triggering this in mainline is not
allowed by kernel development policy [1]; there is no exception for
"transient regressions that are fixed immediately afterward." A
regression is a regression.

>
> > > It will be
> > > good to see the reproduction of the issue and which symptoms we have for this
> > > issue. Do you have the reproduction script and call trace of the issue?
> >
> > Fair point!
> >
> > Function inlining makes the call trace not very interesting:
> > Call trace:
> >  ceph_writepages_start+0x16ec/0x18e0 [ceph] ()
> >  do_writepages+0xb0/0x1c0
> >  __writeback_single_inode+0x4c/0x4d8
> >  writeback_sb_inodes+0x238/0x4c8
> >  __writeback_inodes_wb+0x64/0x120
> >  wb_writeback+0x320/0x3e8
> >  wb_workfn+0x42c/0x518
> >  process_one_work+0x17c/0x428
> >  worker_thread+0x260/0x390
> >  kthread+0x148/0x240
> >  ret_from_fork+0x10/0x20
> > Code: 34ffdee0 52800020 3903e7e0 17fffef4 (d4210000)
> > ---[ end trace 0000000000000000 ]---
> > Kernel panic - not syncing: Oops - BUG: Fatal exception
> >
> > ceph_writepages_start+0x16ec corresponds to linux-6.18.2/fs/ceph/addr.c:1222
> >
> > However, these repro steps should work:
> > 1) Apply patch 5 from this series (and no other patches)
> > 2) Mount CephFS and activate fscrypt
> > 3) Copy a large directory into the CephFS mount
> > 4) After dozens of GBs transferred, you should observe the above kernel oops
>
> Could we have all of these details in the commit message?

Would this truly help future readers, or just create noise? The commit
message already explains the exact execution path to the
BUG_ON()/oops, which is what really matters; call traces are
secondary. I did not want to imply that readers cannot understand the
seriousness of the issue without a crash log. I will include these
details if the group consensus prefers it, but I am otherwise opposed.

Hope you and yours are well,
Sam

[1] See the "no regressions" rule:
https://docs.kernel.org/admin-guide/reporting-regressions.html

>
> Thanks,
> Slava.
>
> >
> > Warm regards,
> > Sam
> >
> >
> >
> >
> > >
> > > >
> > > > For now, just reset `rc` when redirtying the folio and prevent the
> > > > error from propagating. After this change, ceph_process_folio_batch() no
> > > > longer returns errors; its only remaining failure indicator is
> > > > `locked_pages == 0`, which the caller already handles correctly. The
> > > > next patch in this series addresses this.
> > > >
> > > > Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > > ---
> > > >  fs/ceph/addr.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index 63b75d214210..3462df35d245 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > > >               rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> > > >                               folio);
> > > >               if (rc) {
> > > > +                     rc = 0;
> > >
> > > I like the fix but I would like to clarify the above questions at first.
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >                       folio_redirty_for_writepage(wbc, folio);
> > > >                       folio_unlock(folio);
> > > >                       break;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch()
  2025-12-31  2:43 [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes Sam Edwards
  2025-12-31  2:43 ` [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
@ 2025-12-31  2:43 ` Sam Edwards
  2026-01-05 20:36   ` Viacheslav Dubeyko
  2025-12-31  2:43 ` [PATCH 3/5] ceph: Free page array when ceph_submit_write fails Sam Edwards
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2025-12-31  2:43 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards

Following the previous patch, ceph_process_folio_batch() no longer
returns errors because the writeback loop cannot handle them.

Since this function already indicates failure to lock any pages by
leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
to handle abandonment of a locked batch, change the return type of
ceph_process_folio_batch() to `void` and remove the pathological goto in
the writeback loop. The lack of a return code emphasizes that
ceph_process_folio_batch() is designed to be abort-free: that is, once
it commits a folio for writeback, it will not later abandon it or
propagate an error for that folio.

Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3462df35d245..2b722916fb9b 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1283,16 +1283,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
 }
 
 static
-int ceph_process_folio_batch(struct address_space *mapping,
-			     struct writeback_control *wbc,
-			     struct ceph_writeback_ctl *ceph_wbc)
+void ceph_process_folio_batch(struct address_space *mapping,
+			      struct writeback_control *wbc,
+			      struct ceph_writeback_ctl *ceph_wbc)
 {
 	struct inode *inode = mapping->host;
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
 	struct ceph_client *cl = fsc->client;
 	struct folio *folio = NULL;
 	unsigned i;
-	int rc = 0;
+	int rc;
 
 	for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
 		folio = ceph_wbc->fbatch.folios[i];
@@ -1322,12 +1322,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
 		rc = ceph_check_page_before_write(mapping, wbc,
 						  ceph_wbc, folio);
 		if (rc == -ENODATA) {
-			rc = 0;
 			folio_unlock(folio);
 			ceph_wbc->fbatch.folios[i] = NULL;
 			continue;
 		} else if (rc == -E2BIG) {
-			rc = 0;
 			folio_unlock(folio);
 			ceph_wbc->fbatch.folios[i] = NULL;
 			break;
@@ -1369,7 +1367,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
 		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
 				folio);
 		if (rc) {
-			rc = 0;
 			folio_redirty_for_writepage(wbc, folio);
 			folio_unlock(folio);
 			break;
@@ -1380,8 +1377,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
 	}
 
 	ceph_wbc->processed_in_fbatch = i;
-
-	return rc;
 }
 
 static inline
@@ -1685,10 +1680,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 			break;
 
 process_folio_batch:
-		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
+		ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
 		ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
-		if (rc)
-			goto release_folios;
 
 		/* did we get anything? */
 		if (!ceph_wbc.locked_pages)
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re:  [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch()
  2025-12-31  2:43 ` [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch() Sam Edwards
@ 2026-01-05 20:36   ` Viacheslav Dubeyko
  2026-01-06  6:52     ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-05 20:36 UTC (permalink / raw)
  To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
  Cc: Milind Changire, ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> Following the previous patch, ceph_process_folio_batch() no longer
> returns errors because the writeback loop cannot handle them.

I am not completely convinced that we can remove returning error code here. What
if we don't process any folio in ceph_process_folio_batch(), then we cannot call
the ceph_submit_write(). It sounds to me that we need to have error code to jump
to release_folios in such case.

> 
> Since this function already indicates failure to lock any pages by
> leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way

Frankly speaking, I don't quite follow what do you mean by "this function
already indicates failure to lock any pages". What do you mean here?

> to handle abandonment of a locked batch, change the return type of
> ceph_process_folio_batch() to `void` and remove the pathological goto in
> the writeback loop. The lack of a return code emphasizes that
> ceph_process_folio_batch() is designed to be abort-free: that is, once
> it commits a folio for writeback, it will not later abandon it or
> propagate an error for that folio.

I think you need to explain your point more clear. Currently, I am not convinced
that this modification makes sense.

Thanks,
Slava.

> 
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 3462df35d245..2b722916fb9b 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1283,16 +1283,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
>  }
>  
>  static
> -int ceph_process_folio_batch(struct address_space *mapping,
> -			     struct writeback_control *wbc,
> -			     struct ceph_writeback_ctl *ceph_wbc)
> +void ceph_process_folio_batch(struct address_space *mapping,
> +			      struct writeback_control *wbc,
> +			      struct ceph_writeback_ctl *ceph_wbc)
>  {
>  	struct inode *inode = mapping->host;
>  	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
>  	struct ceph_client *cl = fsc->client;
>  	struct folio *folio = NULL;
>  	unsigned i;
> -	int rc = 0;
> +	int rc;
>  
>  	for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
>  		folio = ceph_wbc->fbatch.folios[i];
> @@ -1322,12 +1322,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
>  		rc = ceph_check_page_before_write(mapping, wbc,
>  						  ceph_wbc, folio);
>  		if (rc == -ENODATA) {
> -			rc = 0;
>  			folio_unlock(folio);
>  			ceph_wbc->fbatch.folios[i] = NULL;
>  			continue;
>  		} else if (rc == -E2BIG) {
> -			rc = 0;
>  			folio_unlock(folio);
>  			ceph_wbc->fbatch.folios[i] = NULL;
>  			break;
> @@ -1369,7 +1367,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
>  		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
>  				folio);
>  		if (rc) {
> -			rc = 0;
>  			folio_redirty_for_writepage(wbc, folio);
>  			folio_unlock(folio);
>  			break;
> @@ -1380,8 +1377,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
>  	}
>  
>  	ceph_wbc->processed_in_fbatch = i;
> -
> -	return rc;
>  }
>  
>  static inline
> @@ -1685,10 +1680,8 @@ static int ceph_writepages_start(struct address_space *mapping,
>  			break;
>  
>  process_folio_batch:
> -		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> +		ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
>  		ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> -		if (rc)
> -			goto release_folios;
>  
>  		/* did we get anything? */
>  		if (!ceph_wbc.locked_pages)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch()
  2026-01-05 20:36   ` Viacheslav Dubeyko
@ 2026-01-06  6:52     ` Sam Edwards
  2026-01-06 22:47       ` Viacheslav Dubeyko
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2026-01-06  6:52 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
	ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Mon, Jan 5, 2026 at 12:36 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > Following the previous patch, ceph_process_folio_batch() no longer
> > returns errors because the writeback loop cannot handle them.
>

Hi Slava,

> I am not completely convinced that we can remove returning error code here. What
> if we don't process any folio in ceph_process_folio_batch(), then we cannot call
> the ceph_submit_write(). It sounds to me that we need to have error code to jump
> to release_folios in such case.

This goto is already present (search the comment "did we get anything?").

>
> >
> > Since this function already indicates failure to lock any pages by
> > leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
>
> Frankly speaking, I don't quite follow what do you mean by "this function
> already indicates failure to lock any pages". What do you mean here?

I feel like there's a language barrier here. I understand from your
homepage that you speak Russian. I believe the Russian translation of
what I'm trying to say is:

Эта функция уже сигнализирует о том, что ни одна страница не была
заблокирована, поскольку ceph_wbc.locked_pages остаётся равным 0.

It's likely that I didn't phrase the English version clearly enough.
Do you have a clearer phrasing I could use? This is the central point
of this patch, so it's crucial that it's well-understood.

>
> > to handle abandonment of a locked batch, change the return type of
> > ceph_process_folio_batch() to `void` and remove the pathological goto in
> > the writeback loop. The lack of a return code emphasizes that
> > ceph_process_folio_batch() is designed to be abort-free: that is, once
> > it commits a folio for writeback, it will not later abandon it or
> > propagate an error for that folio.
>
> I think you need to explain your point more clear. Currently, I am not convinced
> that this modification makes sense.

ACK; a good commit message needs to be clear to everyone. I'm not sure
where I went wrong in my wording, but I'll try to restate my thought
process; so maybe you can tell me where I lose you:
1) At this point in the series (after patch 1 is applied), there is no
remaining possible way for ceph_process_folio_batch() to return a
nonzero value. All possible codepaths result in rc=0.
2) Therefore, the `if` statement that checks the
ceph_process_folio_batch() return code is dead code.
3) Trying to `goto release_folios` when the page array is active
constitutes a bug. So the `if (!ceph_wbc.locked_pages) goto
release_folios;` condition is correct, but the `if (rc) goto
release_folios;` is incorrect. (It's dead code anyway, see #2 above.)
4) Therefore, delete the `if (rc) goto release_folios;` dead code and
rely solely on `if (!ceph_wbc.locked_pages) goto release_folios;`
5) Since we aren't using the return code of ceph_process_folio_batch()
-- a static function with no other callers -- it should not return a
status in the first place.
6) By design ceph_process_folio_batch() is a "best-effort" function:
it tries to lock as many pages as it *can* (and that might be 0!) and
returns once it can't proceed. It is *not* allowed to abort: that is,
it cannot commit some pages for writeback, then change its mind and
prevent writeback of the whole batch.
7) Removing the return code from ceph_process_folio_batch() does not
prevent ceph_writepages_start() from knowing if a failure happened on
the first folio. ceph_writepages_start() already checks whether
ceph_wbc.locked_pages == 0.
8) Removing the return code from ceph_process_folio_batch() *does*
prevent ceph_writepages_start() from knowing *what* went wrong when
the first folio failed, but ceph_writepages_start() wasn't doing
anything with that information anyway. It only cared about the error
status as a boolean.
9) Most importantly: This patch does NOT constitute a behavioral
change. It is removing unreachable (and, when reached, buggy)
codepaths.

Warm regards,
Sam



>
> Thanks,
> Slava.
>
> >
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  fs/ceph/addr.c | 17 +++++------------
> >  1 file changed, 5 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 3462df35d245..2b722916fb9b 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1283,16 +1283,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
> >  }
> >
> >  static
> > -int ceph_process_folio_batch(struct address_space *mapping,
> > -                          struct writeback_control *wbc,
> > -                          struct ceph_writeback_ctl *ceph_wbc)
> > +void ceph_process_folio_batch(struct address_space *mapping,
> > +                           struct writeback_control *wbc,
> > +                           struct ceph_writeback_ctl *ceph_wbc)
> >  {
> >       struct inode *inode = mapping->host;
> >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> >       struct ceph_client *cl = fsc->client;
> >       struct folio *folio = NULL;
> >       unsigned i;
> > -     int rc = 0;
> > +     int rc;
> >
> >       for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
> >               folio = ceph_wbc->fbatch.folios[i];
> > @@ -1322,12 +1322,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
> >               rc = ceph_check_page_before_write(mapping, wbc,
> >                                                 ceph_wbc, folio);
> >               if (rc == -ENODATA) {
> > -                     rc = 0;
> >                       folio_unlock(folio);
> >                       ceph_wbc->fbatch.folios[i] = NULL;
> >                       continue;
> >               } else if (rc == -E2BIG) {
> > -                     rc = 0;
> >                       folio_unlock(folio);
> >                       ceph_wbc->fbatch.folios[i] = NULL;
> >                       break;
> > @@ -1369,7 +1367,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> >               rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> >                               folio);
> >               if (rc) {
> > -                     rc = 0;
> >                       folio_redirty_for_writepage(wbc, folio);
> >                       folio_unlock(folio);
> >                       break;
> > @@ -1380,8 +1377,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> >       }
> >
> >       ceph_wbc->processed_in_fbatch = i;
> > -
> > -     return rc;
> >  }
> >
> >  static inline
> > @@ -1685,10 +1680,8 @@ static int ceph_writepages_start(struct address_space *mapping,
> >                       break;
> >
> >  process_folio_batch:
> > -             rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > +             ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> >               ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> > -             if (rc)
> > -                     goto release_folios;
> >
> >               /* did we get anything? */
> >               if (!ceph_wbc.locked_pages)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch()
  2026-01-06  6:52     ` Sam Edwards
@ 2026-01-06 22:47       ` Viacheslav Dubeyko
  2026-01-07  0:15         ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-06 22:47 UTC (permalink / raw)
  To: cfsworks@gmail.com
  Cc: Milind Changire, Xiubo Li, idryomov@gmail.com, brauner@kernel.org,
	ceph-devel@vger.kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Mon, 2026-01-05 at 22:52 -0800, Sam Edwards wrote:
> On Mon, Jan 5, 2026 at 12:36 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > Following the previous patch, ceph_process_folio_batch() no longer
> > > returns errors because the writeback loop cannot handle them.
> > 
> 
> Hi Slava,
> 
> > I am not completely convinced that we can remove returning error code here. What
> > if we don't process any folio in ceph_process_folio_batch(), then we cannot call
> > the ceph_submit_write(). It sounds to me that we need to have error code to jump
> > to release_folios in such case.
> 
> This goto is already present (search the comment "did we get anything?").
> 
> > 
> > > 
> > > Since this function already indicates failure to lock any pages by
> > > leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
> > 
> > Frankly speaking, I don't quite follow what do you mean by "this function
> > already indicates failure to lock any pages". What do you mean here?
> 
> I feel like there's a language barrier here. I understand from your
> homepage that you speak Russian. I believe the Russian translation of
> what I'm trying to say is:
> 
> Эта функция уже сигнализирует о том, что ни одна страница не была
> заблокирована, поскольку ceph_wbc.locked_pages остаётся равным 0.

It haven't made your statement more clear. :)

As far as I can see, this statement has no connection to the patch 2. It is more
relevant to the patch 3.

> 
> It's likely that I didn't phrase the English version clearly enough.
> Do you have a clearer phrasing I could use? This is the central point
> of this patch, so it's crucial that it's well-understood.
> 
> > 
> > > to handle abandonment of a locked batch, change the return type of
> > > ceph_process_folio_batch() to `void` and remove the pathological goto in
> > > the writeback loop. The lack of a return code emphasizes that
> > > ceph_process_folio_batch() is designed to be abort-free: that is, once
> > > it commits a folio for writeback, it will not later abandon it or
> > > propagate an error for that folio.
> > 
> > I think you need to explain your point more clear. Currently, I am not convinced
> > that this modification makes sense.
> 
> ACK; a good commit message needs to be clear to everyone. I'm not sure
> where I went wrong in my wording, but I'll try to restate my thought
> process; so maybe you can tell me where I lose you:
> 1) At this point in the series (after patch 1 is applied), there is no
> remaining possible way for ceph_process_folio_batch() to return a
> nonzero value. All possible codepaths result in rc=0.

I am still not convinced that patch 1 is correct. So, we should expect to
receive error code from move_dirty_folio_in_page_array(), especially for the
case if no one folio has been processed. And if no one folio has been processed,
then we need to return error code.

The ceph_process_folio_batch() is complex function and we need to have the way
to return the error code from internal function's logic to the caller's logic.
We cannot afford not to have the return error code from this function. Because
we need to be ready to release unprocessed folios if something was wrong in the
function's logic.

Thanks,
Slava.

> 2) Therefore, the `if` statement that checks the
> ceph_process_folio_batch() return code is dead code.
> 3) Trying to `goto release_folios` when the page array is active
> constitutes a bug. So the `if (!ceph_wbc.locked_pages) goto
> release_folios;` condition is correct, but the `if (rc) goto
> release_folios;` is incorrect. (It's dead code anyway, see #2 above.)
> 4) Therefore, delete the `if (rc) goto release_folios;` dead code and
> rely solely on `if (!ceph_wbc.locked_pages) goto release_folios;`
> 5) Since we aren't using the return code of ceph_process_folio_batch()
> -- a static function with no other callers -- it should not return a
> status in the first place.
> 6) By design ceph_process_folio_batch() is a "best-effort" function:
> it tries to lock as many pages as it *can* (and that might be 0!) and
> returns once it can't proceed. It is *not* allowed to abort: that is,
> it cannot commit some pages for writeback, then change its mind and
> prevent writeback of the whole batch.
> 7) Removing the return code from ceph_process_folio_batch() does not
> prevent ceph_writepages_start() from knowing if a failure happened on
> the first folio. ceph_writepages_start() already checks whether
> ceph_wbc.locked_pages == 0.
> 8) Removing the return code from ceph_process_folio_batch() *does*
> prevent ceph_writepages_start() from knowing *what* went wrong when
> the first folio failed, but ceph_writepages_start() wasn't doing
> anything with that information anyway. It only cared about the error
> status as a boolean.
> 9) Most importantly: This patch does NOT constitute a behavioral
> change. It is removing unreachable (and, when reached, buggy)
> codepaths.
> 
> Warm regards,
> Sam
> 
> 
> 
> > 
> > Thanks,
> > Slava.
> > 
> > > 
> > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > ---
> > >  fs/ceph/addr.c | 17 +++++------------
> > >  1 file changed, 5 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 3462df35d245..2b722916fb9b 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1283,16 +1283,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
> > >  }
> > > 
> > >  static
> > > -int ceph_process_folio_batch(struct address_space *mapping,
> > > -                          struct writeback_control *wbc,
> > > -                          struct ceph_writeback_ctl *ceph_wbc)
> > > +void ceph_process_folio_batch(struct address_space *mapping,
> > > +                           struct writeback_control *wbc,
> > > +                           struct ceph_writeback_ctl *ceph_wbc)
> > >  {
> > >       struct inode *inode = mapping->host;
> > >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > >       struct ceph_client *cl = fsc->client;
> > >       struct folio *folio = NULL;
> > >       unsigned i;
> > > -     int rc = 0;
> > > +     int rc;
> > > 
> > >       for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
> > >               folio = ceph_wbc->fbatch.folios[i];
> > > @@ -1322,12 +1322,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > >               rc = ceph_check_page_before_write(mapping, wbc,
> > >                                                 ceph_wbc, folio);
> > >               if (rc == -ENODATA) {
> > > -                     rc = 0;
> > >                       folio_unlock(folio);
> > >                       ceph_wbc->fbatch.folios[i] = NULL;
> > >                       continue;
> > >               } else if (rc == -E2BIG) {
> > > -                     rc = 0;
> > >                       folio_unlock(folio);
> > >                       ceph_wbc->fbatch.folios[i] = NULL;
> > >                       break;
> > > @@ -1369,7 +1367,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > >               rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> > >                               folio);
> > >               if (rc) {
> > > -                     rc = 0;
> > >                       folio_redirty_for_writepage(wbc, folio);
> > >                       folio_unlock(folio);
> > >                       break;
> > > @@ -1380,8 +1377,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > >       }
> > > 
> > >       ceph_wbc->processed_in_fbatch = i;
> > > -
> > > -     return rc;
> > >  }
> > > 
> > >  static inline
> > > @@ -1685,10 +1680,8 @@ static int ceph_writepages_start(struct address_space *mapping,
> > >                       break;
> > > 
> > >  process_folio_batch:
> > > -             rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > > +             ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > >               ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> > > -             if (rc)
> > > -                     goto release_folios;
> > > 
> > >               /* did we get anything? */
> > >               if (!ceph_wbc.locked_pages)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch()
  2026-01-06 22:47       ` Viacheslav Dubeyko
@ 2026-01-07  0:15         ` Sam Edwards
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Edwards @ 2026-01-07  0:15 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Milind Changire, Xiubo Li, idryomov@gmail.com, brauner@kernel.org,
	ceph-devel@vger.kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Tue, Jan 6, 2026 at 2:47 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Mon, 2026-01-05 at 22:52 -0800, Sam Edwards wrote:
> > On Mon, Jan 5, 2026 at 12:36 PM Viacheslav Dubeyko
> > <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > > Following the previous patch, ceph_process_folio_batch() no longer
> > > > returns errors because the writeback loop cannot handle them.
> > >
> >
> > Hi Slava,
> >
> > > I am not completely convinced that we can remove returning error code here. What
> > > if we don't process any folio in ceph_process_folio_batch(), then we cannot call
> > > the ceph_submit_write(). It sounds to me that we need to have error code to jump
> > > to release_folios in such case.
> >
> > This goto is already present (search the comment "did we get anything?").
> >
> > >
> > > >
> > > > Since this function already indicates failure to lock any pages by
> > > > leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
> > >
> > > Frankly speaking, I don't quite follow what do you mean by "this function
> > > already indicates failure to lock any pages". What do you mean here?
> >
> > I feel like there's a language barrier here. I understand from your
> > homepage that you speak Russian. I believe the Russian translation of
> > what I'm trying to say is:
> >
> > Эта функция уже сигнализирует о том, что ни одна страница не была
> > заблокирована, поскольку ceph_wbc.locked_pages остаётся равным 0.
>
> It haven't made your statement more clear. :)
>
> As far as I can see, this statement has no connection to the patch 2. It is more
> relevant to the patch 3.
>
> >
> > It's likely that I didn't phrase the English version clearly enough.
> > Do you have a clearer phrasing I could use? This is the central point
> > of this patch, so it's crucial that it's well-understood.
> >
> > >
> > > > to handle abandonment of a locked batch, change the return type of
> > > > ceph_process_folio_batch() to `void` and remove the pathological goto in
> > > > the writeback loop. The lack of a return code emphasizes that
> > > > ceph_process_folio_batch() is designed to be abort-free: that is, once
> > > > it commits a folio for writeback, it will not later abandon it or
> > > > propagate an error for that folio.
> > >
> > > I think you need to explain your point more clear. Currently, I am not convinced
> > > that this modification makes sense.
> >
> > ACK; a good commit message needs to be clear to everyone. I'm not sure
> > where I went wrong in my wording, but I'll try to restate my thought
> > process; so maybe you can tell me where I lose you:
> > 1) At this point in the series (after patch 1 is applied), there is no
> > remaining possible way for ceph_process_folio_batch() to return a
> > nonzero value. All possible codepaths result in rc=0.
>
> I am still not convinced that patch 1 is correct.

Then we should resolve patch 1 first before continuing with this
discussion. This patch is predicated on the correctness of patch 1, so
until that premise is agreed upon, any review of this patch is
necessarily blocked on that outcome.

If you have specific technical objections to patch 1, let’s address
those directly in that thread. Once we reach a consensus there, we can
continue the discussion of this patch on solid ground.


> So, we should expect to
> receive error code from move_dirty_folio_in_page_array(), especially for the
> case if no one folio has been processed. And if no one folio has been processed,
> then we need to return error code.
>
> The ceph_process_folio_batch() is complex function and we need to have the way
> to return the error code from internal function's logic to the caller's logic.
> We cannot afford not to have the return error code from this function. Because
> we need to be ready to release unprocessed folios if something was wrong in the
> function's logic.
>
> Thanks,
> Slava.
>
> > 2) Therefore, the `if` statement that checks the
> > ceph_process_folio_batch() return code is dead code.
> > 3) Trying to `goto release_folios` when the page array is active
> > constitutes a bug. So the `if (!ceph_wbc.locked_pages) goto
> > release_folios;` condition is correct, but the `if (rc) goto
> > release_folios;` is incorrect. (It's dead code anyway, see #2 above.)
> > 4) Therefore, delete the `if (rc) goto release_folios;` dead code and
> > rely solely on `if (!ceph_wbc.locked_pages) goto release_folios;`
> > 5) Since we aren't using the return code of ceph_process_folio_batch()
> > -- a static function with no other callers -- it should not return a
> > status in the first place.
> > 6) By design ceph_process_folio_batch() is a "best-effort" function:
> > it tries to lock as many pages as it *can* (and that might be 0!) and
> > returns once it can't proceed. It is *not* allowed to abort: that is,
> > it cannot commit some pages for writeback, then change its mind and
> > prevent writeback of the whole batch.
> > 7) Removing the return code from ceph_process_folio_batch() does not
> > prevent ceph_writepages_start() from knowing if a failure happened on
> > the first folio. ceph_writepages_start() already checks whether
> > ceph_wbc.locked_pages == 0.
> > 8) Removing the return code from ceph_process_folio_batch() *does*
> > prevent ceph_writepages_start() from knowing *what* went wrong when
> > the first folio failed, but ceph_writepages_start() wasn't doing
> > anything with that information anyway. It only cared about the error
> > status as a boolean.
> > 9) Most importantly: This patch does NOT constitute a behavioral
> > change. It is removing unreachable (and, when reached, buggy)
> > codepaths.
> >
> > Warm regards,
> > Sam
> >
> >
> >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >
> > > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > > ---
> > > >  fs/ceph/addr.c | 17 +++++------------
> > > >  1 file changed, 5 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index 3462df35d245..2b722916fb9b 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1283,16 +1283,16 @@ static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
> > > >  }
> > > >
> > > >  static
> > > > -int ceph_process_folio_batch(struct address_space *mapping,
> > > > -                          struct writeback_control *wbc,
> > > > -                          struct ceph_writeback_ctl *ceph_wbc)
> > > > +void ceph_process_folio_batch(struct address_space *mapping,
> > > > +                           struct writeback_control *wbc,
> > > > +                           struct ceph_writeback_ctl *ceph_wbc)
> > > >  {
> > > >       struct inode *inode = mapping->host;
> > > >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > >       struct ceph_client *cl = fsc->client;
> > > >       struct folio *folio = NULL;
> > > >       unsigned i;
> > > > -     int rc = 0;
> > > > +     int rc;
> > > >
> > > >       for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
> > > >               folio = ceph_wbc->fbatch.folios[i];
> > > > @@ -1322,12 +1322,10 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > > >               rc = ceph_check_page_before_write(mapping, wbc,
> > > >                                                 ceph_wbc, folio);
> > > >               if (rc == -ENODATA) {
> > > > -                     rc = 0;
> > > >                       folio_unlock(folio);
> > > >                       ceph_wbc->fbatch.folios[i] = NULL;
> > > >                       continue;
> > > >               } else if (rc == -E2BIG) {
> > > > -                     rc = 0;
> > > >                       folio_unlock(folio);
> > > >                       ceph_wbc->fbatch.folios[i] = NULL;
> > > >                       break;
> > > > @@ -1369,7 +1367,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > > >               rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
> > > >                               folio);
> > > >               if (rc) {
> > > > -                     rc = 0;
> > > >                       folio_redirty_for_writepage(wbc, folio);
> > > >                       folio_unlock(folio);
> > > >                       break;
> > > > @@ -1380,8 +1377,6 @@ int ceph_process_folio_batch(struct address_space *mapping,
> > > >       }
> > > >
> > > >       ceph_wbc->processed_in_fbatch = i;
> > > > -
> > > > -     return rc;
> > > >  }
> > > >
> > > >  static inline
> > > > @@ -1685,10 +1680,8 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > >                       break;
> > > >
> > > >  process_folio_batch:
> > > > -             rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > > > +             ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > > >               ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> > > > -             if (rc)
> > > > -                     goto release_folios;
> > > >
> > > >               /* did we get anything? */
> > > >               if (!ceph_wbc.locked_pages)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/5] ceph: Free page array when ceph_submit_write fails
  2025-12-31  2:43 [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes Sam Edwards
  2025-12-31  2:43 ` [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
  2025-12-31  2:43 ` [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch() Sam Edwards
@ 2025-12-31  2:43 ` Sam Edwards
  2026-01-05 21:09   ` Viacheslav Dubeyko
  2025-12-31  2:43 ` [PATCH 4/5] ceph: Assert writeback loop invariants Sam Edwards
  2025-12-31  2:43 ` [PATCH 5/5] ceph: Fix write storm on fscrypted files Sam Edwards
  4 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2025-12-31  2:43 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable

If `locked_pages` is zero, the page array must not be allocated:
ceph_process_folio_batch() uses `locked_pages` to decide when to
allocate `pages`, and redundant allocations trigger
ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
writeback stall) or even a kernel panic. Consequently, the main loop in
ceph_writepages_start() assumes that the lifetime of `pages` is confined
to a single iteration.

The ceph_submit_write() function claims ownership of the page array on
success. But failures only redirty/unlock the pages and fail to free the
array, making the failure case in ceph_submit_write() fatal.

Free the page array in ceph_submit_write()'s error-handling 'if' block
so that the caller's invariant (that the array does not outlive the
iteration) is maintained unconditionally, allowing failures in
ceph_submit_write() to be recoverable as originally intended.

Fixes: 1551ec61dc55 ("ceph: introduce ceph_submit_write() method")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2b722916fb9b..91cc43950162 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1466,6 +1466,13 @@ int ceph_submit_write(struct address_space *mapping,
 			unlock_page(page);
 		}
 
+		if (ceph_wbc->from_pool) {
+			mempool_free(ceph_wbc->pages, ceph_wb_pagevec_pool);
+			ceph_wbc->from_pool = false;
+		} else
+			kfree(ceph_wbc->pages);
+		ceph_wbc->pages = NULL;
+
 		ceph_osdc_put_request(req);
 		return -EIO;
 	}
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re:  [PATCH 3/5] ceph: Free page array when ceph_submit_write fails
  2025-12-31  2:43 ` [PATCH 3/5] ceph: Free page array when ceph_submit_write fails Sam Edwards
@ 2026-01-05 21:09   ` Viacheslav Dubeyko
  2026-01-06  6:52     ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-05 21:09 UTC (permalink / raw)
  To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
  Cc: Milind Changire, stable@vger.kernel.org,
	ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> If `locked_pages` is zero, the page array must not be allocated:
> ceph_process_folio_batch() uses `locked_pages` to decide when to
> allocate `pages`, 
> 

I don't quite follow how this statement is relevant to the issue. If
`locked_pages` is zero, then ceph_submit_write() will not to be called. Do I
miss something here?

> and redundant allocations trigger
> ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> writeback stall) or even a kernel panic. Consequently, the main loop in
> ceph_writepages_start() assumes that the lifetime of `pages` is confined
> to a single iteration.

It will be great to see the reproducer script or application and call trace of
the issue. Could you please share the reproduction path and the call trace of
the issue?

> 
> The ceph_submit_write() function claims ownership of the page array on
> success. 
> 

As far as I can see, writepages_finish() should free the page array on success.

> But failures only redirty/unlock the pages and fail to free the
> array, making the failure case in ceph_submit_write() fatal.
> 
> Free the page array in ceph_submit_write()'s error-handling 'if' block
> so that the caller's invariant (that the array does not outlive the
> iteration) is maintained unconditionally, allowing failures in
> ceph_submit_write() to be recoverable as originally intended.
> 
> Fixes: 1551ec61dc55 ("ceph: introduce ceph_submit_write() method")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 2b722916fb9b..91cc43950162 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1466,6 +1466,13 @@ int ceph_submit_write(struct address_space *mapping,
>  			unlock_page(page);
>  		}
>  
> +		if (ceph_wbc->from_pool) {
> +			mempool_free(ceph_wbc->pages, ceph_wb_pagevec_pool);
> +			ceph_wbc->from_pool = false;
> +		} else
> +			kfree(ceph_wbc->pages);
> +		ceph_wbc->pages = NULL;

Probably, it makes sense to introduce a method ceph_free_page_array likewise to
__ceph_allocate_page_array() and to use for freeing page array in all places.

Could ceph_wbc->locked_pages be greater than zero but ceph_wbc->pages == NULL?

Thanks,
Slava.

> +
>  		ceph_osdc_put_request(req);
>  		return -EIO;
>  	}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] ceph: Free page array when ceph_submit_write fails
  2026-01-05 21:09   ` Viacheslav Dubeyko
@ 2026-01-06  6:52     ` Sam Edwards
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Edwards @ 2026-01-06  6:52 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
	stable@vger.kernel.org, ceph-devel@vger.kernel.org,
	brauner@kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Jan 5, 2026 at 1:09 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > If `locked_pages` is zero, the page array must not be allocated:
> > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > allocate `pages`,
> >
>

Hi Slava,

> I don't quite follow how this statement is relevant to the issue. If
> `locked_pages` is zero, then ceph_submit_write() will not to be called. Do I
> miss something here?

That statement is only informing that ceph_process_folio_batch() will
BUG() when locked_pages == 0 && pages != NULL. It establishes why
`pages` must be freed/NULLed before the next iteration of
ceph_writepages_start()'s loop (which sets locked_pages = 0).

>
> > and redundant allocations trigger
> > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > writeback stall) or even a kernel panic. Consequently, the main loop in
> > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > to a single iteration.
>
> It will be great to see the reproducer script or application and call trace of
> the issue. Could you please share the reproduction path and the call trace of
> the issue?

It's difficult to reproduce organically. It arises when
`!ceph_inc_osd_stopping_blocker(fsc->mdsc)`, which I understand can
only happen in a race. I used the fault injection framework to force
ceph_inc_osd_stopping_blocker() to fail.

The call trace is disinteresting. See my reply to your comments on
patch 1: it's the same trace.

>
> >
> > The ceph_submit_write() function claims ownership of the page array on
> > success.
> >
>
> As far as I can see, writepages_finish() should free the page array on success.

That's my understanding too; by "claims ownership of the page array" I
only mean that ceph_writepages_start() isn't responsible for cleaning
it up, once it calls ceph_submit_write().

>
> > But failures only redirty/unlock the pages and fail to free the
> > array, making the failure case in ceph_submit_write() fatal.
> >
> > Free the page array in ceph_submit_write()'s error-handling 'if' block
> > so that the caller's invariant (that the array does not outlive the
> > iteration) is maintained unconditionally, allowing failures in
> > ceph_submit_write() to be recoverable as originally intended.
> >
> > Fixes: 1551ec61dc55 ("ceph: introduce ceph_submit_write() method")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  fs/ceph/addr.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 2b722916fb9b..91cc43950162 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1466,6 +1466,13 @@ int ceph_submit_write(struct address_space *mapping,
> >                       unlock_page(page);
> >               }
> >
> > +             if (ceph_wbc->from_pool) {
> > +                     mempool_free(ceph_wbc->pages, ceph_wb_pagevec_pool);
> > +                     ceph_wbc->from_pool = false;
> > +             } else
> > +                     kfree(ceph_wbc->pages);
> > +             ceph_wbc->pages = NULL;
>
> Probably, it makes sense to introduce a method ceph_free_page_array likewise to
> __ceph_allocate_page_array() and to use for freeing page array in all places.

I like the suggestion but not the name. Instead of
ceph_free_page_array(), it should probably be called
ceph_discard_page_array(), because it is also redirtying the pages and
must not be used after successful writeback. (To me, "free" implies
success while "discard" implies failure.)

> Could ceph_wbc->locked_pages be greater than zero but ceph_wbc->pages == NULL?

ceph_wbc->locked_pages is the current array index into
ceph_wbc->pages, so they both need to be reset sometime before the
next iteration of ceph_writepages_start()'s loop.

Warm regards,
Sam



>
> Thanks,
> Slava.
>
> > +
> >               ceph_osdc_put_request(req);
> >               return -EIO;
> >       }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/5] ceph: Assert writeback loop invariants
  2025-12-31  2:43 [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes Sam Edwards
                   ` (2 preceding siblings ...)
  2025-12-31  2:43 ` [PATCH 3/5] ceph: Free page array when ceph_submit_write fails Sam Edwards
@ 2025-12-31  2:43 ` Sam Edwards
  2026-01-05 22:28   ` Viacheslav Dubeyko
  2025-12-31  2:43 ` [PATCH 5/5] ceph: Fix write storm on fscrypted files Sam Edwards
  4 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2025-12-31  2:43 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards

If `locked_pages` is zero, the page array must not be allocated:
ceph_process_folio_batch() uses `locked_pages` to decide when to
allocate `pages`, and redundant allocations trigger
ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
writeback stall) or even a kernel panic. Consequently, the main loop in
ceph_writepages_start() assumes that the lifetime of `pages` is confined
to a single iteration.

This expectation is currently not clear enough, as evidenced by the
previous two patches which fix oopses caused by `pages` persisting into
the next loop iteration.

Use an explicit BUG_ON() at the top of the loop to assert the loop's
preexisting expectation that `pages` is cleaned up by the previous
iteration. Because this is closely tied to `locked_pages`, also make it
the previous iteration's responsibility to guarantee its reset, and
verify with a second new BUG_ON() instead of handling (and masking)
failures to do so.

Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 91cc43950162..b3569d44d510 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1669,7 +1669,9 @@ static int ceph_writepages_start(struct address_space *mapping,
 		tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);

 	while (!has_writeback_done(&ceph_wbc)) {
-		ceph_wbc.locked_pages = 0;
+		BUG_ON(ceph_wbc.locked_pages);
+		BUG_ON(ceph_wbc.pages);
+
 		ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;

 get_more_pages:
@@ -1703,11 +1705,10 @@ static int ceph_writepages_start(struct address_space *mapping,
 		}

 		rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
-		if (rc)
-			goto release_folios;
-
 		ceph_wbc.locked_pages = 0;
 		ceph_wbc.strip_unit_end = 0;
+		if (rc)
+			goto release_folios;

 		if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
 			ceph_wbc.nr_folios =
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re:  [PATCH 4/5] ceph: Assert writeback loop invariants
  2025-12-31  2:43 ` [PATCH 4/5] ceph: Assert writeback loop invariants Sam Edwards
@ 2026-01-05 22:28   ` Viacheslav Dubeyko
  2026-01-06  6:53     ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-05 22:28 UTC (permalink / raw)
  To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
  Cc: Milind Changire, ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> If `locked_pages` is zero, the page array must not be allocated:
> ceph_process_folio_batch() uses `locked_pages` to decide when to
> allocate `pages`, and redundant allocations trigger
> ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> writeback stall) or even a kernel panic. Consequently, the main loop in
> ceph_writepages_start() assumes that the lifetime of `pages` is confined
> to a single iteration.
> 
> This expectation is currently not clear enough, as evidenced by the
> previous two patches which fix oopses caused by `pages` persisting into
> the next loop iteration.
> 
> Use an explicit BUG_ON() at the top of the loop to assert the loop's
> preexisting expectation that `pages` is cleaned up by the previous
> iteration. Because this is closely tied to `locked_pages`, also make it
> the previous iteration's responsibility to guarantee its reset, and
> verify with a second new BUG_ON() instead of handling (and masking)
> failures to do so.
> 
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 91cc43950162..b3569d44d510 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1669,7 +1669,9 @@ static int ceph_writepages_start(struct address_space *mapping,
>  		tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
>  
>  	while (!has_writeback_done(&ceph_wbc)) {
> -		ceph_wbc.locked_pages = 0;
> +		BUG_ON(ceph_wbc.locked_pages);
> +		BUG_ON(ceph_wbc.pages);
> +

It's not good idea to introduce BUG_ON() in write pages logic. I am definitely
against these two BUG_ON() here.

>  		ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
>  
>  get_more_pages:
> @@ -1703,11 +1705,10 @@ static int ceph_writepages_start(struct address_space *mapping,
>  		}
>  
>  		rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
> -		if (rc)
> -			goto release_folios;
> -

Frankly speaking, its' completely not clear the from commit message why we move
this check. What's the problem is here? How this move can fix the issue?

Thanks,
Slava.

>  		ceph_wbc.locked_pages = 0;
>  		ceph_wbc.strip_unit_end = 0;
> +		if (rc)
> +			goto release_folios;
>  
>  		if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
>  			ceph_wbc.nr_folios =

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] ceph: Assert writeback loop invariants
  2026-01-05 22:28   ` Viacheslav Dubeyko
@ 2026-01-06  6:53     ` Sam Edwards
  2026-01-06 23:00       ` Viacheslav Dubeyko
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2026-01-06  6:53 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
	ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Mon, Jan 5, 2026 at 2:29 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > If `locked_pages` is zero, the page array must not be allocated:
> > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > allocate `pages`, and redundant allocations trigger
> > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > writeback stall) or even a kernel panic. Consequently, the main loop in
> > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > to a single iteration.
> >
> > This expectation is currently not clear enough, as evidenced by the
> > previous two patches which fix oopses caused by `pages` persisting into
> > the next loop iteration.
> >
> > Use an explicit BUG_ON() at the top of the loop to assert the loop's
> > preexisting expectation that `pages` is cleaned up by the previous
> > iteration. Because this is closely tied to `locked_pages`, also make it
> > the previous iteration's responsibility to guarantee its reset, and
> > verify with a second new BUG_ON() instead of handling (and masking)
> > failures to do so.
> >
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  fs/ceph/addr.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 91cc43950162..b3569d44d510 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1669,7 +1669,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> >               tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> >
> >       while (!has_writeback_done(&ceph_wbc)) {
> > -             ceph_wbc.locked_pages = 0;
> > +             BUG_ON(ceph_wbc.locked_pages);
> > +             BUG_ON(ceph_wbc.pages);
> > +
>

Hi Slava,

> It's not good idea to introduce BUG_ON() in write pages logic. I am definitely
> against these two BUG_ON() here.

I share your distaste for BUG_ON() in writeback. However, the
BUG_ON(ceph_wbc.pages); already exists in ceph_allocate_page_array().
This patch is trying to catch that earlier, where it's easier to
troubleshoot. If I changed these to WARN_ON(), it would not prevent
the oops.

And the writeback code has a lot of BUG_ON() checks already (so I am
not "introducing" BUG_ON), but I suppose you could be saying "it's
already a problem, please don't make it worse."

If I introduce a ceph_discard_page_array() function (as discussed on
patch 4), I could call it at the top of the loop (to *ensure* a clean
state) instead of using BUG_ON() (to *assert* a clean state). I'd like
to hear from other reviewers which approach they'd prefer.

>
> >               ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
> >
> >  get_more_pages:
> > @@ -1703,11 +1705,10 @@ static int ceph_writepages_start(struct address_space *mapping,
> >               }
> >
> >               rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
> > -             if (rc)
> > -                     goto release_folios;
> > -
>
> Frankly speaking, its' completely not clear the from commit message why we move
> this check. What's the problem is here? How this move can fix the issue?

The diff makes it a little unclear, but I'm actually moving
ceph_wbc.{locked_pages,strip_unit_end} = 0; *above* the check (see
commit message: "also make it the previous iteration's responsibility
to guarantee [locked_pages is] reset") so that they happen
unconditionally. Git just happens to see it in terms of the if/goto
moving downward, not the assignments moving up.

Warm regards,
Sam


>
> Thanks,
> Slava.
>
> >               ceph_wbc.locked_pages = 0;
> >               ceph_wbc.strip_unit_end = 0;
> > +             if (rc)
> > +                     goto release_folios;
> >
> >               if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
> >                       ceph_wbc.nr_folios =

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 4/5] ceph: Assert writeback loop invariants
  2026-01-06  6:53     ` Sam Edwards
@ 2026-01-06 23:00       ` Viacheslav Dubeyko
  2026-01-07  0:33         ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-06 23:00 UTC (permalink / raw)
  To: cfsworks@gmail.com
  Cc: Milind Changire, Xiubo Li, idryomov@gmail.com, brauner@kernel.org,
	ceph-devel@vger.kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:
> On Mon, Jan 5, 2026 at 2:29 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > If `locked_pages` is zero, the page array must not be allocated:
> > > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > > allocate `pages`, and redundant allocations trigger
> > > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > > writeback stall) or even a kernel panic. Consequently, the main loop in
> > > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > > to a single iteration.
> > > 
> > > This expectation is currently not clear enough, as evidenced by the
> > > previous two patches which fix oopses caused by `pages` persisting into
> > > the next loop iteration.
> > > 
> > > Use an explicit BUG_ON() at the top of the loop to assert the loop's
> > > preexisting expectation that `pages` is cleaned up by the previous
> > > iteration. Because this is closely tied to `locked_pages`, also make it
> > > the previous iteration's responsibility to guarantee its reset, and
> > > verify with a second new BUG_ON() instead of handling (and masking)
> > > failures to do so.
> > > 
> > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > ---
> > >  fs/ceph/addr.c | 9 +++++----
> > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 91cc43950162..b3569d44d510 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1669,7 +1669,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> > >               tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> > > 
> > >       while (!has_writeback_done(&ceph_wbc)) {
> > > -             ceph_wbc.locked_pages = 0;
> > > +             BUG_ON(ceph_wbc.locked_pages);
> > > +             BUG_ON(ceph_wbc.pages);
> > > +
> > 
> 
> Hi Slava,
> 
> > It's not good idea to introduce BUG_ON() in write pages logic. I am definitely
> > against these two BUG_ON() here.
> 
> I share your distaste for BUG_ON() in writeback. However, the
> BUG_ON(ceph_wbc.pages); already exists in ceph_allocate_page_array().
> This patch is trying to catch that earlier, where it's easier to
> troubleshoot. If I changed these to WARN_ON(), it would not prevent
> the oops.
> 
> And the writeback code has a lot of BUG_ON() checks already (so I am
> not "introducing" BUG_ON), but I suppose you could be saying "it's
> already a problem, please don't make it worse."
> 

It looks really strange that you do at first:

-             ceph_wbc.locked_pages = 0;

and then you introduce two BUG_ON():

+             BUG_ON(ceph_wbc.locked_pages);
+             BUG_ON(ceph_wbc.pages);

But what's the point? It looks more natural simply to make initialization here:

              ceph_wbc.locked_pages = 0;
              ceph_wbc.strip_unit_end = 0;

What's wrong with it?

> If I introduce a ceph_discard_page_array() function (as discussed on
> patch 4), I could call it at the top of the loop (to *ensure* a clean
> state) instead of using BUG_ON() (to *assert* a clean state). I'd like
> to hear from other reviewers which approach they'd prefer.
> 
> > 
> > >               ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
> > > 
> > >  get_more_pages:
> > > @@ -1703,11 +1705,10 @@ static int ceph_writepages_start(struct address_space *mapping,
> > >               }
> > > 
> > >               rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
> > > -             if (rc)
> > > -                     goto release_folios;
> > > -
> > 
> > Frankly speaking, its' completely not clear the from commit message why we move
> > this check. What's the problem is here? How this move can fix the issue?
> 
> The diff makes it a little unclear, but I'm actually moving
> ceph_wbc.{locked_pages,strip_unit_end} = 0; *above* the check (see
> commit message: "also make it the previous iteration's responsibility
> to guarantee [locked_pages is] reset") so that they happen
> unconditionally. Git just happens to see it in terms of the if/goto
> moving downward, not the assignments moving up.

I simply don't see any explanation why we are moving this check. And what this
move is fixing. I believe it's really important to explain what this
modification is fixing.

Thanks,
Slava.

> 
> Warm regards,
> Sam
> 
> 
> > 
> > Thanks,
> > Slava.
> > 
> > >               ceph_wbc.locked_pages = 0;
> > >               ceph_wbc.strip_unit_end = 0;
> > > +             if (rc)
> > > +                     goto release_folios;
> > > 
> > >               if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
> > >                       ceph_wbc.nr_folios =

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] ceph: Assert writeback loop invariants
  2026-01-06 23:00       ` Viacheslav Dubeyko
@ 2026-01-07  0:33         ` Sam Edwards
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Edwards @ 2026-01-07  0:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Milind Changire, Xiubo Li, idryomov@gmail.com, brauner@kernel.org,
	ceph-devel@vger.kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Tue, Jan 6, 2026 at 3:00 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:
> > On Mon, Jan 5, 2026 at 2:29 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > > If `locked_pages` is zero, the page array must not be allocated:
> > > > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > > > allocate `pages`, and redundant allocations trigger
> > > > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > > > writeback stall) or even a kernel panic. Consequently, the main loop in
> > > > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > > > to a single iteration.
> > > >
> > > > This expectation is currently not clear enough, as evidenced by the
> > > > previous two patches which fix oopses caused by `pages` persisting into
> > > > the next loop iteration.
> > > >
> > > > Use an explicit BUG_ON() at the top of the loop to assert the loop's
> > > > preexisting expectation that `pages` is cleaned up by the previous
> > > > iteration. Because this is closely tied to `locked_pages`, also make it
> > > > the previous iteration's responsibility to guarantee its reset, and
> > > > verify with a second new BUG_ON() instead of handling (and masking)
> > > > failures to do so.
> > > >
> > > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > > ---
> > > >  fs/ceph/addr.c | 9 +++++----
> > > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index 91cc43950162..b3569d44d510 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1669,7 +1669,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > >               tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> > > >
> > > >       while (!has_writeback_done(&ceph_wbc)) {
> > > > -             ceph_wbc.locked_pages = 0;
> > > > +             BUG_ON(ceph_wbc.locked_pages);
> > > > +             BUG_ON(ceph_wbc.pages);
> > > > +
> > >
> >
> > Hi Slava,
> >
> > > It's not good idea to introduce BUG_ON() in write pages logic. I am definitely
> > > against these two BUG_ON() here.
> >
> > I share your distaste for BUG_ON() in writeback. However, the
> > BUG_ON(ceph_wbc.pages); already exists in ceph_allocate_page_array().
> > This patch is trying to catch that earlier, where it's easier to
> > troubleshoot. If I changed these to WARN_ON(), it would not prevent
> > the oops.
> >
> > And the writeback code has a lot of BUG_ON() checks already (so I am
> > not "introducing" BUG_ON), but I suppose you could be saying "it's
> > already a problem, please don't make it worse."
> >
>
> It looks really strange that you do at first:
>
> -             ceph_wbc.locked_pages = 0;
>
> and then you introduce two BUG_ON():
>
> +             BUG_ON(ceph_wbc.locked_pages);
> +             BUG_ON(ceph_wbc.pages);
>
> But what's the point? It looks more natural simply to make initialization here:
>
>               ceph_wbc.locked_pages = 0;
>               ceph_wbc.strip_unit_end = 0;
>
> What's wrong with it?

The problem is if that block runs at the top of the loop while
ceph_wbc.pages != NULL, the worker will oops in
ceph_allocate_page_array(). This is a particularly difficult oops to
diagnose. We should prevent it by carefully maintaining the loop's
invariants, but if prevention fails, the next best option is to oops
earlier, as close as possible to the actual bug.

>
> > If I introduce a ceph_discard_page_array() function (as discussed on
> > patch 4), I could call it at the top of the loop (to *ensure* a clean
> > state) instead of using BUG_ON() (to *assert* a clean state). I'd like
> > to hear from other reviewers which approach they'd prefer.
> >
> > >
> > > >               ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
> > > >
> > > >  get_more_pages:
> > > > @@ -1703,11 +1705,10 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > >               }
> > > >
> > > >               rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
> > > > -             if (rc)
> > > > -                     goto release_folios;
> > > > -
> > >
> > > Frankly speaking, its' completely not clear the from commit message why we move
> > > this check. What's the problem is here? How this move can fix the issue?
> >
> > The diff makes it a little unclear, but I'm actually moving
> > ceph_wbc.{locked_pages,strip_unit_end} = 0; *above* the check (see
> > commit message: "also make it the previous iteration's responsibility
> > to guarantee [locked_pages is] reset") so that they happen
> > unconditionally. Git just happens to see it in terms of the if/goto
> > moving downward, not the assignments moving up.
>
> I simply don't see any explanation why we are moving this check.

The check is not being moved; other lines are being moved above it,
and Git's diff algorithm made it look like the check moved.
This equivalent diff makes the actual change clearer:
         rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
+        ceph_wbc.locked_pages = 0;
+        ceph_wbc.strip_unit_end = 0;
         if (rc)
             goto release_folios;
-
-        ceph_wbc.locked_pages = 0;
-        ceph_wbc.strip_unit_end = 0;

         if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
             ceph_wbc.nr_folios =

> And what this
> move is fixing. I believe it's really important to explain what this
> modification is fixing.

This is not a bugfix; it's purely code cleanup -- more of the
"defensive programming" that we both like to see.

Cheers,
Sam

>
> Thanks,
> Slava.
>
> >
> > Warm regards,
> > Sam
> >
> >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >               ceph_wbc.locked_pages = 0;
> > > >               ceph_wbc.strip_unit_end = 0;
> > > > +             if (rc)
> > > > +                     goto release_folios;
> > > >
> > > >               if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
> > > >                       ceph_wbc.nr_folios =

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/5] ceph: Fix write storm on fscrypted files
  2025-12-31  2:43 [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes Sam Edwards
                   ` (3 preceding siblings ...)
  2025-12-31  2:43 ` [PATCH 4/5] ceph: Assert writeback loop invariants Sam Edwards
@ 2025-12-31  2:43 ` Sam Edwards
  2026-01-05 22:34   ` Viacheslav Dubeyko
  4 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2025-12-31  2:43 UTC (permalink / raw)
  To: Xiubo Li, Ilya Dryomov
  Cc: Viacheslav Dubeyko, Christian Brauner, Milind Changire,
	Jeff Layton, ceph-devel, linux-kernel, Sam Edwards, stable

CephFS stores file data across multiple RADOS objects. An object is the
atomic unit of storage, so the writeback code must clean only folios
that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled,
each object stores multiple unbroken "stripe units" covering different
portions of the file; if disabled, a "stripe unit" is simply the whole
object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios
belonging to the same object, its current design is to lock only a
single stripe unit at a time. Ever since this code was first written,
it has determined this size by checking the inode's block size.
However, the relatively-new fscrypt support needed to reduce the block
size for encrypted inodes to the crypto block size (see 'fixes' commit),
which causes an unnecessarily high number of write operations (~1024x as
many, with 4MiB objects) and grossly degraded performance.

Fix this (and clarify intent) by using i_layout.stripe_unit directly in
ceph_define_write_size() so that encrypted inodes are written back with
the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b3569d44d510..cb1da8e27c2b 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
 {
 	struct inode *inode = mapping->host;
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	unsigned int wsize = i_blocksize(inode);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	unsigned int wsize = ci->i_layout.stripe_unit;

 	if (fsc->mount_options->wsize < wsize)
 		wsize = fsc->mount_options->wsize;
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re:  [PATCH 5/5] ceph: Fix write storm on fscrypted files
  2025-12-31  2:43 ` [PATCH 5/5] ceph: Fix write storm on fscrypted files Sam Edwards
@ 2026-01-05 22:34   ` Viacheslav Dubeyko
  2026-01-06  6:53     ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-05 22:34 UTC (permalink / raw)
  To: Xiubo Li, idryomov@gmail.com, cfsworks@gmail.com
  Cc: Milind Changire, stable@vger.kernel.org,
	ceph-devel@vger.kernel.org, brauner@kernel.org,
	jlayton@kernel.org, linux-kernel@vger.kernel.org

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> CephFS stores file data across multiple RADOS objects. An object is the
> atomic unit of storage, so the writeback code must clean only folios
> that belong to the same object with each OSD request.
> 
> CephFS also supports RAID0-style striping of file contents: if enabled,
> each object stores multiple unbroken "stripe units" covering different
> portions of the file; if disabled, a "stripe unit" is simply the whole
> object. The stripe unit is (usually) reported as the inode's block size.
> 
> Though the writeback logic could, in principle, lock all dirty folios
> belonging to the same object, its current design is to lock only a
> single stripe unit at a time. Ever since this code was first written,
> it has determined this size by checking the inode's block size.
> However, the relatively-new fscrypt support needed to reduce the block
> size for encrypted inodes to the crypto block size (see 'fixes' commit),
> which causes an unnecessarily high number of write operations (~1024x as
> many, with 4MiB objects) and grossly degraded performance.

Do you have any benchmarking results that prove your point?

Thanks,
Slava.

> 
> Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> ceph_define_write_size() so that encrypted inodes are written back with
> the same number of operations as if they were unencrypted.
> 
> Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index b3569d44d510..cb1da8e27c2b 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
>  {
>  	struct inode *inode = mapping->host;
>  	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> -	unsigned int wsize = i_blocksize(inode);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	unsigned int wsize = ci->i_layout.stripe_unit;
>  
>  	if (fsc->mount_options->wsize < wsize)
>  		wsize = fsc->mount_options->wsize;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] ceph: Fix write storm on fscrypted files
  2026-01-05 22:34   ` Viacheslav Dubeyko
@ 2026-01-06  6:53     ` Sam Edwards
  2026-01-06 23:11       ` Viacheslav Dubeyko
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Edwards @ 2026-01-06  6:53 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, idryomov@gmail.com, Milind Changire,
	stable@vger.kernel.org, ceph-devel@vger.kernel.org,
	brauner@kernel.org, jlayton@kernel.org,
	linux-kernel@vger.kernel.org

On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > CephFS stores file data across multiple RADOS objects. An object is the
> > atomic unit of storage, so the writeback code must clean only folios
> > that belong to the same object with each OSD request.
> >
> > CephFS also supports RAID0-style striping of file contents: if enabled,
> > each object stores multiple unbroken "stripe units" covering different
> > portions of the file; if disabled, a "stripe unit" is simply the whole
> > object. The stripe unit is (usually) reported as the inode's block size.
> >
> > Though the writeback logic could, in principle, lock all dirty folios
> > belonging to the same object, its current design is to lock only a
> > single stripe unit at a time. Ever since this code was first written,
> > it has determined this size by checking the inode's block size.
> > However, the relatively-new fscrypt support needed to reduce the block
> > size for encrypted inodes to the crypto block size (see 'fixes' commit),
> > which causes an unnecessarily high number of write operations (~1024x as
> > many, with 4MiB objects) and grossly degraded performance.

Hi Slava,

> Do you have any benchmarking results that prove your point?

I haven't done any "real" benchmarking for this change. On my setup
(closer to a home server than a typical Ceph deployment), sequential
write throughput increased from ~1.7 to ~66 MB/s with this patch
applied. I don't consider this single datapoint representative, so
rather than presenting it as a general benchmark in the commit
message, I chose the qualitative wording "grossly degraded
performance." Actual impact will vary depending on workload, disk
type, OSD count, etc.

Those curious about the bug's performance impact in their environment
can find out without enabling fscrypt, using: mount -o wsize=4096

However, the core rationale for my claim is based on principles, not
on measurements: batching writes into fewer operations necessarily
spreads per-operation overhead across more bytes. So this change
removes an artificial per-op bottleneck on sequential write
performance. The exact impact varies, but the patch does improve
(fscrypt-enabled) write throughput in nearly every case.

Warm regards,
Sam


>
> Thanks,
> Slava.
>
> >
> > Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> > ceph_define_write_size() so that encrypted inodes are written back with
> > the same number of operations as if they were unencrypted.
> >
> > Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  fs/ceph/addr.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index b3569d44d510..cb1da8e27c2b 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
> >  {
> >       struct inode *inode = mapping->host;
> >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > -     unsigned int wsize = i_blocksize(inode);
> > +     struct ceph_inode_info *ci = ceph_inode(inode);
> > +     unsigned int wsize = ci->i_layout.stripe_unit;
> >
> >       if (fsc->mount_options->wsize < wsize)
> >               wsize = fsc->mount_options->wsize;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 5/5] ceph: Fix write storm on fscrypted files
  2026-01-06  6:53     ` Sam Edwards
@ 2026-01-06 23:11       ` Viacheslav Dubeyko
  2026-01-07  0:05         ` Sam Edwards
  0 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-06 23:11 UTC (permalink / raw)
  To: cfsworks@gmail.com
  Cc: Xiubo Li, brauner@kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, jlayton@kernel.org, Milind Changire,
	idryomov@gmail.com, stable@vger.kernel.org

On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:
> On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > CephFS stores file data across multiple RADOS objects. An object is the
> > > atomic unit of storage, so the writeback code must clean only folios
> > > that belong to the same object with each OSD request.
> > > 
> > > CephFS also supports RAID0-style striping of file contents: if enabled,
> > > each object stores multiple unbroken "stripe units" covering different
> > > portions of the file; if disabled, a "stripe unit" is simply the whole
> > > object. The stripe unit is (usually) reported as the inode's block size.
> > > 
> > > Though the writeback logic could, in principle, lock all dirty folios
> > > belonging to the same object, its current design is to lock only a
> > > single stripe unit at a time. Ever since this code was first written,
> > > it has determined this size by checking the inode's block size.
> > > However, the relatively-new fscrypt support needed to reduce the block
> > > size for encrypted inodes to the crypto block size (see 'fixes' commit),
> > > which causes an unnecessarily high number of write operations (~1024x as
> > > many, with 4MiB objects) and grossly degraded performance.
> 
> Hi Slava,
> 
> > Do you have any benchmarking results that prove your point?
> 
> I haven't done any "real" benchmarking for this change. On my setup
> (closer to a home server than a typical Ceph deployment), sequential
> write throughput increased from ~1.7 to ~66 MB/s with this patch
> applied. I don't consider this single datapoint representative, so
> rather than presenting it as a general benchmark in the commit
> message, I chose the qualitative wording "grossly degraded
> performance." Actual impact will vary depending on workload, disk
> type, OSD count, etc.
> 
> Those curious about the bug's performance impact in their environment
> can find out without enabling fscrypt, using: mount -o wsize=4096
> 
> However, the core rationale for my claim is based on principles, not
> on measurements: batching writes into fewer operations necessarily
> spreads per-operation overhead across more bytes. So this change
> removes an artificial per-op bottleneck on sequential write
> performance. The exact impact varies, but the patch does improve
> (fscrypt-enabled) write throughput in nearly every case.
> 

If you claim in commit message that "this patch fixes performance degradation",
then you MUST share the evidence (benchmarking results). Otherwise, you cannot
make such statements in commit message. Yes, it could be a good fix but please
don't make a promise of performance improvement. Because, end-users have very
different environments and workloads. And what could work on your side may not
work for other use-cases and environments. Potentially, you could describe your
environment, workload and to share your estimation/expectation of potential
performance improvement.

Thanks,
Slava.

> Warm regards,
> Sam
> 
> 
> > 
> > Thanks,
> > Slava.
> > 
> > > 
> > > Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> > > ceph_define_write_size() so that encrypted inodes are written back with
> > > the same number of operations as if they were unencrypted.
> > > 
> > > Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > ---
> > >  fs/ceph/addr.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index b3569d44d510..cb1da8e27c2b 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
> > >  {
> > >       struct inode *inode = mapping->host;
> > >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > -     unsigned int wsize = i_blocksize(inode);
> > > +     struct ceph_inode_info *ci = ceph_inode(inode);
> > > +     unsigned int wsize = ci->i_layout.stripe_unit;
> > > 
> > >       if (fsc->mount_options->wsize < wsize)
> > >               wsize = fsc->mount_options->wsize;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] ceph: Fix write storm on fscrypted files
  2026-01-06 23:11       ` Viacheslav Dubeyko
@ 2026-01-07  0:05         ` Sam Edwards
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Edwards @ 2026-01-07  0:05 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, brauner@kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, jlayton@kernel.org, Milind Changire,
	idryomov@gmail.com, stable@vger.kernel.org

On Tue, Jan 6, 2026 at 3:11 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:
> > On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > > CephFS stores file data across multiple RADOS objects. An object is the
> > > > atomic unit of storage, so the writeback code must clean only folios
> > > > that belong to the same object with each OSD request.
> > > >
> > > > CephFS also supports RAID0-style striping of file contents: if enabled,
> > > > each object stores multiple unbroken "stripe units" covering different
> > > > portions of the file; if disabled, a "stripe unit" is simply the whole
> > > > object. The stripe unit is (usually) reported as the inode's block size.
> > > >
> > > > Though the writeback logic could, in principle, lock all dirty folios
> > > > belonging to the same object, its current design is to lock only a
> > > > single stripe unit at a time. Ever since this code was first written,
> > > > it has determined this size by checking the inode's block size.
> > > > However, the relatively-new fscrypt support needed to reduce the block
> > > > size for encrypted inodes to the crypto block size (see 'fixes' commit),
> > > > which causes an unnecessarily high number of write operations (~1024x as
> > > > many, with 4MiB objects) and grossly degraded performance.
> >
> > Hi Slava,
> >
> > > Do you have any benchmarking results that prove your point?
> >
> > I haven't done any "real" benchmarking for this change. On my setup
> > (closer to a home server than a typical Ceph deployment), sequential
> > write throughput increased from ~1.7 to ~66 MB/s with this patch
> > applied. I don't consider this single datapoint representative, so
> > rather than presenting it as a general benchmark in the commit
> > message, I chose the qualitative wording "grossly degraded
> > performance." Actual impact will vary depending on workload, disk
> > type, OSD count, etc.
> >
> > Those curious about the bug's performance impact in their environment
> > can find out without enabling fscrypt, using: mount -o wsize=4096
> >
> > However, the core rationale for my claim is based on principles, not
> > on measurements: batching writes into fewer operations necessarily
> > spreads per-operation overhead across more bytes. So this change
> > removes an artificial per-op bottleneck on sequential write
> > performance. The exact impact varies, but the patch does improve
> > (fscrypt-enabled) write throughput in nearly every case.
> >
>

Hi Slava,

> If you claim in commit message that "this patch fixes performance degradation",
> then you MUST share the evidence (benchmarking results). Otherwise, you cannot
> make such statements in commit message. Yes, it could be a good fix but please
> don't make a promise of performance improvement. Because, end-users have very
> different environments and workloads. And what could work on your side may not
> work for other use-cases and environments.

I agree with the underlying concern: I do not have a representative
environment, and it would be irresponsible to promise or quantify a
specific speedup. For that reason, the commit message does not claim
any particular improvement factor.

What it does say is that this patch fixes a known performance
degradation that artificially limits the write batch size. But that
statement is about correctness relative to previous behavior, not
about guaranteeing any specific performance outcome for end users.

> Potentially, you could describe your
> environment, workload and to share your estimation/expectation of potential
> performance improvement.

I don’t think that would be useful here. As you pointed out, any such
numbers would be highly workload- and environment-specific and would
not be representative or actionable. The purpose of this patch is
simply to remove an unintentional limit, not to advertise or promise
measurable gains.

Best,
Sam

>
> Thanks,
> Slava.
>
> > Warm regards,
> > Sam
> >
> >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >
> > > > Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> > > > ceph_define_write_size() so that encrypted inodes are written back with
> > > > the same number of operations as if they were unencrypted.
> > > >
> > > > Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > > > ---
> > > >  fs/ceph/addr.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index b3569d44d510..cb1da8e27c2b 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
> > > >  {
> > > >       struct inode *inode = mapping->host;
> > > >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > > -     unsigned int wsize = i_blocksize(inode);
> > > > +     struct ceph_inode_info *ci = ceph_inode(inode);
> > > > +     unsigned int wsize = ci->i_layout.stripe_unit;
> > > >
> > > >       if (fsc->mount_options->wsize < wsize)
> > > >               wsize = fsc->mount_options->wsize;

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-01-07  0:33 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-31  2:43 [PATCH 0/5] ceph: CephFS writeback correctness and performance fixes Sam Edwards
2025-12-31  2:43 ` [PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
2026-01-05 20:23   ` Viacheslav Dubeyko
2026-01-06  6:52     ` Sam Edwards
2026-01-06 21:08       ` Viacheslav Dubeyko
2026-01-06 23:50         ` Sam Edwards
2025-12-31  2:43 ` [PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch() Sam Edwards
2026-01-05 20:36   ` Viacheslav Dubeyko
2026-01-06  6:52     ` Sam Edwards
2026-01-06 22:47       ` Viacheslav Dubeyko
2026-01-07  0:15         ` Sam Edwards
2025-12-31  2:43 ` [PATCH 3/5] ceph: Free page array when ceph_submit_write fails Sam Edwards
2026-01-05 21:09   ` Viacheslav Dubeyko
2026-01-06  6:52     ` Sam Edwards
2025-12-31  2:43 ` [PATCH 4/5] ceph: Assert writeback loop invariants Sam Edwards
2026-01-05 22:28   ` Viacheslav Dubeyko
2026-01-06  6:53     ` Sam Edwards
2026-01-06 23:00       ` Viacheslav Dubeyko
2026-01-07  0:33         ` Sam Edwards
2025-12-31  2:43 ` [PATCH 5/5] ceph: Fix write storm on fscrypted files Sam Edwards
2026-01-05 22:34   ` Viacheslav Dubeyko
2026-01-06  6:53     ` Sam Edwards
2026-01-06 23:11       ` Viacheslav Dubeyko
2026-01-07  0:05         ` Sam Edwards

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox