[REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails
@ 2026-03-15 23:25 Sam Edwards
  2026-03-16 17:44 ` Viacheslav Dubeyko
  0 siblings, 1 reply; 6+ messages in thread
From: Sam Edwards @ 2026-03-15 23:25 UTC (permalink / raw)
  To: Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko
  Cc: Milind Changire, Xiubo Li, Jeff Layton, ceph-devel, linux-kernel,
	Sam Edwards, stable

move_dirty_folio_in_page_array() may fail if the file is encrypted, the
dirty folio is not the first in the batch, and it fails to allocate a
bounce buffer to hold the ciphertext. When that happens,
ceph_process_folio_batch() simply redirties the folio and flushes the
current batch -- it can retry that folio in a future batch.

However, if this failed folio is not contiguous with the last folio that
did make it into the batch, then ceph_process_folio_batch() has already
incremented `ceph_wbc->num_ops`; because it doesn't follow through and
add the discontiguous folio to the array, ceph_submit_write() -- which
expects that `ceph_wbc->num_ops` accurately reflects the number of
contiguous ranges (and therefore the required number of "write extent"
ops) in the writeback -- will panic the kernel:

    BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);

Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct
value when move_dirty_folio_in_page_array() fails, but the folio already
started counting a new (i.e. still-empty) extent.

The defect corrected by this patch has existed since 2022 (see first
`Fixes:`), but another bug blocked multi-folio encrypted writeback until
recently (see second `Fixes:`). The second commit made it into 6.18.16,
6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch
therefore fixes a regression (panic) introduced by cac190c7674f.

Cc: stable@vger.kernel.org # v6.18+
Fixes: d55207717ded ("ceph: add encryption support to writepage and writepages")
Fixes: cac190c7674f ("ceph: fix write storm on fscrypted files")
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index e87b3bb94ee8..f366e159ffa6 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1366,6 +1366,10 @@ void ceph_process_folio_batch(struct address_space *mapping,
 		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
 				folio);
 		if (rc) {
+			/* Did we just begin a new contiguous op? Nevermind! */
+			if (ceph_wbc->len == 0)
+				ceph_wbc->num_ops--;
+
 			folio_redirty_for_writepage(wbc, folio);
 			folio_unlock(folio);
 			break;
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re:  [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails
  2026-03-15 23:25 [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails Sam Edwards
@ 2026-03-16 17:44 ` Viacheslav Dubeyko
  2026-03-16 20:14   ` Sam Edwards
  0 siblings, 1 reply; 6+ messages in thread
From: Viacheslav Dubeyko @ 2026-03-16 17:44 UTC (permalink / raw)
  To: idryomov@gmail.com, Alex Markuze, cfsworks@gmail.com,
	slava@dubeyko.com
  Cc: Milind Changire, stable@vger.kernel.org, Xiubo Li,
	jlayton@kernel.org, linux-kernel@vger.kernel.org,
	ceph-devel@vger.kernel.org

On Sun, 2026-03-15 at 16:25 -0700, Sam Edwards wrote:
> move_dirty_folio_in_page_array() may fail if the file is encrypted, the
> dirty folio is not the first in the batch, and it fails to allocate a
> bounce buffer to hold the ciphertext. When that happens,
> ceph_process_folio_batch() simply redirties the folio and flushes the
> current batch -- it can retry that folio in a future batch.
> 

How this issue can be reproduced? Do you have a reproduction script or anything
like this?

> However, if this failed folio is not contiguous with the last folio that
> did make it into the batch, then ceph_process_folio_batch() has already
> incremented `ceph_wbc->num_ops`; because it doesn't follow through and
> add the discontiguous folio to the array, ceph_submit_write() -- which
> expects that `ceph_wbc->num_ops` accurately reflects the number of
> contiguous ranges (and therefore the required number of "write extent"
> ops) in the writeback -- will panic the kernel:
> 
>     BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);

I don't quite follow. We decrement ceph_wbc->num_ops but BUG_ON() operates by
req->r_num_ops. How req->r_num_ops receives the value of ceph_wbc->num_ops?

> 
> Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct
> value when move_dirty_folio_in_page_array() fails, but the folio already
> started counting a new (i.e. still-empty) extent.
> 
> The defect corrected by this patch has existed since 2022 (see first
> `Fixes:`), but another bug blocked multi-folio encrypted writeback until
> recently (see second `Fixes:`). The second commit made it into 6.18.16,
> 6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch
> therefore fixes a regression (panic) introduced by cac190c7674f.
> 
> Cc: stable@vger.kernel.org # v6.18+
> Fixes: d55207717ded ("ceph: add encryption support to writepage and writepages")
> Fixes: cac190c7674f ("ceph: fix write storm on fscrypted files")
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index e87b3bb94ee8..f366e159ffa6 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1366,6 +1366,10 @@ void ceph_process_folio_batch(struct address_space *mapping,
>  		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
>  				folio);
>  		if (rc) {
> +			/* Did we just begin a new contiguous op? Nevermind! */
> +			if (ceph_wbc->len == 0)
> +				ceph_wbc->num_ops--;
> +
>  			folio_redirty_for_writepage(wbc, folio);
>  			folio_unlock(folio);
>  			break;

We change ceph_wbc->num_ops, ceph_wbc->offset, and ceph_wbc->len here:

		} else if (!is_folio_index_contiguous(ceph_wbc, folio)) {
			if (is_num_ops_too_big(ceph_wbc)) {
				folio_redirty_for_writepage(wbc, folio);
				folio_unlock(folio);
				break;
			}

			ceph_wbc->num_ops++;
			ceph_wbc->offset = (u64)folio_pos(folio);
			ceph_wbc->len = 0;
		}

First of all, technically speaking, move_dirty_folio_in_page_array() can fail
even if is_folio_index_contiguous() is positive. Do you mean that we don't need
to decrement the ceph_wbc->num_ops in such case?

Secondly, do we need to correct ceph_wbc->offset?

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails
  2026-03-16 17:44 ` Viacheslav Dubeyko
@ 2026-03-16 20:14   ` Sam Edwards
  2026-03-17 18:51     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 6+ messages in thread
From: Sam Edwards @ 2026-03-16 20:14 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	Milind Changire, stable@vger.kernel.org, Xiubo Li,
	jlayton@kernel.org, linux-kernel@vger.kernel.org,
	ceph-devel@vger.kernel.org

On Mon, Mar 16, 2026 at 10:44 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Sun, 2026-03-15 at 16:25 -0700, Sam Edwards wrote:
> > move_dirty_folio_in_page_array() may fail if the file is encrypted, the
> > dirty folio is not the first in the batch, and it fails to allocate a
> > bounce buffer to hold the ciphertext. When that happens,
> > ceph_process_folio_batch() simply redirties the folio and flushes the
> > current batch -- it can retry that folio in a future batch.
> >
>
> How this issue can be reproduced? Do you have a reproduction script or anything
> like this?

Good day Slava,

Is this question about the preceding paragraph? If so: that paragraph
is just describing current (and intended) behavior, not an issue.

If this is just a general question about the patch, then I don't know
of a way to trigger the issue in a short timeframe, but something like
this ought to work:
1. Create a reasonably-sized (e.g. 4GiB) fscrypt-protected file in CephFS
2. Put the CephFS client system under heavy memory pressure, so that
bounce page allocation is more likely to fail
3. Repeatedly write to the file in a 4KiB-written/4KiB-skipped
pattern, starting over upon getting to the end of the file
4. Wait for the system to panic, gradually ramping up the memory
pressure until it does

I run a workload that performs fairly random I/O atop CephFS+fscrypt.
Before this patch, I'd get a panic after about a day. After this
patch, I've been running for 4+ days without this particular issue
reappearing.

> > However, if this failed folio is not contiguous with the last folio that
> > did make it into the batch, then ceph_process_folio_batch() has already
> > incremented `ceph_wbc->num_ops`; because it doesn't follow through and
> > add the discontiguous folio to the array, ceph_submit_write() -- which
> > expects that `ceph_wbc->num_ops` accurately reflects the number of
> > contiguous ranges (and therefore the required number of "write extent"
> > ops) in the writeback -- will panic the kernel:
> >
> >     BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
>
> I don't quite follow. We decrement ceph_wbc->num_ops but BUG_ON() operates by
> req->r_num_ops. How req->r_num_ops receives the value of ceph_wbc->num_ops?

ceph_submit_write() passes ceph_wbc->num_ops to ceph_osdc_new_request()...

> We change ceph_wbc->num_ops, ceph_wbc->offset, and ceph_wbc->len here:
>
>                 } else if (!is_folio_index_contiguous(ceph_wbc, folio)) {
>                         if (is_num_ops_too_big(ceph_wbc)) {
>                                 folio_redirty_for_writepage(wbc, folio);
>                                 folio_unlock(folio);
>                                 break;
>                         }
>
>                         ceph_wbc->num_ops++;
>                         ceph_wbc->offset = (u64)folio_pos(folio);
>                         ceph_wbc->len = 0;
>                 }
>
> First of all, technically speaking, move_dirty_folio_in_page_array() can fail
> even if is_folio_index_contiguous() is positive. Do you mean that we don't need
> to decrement the ceph_wbc->num_ops in such case?

Yes, exactly: as stated in the commit message, we only need to correct
the value "when move_dirty_folio_in_page_array() fails, but the folio
already started counting a new (i.e. still-empty) extent." The `len ==
0` test is checking for that new/still-empty condition.

> Secondly, do we need to correct ceph_wbc->offset?

No, we do not; the valid lifetime of offset/len ends when
ceph_process_folio_batch() returns. I'd even argue they don't belong
in ceph_wbc at all and should be local variables instead, but that's a
matter for a different patch.

Cheers,
Sam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails
  2026-03-16 20:14   ` Sam Edwards
@ 2026-03-17 18:51     ` Viacheslav Dubeyko
  2026-03-17 19:13       ` Sam Edwards
  0 siblings, 1 reply; 6+ messages in thread
From: Viacheslav Dubeyko @ 2026-03-17 18:51 UTC (permalink / raw)
  To: cfsworks@gmail.com
  Cc: Xiubo Li, slava@dubeyko.com, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alex Markuze, jlayton@kernel.org,
	Milind Changire, idryomov@gmail.com, stable@vger.kernel.org

On Mon, 2026-03-16 at 13:14 -0700, Sam Edwards wrote:
> On Mon, Mar 16, 2026 at 10:44 AM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Sun, 2026-03-15 at 16:25 -0700, Sam Edwards wrote:
> > > move_dirty_folio_in_page_array() may fail if the file is encrypted, the
> > > dirty folio is not the first in the batch, and it fails to allocate a
> > > bounce buffer to hold the ciphertext. When that happens,
> > > ceph_process_folio_batch() simply redirties the folio and flushes the
> > > current batch -- it can retry that folio in a future batch.
> > > 
> > 
> > How this issue can be reproduced? Do you have a reproduction script or anything
> > like this?
> 
> Good day Slava,
> 
> Is this question about the preceding paragraph? If so: that paragraph
> is just describing current (and intended) behavior, not an issue.
> 
> If this is just a general question about the patch, then I don't know
> of a way to trigger the issue in a short timeframe, but something like
> this ought to work:
> 1. Create a reasonably-sized (e.g. 4GiB) fscrypt-protected file in CephFS
> 2. Put the CephFS client system under heavy memory pressure, so that
> bounce page allocation is more likely to fail
> 3. Repeatedly write to the file in a 4KiB-written/4KiB-skipped
> pattern, starting over upon getting to the end of the file
> 4. Wait for the system to panic, gradually ramping up the memory
> pressure until it does
> 
> I run a workload that performs fairly random I/O atop CephFS+fscrypt.
> Before this patch, I'd get a panic after about a day. After this
> patch, I've been running for 4+ days without this particular issue
> reappearing.
> 

I think this is good enough description how the issue can be triggered. And I
believe that the commit message deserve to have this description.


Frankly speaking, I am trying to reproduce the issue [1]. Do you think that it
could be the same issue?

> > > However, if this failed folio is not contiguous with the last folio that
> > > did make it into the batch, then ceph_process_folio_batch() has already
> > > incremented `ceph_wbc->num_ops`; because it doesn't follow through and
> > > add the discontiguous folio to the array, ceph_submit_write() -- which
> > > expects that `ceph_wbc->num_ops` accurately reflects the number of
> > > contiguous ranges (and therefore the required number of "write extent"
> > > ops) in the writeback -- will panic the kernel:
> > > 
> > >     BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);

I believe that it will be great to have the link to the particular location of
this code in the commit message.

> > 
> > I don't quite follow. We decrement ceph_wbc->num_ops but BUG_ON() operates by
> > req->r_num_ops. How req->r_num_ops receives the value of ceph_wbc->num_ops?
> 
> ceph_submit_write() passes ceph_wbc->num_ops to ceph_osdc_new_request()...

I think it makes sense to mention it in the commit message.

> 
> > We change ceph_wbc->num_ops, ceph_wbc->offset, and ceph_wbc->len here:
> > 
> >                 } else if (!is_folio_index_contiguous(ceph_wbc, folio)) {
> >                         if (is_num_ops_too_big(ceph_wbc)) {
> >                                 folio_redirty_for_writepage(wbc, folio);
> >                                 folio_unlock(folio);
> >                                 break;
> >                         }
> > 
> >                         ceph_wbc->num_ops++;
> >                         ceph_wbc->offset = (u64)folio_pos(folio);
> >                         ceph_wbc->len = 0;
> >                 }
> > 
> > First of all, technically speaking, move_dirty_folio_in_page_array() can fail
> > even if is_folio_index_contiguous() is positive. Do you mean that we don't need
> > to decrement the ceph_wbc->num_ops in such case?
> 
> Yes, exactly: as stated in the commit message, we only need to correct
> the value "when move_dirty_folio_in_page_array() fails, but the folio
> already started counting a new (i.e. still-empty) extent." The `len ==
> 0` test is checking for that new/still-empty condition.
> 
> > Secondly, do we need to correct ceph_wbc->offset?
> 
> No, we do not; the valid lifetime of offset/len ends when
> ceph_process_folio_batch() returns. I'd even argue they don't belong
> in ceph_wbc at all and should be local variables instead, but that's a
> matter for a different patch.
> 

I think that it makes sense to create the issue in Ceph tracker and to add
Closes to the fix.

Thanks,
Slava.

[1] https://tracker.ceph.com/issues/74156

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails
  2026-03-17 18:51     ` Viacheslav Dubeyko
@ 2026-03-17 19:13       ` Sam Edwards
  2026-03-17 19:48         ` Ilya Dryomov
  0 siblings, 1 reply; 6+ messages in thread
From: Sam Edwards @ 2026-03-17 19:13 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Xiubo Li, slava@dubeyko.com, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alex Markuze, jlayton@kernel.org,
	Milind Changire, idryomov@gmail.com, stable@vger.kernel.org

On Tue, Mar 17, 2026 at 11:51 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
> > If this is just a general question about the patch, then I don't know
> > of a way to trigger the issue in a short timeframe, but something like
> > this ought to work:
> > 1. Create a reasonably-sized (e.g. 4GiB) fscrypt-protected file in CephFS
> > 2. Put the CephFS client system under heavy memory pressure, so that
> > bounce page allocation is more likely to fail
> > 3. Repeatedly write to the file in a 4KiB-written/4KiB-skipped
> > pattern, starting over upon getting to the end of the file
> > 4. Wait for the system to panic, gradually ramping up the memory
> > pressure until it does
> >
> > I run a workload that performs fairly random I/O atop CephFS+fscrypt.
> > Before this patch, I'd get a panic after about a day. After this
> > patch, I've been running for 4+ days without this particular issue
> > reappearing.
> >
>
> I think this is good enough description how the issue can be triggered. And I
> believe that the commit message deserve to have this description.

Very well, I'll try to condense it in a way that makes it clear to
those trying to repro the crash without being overly verbose.

> Frankly speaking, I am trying to reproduce the issue [1]. Do you think that it
> could be the same issue?

Please double-check; the link you sent is to bug #74156, reported last
year. This regression was only introduced last month, so it couldn't
be the same issue. Did you send the wrong link?

> > > >     BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
>
> I believe that it will be great to have the link to the particular location of
> this code in the commit message.

I strongly disagree: The location of the code changes with every
commit that adds/removes lines above it (including this patch) so such
a link would be rendered stale immediately. What is your reason for
believing the link is useful?

> > > I don't quite follow. We decrement ceph_wbc->num_ops but BUG_ON() operates by
> > > req->r_num_ops. How req->r_num_ops receives the value of ceph_wbc->num_ops?
> >
> > ceph_submit_write() passes ceph_wbc->num_ops to ceph_osdc_new_request()...
>
> I think it makes sense to mention it in the commit message.

NACK, that relationship is already memorialized in addr.c. But again
I'm interested to learn your reasoning.

> I think that it makes sense to create the issue in Ceph tracker and to add
> Closes to the fix.

I don't currently have a Ceph tracker account and don't think I can
add anything of substance to an issue report. Feel free to create the
issue on my behalf if it's important for Ceph's processes, and I can
Closes: tag it in v2.

Cheers,
Sam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails
  2026-03-17 19:13       ` Sam Edwards
@ 2026-03-17 19:48         ` Ilya Dryomov
  0 siblings, 0 replies; 6+ messages in thread
From: Ilya Dryomov @ 2026-03-17 19:48 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Viacheslav Dubeyko, Xiubo Li, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alex Markuze, jlayton@kernel.org, Milind Changire,
	stable@vger.kernel.org

On Tue, Mar 17, 2026 at 8:13 PM Sam Edwards <cfsworks@gmail.com> wrote:
> > I think that it makes sense to create the issue in Ceph tracker and to add
> > Closes to the fix.
>
> I don't currently have a Ceph tracker account and don't think I can
> add anything of substance to an issue report. Feel free to create the
> issue on my behalf if it's important for Ceph's processes, and I can
> Closes: tag it in v2.

Hi Sam,

It's not important.  A tracker ticket is useful for when the issue is
purely reported as the placeholder for investigation, log attachments,
etc.  In this case you both reported the issue and proposed a fix for
it, so the patch description is the only thing that matters.  Creating
a tracker ticket just for the sake of it isn't needed.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-17 19:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-15 23:25 [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails Sam Edwards
2026-03-16 17:44 ` Viacheslav Dubeyko
2026-03-16 20:14   ` Sam Edwards
2026-03-17 18:51     ` Viacheslav Dubeyko
2026-03-17 19:13       ` Sam Edwards
2026-03-17 19:48         ` Ilya Dryomov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox