linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
@ 2025-08-27 18:17 Max Kellermann
  2025-08-27 19:07 ` Viacheslav Dubeyko
  2025-08-28 18:54 ` Viacheslav Dubeyko
  0 siblings, 2 replies; 7+ messages in thread
From: Max Kellermann @ 2025-08-27 18:17 UTC (permalink / raw)
  To: Slava.Dubeyko, xiubli, idryomov, amarkuze, ceph-devel,
	linux-kernel
  Cc: Max Kellermann, stable

The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state.  Before folio_batch_release() crashes
due to this API violation, the function
ceph_shift_unused_folios_left() is supposed to remove those NULLs from
the array.

However, since commit ce80b76dd327 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to
ceph_process_folio_batch(), and now the `i` variable that remains in
ceph_writepages_start() doesn't get incremented anymore, making the
shifting effectively unreachable much of the time.

Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.

(Interestingly, the crash happens only if `huge_zero_folio` has
already been allocated; without `huge_zero_folio`,
is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
entries instead of dereferencing them.  That makes reproducing the bug
somewhat unreliable.  See
https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com
for a discussion of this detail.)

My suggestion is to move the ceph_shift_unused_folios_left() to right
after ceph_process_folio_batch() to ensure it always gets called to
fix up the illegal folio_batch state.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/ceph/addr.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b202d789e93..8bc66b45dade 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1687,6 +1687,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 process_folio_batch:
 		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
+		ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
 		if (rc)
 			goto release_folios;
 
@@ -1695,8 +1696,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 			goto release_folios;
 
 		if (ceph_wbc.processed_in_fbatch) {
-			ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
-
 			if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
 			    ceph_wbc.locked_pages < ceph_wbc.max_pages) {
 				doutc(cl, "reached end fbatch, trying for more\n");
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re:  [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
  2025-08-27 18:17 [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left() Max Kellermann
@ 2025-08-27 19:07 ` Viacheslav Dubeyko
  2025-08-28 18:54 ` Viacheslav Dubeyko
  1 sibling, 0 replies; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-27 19:07 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org, max.kellermann@ionos.com, Xiubo Li,
	idryomov@gmail.com, linux-kernel@vger.kernel.org, Alex Markuze
  Cc: stable@vger.kernel.org

On Wed, 2025-08-27 at 20:17 +0200, Max Kellermann wrote:
> The function ceph_process_folio_batch() sets folio_batch entries to
> NULL, which is an illegal state.  Before folio_batch_release() crashes
> due to this API violation, the function
> ceph_shift_unused_folios_left() is supposed to remove those NULLs from
> the array.
> 
> However, since commit ce80b76dd327 ("ceph: introduce
> ceph_process_folio_batch() method"), this shifting doesn't happen
> anymore because the "for" loop got moved to
> ceph_process_folio_batch(), and now the `i` variable that remains in
> ceph_writepages_start() doesn't get incremented anymore, making the
> shifting effectively unreachable much of the time.
> 
> Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
> method") added more preconditions for doing the shift, replacing the
> `i` check (with something that is still just as broken):
> 
> - if ceph_process_folio_batch() fails, shifting never happens
> 
> - if ceph_move_dirty_page_in_page_array() was never called (because
>   ceph_process_folio_batch() has returned early for some of various
>   reasons), shifting never happens
> 
> - if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
>   has returned early for some of the reasons mentioned above or
>   because ceph_move_dirty_page_in_page_array() has failed), shifting
>   never happens
> 
> Since those two commits, any problem in ceph_process_folio_batch()
> could crash the kernel, e.g. this way:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000034
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 0 P4D 0
>  Oops: Oops: 0002 [#1] SMP NOPTI
>  CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
>  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
>  Workqueue: writeback wb_workfn (flush-ceph-1)
>  RIP: 0010:folios_put_refs+0x85/0x140
>  Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
>  RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
>  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
>  RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
>  RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
>  R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
>  R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
>  FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>   ceph_writepages_start+0xeb9/0x1410
> 
> The crash can be reproduced easily by changing the
> ceph_check_page_before_write() return value to `-E2BIG`.
> 
> (Interestingly, the crash happens only if `huge_zero_folio` has
> already been allocated; without `huge_zero_folio`,
> is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
> entries instead of dereferencing them.  That makes reproducing the bug
> somewhat unreliable.  See
> https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com  
> for a discussion of this detail.)
> 
> My suggestion is to move the ceph_shift_unused_folios_left() to right
> after ceph_process_folio_batch() to ensure it always gets called to
> fix up the illegal folio_batch state.
> 
> Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/  
> Cc: stable@vger.kernel.org
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>  fs/ceph/addr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b202d789e93..8bc66b45dade 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1687,6 +1687,7 @@ static int ceph_writepages_start(struct address_space *mapping,
>  
>  process_folio_batch:
>  		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> +		ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
>  		if (rc)
>  			goto release_folios;
>  
> @@ -1695,8 +1696,6 @@ static int ceph_writepages_start(struct address_space *mapping,
>  			goto release_folios;
>  
>  		if (ceph_wbc.processed_in_fbatch) {
> -			ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> -
>  			if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
>  			    ceph_wbc.locked_pages < ceph_wbc.max_pages) {
>  				doutc(cl, "reached end fbatch, trying for more\n");

Let us try to reproduce the issue and to test the patch.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re:  [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
  2025-08-27 18:17 [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left() Max Kellermann
  2025-08-27 19:07 ` Viacheslav Dubeyko
@ 2025-08-28 18:54 ` Viacheslav Dubeyko
  2025-08-28 19:05   ` Ilya Dryomov
  1 sibling, 1 reply; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-28 18:54 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org, max.kellermann@ionos.com, Xiubo Li,
	idryomov@gmail.com, linux-kernel@vger.kernel.org, Alex Markuze
  Cc: stable@vger.kernel.org

On Wed, 2025-08-27 at 20:17 +0200, Max Kellermann wrote:
> The function ceph_process_folio_batch() sets folio_batch entries to
> NULL, which is an illegal state.  Before folio_batch_release() crashes
> due to this API violation, the function
> ceph_shift_unused_folios_left() is supposed to remove those NULLs from
> the array.
> 
> However, since commit ce80b76dd327 ("ceph: introduce
> ceph_process_folio_batch() method"), this shifting doesn't happen
> anymore because the "for" loop got moved to
> ceph_process_folio_batch(), and now the `i` variable that remains in
> ceph_writepages_start() doesn't get incremented anymore, making the
> shifting effectively unreachable much of the time.
> 
> Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
> method") added more preconditions for doing the shift, replacing the
> `i` check (with something that is still just as broken):
> 
> - if ceph_process_folio_batch() fails, shifting never happens
> 
> - if ceph_move_dirty_page_in_page_array() was never called (because
>   ceph_process_folio_batch() has returned early for some of various
>   reasons), shifting never happens
> 
> - if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
>   has returned early for some of the reasons mentioned above or
>   because ceph_move_dirty_page_in_page_array() has failed), shifting
>   never happens
> 
> Since those two commits, any problem in ceph_process_folio_batch()
> could crash the kernel, e.g. this way:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000034
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 0 P4D 0
>  Oops: Oops: 0002 [#1] SMP NOPTI
>  CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
>  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
>  Workqueue: writeback wb_workfn (flush-ceph-1)
>  RIP: 0010:folios_put_refs+0x85/0x140
>  Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
>  RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
>  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
>  RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
>  RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
>  R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
>  R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
>  FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>   ceph_writepages_start+0xeb9/0x1410
> 
> The crash can be reproduced easily by changing the
> ceph_check_page_before_write() return value to `-E2BIG`.
> 

I cannot reproduce the crash/issue. If ceph_check_page_before_write() returns
`-E2BIG`, then nothing happens. There is no crush and no write operations could
be processed by file system driver anymore. So, it doesn't look like recipe to
reproduce the issue. I cannot confirm that the patch fixes the issue without
clear way to reproduce the issue.

Could you please provide more clear explanation of the issue reproduction path?

Thanks,
Slava.


> (Interestingly, the crash happens only if `huge_zero_folio` has
> already been allocated; without `huge_zero_folio`,
> is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
> entries instead of dereferencing them.  That makes reproducing the bug
> somewhat unreliable.  See
> https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com  
> for a discussion of this detail.)
> 
> My suggestion is to move the ceph_shift_unused_folios_left() to right
> after ceph_process_folio_batch() to ensure it always gets called to
> fix up the illegal folio_batch state.
> 
> Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
> Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/  
> Cc: stable@vger.kernel.org
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>  fs/ceph/addr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b202d789e93..8bc66b45dade 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1687,6 +1687,7 @@ static int ceph_writepages_start(struct address_space *mapping,
>  
>  process_folio_batch:
>  		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> +		ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
>  		if (rc)
>  			goto release_folios;
>  
> @@ -1695,8 +1696,6 @@ static int ceph_writepages_start(struct address_space *mapping,
>  			goto release_folios;
>  
>  		if (ceph_wbc.processed_in_fbatch) {
> -			ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> -
>  			if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
>  			    ceph_wbc.locked_pages < ceph_wbc.max_pages) {
>  				doutc(cl, "reached end fbatch, trying for more\n");

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
  2025-08-28 18:54 ` Viacheslav Dubeyko
@ 2025-08-28 19:05   ` Ilya Dryomov
  2025-08-28 19:08     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 7+ messages in thread
From: Ilya Dryomov @ 2025-08-28 19:05 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: ceph-devel@vger.kernel.org, max.kellermann@ionos.com, Xiubo Li,
	linux-kernel@vger.kernel.org, Alex Markuze,
	stable@vger.kernel.org

On Thu, Aug 28, 2025 at 8:55 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-08-27 at 20:17 +0200, Max Kellermann wrote:
> > The function ceph_process_folio_batch() sets folio_batch entries to
> > NULL, which is an illegal state.  Before folio_batch_release() crashes
> > due to this API violation, the function
> > ceph_shift_unused_folios_left() is supposed to remove those NULLs from
> > the array.
> >
> > However, since commit ce80b76dd327 ("ceph: introduce
> > ceph_process_folio_batch() method"), this shifting doesn't happen
> > anymore because the "for" loop got moved to
> > ceph_process_folio_batch(), and now the `i` variable that remains in
> > ceph_writepages_start() doesn't get incremented anymore, making the
> > shifting effectively unreachable much of the time.
> >
> > Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
> > method") added more preconditions for doing the shift, replacing the
> > `i` check (with something that is still just as broken):
> >
> > - if ceph_process_folio_batch() fails, shifting never happens
> >
> > - if ceph_move_dirty_page_in_page_array() was never called (because
> >   ceph_process_folio_batch() has returned early for some of various
> >   reasons), shifting never happens
> >
> > - if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
> >   has returned early for some of the reasons mentioned above or
> >   because ceph_move_dirty_page_in_page_array() has failed), shifting
> >   never happens
> >
> > Since those two commits, any problem in ceph_process_folio_batch()
> > could crash the kernel, e.g. this way:
> >
> >  BUG: kernel NULL pointer dereference, address: 0000000000000034
> >  #PF: supervisor write access in kernel mode
> >  #PF: error_code(0x0002) - not-present page
> >  PGD 0 P4D 0
> >  Oops: Oops: 0002 [#1] SMP NOPTI
> >  CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
> >  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
> >  Workqueue: writeback wb_workfn (flush-ceph-1)
> >  RIP: 0010:folios_put_refs+0x85/0x140
> >  Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
> >  RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
> >  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
> >  RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
> >  RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
> >  R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
> >  R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
> >  FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
> >  PKRU: 55555554
> >  Call Trace:
> >   <TASK>
> >   ceph_writepages_start+0xeb9/0x1410
> >
> > The crash can be reproduced easily by changing the
> > ceph_check_page_before_write() return value to `-E2BIG`.
> >
>
> I cannot reproduce the crash/issue. If ceph_check_page_before_write() returns
> `-E2BIG`, then nothing happens. There is no crush and no write operations could
> be processed by file system driver anymore. So, it doesn't look like recipe to
> reproduce the issue. I cannot confirm that the patch fixes the issue without
> clear way to reproduce the issue.
>
> Could you please provide more clear explanation of the issue reproduction path?

Hi Slava,

Was this bit taken into account?

  (Interestingly, the crash happens only if `huge_zero_folio` has
  already been allocated; without `huge_zero_folio`,
  is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
  entries instead of dereferencing them.  That makes reproducing the bug
  somewhat unreliable.  See
  https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com
  for a discussion of this detail.)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
  2025-08-28 19:05   ` Ilya Dryomov
@ 2025-08-28 19:08     ` Viacheslav Dubeyko
  2025-08-28 21:37       ` Max Kellermann
  0 siblings, 1 reply; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-28 19:08 UTC (permalink / raw)
  To: idryomov@gmail.com
  Cc: stable@vger.kernel.org, max.kellermann@ionos.com,
	ceph-devel@vger.kernel.org, Xiubo Li,
	linux-kernel@vger.kernel.org, Alex Markuze

On Thu, 2025-08-28 at 21:05 +0200, Ilya Dryomov wrote:
> On Thu, Aug 28, 2025 at 8:55 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Wed, 2025-08-27 at 20:17 +0200, Max Kellermann wrote:
> > > The function ceph_process_folio_batch() sets folio_batch entries to
> > > NULL, which is an illegal state.  Before folio_batch_release() crashes
> > > due to this API violation, the function
> > > ceph_shift_unused_folios_left() is supposed to remove those NULLs from
> > > the array.
> > > 
> > > However, since commit ce80b76dd327 ("ceph: introduce
> > > ceph_process_folio_batch() method"), this shifting doesn't happen
> > > anymore because the "for" loop got moved to
> > > ceph_process_folio_batch(), and now the `i` variable that remains in
> > > ceph_writepages_start() doesn't get incremented anymore, making the
> > > shifting effectively unreachable much of the time.
> > > 
> > > Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
> > > method") added more preconditions for doing the shift, replacing the
> > > `i` check (with something that is still just as broken):
> > > 
> > > - if ceph_process_folio_batch() fails, shifting never happens
> > > 
> > > - if ceph_move_dirty_page_in_page_array() was never called (because
> > >   ceph_process_folio_batch() has returned early for some of various
> > >   reasons), shifting never happens
> > > 
> > > - if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
> > >   has returned early for some of the reasons mentioned above or
> > >   because ceph_move_dirty_page_in_page_array() has failed), shifting
> > >   never happens
> > > 
> > > Since those two commits, any problem in ceph_process_folio_batch()
> > > could crash the kernel, e.g. this way:
> > > 
> > >  BUG: kernel NULL pointer dereference, address: 0000000000000034
> > >  #PF: supervisor write access in kernel mode
> > >  #PF: error_code(0x0002) - not-present page
> > >  PGD 0 P4D 0
> > >  Oops: Oops: 0002 [#1] SMP NOPTI
> > >  CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
> > >  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
> > >  Workqueue: writeback wb_workfn (flush-ceph-1)
> > >  RIP: 0010:folios_put_refs+0x85/0x140
> > >  Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
> > >  RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
> > >  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
> > >  RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
> > >  RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
> > >  R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
> > >  R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
> > >  FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
> > >  PKRU: 55555554
> > >  Call Trace:
> > >   <TASK>
> > >   ceph_writepages_start+0xeb9/0x1410
> > > 
> > > The crash can be reproduced easily by changing the
> > > ceph_check_page_before_write() return value to `-E2BIG`.
> > > 
> > 
> > I cannot reproduce the crash/issue. If ceph_check_page_before_write() returns
> > `-E2BIG`, then nothing happens. There is no crush and no write operations could
> > be processed by file system driver anymore. So, it doesn't look like recipe to
> > reproduce the issue. I cannot confirm that the patch fixes the issue without
> > clear way to reproduce the issue.
> > 
> > Could you please provide more clear explanation of the issue reproduction path?
> 
> Hi Slava,
> 
> Was this bit taken into account?
> 
>   (Interestingly, the crash happens only if `huge_zero_folio` has
>   already been allocated; without `huge_zero_folio`,
>   is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
>   entries instead of dereferencing them.  That makes reproducing the bug
>   somewhat unreliable.  See
>   https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com  
>   for a discussion of this detail.)
> 
> 
Hi Ilya,

And which practical step of actions do you see to repeat and reproduce it? :)

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
  2025-08-28 19:08     ` Viacheslav Dubeyko
@ 2025-08-28 21:37       ` Max Kellermann
  2025-09-04 21:43         ` Viacheslav Dubeyko
  0 siblings, 1 reply; 7+ messages in thread
From: Max Kellermann @ 2025-08-28 21:37 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: idryomov@gmail.com, stable@vger.kernel.org,
	ceph-devel@vger.kernel.org, Xiubo Li,
	linux-kernel@vger.kernel.org, Alex Markuze

On Thu, Aug 28, 2025 at 9:08 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
> And which practical step of actions do you see to repeat and reproduce it? :)

Apply the patch in the link. Did you read that thread/patch?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()
  2025-08-28 21:37       ` Max Kellermann
@ 2025-09-04 21:43         ` Viacheslav Dubeyko
  0 siblings, 0 replies; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-09-04 21:43 UTC (permalink / raw)
  To: max.kellermann@ionos.com
  Cc: Xiubo Li, Alex Markuze, idryomov@gmail.com,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org,
	ceph-devel@vger.kernel.org

On Thu, 2025-08-28 at 23:37 +0200, Max Kellermann wrote:
> On Thu, Aug 28, 2025 at 9:08 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > And which practical step of actions do you see to repeat and reproduce it? :)
> 
> Apply the patch in the link. Did you read that thread/patch?

By applying the patch [1], enabling CONFIG_DEBUG_VM, and returning -E2BIG from
ceph_check_page_before_write(), I was able to reproduce this warning:

 [  123.147833] ------------[ cut here ]------------
 [  123.147861] WARNING: CPU: 5 PID: 72 at ./include/linux/huge_mm.h:482
folios_put_refs+0x4c2/0x600
 [  123.147900] Modules linked in: intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel kvm irqbypass joydev
polyval_clmulni ghash_clmulni_intel aesni_intel rapl input_leds psmouse
i2c_piix4 vga16fb pata_acpi bochs vgastate i2c_smbus serio_raw floppy
qemu_fw_cfg mac_hid sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore
 [  123.147988] CPU: 5 UID: 0 PID: 72 Comm: kworker/u32:2 Not tainted 6.17.0-
rc4+ #9 PREEMPT(voluntary) 
 [  123.147995] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
 [  123.148002] Workqueue: writeback wb_workfn (flush-ceph-1)
 [  123.148021] RIP: 0010:folios_put_refs+0x4c2/0x600
 [  123.148031] Code: cc c6 db 05 0f 85 19 01 00 00 48 81 c4 b8 00 00 00 5b 41
5c 41 5d 41 5e 41 5f 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc <0f> 0b e9
1e fe ff ff e8 c2 fe 24 00 e9 da fb ff ff 4c 89 ef e8 b5
 [  123.148035] RSP: 0018:ffff888101c6f228 EFLAGS: 00010246
 [  123.148051] RAX: ffffed102038dea4 RBX: 0000000000000000 RCX:
0000000000000000
 [  123.148057] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
ffff888101c6f520
 [  123.148060] RBP: ffff888101c6f308 R08: 0000000000000000 R09:
0000000000000000
 [  123.148063] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
 [  123.148066] R13: ffff888101c6f520 R14: 0000000000000000 R15:
dffffc0000000000
 [  123.148069] FS:  0000000000000000(0000) GS:ffff88824a034000(0000)
knlGS:0000000000000000
 [  123.148072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  123.148075] CR2: 0000700798000020 CR3: 0000000111b6a005 CR4:
0000000000772ef0
 [  123.148082] PKRU: 55555554
 [  123.148085] Call Trace:
 [  123.148088]  <TASK>
 [  123.148093]  ? __pfx_folios_put_refs+0x10/0x10
 [  123.148099]  ? __pfx_filemap_get_folios_tag+0x10/0x10
 [  123.148110]  __folio_batch_release+0x52/0xe0
 [  123.148115]  ceph_writepages_start+0x277a/0x45f0
 [  123.148129]  ? update_load_avg+0x1bd/0x1fe0
 [  123.148145]  ? dequeue_entity+0x3e5/0x1450
 [  123.148151]  ? ata_sff_qc_issue+0x443/0xa90
 [  123.148175]  ? kvm_sched_clock_read+0x11/0x20
 [  123.148198]  ? sched_clock_noinstr+0x9/0x10
 [  123.148203]  ? sched_clock+0x10/0x30
 [  123.148216]  ? __pfx_ceph_writepages_start+0x10/0x10
 [  123.148221]  ? psi_group_change+0x3fa/0x8a0
 [  123.148233]  ? __pfx_sched_clock_cpu+0x10/0x10
 [  123.148238]  ? set_next_entity+0x325/0xb40
 [  123.148245]  ? ncsi_channel_monitor.cold+0x36d/0x553
 [  123.148269]  ? __kasan_check_write+0x14/0x30
 [  123.148283]  ? _raw_spin_lock+0x82/0xf0
 [  123.148293]  ? __pfx__raw_spin_lock+0x10/0x10
 [  123.148298]  do_writepages+0x1e1/0x540
 [  123.148303]  ? do_writepages+0x1e1/0x540
 [  123.148308]  __writeback_single_inode+0xa7/0x940
 [  123.148312]  ? _raw_spin_unlock+0xe/0x40
 [  123.148315]  ? wbc_attach_and_unlock_inode+0x440/0x610
 [  123.148325]  ? __pfx_call_function_single_prep_ipi+0x10/0x10
 [  123.148336]  writeback_sb_inodes+0x563/0xe40
 [  123.148341]  ? __pfx_writeback_sb_inodes+0x10/0x10
 [  123.148348]  ? __pfx_move_expired_inodes+0x10/0x10
 [  123.148360]  __writeback_inodes_wb+0xbe/0x210
 [  123.148364]  wb_writeback+0x4e4/0x6f0
 [  123.148368]  ? __pfx_wb_writeback+0x10/0x10
 [  123.148416]  ? get_nr_dirty_inodes+0xdc/0x1e0
 [  123.148426]  wb_workfn+0x5a9/0xb30
 [  123.148430]  ? __pfx_wb_workfn+0x10/0x10
 [  123.148433]  ? __pfx___schedule+0x10/0x10
 [  123.148438]  ? __pfx__raw_spin_lock_irq+0x10/0x10
 [  123.148442]  process_one_work+0x611/0xe20
 [  123.148448]  ? __kasan_check_write+0x14/0x30
 [  123.148452]  worker_thread+0x7e3/0x1580
 [  123.148456]  ? __pfx_worker_thread+0x10/0x10
 [  123.148458]  kthread+0x381/0x7a0
 [  123.148463]  ? __pfx__raw_spin_lock_irq+0x10/0x10
 [  123.148466]  ? __pfx_kthread+0x10/0x10
 [  123.148468]  ? __kasan_check_write+0x14/0x30
 [  123.148471]  ? recalc_sigpending+0x160/0x220
 [  123.148478]  ? _raw_spin_unlock_irq+0xe/0x50
 [  123.148481]  ? calculate_sigpending+0x78/0xb0
 [  123.148484]  ? __pfx_kthread+0x10/0x10
 [  123.148487]  ret_from_fork+0x285/0x350
 [  123.148490]  ? __pfx_kthread+0x10/0x10
 [  123.148493]  ret_from_fork_asm+0x1a/0x30
 [  123.148499]  </TASK>
 [  123.148501] ---[ end trace 0000000000000000 ]---

The warning has been eliminated by applying suggested fix. The suggested patch
has been tested by xfstests and no regression or issue has been detected.

Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

Thanks,
Slava.

[1]
https://lore.kernel.org/all/20250826231626.218675-1-max.kellermann@ionos.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-09-04 21:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-27 18:17 [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left() Max Kellermann
2025-08-27 19:07 ` Viacheslav Dubeyko
2025-08-28 18:54 ` Viacheslav Dubeyko
2025-08-28 19:05   ` Ilya Dryomov
2025-08-28 19:08     ` Viacheslav Dubeyko
2025-08-28 21:37       ` Max Kellermann
2025-09-04 21:43         ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).