Re: [RFC PATCH 3/4] ceph: introduce ceph_submit

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
       [not found]   ` <Z6-xg-p_mi3I1aMq@casper.infradead.org>
@ 2025-08-26 22:06     ` Max Kellermann
  2025-08-26 22:33       ` Viacheslav Dubeyko
  0 siblings, 1 reply; 5+ messages in thread
From: Max Kellermann @ 2025-08-26 22:06 UTC (permalink / raw)
  To: Matthew Wilcox, brauner
  Cc: Viacheslav Dubeyko, ceph-devel, idryomov, dhowells, linux-fsdevel,
	pdonnell, amarkuze, Slava.Dubeyko, linux-kernel

On 2025/02/14 22:11, Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> > This patch implements refactoring of ceph_submit_write()
> > and also it solves the second issue.
> > 
> > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> This kind of giant refactoring to solve a bug is a really bad idea.
> First, it's going to need to be backported to older kernels.  How far
> back?  You need to identify that with a Fixes: line.
> 
> It's also really hard to review and know whether it's right.  You might
> have introduced several new bugs while doing it.  In general, bugfixes
> first, refactor later.  I *think* this means we can do without 1/7 of
> the patches I resent earlier today, but it's really hard to be sure.

I'm very disappointed that nobody has listened to Matthew's complaint.
Viacheslav has only hand-waved it away with a lazy non-argument.

From Documentation/process/submitting-patches.rst:

 "you should not modify the moved code at all in the same patch which
 moves it"

Obviously, this patch set violates this rule.  There are lots of
semantic/behavior changes in the patches that move code around.

In the end, Christian Brauner has merged this into Linux 6.15 and that
merge has wreaked havoc in our production clusters.  We have been
testing 6.15 for a month with no problems (after David Howells had
fixed yet-another netfs regression that was stable-backported to 6.15,
ugh!), but when we updated a production clusters, all servers had
crashed after a few hours, and our ops team had a very bad night.

This patch set is obviously bad.  It pretends to fix a bug, but really
rewrites almost everything in two patches documented as "introduce XY
method" with no real explanation for why Viacheslav has decided to do
it, instead of just fixing the bug (as Matthew asked him to).

Look at this line modified by patch "ceph: extend ceph_writeback_ctl
for ceph_writepages_start() refactoring":

> +             ceph_wbc.fbatch.folios[i] = NULL;

This sets a folio_batch element to NULL, which will, of course, crash
in folios_put_refs() (but only if the global huge zero page has
already been created).  Fortunately, there's code that removes all
NULL elements from the folio_batch array.  That is code that already
existed before Viacheslav's patch set (code which I already dislike
because it's a fragile mess that is just waiting to crash), and the
code was only being moved around.

Did I mention that I think this is a fragile mess?  Fast-forward to
Viacheslav's patch "ceph: introduce ceph_process_folio_batch() method"
which moves the NULL-setting loop to ceph_process_folio_batch().  Look
at this (untouched) piece of code after the ceph_process_folio_batch()
call:

>   if (i) {
>        unsigned j, n = 0;
>        /* shift unused page to beginning of fbatch */

Shifting only happens if at least one folio has been processed ("i"
was incremented).  But does it really happen?

No, the loop was moved to ceph_process_folio_batch(), and nobody ever
increments "i" anymore.  Voila, crash due to NULL pointer dereference:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0 
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE 
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

Viacheslav's third patch "ceph: introduce ceph_submit_write() method"
messes up the logic a bit more and makes it more fragile by hiding the
shifting code behind more conditions:

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if processed_in_fbatch is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

If shifting doesn't happen, then the kernel crashes (unless
huge-zero-page doesn't exist, see above).  Obviously, nobody has ever
looked closely enough at the code.  I'm still new to Linux memory
management and file systems, but these problems were obvious when I
first saw this patch set (which was my candidate for the other 6.15
crashes which then turned out to be netfs regressions, not Ceph).

This whole patch set is a huge mess and has caused my team a good
amount of pain.  This could and should have been avoided, had only
somebody listened to Matthew.

(Also look at all those "checkpatch.pl" complaints on all patches in
this patch set.  There are many coding style violations.)

Can we please revert the whole patch set?  I don't think it's possible
to fix all the weird undocumented side effects that may cause more
crashes once we fix this one.  Refactoring the Ceph code sure is
necessary, it's not in a good shape, but it should be done more
carefully.  Some people (like me) depend on Ceph's stability, and this
mess is doing a disservice to Ceph's reputation.

Max

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-26 22:06     ` [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Max Kellermann
@ 2025-08-26 22:33       ` Viacheslav Dubeyko
  2025-08-26 22:53         ` Max Kellermann
  0 siblings, 1 reply; 5+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-26 22:33 UTC (permalink / raw)
  To: brauner@kernel.org, Patrick Donnelly,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, slava@dubeyko.com, David Howells,
	Alex Markuze, willy@infradead.org, idryomov@gmail.com

On Wed, 2025-08-27 at 00:06 +0200, Max Kellermann wrote:
> On 2025/02/14 22:11, Matthew Wilcox <willy@infradead.org> wrote:
> > On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> > > This patch implements refactoring of ceph_submit_write()
> > > and also it solves the second issue.
> > > 
> > > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> > 
> > This kind of giant refactoring to solve a bug is a really bad idea.
> > First, it's going to need to be backported to older kernels.  How far
> > back?  You need to identify that with a Fixes: line.
> > 
> > It's also really hard to review and know whether it's right.  You might
> > have introduced several new bugs while doing it.  In general, bugfixes
> > first, refactor later.  I *think* this means we can do without 1/7 of
> > the patches I resent earlier today, but it's really hard to be sure.
> 
> I'm very disappointed that nobody has listened to Matthew's complaint.
> Viacheslav has only hand-waved it away with a lazy non-argument.
> 
> From Documentation/process/submitting-patches.rst:
> 
>  "you should not modify the moved code at all in the same patch which
>  moves it"
> 
> Obviously, this patch set violates this rule.  There are lots of
> semantic/behavior changes in the patches that move code around.
> 
> In the end, Christian Brauner has merged this into Linux 6.15 and that
> merge has wreaked havoc in our production clusters.  We have been
> testing 6.15 for a month with no problems (after David Howells had
> fixed yet-another netfs regression that was stable-backported to 6.15,
> ugh!), but when we updated a production clusters, all servers had
> crashed after a few hours, and our ops team had a very bad night.
> 
> This patch set is obviously bad.  It pretends to fix a bug, but really
> rewrites almost everything in two patches documented as "introduce XY
> method" with no real explanation for why Viacheslav has decided to do
> it, instead of just fixing the bug (as Matthew asked him to).
> 
> Look at this line modified by patch "ceph: extend ceph_writeback_ctl
> for ceph_writepages_start() refactoring":
> 
> > +             ceph_wbc.fbatch.folios[i] = NULL;
> 
> This sets a folio_batch element to NULL, which will, of course, crash
> in folios_put_refs() (but only if the global huge zero page has
> already been created).  Fortunately, there's code that removes all
> NULL elements from the folio_batch array.  That is code that already
> existed before Viacheslav's patch set (code which I already dislike
> because it's a fragile mess that is just waiting to crash), and the
> code was only being moved around.
> 
> Did I mention that I think this is a fragile mess?  Fast-forward to
> Viacheslav's patch "ceph: introduce ceph_process_folio_batch() method"
> which moves the NULL-setting loop to ceph_process_folio_batch().  Look
> at this (untouched) piece of code after the ceph_process_folio_batch()
> call:
> 
> >   if (i) {
> >        unsigned j, n = 0;
> >        /* shift unused page to beginning of fbatch */
> 
> Shifting only happens if at least one folio has been processed ("i"
> was incremented).  But does it really happen?
> 
> No, the loop was moved to ceph_process_folio_batch(), and nobody ever
> increments "i" anymore.  Voila, crash due to NULL pointer dereference:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000034
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 0 P4D 0 
>  Oops: Oops: 0002 [#1] SMP NOPTI
>  CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE 
>  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
>  Workqueue: writeback wb_workfn (flush-ceph-1)
>  RIP: 0010:folios_put_refs+0x85/0x140
>  Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
>  RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
>  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
>  RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
>  RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
>  R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
>  R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
>  FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>   ceph_writepages_start+0xeb9/0x1410
> 
> Viacheslav's third patch "ceph: introduce ceph_submit_write() method"
> messes up the logic a bit more and makes it more fragile by hiding the
> shifting code behind more conditions:
> 
> - if ceph_process_folio_batch() fails, shifting never happens
> 
> - if ceph_move_dirty_page_in_page_array() was never called (because
>   ceph_process_folio_batch() has returned early for some of various
>   reasons), shifting never happens
> 
> - if processed_in_fbatch is zero (because ceph_process_folio_batch()
>   has returned early for some of the reasons mentioned above or
>   because ceph_move_dirty_page_in_page_array() has failed), shifting
>   never happens
> 
> If shifting doesn't happen, then the kernel crashes (unless
> huge-zero-page doesn't exist, see above).  Obviously, nobody has ever
> looked closely enough at the code.  I'm still new to Linux memory
> management and file systems, but these problems were obvious when I
> first saw this patch set (which was my candidate for the other 6.15
> crashes which then turned out to be netfs regressions, not Ceph).
> 
> This whole patch set is a huge mess and has caused my team a good
> amount of pain.  This could and should have been avoided, had only
> somebody listened to Matthew.
> 
> (Also look at all those "checkpatch.pl" complaints on all patches in
> this patch set.  There are many coding style violations.)
> 
> Can we please revert the whole patch set?  I don't think it's possible
> to fix all the weird undocumented side effects that may cause more
> crashes once we fix this one.  Refactoring the Ceph code sure is
> necessary, it's not in a good shape, but it should be done more
> carefully.  Some people (like me) depend on Ceph's stability, and this
> mess is doing a disservice to Ceph's reputation.

Of course, we can revert any patch. This patchset has been sent not with the
goal of pure refactoring but it fixes several bugs. Reverting means returning
these bugs back. This patchset was available for review for a long time. So,
anybody was able to review it and to share any comments. I am always happy to
rework and to make any patch more better. From my point of view, reverting is
not answer and it makes sense to continue fix bugs and to make CephFS code more
stable.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-26 22:33       ` Viacheslav Dubeyko
@ 2025-08-26 22:53         ` Max Kellermann
  2025-08-27  3:06           ` Viacheslav Dubeyko
  0 siblings, 1 reply; 5+ messages in thread
From: Max Kellermann @ 2025-08-26 22:53 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: brauner@kernel.org, Patrick Donnelly,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, slava@dubeyko.com, David Howells,
	Alex Markuze, willy@infradead.org, idryomov@gmail.com

On 2025/08/27 00:33, Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> Of course, we can revert any patch. This patchset has been sent not with the
> goal of pure refactoring but it fixes several bugs. Reverting means returning
> these bugs back.

You should have listened of Matthew and submit separate minimal
bug-fixing patches instead of posting huge patches which move code
around, change semantics and hidden somewhere deep within fix some bug
(and then introduce new bugs).

> This patchset was available for review for a long time.

There was exactly one review, and no, you were not "happy to rework
and to make any patch more better" - you openly rejected Matthew's
review.

> From my point of view, reverting is not answer and it makes sense to
> continue fix bugs and to make CephFS code more stable.

Your argument only appears to sound right, but it is detached from the
reality I'm living in.

Your patches made Ceph less stable.  6.14 had one Ceph-related crash
every other week, but 6.15 with your patches made all servers crash
within hours.

The point is: the Linux kernel was better without your patches.  Your
patches may have fixed a bug, but have introduced a dozen new bugs,
including one that very quickly crashes the whole kernel, one that was
really obvious enough, just nobody cared enough to read deeply enough
after you rejected Matthew's review.  Too bad no maintainer stopped
you!

Of course, the bug that was fixed by your patch set should be fixed -
but not the way you did it.  Every aspect of your approach to fixing
the bug was bad.

The best way forward for you would be to revert this patch set and
write a minimal patch that only fixes the bug.  If you want to be
helpful here, please give this a try.

Max

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-26 22:53         ` Max Kellermann
@ 2025-08-27  3:06           ` Viacheslav Dubeyko
  2025-08-27  4:57             ` Max Kellermann
  0 siblings, 1 reply; 5+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-27  3:06 UTC (permalink / raw)
  To: David Howells, brauner@kernel.org, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alex Markuze, Patrick Donnelly, willy@infradead.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

On Wed, 2025-08-27 at 00:53 +0200, Max Kellermann wrote:
> On 2025/08/27 00:33, Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > Of course, we can revert any patch. This patchset has been sent not with the
> > goal of pure refactoring but it fixes several bugs. Reverting means returning
> > these bugs back.
> 
> You should have listened of Matthew and submit separate minimal
> bug-fixing patches instead of posting huge patches which move code
> around, change semantics and hidden somewhere deep within fix some bug
> (and then introduce new bugs).
> 
> > This patchset was available for review for a long time.
> 
> There was exactly one review, and no, you were not "happy to rework
> and to make any patch more better" - you openly rejected Matthew's
> review.
> 
> > From my point of view, reverting is not answer and it makes sense to
> > continue fix bugs and to make CephFS code more stable.
> 
> Your argument only appears to sound right, but it is detached from the
> reality I'm living in.
> 
> Your patches made Ceph less stable.  6.14 had one Ceph-related crash
> every other week, but 6.15 with your patches made all servers crash
> within hours.
> 
> The point is: the Linux kernel was better without your patches.  Your
> patches may have fixed a bug, but have introduced a dozen new bugs,
> including one that very quickly crashes the whole kernel, one that was
> really obvious enough, just nobody cared enough to read deeply enough
> after you rejected Matthew's review.  Too bad no maintainer stopped
> you!
> 
> Of course, the bug that was fixed by your patch set should be fixed -
> but not the way you did it.  Every aspect of your approach to fixing
> the bug was bad.
> 
> The best way forward for you would be to revert this patch set and
> write a minimal patch that only fixes the bug.  If you want to be
> helpful here, please give this a try.
> 
> 

This patchset had been tested by xfstests and no issue had been triggered. We
have not enough Ceph related test-cases. So, you are welcome to add a test-case
for the issue.

This issue had been reported more than a month ago. I tried to reproduce it but
I had no luck. So, if you are lucky enough, then simply share the patch or the
way to reproduce the issue. The patchset simply placed old code into dedicated
functions with the goal to manage complexity. So, the old code is still there
and the patch cannot introduce "dozen new bugs". Otherwise, xfstests can easily
reproduce these gazilion of bugs. :) Currently, we have only one issue reported.

The open-source is space for collaboration and respect of each other. It is not
space for offensive or bullying behavior.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-27  3:06           ` Viacheslav Dubeyko
@ 2025-08-27  4:57             ` Max Kellermann
  0 siblings, 0 replies; 5+ messages in thread
From: Max Kellermann @ 2025-08-27  4:57 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: David Howells, brauner@kernel.org, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alex Markuze, Patrick Donnelly, willy@infradead.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

On 2025/08/27 05:06, Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> The open-source is space for collaboration and respect of each other. It is not
> space for offensive or bullying behavior.

This is now your third reply to a comment on your patch set (one to
Matthew and two to me), and you are still sidestepping the valid
criticism.  This time, instead of talking how to fix it, you are
explaining how well you tested this, and then this meta-comment.

No, for the bug discussion, it does not matter how well you tested
this.  If it's buggy, it's buggy.  For me, as an amateur, the bug was
obvious enough to see in the source code, and after my long
explanation, it should be easy enough for everybody else to see, isn't
it?  And no, I don't owe you a test case.  I explained the exact
conditions that need to occur for the crash to happen, and I have to
trust that you are capable of easily injecting such an error, aren't
you?

This is not what I call "collaboration" neither "respect", to me it
rather seems like ignorance.  You were not collaborating with the
reviewer and you are not respecting the bug report.  You just divert
the discussion instead of talking about the actual problems.

And, note: in both of your replies to me, you removed me from the
recipient list.  Everybody else you kept, you only removed me.  I had
to look up your replies on lore.kernel.org.  (You did not do that with
your replies to Matthew and David.)

What kind of funny behavior is that?

Max

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-27  4:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250205000249.123054-1-slava@dubeyko.com>
     [not found] ` <20250205000249.123054-4-slava@dubeyko.com>
     [not found]   ` <Z6-xg-p_mi3I1aMq@casper.infradead.org>
2025-08-26 22:06     ` [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Max Kellermann
2025-08-26 22:33       ` Viacheslav Dubeyko
2025-08-26 22:53         ` Max Kellermann
2025-08-27  3:06           ` Viacheslav Dubeyko
2025-08-27  4:57             ` Max Kellermann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).