From: Sam Edwards <cfsworks@gmail.com>
To: Xiubo Li <xiubli@redhat.com>, Ilya Dryomov <idryomov@gmail.com>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>,
Christian Brauner <brauner@kernel.org>,
Milind Changire <mchangir@redhat.com>,
Jeff Layton <jlayton@kernel.org>,
ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
Sam Edwards <CFSworks@gmail.com>
Subject: [PATCH v2 0/6] ceph: CephFS writeback correctness and performance fixes
Date: Wed, 7 Jan 2026 13:01:33 -0800 [thread overview]
Message-ID: <20260107210139.40554-1-CFSworks@gmail.com> (raw)
Hello list,
This is v2 of my series that addresses interrelated issues in CephFS writeback,
fixing crashes, improving robustness, and correcting performance behavior,
particularly for fscrypted files. [1]
Changes v1->v2:
- Clarify patch #1's commit message to establish that failures on the first
folio are not possible.
- Add another patch to move the "clean up page array on abort" logic to a new
ceph_discard_page_array() function. (Thanks Slava!)
- Change the wording "grossly degraded performance" to instead read
"correspondingly degraded performance." This makes the causal relationship
clearer (that write throughput is limited much more significantly by write
op/s due to the bug) without making any claims (qualitative or otherwise)
about significance. (Thanks Slava!)
- Reset locked_pages = 0 immediately when the page array is discarded,
simplifying patch #5 ("ceph: Assert writeback loop invariants")
- Reword "as evidenced by the previous two patches which fix oopses" to
"as evidenced by two recent patches which fix oopses" and refer to the
patches by subject (being in the same series, I cannot refer to them by hash)
I received several items of feedback on v1 that I have chosen not to adopt --
mostly because I believe they run contrary to kernel norms about strong
contracts, redundancy, not masking bugs, and regressions. (It is possible that
I am mistaken on these norms, and may still include them in a v3 if someone
makes good points in favor of them or consensus overrules me.)
Feedback on v1 not adopted:
- "Patch #1, which fixes a crash in unreachable code, should be reordered after
patch #6 (#5 in v1), which fixes a bug that makes the code unreachable,
in order to simplify crash reproduction in review"
The order of the patchset represents the canonical commit order of the series.
Committing patch #6 before patch #1 would therefore introduce a regression, in
direct violation of longstanding kernel policy.
- "Patch #1 should not swallow errors from move_dirty_folio_in_page_array() if
they happen on the first folio."
It is not possible for move_dirty_folio_in_page_array() to encounter an error
on the first folio, and this is not likely to change in the future. Even if
such an error were to occur, the caller already expects
ceph_process_folio_batch() to "successfully" lock zero pages from time to time.
Swallowing errors in ceph_process_folio_batch() is consistent with its design.
- "Patch #1 should include the call trace and reproduction steps in the commit
message."
The commit message already explains the execution path to the failure, which is
what people really care about (the call trace is just a means to that end).
Inlining makes the call trace completely opaque about why the oops happens.
Reproduction steps are not particularly valuable for posterity, but a curious
party in the future can always refer back to mailing list discussions to find
them.
- "Patch #2 should not exist: it removes the return code from
ceph_process_folio_batch(), which is essential to indicate failures to the
caller."
The caller of ceph_process_folio_batch() only cares about success/failure as a
boolean: the exact error code is discarded. This is redundant with
ceph_wbc.locked_pages, which not only indicates success/failure, but also the
degree of success (how many pages were locked). This makes
ceph_wbc.locked_pages a more appealing single source of truth than the error
code.
Further, the error return mechanism is misleading to future programmers: it
implies that it is acceptable for ceph_process_folio_batch() to abort the
operation (after already selecting some folios for write). The caller cannot
handle this. Removing the return code altogether makes the contract explicit,
which is the central point of the patch.
- "Patch #5 (#4 in v1) should not introduce BUG_ON() in the writeback path."
The writeback path already has BUG_ON(), and this is consistent with kernel
norms: check invariants, fail fast, and don't try to tolerate ambiguity. There
are already BUG_ONs that check several invariants, so rather than introducing
new failure possibilities to the writeback loop, patch #5 actually catches
invariant errors sooner. This is to tighten up the code and prevent
regressions, not fix any particular bug.
- "Patch #6 (#5 in v1) should include benchmark results to support its claim of improved performance."
My environment is not very representative of a typical Ceph deployment, and
benchmarking is tough to get right. I am not prepared to stand behind any
particular estimated/expected speedup factor. Rather, the rationale for this
patch is a simple computer science principle: increasing the amount of useful
work done per operation reduces total overhead. I have changed the phrasing
"grossly degraded performance" to "correspondingly degraded performance" to
emphasize that the performance degradation follows from the bottleneck, without
implying that I'm making some kind of claim about magnitude.
Warm regards,
Sam
[1]: https://lore.kernel.org/ceph-devel/20251231024316.4643-1-CFSworks@gmail.com/T/
Sam Edwards (6):
ceph: Do not propagate page array emplacement errors as batch errors
ceph: Remove error return from ceph_process_folio_batch()
ceph: Free page array when ceph_submit_write fails
ceph: Split out page-array discarding to a function
ceph: Assert writeback loop invariants
ceph: Fix write storm on fscrypted files
fs/ceph/addr.c | 82 +++++++++++++++++++++++++++++---------------------
1 file changed, 48 insertions(+), 34 deletions(-)
--
2.51.2
next reply other threads:[~2026-01-07 21:02 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-07 21:01 Sam Edwards [this message]
2026-01-07 21:01 ` [PATCH v2 1/6] ceph: Do not propagate page array emplacement errors as batch errors Sam Edwards
2026-01-08 20:05 ` Viacheslav Dubeyko
2026-01-07 21:01 ` [PATCH v2 2/6] ceph: Remove error return from ceph_process_folio_batch() Sam Edwards
2026-01-08 20:08 ` Viacheslav Dubeyko
2026-01-09 0:29 ` Sam Edwards
2026-01-07 21:01 ` [PATCH v2 3/6] ceph: Free page array when ceph_submit_write fails Sam Edwards
2026-01-07 21:01 ` [PATCH v2 4/6] ceph: Split out page-array discarding to a function Sam Edwards
2026-01-08 20:11 ` Viacheslav Dubeyko
2026-01-07 21:01 ` [PATCH v2 5/6] ceph: Assert writeback loop invariants Sam Edwards
2026-01-08 20:12 ` Viacheslav Dubeyko
2026-01-07 21:01 ` [PATCH v2 6/6] ceph: Fix write storm on fscrypted files Sam Edwards
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260107210139.40554-1-CFSworks@gmail.com \
--to=cfsworks@gmail.com \
--cc=Slava.Dubeyko@ibm.com \
--cc=brauner@kernel.org \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchangir@redhat.com \
--cc=xiubli@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox