* [RFC PATCH] libceph: Handle sparse-read replies lacking data length @ 2026-01-13 3:31 Sam Edwards 2026-01-13 17:26 ` Ilya Dryomov 0 siblings, 1 reply; 6+ messages in thread From: Sam Edwards @ 2026-01-13 3:31 UTC (permalink / raw) To: Xiubo Li, Ilya Dryomov, Jeff Layton; +Cc: ceph-devel, linux-kernel, Sam Edwards When the OSD replies to a sparse-read request, but no extents matched the read (because the object is empty, the read requested a region backed by no extents, ...) it is expected to reply with two 32-bit zeroes: one indicating that there are no extents, the other that the total bytes read is zero. In certain circumstances (e.g. on Ceph 19.2.3, when the requested object is in an EC pool), the OSD sends back only one 32-bit zero. The sparse-read state machine will end up reading something else (such as the data CRC in the footer) and get stuck in a retry loop like: libceph: [0] got 0 extents libceph: data len 142248331 != extent len 0 libceph: osd0 (1)...:6801 socket error on read libceph: data len 142248331 != extent len 0 libceph: osd0 (1)...:6801 socket error on read This is probably a bug in the OSD, but even so, the kernel must handle it to avoid misinterpreting replies and entering a retry loop. Detect this condition when the extent count is zero by checking the `payload_len` field of the op reply. If it is only big enough for the extent count, conclude that the data length is omitted and skip to the next op (which is what the state machine would have done immediately upon reading and validating the data length, if it were present). --- Hi list, RFC: This patch is submitted for comment only. I've tested it for about 2 weeks now and am satisfied that it prevents the hang, but the current approach decodes the entire op reply body while still in the data-gathering step, which is suboptimal; feedback on cleaner alternatives is welcome! I have not searched for nor opened a report with Ceph proper; I'd like a second pair of eyes to confirm that this is indeed an OSD bug before I proceed with that. Reproducer (Ceph 19.2.3, CephFS with an EC pool already created): mount -o sparseread ... /mnt/cephfs cd /mnt/cephfs mkdir ec/ setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/ echo 'Hello world' > ec/sparsely-packed truncate -s 1048576 ec/sparsely-packed # Read from a hole-backed region via sparse read dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd # The read hangs and triggers the retry loop described in the patch Hope this works, Sam PS: I would also like to write a pair of patches to our messenger v1/v2 clients to check explicitly that sparse reads consume exactly the number of bytes in the data section, as I see there have already been previous bugs (including CVE-2023-52636) where the sparse-read machinery gets out of sync with the incoming TCP stream. Has this already been proposed? --- net/ceph/osd_client.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 1a7be2f615dc..e9e898a2415f 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -5840,7 +5840,25 @@ static int osd_sparse_read(struct ceph_connection *con, sr->sr_state = CEPH_SPARSE_READ_DATA_LEN; break; } - /* No extents? Read data len */ + + /* + * No extents? Read data len (which we expect is 0) if present. + * + * Sometimes the OSD will omit this for zero-extent replies + * (e.g. in Ceph 19.2.3 when the object is in an EC pool) which + * is likely a bug in the OSD, but nonetheless we must handle + * it to avoid misinterpreting the reply. + */ + struct MOSDOpReply m; + ret = decode_MOSDOpReply(con->in_msg, &m); + if (ret) + return ret; + if (m.outdata_len[o->o_sparse_op_idx] == sizeof(sr->sr_count)) { + dout("[%d] missing data length\n", o->o_osd); + sr->sr_state = CEPH_SPARSE_READ_HDR; + goto next_op; + } + fallthrough; case CEPH_SPARSE_READ_DATA_LEN: convert_extent_map(sr); -- 2.51.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length 2026-01-13 3:31 [RFC PATCH] libceph: Handle sparse-read replies lacking data length Sam Edwards @ 2026-01-13 17:26 ` Ilya Dryomov 2026-01-13 19:04 ` Sam Edwards 0 siblings, 1 reply; 6+ messages in thread From: Ilya Dryomov @ 2026-01-13 17:26 UTC (permalink / raw) To: Sam Edwards; +Cc: Xiubo Li, Jeff Layton, ceph-devel, linux-kernel On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote: > > When the OSD replies to a sparse-read request, but no extents matched > the read (because the object is empty, the read requested a region > backed by no extents, ...) it is expected to reply with two 32-bit > zeroes: one indicating that there are no extents, the other that the > total bytes read is zero. > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object > is in an EC pool), the OSD sends back only one 32-bit zero. The > sparse-read state machine will end up reading something else (such as > the data CRC in the footer) and get stuck in a retry loop like: > > libceph: [0] got 0 extents > libceph: data len 142248331 != extent len 0 > libceph: osd0 (1)...:6801 socket error on read > libceph: data len 142248331 != extent len 0 > libceph: osd0 (1)...:6801 socket error on read > > This is probably a bug in the OSD, but even so, the kernel must handle > it to avoid misinterpreting replies and entering a retry loop. Hi Sam, Yes, this is definitely a bug in the OSD (and I also see another related bug in the userspace client code above the OSD...). The triggering condition is a sparse read beyond the end of an existing object on an EC pool. 19.2.3 isn't the problem -- main branch is affected as well. If this was one of the common paths, I'd support adding some sort of a workaround to "handle" this in the kernel client. However, sparse reads are pretty useless on EC pools because they just get converted into regular thick reads. Sparse reads offer potential benefits only on replicated pools, but the kernel client doesn't use them by default there either. The sparseread mount option that is necessary for the reproducer to work isn't documented and was added purely for testing purposes. > > Detect this condition when the extent count is zero by checking the > `payload_len` field of the op reply. If it is only big enough for the > extent count, conclude that the data length is omitted and skip to the > next op (which is what the state machine would have done immediately > upon reading and validating the data length, if it were present). > > --- > > Hi list, > > RFC: This patch is submitted for comment only. I've tested it for about > 2 weeks now and am satisfied that it prevents the hang, but the current > approach decodes the entire op reply body while still in the > data-gathering step, which is suboptimal; feedback on cleaner > alternatives is welcome! > > I have not searched for nor opened a report with Ceph proper; I'd like a > second pair of eyes to confirm that this is indeed an OSD bug before I > proceed with that. Let me know if you want me to file a Ceph tracker ticket on your behalf. I have a draft patch for the bug in the OSD and would link it in the PR, crediting you as a reporter. > > Reproducer (Ceph 19.2.3, CephFS with an EC pool already created): > mount -o sparseread ... /mnt/cephfs > cd /mnt/cephfs > mkdir ec/ > setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/ > echo 'Hello world' > ec/sparsely-packed > truncate -s 1048576 ec/sparsely-packed > # Read from a hole-backed region via sparse read > dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd > # The read hangs and triggers the retry loop described in the patch > > Hope this works, > Sam > > PS: I would also like to write a pair of patches to our messenger v1/v2 > clients to check explicitly that sparse reads consume exactly the number > of bytes in the data section, as I see there have already been previous > bugs (including CVE-2023-52636) where the sparse-read machinery gets out > of sync with the incoming TCP stream. Has this already been proposed? Not that I'm aware of. An additional safety net would be welcome as long as it doesn't end up too invasive of course. Thanks, Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length 2026-01-13 17:26 ` Ilya Dryomov @ 2026-01-13 19:04 ` Sam Edwards 2026-01-13 20:15 ` Ilya Dryomov 0 siblings, 1 reply; 6+ messages in thread From: Sam Edwards @ 2026-01-13 19:04 UTC (permalink / raw) To: Ilya Dryomov; +Cc: Xiubo Li, Jeff Layton, ceph-devel, linux-kernel On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote: > > > > When the OSD replies to a sparse-read request, but no extents matched > > the read (because the object is empty, the read requested a region > > backed by no extents, ...) it is expected to reply with two 32-bit > > zeroes: one indicating that there are no extents, the other that the > > total bytes read is zero. > > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object > > is in an EC pool), the OSD sends back only one 32-bit zero. The > > sparse-read state machine will end up reading something else (such as > > the data CRC in the footer) and get stuck in a retry loop like: > > > > libceph: [0] got 0 extents > > libceph: data len 142248331 != extent len 0 > > libceph: osd0 (1)...:6801 socket error on read > > libceph: data len 142248331 != extent len 0 > > libceph: osd0 (1)...:6801 socket error on read > > > > This is probably a bug in the OSD, but even so, the kernel must handle > > it to avoid misinterpreting replies and entering a retry loop. > > Hi Sam, > Hey Ilya, > Yes, this is definitely a bug in the OSD (and I also see another > related bug in the userspace client code above the OSD...). The > triggering condition is a sparse read beyond the end of an existing > object on an EC pool. 19.2.3 isn't the problem -- main branch is > affected as well. > > If this was one of the common paths, I'd support adding some sort of > a workaround to "handle" this in the kernel client. However, sparse > reads are pretty useless on EC pools because they just get converted > into regular thick reads. Sparse reads offer potential benefits only > on replicated pools, but the kernel client doesn't use them by default > there either. The sparseread mount option that is necessary for the > reproducer to work isn't documented and was added purely for testing > purposes. Note that the kernel client forces sparse reads when using fscrypt (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem organically as a result. It may still make sense to apply a kernel workaround. On the other hand, it sounds like fscrypt+EC is a niche corner case, we've now established that the OSD is definitely not following the protocol, and working around this client-side is more involved than just fixing this in the OSD. So I think simply telling affected users to update their OSDs is also a reasonable way to handle this. I'll defer to you. > > > > > Detect this condition when the extent count is zero by checking the > > `payload_len` field of the op reply. If it is only big enough for the > > extent count, conclude that the data length is omitted and skip to the > > next op (which is what the state machine would have done immediately > > upon reading and validating the data length, if it were present). > > > > --- > > > > Hi list, > > > > RFC: This patch is submitted for comment only. I've tested it for about > > 2 weeks now and am satisfied that it prevents the hang, but the current > > approach decodes the entire op reply body while still in the > > data-gathering step, which is suboptimal; feedback on cleaner > > alternatives is welcome! > > > > I have not searched for nor opened a report with Ceph proper; I'd like a > > second pair of eyes to confirm that this is indeed an OSD bug before I > > proceed with that. > > Let me know if you want me to file a Ceph tracker ticket on your > behalf. I have a draft patch for the bug in the OSD and would link it > in the PR, crediting you as a reporter. Please do! I'm also interested in seeing the patch -- the OSD code is pretty dense and I couldn't find the EC sparse read handler. > > > > > Reproducer (Ceph 19.2.3, CephFS with an EC pool already created): > > mount -o sparseread ... /mnt/cephfs > > cd /mnt/cephfs > > mkdir ec/ > > setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/ > > echo 'Hello world' > ec/sparsely-packed > > truncate -s 1048576 ec/sparsely-packed > > # Read from a hole-backed region via sparse read > > dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd > > # The read hangs and triggers the retry loop described in the patch > > > > Hope this works, > > Sam > > > > PS: I would also like to write a pair of patches to our messenger v1/v2 > > clients to check explicitly that sparse reads consume exactly the number > > of bytes in the data section, as I see there have already been previous > > bugs (including CVE-2023-52636) where the sparse-read machinery gets out > > of sync with the incoming TCP stream. Has this already been proposed? > > Not that I'm aware of. An additional safety net would be welcome as > long as it doesn't end up too invasive of course. Time permitting, I'll see about fixing read_partial_message() to use con->v1.in_base_pos consistently, use that to count data bytes consumed in sparse reads, and fail with a more specific error_msg when a length mismatch is detected. (I do not have a plan for messenger v2 yet.) Regards, Sam > > Thanks, > > Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length 2026-01-13 19:04 ` Sam Edwards @ 2026-01-13 20:15 ` Ilya Dryomov 2026-01-14 1:28 ` Sam Edwards 0 siblings, 1 reply; 6+ messages in thread From: Ilya Dryomov @ 2026-01-13 20:15 UTC (permalink / raw) To: Sam Edwards; +Cc: Xiubo Li, Jeff Layton, ceph-devel, linux-kernel On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@gmail.com> wrote: > > On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote: > > > > > > When the OSD replies to a sparse-read request, but no extents matched > > > the read (because the object is empty, the read requested a region > > > backed by no extents, ...) it is expected to reply with two 32-bit > > > zeroes: one indicating that there are no extents, the other that the > > > total bytes read is zero. > > > > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object > > > is in an EC pool), the OSD sends back only one 32-bit zero. The > > > sparse-read state machine will end up reading something else (such as > > > the data CRC in the footer) and get stuck in a retry loop like: > > > > > > libceph: [0] got 0 extents > > > libceph: data len 142248331 != extent len 0 > > > libceph: osd0 (1)...:6801 socket error on read > > > libceph: data len 142248331 != extent len 0 > > > libceph: osd0 (1)...:6801 socket error on read > > > > > > This is probably a bug in the OSD, but even so, the kernel must handle > > > it to avoid misinterpreting replies and entering a retry loop. > > > > Hi Sam, > > > > Hey Ilya, > > > Yes, this is definitely a bug in the OSD (and I also see another > > related bug in the userspace client code above the OSD...). The > > triggering condition is a sparse read beyond the end of an existing > > object on an EC pool. 19.2.3 isn't the problem -- main branch is > > affected as well. > > > > If this was one of the common paths, I'd support adding some sort of > > a workaround to "handle" this in the kernel client. However, sparse > > reads are pretty useless on EC pools because they just get converted > > into regular thick reads. Sparse reads offer potential benefits only > > on replicated pools, but the kernel client doesn't use them by default > > there either. The sparseread mount option that is necessary for the > > reproducer to work isn't documented and was added purely for testing > > purposes. > > Note that the kernel client forces sparse reads when using fscrypt > (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem > organically as a result. It may still make sense to apply a kernel > workaround. > > On the other hand, it sounds like fscrypt+EC is a niche corner case, > we've now established that the OSD is definitely not following the > protocol, and working around this client-side is more involved than > just fixing this in the OSD. So I think simply telling affected users > to update their OSDs is also a reasonable way to handle this. fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work. The reason sparse reads are forced for fscrypt is that the client relies on the sparseness metadata to be able tell if a given 4K block in the encrypted file is a hole (in the PUNCH_HOLE sense) or not. If it's a hole, POSIX dictates that a read should return zeroes. On an EC pool where sparse reads are degraded into regular thick reads by the OSD, a hole in the middle of an object wouldn't ever be signaled. Instead, the OSD would synthesize a bunch of zeroes and pass them to the client. The client would then run them through the crypto engine (believing it's a bona fide ciphertext) and return the resulting gibberish to the user, thus violating POSIX and widespread assumptions about generic filesystem behavior. > > I'll defer to you. > > > > > > > > > Detect this condition when the extent count is zero by checking the > > > `payload_len` field of the op reply. If it is only big enough for the > > > extent count, conclude that the data length is omitted and skip to the > > > next op (which is what the state machine would have done immediately > > > upon reading and validating the data length, if it were present). > > > > > > --- > > > > > > Hi list, > > > > > > RFC: This patch is submitted for comment only. I've tested it for about > > > 2 weeks now and am satisfied that it prevents the hang, but the current > > > approach decodes the entire op reply body while still in the > > > data-gathering step, which is suboptimal; feedback on cleaner > > > alternatives is welcome! > > > > > > I have not searched for nor opened a report with Ceph proper; I'd like a > > > second pair of eyes to confirm that this is indeed an OSD bug before I > > > proceed with that. > > > > Let me know if you want me to file a Ceph tracker ticket on your > > behalf. I have a draft patch for the bug in the OSD and would link it > > in the PR, crediting you as a reporter. > > Please do! I'm also interested in seeing the patch -- the OSD code is > pretty dense and I couldn't find the EC sparse read handler. https://github.com/ceph/ceph/pull/66912 Thanks, Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length 2026-01-13 20:15 ` Ilya Dryomov @ 2026-01-14 1:28 ` Sam Edwards 2026-01-14 10:23 ` Ilya Dryomov 0 siblings, 1 reply; 6+ messages in thread From: Sam Edwards @ 2026-01-14 1:28 UTC (permalink / raw) To: Ilya Dryomov; +Cc: Xiubo Li, Jeff Layton, ceph-devel, linux-kernel On Tue, Jan 13, 2026 at 12:15 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@gmail.com> wrote: > > > > On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote: > > > > > > > > When the OSD replies to a sparse-read request, but no extents matched > > > > the read (because the object is empty, the read requested a region > > > > backed by no extents, ...) it is expected to reply with two 32-bit > > > > zeroes: one indicating that there are no extents, the other that the > > > > total bytes read is zero. > > > > > > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object > > > > is in an EC pool), the OSD sends back only one 32-bit zero. The > > > > sparse-read state machine will end up reading something else (such as > > > > the data CRC in the footer) and get stuck in a retry loop like: > > > > > > > > libceph: [0] got 0 extents > > > > libceph: data len 142248331 != extent len 0 > > > > libceph: osd0 (1)...:6801 socket error on read > > > > libceph: data len 142248331 != extent len 0 > > > > libceph: osd0 (1)...:6801 socket error on read > > > > > > > > This is probably a bug in the OSD, but even so, the kernel must handle > > > > it to avoid misinterpreting replies and entering a retry loop. > > > > > > Hi Sam, > > > > > > > Hey Ilya, > > > > > Yes, this is definitely a bug in the OSD (and I also see another > > > related bug in the userspace client code above the OSD...). The > > > triggering condition is a sparse read beyond the end of an existing > > > object on an EC pool. 19.2.3 isn't the problem -- main branch is > > > affected as well. > > > > > > If this was one of the common paths, I'd support adding some sort of > > > a workaround to "handle" this in the kernel client. However, sparse > > > reads are pretty useless on EC pools because they just get converted > > > into regular thick reads. Sparse reads offer potential benefits only > > > on replicated pools, but the kernel client doesn't use them by default > > > there either. The sparseread mount option that is necessary for the > > > reproducer to work isn't documented and was added purely for testing > > > purposes. > > > > Note that the kernel client forces sparse reads when using fscrypt > > (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem > > organically as a result. It may still make sense to apply a kernel > > workaround. > > > > On the other hand, it sounds like fscrypt+EC is a niche corner case, > > we've now established that the OSD is definitely not following the > > protocol, and working around this client-side is more involved than > > just fixing this in the OSD. So I think simply telling affected users > > to update their OSDs is also a reasonable way to handle this. > > fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work. The > reason sparse reads are forced for fscrypt is that the client relies on > the sparseness metadata to be able tell if a given 4K block in the > encrypted file is a hole (in the PUNCH_HOLE sense) or not. If it's > a hole, POSIX dictates that a read should return zeroes. On an EC pool > where sparse reads are degraded into regular thick reads by the OSD, > a hole in the middle of an object wouldn't ever be signaled. Instead, > the OSD would synthesize a bunch of zeroes and pass them to the client. > The client would then run them through the crypto engine (believing > it's a bona fide ciphertext) and return the resulting gibberish to the > user, thus violating POSIX and widespread assumptions about generic > filesystem behavior. Oof, thanks for the heads-up! Fortunately my workload tolerates garbage in holes... with the occasional (now-explained) warning, that is. :) I don't see the fscrypt+EC limitation mentioned in the kernel nor Ceph docs, so I'm guessing this is more a "known major limitation" than an out-of-scope use case. The CephFS client already blocks PUNCH_HOLE for encrypted inodes, but by writing into the middle of an empty object, I was able to form a hole organically and reproduce the garbage you describe. EC is complex, so I wouldn't have been surprised if it simply didn't have a way to store objects with holes at all. But I was caught off guard to learn that the hard part of this problem is communicating the hole to the client. My intuition was that the read path must already be detecting "no data here" in order to synthesize filler zeroes, but it sounds like that information doesn't survive as explicit metadata. Clearly I have more to learn about the EC read pipeline. Cheers, Sam > > > > > I'll defer to you. > > > > > > > > > > > > > Detect this condition when the extent count is zero by checking the > > > > `payload_len` field of the op reply. If it is only big enough for the > > > > extent count, conclude that the data length is omitted and skip to the > > > > next op (which is what the state machine would have done immediately > > > > upon reading and validating the data length, if it were present). > > > > > > > > --- > > > > > > > > Hi list, > > > > > > > > RFC: This patch is submitted for comment only. I've tested it for about > > > > 2 weeks now and am satisfied that it prevents the hang, but the current > > > > approach decodes the entire op reply body while still in the > > > > data-gathering step, which is suboptimal; feedback on cleaner > > > > alternatives is welcome! > > > > > > > > I have not searched for nor opened a report with Ceph proper; I'd like a > > > > second pair of eyes to confirm that this is indeed an OSD bug before I > > > > proceed with that. > > > > > > Let me know if you want me to file a Ceph tracker ticket on your > > > behalf. I have a draft patch for the bug in the OSD and would link it > > > in the PR, crediting you as a reporter. > > > > Please do! I'm also interested in seeing the patch -- the OSD code is > > pretty dense and I couldn't find the EC sparse read handler. > > https://github.com/ceph/ceph/pull/66912 > > Thanks, > > Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length 2026-01-14 1:28 ` Sam Edwards @ 2026-01-14 10:23 ` Ilya Dryomov 0 siblings, 0 replies; 6+ messages in thread From: Ilya Dryomov @ 2026-01-14 10:23 UTC (permalink / raw) To: Sam Edwards; +Cc: Xiubo Li, Jeff Layton, ceph-devel, linux-kernel On Wed, Jan 14, 2026 at 2:28 AM Sam Edwards <cfsworks@gmail.com> wrote: > > On Tue, Jan 13, 2026 at 12:15 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@gmail.com> wrote: > > > > > > On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote: > > > > > > > > > > When the OSD replies to a sparse-read request, but no extents matched > > > > > the read (because the object is empty, the read requested a region > > > > > backed by no extents, ...) it is expected to reply with two 32-bit > > > > > zeroes: one indicating that there are no extents, the other that the > > > > > total bytes read is zero. > > > > > > > > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object > > > > > is in an EC pool), the OSD sends back only one 32-bit zero. The > > > > > sparse-read state machine will end up reading something else (such as > > > > > the data CRC in the footer) and get stuck in a retry loop like: > > > > > > > > > > libceph: [0] got 0 extents > > > > > libceph: data len 142248331 != extent len 0 > > > > > libceph: osd0 (1)...:6801 socket error on read > > > > > libceph: data len 142248331 != extent len 0 > > > > > libceph: osd0 (1)...:6801 socket error on read > > > > > > > > > > This is probably a bug in the OSD, but even so, the kernel must handle > > > > > it to avoid misinterpreting replies and entering a retry loop. > > > > > > > > Hi Sam, > > > > > > > > > > Hey Ilya, > > > > > > > Yes, this is definitely a bug in the OSD (and I also see another > > > > related bug in the userspace client code above the OSD...). The > > > > triggering condition is a sparse read beyond the end of an existing > > > > object on an EC pool. 19.2.3 isn't the problem -- main branch is > > > > affected as well. > > > > > > > > If this was one of the common paths, I'd support adding some sort of > > > > a workaround to "handle" this in the kernel client. However, sparse > > > > reads are pretty useless on EC pools because they just get converted > > > > into regular thick reads. Sparse reads offer potential benefits only > > > > on replicated pools, but the kernel client doesn't use them by default > > > > there either. The sparseread mount option that is necessary for the > > > > reproducer to work isn't documented and was added purely for testing > > > > purposes. > > > > > > Note that the kernel client forces sparse reads when using fscrypt > > > (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem > > > organically as a result. It may still make sense to apply a kernel > > > workaround. > > > > > > On the other hand, it sounds like fscrypt+EC is a niche corner case, > > > we've now established that the OSD is definitely not following the > > > protocol, and working around this client-side is more involved than > > > just fixing this in the OSD. So I think simply telling affected users > > > to update their OSDs is also a reasonable way to handle this. > > > > fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work. The > > reason sparse reads are forced for fscrypt is that the client relies on > > the sparseness metadata to be able tell if a given 4K block in the > > encrypted file is a hole (in the PUNCH_HOLE sense) or not. If it's > > a hole, POSIX dictates that a read should return zeroes. On an EC pool > > where sparse reads are degraded into regular thick reads by the OSD, > > a hole in the middle of an object wouldn't ever be signaled. Instead, > > the OSD would synthesize a bunch of zeroes and pass them to the client. > > The client would then run them through the crypto engine (believing > > it's a bona fide ciphertext) and return the resulting gibberish to the > > user, thus violating POSIX and widespread assumptions about generic > > filesystem behavior. > > Oof, thanks for the heads-up! Fortunately my workload tolerates > garbage in holes... with the occasional (now-explained) warning, that > is. :) > > I don't see the fscrypt+EC limitation mentioned in the kernel nor Ceph > docs, so I'm guessing this is more a "known major limitation" than an > out-of-scope use case. Correct, it's tracked under https://tracker.ceph.com/issues/67507. Thanks, Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-01-14 10:23 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-13 3:31 [RFC PATCH] libceph: Handle sparse-read replies lacking data length Sam Edwards 2026-01-13 17:26 ` Ilya Dryomov 2026-01-13 19:04 ` Sam Edwards 2026-01-13 20:15 ` Ilya Dryomov 2026-01-14 1:28 ` Sam Edwards 2026-01-14 10:23 ` Ilya Dryomov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox