[RFC/BUG] Use a bounce buffer for mds client decryption

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC/BUG] Use a bounce buffer for mds client decryption
@ 2026-04-15  3:40 Sam Edwards
  2026-04-15 17:16 ` Alex Markuze
  2026-04-15 19:14 ` Viacheslav Dubeyko
  0 siblings, 2 replies; 6+ messages in thread
From: Sam Edwards @ 2026-04-15  3:40 UTC (permalink / raw)
  To: Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko
  Cc: ceph-devel, linux-kernel, Sam Edwards

Hi Ceph list,

This is a combination RFC and bug report. I'm holding off cleaning up the patch
until I get confirmation that this is the right approach. There are some edge
cases stemming from complexities in ceph_fname_to_usr(), making this patch into
more of a shotgun bugfix than I'd like.

I have thoroughly tested this patch: In my live (AArch64) environment, certain
workloads would trigger this particular oops->panic 3-5 times per day. Since I
applied this patch ~2 weeks ago, I get no further panics. I do not think my MDS
contains any base64-only dentries, however.

Bug description
---------------

On architectures that implement flush_dcache_page() (most of the non-x86 ones),
scatterwalk_done_dst() will attempt to flush each page of the destination when
a skcipher decryption completes. If passed a destination buffer that isn't in
the linear region, virt_to_page() yields a bogus `struct page *`, leading to a
kernel oops/panic.

For this reason, sg_set_buf() requires a linear buffer (and will assert it with
BUG_ON, if the kernel is built with CONFIG_DEBUG_SG). By extension, that means
`fscrypt_str`s must only point to kmalloc() buffers.

The MDS client's parse_reply_info_readdir() invokes fscrypt in-place, reusing
the message's `front` buffer. However, ceph_msg_new2() allocates the `front`
buffer with kvmalloc(), which attempts kmalloc() (linear region) allocation
first, then falls back to vmalloc() (vmalloc region) if that fails.

This means that: when physical memory is fragmented, fscrypt is enabled, MDS
responses are rather large, and the host architecture implements
flush_dcache_page(), the CephFS client will oops/panic.

Reproduction steps
------------------

1. Build the kernel with the following configs enabled:
- CONFIG_DEBUG_SG
  - This enables a BUG_ON() to confirm that only linear virt addresses are
    passed to sg_set_buf(); this is necessary to see the issue on x86
    (on ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE archs like arm/64, MIPS, RISC-V, etc.
     the oops will happen on flush)
- CONFIG_FAULT_INJECTION
- CONFIG_FAILSLAB
- CONFIG_FAULT_INJECTION_DEBUG_FS

2. Create a directory in CephFS, encrypt it with fscrypt, and populate it with
   10 files:
```
cd /path/to/cephfs
mkdir fscrypted
fscrypt encrypt fscrypted
seq 1 10 | while read i; do touch fscrypted/file$i; done
```

3. Enable fault injection on kmalloc, so that 1% of kvmalloc() attempts use the
   vmalloc() fallback:
```
echo -1 > /sys/kernel/debug/failslab/times
echo 0 > /sys/kernel/debug/failslab/interval
echo 1 > /sys/kernel/debug/failslab/probability
```

4. Repeatedly retrieve/decrypt the directory listing:
while true; do ls fscrypted >/dev/null; sysctl -q vm.drop_caches=1; done

5. Within 3-10 seconds, expect an oops like:
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 1474 Comm: kworker/0:2 Not tainted 7.0.0-rc6 #79 PREEMPT(lazy)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014
Workqueue: ceph-msgr ceph_con_workfn
RIP: 0010:sg_init_one+0x88/0xa0
Code: [[[SNIP SNIP]]]
RSP: 0018:ffffbb730042f618 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffbb73001591b7 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000020 R08: ffffbb73001591b7 R09: ffffbb730042f638
R10: ffff97e7c42cc630 R11: 0000000000000000 R12: ffffbb730042f638
R13: ffff97e7c42cc630 R14: ffffbb730042f658 R15: ffff97e7c3937000
FS:  0000000000000000(0000) GS:ffff97e865a61000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe46ff2f000 CR3: 0000000003bb9000 CR4: 0000000000350eb0
Call Trace:
 <TASK>
 fname_decrypt+0x10b/0x1b0
 fscrypt_fname_disk_to_usr+0x14d/0x1b0
 ceph_fname_to_usr+0x1df/0x2f0
 ? __pfx_crypto_cbc_encrypt+0x10/0x10
 ? crypto_lskcipher_crypt_sg+0xf7/0x140
 ? ceph_aes_crypt+0x1d7/0x2b0
 ? __kmalloc_noprof+0x19c/0x570
 ? ceph_decode_copy+0x13/0x30
 parse_reply_info+0x49d/0x9c0
 mds_dispatch+0x890/0x2000
 ceph_con_process_message+0x72/0x140
 ceph_con_v1_try_read+0xaf9/0x1cc0
 ? put_prev_task_fair+0x1d/0x40
 ? finish_task_switch.isra.0+0x90/0x2c0
 ceph_con_workfn+0x2e0/0x6e0
 process_one_work+0x192/0x3a0
 worker_thread+0x1aa/0x330
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xde/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d7/0x360
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

6. If unable to repro with 10 files, try again with 5, 15, and 20. The mds
   readdir reply's front section needs to be a particular size: if it's one
   page or smaller, kvmalloc() does not attempt the vmalloc() fallback; if it's
   larger than KMALLOC_MAX_CACHE_SIZE (2 pages), kmalloc() doesn't respect
   failslab (this is apparently an oversight in failslab).

Summery of changes
------------------

All changes are in parse_reply_info_readdir():

- Move the `inode`/`ci` declaration/initialization out of the loop; they are
  loop-invariant anyway, and we need the directory's inode before entering the
  loop.

- Add a `struct fscrypt_str bname` to serve as a bounce buffer.

- Any time *p is in the vmalloc region, call ceph_fname_alloc_buffer() to
  attempt to allocate the bounce buffer. This is very rare (it only happens if
  kmalloc() fails) and only does anything if the directory has encryption
  enabled. The bounce buffer is reused for all filenames in the readdir
  response, therefore I allocate it only once, before entering the loop.

- Let oname/ctext/tname be initialized normally, but if the bounce buffer is
  active (indicating it *must* be used):
  a) If there is an `altname`, then this is a binary blob of the ciphertext,
     and ceph_fname_to_usr() will use it directly; copy it into the bounce
     buffer.
  b) If there is no `altname`, then ceph_fname_to_usr() will perform base64
     decoding on `name` into `tname`. The bounce buffer need not be
     initialized, but it's easy enough to have the memcpy() in there
     unconditionally.

- Redirect all potentially-crypted pointers into the bounce buffer. The unused
  ones are ignored by ceph_fname_to_usr():
  - fname.ctext (ciphertext if `altname` is present)
  - tname.name (ciphertext if `altname` is absent; base64 decoder buffer)
  - oname.name (output for the plaintext filename)

- After calling ceph_fname_to_usr(), if it chose to keep `oname.name` pointed
  at the bounce buffer:
  - Restore `oname.name` to point to its original location in the `front`
    buffer (which survives past the end of this function)
  - Copy the plaintext back out of the bounce buffer to the long-lived buffer

- Finally, free the bounce buffer with ceph_fname_free_buffer(), which is a
  no-op if the buffer was never allocated.
---
 fs/ceph/mds_client.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..a583032c0041 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -414,6 +414,9 @@ static int parse_reply_info_readdir(void **p, void *end,
 {
 	struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
 	struct ceph_client *cl = req->r_mdsc->fsc->client;
+	struct fscrypt_str bname = FSTR_INIT(NULL, 0);
+	struct inode *inode = d_inode(req->r_dentry);
+	struct ceph_inode_info *ci = ceph_inode(inode);
 	u32 num, i = 0;
 	int err;
 
@@ -441,10 +444,18 @@ static int parse_reply_info_readdir(void **p, void *end,
 		goto bad;
 	}
 
+	/*
+	 * `p` points into the message's `front` buffer, which ceph_msg_new2()
+	 * allocates using kvmalloc(), so the buffer may end up outside of the
+	 * kmalloc() region -- but fscript_str.name must be in that region, so
+	 * use a bounce buffer in that case
+	 */
+	if (unlikely(is_vmalloc_addr(*p)))
+		if ((err = ceph_fname_alloc_buffer(inode, &bname)))
+			goto out_bad;
+
 	info->dir_nr = num;
 	while (num) {
-		struct inode *inode = d_inode(req->r_dentry);
-		struct ceph_inode_info *ci = ceph_inode(inode);
 		struct ceph_mds_reply_dir_entry *rde = info->dir_entries + i;
 		struct fscrypt_str tname = FSTR_INIT(NULL, 0);
 		struct fscrypt_str oname = FSTR_INIT(NULL, 0);
@@ -514,6 +525,13 @@ static int parse_reply_info_readdir(void **p, void *end,
 			oname.name = altname;
 			oname.len = altname_len;
 		}
+
+		if (bname.name) {
+			BUG_ON(oname.len > bname.len);
+			memcpy(bname.name, oname.name, oname.len);
+			fname.ctext = tname.name = oname.name = bname.name;
+		}
+
 		rde->is_nokey = false;
 		err = ceph_fname_to_usr(&fname, &tname, &oname, &rde->is_nokey);
 		if (err) {
@@ -521,6 +539,12 @@ static int parse_reply_info_readdir(void **p, void *end,
 				      _name_len, _name, err);
 			goto out_bad;
 		}
+
+		if (unlikely(oname.name == bname.name)) {
+			oname.name = (altname_len == 0) ? _name : altname;
+			memcpy(oname.name, bname.name, oname.len);
+		}
+
 		rde->name = oname.name;
 		rde->name_len = oname.len;
 
@@ -537,12 +561,14 @@ static int parse_reply_info_readdir(void **p, void *end,
 done:
 	/* Skip over any unrecognized fields */
 	*p = end;
+	ceph_fname_free_buffer(inode, &bname);
 	return 0;
 
 bad:
 	err = -EIO;
 out_bad:
 	pr_err_client(cl, "problem parsing dir contents %d\n", err);
+	ceph_fname_free_buffer(inode, &bname);
 	return err;
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC/BUG] Use a bounce buffer for mds client decryption
  2026-04-15  3:40 [RFC/BUG] Use a bounce buffer for mds client decryption Sam Edwards
@ 2026-04-15 17:16 ` Alex Markuze
  2026-04-15 19:15   ` Sam Edwards
  2026-04-15 19:14 ` Viacheslav Dubeyko
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Markuze @ 2026-04-15 17:16 UTC (permalink / raw)
  To: Sam Edwards; +Cc: Ilya Dryomov, Viacheslav Dubeyko, ceph-devel, linux-kernel

Hi Sam,

Nice find. The diagnosis is correct and I like the bounce buffer approach.

A few things for v2:

1. Typo in the comment: "fscript_str" should be "fscrypt_str".

2. The assignment-in-if is nice but limits readability:

```c
if ((err = ceph_fname_alloc_buffer(inode, &bname)))
```

Split it:

```c
err = ceph_fname_alloc_buffer(inode, &bname);
if (err)
        goto out_bad;
```

3. The `BUG_ON(oname.len > bname.len)` should be a `WARN_ON_ONCE` with
an error return instead. New `BUG_ON`s in data-dependent paths are
unlikely to pass review.

4. The copy-back block is correct but not obvious at a glance. A short
comment explaining why you need to copy back out of the bounce buffer
(rde->name must point into the long-lived front buffer) would help.

5. `parse_reply_info_trace()` has the same latent bug. It stores
`info->dname` and `info->altname` as raw pointers into the front
buffer. When the LOOKUPNAME path in `ceph_fill_inode()` later passes
`fname.ctext = rinfo->altname` to `ceph_fname_to_usr()`, that pointer
hits `sg_init_one()` in `fname_decrypt()` as the source scatterlist.
The output buffer is separately allocated so it's fine, but the input is not.
6. Also, let's add an `unlikely` on the is_vmalloc_addr

Thanks,

On Wed, Apr 15, 2026 at 6:40 AM Sam Edwards <cfsworks@gmail.com> wrote:
>
> Hi Ceph list,
>
> This is a combination RFC and bug report. I'm holding off cleaning up the patch
> until I get confirmation that this is the right approach. There are some edge
> cases stemming from complexities in ceph_fname_to_usr(), making this patch into
> more of a shotgun bugfix than I'd like.
>
> I have thoroughly tested this patch: In my live (AArch64) environment, certain
> workloads would trigger this particular oops->panic 3-5 times per day. Since I
> applied this patch ~2 weeks ago, I get no further panics. I do not think my MDS
> contains any base64-only dentries, however.
>
> Bug description
> ---------------
>
> On architectures that implement flush_dcache_page() (most of the non-x86 ones),
> scatterwalk_done_dst() will attempt to flush each page of the destination when
> a skcipher decryption completes. If passed a destination buffer that isn't in
> the linear region, virt_to_page() yields a bogus `struct page *`, leading to a
> kernel oops/panic.
>
> For this reason, sg_set_buf() requires a linear buffer (and will assert it with
> BUG_ON, if the kernel is built with CONFIG_DEBUG_SG). By extension, that means
> `fscrypt_str`s must only point to kmalloc() buffers.
>
> The MDS client's parse_reply_info_readdir() invokes fscrypt in-place, reusing
> the message's `front` buffer. However, ceph_msg_new2() allocates the `front`
> buffer with kvmalloc(), which attempts kmalloc() (linear region) allocation
> first, then falls back to vmalloc() (vmalloc region) if that fails.
>
> This means that: when physical memory is fragmented, fscrypt is enabled, MDS
> responses are rather large, and the host architecture implements
> flush_dcache_page(), the CephFS client will oops/panic.
>
> Reproduction steps
> ------------------
>
> 1. Build the kernel with the following configs enabled:
> - CONFIG_DEBUG_SG
>   - This enables a BUG_ON() to confirm that only linear virt addresses are
>     passed to sg_set_buf(); this is necessary to see the issue on x86
>     (on ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE archs like arm/64, MIPS, RISC-V, etc.
>      the oops will happen on flush)
> - CONFIG_FAULT_INJECTION
> - CONFIG_FAILSLAB
> - CONFIG_FAULT_INJECTION_DEBUG_FS
>
> 2. Create a directory in CephFS, encrypt it with fscrypt, and populate it with
>    10 files:
> ```
> cd /path/to/cephfs
> mkdir fscrypted
> fscrypt encrypt fscrypted
> seq 1 10 | while read i; do touch fscrypted/file$i; done
> ```
>
> 3. Enable fault injection on kmalloc, so that 1% of kvmalloc() attempts use the
>    vmalloc() fallback:
> ```
> echo -1 > /sys/kernel/debug/failslab/times
> echo 0 > /sys/kernel/debug/failslab/interval
> echo 1 > /sys/kernel/debug/failslab/probability
> ```
>
> 4. Repeatedly retrieve/decrypt the directory listing:
> while true; do ls fscrypted >/dev/null; sysctl -q vm.drop_caches=1; done
>
> 5. Within 3-10 seconds, expect an oops like:
> Oops: invalid opcode: 0000 [#1] SMP NOPTI
> CPU: 0 UID: 0 PID: 1474 Comm: kworker/0:2 Not tainted 7.0.0-rc6 #79 PREEMPT(lazy)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014
> Workqueue: ceph-msgr ceph_con_workfn
> RIP: 0010:sg_init_one+0x88/0xa0
> Code: [[[SNIP SNIP]]]
> RSP: 0018:ffffbb730042f618 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffbb73001591b7 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 0000000000000020 R08: ffffbb73001591b7 R09: ffffbb730042f638
> R10: ffff97e7c42cc630 R11: 0000000000000000 R12: ffffbb730042f638
> R13: ffff97e7c42cc630 R14: ffffbb730042f658 R15: ffff97e7c3937000
> FS:  0000000000000000(0000) GS:ffff97e865a61000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fe46ff2f000 CR3: 0000000003bb9000 CR4: 0000000000350eb0
> Call Trace:
>  <TASK>
>  fname_decrypt+0x10b/0x1b0
>  fscrypt_fname_disk_to_usr+0x14d/0x1b0
>  ceph_fname_to_usr+0x1df/0x2f0
>  ? __pfx_crypto_cbc_encrypt+0x10/0x10
>  ? crypto_lskcipher_crypt_sg+0xf7/0x140
>  ? ceph_aes_crypt+0x1d7/0x2b0
>  ? __kmalloc_noprof+0x19c/0x570
>  ? ceph_decode_copy+0x13/0x30
>  parse_reply_info+0x49d/0x9c0
>  mds_dispatch+0x890/0x2000
>  ceph_con_process_message+0x72/0x140
>  ceph_con_v1_try_read+0xaf9/0x1cc0
>  ? put_prev_task_fair+0x1d/0x40
>  ? finish_task_switch.isra.0+0x90/0x2c0
>  ceph_con_workfn+0x2e0/0x6e0
>  process_one_work+0x192/0x3a0
>  worker_thread+0x1aa/0x330
>  ? __pfx_worker_thread+0x10/0x10
>  kthread+0xde/0x120
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2d7/0x360
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork_asm+0x1a/0x30
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
>
> 6. If unable to repro with 10 files, try again with 5, 15, and 20. The mds
>    readdir reply's front section needs to be a particular size: if it's one
>    page or smaller, kvmalloc() does not attempt the vmalloc() fallback; if it's
>    larger than KMALLOC_MAX_CACHE_SIZE (2 pages), kmalloc() doesn't respect
>    failslab (this is apparently an oversight in failslab).
>
> Summery of changes
> ------------------
>
> All changes are in parse_reply_info_readdir():
>
> - Move the `inode`/`ci` declaration/initialization out of the loop; they are
>   loop-invariant anyway, and we need the directory's inode before entering the
>   loop.
>
> - Add a `struct fscrypt_str bname` to serve as a bounce buffer.
>
> - Any time *p is in the vmalloc region, call ceph_fname_alloc_buffer() to
>   attempt to allocate the bounce buffer. This is very rare (it only happens if
>   kmalloc() fails) and only does anything if the directory has encryption
>   enabled. The bounce buffer is reused for all filenames in the readdir
>   response, therefore I allocate it only once, before entering the loop.
>
> - Let oname/ctext/tname be initialized normally, but if the bounce buffer is
>   active (indicating it *must* be used):
>   a) If there is an `altname`, then this is a binary blob of the ciphertext,
>      and ceph_fname_to_usr() will use it directly; copy it into the bounce
>      buffer.
>   b) If there is no `altname`, then ceph_fname_to_usr() will perform base64
>      decoding on `name` into `tname`. The bounce buffer need not be
>      initialized, but it's easy enough to have the memcpy() in there
>      unconditionally.
>
> - Redirect all potentially-crypted pointers into the bounce buffer. The unused
>   ones are ignored by ceph_fname_to_usr():
>   - fname.ctext (ciphertext if `altname` is present)
>   - tname.name (ciphertext if `altname` is absent; base64 decoder buffer)
>   - oname.name (output for the plaintext filename)
>
> - After calling ceph_fname_to_usr(), if it chose to keep `oname.name` pointed
>   at the bounce buffer:
>   - Restore `oname.name` to point to its original location in the `front`
>     buffer (which survives past the end of this function)
>   - Copy the plaintext back out of the bounce buffer to the long-lived buffer
>
> - Finally, free the bounce buffer with ceph_fname_free_buffer(), which is a
>   no-op if the buffer was never allocated.
> ---
>  fs/ceph/mds_client.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index b1746273f186..a583032c0041 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -414,6 +414,9 @@ static int parse_reply_info_readdir(void **p, void *end,
>  {
>         struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
>         struct ceph_client *cl = req->r_mdsc->fsc->client;
> +       struct fscrypt_str bname = FSTR_INIT(NULL, 0);
> +       struct inode *inode = d_inode(req->r_dentry);
> +       struct ceph_inode_info *ci = ceph_inode(inode);
>         u32 num, i = 0;
>         int err;
>
> @@ -441,10 +444,18 @@ static int parse_reply_info_readdir(void **p, void *end,
>                 goto bad;
>         }
>
> +       /*
> +        * `p` points into the message's `front` buffer, which ceph_msg_new2()
> +        * allocates using kvmalloc(), so the buffer may end up outside of the
> +        * kmalloc() region -- but fscript_str.name must be in that region, so
> +        * use a bounce buffer in that case
> +        */
> +       if (unlikely(is_vmalloc_addr(*p)))
> +               if ((err = ceph_fname_alloc_buffer(inode, &bname)))
> +                       goto out_bad;
> +
>         info->dir_nr = num;
>         while (num) {
> -               struct inode *inode = d_inode(req->r_dentry);
> -               struct ceph_inode_info *ci = ceph_inode(inode);
>                 struct ceph_mds_reply_dir_entry *rde = info->dir_entries + i;
>                 struct fscrypt_str tname = FSTR_INIT(NULL, 0);
>                 struct fscrypt_str oname = FSTR_INIT(NULL, 0);
> @@ -514,6 +525,13 @@ static int parse_reply_info_readdir(void **p, void *end,
>                         oname.name = altname;
>                         oname.len = altname_len;
>                 }
> +
> +               if (bname.name) {
> +                       BUG_ON(oname.len > bname.len);
> +                       memcpy(bname.name, oname.name, oname.len);
> +                       fname.ctext = tname.name = oname.name = bname.name;
> +               }
> +
>                 rde->is_nokey = false;
>                 err = ceph_fname_to_usr(&fname, &tname, &oname, &rde->is_nokey);
>                 if (err) {
> @@ -521,6 +539,12 @@ static int parse_reply_info_readdir(void **p, void *end,
>                                       _name_len, _name, err);
>                         goto out_bad;
>                 }
> +
> +               if (unlikely(oname.name == bname.name)) {
> +                       oname.name = (altname_len == 0) ? _name : altname;
> +                       memcpy(oname.name, bname.name, oname.len);
> +               }
> +
>                 rde->name = oname.name;
>                 rde->name_len = oname.len;
>
> @@ -537,12 +561,14 @@ static int parse_reply_info_readdir(void **p, void *end,
>  done:
>         /* Skip over any unrecognized fields */
>         *p = end;
> +       ceph_fname_free_buffer(inode, &bname);
>         return 0;
>
>  bad:
>         err = -EIO;
>  out_bad:
>         pr_err_client(cl, "problem parsing dir contents %d\n", err);
> +       ceph_fname_free_buffer(inode, &bname);
>         return err;
>  }
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/BUG] Use a bounce buffer for mds client decryption
  2026-04-15 17:16 ` Alex Markuze
@ 2026-04-15 19:15   ` Sam Edwards
  0 siblings, 0 replies; 6+ messages in thread
From: Sam Edwards @ 2026-04-15 19:15 UTC (permalink / raw)
  To: Alex Markuze; +Cc: Ilya Dryomov, Viacheslav Dubeyko, ceph-devel, linux-kernel

On Wed, Apr 15, 2026 at 10:16 AM Alex Markuze <amarkuze@redhat.com> wrote:
>
> Hi Sam,
>
> Nice find. The diagnosis is correct and I like the bounce buffer approach.
>
> A few things for v2:
>
> 1. Typo in the comment: "fscript_str" should be "fscrypt_str".
>
> 2. The assignment-in-if is nice but limits readability:
>
> ```c
> if ((err = ceph_fname_alloc_buffer(inode, &bname)))
> ```
>
> Split it:
>
> ```c
> err = ceph_fname_alloc_buffer(inode, &bname);
> if (err)
>         goto out_bad;
> ```
>
> 3. The `BUG_ON(oname.len > bname.len)` should be a `WARN_ON_ONCE` with
> an error return instead. New `BUG_ON`s in data-dependent paths are
> unlikely to pass review.
>
> 4. The copy-back block is correct but not obvious at a glance. A short
> comment explaining why you need to copy back out of the bounce buffer
> (rde->name must point into the long-lived front buffer) would help.

Hi Alex,

ACK to all of the above. Which return code is best for the
WARN_ON_ONCE violation?

> 5. `parse_reply_info_trace()` has the same latent bug. It stores
> `info->dname` and `info->altname` as raw pointers into the front
> buffer. When the LOOKUPNAME path in `ceph_fill_inode()` later passes
> `fname.ctext = rinfo->altname` to `ceph_fname_to_usr()`, that pointer
> hits `sg_init_one()` in `fname_decrypt()` as the source scatterlist.
> The output buffer is separately allocated so it's fine, but the input is not.

I'm happy to bundle a fix for this other problem, but I'd need a way
to observe+validate it. Do you have a reproducer, bug report, or at
least a call trace for this?

The callers of ceph_fill_inode() that I can see are
ceph_finish_async_create(), ceph_fill_trace(), and
ceph_readdir_prepopulate(), none of which are (callees of)
parse_reply_info_trace(). Is the culprit handle_reply() calling
parse_reply_info_trace(), then ceph_fill_trace() while passing the bad
buffer inside `req`? If so, is handle_reply() responsible for the
bounce buffer's lifetime, or should it be handled by
ceph_fill_trace()?

> 6. Also, let's add an `unlikely` on the is_vmalloc_addr

Did you mean `if (bname.name)`?

Thank you greatly,
Sam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re:  [RFC/BUG] Use a bounce buffer for mds client decryption
  2026-04-15  3:40 [RFC/BUG] Use a bounce buffer for mds client decryption Sam Edwards
  2026-04-15 17:16 ` Alex Markuze
@ 2026-04-15 19:14 ` Viacheslav Dubeyko
  2026-04-15 19:51   ` Sam Edwards
  1 sibling, 1 reply; 6+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-15 19:14 UTC (permalink / raw)
  To: idryomov@gmail.com, Alex Markuze, cfsworks@gmail.com,
	slava@dubeyko.com
  Cc: ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org

On Tue, 2026-04-14 at 20:40 -0700, Sam Edwards wrote:
> Hi Ceph list,
> 
> This is a combination RFC and bug report. I'm holding off cleaning up the patch
> until I get confirmation that this is the right approach. There are some edge
> cases stemming from complexities in ceph_fname_to_usr(), making this patch into
> more of a shotgun bugfix than I'd like.
> 
> I have thoroughly tested this patch: 
> 

Even really good testing cannot guarantee from potential bugs. :) Have you run
xfststes/fstests regression testing suite for the patch? Without this run you
cannot say that you tested your patch good enough. :)

> In my live (AArch64) environment, certain
> workloads would trigger this particular oops->panic 3-5 times per day. Since I
> applied this patch ~2 weeks ago, I get no further panics. I do not think my MDS
> contains any base64-only dentries, however.
> 
> Bug description
> ---------------
> 
> On architectures that implement flush_dcache_page() (most of the non-x86 ones),
> scatterwalk_done_dst() will attempt to flush each page of the destination when
> a skcipher decryption completes. If passed a destination buffer that isn't in
> the linear region, virt_to_page() yields a bogus `struct page *`, leading to a
> kernel oops/panic.
> 
> For this reason, sg_set_buf() requires a linear buffer (and will assert it with
> BUG_ON, if the kernel is built with CONFIG_DEBUG_SG). By extension, that means
> `fscrypt_str`s must only point to kmalloc() buffers.
> 
> The MDS client's parse_reply_info_readdir() invokes fscrypt in-place, reusing
> the message's `front` buffer. However, ceph_msg_new2() allocates the `front`
> buffer with kvmalloc(), which attempts kmalloc() (linear region) allocation
> first, then falls back to vmalloc() (vmalloc region) if that fails.
> 
> This means that: when physical memory is fragmented, fscrypt is enabled, MDS
> responses are rather large, and the host architecture implements
> flush_dcache_page(), the CephFS client will oops/panic.
> 
> Reproduction steps
> ------------------
> 
> 1. Build the kernel with the following configs enabled:
> - CONFIG_DEBUG_SG
>   - This enables a BUG_ON() to confirm that only linear virt addresses are
>     passed to sg_set_buf(); this is necessary to see the issue on x86
>     (on ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE archs like arm/64, MIPS, RISC-V, etc.
>      the oops will happen on flush)
> - CONFIG_FAULT_INJECTION
> - CONFIG_FAILSLAB
> - CONFIG_FAULT_INJECTION_DEBUG_FS
> 
> 2. Create a directory in CephFS, encrypt it with fscrypt, and populate it with
>    10 files:
> ```
> cd /path/to/cephfs
> mkdir fscrypted
> fscrypt encrypt fscrypted
> seq 1 10 | while read i; do touch fscrypted/file$i; done
> ```
> 
> 3. Enable fault injection on kmalloc, so that 1% of kvmalloc() attempts use the
>    vmalloc() fallback:
> ```
> echo -1 > /sys/kernel/debug/failslab/times
> echo 0 > /sys/kernel/debug/failslab/interval
> echo 1 > /sys/kernel/debug/failslab/probability
> ```
> 
> 4. Repeatedly retrieve/decrypt the directory listing:
> while true; do ls fscrypted >/dev/null; sysctl -q vm.drop_caches=1; done
> 
> 5. Within 3-10 seconds, expect an oops like:
> Oops: invalid opcode: 0000 [#1] SMP NOPTI
> CPU: 0 UID: 0 PID: 1474 Comm: kworker/0:2 Not tainted 7.0.0-rc6 #79 PREEMPT(lazy)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014
> Workqueue: ceph-msgr ceph_con_workfn
> RIP: 0010:sg_init_one+0x88/0xa0
> Code: [[[SNIP SNIP]]]
> RSP: 0018:ffffbb730042f618 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffbb73001591b7 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 0000000000000020 R08: ffffbb73001591b7 R09: ffffbb730042f638
> R10: ffff97e7c42cc630 R11: 0000000000000000 R12: ffffbb730042f638
> R13: ffff97e7c42cc630 R14: ffffbb730042f658 R15: ffff97e7c3937000
> FS:  0000000000000000(0000) GS:ffff97e865a61000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fe46ff2f000 CR3: 0000000003bb9000 CR4: 0000000000350eb0
> Call Trace:
>  <TASK>
>  fname_decrypt+0x10b/0x1b0
>  fscrypt_fname_disk_to_usr+0x14d/0x1b0
>  ceph_fname_to_usr+0x1df/0x2f0
>  ? __pfx_crypto_cbc_encrypt+0x10/0x10
>  ? crypto_lskcipher_crypt_sg+0xf7/0x140
>  ? ceph_aes_crypt+0x1d7/0x2b0
>  ? __kmalloc_noprof+0x19c/0x570
>  ? ceph_decode_copy+0x13/0x30
>  parse_reply_info+0x49d/0x9c0
>  mds_dispatch+0x890/0x2000
>  ceph_con_process_message+0x72/0x140
>  ceph_con_v1_try_read+0xaf9/0x1cc0
>  ? put_prev_task_fair+0x1d/0x40
>  ? finish_task_switch.isra.0+0x90/0x2c0
>  ceph_con_workfn+0x2e0/0x6e0
>  process_one_work+0x192/0x3a0
>  worker_thread+0x1aa/0x330
>  ? __pfx_worker_thread+0x10/0x10
>  kthread+0xde/0x120
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2d7/0x360
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork_asm+0x1a/0x30
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> 
> 6. If unable to repro with 10 files, try again with 5, 15, and 20. The mds
>    readdir reply's front section needs to be a particular size: if it's one
>    page or smaller, kvmalloc() does not attempt the vmalloc() fallback; if it's
>    larger than KMALLOC_MAX_CACHE_SIZE (2 pages), kmalloc() doesn't respect
>    failslab (this is apparently an oversight in failslab).
> 
> Summery of changes

What is Summery? ;)

> ------------------
> 
> All changes are in parse_reply_info_readdir():
> 
> - Move the `inode`/`ci` declaration/initialization out of the loop; they are
>   loop-invariant anyway, and we need the directory's inode before entering the
>   loop.
> 
> - Add a `struct fscrypt_str bname` to serve as a bounce buffer.
> 
> - Any time *p is in the vmalloc region, call ceph_fname_alloc_buffer() to
>   attempt to allocate the bounce buffer. This is very rare (it only happens if
>   kmalloc() fails) and only does anything if the directory has encryption
>   enabled. The bounce buffer is reused for all filenames in the readdir
>   response, therefore I allocate it only once, before entering the loop.
> 
> - Let oname/ctext/tname be initialized normally, but if the bounce buffer is
>   active (indicating it *must* be used):
>   a) If there is an `altname`, then this is a binary blob of the ciphertext,
>      and ceph_fname_to_usr() will use it directly; copy it into the bounce
>      buffer.
>   b) If there is no `altname`, then ceph_fname_to_usr() will perform base64
>      decoding on `name` into `tname`. The bounce buffer need not be
>      initialized, but it's easy enough to have the memcpy() in there
>      unconditionally.
> 
> - Redirect all potentially-crypted pointers into the bounce buffer. The unused
>   ones are ignored by ceph_fname_to_usr():
>   - fname.ctext (ciphertext if `altname` is present)
>   - tname.name (ciphertext if `altname` is absent; base64 decoder buffer)
>   - oname.name (output for the plaintext filename)
> 
> - After calling ceph_fname_to_usr(), if it chose to keep `oname.name` pointed
>   at the bounce buffer:
>   - Restore `oname.name` to point to its original location in the `front`
>     buffer (which survives past the end of this function)
>   - Copy the plaintext back out of the bounce buffer to the long-lived buffer
> 
> - Finally, free the bounce buffer with ceph_fname_free_buffer(), which is a
>   no-op if the buffer was never allocated.
> ---
>  fs/ceph/mds_client.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index b1746273f186..a583032c0041 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -414,6 +414,9 @@ static int parse_reply_info_readdir(void **p, void *end,
>  {
>  	struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
>  	struct ceph_client *cl = req->r_mdsc->fsc->client;
> +	struct fscrypt_str bname = FSTR_INIT(NULL, 0);

Do we really need to introduce another bname variable? Could we use oname for
the buffer?

> +	struct inode *inode = d_inode(req->r_dentry);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
>  	u32 num, i = 0;
>  	int err;
>  
> @@ -441,10 +444,18 @@ static int parse_reply_info_readdir(void **p, void *end,
>  		goto bad;
>  	}
>  
> +	/*
> +	 * `p` points into the message's `front` buffer, which ceph_msg_new2()
> +	 * allocates using kvmalloc(), so the buffer may end up outside of the
> +	 * kmalloc() region -- but fscript_str.name must be in that region, so
> +	 * use a bounce buffer in that case
> +	 */
> +	if (unlikely(is_vmalloc_addr(*p)))
> +		if ((err = ceph_fname_alloc_buffer(inode, &bname)))
> +			goto out_bad;

Kernel style strongly prefers braces around the outer if body. What's about
this?

  if (unlikely(is_vmalloc_addr(*p))) {                                         
      err = ceph_fname_alloc_buffer(inode, &bname);                            
      if (err)
          goto out_bad;                                                        
  }

> +
>  	info->dir_nr = num;
>  	while (num) {
> -		struct inode *inode = d_inode(req->r_dentry);
> -		struct ceph_inode_info *ci = ceph_inode(inode);
>  		struct ceph_mds_reply_dir_entry *rde = info->dir_entries + i;
>  		struct fscrypt_str tname = FSTR_INIT(NULL, 0);
>  		struct fscrypt_str oname = FSTR_INIT(NULL, 0);
> @@ -514,6 +525,13 @@ static int parse_reply_info_readdir(void **p, void *end,
>  			oname.name = altname;
>  			oname.len = altname_len;
>  		}
> +
> +		if (bname.name) {

Should we have unlikely(bname.name) here? 

> +			BUG_ON(oname.len > bname.len);

The ceph_decode_need confirms there are enough bytes in the wire buffer, but
there is no NAME_MAX check. A misbehaving or malicious MDS can set altname_len >
NAME_MAX, so oname.len > bname.len. I believe we need to process likewise error
in more calm way. What's about this?

  if (oname.len > bname.len) {
      err = -EIO;
      goto out_bad;
  }

> +			memcpy(bname.name, oname.name, oname.len);
> +			fname.ctext = tname.name = oname.name = bname.name;

For my taste, the function is complicated enough already. I believe it requires
some better refactoring.

Thanks,
Slava.

> +		}
> +
>  		rde->is_nokey = false;
>  		err = ceph_fname_to_usr(&fname, &tname, &oname, &rde->is_nokey);
>  		if (err) {
> @@ -521,6 +539,12 @@ static int parse_reply_info_readdir(void **p, void *end,
>  				      _name_len, _name, err);
>  			goto out_bad;
>  		}
> +
> +		if (unlikely(oname.name == bname.name)) {
> +			oname.name = (altname_len == 0) ? _name : altname;
> +			memcpy(oname.name, bname.name, oname.len);
> +		}
> +
>  		rde->name = oname.name;
>  		rde->name_len = oname.len;
>  
> @@ -537,12 +561,14 @@ static int parse_reply_info_readdir(void **p, void *end,
>  done:
>  	/* Skip over any unrecognized fields */
>  	*p = end;
> +	ceph_fname_free_buffer(inode, &bname);
>  	return 0;
>  
>  bad:
>  	err = -EIO;
>  out_bad:
>  	pr_err_client(cl, "problem parsing dir contents %d\n", err);
> +	ceph_fname_free_buffer(inode, &bname);
>  	return err;
>  }
>  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/BUG] Use a bounce buffer for mds client decryption
  2026-04-15 19:14 ` Viacheslav Dubeyko
@ 2026-04-15 19:51   ` Sam Edwards
  2026-04-16 18:57     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 6+ messages in thread
From: Sam Edwards @ 2026-04-15 19:51 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org

On Wed, Apr 15, 2026 at 12:14 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2026-04-14 at 20:40 -0700, Sam Edwards wrote:
> > Hi Ceph list,
> >
> > This is a combination RFC and bug report. I'm holding off cleaning up the patch
> > until I get confirmation that this is the right approach. There are some edge
> > cases stemming from complexities in ceph_fname_to_usr(), making this patch into
> > more of a shotgun bugfix than I'd like.
> >

Good day Slava,

> > I have thoroughly tested this patch:
> >
>
> Even really good testing cannot guarantee from potential bugs. :) Have you run
> xfststes/fstests regression testing suite for the patch? Without this run you
> cannot say that you tested your patch good enough. :)

Not the kind of testing I meant: the next paragraph describes thorough
*soak* testing. Since this is only an RFC (lacking my Signed-off-by
and the usual QA that it implies), I'm clarifying that I have at least
made the bug go away. I do not promise an absence of regressions.

> > In my live (AArch64) environment, certain
> > workloads would trigger this particular oops->panic 3-5 times per day. Since I
> > applied this patch ~2 weeks ago, I get no further panics. I do not think my MDS
> > contains any base64-only dentries, however.
> >
> > ---
> >  fs/ceph/mds_client.c | 30 ++++++++++++++++++++++++++++--
> >  1 file changed, 28 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index b1746273f186..a583032c0041 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -414,6 +414,9 @@ static int parse_reply_info_readdir(void **p, void *end,
> >  {
> >       struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
> >       struct ceph_client *cl = req->r_mdsc->fsc->client;
> > +     struct fscrypt_str bname = FSTR_INIT(NULL, 0);
>
> Do we really need to introduce another bname variable? Could we use oname for
> the buffer?

NAK: ceph_fname_to_usr() redirects oname.name under certain conditions
(e.g. no key available), so we risk leaking the bounce buffer if we do
that.

>
> > +     struct inode *inode = d_inode(req->r_dentry);
> > +     struct ceph_inode_info *ci = ceph_inode(inode);
> >       u32 num, i = 0;
> >       int err;
> >
> > @@ -441,10 +444,18 @@ static int parse_reply_info_readdir(void **p, void *end,
> >               goto bad;
> >       }
> >
> > +     /*
> > +      * `p` points into the message's `front` buffer, which ceph_msg_new2()
> > +      * allocates using kvmalloc(), so the buffer may end up outside of the
> > +      * kmalloc() region -- but fscript_str.name must be in that region, so
> > +      * use a bounce buffer in that case
> > +      */
> > +     if (unlikely(is_vmalloc_addr(*p)))
> > +             if ((err = ceph_fname_alloc_buffer(inode, &bname)))
> > +                     goto out_bad;
>
> Kernel style strongly prefers braces around the outer if body. What's about
> this?
>
>   if (unlikely(is_vmalloc_addr(*p))) {
>       err = ceph_fname_alloc_buffer(inode, &bname);
>       if (err)
>           goto out_bad;
>   }

ACK

>
> > +
> >       info->dir_nr = num;
> >       while (num) {
> > -             struct inode *inode = d_inode(req->r_dentry);
> > -             struct ceph_inode_info *ci = ceph_inode(inode);
> >               struct ceph_mds_reply_dir_entry *rde = info->dir_entries + i;
> >               struct fscrypt_str tname = FSTR_INIT(NULL, 0);
> >               struct fscrypt_str oname = FSTR_INIT(NULL, 0);
> > @@ -514,6 +525,13 @@ static int parse_reply_info_readdir(void **p, void *end,
> >                       oname.name = altname;
> >                       oname.len = altname_len;
> >               }
> > +
> > +             if (bname.name) {
>
> Should we have unlikely(bname.name) here?

ACK

>
> > +                     BUG_ON(oname.len > bname.len);
>
> The ceph_decode_need confirms there are enough bytes in the wire buffer, but
> there is no NAME_MAX check. A misbehaving or malicious MDS can set altname_len >
> NAME_MAX, so oname.len > bname.len. I believe we need to process likewise error
> in more calm way. What's about this?
>
>   if (oname.len > bname.len) {
>       err = -EIO;
>       goto out_bad;
>   }

ACK -- with a WARN_ON_ONCE() as suggested by Alex.

>
> > +                     memcpy(bname.name, oname.name, oname.len);
> > +                     fname.ctext = tname.name = oname.name = bname.name;
>
> For my taste, the function is complicated enough already. I believe it requires
> some better refactoring.

Well this is an RFC, I'm seeking feedback on that refactoring! I agree
about the complexity. What way of writing the function would be to
your taste?

> Thanks,
> Slava.

Regards,
Sam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC/BUG] Use a bounce buffer for mds client decryption
  2026-04-15 19:51   ` Sam Edwards
@ 2026-04-16 18:57     ` Viacheslav Dubeyko
  0 siblings, 0 replies; 6+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-16 18:57 UTC (permalink / raw)
  To: cfsworks@gmail.com
  Cc: idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org

On Wed, 2026-04-15 at 12:51 -0700, Sam Edwards wrote:
> On Wed, Apr 15, 2026 at 12:14 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Tue, 2026-04-14 at 20:40 -0700, Sam Edwards wrote:
> > > Hi Ceph list,
> > > 
> > > This is a combination RFC and bug report. I'm holding off cleaning up the patch
> > > until I get confirmation that this is the right approach. There are some edge
> > > cases stemming from complexities in ceph_fname_to_usr(), making this patch into
> > > more of a shotgun bugfix than I'd like.
> > > 
> 
> Good day Slava,
> 
> > > I have thoroughly tested this patch:
> > > 
> > 
> > Even really good testing cannot guarantee from potential bugs. :) Have you run
> > xfststes/fstests regression testing suite for the patch? Without this run you
> > cannot say that you tested your patch good enough. :)
> 
> Not the kind of testing I meant: the next paragraph describes thorough
> *soak* testing. Since this is only an RFC (lacking my Signed-off-by
> and the usual QA that it implies), I'm clarifying that I have at least
> made the bug go away. I do not promise an absence of regressions.
> 
> > > In my live (AArch64) environment, certain
> > > workloads would trigger this particular oops->panic 3-5 times per day. Since I
> > > applied this patch ~2 weeks ago, I get no further panics. I do not think my MDS
> > > contains any base64-only dentries, however.
> > > 
> > > ---
> > >  fs/ceph/mds_client.c | 30 ++++++++++++++++++++++++++++--
> > >  1 file changed, 28 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > index b1746273f186..a583032c0041 100644
> > > --- a/fs/ceph/mds_client.c
> > > +++ b/fs/ceph/mds_client.c
> > > @@ -414,6 +414,9 @@ static int parse_reply_info_readdir(void **p, void *end,
> > >  {
> > >       struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
> > >       struct ceph_client *cl = req->r_mdsc->fsc->client;
> > > +     struct fscrypt_str bname = FSTR_INIT(NULL, 0);
> > 
> > Do we really need to introduce another bname variable? Could we use oname for
> > the buffer?
> 
> NAK: ceph_fname_to_usr() redirects oname.name under certain conditions
> (e.g. no key available), so we risk leaking the bounce buffer if we do
> that.
> 
> > 
> > > +     struct inode *inode = d_inode(req->r_dentry);
> > > +     struct ceph_inode_info *ci = ceph_inode(inode);
> > >       u32 num, i = 0;
> > >       int err;
> > > 
> > > @@ -441,10 +444,18 @@ static int parse_reply_info_readdir(void **p, void *end,
> > >               goto bad;
> > >       }
> > > 
> > > +     /*
> > > +      * `p` points into the message's `front` buffer, which ceph_msg_new2()
> > > +      * allocates using kvmalloc(), so the buffer may end up outside of the
> > > +      * kmalloc() region -- but fscript_str.name must be in that region, so
> > > +      * use a bounce buffer in that case
> > > +      */
> > > +     if (unlikely(is_vmalloc_addr(*p)))
> > > +             if ((err = ceph_fname_alloc_buffer(inode, &bname)))
> > > +                     goto out_bad;
> > 
> > Kernel style strongly prefers braces around the outer if body. What's about
> > this?
> > 
> >   if (unlikely(is_vmalloc_addr(*p))) {
> >       err = ceph_fname_alloc_buffer(inode, &bname);
> >       if (err)
> >           goto out_bad;
> >   }
> 
> ACK
> 
> > 
> > > +
> > >       info->dir_nr = num;
> > >       while (num) {
> > > -             struct inode *inode = d_inode(req->r_dentry);
> > > -             struct ceph_inode_info *ci = ceph_inode(inode);
> > >               struct ceph_mds_reply_dir_entry *rde = info->dir_entries + i;
> > >               struct fscrypt_str tname = FSTR_INIT(NULL, 0);
> > >               struct fscrypt_str oname = FSTR_INIT(NULL, 0);
> > > @@ -514,6 +525,13 @@ static int parse_reply_info_readdir(void **p, void *end,
> > >                       oname.name = altname;
> > >                       oname.len = altname_len;
> > >               }
> > > +
> > > +             if (bname.name) {
> > 
> > Should we have unlikely(bname.name) here?
> 
> ACK
> 
> > 
> > > +                     BUG_ON(oname.len > bname.len);
> > 
> > The ceph_decode_need confirms there are enough bytes in the wire buffer, but
> > there is no NAME_MAX check. A misbehaving or malicious MDS can set altname_len >
> > NAME_MAX, so oname.len > bname.len. I believe we need to process likewise error
> > in more calm way. What's about this?
> > 
> >   if (oname.len > bname.len) {
> >       err = -EIO;
> >       goto out_bad;
> >   }
> 
> ACK -- with a WARN_ON_ONCE() as suggested by Alex.
> 
> > 
> > > +                     memcpy(bname.name, oname.name, oname.len);
> > > +                     fname.ctext = tname.name = oname.name = bname.name;
> > 
> > For my taste, the function is complicated enough already. I believe it requires
> > some better refactoring.
> 
> Well this is an RFC, I'm seeking feedback on that refactoring! I agree
> about the complexity. What way of writing the function would be to
> your taste?
> 
> 

We need to manage the complexity of the function. It sounds to me the necessity
of introducing some structure that can include these tname, oname, fname and
related items. Also, some short functions can be introduced to contain the
operations with this structure. This direction can make the
parse_reply_info_readdir() more compact and easy to understand. I think that
suggested memory management logic can be implemented as another structure and
method(s) that can hide the management logic from the parse_reply_info_readdir()
workflow.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-16 18:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15  3:40 [RFC/BUG] Use a bounce buffer for mds client decryption Sam Edwards
2026-04-15 17:16 ` Alex Markuze
2026-04-15 19:15   ` Sam Edwards
2026-04-15 19:14 ` Viacheslav Dubeyko
2026-04-15 19:51   ` Sam Edwards
2026-04-16 18:57     ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox