public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC/BUG] Use a bounce buffer for mds client decryption
@ 2026-04-15  3:40 Sam Edwards
  2026-04-15 17:16 ` Alex Markuze
  2026-04-15 19:14 ` Viacheslav Dubeyko
  0 siblings, 2 replies; 6+ messages in thread
From: Sam Edwards @ 2026-04-15  3:40 UTC (permalink / raw)
  To: Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko
  Cc: ceph-devel, linux-kernel, Sam Edwards

Hi Ceph list,

This is a combination RFC and bug report. I'm holding off cleaning up the patch
until I get confirmation that this is the right approach. There are some edge
cases stemming from complexities in ceph_fname_to_usr(), making this patch into
more of a shotgun bugfix than I'd like.

I have thoroughly tested this patch: In my live (AArch64) environment, certain
workloads would trigger this particular oops->panic 3-5 times per day. Since I
applied this patch ~2 weeks ago, I get no further panics. I do not think my MDS
contains any base64-only dentries, however.

Bug description
---------------

On architectures that implement flush_dcache_page() (most of the non-x86 ones),
scatterwalk_done_dst() will attempt to flush each page of the destination when
a skcipher decryption completes. If passed a destination buffer that isn't in
the linear region, virt_to_page() yields a bogus `struct page *`, leading to a
kernel oops/panic.

For this reason, sg_set_buf() requires a linear buffer (and will assert it with
BUG_ON, if the kernel is built with CONFIG_DEBUG_SG). By extension, that means
`fscrypt_str`s must only point to kmalloc() buffers.

The MDS client's parse_reply_info_readdir() invokes fscrypt in-place, reusing
the message's `front` buffer. However, ceph_msg_new2() allocates the `front`
buffer with kvmalloc(), which attempts kmalloc() (linear region) allocation
first, then falls back to vmalloc() (vmalloc region) if that fails.

This means that: when physical memory is fragmented, fscrypt is enabled, MDS
responses are rather large, and the host architecture implements
flush_dcache_page(), the CephFS client will oops/panic.

Reproduction steps
------------------

1. Build the kernel with the following configs enabled:
- CONFIG_DEBUG_SG
  - This enables a BUG_ON() to confirm that only linear virt addresses are
    passed to sg_set_buf(); this is necessary to see the issue on x86
    (on ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE archs like arm/64, MIPS, RISC-V, etc.
     the oops will happen on flush)
- CONFIG_FAULT_INJECTION
- CONFIG_FAILSLAB
- CONFIG_FAULT_INJECTION_DEBUG_FS

2. Create a directory in CephFS, encrypt it with fscrypt, and populate it with
   10 files:
```
cd /path/to/cephfs
mkdir fscrypted
fscrypt encrypt fscrypted
seq 1 10 | while read i; do touch fscrypted/file$i; done
```

3. Enable fault injection on kmalloc, so that 1% of kvmalloc() attempts use the
   vmalloc() fallback:
```
echo -1 > /sys/kernel/debug/failslab/times
echo 0 > /sys/kernel/debug/failslab/interval
echo 1 > /sys/kernel/debug/failslab/probability
```

4. Repeatedly retrieve/decrypt the directory listing:
while true; do ls fscrypted >/dev/null; sysctl -q vm.drop_caches=1; done

5. Within 3-10 seconds, expect an oops like:
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 1474 Comm: kworker/0:2 Not tainted 7.0.0-rc6 #79 PREEMPT(lazy)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014
Workqueue: ceph-msgr ceph_con_workfn
RIP: 0010:sg_init_one+0x88/0xa0
Code: [[[SNIP SNIP]]]
RSP: 0018:ffffbb730042f618 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffbb73001591b7 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000020 R08: ffffbb73001591b7 R09: ffffbb730042f638
R10: ffff97e7c42cc630 R11: 0000000000000000 R12: ffffbb730042f638
R13: ffff97e7c42cc630 R14: ffffbb730042f658 R15: ffff97e7c3937000
FS:  0000000000000000(0000) GS:ffff97e865a61000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe46ff2f000 CR3: 0000000003bb9000 CR4: 0000000000350eb0
Call Trace:
 <TASK>
 fname_decrypt+0x10b/0x1b0
 fscrypt_fname_disk_to_usr+0x14d/0x1b0
 ceph_fname_to_usr+0x1df/0x2f0
 ? __pfx_crypto_cbc_encrypt+0x10/0x10
 ? crypto_lskcipher_crypt_sg+0xf7/0x140
 ? ceph_aes_crypt+0x1d7/0x2b0
 ? __kmalloc_noprof+0x19c/0x570
 ? ceph_decode_copy+0x13/0x30
 parse_reply_info+0x49d/0x9c0
 mds_dispatch+0x890/0x2000
 ceph_con_process_message+0x72/0x140
 ceph_con_v1_try_read+0xaf9/0x1cc0
 ? put_prev_task_fair+0x1d/0x40
 ? finish_task_switch.isra.0+0x90/0x2c0
 ceph_con_workfn+0x2e0/0x6e0
 process_one_work+0x192/0x3a0
 worker_thread+0x1aa/0x330
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xde/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d7/0x360
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

6. If unable to repro with 10 files, try again with 5, 15, and 20. The mds
   readdir reply's front section needs to be a particular size: if it's one
   page or smaller, kvmalloc() does not attempt the vmalloc() fallback; if it's
   larger than KMALLOC_MAX_CACHE_SIZE (2 pages), kmalloc() doesn't respect
   failslab (this is apparently an oversight in failslab).

Summery of changes
------------------

All changes are in parse_reply_info_readdir():

- Move the `inode`/`ci` declaration/initialization out of the loop; they are
  loop-invariant anyway, and we need the directory's inode before entering the
  loop.

- Add a `struct fscrypt_str bname` to serve as a bounce buffer.

- Any time *p is in the vmalloc region, call ceph_fname_alloc_buffer() to
  attempt to allocate the bounce buffer. This is very rare (it only happens if
  kmalloc() fails) and only does anything if the directory has encryption
  enabled. The bounce buffer is reused for all filenames in the readdir
  response, therefore I allocate it only once, before entering the loop.

- Let oname/ctext/tname be initialized normally, but if the bounce buffer is
  active (indicating it *must* be used):
  a) If there is an `altname`, then this is a binary blob of the ciphertext,
     and ceph_fname_to_usr() will use it directly; copy it into the bounce
     buffer.
  b) If there is no `altname`, then ceph_fname_to_usr() will perform base64
     decoding on `name` into `tname`. The bounce buffer need not be
     initialized, but it's easy enough to have the memcpy() in there
     unconditionally.

- Redirect all potentially-crypted pointers into the bounce buffer. The unused
  ones are ignored by ceph_fname_to_usr():
  - fname.ctext (ciphertext if `altname` is present)
  - tname.name (ciphertext if `altname` is absent; base64 decoder buffer)
  - oname.name (output for the plaintext filename)

- After calling ceph_fname_to_usr(), if it chose to keep `oname.name` pointed
  at the bounce buffer:
  - Restore `oname.name` to point to its original location in the `front`
    buffer (which survives past the end of this function)
  - Copy the plaintext back out of the bounce buffer to the long-lived buffer

- Finally, free the bounce buffer with ceph_fname_free_buffer(), which is a
  no-op if the buffer was never allocated.
---
 fs/ceph/mds_client.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..a583032c0041 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -414,6 +414,9 @@ static int parse_reply_info_readdir(void **p, void *end,
 {
 	struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
 	struct ceph_client *cl = req->r_mdsc->fsc->client;
+	struct fscrypt_str bname = FSTR_INIT(NULL, 0);
+	struct inode *inode = d_inode(req->r_dentry);
+	struct ceph_inode_info *ci = ceph_inode(inode);
 	u32 num, i = 0;
 	int err;
 
@@ -441,10 +444,18 @@ static int parse_reply_info_readdir(void **p, void *end,
 		goto bad;
 	}
 
+	/*
+	 * `p` points into the message's `front` buffer, which ceph_msg_new2()
+	 * allocates using kvmalloc(), so the buffer may end up outside of the
+	 * kmalloc() region -- but fscript_str.name must be in that region, so
+	 * use a bounce buffer in that case
+	 */
+	if (unlikely(is_vmalloc_addr(*p)))
+		if ((err = ceph_fname_alloc_buffer(inode, &bname)))
+			goto out_bad;
+
 	info->dir_nr = num;
 	while (num) {
-		struct inode *inode = d_inode(req->r_dentry);
-		struct ceph_inode_info *ci = ceph_inode(inode);
 		struct ceph_mds_reply_dir_entry *rde = info->dir_entries + i;
 		struct fscrypt_str tname = FSTR_INIT(NULL, 0);
 		struct fscrypt_str oname = FSTR_INIT(NULL, 0);
@@ -514,6 +525,13 @@ static int parse_reply_info_readdir(void **p, void *end,
 			oname.name = altname;
 			oname.len = altname_len;
 		}
+
+		if (bname.name) {
+			BUG_ON(oname.len > bname.len);
+			memcpy(bname.name, oname.name, oname.len);
+			fname.ctext = tname.name = oname.name = bname.name;
+		}
+
 		rde->is_nokey = false;
 		err = ceph_fname_to_usr(&fname, &tname, &oname, &rde->is_nokey);
 		if (err) {
@@ -521,6 +539,12 @@ static int parse_reply_info_readdir(void **p, void *end,
 				      _name_len, _name, err);
 			goto out_bad;
 		}
+
+		if (unlikely(oname.name == bname.name)) {
+			oname.name = (altname_len == 0) ? _name : altname;
+			memcpy(oname.name, bname.name, oname.len);
+		}
+
 		rde->name = oname.name;
 		rde->name_len = oname.len;
 
@@ -537,12 +561,14 @@ static int parse_reply_info_readdir(void **p, void *end,
 done:
 	/* Skip over any unrecognized fields */
 	*p = end;
+	ceph_fname_free_buffer(inode, &bname);
 	return 0;
 
 bad:
 	err = -EIO;
 out_bad:
 	pr_err_client(cl, "problem parsing dir contents %d\n", err);
+	ceph_fname_free_buffer(inode, &bname);
 	return err;
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-16 18:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15  3:40 [RFC/BUG] Use a bounce buffer for mds client decryption Sam Edwards
2026-04-15 17:16 ` Alex Markuze
2026-04-15 19:15   ` Sam Edwards
2026-04-15 19:14 ` Viacheslav Dubeyko
2026-04-15 19:51   ` Sam Edwards
2026-04-16 18:57     ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox