public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] nvme: fix memory corruption for passthrough metadata
       [not found] <CGME20231006135322epcas5p1c9acf38b04f35017181c715c706281dc@epcas5p1.samsung.com>
@ 2023-10-06 13:47 ` Kanchan Joshi
  2023-10-10  7:46   ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2023-10-06 13:47 UTC (permalink / raw)
  To: hch, kbusch, axboe, sagi
  Cc: linux-nvme, vincentfu, ankit.kumar, joshiiitr, cpgs,
	Kanchan Joshi, stable, Vincent Fu

User can specify a smaller meta buffer than what the device is
wired to update/access. Kernel makes a copy of the meta buffer into
which the device does DMA.
As a result, the device overwrites the unrelated kernel memory, causing
random kernel crashes.

Same issue is possible for extended-lba case also. When user specifies a
short unaligned buffer, the kernel makes a copy and uses that for DMA.

Detect these situations for sync/uring passthrough, and bail out.

Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device")
Cc: stable@vger.kernel.org

Reported-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---

Changes since v2:
- Handle extended-lba case: short unaligned buffer IO
- Reduce the scope of check to only well-known commands
- Do not check early. Move it deeper so that check gets executed less
  often
- Combine two patches into one.

Changes since v1:
- Revise the check to exclude PRACT=1 case

Random crash example:

[ 6815.014478] general protection fault, probably for non-canonical address 0x70e3cdbe9133b7a6: 0000 [#1] PREEMPT SMP PTI
[ 6815.014505] CPU: 1 PID: 434 Comm: systemd-timesyn Tainted: G           OE      6.4.0-rc3+ #5
[ 6815.014516] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 6815.014522] RIP: 0010:__kmem_cache_alloc_node+0x100/0x440
[ 6815.014551] Code: 48 85 c0 0f 84 fb 02 00 00 41 83 ff ff 74 10 48 8b 00 48 c1 e8 36 41 39 c7 0f 85 e5 02 00 00 41 8b 45 28 49 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9d b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
[ 6815.014559] RSP: 0018:ffffb510c0577d18 EFLAGS: 00010216
[ 6815.014569] RAX: 70e3cdbe9133b7a6 RBX: ffff8a9ec1042300 RCX: 0000000000000010
[ 6815.014575] RDX: 00000000048b0001 RSI: 0000000000000dc0 RDI: 0000000000037060
[ 6815.014581] RBP: ffffb510c0577d58 R08: ffffffffb9ffa280 R09: 0000000000000000
[ 6815.014586] R10: ffff8a9ecbcab1f0 R11: 0000000000000000 R12: 70e3cdbe9133b79e
[ 6815.014591] R13: ffff8a9ec1042300 R14: 0000000000000dc0 R15: 00000000ffffffff
[ 6815.014597] FS:  00007fce590d6940(0000) GS:ffff8a9f3dd00000(0000) knlGS:0000000000000000
[ 6815.014604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6815.014609] CR2: 00005579abbb6498 CR3: 000000000d9b0000 CR4: 00000000000006e0
[ 6815.014622] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6815.014627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6815.014632] Call Trace:
[ 6815.014650]  <TASK>
[ 6815.014655]  ? apparmor_sk_alloc_security+0x40/0x80
[ 6815.014673]  kmalloc_trace+0x2a/0xa0
[ 6815.014684]  apparmor_sk_alloc_security+0x40/0x80
[ 6815.014694]  security_sk_alloc+0x3f/0x60
[ 6815.014703]  sk_prot_alloc+0x75/0x110
[ 6815.014712]  sk_alloc+0x31/0x200
[ 6815.014721]  inet_create+0xd8/0x3a0
[ 6815.014734]  __sock_create+0x11b/0x220
[ 6815.014749]  __sys_socket_create.part.0+0x49/0x70
[ 6815.014756]  ? __secure_computing+0x94/0xf0
[ 6815.014768]  __sys_socket+0x3c/0xc0
[ 6815.014776]  __x64_sys_socket+0x1a/0x30
[ 6815.014783]  do_syscall_64+0x3b/0x90
[ 6815.014794]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 6815.014804] RIP: 0033:0x7fce59aa795b

 drivers/nvme/host/ioctl.c | 76 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index d8ff796fd5f2..8147750beff4 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -158,6 +158,67 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 	return req;
 }
 
+static inline bool nvme_nlb_in_cdw12(u8 opcode)
+{
+	switch (opcode) {
+	case nvme_cmd_read:
+	case nvme_cmd_write:
+	case nvme_cmd_compare:
+	case nvme_cmd_zone_append:
+		return true;
+	}
+	return false;
+}
+
+static bool nvme_validate_passthru_meta(struct nvme_ns *ns,
+					struct nvme_command *c,
+					__u32 meta_len,
+					unsigned data_len)
+{
+	/*
+	 * User may specify smaller meta-buffer with a larger data-buffer.
+	 * Driver allocated meta buffer will also be small.
+	 * Device can do larger dma into that, overwriting unrelated kernel
+	 * memory.
+	 * For extended lba case, if user provides a short unaligned buffer,
+	 * the device will corrupt kernel memory.
+	 * Avoid running into corruption in both situations.
+	 */
+	bool ext_lba = ns->features & NVME_NS_EXT_LBAS;
+	u16 nlb, control;
+	unsigned dlen, mlen;
+
+	/* Exclude commands that do not have nlb in cdw12 */
+	if (!nvme_nlb_in_cdw12(c->common.opcode))
+		return true;
+
+	control = upper_16_bits(le32_to_cpu(c->common.cdw12));
+	/* Exclude when meta transfer from/to host is not done */
+	if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
+		return true;
+
+	nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));
+	mlen = (nlb + 1) * ns->ms;
+
+	/* sanity for interleaved buffer */
+	if (ext_lba) {
+		dlen = (nlb + 1) << ns->lba_shift;
+		if (data_len < (dlen + mlen))
+			goto out_false;
+		return true;
+	}
+	/* sanity for separate meta buffer */
+	if (meta_len < mlen)
+		goto out_false;
+
+	return true;
+
+out_false:
+	dev_err(ns->ctrl->device,
+		"%s: metadata length is small!\n", current->comm);
+	return false;
+}
+
 static int nvme_map_user_request(struct request *req, u64 ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
 		u32 meta_seed, void **metap, struct io_uring_cmd *ioucmd,
@@ -194,6 +255,12 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
 		bio_set_dev(bio, bdev);
 
 	if (bdev && meta_buffer && meta_len) {
+		if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
+					meta_len, bufflen)) {
+			ret = -EINVAL;
+			goto out_unmap;
+		}
+
 		meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
 				meta_seed);
 		if (IS_ERR(meta)) {
@@ -203,6 +270,15 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
 		*metap = meta;
 	}
 
+	/* guard another case when kernel memory is being used */
+	if (bio->bi_private && ns && ns->features & NVME_NS_EXT_LBAS) {
+		if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
+					meta_len, bufflen)) {
+			ret = -EINVAL;
+			goto out_unmap;
+		}
+	}
+
 	return ret;
 
 out_unmap:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-06 13:47 ` [PATCH v3] nvme: fix memory corruption for passthrough metadata Kanchan Joshi
@ 2023-10-10  7:46   ` Christoph Hellwig
  2023-10-10 13:39     ` Kanchan Joshi
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-10  7:46 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, kbusch, axboe, sagi, linux-nvme, vincentfu, ankit.kumar,
	joshiiitr, cpgs, stable, Vincent Fu

On Fri, Oct 06, 2023 at 07:17:06PM +0530, Kanchan Joshi wrote:
> Same issue is possible for extended-lba case also. When user specifies a
> short unaligned buffer, the kernel makes a copy and uses that for DMA.

I fail to understand the extent LBA case, and also from looking at the
code mixing it up with validation of the metadata_len seems very
confusion.  Can you try to clearly explain it and maybe split it into a
separate patch?

> Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device")

Is this really io_uring specific?  I think we also had the same issue
before and this should go back to adding metadata support to the
general passthrough ioctl?

> +static inline bool nvme_nlb_in_cdw12(u8 opcode)
> +{
> +	switch (opcode) {
> +	case nvme_cmd_read:
> +	case nvme_cmd_write:
> +	case nvme_cmd_compare:
> +	case nvme_cmd_zone_append:
> +		return true;
> +	}
> +	return false;
> +}

Nitpick: I find it nicer to read to have a switch that catches
everything with a default statement instead of falling out of it
for checks like this.  It's not making any different in practice
but just reads a little nicer.

> +	/* Exclude commands that do not have nlb in cdw12 */
> +	if (!nvme_nlb_in_cdw12(c->common.opcode))
> +		return true;

So we can still get exactly the same corruption for all commands that
are not known?  That's not a very safe way to deal with the issue..

> +	control = upper_16_bits(le32_to_cpu(c->common.cdw12));
> +	/* Exclude when meta transfer from/to host is not done */
> +	if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
> +		return true;
> +
> +	nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));

I'd use the rw field of the union and the typed control and length
fields to clean this up a bit.

>  	if (bdev && meta_buffer && meta_len) {
> +		if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
> +					meta_len, bufflen)) {
> +			ret = -EINVAL;
> +			goto out_unmap;
> +		}
> +
>  		meta = nvme_add_user_metadata(req, meta_buffer, meta_len,

I'd move the check into nvme_add_user_metadata to keep it out of the
hot path.

FYI: here is what I'd do for the external metadata only case:

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index d8ff796fd5f21d..bf22c7953856f5 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -96,6 +96,71 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
 	return (void __user *)ptrval;
 }
 
+static inline bool nvme_nlb_in_cdw12(u8 opcode)
+{
+	switch (opcode) {
+	case nvme_cmd_read:
+	case nvme_cmd_write:
+	case nvme_cmd_compare:
+	case nvme_cmd_zone_append:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/*
+ * NVMe has no separate field to encode the metadata length expected
+ * (except when using SGLs).
+ *
+ * Because of that we can't allow to transfer arbitrary metadata, as
+ * a metadata buffer that is shorted than what the device expects for
+ * the command will lead to arbitrary kernel (if bounce buffering) or
+ * userspace (if not) memory corruption.
+ *
+ * Check that external metadata is only specified for the few commands
+ * where we know the length based of other fields, and that it fits
+ * the actual data transfer from/to the device.
+ */
+static bool nvme_validate_metadata_len(struct request *req, unsigned meta_len)
+{
+	struct nvme_ns *ns = req->q->queuedata;
+	struct nvme_command *c = nvme_req(req)->cmd;
+	u32 len_by_nlb;
+	
+	/* Exclude commands that do not have nlb in cdw12 */
+	if (!nvme_nlb_in_cdw12(c->common.opcode)) {
+		dev_err(ns->ctrl->device,
+			"unknown metadata command %c (%s).\n",
+			c->common.opcode, current->comm);
+		return false;
+	}
+
+	/*
+	 * Skip the check when PI is inserted or stripped and not transferred.
+	 */
+	if (ns->ms == ns->pi_size &&
+	    (c->rw.control & cpu_to_le16(NVME_RW_PRINFO_PRACT)))
+		return true;
+
+	if (ns->features & NVME_NS_EXT_LBAS) {
+		dev_err(ns->ctrl->device,
+			"requires extended LBAs for metadata (%s).\n",
+			current->comm);
+		return false;
+	}
+
+	len_by_nlb = (le16_to_cpu(c->rw.length) + 1) * ns->ms;
+	if (meta_len < len_by_nlb) {
+		dev_err(ns->ctrl->device,
+			"metadata length (%u instad of %u) is too small (%s).\n",
+			meta_len, len_by_nlb, current->comm);
+		return false;
+	}
+
+	return true;
+}
+
 static void *nvme_add_user_metadata(struct request *req, void __user *ubuf,
 		unsigned len, u32 seed)
 {
@@ -104,6 +169,9 @@ static void *nvme_add_user_metadata(struct request *req, void __user *ubuf,
 	void *buf;
 	struct bio *bio = req->bio;
 
+	if (!nvme_validate_metadata_len(req, len))
+		return ERR_PTR(-EINVAL);
+
 	buf = kmalloc(len, GFP_KERNEL);
 	if (!buf)
 		goto out;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-10  7:46   ` Christoph Hellwig
@ 2023-10-10 13:39     ` Kanchan Joshi
  2023-10-10 15:31       ` Clay Mayers
  2023-10-11  5:02       ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-10-10 13:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, sagi, linux-nvme, vincentfu,
	ankit.kumar, cpgs, stable, Vincent Fu

On Tue, Oct 10, 2023 at 1:16 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Oct 06, 2023 at 07:17:06PM +0530, Kanchan Joshi wrote:
> > Same issue is possible for extended-lba case also. When user specifies a
> > short unaligned buffer, the kernel makes a copy and uses that for DMA.
>
> I fail to understand the extent LBA case, and also from looking at the
> code mixing it up with validation of the metadata_len seems very
> confusion.  Can you try to clearly explain it and maybe split it into a
> separate patch?

The case is for the single interleaved buffer with both data and
metadata. When the driver sends this buffer to blk_rq_map_user_iov(),
it may make a copy of it.
This kernel buffer will be used for DMA rather than user buffer. If
the user-buffer is short, the kernel buffer is also short.

Does this explanation help?
I can move the part to a separate patch.

> > Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device")
>
> Is this really io_uring specific?  I think we also had the same issue
> before and this should go back to adding metadata support to the
> general passthrough ioctl?

Yes, not io_uring specific.
Just that I was not sure on (i) whether to go back that far in
history, and (ii) what patch to tag.

> > +static inline bool nvme_nlb_in_cdw12(u8 opcode)
> > +{
> > +     switch (opcode) {
> > +     case nvme_cmd_read:
> > +     case nvme_cmd_write:
> > +     case nvme_cmd_compare:
> > +     case nvme_cmd_zone_append:
> > +             return true;
> > +     }
> > +     return false;
> > +}
>
> Nitpick: I find it nicer to read to have a switch that catches
> everything with a default statement instead of falling out of it
> for checks like this.  It's not making any different in practice
> but just reads a little nicer.

Sure, I can change it.

> > +     /* Exclude commands that do not have nlb in cdw12 */
> > +     if (!nvme_nlb_in_cdw12(c->common.opcode))
> > +             return true;
>
> So we can still get exactly the same corruption for all commands that
> are not known?  That's not a very safe way to deal with the issue..

Given the way things are in NVMe, I do not find a better way.
Maybe another day for commands that do (or can do) things very
differently for nlb and PI representation.

> > +     control = upper_16_bits(le32_to_cpu(c->common.cdw12));
> > +     /* Exclude when meta transfer from/to host is not done */
> > +     if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
> > +             return true;
> > +
> > +     nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));
>
> I'd use the rw field of the union and the typed control and length
> fields to clean this up a bit.
>
> >       if (bdev && meta_buffer && meta_len) {
> > +             if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
> > +                                     meta_len, bufflen)) {
> > +                     ret = -EINVAL;
> > +                     goto out_unmap;
> > +             }
> > +
> >               meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
>
> I'd move the check into nvme_add_user_metadata to keep it out of the
> hot path.
>
> FYI: here is what I'd do for the external metadata only case:

Since you have improvised comments too, I may just use this for the
next iteration.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-10 13:39     ` Kanchan Joshi
@ 2023-10-10 15:31       ` Clay Mayers
  2023-10-11  5:03         ` Christoph Hellwig
  2023-10-11  5:02       ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Clay Mayers @ 2023-10-10 15:31 UTC (permalink / raw)
  To: Kanchan Joshi, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch@kernel.org, axboe@kernel.dk,
	sagi@grimberg.me, linux-nvme@lists.infradead.org,
	vincentfu@gmail.com, ankit.kumar@samsung.com, cpgs@samsung.com,
	stable@vger.kernel.org, Vincent Fu

On Tuesday, October 10, 2023 6:40 AM Kanchan Joshi <joshi.k@samsung.com> wrote:
> 
> On Tue, Oct 10, 2023 at 1:16 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Fri, Oct 06, 2023 at 07:17:06PM +0530, Kanchan Joshi wrote:
> > > Same issue is possible for extended-lba case also. When user specifies a
> > > short unaligned buffer, the kernel makes a copy and uses that for DMA.
> >
> > I fail to understand the extent LBA case, and also from looking at the
> > code mixing it up with validation of the metadata_len seems very
> > confusion.  Can you try to clearly explain it and maybe split it into a
> > separate patch?
> 
> The case is for the single interleaved buffer with both data and
> metadata. When the driver sends this buffer to blk_rq_map_user_iov(),
> it may make a copy of it.
> This kernel buffer will be used for DMA rather than user buffer. If
> the user-buffer is short, the kernel buffer is also short.
> 
> Does this explanation help?
> I can move the part to a separate patch.
> 
> > > Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on
> char-device")
> >
> > Is this really io_uring specific?  I think we also had the same issue
> > before and this should go back to adding metadata support to the
> > general passthrough ioctl?
> 
> Yes, not io_uring specific.
> Just that I was not sure on (i) whether to go back that far in
> history, and (ii) what patch to tag.
> 
> > > +static inline bool nvme_nlb_in_cdw12(u8 opcode)
> > > +{
> > > +     switch (opcode) {
> > > +     case nvme_cmd_read:
> > > +     case nvme_cmd_write:
> > > +     case nvme_cmd_compare:
> > > +     case nvme_cmd_zone_append:
> > > +             return true;
> > > +     }
> > > +     return false;
> > > +}
> >
> > Nitpick: I find it nicer to read to have a switch that catches
> > everything with a default statement instead of falling out of it
> > for checks like this.  It's not making any different in practice
> > but just reads a little nicer.
> 
> Sure, I can change it.
> 

What if the ns used the KV CS?  Store and retrieve are the same
op codes as nvme_cmd_write and nvme_cmd_read.

> > > +     /* Exclude commands that do not have nlb in cdw12 */
> > > +     if (!nvme_nlb_in_cdw12(c->common.opcode))
> > > +             return true;
> >
> > So we can still get exactly the same corruption for all commands that
> > are not known?  That's not a very safe way to deal with the issue..
> 
> Given the way things are in NVMe, I do not find a better way.
> Maybe another day for commands that do (or can do) things very
> differently for nlb and PI representation.
>
> > > +     control = upper_16_bits(le32_to_cpu(c->common.cdw12));
> > > +     /* Exclude when meta transfer from/to host is not done */
> > > +     if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
> > > +             return true;
> > > +
> > > +     nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));
> >
> > I'd use the rw field of the union and the typed control and length
> > fields to clean this up a bit.
> >
> > >       if (bdev && meta_buffer && meta_len) {
> > > +             if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
> > > +                                     meta_len, bufflen)) {
> > > +                     ret = -EINVAL;
> > > +                     goto out_unmap;
> > > +             }
> > > +
> > >               meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
> >
> > I'd move the check into nvme_add_user_metadata to keep it out of the
> > hot path.
> >
> > FYI: here is what I'd do for the external metadata only case:
> 
> Since you have improvised comments too, I may just use this for the
> next iteration.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-10 13:39     ` Kanchan Joshi
  2023-10-10 15:31       ` Clay Mayers
@ 2023-10-11  5:02       ` Christoph Hellwig
  2023-10-11  5:26         ` Kanchan Joshi
  2023-10-11 17:04         ` Keith Busch
  1 sibling, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-11  5:02 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, axboe, sagi, linux-nvme,
	vincentfu, ankit.kumar, cpgs, stable, Vincent Fu

On Tue, Oct 10, 2023 at 07:09:54PM +0530, Kanchan Joshi wrote:
> The case is for the single interleaved buffer with both data and
> metadata. When the driver sends this buffer to blk_rq_map_user_iov(),
> it may make a copy of it.
> This kernel buffer will be used for DMA rather than user buffer. If
> the user-buffer is short, the kernel buffer is also short.

Yes.  Note that we'll corrupt memory either way, so user vs kernel
does not matter.

> Does this explanation help?
> I can move the part to a separate patch.

Definitively separate function please, not sure if a separate
patch is required.

> Yes, not io_uring specific.
> Just that I was not sure on (i) whether to go back that far in
> history, and (ii) what patch to tag.

I think the one that adds the original problem is:

63263d60e0f9f37bfd5e6a1e83a62f0e62fc459f
Author: Keith Busch <kbusch@kernel.org>
Date:   Tue Aug 29 17:46:04 2017 -0400

    nvme: Use metadata for passthrough commands


> > > +     /* Exclude commands that do not have nlb in cdw12 */
> > > +     if (!nvme_nlb_in_cdw12(c->common.opcode))
> > > +             return true;
> >
> > So we can still get exactly the same corruption for all commands that
> > are not known?  That's not a very safe way to deal with the issue..
> 
> Given the way things are in NVMe, I do not find a better way.
> Maybe another day for commands that do (or can do) things very
> differently for nlb and PI representation.

Fixing just a subset of these problems is pointless.  If people want
to use metadata on vendor specific commands they need to work with
NVMe to figure out a generic way to pass the length.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-10 15:31       ` Clay Mayers
@ 2023-10-11  5:03         ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-11  5:03 UTC (permalink / raw)
  To: Clay Mayers
  Cc: Kanchan Joshi, Christoph Hellwig, Kanchan Joshi,
	kbusch@kernel.org, axboe@kernel.dk, sagi@grimberg.me,
	linux-nvme@lists.infradead.org, vincentfu@gmail.com,
	ankit.kumar@samsung.com, cpgs@samsung.com, stable@vger.kernel.org,
	Vincent Fu

On Tue, Oct 10, 2023 at 03:31:23PM +0000, Clay Mayers wrote:
> > Sure, I can change it.
> > 
> 
> What if the ns used the KV CS?  Store and retrieve are the same
> op codes as nvme_cmd_write and nvme_cmd_read.

Yeah, we also need a CSI check here.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-11  5:02       ` Christoph Hellwig
@ 2023-10-11  5:26         ` Kanchan Joshi
  2023-10-11  6:36           ` Christoph Hellwig
  2023-10-11 17:04         ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2023-10-11  5:26 UTC (permalink / raw)
  To: Christoph Hellwig, Kanchan Joshi
  Cc: kbusch, axboe, sagi, linux-nvme, vincentfu, ankit.kumar, cpgs,
	stable, Vincent Fu

On 10/11/2023 10:32 AM, Christoph Hellwig wrote:
>> Just that I was not sure on (i) whether to go back that far in
>> history, and (ii) what patch to tag.
> 
> I think the one that adds the original problem is:
> 
> 63263d60e0f9f37bfd5e6a1e83a62f0e62fc459f
> Author: Keith Busch <kbusch@kernel.org>
> Date:   Tue Aug 29 17:46:04 2017 -0400
> 
>      nvme: Use metadata for passthrough commands

Thanks.

> 
>>>> +     /* Exclude commands that do not have nlb in cdw12 */
>>>> +     if (!nvme_nlb_in_cdw12(c->common.opcode))
>>>> +             return true;
>>>
>>> So we can still get exactly the same corruption for all commands that
>>> are not known?  That's not a very safe way to deal with the issue..
>>
>> Given the way things are in NVMe, I do not find a better way.
>> Maybe another day for commands that do (or can do) things very
>> differently for nlb and PI representation.
> 
> Fixing just a subset of these problems is pointless.  If people want
> to use metadata on vendor specific commands they need to work with
> NVMe to figure out a generic way to pass the length.

Do you suggest that vendor specific opcodes should be blocked here?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-11  5:26         ` Kanchan Joshi
@ 2023-10-11  6:36           ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-11  6:36 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, axboe, sagi, linux-nvme,
	vincentfu, ankit.kumar, cpgs, stable, Vincent Fu

On Wed, Oct 11, 2023 at 10:56:44AM +0530, Kanchan Joshi wrote:
> > Fixing just a subset of these problems is pointless.  If people want
> > to use metadata on vendor specific commands they need to work with
> > NVMe to figure out a generic way to pass the length.
> 
> Do you suggest that vendor specific opcodes should be blocked here?

We have to block everything that we can't calculate the length for.
Otherwise you still leave the hole open.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-11  5:02       ` Christoph Hellwig
  2023-10-11  5:26         ` Kanchan Joshi
@ 2023-10-11 17:04         ` Keith Busch
  2023-10-12  4:36           ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-10-11 17:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, Kanchan Joshi, axboe, sagi, linux-nvme, vincentfu,
	ankit.kumar, cpgs, stable, Vincent Fu

On Wed, Oct 11, 2023 at 07:02:54AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 10, 2023 at 07:09:54PM +0530, Kanchan Joshi wrote:
> > 
> > Given the way things are in NVMe, I do not find a better way.
> > Maybe another day for commands that do (or can do) things very
> > differently for nlb and PI representation.
> 
> Fixing just a subset of these problems is pointless.  If people want
> to use metadata on vendor specific commands they need to work with
> NVMe to figure out a generic way to pass the length.

NVMe already tried to solve that with NDT and NDM fields, but no vendor
implemented it. Maybe just require SGL's for passthrough IO since that
encodes the buffer sizes.

I don't think it's reasonable for the driver to decode every passthrough
command to validate the data lengths, or reject ones that we don't know
how to decode. SG_IO doesn't do that either.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-11 17:04         ` Keith Busch
@ 2023-10-12  4:36           ` Christoph Hellwig
  2023-10-12 15:31             ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-12  4:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, Kanchan Joshi, axboe, sagi,
	linux-nvme, vincentfu, ankit.kumar, cpgs, stable, Vincent Fu

On Wed, Oct 11, 2023 at 11:04:58AM -0600, Keith Busch wrote:
> > > Given the way things are in NVMe, I do not find a better way.
> > > Maybe another day for commands that do (or can do) things very
> > > differently for nlb and PI representation.
> > 
> > Fixing just a subset of these problems is pointless.  If people want
> > to use metadata on vendor specific commands they need to work with
> > NVMe to figure out a generic way to pass the length.
> 
> NVMe already tried to solve that with NDT and NDM fields, but no vendor
> implemented it. Maybe just require SGL's for passthrough IO since that
> encodes the buffer sizes.

How is that going to help us with metadata where neither we, nor most
devices support SGLs?

> I don't think it's reasonable for the driver to decode every passthrough
> command to validate the data lengths, or reject ones that we don't know
> how to decode. SG_IO doesn't do that either.

I don't want that either, but what can we do against a (possibly
unprivileged) user corrupting data?

SCSI has it much either because it has an explicit data transfer length
(outside the CDB) instead of trying to build it from information that
differs per opcode.  One of the many places where it shows that NVMe
is a very sloppy and badly thought out protocol.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-12  4:36           ` Christoph Hellwig
@ 2023-10-12 15:31             ` Keith Busch
  2023-10-12 15:46               ` Christoph Hellwig
  2023-10-13  2:19               ` Kanchan Joshi
  0 siblings, 2 replies; 15+ messages in thread
From: Keith Busch @ 2023-10-12 15:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, Kanchan Joshi, axboe, sagi, linux-nvme, vincentfu,
	ankit.kumar, cpgs, stable, Vincent Fu

On Thu, Oct 12, 2023 at 06:36:52AM +0200, Christoph Hellwig wrote:
> On Wed, Oct 11, 2023 at 11:04:58AM -0600, Keith Busch wrote:
> 
> > I don't think it's reasonable for the driver to decode every passthrough
> > command to validate the data lengths, or reject ones that we don't know
> > how to decode. SG_IO doesn't do that either.
> 
> I don't want that either, but what can we do against a (possibly
> unprivileged) user corrupting data?

The unpriviledged access is kind of recent. Maybe limit the scope of
decoding to that usage?

We've always known the interface can be misused to corrupt memory and/or
data, and it was always user responsibility to use this interface
reponsibly. We shouldn't disable something people have relied on for
over 10 years just because someone rediscovered ways to break it.

It's not like this is a "metadata" specific thing either; you can
provide short user space buffers and corrupt memory with regular admin
commands, and we have been able to that from day 1. But if you abuse
this interface, it was always your fault; the kernel never took
responsibility to sanity check your nvme input, and I think it's a bad
precedent to start doing it.
 
> SCSI has it much either because it has an explicit data transfer length
> (outside the CDB) instead of trying to build it from information that
> differs per opcode.  One of the many places where it shows that NVMe
> is a very sloppy and badly thought out protocol.

Yeah, implicit PRP length has often been reported as one of the early
protocol "regrets"...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-12 15:31             ` Keith Busch
@ 2023-10-12 15:46               ` Christoph Hellwig
  2023-10-13  2:19               ` Kanchan Joshi
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-12 15:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, Kanchan Joshi, axboe, sagi,
	linux-nvme, vincentfu, ankit.kumar, cpgs, stable, Vincent Fu

On Thu, Oct 12, 2023 at 09:31:38AM -0600, Keith Busch wrote:
> > I don't want that either, but what can we do against a (possibly
> > unprivileged) user corrupting data?
> 
> The unpriviledged access is kind of recent. Maybe limit the scope of
> decoding to that usage?

Let's just drop support for unpriviledged passthrough for now.  That's
easily backportable and gives us time to sort out what we can do.
Probably only allowing it when SGLs are in use, including a flag to
force using it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-12 15:31             ` Keith Busch
  2023-10-12 15:46               ` Christoph Hellwig
@ 2023-10-13  2:19               ` Kanchan Joshi
  2023-10-13  4:38                 ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2023-10-13  2:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, sagi, linux-nvme,
	vincentfu, ankit.kumar, cpgs, stable, Vincent Fu

On Thu, Oct 12, 2023 at 9:01 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Oct 12, 2023 at 06:36:52AM +0200, Christoph Hellwig wrote:
> > On Wed, Oct 11, 2023 at 11:04:58AM -0600, Keith Busch wrote:
> >
> > > I don't think it's reasonable for the driver to decode every passthrough
> > > command to validate the data lengths, or reject ones that we don't know
> > > how to decode. SG_IO doesn't do that either.
> >
> > I don't want that either, but what can we do against a (possibly
> > unprivileged) user corrupting data?
>
> The unpriviledged access is kind of recent. Maybe limit the scope of
> decoding to that usage?

I can send an iteration today that takes this route.
Maybe that can be considered over dropping a useful feature.

> We've always known the interface can be misused to corrupt memory and/or
> data, and it was always user responsibility to use this interface
> reponsibly. We shouldn't disable something people have relied on for
> over 10 years just because someone rediscovered ways to break it.
>
> It's not like this is a "metadata" specific thing either; you can
> provide short user space buffers and corrupt memory with regular admin
> commands, and we have been able to that from day 1. But if you abuse
> this interface, it was always your fault; the kernel never took
> responsibility to sanity check your nvme input, and I think it's a bad
> precedent to start doing it.

In my mind, this was about dealing with the specific case when the
kernel memory is being used for device DMA.
We have just two cases: (i) separate meta buffer, and (ii) bounce
buffer for data (+metadata).
I had not planned sanity checks for user inputs for anything beyond that.
As opposed to being preventive (in all cases), it was about failing
only when we are certain that DMA will take place and it will corrupt
kernel memory.

In the long-term, it may be possible for the path to do away with
memory copies. The checks can disappear with that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-13  2:19               ` Kanchan Joshi
@ 2023-10-13  4:38                 ` Christoph Hellwig
  2023-10-13  5:50                   ` Kanchan Joshi
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-13  4:38 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Keith Busch, Christoph Hellwig, Kanchan Joshi, axboe, sagi,
	linux-nvme, vincentfu, ankit.kumar, cpgs, stable, Vincent Fu

On Fri, Oct 13, 2023 at 07:49:19AM +0530, Kanchan Joshi wrote:
> > precedent to start doing it.
> In my mind, this was about dealing with the specific case when the
> kernel memory is being used for device DMA.
> We have just two cases: (i) separate meta buffer, and (ii) bounce
> buffer for data (+metadata).
> I had not planned sanity checks for user inputs for anything beyond that.
> As opposed to being preventive (in all cases), it was about failing
> only when we are certain that DMA will take place and it will corrupt
> kernel memory.
> 
> In the long-term, it may be possible for the path to do away with
> memory copies. The checks can disappear with that.

As soon as the user buffer is unaligned we need to bounce buffer,
including for the data buffer.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] nvme: fix memory corruption for passthrough metadata
  2023-10-13  4:38                 ` Christoph Hellwig
@ 2023-10-13  5:50                   ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-10-13  5:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Kanchan Joshi, axboe, sagi, linux-nvme, vincentfu,
	ankit.kumar, cpgs, stable, Vincent Fu

On Fri, Oct 13, 2023 at 10:08 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Oct 13, 2023 at 07:49:19AM +0530, Kanchan Joshi wrote:
> > > precedent to start doing it.
> > In my mind, this was about dealing with the specific case when the
> > kernel memory is being used for device DMA.
> > We have just two cases: (i) separate meta buffer, and (ii) bounce
> > buffer for data (+metadata).
> > I had not planned sanity checks for user inputs for anything beyond that.
> > As opposed to being preventive (in all cases), it was about failing
> > only when we are certain that DMA will take place and it will corrupt
> > kernel memory.
> >
> > In the long-term, it may be possible for the path to do away with
> > memory copies. The checks can disappear with that.
>
> As soon as the user buffer is unaligned we need to bounce buffer,
> including for the data buffer.

Yes, but that also sprinkles a bunch of checks and goes against the
theme of  and doing as minimal as possible (at least for passthrough).
Had the plain buffer (potentially unaligned) gone down, either it
would have worked or the device would not like it and user space would
have got the error anyway. No?

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-10-13  5:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20231006135322epcas5p1c9acf38b04f35017181c715c706281dc@epcas5p1.samsung.com>
2023-10-06 13:47 ` [PATCH v3] nvme: fix memory corruption for passthrough metadata Kanchan Joshi
2023-10-10  7:46   ` Christoph Hellwig
2023-10-10 13:39     ` Kanchan Joshi
2023-10-10 15:31       ` Clay Mayers
2023-10-11  5:03         ` Christoph Hellwig
2023-10-11  5:02       ` Christoph Hellwig
2023-10-11  5:26         ` Kanchan Joshi
2023-10-11  6:36           ` Christoph Hellwig
2023-10-11 17:04         ` Keith Busch
2023-10-12  4:36           ` Christoph Hellwig
2023-10-12 15:31             ` Keith Busch
2023-10-12 15:46               ` Christoph Hellwig
2023-10-13  2:19               ` Kanchan Joshi
2023-10-13  4:38                 ` Christoph Hellwig
2023-10-13  5:50                   ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox