linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* in-kernel verification of user PI?
@ 2025-01-29 12:46 ` Christoph Hellwig
  2025-01-29 14:23   ` Martin K. Petersen
  2025-01-31 10:34   ` Kanchan Joshi
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2025-01-29 12:46 UTC (permalink / raw)
  To: Kanchan Joshi, Anuj Gupta, Martin K. Petersen
  Cc: Keith Busch, Jens Axboe, linux-block

Hi all,

I've recently been reviewing the just merged io_uring support for
passing PI and metadata from userspace and reconciling it with my
fs PI design notes and prototype.

One thing that I noticed is that for PI passed form userspace the
kernel never verifies that the guard and ref tag match what we'd
expect.  I.e. if userspace passes incorrect information it can trigger
a command failure and thus the driver error handler, which is something
we don't usually allow for "regular" I/O.  Definitively not on files
but in general also not on the block device special files.  Also a
"random" reftag could cause some interesting integer overflows when
partition (or later file offset) remapping.

Shouldn't the kernel do verification of the guard/ref tags on writes
with PI data?

Also another thing is that right now the holder of a path or fd has no
idea what metadata it is supposed to pass.  For block device special
files find the right sysfs directory is relatively straight forward
(but still annoying), but one a file is on a file systems that becomes
impossible.  I think we'll need an ioctl that exposes the equivalent
of the integrity sysfs directory to make this usable by applications.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 12:46 ` in-kernel verification of user PI? Christoph Hellwig
@ 2025-01-29 14:23   ` Martin K. Petersen
  2025-01-29 15:26     ` Christoph Hellwig
  2025-01-31 10:34   ` Kanchan Joshi
  1 sibling, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2025-01-29 14:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, Anuj Gupta, Martin K. Petersen, Keith Busch,
	Jens Axboe, linux-block


Hi Christoph!

> One thing that I noticed is that for PI passed form userspace the
> kernel never verifies that the guard and ref tag match what we'd
> expect.

Doing a verification pass in the write hot path had a substantial
performance impact when I originally did this. Even remapping the ref
tag has an impact on cache. That's why DIX1.1 moved ref tag remapping to
the HBA so we could avoid touching the PI buffer altogether in the hot
path.

> I.e. if userspace passes incorrect information it can trigger a
> command failure and thus the driver error handler, which is something
> we don't usually allow for "regular" I/O.

Do you trigger EH in NVMe? For SCSI we just bubble the PI error up
without retrying.

> Shouldn't the kernel do verification of the guard/ref tags on writes
> with PI data?

I'd prefer to have things fail gracefully if a problem is identified by
the hardware. As opposed to adding a second CRC calculation pass to the
hot path.

> Also another thing is that right now the holder of a path or fd has no
> idea what metadata it is supposed to pass. For block device special
> files find the right sysfs directory is relatively straight forward
> (but still annoying), but one a file is on a file systems that becomes
> impossible. I think we'll need an ioctl that exposes the equivalent of
> the integrity sysfs directory to make this usable by applications.

I agree that poking around in sysfs and reading multiple files to
combine all the various parameters is painful. Totally in favor of an
ioctl to query the integrity format.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 14:23   ` Martin K. Petersen
@ 2025-01-29 15:26     ` Christoph Hellwig
  2025-01-29 15:42       ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-01-29 15:26 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Kanchan Joshi, Anuj Gupta, Keith Busch,
	Jens Axboe, linux-block

On Wed, Jan 29, 2025 at 09:23:37AM -0500, Martin K. Petersen wrote:
> Doing a verification pass in the write hot path had a substantial
> performance impact when I originally did this.

Oh yes, it absolutely will.  While the CRC implementations got a lot
faster in the last years, there's still a cost.  It also touches a lot of
cache lines.

> Even remapping the ref
> tag has an impact on cache. That's why DIX1.1 moved ref tag remapping to
> the HBA so we could avoid touching the PI buffer altogether in the hot
> path.

As in supplying an offset for the ref tag somewhere in the HBA specific
per-command payload?  That's not implemented in Linux as far as I can
tell, or did I miss something?

> > I.e. if userspace passes incorrect information it can trigger a
> > command failure and thus the driver error handler, which is something
> > we don't usually allow for "regular" I/O.
> 
> Do you trigger EH in NVMe? For SCSI we just bubble the PI error up
> without retrying.

We don't have the EH thread from hell, but there is error handling yes.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 15:26     ` Christoph Hellwig
@ 2025-01-29 15:42       ` Martin K. Petersen
  2025-01-29 15:43         ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2025-01-29 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, Anuj Gupta, Keith Busch,
	Jens Axboe, linux-block


Christoph,

> As in supplying an offset for the ref tag somewhere in the HBA specific
> per-command payload?  That's not implemented in Linux as far as I can
> tell, or did I miss something?

It fell by the wayside for various reasons. I would love to revive it,
all it did was skip the remapping step if a flag was set in the profile.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 15:42       ` Martin K. Petersen
@ 2025-01-29 15:43         ` Christoph Hellwig
  2025-01-29 16:15           ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-01-29 15:43 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Kanchan Joshi, Anuj Gupta, Keith Busch,
	Jens Axboe, linux-block

On Wed, Jan 29, 2025 at 10:42:04AM -0500, Martin K. Petersen wrote:
> 
> Christoph,
> 
> > As in supplying an offset for the ref tag somewhere in the HBA specific
> > per-command payload?  That's not implemented in Linux as far as I can
> > tell, or did I miss something?
> 
> It fell by the wayside for various reasons. I would love to revive it,
> all it did was skip the remapping step if a flag was set in the profile.

How much remapping could the hardware do?  Would this also work for
remapping a inode-relative ref tag?  Do we need to bring it into NVMe?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 15:43         ` Christoph Hellwig
@ 2025-01-29 16:15           ` Martin K. Petersen
  2025-01-30 13:02             ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2025-01-29 16:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, Anuj Gupta, Keith Busch,
	Jens Axboe, linux-block


Christoph,

>> It fell by the wayside for various reasons. I would love to revive it,
>> all it did was skip the remapping step if a flag was set in the profile.
>
> How much remapping could the hardware do?  Would this also work for
> remapping a inode-relative ref tag?  Do we need to bring it into NVMe?

One of the reasons it lost momentum was that NVMe didn't do it for
ILBRT/EILBRT. Although of course NVMe doesn't really have an
intermediate HBA entity like SCSI. For SCSI it was natural for the HBA
to convert between what the host sees and what the disk sees, RAID
controllers do it all the time. NVMe didn't pick up that wrinkle.

With DIX1.1, you tell the HBA what to expect the first received ref tag
to be. That could be the application's file offset or whatever you want.
It's just the seed value chosen by the application when the PI was
generated. That's passed down the stack along with the PI buffer itself.

And then the controller ASIC uses that seed value to program the
register for validating the ref tags as it DMAs from host memory. On the
outbound side it uses a different value to seed the ref tag generation
register when sending the PI on to the drive. I.e. LBA for Type 1.

It's really just a matter of the device having separate, programmable
registers for ref tag verification and generation.

So in terms of NVMe, it's like having ILBRT and EILBRT specified at the
same time. The drive should use EILBRT for validating the ref tag in the
PI received from host and then use a separate ILBRT as the initial value
for the ref tag when writing the PI to media.

On reads it works the same way. The controller validates that the ref
tag read from media matches the LBA. And then uses the separate register
to generate a new ref tag initialized with the seed value requested by
the application.

Not sure if we'd have room for both an EILBRT and an ILBRT in the same
command? Sounds like it would be difficult, especially with the larger
ref tags in NVMe. But I'm happy to pursue in NVMe if there is interest.
Because it did make a performance difference not having to touch the PI
buffer in the I/O path.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 16:15           ` Martin K. Petersen
@ 2025-01-30 13:02             ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2025-01-30 13:02 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Kanchan Joshi, Anuj Gupta, Keith Busch,
	Jens Axboe, linux-block

On Wed, Jan 29, 2025 at 11:15:35AM -0500, Martin K. Petersen wrote:
> 
> Christoph,
> 
> >> It fell by the wayside for various reasons. I would love to revive it,
> >> all it did was skip the remapping step if a flag was set in the profile.
> >
> > How much remapping could the hardware do?  Would this also work for
> > remapping a inode-relative ref tag?  Do we need to bring it into NVMe?
> 
> One of the reasons it lost momentum was that NVMe didn't do it for
> ILBRT/EILBRT. Although of course NVMe doesn't really have an
> intermediate HBA entity like SCSI.

Or any kind of coherent architecture for PI..

> With DIX1.1, you tell the HBA what to expect the first received ref tag
> to be. That could be the application's file offset or whatever you want.
> It's just the seed value chosen by the application when the PI was
> generated. That's passed down the stack along with the PI buffer itself.

Yeah.  NVMe actually kinda supports this, but for zone append only as we
need that for PI with zone append.   But it is limited to remapping from
a starting reftag that is the zone start address, so it's not quite as
flexibble.  Search for the PIREMAP bit in the ZNS spec.

> Not sure if we'd have room for both an EILBRT and an ILBRT in the same
> command? Sounds like it would be difficult, especially with the larger
> ref tags in NVMe. But I'm happy to pursue in NVMe if there is interest.
> Because it did make a performance difference not having to touch the PI
> buffer in the I/O path.

I guess you'd do it by treating type1 PI as actual type1 PI, that is
the ILBRT is derived from the LBA.  But I'd need to think more about
it, and without a clear customer use case it's probably not going to
happen in NVMe.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-29 12:46 ` in-kernel verification of user PI? Christoph Hellwig
  2025-01-29 14:23   ` Martin K. Petersen
@ 2025-01-31 10:34   ` Kanchan Joshi
  2025-01-31 10:36     ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Kanchan Joshi @ 2025-01-31 10:34 UTC (permalink / raw)
  To: Christoph Hellwig, Anuj Gupta, Martin K. Petersen
  Cc: Keith Busch, Jens Axboe, linux-block

On 1/29/2025 6:16 PM, Christoph Hellwig wrote:
> Also another thing is that right now the holder of a path or fd has no
> idea what metadata it is supposed to pass.  For block device special
> files find the right sysfs directory is relatively straight forward
> (but still annoying), but one a file is on a file systems that becomes
> impossible.  I think we'll need an ioctl that exposes the equivalent
> of the integrity sysfs directory to make this usable by applications.

Are you thinking this ioctl to be on a regular file?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: in-kernel verification of user PI?
  2025-01-31 10:34   ` Kanchan Joshi
@ 2025-01-31 10:36     ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2025-01-31 10:36 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, Anuj Gupta, Martin K. Petersen, Keith Busch,
	Jens Axboe, linux-block

On Fri, Jan 31, 2025 at 04:04:53PM +0530, Kanchan Joshi wrote:
> On 1/29/2025 6:16 PM, Christoph Hellwig wrote:
> > Also another thing is that right now the holder of a path or fd has no
> > idea what metadata it is supposed to pass.  For block device special
> > files find the right sysfs directory is relatively straight forward
> > (but still annoying), but one a file is on a file systems that becomes
> > impossible.  I think we'll need an ioctl that exposes the equivalent
> > of the integrity sysfs directory to make this usable by applications.
> 
> Are you thinking this ioctl to be on a regular file?

On anything that supports passing PI through io_uring.  So block devices
and (some) regular files.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-31 10:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250129124655epcas5p39750f07e5015f1dd5e198c72cca0aa4e@epcas5p3.samsung.com>
2025-01-29 12:46 ` in-kernel verification of user PI? Christoph Hellwig
2025-01-29 14:23   ` Martin K. Petersen
2025-01-29 15:26     ` Christoph Hellwig
2025-01-29 15:42       ` Martin K. Petersen
2025-01-29 15:43         ` Christoph Hellwig
2025-01-29 16:15           ` Martin K. Petersen
2025-01-30 13:02             ` Christoph Hellwig
2025-01-31 10:34   ` Kanchan Joshi
2025-01-31 10:36     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).