linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Kanchan Joshi <joshi.k@samsung.com>
Cc: lsf-pc@lists.linux-foundation.org,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	josef@toxicpanda.com,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [Lsf-pc] [LSF/MM/BPF ATTEND][LSF/MM/BPF TOPIC] Meta/Integrity/PI improvements
Date: Mon, 26 Feb 2024 18:15:19 -0500	[thread overview]
Message-ID: <yq14jdu7t2u.fsf@ca-mkp.ca.oracle.com> (raw)
In-Reply-To: <aca1e970-9785-5ff4-807b-9f892af71741@samsung.com> (Kanchan Joshi's message of "Fri, 23 Feb 2024 01:03:01 +0530")


Kanchan,

> - Generic user interface that user-space can use to exchange meta. A
> new io_uring opcode IORING_OP_READ/WRITE_META - seems feasible for
> direct IO.

Yep. I'm interested in this too. Reviving this effort is near the top of
my todo list so I'm happy to collaborate.

> NVMe SSD can do the offload when the host sends the PRACT bit. But in
> the driver, this is tied to global integrity disablement using
> CONFIG_BLK_DEV_INTEGRITY.

> So, the idea is to introduce a bio flag REQ_INTEGRITY_OFFLOAD
> that the filesystem can send. The block-integrity and NVMe driver do
> the rest to make the offload work.

Whether to have a block device do this is currently controlled by the
/sys/block/foo/integrity/{read_verify,write_generate} knobs. At least
for SCSI, protected transfers are always enabled between HBA and target
if both support it. If no integrity has been attached to an I/O by the
application/filesystem, the block layer will do so controlled by the
sysfs knobs above. IOW, if the hardware is capable, protected transfers
should always be enabled, at least from the block layer down.

It's possible that things don't work quite that way with NVMe since, at
least for PCIe, the drive is both initiator and target. And NVMe also
missed quite a few DIX details in its PI implementation. It's been a
while since I messed with PI on NVMe, I'll have a look.

But in any case the intent for the Linux code was for protected
transfers to be enabled automatically when possible. If the block layer
protection is explicitly disabled, a filesystem can still trigger
protected transfers via the bip flags. So that capability should
definitely be exposed via io_uring.

> "Work is in progress to implement support for the data integrity
> extensions in btrfs, enabling the filesystem to use the application
> tag."

This didn't go anywhere for a couple of reasons:

 - Individual disk drives supported ATO but every storage array we
   worked with used the app tag space internally. And thus there were
   very few real-life situations where it would be possible to store
   additional information in each block.

   Back in the mid-2000s, putting enterprise data on individual disk
   drives was not considered acceptable. So implementing filesystem
   support that would only be usable on individual disk drives didn't
   seem worth the investment. Especially when the PI-for-ATA efforts
   were abandoned.

   Wrt. the app tag ownership situation in SCSI, the storage tag in NVMe
   spec is a remedy for this, allowing the application to own part of
   the extra tag space and the storage device itself another.

 - Our proposed use case for the app tag was to provide filesystems with
   back pointers without having to change the on-disk format.

   The use of 0xFFFF as escape check in PI meant that the caller had to
   be very careful about what to store in the app tag. Our prototype
   attached structs of metadata to each filesystem block (8 512-byte
   sectors * 2 bytes of PI, so 16 bytes of metadata per filesystem
   block). But none of those 2-byte blobs could contain the value
   0xFFFF. Wasn't really a great interface for filesystems that wanted
   to be able to attach whatever data structure was important to them.

So between a very limited selection of hardware actually providing the
app tag space and a clunky interface for filesystems, the app tag just
never really took off. We ended up modifying it to be an access control
instead, see the app tag control mode page in SCSI.

Databases and many filesystems have means to protect blocks or extents.
And these means are often better at identifying the nature of read-time
problems than a CRC over each 512-byte LBA would be. So what made PI
interesting was the ability to catch problems at write time in case of a
bad partition remap, wrong buffer pointer, misordered blocks, etc. Once
the data is on media, the drive ECC is superior. And again, at read time
the database or application is often better equipped to identify
corruption than PI.

And consequently our interest focused on treating PI something more akin
to a network checksum than a facility to protect data at rest on media.

-- 
Martin K. Petersen	Oracle Linux Engineering

  parent reply	other threads:[~2024-02-26 23:15 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20240222193304epcas5p318426c5267ee520e6b5710164c533b7d@epcas5p3.samsung.com>
2024-02-22 19:33 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF TOPIC] Meta/Integrity/PI improvements Kanchan Joshi
2024-02-22 20:08   ` Keith Busch
2024-02-23 12:41     ` Kanchan Joshi
2024-02-23 14:38   ` David Sterba
2024-02-26 23:15   ` Martin K. Petersen [this message]
2024-03-27 13:45     ` [Lsf-pc] " Kanchan Joshi
2024-03-28  0:30       ` Martin K. Petersen
2024-03-29 11:35         ` Kanchan Joshi
2024-04-03  2:10           ` Martin K. Petersen
2024-04-02 10:45     ` Dongyang Li
2024-04-02 11:37       ` Hannes Reinecke
2024-04-02 16:52       ` Kanchan Joshi
2024-04-03 12:40         ` Dongyang Li
2024-04-03 12:42           ` hch
2024-04-04  9:53             ` Dongyang Li
2024-04-05  6:12     ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=yq14jdu7t2u.fsf@ca-mkp.ca.oracle.com \
    --to=martin.petersen@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=josef@toxicpanda.com \
    --cc=joshi.k@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).