* [PATCHv6] blk-integrity: support arbitrary buffer alignment
@ 2025-11-24 16:17 Keith Busch
2025-11-24 16:31 ` Martin K. Petersen
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Keith Busch @ 2025-11-24 16:17 UTC (permalink / raw)
To: linux-block, axboe, hch; +Cc: ebiggers, Keith Busch, Martin K. Petersen
From: Keith Busch <kbusch@kernel.org>
A bio segment may have partial interval block data with the rest continuing
into the next segments because direct-io data payloads only need to aligned in
memory to the device's DMA limits.
At the same time, the protection information may also be split in
multiple segments. The most likely way that may happen is if two
requests merge, or if we're directly using the io_uring user metadata.
The generate/verify, however, only ever accessed the first bip_vec.
Further, it may be possible to unalign the protection fields from the
user space buffer, or if there are odd additional opaque bytes in front
or in back of the protection information metadata region.
Change up the iteration to allow spanning multiple segments. This patch
is mostly a re-write of the protection information handling to allow any
arbitrary alignments, so it's probably easier to review the end result
rather than the diff.
Martin reports many SCSI controllers are not able to handle interval
data composed of multiple segments when PI is used, so this patch
introduces a new integrity limit that a low level driver can set to
notify that it is capable of handling that, default to false. The nvme
driver is the first one to enable it in this patch. Everyone else will
force DMA alignment to the logical block size to ensure interval data is
always aligned within a single segment.
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v5->v6:
Changed the new queue limit from a bool to a flag and took an
available bit for it.
Also ensure the limit is stacked as needed.
block/blk-settings.c | 12 +-
block/t10-pi.c | 825 +++++++++++++++++++---------------
drivers/nvme/host/core.c | 1 +
include/linux/blk-integrity.h | 1 +
4 files changed, 473 insertions(+), 366 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 51401f08ce05b..86e04ea895ea0 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -198,11 +198,11 @@ static int blk_validate_integrity_limits(struct queue_limits *lim)
bi->interval_exp = ilog2(lim->logical_block_size);
/*
- * The PI generation / validation helpers do not expect intervals to
- * straddle multiple bio_vecs. Enforce alignment so that those are
+ * Some IO controllers can not handle data intervals straddling
+ * multiple bio_vecs. For those, enforce alignment so that those are
* never generated, and that each buffer is aligned as expected.
*/
- if (bi->csum_type) {
+ if (!(bi->flags & BLK_SPLIT_INTERVAL_CAPABLE) && bi->csum_type) {
lim->dma_alignment = max(lim->dma_alignment,
(1U << bi->interval_exp) - 1);
}
@@ -1001,10 +1001,14 @@ bool queue_limits_stack_integrity(struct queue_limits *t,
if ((ti->flags & BLK_INTEGRITY_REF_TAG) !=
(bi->flags & BLK_INTEGRITY_REF_TAG))
goto incompatible;
+ if ((ti->flags & BLK_SPLIT_INTERVAL_CAPABLE) &&
+ !(bi->flags & BLK_SPLIT_INTERVAL_CAPABLE))
+ ti->flags &= ~BLK_SPLIT_INTERVAL_CAPABLE;
} else {
ti->flags = BLK_INTEGRITY_STACKED;
ti->flags |= (bi->flags & BLK_INTEGRITY_DEVICE_CAPABLE) |
- (bi->flags & BLK_INTEGRITY_REF_TAG);
+ (bi->flags & BLK_INTEGRITY_REF_TAG) |
+ (bi->flags & BLK_SPLIT_INTERVAL_CAPABLE);
ti->csum_type = bi->csum_type;
ti->pi_tuple_size = bi->pi_tuple_size;
ti->metadata_size = bi->metadata_size;
diff --git a/block/t10-pi.c b/block/t10-pi.c
index 0c4ed97021460..a5756efc13e9d 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -12,462 +12,563 @@
#include <linux/unaligned.h>
#include "blk.h"
+#define APP_TAG_ESCAPE 0xffff
+#define REF_TAG_ESCAPE 0xffffffff
+
+/*
+ * This union is used for onstack allocations when the pi field is split across
+ * segments. blk_validate_integrity_limits() guarantees pi_tuple_size matches
+ * the sizeof one of these two types.
+ */
+union pi_tuple {
+ struct crc64_pi_tuple crc64_pi;
+ struct t10_pi_tuple t10_pi;
+};
+
struct blk_integrity_iter {
- void *prot_buf;
- void *data_buf;
- sector_t seed;
- unsigned int data_size;
- unsigned short interval;
- const char *disk_name;
+ struct bio *bio;
+ struct bio_integrity_payload *bip;
+ struct blk_integrity *bi;
+ struct bvec_iter data_iter;
+ struct bvec_iter prot_iter;
+ unsigned int interval_remaining;
+ u64 seed;
+ u64 csum;
};
-static __be16 t10_pi_csum(__be16 csum, void *data, unsigned int len,
- unsigned char csum_type)
+static void blk_calculate_guard(struct blk_integrity_iter *iter, void *data,
+ unsigned int len)
{
- if (csum_type == BLK_INTEGRITY_CSUM_IP)
- return (__force __be16)ip_compute_csum(data, len);
- return cpu_to_be16(crc_t10dif_update(be16_to_cpu(csum), data, len));
+ switch (iter->bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ iter->csum = crc64_nvme(iter->csum, data, len);
+ break;
+ case BLK_INTEGRITY_CSUM_CRC:
+ iter->csum = crc_t10dif_update(iter->csum, data, len);
+ break;
+ case BLK_INTEGRITY_CSUM_IP:
+ iter->csum = (__force u32)csum_partial(data, len,
+ (__force __wsum)iter->csum);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ iter->csum = U64_MAX;
+ break;
+ }
+}
+
+static void blk_integrity_csum_finish(struct blk_integrity_iter *iter)
+{
+ switch (iter->bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_IP:
+ iter->csum = (__force u16)csum_fold((__force __wsum)iter->csum);
+ break;
+ default:
+ break;
+ }
}
/*
- * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
- * 16 bit app tag, 32 bit reference tag. Type 3 does not define the ref
- * tag.
+ * Update the csum for formats that have metadata padding in front of the data
+ * integrity field
*/
-static void t10_pi_generate(struct blk_integrity_iter *iter,
- struct blk_integrity *bi)
+static void blk_integrity_csum_offset(struct blk_integrity_iter *iter)
+{
+ unsigned int offset = iter->bi->pi_offset;
+ struct bio_vec *bvec = iter->bip->bip_vec;
+
+ while (offset > 0) {
+ struct bio_vec pbv = mp_bvec_iter_bvec(bvec, iter->prot_iter);
+ unsigned int len = min(pbv.bv_len, offset);
+ void *prot_buf = bvec_kmap_local(&pbv);
+
+ blk_calculate_guard(iter, prot_buf, len);
+ kunmap_local(prot_buf);
+ offset -= len;
+ bvec_iter_advance_single(bvec, &iter->prot_iter, len);
+ }
+ blk_integrity_csum_finish(iter);
+}
+
+static void blk_integrity_copy_from_tuple(struct bio_integrity_payload *bip,
+ struct bvec_iter *iter, void *tuple,
+ unsigned int tuple_size)
{
- u8 offset = bi->pi_offset;
- unsigned int i;
+ void *prot_buf;
- for (i = 0 ; i < iter->data_size ; i += iter->interval) {
- struct t10_pi_tuple *pi = iter->prot_buf + offset;
+ while (tuple_size) {
+ struct bio_vec pbv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
+ unsigned int len = min(tuple_size, pbv.bv_len);
- pi->guard_tag = t10_pi_csum(0, iter->data_buf, iter->interval,
- bi->csum_type);
- if (offset)
- pi->guard_tag = t10_pi_csum(pi->guard_tag,
- iter->prot_buf, offset, bi->csum_type);
- pi->app_tag = 0;
+ prot_buf = bvec_kmap_local(&pbv);
+ memcpy(prot_buf, tuple, len);
+ kunmap_local(prot_buf);
- if (bi->flags & BLK_INTEGRITY_REF_TAG)
- pi->ref_tag = cpu_to_be32(lower_32_bits(iter->seed));
- else
- pi->ref_tag = 0;
+ bvec_iter_advance_single(bip->bip_vec, iter, len);
+ tuple_size -= len;
+ tuple += len;
+ }
+}
- iter->data_buf += iter->interval;
- iter->prot_buf += bi->metadata_size;
- iter->seed++;
+static void blk_integrity_copy_to_tuple(struct bio_integrity_payload *bip,
+ struct bvec_iter *iter, void *tuple,
+ unsigned int tuple_size)
+{
+ void *prot_buf;
+
+ while (tuple_size) {
+ struct bio_vec pbv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
+ unsigned int len = min(tuple_size, pbv.bv_len);
+
+ prot_buf = bvec_kmap_local(&pbv);
+ memcpy(tuple, prot_buf, len);
+ kunmap_local(prot_buf);
+
+ bvec_iter_advance_single(bip->bip_vec, iter, len);
+ tuple_size -= len;
+ tuple += len;
}
}
-static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter,
- struct blk_integrity *bi)
-{
- u8 offset = bi->pi_offset;
- unsigned int i;
-
- for (i = 0 ; i < iter->data_size ; i += iter->interval) {
- struct t10_pi_tuple *pi = iter->prot_buf + offset;
- __be16 csum;
-
- if (bi->flags & BLK_INTEGRITY_REF_TAG) {
- if (pi->app_tag == T10_PI_APP_ESCAPE)
- goto next;
-
- if (be32_to_cpu(pi->ref_tag) !=
- lower_32_bits(iter->seed)) {
- pr_err("%s: ref tag error at location %llu " \
- "(rcvd %u)\n", iter->disk_name,
- (unsigned long long)
- iter->seed, be32_to_cpu(pi->ref_tag));
- return BLK_STS_PROTECTION;
- }
- } else {
- if (pi->app_tag == T10_PI_APP_ESCAPE &&
- pi->ref_tag == T10_PI_REF_ESCAPE)
- goto next;
+static bool ext_pi_ref_escape(const u8 ref_tag[6])
+{
+ static const u8 ref_escape[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
+
+ return memcmp(ref_tag, ref_escape, sizeof(ref_escape)) == 0;
+}
+
+static blk_status_t blk_verify_ext_pi(struct blk_integrity_iter *iter,
+ struct crc64_pi_tuple *pi)
+{
+ u64 seed = lower_48_bits(iter->seed);
+ u64 guard = get_unaligned_be64(&pi->guard_tag);
+ u64 ref = get_unaligned_be48(pi->ref_tag);
+ u16 app = get_unaligned_be16(&pi->app_tag);
+
+ if (iter->bi->flags & BLK_INTEGRITY_REF_TAG) {
+ if (app == APP_TAG_ESCAPE)
+ return BLK_STS_OK;
+ if (ref != seed) {
+ pr_err("%s: ref tag error at location %llu (rcvd %llu)\n",
+ iter->bio->bi_bdev->bd_disk->disk_name, seed,
+ ref);
+ return BLK_STS_PROTECTION;
}
+ } else if (app == APP_TAG_ESCAPE && ext_pi_ref_escape(pi->ref_tag)) {
+ return BLK_STS_OK;
+ }
+
+ if (guard != iter->csum) {
+ pr_err("%s: guard tag error at sector %llu (rcvd %016llx, want %016llx)\n",
+ iter->bio->bi_bdev->bd_disk->disk_name, iter->seed,
+ guard, iter->csum);
+ return BLK_STS_PROTECTION;
+ }
- csum = t10_pi_csum(0, iter->data_buf, iter->interval,
- bi->csum_type);
- if (offset)
- csum = t10_pi_csum(csum, iter->prot_buf, offset,
- bi->csum_type);
-
- if (pi->guard_tag != csum) {
- pr_err("%s: guard tag error at sector %llu " \
- "(rcvd %04x, want %04x)\n", iter->disk_name,
- (unsigned long long)iter->seed,
- be16_to_cpu(pi->guard_tag), be16_to_cpu(csum));
+ return BLK_STS_OK;
+}
+
+static blk_status_t blk_verify_pi(struct blk_integrity_iter *iter,
+ struct t10_pi_tuple *pi, u16 guard)
+{
+ u32 seed = lower_32_bits(iter->seed);
+ u32 ref = get_unaligned_be32(&pi->ref_tag);
+ u16 app = get_unaligned_be16(&pi->app_tag);
+
+ if (iter->bi->flags & BLK_INTEGRITY_REF_TAG) {
+ if (app == APP_TAG_ESCAPE)
+ return BLK_STS_OK;
+ if (ref != seed) {
+ pr_err("%s: ref tag error at location %u (rcvd %u)\n",
+ iter->bio->bi_bdev->bd_disk->disk_name, seed,
+ ref);
return BLK_STS_PROTECTION;
}
+ } else if (app == APP_TAG_ESCAPE && ref == REF_TAG_ESCAPE) {
+ return BLK_STS_OK;
+ }
-next:
- iter->data_buf += iter->interval;
- iter->prot_buf += bi->metadata_size;
- iter->seed++;
+ if (guard != (u16)iter->csum) {
+ pr_err("%s: guard tag error at sector %llu (rcvd %04x, want %04x)\n",
+ iter->bio->bi_bdev->bd_disk->disk_name, iter->seed,
+ guard, (u16)iter->csum);
+ return BLK_STS_PROTECTION;
}
return BLK_STS_OK;
}
-/**
- * t10_pi_type1_prepare - prepare PI prior submitting request to device
- * @rq: request with PI that should be prepared
- *
- * For Type 1/Type 2, the virtual start sector is the one that was
- * originally submitted by the block layer for the ref_tag usage. Due to
- * partitioning, MD/DM cloning, etc. the actual physical start sector is
- * likely to be different. Remap protection information to match the
- * physical LBA.
- */
-static void t10_pi_type1_prepare(struct request *rq)
+static blk_status_t blk_verify_t10_pi(struct blk_integrity_iter *iter,
+ struct t10_pi_tuple *pi)
{
- struct blk_integrity *bi = &rq->q->limits.integrity;
- const int tuple_sz = bi->metadata_size;
- u32 ref_tag = t10_pi_ref_tag(rq);
- u8 offset = bi->pi_offset;
- struct bio *bio;
+ u16 guard = get_unaligned_be16(&pi->guard_tag);
- __rq_for_each_bio(bio, rq) {
- struct bio_integrity_payload *bip = bio_integrity(bio);
- u32 virt = bip_get_seed(bip) & 0xffffffff;
- struct bio_vec iv;
- struct bvec_iter iter;
+ return blk_verify_pi(iter, pi, guard);
+}
- /* Already remapped? */
- if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
- break;
+static blk_status_t blk_verify_ip_pi(struct blk_integrity_iter *iter,
+ struct t10_pi_tuple *pi)
+{
+ u16 guard = get_unaligned((u16 *)&pi->guard_tag);
- bip_for_each_vec(iv, bip, iter) {
- unsigned int j;
- void *p;
-
- p = bvec_kmap_local(&iv);
- for (j = 0; j < iv.bv_len; j += tuple_sz) {
- struct t10_pi_tuple *pi = p + offset;
-
- if (be32_to_cpu(pi->ref_tag) == virt)
- pi->ref_tag = cpu_to_be32(ref_tag);
- virt++;
- ref_tag++;
- p += tuple_sz;
- }
- kunmap_local(p);
- }
+ return blk_verify_pi(iter, pi, guard);
+}
- bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+static blk_status_t blk_integrity_verify(struct blk_integrity_iter *iter,
+ union pi_tuple *tuple)
+{
+ switch (iter->bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ return blk_verify_ext_pi(iter, &tuple->crc64_pi);
+ case BLK_INTEGRITY_CSUM_CRC:
+ return blk_verify_t10_pi(iter, &tuple->t10_pi);
+ case BLK_INTEGRITY_CSUM_IP:
+ return blk_verify_ip_pi(iter, &tuple->t10_pi);
+ default:
+ return BLK_STS_OK;
}
}
-/**
- * t10_pi_type1_complete - prepare PI prior returning request to the blk layer
- * @rq: request with PI that should be prepared
- * @nr_bytes: total bytes to prepare
- *
- * For Type 1/Type 2, the virtual start sector is the one that was
- * originally submitted by the block layer for the ref_tag usage. Due to
- * partitioning, MD/DM cloning, etc. the actual physical start sector is
- * likely to be different. Since the physical start sector was submitted
- * to the device, we should remap it back to virtual values expected by the
- * block layer.
- */
-static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
+static void blk_set_ext_pi(struct blk_integrity_iter *iter,
+ struct crc64_pi_tuple *pi)
{
- struct blk_integrity *bi = &rq->q->limits.integrity;
- unsigned intervals = nr_bytes >> bi->interval_exp;
- const int tuple_sz = bi->metadata_size;
- u32 ref_tag = t10_pi_ref_tag(rq);
- u8 offset = bi->pi_offset;
- struct bio *bio;
+ put_unaligned_be64(iter->csum, &pi->guard_tag);
+ put_unaligned_be16(0, &pi->app_tag);
+ put_unaligned_be48(iter->seed, &pi->ref_tag);
+}
- __rq_for_each_bio(bio, rq) {
- struct bio_integrity_payload *bip = bio_integrity(bio);
- u32 virt = bip_get_seed(bip) & 0xffffffff;
- struct bio_vec iv;
- struct bvec_iter iter;
-
- bip_for_each_vec(iv, bip, iter) {
- unsigned int j;
- void *p;
-
- p = bvec_kmap_local(&iv);
- for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
- struct t10_pi_tuple *pi = p + offset;
-
- if (be32_to_cpu(pi->ref_tag) == ref_tag)
- pi->ref_tag = cpu_to_be32(virt);
- virt++;
- ref_tag++;
- intervals--;
- p += tuple_sz;
- }
- kunmap_local(p);
- }
+static void blk_set_pi(struct blk_integrity_iter *iter,
+ struct t10_pi_tuple *pi, __be16 csum)
+{
+ put_unaligned(csum, &pi->guard_tag);
+ put_unaligned_be16(0, &pi->app_tag);
+ put_unaligned_be32(iter->seed, &pi->ref_tag);
+}
+
+static void blk_set_t10_pi(struct blk_integrity_iter *iter,
+ struct t10_pi_tuple *pi)
+{
+ blk_set_pi(iter, pi, cpu_to_be16((u16)iter->csum));
+}
+
+static void blk_set_ip_pi(struct blk_integrity_iter *iter,
+ struct t10_pi_tuple *pi)
+{
+ blk_set_pi(iter, pi, (__force __be16)(u16)iter->csum);
+}
+
+static void blk_integrity_set(struct blk_integrity_iter *iter,
+ union pi_tuple *tuple)
+{
+ switch (iter->bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ return blk_set_ext_pi(iter, &tuple->crc64_pi);
+ case BLK_INTEGRITY_CSUM_CRC:
+ return blk_set_t10_pi(iter, &tuple->t10_pi);
+ case BLK_INTEGRITY_CSUM_IP:
+ return blk_set_ip_pi(iter, &tuple->t10_pi);
+ default:
+ WARN_ON_ONCE(1);
+ return;
}
}
-static __be64 ext_pi_crc64(u64 crc, void *data, unsigned int len)
+static blk_status_t blk_integrity_interval(struct blk_integrity_iter *iter,
+ bool verify)
{
- return cpu_to_be64(crc64_nvme(crc, data, len));
+ blk_status_t ret = BLK_STS_OK;
+ union pi_tuple tuple;
+ void *ptuple = &tuple;
+ struct bio_vec pbv;
+
+ blk_integrity_csum_offset(iter);
+ pbv = mp_bvec_iter_bvec(iter->bip->bip_vec, iter->prot_iter);
+ if (pbv.bv_len >= iter->bi->pi_tuple_size) {
+ ptuple = bvec_kmap_local(&pbv);
+ bvec_iter_advance_single(iter->bip->bip_vec, &iter->prot_iter,
+ iter->bi->metadata_size - iter->bi->pi_offset);
+ } else if (verify) {
+ blk_integrity_copy_to_tuple(iter->bip, &iter->prot_iter,
+ ptuple, iter->bi->pi_tuple_size);
+ }
+
+ if (verify)
+ ret = blk_integrity_verify(iter, ptuple);
+ else
+ blk_integrity_set(iter, ptuple);
+
+ if (likely(ptuple != &tuple)) {
+ kunmap_local(ptuple);
+ } else if (!verify) {
+ blk_integrity_copy_from_tuple(iter->bip, &iter->prot_iter,
+ ptuple, iter->bi->pi_tuple_size);
+ }
+
+
+ iter->interval_remaining = 1 << iter->bi->interval_exp;
+ iter->csum = 0;
+ iter->seed++;
+
+ return ret;
}
-static void ext_pi_crc64_generate(struct blk_integrity_iter *iter,
- struct blk_integrity *bi)
+static void blk_integrity_iterate(struct bio *bio, struct bvec_iter *data_iter,
+ bool verify)
{
- u8 offset = bi->pi_offset;
- unsigned int i;
+ struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+ struct bio_integrity_payload *bip = bio_integrity(bio);
+ struct blk_integrity_iter iter = {
+ .bio = bio,
+ .bip = bip,
+ .bi = bi,
+ .data_iter = *data_iter,
+ .prot_iter = bip->bip_iter,
+ .interval_remaining = 1 << bi->interval_exp,
+ .seed = data_iter->bi_sector,
+ .csum = 0,
+ };
+ blk_status_t ret = BLK_STS_OK;
+
+ while (iter.data_iter.bi_size && ret == BLK_STS_OK) {
+ struct bio_vec bv = mp_bvec_iter_bvec(iter.bio->bi_io_vec,
+ iter.data_iter);
+ void *kaddr = bvec_kmap_local(&bv);
+ void *data = kaddr;
+ unsigned int len;
+
+ bvec_iter_advance_single(iter.bio->bi_io_vec, &iter.data_iter,
+ bv.bv_len);
+ while (bv.bv_len && ret == BLK_STS_OK) {
+ len = min(iter.interval_remaining, bv.bv_len);
+ blk_calculate_guard(&iter, data, len);
+ bv.bv_len -= len;
+ data += len;
+ iter.interval_remaining -= len;
+ if (!iter.interval_remaining)
+ ret = blk_integrity_interval(&iter, verify);
+ }
+ kunmap_local(kaddr);
+ }
+
+ if (ret)
+ bio->bi_status = ret;
+}
- for (i = 0 ; i < iter->data_size ; i += iter->interval) {
- struct crc64_pi_tuple *pi = iter->prot_buf + offset;
+void blk_integrity_generate(struct bio *bio)
+{
+ struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
- pi->guard_tag = ext_pi_crc64(0, iter->data_buf, iter->interval);
- if (offset)
- pi->guard_tag = ext_pi_crc64(be64_to_cpu(pi->guard_tag),
- iter->prot_buf, offset);
- pi->app_tag = 0;
+ switch (bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ case BLK_INTEGRITY_CSUM_CRC:
+ case BLK_INTEGRITY_CSUM_IP:
+ blk_integrity_iterate(bio, &bio->bi_iter, false);
+ break;
+ default:
+ break;
+ }
+}
- if (bi->flags & BLK_INTEGRITY_REF_TAG)
- put_unaligned_be48(iter->seed, pi->ref_tag);
- else
- put_unaligned_be48(0ULL, pi->ref_tag);
+void blk_integrity_verify_iter(struct bio *bio, struct bvec_iter *saved_iter)
+{
+ struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
- iter->data_buf += iter->interval;
- iter->prot_buf += bi->metadata_size;
- iter->seed++;
+ switch (bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ case BLK_INTEGRITY_CSUM_CRC:
+ case BLK_INTEGRITY_CSUM_IP:
+ blk_integrity_iterate(bio, saved_iter, true);
+ break;
+ default:
+ break;
}
}
-static bool ext_pi_ref_escape(const u8 ref_tag[6])
+/*
+ * Advance @iter past the protection offset for protection formats that
+ * contain front padding on the metadata region.
+ */
+static void blk_pi_advance_offset(struct blk_integrity *bi,
+ struct bio_integrity_payload *bip,
+ struct bvec_iter *iter)
{
- static const u8 ref_escape[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
+ unsigned int offset = bi->pi_offset;
- return memcmp(ref_tag, ref_escape, sizeof(ref_escape)) == 0;
+ while (offset > 0) {
+ struct bio_vec bv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
+ unsigned int len = min(bv.bv_len, offset);
+
+ bvec_iter_advance_single(bip->bip_vec, iter, len);
+ offset -= len;
+ }
}
-static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter,
- struct blk_integrity *bi)
-{
- u8 offset = bi->pi_offset;
- unsigned int i;
-
- for (i = 0; i < iter->data_size; i += iter->interval) {
- struct crc64_pi_tuple *pi = iter->prot_buf + offset;
- u64 ref, seed;
- __be64 csum;
-
- if (bi->flags & BLK_INTEGRITY_REF_TAG) {
- if (pi->app_tag == T10_PI_APP_ESCAPE)
- goto next;
-
- ref = get_unaligned_be48(pi->ref_tag);
- seed = lower_48_bits(iter->seed);
- if (ref != seed) {
- pr_err("%s: ref tag error at location %llu (rcvd %llu)\n",
- iter->disk_name, seed, ref);
- return BLK_STS_PROTECTION;
- }
- } else {
- if (pi->app_tag == T10_PI_APP_ESCAPE &&
- ext_pi_ref_escape(pi->ref_tag))
- goto next;
- }
+static void *blk_tuple_remap_begin(union pi_tuple *tuple,
+ struct blk_integrity *bi,
+ struct bio_integrity_payload *bip,
+ struct bvec_iter *iter)
+{
+ struct bvec_iter titer;
+ struct bio_vec pbv;
- csum = ext_pi_crc64(0, iter->data_buf, iter->interval);
- if (offset)
- csum = ext_pi_crc64(be64_to_cpu(csum), iter->prot_buf,
- offset);
+ blk_pi_advance_offset(bi, bip, iter);
+ pbv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
+ if (likely(pbv.bv_len >= bi->pi_tuple_size))
+ return bvec_kmap_local(&pbv);
- if (pi->guard_tag != csum) {
- pr_err("%s: guard tag error at sector %llu " \
- "(rcvd %016llx, want %016llx)\n",
- iter->disk_name, (unsigned long long)iter->seed,
- be64_to_cpu(pi->guard_tag), be64_to_cpu(csum));
- return BLK_STS_PROTECTION;
- }
+ /*
+ * We need to preserve the state of the original iter for the
+ * copy_from_tuple at the end, so make a temp iter for here.
+ */
+ titer = *iter;
+ blk_integrity_copy_to_tuple(bip, &titer, tuple, bi->pi_tuple_size);
+ return tuple;
+}
-next:
- iter->data_buf += iter->interval;
- iter->prot_buf += bi->metadata_size;
- iter->seed++;
+static void blk_tuple_remap_end(union pi_tuple *tuple, void *ptuple,
+ struct blk_integrity *bi,
+ struct bio_integrity_payload *bip,
+ struct bvec_iter *iter)
+{
+ unsigned int len = bi->metadata_size - bi->pi_offset;
+
+ if (likely(ptuple != tuple)) {
+ kunmap_local(ptuple);
+ } else {
+ blk_integrity_copy_from_tuple(bip, iter, ptuple,
+ bi->pi_tuple_size);
+ len -= bi->pi_tuple_size;
}
- return BLK_STS_OK;
+ bvec_iter_advance(bip->bip_vec, iter, len);
}
-static void ext_pi_type1_prepare(struct request *rq)
+static void blk_set_ext_unmap_ref(struct crc64_pi_tuple *pi, u64 virt,
+ u64 ref_tag)
{
- struct blk_integrity *bi = &rq->q->limits.integrity;
- const int tuple_sz = bi->metadata_size;
- u64 ref_tag = ext_pi_ref_tag(rq);
- u8 offset = bi->pi_offset;
- struct bio *bio;
+ u64 ref = get_unaligned_be48(&pi->ref_tag);
- __rq_for_each_bio(bio, rq) {
- struct bio_integrity_payload *bip = bio_integrity(bio);
- u64 virt = lower_48_bits(bip_get_seed(bip));
- struct bio_vec iv;
- struct bvec_iter iter;
+ if (ref == lower_48_bits(ref_tag) && ref != lower_48_bits(virt))
+ put_unaligned_be48(virt, pi->ref_tag);
+}
- /* Already remapped? */
- if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
- break;
+static void blk_set_t10_unmap_ref(struct t10_pi_tuple *pi, u32 virt,
+ u32 ref_tag)
+{
+ u32 ref = get_unaligned_be32(&pi->ref_tag);
- bip_for_each_vec(iv, bip, iter) {
- unsigned int j;
- void *p;
-
- p = bvec_kmap_local(&iv);
- for (j = 0; j < iv.bv_len; j += tuple_sz) {
- struct crc64_pi_tuple *pi = p + offset;
- u64 ref = get_unaligned_be48(pi->ref_tag);
-
- if (ref == virt)
- put_unaligned_be48(ref_tag, pi->ref_tag);
- virt++;
- ref_tag++;
- p += tuple_sz;
- }
- kunmap_local(p);
- }
+ if (ref == ref_tag && ref != virt)
+ put_unaligned_be32(virt, &pi->ref_tag);
+}
- bip->bip_flags |= BIP_MAPPED_INTEGRITY;
+static void blk_reftag_remap_complete(struct blk_integrity *bi,
+ union pi_tuple *tuple, u64 virt, u64 ref)
+{
+ switch (bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ blk_set_ext_unmap_ref(&tuple->crc64_pi, virt, ref);
+ break;
+ case BLK_INTEGRITY_CSUM_CRC:
+ case BLK_INTEGRITY_CSUM_IP:
+ blk_set_t10_unmap_ref(&tuple->t10_pi, virt, ref);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
}
}
-static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
+static void blk_set_ext_map_ref(struct crc64_pi_tuple *pi, u64 virt,
+ u64 ref_tag)
{
- struct blk_integrity *bi = &rq->q->limits.integrity;
- unsigned intervals = nr_bytes >> bi->interval_exp;
- const int tuple_sz = bi->metadata_size;
- u64 ref_tag = ext_pi_ref_tag(rq);
- u8 offset = bi->pi_offset;
- struct bio *bio;
+ u64 ref = get_unaligned_be48(&pi->ref_tag);
- __rq_for_each_bio(bio, rq) {
- struct bio_integrity_payload *bip = bio_integrity(bio);
- u64 virt = lower_48_bits(bip_get_seed(bip));
- struct bio_vec iv;
- struct bvec_iter iter;
-
- bip_for_each_vec(iv, bip, iter) {
- unsigned int j;
- void *p;
-
- p = bvec_kmap_local(&iv);
- for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
- struct crc64_pi_tuple *pi = p + offset;
- u64 ref = get_unaligned_be48(pi->ref_tag);
-
- if (ref == ref_tag)
- put_unaligned_be48(virt, pi->ref_tag);
- virt++;
- ref_tag++;
- intervals--;
- p += tuple_sz;
- }
- kunmap_local(p);
- }
- }
+ if (ref == lower_48_bits(virt) && ref != ref_tag)
+ put_unaligned_be48(ref_tag, pi->ref_tag);
}
-void blk_integrity_generate(struct bio *bio)
+static void blk_set_t10_map_ref(struct t10_pi_tuple *pi, u32 virt, u32 ref_tag)
{
- struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
- struct bio_integrity_payload *bip = bio_integrity(bio);
- struct blk_integrity_iter iter;
- struct bvec_iter bviter;
- struct bio_vec bv;
-
- iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
- iter.interval = 1 << bi->interval_exp;
- iter.seed = bio->bi_iter.bi_sector;
- iter.prot_buf = bvec_virt(bip->bip_vec);
- bio_for_each_segment(bv, bio, bviter) {
- void *kaddr = bvec_kmap_local(&bv);
+ u32 ref = get_unaligned_be32(&pi->ref_tag);
- iter.data_buf = kaddr;
- iter.data_size = bv.bv_len;
- switch (bi->csum_type) {
- case BLK_INTEGRITY_CSUM_CRC64:
- ext_pi_crc64_generate(&iter, bi);
- break;
- case BLK_INTEGRITY_CSUM_CRC:
- case BLK_INTEGRITY_CSUM_IP:
- t10_pi_generate(&iter, bi);
- break;
- default:
- break;
- }
- kunmap_local(kaddr);
+ if (ref == virt && ref != ref_tag)
+ put_unaligned_be32(ref_tag, &pi->ref_tag);
+}
+
+static void blk_reftag_remap_prepare(struct blk_integrity *bi,
+ union pi_tuple *tuple,
+ u64 virt, u64 ref)
+{
+ switch (bi->csum_type) {
+ case BLK_INTEGRITY_CSUM_CRC64:
+ blk_set_ext_map_ref(&tuple->crc64_pi, virt, ref);
+ break;
+ case BLK_INTEGRITY_CSUM_CRC:
+ case BLK_INTEGRITY_CSUM_IP:
+ blk_set_t10_map_ref(&tuple->t10_pi, virt, ref);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
}
}
-void blk_integrity_verify_iter(struct bio *bio, struct bvec_iter *saved_iter)
+static void __blk_reftag_remap(struct bio *bio, struct blk_integrity *bi,
+ unsigned *intervals, u64 *ref, bool prep)
{
- struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
struct bio_integrity_payload *bip = bio_integrity(bio);
- struct blk_integrity_iter iter;
- struct bvec_iter bviter;
- struct bio_vec bv;
+ struct bvec_iter iter = bip->bip_iter;
+ u64 virt = bip_get_seed(bip);
+ union pi_tuple *ptuple;
+ union pi_tuple tuple;
- /*
- * At the moment verify is called bi_iter has been advanced during split
- * and completion, so use the copy created during submission here.
- */
- iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
- iter.interval = 1 << bi->interval_exp;
- iter.seed = saved_iter->bi_sector;
- iter.prot_buf = bvec_virt(bip->bip_vec);
- __bio_for_each_segment(bv, bio, bviter, *saved_iter) {
- void *kaddr = bvec_kmap_local(&bv);
- blk_status_t ret = BLK_STS_OK;
+ if (prep && bip->bip_flags & BIP_MAPPED_INTEGRITY) {
+ *ref += bio->bi_iter.bi_size >> bi->interval_exp;
+ return;
+ }
- iter.data_buf = kaddr;
- iter.data_size = bv.bv_len;
- switch (bi->csum_type) {
- case BLK_INTEGRITY_CSUM_CRC64:
- ret = ext_pi_crc64_verify(&iter, bi);
- break;
- case BLK_INTEGRITY_CSUM_CRC:
- case BLK_INTEGRITY_CSUM_IP:
- ret = t10_pi_verify(&iter, bi);
- break;
- default:
- break;
- }
- kunmap_local(kaddr);
+ while (iter.bi_size && *intervals) {
+ ptuple = blk_tuple_remap_begin(&tuple, bi, bip, &iter);
- if (ret) {
- bio->bi_status = ret;
- return;
- }
+ if (prep)
+ blk_reftag_remap_prepare(bi, ptuple, virt, *ref);
+ else
+ blk_reftag_remap_complete(bi, ptuple, virt, *ref);
+
+ blk_tuple_remap_end(&tuple, ptuple, bi, bip, &iter);
+ (*intervals)--;
+ (*ref)++;
+ virt++;
}
+
+ if (prep)
+ bip->bip_flags |= BIP_MAPPED_INTEGRITY;
}
-void blk_integrity_prepare(struct request *rq)
+static void blk_integrity_remap(struct request *rq, unsigned int nr_bytes,
+ bool prep)
{
struct blk_integrity *bi = &rq->q->limits.integrity;
+ u64 ref = blk_rq_pos(rq) >> (bi->interval_exp - SECTOR_SHIFT);
+ unsigned intervals = nr_bytes >> bi->interval_exp;
+ struct bio *bio;
if (!(bi->flags & BLK_INTEGRITY_REF_TAG))
return;
- if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64)
- ext_pi_type1_prepare(rq);
- else
- t10_pi_type1_prepare(rq);
+ __rq_for_each_bio(bio, rq) {
+ __blk_reftag_remap(bio, bi, &intervals, &ref, prep);
+ if (!intervals)
+ break;
+ }
}
-void blk_integrity_complete(struct request *rq, unsigned int nr_bytes)
+void blk_integrity_prepare(struct request *rq)
{
- struct blk_integrity *bi = &rq->q->limits.integrity;
-
- if (!(bi->flags & BLK_INTEGRITY_REF_TAG))
- return;
+ blk_integrity_remap(rq, blk_rq_bytes(rq), true);
+}
- if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64)
- ext_pi_type1_complete(rq, nr_bytes);
- else
- t10_pi_type1_complete(rq, nr_bytes);
+void blk_integrity_complete(struct request *rq, unsigned int nr_bytes)
+{
+ blk_integrity_remap(rq, nr_bytes, false);
}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7bf228df6001f..fa534f1d6b27a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1874,6 +1874,7 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
break;
}
+ bi->flags |= BLK_SPLIT_INTERVAL_CAPABLE;
bi->metadata_size = head->ms;
if (bi->csum_type) {
bi->pi_tuple_size = head->pi_size;
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index a6b84206eb94c..1f937373e56b6 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -19,6 +19,7 @@ enum blk_integrity_flags {
BLK_INTEGRITY_DEVICE_CAPABLE = 1 << 2,
BLK_INTEGRITY_REF_TAG = 1 << 3,
BLK_INTEGRITY_STACKED = 1 << 4,
+ BLK_SPLIT_INTERVAL_CAPABLE = 1 << 5,
};
const char *blk_integrity_profile_name(struct blk_integrity *bi);
--
2.47.3
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 16:17 [PATCHv6] blk-integrity: support arbitrary buffer alignment Keith Busch
@ 2025-11-24 16:31 ` Martin K. Petersen
2025-11-24 17:57 ` Keith Busch
2025-11-24 16:58 ` Christoph Hellwig
2025-11-24 21:34 ` Caleb Sander Mateos
2 siblings, 1 reply; 11+ messages in thread
From: Martin K. Petersen @ 2025-11-24 16:31 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, axboe, hch, ebiggers, Keith Busch,
Martin K. Petersen
Keith,
> A bio segment may have partial interval block data with the rest
> continuing into the next segments because direct-io data payloads only
> need to aligned in memory to the device's DMA limits.
No objections from me if NVMe needs this.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
--
Martin K. Petersen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 16:31 ` Martin K. Petersen
@ 2025-11-24 17:57 ` Keith Busch
2025-11-24 20:49 ` Martin K. Petersen
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-11-24 17:57 UTC (permalink / raw)
To: Martin K. Petersen; +Cc: Keith Busch, linux-block, axboe, hch, ebiggers
On Mon, Nov 24, 2025 at 11:31:54AM -0500, Martin K. Petersen wrote:
> > A bio segment may have partial interval block data with the rest
> > continuing into the next segments because direct-io data payloads only
> > need to aligned in memory to the device's DMA limits.
>
> No objections from me if NVMe needs this.
Truthfully, the users I'm enabling rely on the NVMe io_uring passthrough
interface, so much of this path isn't in the path. I just want to remove
the dma_alignment limit when a csum is used because the incoming data is
naturally unaligned in memory to the lba size. Most of this patch is
prep work to safely modify one line of code :) But I'm still really
happy with the result!
This should fix a real bug too: requests with data integrity and merged
bio's would fail without this. It's unlikely you could merge such bio's
due to integrity segment and virtual boundary limits so it's not
surprising issues haven't come up, but there are patches staged for 6.19
that make it easier to do that for capable NVMe devices.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 17:57 ` Keith Busch
@ 2025-11-24 20:49 ` Martin K. Petersen
0 siblings, 0 replies; 11+ messages in thread
From: Martin K. Petersen @ 2025-11-24 20:49 UTC (permalink / raw)
To: Keith Busch
Cc: Martin K. Petersen, Keith Busch, linux-block, axboe, hch,
ebiggers
Keith,
> Truthfully, the users I'm enabling rely on the NVMe io_uring passthrough
> interface, so much of this path isn't in the path. I just want to remove
> the dma_alignment limit when a csum is used because the incoming data is
> naturally unaligned in memory to the lba size.
Yep, I understand completely.
--
Martin K. Petersen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 16:17 [PATCHv6] blk-integrity: support arbitrary buffer alignment Keith Busch
2025-11-24 16:31 ` Martin K. Petersen
@ 2025-11-24 16:58 ` Christoph Hellwig
2025-11-24 21:34 ` Caleb Sander Mateos
2 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2025-11-24 16:58 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, axboe, hch, ebiggers, Keith Busch,
Martin K. Petersen
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 16:17 [PATCHv6] blk-integrity: support arbitrary buffer alignment Keith Busch
2025-11-24 16:31 ` Martin K. Petersen
2025-11-24 16:58 ` Christoph Hellwig
@ 2025-11-24 21:34 ` Caleb Sander Mateos
2025-11-24 22:56 ` Keith Busch
2 siblings, 1 reply; 11+ messages in thread
From: Caleb Sander Mateos @ 2025-11-24 21:34 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, axboe, hch, ebiggers, Keith Busch,
Martin K. Petersen
On Mon, Nov 24, 2025 at 8:18 AM Keith Busch <kbusch@meta.com> wrote:
>
> From: Keith Busch <kbusch@kernel.org>
>
> A bio segment may have partial interval block data with the rest continuing
> into the next segments because direct-io data payloads only need to aligned in
"need to be aligned"?
> memory to the device's DMA limits.
>
> At the same time, the protection information may also be split in
> multiple segments. The most likely way that may happen is if two
> requests merge, or if we're directly using the io_uring user metadata.
> The generate/verify, however, only ever accessed the first bip_vec.
>
> Further, it may be possible to unalign the protection fields from the
> user space buffer, or if there are odd additional opaque bytes in front
> or in back of the protection information metadata region.
>
> Change up the iteration to allow spanning multiple segments. This patch
> is mostly a re-write of the protection information handling to allow any
> arbitrary alignments, so it's probably easier to review the end result
> rather than the diff.
>
> Martin reports many SCSI controllers are not able to handle interval
> data composed of multiple segments when PI is used, so this patch
> introduces a new integrity limit that a low level driver can set to
> notify that it is capable of handling that, default to false. The nvme
> driver is the first one to enable it in this patch. Everyone else will
> force DMA alignment to the logical block size to ensure interval data is
> always aligned within a single segment.
>
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v5->v6:
>
> Changed the new queue limit from a bool to a flag and took an
> available bit for it.
>
> Also ensure the limit is stacked as needed.
>
> block/blk-settings.c | 12 +-
> block/t10-pi.c | 825 +++++++++++++++++++---------------
> drivers/nvme/host/core.c | 1 +
> include/linux/blk-integrity.h | 1 +
> 4 files changed, 473 insertions(+), 366 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 51401f08ce05b..86e04ea895ea0 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -198,11 +198,11 @@ static int blk_validate_integrity_limits(struct queue_limits *lim)
> bi->interval_exp = ilog2(lim->logical_block_size);
>
> /*
> - * The PI generation / validation helpers do not expect intervals to
> - * straddle multiple bio_vecs. Enforce alignment so that those are
> + * Some IO controllers can not handle data intervals straddling
> + * multiple bio_vecs. For those, enforce alignment so that those are
> * never generated, and that each buffer is aligned as expected.
> */
> - if (bi->csum_type) {
> + if (!(bi->flags & BLK_SPLIT_INTERVAL_CAPABLE) && bi->csum_type) {
> lim->dma_alignment = max(lim->dma_alignment,
> (1U << bi->interval_exp) - 1);
> }
> @@ -1001,10 +1001,14 @@ bool queue_limits_stack_integrity(struct queue_limits *t,
> if ((ti->flags & BLK_INTEGRITY_REF_TAG) !=
> (bi->flags & BLK_INTEGRITY_REF_TAG))
> goto incompatible;
> + if ((ti->flags & BLK_SPLIT_INTERVAL_CAPABLE) &&
> + !(bi->flags & BLK_SPLIT_INTERVAL_CAPABLE))
> + ti->flags &= ~BLK_SPLIT_INTERVAL_CAPABLE;
> } else {
> ti->flags = BLK_INTEGRITY_STACKED;
> ti->flags |= (bi->flags & BLK_INTEGRITY_DEVICE_CAPABLE) |
> - (bi->flags & BLK_INTEGRITY_REF_TAG);
> + (bi->flags & BLK_INTEGRITY_REF_TAG) |
> + (bi->flags & BLK_SPLIT_INTERVAL_CAPABLE);
> ti->csum_type = bi->csum_type;
> ti->pi_tuple_size = bi->pi_tuple_size;
> ti->metadata_size = bi->metadata_size;
> diff --git a/block/t10-pi.c b/block/t10-pi.c
> index 0c4ed97021460..a5756efc13e9d 100644
> --- a/block/t10-pi.c
> +++ b/block/t10-pi.c
> @@ -12,462 +12,563 @@
> #include <linux/unaligned.h>
> #include "blk.h"
>
> +#define APP_TAG_ESCAPE 0xffff
> +#define REF_TAG_ESCAPE 0xffffffff
> +
> +/*
> + * This union is used for onstack allocations when the pi field is split across
> + * segments. blk_validate_integrity_limits() guarantees pi_tuple_size matches
> + * the sizeof one of these two types.
> + */
> +union pi_tuple {
> + struct crc64_pi_tuple crc64_pi;
> + struct t10_pi_tuple t10_pi;
> +};
> +
> struct blk_integrity_iter {
> - void *prot_buf;
> - void *data_buf;
> - sector_t seed;
> - unsigned int data_size;
> - unsigned short interval;
> - const char *disk_name;
> + struct bio *bio;
> + struct bio_integrity_payload *bip;
> + struct blk_integrity *bi;
> + struct bvec_iter data_iter;
> + struct bvec_iter prot_iter;
> + unsigned int interval_remaining;
> + u64 seed;
> + u64 csum;
> };
>
> -static __be16 t10_pi_csum(__be16 csum, void *data, unsigned int len,
> - unsigned char csum_type)
> +static void blk_calculate_guard(struct blk_integrity_iter *iter, void *data,
> + unsigned int len)
> {
> - if (csum_type == BLK_INTEGRITY_CSUM_IP)
> - return (__force __be16)ip_compute_csum(data, len);
> - return cpu_to_be16(crc_t10dif_update(be16_to_cpu(csum), data, len));
> + switch (iter->bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + iter->csum = crc64_nvme(iter->csum, data, len);
> + break;
> + case BLK_INTEGRITY_CSUM_CRC:
> + iter->csum = crc_t10dif_update(iter->csum, data, len);
> + break;
> + case BLK_INTEGRITY_CSUM_IP:
> + iter->csum = (__force u32)csum_partial(data, len,
> + (__force __wsum)iter->csum);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + iter->csum = U64_MAX;
> + break;
> + }
> +}
> +
> +static void blk_integrity_csum_finish(struct blk_integrity_iter *iter)
> +{
> + switch (iter->bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_IP:
> + iter->csum = (__force u16)csum_fold((__force __wsum)iter->csum);
> + break;
> + default:
> + break;
> + }
> }
>
> /*
> - * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
> - * 16 bit app tag, 32 bit reference tag. Type 3 does not define the ref
> - * tag.
> + * Update the csum for formats that have metadata padding in front of the data
> + * integrity field
> */
> -static void t10_pi_generate(struct blk_integrity_iter *iter,
> - struct blk_integrity *bi)
> +static void blk_integrity_csum_offset(struct blk_integrity_iter *iter)
> +{
> + unsigned int offset = iter->bi->pi_offset;
> + struct bio_vec *bvec = iter->bip->bip_vec;
> +
> + while (offset > 0) {
> + struct bio_vec pbv = mp_bvec_iter_bvec(bvec, iter->prot_iter);
> + unsigned int len = min(pbv.bv_len, offset);
> + void *prot_buf = bvec_kmap_local(&pbv);
Is it valid to use bvec_kmap_local() on a multi-page bvec? It calls
kmap_local_page() internally, which will only map the first page,
right?
> +
> + blk_calculate_guard(iter, prot_buf, len);
> + kunmap_local(prot_buf);
> + offset -= len;
> + bvec_iter_advance_single(bvec, &iter->prot_iter, len);
> + }
> + blk_integrity_csum_finish(iter);
> +}
> +
> +static void blk_integrity_copy_from_tuple(struct bio_integrity_payload *bip,
> + struct bvec_iter *iter, void *tuple,
> + unsigned int tuple_size)
> {
> - u8 offset = bi->pi_offset;
> - unsigned int i;
> + void *prot_buf;
Declare this in the inner loop where it's used? Same comment for
blk_integrity_copy_to_tuple().
>
> - for (i = 0 ; i < iter->data_size ; i += iter->interval) {
> - struct t10_pi_tuple *pi = iter->prot_buf + offset;
> + while (tuple_size) {
> + struct bio_vec pbv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
> + unsigned int len = min(tuple_size, pbv.bv_len);
>
> - pi->guard_tag = t10_pi_csum(0, iter->data_buf, iter->interval,
> - bi->csum_type);
> - if (offset)
> - pi->guard_tag = t10_pi_csum(pi->guard_tag,
> - iter->prot_buf, offset, bi->csum_type);
> - pi->app_tag = 0;
> + prot_buf = bvec_kmap_local(&pbv);
> + memcpy(prot_buf, tuple, len);
> + kunmap_local(prot_buf);
>
> - if (bi->flags & BLK_INTEGRITY_REF_TAG)
> - pi->ref_tag = cpu_to_be32(lower_32_bits(iter->seed));
> - else
> - pi->ref_tag = 0;
> + bvec_iter_advance_single(bip->bip_vec, iter, len);
> + tuple_size -= len;
> + tuple += len;
> + }
> +}
>
> - iter->data_buf += iter->interval;
> - iter->prot_buf += bi->metadata_size;
> - iter->seed++;
> +static void blk_integrity_copy_to_tuple(struct bio_integrity_payload *bip,
> + struct bvec_iter *iter, void *tuple,
> + unsigned int tuple_size)
> +{
> + void *prot_buf;
> +
> + while (tuple_size) {
> + struct bio_vec pbv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
> + unsigned int len = min(tuple_size, pbv.bv_len);
> +
> + prot_buf = bvec_kmap_local(&pbv);
> + memcpy(tuple, prot_buf, len);
> + kunmap_local(prot_buf);
> +
> + bvec_iter_advance_single(bip->bip_vec, iter, len);
> + tuple_size -= len;
> + tuple += len;
> }
> }
>
> -static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter,
> - struct blk_integrity *bi)
> -{
> - u8 offset = bi->pi_offset;
> - unsigned int i;
> -
> - for (i = 0 ; i < iter->data_size ; i += iter->interval) {
> - struct t10_pi_tuple *pi = iter->prot_buf + offset;
> - __be16 csum;
> -
> - if (bi->flags & BLK_INTEGRITY_REF_TAG) {
> - if (pi->app_tag == T10_PI_APP_ESCAPE)
> - goto next;
> -
> - if (be32_to_cpu(pi->ref_tag) !=
> - lower_32_bits(iter->seed)) {
> - pr_err("%s: ref tag error at location %llu " \
> - "(rcvd %u)\n", iter->disk_name,
> - (unsigned long long)
> - iter->seed, be32_to_cpu(pi->ref_tag));
> - return BLK_STS_PROTECTION;
> - }
> - } else {
> - if (pi->app_tag == T10_PI_APP_ESCAPE &&
> - pi->ref_tag == T10_PI_REF_ESCAPE)
> - goto next;
> +static bool ext_pi_ref_escape(const u8 ref_tag[6])
> +{
> + static const u8 ref_escape[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
> +
> + return memcmp(ref_tag, ref_escape, sizeof(ref_escape)) == 0;
> +}
> +
> +static blk_status_t blk_verify_ext_pi(struct blk_integrity_iter *iter,
> + struct crc64_pi_tuple *pi)
> +{
> + u64 seed = lower_48_bits(iter->seed);
> + u64 guard = get_unaligned_be64(&pi->guard_tag);
> + u64 ref = get_unaligned_be48(pi->ref_tag);
> + u16 app = get_unaligned_be16(&pi->app_tag);
> +
> + if (iter->bi->flags & BLK_INTEGRITY_REF_TAG) {
> + if (app == APP_TAG_ESCAPE)
> + return BLK_STS_OK;
> + if (ref != seed) {
> + pr_err("%s: ref tag error at location %llu (rcvd %llu)\n",
> + iter->bio->bi_bdev->bd_disk->disk_name, seed,
> + ref);
> + return BLK_STS_PROTECTION;
> }
> + } else if (app == APP_TAG_ESCAPE && ext_pi_ref_escape(pi->ref_tag)) {
> + return BLK_STS_OK;
> + }
> +
> + if (guard != iter->csum) {
> + pr_err("%s: guard tag error at sector %llu (rcvd %016llx, want %016llx)\n",
> + iter->bio->bi_bdev->bd_disk->disk_name, iter->seed,
> + guard, iter->csum);
> + return BLK_STS_PROTECTION;
> + }
>
> - csum = t10_pi_csum(0, iter->data_buf, iter->interval,
> - bi->csum_type);
> - if (offset)
> - csum = t10_pi_csum(csum, iter->prot_buf, offset,
> - bi->csum_type);
> -
> - if (pi->guard_tag != csum) {
> - pr_err("%s: guard tag error at sector %llu " \
> - "(rcvd %04x, want %04x)\n", iter->disk_name,
> - (unsigned long long)iter->seed,
> - be16_to_cpu(pi->guard_tag), be16_to_cpu(csum));
> + return BLK_STS_OK;
> +}
> +
> +static blk_status_t blk_verify_pi(struct blk_integrity_iter *iter,
> + struct t10_pi_tuple *pi, u16 guard)
> +{
> + u32 seed = lower_32_bits(iter->seed);
> + u32 ref = get_unaligned_be32(&pi->ref_tag);
> + u16 app = get_unaligned_be16(&pi->app_tag);
> +
> + if (iter->bi->flags & BLK_INTEGRITY_REF_TAG) {
> + if (app == APP_TAG_ESCAPE)
> + return BLK_STS_OK;
> + if (ref != seed) {
> + pr_err("%s: ref tag error at location %u (rcvd %u)\n",
> + iter->bio->bi_bdev->bd_disk->disk_name, seed,
> + ref);
> return BLK_STS_PROTECTION;
> }
> + } else if (app == APP_TAG_ESCAPE && ref == REF_TAG_ESCAPE) {
> + return BLK_STS_OK;
> + }
>
> -next:
> - iter->data_buf += iter->interval;
> - iter->prot_buf += bi->metadata_size;
> - iter->seed++;
> + if (guard != (u16)iter->csum) {
> + pr_err("%s: guard tag error at sector %llu (rcvd %04x, want %04x)\n",
> + iter->bio->bi_bdev->bd_disk->disk_name, iter->seed,
> + guard, (u16)iter->csum);
> + return BLK_STS_PROTECTION;
> }
>
> return BLK_STS_OK;
> }
>
> -/**
> - * t10_pi_type1_prepare - prepare PI prior submitting request to device
> - * @rq: request with PI that should be prepared
> - *
> - * For Type 1/Type 2, the virtual start sector is the one that was
> - * originally submitted by the block layer for the ref_tag usage. Due to
> - * partitioning, MD/DM cloning, etc. the actual physical start sector is
> - * likely to be different. Remap protection information to match the
> - * physical LBA.
> - */
> -static void t10_pi_type1_prepare(struct request *rq)
> +static blk_status_t blk_verify_t10_pi(struct blk_integrity_iter *iter,
> + struct t10_pi_tuple *pi)
> {
> - struct blk_integrity *bi = &rq->q->limits.integrity;
> - const int tuple_sz = bi->metadata_size;
> - u32 ref_tag = t10_pi_ref_tag(rq);
> - u8 offset = bi->pi_offset;
> - struct bio *bio;
> + u16 guard = get_unaligned_be16(&pi->guard_tag);
>
> - __rq_for_each_bio(bio, rq) {
> - struct bio_integrity_payload *bip = bio_integrity(bio);
> - u32 virt = bip_get_seed(bip) & 0xffffffff;
> - struct bio_vec iv;
> - struct bvec_iter iter;
> + return blk_verify_pi(iter, pi, guard);
> +}
>
> - /* Already remapped? */
> - if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
> - break;
> +static blk_status_t blk_verify_ip_pi(struct blk_integrity_iter *iter,
> + struct t10_pi_tuple *pi)
> +{
> + u16 guard = get_unaligned((u16 *)&pi->guard_tag);
>
> - bip_for_each_vec(iv, bip, iter) {
> - unsigned int j;
> - void *p;
> -
> - p = bvec_kmap_local(&iv);
> - for (j = 0; j < iv.bv_len; j += tuple_sz) {
> - struct t10_pi_tuple *pi = p + offset;
> -
> - if (be32_to_cpu(pi->ref_tag) == virt)
> - pi->ref_tag = cpu_to_be32(ref_tag);
> - virt++;
> - ref_tag++;
> - p += tuple_sz;
> - }
> - kunmap_local(p);
> - }
> + return blk_verify_pi(iter, pi, guard);
> +}
>
> - bip->bip_flags |= BIP_MAPPED_INTEGRITY;
> +static blk_status_t blk_integrity_verify(struct blk_integrity_iter *iter,
> + union pi_tuple *tuple)
> +{
> + switch (iter->bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + return blk_verify_ext_pi(iter, &tuple->crc64_pi);
> + case BLK_INTEGRITY_CSUM_CRC:
> + return blk_verify_t10_pi(iter, &tuple->t10_pi);
> + case BLK_INTEGRITY_CSUM_IP:
> + return blk_verify_ip_pi(iter, &tuple->t10_pi);
> + default:
> + return BLK_STS_OK;
> }
> }
>
> -/**
> - * t10_pi_type1_complete - prepare PI prior returning request to the blk layer
> - * @rq: request with PI that should be prepared
> - * @nr_bytes: total bytes to prepare
> - *
> - * For Type 1/Type 2, the virtual start sector is the one that was
> - * originally submitted by the block layer for the ref_tag usage. Due to
> - * partitioning, MD/DM cloning, etc. the actual physical start sector is
> - * likely to be different. Since the physical start sector was submitted
> - * to the device, we should remap it back to virtual values expected by the
> - * block layer.
> - */
> -static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
> +static void blk_set_ext_pi(struct blk_integrity_iter *iter,
> + struct crc64_pi_tuple *pi)
> {
> - struct blk_integrity *bi = &rq->q->limits.integrity;
> - unsigned intervals = nr_bytes >> bi->interval_exp;
> - const int tuple_sz = bi->metadata_size;
> - u32 ref_tag = t10_pi_ref_tag(rq);
> - u8 offset = bi->pi_offset;
> - struct bio *bio;
> + put_unaligned_be64(iter->csum, &pi->guard_tag);
> + put_unaligned_be16(0, &pi->app_tag);
> + put_unaligned_be48(iter->seed, &pi->ref_tag);
> +}
>
> - __rq_for_each_bio(bio, rq) {
> - struct bio_integrity_payload *bip = bio_integrity(bio);
> - u32 virt = bip_get_seed(bip) & 0xffffffff;
> - struct bio_vec iv;
> - struct bvec_iter iter;
> -
> - bip_for_each_vec(iv, bip, iter) {
> - unsigned int j;
> - void *p;
> -
> - p = bvec_kmap_local(&iv);
> - for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
> - struct t10_pi_tuple *pi = p + offset;
> -
> - if (be32_to_cpu(pi->ref_tag) == ref_tag)
> - pi->ref_tag = cpu_to_be32(virt);
> - virt++;
> - ref_tag++;
> - intervals--;
> - p += tuple_sz;
> - }
> - kunmap_local(p);
> - }
> +static void blk_set_pi(struct blk_integrity_iter *iter,
> + struct t10_pi_tuple *pi, __be16 csum)
> +{
> + put_unaligned(csum, &pi->guard_tag);
> + put_unaligned_be16(0, &pi->app_tag);
> + put_unaligned_be32(iter->seed, &pi->ref_tag);
> +}
> +
> +static void blk_set_t10_pi(struct blk_integrity_iter *iter,
> + struct t10_pi_tuple *pi)
> +{
> + blk_set_pi(iter, pi, cpu_to_be16((u16)iter->csum));
> +}
> +
> +static void blk_set_ip_pi(struct blk_integrity_iter *iter,
> + struct t10_pi_tuple *pi)
> +{
> + blk_set_pi(iter, pi, (__force __be16)(u16)iter->csum);
> +}
> +
> +static void blk_integrity_set(struct blk_integrity_iter *iter,
> + union pi_tuple *tuple)
> +{
> + switch (iter->bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + return blk_set_ext_pi(iter, &tuple->crc64_pi);
> + case BLK_INTEGRITY_CSUM_CRC:
> + return blk_set_t10_pi(iter, &tuple->t10_pi);
> + case BLK_INTEGRITY_CSUM_IP:
> + return blk_set_ip_pi(iter, &tuple->t10_pi);
> + default:
> + WARN_ON_ONCE(1);
> + return;
> }
> }
>
> -static __be64 ext_pi_crc64(u64 crc, void *data, unsigned int len)
> +static blk_status_t blk_integrity_interval(struct blk_integrity_iter *iter,
> + bool verify)
> {
> - return cpu_to_be64(crc64_nvme(crc, data, len));
> + blk_status_t ret = BLK_STS_OK;
> + union pi_tuple tuple;
> + void *ptuple = &tuple;
> + struct bio_vec pbv;
> +
> + blk_integrity_csum_offset(iter);
> + pbv = mp_bvec_iter_bvec(iter->bip->bip_vec, iter->prot_iter);
> + if (pbv.bv_len >= iter->bi->pi_tuple_size) {
> + ptuple = bvec_kmap_local(&pbv);
> + bvec_iter_advance_single(iter->bip->bip_vec, &iter->prot_iter,
> + iter->bi->metadata_size - iter->bi->pi_offset);
> + } else if (verify) {
> + blk_integrity_copy_to_tuple(iter->bip, &iter->prot_iter,
> + ptuple, iter->bi->pi_tuple_size);
> + }
> +
> + if (verify)
> + ret = blk_integrity_verify(iter, ptuple);
> + else
> + blk_integrity_set(iter, ptuple);
> +
> + if (likely(ptuple != &tuple)) {
> + kunmap_local(ptuple);
> + } else if (!verify) {
> + blk_integrity_copy_from_tuple(iter->bip, &iter->prot_iter,
> + ptuple, iter->bi->pi_tuple_size);
> + }
> +
> +
> + iter->interval_remaining = 1 << iter->bi->interval_exp;
> + iter->csum = 0;
> + iter->seed++;
> +
> + return ret;
> }
>
> -static void ext_pi_crc64_generate(struct blk_integrity_iter *iter,
> - struct blk_integrity *bi)
> +static void blk_integrity_iterate(struct bio *bio, struct bvec_iter *data_iter,
> + bool verify)
> {
> - u8 offset = bi->pi_offset;
> - unsigned int i;
> + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + struct blk_integrity_iter iter = {
> + .bio = bio,
> + .bip = bip,
> + .bi = bi,
> + .data_iter = *data_iter,
> + .prot_iter = bip->bip_iter,
> + .interval_remaining = 1 << bi->interval_exp,
> + .seed = data_iter->bi_sector,
> + .csum = 0,
> + };
> + blk_status_t ret = BLK_STS_OK;
> +
> + while (iter.data_iter.bi_size && ret == BLK_STS_OK) {
> + struct bio_vec bv = mp_bvec_iter_bvec(iter.bio->bi_io_vec,
> + iter.data_iter);
> + void *kaddr = bvec_kmap_local(&bv);
> + void *data = kaddr;
> + unsigned int len;
> +
> + bvec_iter_advance_single(iter.bio->bi_io_vec, &iter.data_iter,
> + bv.bv_len);
> + while (bv.bv_len && ret == BLK_STS_OK) {
> + len = min(iter.interval_remaining, bv.bv_len);
> + blk_calculate_guard(&iter, data, len);
> + bv.bv_len -= len;
> + data += len;
> + iter.interval_remaining -= len;
> + if (!iter.interval_remaining)
> + ret = blk_integrity_interval(&iter, verify);
> + }
> + kunmap_local(kaddr);
> + }
> +
> + if (ret)
> + bio->bi_status = ret;
> +}
>
> - for (i = 0 ; i < iter->data_size ; i += iter->interval) {
> - struct crc64_pi_tuple *pi = iter->prot_buf + offset;
> +void blk_integrity_generate(struct bio *bio)
> +{
> + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
>
> - pi->guard_tag = ext_pi_crc64(0, iter->data_buf, iter->interval);
> - if (offset)
> - pi->guard_tag = ext_pi_crc64(be64_to_cpu(pi->guard_tag),
> - iter->prot_buf, offset);
> - pi->app_tag = 0;
> + switch (bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + case BLK_INTEGRITY_CSUM_CRC:
> + case BLK_INTEGRITY_CSUM_IP:
> + blk_integrity_iterate(bio, &bio->bi_iter, false);
> + break;
> + default:
> + break;
> + }
> +}
>
> - if (bi->flags & BLK_INTEGRITY_REF_TAG)
> - put_unaligned_be48(iter->seed, pi->ref_tag);
> - else
> - put_unaligned_be48(0ULL, pi->ref_tag);
> +void blk_integrity_verify_iter(struct bio *bio, struct bvec_iter *saved_iter)
> +{
> + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
>
> - iter->data_buf += iter->interval;
> - iter->prot_buf += bi->metadata_size;
> - iter->seed++;
> + switch (bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + case BLK_INTEGRITY_CSUM_CRC:
> + case BLK_INTEGRITY_CSUM_IP:
> + blk_integrity_iterate(bio, saved_iter, true);
> + break;
> + default:
> + break;
> }
> }
>
> -static bool ext_pi_ref_escape(const u8 ref_tag[6])
> +/*
> + * Advance @iter past the protection offset for protection formats that
> + * contain front padding on the metadata region.
> + */
> +static void blk_pi_advance_offset(struct blk_integrity *bi,
> + struct bio_integrity_payload *bip,
> + struct bvec_iter *iter)
> {
> - static const u8 ref_escape[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
> + unsigned int offset = bi->pi_offset;
>
> - return memcmp(ref_tag, ref_escape, sizeof(ref_escape)) == 0;
> + while (offset > 0) {
> + struct bio_vec bv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
> + unsigned int len = min(bv.bv_len, offset);
> +
> + bvec_iter_advance_single(bip->bip_vec, iter, len);
> + offset -= len;
> + }
> }
>
> -static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter,
> - struct blk_integrity *bi)
> -{
> - u8 offset = bi->pi_offset;
> - unsigned int i;
> -
> - for (i = 0; i < iter->data_size; i += iter->interval) {
> - struct crc64_pi_tuple *pi = iter->prot_buf + offset;
> - u64 ref, seed;
> - __be64 csum;
> -
> - if (bi->flags & BLK_INTEGRITY_REF_TAG) {
> - if (pi->app_tag == T10_PI_APP_ESCAPE)
> - goto next;
> -
> - ref = get_unaligned_be48(pi->ref_tag);
> - seed = lower_48_bits(iter->seed);
> - if (ref != seed) {
> - pr_err("%s: ref tag error at location %llu (rcvd %llu)\n",
> - iter->disk_name, seed, ref);
> - return BLK_STS_PROTECTION;
> - }
> - } else {
> - if (pi->app_tag == T10_PI_APP_ESCAPE &&
> - ext_pi_ref_escape(pi->ref_tag))
> - goto next;
> - }
> +static void *blk_tuple_remap_begin(union pi_tuple *tuple,
> + struct blk_integrity *bi,
> + struct bio_integrity_payload *bip,
> + struct bvec_iter *iter)
> +{
> + struct bvec_iter titer;
> + struct bio_vec pbv;
>
> - csum = ext_pi_crc64(0, iter->data_buf, iter->interval);
> - if (offset)
> - csum = ext_pi_crc64(be64_to_cpu(csum), iter->prot_buf,
> - offset);
> + blk_pi_advance_offset(bi, bip, iter);
> + pbv = mp_bvec_iter_bvec(bip->bip_vec, *iter);
> + if (likely(pbv.bv_len >= bi->pi_tuple_size))
> + return bvec_kmap_local(&pbv);
>
> - if (pi->guard_tag != csum) {
> - pr_err("%s: guard tag error at sector %llu " \
> - "(rcvd %016llx, want %016llx)\n",
> - iter->disk_name, (unsigned long long)iter->seed,
> - be64_to_cpu(pi->guard_tag), be64_to_cpu(csum));
> - return BLK_STS_PROTECTION;
> - }
> + /*
> + * We need to preserve the state of the original iter for the
> + * copy_from_tuple at the end, so make a temp iter for here.
> + */
> + titer = *iter;
> + blk_integrity_copy_to_tuple(bip, &titer, tuple, bi->pi_tuple_size);
> + return tuple;
> +}
>
> -next:
> - iter->data_buf += iter->interval;
> - iter->prot_buf += bi->metadata_size;
> - iter->seed++;
> +static void blk_tuple_remap_end(union pi_tuple *tuple, void *ptuple,
> + struct blk_integrity *bi,
> + struct bio_integrity_payload *bip,
> + struct bvec_iter *iter)
> +{
> + unsigned int len = bi->metadata_size - bi->pi_offset;
> +
> + if (likely(ptuple != tuple)) {
> + kunmap_local(ptuple);
> + } else {
> + blk_integrity_copy_from_tuple(bip, iter, ptuple,
> + bi->pi_tuple_size);
> + len -= bi->pi_tuple_size;
> }
>
> - return BLK_STS_OK;
> + bvec_iter_advance(bip->bip_vec, iter, len);
> }
>
> -static void ext_pi_type1_prepare(struct request *rq)
> +static void blk_set_ext_unmap_ref(struct crc64_pi_tuple *pi, u64 virt,
> + u64 ref_tag)
> {
> - struct blk_integrity *bi = &rq->q->limits.integrity;
> - const int tuple_sz = bi->metadata_size;
> - u64 ref_tag = ext_pi_ref_tag(rq);
> - u8 offset = bi->pi_offset;
> - struct bio *bio;
> + u64 ref = get_unaligned_be48(&pi->ref_tag);
>
> - __rq_for_each_bio(bio, rq) {
> - struct bio_integrity_payload *bip = bio_integrity(bio);
> - u64 virt = lower_48_bits(bip_get_seed(bip));
> - struct bio_vec iv;
> - struct bvec_iter iter;
> + if (ref == lower_48_bits(ref_tag) && ref != lower_48_bits(virt))
> + put_unaligned_be48(virt, pi->ref_tag);
> +}
>
> - /* Already remapped? */
> - if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
> - break;
> +static void blk_set_t10_unmap_ref(struct t10_pi_tuple *pi, u32 virt,
> + u32 ref_tag)
> +{
> + u32 ref = get_unaligned_be32(&pi->ref_tag);
>
> - bip_for_each_vec(iv, bip, iter) {
> - unsigned int j;
> - void *p;
> -
> - p = bvec_kmap_local(&iv);
> - for (j = 0; j < iv.bv_len; j += tuple_sz) {
> - struct crc64_pi_tuple *pi = p + offset;
> - u64 ref = get_unaligned_be48(pi->ref_tag);
> -
> - if (ref == virt)
> - put_unaligned_be48(ref_tag, pi->ref_tag);
> - virt++;
> - ref_tag++;
> - p += tuple_sz;
> - }
> - kunmap_local(p);
> - }
> + if (ref == ref_tag && ref != virt)
> + put_unaligned_be32(virt, &pi->ref_tag);
> +}
>
> - bip->bip_flags |= BIP_MAPPED_INTEGRITY;
> +static void blk_reftag_remap_complete(struct blk_integrity *bi,
> + union pi_tuple *tuple, u64 virt, u64 ref)
> +{
> + switch (bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + blk_set_ext_unmap_ref(&tuple->crc64_pi, virt, ref);
> + break;
> + case BLK_INTEGRITY_CSUM_CRC:
> + case BLK_INTEGRITY_CSUM_IP:
> + blk_set_t10_unmap_ref(&tuple->t10_pi, virt, ref);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + break;
> }
> }
>
> -static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
> +static void blk_set_ext_map_ref(struct crc64_pi_tuple *pi, u64 virt,
> + u64 ref_tag)
> {
> - struct blk_integrity *bi = &rq->q->limits.integrity;
> - unsigned intervals = nr_bytes >> bi->interval_exp;
> - const int tuple_sz = bi->metadata_size;
> - u64 ref_tag = ext_pi_ref_tag(rq);
> - u8 offset = bi->pi_offset;
> - struct bio *bio;
> + u64 ref = get_unaligned_be48(&pi->ref_tag);
>
> - __rq_for_each_bio(bio, rq) {
> - struct bio_integrity_payload *bip = bio_integrity(bio);
> - u64 virt = lower_48_bits(bip_get_seed(bip));
> - struct bio_vec iv;
> - struct bvec_iter iter;
> -
> - bip_for_each_vec(iv, bip, iter) {
> - unsigned int j;
> - void *p;
> -
> - p = bvec_kmap_local(&iv);
> - for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
> - struct crc64_pi_tuple *pi = p + offset;
> - u64 ref = get_unaligned_be48(pi->ref_tag);
> -
> - if (ref == ref_tag)
> - put_unaligned_be48(virt, pi->ref_tag);
> - virt++;
> - ref_tag++;
> - intervals--;
> - p += tuple_sz;
> - }
> - kunmap_local(p);
> - }
> - }
> + if (ref == lower_48_bits(virt) && ref != ref_tag)
> + put_unaligned_be48(ref_tag, pi->ref_tag);
> }
>
> -void blk_integrity_generate(struct bio *bio)
> +static void blk_set_t10_map_ref(struct t10_pi_tuple *pi, u32 virt, u32 ref_tag)
> {
> - struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> - struct bio_integrity_payload *bip = bio_integrity(bio);
> - struct blk_integrity_iter iter;
> - struct bvec_iter bviter;
> - struct bio_vec bv;
> -
> - iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
> - iter.interval = 1 << bi->interval_exp;
> - iter.seed = bio->bi_iter.bi_sector;
> - iter.prot_buf = bvec_virt(bip->bip_vec);
> - bio_for_each_segment(bv, bio, bviter) {
> - void *kaddr = bvec_kmap_local(&bv);
> + u32 ref = get_unaligned_be32(&pi->ref_tag);
>
> - iter.data_buf = kaddr;
> - iter.data_size = bv.bv_len;
> - switch (bi->csum_type) {
> - case BLK_INTEGRITY_CSUM_CRC64:
> - ext_pi_crc64_generate(&iter, bi);
> - break;
> - case BLK_INTEGRITY_CSUM_CRC:
> - case BLK_INTEGRITY_CSUM_IP:
> - t10_pi_generate(&iter, bi);
> - break;
> - default:
> - break;
> - }
> - kunmap_local(kaddr);
> + if (ref == virt && ref != ref_tag)
> + put_unaligned_be32(ref_tag, &pi->ref_tag);
> +}
> +
> +static void blk_reftag_remap_prepare(struct blk_integrity *bi,
> + union pi_tuple *tuple,
> + u64 virt, u64 ref)
> +{
> + switch (bi->csum_type) {
> + case BLK_INTEGRITY_CSUM_CRC64:
> + blk_set_ext_map_ref(&tuple->crc64_pi, virt, ref);
> + break;
> + case BLK_INTEGRITY_CSUM_CRC:
> + case BLK_INTEGRITY_CSUM_IP:
> + blk_set_t10_map_ref(&tuple->t10_pi, virt, ref);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + break;
> }
> }
>
> -void blk_integrity_verify_iter(struct bio *bio, struct bvec_iter *saved_iter)
> +static void __blk_reftag_remap(struct bio *bio, struct blk_integrity *bi,
> + unsigned *intervals, u64 *ref, bool prep)
> {
> - struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> struct bio_integrity_payload *bip = bio_integrity(bio);
> - struct blk_integrity_iter iter;
> - struct bvec_iter bviter;
> - struct bio_vec bv;
> + struct bvec_iter iter = bip->bip_iter;
> + u64 virt = bip_get_seed(bip);
> + union pi_tuple *ptuple;
> + union pi_tuple tuple;
>
> - /*
> - * At the moment verify is called bi_iter has been advanced during split
> - * and completion, so use the copy created during submission here.
> - */
> - iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
> - iter.interval = 1 << bi->interval_exp;
> - iter.seed = saved_iter->bi_sector;
> - iter.prot_buf = bvec_virt(bip->bip_vec);
> - __bio_for_each_segment(bv, bio, bviter, *saved_iter) {
> - void *kaddr = bvec_kmap_local(&bv);
> - blk_status_t ret = BLK_STS_OK;
> + if (prep && bip->bip_flags & BIP_MAPPED_INTEGRITY) {
> + *ref += bio->bi_iter.bi_size >> bi->interval_exp;
> + return;
> + }
>
> - iter.data_buf = kaddr;
> - iter.data_size = bv.bv_len;
> - switch (bi->csum_type) {
> - case BLK_INTEGRITY_CSUM_CRC64:
> - ret = ext_pi_crc64_verify(&iter, bi);
> - break;
> - case BLK_INTEGRITY_CSUM_CRC:
> - case BLK_INTEGRITY_CSUM_IP:
> - ret = t10_pi_verify(&iter, bi);
> - break;
> - default:
> - break;
> - }
> - kunmap_local(kaddr);
> + while (iter.bi_size && *intervals) {
> + ptuple = blk_tuple_remap_begin(&tuple, bi, bip, &iter);
>
> - if (ret) {
> - bio->bi_status = ret;
> - return;
> - }
> + if (prep)
> + blk_reftag_remap_prepare(bi, ptuple, virt, *ref);
> + else
> + blk_reftag_remap_complete(bi, ptuple, virt, *ref);
> +
> + blk_tuple_remap_end(&tuple, ptuple, bi, bip, &iter);
> + (*intervals)--;
> + (*ref)++;
> + virt++;
> }
> +
> + if (prep)
> + bip->bip_flags |= BIP_MAPPED_INTEGRITY;
> }
>
> -void blk_integrity_prepare(struct request *rq)
> +static void blk_integrity_remap(struct request *rq, unsigned int nr_bytes,
> + bool prep)
> {
> struct blk_integrity *bi = &rq->q->limits.integrity;
> + u64 ref = blk_rq_pos(rq) >> (bi->interval_exp - SECTOR_SHIFT);
> + unsigned intervals = nr_bytes >> bi->interval_exp;
> + struct bio *bio;
>
> if (!(bi->flags & BLK_INTEGRITY_REF_TAG))
> return;
>
> - if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64)
> - ext_pi_type1_prepare(rq);
> - else
> - t10_pi_type1_prepare(rq);
> + __rq_for_each_bio(bio, rq) {
> + __blk_reftag_remap(bio, bi, &intervals, &ref, prep);
> + if (!intervals)
> + break;
> + }
> }
>
> -void blk_integrity_complete(struct request *rq, unsigned int nr_bytes)
> +void blk_integrity_prepare(struct request *rq)
> {
> - struct blk_integrity *bi = &rq->q->limits.integrity;
> -
> - if (!(bi->flags & BLK_INTEGRITY_REF_TAG))
> - return;
> + blk_integrity_remap(rq, blk_rq_bytes(rq), true);
> +}
>
> - if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64)
> - ext_pi_type1_complete(rq, nr_bytes);
> - else
> - t10_pi_type1_complete(rq, nr_bytes);
> +void blk_integrity_complete(struct request *rq, unsigned int nr_bytes)
> +{
> + blk_integrity_remap(rq, nr_bytes, false);
> }
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7bf228df6001f..fa534f1d6b27a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1874,6 +1874,7 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
> break;
> }
>
> + bi->flags |= BLK_SPLIT_INTERVAL_CAPABLE;
Can just be = instead of |= since bi is zeroed above.
Best,
Caleb
> bi->metadata_size = head->ms;
> if (bi->csum_type) {
> bi->pi_tuple_size = head->pi_size;
> diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
> index a6b84206eb94c..1f937373e56b6 100644
> --- a/include/linux/blk-integrity.h
> +++ b/include/linux/blk-integrity.h
> @@ -19,6 +19,7 @@ enum blk_integrity_flags {
> BLK_INTEGRITY_DEVICE_CAPABLE = 1 << 2,
> BLK_INTEGRITY_REF_TAG = 1 << 3,
> BLK_INTEGRITY_STACKED = 1 << 4,
> + BLK_SPLIT_INTERVAL_CAPABLE = 1 << 5,
> };
>
> const char *blk_integrity_profile_name(struct blk_integrity *bi);
> --
> 2.47.3
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 21:34 ` Caleb Sander Mateos
@ 2025-11-24 22:56 ` Keith Busch
2025-11-25 3:41 ` Caleb Sander Mateos
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-11-24 22:56 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: Keith Busch, linux-block, axboe, hch, ebiggers,
Martin K. Petersen
On Mon, Nov 24, 2025 at 01:34:03PM -0800, Caleb Sander Mateos wrote:
> On Mon, Nov 24, 2025 at 8:18 AM Keith Busch <kbusch@meta.com> wrote:
> >
> > From: Keith Busch <kbusch@kernel.org>
> >
> > A bio segment may have partial interval block data with the rest continuing
> > into the next segments because direct-io data payloads only need to aligned in
>
> "need to be aligned"?
In the original text, s/aligned/align
> > + while (offset > 0) {
> > + struct bio_vec pbv = mp_bvec_iter_bvec(bvec, iter->prot_iter);
> > + unsigned int len = min(pbv.bv_len, offset);
> > + void *prot_buf = bvec_kmap_local(&pbv);
>
> Is it valid to use bvec_kmap_local() on a multi-page bvec?
If it wasn't valid, there'd be a major problem with large folios.
> It calls
> kmap_local_page() internally, which will only map the first page,
> right?
Compare kmap_local_page() with kmap_local_folio(). They both resolve to
the same lower level mapping function, and folios have no problem
spanning pages.
> > @@ -1874,6 +1874,7 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
> > break;
> > }
> >
> > + bi->flags |= BLK_SPLIT_INTERVAL_CAPABLE;
>
> Can just be = instead of |= since bi is zeroed above.
It can, but these kinds of syntax are a courtesty to future changes so
that you don't need to change this line later. You can also end a struct
initialization without a trailing "," too, but that's just mean.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-24 22:56 ` Keith Busch
@ 2025-11-25 3:41 ` Caleb Sander Mateos
2025-11-25 11:31 ` Christoph Hellwig
0 siblings, 1 reply; 11+ messages in thread
From: Caleb Sander Mateos @ 2025-11-25 3:41 UTC (permalink / raw)
To: Keith Busch
Cc: Keith Busch, linux-block, axboe, hch, ebiggers,
Martin K. Petersen
On Mon, Nov 24, 2025 at 2:56 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, Nov 24, 2025 at 01:34:03PM -0800, Caleb Sander Mateos wrote:
> > On Mon, Nov 24, 2025 at 8:18 AM Keith Busch <kbusch@meta.com> wrote:
> > >
> > > From: Keith Busch <kbusch@kernel.org>
> > >
> > > A bio segment may have partial interval block data with the rest continuing
> > > into the next segments because direct-io data payloads only need to aligned in
> >
> > "need to be aligned"?
>
> In the original text, s/aligned/align
That works.
>
> > > + while (offset > 0) {
> > > + struct bio_vec pbv = mp_bvec_iter_bvec(bvec, iter->prot_iter);
> > > + unsigned int len = min(pbv.bv_len, offset);
> > > + void *prot_buf = bvec_kmap_local(&pbv);
> >
> > Is it valid to use bvec_kmap_local() on a multi-page bvec?
>
> If it wasn't valid, there'd be a major problem with large folios.
>
> > It calls
> > kmap_local_page() internally, which will only map the first page,
> > right?
>
> Compare kmap_local_page() with kmap_local_folio(). They both resolve to
> the same lower level mapping function, and folios have no problem
> spanning pages.
Documentation/mm/highmem.rst seems to suggest that kmap_local_folio()
only maps the page that contains the specified offset, not the whole
folio:
The only differences between them consist in the first taking a pointer
to a struct page and the second taking a pointer to struct folio and the byte
offset within the folio which identifies the page.
And this is consistent with the implementation, which passes
__kmap_local_page_prot() the page at the given offset into the folio.
Of course, both are no-ops for non-CONFIG_HIGHMEM systems, so it may
be difficult to observe the difference on most systems these days...
>
> > > @@ -1874,6 +1874,7 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
> > > break;
> > > }
> > >
> > > + bi->flags |= BLK_SPLIT_INTERVAL_CAPABLE;
> >
> > Can just be = instead of |= since bi is zeroed above.
>
> It can, but these kinds of syntax are a courtesty to future changes so
> that you don't need to change this line later. You can also end a struct
> initialization without a trailing "," too, but that's just mean.
Sure, either way is fine.
Best,
Caleb
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-25 3:41 ` Caleb Sander Mateos
@ 2025-11-25 11:31 ` Christoph Hellwig
2025-11-25 14:24 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2025-11-25 11:31 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: Keith Busch, Keith Busch, linux-block, axboe, hch, ebiggers,
Martin K. Petersen
On Mon, Nov 24, 2025 at 07:41:16PM -0800, Caleb Sander Mateos wrote:
> > Compare kmap_local_page() with kmap_local_folio(). They both resolve to
> > the same lower level mapping function, and folios have no problem
> > spanning pages.
>
> Documentation/mm/highmem.rst seems to suggest that kmap_local_folio()
> only maps the page that contains the specified offset, not the whole
> folio:
Yes. Additionally multi-page bvecs aren't folios, they just are
physically contiguous ranges. So I don't think we could even use any
folio helpers if they worked, as a single range might consist of multiple
folios.
(Sorry for not spotting this earlier)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-25 11:31 ` Christoph Hellwig
@ 2025-11-25 14:24 ` Keith Busch
2025-11-25 16:46 ` Christoph Hellwig
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-11-25 14:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Caleb Sander Mateos, Keith Busch, linux-block, axboe, ebiggers,
Martin K. Petersen
On Tue, Nov 25, 2025 at 12:31:44PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 24, 2025 at 07:41:16PM -0800, Caleb Sander Mateos wrote:
> > > Compare kmap_local_page() with kmap_local_folio(). They both resolve to
> > > the same lower level mapping function, and folios have no problem
> > > spanning pages.
> >
> > Documentation/mm/highmem.rst seems to suggest that kmap_local_folio()
> > only maps the page that contains the specified offset, not the whole
> > folio:
>
> Yes. Additionally multi-page bvecs aren't folios, they just are
> physically contiguous ranges. So I don't think we could even use any
> folio helpers if they worked, as a single range might consist of multiple
> folios.
>
> (Sorry for not spotting this earlier)
So the solution is to replace mp_bvec_iter_bvec() with bvec_iter_bvec()
to ensure we don't cross pages? It's a little less efficient, but that's
not a big deal.
I assumed mapping physically congiguous memory was contiguous in kernel
address space too, and that seems to work out in testing, but I don't
have CONFIG_HIGHMEM enable where that might matter.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6] blk-integrity: support arbitrary buffer alignment
2025-11-25 14:24 ` Keith Busch
@ 2025-11-25 16:46 ` Christoph Hellwig
0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2025-11-25 16:46 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Caleb Sander Mateos, Keith Busch, linux-block,
axboe, ebiggers, Martin K. Petersen
On Tue, Nov 25, 2025 at 07:24:27AM -0700, Keith Busch wrote:
> So the solution is to replace mp_bvec_iter_bvec() with bvec_iter_bvec()
> to ensure we don't cross pages? It's a little less efficient, but that's
> not a big deal.
Yes. At least for CONFIG_HIGHMEM. Note that in many places I actually
found that open coding these macros leads to nicer code, and with that
CONFIG_HIGHMEM ifdef that would probably also be the case.
> I assumed mapping physically congiguous memory was contiguous in kernel
> address space too,
It is.
> and that seems to work out in testing, but I don't
> have CONFIG_HIGHMEM enable where that might matter.
I keep an x86_32 VM around for that, but I usually only fire it up when
it's too late..
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-11-25 16:46 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24 16:17 [PATCHv6] blk-integrity: support arbitrary buffer alignment Keith Busch
2025-11-24 16:31 ` Martin K. Petersen
2025-11-24 17:57 ` Keith Busch
2025-11-24 20:49 ` Martin K. Petersen
2025-11-24 16:58 ` Christoph Hellwig
2025-11-24 21:34 ` Caleb Sander Mateos
2025-11-24 22:56 ` Keith Busch
2025-11-25 3:41 ` Caleb Sander Mateos
2025-11-25 11:31 ` Christoph Hellwig
2025-11-25 14:24 ` Keith Busch
2025-11-25 16:46 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).