* [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata
@ 2024-05-15 13:27 Mikulas Patocka
2024-05-15 13:28 ` [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator Mikulas Patocka
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-15 13:27 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz
Cc: linux-block, dm-devel, linux-nvme
Hi
Some NVMe devices may be formatted with extra 64 bytes of metadata per
sector.
Here I'm submitting for review dm-crypt patches that make it possible to
use per-sector metadata for authenticated encryption. With these patches,
dm-crypt can run directly on the top of a NVMe device, without using
dm-integrity. These patches increase write throughput twice, because there
is no write to the dm-integrity journal.
An example how to use it (so far, there is no support in the userspace
cryptsetup tool):
# nvme format /dev/nvme1 -n 1 -lbaf=4
# dmsetup create cr --table '0 1048576 crypt
capi:authenc(hmac(sha256),cbc(aes))-essiv:sha256
01b11af6b55f76424fd53fb66667c301466b2eeaf0f39fd36d26e7fc4f52ade2de4228e996f5ae2fe817ce178e77079d28e4baaebffbcd3e16ae4f36ef217298
0 /dev/nvme1n1 0 2 integrity:32:aead sector_size:4096'
Please review it - I'd like to know whether detecting the presence of
per-sector metadata in crypt_integrity_ctr is correct whether it should be
done differently.
Mikulas
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator
2024-05-15 13:27 [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata Mikulas Patocka
@ 2024-05-15 13:28 ` Mikulas Patocka
2024-05-16 2:30 ` Jens Axboe
2024-05-16 8:14 ` [RFC PATCH 1/2] " Ming Lei
2024-05-15 13:30 ` [RFC PATCH 2/2] dm-crypt: support per-sector NVMe metadata Mikulas Patocka
2024-05-27 22:12 ` [RFC PATCH 0/2] dm-crypt support for " Eric Wheeler
2 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-15 13:28 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz
Cc: linux-block, dm-devel, linux-nvme
If we allocate a bio that is larger than NVMe maximum request size, attach
integrity metadata to it and send it to the NVMe subsystem, the integrity
metadata will be corrupted.
Splitting the bio works correctly. The function bio_split will clone the
bio, trim the iterator of the first bio and advance the iterator of the
second bio.
However, the function rq_integrity_vec has a bug - it returns the first
vector of the bio's metadata and completely disregards the metadata
iterator that was advanced when the bio was split. Thus, the second bio
uses the same metadata as the first bio and this leads to metadata
corruption.
This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
instead of returning the first vector. mp_bvec_iter_bvec reads the
iterator and advances the vector by the iterator.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/nvme/host/pci.c | 6 +++---
include/linux/blk-integrity.h | 12 ++++++------
2 files changed, 9 insertions(+), 9 deletions(-)
Index: linux-2.6/drivers/nvme/host/pci.c
===================================================================
--- linux-2.6.orig/drivers/nvme/host/pci.c
+++ linux-2.6/drivers/nvme/host/pci.c
@@ -825,9 +825,9 @@ static blk_status_t nvme_map_metadata(st
struct nvme_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct bio_vec bv = rq_integrity_vec(req);
- iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req),
- rq_dma_dir(req), 0);
+ iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
if (dma_mapping_error(dev->dev, iod->meta_dma))
return BLK_STS_IOERR;
cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
@@ -966,7 +966,7 @@ static __always_inline void nvme_pci_unm
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
dma_unmap_page(dev->dev, iod->meta_dma,
- rq_integrity_vec(req)->bv_len, rq_dma_dir(req));
+ rq_integrity_vec(req).bv_len, rq_dma_dir(req));
}
if (blk_rq_nr_phys_segments(req))
Index: linux-2.6/include/linux/blk-integrity.h
===================================================================
--- linux-2.6.orig/include/linux/blk-integrity.h
+++ linux-2.6/include/linux/blk-integrity.h
@@ -109,11 +109,11 @@ static inline bool blk_integrity_rq(stru
* Return the first bvec that contains integrity data. Only drivers that are
* limited to a single integrity segment should use this helper.
*/
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
{
- if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
- return NULL;
- return rq->bio->bi_integrity->bip_vec;
+ WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
+ return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
+ rq->bio->bi_integrity->bip_iter);
}
#else /* CONFIG_BLK_DEV_INTEGRITY */
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
@@ -177,9 +177,9 @@ static inline int blk_integrity_rq(struc
return 0;
}
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
{
- return NULL;
+ BUG();
}
#endif /* CONFIG_BLK_DEV_INTEGRITY */
#endif /* _LINUX_BLK_INTEGRITY_H */
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH 2/2] dm-crypt: support per-sector NVMe metadata
2024-05-15 13:27 [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata Mikulas Patocka
2024-05-15 13:28 ` [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator Mikulas Patocka
@ 2024-05-15 13:30 ` Mikulas Patocka
2024-05-27 22:12 ` [RFC PATCH 0/2] dm-crypt support for " Eric Wheeler
2 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-15 13:30 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz
Cc: linux-block, dm-devel, linux-nvme
Support per-sector NVMe metadata in dm-crypt.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/md/dm-crypt.c | 53 ++++++++++++++++++++++++++++----------------------
1 file changed, 30 insertions(+), 23 deletions(-)
Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c
+++ linux-2.6/drivers/md/dm-crypt.c
@@ -211,7 +211,8 @@ struct crypt_config {
unsigned int integrity_tag_size;
unsigned int integrity_iv_size;
- unsigned int on_disk_tag_size;
+ unsigned int used_tag_size;
+ unsigned int tuple_size;
/*
* pool for per bio private data, crypto requests,
@@ -1148,14 +1149,14 @@ static int dm_crypt_integrity_io_alloc(s
unsigned int tag_len;
int ret;
- if (!bio_sectors(bio) || !io->cc->on_disk_tag_size)
+ if (!bio_sectors(bio) || !io->cc->tuple_size)
return 0;
bip = bio_integrity_alloc(bio, GFP_NOIO, 1);
if (IS_ERR(bip))
return PTR_ERR(bip);
- tag_len = io->cc->on_disk_tag_size * (bio_sectors(bio) >> io->cc->sector_shift);
+ tag_len = io->cc->tuple_size * (bio_sectors(bio) >> io->cc->sector_shift);
bip->bip_iter.bi_sector = io->cc->start + io->sector;
@@ -1173,24 +1174,30 @@ static int crypt_integrity_ctr(struct cr
struct blk_integrity *bi = blk_get_integrity(cc->dev->bdev->bd_disk);
struct mapped_device *md = dm_table_get_md(ti->table);
+ if (!bi) {
+ ti->error = "No integrity profile.";
+ return -EINVAL;
+ }
+
/* From now we require underlying device with our integrity profile */
- if (!bi || strcasecmp(bi->profile->name, "DM-DIF-EXT-TAG")) {
+ if (strcasecmp(bi->profile->name, "DM-DIF-EXT-TAG") &&
+ strcasecmp(bi->profile->name, "nop")) {
ti->error = "Integrity profile not supported.";
return -EINVAL;
}
- if (bi->tag_size != cc->on_disk_tag_size ||
- bi->tuple_size != cc->on_disk_tag_size) {
+ if (bi->tuple_size < cc->used_tag_size) {
ti->error = "Integrity profile tag size mismatch.";
return -EINVAL;
}
+ cc->tuple_size = bi->tuple_size;
if (1 << bi->interval_exp != cc->sector_size) {
ti->error = "Integrity profile sector size mismatch.";
return -EINVAL;
}
if (crypt_integrity_aead(cc)) {
- cc->integrity_tag_size = cc->on_disk_tag_size - cc->integrity_iv_size;
+ cc->integrity_tag_size = cc->used_tag_size - cc->integrity_iv_size;
DMDEBUG("%s: Integrity AEAD, tag size %u, IV size %u.", dm_device_name(md),
cc->integrity_tag_size, cc->integrity_iv_size);
@@ -1202,7 +1209,7 @@ static int crypt_integrity_ctr(struct cr
DMDEBUG("%s: Additional per-sector space %u bytes for IV.", dm_device_name(md),
cc->integrity_iv_size);
- if ((cc->integrity_tag_size + cc->integrity_iv_size) != bi->tag_size) {
+ if ((cc->integrity_tag_size + cc->integrity_iv_size) > cc->tuple_size) {
ti->error = "Not enough space for integrity tag in the profile.";
return -EINVAL;
}
@@ -1281,7 +1288,7 @@ static void *tag_from_dmreq(struct crypt
struct dm_crypt_io *io = container_of(ctx, struct dm_crypt_io, ctx);
return &io->integrity_metadata[*org_tag_of_dmreq(cc, dmreq) *
- cc->on_disk_tag_size];
+ cc->tuple_size];
}
static void *iv_tag_from_dmreq(struct crypt_config *cc,
@@ -1362,9 +1369,9 @@ static int crypt_convert_block_aead(stru
aead_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,
cc->sector_size, iv);
r = crypto_aead_encrypt(req);
- if (cc->integrity_tag_size + cc->integrity_iv_size != cc->on_disk_tag_size)
+ if (cc->integrity_tag_size + cc->integrity_iv_size != cc->tuple_size)
memset(tag + cc->integrity_tag_size + cc->integrity_iv_size, 0,
- cc->on_disk_tag_size - (cc->integrity_tag_size + cc->integrity_iv_size));
+ cc->tuple_size - (cc->integrity_tag_size + cc->integrity_iv_size));
} else {
aead_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,
cc->sector_size + cc->integrity_tag_size, iv);
@@ -1794,7 +1801,7 @@ static void crypt_dec_pending(struct dm_
return;
if (likely(!io->ctx.aead_recheck) && unlikely(io->ctx.aead_failed) &&
- cc->on_disk_tag_size && bio_data_dir(base_bio) == READ) {
+ cc->used_tag_size && bio_data_dir(base_bio) == READ) {
io->ctx.aead_recheck = true;
io->ctx.aead_failed = false;
io->error = 0;
@@ -3173,7 +3180,7 @@ static int crypt_ctr_optional(struct dm_
ti->error = "Invalid integrity arguments";
return -EINVAL;
}
- cc->on_disk_tag_size = val;
+ cc->used_tag_size = val;
sval = strchr(opt_string + strlen("integrity:"), ':') + 1;
if (!strcasecmp(sval, "aead")) {
set_bit(CRYPT_MODE_INTEGRITY_AEAD, &cc->cipher_flags);
@@ -3384,12 +3391,12 @@ static int crypt_ctr(struct dm_target *t
if (ret)
goto bad;
- cc->tag_pool_max_sectors = POOL_ENTRY_SIZE / cc->on_disk_tag_size;
+ cc->tag_pool_max_sectors = POOL_ENTRY_SIZE / cc->tuple_size;
if (!cc->tag_pool_max_sectors)
cc->tag_pool_max_sectors = 1;
ret = mempool_init_kmalloc_pool(&cc->tag_pool, MIN_IOS,
- cc->tag_pool_max_sectors * cc->on_disk_tag_size);
+ cc->tag_pool_max_sectors * cc->tuple_size);
if (ret) {
ti->error = "Cannot allocate integrity tags mempool";
goto bad;
@@ -3464,7 +3471,7 @@ static int crypt_map(struct dm_target *t
* Check if bio is too large, split as needed.
*/
if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) &&
- (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size))
+ (bio_data_dir(bio) == WRITE || cc->used_tag_size))
dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT));
/*
@@ -3480,8 +3487,8 @@ static int crypt_map(struct dm_target *t
io = dm_per_bio_data(bio, cc->per_bio_data_size);
crypt_io_init(io, cc, bio, dm_target_offset(ti, bio->bi_iter.bi_sector));
- if (cc->on_disk_tag_size) {
- unsigned int tag_len = cc->on_disk_tag_size * (bio_sectors(bio) >> cc->sector_shift);
+ if (cc->tuple_size) {
+ unsigned int tag_len = cc->tuple_size * (bio_sectors(bio) >> cc->sector_shift);
if (unlikely(tag_len > KMALLOC_MAX_SIZE))
io->integrity_metadata = NULL;
@@ -3552,7 +3559,7 @@ static void crypt_status(struct dm_targe
num_feature_args += test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags);
num_feature_args += cc->sector_size != (1 << SECTOR_SHIFT);
num_feature_args += test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags);
- if (cc->on_disk_tag_size)
+ if (cc->used_tag_size)
num_feature_args++;
if (num_feature_args) {
DMEMIT(" %d", num_feature_args);
@@ -3566,8 +3573,8 @@ static void crypt_status(struct dm_targe
DMEMIT(" no_read_workqueue");
if (test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags))
DMEMIT(" no_write_workqueue");
- if (cc->on_disk_tag_size)
- DMEMIT(" integrity:%u:%s", cc->on_disk_tag_size, cc->cipher_auth);
+ if (cc->used_tag_size)
+ DMEMIT(" integrity:%u:%s", cc->used_tag_size, cc->cipher_auth);
if (cc->sector_size != (1 << SECTOR_SHIFT))
DMEMIT(" sector_size:%d", cc->sector_size);
if (test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags))
@@ -3588,9 +3595,9 @@ static void crypt_status(struct dm_targe
DMEMIT(",iv_large_sectors=%c", test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags) ?
'y' : 'n');
- if (cc->on_disk_tag_size)
+ if (cc->used_tag_size)
DMEMIT(",integrity_tag_size=%u,cipher_auth=%s",
- cc->on_disk_tag_size, cc->cipher_auth);
+ cc->used_tag_size, cc->cipher_auth);
if (cc->sector_size != (1 << SECTOR_SHIFT))
DMEMIT(",sector_size=%d", cc->sector_size);
if (cc->cipher_string)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator
2024-05-15 13:28 ` [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator Mikulas Patocka
@ 2024-05-16 2:30 ` Jens Axboe
2024-05-20 12:53 ` Mikulas Patocka
2024-05-23 14:58 ` [PATCH v2] " Mikulas Patocka
2024-05-16 8:14 ` [RFC PATCH 1/2] " Ming Lei
1 sibling, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2024-05-16 2:30 UTC (permalink / raw)
To: Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz
Cc: linux-block, dm-devel, linux-nvme
On 5/15/24 7:28 AM, Mikulas Patocka wrote:
> @@ -177,9 +177,9 @@ static inline int blk_integrity_rq(struc
> return 0;
> }
>
> -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> {
> - return NULL;
> + BUG();
> }
> #endif /* CONFIG_BLK_DEV_INTEGRITY */
> #endif /* _LINUX_BLK_INTEGRITY_H */
Let's please not do that. If it's not used outside of
CONFIG_BLK_DEV_INTEGRITY, it should just go away.
--
Jens Axboe
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator
2024-05-15 13:28 ` [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator Mikulas Patocka
2024-05-16 2:30 ` Jens Axboe
@ 2024-05-16 8:14 ` Ming Lei
2024-05-20 12:42 ` Mikulas Patocka
1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2024-05-16 8:14 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz, linux-block, dm-devel, linux-nvme
On Wed, May 15, 2024 at 03:28:11PM +0200, Mikulas Patocka wrote:
> If we allocate a bio that is larger than NVMe maximum request size, attach
> integrity metadata to it and send it to the NVMe subsystem, the integrity
> metadata will be corrupted.
>
> Splitting the bio works correctly. The function bio_split will clone the
> bio, trim the iterator of the first bio and advance the iterator of the
> second bio.
>
> However, the function rq_integrity_vec has a bug - it returns the first
> vector of the bio's metadata and completely disregards the metadata
> iterator that was advanced when the bio was split. Thus, the second bio
> uses the same metadata as the first bio and this leads to metadata
> corruption.
Wrt. NVMe, inside blk_mq_submit_bio(), bio_integrity_prep() is called after
bio is split, ->bi_integrity is actually allocated for every split bio, so I
am not sure if the issue is related with bio splitting. Or is it related
with DM over NVMe?
However, rq_integrity_vec() may not work correctly in case of bio merge.
Thanks,
Ming
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator
2024-05-16 8:14 ` [RFC PATCH 1/2] " Ming Lei
@ 2024-05-20 12:42 ` Mikulas Patocka
2024-05-20 13:19 ` Ming Lei
0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-20 12:42 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz, linux-block, dm-devel, linux-nvme
On Thu, 16 May 2024, Ming Lei wrote:
> On Wed, May 15, 2024 at 03:28:11PM +0200, Mikulas Patocka wrote:
> > If we allocate a bio that is larger than NVMe maximum request size, attach
> > integrity metadata to it and send it to the NVMe subsystem, the integrity
> > metadata will be corrupted.
> >
> > Splitting the bio works correctly. The function bio_split will clone the
> > bio, trim the iterator of the first bio and advance the iterator of the
> > second bio.
> >
> > However, the function rq_integrity_vec has a bug - it returns the first
> > vector of the bio's metadata and completely disregards the metadata
> > iterator that was advanced when the bio was split. Thus, the second bio
> > uses the same metadata as the first bio and this leads to metadata
> > corruption.
>
> Wrt. NVMe, inside blk_mq_submit_bio(), bio_integrity_prep() is called after
> bio is split, ->bi_integrity is actually allocated for every split bio, so I
> am not sure if the issue is related with bio splitting. Or is it related
> with DM over NVMe?
I created a dm-crypt patch that stores autenticated data in the bio
integrity field:
https://patchwork.kernel.org/project/linux-block/patch/703ffbcf-2fa8-56aa-2219-10254af26ba5@redhat.com/
And that patch needs this bugfix.
Mikulas
> However, rq_integrity_vec() may not work correctly in case of bio merge.
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator
2024-05-16 2:30 ` Jens Axboe
@ 2024-05-20 12:53 ` Mikulas Patocka
2024-05-23 14:58 ` [PATCH v2] " Mikulas Patocka
1 sibling, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-20 12:53 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Mike Snitzer,
Milan Broz, linux-block, dm-devel, linux-nvme
On Wed, 15 May 2024, Jens Axboe wrote:
> On 5/15/24 7:28 AM, Mikulas Patocka wrote:
> > @@ -177,9 +177,9 @@ static inline int blk_integrity_rq(struc
> > return 0;
> > }
> >
> > -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
> > +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> > {
> > - return NULL;
> > + BUG();
> > }
> > #endif /* CONFIG_BLK_DEV_INTEGRITY */
> > #endif /* _LINUX_BLK_INTEGRITY_H */
>
> Let's please not do that. If it's not used outside of
> CONFIG_BLK_DEV_INTEGRITY, it should just go away.
>
> --
> Jens Axboe
It can't go away - it is guarded with blk_integrity_rq (which always
returns 0 if compiled without CONFIG_BLK_DEV_INTEGRITY), so the compiler
will optimize-out the calls to rq_integrity_vec. But we can't delete
rq_integrity_vec, because the source code references it.
Should rq_integrity_vec return empty 'struct bio_vec' instead? Or should
we add more CONFIG_BLK_DEV_INTEGRITY tests to disable the call locations?
Mikulas
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator
2024-05-20 12:42 ` Mikulas Patocka
@ 2024-05-20 13:19 ` Ming Lei
0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2024-05-20 13:19 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz, linux-block, dm-devel, linux-nvme,
ming.lei
On Mon, May 20, 2024 at 02:42:34PM +0200, Mikulas Patocka wrote:
>
>
> On Thu, 16 May 2024, Ming Lei wrote:
>
> > On Wed, May 15, 2024 at 03:28:11PM +0200, Mikulas Patocka wrote:
> > > If we allocate a bio that is larger than NVMe maximum request size, attach
> > > integrity metadata to it and send it to the NVMe subsystem, the integrity
> > > metadata will be corrupted.
> > >
> > > Splitting the bio works correctly. The function bio_split will clone the
> > > bio, trim the iterator of the first bio and advance the iterator of the
> > > second bio.
> > >
> > > However, the function rq_integrity_vec has a bug - it returns the first
> > > vector of the bio's metadata and completely disregards the metadata
> > > iterator that was advanced when the bio was split. Thus, the second bio
> > > uses the same metadata as the first bio and this leads to metadata
> > > corruption.
> >
> > Wrt. NVMe, inside blk_mq_submit_bio(), bio_integrity_prep() is called after
> > bio is split, ->bi_integrity is actually allocated for every split bio, so I
> > am not sure if the issue is related with bio splitting. Or is it related
> > with DM over NVMe?
>
> I created a dm-crypt patch that stores autenticated data in the bio
> integrity field:
> https://patchwork.kernel.org/project/linux-block/patch/703ffbcf-2fa8-56aa-2219-10254af26ba5@redhat.com/
>
> And that patch needs this bugfix.
OK, then please update commit log with dm-crypt use case, given there
isn't such issue on plain nvme.
BTW, bio won't be merged in case of plain NVMe since there is gap
between two nvme bio's meta buffer, both are allocated from kmalloc().
However, is it possible for the split bios from dm-crypt to be merged
in blk-mq code because dm-crypt may have its own queue limit? If yes, I
guess this patch may not be enough. Otherwise, I think this path is
good.
Thanks,
Ming
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2] block: change rq_integrity_vec to respect the iterator
2024-05-16 2:30 ` Jens Axboe
2024-05-20 12:53 ` Mikulas Patocka
@ 2024-05-23 14:58 ` Mikulas Patocka
2024-05-23 15:01 ` Jens Axboe
1 sibling, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-23 14:58 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Mike Snitzer,
Milan Broz, linux-block, dm-devel, linux-nvme
On Wed, 15 May 2024, Jens Axboe wrote:
> On 5/15/24 7:28 AM, Mikulas Patocka wrote:
> > @@ -177,9 +177,9 @@ static inline int blk_integrity_rq(struc
> > return 0;
> > }
> >
> > -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
> > +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> > {
> > - return NULL;
> > + BUG();
> > }
> > #endif /* CONFIG_BLK_DEV_INTEGRITY */
> > #endif /* _LINUX_BLK_INTEGRITY_H */
>
> Let's please not do that. If it's not used outside of
> CONFIG_BLK_DEV_INTEGRITY, it should just go away.
>
> --
> Jens Axboe
Here I'm resending the patch with the function rq_integrity_vec removed if
CONFIG_BLK_DEV_INTEGRITY is not defined.
Mikulas
From: Mikulas Patocka <mpatocka@redhat.com>
If we allocate a bio that is larger than NVMe maximum request size, attach
integrity metadata to it and send it to the NVMe subsystem, the integrity
metadata will be corrupted.
Splitting the bio works correctly. The function bio_split will clone the
bio, trim the iterator of the first bio and advance the iterator of the
second bio.
However, the function rq_integrity_vec has a bug - it returns the first
vector of the bio's metadata and completely disregards the metadata
iterator that was advanced when the bio was split. Thus, the second bio
uses the same metadata as the first bio and this leads to metadata
corruption.
This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
instead of returning the first vector. mp_bvec_iter_bvec reads the
iterator and advances the vector by the iterator.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/nvme/host/pci.c | 14 +++++++++++---
include/linux/blk-integrity.h | 12 ++++--------
2 files changed, 15 insertions(+), 11 deletions(-)
Index: linux-2.6/drivers/nvme/host/pci.c
===================================================================
--- linux-2.6.orig/drivers/nvme/host/pci.c
+++ linux-2.6/drivers/nvme/host/pci.c
@@ -821,18 +821,20 @@ out_free_sg:
return ret;
}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct bio_vec bv = rq_integrity_vec(req);
- iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req),
- rq_dma_dir(req), 0);
+ iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
if (dma_mapping_error(dev->dev, iod->meta_dma))
return BLK_STS_IOERR;
cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
return BLK_STS_OK;
}
+#endif
static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
{
@@ -853,16 +855,20 @@ static blk_status_t nvme_prep_rq(struct
goto out_free_cmd;
}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
if (blk_integrity_rq(req)) {
ret = nvme_map_metadata(dev, req, &iod->cmd);
if (ret)
goto out_unmap_data;
}
+#endif
nvme_start_request(req);
return BLK_STS_OK;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
out_unmap_data:
nvme_unmap_data(dev, req);
+#endif
out_free_cmd:
nvme_cleanup_cmd(req);
return ret;
@@ -962,12 +968,14 @@ static __always_inline void nvme_pci_unm
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct nvme_dev *dev = nvmeq->dev;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
if (blk_integrity_rq(req)) {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
dma_unmap_page(dev->dev, iod->meta_dma,
- rq_integrity_vec(req)->bv_len, rq_dma_dir(req));
+ rq_integrity_vec(req).bv_len, rq_dma_dir(req));
}
+#endif
if (blk_rq_nr_phys_segments(req))
nvme_unmap_data(dev, req);
Index: linux-2.6/include/linux/blk-integrity.h
===================================================================
--- linux-2.6.orig/include/linux/blk-integrity.h
+++ linux-2.6/include/linux/blk-integrity.h
@@ -109,11 +109,11 @@ static inline bool blk_integrity_rq(stru
* Return the first bvec that contains integrity data. Only drivers that are
* limited to a single integrity segment should use this helper.
*/
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
{
- if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
- return NULL;
- return rq->bio->bi_integrity->bip_vec;
+ WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
+ return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
+ rq->bio->bi_integrity->bip_iter);
}
#else /* CONFIG_BLK_DEV_INTEGRITY */
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
@@ -177,9 +177,5 @@ static inline int blk_integrity_rq(struc
return 0;
}
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
-{
- return NULL;
-}
#endif /* CONFIG_BLK_DEV_INTEGRITY */
#endif /* _LINUX_BLK_INTEGRITY_H */
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] block: change rq_integrity_vec to respect the iterator
2024-05-23 14:58 ` [PATCH v2] " Mikulas Patocka
@ 2024-05-23 15:01 ` Jens Axboe
2024-05-23 15:11 ` Mikulas Patocka
0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2024-05-23 15:01 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Mike Snitzer,
Milan Broz, linux-block, dm-devel, linux-nvme
On 5/23/24 8:58 AM, Mikulas Patocka wrote:
>
>
> On Wed, 15 May 2024, Jens Axboe wrote:
>
>> On 5/15/24 7:28 AM, Mikulas Patocka wrote:
>>> @@ -177,9 +177,9 @@ static inline int blk_integrity_rq(struc
>>> return 0;
>>> }
>>>
>>> -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
>>> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
>>> {
>>> - return NULL;
>>> + BUG();
>>> }
>>> #endif /* CONFIG_BLK_DEV_INTEGRITY */
>>> #endif /* _LINUX_BLK_INTEGRITY_H */
>>
>> Let's please not do that. If it's not used outside of
>> CONFIG_BLK_DEV_INTEGRITY, it should just go away.
>>
>> --
>> Jens Axboe
>
> Here I'm resending the patch with the function rq_integrity_vec removed if
> CONFIG_BLK_DEV_INTEGRITY is not defined.
That looks better - but can you please just post a full new series,
that's a lot easier to deal with and look at than adding a v2 of one
patch in the thread.
> @@ -853,16 +855,20 @@ static blk_status_t nvme_prep_rq(struct
> goto out_free_cmd;
> }
>
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> if (blk_integrity_rq(req)) {
> ret = nvme_map_metadata(dev, req, &iod->cmd);
> if (ret)
> goto out_unmap_data;
> }
> +#endif
if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
?
> @@ -962,12 +968,14 @@ static __always_inline void nvme_pci_unm
> struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> struct nvme_dev *dev = nvmeq->dev;
>
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> if (blk_integrity_rq(req)) {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
Ditto
> Index: linux-2.6/include/linux/blk-integrity.h
> ===================================================================
> --- linux-2.6.orig/include/linux/blk-integrity.h
> +++ linux-2.6/include/linux/blk-integrity.h
> @@ -109,11 +109,11 @@ static inline bool blk_integrity_rq(stru
> * Return the first bvec that contains integrity data. Only drivers that are
> * limited to a single integrity segment should use this helper.
> */
> -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> {
> - if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
> - return NULL;
> - return rq->bio->bi_integrity->bip_vec;
> + WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
> + return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
> + rq->bio->bi_integrity->bip_iter);
> }
Not clear why the return on integrity segments > 1 is removed?
--
Jens Axboe
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] block: change rq_integrity_vec to respect the iterator
2024-05-23 15:01 ` Jens Axboe
@ 2024-05-23 15:11 ` Mikulas Patocka
2024-05-23 15:22 ` Anuj gupta
2024-05-23 15:33 ` Jens Axboe
0 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-23 15:11 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Mike Snitzer,
Milan Broz, linux-block, dm-devel, linux-nvme
On Thu, 23 May 2024, Jens Axboe wrote:
> On 5/23/24 8:58 AM, Mikulas Patocka wrote:
> > Here I'm resending the patch with the function rq_integrity_vec removed if
> > CONFIG_BLK_DEV_INTEGRITY is not defined.
>
> That looks better - but can you please just post a full new series,
> that's a lot easier to deal with and look at than adding a v2 of one
> patch in the thread.
OK, I'll post both patches.
> > @@ -853,16 +855,20 @@ static blk_status_t nvme_prep_rq(struct
> > goto out_free_cmd;
> > }
> >
> > +#ifdef CONFIG_BLK_DEV_INTEGRITY
> > if (blk_integrity_rq(req)) {
> > ret = nvme_map_metadata(dev, req, &iod->cmd);
> > if (ret)
> > goto out_unmap_data;
> > }
> > +#endif
>
> if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
>
> ?
That wouldn't work, because the calls to rq_integrity_vec need to be
eliminated by the preprocessor.
Should I change rq_integrity_vec to this? Then, we could get rid of the
ifdefs and let the optimizer remove all calls to rq_integrity_vec.
static inline struct bio_vec rq_integrity_vec(struct request *rq)
{
struct bio_vec bv = { };
return bv;
}
> > @@ -962,12 +968,14 @@ static __always_inline void nvme_pci_unm
> > struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> > struct nvme_dev *dev = nvmeq->dev;
> >
> > +#ifdef CONFIG_BLK_DEV_INTEGRITY
> > if (blk_integrity_rq(req)) {
> > struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>
> Ditto
>
> > Index: linux-2.6/include/linux/blk-integrity.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/blk-integrity.h
> > +++ linux-2.6/include/linux/blk-integrity.h
> > @@ -109,11 +109,11 @@ static inline bool blk_integrity_rq(stru
> > * Return the first bvec that contains integrity data. Only drivers that are
> > * limited to a single integrity segment should use this helper.
> > */
> > -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
> > +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> > {
> > - if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
> > - return NULL;
> > - return rq->bio->bi_integrity->bip_vec;
> > + WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
> > + return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
> > + rq->bio->bi_integrity->bip_iter);
> > }
>
> Not clear why the return on integrity segments > 1 is removed?
Because we can't return NULL. But I can leave it there, print a warning
and return the first vector.
Mikulas
> --
> Jens Axboe
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] block: change rq_integrity_vec to respect the iterator
2024-05-23 15:11 ` Mikulas Patocka
@ 2024-05-23 15:22 ` Anuj gupta
2024-05-23 15:33 ` Jens Axboe
1 sibling, 0 replies; 18+ messages in thread
From: Anuj gupta @ 2024-05-23 15:22 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz, linux-block, dm-devel, linux-nvme
On Thu, May 23, 2024 at 8:41 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Thu, 23 May 2024, Jens Axboe wrote:
>
> > On 5/23/24 8:58 AM, Mikulas Patocka wrote:
>
> > > Here I'm resending the patch with the function rq_integrity_vec removed if
> > > CONFIG_BLK_DEV_INTEGRITY is not defined.
> >
> > That looks better - but can you please just post a full new series,
> > that's a lot easier to deal with and look at than adding a v2 of one
> > patch in the thread.
>
> OK, I'll post both patches.
>
> > > @@ -853,16 +855,20 @@ static blk_status_t nvme_prep_rq(struct
> > > goto out_free_cmd;
> > > }
> > >
> > > +#ifdef CONFIG_BLK_DEV_INTEGRITY
> > > if (blk_integrity_rq(req)) {
> > > ret = nvme_map_metadata(dev, req, &iod->cmd);
> > > if (ret)
> > > goto out_unmap_data;
> > > }
> > > +#endif
> >
> > if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
> >
> > ?
>
> That wouldn't work, because the calls to rq_integrity_vec need to be
> eliminated by the preprocessor.
>
> Should I change rq_integrity_vec to this? Then, we could get rid of the
> ifdefs and let the optimizer remove all calls to rq_integrity_vec.
> static inline struct bio_vec rq_integrity_vec(struct request *rq)
> {
> struct bio_vec bv = { };
> return bv;
> }
This can be reduced to
return (struct bio_vec){ };
>
> > > @@ -962,12 +968,14 @@ static __always_inline void nvme_pci_unm
> > > struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> > > struct nvme_dev *dev = nvmeq->dev;
> > >
> > > +#ifdef CONFIG_BLK_DEV_INTEGRITY
> > > if (blk_integrity_rq(req)) {
> > > struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> >
> > Ditto
> >
> > > Index: linux-2.6/include/linux/blk-integrity.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/blk-integrity.h
> > > +++ linux-2.6/include/linux/blk-integrity.h
> > > @@ -109,11 +109,11 @@ static inline bool blk_integrity_rq(stru
> > > * Return the first bvec that contains integrity data. Only drivers that are
> > > * limited to a single integrity segment should use this helper.
> > > */
> > > -static inline struct bio_vec *rq_integrity_vec(struct request *rq)
> > > +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> > > {
> > > - if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
> > > - return NULL;
> > > - return rq->bio->bi_integrity->bip_vec;
> > > + WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
> > > + return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
> > > + rq->bio->bi_integrity->bip_iter);
> > > }
> >
> > Not clear why the return on integrity segments > 1 is removed?
>
> Because we can't return NULL. But I can leave it there, print a warning
> and return the first vector.
>
> Mikulas
>
> > --
> > Jens Axboe
> >
>
>
--
Anuj Gupta
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] block: change rq_integrity_vec to respect the iterator
2024-05-23 15:11 ` Mikulas Patocka
2024-05-23 15:22 ` Anuj gupta
@ 2024-05-23 15:33 ` Jens Axboe
2024-05-23 15:48 ` Mikulas Patocka
1 sibling, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2024-05-23 15:33 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Mike Snitzer,
Milan Broz, linux-block, dm-devel, linux-nvme
On 5/23/24 9:11 AM, Mikulas Patocka wrote:
>>> @@ -853,16 +855,20 @@ static blk_status_t nvme_prep_rq(struct
>>> goto out_free_cmd;
>>> }
>>>
>>> +#ifdef CONFIG_BLK_DEV_INTEGRITY
>>> if (blk_integrity_rq(req)) {
>>> ret = nvme_map_metadata(dev, req, &iod->cmd);
>>> if (ret)
>>> goto out_unmap_data;
>>> }
>>> +#endif
>>
>> if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
>>
>> ?
>
> That wouldn't work, because the calls to rq_integrity_vec need to be
> eliminated by the preprocessor.
Why not just do this incremental? Cleans up the ifdef mess too, leaving
only the one actually using rq_integrity_vec in place.
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5f857cbc95c8..bd56416a7fa8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -821,10 +821,10 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
return ret;
}
-#ifdef CONFIG_BLK_DEV_INTEGRITY
static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
+#ifdef CONFIG_BLK_DEV_INTEGRITY
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct bio_vec bv = rq_integrity_vec(req);
@@ -832,9 +832,9 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
if (dma_mapping_error(dev->dev, iod->meta_dma))
return BLK_STS_IOERR;
cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
+#endif
return BLK_STS_OK;
}
-#endif
static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
{
@@ -855,20 +855,16 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
goto out_free_cmd;
}
-#ifdef CONFIG_BLK_DEV_INTEGRITY
- if (blk_integrity_rq(req)) {
+ if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
ret = nvme_map_metadata(dev, req, &iod->cmd);
if (ret)
goto out_unmap_data;
}
-#endif
nvme_start_request(req);
return BLK_STS_OK;
-#ifdef CONFIG_BLK_DEV_INTEGRITY
out_unmap_data:
nvme_unmap_data(dev, req);
-#endif
out_free_cmd:
nvme_cleanup_cmd(req);
return ret;
> Should I change rq_integrity_vec to this? Then, we could get rid of the
> ifdefs and let the optimizer remove all calls to rq_integrity_vec.
> static inline struct bio_vec rq_integrity_vec(struct request *rq)
> {
> struct bio_vec bv = { };
> return bv;
> }
Only if that eliminates runtime checking for !INTEGRITY, which I don't
thin it will.
--
Jens Axboe
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2] block: change rq_integrity_vec to respect the iterator
2024-05-23 15:33 ` Jens Axboe
@ 2024-05-23 15:48 ` Mikulas Patocka
0 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-23 15:48 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Mike Snitzer,
Milan Broz, linux-block, dm-devel, linux-nvme
On Thu, 23 May 2024, Jens Axboe wrote:
> On 5/23/24 9:11 AM, Mikulas Patocka wrote:
> >>> @@ -853,16 +855,20 @@ static blk_status_t nvme_prep_rq(struct
> >>> goto out_free_cmd;
> >>> }
> >>>
> >>> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> >>> if (blk_integrity_rq(req)) {
> >>> ret = nvme_map_metadata(dev, req, &iod->cmd);
> >>> if (ret)
> >>> goto out_unmap_data;
> >>> }
> >>> +#endif
> >>
> >> if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
> >>
> >> ?
> >
> > That wouldn't work, because the calls to rq_integrity_vec need to be
> > eliminated by the preprocessor.
>
> Why not just do this incremental? Cleans up the ifdef mess too, leaving
> only the one actually using rq_integrity_vec in place.
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5f857cbc95c8..bd56416a7fa8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -821,10 +821,10 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> return ret;
> }
>
> -#ifdef CONFIG_BLK_DEV_INTEGRITY
> static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
> struct nvme_command *cmnd)
> {
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> struct bio_vec bv = rq_integrity_vec(req);
>
> @@ -832,9 +832,9 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
> if (dma_mapping_error(dev->dev, iod->meta_dma))
> return BLK_STS_IOERR;
> cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
> +#endif
> return BLK_STS_OK;
> }
> -#endif
>
> static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
> {
> @@ -855,20 +855,16 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
> goto out_free_cmd;
> }
>
> -#ifdef CONFIG_BLK_DEV_INTEGRITY
> - if (blk_integrity_rq(req)) {
> + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && blk_integrity_rq(req)) {
> ret = nvme_map_metadata(dev, req, &iod->cmd);
> if (ret)
> goto out_unmap_data;
> }
> -#endif
>
> nvme_start_request(req);
> return BLK_STS_OK;
> -#ifdef CONFIG_BLK_DEV_INTEGRITY
> out_unmap_data:
> nvme_unmap_data(dev, req);
> -#endif
> out_free_cmd:
> nvme_cleanup_cmd(req);
> return ret;
>
> > Should I change rq_integrity_vec to this? Then, we could get rid of the
> > ifdefs and let the optimizer remove all calls to rq_integrity_vec.
> > static inline struct bio_vec rq_integrity_vec(struct request *rq)
> > {
> > struct bio_vec bv = { };
> > return bv;
> > }
>
> Only if that eliminates runtime checking for !INTEGRITY, which I don't
> thin it will.
It will eliminate the ifdefs. If we are compiling without
CONFIG_BLK_DEV_INTEGRITY, blk_integrity_rq(req) is inline and it always
returns 0. So the optimizer will remove anything guarded with "if
(blk_integrity_rq(req))" - including the calls to nvme_map_metadata and
rq_integrity_vec.
But we need to provide dummy rq_integrity_vec for the compiler front-end.
The front-end doesn't know that blk_integrity_rq always returns zero.
So, the patch will be smaller if we get rid of the ifdefs and provide a
dummy rq_integrity_vec.
Mikulas
>
>
> --
> Jens Axboe
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata
2024-05-15 13:27 [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata Mikulas Patocka
2024-05-15 13:28 ` [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator Mikulas Patocka
2024-05-15 13:30 ` [RFC PATCH 2/2] dm-crypt: support per-sector NVMe metadata Mikulas Patocka
@ 2024-05-27 22:12 ` Eric Wheeler
2024-05-28 7:25 ` Milan Broz
2024-05-28 11:16 ` Mikulas Patocka
2 siblings, 2 replies; 18+ messages in thread
From: Eric Wheeler @ 2024-05-27 22:12 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz, linux-block, dm-devel, linux-nvme
On Wed, 15 May 2024, Mikulas Patocka wrote:
> Hi
>
> Some NVMe devices may be formatted with extra 64 bytes of metadata per
> sector.
>
> Here I'm submitting for review dm-crypt patches that make it possible to
> use per-sector metadata for authenticated encryption. With these patches,
> dm-crypt can run directly on the top of a NVMe device, without using
> dm-integrity. These patches increase write throughput twice, because there
> is no write to the dm-integrity journal.
>
> An example how to use it (so far, there is no support in the userspace
> cryptsetup tool):
>
> # nvme format /dev/nvme1 -n 1 -lbaf=4
> # dmsetup create cr --table '0 1048576 crypt
> capi:authenc(hmac(sha256),cbc(aes))-essiv:sha256
> 01b11af6b55f76424fd53fb66667c301466b2eeaf0f39fd36d26e7fc4f52ade2de4228e996f5ae2fe817ce178e77079d28e4baaebffbcd3e16ae4f36ef217298
> 0 /dev/nvme1n1 0 2 integrity:32:aead sector_size:4096'
Thats really an amazing feature, and I think your implementation is simple
and elegant. Somehow reminds me of 520/528-byte sectors that big
commercial filers use, but in a way the Linux could use.
Questions:
- I see you are using 32-bytes of AEAD data (out of 64). Is AEAD always
32-bytes, or can it vary by crypto mechanism?
- What drive are you using? I am curious what your `nvme id-ns` output
looks like. Do you have 64 in the `ms` value?
# nvme id-ns /dev/nvme0n1 | grep lbaf
nlbaf : 0
nulbaf : 0
lbaf 0 : ms:0 lbads:9 rp:0 (in use)
^ ^512b
--
Eric Wheeler
>
> Please review it - I'd like to know whether detecting the presence of
> per-sector metadata in crypt_integrity_ctr is correct whether it should be
> done differently.
>
> Mikulas
>
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata
2024-05-27 22:12 ` [RFC PATCH 0/2] dm-crypt support for " Eric Wheeler
@ 2024-05-28 7:25 ` Milan Broz
2024-05-28 23:55 ` Eric Wheeler
2024-05-28 11:16 ` Mikulas Patocka
1 sibling, 1 reply; 18+ messages in thread
From: Milan Broz @ 2024-05-28 7:25 UTC (permalink / raw)
To: Eric Wheeler, Mikulas Patocka
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, linux-block, dm-devel, linux-nvme
On 5/28/24 12:12 AM, Eric Wheeler wrote:
> On Wed, 15 May 2024, Mikulas Patocka wrote:
>> Hi
>>
>> Some NVMe devices may be formatted with extra 64 bytes of metadata per
>> sector.
>>
>> Here I'm submitting for review dm-crypt patches that make it possible to
>> use per-sector metadata for authenticated encryption. With these patches,
>> dm-crypt can run directly on the top of a NVMe device, without using
>> dm-integrity. These patches increase write throughput twice, because there
>> is no write to the dm-integrity journal.
>>
>> An example how to use it (so far, there is no support in the userspace
>> cryptsetup tool):
>>
>> # nvme format /dev/nvme1 -n 1 -lbaf=4
>> # dmsetup create cr --table '0 1048576 crypt
>> capi:authenc(hmac(sha256),cbc(aes))-essiv:sha256
>> 01b11af6b55f76424fd53fb66667c301466b2eeaf0f39fd36d26e7fc4f52ade2de4228e996f5ae2fe817ce178e77079d28e4baaebffbcd3e16ae4f36ef217298
>> 0 /dev/nvme1n1 0 2 integrity:32:aead sector_size:4096'
>
> Thats really an amazing feature, and I think your implementation is simple
> and elegant. Somehow reminds me of 520/528-byte sectors that big
> commercial filers use, but in a way the Linux could use.
>
> Questions:
>
> - I see you are using 32-bytes of AEAD data (out of 64). Is AEAD always
> 32-bytes, or can it vary by crypto mechanism?
Hi Eric,
I'll try to answer this question as this is where we headed with dm-integrity+dm-crypt
since the beginning - replace it with HW and atomic sector+metadata handling once
suitable HW becomes available.
Currently, dm-integrity allocates exact space for any AEAD you want to construct
(cipher-xts/hctr2 + hmac) or for native AEAD (my favourite is AEGIS here).
So it depends on configuration, the only difference to dm-integrity is that HW allocates
fixed 64 bytes so that crypto can use up to this space, but it should be completely
configurable in dm-crypt. IOW real used space can vary by crypto mechanism.
Definitely, it is now enough for real AEAD compared to legacy 512+8 DIF :)
Also, it opens a way to store something more (sector context) in metadata,
but that's an idea for the future (usable in fs encryption as well, I guess).
> - What drive are you using? I am curious what your `nvme id-ns` output
> looks like. Do you have 64 in the `ms` value?
>
> # nvme id-ns /dev/nvme0n1 | grep lbaf
> nlbaf : 0
> nulbaf : 0
> lbaf 0 : ms:0 lbads:9 rp:0 (in use)
> ^ ^512b
This is the major issue still - I think there are only enterprisey NVMe drives that
can do this.
Milan
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata
2024-05-27 22:12 ` [RFC PATCH 0/2] dm-crypt support for " Eric Wheeler
2024-05-28 7:25 ` Milan Broz
@ 2024-05-28 11:16 ` Mikulas Patocka
1 sibling, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2024-05-28 11:16 UTC (permalink / raw)
To: Eric Wheeler
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Mike Snitzer, Milan Broz, linux-block, dm-devel, linux-nvme
On Mon, 27 May 2024, Eric Wheeler wrote:
> On Wed, 15 May 2024, Mikulas Patocka wrote:
> > Hi
> >
> > Some NVMe devices may be formatted with extra 64 bytes of metadata per
> > sector.
> >
> > Here I'm submitting for review dm-crypt patches that make it possible to
> > use per-sector metadata for authenticated encryption. With these patches,
> > dm-crypt can run directly on the top of a NVMe device, without using
> > dm-integrity. These patches increase write throughput twice, because there
> > is no write to the dm-integrity journal.
> >
> > An example how to use it (so far, there is no support in the userspace
> > cryptsetup tool):
> >
> > # nvme format /dev/nvme1 -n 1 -lbaf=4
> > # dmsetup create cr --table '0 1048576 crypt
> > capi:authenc(hmac(sha256),cbc(aes))-essiv:sha256
> > 01b11af6b55f76424fd53fb66667c301466b2eeaf0f39fd36d26e7fc4f52ade2de4228e996f5ae2fe817ce178e77079d28e4baaebffbcd3e16ae4f36ef217298
> > 0 /dev/nvme1n1 0 2 integrity:32:aead sector_size:4096'
>
> Thats really an amazing feature, and I think your implementation is simple
> and elegant. Somehow reminds me of 520/528-byte sectors that big
> commercial filers use, but in a way the Linux could use.
>
> Questions:
>
> - I see you are using 32-bytes of AEAD data (out of 64). Is AEAD always
> 32-bytes, or can it vary by crypto mechanism?
It varies. I.e. if you use hmac(sha512), full 64 bytes will be used.
> - What drive are you using?
Western Digital SN840
WUS4BA119DSP3X3
> I am curious what your `nvme id-ns` output
> looks like. Do you have 64 in the `ms` value?
>
> # nvme id-ns /dev/nvme0n1 | grep lbaf
> nlbaf : 0
> nulbaf : 0
> lbaf 0 : ms:0 lbads:9 rp:0 (in use)
> ^ ^512b
Yes, I have this:
lbaf 0 : ms:0 lbads:9 rp:0
lbaf 1 : ms:8 lbads:9 rp:0
lbaf 2 : ms:0 lbads:12 rp:0
lbaf 3 : ms:8 lbads:12 rp:0
lbaf 4 : ms:64 lbads:12 rp:0 (in use)
Mikulas
> --
> Eric Wheeler
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata
2024-05-28 7:25 ` Milan Broz
@ 2024-05-28 23:55 ` Eric Wheeler
0 siblings, 0 replies; 18+ messages in thread
From: Eric Wheeler @ 2024-05-28 23:55 UTC (permalink / raw)
To: Milan Broz
Cc: Mikulas Patocka, Jens Axboe, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Mike Snitzer, linux-block, dm-devel, linux-nvme
On Tue, 28 May 2024, Milan Broz wrote:
> On 5/28/24 12:12 AM, Eric Wheeler wrote:
> > On Wed, 15 May 2024, Mikulas Patocka wrote:
> >> Hi
> >>
> >> Some NVMe devices may be formatted with extra 64 bytes of metadata per
> >> sector.
> >>
> >> Here I'm submitting for review dm-crypt patches that make it possible to
> >> use per-sector metadata for authenticated encryption. With these patches,
> >> dm-crypt can run directly on the top of a NVMe device, without using
> >> dm-integrity. These patches increase write throughput twice, because there
> >> is no write to the dm-integrity journal.
> >>
> >> An example how to use it (so far, there is no support in the userspace
> >> cryptsetup tool):
> >>
> >> # nvme format /dev/nvme1 -n 1 -lbaf=4
> >> # dmsetup create cr --table '0 1048576 crypt
> >> capi:authenc(hmac(sha256),cbc(aes))-essiv:sha256
> >> 01b11af6b55f76424fd53fb66667c301466b2eeaf0f39fd36d26e7fc4f52ade2de4228e996f5ae2fe817ce178e77079d28e4baaebffbcd3e16ae4f36ef217298
> >> 0 /dev/nvme1n1 0 2 integrity:32:aead sector_size:4096'
> >
> > Thats really an amazing feature, and I think your implementation is simple
> > and elegant. Somehow reminds me of 520/528-byte sectors that big
> > commercial filers use, but in a way the Linux could use.
> >
> > Questions:
> >
> > - I see you are using 32-bytes of AEAD data (out of 64). Is AEAD always
> > 32-bytes, or can it vary by crypto mechanism?
>
> Hi Eric,
>
> I'll try to answer this question as this is where we headed with
> dm-integrity+dm-crypt since the beginning - replace it with HW and
> atomic sector+metadata handling once suitable HW becomes available.
>
> Currently, dm-integrity allocates exact space for any AEAD you want to
> construct (cipher-xts/hctr2 + hmac) or for native AEAD (my favourite is
> AEGIS here).
Awesome.
> So it depends on configuration, the only difference to dm-integrity is
> that HW allocates fixed 64 bytes so that crypto can use up to this
> space, but it should be completely configurable in dm-crypt. IOW real
> used space can vary by crypto mechanism.
>
> Definitely, it is now enough for real AEAD compared to legacy 512+8 DIF :)
>
> Also, it opens a way to store something more (sector context) in metadata,
> but that's an idea for the future (usable in fs encryption as well, I guess).
Good idea, you could modify the SCSI layer (similarly as for NVMe meta) so
that bio integrity payload data could be packed at the end of a sector for
drives that have DIF room. This would make it possible to use 520/528-byte
and 4160/4224-byte-sectored SAS drives in Linux, for the first time ever.
Newer SAS drives like the Xeos X22 with can support the following sector
sizes:
512 + 0
512 + 8
512 + 16
4096 + 0
4096 + 64
4096 +128
> > - What drive are you using? I am curious what your `nvme id-ns` output
> > looks like. Do you have 64 in the `ms` value?
> >
> > # nvme id-ns /dev/nvme0n1 | grep lbaf
> > nlbaf : 0
> > nulbaf : 0
> > lbaf 0 : ms:0 lbads:9 rp:0 (in use)
> > ^ ^512b
>
> This is the major issue still - I think there are only enterprisey NVMe drives
> that can do this.
For now that is fine, we only use enterprise drives anyway, and it will be
great to use the integrity that the drives support natively. Coupled with
MDRAID, this solves hidden bitrot quite nicely at that block layer ... and
it may trickle down into desktop drives eventually.
-Eric
>
> Milan
>
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-05-28 23:55 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-15 13:27 [RFC PATCH 0/2] dm-crypt support for per-sector NVMe metadata Mikulas Patocka
2024-05-15 13:28 ` [RFC PATCH 1/2] block: change rq_integrity_vec to respect the iterator Mikulas Patocka
2024-05-16 2:30 ` Jens Axboe
2024-05-20 12:53 ` Mikulas Patocka
2024-05-23 14:58 ` [PATCH v2] " Mikulas Patocka
2024-05-23 15:01 ` Jens Axboe
2024-05-23 15:11 ` Mikulas Patocka
2024-05-23 15:22 ` Anuj gupta
2024-05-23 15:33 ` Jens Axboe
2024-05-23 15:48 ` Mikulas Patocka
2024-05-16 8:14 ` [RFC PATCH 1/2] " Ming Lei
2024-05-20 12:42 ` Mikulas Patocka
2024-05-20 13:19 ` Ming Lei
2024-05-15 13:30 ` [RFC PATCH 2/2] dm-crypt: support per-sector NVMe metadata Mikulas Patocka
2024-05-27 22:12 ` [RFC PATCH 0/2] dm-crypt support for " Eric Wheeler
2024-05-28 7:25 ` Milan Broz
2024-05-28 23:55 ` Eric Wheeler
2024-05-28 11:16 ` Mikulas Patocka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).