qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] Protection information pass-through for block devices
@ 2022-11-24 15:58 Dmitry Tihov
  2022-11-24 15:58 ` [RFC 1/5] docs/nvme: add new feature summary Dmitry Tihov
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Dmitry Tihov @ 2022-11-24 15:58 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, kbusch, its, linux, m.malygin, a.buev, ddtikhov

This patch set allows using End-to-End Data Protection in NVMe subsystem
with integrity capable host devices as the NVMe namespaces backend.
The patch series is based on io-uring kernel interface feature not merged
to kernel upstream yet:
https://lore.kernel.org/linux-block/20220920144618.1111138-1-a.buev@yadro.com/

The main advantage of this approach is that it allows using the
same protection information enabled disks in multiple VMs
concurrently. This may be useful in cluster setups.

Please let me know what do you think, are this kind of changes appropriate
for QEMU upstream, what should be changed, etc.

Dmitry Tihov (5):
  docs/nvme: add new feature summary
  block: add transfer of protection information
  hw/nvme: add protection information pass parameter
  hw/nvme: implement pi pass read/write/wrz commands
  hw/nvme: extend pi pass capable commands

 block/file-posix.c           | 130 ++++++++++++-
 block/io_uring.c             | 109 ++++++++++-
 docs/system/devices/nvme.rst |  15 ++
 hw/nvme/ctrl.c               | 361 ++++++++++++++++++++++++++++++++---
 hw/nvme/dif.c                | 303 +++++++++++++++++++++++++++++
 hw/nvme/dif.h                |  18 ++
 hw/nvme/ns.c                 |  59 +++++-
 hw/nvme/nvme.h               |   2 +
 hw/nvme/trace-events         |   6 +
 include/block/block-common.h |   2 +
 include/block/raw-aio.h      |   3 +-
 include/qemu/iov.h           |   6 +
 util/iov.c                   |  24 +++
 13 files changed, 992 insertions(+), 46 deletions(-)

-- 
2.38.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC 1/5] docs/nvme: add new feature summary
  2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
@ 2022-11-24 15:58 ` Dmitry Tihov
  2022-11-24 15:58 ` [RFC 2/5] block: add transfer of protection information Dmitry Tihov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Dmitry Tihov @ 2022-11-24 15:58 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, kbusch, its, linux, m.malygin, a.buev, ddtikhov

Describe use of new protection info block-level passthrough nvme feature.

Signed-off-by: Dmitry Tihov <d.tihov@yadro.com>
---
 docs/system/devices/nvme.rst | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index 30f841ef62..7375379810 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -240,6 +240,21 @@ The virtual namespace device supports DIF- and DIX-based protection information
   metadata. Otherwise, the protection information is transferred as the last
   eight bytes.
 
+Virtual namespace device can also be backed by integrity capable host block
+device. This way protection information is passed through to/from the host block
+device from/to the guest and checked by the host block device itself instead of QEMU.
+
+``pip=BOOL`` (default: ``off``)
+  Set to ``on`` to allow host block device protection information passthrough.
+
+To use this feature nvme-ns backend drive must have Linux io_uring AIO backend
+and host page cache must be avoided. E.g. the following parameters should be used:
+
+.. code-block:: console
+
+   -drive file=/dev/nvme0n1,cache.direct=on,aio=io_uring,format=raw,if=none,id=dif_drive_0
+   -device nvme-ns,logical_block_size=4096,physical_block_size=4096,drive=dif_drive_0,pip=on
+
 Virtualization Enhancements and SR-IOV (Experimental Support)
 -------------------------------------------------------------
 
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 2/5] block: add transfer of protection information
  2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
  2022-11-24 15:58 ` [RFC 1/5] docs/nvme: add new feature summary Dmitry Tihov
@ 2022-11-24 15:58 ` Dmitry Tihov
  2022-11-24 15:58 ` [RFC 3/5] hw/nvme: add protection information pass parameter Dmitry Tihov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Dmitry Tihov @ 2022-11-24 15:58 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, kbusch, its, linux, m.malygin, a.buev, ddtikhov

Under linux hosts, T10 protection information can be passed directly
from userspace to integrity capable block devices using io_uring API.
Discover integrity capable block devices and support submitting IO with
integrity payload to such block devices if it is present in request.

Signed-off-by: Dmitry Tihov <d.tihov@yadro.com>
---
 block/file-posix.c           | 130 +++++++++++++++++++++++++++++++++--
 block/io_uring.c             | 109 +++++++++++++++++++++++++++--
 include/block/block-common.h |   2 +
 include/block/raw-aio.h      |   3 +-
 include/qemu/iov.h           |   6 ++
 util/iov.c                   |  24 +++++++
 6 files changed, 262 insertions(+), 12 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b9647c5ffc..1eec7dd3cb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -152,6 +152,10 @@ typedef struct BDRVRawState {
     int perm_change_flags;
     BDRVReopenState *reopen_state;
 
+    /* DIF T10 Protection Information */
+    uint8_t t10_type;
+    uint64_t protection_interval_bytes;
+
     bool has_discard:1;
     bool has_write_zeroes:1;
     bool use_linux_aio:1;
@@ -2094,8 +2098,9 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
 #ifdef CONFIG_LINUX_IO_URING
     } else if (s->use_linux_io_uring) {
         LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+        bool is_pi = (s->t10_type && qiov->dif.iov_len);
         assert(qiov->size == bytes);
-        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
+        return luring_co_submit(bs, aio, s->fd, offset, qiov, type, is_pi);
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (s->use_linux_aio) {
@@ -2190,7 +2195,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
         LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH);
+        return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH, false);
     }
 #endif
     return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
@@ -3516,6 +3521,110 @@ static bool hdev_is_sg(BlockDriverState *bs)
     return false;
 }
 
+#if defined(CONFIG_LINUX_IO_URING)
+
+static int fill_pi_info(BlockDriverState *bs, Error **errp)
+{
+    BDRVRawState *s = bs->opaque;
+    int ret = 0, bytes;
+    uint64_t is_integrity_capable;
+    g_autofree char *sysfs_int_cap = NULL;
+    g_autofree char *sysfs_fmt = NULL;
+    g_autofree char *sysfs_bytes = NULL;
+    const char *str_int_cap;
+    const char *str_bytes;
+    int fd_fmt = -1, fd_bytes = -1, fd_int_cap = -1;
+    char buf[24] = {0};
+    g_autofree char *dev_name = g_path_get_basename(bs->filename);
+
+    str_int_cap = "/sys/class/block/%s/integrity/device_is_integrity_capable";
+    sysfs_int_cap = g_strdup_printf(str_int_cap, dev_name);
+    sysfs_fmt = g_strdup_printf("/sys/class/block/%s/integrity/format",
+                                dev_name);
+    str_bytes = "/sys/class/block/%s/integrity/protection_interval_bytes";
+    sysfs_bytes = g_strdup_printf(str_bytes, dev_name);
+
+    if (!(bs->open_flags & BDRV_O_NOCACHE)) {
+        goto out;
+    }
+
+    fd_int_cap = open(sysfs_int_cap, O_RDONLY);
+    if (fd_int_cap == -1) {
+        error_setg_errno(errp, errno, "Can not open %s integrity capability"
+                         " sysfs entry", dev_name);
+        ret = -errno;
+        goto out;
+    }
+    bytes = read(fd_int_cap, buf, sizeof(buf));
+    if (bytes < 0) {
+        error_setg_errno(errp, errno, "Can not read %s integrity capability"
+                         " sysfs entry", dev_name);
+        ret = -errno;
+        goto out;
+    }
+    is_integrity_capable = g_ascii_strtoull(buf, NULL, 10);
+    if (!is_integrity_capable) {
+        goto out;
+    }
+    memset(buf, 0, sizeof(buf));
+
+    fd_fmt = open(sysfs_fmt, O_RDONLY);
+    if (fd_fmt == -1) {
+        error_setg_errno(errp, errno, "Can not open %s integrity format"
+                         " sysfs entry", dev_name);
+        ret = -errno;
+        goto out;
+    }
+    bytes = read(fd_fmt, buf, sizeof(buf));
+    if (bytes < 0) {
+        error_setg_errno(errp, errno, "Can not read %s integrity format"
+                         " sysfs entry", dev_name);
+        ret = -errno;
+        goto out;
+    }
+    if (bytes > 0 && buf[bytes - 1] == '\n') {
+        buf[bytes - 1] = 0;
+    }
+    if (strcmp(buf, "T10-DIF-TYPE1-CRC") == 0) {
+        s->t10_type = 1;
+    } else if (strcmp(buf, "T10-DIF-TYPE3-CRC") == 0) {
+        s->t10_type = 3;
+    } else {
+        s->t10_type = 0;
+    }
+    memset(buf, 0, sizeof(buf));
+
+    fd_bytes = open(sysfs_bytes, O_RDONLY);
+    if (fd_bytes == -1) {
+        error_setg_errno(errp, errno, "Can not open %s protection interval"
+                         " bytes sysfs entry", dev_name);
+        ret = -errno;
+        goto out;
+    }
+    if (read(fd_bytes, buf, sizeof(buf)) < 0) {
+        error_setg_errno(errp, errno, "Can not read %s protection interval"
+                   " bytes sysfs entry", dev_name);
+        ret = -errno;
+        goto out;
+    }
+    s->protection_interval_bytes = g_ascii_strtoull(buf, NULL, 10);
+
+out:
+    if (fd_fmt != -1) {
+        close(fd_fmt);
+    }
+    if (fd_bytes != -1) {
+        close(fd_bytes);
+    }
+    if (fd_int_cap != -1) {
+        close(fd_int_cap);
+    }
+
+    return ret;
+}
+
+#endif
+
 static int hdev_open(BlockDriverState *bs, QDict *options, int flags,
                      Error **errp)
 {
@@ -3601,6 +3710,11 @@ hdev_open_Mac_error:
     /* Since this does ioctl the device must be already opened */
     bs->sg = hdev_is_sg(bs);
 
+#if defined(CONFIG_LINUX_IO_URING)
+    if (s->use_linux_io_uring) {
+        ret = fill_pi_info(bs, errp);
+    }
+#endif
     return ret;
 }
 
@@ -3668,6 +3782,14 @@ static coroutine_fn int hdev_co_pwrite_zeroes(BlockDriverState *bs,
     return raw_do_pwrite_zeroes(bs, offset, bytes, flags, true);
 }
 
+static int hdev_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVRawState *s = bs->opaque;
+    bdi->protection_interval = s->protection_interval_bytes;
+    bdi->protection_type = s->t10_type;
+    return 0;
+}
+
 static BlockDriver bdrv_host_device = {
     .format_name        = "host_device",
     .protocol_name        = "host_device",
@@ -3698,8 +3820,8 @@ static BlockDriver bdrv_host_device = {
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate       = raw_co_truncate,
-    .bdrv_getlength	= raw_getlength,
-    .bdrv_get_info = raw_get_info,
+    .bdrv_getlength         = raw_getlength,
+    .bdrv_get_info          = hdev_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
     .bdrv_get_specific_stats = hdev_get_specific_stats,
diff --git a/block/io_uring.c b/block/io_uring.c
index 973e15d876..ba9fec1145 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -21,6 +21,84 @@
 /* io_uring ring size */
 #define MAX_ENTRIES 128
 
+#define IORING_OP_READV_PI  (48)
+#define IORING_OP_WRITEV_PI (49)
+
+#pragma pack(push, 1)
+
+struct __io_uring_sqe {
+    __u8    opcode;     /* type of operation for this sqe */
+    __u8    flags;      /* IOSQE_ flags */
+    __u16   ioprio;     /* ioprio for the request */
+    __s32   fd;     /* file descriptor to do IO on */
+    union {
+        __u64   off;    /* offset into file */
+        __u64   addr2;
+    };
+    union {
+        __u64   addr;   /* pointer to buffer or iovecs */
+        __u64   splice_off_in;
+    };
+    __u32   len;        /* buffer size or number of iovecs */
+    union {
+        __kernel_rwf_t  rw_flags;
+        __u32       fsync_flags;
+        __u16       poll_events;    /* compatibility */
+        __u32       poll32_events;  /* word-reversed for BE */
+        __u32       sync_range_flags;
+        __u32       msg_flags;
+        __u32       timeout_flags;
+        __u32       accept_flags;
+        __u32       cancel_flags;
+        __u32       open_flags;
+        __u32       statx_flags;
+        __u32       fadvise_advice;
+        __u32       splice_flags;
+        __u32       rename_flags;
+        __u32       unlink_flags;
+        __u32       hardlink_flags;
+    };
+    __u64   user_data;  /* data to be passed back at completion time */
+    /* pack this to avoid bogus arm OABI complaints */
+    union {
+        /* index into fixed buffers, if used */
+        __u16   buf_index;
+        /* for grouped buffer selection */
+        __u16   buf_group;
+    } __attribute__((packed));
+    /* personality to use, if used */
+    __u16   personality;
+    union {
+        __s32   splice_fd_in;
+        __u32   file_index;
+    };
+    __u64   pi_addr;
+    __u32 pi_len;
+    __u32   __pad2[1];
+};
+
+#pragma pack(pop)
+
+static inline void __io_uring_prep_writev_pi(uint8_t op,
+    struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs,
+    unsigned nr_vecs, const struct iovec *pi_iovec, unsigned nr_pi_vecs,
+    off_t offset)
+{
+    io_uring_prep_rw(op, sqe, fd, iovecs, nr_vecs, offset);
+    ((struct __io_uring_sqe *)sqe)->pi_addr = (__u64)pi_iovec;
+    ((struct __io_uring_sqe *)sqe)->pi_len = nr_pi_vecs;
+}
+
+static inline void __io_uring_prep_readv_pi(uint8_t op,
+    struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs,
+    unsigned nr_vecs, const struct iovec *pi_iovec, unsigned nr_pi_vecs,
+    off_t offset)
+{
+    io_uring_prep_rw(op, sqe, fd, iovecs, nr_vecs, offset);
+    ((struct __io_uring_sqe *)sqe)->pi_addr = (__u64)pi_iovec;
+    ((struct __io_uring_sqe *)sqe)->pi_len = nr_pi_vecs;
+}
+
 typedef struct LuringAIOCB {
     Coroutine *co;
     struct io_uring_sqe sqeq;
@@ -330,24 +408,39 @@ void luring_io_unplug(BlockDriverState *bs, LuringState *s)
  * @s: AIO state
  * @offset: offset for request
  * @type: type of request
+ * @is_pi: is protection information attached
  *
  * Fetches sqes from ring, adds to pending queue and preps them
  *
  */
 static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
-                            uint64_t offset, int type)
+                            uint64_t offset, int type, bool is_pi)
 {
     int ret;
     struct io_uring_sqe *sqes = &luringcb->sqeq;
 
     switch (type) {
     case QEMU_AIO_WRITE:
-        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-                             luringcb->qiov->niov, offset);
+        if (is_pi) {
+            __io_uring_prep_writev_pi(IORING_OP_WRITEV_PI, sqes, fd,
+                                      luringcb->qiov->iov,
+                                      luringcb->qiov->niov,
+                                      &luringcb->qiov->dif, 1, offset);
+        } else {
+            io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
+                                 luringcb->qiov->niov, offset);
+        }
         break;
     case QEMU_AIO_READ:
-        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
-                            luringcb->qiov->niov, offset);
+        if (is_pi) {
+            __io_uring_prep_readv_pi(IORING_OP_READV_PI, sqes, fd,
+                                     luringcb->qiov->iov,
+                                     luringcb->qiov->niov,
+                                     &luringcb->qiov->dif, 1, offset);
+        } else {
+            io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
+                                luringcb->qiov->niov, offset);
+        }
         break;
     case QEMU_AIO_FLUSH:
         io_uring_prep_fsync(sqes, fd, IORING_FSYNC_DATASYNC);
@@ -374,7 +467,8 @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
 }
 
 int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-                                  uint64_t offset, QEMUIOVector *qiov, int type)
+                                  uint64_t offset, QEMUIOVector *qiov, int type,
+                                  bool is_pi)
 {
     int ret;
     LuringAIOCB luringcb = {
@@ -383,9 +477,10 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
         .qiov       = qiov,
         .is_read    = (type == QEMU_AIO_READ),
     };
+
     trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0,
                            type);
-    ret = luring_do_submit(fd, &luringcb, s, offset, type);
+    ret = luring_do_submit(fd, &luringcb, s, offset, type, is_pi);
 
     if (ret < 0) {
         return ret;
diff --git a/include/block/block-common.h b/include/block/block-common.h
index 297704c1e9..1f283dbef8 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -59,6 +59,8 @@ typedef struct BlockDriverInfo {
      * True if this block driver only supports compressed writes
      */
     bool needs_compressed_writes;
+    uint8_t protection_type;
+    uint32_t protection_interval;
 } BlockDriverInfo;
 
 typedef struct BlockFragInfo {
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 21fc10c4c9..3f715b4bcc 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -65,7 +65,8 @@ typedef struct LuringState LuringState;
 LuringState *luring_init(Error **errp);
 void luring_cleanup(LuringState *s);
 int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type);
+                                uint64_t offset, QEMUIOVector *qiov, int type,
+                                bool is_pi);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
 void luring_io_plug(BlockDriverState *bs, LuringState *s);
diff --git a/include/qemu/iov.h b/include/qemu/iov.h
index 9330746680..58ae2d1f51 100644
--- a/include/qemu/iov.h
+++ b/include/qemu/iov.h
@@ -181,6 +181,9 @@ typedef struct QEMUIOVector {
             size_t size;
         };
     };
+
+    /* T10 data integrity field */
+    struct iovec dif;
 } QEMUIOVector;
 
 QEMU_BUILD_BUG_ON(offsetof(QEMUIOVector, size) !=
@@ -229,6 +232,9 @@ int qemu_iovec_init_extended(
         void *tail_buf, size_t tail_len);
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len);
+void qemu_iovec_init_pi(QEMUIOVector *qiov, int alloc_hint,
+                        unsigned int lba_cnt);
+void qemu_iovec_destroy_pi(QEMUIOVector *qiov);
 int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len);
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
 void qemu_iovec_concat(QEMUIOVector *dst,
diff --git a/util/iov.c b/util/iov.c
index b4be580022..f0e51d5e66 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -20,6 +20,7 @@
 #include "qemu/iov.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "qemu/memalign.h"
 
 size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
                          size_t offset, const void *buf, size_t bytes)
@@ -278,6 +279,8 @@ void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint)
     qiov->niov = 0;
     qiov->nalloc = alloc_hint;
     qiov->size = 0;
+    qiov->dif.iov_base = NULL;
+    qiov->dif.iov_len = 0;
 }
 
 void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov)
@@ -292,6 +295,19 @@ void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov)
         qiov->size += iov[i].iov_len;
 }
 
+void qemu_iovec_init_pi(QEMUIOVector *qiov, int alloc_hint,
+                        unsigned int lba_cnt)
+{
+    void *alignd_mem = NULL;
+    qemu_iovec_init(qiov, alloc_hint);
+
+    /* dif size is always 8 bytes */
+    qiov->dif.iov_len = lba_cnt << 3;
+
+    alignd_mem = qemu_memalign(qemu_real_host_page_size(), qiov->dif.iov_len);
+    qiov->dif.iov_base = memset(alignd_mem, 0, qiov->dif.iov_len);
+}
+
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len)
 {
     assert(qiov->nalloc != -1);
@@ -530,12 +546,20 @@ void qemu_iovec_destroy(QEMUIOVector *qiov)
     memset(qiov, 0, sizeof(*qiov));
 }
 
+void qemu_iovec_destroy_pi(QEMUIOVector *qiov)
+{
+    g_free(qiov->dif.iov_base);
+
+    qemu_iovec_destroy(qiov);
+}
+
 void qemu_iovec_reset(QEMUIOVector *qiov)
 {
     assert(qiov->nalloc != -1);
 
     qiov->niov = 0;
     qiov->size = 0;
+    qiov->dif.iov_len = 0;
 }
 
 size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 3/5] hw/nvme: add protection information pass parameter
  2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
  2022-11-24 15:58 ` [RFC 1/5] docs/nvme: add new feature summary Dmitry Tihov
  2022-11-24 15:58 ` [RFC 2/5] block: add transfer of protection information Dmitry Tihov
@ 2022-11-24 15:58 ` Dmitry Tihov
  2022-11-24 15:58 ` [RFC 4/5] hw/nvme: implement pi pass read/write/wrz commands Dmitry Tihov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Dmitry Tihov @ 2022-11-24 15:58 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, kbusch, its, linux, m.malygin, a.buev, ddtikhov

Allow namespace to enable pass-through of protection information
between guest and integrity capable BlockBackend.

Signed-off-by: Dmitry Tihov <d.tihov@yadro.com>
---
 hw/nvme/ns.c   | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
 hw/nvme/nvme.h |  2 ++
 2 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 62a1f97be0..da0cff71f8 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -35,7 +35,11 @@ void nvme_ns_init_format(NvmeNamespace *ns)
     ns->lbaf = id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
     ns->lbasz = 1 << ns->lbaf.ds;
 
-    nlbas = ns->size / (ns->lbasz + ns->lbaf.ms);
+    if (ns->pip) {
+        nlbas = ns->size / (ns->lbasz);
+    } else {
+        nlbas = ns->size / (ns->lbasz + ns->lbaf.ms);
+    }
 
     id_ns->nsze = cpu_to_le64(nlbas);
 
@@ -60,17 +64,22 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
     static uint64_t ns_count;
     NvmeIdNs *id_ns = &ns->id_ns;
     NvmeIdNsNvm *id_ns_nvm = &ns->id_ns_nvm;
+    BlockDriverInfo bdi;
     uint8_t ds;
     uint16_t ms;
-    int i;
+    int i, ret;
+
+    ns->pip = ns->params.pip;
 
     ns->csi = NVME_CSI_NVM;
     ns->status = 0x0;
 
     ns->id_ns.dlfeat = 0x1;
 
-    /* support DULBE and I/O optimization fields */
-    id_ns->nsfeat |= (0x4 | 0x10);
+    if (!ns->pip) {
+        /* support DULBE and I/O optimization fields */
+        id_ns->nsfeat |= (0x4 | 0x10);
+    }
 
     if (ns->params.shared) {
         id_ns->nmic |= NVME_NMIC_NS_SHARED;
@@ -89,7 +98,11 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
     id_ns->eui64 = cpu_to_be64(ns->params.eui64);
 
     ds = 31 - clz32(ns->blkconf.logical_block_size);
-    ms = ns->params.ms;
+    if (ns->pip) {
+        ms = 8;
+    } else {
+        ms = ns->params.ms;
+    }
 
     id_ns->mc = NVME_ID_NS_MC_EXTENDED | NVME_ID_NS_MC_SEPARATE;
 
@@ -105,6 +118,14 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
 
     ns->pif = ns->params.pif;
 
+    if (ns->pip) {
+        ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
+        if (ret >= 0) {
+            id_ns->dps = bdi.protection_type;
+            ns->pif = NVME_PI_GUARD_16;
+        }
+    }
+
     static const NvmeLBAF lbaf[16] = {
         [0] = { .ds =  9           },
         [1] = { .ds =  9, .ms =  8 },
@@ -380,13 +401,38 @@ static void nvme_zoned_ns_shutdown(NvmeNamespace *ns)
 static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
 {
     unsigned int pi_size;
+    BlockDriverInfo bdi;
+    int ret;
 
     if (!ns->blkconf.blk) {
         error_setg(errp, "block backend not configured");
         return -1;
     }
 
-    if (ns->params.pi) {
+    if (ns->params.pip) {
+        if (ns->params.mset) {
+            error_setg(errp, "invalid mset parameter, metadata must be "
+                "stored in a separate buffer to use integrity passthrough");
+            return -1;
+        }
+        ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
+        if (ret < 0) {
+            error_setg(errp, "could not determine host block device"
+                       " integrity information");
+            return -1;
+        }
+        if (!bdi.protection_type) {
+            error_setg(errp, "nvme-ns backend block device does not"
+                       " support integrity passthrough");
+            return -1;
+        }
+        if (bdi.protection_interval != ns->blkconf.logical_block_size) {
+            error_setg(errp, "logical block size parameter (%u bytes) must be"
+                " equal to protection information interval (%u bytes)",
+                ns->blkconf.logical_block_size, bdi.protection_interval);
+            return -1;
+        }
+    } else if (ns->params.pi) {
         if (ns->params.pi > NVME_ID_NS_DPS_TYPE_3) {
             error_setg(errp, "invalid 'pi' value");
             return -1;
@@ -623,6 +669,7 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT8("pi", NvmeNamespace, params.pi, 0),
     DEFINE_PROP_UINT8("pil", NvmeNamespace, params.pil, 0),
     DEFINE_PROP_UINT8("pif", NvmeNamespace, params.pif, 0),
+    DEFINE_PROP_BOOL("pip", NvmeNamespace, params.pip, false),
     DEFINE_PROP_UINT16("mssrl", NvmeNamespace, params.mssrl, 128),
     DEFINE_PROP_UINT32("mcl", NvmeNamespace, params.mcl, 128),
     DEFINE_PROP_UINT8("msrc", NvmeNamespace, params.msrc, 127),
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 79f5c281c2..4876670d26 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -109,6 +109,7 @@ typedef struct NvmeNamespaceParams {
     uint8_t  pi;
     uint8_t  pil;
     uint8_t  pif;
+    bool     pip;
 
     uint16_t mssrl;
     uint32_t mcl;
@@ -143,6 +144,7 @@ typedef struct NvmeNamespace {
     uint16_t     status;
     int          attached;
     uint8_t      pif;
+    bool         pip;
 
     struct {
         uint16_t zrwas;
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 4/5] hw/nvme: implement pi pass read/write/wrz commands
  2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
                   ` (2 preceding siblings ...)
  2022-11-24 15:58 ` [RFC 3/5] hw/nvme: add protection information pass parameter Dmitry Tihov
@ 2022-11-24 15:58 ` Dmitry Tihov
  2022-11-24 15:58 ` [RFC 5/5] hw/nvme: extend pi pass capable commands Dmitry Tihov
  2022-11-25  7:44 ` [RFC 0/5] Protection information pass-through for block devices Klaus Jensen
  5 siblings, 0 replies; 9+ messages in thread
From: Dmitry Tihov @ 2022-11-24 15:58 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, kbusch, its, linux, m.malygin, a.buev, ddtikhov

This patch adds ability for read, write and write zeroes commands
to submit single request with data and integrity directly to underlying
device using block-level transfer of protection information. Block
level supports only type1/type3 protection types and for the type1
protection type guard/reftag are always checked, while for the type3
protection type guardtag is always checked. This way NVME PRCHK field
can not be used to disable checking of guard/reftag properly, so error
from block level is caught and reported for cases of unset 02/00 bits
in PRCHK and invalid guard/reftag. Also, because apptag is never
checked by block devices, check it explicitly in case of set 01 bit in
PRCHK.

Signed-off-by: Dmitry Tihov <d.tihov@yadro.com>
---
 hw/nvme/ctrl.c       |  13 +-
 hw/nvme/dif.c        | 303 +++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/dif.h        |  18 +++
 hw/nvme/trace-events |   4 +
 4 files changed, 335 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..c646345bcc 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -2045,7 +2045,7 @@ static void nvme_rw_cb(void *opaque, int ret)
         goto out;
     }
 
-    if (ns->lbaf.ms) {
+    if (ns->lbaf.ms && !ns->pip) {
         NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
         uint64_t slba = le64_to_cpu(rw->slba);
         uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
@@ -3349,7 +3349,9 @@ static uint16_t nvme_read(NvmeCtrl *n, NvmeRequest *req)
         }
     }
 
-    if (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps)) {
+    if (ns->pip) {
+        return nvme_dif_pass_rw(n, req);
+    } else if (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps)) {
         return nvme_dif_rw(n, req);
     }
 
@@ -3379,6 +3381,7 @@ static uint16_t nvme_do_write(NvmeCtrl *n, NvmeRequest *req, bool append,
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
     uint16_t ctrl = le16_to_cpu(rw->control);
     uint8_t prinfo = NVME_RW_PRINFO(ctrl);
+    bool pract = !!(prinfo & NVME_PRINFO_PRACT);
     uint64_t data_size = nvme_l2b(ns, nlb);
     uint64_t mapped_size = data_size;
     uint64_t data_offset;
@@ -3483,7 +3486,11 @@ static uint16_t nvme_do_write(NvmeCtrl *n, NvmeRequest *req, bool append,
 
     data_offset = nvme_l2b(ns, slba);
 
-    if (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps)) {
+    if (ns->pip) {
+        if (!wrz || pract) {
+            return nvme_dif_pass_rw(n, req);
+        }
+    } else if (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps)) {
         return nvme_dif_rw(n, req);
     }
 
diff --git a/hw/nvme/dif.c b/hw/nvme/dif.c
index 63c44c86ab..0b562cf45a 100644
--- a/hw/nvme/dif.c
+++ b/hw/nvme/dif.c
@@ -11,6 +11,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "sysemu/block-backend.h"
+#include "qemu/memalign.h"
 
 #include "nvme.h"
 #include "dif.h"
@@ -714,3 +715,305 @@ err:
 
     return status;
 }
+
+void nvme_dif_pass_dump(NvmeNamespace *ns, uint8_t *mdata_buf, size_t mdata_len)
+{
+    size_t i;
+    uint8_t *end = mdata_buf + mdata_len;
+    for (i = 1; mdata_buf < end; ++i, mdata_buf += ns->lbaf.ms) {
+        NvmeDifTuple *mdata = (NvmeDifTuple *) mdata_buf;
+        trace_pci_nvme_dif_dump_pass_pi(i, be16_to_cpu(mdata->g16.guard),
+                                        be16_to_cpu(mdata->g16.apptag),
+                                        be32_to_cpu(mdata->g16.reftag));
+    }
+}
+
+static void nvme_dif_pass_read_cb(void *opaque, int ret)
+{
+    NvmeRequest *req = opaque;
+    NvmeCtrl *n = nvme_ctrl(req);
+    NvmeDifPassContext *ctx = req->opaque;
+    NvmeNamespace *ns = req->ns;
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    uint8_t prinfo = NVME_RW_PRINFO(le16_to_cpu(rw->control));
+    bool pract = !!(prinfo & NVME_PRINFO_PRACT);
+    uint16_t apptag = le16_to_cpu(rw->apptag);
+    uint16_t appmask = le16_to_cpu(rw->appmask);
+    uint32_t reftag = le32_to_cpu(rw->reftag);
+    uint64_t slba = le64_to_cpu(rw->slba);
+    uint16_t status;
+
+    trace_pci_nvme_dif_pass_read_cb(nvme_cid(req), ctx->iov.dif.iov_len >> 3);
+    if (trace_event_get_state_backends(TRACE_PCI_NVME_DIF_DUMP_PASS_PI)) {
+        nvme_dif_pass_dump(ns, ctx->iov.dif.iov_base, ctx->iov.dif.iov_len);
+    }
+
+    /* block layer returns EILSEQ in case of integrity check failure */
+    /* determine exact pi error and return status accordingly */
+    if (unlikely(ret == -EILSEQ)) {
+        req->status = nvme_dif_pass_check(ns, ctx->data.bounce, ctx->data.len,
+                          ctx->iov.dif.iov_base, prinfo, slba, reftag);
+        if (req->status) {
+            /* zero out ret to allow req->status passthrough */
+            ret = 0;
+        }
+        goto out;
+    }
+
+    if (ret) {
+        goto out;
+    }
+
+    status = nvme_dif_pass_apptag_check(ns, ctx->iov.dif.iov_base,
+                 ctx->iov.dif.iov_len, prinfo, apptag, appmask);
+    if (status) {
+        req->status = status;
+        goto out;
+    }
+
+    status = nvme_bounce_data(n, ctx->data.bounce, ctx->data.len,
+                              NVME_TX_DIRECTION_FROM_DEVICE, req);
+    if (status) {
+        req->status = status;
+        goto out;
+    }
+
+    if (!pract) {
+        status = nvme_bounce_mdata(n, ctx->iov.dif.iov_base,
+                     ctx->iov.dif.iov_len, NVME_TX_DIRECTION_FROM_DEVICE, req);
+        if (status) {
+            req->status = status;
+        }
+    }
+
+out:
+    qemu_iovec_destroy_pi(&ctx->iov);
+    g_free(ctx->data.bounce);
+    g_free(ctx);
+
+    nvme_rw_complete_cb(req, ret);
+}
+
+static void nvme_diff_pass_write_cb(void *opaque, int ret)
+{
+    NvmeRequest *req = opaque;
+    NvmeDifPassContext *ctx = req->opaque;
+    NvmeNamespace *ns = req->ns;
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    uint8_t prinfo = NVME_RW_PRINFO(le16_to_cpu(rw->control));
+    uint32_t reftag = le32_to_cpu(rw->reftag);
+    uint64_t slba = le64_to_cpu(rw->slba);
+
+    trace_pci_nvme_dif_pass_write_cb(nvme_cid(req), ctx->iov.dif.iov_len >> 3);
+    if (trace_event_get_state_backends(TRACE_PCI_NVME_DIF_DUMP_PASS_PI)) {
+        nvme_dif_pass_dump(ns, ctx->iov.dif.iov_base, ctx->iov.dif.iov_len);
+    }
+
+    /* block layer returns EILSEQ in case of integrity check failure */
+    /* determine exact pi error and return status accordingly */
+    if (unlikely(ret == -EILSEQ)) {
+        req->status = nvme_dif_pass_check(ns, ctx->data.bounce, ctx->data.len,
+                          ctx->iov.dif.iov_base, prinfo, slba, reftag);
+        if (req->status) {
+            /* zero out ret to allow req->status passthrough */
+            ret = 0;
+        }
+    }
+
+    qemu_iovec_destroy_pi(&ctx->iov);
+    g_free(ctx->data.bounce);
+    g_free(ctx);
+
+    nvme_rw_complete_cb(req, ret);
+}
+
+uint16_t nvme_dif_pass_rw(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    uint8_t prinfo = NVME_RW_PRINFO(le16_to_cpu(rw->control));
+    uint16_t apptag = le16_to_cpu(rw->apptag);
+    uint16_t appmask = le16_to_cpu(rw->appmask);
+    uint64_t reftag = le32_to_cpu(rw->reftag);
+    bool pract = !!(prinfo & NVME_PRINFO_PRACT);
+    NvmeNamespace *ns = req->ns;
+    BlockBackend *blk = ns->blkconf.blk;
+    bool wrz = rw->opcode == NVME_CMD_WRITE_ZEROES;
+    uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
+    uint64_t slba = le64_to_cpu(rw->slba);
+    size_t len = nvme_l2b(ns, nlb);
+    int64_t offset = nvme_l2b(ns, slba);
+    NvmeDifPassContext *ctx;
+    uint16_t status;
+
+    trace_pci_nvme_dif_pass_rw(nvme_cid(req),
+        NVME_ID_NS_DPS_TYPE(ns->id_ns.dps), prinfo, apptag, appmask, reftag);
+
+    ctx = g_new0(NvmeDifPassContext, 1);
+    qemu_iovec_init_pi(&ctx->iov, 1, nlb);
+    ctx->data.len = len;
+    ctx->data.bounce = qemu_memalign(qemu_real_host_page_size(),
+                                     ctx->data.len);
+    qemu_iovec_add(&ctx->iov, ctx->data.bounce, ctx->data.len);
+
+    req->opaque = ctx;
+
+    status = nvme_check_prinfo(ns, prinfo, slba, reftag);
+    if (status) {
+        goto err;
+    }
+    status = nvme_map_dptr(n, &req->sg, len, &req->cmd);
+    if (status) {
+        goto err;
+    }
+
+    if (req->cmd.opcode == NVME_CMD_READ) {
+        block_acct_start(blk_get_stats(blk), &req->acct, ctx->iov.size,
+                         BLOCK_ACCT_READ);
+
+        req->aiocb = blk_aio_preadv(ns->blkconf.blk, offset, &ctx->iov, 0,
+                                    nvme_dif_pass_read_cb, req);
+
+        return NVME_NO_COMPLETE;
+    }
+
+    if (wrz) {
+
+        assert(pract);
+
+        if (prinfo & NVME_PRINFO_PRCHK_MASK) {
+            status = NVME_INVALID_PROT_INFO | NVME_DNR;
+            goto err;
+        }
+        uint8_t *mbuf, *end;
+
+        mbuf = ctx->iov.dif.iov_base;
+        end = mbuf + ctx->iov.dif.iov_len;
+
+        for (; mbuf < end; mbuf += ns->lbaf.ms) {
+            NvmeDifTuple *dif = (NvmeDifTuple *)(mbuf);
+
+            dif->g16.apptag = cpu_to_be16(apptag);
+            dif->g16.reftag = cpu_to_be32(reftag);
+
+            switch (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps)) {
+            case NVME_ID_NS_DPS_TYPE_1:
+            case NVME_ID_NS_DPS_TYPE_2:
+                reftag++;
+            }
+        }
+        memset(ctx->data.bounce, 0, ctx->data.len);
+
+        req->aiocb = blk_aio_pwritev(ns->blkconf.blk, offset, &ctx->iov, 0,
+                                     nvme_diff_pass_write_cb, req);
+
+    } else {
+
+        status = nvme_bounce_data(n, ctx->data.bounce, ctx->data.len,
+                                  NVME_TX_DIRECTION_TO_DEVICE, req);
+        if (status) {
+            goto err;
+        }
+        if (pract) {
+            nvme_dif_pract_generate_dif(ns, ctx->data.bounce,
+                                        ctx->data.len, ctx->iov.dif.iov_base,
+                                        ctx->iov.dif.iov_len, apptag, &reftag);
+        } else {
+            status = nvme_bounce_mdata(n, ctx->iov.dif.iov_base,
+                         ctx->iov.dif.iov_len, NVME_TX_DIRECTION_TO_DEVICE,
+                         req);
+            if (status) {
+                goto err;
+            }
+            status = nvme_dif_pass_apptag_check(ns, ctx->iov.dif.iov_base,
+                         ctx->iov.dif.iov_len, prinfo, apptag, appmask);
+            if (status) {
+                goto err;
+            }
+        }
+
+        block_acct_start(blk_get_stats(blk), &req->acct, ctx->iov.size,
+                         BLOCK_ACCT_WRITE);
+
+        req->aiocb = blk_aio_pwritev(ns->blkconf.blk, offset, &ctx->iov, 0,
+                                     nvme_diff_pass_write_cb, req);
+
+    }
+
+    return NVME_NO_COMPLETE;
+
+err:
+    qemu_iovec_destroy_pi(&ctx->iov);
+    g_free(ctx->data.bounce);
+    g_free(ctx);
+
+    return status;
+}
+
+uint16_t nvme_dif_pass_check(NvmeNamespace *ns, uint8_t *buf, size_t len,
+                             uint8_t *mbuf, uint8_t prinfo, uint64_t slba,
+                             uint32_t reftag)
+{
+    Error *local_err = NULL;
+    uint16_t status;
+
+    status = nvme_check_prinfo(ns, prinfo, slba, reftag);
+    if (status) {
+        return status;
+    }
+
+    uint8_t *end = buf + len;
+
+    for (uint8_t *bufp = buf, *mbufp = mbuf; bufp < end; bufp += ns->lbasz,
+         mbufp += ns->lbaf.ms) {
+        NvmeDifTuple *dif = (NvmeDifTuple *)mbufp;
+
+        if (be16_to_cpu(dif->g16.guard) != crc16_t10dif(0x0, bufp, ns->lbasz)) {
+            if (prinfo & NVME_PRINFO_PRCHK_GUARD) {
+                return NVME_E2E_GUARD_ERROR;
+            } else {
+                error_setg(&local_err, "Nvme namespace %u, backed by %s"
+                           " drive, can not pass custom guard tag",
+                           nvme_nsid(ns), blk_name(ns->blkconf.blk));
+                error_report_err(local_err);
+                return NVME_INTERNAL_DEV_ERROR;
+            }
+        }
+
+        if (be32_to_cpu(dif->g16.reftag) != reftag) {
+            if (prinfo & NVME_PRINFO_PRCHK_REF) {
+                return NVME_E2E_REF_ERROR;
+            } else if (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps) !=
+                       NVME_ID_NS_DPS_TYPE_3) {
+                error_setg(&local_err, "Nvme namespace %u, backed by %s"
+                           " drive can not pass custom ref tag",
+                           nvme_nsid(ns), blk_name(ns->blkconf.blk));
+                error_report_err(local_err);
+                return NVME_INTERNAL_DEV_ERROR;
+            }
+        }
+
+        if (NVME_ID_NS_DPS_TYPE(ns->id_ns.dps) != NVME_ID_NS_DPS_TYPE_3) {
+            reftag++;
+        }
+    }
+
+    return NVME_SUCCESS;
+}
+
+uint16_t nvme_dif_pass_apptag_check(NvmeNamespace *ns, uint8_t *mbuf,
+                                    size_t mlen, uint8_t prinfo,
+                                    uint16_t apptag, uint16_t appmask)
+{
+    if (prinfo & NVME_PRINFO_PRCHK_APP) {
+        uint8_t *end = mbuf + mlen;
+        for (uint8_t *mbufp = mbuf; mbufp < end; mbufp += ns->lbaf.ms) {
+            NvmeDifTuple *dif = (NvmeDifTuple *)mbufp;
+            if ((be16_to_cpu(dif->g16.apptag) & appmask) !=
+                (apptag & appmask)) {
+                return NVME_E2E_APP_ERROR;
+            }
+        }
+    }
+
+    return NVME_SUCCESS;
+}
diff --git a/hw/nvme/dif.h b/hw/nvme/dif.h
index f12e312250..08e3630461 100644
--- a/hw/nvme/dif.h
+++ b/hw/nvme/dif.h
@@ -188,4 +188,22 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, size_t len,
                         uint16_t appmask, uint64_t *reftag);
 uint16_t nvme_dif_rw(NvmeCtrl *n, NvmeRequest *req);
 
+typedef struct NvmeDifPassContext {
+    struct {
+        uint8_t *bounce;
+        size_t len;
+    } data;
+    QEMUIOVector iov;
+} NvmeDifPassContext;
+
+uint16_t nvme_dif_pass_rw(NvmeCtrl *n, NvmeRequest *req);
+void nvme_dif_pass_dump(NvmeNamespace *ns, uint8_t *mdata_buf,
+                        size_t mdata_len);
+uint16_t nvme_dif_pass_check(NvmeNamespace *ns, uint8_t *buf, size_t len,
+                             uint8_t *mbuf, uint8_t prinfo, uint64_t slba,
+                             uint32_t reftag);
+uint16_t nvme_dif_pass_apptag_check(NvmeNamespace *ns, uint8_t *mbuf,
+                                    size_t mlen, uint8_t prinfo,
+                                    uint16_t apptag, uint16_t appmask);
+
 #endif /* HW_NVME_DIF_H */
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index fccb79f489..259fa8ffa2 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -17,6 +17,10 @@ pci_nvme_write(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_misc_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_dif_rw(uint8_t pract, uint8_t prinfo) "pract 0x%"PRIx8" prinfo 0x%"PRIx8""
+pci_nvme_dif_pass_rw(uint16_t cid, uint8_t type, uint8_t prinfo, uint16_t apptag, uint16_t appmask, uint32_t reftag) "cid %"PRIu16" type %"PRIu8" prinfo 0x%"PRIx8" apptag 0x%"PRIx16" appmask 0x%"PRIx16" reftag 0x%"PRIx32""
+pci_nvme_dif_pass_read_cb(uint16_t cid, size_t count) "cid %"PRIu16" number of DIF elements %zu"
+pci_nvme_dif_pass_write_cb(uint16_t cid, size_t count) "cid %"PRIu16" number of DIF elements %zu"
+pci_nvme_dif_dump_pass_pi(size_t dif_num, uint16_t guard, uint16_t apptag, uint32_t reftag) "DIF element %zu guard tag 0x%"PRIx16" apptag 0x%"PRIx16" reftag 0x%"PRIx32""
 pci_nvme_dif_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_dif_rw_mdata_in_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_dif_rw_mdata_out_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 5/5] hw/nvme: extend pi pass capable commands
  2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
                   ` (3 preceding siblings ...)
  2022-11-24 15:58 ` [RFC 4/5] hw/nvme: implement pi pass read/write/wrz commands Dmitry Tihov
@ 2022-11-24 15:58 ` Dmitry Tihov
  2022-11-25  7:44 ` [RFC 0/5] Protection information pass-through for block devices Klaus Jensen
  5 siblings, 0 replies; 9+ messages in thread
From: Dmitry Tihov @ 2022-11-24 15:58 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, kbusch, its, linux, m.malygin, a.buev, ddtikhov

Add protection information block level passthrough support to compare,
dataset management, verify and copy nvme commands.

Signed-off-by: Dmitry Tihov <d.tihov@yadro.com>
---
 hw/nvme/ctrl.c       | 348 +++++++++++++++++++++++++++++++++++++++----
 hw/nvme/trace-events |   2 +
 2 files changed, 325 insertions(+), 25 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index c646345bcc..950d773d59 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -197,6 +197,7 @@
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
 #include "migration/vmstate.h"
+#include "qemu/memalign.h"
 
 #include "nvme.h"
 #include "dif.h"
@@ -2168,6 +2169,50 @@ out:
     nvme_verify_cb(ctx, ret);
 }
 
+static void nvme_dif_pass_verify_cb(void *opaque, int ret)
+{
+    NvmeBounceContext *ctx = opaque;
+    NvmeRequest *req = ctx->req;
+    NvmeNamespace *ns = req->ns;
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    uint64_t slba = le64_to_cpu(rw->slba);
+    uint8_t prinfo = NVME_RW_PRINFO(le16_to_cpu(rw->control));
+    uint16_t apptag = le16_to_cpu(rw->apptag);
+    uint16_t appmask = le16_to_cpu(rw->appmask);
+    uint32_t reftag = le32_to_cpu(rw->reftag);
+
+    trace_pci_nvme_dif_pass_verify_cb(nvme_cid(req));
+    if (trace_event_get_state_backends(TRACE_PCI_NVME_DIF_DUMP_PASS_PI)) {
+        nvme_dif_pass_dump(ns, ctx->data.iov.dif.iov_base,
+                           ctx->data.iov.dif.iov_len);
+    }
+
+    if (unlikely(ret == -EILSEQ)) {
+        req->status = nvme_dif_pass_check(ns, ctx->data.bounce,
+                          ctx->data.iov.size, ctx->data.iov.dif.iov_base,
+                          prinfo, slba, reftag);
+        if (req->status) {
+            /* zero out ret to allow req->status passthrough */
+            ret = 0;
+        }
+        goto out;
+    }
+
+    if (ret) {
+        goto out;
+    }
+
+    req->status = nvme_dif_pass_apptag_check(ns, ctx->data.iov.dif.iov_base,
+                      ctx->data.iov.dif.iov_len, prinfo, apptag, appmask);
+
+out:
+    qemu_iovec_destroy_pi(&ctx->data.iov);
+    g_free(ctx->data.bounce);
+    g_free(ctx);
+
+    nvme_rw_complete_cb(req, ret);
+}
+
 struct nvme_compare_ctx {
     struct {
         QEMUIOVector iov;
@@ -2331,6 +2376,83 @@ out:
     nvme_enqueue_req_completion(nvme_cq(req), req);
 }
 
+static void nvme_dif_pass_compare_cb(void *opaque, int ret)
+{
+    NvmeRequest *req = opaque;
+    NvmeCtrl *n = nvme_ctrl(req);
+    NvmeNamespace *ns = req->ns;
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    uint64_t slba = le64_to_cpu(rw->slba);
+    uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
+    size_t mlen = nvme_m2b(ns, nlb);
+    uint8_t prinfo = NVME_RW_PRINFO(le16_to_cpu(rw->control));
+    uint16_t apptag = le16_to_cpu(rw->apptag);
+    uint16_t appmask = le16_to_cpu(rw->appmask);
+    uint32_t reftag = le32_to_cpu(rw->reftag);
+    struct nvme_compare_ctx *ctx = req->opaque;
+    g_autofree uint8_t *buf = NULL;
+    uint16_t status;
+
+    trace_pci_nvme_dif_pass_compare_cb(nvme_cid(req));
+    if (trace_event_get_state_backends(TRACE_PCI_NVME_DIF_DUMP_PASS_PI)) {
+        nvme_dif_pass_dump(ns, ctx->data.iov.dif.iov_base,
+                           ctx->data.iov.dif.iov_len);
+    }
+
+    if (unlikely(ret == -EILSEQ)) {
+        status = nvme_dif_pass_check(ns, ctx->data.bounce, ctx->data.iov.size,
+                                     ctx->data.iov.dif.iov_base, prinfo, slba,
+                                     reftag);
+        if (status) {
+            /* zero out ret to allow req->status passthrough */
+            ret = 0;
+            req->status = status;
+        }
+        goto out;
+    }
+
+    if (ret) {
+        goto out;
+    }
+
+    status = nvme_dif_pass_apptag_check(ns, ctx->data.iov.dif.iov_base,
+                      ctx->data.iov.dif.iov_len, prinfo, apptag, appmask);
+    if (status) {
+        req->status = status;
+        goto out;
+    }
+
+    buf = g_malloc(ctx->data.iov.size);
+    status = nvme_bounce_data(n, buf, ctx->data.iov.size,
+                              NVME_TX_DIRECTION_TO_DEVICE, req);
+    if (status) {
+        req->status = status;
+        goto out;
+    }
+    if (memcmp(buf, ctx->data.bounce, ctx->data.iov.size)) {
+        req->status = NVME_CMP_FAILURE;
+        goto out;
+    }
+
+    ctx->mdata.bounce = g_malloc(mlen);
+    status = nvme_bounce_mdata(n, ctx->mdata.bounce, mlen,
+                               NVME_TX_DIRECTION_TO_DEVICE, req);
+    if (status) {
+        req->status = status;
+        goto out;
+    }
+    if (memcmp(ctx->mdata.bounce, ctx->data.iov.dif.iov_base, mlen)) {
+        req->status = NVME_CMP_FAILURE;
+    }
+
+out:
+    qemu_iovec_destroy_pi(&ctx->data.iov);
+    g_free(ctx->data.bounce);
+    g_free(ctx);
+
+    nvme_rw_complete_cb(req, ret);
+}
+
 typedef struct NvmeDSMAIOCB {
     BlockAIOCB common;
     BlockAIOCB *aiocb;
@@ -2395,7 +2517,7 @@ static void nvme_dsm_md_cb(void *opaque, int ret)
         goto done;
     }
 
-    if (!ns->lbaf.ms) {
+    if (!ns->lbaf.ms || ns->pip) {
         nvme_dsm_cb(iocb, 0);
         return;
     }
@@ -2556,19 +2678,35 @@ static uint16_t nvme_verify(NvmeCtrl *n, NvmeRequest *req)
         }
     }
 
-    ctx = g_new0(NvmeBounceContext, 1);
-    ctx->req = req;
+    if (ns->pip) {
+        ctx = g_new0(NvmeBounceContext, 1);
+        ctx->req = req;
 
-    ctx->data.bounce = g_malloc(len);
+        ctx->data.bounce = qemu_memalign(qemu_real_host_page_size(), len);
 
-    qemu_iovec_init(&ctx->data.iov, 1);
-    qemu_iovec_add(&ctx->data.iov, ctx->data.bounce, len);
+        qemu_iovec_init_pi(&ctx->data.iov, 1, nlb);
+        qemu_iovec_add(&ctx->data.iov, ctx->data.bounce, len);
 
-    block_acct_start(blk_get_stats(blk), &req->acct, ctx->data.iov.size,
-                     BLOCK_ACCT_READ);
+        block_acct_start(blk_get_stats(blk), &req->acct, ctx->data.iov.size,
+                         BLOCK_ACCT_READ);
+
+        req->aiocb = blk_aio_preadv(ns->blkconf.blk, offset, &ctx->data.iov, 0,
+                                    nvme_dif_pass_verify_cb, ctx);
+    } else {
+        ctx = g_new0(NvmeBounceContext, 1);
+        ctx->req = req;
+
+        ctx->data.bounce = g_malloc(len);
+
+        qemu_iovec_init(&ctx->data.iov, 1);
+        qemu_iovec_add(&ctx->data.iov, ctx->data.bounce, len);
+
+        block_acct_start(blk_get_stats(blk), &req->acct, ctx->data.iov.size,
+                         BLOCK_ACCT_READ);
 
-    req->aiocb = blk_aio_preadv(ns->blkconf.blk, offset, &ctx->data.iov, 0,
-                                nvme_verify_mdata_in_cb, ctx);
+        req->aiocb = blk_aio_preadv(ns->blkconf.blk, offset, &ctx->data.iov, 0,
+                                    nvme_verify_mdata_in_cb, ctx);
+    }
     return NVME_NO_COMPLETE;
 }
 
@@ -2625,7 +2763,11 @@ static void nvme_copy_bh(void *opaque)
         req->cqe.result = cpu_to_le32(iocb->idx);
     }
 
-    qemu_iovec_destroy(&iocb->iov);
+    if (ns->pip) {
+        qemu_iovec_destroy_pi(&iocb->iov);
+    } else {
+        qemu_iovec_destroy(&iocb->iov);
+    }
     g_free(iocb->bounce);
 
     qemu_bh_delete(iocb->bh);
@@ -2737,10 +2879,29 @@ static void nvme_copy_out_completed_cb(void *opaque, int ret)
     NvmeRequest *req = iocb->req;
     NvmeNamespace *ns = req->ns;
     uint32_t nlb;
+    uint16_t status;
 
     nvme_copy_source_range_parse(iocb->ranges, iocb->idx, iocb->format, NULL,
                                  &nlb, NULL, NULL, NULL);
 
+    if (ns->pip) {
+        if (iocb->iov.dif.iov_len) {
+            NvmeCopyCmd *copy = (NvmeCopyCmd *)&req->cmd;
+            uint64_t slba = le64_to_cpu(copy->sdlba);
+            uint16_t prinfo = ((copy->control[2] >> 2) & 0xf);
+            size_t len = nvme_l2b(ns, nlb);
+            if (unlikely(ret == -EILSEQ)) {
+                status = nvme_dif_pass_check(ns, iocb->bounce, len,
+                             iocb->iov.dif.iov_base, prinfo, slba,
+                             iocb->reftag);
+                if (status) {
+                    goto invalid;
+                }
+            }
+        }
+
+        iocb->reftag += nlb;
+    }
     if (ret < 0) {
         iocb->ret = ret;
         goto out;
@@ -2754,8 +2915,17 @@ static void nvme_copy_out_completed_cb(void *opaque, int ret)
 
     iocb->idx++;
     iocb->slba += nlb;
+
 out:
     nvme_copy_cb(iocb, iocb->ret);
+    return;
+
+invalid:
+    req->status = status;
+    iocb->aiocb = NULL;
+    if (iocb->bh) {
+        qemu_bh_schedule(iocb->bh);
+    }
 }
 
 static void nvme_copy_out_cb(void *opaque, int ret)
@@ -2900,6 +3070,99 @@ out:
     nvme_copy_cb(iocb, ret);
 }
 
+static void nvme_dif_pass_copy_cb(void *opaque, int ret)
+{
+    NvmeCopyAIOCB *iocb = opaque;
+    NvmeRequest *req = iocb->req;
+    NvmeNamespace *ns = req->ns;
+    NvmeCopyCmd *copy = (NvmeCopyCmd *)&req->cmd;
+    uint16_t prinfor = ((copy->control[0] >> 4) & 0xf);
+    uint16_t prinfow = ((copy->control[2] >> 2) & 0xf);
+    uint32_t nlb;
+    size_t len;
+    uint16_t status;
+    uint64_t slba;
+    uint16_t apptag;
+    uint16_t appmask;
+    uint64_t reftag;
+
+    nvme_copy_source_range_parse(iocb->ranges, iocb->idx, iocb->format, &slba,
+                                 &nlb, &apptag, &appmask, &reftag);
+    len = nvme_l2b(ns, nlb);
+
+    if (unlikely(ret == -EILSEQ)) {
+        status = nvme_dif_pass_check(ns, iocb->bounce, len,
+                                     iocb->iov.dif.iov_base, prinfor, slba,
+                                     reftag);
+        if (status) {
+            goto invalid;
+        }
+    }
+
+    if (ret < 0) {
+        iocb->ret = ret;
+        goto out;
+    } else if (iocb->ret < 0) {
+        goto out;
+    }
+
+    status = nvme_dif_pass_apptag_check(ns, iocb->iov.dif.iov_base,
+                                        nvme_m2b(ns, nlb), prinfor, apptag,
+                                        appmask);
+    if (status) {
+        goto invalid;
+    }
+
+    status = nvme_check_prinfo(ns, prinfow, iocb->slba, iocb->reftag);
+    if (status) {
+        goto invalid;
+    }
+    status = nvme_check_bounds(ns, iocb->slba, nlb);
+    if (status) {
+        goto invalid;
+    }
+
+    if (ns->params.zoned) {
+        status = nvme_check_zone_write(ns, iocb->zone, iocb->slba, nlb);
+        if (status) {
+            goto invalid;
+        }
+
+        iocb->zone->w_ptr += nlb;
+    }
+
+    if (prinfow & NVME_PRINFO_PRACT) {
+        qemu_iovec_reset(&iocb->iov);
+        qemu_iovec_add(&iocb->iov, iocb->bounce, len);
+    } else {
+        appmask = le16_to_cpu(copy->appmask);
+        apptag = le16_to_cpu(copy->apptag);
+        status = nvme_dif_pass_apptag_check(ns, iocb->iov.dif.iov_base,
+                                            nvme_m2b(ns, nlb), prinfow, apptag,
+                                            appmask);
+        if (status) {
+            goto invalid;
+        }
+    }
+    iocb->aiocb = blk_aio_pwritev(ns->blkconf.blk, nvme_l2b(ns, iocb->slba),
+                                  &iocb->iov, 0, nvme_copy_out_completed_cb,
+                                  iocb);
+
+    return;
+
+invalid:
+    req->status = status;
+    iocb->aiocb = NULL;
+    if (iocb->bh) {
+        qemu_bh_schedule(iocb->bh);
+    }
+
+    return;
+
+out:
+    nvme_copy_cb(iocb, ret);
+}
+
 static void nvme_copy_in_cb(void *opaque, int ret)
 {
     NvmeCopyAIOCB *iocb = opaque;
@@ -2943,6 +3206,7 @@ static void nvme_copy_cb(void *opaque, int ret)
     NvmeNamespace *ns = req->ns;
     uint64_t slba;
     uint32_t nlb;
+    uint64_t reftag;
     size_t len;
     uint16_t status;
 
@@ -2958,7 +3222,7 @@ static void nvme_copy_cb(void *opaque, int ret)
     }
 
     nvme_copy_source_range_parse(iocb->ranges, iocb->idx, iocb->format, &slba,
-                                 &nlb, NULL, NULL, NULL);
+                                 &nlb, NULL, NULL, &reftag);
     len = nvme_l2b(ns, nlb);
 
     trace_pci_nvme_copy_source_range(slba, nlb);
@@ -2990,8 +3254,21 @@ static void nvme_copy_cb(void *opaque, int ret)
     qemu_iovec_reset(&iocb->iov);
     qemu_iovec_add(&iocb->iov, iocb->bounce, len);
 
-    iocb->aiocb = blk_aio_preadv(ns->blkconf.blk, nvme_l2b(ns, slba),
-                                 &iocb->iov, 0, nvme_copy_in_cb, iocb);
+    if (ns->pip) {
+        NvmeCopyCmd *copy = (NvmeCopyCmd *)&req->cmd;
+        uint16_t prinfor = ((copy->control[0] >> 4) & 0xf);
+        status = nvme_check_prinfo(ns, prinfor, slba, reftag);
+        if (status) {
+            goto invalid;
+        }
+        iocb->iov.dif.iov_len = nvme_m2b(ns, nlb);
+        iocb->aiocb = blk_aio_preadv(ns->blkconf.blk, nvme_l2b(ns, slba),
+                                     &iocb->iov, 0, nvme_dif_pass_copy_cb,
+                                     iocb);
+    } else {
+        iocb->aiocb = blk_aio_preadv(ns->blkconf.blk, nvme_l2b(ns, slba),
+                                     &iocb->iov, 0, nvme_copy_in_cb, iocb);
+    }
     return;
 
 invalid:
@@ -3078,11 +3355,19 @@ static uint16_t nvme_copy(NvmeCtrl *n, NvmeRequest *req)
     iocb->idx = 0;
     iocb->reftag = le32_to_cpu(copy->reftag);
     iocb->reftag |= (uint64_t)le32_to_cpu(copy->cdw3) << 32;
-    iocb->bounce = g_malloc_n(le16_to_cpu(ns->id_ns.mssrl),
-                              ns->lbasz + ns->lbaf.ms);
 
     qemu_iovec_init(&iocb->iov, 1);
 
+    if (ns->pip) {
+        qemu_iovec_init_pi(&iocb->iov, 1, le16_to_cpu(ns->id_ns.mssrl));
+        iocb->bounce = qemu_memalign(qemu_real_host_page_size(),
+                                     le16_to_cpu(ns->id_ns.mssrl) * ns->lbasz);
+    } else {
+        qemu_iovec_init(&iocb->iov, 1);
+        iocb->bounce = g_malloc_n(le16_to_cpu(ns->id_ns.mssrl),
+                                  ns->lbasz + ns->lbaf.ms);
+    }
+
     block_acct_start(blk_get_stats(ns->blkconf.blk), &iocb->acct.read, 0,
                      BLOCK_ACCT_READ);
     block_acct_start(blk_get_stats(ns->blkconf.blk), &iocb->acct.write, 0,
@@ -3145,18 +3430,31 @@ static uint16_t nvme_compare(NvmeCtrl *n, NvmeRequest *req)
         return status;
     }
 
-    ctx = g_new(struct nvme_compare_ctx, 1);
-    ctx->data.bounce = g_malloc(data_len);
+    if (ns->pip) {
+        ctx = g_new0(struct nvme_compare_ctx, 1);
+        ctx->data.bounce = qemu_memalign(qemu_real_host_page_size(), data_len);
+
+        req->opaque = ctx;
 
-    req->opaque = ctx;
+        qemu_iovec_init_pi(&ctx->data.iov, 1, nlb);
+        qemu_iovec_add(&ctx->data.iov, ctx->data.bounce, data_len);
+        block_acct_start(blk_get_stats(blk), &req->acct, data_len,
+                         BLOCK_ACCT_READ);
+        req->aiocb = blk_aio_preadv(blk, offset, &ctx->data.iov, 0,
+                                    nvme_dif_pass_compare_cb, req);
+    } else {
+        ctx = g_new(struct nvme_compare_ctx, 1);
+        ctx->data.bounce = g_malloc(data_len);
 
-    qemu_iovec_init(&ctx->data.iov, 1);
-    qemu_iovec_add(&ctx->data.iov, ctx->data.bounce, data_len);
+        req->opaque = ctx;
 
-    block_acct_start(blk_get_stats(blk), &req->acct, data_len,
-                     BLOCK_ACCT_READ);
-    req->aiocb = blk_aio_preadv(blk, offset, &ctx->data.iov, 0,
-                                nvme_compare_data_cb, req);
+        qemu_iovec_init(&ctx->data.iov, 1);
+        qemu_iovec_add(&ctx->data.iov, ctx->data.bounce, data_len);
+        block_acct_start(blk_get_stats(blk), &req->acct, data_len,
+                         BLOCK_ACCT_READ);
+        req->aiocb = blk_aio_preadv(blk, offset, &ctx->data.iov, 0,
+                                    nvme_compare_data_cb, req);
+    }
 
     return NVME_NO_COMPLETE;
 }
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 259fa8ffa2..42c171ed72 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -41,12 +41,14 @@ pci_nvme_copy_out(uint64_t slba, uint32_t nlb) "slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_verify(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_verify_mdata_in_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_verify_cb(uint16_t cid, uint8_t prinfo, uint16_t apptag, uint16_t appmask, uint32_t reftag) "cid %"PRIu16" prinfo 0x%"PRIx8" apptag 0x%"PRIx16" appmask 0x%"PRIx16" reftag 0x%"PRIx32""
+pci_nvme_dif_pass_verify_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_rw_complete_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_block_status(int64_t offset, int64_t bytes, int64_t pnum, int ret, bool zeroed) "offset %"PRId64" bytes %"PRId64" pnum %"PRId64" ret 0x%x zeroed %d"
 pci_nvme_dsm(uint32_t nr, uint32_t attr) "nr %"PRIu32" attr 0x%"PRIx32""
 pci_nvme_dsm_deallocate(uint64_t slba, uint32_t nlb) "slba %"PRIu64" nlb %"PRIu32""
 pci_nvme_dsm_single_range_limit_exceeded(uint32_t nlb, uint32_t dmrsl) "nlb %"PRIu32" dmrsl %"PRIu32""
 pci_nvme_compare(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba 0x%"PRIx64" nlb %"PRIu32""
+pci_nvme_dif_pass_compare_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_compare_data_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_compare_mdata_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_aio_discard_cb(uint16_t cid) "cid %"PRIu16""
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC 0/5] Protection information pass-through for block devices
  2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
                   ` (4 preceding siblings ...)
  2022-11-24 15:58 ` [RFC 5/5] hw/nvme: extend pi pass capable commands Dmitry Tihov
@ 2022-11-25  7:44 ` Klaus Jensen
  2022-12-05  9:01   ` Dmitry Tihov
  5 siblings, 1 reply; 9+ messages in thread
From: Klaus Jensen @ 2022-11-25  7:44 UTC (permalink / raw)
  To: Dmitry Tihov
  Cc: qemu-block, qemu-devel, kbusch, linux, m.malygin, a.buev,
	ddtikhov, Kevin Wolf, Hanna Reitz

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

+CC: block layer maintainers (Kevin, Hanna)

On Nov 24 18:58, Dmitry Tihov wrote:
> This patch set allows using End-to-End Data Protection in NVMe subsystem
> with integrity capable host devices as the NVMe namespaces backend.
> The patch series is based on io-uring kernel interface feature not merged
> to kernel upstream yet:
> https://lore.kernel.org/linux-block/20220920144618.1111138-1-a.buev@yadro.com/
> 
> The main advantage of this approach is that it allows using the
> same protection information enabled disks in multiple VMs
> concurrently. This may be useful in cluster setups.
> 
> Please let me know what do you think, are this kind of changes appropriate
> for QEMU upstream, what should be changed, etc.
> 
> Dmitry Tihov (5):
>   docs/nvme: add new feature summary
>   block: add transfer of protection information
>   hw/nvme: add protection information pass parameter
>   hw/nvme: implement pi pass read/write/wrz commands
>   hw/nvme: extend pi pass capable commands
> 
>  block/file-posix.c           | 130 ++++++++++++-
>  block/io_uring.c             | 109 ++++++++++-
>  docs/system/devices/nvme.rst |  15 ++
>  hw/nvme/ctrl.c               | 361 ++++++++++++++++++++++++++++++++---
>  hw/nvme/dif.c                | 303 +++++++++++++++++++++++++++++
>  hw/nvme/dif.h                |  18 ++
>  hw/nvme/ns.c                 |  59 +++++-
>  hw/nvme/nvme.h               |   2 +
>  hw/nvme/trace-events         |   6 +
>  include/block/block-common.h |   2 +
>  include/block/raw-aio.h      |   3 +-
>  include/qemu/iov.h           |   6 +
>  util/iov.c                   |  24 +++
>  13 files changed, 992 insertions(+), 46 deletions(-)
> 
> -- 
> 2.38.1
> 

Hi Dmitry,

Neat.

But this is largely depending on how the API turns out in block/ and I
am not the right one to comment on that. It's great that you have an
example device to utilize the API, but this needs comments from the
block layer maintainers before we consider it in hw/nvme.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/5] Protection information pass-through for block devices
  2022-11-25  7:44 ` [RFC 0/5] Protection information pass-through for block devices Klaus Jensen
@ 2022-12-05  9:01   ` Dmitry Tihov
  2022-12-05 12:47     ` Klaus Jensen
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Tihov @ 2022-12-05  9:01 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-block, qemu-devel, kbusch, linux, m.malygin, a.buev,
	ddtikhov, Kevin Wolf, Hanna Reitz

On Fri, Nov 25, 2022 at 08:44:18, Klaus Jensen wrote:
> +CC: block layer maintainers (Kevin, Hanna)
> 
> On Nov 24 18:58, Dmitry Tihov wrote:
> > This patch set allows using End-to-End Data Protection in NVMe subsystem
> > with integrity capable host devices as the NVMe namespaces backend.
> > The patch series is based on io-uring kernel interface feature not merged
> > to kernel upstream yet:
> > https://lore.kernel.org/linux-block/20220920144618.1111138-1-a.buev@yadro.com/
> > 
> > The main advantage of this approach is that it allows using the
> > same protection information enabled disks in multiple VMs
> > concurrently. This may be useful in cluster setups.
> > 
> > Please let me know what do you think, are this kind of changes appropriate
> > for QEMU upstream, what should be changed, etc.
> > 
> > Dmitry Tihov (5):
> >   docs/nvme: add new feature summary
> >   block: add transfer of protection information
> >   hw/nvme: add protection information pass parameter
> >   hw/nvme: implement pi pass read/write/wrz commands
> >   hw/nvme: extend pi pass capable commands
> > 
> >  block/file-posix.c           | 130 ++++++++++++-
> >  block/io_uring.c             | 109 ++++++++++-
> >  docs/system/devices/nvme.rst |  15 ++
> >  hw/nvme/ctrl.c               | 361 ++++++++++++++++++++++++++++++++---
> >  hw/nvme/dif.c                | 303 +++++++++++++++++++++++++++++
> >  hw/nvme/dif.h                |  18 ++
> >  hw/nvme/ns.c                 |  59 +++++-
> >  hw/nvme/nvme.h               |   2 +
> >  hw/nvme/trace-events         |   6 +
> >  include/block/block-common.h |   2 +
> >  include/block/raw-aio.h      |   3 +-
> >  include/qemu/iov.h           |   6 +
> >  util/iov.c                   |  24 +++
> >  13 files changed, 992 insertions(+), 46 deletions(-)
> > 
> > -- 
> > 2.38.1
> > 
> 
> Hi Dmitry,
> 
> Neat.
> 
> But this is largely depending on how the API turns out in block/ and I
> am not the right one to comment on that. It's great that you have an
> example device to utilize the API, but this needs comments from the
> block layer maintainers before we consider it in hw/nvme.

You mean API in QEMU block layer right? Specifically the second patch
of this series. Should I send it in a distinct RFC for review by block
layer maintainers?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/5] Protection information pass-through for block devices
  2022-12-05  9:01   ` Dmitry Tihov
@ 2022-12-05 12:47     ` Klaus Jensen
  0 siblings, 0 replies; 9+ messages in thread
From: Klaus Jensen @ 2022-12-05 12:47 UTC (permalink / raw)
  To: Dmitry Tihov
  Cc: qemu-block, qemu-devel, kbusch, linux, m.malygin, a.buev,
	ddtikhov, Kevin Wolf, Hanna Reitz

[-- Attachment #1: Type: text/plain, Size: 2864 bytes --]

On Dec  5 12:01, Dmitry Tihov wrote:
> On Fri, Nov 25, 2022 at 08:44:18, Klaus Jensen wrote:
> > +CC: block layer maintainers (Kevin, Hanna)
> > 
> > On Nov 24 18:58, Dmitry Tihov wrote:
> > > This patch set allows using End-to-End Data Protection in NVMe subsystem
> > > with integrity capable host devices as the NVMe namespaces backend.
> > > The patch series is based on io-uring kernel interface feature not merged
> > > to kernel upstream yet:
> > > https://lore.kernel.org/linux-block/20220920144618.1111138-1-a.buev@yadro.com/
> > > 
> > > The main advantage of this approach is that it allows using the
> > > same protection information enabled disks in multiple VMs
> > > concurrently. This may be useful in cluster setups.
> > > 
> > > Please let me know what do you think, are this kind of changes appropriate
> > > for QEMU upstream, what should be changed, etc.
> > > 
> > > Dmitry Tihov (5):
> > >   docs/nvme: add new feature summary
> > >   block: add transfer of protection information
> > >   hw/nvme: add protection information pass parameter
> > >   hw/nvme: implement pi pass read/write/wrz commands
> > >   hw/nvme: extend pi pass capable commands
> > > 
> > >  block/file-posix.c           | 130 ++++++++++++-
> > >  block/io_uring.c             | 109 ++++++++++-
> > >  docs/system/devices/nvme.rst |  15 ++
> > >  hw/nvme/ctrl.c               | 361 ++++++++++++++++++++++++++++++++---
> > >  hw/nvme/dif.c                | 303 +++++++++++++++++++++++++++++
> > >  hw/nvme/dif.h                |  18 ++
> > >  hw/nvme/ns.c                 |  59 +++++-
> > >  hw/nvme/nvme.h               |   2 +
> > >  hw/nvme/trace-events         |   6 +
> > >  include/block/block-common.h |   2 +
> > >  include/block/raw-aio.h      |   3 +-
> > >  include/qemu/iov.h           |   6 +
> > >  util/iov.c                   |  24 +++
> > >  13 files changed, 992 insertions(+), 46 deletions(-)
> > > 
> > > -- 
> > > 2.38.1
> > > 
> > 
> > Hi Dmitry,
> > 
> > Neat.
> > 
> > But this is largely depending on how the API turns out in block/ and I
> > am not the right one to comment on that. It's great that you have an
> > example device to utilize the API, but this needs comments from the
> > block layer maintainers before we consider it in hw/nvme.
> 
> You mean API in QEMU block layer right? Specifically the second patch
> of this series. Should I send it in a distinct RFC for review by block
> layer maintainers?
> 

Yes, basically the block/ stuff.

Given the RFC status of this series, I see no problem in keeping it
as-is. Having it showing how it is potentially used in a device is good.

I CC'ed the block maintainers to let them comment on it when they have
time. We are right up on a release, so expect some feedback as we start
the next development cycle post-release :)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-12-05 12:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-24 15:58 [RFC 0/5] Protection information pass-through for block devices Dmitry Tihov
2022-11-24 15:58 ` [RFC 1/5] docs/nvme: add new feature summary Dmitry Tihov
2022-11-24 15:58 ` [RFC 2/5] block: add transfer of protection information Dmitry Tihov
2022-11-24 15:58 ` [RFC 3/5] hw/nvme: add protection information pass parameter Dmitry Tihov
2022-11-24 15:58 ` [RFC 4/5] hw/nvme: implement pi pass read/write/wrz commands Dmitry Tihov
2022-11-24 15:58 ` [RFC 5/5] hw/nvme: extend pi pass capable commands Dmitry Tihov
2022-11-25  7:44 ` [RFC 0/5] Protection information pass-through for block devices Klaus Jensen
2022-12-05  9:01   ` Dmitry Tihov
2022-12-05 12:47     ` Klaus Jensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).