[PATCH 0/3] iothread and irqfd support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] iothread and irqfd support
@ 2022-08-26 11:18 Jinhao Fan
  2022-08-26 11:18 ` [PATCH 1/3] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Jinhao Fan @ 2022-08-26 11:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan

This patch series adds support for using a seperate iothread for NVMe
IO emulation, which brings the potential of applying polling. The
first two patches implements support for irqfd, which solves thread
safety problems for interrupt emulation outside the main loop thread.

Jinhao Fan (3):
  hw/nvme: support irq(de)assertion with eventfd
  hw/nvme: use KVM irqfd when available
  hw/nvme: add iothread support

 hw/nvme/ctrl.c       | 335 +++++++++++++++++++++++++++++++++++++++----
 hw/nvme/ns.c         |  21 ++-
 hw/nvme/nvme.h       |  12 +-
 hw/nvme/trace-events |   3 +
 4 files changed, 342 insertions(+), 29 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] hw/nvme: support irq(de)assertion with eventfd
  2022-08-26 11:18 [PATCH 0/3] iothread and irqfd support Jinhao Fan
@ 2022-08-26 11:18 ` Jinhao Fan
  2022-08-26 11:18 ` [PATCH 2/3] hw/nvme: use KVM irqfd when available Jinhao Fan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Jinhao Fan @ 2022-08-26 11:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: its, kbusch, stefanha, Jinhao Fan, Klaus Jensen, open list:nvme

When the new option 'irq-eventfd' is turned on, the IO emulation code
signals an eventfd when it want to (de)assert an irq. The main loop
eventfd handler does the actual irq (de)assertion.  This paves the way
for iothread support since QEMU's interrupt emulation is not thread
safe.

Asserting and deasseting irq with eventfd has some performance
implications. For small queue depth it increases request latency but
for large queue depth it effectively coalesces irqs.

Comparision (KIOPS):

QD            1   4  16  64
QEMU         38 123 210 329
irq-eventfd  32 106 240 364

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------
 hw/nvme/nvme.h |   3 ++
 2 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..51792f3955 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -526,34 +526,57 @@ static void nvme_irq_check(NvmeCtrl *n)
     }
 }
 
+static void nvme_irq_do_assert(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    if (msix_enabled(&(n->parent_obj))) {
+        trace_pci_nvme_irq_msix(cq->vector);
+        msix_notify(&(n->parent_obj), cq->vector);
+    } else {
+        trace_pci_nvme_irq_pin();
+        assert(cq->vector < 32);
+        n->irq_status |= 1 << cq->vector;
+        nvme_irq_check(n);
+    }
+}
+
 static void nvme_irq_assert(NvmeCtrl *n, NvmeCQueue *cq)
 {
     if (cq->irq_enabled) {
-        if (msix_enabled(&(n->parent_obj))) {
-            trace_pci_nvme_irq_msix(cq->vector);
-            msix_notify(&(n->parent_obj), cq->vector);
+        if (cq->assert_notifier.initialized) {
+            event_notifier_set(&cq->assert_notifier);
         } else {
-            trace_pci_nvme_irq_pin();
-            assert(cq->vector < 32);
-            n->irq_status |= 1 << cq->vector;
-            nvme_irq_check(n);
+            nvme_irq_do_assert(n, cq);
         }
     } else {
         trace_pci_nvme_irq_masked();
     }
 }
 
+static void nvme_irq_do_deassert(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    if (msix_enabled(&(n->parent_obj))) {
+        return;
+    } else {
+        assert(cq->vector < 32);
+        if (!n->cq_pending) {
+            n->irq_status &= ~(1 << cq->vector);
+        }
+        nvme_irq_check(n);
+    }
+}
+
 static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 {
     if (cq->irq_enabled) {
-        if (msix_enabled(&(n->parent_obj))) {
-            return;
+        if (cq->deassert_notifier.initialized) {
+            /*
+             * The deassert notifier will only be initilized when MSI-X is NOT
+             * in use. Therefore no need to worry about extra eventfd syscall
+             * for pin-based interrupts.
+             */
+            event_notifier_set(&cq->deassert_notifier);
         } else {
-            assert(cq->vector < 32);
-            if (!n->cq_pending) {
-                n->irq_status &= ~(1 << cq->vector);
-            }
-            nvme_irq_check(n);
+            nvme_irq_do_deassert(n, cq);
         }
     }
 }
@@ -1338,6 +1361,50 @@ static void nvme_update_cq_head(NvmeCQueue *cq)
     trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head);
 }
 
+static void nvme_assert_notifier_read(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, assert_notifier);
+    if (event_notifier_test_and_clear(e)) {
+        nvme_irq_do_assert(cq->ctrl, cq);
+    }
+}
+
+static void nvme_deassert_notifier_read(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, deassert_notifier);
+    if (event_notifier_test_and_clear(e)) {
+        nvme_irq_do_deassert(cq->ctrl, cq);
+    }
+}
+
+static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    int ret;
+
+    ret = event_notifier_init(&cq->assert_notifier, 0);
+    if (ret < 0) {
+        return;
+    }
+
+    event_notifier_set_handler(&cq->assert_notifier,
+                                nvme_assert_notifier_read);
+
+    if (!msix_enabled(&n->parent_obj)) {
+        ret = event_notifier_init(&cq->deassert_notifier, 0);
+        if (ret < 0) {
+            event_notifier_set_handler(&cq->assert_notifier, NULL);
+            event_notifier_cleanup(&cq->assert_notifier);
+
+            return;
+        }
+
+        event_notifier_set_handler(&cq->deassert_notifier,
+                                   nvme_deassert_notifier_read);
+    }
+
+    return;
+}
+
 static void nvme_post_cqes(void *opaque)
 {
     NvmeCQueue *cq = opaque;
@@ -1377,8 +1444,10 @@ static void nvme_post_cqes(void *opaque)
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
     if (cq->tail != cq->head) {
-        if (cq->irq_enabled && !pending) {
-            n->cq_pending++;
+        if (cq->irq_enabled) {
+            if (!pending) {
+                n->cq_pending++;
+            }
         }
 
         nvme_irq_assert(n, cq);
@@ -4705,6 +4774,14 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
         event_notifier_set_handler(&cq->notifier, NULL);
         event_notifier_cleanup(&cq->notifier);
     }
+    if (cq->assert_notifier.initialized) {
+        event_notifier_set_handler(&cq->assert_notifier, NULL);
+        event_notifier_cleanup(&cq->assert_notifier);
+    }
+    if (cq->deassert_notifier.initialized) {
+        event_notifier_set_handler(&cq->deassert_notifier, NULL);
+        event_notifier_cleanup(&cq->deassert_notifier);
+    }
     if (msix_enabled(&n->parent_obj)) {
         msix_vector_unuse(&n->parent_obj, cq->vector);
     }
@@ -4734,7 +4811,7 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
         n->cq_pending--;
     }
 
-    nvme_irq_deassert(n, cq);
+    nvme_irq_do_deassert(n, cq);
     trace_pci_nvme_del_cq(qid);
     nvme_free_cq(cq, n);
     return NVME_SUCCESS;
@@ -4772,6 +4849,14 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     }
     n->cq[cqid] = cq;
     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+
+    /*
+     * Only enable irq eventfd for IO queues since we always emulate admin
+     * queue in main loop thread.
+     */
+    if (cqid && n->params.irq_eventfd) {
+        nvme_init_irq_notifier(n, cq);
+    }
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
@@ -7671,6 +7756,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
     DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false),
     DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false),
+    DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false),
     DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 79f5c281c2..4850d3e965 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -398,6 +398,8 @@ typedef struct NvmeCQueue {
     uint64_t    ei_addr;
     QEMUTimer   *timer;
     EventNotifier notifier;
+    EventNotifier assert_notifier;
+    EventNotifier deassert_notifier;
     bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
@@ -422,6 +424,7 @@ typedef struct NvmeParams {
     bool     auto_transition_zones;
     bool     legacy_cmb;
     bool     ioeventfd;
+    bool     irq_eventfd;
     uint8_t  sriov_max_vfs;
     uint16_t sriov_vq_flexible;
     uint16_t sriov_vi_flexible;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] hw/nvme: use KVM irqfd when available
  2022-08-26 11:18 [PATCH 0/3] iothread and irqfd support Jinhao Fan
  2022-08-26 11:18 ` [PATCH 1/3] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
@ 2022-08-26 11:18 ` Jinhao Fan
  2022-08-26 11:18 ` [PATCH 3/3] hw/nvme: add iothread support Jinhao Fan
  2022-10-12 14:28 ` [PATCH 0/3] iothread and irqfd support Stefan Hajnoczi
  3 siblings, 0 replies; 10+ messages in thread
From: Jinhao Fan @ 2022-08-26 11:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: its, kbusch, stefanha, Jinhao Fan, Klaus Jensen, open list:nvme

Use KVM's irqfd to send interrupts when possible. This approach is
thread safe. Moreover, it does not have the inter-thread communication
overhead of plain event notifiers since handler callback are called
in the same system call as irqfd write.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c       | 145 ++++++++++++++++++++++++++++++++++++++++++-
 hw/nvme/nvme.h       |   3 +
 hw/nvme/trace-events |   3 +
 3 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 51792f3955..396f3f0cdd 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -192,6 +192,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/kvm.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
@@ -1377,8 +1378,115 @@ static void nvme_deassert_notifier_read(EventNotifier *e)
     }
 }
 
+static int nvme_kvm_vector_use(NvmeCtrl *n, NvmeCQueue *cq, uint32_t vector)
+{
+    KVMRouteChange c = kvm_irqchip_begin_route_changes(kvm_state);
+    int ret;
+
+    ret = kvm_irqchip_add_msi_route(&c, vector, &n->parent_obj);
+    if (ret < 0) {
+        return ret;
+    }
+
+    kvm_irqchip_commit_route_changes(&c);
+
+    cq->virq = ret;
+
+    return 0;
+}
+
+static int nvme_kvm_vector_unmask(PCIDevice *pci_dev, unsigned vector,
+                                  MSIMessage msg)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+    int ret;
+
+    trace_pci_nvme_irq_unmask(vector, msg.address, msg.data);
+
+    for (uint32_t i = 1; i <= n->params.max_ioqpairs; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq) {
+            continue;
+        }
+
+        if (cq->vector == vector) {
+            if (cq->msg.data != msg.data || cq->msg.address != msg.address) {
+                ret = kvm_irqchip_update_msi_route(kvm_state, cq->virq, msg,
+                                                   pci_dev);
+                if (ret < 0) {
+                    return ret;
+                }
+
+                kvm_irqchip_commit_routes(kvm_state);
+
+                cq->msg = msg;
+            }
+
+            ret = kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                                     &cq->assert_notifier,
+                                                     NULL, cq->virq);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
+static void nvme_kvm_vector_mask(PCIDevice *pci_dev, unsigned vector)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_irq_mask(vector);
+
+    for (uint32_t i = 1; i <= n->params.max_ioqpairs; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq) {
+            continue;
+        }
+
+        if (cq->vector == vector) {
+            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                  &cq->assert_notifier,
+                                                  cq->virq);
+        }
+    }
+}
+
+static void nvme_kvm_vector_poll(PCIDevice *pci_dev, unsigned int vector_start,
+                                 unsigned int vector_end)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_irq_poll(vector_start, vector_end);
+
+    for (uint32_t i = 1; i <= n->params.max_ioqpairs; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq) {
+            continue;
+        }
+
+        if (!msix_is_masked(pci_dev, cq->vector)) {
+            continue;
+        }
+
+        if (cq->vector >= vector_start && cq->vector <= vector_end) {
+            if (event_notifier_test_and_clear(&cq->assert_notifier)) {
+                msix_set_pending(pci_dev, i);
+            }
+        }
+    }
+}
+
+
 static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
 {
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     int ret;
 
     ret = event_notifier_init(&cq->assert_notifier, 0);
@@ -1386,12 +1494,27 @@ static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
         return;
     }
 
-    event_notifier_set_handler(&cq->assert_notifier,
-                                nvme_assert_notifier_read);
+    if (with_irqfd) {
+        ret = nvme_kvm_vector_use(n, cq, cq->vector);
+        if (ret < 0) {
+            event_notifier_cleanup(&cq->assert_notifier);
+
+            return;
+        }
+    } else {
+        event_notifier_set_handler(&cq->assert_notifier,
+                                   nvme_assert_notifier_read);
+    }
 
     if (!msix_enabled(&n->parent_obj)) {
         ret = event_notifier_init(&cq->deassert_notifier, 0);
         if (ret < 0) {
+            if (with_irqfd) {
+                kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                      &cq->assert_notifier,
+                                                      cq->virq);
+            }
+
             event_notifier_set_handler(&cq->assert_notifier, NULL);
             event_notifier_cleanup(&cq->assert_notifier);
 
@@ -4764,6 +4887,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 
 static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 {
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     uint16_t offset = (cq->cqid << 3) + (1 << 2);
 
     n->cq[cq->cqid] = NULL;
@@ -4775,6 +4900,12 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
         event_notifier_cleanup(&cq->notifier);
     }
     if (cq->assert_notifier.initialized) {
+        if (with_irqfd) {
+            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                  &cq->assert_notifier,
+                                                  cq->virq);
+            kvm_irqchip_release_virq(kvm_state, cq->virq);
+        }
         event_notifier_set_handler(&cq->assert_notifier, NULL);
         event_notifier_cleanup(&cq->assert_notifier);
     }
@@ -6528,6 +6659,9 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     uint32_t page_size = 1 << page_bits;
     NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
+
     if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
         trace_pci_nvme_err_startfail_virt_state(le16_to_cpu(sctrl->nvi),
                                                 le16_to_cpu(sctrl->nvq),
@@ -6617,6 +6751,12 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
     nvme_select_iocs(n);
 
+    if (with_irqfd) {
+        return msix_set_vector_notifiers(PCI_DEVICE(n), nvme_kvm_vector_unmask,
+                                         nvme_kvm_vector_mask,
+                                         nvme_kvm_vector_poll);
+    }
+
     return 0;
 }
 
@@ -7734,6 +7874,7 @@ static void nvme_exit(PCIDevice *pci_dev)
         pcie_sriov_pf_exit(pci_dev);
     }
 
+    msix_unset_vector_notifiers(pci_dev);
     msix_uninit(pci_dev, &n->bar0, &n->bar0);
     memory_region_del_subregion(&n->bar0, &n->iomem);
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 4850d3e965..b0b986b024 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -20,6 +20,7 @@
 
 #include "qemu/uuid.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/msi.h"
 #include "hw/block/block.h"
 
 #include "block/nvme.h"
@@ -396,10 +397,12 @@ typedef struct NvmeCQueue {
     uint64_t    dma_addr;
     uint64_t    db_addr;
     uint64_t    ei_addr;
+    int         virq;
     QEMUTimer   *timer;
     EventNotifier notifier;
     EventNotifier assert_notifier;
     EventNotifier deassert_notifier;
+    MSIMessage  msg;
     bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index fccb79f489..b11fcf4a65 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -2,6 +2,9 @@
 pci_nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
 pci_nvme_irq_pin(void) "pulsing IRQ pin"
 pci_nvme_irq_masked(void) "IRQ is masked"
+pci_nvme_irq_mask(uint32_t vector) "IRQ %u gets masked"
+pci_nvme_irq_unmask(uint32_t vector, uint64_t addr, uint32_t data) "IRQ %u gets unmasked, addr=0x%"PRIx64" data=0x%"PRIu32""
+pci_nvme_irq_poll(uint32_t vector_start, uint32_t vector_end) "IRQ poll, start=0x%"PRIu32" end=0x%"PRIu32""
 pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
 pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64" eis_addr=0x%"PRIx64""
 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] hw/nvme: add iothread support
  2022-08-26 11:18 [PATCH 0/3] iothread and irqfd support Jinhao Fan
  2022-08-26 11:18 ` [PATCH 1/3] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
  2022-08-26 11:18 ` [PATCH 2/3] hw/nvme: use KVM irqfd when available Jinhao Fan
@ 2022-08-26 11:18 ` Jinhao Fan
  2022-08-26 15:10   ` Jinhao Fan
  2022-10-12 14:28 ` [PATCH 0/3] iothread and irqfd support Stefan Hajnoczi
  3 siblings, 1 reply; 10+ messages in thread
From: Jinhao Fan @ 2022-08-26 11:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

Add an option "iothread=x" to do emulation in a seperate iothread.
This improves the performance because QEMU's main loop is responsible
for a lot of other work while iothread is dedicated to NVMe emulation.
Moreover, emulating in iothread brings the potential of polling on
SQ/CQ doorbells, which I will bring up in a following patch.

Iothread can be enabled by:
 -object iothread,id=nvme0 \
 -device nvme,iothread=nvme0 \

Performance comparisons (KIOPS):

QD         1   4  16  64
QEMU      41 136 242 338
iothread  53 155 245 309

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 74 +++++++++++++++++++++++++++++++++++++++++++++-----
 hw/nvme/ns.c   | 21 +++++++++++---
 hw/nvme/nvme.h |  6 +++-
 3 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 396f3f0cdd..24a367329d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4458,7 +4458,13 @@ static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
         return ret;
     }
 
-    event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
+    if (cq->cqid) {
+        aio_set_event_notifier(n->ctx, &cq->notifier, true, nvme_cq_notifier,
+                               NULL, NULL);
+    } else {
+        event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
+    }
+
     memory_region_add_eventfd(&n->iomem,
                               0x1000 + offset, 4, false, 0, &cq->notifier);
 
@@ -4487,7 +4493,13 @@ static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
         return ret;
     }
 
-    event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
+    if (sq->sqid) {
+        aio_set_event_notifier(n->ctx, &sq->notifier, true, nvme_sq_notifier,
+                               NULL, NULL);
+    } else {
+        event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
+    }
+
     memory_region_add_eventfd(&n->iomem,
                               0x1000 + offset, 4, false, 0, &sq->notifier);
 
@@ -4503,7 +4515,12 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
     if (sq->ioeventfd_enabled) {
         memory_region_del_eventfd(&n->iomem,
                                   0x1000 + offset, 4, false, 0, &sq->notifier);
-        event_notifier_set_handler(&sq->notifier, NULL);
+        if (sq->sqid) {
+            aio_set_event_notifier(n->ctx, &sq->notifier, true, NULL, NULL,
+                                   NULL);
+        } else {
+            event_notifier_set_handler(&sq->notifier, NULL);
+        }
         event_notifier_cleanup(&sq->notifier);
     }
     g_free(sq->io_req);
@@ -4573,7 +4590,13 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
         sq->io_req[i].sq = sq;
         QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
     }
-    sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
+
+    if (sq->sqid) {
+        sq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
+                                  nvme_process_sq, sq);
+    } else {
+        sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
+    }
 
     if (n->dbbuf_enabled) {
         sq->db_addr = n->dbbuf_dbs + (sqid << 3);
@@ -4896,7 +4919,12 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
     if (cq->ioeventfd_enabled) {
         memory_region_del_eventfd(&n->iomem,
                                   0x1000 + offset, 4, false, 0, &cq->notifier);
-        event_notifier_set_handler(&cq->notifier, NULL);
+        if (cq->cqid) {
+            aio_set_event_notifier(n->ctx, &cq->notifier, true, NULL, NULL,
+                                   NULL);
+        } else {
+            event_notifier_set_handler(&cq->notifier, NULL);
+        }
         event_notifier_cleanup(&cq->notifier);
     }
     if (cq->assert_notifier.initialized) {
@@ -4979,7 +5007,13 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
         }
     }
     n->cq[cqid] = cq;
-    cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+
+    if (cq->cqid) {
+        cq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
+                                  nvme_post_cqes, cq);
+    } else {
+        cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+    }
 
     /*
      * Only enable irq eventfd for IO queues since we always emulate admin
@@ -4988,6 +5022,13 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     if (cqid && n->params.irq_eventfd) {
         nvme_init_irq_notifier(n, cq);
     }
+
+    if (cq->cqid) {
+        cq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
+                                  nvme_post_cqes, cq);
+    } else {
+        cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+    }
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
@@ -7759,6 +7800,14 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
         stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
     }
+
+    if (n->params.iothread) {
+        n->iothread = n->params.iothread;
+        object_ref(OBJECT(n->iothread));
+        n->ctx = iothread_get_aio_context(n->iothread);
+    } else {
+        n->ctx = qemu_get_aio_context();
+    }
 }
 
 static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
@@ -7831,7 +7880,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
         ns = &n->namespace;
         ns->params.nsid = 1;
 
-        if (nvme_ns_setup(ns, errp)) {
+        if (nvme_ns_setup(ns, n->ctx, errp)) {
             return;
         }
 
@@ -7862,6 +7911,15 @@ static void nvme_exit(PCIDevice *pci_dev)
     g_free(n->sq);
     g_free(n->aer_reqs);
 
+    aio_context_acquire(n->ctx);
+    blk_set_aio_context(n->namespace.blkconf.blk, qemu_get_aio_context(), NULL);
+    aio_context_release(n->ctx);
+
+    if (n->iothread) {
+        object_unref(OBJECT(n->iothread));
+        n->iothread = NULL;
+    }
+
     if (n->params.cmb_size_mb) {
         g_free(n->cmb.buf);
     }
@@ -7885,6 +7943,8 @@ static Property nvme_props[] = {
                      HostMemoryBackend *),
     DEFINE_PROP_LINK("subsys", NvmeCtrl, subsys, TYPE_NVME_SUBSYS,
                      NvmeSubsystem *),
+    DEFINE_PROP_LINK("iothread", NvmeCtrl, params.iothread, TYPE_IOTHREAD,
+                     IOThread *),
     DEFINE_PROP_STRING("serial", NvmeCtrl, params.serial),
     DEFINE_PROP_UINT32("cmb_size_mb", NvmeCtrl, params.cmb_size_mb, 0),
     DEFINE_PROP_UINT32("num_queues", NvmeCtrl, params.num_queues, 0),
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 62a1f97be0..eb9141a67b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -146,9 +146,11 @@ lbaf_found:
     return 0;
 }
 
-static int nvme_ns_init_blk(NvmeNamespace *ns, Error **errp)
+static int nvme_ns_init_blk(NvmeNamespace *ns, AioContext *ctx, Error **errp)
 {
     bool read_only;
+    AioContext *old_context;
+    int ret;
 
     if (!blkconf_blocksizes(&ns->blkconf, errp)) {
         return -1;
@@ -170,6 +172,17 @@ static int nvme_ns_init_blk(NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    old_context = blk_get_aio_context(ns->blkconf.blk);
+    aio_context_acquire(old_context);
+    ret = blk_set_aio_context(ns->blkconf.blk, ctx, errp);
+    aio_context_release(old_context);
+
+    if (ret) {
+        error_setg(errp, "Set AioContext on BlockBackend failed");
+        return ret;
+    }
+
+
     return 0;
 }
 
@@ -482,13 +495,13 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
     return 0;
 }
 
-int nvme_ns_setup(NvmeNamespace *ns, Error **errp)
+int nvme_ns_setup(NvmeNamespace *ns, AioContext *ctx, Error **errp)
 {
     if (nvme_ns_check_constraints(ns, errp)) {
         return -1;
     }
 
-    if (nvme_ns_init_blk(ns, errp)) {
+    if (nvme_ns_init_blk(ns, ctx, errp)) {
         return -1;
     }
 
@@ -563,7 +576,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    if (nvme_ns_setup(ns, errp)) {
+    if (nvme_ns_setup(ns, n->ctx, errp)) {
         return;
     }
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index b0b986b024..224b73e6c4 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -22,6 +22,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msi.h"
 #include "hw/block/block.h"
+#include "sysemu/iothread.h"
 
 #include "block/nvme.h"
 
@@ -276,7 +277,7 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns)
 }
 
 void nvme_ns_init_format(NvmeNamespace *ns);
-int nvme_ns_setup(NvmeNamespace *ns, Error **errp);
+int nvme_ns_setup(NvmeNamespace *ns, AioContext *ctx, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_shutdown(NvmeNamespace *ns);
 void nvme_ns_cleanup(NvmeNamespace *ns);
@@ -433,6 +434,7 @@ typedef struct NvmeParams {
     uint16_t sriov_vi_flexible;
     uint8_t  sriov_max_vq_per_vf;
     uint8_t  sriov_max_vi_per_vf;
+    IOThread *iothread;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
@@ -464,6 +466,8 @@ typedef struct NvmeCtrl {
     uint64_t    dbbuf_dbs;
     uint64_t    dbbuf_eis;
     bool        dbbuf_enabled;
+    IOThread    *iothread;
+    AioContext  *ctx;
 
     struct {
         MemoryRegion mem;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] hw/nvme: add iothread support
  2022-08-26 11:18 ` [PATCH 3/3] hw/nvme: add iothread support Jinhao Fan
@ 2022-08-26 15:10   ` Jinhao Fan
  0 siblings, 0 replies; 10+ messages in thread
From: Jinhao Fan @ 2022-08-26 15:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Klaus Jensen, kbusch, stefanha, open list:nvme

at 7:18 PM, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:

> @@ -4979,7 +5007,13 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>         }
>     }
>     n->cq[cqid] = cq;
> -    cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
> +
> +    if (cq->cqid) {
> +        cq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
> +                                  nvme_post_cqes, cq);
> +    } else {
> +        cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
> +    }
> 
>     /*
>      * Only enable irq eventfd for IO queues since we always emulate admin
> @@ -4988,6 +5022,13 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>     if (cqid && n->params.irq_eventfd) {
>         nvme_init_irq_notifier(n, cq);
>     }
> +
> +    if (cq->cqid) {
> +        cq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
> +                                  nvme_post_cqes, cq);
> +    } else {
> +        cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
> +    }
> }

Duplicated initialization of cq->timer.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] iothread and irqfd support
  2022-08-26 11:18 [PATCH 0/3] iothread and irqfd support Jinhao Fan
                   ` (2 preceding siblings ...)
  2022-08-26 11:18 ` [PATCH 3/3] hw/nvme: add iothread support Jinhao Fan
@ 2022-10-12 14:28 ` Stefan Hajnoczi
  2022-10-12 14:39   ` Klaus Jensen
  3 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2022-10-12 14:28 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, its, kbusch

On Fri, 26 Aug 2022 at 07:18, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:
>
> This patch series adds support for using a seperate iothread for NVMe
> IO emulation, which brings the potential of applying polling. The
> first two patches implements support for irqfd, which solves thread
> safety problems for interrupt emulation outside the main loop thread.
>
> Jinhao Fan (3):
>   hw/nvme: support irq(de)assertion with eventfd
>   hw/nvme: use KVM irqfd when available
>   hw/nvme: add iothread support

Hi,
What is the status of this series?

Stefan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] iothread and irqfd support
  2022-10-12 14:28 ` [PATCH 0/3] iothread and irqfd support Stefan Hajnoczi
@ 2022-10-12 14:39   ` Klaus Jensen
  2022-10-12 15:35     ` Jinhao Fan
  0 siblings, 1 reply; 10+ messages in thread
From: Klaus Jensen @ 2022-10-12 14:39 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Jinhao Fan, qemu-devel, kbusch

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Okt 12 10:28, Stefan Hajnoczi wrote:
> On Fri, 26 Aug 2022 at 07:18, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:
> >
> > This patch series adds support for using a seperate iothread for NVMe
> > IO emulation, which brings the potential of applying polling. The
> > first two patches implements support for irqfd, which solves thread
> > safety problems for interrupt emulation outside the main loop thread.
> >
> > Jinhao Fan (3):
> >   hw/nvme: support irq(de)assertion with eventfd
> >   hw/nvme: use KVM irqfd when available
> >   hw/nvme: add iothread support
> 
> Hi,
> What is the status of this series?
> 

I have been meaning to pick it up, but I got side-tracked. The polling
performance drop needs to be address as we discussed offline.

But the v4 looks pretty good and I can pick that up without the polling
support for now.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] iothread and irqfd support
  2022-10-12 14:39   ` Klaus Jensen
@ 2022-10-12 15:35     ` Jinhao Fan
  2022-10-12 15:53       ` Stefan Hajnoczi
  0 siblings, 1 reply; 10+ messages in thread
From: Jinhao Fan @ 2022-10-12 15:35 UTC (permalink / raw)
  To: Klaus Jensen, Stefan Hajnoczi; +Cc: qemu-devel, kbusch

On 10/12/2022 10:39 PM, Klaus Jensen wrote:
> I have been meaning to pick it up, but I got side-tracked. The polling
> performance drop needs to be address as we discussed offline.
> 
> But the v4 looks pretty good and I can pick that up without the polling
> support for now.

I've been using the v4 without polling for my daily work. It worked 
pretty well for my test workloads.

I'm not sure what needs to be done for the polling problem. I can try to 
add a completion batching mechanism with BH, similar to virtio-blk. Do 
you think this is the right direction?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] iothread and irqfd support
  2022-10-12 15:35     ` Jinhao Fan
@ 2022-10-12 15:53       ` Stefan Hajnoczi
  2022-10-14  4:49         ` Jinhao Fan
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2022-10-12 15:53 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: Klaus Jensen, qemu-devel, kbusch

virtio-blk's dataplane BH completion batching mechanism is not enabled
by default and the performance results are mixed. If you develop a
different mechanism from scratch I think there's a good chance it
would work better :).

This looks like a queuing theory problem to me. It should be possible
to model IOPS as a function of some parameters and then hopefully find
simple rules to optimize IOPS by adjusting some of the parameters at
runtime. I haven't looked into this much myself though, so I don't
have any concrete suggestion. The basic idea is that as long as events
occur at a minimum rate then they can be batched to maximize
throughput without sacrificing too much latency. If the rate drops
then the device cannot hold back events.

Another place to look for inspiration is network cards. In Linux it's
common to use the NAPI framework to disable further receive interrupts
and then poll until the receive queue becomes empty. Transmit
completions can also be mitigated, but I'm not sure what the most
common approach is there.

Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] iothread and irqfd support
  2022-10-12 15:53       ` Stefan Hajnoczi
@ 2022-10-14  4:49         ` Jinhao Fan
  0 siblings, 0 replies; 10+ messages in thread
From: Jinhao Fan @ 2022-10-14  4:49 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Klaus Jensen, qemu-devel, Keith Busch

I also came across the Interrupt Coalescing feature in NVMe (5.21.1.8 in
Spec 1.4c). It seems to suggest an algorithm based on thresholds for the
maximum interrupt delay time and minimum number of coalesced interrupts.

Klaus, based on your experience, how does this algorithm work? Is is
commonly used by controllers and drivers?

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-10-14  4:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-26 11:18 [PATCH 0/3] iothread and irqfd support Jinhao Fan
2022-08-26 11:18 ` [PATCH 1/3] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
2022-08-26 11:18 ` [PATCH 2/3] hw/nvme: use KVM irqfd when available Jinhao Fan
2022-08-26 11:18 ` [PATCH 3/3] hw/nvme: add iothread support Jinhao Fan
2022-08-26 15:10   ` Jinhao Fan
2022-10-12 14:28 ` [PATCH 0/3] iothread and irqfd support Stefan Hajnoczi
2022-10-12 14:39   ` Klaus Jensen
2022-10-12 15:35     ` Jinhao Fan
2022-10-12 15:53       ` Stefan Hajnoczi
2022-10-14  4:49         ` Jinhao Fan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).