qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
@ 2024-05-23 17:39 Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 01/13] exec/memtxattr: add process identifier to the transaction attributes Daniel Henrique Barboza
                   ` (14 more replies)
  0 siblings, 15 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

Hi,

In this new version a lot of changes were made throughout all the code,
most notably on patch 3. Link for the previous version is [1].

* How it was tested *

This series was tested using an emulated QEMU RISC-V host booting a QEMU
KVM guest, passing through an emulated e1000 network card from the host
to the guest. I can provide more details (e.g. QEMU command lines) if
required, just let me know. For now this cover-letter is too much of an
essay as is.

The Linux kernel used for tests can be found here:

https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3

This is a newer version of the following work from Tomasz: 

https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
("[PATCH v5 0/7] Linux RISC-V IOMMU Support")

The v5 wasn't enough for the testing being done. v6-rc3 did the trick.

Note that to test this work using riscv-iommu-pci we'll need to provide
the Rivos PCI ID in the command line. More details down below.

* Highlights of this version *

- patches removed from v2: platform driver (riscv-iommu-sys, former
patch 05) and the EDU changes (patches 14 and 15). The platform driver
will be sent later with a working example on the 'virt' machine,
either on a newer version of this series or via a follow-up series. We
already have a PoC on [2] created by Sunil. More tests are needed, so
it'll be left behind for now. The EDU changes will be sent in separate
after I finish the doc changes that Frank cited in v2.

- patch 3 contains the bulk of changes made from v2. Please give special
attention to the following functions since this is entirely new code I
ended up adding:
 
 - riscv_iommu_report_fault()
 - riscv_iommu_validate_device_ctx() 
 - riscv_iommu_update_ipsr() 
 
  Aside from these helpers most of the changes made in this patch 3 were
punctual.

- Red HAT PCI ID related changes. A new patch (4) that introduces a
generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
to us by Red Hat and Gerd Hoffman from their ID space. The
riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
ID. The device was changed slightly to allow vendor-id and device-id to
be set in the command-line, so it's now possible to use this reference
device as another RISC-V IOMMU PCI device to ease the burden of
testing/development.

  To instantiate the riscv-iommu-pci device using the previous Rivos PCI
ID, use the following cmd line:

  -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1 

  I'm using these options to test the series with the existing Linux RISC-V
IOMMU support that uses just a Rivos ID to identify the device.


Series based on alistair/riscv-to-apply.next. It's also applicable on
current QEMU master. It can also be fetched from:

https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
 

Patches missing reviews/acks: 3, 5, 9, 10, 11.

Changes from v2 [1]:
- patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
  - will be reintroduced in a later review or as a follow-up series

- patches 14 and 15: dropped
  - will be sent in separate 

- patches 2, 3, 4 and 5:
  - removed all 'Ziommu' references

- patch 2:
  - added extra bits that patch 3 ended up using

- patch 3:
  - fixed blank line at EOF in hw/riscv/trace.h
  - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
    a given fault is eligible to be reported if DTF is 1
  - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
    to avoid code repetition
  - added a riscv_iommu_validate_device_ctx() helper to validate the device context
    as specified in "Device configuration checks" section. This helper is being used
    in riscv_iommu_ctx_fetch()
  - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
    in riscv_iommu_mmio_write()
  - riscv_iommmu_msi_write() now reports a fault in all error paths
  - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
  - change riscv-iommu region name to 'riscv-iommu'
  - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
  - changed riscv_iommu_mmio_ops min_access_size to 4
  - do not check for min and max sizes on riscv_iommu_mmio_write()
  - changed riscv_iommu_trap_ops  min_access_size to 4
  - removed IOMMU qemu_thread thread:
    - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
      'core_lock'
  - init FSCR as zero explicitly
  - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()

- patch 4 (new):
  - add Red-Hat PCI RISC-V IOMMU ID

- patch 5 (former 4):
  - create vendor-id and device-id properties
  - set Red-hat PCI RISC-V IOMMU ID as default ID

- patch 8:
  - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances

- patch 9:
  - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
  - removed 'gpa' boolean from riscv_iommu_spa_fetch()
  - 'en_s' is no longer used for early MSI address match

- patch 10:
  - add ATS steps in riscv_iommu_validate_device_ctx()
  - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
  - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
  - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()

- patch 11:
  - removed unused bits
  - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
    bits
  - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
  - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
 
[1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
[2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/

Andrew Jones (1):
  hw/riscv/riscv-iommu: Add another irq for mrif notifications

Daniel Henrique Barboza (3):
  pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
  test/qtest: add riscv-iommu-pci tests
  qtest/riscv-iommu-test: add init queues test

Tomasz Jeznach (9):
  exec/memtxattr: add process identifier to the transaction attributes
  hw/riscv: add riscv-iommu-bits.h
  hw/riscv: add RISC-V IOMMU base emulation
  hw/riscv: add riscv-iommu-pci reference device
  hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
  hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
  hw/riscv/riscv-iommu: add s-stage and g-stage support
  hw/riscv/riscv-iommu: add ATS support
  hw/riscv/riscv-iommu: add DBG support

 docs/specs/pci-ids.rst           |    2 +
 hw/riscv/Kconfig                 |    4 +
 hw/riscv/meson.build             |    1 +
 hw/riscv/riscv-iommu-bits.h      |  416 ++++++
 hw/riscv/riscv-iommu-pci.c       |  177 +++
 hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
 hw/riscv/riscv-iommu.h           |  146 ++
 hw/riscv/trace-events            |   15 +
 hw/riscv/trace.h                 |    1 +
 hw/riscv/virt.c                  |   33 +-
 include/exec/memattrs.h          |    5 +
 include/hw/pci/pci.h             |    1 +
 include/hw/riscv/iommu.h         |   36 +
 meson.build                      |    1 +
 tests/qtest/libqos/meson.build   |    4 +
 tests/qtest/libqos/riscv-iommu.c |   76 +
 tests/qtest/libqos/riscv-iommu.h |  100 ++
 tests/qtest/meson.build          |    1 +
 tests/qtest/riscv-iommu-test.c   |  234 +++
 19 files changed, 3535 insertions(+), 1 deletion(-)
 create mode 100644 hw/riscv/riscv-iommu-bits.h
 create mode 100644 hw/riscv/riscv-iommu-pci.c
 create mode 100644 hw/riscv/riscv-iommu.c
 create mode 100644 hw/riscv/riscv-iommu.h
 create mode 100644 hw/riscv/trace-events
 create mode 100644 hw/riscv/trace.h
 create mode 100644 include/hw/riscv/iommu.h
 create mode 100644 tests/qtest/libqos/riscv-iommu.c
 create mode 100644 tests/qtest/libqos/riscv-iommu.h
 create mode 100644 tests/qtest/riscv-iommu-test.c

-- 
2.44.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 01/13] exec/memtxattr: add process identifier to the transaction attributes
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h Daniel Henrique Barboza
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang

From: Tomasz Jeznach <tjeznach@rivosinc.com>

Extend memory transaction attributes with process identifier to allow
per-request address translation logic to use requester_id / process_id
to identify memory mapping (e.g. enabling IOMMU w/ PASID translations).

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 include/exec/memattrs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/exec/memattrs.h b/include/exec/memattrs.h
index 14cdd8d582..46d0725416 100644
--- a/include/exec/memattrs.h
+++ b/include/exec/memattrs.h
@@ -52,6 +52,11 @@ typedef struct MemTxAttrs {
     unsigned int memory:1;
     /* Requester ID (for MSI for example) */
     unsigned int requester_id:16;
+
+    /*
+     * PCI PASID support: Limited to 8 bits process identifier.
+     */
+    unsigned int pasid:8;
 } MemTxAttrs;
 
 /* Bus masters which don't specify any attributes will get this,
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 01/13] exec/memtxattr: add process identifier to the transaction attributes Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-28  6:41   ` Eric Cheng
  2024-05-23 17:39 ` [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation Daniel Henrique Barboza
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

This header will be used by the RISC-V IOMMU emulation to be added
in the next patch. Due to its size it's being sent in separate for
an easier review.

One thing to notice is that this header can be replaced by the future
Linux RISC-V IOMMU driver header, which would become a linux-header we
would import instead of keeping our own. The Linux implementation isn't
upstream yet so for now we'll have to manage riscv-iommu-bits.h.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 hw/riscv/riscv-iommu-bits.h | 347 ++++++++++++++++++++++++++++++++++++
 1 file changed, 347 insertions(+)
 create mode 100644 hw/riscv/riscv-iommu-bits.h

diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
new file mode 100644
index 0000000000..f29b916acb
--- /dev/null
+++ b/hw/riscv/riscv-iommu-bits.h
@@ -0,0 +1,347 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2022-2023 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ * Copyright © 2023 RISC-V IOMMU Task Group
+ *
+ * RISC-V IOMMU - Register Layout and Data Structures.
+ *
+ * Based on the IOMMU spec version 1.0, 3/2023
+ * https://github.com/riscv-non-isa/riscv-iommu
+ */
+
+#ifndef HW_RISCV_IOMMU_BITS_H
+#define HW_RISCV_IOMMU_BITS_H
+
+#include "qemu/osdep.h"
+
+#define RISCV_IOMMU_SPEC_DOT_VER 0x010
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) (((~0ULL) >> (63 - (h) + (l))) << (l))
+#endif
+
+/*
+ * struct riscv_iommu_fq_record - Fault/Event Queue Record
+ * See section 3.2 for more info.
+ */
+struct riscv_iommu_fq_record {
+    uint64_t hdr;
+    uint64_t _reserved;
+    uint64_t iotval;
+    uint64_t iotval2;
+};
+/* Header fields */
+#define RISCV_IOMMU_FQ_HDR_CAUSE        GENMASK_ULL(11, 0)
+#define RISCV_IOMMU_FQ_HDR_PID          GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_FQ_HDR_PV           BIT_ULL(32)
+#define RISCV_IOMMU_FQ_HDR_TTYPE        GENMASK_ULL(39, 34)
+#define RISCV_IOMMU_FQ_HDR_DID          GENMASK_ULL(63, 40)
+
+/*
+ * struct riscv_iommu_pq_record - PCIe Page Request record
+ * For more infos on the PCIe Page Request queue see chapter 3.3.
+ */
+struct riscv_iommu_pq_record {
+      uint64_t hdr;
+      uint64_t payload;
+};
+/* Header fields */
+#define RISCV_IOMMU_PREQ_HDR_PID        GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_PREQ_HDR_PV         BIT_ULL(32)
+#define RISCV_IOMMU_PREQ_HDR_PRIV       BIT_ULL(33)
+#define RISCV_IOMMU_PREQ_HDR_EXEC       BIT_ULL(34)
+#define RISCV_IOMMU_PREQ_HDR_DID        GENMASK_ULL(63, 40)
+/* Payload fields */
+#define RISCV_IOMMU_PREQ_PAYLOAD_M      GENMASK_ULL(2, 0)
+
+/* Common field positions */
+#define RISCV_IOMMU_PPN_FIELD           GENMASK_ULL(53, 10)
+#define RISCV_IOMMU_QUEUE_LOGSZ_FIELD   GENMASK_ULL(4, 0)
+#define RISCV_IOMMU_QUEUE_INDEX_FIELD   GENMASK_ULL(31, 0)
+#define RISCV_IOMMU_QUEUE_ENABLE        BIT(0)
+#define RISCV_IOMMU_QUEUE_INTR_ENABLE   BIT(1)
+#define RISCV_IOMMU_QUEUE_MEM_FAULT     BIT(8)
+#define RISCV_IOMMU_QUEUE_OVERFLOW      BIT(9)
+#define RISCV_IOMMU_QUEUE_ACTIVE        BIT(16)
+#define RISCV_IOMMU_QUEUE_BUSY          BIT(17)
+#define RISCV_IOMMU_ATP_PPN_FIELD       GENMASK_ULL(43, 0)
+#define RISCV_IOMMU_ATP_MODE_FIELD      GENMASK_ULL(63, 60)
+
+/* 5.3 IOMMU Capabilities (64bits) */
+#define RISCV_IOMMU_REG_CAP             0x0000
+#define RISCV_IOMMU_CAP_VERSION         GENMASK_ULL(7, 0)
+#define RISCV_IOMMU_CAP_MSI_FLAT        BIT_ULL(22)
+#define RISCV_IOMMU_CAP_MSI_MRIF        BIT_ULL(23)
+#define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
+#define RISCV_IOMMU_CAP_IGS             GENMASK_ULL(29, 28)
+#define RISCV_IOMMU_CAP_PAS             GENMASK_ULL(37, 32)
+#define RISCV_IOMMU_CAP_PD8             BIT_ULL(38)
+#define RISCV_IOMMU_CAP_PD17            BIT_ULL(39)
+#define RISCV_IOMMU_CAP_PD20            BIT_ULL(40)
+
+/* 5.4 Features control register (32bits) */
+#define RISCV_IOMMU_REG_FCTL            0x0008
+#define RISCV_IOMMU_FCTL_WSI            BIT(1)
+
+/* 5.5 Device-directory-table pointer (64bits) */
+#define RISCV_IOMMU_REG_DDTP            0x0010
+#define RISCV_IOMMU_DDTP_MODE           GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_DDTP_BUSY           BIT_ULL(4)
+#define RISCV_IOMMU_DDTP_PPN            RISCV_IOMMU_PPN_FIELD
+
+enum riscv_iommu_ddtp_modes {
+    RISCV_IOMMU_DDTP_MODE_OFF = 0,
+    RISCV_IOMMU_DDTP_MODE_BARE = 1,
+    RISCV_IOMMU_DDTP_MODE_1LVL = 2,
+    RISCV_IOMMU_DDTP_MODE_2LVL = 3,
+    RISCV_IOMMU_DDTP_MODE_3LVL = 4,
+    RISCV_IOMMU_DDTP_MODE_MAX = 4
+};
+
+/* 5.6 Command Queue Base (64bits) */
+#define RISCV_IOMMU_REG_CQB             0x0018
+#define RISCV_IOMMU_CQB_LOG2SZ          RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_CQB_PPN             RISCV_IOMMU_PPN_FIELD
+
+/* 5.7 Command Queue head (32bits) */
+#define RISCV_IOMMU_REG_CQH             0x0020
+
+/* 5.8 Command Queue tail (32bits) */
+#define RISCV_IOMMU_REG_CQT             0x0024
+
+/* 5.9 Fault Queue Base (64bits) */
+#define RISCV_IOMMU_REG_FQB             0x0028
+#define RISCV_IOMMU_FQB_LOG2SZ          RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_FQB_PPN             RISCV_IOMMU_PPN_FIELD
+
+/* 5.10 Fault Queue Head (32bits) */
+#define RISCV_IOMMU_REG_FQH             0x0030
+
+/* 5.11 Fault Queue tail (32bits) */
+#define RISCV_IOMMU_REG_FQT             0x0034
+
+/* 5.12 Page Request Queue base (64bits) */
+#define RISCV_IOMMU_REG_PQB             0x0038
+#define RISCV_IOMMU_PQB_LOG2SZ          RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_PQB_PPN             RISCV_IOMMU_PPN_FIELD
+
+/* 5.13 Page Request Queue head (32bits) */
+#define RISCV_IOMMU_REG_PQH             0x0040
+
+/* 5.14 Page Request Queue tail (32bits) */
+#define RISCV_IOMMU_REG_PQT             0x0044
+
+/* 5.15 Command Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_CQCSR           0x0048
+#define RISCV_IOMMU_CQCSR_CQEN          RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_CQCSR_CIE           RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_CQCSR_CQMF          RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_CQCSR_CMD_TO        BIT(9)
+#define RISCV_IOMMU_CQCSR_CMD_ILL       BIT(10)
+#define RISCV_IOMMU_CQCSR_FENCE_W_IP    BIT(11)
+#define RISCV_IOMMU_CQCSR_CQON          RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_CQCSR_BUSY          RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.16 Fault Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_FQCSR           0x004C
+#define RISCV_IOMMU_FQCSR_FQEN          RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_FQCSR_FIE           RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_FQCSR_FQMF          RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_FQCSR_FQOF          RISCV_IOMMU_QUEUE_OVERFLOW
+#define RISCV_IOMMU_FQCSR_FQON          RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_FQCSR_BUSY          RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.17 Page Request Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_PQCSR           0x0050
+#define RISCV_IOMMU_PQCSR_PQEN          RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_PQCSR_PIE           RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_PQCSR_PQMF          RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_PQCSR_PQOF          RISCV_IOMMU_QUEUE_OVERFLOW
+#define RISCV_IOMMU_PQCSR_PQON          RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_PQCSR_BUSY          RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.18 Interrupt Pending Status (32bits) */
+#define RISCV_IOMMU_REG_IPSR            0x0054
+#define RISCV_IOMMU_IPSR_CIP            BIT(0)
+#define RISCV_IOMMU_IPSR_FIP            BIT(1)
+#define RISCV_IOMMU_IPSR_PIP            BIT(3)
+
+enum {
+    RISCV_IOMMU_INTR_CQ,
+    RISCV_IOMMU_INTR_FQ,
+    RISCV_IOMMU_INTR_PM,
+    RISCV_IOMMU_INTR_PQ,
+    RISCV_IOMMU_INTR_COUNT
+};
+
+/* 5.27 Interrupt cause to vector (64bits) */
+#define RISCV_IOMMU_REG_IVEC            0x02F8
+
+/* 5.28 MSI Configuration table (32 * 64bits) */
+#define RISCV_IOMMU_REG_MSI_CONFIG      0x0300
+
+#define RISCV_IOMMU_REG_SIZE           0x1000
+
+#define RISCV_IOMMU_DDTE_VALID          BIT_ULL(0)
+#define RISCV_IOMMU_DDTE_PPN            RISCV_IOMMU_PPN_FIELD
+
+/* Struct riscv_iommu_dc - Device Context - section 2.1 */
+struct riscv_iommu_dc {
+      uint64_t tc;
+      uint64_t iohgatp;
+      uint64_t ta;
+      uint64_t fsc;
+      uint64_t msiptp;
+      uint64_t msi_addr_mask;
+      uint64_t msi_addr_pattern;
+      uint64_t _reserved;
+};
+
+/* Translation control fields */
+#define RISCV_IOMMU_DC_TC_V             BIT_ULL(0)
+#define RISCV_IOMMU_DC_TC_EN_PRI        BIT_ULL(2)
+#define RISCV_IOMMU_DC_TC_T2GPA         BIT_ULL(3)
+#define RISCV_IOMMU_DC_TC_DTF           BIT_ULL(4)
+#define RISCV_IOMMU_DC_TC_PDTV          BIT_ULL(5)
+#define RISCV_IOMMU_DC_TC_PRPR          BIT_ULL(6)
+#define RISCV_IOMMU_DC_TC_DPE           BIT_ULL(9)
+#define RISCV_IOMMU_DC_TC_SBE           BIT_ULL(10)
+#define RISCV_IOMMU_DC_TC_SXL           BIT_ULL(11)
+
+/* Second-stage (aka G-stage) context fields */
+#define RISCV_IOMMU_DC_IOHGATP_PPN      RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_IOHGATP_GSCID    GENMASK_ULL(59, 44)
+#define RISCV_IOMMU_DC_IOHGATP_MODE     RISCV_IOMMU_ATP_MODE_FIELD
+
+enum riscv_iommu_dc_iohgatp_modes {
+    RISCV_IOMMU_DC_IOHGATP_MODE_BARE = 0,
+    RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4 = 8,
+    RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4 = 8,
+    RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4 = 9,
+    RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4 = 10
+};
+
+/* Translation attributes fields */
+#define RISCV_IOMMU_DC_TA_PSCID         GENMASK_ULL(31, 12)
+
+/* First-stage context fields */
+#define RISCV_IOMMU_DC_FSC_PPN          RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_FSC_MODE         RISCV_IOMMU_ATP_MODE_FIELD
+
+/* Generic I/O MMU command structure - check section 3.1 */
+struct riscv_iommu_command {
+    uint64_t dword0;
+    uint64_t dword1;
+};
+
+#define RISCV_IOMMU_CMD_OPCODE          GENMASK_ULL(6, 0)
+#define RISCV_IOMMU_CMD_FUNC            GENMASK_ULL(9, 7)
+
+#define RISCV_IOMMU_CMD_IOTINVAL_OPCODE         1
+#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA       0
+#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA      1
+#define RISCV_IOMMU_CMD_IOTINVAL_AV     BIT_ULL(10)
+#define RISCV_IOMMU_CMD_IOTINVAL_PSCID  GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_IOTINVAL_PSCV   BIT_ULL(32)
+#define RISCV_IOMMU_CMD_IOTINVAL_GV     BIT_ULL(33)
+#define RISCV_IOMMU_CMD_IOTINVAL_GSCID  GENMASK_ULL(59, 44)
+
+#define RISCV_IOMMU_CMD_IOFENCE_OPCODE          2
+#define RISCV_IOMMU_CMD_IOFENCE_FUNC_C          0
+#define RISCV_IOMMU_CMD_IOFENCE_AV      BIT_ULL(10)
+#define RISCV_IOMMU_CMD_IOFENCE_DATA    GENMASK_ULL(63, 32)
+
+#define RISCV_IOMMU_CMD_IODIR_OPCODE            3
+#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT    0
+#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT    1
+#define RISCV_IOMMU_CMD_IODIR_PID       GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_IODIR_DV        BIT_ULL(33)
+#define RISCV_IOMMU_CMD_IODIR_DID       GENMASK_ULL(63, 40)
+
+enum riscv_iommu_dc_fsc_atp_modes {
+    RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
+    RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
+    RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 = 8,
+    RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48 = 9,
+    RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57 = 10,
+    RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8 = 1,
+    RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17 = 2,
+    RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20 = 3
+};
+
+enum riscv_iommu_fq_causes {
+    RISCV_IOMMU_FQ_CAUSE_INST_FAULT           = 1,
+    RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED   = 4,
+    RISCV_IOMMU_FQ_CAUSE_RD_FAULT             = 5,
+    RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED   = 6,
+    RISCV_IOMMU_FQ_CAUSE_WR_FAULT             = 7,
+    RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S         = 12,
+    RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S           = 13,
+    RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S           = 15,
+    RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS        = 20,
+    RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS          = 21,
+    RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS          = 23,
+    RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED         = 256,
+    RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT       = 257,
+    RISCV_IOMMU_FQ_CAUSE_DDT_INVALID          = 258,
+    RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED    = 259,
+    RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED        = 260,
+    RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT       = 261,
+    RISCV_IOMMU_FQ_CAUSE_MSI_INVALID          = 262,
+    RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED    = 263,
+    RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT           = 264,
+    RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT       = 265,
+    RISCV_IOMMU_FQ_CAUSE_PDT_INVALID          = 266,
+    RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED    = 267,
+    RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED        = 268,
+    RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED        = 269,
+    RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED     = 270,
+    RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED      = 271,
+    RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR    = 272,
+    RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT         = 273,
+    RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED         = 274
+};
+
+/* MSI page table pointer */
+#define RISCV_IOMMU_DC_MSIPTP_PPN       RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_MSIPTP_MODE      RISCV_IOMMU_ATP_MODE_FIELD
+#define RISCV_IOMMU_DC_MSIPTP_MODE_OFF  0
+#define RISCV_IOMMU_DC_MSIPTP_MODE_FLAT 1
+
+/* Translation attributes fields */
+#define RISCV_IOMMU_PC_TA_V             BIT_ULL(0)
+
+/* First stage context fields */
+#define RISCV_IOMMU_PC_FSC_PPN          GENMASK_ULL(43, 0)
+
+enum riscv_iommu_fq_ttypes {
+    RISCV_IOMMU_FQ_TTYPE_NONE = 0,
+    RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH = 1,
+    RISCV_IOMMU_FQ_TTYPE_UADDR_RD = 2,
+    RISCV_IOMMU_FQ_TTYPE_UADDR_WR = 3,
+    RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
+    RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
+    RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
+    RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 8,
+};
+
+/* Fields on pte */
+#define RISCV_IOMMU_MSI_PTE_V           BIT_ULL(0)
+#define RISCV_IOMMU_MSI_PTE_M           GENMASK_ULL(2, 1)
+
+#define RISCV_IOMMU_MSI_PTE_M_MRIF      1
+#define RISCV_IOMMU_MSI_PTE_M_BASIC     3
+
+/* When M == 1 (MRIF mode) */
+#define RISCV_IOMMU_MSI_PTE_MRIF_ADDR   GENMASK_ULL(53, 7)
+/* When M == 3 (basic mode) */
+#define RISCV_IOMMU_MSI_PTE_PPN         RISCV_IOMMU_PPN_FIELD
+#define RISCV_IOMMU_MSI_PTE_C           BIT_ULL(63)
+
+/* Fields on mrif_info */
+#define RISCV_IOMMU_MSI_MRIF_NID        GENMASK_ULL(9, 0)
+#define RISCV_IOMMU_MSI_MRIF_NPPN       RISCV_IOMMU_PPN_FIELD
+#define RISCV_IOMMU_MSI_MRIF_NID_MSB    BIT_ULL(60)
+
+#endif /* _RISCV_IOMMU_BITS_H_ */
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 01/13] exec/memtxattr: add process identifier to the transaction attributes Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-30  1:39   ` Eric Cheng
                     ` (2 more replies)
  2024-05-23 17:39 ` [PATCH v3 04/13] pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device Daniel Henrique Barboza
                   ` (11 subsequent siblings)
  14 siblings, 3 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf,
	Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

The RISC-V IOMMU specification is now ratified as-per the RISC-V
international process. The latest frozen specifcation can be found
at:

https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf

Add the foundation of the device emulation for RISC-V IOMMU, which
includes an IOMMU that has no capabilities but MSI interrupt support and
fault queue interfaces. We'll add add more features incrementally in the
next patches.

Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 hw/riscv/Kconfig         |    4 +
 hw/riscv/meson.build     |    1 +
 hw/riscv/riscv-iommu.c   | 1602 ++++++++++++++++++++++++++++++++++++++
 hw/riscv/riscv-iommu.h   |  141 ++++
 hw/riscv/trace-events    |   11 +
 hw/riscv/trace.h         |    1 +
 include/hw/riscv/iommu.h |   36 +
 meson.build              |    1 +
 8 files changed, 1797 insertions(+)
 create mode 100644 hw/riscv/riscv-iommu.c
 create mode 100644 hw/riscv/riscv-iommu.h
 create mode 100644 hw/riscv/trace-events
 create mode 100644 hw/riscv/trace.h
 create mode 100644 include/hw/riscv/iommu.h

diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
index a2030e3a6f..f69d6e3c8e 100644
--- a/hw/riscv/Kconfig
+++ b/hw/riscv/Kconfig
@@ -1,3 +1,6 @@
+config RISCV_IOMMU
+    bool
+
 config RISCV_NUMA
     bool
 
@@ -47,6 +50,7 @@ config RISCV_VIRT
     select SERIAL
     select RISCV_ACLINT
     select RISCV_APLIC
+    select RISCV_IOMMU
     select RISCV_IMSIC
     select SIFIVE_PLIC
     select SIFIVE_TEST
diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
index f872674093..cbc99c6e8e 100644
--- a/hw/riscv/meson.build
+++ b/hw/riscv/meson.build
@@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
 riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
 riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
 riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
+riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
 
 hw_arch += {'riscv': riscv_ss}
diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
new file mode 100644
index 0000000000..39b4ff1405
--- /dev/null
+++ b/hw/riscv/riscv-iommu.c
@@ -0,0 +1,1602 @@
+/*
+ * QEMU emulation of an RISC-V IOMMU
+ *
+ * Copyright (C) 2021-2023, Rivos Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qom/object.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_device.h"
+#include "hw/qdev-properties.h"
+#include "hw/riscv/riscv_hart.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+#include "qemu/timer.h"
+
+#include "cpu_bits.h"
+#include "riscv-iommu.h"
+#include "riscv-iommu-bits.h"
+#include "trace.h"
+
+#define LIMIT_CACHE_CTX               (1U << 7)
+#define LIMIT_CACHE_IOT               (1U << 20)
+
+/* Physical page number coversions */
+#define PPN_PHYS(ppn)                 ((ppn) << TARGET_PAGE_BITS)
+#define PPN_DOWN(phy)                 ((phy) >> TARGET_PAGE_BITS)
+
+typedef struct RISCVIOMMUContext RISCVIOMMUContext;
+typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
+
+/* Device assigned I/O address space */
+struct RISCVIOMMUSpace {
+    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached device */
+    AddressSpace iova_as;       /* IOVA address space for attached device */
+    RISCVIOMMUState *iommu;     /* Managing IOMMU device state */
+    uint32_t devid;             /* Requester identifier, AKA device_id */
+    bool notifier;              /* IOMMU unmap notifier enabled */
+    QLIST_ENTRY(RISCVIOMMUSpace) list;
+};
+
+/* Device translation context state. */
+struct RISCVIOMMUContext {
+    uint64_t devid:24;          /* Requester Id, AKA device_id */
+    uint64_t pasid:20;          /* Process Address Space ID */
+    uint64_t __rfu:20;          /* reserved */
+    uint64_t tc;                /* Translation Control */
+    uint64_t ta;                /* Translation Attributes */
+    uint64_t msi_addr_mask;     /* MSI filtering - address mask */
+    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
+    uint64_t msiptp;            /* MSI redirection page table pointer */
+};
+
+/* IOMMU index for transactions without PASID specified. */
+#define RISCV_IOMMU_NOPASID 0
+
+static void riscv_iommu_notify(RISCVIOMMUState *s, int vec)
+{
+    const uint32_t fctl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FCTL);
+    uint32_t ipsr, ivec;
+
+    if (fctl & RISCV_IOMMU_FCTL_WSI || !s->notify) {
+        return;
+    }
+
+    ipsr = riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, (1 << vec), 0);
+    ivec = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_IVEC);
+
+    if (!(ipsr & (1 << vec))) {
+        s->notify(s, (ivec >> (vec * 4)) & 0x0F);
+    }
+}
+
+static void riscv_iommu_fault(RISCVIOMMUState *s,
+                              struct riscv_iommu_fq_record *ev)
+{
+    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
+    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQH) & s->fq_mask;
+    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQT) & s->fq_mask;
+    uint32_t next = (tail + 1) & s->fq_mask;
+    uint32_t devid = get_field(ev->hdr, RISCV_IOMMU_FQ_HDR_DID);
+
+    trace_riscv_iommu_flt(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
+                          PCI_FUNC(devid), ev->hdr, ev->iotval);
+
+    if (!(ctrl & RISCV_IOMMU_FQCSR_FQON) ||
+        !!(ctrl & (RISCV_IOMMU_FQCSR_FQOF | RISCV_IOMMU_FQCSR_FQMF))) {
+        return;
+    }
+
+    if (head == next) {
+        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
+                              RISCV_IOMMU_FQCSR_FQOF, 0);
+    } else {
+        dma_addr_t addr = s->fq_addr + tail * sizeof(*ev);
+        if (dma_memory_write(s->target_as, addr, ev, sizeof(*ev),
+                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
+                                  RISCV_IOMMU_FQCSR_FQMF, 0);
+        } else {
+            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_FQT, next);
+        }
+    }
+
+    if (ctrl & RISCV_IOMMU_FQCSR_FIE) {
+        riscv_iommu_notify(s, RISCV_IOMMU_INTR_FQ);
+    }
+}
+
+static void riscv_iommu_pri(RISCVIOMMUState *s,
+    struct riscv_iommu_pq_record *pr)
+{
+    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
+    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQH) & s->pq_mask;
+    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQT) & s->pq_mask;
+    uint32_t next = (tail + 1) & s->pq_mask;
+    uint32_t devid = get_field(pr->hdr, RISCV_IOMMU_PREQ_HDR_DID);
+
+    trace_riscv_iommu_pri(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
+                          PCI_FUNC(devid), pr->payload);
+
+    if (!(ctrl & RISCV_IOMMU_PQCSR_PQON) ||
+        !!(ctrl & (RISCV_IOMMU_PQCSR_PQOF | RISCV_IOMMU_PQCSR_PQMF))) {
+        return;
+    }
+
+    if (head == next) {
+        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
+                              RISCV_IOMMU_PQCSR_PQOF, 0);
+    } else {
+        dma_addr_t addr = s->pq_addr + tail * sizeof(*pr);
+        if (dma_memory_write(s->target_as, addr, pr, sizeof(*pr),
+                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
+                                  RISCV_IOMMU_PQCSR_PQMF, 0);
+        } else {
+            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_PQT, next);
+        }
+    }
+
+    if (ctrl & RISCV_IOMMU_PQCSR_PIE) {
+        riscv_iommu_notify(s, RISCV_IOMMU_INTR_PQ);
+    }
+}
+
+/* Portable implementation of pext_u64, bit-mask extraction. */
+static uint64_t _pext_u64(uint64_t val, uint64_t ext)
+{
+    uint64_t ret = 0;
+    uint64_t rot = 1;
+
+    while (ext) {
+        if (ext & 1) {
+            if (val & 1) {
+                ret |= rot;
+            }
+            rot <<= 1;
+        }
+        val >>= 1;
+        ext >>= 1;
+    }
+
+    return ret;
+}
+
+/* Check if GPA matches MSI/MRIF pattern. */
+static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
+    dma_addr_t gpa)
+{
+    if (get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE) !=
+        RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
+        return false; /* Invalid MSI/MRIF mode */
+    }
+
+    if ((PPN_DOWN(gpa) ^ ctx->msi_addr_pattern) & ~ctx->msi_addr_mask) {
+        return false; /* GPA not in MSI range defined by AIA IMSIC rules. */
+    }
+
+    return true;
+}
+
+/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
+static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
+    IOMMUTLBEntry *iotlb)
+{
+    /* Early check for MSI address match when IOVA == GPA */
+    if (iotlb->perm & IOMMU_WO &&
+        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
+        iotlb->target_as = &s->trap_as;
+        iotlb->translated_addr = iotlb->iova;
+        iotlb->addr_mask = ~TARGET_PAGE_MASK;
+        return 0;
+    }
+
+    /* Exit early for pass-through mode. */
+    iotlb->translated_addr = iotlb->iova;
+    iotlb->addr_mask = ~TARGET_PAGE_MASK;
+    /* Allow R/W in pass-through mode */
+    iotlb->perm = IOMMU_RW;
+    return 0;
+}
+
+static void riscv_iommu_report_fault(RISCVIOMMUState *s,
+                                     RISCVIOMMUContext *ctx,
+                                     uint32_t fault_type, uint32_t cause,
+                                     bool pv,
+                                     uint64_t iotval, uint64_t iotval2)
+{
+    struct riscv_iommu_fq_record ev = { 0 };
+
+    if (ctx->tc & RISCV_IOMMU_DC_TC_DTF) {
+        switch (cause) {
+        case RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED:
+        case RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT:
+        case RISCV_IOMMU_FQ_CAUSE_DDT_INVALID:
+        case RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED:
+        case RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED:
+        case RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR:
+        case RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT:
+            break;
+        default:
+            /* DTF prevents reporting a fault for this given cause */
+            return;
+        }
+    }
+
+    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_CAUSE, cause);
+    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_TTYPE, fault_type);
+    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_DID, ctx->devid);
+    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PV, true);
+
+    if (pv) {
+        ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PID, ctx->pasid);
+    }
+
+    ev.iotval = iotval;
+    ev.iotval2 = iotval2;
+
+    riscv_iommu_fault(s, &ev);
+}
+
+/* Redirect MSI write for given GPA. */
+static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
+    RISCVIOMMUContext *ctx, uint64_t gpa, uint64_t data,
+    unsigned size, MemTxAttrs attrs)
+{
+    MemTxResult res;
+    dma_addr_t addr;
+    uint64_t intn;
+    uint32_t n190;
+    uint64_t pte[2];
+    int fault_type = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
+    int cause;
+
+    if (!riscv_iommu_msi_check(s, ctx, gpa)) {
+        res = MEMTX_ACCESS_ERROR;
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
+        goto err;
+    }
+
+    /* Interrupt File Number */
+    intn = _pext_u64(PPN_DOWN(gpa), ctx->msi_addr_mask);
+    if (intn >= 256) {
+        /* Interrupt file number out of range */
+        res = MEMTX_ACCESS_ERROR;
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
+        goto err;
+    }
+
+    /* fetch MSI PTE */
+    addr = PPN_PHYS(get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_PPN));
+    addr = addr | (intn * sizeof(pte));
+    res = dma_memory_read(s->target_as, addr, &pte, sizeof(pte),
+            MEMTXATTRS_UNSPECIFIED);
+    if (res != MEMTX_OK) {
+        if (res == MEMTX_DECODE_ERROR) {
+            cause = RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED;
+        } else {
+            cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
+        }
+        goto err;
+    }
+
+    le64_to_cpus(&pte[0]);
+    le64_to_cpus(&pte[1]);
+
+    if (!(pte[0] & RISCV_IOMMU_MSI_PTE_V) || (pte[0] & RISCV_IOMMU_MSI_PTE_C)) {
+        /*
+         * The spec mentions that: "If msipte.C == 1, then further
+         * processing to interpret the PTE is implementation
+         * defined.". We'll abort with cause = 262 for this
+         * case too.
+         */
+        res = MEMTX_ACCESS_ERROR;
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_INVALID;
+        goto err;
+    }
+
+    switch (get_field(pte[0], RISCV_IOMMU_MSI_PTE_M)) {
+    case RISCV_IOMMU_MSI_PTE_M_BASIC:
+        /* MSI Pass-through mode */
+        addr = PPN_PHYS(get_field(pte[0], RISCV_IOMMU_MSI_PTE_PPN));
+        addr = addr | (gpa & TARGET_PAGE_MASK);
+
+        trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
+                              PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
+                              gpa, addr);
+
+        res = dma_memory_write(s->target_as, addr, &data, size, attrs);
+        if (res != MEMTX_OK) {
+            cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
+            goto err;
+        }
+
+        return MEMTX_OK;
+    case RISCV_IOMMU_MSI_PTE_M_MRIF:
+        /* MRIF mode, continue. */
+        break;
+    default:
+        res = MEMTX_ACCESS_ERROR;
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
+        goto err;
+    }
+
+    /*
+     * Report an error for interrupt identities exceeding the maximum allowed
+     * for an IMSIC interrupt file (2047) or destination address is not 32-bit
+     * aligned. See IOMMU Specification, Chapter 2.3. MSI page tables.
+     */
+    if ((data > 2047) || (gpa & 3)) {
+        res = MEMTX_ACCESS_ERROR;
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
+        goto err;
+    }
+
+    /* MSI MRIF mode, non atomic pending bit update */
+
+    /* MRIF pending bit address */
+    addr = get_field(pte[0], RISCV_IOMMU_MSI_PTE_MRIF_ADDR) << 9;
+    addr = addr | ((data & 0x7c0) >> 3);
+
+    trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
+                          PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
+                          gpa, addr);
+
+    /* MRIF pending bit mask */
+    data = 1ULL << (data & 0x03f);
+    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
+    if (res != MEMTX_OK) {
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
+        goto err;
+    }
+
+    intn = intn | data;
+    res = dma_memory_write(s->target_as, addr, &intn, sizeof(intn), attrs);
+    if (res != MEMTX_OK) {
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
+        goto err;
+    }
+
+    /* Get MRIF enable bits */
+    addr = addr + sizeof(intn);
+    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
+    if (res != MEMTX_OK) {
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
+        goto err;
+    }
+
+    if (!(intn & data)) {
+        /* notification disabled, MRIF update completed. */
+        return MEMTX_OK;
+    }
+
+    /* Send notification message */
+    addr = PPN_PHYS(get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NPPN));
+    n190 = get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID) |
+          (get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID_MSB) << 10);
+
+    res = dma_memory_write(s->target_as, addr, &n190, sizeof(n190), attrs);
+    if (res != MEMTX_OK) {
+        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
+        goto err;
+    }
+
+    return MEMTX_OK;
+
+err:
+    riscv_iommu_report_fault(s, ctx, fault_type, cause,
+                             !!ctx->pasid, 0, 0);
+    return res;
+}
+
+/*
+ * Check device context configuration as described by the
+ * riscv-iommu spec section "Device-context configuration
+ * checks".
+ */
+static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
+                                            RISCVIOMMUContext *ctx)
+{
+    uint32_t msi_mode;
+
+    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
+        ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
+        return false;
+    }
+
+    if (!(s->cap & RISCV_IOMMU_CAP_T2GPA) &&
+        ctx->tc & RISCV_IOMMU_DC_TC_T2GPA) {
+        return false;
+    }
+
+    if (s->cap & RISCV_IOMMU_CAP_MSI_FLAT) {
+        msi_mode = get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE);
+
+        if (msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_OFF &&
+            msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
+            return false;
+        }
+    }
+
+    /*
+     * CAP_END is always zero (only one endianess). FCTL_BE is
+     * always zero (little-endian accesses). Thus TC_SBE must
+     * always be LE, i.e. zero.
+     */
+    if (ctx->tc & RISCV_IOMMU_DC_TC_SBE) {
+        return false;
+    }
+
+    return true;
+}
+
+/*
+ * RISC-V IOMMU Device Context Loopkup - Device Directory Tree Walk
+ *
+ * @s         : IOMMU Device State
+ * @ctx       : Device Translation Context with devid and pasid set.
+ * @return    : success or fault code.
+ */
+static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
+{
+    const uint64_t ddtp = s->ddtp;
+    unsigned mode = get_field(ddtp, RISCV_IOMMU_DDTP_MODE);
+    dma_addr_t addr = PPN_PHYS(get_field(ddtp, RISCV_IOMMU_DDTP_PPN));
+    struct riscv_iommu_dc dc;
+    /* Device Context format: 0: extended (64 bytes) | 1: base (32 bytes) */
+    const int dc_fmt = !s->enable_msi;
+    const size_t dc_len = sizeof(dc) >> dc_fmt;
+    unsigned depth;
+    uint64_t de;
+
+    switch (mode) {
+    case RISCV_IOMMU_DDTP_MODE_OFF:
+        return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
+
+    case RISCV_IOMMU_DDTP_MODE_BARE:
+        /* mock up pass-through translation context */
+        ctx->tc = RISCV_IOMMU_DC_TC_V;
+        ctx->ta = 0;
+        ctx->msiptp = 0;
+        return 0;
+
+    case RISCV_IOMMU_DDTP_MODE_1LVL:
+        depth = 0;
+        break;
+
+    case RISCV_IOMMU_DDTP_MODE_2LVL:
+        depth = 1;
+        break;
+
+    case RISCV_IOMMU_DDTP_MODE_3LVL:
+        depth = 2;
+        break;
+
+    default:
+        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+    }
+
+    /*
+     * Check supported device id width (in bits).
+     * See IOMMU Specification, Chapter 6. Software guidelines.
+     * - if extended device-context format is used:
+     *   1LVL: 6, 2LVL: 15, 3LVL: 24
+     * - if base device-context format is used:
+     *   1LVL: 7, 2LVL: 16, 3LVL: 24
+     */
+    if (ctx->devid >= (1 << (depth * 9 + 6 + (dc_fmt && depth != 2)))) {
+        return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
+    }
+
+    /* Device directory tree walk */
+    for (; depth-- > 0; ) {
+        /*
+         * Select device id index bits based on device directory tree level
+         * and device context format.
+         * See IOMMU Specification, Chapter 2. Data Structures.
+         * - if extended device-context format is used:
+         *   device index: [23:15][14:6][5:0]
+         * - if base device-context format is used:
+         *   device index: [23:16][15:7][6:0]
+         */
+        const int split = depth * 9 + 6 + dc_fmt;
+        addr |= ((ctx->devid >> split) << 3) & ~TARGET_PAGE_MASK;
+        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
+                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+            return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
+        }
+        le64_to_cpus(&de);
+        if (!(de & RISCV_IOMMU_DDTE_VALID)) {
+            /* invalid directory entry */
+            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
+        }
+        if (de & ~(RISCV_IOMMU_DDTE_PPN | RISCV_IOMMU_DDTE_VALID)) {
+            /* reserved bits set */
+            return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+        }
+        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_DDTE_PPN));
+    }
+
+    /* index into device context entry page */
+    addr |= (ctx->devid * dc_len) & ~TARGET_PAGE_MASK;
+
+    memset(&dc, 0, sizeof(dc));
+    if (dma_memory_read(s->target_as, addr, &dc, dc_len,
+                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+        return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
+    }
+
+    /* Set translation context. */
+    ctx->tc = le64_to_cpu(dc.tc);
+    ctx->ta = le64_to_cpu(dc.ta);
+    ctx->msiptp = le64_to_cpu(dc.msiptp);
+    ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
+    ctx->msi_addr_pattern = le64_to_cpu(dc.msi_addr_pattern);
+
+    if (!riscv_iommu_validate_device_ctx(s, ctx)) {
+        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+    }
+
+    if (!(ctx->tc & RISCV_IOMMU_DC_TC_V)) {
+        return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
+    }
+
+    if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
+        if (ctx->pasid != RISCV_IOMMU_NOPASID) {
+            /* PASID is disabled */
+            return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
+        }
+        return 0;
+    }
+
+    /* FSC.TC.PDTV enabled */
+    if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
+        /* Invalid PDTP.MODE */
+        return RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED;
+    }
+
+    for (depth = mode - RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8; depth-- > 0; ) {
+        /*
+         * Select process id index bits based on process directory tree
+         * level. See IOMMU Specification, 2.2. Process-Directory-Table.
+         */
+        const int split = depth * 9 + 8;
+        addr |= ((ctx->pasid >> split) << 3) & ~TARGET_PAGE_MASK;
+        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
+                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+            return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
+        }
+        le64_to_cpus(&de);
+        if (!(de & RISCV_IOMMU_PC_TA_V)) {
+            return RISCV_IOMMU_FQ_CAUSE_PDT_INVALID;
+        }
+        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_PC_FSC_PPN));
+    }
+
+    /* Leaf entry in PDT */
+    addr |= (ctx->pasid << 4) & ~TARGET_PAGE_MASK;
+    if (dma_memory_read(s->target_as, addr, &dc.ta, sizeof(uint64_t) * 2,
+                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+        return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
+    }
+
+    /* Use FSC and TA from process directory entry. */
+    ctx->ta = le64_to_cpu(dc.ta);
+
+    return 0;
+}
+
+/* Translation Context cache support */
+static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
+{
+    RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
+    RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
+    return c1->devid == c2->devid && c1->pasid == c2->pasid;
+}
+
+static guint __ctx_hash(gconstpointer v)
+{
+    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
+    /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
+    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
+}
+
+static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
+    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
+    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
+        ctx->devid == arg->devid &&
+        ctx->pasid == arg->pasid) {
+        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
+    }
+}
+
+static void __ctx_inval_devid(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
+    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
+    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
+        ctx->devid == arg->devid) {
+        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
+    }
+}
+
+static void __ctx_inval_all(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
+    if (ctx->tc & RISCV_IOMMU_DC_TC_V) {
+        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
+    }
+}
+
+static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
+    uint32_t devid, uint32_t pasid)
+{
+    GHashTable *ctx_cache;
+    RISCVIOMMUContext key = {
+        .devid = devid,
+        .pasid = pasid,
+    };
+    ctx_cache = g_hash_table_ref(s->ctx_cache);
+    g_hash_table_foreach(ctx_cache, func, &key);
+    g_hash_table_unref(ctx_cache);
+}
+
+/* Find or allocate translation context for a given {device_id, process_id} */
+static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
+    unsigned devid, unsigned pasid, void **ref)
+{
+    GHashTable *ctx_cache;
+    RISCVIOMMUContext *ctx;
+    RISCVIOMMUContext key = {
+        .devid = devid,
+        .pasid = pasid,
+    };
+
+    ctx_cache = g_hash_table_ref(s->ctx_cache);
+    ctx = g_hash_table_lookup(ctx_cache, &key);
+
+    if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
+        *ref = ctx_cache;
+        return ctx;
+    }
+
+    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
+        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
+                                          g_free, NULL);
+        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
+    }
+
+    ctx = g_new0(RISCVIOMMUContext, 1);
+    ctx->devid = devid;
+    ctx->pasid = pasid;
+
+    int fault = riscv_iommu_ctx_fetch(s, ctx);
+    if (!fault) {
+        g_hash_table_add(ctx_cache, ctx);
+        *ref = ctx_cache;
+        return ctx;
+    }
+
+    g_hash_table_unref(ctx_cache);
+    *ref = NULL;
+
+    riscv_iommu_report_fault(s, ctx, RISCV_IOMMU_FQ_TTYPE_UADDR_RD,
+                             fault, !!pasid, 0, 0);
+
+    g_free(ctx);
+    return NULL;
+}
+
+static void riscv_iommu_ctx_put(RISCVIOMMUState *s, void *ref)
+{
+    if (ref) {
+        g_hash_table_unref((GHashTable *)ref);
+    }
+}
+
+/* Find or allocate address space for a given device */
+static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
+{
+    RISCVIOMMUSpace *as;
+
+    /* FIXME: PCIe bus remapping for attached endpoints. */
+    devid |= s->bus << 8;
+
+    qemu_mutex_lock(&s->core_lock);
+    QLIST_FOREACH(as, &s->spaces, list) {
+        if (as->devid == devid) {
+            break;
+        }
+    }
+    qemu_mutex_unlock(&s->core_lock);
+
+    if (as == NULL) {
+        char name[64];
+        as = g_new0(RISCVIOMMUSpace, 1);
+
+        as->iommu = s;
+        as->devid = devid;
+
+        snprintf(name, sizeof(name), "riscv-iommu-%04x:%02x.%d-iova",
+            PCI_BUS_NUM(as->devid), PCI_SLOT(as->devid), PCI_FUNC(as->devid));
+
+        /* IOVA address space, untranslated addresses */
+        memory_region_init_iommu(&as->iova_mr, sizeof(as->iova_mr),
+            TYPE_RISCV_IOMMU_MEMORY_REGION,
+            OBJECT(as), "riscv_iommu", UINT64_MAX);
+        address_space_init(&as->iova_as, MEMORY_REGION(&as->iova_mr), name);
+
+        qemu_mutex_lock(&s->core_lock);
+        QLIST_INSERT_HEAD(&s->spaces, as, list);
+        qemu_mutex_unlock(&s->core_lock);
+
+        trace_riscv_iommu_new(s->parent_obj.id, PCI_BUS_NUM(as->devid),
+                PCI_SLOT(as->devid), PCI_FUNC(as->devid));
+    }
+    return &as->iova_as;
+}
+
+static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
+    IOMMUTLBEntry *iotlb)
+{
+    bool enable_pasid;
+    bool enable_pri;
+    int fault;
+
+    /*
+     * TC[32] is reserved for custom extensions, used here to temporarily
+     * enable automatic page-request generation for ATS queries.
+     */
+    enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
+    enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
+
+    /* Translate using device directory / page table information. */
+    fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
+
+    if (enable_pri && fault) {
+        struct riscv_iommu_pq_record pr = {0};
+        if (enable_pasid) {
+            pr.hdr = set_field(RISCV_IOMMU_PREQ_HDR_PV,
+                RISCV_IOMMU_PREQ_HDR_PID, ctx->pasid);
+        }
+        pr.hdr = set_field(pr.hdr, RISCV_IOMMU_PREQ_HDR_DID, ctx->devid);
+        pr.payload = (iotlb->iova & TARGET_PAGE_MASK) |
+                     RISCV_IOMMU_PREQ_PAYLOAD_M;
+        riscv_iommu_pri(s, &pr);
+        return fault;
+    }
+
+    if (fault) {
+        unsigned ttype;
+
+        if (iotlb->perm & IOMMU_RW) {
+            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
+        } else {
+            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
+        }
+
+        riscv_iommu_report_fault(s, ctx, ttype, fault, enable_pasid,
+                                 iotlb->iova, iotlb->translated_addr);
+        return fault;
+    }
+
+    return 0;
+}
+
+/* IOMMU Command Interface */
+static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
+    uint64_t addr, uint32_t data)
+{
+    /*
+     * ATS processing in this implementation of the IOMMU is synchronous,
+     * no need to wait for completions here.
+     */
+    if (!notify) {
+        return MEMTX_OK;
+    }
+
+    return dma_memory_write(s->target_as, addr, &data, sizeof(data),
+        MEMTXATTRS_UNSPECIFIED);
+}
+
+static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
+{
+    uint64_t old_ddtp = s->ddtp;
+    uint64_t new_ddtp = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_DDTP);
+    unsigned new_mode = get_field(new_ddtp, RISCV_IOMMU_DDTP_MODE);
+    unsigned old_mode = get_field(old_ddtp, RISCV_IOMMU_DDTP_MODE);
+    bool ok = false;
+
+    /*
+     * Check for allowed DDTP.MODE transitions:
+     * {OFF, BARE}        -> {OFF, BARE, 1LVL, 2LVL, 3LVL}
+     * {1LVL, 2LVL, 3LVL} -> {OFF, BARE}
+     */
+    if (new_mode == old_mode ||
+        new_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
+        new_mode == RISCV_IOMMU_DDTP_MODE_BARE) {
+        ok = true;
+    } else if (new_mode == RISCV_IOMMU_DDTP_MODE_1LVL ||
+               new_mode == RISCV_IOMMU_DDTP_MODE_2LVL ||
+               new_mode == RISCV_IOMMU_DDTP_MODE_3LVL) {
+        ok = old_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
+             old_mode == RISCV_IOMMU_DDTP_MODE_BARE;
+    }
+
+    if (ok) {
+        /* clear reserved and busy bits, report back sanitized version */
+        new_ddtp = set_field(new_ddtp & RISCV_IOMMU_DDTP_PPN,
+                             RISCV_IOMMU_DDTP_MODE, new_mode);
+    } else {
+        new_ddtp = old_ddtp;
+    }
+    s->ddtp = new_ddtp;
+
+    riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_DDTP, new_ddtp);
+}
+
+/* Command function and opcode field. */
+#define RISCV_IOMMU_CMD(func, op) (((func) << 7) | (op))
+
+static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
+{
+    struct riscv_iommu_command cmd;
+    MemTxResult res;
+    dma_addr_t addr;
+    uint32_t tail, head, ctrl;
+    uint64_t cmd_opcode;
+    GHFunc func;
+
+    ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
+    tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQT) & s->cq_mask;
+    head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQH) & s->cq_mask;
+
+    /* Check for pending error or queue processing disabled */
+    if (!(ctrl & RISCV_IOMMU_CQCSR_CQON) ||
+        !!(ctrl & (RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CQMF))) {
+        return;
+    }
+
+    while (tail != head) {
+        addr = s->cq_addr  + head * sizeof(cmd);
+        res = dma_memory_read(s->target_as, addr, &cmd, sizeof(cmd),
+                              MEMTXATTRS_UNSPECIFIED);
+
+        if (res != MEMTX_OK) {
+            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
+                                  RISCV_IOMMU_CQCSR_CQMF, 0);
+            goto fault;
+        }
+
+        trace_riscv_iommu_cmd(s->parent_obj.id, cmd.dword0, cmd.dword1);
+
+        cmd_opcode = get_field(cmd.dword0,
+                               RISCV_IOMMU_CMD_OPCODE | RISCV_IOMMU_CMD_FUNC);
+
+        switch (cmd_opcode) {
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOFENCE_FUNC_C,
+                             RISCV_IOMMU_CMD_IOFENCE_OPCODE):
+            res = riscv_iommu_iofence(s,
+                cmd.dword0 & RISCV_IOMMU_CMD_IOFENCE_AV, cmd.dword1,
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOFENCE_DATA));
+
+            if (res != MEMTX_OK) {
+                riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
+                                      RISCV_IOMMU_CQCSR_CQMF, 0);
+                goto fault;
+            }
+            break;
+
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA,
+                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
+            if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
+                /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
+                goto cmd_ill;
+            }
+            /* translation cache not implemented yet */
+            break;
+
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
+                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
+            /* translation cache not implemented yet */
+            break;
+
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
+                             RISCV_IOMMU_CMD_IODIR_OPCODE):
+            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
+                /* invalidate all device context cache mappings */
+                func = __ctx_inval_all;
+            } else {
+                /* invalidate all device context matching DID */
+                func = __ctx_inval_devid;
+            }
+            riscv_iommu_ctx_inval(s, func,
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID), 0);
+            break;
+
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT,
+                             RISCV_IOMMU_CMD_IODIR_OPCODE):
+            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
+                /* illegal command arguments IODIR_PDT & DV == 0 */
+                goto cmd_ill;
+            } else {
+                func = __ctx_inval_devid_pasid;
+            }
+            riscv_iommu_ctx_inval(s, func,
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID),
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
+            break;
+
+        default:
+        cmd_ill:
+            /* Invalid instruction, do not advance instruction index. */
+            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
+                RISCV_IOMMU_CQCSR_CMD_ILL, 0);
+            goto fault;
+        }
+
+        /* Advance and update head pointer after command completes. */
+        head = (head + 1) & s->cq_mask;
+        riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_CQH, head);
+    }
+    return;
+
+fault:
+    if (ctrl & RISCV_IOMMU_CQCSR_CIE) {
+        riscv_iommu_notify(s, RISCV_IOMMU_INTR_CQ);
+    }
+}
+
+static void riscv_iommu_process_cq_control(RISCVIOMMUState *s)
+{
+    uint64_t base;
+    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
+    uint32_t ctrl_clr;
+    bool enable = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQEN);
+    bool active = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQON);
+
+    if (enable && !active) {
+        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_CQB);
+        s->cq_mask = (2ULL << get_field(base, RISCV_IOMMU_CQB_LOG2SZ)) - 1;
+        s->cq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_CQB_PPN));
+        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~s->cq_mask);
+        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQH], 0);
+        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQT], 0);
+        ctrl_set = RISCV_IOMMU_CQCSR_CQON;
+        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQMF |
+                   RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CMD_TO |
+                   RISCV_IOMMU_CQCSR_FENCE_W_IP;
+    } else if (!enable && active) {
+        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~0);
+        ctrl_set = 0;
+        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQON;
+    } else {
+        ctrl_set = 0;
+        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY;
+    }
+
+    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR, ctrl_set, ctrl_clr);
+}
+
+static void riscv_iommu_process_fq_control(RISCVIOMMUState *s)
+{
+    uint64_t base;
+    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
+    uint32_t ctrl_clr;
+    bool enable = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQEN);
+    bool active = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQON);
+
+    if (enable && !active) {
+        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_FQB);
+        s->fq_mask = (2ULL << get_field(base, RISCV_IOMMU_FQB_LOG2SZ)) - 1;
+        s->fq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_FQB_PPN));
+        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~s->fq_mask);
+        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQH], 0);
+        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQT], 0);
+        ctrl_set = RISCV_IOMMU_FQCSR_FQON;
+        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQMF |
+            RISCV_IOMMU_FQCSR_FQOF;
+    } else if (!enable && active) {
+        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~0);
+        ctrl_set = 0;
+        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQON;
+    } else {
+        ctrl_set = 0;
+        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY;
+    }
+
+    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR, ctrl_set, ctrl_clr);
+}
+
+static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
+{
+    uint64_t base;
+    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
+    uint32_t ctrl_clr;
+    bool enable = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQEN);
+    bool active = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQON);
+
+    if (enable && !active) {
+        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_PQB);
+        s->pq_mask = (2ULL << get_field(base, RISCV_IOMMU_PQB_LOG2SZ)) - 1;
+        s->pq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_PQB_PPN));
+        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~s->pq_mask);
+        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQH], 0);
+        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQT], 0);
+        ctrl_set = RISCV_IOMMU_PQCSR_PQON;
+        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQMF |
+            RISCV_IOMMU_PQCSR_PQOF;
+    } else if (!enable && active) {
+        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~0);
+        ctrl_set = 0;
+        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQON;
+    } else {
+        ctrl_set = 0;
+        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY;
+    }
+
+    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
+}
+
+typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
+
+static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
+{
+    uint32_t cqcsr, fqcsr, pqcsr;
+    uint32_t ipsr_set = 0;
+    uint32_t ipsr_clr = 0;
+
+    if (data & RISCV_IOMMU_IPSR_CIP) {
+        cqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
+
+        if (cqcsr & RISCV_IOMMU_CQCSR_CIE &&
+            (cqcsr & RISCV_IOMMU_CQCSR_FENCE_W_IP ||
+             cqcsr & RISCV_IOMMU_CQCSR_CMD_ILL ||
+             cqcsr & RISCV_IOMMU_CQCSR_CMD_TO ||
+             cqcsr & RISCV_IOMMU_CQCSR_CQMF)) {
+            ipsr_set |= RISCV_IOMMU_IPSR_CIP;
+        } else {
+            ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
+        }
+    } else {
+        ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
+    }
+
+    if (data & RISCV_IOMMU_IPSR_FIP) {
+        fqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
+
+        if (fqcsr & RISCV_IOMMU_FQCSR_FIE &&
+            (fqcsr & RISCV_IOMMU_FQCSR_FQOF ||
+             fqcsr & RISCV_IOMMU_FQCSR_FQMF)) {
+            ipsr_set |= RISCV_IOMMU_IPSR_FIP;
+        } else {
+            ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
+        }
+    } else {
+        ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
+    }
+
+    if (data & RISCV_IOMMU_IPSR_PIP) {
+        pqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
+
+        if (pqcsr & RISCV_IOMMU_PQCSR_PIE &&
+            (pqcsr & RISCV_IOMMU_PQCSR_PQOF ||
+             pqcsr & RISCV_IOMMU_PQCSR_PQMF)) {
+            ipsr_set |= RISCV_IOMMU_IPSR_PIP;
+        } else {
+            ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
+        }
+    } else {
+        ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
+    }
+
+    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, ipsr_set, ipsr_clr);
+}
+
+static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
+    uint64_t data, unsigned size, MemTxAttrs attrs)
+{
+    riscv_iommu_process_fn *process_fn = NULL;
+    RISCVIOMMUState *s = opaque;
+    uint32_t regb = addr & ~3;
+    uint32_t busy = 0;
+    uint64_t val = 0;
+
+    if ((addr & (size - 1)) != 0) {
+        /* Unsupported MMIO alignment or access size */
+        return MEMTX_ERROR;
+    }
+
+    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
+        /* Unsupported MMIO access location. */
+        return MEMTX_ACCESS_ERROR;
+    }
+
+    /* Track actionable MMIO write. */
+    switch (regb) {
+    case RISCV_IOMMU_REG_DDTP:
+    case RISCV_IOMMU_REG_DDTP + 4:
+        process_fn = riscv_iommu_process_ddtp;
+        regb = RISCV_IOMMU_REG_DDTP;
+        busy = RISCV_IOMMU_DDTP_BUSY;
+        break;
+
+    case RISCV_IOMMU_REG_CQT:
+        process_fn = riscv_iommu_process_cq_tail;
+        break;
+
+    case RISCV_IOMMU_REG_CQCSR:
+        process_fn = riscv_iommu_process_cq_control;
+        busy = RISCV_IOMMU_CQCSR_BUSY;
+        break;
+
+    case RISCV_IOMMU_REG_FQCSR:
+        process_fn = riscv_iommu_process_fq_control;
+        busy = RISCV_IOMMU_FQCSR_BUSY;
+        break;
+
+    case RISCV_IOMMU_REG_PQCSR:
+        process_fn = riscv_iommu_process_pq_control;
+        busy = RISCV_IOMMU_PQCSR_BUSY;
+        break;
+
+    case RISCV_IOMMU_REG_IPSR:
+        /*
+         * IPSR has special procedures to update. Execute it
+         * and exit.
+         */
+        if (size == 4) {
+            uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
+            uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
+            uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
+            stl_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
+        } else if (size == 8) {
+            uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
+            uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
+            uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
+            stq_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
+        }
+
+        riscv_iommu_update_ipsr(s, val);
+
+        return MEMTX_OK;
+
+    default:
+        break;
+    }
+
+    /*
+     * Registers update might be not synchronized with core logic.
+     * If system software updates register when relevant BUSY bit
+     * is set IOMMU behavior of additional writes to the register
+     * is UNSPECIFIED.
+     */
+    qemu_spin_lock(&s->regs_lock);
+    if (size == 1) {
+        uint8_t ro = s->regs_ro[addr];
+        uint8_t wc = s->regs_wc[addr];
+        uint8_t rw = s->regs_rw[addr];
+        s->regs_rw[addr] = ((rw & ro) | (data & ~ro)) & ~(data & wc);
+    } else if (size == 2) {
+        uint16_t ro = lduw_le_p(&s->regs_ro[addr]);
+        uint16_t wc = lduw_le_p(&s->regs_wc[addr]);
+        uint16_t rw = lduw_le_p(&s->regs_rw[addr]);
+        stw_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
+    } else if (size == 4) {
+        uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
+        uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
+        uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
+        stl_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
+    } else if (size == 8) {
+        uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
+        uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
+        uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
+        stq_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
+    }
+
+    /* Busy flag update, MSB 4-byte register. */
+    if (busy) {
+        uint32_t rw = ldl_le_p(&s->regs_rw[regb]);
+        stl_le_p(&s->regs_rw[regb], rw | busy);
+    }
+    qemu_spin_unlock(&s->regs_lock);
+
+    if (process_fn) {
+        qemu_mutex_lock(&s->core_lock);
+        process_fn(s);
+        qemu_mutex_unlock(&s->core_lock);
+    }
+
+    return MEMTX_OK;
+}
+
+static MemTxResult riscv_iommu_mmio_read(void *opaque, hwaddr addr,
+    uint64_t *data, unsigned size, MemTxAttrs attrs)
+{
+    RISCVIOMMUState *s = opaque;
+    uint64_t val = -1;
+    uint8_t *ptr;
+
+    if ((addr & (size - 1)) != 0) {
+        /* Unsupported MMIO alignment. */
+        return MEMTX_ERROR;
+    }
+
+    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
+        return MEMTX_ACCESS_ERROR;
+    }
+
+    ptr = &s->regs_rw[addr];
+
+    if (size == 1) {
+        val = (uint64_t)*ptr;
+    } else if (size == 2) {
+        val = lduw_le_p(ptr);
+    } else if (size == 4) {
+        val = ldl_le_p(ptr);
+    } else if (size == 8) {
+        val = ldq_le_p(ptr);
+    } else {
+        return MEMTX_ERROR;
+    }
+
+    *data = val;
+
+    return MEMTX_OK;
+}
+
+static const MemoryRegionOps riscv_iommu_mmio_ops = {
+    .read_with_attrs = riscv_iommu_mmio_read,
+    .write_with_attrs = riscv_iommu_mmio_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+        .unaligned = false,
+    },
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    }
+};
+
+/*
+ * Translations matching MSI pattern check are redirected to "riscv-iommu-trap"
+ * memory region as untranslated address, for additional MSI/MRIF interception
+ * by IOMMU interrupt remapping implementation.
+ * Note: Device emulation code generating an MSI is expected to provide a valid
+ * memory transaction attributes with requested_id set.
+ */
+static MemTxResult riscv_iommu_trap_write(void *opaque, hwaddr addr,
+    uint64_t data, unsigned size, MemTxAttrs attrs)
+{
+    RISCVIOMMUState* s = (RISCVIOMMUState *)opaque;
+    RISCVIOMMUContext *ctx;
+    MemTxResult res;
+    void *ref;
+    uint32_t devid = attrs.requester_id;
+
+    if (attrs.unspecified) {
+        return MEMTX_ACCESS_ERROR;
+    }
+
+    /* FIXME: PCIe bus remapping for attached endpoints. */
+    devid |= s->bus << 8;
+
+    ctx = riscv_iommu_ctx(s, devid, 0, &ref);
+    if (ctx == NULL) {
+        res = MEMTX_ACCESS_ERROR;
+    } else {
+        res = riscv_iommu_msi_write(s, ctx, addr, data, size, attrs);
+    }
+    riscv_iommu_ctx_put(s, ref);
+    return res;
+}
+
+static MemTxResult riscv_iommu_trap_read(void *opaque, hwaddr addr,
+    uint64_t *data, unsigned size, MemTxAttrs attrs)
+{
+    return MEMTX_ACCESS_ERROR;
+}
+
+static const MemoryRegionOps riscv_iommu_trap_ops = {
+    .read_with_attrs = riscv_iommu_trap_read,
+    .write_with_attrs = riscv_iommu_trap_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+        .unaligned = true,
+    },
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    }
+};
+
+static void riscv_iommu_realize(DeviceState *dev, Error **errp)
+{
+    RISCVIOMMUState *s = RISCV_IOMMU(dev);
+
+    s->cap = s->version & RISCV_IOMMU_CAP_VERSION;
+    if (s->enable_msi) {
+        s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
+    }
+    /* Report QEMU target physical address space limits */
+    s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
+                       TARGET_PHYS_ADDR_SPACE_BITS);
+
+    /* TODO: method to report supported PASID bits */
+    s->pasid_bits = 8; /* restricted to size of MemTxAttrs.pasid */
+    s->cap |= RISCV_IOMMU_CAP_PD8;
+
+    /* Out-of-reset translation mode: OFF (DMA disabled) BARE (passthrough) */
+    s->ddtp = set_field(0, RISCV_IOMMU_DDTP_MODE, s->enable_off ?
+                        RISCV_IOMMU_DDTP_MODE_OFF : RISCV_IOMMU_DDTP_MODE_BARE);
+
+    /* register storage */
+    s->regs_rw = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
+    s->regs_ro = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
+    s->regs_wc = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
+
+     /* Mark all registers read-only */
+    memset(s->regs_ro, 0xff, RISCV_IOMMU_REG_SIZE);
+
+    /*
+     * Register complete MMIO space, including MSI/PBA registers.
+     * Note, PCIDevice implementation will add overlapping MR for MSI/PBA,
+     * managed directly by the PCIDevice implementation.
+     */
+    memory_region_init_io(&s->regs_mr, OBJECT(dev), &riscv_iommu_mmio_ops, s,
+        "riscv-iommu-regs", RISCV_IOMMU_REG_SIZE);
+
+    /* Set power-on register state */
+    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
+    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
+    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
+        ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
+    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
+        ~(RISCV_IOMMU_CQB_LOG2SZ | RISCV_IOMMU_CQB_PPN));
+    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQB],
+        ~(RISCV_IOMMU_FQB_LOG2SZ | RISCV_IOMMU_FQB_PPN));
+    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQB],
+        ~(RISCV_IOMMU_PQB_LOG2SZ | RISCV_IOMMU_PQB_PPN));
+    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQMF |
+        RISCV_IOMMU_CQCSR_CMD_TO | RISCV_IOMMU_CQCSR_CMD_ILL);
+    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQON |
+        RISCV_IOMMU_CQCSR_BUSY);
+    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQMF |
+        RISCV_IOMMU_FQCSR_FQOF);
+    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQON |
+        RISCV_IOMMU_FQCSR_BUSY);
+    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQMF |
+        RISCV_IOMMU_PQCSR_PQOF);
+    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQON |
+        RISCV_IOMMU_PQCSR_BUSY);
+    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
+    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
+    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
+
+    /* Memory region for downstream access, if specified. */
+    if (s->target_mr) {
+        s->target_as = g_new0(AddressSpace, 1);
+        address_space_init(s->target_as, s->target_mr,
+            "riscv-iommu-downstream");
+    } else {
+        /* Fallback to global system memory. */
+        s->target_as = &address_space_memory;
+    }
+
+    /* Memory region for untranslated MRIF/MSI writes */
+    memory_region_init_io(&s->trap_mr, OBJECT(dev), &riscv_iommu_trap_ops, s,
+            "riscv-iommu-trap", ~0ULL);
+    address_space_init(&s->trap_as, &s->trap_mr, "riscv-iommu-trap-as");
+
+    /* Device translation context cache */
+    s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
+                                         g_free, NULL);
+
+    s->iommus.le_next = NULL;
+    s->iommus.le_prev = NULL;
+    QLIST_INIT(&s->spaces);
+    qemu_mutex_init(&s->core_lock);
+    qemu_spin_init(&s->regs_lock);
+}
+
+static void riscv_iommu_unrealize(DeviceState *dev)
+{
+    RISCVIOMMUState *s = RISCV_IOMMU(dev);
+
+    qemu_mutex_destroy(&s->core_lock);
+    g_hash_table_unref(s->ctx_cache);
+}
+
+static Property riscv_iommu_properties[] = {
+    DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
+        RISCV_IOMMU_SPEC_DOT_VER),
+    DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
+    DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
+    DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
+    DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
+        TYPE_MEMORY_REGION, MemoryRegion *),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void riscv_iommu_class_init(ObjectClass *klass, void* data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    /* internal device for riscv-iommu-{pci/sys}, not user-creatable */
+    dc->user_creatable = false;
+    dc->realize = riscv_iommu_realize;
+    dc->unrealize = riscv_iommu_unrealize;
+    device_class_set_props(dc, riscv_iommu_properties);
+}
+
+static const TypeInfo riscv_iommu_info = {
+    .name = TYPE_RISCV_IOMMU,
+    .parent = TYPE_DEVICE,
+    .instance_size = sizeof(RISCVIOMMUState),
+    .class_init = riscv_iommu_class_init,
+};
+
+static const char *IOMMU_FLAG_STR[] = {
+    "NA",
+    "RO",
+    "WR",
+    "RW",
+};
+
+/* RISC-V IOMMU Memory Region - Address Translation Space */
+static IOMMUTLBEntry riscv_iommu_memory_region_translate(
+    IOMMUMemoryRegion *iommu_mr, hwaddr addr,
+    IOMMUAccessFlags flag, int iommu_idx)
+{
+    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
+    RISCVIOMMUContext *ctx;
+    void *ref;
+    IOMMUTLBEntry iotlb = {
+        .iova = addr,
+        .target_as = as->iommu->target_as,
+        .addr_mask = ~0ULL,
+        .perm = flag,
+    };
+
+    ctx = riscv_iommu_ctx(as->iommu, as->devid, iommu_idx, &ref);
+    if (ctx == NULL) {
+        /* Translation disabled or invalid. */
+        iotlb.addr_mask = 0;
+        iotlb.perm = IOMMU_NONE;
+    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
+        /* Translation disabled or fault reported. */
+        iotlb.addr_mask = 0;
+        iotlb.perm = IOMMU_NONE;
+    }
+
+    /* Trace all dma translations with original access flags. */
+    trace_riscv_iommu_dma(as->iommu->parent_obj.id, PCI_BUS_NUM(as->devid),
+                          PCI_SLOT(as->devid), PCI_FUNC(as->devid), iommu_idx,
+                          IOMMU_FLAG_STR[flag & IOMMU_RW], iotlb.iova,
+                          iotlb.translated_addr);
+
+    riscv_iommu_ctx_put(as->iommu, ref);
+
+    return iotlb;
+}
+
+static int riscv_iommu_memory_region_notify(
+    IOMMUMemoryRegion *iommu_mr, IOMMUNotifierFlag old,
+    IOMMUNotifierFlag new, Error **errp)
+{
+    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
+
+    if (old == IOMMU_NOTIFIER_NONE) {
+        as->notifier = true;
+        trace_riscv_iommu_notifier_add(iommu_mr->parent_obj.name);
+    } else if (new == IOMMU_NOTIFIER_NONE) {
+        as->notifier = false;
+        trace_riscv_iommu_notifier_del(iommu_mr->parent_obj.name);
+    }
+
+    return 0;
+}
+
+static inline bool pci_is_iommu(PCIDevice *pdev)
+{
+    return pci_get_word(pdev->config + PCI_CLASS_DEVICE) == 0x0806;
+}
+
+static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, int devfn)
+{
+    RISCVIOMMUState *s = (RISCVIOMMUState *) opaque;
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
+    AddressSpace *as = NULL;
+
+    if (pdev && pci_is_iommu(pdev)) {
+        return s->target_as;
+    }
+
+    /* Find first registered IOMMU device */
+    while (s->iommus.le_prev) {
+        s = *(s->iommus.le_prev);
+    }
+
+    /* Find first matching IOMMU */
+    while (s != NULL && as == NULL) {
+        as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), devfn));
+        s = s->iommus.le_next;
+    }
+
+    return as ? as : &address_space_memory;
+}
+
+static const PCIIOMMUOps riscv_iommu_ops = {
+    .get_address_space = riscv_iommu_find_as,
+};
+
+void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
+        Error **errp)
+{
+    if (bus->iommu_ops &&
+        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
+        /* Allow multiple IOMMUs on the same PCIe bus, link known devices */
+        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
+        QLIST_INSERT_AFTER(last, iommu, iommus);
+    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
+        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);
+    } else {
+        error_setg(errp, "can't register secondary IOMMU for PCI bus #%d",
+            pci_bus_num(bus));
+    }
+}
+
+static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
+    MemTxAttrs attrs)
+{
+    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
+}
+
+static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr)
+{
+    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
+    return 1 << as->iommu->pasid_bits;
+}
+
+static void riscv_iommu_memory_region_init(ObjectClass *klass, void *data)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
+
+    imrc->translate = riscv_iommu_memory_region_translate;
+    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
+    imrc->attrs_to_index = riscv_iommu_memory_region_index;
+    imrc->num_indexes = riscv_iommu_memory_region_index_len;
+}
+
+static const TypeInfo riscv_iommu_memory_region_info = {
+    .parent = TYPE_IOMMU_MEMORY_REGION,
+    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
+    .class_init = riscv_iommu_memory_region_init,
+};
+
+static void riscv_iommu_register_mr_types(void)
+{
+    type_register_static(&riscv_iommu_memory_region_info);
+    type_register_static(&riscv_iommu_info);
+}
+
+type_init(riscv_iommu_register_mr_types);
diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
new file mode 100644
index 0000000000..31d3907d33
--- /dev/null
+++ b/hw/riscv/riscv-iommu.h
@@ -0,0 +1,141 @@
+/*
+ * QEMU emulation of an RISC-V IOMMU
+ *
+ * Copyright (C) 2022-2023 Rivos Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_RISCV_IOMMU_STATE_H
+#define HW_RISCV_IOMMU_STATE_H
+
+#include "qemu/osdep.h"
+#include "qom/object.h"
+
+#include "hw/riscv/iommu.h"
+
+struct RISCVIOMMUState {
+    /*< private >*/
+    DeviceState parent_obj;
+
+    /*< public >*/
+    uint32_t version;     /* Reported interface version number */
+    uint32_t pasid_bits;  /* process identifier width */
+    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
+
+    uint64_t cap;         /* IOMMU supported capabilities */
+    uint64_t fctl;        /* IOMMU enabled features */
+
+    bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
+    bool enable_msi;      /* Enable MSI remapping */
+
+    /* IOMMU Internal State */
+    uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
+
+    dma_addr_t cq_addr;   /* Command queue base physical address */
+    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
+    dma_addr_t pq_addr;   /* Page request queue base physical address */
+
+    uint32_t cq_mask;     /* Command queue index bit mask */
+    uint32_t fq_mask;     /* Fault/event queue index bit mask */
+    uint32_t pq_mask;     /* Page request queue index bit mask */
+
+    /* interrupt notifier */
+    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
+
+    /* IOMMU State Machine */
+    QemuThread core_proc; /* Background processing thread */
+    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs updates */
+    QemuCond core_cond;   /* Background processing wake up signal */
+    unsigned core_exec;   /* Processing thread execution actions */
+
+    /* IOMMU target address space */
+    AddressSpace *target_as;
+    MemoryRegion *target_mr;
+
+    /* MSI / MRIF access trap */
+    AddressSpace trap_as;
+    MemoryRegion trap_mr;
+
+    GHashTable *ctx_cache;          /* Device translation Context Cache */
+
+    /* MMIO Hardware Interface */
+    MemoryRegion regs_mr;
+    QemuSpin regs_lock;
+    uint8_t *regs_rw;  /* register state (user write) */
+    uint8_t *regs_wc;  /* write-1-to-clear mask */
+    uint8_t *regs_ro;  /* read-only mask */
+
+    QLIST_ENTRY(RISCVIOMMUState) iommus;
+    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
+};
+
+void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
+         Error **errp);
+
+/* private helpers */
+
+/* Register helper functions */
+static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
+    unsigned idx, uint32_t set, uint32_t clr)
+{
+    uint32_t val;
+    qemu_spin_lock(&s->regs_lock);
+    val = ldl_le_p(s->regs_rw + idx);
+    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
+    qemu_spin_unlock(&s->regs_lock);
+    return val;
+}
+
+static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
+    unsigned idx, uint32_t set)
+{
+    qemu_spin_lock(&s->regs_lock);
+    stl_le_p(s->regs_rw + idx, set);
+    qemu_spin_unlock(&s->regs_lock);
+}
+
+static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
+    unsigned idx)
+{
+    return ldl_le_p(s->regs_rw + idx);
+}
+
+static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
+    unsigned idx, uint64_t set, uint64_t clr)
+{
+    uint64_t val;
+    qemu_spin_lock(&s->regs_lock);
+    val = ldq_le_p(s->regs_rw + idx);
+    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
+    qemu_spin_unlock(&s->regs_lock);
+    return val;
+}
+
+static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
+    unsigned idx, uint64_t set)
+{
+    qemu_spin_lock(&s->regs_lock);
+    stq_le_p(s->regs_rw + idx, set);
+    qemu_spin_unlock(&s->regs_lock);
+}
+
+static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
+    unsigned idx)
+{
+    return ldq_le_p(s->regs_rw + idx);
+}
+
+
+
+#endif
diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
new file mode 100644
index 0000000000..42a97caffa
--- /dev/null
+++ b/hw/riscv/trace-events
@@ -0,0 +1,11 @@
+# See documentation at docs/devel/tracing.rst
+
+# riscv-iommu.c
+riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) "%s: device attached %04x:%02x.%d"
+riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" iova: 0x%"PRIx64
+riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
+riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
+riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
+riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
+riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
+riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
new file mode 100644
index 0000000000..8c0e3ca1f3
--- /dev/null
+++ b/hw/riscv/trace.h
@@ -0,0 +1 @@
+#include "trace/trace-hw_riscv.h"
diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
new file mode 100644
index 0000000000..070ee69973
--- /dev/null
+++ b/include/hw/riscv/iommu.h
@@ -0,0 +1,36 @@
+/*
+ * QEMU emulation of an RISC-V IOMMU
+ *
+ * Copyright (C) 2022-2023 Rivos Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_RISCV_IOMMU_H
+#define HW_RISCV_IOMMU_H
+
+#include "qemu/osdep.h"
+#include "qom/object.h"
+
+#define TYPE_RISCV_IOMMU "riscv-iommu"
+OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
+typedef struct RISCVIOMMUState RISCVIOMMUState;
+
+#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
+typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
+
+#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
+typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
+
+#endif
diff --git a/meson.build b/meson.build
index a9de71d450..8099d8271c 100644
--- a/meson.build
+++ b/meson.build
@@ -3319,6 +3319,7 @@ if have_system
     'hw/pci-host',
     'hw/ppc',
     'hw/rtc',
+    'hw/riscv',
     'hw/s390x',
     'hw/scsi',
     'hw/sd',
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 04/13] pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 05/13] hw/riscv: add riscv-iommu-pci reference device Daniel Henrique Barboza
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza,
	Gerd Hoffmann

The RISC-V IOMMU PCI device we're going to add next is a reference
implementation of the riscv-iommu spec [1], which predicts that the
IOMMU can be implemented as a PCIe device.

However, RISC-V International (RVI), the entity that ratified the
riscv-iommu spec, didn't bother assigning a PCI ID for this IOMMU PCIe
implementation that the spec predicts. This puts us in an uncommon
situation because we want to add the reference IOMMU PCIe implementation
but we don't have a PCI ID for it.

Given that RVI doesn't provide a PCI ID for it we reached out to Red Hat
and Gerd Hoffman, and they were kind enough to give us a PCI ID for the
RISC-V IOMMU PCI reference device.

Thanks Red Hat and Gerd for this RISC-V IOMMU PCIe device ID.

[1] https://github.com/riscv-non-isa/riscv-iommu/releases/tag/v1.0.0

Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 docs/specs/pci-ids.rst | 2 ++
 include/hw/pci/pci.h   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst
index c0a3dec2e7..a89a9d0939 100644
--- a/docs/specs/pci-ids.rst
+++ b/docs/specs/pci-ids.rst
@@ -94,6 +94,8 @@ PCI devices (other than virtio):
   PCI ACPI ERST device (``-device acpi-erst``)
 1b36:0013
   PCI UFS device (``-device ufs``)
+1b36:0014
+  PCI RISC-V IOMMU device
 
 All these devices are documented in :doc:`index`.
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index eaa3fc99d8..462aed1503 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -115,6 +115,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_REDHAT_PVPANIC     0x0011
 #define PCI_DEVICE_ID_REDHAT_ACPI_ERST   0x0012
 #define PCI_DEVICE_ID_REDHAT_UFS         0x0013
+#define PCI_DEVICE_ID_REDHAT_RISCV_IOMMU 0x0014
 #define PCI_DEVICE_ID_REDHAT_QXL         0x0100
 
 #define FMT_PCIBUS                      PRIx64
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 05/13] hw/riscv: add riscv-iommu-pci reference device
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (3 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 04/13] pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-06-09  8:53   ` Frank Chang
  2024-05-23 17:39 ` [PATCH v3 06/13] hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug Daniel Henrique Barboza
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

The RISC-V IOMMU can be modelled as a PCIe device following the
guidelines of the RISC-V IOMMU spec, chapter 7.1, "Integrating an IOMMU
as a PCIe device".

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 hw/riscv/meson.build       |   2 +-
 hw/riscv/riscv-iommu-pci.c | 177 +++++++++++++++++++++++++++++++++++++
 2 files changed, 178 insertions(+), 1 deletion(-)
 create mode 100644 hw/riscv/riscv-iommu-pci.c

diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
index cbc99c6e8e..adbef8a9b2 100644
--- a/hw/riscv/meson.build
+++ b/hw/riscv/meson.build
@@ -10,6 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
 riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
 riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
 riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
-riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
+riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c', 'riscv-iommu-pci.c'))
 
 hw_arch += {'riscv': riscv_ss}
diff --git a/hw/riscv/riscv-iommu-pci.c b/hw/riscv/riscv-iommu-pci.c
new file mode 100644
index 0000000000..7635cc64ff
--- /dev/null
+++ b/hw/riscv/riscv-iommu-pci.c
@@ -0,0 +1,177 @@
+/*
+ * QEMU emulation of an RISC-V IOMMU
+ *
+ * Copyright (C) 2022-2023 Rivos Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/qdev-properties.h"
+#include "hw/riscv/riscv_hart.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/host-utils.h"
+#include "qom/object.h"
+
+#include "cpu_bits.h"
+#include "riscv-iommu.h"
+#include "riscv-iommu-bits.h"
+
+/* RISC-V IOMMU PCI Device Emulation */
+
+typedef struct RISCVIOMMUStatePci {
+    PCIDevice        pci;     /* Parent PCIe device state */
+    uint16_t         vendor_id;
+    uint16_t         device_id;
+    uint8_t          revision;
+    MemoryRegion     bar0;    /* PCI BAR (including MSI-x config) */
+    RISCVIOMMUState  iommu;   /* common IOMMU state */
+} RISCVIOMMUStatePci;
+
+/* interrupt delivery callback */
+static void riscv_iommu_pci_notify(RISCVIOMMUState *iommu, unsigned vector)
+{
+    RISCVIOMMUStatePci *s = container_of(iommu, RISCVIOMMUStatePci, iommu);
+
+    if (msix_enabled(&(s->pci))) {
+        msix_notify(&(s->pci), vector);
+    }
+}
+
+static void riscv_iommu_pci_realize(PCIDevice *dev, Error **errp)
+{
+    RISCVIOMMUStatePci *s = DO_UPCAST(RISCVIOMMUStatePci, pci, dev);
+    RISCVIOMMUState *iommu = &s->iommu;
+    uint8_t *pci_conf = dev->config;
+    Error *err = NULL;
+
+    pci_set_word(pci_conf + PCI_VENDOR_ID, s->vendor_id);
+    pci_set_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID, s->vendor_id);
+    pci_set_word(pci_conf + PCI_DEVICE_ID, s->device_id);
+    pci_set_word(pci_conf + PCI_SUBSYSTEM_ID, s->device_id);
+    pci_set_byte(pci_conf + PCI_REVISION_ID, s->revision);
+
+    /* Set device id for trace / debug */
+    DEVICE(iommu)->id = g_strdup_printf("%02x:%02x.%01x",
+        pci_dev_bus_num(dev), PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+    qdev_realize(DEVICE(iommu), NULL, errp);
+
+    memory_region_init(&s->bar0, OBJECT(s), "riscv-iommu-bar0",
+        QEMU_ALIGN_UP(memory_region_size(&iommu->regs_mr), TARGET_PAGE_SIZE));
+    memory_region_add_subregion(&s->bar0, 0, &iommu->regs_mr);
+
+    pcie_endpoint_cap_init(dev, 0);
+
+    pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+                     PCI_BASE_ADDRESS_MEM_TYPE_64, &s->bar0);
+
+    int ret = msix_init(dev, RISCV_IOMMU_INTR_COUNT,
+                        &s->bar0, 0, RISCV_IOMMU_REG_MSI_CONFIG,
+                        &s->bar0, 0, RISCV_IOMMU_REG_MSI_CONFIG + 256, 0, &err);
+
+    if (ret == -ENOTSUP) {
+        /*
+         * MSI-x is not supported by the platform.
+         * Driver should use timer/polling based notification handlers.
+         */
+        warn_report_err(err);
+    } else if (ret < 0) {
+        error_propagate(errp, err);
+        return;
+    } else {
+        /* mark all allocated MSIx vectors as used. */
+        msix_vector_use(dev, RISCV_IOMMU_INTR_CQ);
+        msix_vector_use(dev, RISCV_IOMMU_INTR_FQ);
+        msix_vector_use(dev, RISCV_IOMMU_INTR_PM);
+        msix_vector_use(dev, RISCV_IOMMU_INTR_PQ);
+        iommu->notify = riscv_iommu_pci_notify;
+    }
+
+    PCIBus *bus = pci_device_root_bus(dev);
+    if (!bus) {
+        error_setg(errp, "can't find PCIe root port for %02x:%02x.%x",
+            pci_bus_num(pci_get_bus(dev)), PCI_SLOT(dev->devfn),
+            PCI_FUNC(dev->devfn));
+        return;
+    }
+
+    riscv_iommu_pci_setup_iommu(iommu, bus, errp);
+}
+
+static void riscv_iommu_pci_exit(PCIDevice *pci_dev)
+{
+    pci_setup_iommu(pci_device_root_bus(pci_dev), NULL, NULL);
+}
+
+static const VMStateDescription riscv_iommu_vmstate = {
+    .name = "riscv-iommu",
+    .unmigratable = 1
+};
+
+static void riscv_iommu_pci_init(Object *obj)
+{
+    RISCVIOMMUStatePci *s = RISCV_IOMMU_PCI(obj);
+    RISCVIOMMUState *iommu = &s->iommu;
+
+    object_initialize_child(obj, "iommu", iommu, TYPE_RISCV_IOMMU);
+    qdev_alias_all_properties(DEVICE(iommu), obj);
+}
+
+static Property riscv_iommu_pci_properties[] = {
+    DEFINE_PROP_UINT16("vendor-id", RISCVIOMMUStatePci, vendor_id,
+                       PCI_VENDOR_ID_REDHAT),
+    DEFINE_PROP_UINT16("device-id", RISCVIOMMUStatePci, device_id,
+                       PCI_DEVICE_ID_REDHAT_RISCV_IOMMU),
+    DEFINE_PROP_UINT8("revision", RISCVIOMMUStatePci, revision, 0x01),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void riscv_iommu_pci_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+    k->realize = riscv_iommu_pci_realize;
+    k->exit = riscv_iommu_pci_exit;
+    k->class_id = 0x0806;
+    dc->desc = "RISCV-IOMMU DMA Remapping device";
+    dc->vmsd = &riscv_iommu_vmstate;
+    dc->hotpluggable = false;
+    dc->user_creatable = true;
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    device_class_set_props(dc, riscv_iommu_pci_properties);
+}
+
+static const TypeInfo riscv_iommu_pci = {
+    .name = TYPE_RISCV_IOMMU_PCI,
+    .parent = TYPE_PCI_DEVICE,
+    .class_init = riscv_iommu_pci_class_init,
+    .instance_init = riscv_iommu_pci_init,
+    .instance_size = sizeof(RISCVIOMMUStatePci),
+    .interfaces = (InterfaceInfo[]) {
+        { INTERFACE_PCIE_DEVICE },
+        { },
+    },
+};
+
+static void riscv_iommu_register_pci_types(void)
+{
+    type_register_static(&riscv_iommu_pci);
+}
+
+type_init(riscv_iommu_register_pci_types);
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 06/13] hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (4 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 05/13] hw/riscv: add riscv-iommu-pci reference device Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 07/13] test/qtest: add riscv-iommu-pci tests Daniel Henrique Barboza
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

Generate device tree entry for riscv-iommu PCI device, along with
mapping all PCI device identifiers to the single IOMMU device instance.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 hw/riscv/virt.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index 4fdb660525..b6ebbb3baf 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -32,6 +32,7 @@
 #include "hw/core/sysbus-fdt.h"
 #include "target/riscv/pmu.h"
 #include "hw/riscv/riscv_hart.h"
+#include "hw/riscv/iommu.h"
 #include "hw/riscv/virt.h"
 #include "hw/riscv/boot.h"
 #include "hw/riscv/numa.h"
@@ -1006,6 +1007,30 @@ static void create_fdt_virtio_iommu(RISCVVirtState *s, uint16_t bdf)
                            bdf + 1, iommu_phandle, bdf + 1, 0xffff - bdf);
 }
 
+static void create_fdt_iommu(RISCVVirtState *s, uint16_t bdf)
+{
+    const char comp[] = "riscv,pci-iommu";
+    void *fdt = MACHINE(s)->fdt;
+    uint32_t iommu_phandle;
+    g_autofree char *iommu_node = NULL;
+    g_autofree char *pci_node = NULL;
+
+    pci_node = g_strdup_printf("/soc/pci@%lx",
+                               (long) virt_memmap[VIRT_PCIE_ECAM].base);
+    iommu_node = g_strdup_printf("%s/iommu@%x", pci_node, bdf);
+    iommu_phandle = qemu_fdt_alloc_phandle(fdt);
+    qemu_fdt_add_subnode(fdt, iommu_node);
+
+    qemu_fdt_setprop(fdt, iommu_node, "compatible", comp, sizeof(comp));
+    qemu_fdt_setprop_cell(fdt, iommu_node, "#iommu-cells", 1);
+    qemu_fdt_setprop_cell(fdt, iommu_node, "phandle", iommu_phandle);
+    qemu_fdt_setprop_cells(fdt, iommu_node, "reg",
+                           bdf << 8, 0, 0, 0, 0);
+    qemu_fdt_setprop_cells(fdt, pci_node, "iommu-map",
+                           0, iommu_phandle, 0, bdf,
+                           bdf + 1, iommu_phandle, bdf + 1, 0xffff - bdf);
+}
+
 static void finalize_fdt(RISCVVirtState *s)
 {
     uint32_t phandle = 1, irq_mmio_phandle = 1, msi_pcie_phandle = 1;
@@ -1712,9 +1737,11 @@ static HotplugHandler *virt_machine_get_hotplug_handler(MachineState *machine,
     MachineClass *mc = MACHINE_GET_CLASS(machine);
 
     if (device_is_dynamic_sysbus(mc, dev) ||
-        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) ||
+        object_dynamic_cast(OBJECT(dev), TYPE_RISCV_IOMMU_PCI)) {
         return HOTPLUG_HANDLER(machine);
     }
+
     return NULL;
 }
 
@@ -1735,6 +1762,10 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
     if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
         create_fdt_virtio_iommu(s, pci_get_bdf(PCI_DEVICE(dev)));
     }
+
+    if (object_dynamic_cast(OBJECT(dev), TYPE_RISCV_IOMMU_PCI)) {
+        create_fdt_iommu(s, pci_get_bdf(PCI_DEVICE(dev)));
+    }
 }
 
 static void virt_machine_class_init(ObjectClass *oc, void *data)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 07/13] test/qtest: add riscv-iommu-pci tests
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (5 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 06/13] hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC) Daniel Henrique Barboza
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

To test the RISC-V IOMMU emulation we'll use its PCI representation.
Create a new 'riscv-iommu-pci' libqos device that will be present with
CONFIG_RISCV_IOMMU.  This config is only available for RISC-V, so this
device will only be consumed by the RISC-V libqos machine.

Start with basic tests: a PCI sanity check and a reset state register
test. The reset test was taken from the RISC-V IOMMU spec chapter 5.2,
"Reset behavior".

More tests will be added later.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 tests/qtest/libqos/meson.build   |  4 ++
 tests/qtest/libqos/riscv-iommu.c | 76 ++++++++++++++++++++++++++
 tests/qtest/libqos/riscv-iommu.h | 71 ++++++++++++++++++++++++
 tests/qtest/meson.build          |  1 +
 tests/qtest/riscv-iommu-test.c   | 93 ++++++++++++++++++++++++++++++++
 5 files changed, 245 insertions(+)
 create mode 100644 tests/qtest/libqos/riscv-iommu.c
 create mode 100644 tests/qtest/libqos/riscv-iommu.h
 create mode 100644 tests/qtest/riscv-iommu-test.c

diff --git a/tests/qtest/libqos/meson.build b/tests/qtest/libqos/meson.build
index 3aed6efcb8..07fe20eacb 100644
--- a/tests/qtest/libqos/meson.build
+++ b/tests/qtest/libqos/meson.build
@@ -67,6 +67,10 @@ if have_virtfs
   libqos_srcs += files('virtio-9p.c', 'virtio-9p-client.c')
 endif
 
+if config_all_devices.has_key('CONFIG_RISCV_IOMMU')
+  libqos_srcs += files('riscv-iommu.c')
+endif
+
 libqos = static_library('qos', libqos_srcs + genh,
                         name_suffix: 'fa',
                         build_by_default: false)
diff --git a/tests/qtest/libqos/riscv-iommu.c b/tests/qtest/libqos/riscv-iommu.c
new file mode 100644
index 0000000000..01e3b31c0b
--- /dev/null
+++ b/tests/qtest/libqos/riscv-iommu.c
@@ -0,0 +1,76 @@
+/*
+ * libqos driver riscv-iommu-pci framework
+ *
+ * Copyright (c) 2024 Ventana Micro Systems Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at your
+ * option) any later version.  See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "../libqtest.h"
+#include "qemu/module.h"
+#include "qgraph.h"
+#include "pci.h"
+#include "riscv-iommu.h"
+
+static void *riscv_iommu_pci_get_driver(void *obj, const char *interface)
+{
+    QRISCVIOMMU *r_iommu_pci = obj;
+
+    if (!g_strcmp0(interface, "pci-device")) {
+        return &r_iommu_pci->dev;
+    }
+
+    fprintf(stderr, "%s not present in riscv_iommu_pci\n", interface);
+    g_assert_not_reached();
+}
+
+static void riscv_iommu_pci_start_hw(QOSGraphObject *obj)
+{
+    QRISCVIOMMU *pci = (QRISCVIOMMU *)obj;
+    qpci_device_enable(&pci->dev);
+}
+
+static void riscv_iommu_pci_destructor(QOSGraphObject *obj)
+{
+    QRISCVIOMMU *pci = (QRISCVIOMMU *)obj;
+    qpci_iounmap(&pci->dev, pci->reg_bar);
+}
+
+static void *riscv_iommu_pci_create(void *pci_bus, QGuestAllocator *alloc,
+                                    void *addr)
+{
+    QRISCVIOMMU *r_iommu_pci = g_new0(QRISCVIOMMU, 1);
+    QPCIBus *bus = pci_bus;
+
+    qpci_device_init(&r_iommu_pci->dev, bus, addr);
+    r_iommu_pci->reg_bar = qpci_iomap(&r_iommu_pci->dev, 0, NULL);
+
+    r_iommu_pci->obj.get_driver = riscv_iommu_pci_get_driver;
+    r_iommu_pci->obj.start_hw = riscv_iommu_pci_start_hw;
+    r_iommu_pci->obj.destructor = riscv_iommu_pci_destructor;
+    return &r_iommu_pci->obj;
+}
+
+static void riscv_iommu_pci_register_nodes(void)
+{
+    QPCIAddress addr = {
+        .vendor_id = RISCV_IOMMU_PCI_VENDOR_ID,
+        .device_id = RISCV_IOMMU_PCI_DEVICE_ID,
+        .devfn = QPCI_DEVFN(1, 0),
+    };
+
+    QOSGraphEdgeOptions opts = {
+        .extra_device_opts = "addr=01.0",
+    };
+
+    add_qpci_address(&opts, &addr);
+
+    qos_node_create_driver("riscv-iommu-pci", riscv_iommu_pci_create);
+    qos_node_produces("riscv-iommu-pci", "pci-device");
+    qos_node_consumes("riscv-iommu-pci", "pci-bus", &opts);
+}
+
+libqos_init(riscv_iommu_pci_register_nodes);
diff --git a/tests/qtest/libqos/riscv-iommu.h b/tests/qtest/libqos/riscv-iommu.h
new file mode 100644
index 0000000000..d123efb41f
--- /dev/null
+++ b/tests/qtest/libqos/riscv-iommu.h
@@ -0,0 +1,71 @@
+/*
+ * libqos driver riscv-iommu-pci framework
+ *
+ * Copyright (c) 2024 Ventana Micro Systems Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at your
+ * option) any later version.  See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef TESTS_LIBQOS_RISCV_IOMMU_H
+#define TESTS_LIBQOS_RISCV_IOMMU_H
+
+#include "qgraph.h"
+#include "pci.h"
+#include "qemu/bitops.h"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) (((~0ULL) >> (63 - (h) + (l))) << (l))
+#endif
+
+/*
+ * RISC-V IOMMU uses PCI_VENDOR_ID_REDHAT 0x1b36 and
+ * PCI_DEVICE_ID_REDHAT_RISCV_IOMMU 0x0014.
+ */
+#define RISCV_IOMMU_PCI_VENDOR_ID       0x1b36
+#define RISCV_IOMMU_PCI_DEVICE_ID       0x0014
+#define RISCV_IOMMU_PCI_DEVICE_CLASS    0x0806
+
+/* Common field positions */
+#define RISCV_IOMMU_QUEUE_ENABLE        BIT(0)
+#define RISCV_IOMMU_QUEUE_INTR_ENABLE   BIT(1)
+#define RISCV_IOMMU_QUEUE_MEM_FAULT     BIT(8)
+#define RISCV_IOMMU_QUEUE_ACTIVE        BIT(16)
+#define RISCV_IOMMU_QUEUE_BUSY          BIT(17)
+
+#define RISCV_IOMMU_REG_CAP             0x0000
+#define RISCV_IOMMU_CAP_VERSION         GENMASK_ULL(7, 0)
+
+#define RISCV_IOMMU_REG_DDTP            0x0010
+#define RISCV_IOMMU_DDTP_BUSY           BIT_ULL(4)
+#define RISCV_IOMMU_DDTP_MODE           GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_DDTP_MODE_OFF       0
+
+#define RISCV_IOMMU_REG_CQCSR           0x0048
+#define RISCV_IOMMU_CQCSR_CQEN          RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_CQCSR_CIE           RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_CQCSR_CQON          RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_CQCSR_BUSY          RISCV_IOMMU_QUEUE_BUSY
+
+#define RISCV_IOMMU_REG_FQCSR           0x004C
+#define RISCV_IOMMU_FQCSR_FQEN          RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_FQCSR_FIE           RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_FQCSR_FQON          RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_FQCSR_BUSY          RISCV_IOMMU_QUEUE_BUSY
+
+#define RISCV_IOMMU_REG_PQCSR           0x0050
+#define RISCV_IOMMU_PQCSR_PQEN          RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_PQCSR_PIE           RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_PQCSR_PQON          RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_PQCSR_BUSY          RISCV_IOMMU_QUEUE_BUSY
+
+#define RISCV_IOMMU_REG_IPSR            0x0054
+
+typedef struct QRISCVIOMMU {
+    QOSGraphObject obj;
+    QPCIDevice dev;
+    QPCIBar reg_bar;
+} QRISCVIOMMU;
+
+#endif
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 86293051dc..1b81db2807 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -293,6 +293,7 @@ qos_test_ss.add(
   'vmxnet3-test.c',
   'igb-test.c',
   'ufs-test.c',
+  'riscv-iommu-test.c',
 )
 
 if config_all_devices.has_key('CONFIG_VIRTIO_SERIAL')
diff --git a/tests/qtest/riscv-iommu-test.c b/tests/qtest/riscv-iommu-test.c
new file mode 100644
index 0000000000..7f0dbd0211
--- /dev/null
+++ b/tests/qtest/riscv-iommu-test.c
@@ -0,0 +1,93 @@
+/*
+ * QTest testcase for RISC-V IOMMU
+ *
+ * Copyright (c) 2024 Ventana Micro Systems Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at your
+ * option) any later version.  See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest-single.h"
+#include "qemu/module.h"
+#include "libqos/qgraph.h"
+#include "libqos/riscv-iommu.h"
+#include "hw/pci/pci_regs.h"
+
+static uint32_t riscv_iommu_read_reg32(QRISCVIOMMU *r_iommu, int reg_offset)
+{
+    uint32_t reg;
+
+    qpci_memread(&r_iommu->dev, r_iommu->reg_bar, reg_offset,
+                 &reg, sizeof(reg));
+    return reg;
+}
+
+static uint64_t riscv_iommu_read_reg64(QRISCVIOMMU *r_iommu, int reg_offset)
+{
+    uint64_t reg;
+
+    qpci_memread(&r_iommu->dev, r_iommu->reg_bar, reg_offset,
+                 &reg, sizeof(reg));
+    return reg;
+}
+
+static void test_pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QRISCVIOMMU *r_iommu = obj;
+    QPCIDevice *dev = &r_iommu->dev;
+    uint16_t vendorid, deviceid, classid;
+
+    vendorid = qpci_config_readw(dev, PCI_VENDOR_ID);
+    deviceid = qpci_config_readw(dev, PCI_DEVICE_ID);
+    classid = qpci_config_readw(dev, PCI_CLASS_DEVICE);
+
+    g_assert_cmpuint(vendorid, ==, RISCV_IOMMU_PCI_VENDOR_ID);
+    g_assert_cmpuint(deviceid, ==, RISCV_IOMMU_PCI_DEVICE_ID);
+    g_assert_cmpuint(classid, ==, RISCV_IOMMU_PCI_DEVICE_CLASS);
+}
+
+static void test_reg_reset(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QRISCVIOMMU *r_iommu = obj;
+    uint64_t cap;
+    uint32_t reg;
+
+    cap = riscv_iommu_read_reg64(r_iommu, RISCV_IOMMU_REG_CAP);
+    g_assert_cmpuint(cap & RISCV_IOMMU_CAP_VERSION, ==, 0x10);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_CQCSR);
+    g_assert_cmpuint(reg & RISCV_IOMMU_CQCSR_CQEN, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_CQCSR_CIE, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_CQCSR_CQON, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_CQCSR_BUSY, ==, 0);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_FQCSR);
+    g_assert_cmpuint(reg & RISCV_IOMMU_FQCSR_FQEN, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_FQCSR_FIE, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_FQCSR_FQON, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_FQCSR_BUSY, ==, 0);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_PQCSR);
+    g_assert_cmpuint(reg & RISCV_IOMMU_PQCSR_PQEN, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_PQCSR_PIE, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_PQCSR_PQON, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_PQCSR_BUSY, ==, 0);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_DDTP);
+    g_assert_cmpuint(reg & RISCV_IOMMU_DDTP_BUSY, ==, 0);
+    g_assert_cmpuint(reg & RISCV_IOMMU_DDTP_MODE, ==,
+                     RISCV_IOMMU_DDTP_MODE_OFF);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_IPSR);
+    g_assert_cmpuint(reg, ==, 0);
+}
+
+static void register_riscv_iommu_test(void)
+{
+    qos_add_test("pci_config", "riscv-iommu-pci", test_pci_config, NULL);
+    qos_add_test("reg_reset", "riscv-iommu-pci", test_reg_reset, NULL);
+}
+
+libqos_init(register_riscv_iommu_test);
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (6 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 07/13] test/qtest: add riscv-iommu-pci tests Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-06-05 17:34   ` Tomasz Jeznach
  2024-05-23 17:39 ` [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support Daniel Henrique Barboza
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

The RISC-V IOMMU spec predicts that the IOMMU can use translation caches
to hold entries from the DDT. This includes implementation for all cache
commands that are marked as 'not implemented'.

There are some artifacts included in the cache that predicts s-stage and
g-stage elements, although we don't support it yet. We'll introduce them
next.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 hw/riscv/riscv-iommu.c | 189 ++++++++++++++++++++++++++++++++++++++++-
 hw/riscv/riscv-iommu.h |   2 +
 2 files changed, 187 insertions(+), 4 deletions(-)

diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index 39b4ff1405..abf6ae7726 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -63,6 +63,16 @@ struct RISCVIOMMUContext {
     uint64_t msiptp;            /* MSI redirection page table pointer */
 };
 
+/* Address translation cache entry */
+struct RISCVIOMMUEntry {
+    uint64_t iova:44;           /* IOVA Page Number */
+    uint64_t pscid:20;          /* Process Soft-Context identifier */
+    uint64_t phys:44;           /* Physical Page Number */
+    uint64_t gscid:16;          /* Guest Soft-Context identifier */
+    uint64_t perm:2;            /* IOMMU_RW flags */
+    uint64_t __rfu:2;
+};
+
 /* IOMMU index for transactions without PASID specified. */
 #define RISCV_IOMMU_NOPASID 0
 
@@ -751,13 +761,125 @@ static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
     return &as->iova_as;
 }
 
+/* Translation Object cache support */
+static gboolean __iot_equal(gconstpointer v1, gconstpointer v2)
+{
+    RISCVIOMMUEntry *t1 = (RISCVIOMMUEntry *) v1;
+    RISCVIOMMUEntry *t2 = (RISCVIOMMUEntry *) v2;
+    return t1->gscid == t2->gscid && t1->pscid == t2->pscid &&
+           t1->iova == t2->iova;
+}
+
+static guint __iot_hash(gconstpointer v)
+{
+    RISCVIOMMUEntry *t = (RISCVIOMMUEntry *) v;
+    return (guint)t->iova;
+}
+
+/* GV: 1 PSCV: 1 AV: 1 */
+static void __iot_inval_pscid_iova(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
+    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
+    if (iot->gscid == arg->gscid &&
+        iot->pscid == arg->pscid &&
+        iot->iova == arg->iova) {
+        iot->perm = IOMMU_NONE;
+    }
+}
+
+/* GV: 1 PSCV: 1 AV: 0 */
+static void __iot_inval_pscid(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
+    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
+    if (iot->gscid == arg->gscid &&
+        iot->pscid == arg->pscid) {
+        iot->perm = IOMMU_NONE;
+    }
+}
+
+/* GV: 1 GVMA: 1 */
+static void __iot_inval_gscid_gpa(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
+    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
+    if (iot->gscid == arg->gscid) {
+        /* simplified cache, no GPA matching */
+        iot->perm = IOMMU_NONE;
+    }
+}
+
+/* GV: 1 GVMA: 0 */
+static void __iot_inval_gscid(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
+    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
+    if (iot->gscid == arg->gscid) {
+        iot->perm = IOMMU_NONE;
+    }
+}
+
+/* GV: 0 */
+static void __iot_inval_all(gpointer key, gpointer value, gpointer data)
+{
+    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
+    iot->perm = IOMMU_NONE;
+}
+
+/* caller should keep ref-count for iot_cache object */
+static RISCVIOMMUEntry *riscv_iommu_iot_lookup(RISCVIOMMUContext *ctx,
+    GHashTable *iot_cache, hwaddr iova)
+{
+    RISCVIOMMUEntry key = {
+        .pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID),
+        .iova  = PPN_DOWN(iova),
+    };
+    return g_hash_table_lookup(iot_cache, &key);
+}
+
+/* caller should keep ref-count for iot_cache object */
+static void riscv_iommu_iot_update(RISCVIOMMUState *s,
+    GHashTable *iot_cache, RISCVIOMMUEntry *iot)
+{
+    if (!s->iot_limit) {
+        return;
+    }
+
+    if (g_hash_table_size(s->iot_cache) >= s->iot_limit) {
+        iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
+                                          g_free, NULL);
+        g_hash_table_unref(qatomic_xchg(&s->iot_cache, iot_cache));
+    }
+    g_hash_table_add(iot_cache, iot);
+}
+
+static void riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
+    uint32_t gscid, uint32_t pscid, hwaddr iova)
+{
+    GHashTable *iot_cache;
+    RISCVIOMMUEntry key = {
+        .gscid = gscid,
+        .pscid = pscid,
+        .iova  = PPN_DOWN(iova),
+    };
+
+    iot_cache = g_hash_table_ref(s->iot_cache);
+    g_hash_table_foreach(iot_cache, func, &key);
+    g_hash_table_unref(iot_cache);
+}
+
 static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
-    IOMMUTLBEntry *iotlb)
+    IOMMUTLBEntry *iotlb, bool enable_cache)
 {
+    RISCVIOMMUEntry *iot;
+    IOMMUAccessFlags perm;
     bool enable_pasid;
     bool enable_pri;
+    GHashTable *iot_cache;
     int fault;
 
+    iot_cache = g_hash_table_ref(s->iot_cache);
     /*
      * TC[32] is reserved for custom extensions, used here to temporarily
      * enable automatic page-request generation for ATS queries.
@@ -765,9 +887,36 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
     enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
     enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
 
+    iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
+    perm = iot ? iot->perm : IOMMU_NONE;
+    if (perm != IOMMU_NONE) {
+        iotlb->translated_addr = PPN_PHYS(iot->phys);
+        iotlb->addr_mask = ~TARGET_PAGE_MASK;
+        iotlb->perm = perm;
+        fault = 0;
+        goto done;
+    }
+
     /* Translate using device directory / page table information. */
     fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
 
+    if (!fault && iotlb->target_as == &s->trap_as) {
+        /* Do not cache trapped MSI translations */
+        goto done;
+    }
+
+    if (!fault && iotlb->translated_addr != iotlb->iova && enable_cache) {
+        iot = g_new0(RISCVIOMMUEntry, 1);
+        iot->iova = PPN_DOWN(iotlb->iova);
+        iot->phys = PPN_DOWN(iotlb->translated_addr);
+        iot->pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID);
+        iot->perm = iotlb->perm;
+        riscv_iommu_iot_update(s, iot_cache, iot);
+    }
+
+done:
+    g_hash_table_unref(iot_cache);
+
     if (enable_pri && fault) {
         struct riscv_iommu_pq_record pr = {0};
         if (enable_pasid) {
@@ -907,13 +1056,40 @@ static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
             if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
                 /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
                 goto cmd_ill;
+            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_GV)) {
+                /* invalidate all cache mappings */
+                func = __iot_inval_all;
+            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_AV)) {
+                /* invalidate cache matching GSCID */
+                func = __iot_inval_gscid;
+            } else {
+                /* invalidate cache matching GSCID and ADDR (GPA) */
+                func = __iot_inval_gscid_gpa;
             }
-            /* translation cache not implemented yet */
+            riscv_iommu_iot_inval(s, func,
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_GSCID), 0,
+                cmd.dword1 & TARGET_PAGE_MASK);
             break;
 
         case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
                              RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
-            /* translation cache not implemented yet */
+            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_GV)) {
+                /* invalidate all cache mappings, simplified model */
+                func = __iot_inval_all;
+            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV)) {
+                /* invalidate cache matching GSCID, simplified model */
+                func = __iot_inval_gscid;
+            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_AV)) {
+                /* invalidate cache matching GSCID and PSCID */
+                func = __iot_inval_pscid;
+            } else {
+                /* invalidate cache matching GSCID and PSCID and ADDR (IOVA) */
+                func = __iot_inval_pscid_iova;
+            }
+            riscv_iommu_iot_inval(s, func,
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_GSCID),
+                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_PSCID),
+                cmd.dword1 & TARGET_PAGE_MASK);
             break;
 
         case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
@@ -1410,6 +1586,8 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
     /* Device translation context cache */
     s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
                                          g_free, NULL);
+    s->iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
+                                         g_free, NULL);
 
     s->iommus.le_next = NULL;
     s->iommus.le_prev = NULL;
@@ -1423,6 +1601,7 @@ static void riscv_iommu_unrealize(DeviceState *dev)
     RISCVIOMMUState *s = RISCV_IOMMU(dev);
 
     qemu_mutex_destroy(&s->core_lock);
+    g_hash_table_unref(s->iot_cache);
     g_hash_table_unref(s->ctx_cache);
 }
 
@@ -1430,6 +1609,8 @@ static Property riscv_iommu_properties[] = {
     DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
         RISCV_IOMMU_SPEC_DOT_VER),
     DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
+    DEFINE_PROP_UINT32("ioatc-limit", RISCVIOMMUState, iot_limit,
+        LIMIT_CACHE_IOT),
     DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
     DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
     DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
@@ -1482,7 +1663,7 @@ static IOMMUTLBEntry riscv_iommu_memory_region_translate(
         /* Translation disabled or invalid. */
         iotlb.addr_mask = 0;
         iotlb.perm = IOMMU_NONE;
-    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
+    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb, true)) {
         /* Translation disabled or fault reported. */
         iotlb.addr_mask = 0;
         iotlb.perm = IOMMU_NONE;
diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
index 31d3907d33..3afee9f3e8 100644
--- a/hw/riscv/riscv-iommu.h
+++ b/hw/riscv/riscv-iommu.h
@@ -68,6 +68,8 @@ struct RISCVIOMMUState {
     MemoryRegion trap_mr;
 
     GHashTable *ctx_cache;          /* Device translation Context Cache */
+    GHashTable *iot_cache;          /* IO Translated Address Cache */
+    unsigned iot_limit;             /* IO Translation Cache size limit */
 
     /* MMIO Hardware Interface */
     MemoryRegion regs_mr;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (7 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC) Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-06-18 10:30   ` Jason Chien
  2024-05-23 17:39 ` [PATCH v3 10/13] hw/riscv/riscv-iommu: add ATS support Daniel Henrique Barboza
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

Add support for s-stage (sv32, sv39, sv48, sv57 caps) and g-stage
(sv32x4, sv39x4, sv48x4, sv57x4 caps). Most of the work is done in the
riscv_iommu_spa_fetch() function that now has to consider how many
translation stages we need to walk the page table.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 hw/riscv/riscv-iommu-bits.h |  11 ++
 hw/riscv/riscv-iommu.c      | 331 +++++++++++++++++++++++++++++++++++-
 hw/riscv/riscv-iommu.h      |   2 +
 3 files changed, 336 insertions(+), 8 deletions(-)

diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
index f29b916acb..a4def7b8ec 100644
--- a/hw/riscv/riscv-iommu-bits.h
+++ b/hw/riscv/riscv-iommu-bits.h
@@ -71,6 +71,14 @@ struct riscv_iommu_pq_record {
 /* 5.3 IOMMU Capabilities (64bits) */
 #define RISCV_IOMMU_REG_CAP             0x0000
 #define RISCV_IOMMU_CAP_VERSION         GENMASK_ULL(7, 0)
+#define RISCV_IOMMU_CAP_SV32            BIT_ULL(8)
+#define RISCV_IOMMU_CAP_SV39            BIT_ULL(9)
+#define RISCV_IOMMU_CAP_SV48            BIT_ULL(10)
+#define RISCV_IOMMU_CAP_SV57            BIT_ULL(11)
+#define RISCV_IOMMU_CAP_SV32X4          BIT_ULL(16)
+#define RISCV_IOMMU_CAP_SV39X4          BIT_ULL(17)
+#define RISCV_IOMMU_CAP_SV48X4          BIT_ULL(18)
+#define RISCV_IOMMU_CAP_SV57X4          BIT_ULL(19)
 #define RISCV_IOMMU_CAP_MSI_FLAT        BIT_ULL(22)
 #define RISCV_IOMMU_CAP_MSI_MRIF        BIT_ULL(23)
 #define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
@@ -83,6 +91,7 @@ struct riscv_iommu_pq_record {
 /* 5.4 Features control register (32bits) */
 #define RISCV_IOMMU_REG_FCTL            0x0008
 #define RISCV_IOMMU_FCTL_WSI            BIT(1)
+#define RISCV_IOMMU_FCTL_GXL            BIT(2)
 
 /* 5.5 Device-directory-table pointer (64bits) */
 #define RISCV_IOMMU_REG_DDTP            0x0010
@@ -205,6 +214,8 @@ struct riscv_iommu_dc {
 #define RISCV_IOMMU_DC_TC_DTF           BIT_ULL(4)
 #define RISCV_IOMMU_DC_TC_PDTV          BIT_ULL(5)
 #define RISCV_IOMMU_DC_TC_PRPR          BIT_ULL(6)
+#define RISCV_IOMMU_DC_TC_GADE          BIT_ULL(7)
+#define RISCV_IOMMU_DC_TC_SADE          BIT_ULL(8)
 #define RISCV_IOMMU_DC_TC_DPE           BIT_ULL(9)
 #define RISCV_IOMMU_DC_TC_SBE           BIT_ULL(10)
 #define RISCV_IOMMU_DC_TC_SXL           BIT_ULL(11)
diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index abf6ae7726..11c418b548 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -58,6 +58,8 @@ struct RISCVIOMMUContext {
     uint64_t __rfu:20;          /* reserved */
     uint64_t tc;                /* Translation Control */
     uint64_t ta;                /* Translation Attributes */
+    uint64_t satp;              /* S-Stage address translation and protection */
+    uint64_t gatp;              /* G-Stage address translation and protection */
     uint64_t msi_addr_mask;     /* MSI filtering - address mask */
     uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
     uint64_t msiptp;            /* MSI redirection page table pointer */
@@ -201,12 +203,45 @@ static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
     return true;
 }
 
-/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
+/*
+ * RISCV IOMMU Address Translation Lookup - Page Table Walk
+ *
+ * Note: Code is based on get_physical_address() from target/riscv/cpu_helper.c
+ * Both implementation can be merged into single helper function in future.
+ * Keeping them separate for now, as error reporting and flow specifics are
+ * sufficiently different for separate implementation.
+ *
+ * @s        : IOMMU Device State
+ * @ctx      : Translation context for device id and process address space id.
+ * @iotlb    : translation data: physical address and access mode.
+ * @return   : success or fault cause code.
+ */
 static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
     IOMMUTLBEntry *iotlb)
 {
+    dma_addr_t addr, base;
+    uint64_t satp, gatp, pte;
+    bool en_s, en_g;
+    struct {
+        unsigned char step;
+        unsigned char levels;
+        unsigned char ptidxbits;
+        unsigned char ptesize;
+    } sc[2];
+    /* Translation stage phase */
+    enum {
+        S_STAGE = 0,
+        G_STAGE = 1,
+    } pass;
+
+    satp = get_field(ctx->satp, RISCV_IOMMU_ATP_MODE_FIELD);
+    gatp = get_field(ctx->gatp, RISCV_IOMMU_ATP_MODE_FIELD);
+
+    en_s = satp != RISCV_IOMMU_DC_FSC_MODE_BARE;
+    en_g = gatp != RISCV_IOMMU_DC_IOHGATP_MODE_BARE;
+
     /* Early check for MSI address match when IOVA == GPA */
-    if (iotlb->perm & IOMMU_WO &&
+    if ((iotlb->perm & IOMMU_WO) &&
         riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
         iotlb->target_as = &s->trap_as;
         iotlb->translated_addr = iotlb->iova;
@@ -215,11 +250,196 @@ static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
     }
 
     /* Exit early for pass-through mode. */
-    iotlb->translated_addr = iotlb->iova;
-    iotlb->addr_mask = ~TARGET_PAGE_MASK;
-    /* Allow R/W in pass-through mode */
-    iotlb->perm = IOMMU_RW;
-    return 0;
+    if (!(en_s || en_g)) {
+        iotlb->translated_addr = iotlb->iova;
+        iotlb->addr_mask = ~TARGET_PAGE_MASK;
+        /* Allow R/W in pass-through mode */
+        iotlb->perm = IOMMU_RW;
+        return 0;
+    }
+
+    /* S/G translation parameters. */
+    for (pass = 0; pass < 2; pass++) {
+        uint32_t sv_mode;
+
+        sc[pass].step = 0;
+        if (pass ? (s->fctl & RISCV_IOMMU_FCTL_GXL) :
+            (ctx->tc & RISCV_IOMMU_DC_TC_SXL)) {
+            /* 32bit mode for GXL/SXL == 1 */
+            switch (pass ? gatp : satp) {
+            case RISCV_IOMMU_DC_IOHGATP_MODE_BARE:
+                sc[pass].levels    = 0;
+                sc[pass].ptidxbits = 0;
+                sc[pass].ptesize   = 0;
+                break;
+            case RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4:
+                sv_mode = pass ? RISCV_IOMMU_CAP_SV32X4 : RISCV_IOMMU_CAP_SV32;
+                if (!(s->cap & sv_mode)) {
+                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+                }
+                sc[pass].levels    = 2;
+                sc[pass].ptidxbits = 10;
+                sc[pass].ptesize   = 4;
+                break;
+            default:
+                return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+            }
+        } else {
+            /* 64bit mode for GXL/SXL == 0 */
+            switch (pass ? gatp : satp) {
+            case RISCV_IOMMU_DC_IOHGATP_MODE_BARE:
+                sc[pass].levels    = 0;
+                sc[pass].ptidxbits = 0;
+                sc[pass].ptesize   = 0;
+                break;
+            case RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4:
+                sv_mode = pass ? RISCV_IOMMU_CAP_SV39X4 : RISCV_IOMMU_CAP_SV39;
+                if (!(s->cap & sv_mode)) {
+                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+                }
+                sc[pass].levels    = 3;
+                sc[pass].ptidxbits = 9;
+                sc[pass].ptesize   = 8;
+                break;
+            case RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4:
+                sv_mode = pass ? RISCV_IOMMU_CAP_SV48X4 : RISCV_IOMMU_CAP_SV48;
+                if (!(s->cap & sv_mode)) {
+                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+                }
+                sc[pass].levels    = 4;
+                sc[pass].ptidxbits = 9;
+                sc[pass].ptesize   = 8;
+                break;
+            case RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4:
+                sv_mode = pass ? RISCV_IOMMU_CAP_SV57X4 : RISCV_IOMMU_CAP_SV57;
+                if (!(s->cap & sv_mode)) {
+                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+                }
+                sc[pass].levels    = 5;
+                sc[pass].ptidxbits = 9;
+                sc[pass].ptesize   = 8;
+                break;
+            default:
+                return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
+            }
+        }
+    };
+
+    /* S/G stages translation tables root pointers */
+    gatp = PPN_PHYS(get_field(ctx->gatp, RISCV_IOMMU_ATP_PPN_FIELD));
+    satp = PPN_PHYS(get_field(ctx->satp, RISCV_IOMMU_ATP_PPN_FIELD));
+    addr = (en_s && en_g) ? satp : iotlb->iova;
+    base = en_g ? gatp : satp;
+    pass = en_g ? G_STAGE : S_STAGE;
+
+    do {
+        const unsigned widened = (pass && !sc[pass].step) ? 2 : 0;
+        const unsigned va_bits = widened + sc[pass].ptidxbits;
+        const unsigned va_skip = TARGET_PAGE_BITS + sc[pass].ptidxbits *
+                                 (sc[pass].levels - 1 - sc[pass].step);
+        const unsigned idx = (addr >> va_skip) & ((1 << va_bits) - 1);
+        const dma_addr_t pte_addr = base + idx * sc[pass].ptesize;
+        const bool ade =
+            ctx->tc & (pass ? RISCV_IOMMU_DC_TC_GADE : RISCV_IOMMU_DC_TC_SADE);
+
+        /* Address range check before first level lookup */
+        if (!sc[pass].step) {
+            const uint64_t va_mask = (1ULL << (va_skip + va_bits)) - 1;
+            if ((addr & va_mask) != addr) {
+                return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
+            }
+        }
+
+        /* Read page table entry */
+        if (dma_memory_read(s->target_as, pte_addr, &pte,
+                sc[pass].ptesize, MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
+            return (iotlb->perm & IOMMU_WO) ? RISCV_IOMMU_FQ_CAUSE_WR_FAULT
+                                            : RISCV_IOMMU_FQ_CAUSE_RD_FAULT;
+        }
+
+        if (sc[pass].ptesize == 4) {
+            pte = (uint64_t) le32_to_cpu(*((uint32_t *)&pte));
+        } else {
+            pte = le64_to_cpu(pte);
+        }
+
+        sc[pass].step++;
+        hwaddr ppn = pte >> PTE_PPN_SHIFT;
+
+        if (!(pte & PTE_V)) {
+            break;                /* Invalid PTE */
+        } else if (!(pte & (PTE_R | PTE_W | PTE_X))) {
+            base = PPN_PHYS(ppn); /* Inner PTE, continue walking */
+        } else if ((pte & (PTE_R | PTE_W | PTE_X)) == PTE_W) {
+            break;                /* Reserved leaf PTE flags: PTE_W */
+        } else if ((pte & (PTE_R | PTE_W | PTE_X)) == (PTE_W | PTE_X)) {
+            break;                /* Reserved leaf PTE flags: PTE_W + PTE_X */
+        } else if (ppn & ((1ULL << (va_skip - TARGET_PAGE_BITS)) - 1)) {
+            break;                /* Misaligned PPN */
+        } else if ((iotlb->perm & IOMMU_RO) && !(pte & PTE_R)) {
+            break;                /* Read access check failed */
+        } else if ((iotlb->perm & IOMMU_WO) && !(pte & PTE_W)) {
+            break;                /* Write access check failed */
+        } else if ((iotlb->perm & IOMMU_RO) && !ade && !(pte & PTE_A)) {
+            break;                /* Access bit not set */
+        } else if ((iotlb->perm & IOMMU_WO) && !ade && !(pte & PTE_D)) {
+            break;                /* Dirty bit not set */
+        } else {
+            /* Leaf PTE, translation completed. */
+            sc[pass].step = sc[pass].levels;
+            base = PPN_PHYS(ppn) | (addr & ((1ULL << va_skip) - 1));
+            /* Update address mask based on smallest translation granularity */
+            iotlb->addr_mask &= (1ULL << va_skip) - 1;
+            /* Continue with S-Stage translation? */
+            if (pass && sc[0].step != sc[0].levels) {
+                pass = S_STAGE;
+                addr = iotlb->iova;
+                continue;
+            }
+            /* Translation phase completed (GPA or SPA) */
+            iotlb->translated_addr = base;
+            iotlb->perm = (pte & PTE_W) ? ((pte & PTE_R) ? IOMMU_RW : IOMMU_WO)
+                                                         : IOMMU_RO;
+
+            /* Check MSI GPA address match */
+            if (pass == S_STAGE && (iotlb->perm & IOMMU_WO) &&
+                riscv_iommu_msi_check(s, ctx, base)) {
+                /* Trap MSI writes and return GPA address. */
+                iotlb->target_as = &s->trap_as;
+                iotlb->addr_mask = ~TARGET_PAGE_MASK;
+                return 0;
+            }
+
+            /* Continue with G-Stage translation? */
+            if (!pass && en_g) {
+                pass = G_STAGE;
+                addr = base;
+                base = gatp;
+                sc[pass].step = 0;
+                continue;
+            }
+
+            return 0;
+        }
+
+        if (sc[pass].step == sc[pass].levels) {
+            break; /* Can't find leaf PTE */
+        }
+
+        /* Continue with G-Stage translation? */
+        if (!pass && en_g) {
+            pass = G_STAGE;
+            addr = base;
+            base = gatp;
+            sc[pass].step = 0;
+        }
+    } while (1);
+
+    return (iotlb->perm & IOMMU_WO) ?
+                (pass ? RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS :
+                        RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S) :
+                (pass ? RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS :
+                        RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S);
 }
 
 static void riscv_iommu_report_fault(RISCVIOMMUState *s,
@@ -420,7 +640,7 @@ err:
 static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
                                             RISCVIOMMUContext *ctx)
 {
-    uint32_t msi_mode;
+    uint32_t fsc_mode, msi_mode;
 
     if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
         ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
@@ -441,6 +661,58 @@ static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
         }
     }
 
+    fsc_mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
+
+    if (ctx->tc & RISCV_IOMMU_DC_TC_PDTV) {
+        switch (fsc_mode) {
+        case RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8:
+            if (!(s->cap & RISCV_IOMMU_CAP_PD8)) {
+                return false;
+            }
+            break;
+        case RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17:
+            if (!(s->cap & RISCV_IOMMU_CAP_PD17)) {
+                return false;
+            }
+            break;
+        case RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20:
+            if (!(s->cap & RISCV_IOMMU_CAP_PD20)) {
+                return false;
+            }
+            break;
+        }
+    } else {
+        /* DC.tc.PDTV is 0 */
+        if (ctx->tc & RISCV_IOMMU_DC_TC_DPE) {
+            return false;
+        }
+
+        if (ctx->tc & RISCV_IOMMU_DC_TC_SXL) {
+            if (fsc_mode == RISCV_IOMMU_CAP_SV32 &&
+                !(s->cap & RISCV_IOMMU_CAP_SV32)) {
+                return false;
+            }
+        } else {
+            switch (fsc_mode) {
+            case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39:
+                if (!(s->cap & RISCV_IOMMU_CAP_SV39)) {
+                    return false;
+                }
+                break;
+            case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48:
+                if (!(s->cap & RISCV_IOMMU_CAP_SV48)) {
+                    return false;
+                }
+            break;
+            case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57:
+                if (!(s->cap & RISCV_IOMMU_CAP_SV57)) {
+                    return false;
+                }
+                break;
+            }
+        }
+    }
+
     /*
      * CAP_END is always zero (only one endianess). FCTL_BE is
      * always zero (little-endian accesses). Thus TC_SBE must
@@ -478,6 +750,10 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
 
     case RISCV_IOMMU_DDTP_MODE_BARE:
         /* mock up pass-through translation context */
+        ctx->gatp = set_field(0, RISCV_IOMMU_ATP_MODE_FIELD,
+            RISCV_IOMMU_DC_IOHGATP_MODE_BARE);
+        ctx->satp = set_field(0, RISCV_IOMMU_ATP_MODE_FIELD,
+            RISCV_IOMMU_DC_FSC_MODE_BARE);
         ctx->tc = RISCV_IOMMU_DC_TC_V;
         ctx->ta = 0;
         ctx->msiptp = 0;
@@ -551,6 +827,8 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
 
     /* Set translation context. */
     ctx->tc = le64_to_cpu(dc.tc);
+    ctx->gatp = le64_to_cpu(dc.iohgatp);
+    ctx->satp = le64_to_cpu(dc.fsc);
     ctx->ta = le64_to_cpu(dc.ta);
     ctx->msiptp = le64_to_cpu(dc.msiptp);
     ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
@@ -564,14 +842,38 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
         return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
     }
 
+    /* FSC field checks */
+    mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
+    addr = PPN_PHYS(get_field(ctx->satp, RISCV_IOMMU_DC_FSC_PPN));
+
+    if (mode == RISCV_IOMMU_DC_FSC_MODE_BARE) {
+        /* No S-Stage translation, done. */
+        return 0;
+    }
+
     if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
         if (ctx->pasid != RISCV_IOMMU_NOPASID) {
             /* PASID is disabled */
             return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
         }
+        if (mode > RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57) {
+            /* Invalid translation mode */
+            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
+        }
         return 0;
     }
 
+    if (ctx->pasid == RISCV_IOMMU_NOPASID) {
+        if (!(ctx->tc & RISCV_IOMMU_DC_TC_DPE)) {
+            /* No default PASID enabled, set BARE mode */
+            ctx->satp = 0ULL;
+            return 0;
+        } else {
+            /* Use default PASID #0 */
+            ctx->pasid = 0;
+        }
+    }
+
     /* FSC.TC.PDTV enabled */
     if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
         /* Invalid PDTP.MODE */
@@ -605,6 +907,7 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
 
     /* Use FSC and TA from process directory entry. */
     ctx->ta = le64_to_cpu(dc.ta);
+    ctx->satp = le64_to_cpu(dc.fsc);
 
     return 0;
 }
@@ -832,6 +1135,7 @@ static RISCVIOMMUEntry *riscv_iommu_iot_lookup(RISCVIOMMUContext *ctx,
     GHashTable *iot_cache, hwaddr iova)
 {
     RISCVIOMMUEntry key = {
+        .gscid = get_field(ctx->gatp, RISCV_IOMMU_DC_IOHGATP_GSCID),
         .pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID),
         .iova  = PPN_DOWN(iova),
     };
@@ -909,6 +1213,7 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
         iot = g_new0(RISCVIOMMUEntry, 1);
         iot->iova = PPN_DOWN(iotlb->iova);
         iot->phys = PPN_DOWN(iotlb->translated_addr);
+        iot->gscid = get_field(ctx->gatp, RISCV_IOMMU_DC_IOHGATP_GSCID);
         iot->pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID);
         iot->perm = iotlb->perm;
         riscv_iommu_iot_update(s, iot_cache, iot);
@@ -1513,6 +1818,14 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
     if (s->enable_msi) {
         s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
     }
+    if (s->enable_s_stage) {
+        s->cap |= RISCV_IOMMU_CAP_SV32 | RISCV_IOMMU_CAP_SV39 |
+                  RISCV_IOMMU_CAP_SV48 | RISCV_IOMMU_CAP_SV57;
+    }
+    if (s->enable_g_stage) {
+        s->cap |= RISCV_IOMMU_CAP_SV32X4 | RISCV_IOMMU_CAP_SV39X4 |
+                  RISCV_IOMMU_CAP_SV48X4 | RISCV_IOMMU_CAP_SV57X4;
+    }
     /* Report QEMU target physical address space limits */
     s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
                        TARGET_PHYS_ADDR_SPACE_BITS);
@@ -1613,6 +1926,8 @@ static Property riscv_iommu_properties[] = {
         LIMIT_CACHE_IOT),
     DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
     DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
+    DEFINE_PROP_BOOL("s-stage", RISCVIOMMUState, enable_s_stage, TRUE),
+    DEFINE_PROP_BOOL("g-stage", RISCVIOMMUState, enable_g_stage, TRUE),
     DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
         TYPE_MEMORY_REGION, MemoryRegion *),
     DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
index 3afee9f3e8..c24e3e4c16 100644
--- a/hw/riscv/riscv-iommu.h
+++ b/hw/riscv/riscv-iommu.h
@@ -38,6 +38,8 @@ struct RISCVIOMMUState {
 
     bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
     bool enable_msi;      /* Enable MSI remapping */
+    bool enable_s_stage;  /* Enable S/VS-Stage translation */
+    bool enable_g_stage;  /* Enable G-Stage translation */
 
     /* IOMMU Internal State */
     uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 10/13] hw/riscv/riscv-iommu: add ATS support
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (8 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-06-09  9:06   ` Frank Chang
  2024-05-23 17:39 ` [PATCH v3 11/13] hw/riscv/riscv-iommu: add DBG support Daniel Henrique Barboza
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

Add PCIe Address Translation Services (ATS) capabilities to the IOMMU.
This will add support for ATS translation requests in Fault/Event
queues, Page-request queue and IOATC invalidations.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 hw/riscv/riscv-iommu-bits.h |  43 +++++++++++-
 hw/riscv/riscv-iommu.c      | 129 +++++++++++++++++++++++++++++++++++-
 hw/riscv/riscv-iommu.h      |   1 +
 hw/riscv/trace-events       |   3 +
 4 files changed, 173 insertions(+), 3 deletions(-)

diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
index a4def7b8ec..e253b29b16 100644
--- a/hw/riscv/riscv-iommu-bits.h
+++ b/hw/riscv/riscv-iommu-bits.h
@@ -81,6 +81,7 @@ struct riscv_iommu_pq_record {
 #define RISCV_IOMMU_CAP_SV57X4          BIT_ULL(19)
 #define RISCV_IOMMU_CAP_MSI_FLAT        BIT_ULL(22)
 #define RISCV_IOMMU_CAP_MSI_MRIF        BIT_ULL(23)
+#define RISCV_IOMMU_CAP_ATS             BIT_ULL(25)
 #define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
 #define RISCV_IOMMU_CAP_IGS             GENMASK_ULL(29, 28)
 #define RISCV_IOMMU_CAP_PAS             GENMASK_ULL(37, 32)
@@ -209,6 +210,7 @@ struct riscv_iommu_dc {
 
 /* Translation control fields */
 #define RISCV_IOMMU_DC_TC_V             BIT_ULL(0)
+#define RISCV_IOMMU_DC_TC_EN_ATS        BIT_ULL(1)
 #define RISCV_IOMMU_DC_TC_EN_PRI        BIT_ULL(2)
 #define RISCV_IOMMU_DC_TC_T2GPA         BIT_ULL(3)
 #define RISCV_IOMMU_DC_TC_DTF           BIT_ULL(4)
@@ -270,6 +272,20 @@ struct riscv_iommu_command {
 #define RISCV_IOMMU_CMD_IODIR_DV        BIT_ULL(33)
 #define RISCV_IOMMU_CMD_IODIR_DID       GENMASK_ULL(63, 40)
 
+/* 3.1.4 I/O MMU PCIe ATS */
+#define RISCV_IOMMU_CMD_ATS_OPCODE              4
+#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL          0
+#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR           1
+#define RISCV_IOMMU_CMD_ATS_PID         GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_ATS_PV          BIT_ULL(32)
+#define RISCV_IOMMU_CMD_ATS_DSV         BIT_ULL(33)
+#define RISCV_IOMMU_CMD_ATS_RID         GENMASK_ULL(55, 40)
+#define RISCV_IOMMU_CMD_ATS_DSEG        GENMASK_ULL(63, 56)
+/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
+
+/* ATS.PRGR payload */
+#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE      GENMASK_ULL(47, 44)
+
 enum riscv_iommu_dc_fsc_atp_modes {
     RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
     RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
@@ -334,7 +350,32 @@ enum riscv_iommu_fq_ttypes {
     RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
     RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
     RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
-    RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 8,
+    RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
+    RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
+};
+
+/* Header fields */
+#define RISCV_IOMMU_PREQ_HDR_PID        GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_PREQ_HDR_PV         BIT_ULL(32)
+#define RISCV_IOMMU_PREQ_HDR_PRIV       BIT_ULL(33)
+#define RISCV_IOMMU_PREQ_HDR_EXEC       BIT_ULL(34)
+#define RISCV_IOMMU_PREQ_HDR_DID        GENMASK_ULL(63, 40)
+
+/* Payload fields */
+#define RISCV_IOMMU_PREQ_PAYLOAD_R      BIT_ULL(0)
+#define RISCV_IOMMU_PREQ_PAYLOAD_W      BIT_ULL(1)
+#define RISCV_IOMMU_PREQ_PAYLOAD_L      BIT_ULL(2)
+#define RISCV_IOMMU_PREQ_PAYLOAD_M      GENMASK_ULL(2, 0)
+#define RISCV_IOMMU_PREQ_PRG_INDEX      GENMASK_ULL(11, 3)
+#define RISCV_IOMMU_PREQ_UADDR          GENMASK_ULL(63, 12)
+
+
+/*
+ * struct riscv_iommu_msi_pte - MSI Page Table Entry
+ */
+struct riscv_iommu_msi_pte {
+      uint64_t pte;
+      uint64_t mrif_info;
 };
 
 /* Fields on pte */
diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index 11c418b548..3516b82081 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -641,6 +641,20 @@ static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
                                             RISCVIOMMUContext *ctx)
 {
     uint32_t fsc_mode, msi_mode;
+    uint64_t gatp;
+
+    if (!(s->cap & RISCV_IOMMU_CAP_ATS) &&
+        (ctx->tc & RISCV_IOMMU_DC_TC_EN_ATS ||
+         ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI ||
+         ctx->tc & RISCV_IOMMU_DC_TC_PRPR)) {
+        return false;
+    }
+
+    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_ATS) &&
+        (ctx->tc & RISCV_IOMMU_DC_TC_T2GPA ||
+         ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI)) {
+        return false;
+    }
 
     if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
         ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
@@ -661,6 +675,12 @@ static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
         }
     }
 
+    gatp = get_field(ctx->gatp, RISCV_IOMMU_ATP_MODE_FIELD);
+    if (ctx->tc & RISCV_IOMMU_DC_TC_T2GPA &&
+        gatp == RISCV_IOMMU_DC_IOHGATP_MODE_BARE) {
+        return false;
+    }
+
     fsc_mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
 
     if (ctx->tc & RISCV_IOMMU_DC_TC_PDTV) {
@@ -754,7 +774,12 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
             RISCV_IOMMU_DC_IOHGATP_MODE_BARE);
         ctx->satp = set_field(0, RISCV_IOMMU_ATP_MODE_FIELD,
             RISCV_IOMMU_DC_FSC_MODE_BARE);
+
         ctx->tc = RISCV_IOMMU_DC_TC_V;
+        if (s->enable_ats) {
+            ctx->tc |= RISCV_IOMMU_DC_TC_EN_ATS;
+        }
+
         ctx->ta = 0;
         ctx->msiptp = 0;
         return 0;
@@ -1191,6 +1216,16 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
     enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
     enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
 
+    /* Check for ATS request. */
+    if (iotlb->perm == IOMMU_NONE) {
+        /* Check if ATS is disabled. */
+        if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_ATS)) {
+            enable_pri = false;
+            fault = RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
+            goto done;
+        }
+    }
+
     iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
     perm = iot ? iot->perm : IOMMU_NONE;
     if (perm != IOMMU_NONE) {
@@ -1236,11 +1271,11 @@ done:
     }
 
     if (fault) {
-        unsigned ttype;
+        unsigned ttype = RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ;
 
         if (iotlb->perm & IOMMU_RW) {
             ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
-        } else {
+        } else if (iotlb->perm & IOMMU_RO) {
             ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
         }
 
@@ -1268,6 +1303,73 @@ static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
         MEMTXATTRS_UNSPECIFIED);
 }
 
+static void riscv_iommu_ats(RISCVIOMMUState *s,
+    struct riscv_iommu_command *cmd, IOMMUNotifierFlag flag,
+    IOMMUAccessFlags perm,
+    void (*trace_fn)(const char *id))
+{
+    RISCVIOMMUSpace *as = NULL;
+    IOMMUNotifier *n;
+    IOMMUTLBEvent event;
+    uint32_t pasid;
+    uint32_t devid;
+    const bool pv = cmd->dword0 & RISCV_IOMMU_CMD_ATS_PV;
+
+    if (cmd->dword0 & RISCV_IOMMU_CMD_ATS_DSV) {
+        /* Use device segment and requester id */
+        devid = get_field(cmd->dword0,
+            RISCV_IOMMU_CMD_ATS_DSEG | RISCV_IOMMU_CMD_ATS_RID);
+    } else {
+        devid = get_field(cmd->dword0, RISCV_IOMMU_CMD_ATS_RID);
+    }
+
+    pasid = get_field(cmd->dword0, RISCV_IOMMU_CMD_ATS_PID);
+
+    qemu_mutex_lock(&s->core_lock);
+    QLIST_FOREACH(as, &s->spaces, list) {
+        if (as->devid == devid) {
+            break;
+        }
+    }
+    qemu_mutex_unlock(&s->core_lock);
+
+    if (!as || !as->notifier) {
+        return;
+    }
+
+    event.type = flag;
+    event.entry.perm = perm;
+    event.entry.target_as = s->target_as;
+
+    IOMMU_NOTIFIER_FOREACH(n, &as->iova_mr) {
+        if (!pv || n->iommu_idx == pasid) {
+            event.entry.iova = n->start;
+            event.entry.addr_mask = n->end - n->start;
+            trace_fn(as->iova_mr.parent_obj.name);
+            memory_region_notify_iommu_one(n, &event);
+        }
+    }
+}
+
+static void riscv_iommu_ats_inval(RISCVIOMMUState *s,
+    struct riscv_iommu_command *cmd)
+{
+    return riscv_iommu_ats(s, cmd, IOMMU_NOTIFIER_DEVIOTLB_UNMAP, IOMMU_NONE,
+                           trace_riscv_iommu_ats_inval);
+}
+
+static void riscv_iommu_ats_prgr(RISCVIOMMUState *s,
+    struct riscv_iommu_command *cmd)
+{
+    unsigned resp_code = get_field(cmd->dword1,
+                                   RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE);
+
+    /* Using the access flag to carry response code information */
+    IOMMUAccessFlags perm = resp_code ? IOMMU_NONE : IOMMU_RW;
+    return riscv_iommu_ats(s, cmd, IOMMU_NOTIFIER_MAP, perm,
+                           trace_riscv_iommu_ats_prgr);
+}
+
 static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
 {
     uint64_t old_ddtp = s->ddtp;
@@ -1423,6 +1525,25 @@ static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
                 get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
             break;
 
+        /* ATS commands */
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_ATS_FUNC_INVAL,
+                             RISCV_IOMMU_CMD_ATS_OPCODE):
+            if (!s->enable_ats) {
+                goto cmd_ill;
+            }
+
+            riscv_iommu_ats_inval(s, &cmd);
+            break;
+
+        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_ATS_FUNC_PRGR,
+                             RISCV_IOMMU_CMD_ATS_OPCODE):
+            if (!s->enable_ats) {
+                goto cmd_ill;
+            }
+
+            riscv_iommu_ats_prgr(s, &cmd);
+            break;
+
         default:
         cmd_ill:
             /* Invalid instruction, do not advance instruction index. */
@@ -1818,6 +1939,9 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
     if (s->enable_msi) {
         s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
     }
+    if (s->enable_ats) {
+        s->cap |= RISCV_IOMMU_CAP_ATS;
+    }
     if (s->enable_s_stage) {
         s->cap |= RISCV_IOMMU_CAP_SV32 | RISCV_IOMMU_CAP_SV39 |
                   RISCV_IOMMU_CAP_SV48 | RISCV_IOMMU_CAP_SV57;
@@ -1925,6 +2049,7 @@ static Property riscv_iommu_properties[] = {
     DEFINE_PROP_UINT32("ioatc-limit", RISCVIOMMUState, iot_limit,
         LIMIT_CACHE_IOT),
     DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
+    DEFINE_PROP_BOOL("ats", RISCVIOMMUState, enable_ats, TRUE),
     DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
     DEFINE_PROP_BOOL("s-stage", RISCVIOMMUState, enable_s_stage, TRUE),
     DEFINE_PROP_BOOL("g-stage", RISCVIOMMUState, enable_g_stage, TRUE),
diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
index c24e3e4c16..26236c3cee 100644
--- a/hw/riscv/riscv-iommu.h
+++ b/hw/riscv/riscv-iommu.h
@@ -38,6 +38,7 @@ struct RISCVIOMMUState {
 
     bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
     bool enable_msi;      /* Enable MSI remapping */
+    bool enable_ats;      /* Enable ATS support */
     bool enable_s_stage;  /* Enable S/VS-Stage translation */
     bool enable_g_stage;  /* Enable G-Stage translation */
 
diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
index 42a97caffa..4b486b6420 100644
--- a/hw/riscv/trace-events
+++ b/hw/riscv/trace-events
@@ -9,3 +9,6 @@ riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iov
 riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
 riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
 riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
+riscv_iommu_ats(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: translate request %04x:%02x.%u iova: 0x%"PRIx64
+riscv_iommu_ats_inval(const char *id) "%s: dev-iotlb invalidate"
+riscv_iommu_ats_prgr(const char *id) "%s: dev-iotlb page request group response"
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 11/13] hw/riscv/riscv-iommu: add DBG support
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (9 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 10/13] hw/riscv/riscv-iommu: add ATS support Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-06-09  9:09   ` Frank Chang
  2024-05-23 17:39 ` [PATCH v3 12/13] hw/riscv/riscv-iommu: Add another irq for mrif notifications Daniel Henrique Barboza
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Tomasz Jeznach <tjeznach@rivosinc.com>

DBG support adds three additional registers: tr_req_iova, tr_req_ctl and
tr_response.

The DBG cap is always enabled. No on/off toggle is provided for it.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 hw/riscv/riscv-iommu-bits.h | 17 +++++++++++
 hw/riscv/riscv-iommu.c      | 59 +++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
index e253b29b16..f143c4a926 100644
--- a/hw/riscv/riscv-iommu-bits.h
+++ b/hw/riscv/riscv-iommu-bits.h
@@ -84,6 +84,7 @@ struct riscv_iommu_pq_record {
 #define RISCV_IOMMU_CAP_ATS             BIT_ULL(25)
 #define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
 #define RISCV_IOMMU_CAP_IGS             GENMASK_ULL(29, 28)
+#define RISCV_IOMMU_CAP_DBG             BIT_ULL(31)
 #define RISCV_IOMMU_CAP_PAS             GENMASK_ULL(37, 32)
 #define RISCV_IOMMU_CAP_PD8             BIT_ULL(38)
 #define RISCV_IOMMU_CAP_PD17            BIT_ULL(39)
@@ -185,6 +186,22 @@ enum {
     RISCV_IOMMU_INTR_COUNT
 };
 
+/* 5.24 Translation request IOVA (64bits) */
+#define RISCV_IOMMU_REG_TR_REQ_IOVA     0x0258
+
+/* 5.25 Translation request control (64bits) */
+#define RISCV_IOMMU_REG_TR_REQ_CTL      0x0260
+#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY  BIT_ULL(0)
+#define RISCV_IOMMU_TR_REQ_CTL_NW       BIT_ULL(3)
+#define RISCV_IOMMU_TR_REQ_CTL_PID      GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_TR_REQ_CTL_DID      GENMASK_ULL(63, 40)
+
+/* 5.26 Translation request response (64bits) */
+#define RISCV_IOMMU_REG_TR_RESPONSE     0x0268
+#define RISCV_IOMMU_TR_RESPONSE_FAULT   BIT_ULL(0)
+#define RISCV_IOMMU_TR_RESPONSE_S       BIT_ULL(9)
+#define RISCV_IOMMU_TR_RESPONSE_PPN     RISCV_IOMMU_PPN_FIELD
+
 /* 5.27 Interrupt cause to vector (64bits) */
 #define RISCV_IOMMU_REG_IVEC            0x02F8
 
diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index 3516b82081..52f0851895 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -1655,6 +1655,50 @@ static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
     riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
 }
 
+static void riscv_iommu_process_dbg(RISCVIOMMUState *s)
+{
+    uint64_t iova = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_TR_REQ_IOVA);
+    uint64_t ctrl = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_TR_REQ_CTL);
+    unsigned devid = get_field(ctrl, RISCV_IOMMU_TR_REQ_CTL_DID);
+    unsigned pid = get_field(ctrl, RISCV_IOMMU_TR_REQ_CTL_PID);
+    RISCVIOMMUContext *ctx;
+    void *ref;
+
+    if (!(ctrl & RISCV_IOMMU_TR_REQ_CTL_GO_BUSY)) {
+        return;
+    }
+
+    ctx = riscv_iommu_ctx(s, devid, pid, &ref);
+    if (ctx == NULL) {
+        riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_TR_RESPONSE,
+                                 RISCV_IOMMU_TR_RESPONSE_FAULT |
+                                 (RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED << 10));
+    } else {
+        IOMMUTLBEntry iotlb = {
+            .iova = iova,
+            .perm = ctrl & RISCV_IOMMU_TR_REQ_CTL_NW ? IOMMU_RO : IOMMU_RW,
+            .addr_mask = ~0,
+            .target_as = NULL,
+        };
+        int fault = riscv_iommu_translate(s, ctx, &iotlb, false);
+        if (fault) {
+            iova = RISCV_IOMMU_TR_RESPONSE_FAULT | (((uint64_t) fault) << 10);
+        } else {
+            iova = iotlb.translated_addr & ~iotlb.addr_mask;
+            iova >>= TARGET_PAGE_BITS;
+            iova &= RISCV_IOMMU_TR_RESPONSE_PPN;
+
+            /* We do not support superpages (> 4kbs) for now */
+            iova &= ~RISCV_IOMMU_TR_RESPONSE_S;
+        }
+        riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_TR_RESPONSE, iova);
+    }
+
+    riscv_iommu_reg_mod64(s, RISCV_IOMMU_REG_TR_REQ_CTL, 0,
+        RISCV_IOMMU_TR_REQ_CTL_GO_BUSY);
+    riscv_iommu_ctx_put(s, ref);
+}
+
 typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
 
 static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
@@ -1778,6 +1822,12 @@ static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
 
         return MEMTX_OK;
 
+    case RISCV_IOMMU_REG_TR_REQ_CTL:
+        process_fn = riscv_iommu_process_dbg;
+        regb = RISCV_IOMMU_REG_TR_REQ_CTL;
+        busy = RISCV_IOMMU_TR_REQ_CTL_GO_BUSY;
+        break;
+
     default:
         break;
     }
@@ -1950,6 +2000,9 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
         s->cap |= RISCV_IOMMU_CAP_SV32X4 | RISCV_IOMMU_CAP_SV39X4 |
                   RISCV_IOMMU_CAP_SV48X4 | RISCV_IOMMU_CAP_SV57X4;
     }
+    /* Enable translation debug interface */
+    s->cap |= RISCV_IOMMU_CAP_DBG;
+
     /* Report QEMU target physical address space limits */
     s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
                        TARGET_PHYS_ADDR_SPACE_BITS);
@@ -2004,6 +2057,12 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
     stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
     stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
     stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
+    /* If debug registers enabled. */
+    if (s->cap & RISCV_IOMMU_CAP_DBG) {
+        stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_TR_REQ_IOVA], 0);
+        stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_TR_REQ_CTL],
+            RISCV_IOMMU_TR_REQ_CTL_GO_BUSY);
+    }
 
     /* Memory region for downstream access, if specified. */
     if (s->target_mr) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 12/13] hw/riscv/riscv-iommu: Add another irq for mrif notifications
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (10 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 11/13] hw/riscv/riscv-iommu: add DBG support Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-05-23 17:39 ` [PATCH v3 13/13] qtest/riscv-iommu-test: add init queues test Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

From: Andrew Jones <ajones@ventanamicro.com>

And add mrif notification trace.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 hw/riscv/riscv-iommu-pci.c | 2 +-
 hw/riscv/riscv-iommu.c     | 1 +
 hw/riscv/trace-events      | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/riscv/riscv-iommu-pci.c b/hw/riscv/riscv-iommu-pci.c
index 7635cc64ff..ad3df8ffe6 100644
--- a/hw/riscv/riscv-iommu-pci.c
+++ b/hw/riscv/riscv-iommu-pci.c
@@ -80,7 +80,7 @@ static void riscv_iommu_pci_realize(PCIDevice *dev, Error **errp)
     pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
                      PCI_BASE_ADDRESS_MEM_TYPE_64, &s->bar0);
 
-    int ret = msix_init(dev, RISCV_IOMMU_INTR_COUNT,
+    int ret = msix_init(dev, RISCV_IOMMU_INTR_COUNT + 1,
                         &s->bar0, 0, RISCV_IOMMU_REG_MSI_CONFIG,
                         &s->bar0, 0, RISCV_IOMMU_REG_MSI_CONFIG + 256, 0, &err);
 
diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index 52f0851895..a27f56419a 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -623,6 +623,7 @@ static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
         cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
         goto err;
     }
+    trace_riscv_iommu_mrif_notification(s->parent_obj.id, n190, addr);
 
     return MEMTX_OK;
 
diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
index 4b486b6420..d69719a27a 100644
--- a/hw/riscv/trace-events
+++ b/hw/riscv/trace-events
@@ -6,6 +6,7 @@ riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t rea
 riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
 riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
 riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
+riscv_iommu_mrif_notification(const char *id, uint32_t nid, uint64_t phys) "%s: sent MRIF notification 0x%x to 0x%"PRIx64
 riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
 riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
 riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 13/13] qtest/riscv-iommu-test: add init queues test
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (11 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 12/13] hw/riscv/riscv-iommu: Add another irq for mrif notifications Daniel Henrique Barboza
@ 2024-05-23 17:39 ` Daniel Henrique Barboza
  2024-06-10  0:34 ` [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Alistair Francis
  2024-06-11  1:51 ` LIU Zhiwei
  14 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-05-23 17:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Daniel Henrique Barboza

Add an additional test to further exercise the IOMMU where we attempt to
initialize the command, fault and page-request queues.

These steps are taken from chapter 6.2 of the RISC-V IOMMU spec,
"Guidelines for initialization". It emulates what we expect from the
software/OS when initializing the IOMMU.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Frank Chang <frank.chang@sifive.com>
---
 tests/qtest/libqos/riscv-iommu.h |  29 +++++++
 tests/qtest/riscv-iommu-test.c   | 141 +++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/tests/qtest/libqos/riscv-iommu.h b/tests/qtest/libqos/riscv-iommu.h
index d123efb41f..c62ddedbac 100644
--- a/tests/qtest/libqos/riscv-iommu.h
+++ b/tests/qtest/libqos/riscv-iommu.h
@@ -62,6 +62,35 @@
 
 #define RISCV_IOMMU_REG_IPSR            0x0054
 
+#define RISCV_IOMMU_REG_IVEC            0x02F8
+#define RISCV_IOMMU_REG_IVEC_CIV        GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_REG_IVEC_FIV        GENMASK_ULL(7, 4)
+#define RISCV_IOMMU_REG_IVEC_PIV        GENMASK_ULL(15, 12)
+
+#define RISCV_IOMMU_REG_CQB             0x0018
+#define RISCV_IOMMU_CQB_PPN_START       10
+#define RISCV_IOMMU_CQB_PPN_LEN         44
+#define RISCV_IOMMU_CQB_LOG2SZ_START    0
+#define RISCV_IOMMU_CQB_LOG2SZ_LEN      5
+
+#define RISCV_IOMMU_REG_CQT             0x0024
+
+#define RISCV_IOMMU_REG_FQB             0x0028
+#define RISCV_IOMMU_FQB_PPN_START       10
+#define RISCV_IOMMU_FQB_PPN_LEN         44
+#define RISCV_IOMMU_FQB_LOG2SZ_START    0
+#define RISCV_IOMMU_FQB_LOG2SZ_LEN      5
+
+#define RISCV_IOMMU_REG_FQT             0x0034
+
+#define RISCV_IOMMU_REG_PQB             0x0038
+#define RISCV_IOMMU_PQB_PPN_START       10
+#define RISCV_IOMMU_PQB_PPN_LEN         44
+#define RISCV_IOMMU_PQB_LOG2SZ_START    0
+#define RISCV_IOMMU_PQB_LOG2SZ_LEN      5
+
+#define RISCV_IOMMU_REG_PQT             0x0044
+
 typedef struct QRISCVIOMMU {
     QOSGraphObject obj;
     QPCIDevice dev;
diff --git a/tests/qtest/riscv-iommu-test.c b/tests/qtest/riscv-iommu-test.c
index 7f0dbd0211..9e2afcb4b9 100644
--- a/tests/qtest/riscv-iommu-test.c
+++ b/tests/qtest/riscv-iommu-test.c
@@ -33,6 +33,20 @@ static uint64_t riscv_iommu_read_reg64(QRISCVIOMMU *r_iommu, int reg_offset)
     return reg;
 }
 
+static void riscv_iommu_write_reg32(QRISCVIOMMU *r_iommu, int reg_offset,
+                                    uint32_t val)
+{
+    qpci_memwrite(&r_iommu->dev, r_iommu->reg_bar, reg_offset,
+                  &val, sizeof(val));
+}
+
+static void riscv_iommu_write_reg64(QRISCVIOMMU *r_iommu, int reg_offset,
+                                    uint64_t val)
+{
+    qpci_memwrite(&r_iommu->dev, r_iommu->reg_bar, reg_offset,
+                  &val, sizeof(val));
+}
+
 static void test_pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
     QRISCVIOMMU *r_iommu = obj;
@@ -84,10 +98,137 @@ static void test_reg_reset(void *obj, void *data, QGuestAllocator *t_alloc)
     g_assert_cmpuint(reg, ==, 0);
 }
 
+/*
+ * Common timeout-based poll for CQCSR, FQCSR and PQCSR. All
+ * their ON bits are mapped as RISCV_IOMMU_QUEUE_ACTIVE (16),
+ */
+static void qtest_wait_for_queue_active(QRISCVIOMMU *r_iommu,
+                                        uint32_t queue_csr)
+{
+    QTestState *qts = global_qtest;
+    guint64 timeout_us = 2 * 1000 * 1000;
+    gint64 start_time = g_get_monotonic_time();
+    uint32_t reg;
+
+    for (;;) {
+        qtest_clock_step(qts, 100);
+
+        reg = riscv_iommu_read_reg32(r_iommu, queue_csr);
+        if (reg & RISCV_IOMMU_QUEUE_ACTIVE) {
+            break;
+        }
+        g_assert(g_get_monotonic_time() - start_time <= timeout_us);
+    }
+}
+
+/*
+ * Goes through the queue activation procedures of chapter 6.2,
+ * "Guidelines for initialization", of the RISCV-IOMMU spec.
+ */
+static void test_iommu_init_queues(void *obj, void *data,
+                                   QGuestAllocator *t_alloc)
+{
+    QRISCVIOMMU *r_iommu = obj;
+    uint64_t reg64, q_addr;
+    uint32_t reg;
+    int k;
+
+    reg64 = riscv_iommu_read_reg64(r_iommu, RISCV_IOMMU_REG_CAP);
+    g_assert_cmpuint(reg64 & RISCV_IOMMU_CAP_VERSION, ==, 0x10);
+
+    /*
+     * Program the command queue. Write 0xF to civ, assert that
+     * we have 4 writable bits (k = 4). The amount of entries N in the
+     * command queue is 2^4 = 16. We need to alloc a N*16 bytes
+     * buffer and use it to set cqb.
+     */
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_IVEC,
+                            0xFFFF & RISCV_IOMMU_REG_IVEC_CIV);
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_IVEC);
+    g_assert_cmpuint(reg & RISCV_IOMMU_REG_IVEC_CIV, ==, 0xF);
+
+    q_addr = guest_alloc(t_alloc, 16 * 16);
+    reg64 = 0;
+    k = 4;
+    deposit64(reg64, RISCV_IOMMU_CQB_PPN_START,
+              RISCV_IOMMU_CQB_PPN_LEN, q_addr);
+    deposit64(reg64, RISCV_IOMMU_CQB_LOG2SZ_START,
+              RISCV_IOMMU_CQB_LOG2SZ_LEN, k - 1);
+    riscv_iommu_write_reg64(r_iommu, RISCV_IOMMU_REG_CQB, reg64);
+
+    /* cqt = 0, cqcsr.cqen = 1, poll cqcsr.cqon until it reads 1 */
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_CQT, 0);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_CQCSR);
+    reg |= RISCV_IOMMU_CQCSR_CQEN;
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_CQCSR, reg);
+
+    qtest_wait_for_queue_active(r_iommu, RISCV_IOMMU_REG_CQCSR);
+
+    /*
+     * Program the fault queue. Similar to the above:
+     * - Write 0xF to fiv, assert that we have 4 writable bits (k = 4)
+     * - Alloc a 16*32 bytes (instead of 16*16) buffer and use it to set
+     * fqb
+     */
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_IVEC,
+                            0xFFFF & RISCV_IOMMU_REG_IVEC_FIV);
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_IVEC);
+    g_assert_cmpuint(reg & RISCV_IOMMU_REG_IVEC_FIV, ==, 0xF0);
+
+    q_addr = guest_alloc(t_alloc, 16 * 32);
+    reg64 = 0;
+    k = 4;
+    deposit64(reg64, RISCV_IOMMU_FQB_PPN_START,
+              RISCV_IOMMU_FQB_PPN_LEN, q_addr);
+    deposit64(reg64, RISCV_IOMMU_FQB_LOG2SZ_START,
+              RISCV_IOMMU_FQB_LOG2SZ_LEN, k - 1);
+    riscv_iommu_write_reg64(r_iommu, RISCV_IOMMU_REG_FQB, reg64);
+
+    /* fqt = 0, fqcsr.fqen = 1, poll fqcsr.fqon until it reads 1 */
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_FQT, 0);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_FQCSR);
+    reg |= RISCV_IOMMU_FQCSR_FQEN;
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_FQCSR, reg);
+
+    qtest_wait_for_queue_active(r_iommu, RISCV_IOMMU_REG_FQCSR);
+
+    /*
+     * Program the page-request queue:
+     - Write 0xF to piv, assert that we have 4 writable bits (k = 4)
+     - Alloc a 16*16 bytes buffer and use it to set pqb.
+     */
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_IVEC,
+                            0xFFFF & RISCV_IOMMU_REG_IVEC_PIV);
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_IVEC);
+    g_assert_cmpuint(reg & RISCV_IOMMU_REG_IVEC_PIV, ==, 0xF000);
+
+    q_addr = guest_alloc(t_alloc, 16 * 16);
+    reg64 = 0;
+    k = 4;
+    deposit64(reg64, RISCV_IOMMU_PQB_PPN_START,
+              RISCV_IOMMU_PQB_PPN_LEN, q_addr);
+    deposit64(reg64, RISCV_IOMMU_PQB_LOG2SZ_START,
+              RISCV_IOMMU_PQB_LOG2SZ_LEN, k - 1);
+    riscv_iommu_write_reg64(r_iommu, RISCV_IOMMU_REG_PQB, reg64);
+
+    /* pqt = 0, pqcsr.pqen = 1, poll pqcsr.pqon until it reads 1 */
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_PQT, 0);
+
+    reg = riscv_iommu_read_reg32(r_iommu, RISCV_IOMMU_REG_PQCSR);
+    reg |= RISCV_IOMMU_PQCSR_PQEN;
+    riscv_iommu_write_reg32(r_iommu, RISCV_IOMMU_REG_PQCSR, reg);
+
+    qtest_wait_for_queue_active(r_iommu, RISCV_IOMMU_REG_PQCSR);
+}
+
 static void register_riscv_iommu_test(void)
 {
     qos_add_test("pci_config", "riscv-iommu-pci", test_pci_config, NULL);
     qos_add_test("reg_reset", "riscv-iommu-pci", test_reg_reset, NULL);
+    qos_add_test("iommu_init_queues", "riscv-iommu-pci",
+                 test_iommu_init_queues, NULL);
 }
 
 libqos_init(register_riscv_iommu_test);
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h
  2024-05-23 17:39 ` [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h Daniel Henrique Barboza
@ 2024-05-28  6:41   ` Eric Cheng
  2024-06-05 22:21     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Cheng @ 2024-05-28  6:41 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang

On 5/24/2024 1:39 AM, Daniel Henrique Barboza wrote:
...
> +/* 5.4 Features control register (32bits) */
> +#define RISCV_IOMMU_REG_FCTL            0x0008

Looks like doesn't support RISCV_IOMMU_FCTL_BE?
If so, need to implement it as read-only? along with other 2 bits.

IIUC,

diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index 1b34d226f9..6a6bf1db98 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -2035,6 +2035,7 @@ static void riscv_iommu_realize(DeviceState *dev, Error 
**errp)
      /* Set power-on register state */
      stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
      stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
+    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FCTL], ~0);
      stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
          ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
      stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],


> +#define RISCV_IOMMU_FCTL_WSI            BIT(1)
> +
...



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-05-23 17:39 ` [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation Daniel Henrique Barboza
@ 2024-05-30  1:39   ` Eric Cheng
  2024-06-06 19:46     ` Daniel Henrique Barboza
  2024-06-11 16:15   ` Jason Chien
  2024-06-18 10:06   ` Jason Chien
  2 siblings, 1 reply; 38+ messages in thread
From: Eric Cheng @ 2024-05-30  1:39 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf

On 5/24/2024 1:39 AM, Daniel Henrique Barboza wrote:
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
> 
> The RISC-V IOMMU specification is now ratified as-per the RISC-V
> international process. The latest frozen specifcation can be found
> at:
> 
> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf
> 
> Add the foundation of the device emulation for RISC-V IOMMU, which
> includes an IOMMU that has no capabilities but MSI interrupt support and
> fault queue interfaces. We'll add add more features incrementally in the
						      ^^^  ^^^
repeated 'add'

> next patches.
> 
> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>   hw/riscv/Kconfig         |    4 +
>   hw/riscv/meson.build     |    1 +
>   hw/riscv/riscv-iommu.c   | 1602 ++++++++++++++++++++++++++++++++++++++
>   hw/riscv/riscv-iommu.h   |  141 ++++
>   hw/riscv/trace-events    |   11 +
>   hw/riscv/trace.h         |    1 +
>   include/hw/riscv/iommu.h |   36 +
>   meson.build              |    1 +
>   8 files changed, 1797 insertions(+)
>   create mode 100644 hw/riscv/riscv-iommu.c
>   create mode 100644 hw/riscv/riscv-iommu.h
>   create mode 100644 hw/riscv/trace-events
>   create mode 100644 hw/riscv/trace.h
>   create mode 100644 include/hw/riscv/iommu.h
> 
> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
> index a2030e3a6f..f69d6e3c8e 100644
> --- a/hw/riscv/Kconfig
> +++ b/hw/riscv/Kconfig
> @@ -1,3 +1,6 @@
> +config RISCV_IOMMU
> +    bool
> +
>   config RISCV_NUMA
>       bool
>   
> @@ -47,6 +50,7 @@ config RISCV_VIRT
>       select SERIAL
>       select RISCV_ACLINT
>       select RISCV_APLIC
> +    select RISCV_IOMMU
>       select RISCV_IMSIC
>       select SIFIVE_PLIC
>       select SIFIVE_TEST
> diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
> index f872674093..cbc99c6e8e 100644
> --- a/hw/riscv/meson.build
> +++ b/hw/riscv/meson.build
> @@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
>   riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
>   riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
>   riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
> +riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
>   
>   hw_arch += {'riscv': riscv_ss}
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> new file mode 100644
> index 0000000000..39b4ff1405
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu.c
> @@ -0,0 +1,1602 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2021-2023, Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +#include "hw/pci/pci_bus.h"
> +#include "hw/pci/pci_device.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/riscv/riscv_hart.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +#include "qemu/timer.h"
> +
> +#include "cpu_bits.h"
> +#include "riscv-iommu.h"
> +#include "riscv-iommu-bits.h"
> +#include "trace.h"
> +
> +#define LIMIT_CACHE_CTX               (1U << 7)
> +#define LIMIT_CACHE_IOT               (1U << 20)
> +
> +/* Physical page number coversions */
> +#define PPN_PHYS(ppn)                 ((ppn) << TARGET_PAGE_BITS)
> +#define PPN_DOWN(phy)                 ((phy) >> TARGET_PAGE_BITS)
> +
> +typedef struct RISCVIOMMUContext RISCVIOMMUContext;
> +typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
> +
> +/* Device assigned I/O address space */
> +struct RISCVIOMMUSpace {
> +    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached device */
> +    AddressSpace iova_as;       /* IOVA address space for attached device */
> +    RISCVIOMMUState *iommu;     /* Managing IOMMU device state */
> +    uint32_t devid;             /* Requester identifier, AKA device_id */
> +    bool notifier;              /* IOMMU unmap notifier enabled */
> +    QLIST_ENTRY(RISCVIOMMUSpace) list;
> +};
> +
> +/* Device translation context state. */
> +struct RISCVIOMMUContext {
> +    uint64_t devid:24;          /* Requester Id, AKA device_id */
> +    uint64_t pasid:20;          /* Process Address Space ID */
> +    uint64_t __rfu:20;          /* reserved */
> +    uint64_t tc;                /* Translation Control */
> +    uint64_t ta;                /* Translation Attributes */
> +    uint64_t msi_addr_mask;     /* MSI filtering - address mask */
> +    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
> +    uint64_t msiptp;            /* MSI redirection page table pointer */
> +};

Can we alias use (devid + pasid + __rfu) by union? so that can easily compare 
the key, esp. I assume functions like __ctx_equal() are on hot path.

And, pasid, is the term in PCI context. I suggest use more general name in spec: 
process_id.

e.g. below (just compiled, not tested)

diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index 1b34d226f9..74011c7f1f 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -53,9 +53,12 @@ struct RISCVIOMMUSpace {

  /* Device translation context state. */
  struct RISCVIOMMUContext {
-    uint64_t devid:24;          /* Requester Id, AKA device_id */
-    uint64_t pasid:20;          /* Process Address Space ID */
-    uint64_t __rfu:20;          /* reserved */
+    union {
+        uint64_t devid:24,          /* Requester Id, AKA device_id */
+                 pasid:20,          /* Process Address Space ID */
+                 __rfu:20;          /* reserved */
+        uint64_t key;
+    };
      uint64_t tc;                /* Translation Control */
      uint64_t ta;                /* Translation Attributes */
      uint64_t satp;              /* S-Stage address translation and protection */
@@ -943,14 +946,14 @@ static gboolean __ctx_equal(gconstpointer v1, 
gconstpointer v2)
  {
      RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
      RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
-    return c1->devid == c2->devid && c1->pasid == c2->pasid;
+    return c1->key == c2->key;
  }

  static guint __ctx_hash(gconstpointer v)
  {
      RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
      /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
-    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
+    return (guint)ctx->key;
  }

  static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
@@ -958,8 +961,7 @@ static void __ctx_inval_devid_pasid(gpointer key, gpointer 
value, gpointer data)
      RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
      RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
      if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
-        ctx->devid == arg->devid &&
-        ctx->pasid == arg->pasid) {
+        ctx->key == arg->key) {
          ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
      }
  }
@@ -989,6 +991,7 @@ static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc 
func,
      RISCVIOMMUContext key = {
          .devid = devid,
          .pasid = pasid,
+        .__rfu = 0,
      };
      ctx_cache = g_hash_table_ref(s->ctx_cache);
      g_hash_table_foreach(ctx_cache, func, &key);
@@ -1004,6 +1007,7 @@ static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
      RISCVIOMMUContext key = {
          .devid = devid,
          .pasid = pasid,
+        .__rfu = 0,
      };

      ctx_cache = g_hash_table_ref(s->ctx_cache);


> +
> +/* IOMMU index for transactions without PASID specified. */
> +#define RISCV_IOMMU_NOPASID 0
> +
> +static void riscv_iommu_notify(RISCVIOMMUState *s, int vec)
> +{
> +    const uint32_t fctl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FCTL);
> +    uint32_t ipsr, ivec;
> +
> +    if (fctl & RISCV_IOMMU_FCTL_WSI || !s->notify) {
> +        return;
> +    }
> +
> +    ipsr = riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, (1 << vec), 0);
> +    ivec = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_IVEC);
> +
> +    if (!(ipsr & (1 << vec))) {
> +        s->notify(s, (ivec >> (vec * 4)) & 0x0F);
> +    }
> +}
> +
> +static void riscv_iommu_fault(RISCVIOMMUState *s,
> +                              struct riscv_iommu_fq_record *ev)
> +{
> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQH) & s->fq_mask;
> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQT) & s->fq_mask;
> +    uint32_t next = (tail + 1) & s->fq_mask;
> +    uint32_t devid = get_field(ev->hdr, RISCV_IOMMU_FQ_HDR_DID);
> +
> +    trace_riscv_iommu_flt(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
> +                          PCI_FUNC(devid), ev->hdr, ev->iotval);
> +
> +    if (!(ctrl & RISCV_IOMMU_FQCSR_FQON) ||
> +        !!(ctrl & (RISCV_IOMMU_FQCSR_FQOF | RISCV_IOMMU_FQCSR_FQMF))) {
> +        return;
> +    }
> +
> +    if (head == next) {
> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
> +                              RISCV_IOMMU_FQCSR_FQOF, 0);
> +    } else {
> +        dma_addr_t addr = s->fq_addr + tail * sizeof(*ev);
> +        if (dma_memory_write(s->target_as, addr, ev, sizeof(*ev),
> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
> +                                  RISCV_IOMMU_FQCSR_FQMF, 0);
> +        } else {
> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_FQT, next);
> +        }
> +    }
> +
> +    if (ctrl & RISCV_IOMMU_FQCSR_FIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_FQ);
> +    }
> +}
> +
> +static void riscv_iommu_pri(RISCVIOMMUState *s,
> +    struct riscv_iommu_pq_record *pr)
> +{
> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQH) & s->pq_mask;
> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQT) & s->pq_mask;
> +    uint32_t next = (tail + 1) & s->pq_mask;
> +    uint32_t devid = get_field(pr->hdr, RISCV_IOMMU_PREQ_HDR_DID);
> +
> +    trace_riscv_iommu_pri(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
> +                          PCI_FUNC(devid), pr->payload);
> +
> +    if (!(ctrl & RISCV_IOMMU_PQCSR_PQON) ||
> +        !!(ctrl & (RISCV_IOMMU_PQCSR_PQOF | RISCV_IOMMU_PQCSR_PQMF))) {
> +        return;
> +    }
> +
> +    if (head == next) {
> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
> +                              RISCV_IOMMU_PQCSR_PQOF, 0);
> +    } else {
> +        dma_addr_t addr = s->pq_addr + tail * sizeof(*pr);
> +        if (dma_memory_write(s->target_as, addr, pr, sizeof(*pr),
> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
> +                                  RISCV_IOMMU_PQCSR_PQMF, 0);
> +        } else {
> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_PQT, next);
> +        }
> +    }
> +
> +    if (ctrl & RISCV_IOMMU_PQCSR_PIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_PQ);
> +    }
> +}
> +
> +/* Portable implementation of pext_u64, bit-mask extraction. */
> +static uint64_t _pext_u64(uint64_t val, uint64_t ext)
> +{
> +    uint64_t ret = 0;
> +    uint64_t rot = 1;
> +
> +    while (ext) {
> +        if (ext & 1) {
> +            if (val & 1) {
> +                ret |= rot;
> +            }
> +            rot <<= 1;
> +        }
> +        val >>= 1;
> +        ext >>= 1;
> +    }
> +
> +    return ret;
> +}
> +
> +/* Check if GPA matches MSI/MRIF pattern. */
> +static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    dma_addr_t gpa)
> +{
> +    if (get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE) !=
> +        RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
> +        return false; /* Invalid MSI/MRIF mode */
> +    }
> +
> +    if ((PPN_DOWN(gpa) ^ ctx->msi_addr_pattern) & ~ctx->msi_addr_mask) {
> +        return false; /* GPA not in MSI range defined by AIA IMSIC rules. */
> +    }
> +
> +    return true;
> +}
> +
> +/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
> +static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    IOMMUTLBEntry *iotlb)
> +{
> +    /* Early check for MSI address match when IOVA == GPA */
> +    if (iotlb->perm & IOMMU_WO &&
> +        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
> +        iotlb->target_as = &s->trap_as;
> +        iotlb->translated_addr = iotlb->iova;
> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +        return 0;
> +    }
> +
> +    /* Exit early for pass-through mode. */
> +    iotlb->translated_addr = iotlb->iova;
> +    iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +    /* Allow R/W in pass-through mode */
> +    iotlb->perm = IOMMU_RW;
> +    return 0;
> +}
> +
> +static void riscv_iommu_report_fault(RISCVIOMMUState *s,
> +                                     RISCVIOMMUContext *ctx,
> +                                     uint32_t fault_type, uint32_t cause,
> +                                     bool pv,
> +                                     uint64_t iotval, uint64_t iotval2)
> +{
> +    struct riscv_iommu_fq_record ev = { 0 };
> +
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_DTF) {
> +        switch (cause) {
> +        case RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_INVALID:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED:
> +        case RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR:
> +        case RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT:
> +            break;
> +        default:
> +            /* DTF prevents reporting a fault for this given cause */
> +            return;
> +        }
> +    }
> +
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_CAUSE, cause);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_TTYPE, fault_type);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_DID, ctx->devid);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PV, true);
> +
> +    if (pv) {
> +        ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PID, ctx->pasid);
> +    }
> +
> +    ev.iotval = iotval;
> +    ev.iotval2 = iotval2;
> +
> +    riscv_iommu_fault(s, &ev);
> +}
> +
> +/* Redirect MSI write for given GPA. */
> +static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
> +    RISCVIOMMUContext *ctx, uint64_t gpa, uint64_t data,
> +    unsigned size, MemTxAttrs attrs)
> +{
> +    MemTxResult res;
> +    dma_addr_t addr;
> +    uint64_t intn;
> +    uint32_t n190;
> +    uint64_t pte[2];
> +    int fault_type = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> +    int cause;
> +
> +    if (!riscv_iommu_msi_check(s, ctx, gpa)) {
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    /* Interrupt File Number */
> +    intn = _pext_u64(PPN_DOWN(gpa), ctx->msi_addr_mask);
> +    if (intn >= 256) {
> +        /* Interrupt file number out of range */
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    /* fetch MSI PTE */
> +    addr = PPN_PHYS(get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_PPN));
> +    addr = addr | (intn * sizeof(pte));
> +    res = dma_memory_read(s->target_as, addr, &pte, sizeof(pte),
> +            MEMTXATTRS_UNSPECIFIED);
> +    if (res != MEMTX_OK) {
> +        if (res == MEMTX_DECODE_ERROR) {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED;
> +        } else {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        }
> +        goto err;
> +    }
> +
> +    le64_to_cpus(&pte[0]);
> +    le64_to_cpus(&pte[1]);
> +
> +    if (!(pte[0] & RISCV_IOMMU_MSI_PTE_V) || (pte[0] & RISCV_IOMMU_MSI_PTE_C)) {
> +        /*
> +         * The spec mentions that: "If msipte.C == 1, then further
> +         * processing to interpret the PTE is implementation
> +         * defined.". We'll abort with cause = 262 for this
> +         * case too.
> +         */
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_INVALID;
> +        goto err;
> +    }
> +
> +    switch (get_field(pte[0], RISCV_IOMMU_MSI_PTE_M)) {
> +    case RISCV_IOMMU_MSI_PTE_M_BASIC:
> +        /* MSI Pass-through mode */
> +        addr = PPN_PHYS(get_field(pte[0], RISCV_IOMMU_MSI_PTE_PPN));
> +        addr = addr | (gpa & TARGET_PAGE_MASK);
> +
> +        trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
> +                              PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
> +                              gpa, addr);
> +
> +        res = dma_memory_write(s->target_as, addr, &data, size, attrs);
> +        if (res != MEMTX_OK) {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +            goto err;
> +        }
> +
> +        return MEMTX_OK;
> +    case RISCV_IOMMU_MSI_PTE_M_MRIF:
> +        /* MRIF mode, continue. */
> +        break;
> +    default:
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
> +        goto err;
> +    }
> +
> +    /*
> +     * Report an error for interrupt identities exceeding the maximum allowed
> +     * for an IMSIC interrupt file (2047) or destination address is not 32-bit
> +     * aligned. See IOMMU Specification, Chapter 2.3. MSI page tables.
> +     */
> +    if ((data > 2047) || (gpa & 3)) {
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
> +        goto err;
> +    }
> +
> +    /* MSI MRIF mode, non atomic pending bit update */
> +
> +    /* MRIF pending bit address */
> +    addr = get_field(pte[0], RISCV_IOMMU_MSI_PTE_MRIF_ADDR) << 9;
> +    addr = addr | ((data & 0x7c0) >> 3);
> +
> +    trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
> +                          PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
> +                          gpa, addr);
> +
> +    /* MRIF pending bit mask */
> +    data = 1ULL << (data & 0x03f);
> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    intn = intn | data;
> +    res = dma_memory_write(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +        goto err;
> +    }
> +
> +    /* Get MRIF enable bits */
> +    addr = addr + sizeof(intn);
> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    if (!(intn & data)) {
> +        /* notification disabled, MRIF update completed. */
> +        return MEMTX_OK;
> +    }
> +
> +    /* Send notification message */
> +    addr = PPN_PHYS(get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NPPN));
> +    n190 = get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID) |
> +          (get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID_MSB) << 10);
> +
> +    res = dma_memory_write(s->target_as, addr, &n190, sizeof(n190), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +        goto err;
> +    }
> +
> +    return MEMTX_OK;
> +
> +err:
> +    riscv_iommu_report_fault(s, ctx, fault_type, cause,
> +                             !!ctx->pasid, 0, 0);
> +    return res;
> +}
> +
> +/*
> + * Check device context configuration as described by the
> + * riscv-iommu spec section "Device-context configuration
> + * checks".
> + */
> +static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
> +                                            RISCVIOMMUContext *ctx)
> +{
> +    uint32_t msi_mode;
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
> +        ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
> +        return false;
> +    }
> +
> +    if (!(s->cap & RISCV_IOMMU_CAP_T2GPA) &&
> +        ctx->tc & RISCV_IOMMU_DC_TC_T2GPA) {
> +        return false;
> +    }
> +
> +    if (s->cap & RISCV_IOMMU_CAP_MSI_FLAT) {
> +        msi_mode = get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE);
> +
> +        if (msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_OFF &&
> +            msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
> +            return false;
> +        }
> +    }
> +
> +    /*
> +     * CAP_END is always zero (only one endianess). FCTL_BE is
> +     * always zero (little-endian accesses). Thus TC_SBE must
> +     * always be LE, i.e. zero.
> +     */
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_SBE) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +/*
> + * RISC-V IOMMU Device Context Loopkup - Device Directory Tree Walk
> + *
> + * @s         : IOMMU Device State
> + * @ctx       : Device Translation Context with devid and pasid set.
> + * @return    : success or fault code.
> + */
> +static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
> +{
> +    const uint64_t ddtp = s->ddtp;
> +    unsigned mode = get_field(ddtp, RISCV_IOMMU_DDTP_MODE);
> +    dma_addr_t addr = PPN_PHYS(get_field(ddtp, RISCV_IOMMU_DDTP_PPN));
> +    struct riscv_iommu_dc dc;
> +    /* Device Context format: 0: extended (64 bytes) | 1: base (32 bytes) */
> +    const int dc_fmt = !s->enable_msi;
> +    const size_t dc_len = sizeof(dc) >> dc_fmt;
> +    unsigned depth;
> +    uint64_t de;
> +
> +    switch (mode) {
> +    case RISCV_IOMMU_DDTP_MODE_OFF:
> +        return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
> +
> +    case RISCV_IOMMU_DDTP_MODE_BARE:
> +        /* mock up pass-through translation context */
> +        ctx->tc = RISCV_IOMMU_DC_TC_V;
> +        ctx->ta = 0;
> +        ctx->msiptp = 0;
> +        return 0;
> +
> +    case RISCV_IOMMU_DDTP_MODE_1LVL:
> +        depth = 0;
> +        break;
> +
> +    case RISCV_IOMMU_DDTP_MODE_2LVL:
> +        depth = 1;
> +        break;
> +
> +    case RISCV_IOMMU_DDTP_MODE_3LVL:
> +        depth = 2;
> +        break;
> +
> +    default:
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +    }
> +
> +    /*
> +     * Check supported device id width (in bits).
> +     * See IOMMU Specification, Chapter 6. Software guidelines.
> +     * - if extended device-context format is used:
> +     *   1LVL: 6, 2LVL: 15, 3LVL: 24
> +     * - if base device-context format is used:
> +     *   1LVL: 7, 2LVL: 16, 3LVL: 24
> +     */
> +    if (ctx->devid >= (1 << (depth * 9 + 6 + (dc_fmt && depth != 2)))) {
> +        return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +    }
> +
> +    /* Device directory tree walk */
> +    for (; depth-- > 0; ) {
> +        /*
> +         * Select device id index bits based on device directory tree level
> +         * and device context format.
> +         * See IOMMU Specification, Chapter 2. Data Structures.
> +         * - if extended device-context format is used:
> +         *   device index: [23:15][14:6][5:0]
> +         * - if base device-context format is used:
> +         *   device index: [23:16][15:7][6:0]
> +         */
> +        const int split = depth * 9 + 6 + dc_fmt;
> +        addr |= ((ctx->devid >> split) << 3) & ~TARGET_PAGE_MASK;
> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
> +        }
> +        le64_to_cpus(&de);
> +        if (!(de & RISCV_IOMMU_DDTE_VALID)) {
> +            /* invalid directory entry */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +        }
> +        if (de & ~(RISCV_IOMMU_DDTE_PPN | RISCV_IOMMU_DDTE_VALID)) {
> +            /* reserved bits set */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +        }
> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_DDTE_PPN));
> +    }
> +
> +    /* index into device context entry page */
> +    addr |= (ctx->devid * dc_len) & ~TARGET_PAGE_MASK;
> +
> +    memset(&dc, 0, sizeof(dc));
> +    if (dma_memory_read(s->target_as, addr, &dc, dc_len,
> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
> +    }
> +
> +    /* Set translation context. */
> +    ctx->tc = le64_to_cpu(dc.tc);
> +    ctx->ta = le64_to_cpu(dc.ta);
> +    ctx->msiptp = le64_to_cpu(dc.msiptp);
> +    ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
> +    ctx->msi_addr_pattern = le64_to_cpu(dc.msi_addr_pattern);
> +
> +    if (!riscv_iommu_validate_device_ctx(s, ctx)) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +    }
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_V)) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +    }
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
> +        if (ctx->pasid != RISCV_IOMMU_NOPASID) {
> +            /* PASID is disabled */
> +            return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +        }
> +        return 0;
> +    }
> +
> +    /* FSC.TC.PDTV enabled */
> +    if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
> +        /* Invalid PDTP.MODE */
> +        return RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED;
> +    }
> +
> +    for (depth = mode - RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8; depth-- > 0; ) {
> +        /*
> +         * Select process id index bits based on process directory tree
> +         * level. See IOMMU Specification, 2.2. Process-Directory-Table.
> +         */
> +        const int split = depth * 9 + 8;
> +        addr |= ((ctx->pasid >> split) << 3) & ~TARGET_PAGE_MASK;
> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
> +        }
> +        le64_to_cpus(&de);
> +        if (!(de & RISCV_IOMMU_PC_TA_V)) {
> +            return RISCV_IOMMU_FQ_CAUSE_PDT_INVALID;
> +        }
> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_PC_FSC_PPN));
> +    }
> +
> +    /* Leaf entry in PDT */
> +    addr |= (ctx->pasid << 4) & ~TARGET_PAGE_MASK;
> +    if (dma_memory_read(s->target_as, addr, &dc.ta, sizeof(uint64_t) * 2,
> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +        return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
> +    }
> +
> +    /* Use FSC and TA from process directory entry. */
> +    ctx->ta = le64_to_cpu(dc.ta);
> +
> +    return 0;
> +}
> +
> +/* Translation Context cache support */
> +static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
> +    RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
> +    return c1->devid == c2->devid && c1->pasid == c2->pasid;
> +}
> +
> +static guint __ctx_hash(gconstpointer v)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
> +    /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
> +    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
> +}
> +
> +static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> +        ctx->devid == arg->devid &&
> +        ctx->pasid == arg->pasid) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void __ctx_inval_devid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> +        ctx->devid == arg->devid) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void __ctx_inval_all(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
> +    uint32_t devid, uint32_t pasid)
> +{
> +    GHashTable *ctx_cache;
> +    RISCVIOMMUContext key = {
> +        .devid = devid,
> +        .pasid = pasid,
> +    };
> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    g_hash_table_foreach(ctx_cache, func, &key);
> +    g_hash_table_unref(ctx_cache);
> +}
> +
> +/* Find or allocate translation context for a given {device_id, process_id} */
> +static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
> +    unsigned devid, unsigned pasid, void **ref)
> +{
> +    GHashTable *ctx_cache;
> +    RISCVIOMMUContext *ctx;
> +    RISCVIOMMUContext key = {
> +        .devid = devid,
> +        .pasid = pasid,
> +    };
> +
> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    ctx = g_hash_table_lookup(ctx_cache, &key);
> +
> +    if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
> +        *ref = ctx_cache;
> +        return ctx;
> +    }
> +
> +    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
> +        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                          g_free, NULL);
> +        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
> +    }
> +
> +    ctx = g_new0(RISCVIOMMUContext, 1);
> +    ctx->devid = devid;
> +    ctx->pasid = pasid;
> +
> +    int fault = riscv_iommu_ctx_fetch(s, ctx);
> +    if (!fault) {
> +        g_hash_table_add(ctx_cache, ctx);
> +        *ref = ctx_cache;
> +        return ctx;
> +    }
> +
> +    g_hash_table_unref(ctx_cache);
> +    *ref = NULL;
> +
> +    riscv_iommu_report_fault(s, ctx, RISCV_IOMMU_FQ_TTYPE_UADDR_RD,
> +                             fault, !!pasid, 0, 0);
> +
> +    g_free(ctx);
> +    return NULL;
> +}
> +
> +static void riscv_iommu_ctx_put(RISCVIOMMUState *s, void *ref)
> +{
> +    if (ref) {
> +        g_hash_table_unref((GHashTable *)ref);
> +    }
> +}
> +
> +/* Find or allocate address space for a given device */
> +static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
> +{
> +    RISCVIOMMUSpace *as;
> +
> +    /* FIXME: PCIe bus remapping for attached endpoints. */
> +    devid |= s->bus << 8;
> +
> +    qemu_mutex_lock(&s->core_lock);
> +    QLIST_FOREACH(as, &s->spaces, list) {
> +        if (as->devid == devid) {
> +            break;
> +        }
> +    }
> +    qemu_mutex_unlock(&s->core_lock);
> +
> +    if (as == NULL) {
> +        char name[64];
> +        as = g_new0(RISCVIOMMUSpace, 1);
> +
> +        as->iommu = s;
> +        as->devid = devid;
> +
> +        snprintf(name, sizeof(name), "riscv-iommu-%04x:%02x.%d-iova",
> +            PCI_BUS_NUM(as->devid), PCI_SLOT(as->devid), PCI_FUNC(as->devid));
> +
> +        /* IOVA address space, untranslated addresses */
> +        memory_region_init_iommu(&as->iova_mr, sizeof(as->iova_mr),
> +            TYPE_RISCV_IOMMU_MEMORY_REGION,
> +            OBJECT(as), "riscv_iommu", UINT64_MAX);
> +        address_space_init(&as->iova_as, MEMORY_REGION(&as->iova_mr), name);
> +
> +        qemu_mutex_lock(&s->core_lock);
> +        QLIST_INSERT_HEAD(&s->spaces, as, list);
> +        qemu_mutex_unlock(&s->core_lock);
> +
> +        trace_riscv_iommu_new(s->parent_obj.id, PCI_BUS_NUM(as->devid),
> +                PCI_SLOT(as->devid), PCI_FUNC(as->devid));
> +    }
> +    return &as->iova_as;
> +}
> +
> +static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    IOMMUTLBEntry *iotlb)
> +{
> +    bool enable_pasid;
> +    bool enable_pri;
> +    int fault;
> +
> +    /*
> +     * TC[32] is reserved for custom extensions, used here to temporarily
> +     * enable automatic page-request generation for ATS queries.
> +     */
> +    enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
> +    enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
> +
> +    /* Translate using device directory / page table information. */
> +    fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
> +
> +    if (enable_pri && fault) {
> +        struct riscv_iommu_pq_record pr = {0};
> +        if (enable_pasid) {
> +            pr.hdr = set_field(RISCV_IOMMU_PREQ_HDR_PV,
> +                RISCV_IOMMU_PREQ_HDR_PID, ctx->pasid);
> +        }
> +        pr.hdr = set_field(pr.hdr, RISCV_IOMMU_PREQ_HDR_DID, ctx->devid);
> +        pr.payload = (iotlb->iova & TARGET_PAGE_MASK) |
> +                     RISCV_IOMMU_PREQ_PAYLOAD_M;
> +        riscv_iommu_pri(s, &pr);
> +        return fault;
> +    }
> +
> +    if (fault) {
> +        unsigned ttype;
> +
> +        if (iotlb->perm & IOMMU_RW) {
> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> +        } else {
> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
> +        }
> +
> +        riscv_iommu_report_fault(s, ctx, ttype, fault, enable_pasid,
> +                                 iotlb->iova, iotlb->translated_addr);
> +        return fault;
> +    }
> +
> +    return 0;
> +}
> +
> +/* IOMMU Command Interface */
> +static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
> +    uint64_t addr, uint32_t data)
> +{
> +    /*
> +     * ATS processing in this implementation of the IOMMU is synchronous,
> +     * no need to wait for completions here.
> +     */
> +    if (!notify) {
> +        return MEMTX_OK;
> +    }
> +
> +    return dma_memory_write(s->target_as, addr, &data, sizeof(data),
> +        MEMTXATTRS_UNSPECIFIED);
> +}
> +
> +static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
> +{
> +    uint64_t old_ddtp = s->ddtp;
> +    uint64_t new_ddtp = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_DDTP);
> +    unsigned new_mode = get_field(new_ddtp, RISCV_IOMMU_DDTP_MODE);
> +    unsigned old_mode = get_field(old_ddtp, RISCV_IOMMU_DDTP_MODE);
> +    bool ok = false;
> +
> +    /*
> +     * Check for allowed DDTP.MODE transitions:
> +     * {OFF, BARE}        -> {OFF, BARE, 1LVL, 2LVL, 3LVL}
> +     * {1LVL, 2LVL, 3LVL} -> {OFF, BARE}
> +     */
> +    if (new_mode == old_mode ||
> +        new_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
> +        new_mode == RISCV_IOMMU_DDTP_MODE_BARE) {
> +        ok = true;
> +    } else if (new_mode == RISCV_IOMMU_DDTP_MODE_1LVL ||
> +               new_mode == RISCV_IOMMU_DDTP_MODE_2LVL ||
> +               new_mode == RISCV_IOMMU_DDTP_MODE_3LVL) {
> +        ok = old_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
> +             old_mode == RISCV_IOMMU_DDTP_MODE_BARE;
> +    }
> +
> +    if (ok) {
> +        /* clear reserved and busy bits, report back sanitized version */
> +        new_ddtp = set_field(new_ddtp & RISCV_IOMMU_DDTP_PPN,
> +                             RISCV_IOMMU_DDTP_MODE, new_mode);
> +    } else {
> +        new_ddtp = old_ddtp;
> +    }
> +    s->ddtp = new_ddtp;
> +
> +    riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_DDTP, new_ddtp);
> +}
> +
> +/* Command function and opcode field. */
> +#define RISCV_IOMMU_CMD(func, op) (((func) << 7) | (op))
> +
> +static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
> +{
> +    struct riscv_iommu_command cmd;
> +    MemTxResult res;
> +    dma_addr_t addr;
> +    uint32_t tail, head, ctrl;
> +    uint64_t cmd_opcode;
> +    GHFunc func;
> +
> +    ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +    tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQT) & s->cq_mask;
> +    head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQH) & s->cq_mask;
> +
> +    /* Check for pending error or queue processing disabled */
> +    if (!(ctrl & RISCV_IOMMU_CQCSR_CQON) ||
> +        !!(ctrl & (RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CQMF))) {
> +        return;
> +    }
> +
> +    while (tail != head) {
> +        addr = s->cq_addr  + head * sizeof(cmd);
> +        res = dma_memory_read(s->target_as, addr, &cmd, sizeof(cmd),
> +                              MEMTXATTRS_UNSPECIFIED);
> +
> +        if (res != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                                  RISCV_IOMMU_CQCSR_CQMF, 0);
> +            goto fault;
> +        }
> +
> +        trace_riscv_iommu_cmd(s->parent_obj.id, cmd.dword0, cmd.dword1);
> +
> +        cmd_opcode = get_field(cmd.dword0,
> +                               RISCV_IOMMU_CMD_OPCODE | RISCV_IOMMU_CMD_FUNC);
> +
> +        switch (cmd_opcode) {
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOFENCE_FUNC_C,
> +                             RISCV_IOMMU_CMD_IOFENCE_OPCODE):
> +            res = riscv_iommu_iofence(s,
> +                cmd.dword0 & RISCV_IOMMU_CMD_IOFENCE_AV, cmd.dword1,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOFENCE_DATA));
> +
> +            if (res != MEMTX_OK) {
> +                riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                                      RISCV_IOMMU_CQCSR_CQMF, 0);
> +                goto fault;
> +            }
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA,
> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> +            if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
> +                /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
> +                goto cmd_ill;
> +            }
> +            /* translation cache not implemented yet */
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> +            /* translation cache not implemented yet */
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
> +                /* invalidate all device context cache mappings */
> +                func = __ctx_inval_all;
> +            } else {
> +                /* invalidate all device context matching DID */
> +                func = __ctx_inval_devid;
> +            }
> +            riscv_iommu_ctx_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID), 0);
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT,
> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
> +                /* illegal command arguments IODIR_PDT & DV == 0 */
> +                goto cmd_ill;
> +            } else {
> +                func = __ctx_inval_devid_pasid;
> +            }
> +            riscv_iommu_ctx_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID),
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
> +            break;
> +
> +        default:
> +        cmd_ill:
> +            /* Invalid instruction, do not advance instruction index. */
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                RISCV_IOMMU_CQCSR_CMD_ILL, 0);
> +            goto fault;
> +        }
> +
> +        /* Advance and update head pointer after command completes. */
> +        head = (head + 1) & s->cq_mask;
> +        riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_CQH, head);
> +    }
> +    return;
> +
> +fault:
> +    if (ctrl & RISCV_IOMMU_CQCSR_CIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_CQ);
> +    }
> +}
> +
> +static void riscv_iommu_process_cq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_CQB);
> +        s->cq_mask = (2ULL << get_field(base, RISCV_IOMMU_CQB_LOG2SZ)) - 1;
> +        s->cq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_CQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~s->cq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQT], 0);
> +        ctrl_set = RISCV_IOMMU_CQCSR_CQON;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQMF |
> +                   RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CMD_TO |
> +                   RISCV_IOMMU_CQCSR_FENCE_W_IP;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +static void riscv_iommu_process_fq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_FQB);
> +        s->fq_mask = (2ULL << get_field(base, RISCV_IOMMU_FQB_LOG2SZ)) - 1;
> +        s->fq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_FQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~s->fq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQT], 0);
> +        ctrl_set = RISCV_IOMMU_FQCSR_FQON;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQMF |
> +            RISCV_IOMMU_FQCSR_FQOF;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_PQB);
> +        s->pq_mask = (2ULL << get_field(base, RISCV_IOMMU_PQB_LOG2SZ)) - 1;
> +        s->pq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_PQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~s->pq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQT], 0);
> +        ctrl_set = RISCV_IOMMU_PQCSR_PQON;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQMF |
> +            RISCV_IOMMU_PQCSR_PQOF;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
> +
> +static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
> +{
> +    uint32_t cqcsr, fqcsr, pqcsr;
> +    uint32_t ipsr_set = 0;
> +    uint32_t ipsr_clr = 0;
> +
> +    if (data & RISCV_IOMMU_IPSR_CIP) {
> +        cqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +
> +        if (cqcsr & RISCV_IOMMU_CQCSR_CIE &&
> +            (cqcsr & RISCV_IOMMU_CQCSR_FENCE_W_IP ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_ILL ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_TO ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_CIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
> +    }
> +
> +    if (data & RISCV_IOMMU_IPSR_FIP) {
> +        fqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +
> +        if (fqcsr & RISCV_IOMMU_FQCSR_FIE &&
> +            (fqcsr & RISCV_IOMMU_FQCSR_FQOF ||
> +             fqcsr & RISCV_IOMMU_FQCSR_FQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_FIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
> +    }
> +
> +    if (data & RISCV_IOMMU_IPSR_PIP) {
> +        pqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +
> +        if (pqcsr & RISCV_IOMMU_PQCSR_PIE &&
> +            (pqcsr & RISCV_IOMMU_PQCSR_PQOF ||
> +             pqcsr & RISCV_IOMMU_PQCSR_PQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_PIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, ipsr_set, ipsr_clr);
> +}
> +
> +static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
> +    uint64_t data, unsigned size, MemTxAttrs attrs)
> +{
> +    riscv_iommu_process_fn *process_fn = NULL;
> +    RISCVIOMMUState *s = opaque;
> +    uint32_t regb = addr & ~3;
> +    uint32_t busy = 0;
> +    uint64_t val = 0;
> +
> +    if ((addr & (size - 1)) != 0) {
> +        /* Unsupported MMIO alignment or access size */
> +        return MEMTX_ERROR;
> +    }
> +
> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
> +        /* Unsupported MMIO access location. */
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    /* Track actionable MMIO write. */
> +    switch (regb) {
> +    case RISCV_IOMMU_REG_DDTP:
> +    case RISCV_IOMMU_REG_DDTP + 4:
> +        process_fn = riscv_iommu_process_ddtp;
> +        regb = RISCV_IOMMU_REG_DDTP;
> +        busy = RISCV_IOMMU_DDTP_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_CQT:
> +        process_fn = riscv_iommu_process_cq_tail;
> +        break;
> +
> +    case RISCV_IOMMU_REG_CQCSR:
> +        process_fn = riscv_iommu_process_cq_control;
> +        busy = RISCV_IOMMU_CQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_FQCSR:
> +        process_fn = riscv_iommu_process_fq_control;
> +        busy = RISCV_IOMMU_FQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_PQCSR:
> +        process_fn = riscv_iommu_process_pq_control;
> +        busy = RISCV_IOMMU_PQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_IPSR:
> +        /*
> +         * IPSR has special procedures to update. Execute it
> +         * and exit.
> +         */
> +        if (size == 4) {
> +            uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
> +            uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
> +            uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
> +            stl_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +        } else if (size == 8) {
> +            uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
> +            uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
> +            uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
> +            stq_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +        }
> +
> +        riscv_iommu_update_ipsr(s, val);
> +
> +        return MEMTX_OK;
> +
> +    default:
> +        break;
> +    }
> +
> +    /*
> +     * Registers update might be not synchronized with core logic.
> +     * If system software updates register when relevant BUSY bit
> +     * is set IOMMU behavior of additional writes to the register
> +     * is UNSPECIFIED.
> +     */
> +    qemu_spin_lock(&s->regs_lock);
> +    if (size == 1) {
> +        uint8_t ro = s->regs_ro[addr];
> +        uint8_t wc = s->regs_wc[addr];
> +        uint8_t rw = s->regs_rw[addr];
> +        s->regs_rw[addr] = ((rw & ro) | (data & ~ro)) & ~(data & wc);
> +    } else if (size == 2) {
> +        uint16_t ro = lduw_le_p(&s->regs_ro[addr]);
> +        uint16_t wc = lduw_le_p(&s->regs_wc[addr]);
> +        uint16_t rw = lduw_le_p(&s->regs_rw[addr]);
> +        stw_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    } else if (size == 4) {
> +        uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
> +        uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
> +        uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
> +        stl_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    } else if (size == 8) {
> +        uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
> +        uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
> +        uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
> +        stq_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    }
> +
> +    /* Busy flag update, MSB 4-byte register. */
> +    if (busy) {
> +        uint32_t rw = ldl_le_p(&s->regs_rw[regb]);
> +        stl_le_p(&s->regs_rw[regb], rw | busy);
> +    }
> +    qemu_spin_unlock(&s->regs_lock);
> +
> +    if (process_fn) {
> +        qemu_mutex_lock(&s->core_lock);
> +        process_fn(s);
> +        qemu_mutex_unlock(&s->core_lock);
> +    }
> +
> +    return MEMTX_OK;
> +}
> +
> +static MemTxResult riscv_iommu_mmio_read(void *opaque, hwaddr addr,
> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
> +{
> +    RISCVIOMMUState *s = opaque;
> +    uint64_t val = -1;
> +    uint8_t *ptr;
> +
> +    if ((addr & (size - 1)) != 0) {
> +        /* Unsupported MMIO alignment. */
> +        return MEMTX_ERROR;
> +    }
> +
> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    ptr = &s->regs_rw[addr];
> +
> +    if (size == 1) {
> +        val = (uint64_t)*ptr;
> +    } else if (size == 2) {
> +        val = lduw_le_p(ptr);
> +    } else if (size == 4) {
> +        val = ldl_le_p(ptr);
> +    } else if (size == 8) {
> +        val = ldq_le_p(ptr);
> +    } else {
> +        return MEMTX_ERROR;
> +    }
> +
> +    *data = val;
> +
> +    return MEMTX_OK;
> +}
> +
> +static const MemoryRegionOps riscv_iommu_mmio_ops = {
> +    .read_with_attrs = riscv_iommu_mmio_read,
> +    .write_with_attrs = riscv_iommu_mmio_write,
> +    .endianness = DEVICE_NATIVE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +        .unaligned = false,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +/*
> + * Translations matching MSI pattern check are redirected to "riscv-iommu-trap"
> + * memory region as untranslated address, for additional MSI/MRIF interception
> + * by IOMMU interrupt remapping implementation.
> + * Note: Device emulation code generating an MSI is expected to provide a valid
> + * memory transaction attributes with requested_id set.
> + */
> +static MemTxResult riscv_iommu_trap_write(void *opaque, hwaddr addr,
> +    uint64_t data, unsigned size, MemTxAttrs attrs)
> +{
> +    RISCVIOMMUState* s = (RISCVIOMMUState *)opaque;
> +    RISCVIOMMUContext *ctx;
> +    MemTxResult res;
> +    void *ref;
> +    uint32_t devid = attrs.requester_id;
> +
> +    if (attrs.unspecified) {
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    /* FIXME: PCIe bus remapping for attached endpoints. */
> +    devid |= s->bus << 8;
> +
> +    ctx = riscv_iommu_ctx(s, devid, 0, &ref);
> +    if (ctx == NULL) {
> +        res = MEMTX_ACCESS_ERROR;
> +    } else {
> +        res = riscv_iommu_msi_write(s, ctx, addr, data, size, attrs);
> +    }
> +    riscv_iommu_ctx_put(s, ref);
> +    return res;
> +}
> +
> +static MemTxResult riscv_iommu_trap_read(void *opaque, hwaddr addr,
> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
> +{
> +    return MEMTX_ACCESS_ERROR;
> +}
> +
> +static const MemoryRegionOps riscv_iommu_trap_ops = {
> +    .read_with_attrs = riscv_iommu_trap_read,
> +    .write_with_attrs = riscv_iommu_trap_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +        .unaligned = true,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +static void riscv_iommu_realize(DeviceState *dev, Error **errp)
> +{
> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
> +
> +    s->cap = s->version & RISCV_IOMMU_CAP_VERSION;
> +    if (s->enable_msi) {
> +        s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
> +    }
> +    /* Report QEMU target physical address space limits */
> +    s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
> +                       TARGET_PHYS_ADDR_SPACE_BITS);
> +
> +    /* TODO: method to report supported PASID bits */
> +    s->pasid_bits = 8; /* restricted to size of MemTxAttrs.pasid */
> +    s->cap |= RISCV_IOMMU_CAP_PD8;
> +
> +    /* Out-of-reset translation mode: OFF (DMA disabled) BARE (passthrough) */
> +    s->ddtp = set_field(0, RISCV_IOMMU_DDTP_MODE, s->enable_off ?
> +                        RISCV_IOMMU_DDTP_MODE_OFF : RISCV_IOMMU_DDTP_MODE_BARE);
> +
> +    /* register storage */
> +    s->regs_rw = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +    s->regs_ro = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +    s->regs_wc = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +
> +     /* Mark all registers read-only */
> +    memset(s->regs_ro, 0xff, RISCV_IOMMU_REG_SIZE);
> +
> +    /*
> +     * Register complete MMIO space, including MSI/PBA registers.
> +     * Note, PCIDevice implementation will add overlapping MR for MSI/PBA,
> +     * managed directly by the PCIDevice implementation.
> +     */
> +    memory_region_init_io(&s->regs_mr, OBJECT(dev), &riscv_iommu_mmio_ops, s,
> +        "riscv-iommu-regs", RISCV_IOMMU_REG_SIZE);
> +
> +    /* Set power-on register state */
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
> +        ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
> +        ~(RISCV_IOMMU_CQB_LOG2SZ | RISCV_IOMMU_CQB_PPN));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQB],
> +        ~(RISCV_IOMMU_FQB_LOG2SZ | RISCV_IOMMU_FQB_PPN));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQB],
> +        ~(RISCV_IOMMU_PQB_LOG2SZ | RISCV_IOMMU_PQB_PPN));
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQMF |
> +        RISCV_IOMMU_CQCSR_CMD_TO | RISCV_IOMMU_CQCSR_CMD_ILL);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQON |
> +        RISCV_IOMMU_CQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQMF |
> +        RISCV_IOMMU_FQCSR_FQOF);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQON |
> +        RISCV_IOMMU_FQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQMF |
> +        RISCV_IOMMU_PQCSR_PQOF);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQON |
> +        RISCV_IOMMU_PQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
> +
> +    /* Memory region for downstream access, if specified. */
> +    if (s->target_mr) {
> +        s->target_as = g_new0(AddressSpace, 1);
> +        address_space_init(s->target_as, s->target_mr,
> +            "riscv-iommu-downstream");
> +    } else {
> +        /* Fallback to global system memory. */
> +        s->target_as = &address_space_memory;
> +    }
> +
> +    /* Memory region for untranslated MRIF/MSI writes */
> +    memory_region_init_io(&s->trap_mr, OBJECT(dev), &riscv_iommu_trap_ops, s,
> +            "riscv-iommu-trap", ~0ULL);
> +    address_space_init(&s->trap_as, &s->trap_mr, "riscv-iommu-trap-as");
> +
> +    /* Device translation context cache */
> +    s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                         g_free, NULL);
> +
> +    s->iommus.le_next = NULL;
> +    s->iommus.le_prev = NULL;
> +    QLIST_INIT(&s->spaces);
> +    qemu_mutex_init(&s->core_lock);
> +    qemu_spin_init(&s->regs_lock);
> +}
> +
> +static void riscv_iommu_unrealize(DeviceState *dev)
> +{
> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
> +
> +    qemu_mutex_destroy(&s->core_lock);
> +    g_hash_table_unref(s->ctx_cache);
> +}
> +
> +static Property riscv_iommu_properties[] = {
> +    DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
> +        RISCV_IOMMU_SPEC_DOT_VER),
> +    DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
> +    DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
> +    DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
> +    DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
> +        TYPE_MEMORY_REGION, MemoryRegion *),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void riscv_iommu_class_init(ObjectClass *klass, void* data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    /* internal device for riscv-iommu-{pci/sys}, not user-creatable */
> +    dc->user_creatable = false;
> +    dc->realize = riscv_iommu_realize;
> +    dc->unrealize = riscv_iommu_unrealize;
> +    device_class_set_props(dc, riscv_iommu_properties);
> +}
> +
> +static const TypeInfo riscv_iommu_info = {
> +    .name = TYPE_RISCV_IOMMU,
> +    .parent = TYPE_DEVICE,
> +    .instance_size = sizeof(RISCVIOMMUState),
> +    .class_init = riscv_iommu_class_init,
> +};
> +
> +static const char *IOMMU_FLAG_STR[] = {
> +    "NA",
> +    "RO",
> +    "WR",
> +    "RW",
> +};
> +
> +/* RISC-V IOMMU Memory Region - Address Translation Space */
> +static IOMMUTLBEntry riscv_iommu_memory_region_translate(
> +    IOMMUMemoryRegion *iommu_mr, hwaddr addr,
> +    IOMMUAccessFlags flag, int iommu_idx)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +    RISCVIOMMUContext *ctx;
> +    void *ref;
> +    IOMMUTLBEntry iotlb = {
> +        .iova = addr,
> +        .target_as = as->iommu->target_as,
> +        .addr_mask = ~0ULL,
> +        .perm = flag,
> +    };
> +
> +    ctx = riscv_iommu_ctx(as->iommu, as->devid, iommu_idx, &ref);
> +    if (ctx == NULL) {
> +        /* Translation disabled or invalid. */
> +        iotlb.addr_mask = 0;
> +        iotlb.perm = IOMMU_NONE;
> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
> +        /* Translation disabled or fault reported. */
> +        iotlb.addr_mask = 0;
> +        iotlb.perm = IOMMU_NONE;
> +    }
> +
> +    /* Trace all dma translations with original access flags. */
> +    trace_riscv_iommu_dma(as->iommu->parent_obj.id, PCI_BUS_NUM(as->devid),
> +                          PCI_SLOT(as->devid), PCI_FUNC(as->devid), iommu_idx,
> +                          IOMMU_FLAG_STR[flag & IOMMU_RW], iotlb.iova,
> +                          iotlb.translated_addr);
> +
> +    riscv_iommu_ctx_put(as->iommu, ref);
> +
> +    return iotlb;
> +}
> +
> +static int riscv_iommu_memory_region_notify(
> +    IOMMUMemoryRegion *iommu_mr, IOMMUNotifierFlag old,
> +    IOMMUNotifierFlag new, Error **errp)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +
> +    if (old == IOMMU_NOTIFIER_NONE) {
> +        as->notifier = true;
> +        trace_riscv_iommu_notifier_add(iommu_mr->parent_obj.name);
> +    } else if (new == IOMMU_NOTIFIER_NONE) {
> +        as->notifier = false;
> +        trace_riscv_iommu_notifier_del(iommu_mr->parent_obj.name);
> +    }
> +
> +    return 0;
> +}
> +
> +static inline bool pci_is_iommu(PCIDevice *pdev)
> +{
> +    return pci_get_word(pdev->config + PCI_CLASS_DEVICE) == 0x0806;
> +}
> +
> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, int devfn)
> +{
> +    RISCVIOMMUState *s = (RISCVIOMMUState *) opaque;
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> +    AddressSpace *as = NULL;
> +
> +    if (pdev && pci_is_iommu(pdev)) {
> +        return s->target_as;
> +    }
> +
> +    /* Find first registered IOMMU device */
> +    while (s->iommus.le_prev) {
> +        s = *(s->iommus.le_prev);
> +    }
> +
> +    /* Find first matching IOMMU */
> +    while (s != NULL && as == NULL) {
> +        as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), devfn));
> +        s = s->iommus.le_next;
> +    }
> +
> +    return as ? as : &address_space_memory;
> +}
> +
> +static const PCIIOMMUOps riscv_iommu_ops = {
> +    .get_address_space = riscv_iommu_find_as,
> +};
> +
> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
> +        Error **errp)
> +{
> +    if (bus->iommu_ops &&
> +        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
> +        /* Allow multiple IOMMUs on the same PCIe bus, link known devices */
> +        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
> +        QLIST_INSERT_AFTER(last, iommu, iommus);
> +    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
> +        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);
> +    } else {
> +        error_setg(errp, "can't register secondary IOMMU for PCI bus #%d",
> +            pci_bus_num(bus));
> +    }
> +}
> +
> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
> +    MemTxAttrs attrs)
> +{
> +    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
> +}
> +
> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +    return 1 << as->iommu->pasid_bits;
> +}
> +
> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void *data)
> +{
> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
> +
> +    imrc->translate = riscv_iommu_memory_region_translate;
> +    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
> +    imrc->attrs_to_index = riscv_iommu_memory_region_index;
> +    imrc->num_indexes = riscv_iommu_memory_region_index_len;
> +}
> +
> +static const TypeInfo riscv_iommu_memory_region_info = {
> +    .parent = TYPE_IOMMU_MEMORY_REGION,
> +    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
> +    .class_init = riscv_iommu_memory_region_init,
> +};
> +
> +static void riscv_iommu_register_mr_types(void)
> +{
> +    type_register_static(&riscv_iommu_memory_region_info);
> +    type_register_static(&riscv_iommu_info);
> +}
> +
> +type_init(riscv_iommu_register_mr_types);
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> new file mode 100644
> index 0000000000..31d3907d33
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu.h
> @@ -0,0 +1,141 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_RISCV_IOMMU_STATE_H
> +#define HW_RISCV_IOMMU_STATE_H
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +
> +#include "hw/riscv/iommu.h"
> +
> +struct RISCVIOMMUState {
> +    /*< private >*/
> +    DeviceState parent_obj;
> +
> +    /*< public >*/
> +    uint32_t version;     /* Reported interface version number */
> +    uint32_t pasid_bits;  /* process identifier width */
> +    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
> +
> +    uint64_t cap;         /* IOMMU supported capabilities */
> +    uint64_t fctl;        /* IOMMU enabled features */
> +
> +    bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
> +    bool enable_msi;      /* Enable MSI remapping */
> +
> +    /* IOMMU Internal State */
> +    uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
> +
> +    dma_addr_t cq_addr;   /* Command queue base physical address */
> +    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
> +    dma_addr_t pq_addr;   /* Page request queue base physical address */
> +
> +    uint32_t cq_mask;     /* Command queue index bit mask */
> +    uint32_t fq_mask;     /* Fault/event queue index bit mask */
> +    uint32_t pq_mask;     /* Page request queue index bit mask */
> +
> +    /* interrupt notifier */
> +    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
> +
> +    /* IOMMU State Machine */
> +    QemuThread core_proc; /* Background processing thread */
> +    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs updates */
> +    QemuCond core_cond;   /* Background processing wake up signal */
> +    unsigned core_exec;   /* Processing thread execution actions */
> +
> +    /* IOMMU target address space */
> +    AddressSpace *target_as;
> +    MemoryRegion *target_mr;
> +
> +    /* MSI / MRIF access trap */
> +    AddressSpace trap_as;
> +    MemoryRegion trap_mr;
> +
> +    GHashTable *ctx_cache;          /* Device translation Context Cache */
> +
> +    /* MMIO Hardware Interface */
> +    MemoryRegion regs_mr;
> +    QemuSpin regs_lock;
> +    uint8_t *regs_rw;  /* register state (user write) */
> +    uint8_t *regs_wc;  /* write-1-to-clear mask */
> +    uint8_t *regs_ro;  /* read-only mask */
> +
> +    QLIST_ENTRY(RISCVIOMMUState) iommus;
> +    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
> +};
> +
> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
> +         Error **errp);
> +
> +/* private helpers */
> +
> +/* Register helper functions */
> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
> +    unsigned idx, uint32_t set, uint32_t clr)
> +{
> +    uint32_t val;
> +    qemu_spin_lock(&s->regs_lock);
> +    val = ldl_le_p(s->regs_rw + idx);
> +    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
> +    qemu_spin_unlock(&s->regs_lock);
> +    return val;
> +}
> +
> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
> +    unsigned idx, uint32_t set)
> +{
> +    qemu_spin_lock(&s->regs_lock);
> +    stl_le_p(s->regs_rw + idx, set);
> +    qemu_spin_unlock(&s->regs_lock);
> +}
> +
> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
> +    unsigned idx)
> +{
> +    return ldl_le_p(s->regs_rw + idx);
> +}
> +
> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
> +    unsigned idx, uint64_t set, uint64_t clr)
> +{
> +    uint64_t val;
> +    qemu_spin_lock(&s->regs_lock);
> +    val = ldq_le_p(s->regs_rw + idx);
> +    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
> +    qemu_spin_unlock(&s->regs_lock);
> +    return val;
> +}
> +
> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
> +    unsigned idx, uint64_t set)
> +{
> +    qemu_spin_lock(&s->regs_lock);
> +    stq_le_p(s->regs_rw + idx, set);
> +    qemu_spin_unlock(&s->regs_lock);
> +}
> +
> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
> +    unsigned idx)
> +{
> +    return ldq_le_p(s->regs_rw + idx);
> +}
> +
> +
> +
> +#endif
> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
> new file mode 100644
> index 0000000000..42a97caffa
> --- /dev/null
> +++ b/hw/riscv/trace-events
> @@ -0,0 +1,11 @@
> +# See documentation at docs/devel/tracing.rst
> +
> +# riscv-iommu.c
> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) "%s: device attached %04x:%02x.%d"
> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" iova: 0x%"PRIx64
> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
> new file mode 100644
> index 0000000000..8c0e3ca1f3
> --- /dev/null
> +++ b/hw/riscv/trace.h
> @@ -0,0 +1 @@
> +#include "trace/trace-hw_riscv.h"
> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
> new file mode 100644
> index 0000000000..070ee69973
> --- /dev/null
> +++ b/include/hw/riscv/iommu.h
> @@ -0,0 +1,36 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_RISCV_IOMMU_H
> +#define HW_RISCV_IOMMU_H
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +
> +#define TYPE_RISCV_IOMMU "riscv-iommu"
> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
> +typedef struct RISCVIOMMUState RISCVIOMMUState;
> +
> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
> +
> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
> +
> +#endif
> diff --git a/meson.build b/meson.build
> index a9de71d450..8099d8271c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -3319,6 +3319,7 @@ if have_system
>       'hw/pci-host',
>       'hw/ppc',
>       'hw/rtc',
> +    'hw/riscv',
>       'hw/s390x',
>       'hw/scsi',
>       'hw/sd',



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
  2024-05-23 17:39 ` [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC) Daniel Henrique Barboza
@ 2024-06-05 17:34   ` Tomasz Jeznach
  2024-06-07  8:30     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 38+ messages in thread
From: Tomasz Jeznach @ 2024-06-05 17:34 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, ajones, frank.chang

Daniel,

Thank you for your upstreaming work!

I've synchronized the private branch with v3 changes, and noticed
there is an important change missing in this patchset. We need
reader-writer lock around access to GLib.HashTable as it's not
MT-safe.  Diff added below, also available on github [1] branch
riscv_iommu_v4-rc1.

[1] link: https://github.com/tjeznach/qemu/tree/riscv_iommu_v4-rc1

Thanks!
- Tomasz Jeznach

diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
index a27f56419a..75c5d645fc 100644
--- a/hw/riscv/riscv-iommu.c
+++ b/hw/riscv/riscv-iommu.c
@@ -991,7 +991,9 @@ static void riscv_iommu_ctx_inval(RISCVIOMMUState
*s, GHFunc func,
         .pasid = pasid,
     };
     ctx_cache = g_hash_table_ref(s->ctx_cache);
+    pthread_rwlock_wrlock(&s->ctx_lock);
     g_hash_table_foreach(ctx_cache, func, &key);
+    pthread_rwlock_unlock(&s->ctx_lock);
     g_hash_table_unref(ctx_cache);
 }

@@ -1007,26 +1009,31 @@ static RISCVIOMMUContext
*riscv_iommu_ctx(RISCVIOMMUState *s,
     };

     ctx_cache = g_hash_table_ref(s->ctx_cache);
+    pthread_rwlock_rdlock(&s->ctx_lock);
     ctx = g_hash_table_lookup(ctx_cache, &key);
+    pthread_rwlock_unlock(&s->ctx_lock);

     if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
         *ref = ctx_cache;
         return ctx;
     }

-    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
-        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
-                                          g_free, NULL);
-        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
-    }
-
     ctx = g_new0(RISCVIOMMUContext, 1);
     ctx->devid = devid;
     ctx->pasid = pasid;

     int fault = riscv_iommu_ctx_fetch(s, ctx);
     if (!fault) {
+        pthread_rwlock_wrlock(&s->ctx_lock);
+        if (g_hash_table_size(ctx_cache) >= LIMIT_CACHE_CTX) {
+            g_hash_table_unref(ctx_cache);
+            ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
+                                              g_free, NULL);
+            g_hash_table_ref(ctx_cache);
+            g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
+        }
         g_hash_table_add(ctx_cache, ctx);
+        pthread_rwlock_unlock(&s->ctx_lock);
         *ref = ctx_cache;
         return ctx;
     }
@@ -1176,12 +1183,14 @@ static void riscv_iommu_iot_update(RISCVIOMMUState *s,
         return;
     }

+    pthread_rwlock_wrlock(&s->iot_lock);
     if (g_hash_table_size(s->iot_cache) >= s->iot_limit) {
         iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
                                           g_free, NULL);
         g_hash_table_unref(qatomic_xchg(&s->iot_cache, iot_cache));
     }
     g_hash_table_add(iot_cache, iot);
+    pthread_rwlock_unlock(&s->iot_lock);
 }

 static void riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
@@ -1195,7 +1204,9 @@ static void
riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
     };

     iot_cache = g_hash_table_ref(s->iot_cache);
+    pthread_rwlock_wrlock(&s->iot_lock);
     g_hash_table_foreach(iot_cache, func, &key);
+    pthread_rwlock_unlock(&s->iot_lock);
     g_hash_table_unref(iot_cache);
 }

@@ -1227,7 +1238,9 @@ static int riscv_iommu_translate(RISCVIOMMUState
*s, RISCVIOMMUContext *ctx,
         }
     }

+    pthread_rwlock_rdlock(&s->iot_lock);
     iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
+    pthread_rwlock_unlock(&s->iot_lock);
     perm = iot ? iot->perm : IOMMU_NONE;
     if (perm != IOMMU_NONE) {
         iotlb->translated_addr = PPN_PHYS(iot->phys);
@@ -2085,6 +2098,8 @@ static void riscv_iommu_realize(DeviceState
*dev, Error **errp)
                                          g_free, NULL);
     s->iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
                                          g_free, NULL);
+    pthread_rwlock_init(&s->ctx_lock, NULL);
+    pthread_rwlock_init(&s->iot_lock, NULL);

     s->iommus.le_next = NULL;
     s->iommus.le_prev = NULL;
diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
index 26236c3cee..041b3b9e05 100644
--- a/hw/riscv/riscv-iommu.h
+++ b/hw/riscv/riscv-iommu.h
@@ -71,7 +71,9 @@ struct RISCVIOMMUState {
     MemoryRegion trap_mr;

     GHashTable *ctx_cache;          /* Device translation Context Cache */
+    pthread_rwlock_t ctx_lock;      /* Device translation Cache update lock */
     GHashTable *iot_cache;          /* IO Translated Address Cache */
+    pthread_rwlock_t iot_lock;      /* IO TLB Cache update lock */
     unsigned iot_limit;             /* IO Translation Cache size limit */

     /* MMIO Hardware Interface */


On Thu, May 23, 2024 at 10:40 AM Daniel Henrique Barboza
<dbarboza@ventanamicro.com> wrote:
>
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> The RISC-V IOMMU spec predicts that the IOMMU can use translation caches
> to hold entries from the DDT. This includes implementation for all cache
> commands that are marked as 'not implemented'.
>
> There are some artifacts included in the cache that predicts s-stage and
> g-stage elements, although we don't support it yet. We'll introduce them
> next.
>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> Reviewed-by: Frank Chang <frank.chang@sifive.com>
> ---
>  hw/riscv/riscv-iommu.c | 189 ++++++++++++++++++++++++++++++++++++++++-
>  hw/riscv/riscv-iommu.h |   2 +
>  2 files changed, 187 insertions(+), 4 deletions(-)
>
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index 39b4ff1405..abf6ae7726 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -63,6 +63,16 @@ struct RISCVIOMMUContext {
>      uint64_t msiptp;            /* MSI redirection page table pointer */
>  };
>
> +/* Address translation cache entry */
> +struct RISCVIOMMUEntry {
> +    uint64_t iova:44;           /* IOVA Page Number */
> +    uint64_t pscid:20;          /* Process Soft-Context identifier */
> +    uint64_t phys:44;           /* Physical Page Number */
> +    uint64_t gscid:16;          /* Guest Soft-Context identifier */
> +    uint64_t perm:2;            /* IOMMU_RW flags */
> +    uint64_t __rfu:2;
> +};
> +
>  /* IOMMU index for transactions without PASID specified. */
>  #define RISCV_IOMMU_NOPASID 0
>
> @@ -751,13 +761,125 @@ static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
>      return &as->iova_as;
>  }
>
> +/* Translation Object cache support */
> +static gboolean __iot_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    RISCVIOMMUEntry *t1 = (RISCVIOMMUEntry *) v1;
> +    RISCVIOMMUEntry *t2 = (RISCVIOMMUEntry *) v2;
> +    return t1->gscid == t2->gscid && t1->pscid == t2->pscid &&
> +           t1->iova == t2->iova;
> +}
> +
> +static guint __iot_hash(gconstpointer v)
> +{
> +    RISCVIOMMUEntry *t = (RISCVIOMMUEntry *) v;
> +    return (guint)t->iova;
> +}
> +
> +/* GV: 1 PSCV: 1 AV: 1 */
> +static void __iot_inval_pscid_iova(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
> +    if (iot->gscid == arg->gscid &&
> +        iot->pscid == arg->pscid &&
> +        iot->iova == arg->iova) {
> +        iot->perm = IOMMU_NONE;
> +    }
> +}
> +
> +/* GV: 1 PSCV: 1 AV: 0 */
> +static void __iot_inval_pscid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
> +    if (iot->gscid == arg->gscid &&
> +        iot->pscid == arg->pscid) {
> +        iot->perm = IOMMU_NONE;
> +    }
> +}
> +
> +/* GV: 1 GVMA: 1 */
> +static void __iot_inval_gscid_gpa(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
> +    if (iot->gscid == arg->gscid) {
> +        /* simplified cache, no GPA matching */
> +        iot->perm = IOMMU_NONE;
> +    }
> +}
> +
> +/* GV: 1 GVMA: 0 */
> +static void __iot_inval_gscid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
> +    if (iot->gscid == arg->gscid) {
> +        iot->perm = IOMMU_NONE;
> +    }
> +}
> +
> +/* GV: 0 */
> +static void __iot_inval_all(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
> +    iot->perm = IOMMU_NONE;
> +}
> +
> +/* caller should keep ref-count for iot_cache object */
> +static RISCVIOMMUEntry *riscv_iommu_iot_lookup(RISCVIOMMUContext *ctx,
> +    GHashTable *iot_cache, hwaddr iova)
> +{
> +    RISCVIOMMUEntry key = {
> +        .pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID),
> +        .iova  = PPN_DOWN(iova),
> +    };
> +    return g_hash_table_lookup(iot_cache, &key);
> +}
> +
> +/* caller should keep ref-count for iot_cache object */
> +static void riscv_iommu_iot_update(RISCVIOMMUState *s,
> +    GHashTable *iot_cache, RISCVIOMMUEntry *iot)
> +{
> +    if (!s->iot_limit) {
> +        return;
> +    }
> +
> +    if (g_hash_table_size(s->iot_cache) >= s->iot_limit) {
> +        iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
> +                                          g_free, NULL);
> +        g_hash_table_unref(qatomic_xchg(&s->iot_cache, iot_cache));
> +    }
> +    g_hash_table_add(iot_cache, iot);
> +}
> +
> +static void riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
> +    uint32_t gscid, uint32_t pscid, hwaddr iova)
> +{
> +    GHashTable *iot_cache;
> +    RISCVIOMMUEntry key = {
> +        .gscid = gscid,
> +        .pscid = pscid,
> +        .iova  = PPN_DOWN(iova),
> +    };
> +
> +    iot_cache = g_hash_table_ref(s->iot_cache);
> +    g_hash_table_foreach(iot_cache, func, &key);
> +    g_hash_table_unref(iot_cache);
> +}
> +
>  static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> -    IOMMUTLBEntry *iotlb)
> +    IOMMUTLBEntry *iotlb, bool enable_cache)
>  {
> +    RISCVIOMMUEntry *iot;
> +    IOMMUAccessFlags perm;
>      bool enable_pasid;
>      bool enable_pri;
> +    GHashTable *iot_cache;
>      int fault;
>
> +    iot_cache = g_hash_table_ref(s->iot_cache);
>      /*
>       * TC[32] is reserved for custom extensions, used here to temporarily
>       * enable automatic page-request generation for ATS queries.
> @@ -765,9 +887,36 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>      enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
>      enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
>
> +    iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
> +    perm = iot ? iot->perm : IOMMU_NONE;
> +    if (perm != IOMMU_NONE) {
> +        iotlb->translated_addr = PPN_PHYS(iot->phys);
> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +        iotlb->perm = perm;
> +        fault = 0;
> +        goto done;
> +    }
> +
>      /* Translate using device directory / page table information. */
>      fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
>
> +    if (!fault && iotlb->target_as == &s->trap_as) {
> +        /* Do not cache trapped MSI translations */
> +        goto done;
> +    }
> +
> +    if (!fault && iotlb->translated_addr != iotlb->iova && enable_cache) {
> +        iot = g_new0(RISCVIOMMUEntry, 1);
> +        iot->iova = PPN_DOWN(iotlb->iova);
> +        iot->phys = PPN_DOWN(iotlb->translated_addr);
> +        iot->pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID);
> +        iot->perm = iotlb->perm;
> +        riscv_iommu_iot_update(s, iot_cache, iot);
> +    }
> +
> +done:
> +    g_hash_table_unref(iot_cache);
> +
>      if (enable_pri && fault) {
>          struct riscv_iommu_pq_record pr = {0};
>          if (enable_pasid) {
> @@ -907,13 +1056,40 @@ static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
>              if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
>                  /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
>                  goto cmd_ill;
> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_GV)) {
> +                /* invalidate all cache mappings */
> +                func = __iot_inval_all;
> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_AV)) {
> +                /* invalidate cache matching GSCID */
> +                func = __iot_inval_gscid;
> +            } else {
> +                /* invalidate cache matching GSCID and ADDR (GPA) */
> +                func = __iot_inval_gscid_gpa;
>              }
> -            /* translation cache not implemented yet */
> +            riscv_iommu_iot_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_GSCID), 0,
> +                cmd.dword1 & TARGET_PAGE_MASK);
>              break;
>
>          case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
>                               RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> -            /* translation cache not implemented yet */
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_GV)) {
> +                /* invalidate all cache mappings, simplified model */
> +                func = __iot_inval_all;
> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV)) {
> +                /* invalidate cache matching GSCID, simplified model */
> +                func = __iot_inval_gscid;
> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_AV)) {
> +                /* invalidate cache matching GSCID and PSCID */
> +                func = __iot_inval_pscid;
> +            } else {
> +                /* invalidate cache matching GSCID and PSCID and ADDR (IOVA) */
> +                func = __iot_inval_pscid_iova;
> +            }
> +            riscv_iommu_iot_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_GSCID),
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_PSCID),
> +                cmd.dword1 & TARGET_PAGE_MASK);
>              break;
>
>          case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
> @@ -1410,6 +1586,8 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>      /* Device translation context cache */
>      s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
>                                           g_free, NULL);
> +    s->iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
> +                                         g_free, NULL);
>
>      s->iommus.le_next = NULL;
>      s->iommus.le_prev = NULL;
> @@ -1423,6 +1601,7 @@ static void riscv_iommu_unrealize(DeviceState *dev)
>      RISCVIOMMUState *s = RISCV_IOMMU(dev);
>
>      qemu_mutex_destroy(&s->core_lock);
> +    g_hash_table_unref(s->iot_cache);
>      g_hash_table_unref(s->ctx_cache);
>  }
>
> @@ -1430,6 +1609,8 @@ static Property riscv_iommu_properties[] = {
>      DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
>          RISCV_IOMMU_SPEC_DOT_VER),
>      DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
> +    DEFINE_PROP_UINT32("ioatc-limit", RISCVIOMMUState, iot_limit,
> +        LIMIT_CACHE_IOT),
>      DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
>      DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
>      DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
> @@ -1482,7 +1663,7 @@ static IOMMUTLBEntry riscv_iommu_memory_region_translate(
>          /* Translation disabled or invalid. */
>          iotlb.addr_mask = 0;
>          iotlb.perm = IOMMU_NONE;
> -    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb, true)) {
>          /* Translation disabled or fault reported. */
>          iotlb.addr_mask = 0;
>          iotlb.perm = IOMMU_NONE;
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> index 31d3907d33..3afee9f3e8 100644
> --- a/hw/riscv/riscv-iommu.h
> +++ b/hw/riscv/riscv-iommu.h
> @@ -68,6 +68,8 @@ struct RISCVIOMMUState {
>      MemoryRegion trap_mr;
>
>      GHashTable *ctx_cache;          /* Device translation Context Cache */
> +    GHashTable *iot_cache;          /* IO Translated Address Cache */
> +    unsigned iot_limit;             /* IO Translation Cache size limit */
>
>      /* MMIO Hardware Interface */
>      MemoryRegion regs_mr;
> --
> 2.44.0
>


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h
  2024-05-28  6:41   ` Eric Cheng
@ 2024-06-05 22:21     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-05 22:21 UTC (permalink / raw)
  To: Eric Cheng, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang



On 5/28/24 3:41 AM, Eric Cheng wrote:
> On 5/24/2024 1:39 AM, Daniel Henrique Barboza wrote:
> ...
>> +/* 5.4 Features control register (32bits) */
>> +#define RISCV_IOMMU_REG_FCTL            0x0008
> 
> Looks like doesn't support RISCV_IOMMU_FCTL_BE?
> If so, need to implement it as read-only? along with other 2 bits.


Good point. Just set RISCV_IOMMU_FCTL_BE in regs_ro mask. I'll also set
FCTL_WSI given that, at this moment, we do not have wired interrupt
support (the riscv-iommu sysbus device will support it later).

FCTL_GSX is declared and used in patch 8 so we don't need to set it
as read-only. Thanks,


Daniel

> 
> IIUC,
> 
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index 1b34d226f9..6a6bf1db98 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -2035,6 +2035,7 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>       /* Set power-on register state */
>       stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
>       stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FCTL], ~0);
>       stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
>           ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
>       stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
> 
> 
>> +#define RISCV_IOMMU_FCTL_WSI            BIT(1)
>> +
> ...
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-05-30  1:39   ` Eric Cheng
@ 2024-06-06 19:46     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-06 19:46 UTC (permalink / raw)
  To: Eric Cheng, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf



On 5/29/24 10:39 PM, Eric Cheng wrote:
> On 5/24/2024 1:39 AM, Daniel Henrique Barboza wrote:
>> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>>
>> The RISC-V IOMMU specification is now ratified as-per the RISC-V
>> international process. The latest frozen specifcation can be found
>> at:
>>
>> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf
>>
>> Add the foundation of the device emulation for RISC-V IOMMU, which
>> includes an IOMMU that has no capabilities but MSI interrupt support and
>> fault queue interfaces. We'll add add more features incrementally in the
>                                ^^^  ^^^
> repeated 'add'

Fixed.

> 
>> next patches.
>>
>> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
>> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
>> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
>> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
>> ---
>>   hw/riscv/Kconfig         |    4 +
>>   hw/riscv/meson.build     |    1 +
>>   hw/riscv/riscv-iommu.c   | 1602 ++++++++++++++++++++++++++++++++++++++
>>   hw/riscv/riscv-iommu.h   |  141 ++++
>>   hw/riscv/trace-events    |   11 +
>>   hw/riscv/trace.h         |    1 +
>>   include/hw/riscv/iommu.h |   36 +
>>   meson.build              |    1 +
>>   8 files changed, 1797 insertions(+)
>>   create mode 100644 hw/riscv/riscv-iommu.c
>>   create mode 100644 hw/riscv/riscv-iommu.h
>>   create mode 100644 hw/riscv/trace-events
>>   create mode 100644 hw/riscv/trace.h
>>   create mode 100644 include/hw/riscv/iommu.h
>>
>> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
>> index a2030e3a6f..f69d6e3c8e 100644
>> --- a/hw/riscv/Kconfig
>> +++ b/hw/riscv/Kconfig
>> @@ -1,3 +1,6 @@
>> +config RISCV_IOMMU
>> +    bool
>> +
>>   config RISCV_NUMA
>>       bool
>> @@ -47,6 +50,7 @@ config RISCV_VIRT
>>       select SERIAL
>>       select RISCV_ACLINT
>>       select RISCV_APLIC
>> +    select RISCV_IOMMU
>>       select RISCV_IMSIC
>>       select SIFIVE_PLIC
>>       select SIFIVE_TEST
>> diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
>> index f872674093..cbc99c6e8e 100644
>> --- a/hw/riscv/meson.build
>> +++ b/hw/riscv/meson.build
>> @@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
>>   riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
>>   riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
>>   riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
>> +riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
>>   hw_arch += {'riscv': riscv_ss}
>> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
>> new file mode 100644
>> index 0000000000..39b4ff1405
>> --- /dev/null
>> +++ b/hw/riscv/riscv-iommu.c
>> @@ -0,0 +1,1602 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2021-2023, Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +#include "hw/pci/pci_bus.h"
>> +#include "hw/pci/pci_device.h"
>> +#include "hw/qdev-properties.h"
>> +#include "hw/riscv/riscv_hart.h"
>> +#include "migration/vmstate.h"
>> +#include "qapi/error.h"
>> +#include "qemu/timer.h"
>> +
>> +#include "cpu_bits.h"
>> +#include "riscv-iommu.h"
>> +#include "riscv-iommu-bits.h"
>> +#include "trace.h"
>> +
>> +#define LIMIT_CACHE_CTX               (1U << 7)
>> +#define LIMIT_CACHE_IOT               (1U << 20)
>> +
>> +/* Physical page number coversions */
>> +#define PPN_PHYS(ppn)                 ((ppn) << TARGET_PAGE_BITS)
>> +#define PPN_DOWN(phy)                 ((phy) >> TARGET_PAGE_BITS)
>> +
>> +typedef struct RISCVIOMMUContext RISCVIOMMUContext;
>> +typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
>> +
>> +/* Device assigned I/O address space */
>> +struct RISCVIOMMUSpace {
>> +    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached device */
>> +    AddressSpace iova_as;       /* IOVA address space for attached device */
>> +    RISCVIOMMUState *iommu;     /* Managing IOMMU device state */
>> +    uint32_t devid;             /* Requester identifier, AKA device_id */
>> +    bool notifier;              /* IOMMU unmap notifier enabled */
>> +    QLIST_ENTRY(RISCVIOMMUSpace) list;
>> +};
>> +
>> +/* Device translation context state. */
>> +struct RISCVIOMMUContext {
>> +    uint64_t devid:24;          /* Requester Id, AKA device_id */
>> +    uint64_t pasid:20;          /* Process Address Space ID */
>> +    uint64_t __rfu:20;          /* reserved */
>> +    uint64_t tc;                /* Translation Control */
>> +    uint64_t ta;                /* Translation Attributes */
>> +    uint64_t msi_addr_mask;     /* MSI filtering - address mask */
>> +    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
>> +    uint64_t msiptp;            /* MSI redirection page table pointer */
>> +};
> 
> Can we alias use (devid + pasid + __rfu) by union? so that can easily compare the key, esp. I assume functions like __ctx_equal() are on hot path.


I don't think so. The reason is that the device context can have different
process ids with the same devid. The spec mentions GPU as an example of a
device that could have multiple device contexts for each process using it.

In fact, we're checking for both attributes at the same time in functions
such as riscv_iommu_ctx_fetch().

> 
> And, pasid, is the term in PCI context. I suggest use more general name in spec: process_id.


The spec seems to use 'process_id' rather than PASID for this particular
use indeed. I'll rename it.



Thanks,

Daniel


> 
> e.g. below (just compiled, not tested)
> 
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index 1b34d226f9..74011c7f1f 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -53,9 +53,12 @@ struct RISCVIOMMUSpace {
> 
>   /* Device translation context state. */
>   struct RISCVIOMMUContext {
> -    uint64_t devid:24;          /* Requester Id, AKA device_id */
> -    uint64_t pasid:20;          /* Process Address Space ID */
> -    uint64_t __rfu:20;          /* reserved */
> +    union {
> +        uint64_t devid:24,          /* Requester Id, AKA device_id */
> +                 pasid:20,          /* Process Address Space ID */
> +                 __rfu:20;          /* reserved */
> +        uint64_t key;
> +    };
>       uint64_t tc;                /* Translation Control */
>       uint64_t ta;                /* Translation Attributes */
>       uint64_t satp;              /* S-Stage address translation and protection */
> @@ -943,14 +946,14 @@ static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
>   {
>       RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
>       RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
> -    return c1->devid == c2->devid && c1->pasid == c2->pasid;
> +    return c1->key == c2->key;
>   }
> 
>   static guint __ctx_hash(gconstpointer v)
>   {
>       RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
>       /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
> -    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
> +    return (guint)ctx->key;
>   }
> 
>   static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
> @@ -958,8 +961,7 @@ static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
>       RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>       RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
>       if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> -        ctx->devid == arg->devid &&
> -        ctx->pasid == arg->pasid) {
> +        ctx->key == arg->key) {
>           ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>       }
>   }
> @@ -989,6 +991,7 @@ static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
>       RISCVIOMMUContext key = {
>           .devid = devid,
>           .pasid = pasid,
> +        .__rfu = 0,
>       };
>       ctx_cache = g_hash_table_ref(s->ctx_cache);
>       g_hash_table_foreach(ctx_cache, func, &key);
> @@ -1004,6 +1007,7 @@ static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
>       RISCVIOMMUContext key = {
>           .devid = devid,
>           .pasid = pasid,
> +        .__rfu = 0,
>       };
> 
>       ctx_cache = g_hash_table_ref(s->ctx_cache);
> 
> 
>> +
>> +/* IOMMU index for transactions without PASID specified. */
>> +#define RISCV_IOMMU_NOPASID 0
>> +
>> +static void riscv_iommu_notify(RISCVIOMMUState *s, int vec)
>> +{
>> +    const uint32_t fctl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FCTL);
>> +    uint32_t ipsr, ivec;
>> +
>> +    if (fctl & RISCV_IOMMU_FCTL_WSI || !s->notify) {
>> +        return;
>> +    }
>> +
>> +    ipsr = riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, (1 << vec), 0);
>> +    ivec = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_IVEC);
>> +
>> +    if (!(ipsr & (1 << vec))) {
>> +        s->notify(s, (ivec >> (vec * 4)) & 0x0F);
>> +    }
>> +}
>> +
>> +static void riscv_iommu_fault(RISCVIOMMUState *s,
>> +                              struct riscv_iommu_fq_record *ev)
>> +{
>> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
>> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQH) & s->fq_mask;
>> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQT) & s->fq_mask;
>> +    uint32_t next = (tail + 1) & s->fq_mask;
>> +    uint32_t devid = get_field(ev->hdr, RISCV_IOMMU_FQ_HDR_DID);
>> +
>> +    trace_riscv_iommu_flt(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
>> +                          PCI_FUNC(devid), ev->hdr, ev->iotval);
>> +
>> +    if (!(ctrl & RISCV_IOMMU_FQCSR_FQON) ||
>> +        !!(ctrl & (RISCV_IOMMU_FQCSR_FQOF | RISCV_IOMMU_FQCSR_FQMF))) {
>> +        return;
>> +    }
>> +
>> +    if (head == next) {
>> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
>> +                              RISCV_IOMMU_FQCSR_FQOF, 0);
>> +    } else {
>> +        dma_addr_t addr = s->fq_addr + tail * sizeof(*ev);
>> +        if (dma_memory_write(s->target_as, addr, ev, sizeof(*ev),
>> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
>> +                                  RISCV_IOMMU_FQCSR_FQMF, 0);
>> +        } else {
>> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_FQT, next);
>> +        }
>> +    }
>> +
>> +    if (ctrl & RISCV_IOMMU_FQCSR_FIE) {
>> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_FQ);
>> +    }
>> +}
>> +
>> +static void riscv_iommu_pri(RISCVIOMMUState *s,
>> +    struct riscv_iommu_pq_record *pr)
>> +{
>> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
>> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQH) & s->pq_mask;
>> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQT) & s->pq_mask;
>> +    uint32_t next = (tail + 1) & s->pq_mask;
>> +    uint32_t devid = get_field(pr->hdr, RISCV_IOMMU_PREQ_HDR_DID);
>> +
>> +    trace_riscv_iommu_pri(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
>> +                          PCI_FUNC(devid), pr->payload);
>> +
>> +    if (!(ctrl & RISCV_IOMMU_PQCSR_PQON) ||
>> +        !!(ctrl & (RISCV_IOMMU_PQCSR_PQOF | RISCV_IOMMU_PQCSR_PQMF))) {
>> +        return;
>> +    }
>> +
>> +    if (head == next) {
>> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
>> +                              RISCV_IOMMU_PQCSR_PQOF, 0);
>> +    } else {
>> +        dma_addr_t addr = s->pq_addr + tail * sizeof(*pr);
>> +        if (dma_memory_write(s->target_as, addr, pr, sizeof(*pr),
>> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
>> +                                  RISCV_IOMMU_PQCSR_PQMF, 0);
>> +        } else {
>> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_PQT, next);
>> +        }
>> +    }
>> +
>> +    if (ctrl & RISCV_IOMMU_PQCSR_PIE) {
>> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_PQ);
>> +    }
>> +}
>> +
>> +/* Portable implementation of pext_u64, bit-mask extraction. */
>> +static uint64_t _pext_u64(uint64_t val, uint64_t ext)
>> +{
>> +    uint64_t ret = 0;
>> +    uint64_t rot = 1;
>> +
>> +    while (ext) {
>> +        if (ext & 1) {
>> +            if (val & 1) {
>> +                ret |= rot;
>> +            }
>> +            rot <<= 1;
>> +        }
>> +        val >>= 1;
>> +        ext >>= 1;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +/* Check if GPA matches MSI/MRIF pattern. */
>> +static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>> +    dma_addr_t gpa)
>> +{
>> +    if (get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE) !=
>> +        RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
>> +        return false; /* Invalid MSI/MRIF mode */
>> +    }
>> +
>> +    if ((PPN_DOWN(gpa) ^ ctx->msi_addr_pattern) & ~ctx->msi_addr_mask) {
>> +        return false; /* GPA not in MSI range defined by AIA IMSIC rules. */
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
>> +static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>> +    IOMMUTLBEntry *iotlb)
>> +{
>> +    /* Early check for MSI address match when IOVA == GPA */
>> +    if (iotlb->perm & IOMMU_WO &&
>> +        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
>> +        iotlb->target_as = &s->trap_as;
>> +        iotlb->translated_addr = iotlb->iova;
>> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
>> +        return 0;
>> +    }
>> +
>> +    /* Exit early for pass-through mode. */
>> +    iotlb->translated_addr = iotlb->iova;
>> +    iotlb->addr_mask = ~TARGET_PAGE_MASK;
>> +    /* Allow R/W in pass-through mode */
>> +    iotlb->perm = IOMMU_RW;
>> +    return 0;
>> +}
>> +
>> +static void riscv_iommu_report_fault(RISCVIOMMUState *s,
>> +                                     RISCVIOMMUContext *ctx,
>> +                                     uint32_t fault_type, uint32_t cause,
>> +                                     bool pv,
>> +                                     uint64_t iotval, uint64_t iotval2)
>> +{
>> +    struct riscv_iommu_fq_record ev = { 0 };
>> +
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_DTF) {
>> +        switch (cause) {
>> +        case RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_INVALID:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED:
>> +        case RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR:
>> +        case RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT:
>> +            break;
>> +        default:
>> +            /* DTF prevents reporting a fault for this given cause */
>> +            return;
>> +        }
>> +    }
>> +
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_CAUSE, cause);
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_TTYPE, fault_type);
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_DID, ctx->devid);
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PV, true);
>> +
>> +    if (pv) {
>> +        ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PID, ctx->pasid);
>> +    }
>> +
>> +    ev.iotval = iotval;
>> +    ev.iotval2 = iotval2;
>> +
>> +    riscv_iommu_fault(s, &ev);
>> +}
>> +
>> +/* Redirect MSI write for given GPA. */
>> +static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
>> +    RISCVIOMMUContext *ctx, uint64_t gpa, uint64_t data,
>> +    unsigned size, MemTxAttrs attrs)
>> +{
>> +    MemTxResult res;
>> +    dma_addr_t addr;
>> +    uint64_t intn;
>> +    uint32_t n190;
>> +    uint64_t pte[2];
>> +    int fault_type = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
>> +    int cause;
>> +
>> +    if (!riscv_iommu_msi_check(s, ctx, gpa)) {
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    /* Interrupt File Number */
>> +    intn = _pext_u64(PPN_DOWN(gpa), ctx->msi_addr_mask);
>> +    if (intn >= 256) {
>> +        /* Interrupt file number out of range */
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    /* fetch MSI PTE */
>> +    addr = PPN_PHYS(get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_PPN));
>> +    addr = addr | (intn * sizeof(pte));
>> +    res = dma_memory_read(s->target_as, addr, &pte, sizeof(pte),
>> +            MEMTXATTRS_UNSPECIFIED);
>> +    if (res != MEMTX_OK) {
>> +        if (res == MEMTX_DECODE_ERROR) {
>> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED;
>> +        } else {
>> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        }
>> +        goto err;
>> +    }
>> +
>> +    le64_to_cpus(&pte[0]);
>> +    le64_to_cpus(&pte[1]);
>> +
>> +    if (!(pte[0] & RISCV_IOMMU_MSI_PTE_V) || (pte[0] & RISCV_IOMMU_MSI_PTE_C)) {
>> +        /*
>> +         * The spec mentions that: "If msipte.C == 1, then further
>> +         * processing to interpret the PTE is implementation
>> +         * defined.". We'll abort with cause = 262 for this
>> +         * case too.
>> +         */
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_INVALID;
>> +        goto err;
>> +    }
>> +
>> +    switch (get_field(pte[0], RISCV_IOMMU_MSI_PTE_M)) {
>> +    case RISCV_IOMMU_MSI_PTE_M_BASIC:
>> +        /* MSI Pass-through mode */
>> +        addr = PPN_PHYS(get_field(pte[0], RISCV_IOMMU_MSI_PTE_PPN));
>> +        addr = addr | (gpa & TARGET_PAGE_MASK);
>> +
>> +        trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
>> +                              PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
>> +                              gpa, addr);
>> +
>> +        res = dma_memory_write(s->target_as, addr, &data, size, attrs);
>> +        if (res != MEMTX_OK) {
>> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
>> +            goto err;
>> +        }
>> +
>> +        return MEMTX_OK;
>> +    case RISCV_IOMMU_MSI_PTE_M_MRIF:
>> +        /* MRIF mode, continue. */
>> +        break;
>> +    default:
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
>> +        goto err;
>> +    }
>> +
>> +    /*
>> +     * Report an error for interrupt identities exceeding the maximum allowed
>> +     * for an IMSIC interrupt file (2047) or destination address is not 32-bit
>> +     * aligned. See IOMMU Specification, Chapter 2.3. MSI page tables.
>> +     */
>> +    if ((data > 2047) || (gpa & 3)) {
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
>> +        goto err;
>> +    }
>> +
>> +    /* MSI MRIF mode, non atomic pending bit update */
>> +
>> +    /* MRIF pending bit address */
>> +    addr = get_field(pte[0], RISCV_IOMMU_MSI_PTE_MRIF_ADDR) << 9;
>> +    addr = addr | ((data & 0x7c0) >> 3);
>> +
>> +    trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
>> +                          PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
>> +                          gpa, addr);
>> +
>> +    /* MRIF pending bit mask */
>> +    data = 1ULL << (data & 0x03f);
>> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    intn = intn | data;
>> +    res = dma_memory_write(s->target_as, addr, &intn, sizeof(intn), attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    /* Get MRIF enable bits */
>> +    addr = addr + sizeof(intn);
>> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    if (!(intn & data)) {
>> +        /* notification disabled, MRIF update completed. */
>> +        return MEMTX_OK;
>> +    }
>> +
>> +    /* Send notification message */
>> +    addr = PPN_PHYS(get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NPPN));
>> +    n190 = get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID) |
>> +          (get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID_MSB) << 10);
>> +
>> +    res = dma_memory_write(s->target_as, addr, &n190, sizeof(n190), attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    return MEMTX_OK;
>> +
>> +err:
>> +    riscv_iommu_report_fault(s, ctx, fault_type, cause,
>> +                             !!ctx->pasid, 0, 0);
>> +    return res;
>> +}
>> +
>> +/*
>> + * Check device context configuration as described by the
>> + * riscv-iommu spec section "Device-context configuration
>> + * checks".
>> + */
>> +static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
>> +                                            RISCVIOMMUContext *ctx)
>> +{
>> +    uint32_t msi_mode;
>> +
>> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
>> +        ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
>> +        return false;
>> +    }
>> +
>> +    if (!(s->cap & RISCV_IOMMU_CAP_T2GPA) &&
>> +        ctx->tc & RISCV_IOMMU_DC_TC_T2GPA) {
>> +        return false;
>> +    }
>> +
>> +    if (s->cap & RISCV_IOMMU_CAP_MSI_FLAT) {
>> +        msi_mode = get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE);
>> +
>> +        if (msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_OFF &&
>> +            msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * CAP_END is always zero (only one endianess). FCTL_BE is
>> +     * always zero (little-endian accesses). Thus TC_SBE must
>> +     * always be LE, i.e. zero.
>> +     */
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_SBE) {
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * RISC-V IOMMU Device Context Loopkup - Device Directory Tree Walk
>> + *
>> + * @s         : IOMMU Device State
>> + * @ctx       : Device Translation Context with devid and pasid set.
>> + * @return    : success or fault code.
>> + */
>> +static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>> +{
>> +    const uint64_t ddtp = s->ddtp;
>> +    unsigned mode = get_field(ddtp, RISCV_IOMMU_DDTP_MODE);
>> +    dma_addr_t addr = PPN_PHYS(get_field(ddtp, RISCV_IOMMU_DDTP_PPN));
>> +    struct riscv_iommu_dc dc;
>> +    /* Device Context format: 0: extended (64 bytes) | 1: base (32 bytes) */
>> +    const int dc_fmt = !s->enable_msi;
>> +    const size_t dc_len = sizeof(dc) >> dc_fmt;
>> +    unsigned depth;
>> +    uint64_t de;
>> +
>> +    switch (mode) {
>> +    case RISCV_IOMMU_DDTP_MODE_OFF:
>> +        return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_BARE:
>> +        /* mock up pass-through translation context */
>> +        ctx->tc = RISCV_IOMMU_DC_TC_V;
>> +        ctx->ta = 0;
>> +        ctx->msiptp = 0;
>> +        return 0;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_1LVL:
>> +        depth = 0;
>> +        break;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_2LVL:
>> +        depth = 1;
>> +        break;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_3LVL:
>> +        depth = 2;
>> +        break;
>> +
>> +    default:
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
>> +    }
>> +
>> +    /*
>> +     * Check supported device id width (in bits).
>> +     * See IOMMU Specification, Chapter 6. Software guidelines.
>> +     * - if extended device-context format is used:
>> +     *   1LVL: 6, 2LVL: 15, 3LVL: 24
>> +     * - if base device-context format is used:
>> +     *   1LVL: 7, 2LVL: 16, 3LVL: 24
>> +     */
>> +    if (ctx->devid >= (1 << (depth * 9 + 6 + (dc_fmt && depth != 2)))) {
>> +        return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
>> +    }
>> +
>> +    /* Device directory tree walk */
>> +    for (; depth-- > 0; ) {
>> +        /*
>> +         * Select device id index bits based on device directory tree level
>> +         * and device context format.
>> +         * See IOMMU Specification, Chapter 2. Data Structures.
>> +         * - if extended device-context format is used:
>> +         *   device index: [23:15][14:6][5:0]
>> +         * - if base device-context format is used:
>> +         *   device index: [23:16][15:7][6:0]
>> +         */
>> +        const int split = depth * 9 + 6 + dc_fmt;
>> +        addr |= ((ctx->devid >> split) << 3) & ~TARGET_PAGE_MASK;
>> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
>> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
>> +        }
>> +        le64_to_cpus(&de);
>> +        if (!(de & RISCV_IOMMU_DDTE_VALID)) {
>> +            /* invalid directory entry */
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>> +        }
>> +        if (de & ~(RISCV_IOMMU_DDTE_PPN | RISCV_IOMMU_DDTE_VALID)) {
>> +            /* reserved bits set */
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
>> +        }
>> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_DDTE_PPN));
>> +    }
>> +
>> +    /* index into device context entry page */
>> +    addr |= (ctx->devid * dc_len) & ~TARGET_PAGE_MASK;
>> +
>> +    memset(&dc, 0, sizeof(dc));
>> +    if (dma_memory_read(s->target_as, addr, &dc, dc_len,
>> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
>> +    }
>> +
>> +    /* Set translation context. */
>> +    ctx->tc = le64_to_cpu(dc.tc);
>> +    ctx->ta = le64_to_cpu(dc.ta);
>> +    ctx->msiptp = le64_to_cpu(dc.msiptp);
>> +    ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
>> +    ctx->msi_addr_pattern = le64_to_cpu(dc.msi_addr_pattern);
>> +
>> +    if (!riscv_iommu_validate_device_ctx(s, ctx)) {
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
>> +    }
>> +
>> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_V)) {
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>> +    }
>> +
>> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
>> +        if (ctx->pasid != RISCV_IOMMU_NOPASID) {
>> +            /* PASID is disabled */
>> +            return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
>> +        }
>> +        return 0;
>> +    }
>> +
>> +    /* FSC.TC.PDTV enabled */
>> +    if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
>> +        /* Invalid PDTP.MODE */
>> +        return RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED;
>> +    }
>> +
>> +    for (depth = mode - RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8; depth-- > 0; ) {
>> +        /*
>> +         * Select process id index bits based on process directory tree
>> +         * level. See IOMMU Specification, 2.2. Process-Directory-Table.
>> +         */
>> +        const int split = depth * 9 + 8;
>> +        addr |= ((ctx->pasid >> split) << 3) & ~TARGET_PAGE_MASK;
>> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
>> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
>> +        }
>> +        le64_to_cpus(&de);
>> +        if (!(de & RISCV_IOMMU_PC_TA_V)) {
>> +            return RISCV_IOMMU_FQ_CAUSE_PDT_INVALID;
>> +        }
>> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_PC_FSC_PPN));
>> +    }
>> +
>> +    /* Leaf entry in PDT */
>> +    addr |= (ctx->pasid << 4) & ~TARGET_PAGE_MASK;
>> +    if (dma_memory_read(s->target_as, addr, &dc.ta, sizeof(uint64_t) * 2,
>> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +        return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
>> +    }
>> +
>> +    /* Use FSC and TA from process directory entry. */
>> +    ctx->ta = le64_to_cpu(dc.ta);
>> +
>> +    return 0;
>> +}
>> +
>> +/* Translation Context cache support */
>> +static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
>> +{
>> +    RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
>> +    RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
>> +    return c1->devid == c2->devid && c1->pasid == c2->pasid;
>> +}
>> +
>> +static guint __ctx_hash(gconstpointer v)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
>> +    /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
>> +    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
>> +}
>> +
>> +static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
>> +        ctx->devid == arg->devid &&
>> +        ctx->pasid == arg->pasid) {
>> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>> +    }
>> +}
>> +
>> +static void __ctx_inval_devid(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
>> +        ctx->devid == arg->devid) {
>> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>> +    }
>> +}
>> +
>> +static void __ctx_inval_all(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V) {
>> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>> +    }
>> +}
>> +
>> +static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
>> +    uint32_t devid, uint32_t pasid)
>> +{
>> +    GHashTable *ctx_cache;
>> +    RISCVIOMMUContext key = {
>> +        .devid = devid,
>> +        .pasid = pasid,
>> +    };
>> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
>> +    g_hash_table_foreach(ctx_cache, func, &key);
>> +    g_hash_table_unref(ctx_cache);
>> +}
>> +
>> +/* Find or allocate translation context for a given {device_id, process_id} */
>> +static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
>> +    unsigned devid, unsigned pasid, void **ref)
>> +{
>> +    GHashTable *ctx_cache;
>> +    RISCVIOMMUContext *ctx;
>> +    RISCVIOMMUContext key = {
>> +        .devid = devid,
>> +        .pasid = pasid,
>> +    };
>> +
>> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
>> +    ctx = g_hash_table_lookup(ctx_cache, &key);
>> +
>> +    if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
>> +        *ref = ctx_cache;
>> +        return ctx;
>> +    }
>> +
>> +    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
>> +        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
>> +                                          g_free, NULL);
>> +        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
>> +    }
>> +
>> +    ctx = g_new0(RISCVIOMMUContext, 1);
>> +    ctx->devid = devid;
>> +    ctx->pasid = pasid;
>> +
>> +    int fault = riscv_iommu_ctx_fetch(s, ctx);
>> +    if (!fault) {
>> +        g_hash_table_add(ctx_cache, ctx);
>> +        *ref = ctx_cache;
>> +        return ctx;
>> +    }
>> +
>> +    g_hash_table_unref(ctx_cache);
>> +    *ref = NULL;
>> +
>> +    riscv_iommu_report_fault(s, ctx, RISCV_IOMMU_FQ_TTYPE_UADDR_RD,
>> +                             fault, !!pasid, 0, 0);
>> +
>> +    g_free(ctx);
>> +    return NULL;
>> +}
>> +
>> +static void riscv_iommu_ctx_put(RISCVIOMMUState *s, void *ref)
>> +{
>> +    if (ref) {
>> +        g_hash_table_unref((GHashTable *)ref);
>> +    }
>> +}
>> +
>> +/* Find or allocate address space for a given device */
>> +static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
>> +{
>> +    RISCVIOMMUSpace *as;
>> +
>> +    /* FIXME: PCIe bus remapping for attached endpoints. */
>> +    devid |= s->bus << 8;
>> +
>> +    qemu_mutex_lock(&s->core_lock);
>> +    QLIST_FOREACH(as, &s->spaces, list) {
>> +        if (as->devid == devid) {
>> +            break;
>> +        }
>> +    }
>> +    qemu_mutex_unlock(&s->core_lock);
>> +
>> +    if (as == NULL) {
>> +        char name[64];
>> +        as = g_new0(RISCVIOMMUSpace, 1);
>> +
>> +        as->iommu = s;
>> +        as->devid = devid;
>> +
>> +        snprintf(name, sizeof(name), "riscv-iommu-%04x:%02x.%d-iova",
>> +            PCI_BUS_NUM(as->devid), PCI_SLOT(as->devid), PCI_FUNC(as->devid));
>> +
>> +        /* IOVA address space, untranslated addresses */
>> +        memory_region_init_iommu(&as->iova_mr, sizeof(as->iova_mr),
>> +            TYPE_RISCV_IOMMU_MEMORY_REGION,
>> +            OBJECT(as), "riscv_iommu", UINT64_MAX);
>> +        address_space_init(&as->iova_as, MEMORY_REGION(&as->iova_mr), name);
>> +
>> +        qemu_mutex_lock(&s->core_lock);
>> +        QLIST_INSERT_HEAD(&s->spaces, as, list);
>> +        qemu_mutex_unlock(&s->core_lock);
>> +
>> +        trace_riscv_iommu_new(s->parent_obj.id, PCI_BUS_NUM(as->devid),
>> +                PCI_SLOT(as->devid), PCI_FUNC(as->devid));
>> +    }
>> +    return &as->iova_as;
>> +}
>> +
>> +static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>> +    IOMMUTLBEntry *iotlb)
>> +{
>> +    bool enable_pasid;
>> +    bool enable_pri;
>> +    int fault;
>> +
>> +    /*
>> +     * TC[32] is reserved for custom extensions, used here to temporarily
>> +     * enable automatic page-request generation for ATS queries.
>> +     */
>> +    enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
>> +    enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
>> +
>> +    /* Translate using device directory / page table information. */
>> +    fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
>> +
>> +    if (enable_pri && fault) {
>> +        struct riscv_iommu_pq_record pr = {0};
>> +        if (enable_pasid) {
>> +            pr.hdr = set_field(RISCV_IOMMU_PREQ_HDR_PV,
>> +                RISCV_IOMMU_PREQ_HDR_PID, ctx->pasid);
>> +        }
>> +        pr.hdr = set_field(pr.hdr, RISCV_IOMMU_PREQ_HDR_DID, ctx->devid);
>> +        pr.payload = (iotlb->iova & TARGET_PAGE_MASK) |
>> +                     RISCV_IOMMU_PREQ_PAYLOAD_M;
>> +        riscv_iommu_pri(s, &pr);
>> +        return fault;
>> +    }
>> +
>> +    if (fault) {
>> +        unsigned ttype;
>> +
>> +        if (iotlb->perm & IOMMU_RW) {
>> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
>> +        } else {
>> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
>> +        }
>> +
>> +        riscv_iommu_report_fault(s, ctx, ttype, fault, enable_pasid,
>> +                                 iotlb->iova, iotlb->translated_addr);
>> +        return fault;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* IOMMU Command Interface */
>> +static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
>> +    uint64_t addr, uint32_t data)
>> +{
>> +    /*
>> +     * ATS processing in this implementation of the IOMMU is synchronous,
>> +     * no need to wait for completions here.
>> +     */
>> +    if (!notify) {
>> +        return MEMTX_OK;
>> +    }
>> +
>> +    return dma_memory_write(s->target_as, addr, &data, sizeof(data),
>> +        MEMTXATTRS_UNSPECIFIED);
>> +}
>> +
>> +static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
>> +{
>> +    uint64_t old_ddtp = s->ddtp;
>> +    uint64_t new_ddtp = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_DDTP);
>> +    unsigned new_mode = get_field(new_ddtp, RISCV_IOMMU_DDTP_MODE);
>> +    unsigned old_mode = get_field(old_ddtp, RISCV_IOMMU_DDTP_MODE);
>> +    bool ok = false;
>> +
>> +    /*
>> +     * Check for allowed DDTP.MODE transitions:
>> +     * {OFF, BARE}        -> {OFF, BARE, 1LVL, 2LVL, 3LVL}
>> +     * {1LVL, 2LVL, 3LVL} -> {OFF, BARE}
>> +     */
>> +    if (new_mode == old_mode ||
>> +        new_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
>> +        new_mode == RISCV_IOMMU_DDTP_MODE_BARE) {
>> +        ok = true;
>> +    } else if (new_mode == RISCV_IOMMU_DDTP_MODE_1LVL ||
>> +               new_mode == RISCV_IOMMU_DDTP_MODE_2LVL ||
>> +               new_mode == RISCV_IOMMU_DDTP_MODE_3LVL) {
>> +        ok = old_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
>> +             old_mode == RISCV_IOMMU_DDTP_MODE_BARE;
>> +    }
>> +
>> +    if (ok) {
>> +        /* clear reserved and busy bits, report back sanitized version */
>> +        new_ddtp = set_field(new_ddtp & RISCV_IOMMU_DDTP_PPN,
>> +                             RISCV_IOMMU_DDTP_MODE, new_mode);
>> +    } else {
>> +        new_ddtp = old_ddtp;
>> +    }
>> +    s->ddtp = new_ddtp;
>> +
>> +    riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_DDTP, new_ddtp);
>> +}
>> +
>> +/* Command function and opcode field. */
>> +#define RISCV_IOMMU_CMD(func, op) (((func) << 7) | (op))
>> +
>> +static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
>> +{
>> +    struct riscv_iommu_command cmd;
>> +    MemTxResult res;
>> +    dma_addr_t addr;
>> +    uint32_t tail, head, ctrl;
>> +    uint64_t cmd_opcode;
>> +    GHFunc func;
>> +
>> +    ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
>> +    tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQT) & s->cq_mask;
>> +    head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQH) & s->cq_mask;
>> +
>> +    /* Check for pending error or queue processing disabled */
>> +    if (!(ctrl & RISCV_IOMMU_CQCSR_CQON) ||
>> +        !!(ctrl & (RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CQMF))) {
>> +        return;
>> +    }
>> +
>> +    while (tail != head) {
>> +        addr = s->cq_addr  + head * sizeof(cmd);
>> +        res = dma_memory_read(s->target_as, addr, &cmd, sizeof(cmd),
>> +                              MEMTXATTRS_UNSPECIFIED);
>> +
>> +        if (res != MEMTX_OK) {
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
>> +                                  RISCV_IOMMU_CQCSR_CQMF, 0);
>> +            goto fault;
>> +        }
>> +
>> +        trace_riscv_iommu_cmd(s->parent_obj.id, cmd.dword0, cmd.dword1);
>> +
>> +        cmd_opcode = get_field(cmd.dword0,
>> +                               RISCV_IOMMU_CMD_OPCODE | RISCV_IOMMU_CMD_FUNC);
>> +
>> +        switch (cmd_opcode) {
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOFENCE_FUNC_C,
>> +                             RISCV_IOMMU_CMD_IOFENCE_OPCODE):
>> +            res = riscv_iommu_iofence(s,
>> +                cmd.dword0 & RISCV_IOMMU_CMD_IOFENCE_AV, cmd.dword1,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOFENCE_DATA));
>> +
>> +            if (res != MEMTX_OK) {
>> +                riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
>> +                                      RISCV_IOMMU_CQCSR_CQMF, 0);
>> +                goto fault;
>> +            }
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA,
>> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
>> +            if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
>> +                /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
>> +                goto cmd_ill;
>> +            }
>> +            /* translation cache not implemented yet */
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
>> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
>> +            /* translation cache not implemented yet */
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
>> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
>> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
>> +                /* invalidate all device context cache mappings */
>> +                func = __ctx_inval_all;
>> +            } else {
>> +                /* invalidate all device context matching DID */
>> +                func = __ctx_inval_devid;
>> +            }
>> +            riscv_iommu_ctx_inval(s, func,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID), 0);
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT,
>> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
>> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
>> +                /* illegal command arguments IODIR_PDT & DV == 0 */
>> +                goto cmd_ill;
>> +            } else {
>> +                func = __ctx_inval_devid_pasid;
>> +            }
>> +            riscv_iommu_ctx_inval(s, func,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID),
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
>> +            break;
>> +
>> +        default:
>> +        cmd_ill:
>> +            /* Invalid instruction, do not advance instruction index. */
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
>> +                RISCV_IOMMU_CQCSR_CMD_ILL, 0);
>> +            goto fault;
>> +        }
>> +
>> +        /* Advance and update head pointer after command completes. */
>> +        head = (head + 1) & s->cq_mask;
>> +        riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_CQH, head);
>> +    }
>> +    return;
>> +
>> +fault:
>> +    if (ctrl & RISCV_IOMMU_CQCSR_CIE) {
>> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_CQ);
>> +    }
>> +}
>> +
>> +static void riscv_iommu_process_cq_control(RISCVIOMMUState *s)
>> +{
>> +    uint64_t base;
>> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
>> +    uint32_t ctrl_clr;
>> +    bool enable = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQEN);
>> +    bool active = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQON);
>> +
>> +    if (enable && !active) {
>> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_CQB);
>> +        s->cq_mask = (2ULL << get_field(base, RISCV_IOMMU_CQB_LOG2SZ)) - 1;
>> +        s->cq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_CQB_PPN));
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~s->cq_mask);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQH], 0);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQT], 0);
>> +        ctrl_set = RISCV_IOMMU_CQCSR_CQON;
>> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQMF |
>> +                   RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CMD_TO |
>> +                   RISCV_IOMMU_CQCSR_FENCE_W_IP;
>> +    } else if (!enable && active) {
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~0);
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQON;
>> +    } else {
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR, ctrl_set, ctrl_clr);
>> +}
>> +
>> +static void riscv_iommu_process_fq_control(RISCVIOMMUState *s)
>> +{
>> +    uint64_t base;
>> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
>> +    uint32_t ctrl_clr;
>> +    bool enable = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQEN);
>> +    bool active = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQON);
>> +
>> +    if (enable && !active) {
>> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_FQB);
>> +        s->fq_mask = (2ULL << get_field(base, RISCV_IOMMU_FQB_LOG2SZ)) - 1;
>> +        s->fq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_FQB_PPN));
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~s->fq_mask);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQH], 0);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQT], 0);
>> +        ctrl_set = RISCV_IOMMU_FQCSR_FQON;
>> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQMF |
>> +            RISCV_IOMMU_FQCSR_FQOF;
>> +    } else if (!enable && active) {
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~0);
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQON;
>> +    } else {
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR, ctrl_set, ctrl_clr);
>> +}
>> +
>> +static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
>> +{
>> +    uint64_t base;
>> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
>> +    uint32_t ctrl_clr;
>> +    bool enable = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQEN);
>> +    bool active = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQON);
>> +
>> +    if (enable && !active) {
>> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_PQB);
>> +        s->pq_mask = (2ULL << get_field(base, RISCV_IOMMU_PQB_LOG2SZ)) - 1;
>> +        s->pq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_PQB_PPN));
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~s->pq_mask);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQH], 0);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQT], 0);
>> +        ctrl_set = RISCV_IOMMU_PQCSR_PQON;
>> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQMF |
>> +            RISCV_IOMMU_PQCSR_PQOF;
>> +    } else if (!enable && active) {
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~0);
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQON;
>> +    } else {
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
>> +}
>> +
>> +typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
>> +
>> +static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
>> +{
>> +    uint32_t cqcsr, fqcsr, pqcsr;
>> +    uint32_t ipsr_set = 0;
>> +    uint32_t ipsr_clr = 0;
>> +
>> +    if (data & RISCV_IOMMU_IPSR_CIP) {
>> +        cqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
>> +
>> +        if (cqcsr & RISCV_IOMMU_CQCSR_CIE &&
>> +            (cqcsr & RISCV_IOMMU_CQCSR_FENCE_W_IP ||
>> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_ILL ||
>> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_TO ||
>> +             cqcsr & RISCV_IOMMU_CQCSR_CQMF)) {
>> +            ipsr_set |= RISCV_IOMMU_IPSR_CIP;
>> +        } else {
>> +            ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
>> +        }
>> +    } else {
>> +        ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
>> +    }
>> +
>> +    if (data & RISCV_IOMMU_IPSR_FIP) {
>> +        fqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
>> +
>> +        if (fqcsr & RISCV_IOMMU_FQCSR_FIE &&
>> +            (fqcsr & RISCV_IOMMU_FQCSR_FQOF ||
>> +             fqcsr & RISCV_IOMMU_FQCSR_FQMF)) {
>> +            ipsr_set |= RISCV_IOMMU_IPSR_FIP;
>> +        } else {
>> +            ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
>> +        }
>> +    } else {
>> +        ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
>> +    }
>> +
>> +    if (data & RISCV_IOMMU_IPSR_PIP) {
>> +        pqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
>> +
>> +        if (pqcsr & RISCV_IOMMU_PQCSR_PIE &&
>> +            (pqcsr & RISCV_IOMMU_PQCSR_PQOF ||
>> +             pqcsr & RISCV_IOMMU_PQCSR_PQMF)) {
>> +            ipsr_set |= RISCV_IOMMU_IPSR_PIP;
>> +        } else {
>> +            ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
>> +        }
>> +    } else {
>> +        ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, ipsr_set, ipsr_clr);
>> +}
>> +
>> +static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
>> +    uint64_t data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    riscv_iommu_process_fn *process_fn = NULL;
>> +    RISCVIOMMUState *s = opaque;
>> +    uint32_t regb = addr & ~3;
>> +    uint32_t busy = 0;
>> +    uint64_t val = 0;
>> +
>> +    if ((addr & (size - 1)) != 0) {
>> +        /* Unsupported MMIO alignment or access size */
>> +        return MEMTX_ERROR;
>> +    }
>> +
>> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
>> +        /* Unsupported MMIO access location. */
>> +        return MEMTX_ACCESS_ERROR;
>> +    }
>> +
>> +    /* Track actionable MMIO write. */
>> +    switch (regb) {
>> +    case RISCV_IOMMU_REG_DDTP:
>> +    case RISCV_IOMMU_REG_DDTP + 4:
>> +        process_fn = riscv_iommu_process_ddtp;
>> +        regb = RISCV_IOMMU_REG_DDTP;
>> +        busy = RISCV_IOMMU_DDTP_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_CQT:
>> +        process_fn = riscv_iommu_process_cq_tail;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_CQCSR:
>> +        process_fn = riscv_iommu_process_cq_control;
>> +        busy = RISCV_IOMMU_CQCSR_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_FQCSR:
>> +        process_fn = riscv_iommu_process_fq_control;
>> +        busy = RISCV_IOMMU_FQCSR_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_PQCSR:
>> +        process_fn = riscv_iommu_process_pq_control;
>> +        busy = RISCV_IOMMU_PQCSR_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_IPSR:
>> +        /*
>> +         * IPSR has special procedures to update. Execute it
>> +         * and exit.
>> +         */
>> +        if (size == 4) {
>> +            uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
>> +            uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
>> +            uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
>> +            stl_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +        } else if (size == 8) {
>> +            uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
>> +            uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
>> +            uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
>> +            stq_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +        }
>> +
>> +        riscv_iommu_update_ipsr(s, val);
>> +
>> +        return MEMTX_OK;
>> +
>> +    default:
>> +        break;
>> +    }
>> +
>> +    /*
>> +     * Registers update might be not synchronized with core logic.
>> +     * If system software updates register when relevant BUSY bit
>> +     * is set IOMMU behavior of additional writes to the register
>> +     * is UNSPECIFIED.
>> +     */
>> +    qemu_spin_lock(&s->regs_lock);
>> +    if (size == 1) {
>> +        uint8_t ro = s->regs_ro[addr];
>> +        uint8_t wc = s->regs_wc[addr];
>> +        uint8_t rw = s->regs_rw[addr];
>> +        s->regs_rw[addr] = ((rw & ro) | (data & ~ro)) & ~(data & wc);
>> +    } else if (size == 2) {
>> +        uint16_t ro = lduw_le_p(&s->regs_ro[addr]);
>> +        uint16_t wc = lduw_le_p(&s->regs_wc[addr]);
>> +        uint16_t rw = lduw_le_p(&s->regs_rw[addr]);
>> +        stw_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +    } else if (size == 4) {
>> +        uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
>> +        uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
>> +        uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
>> +        stl_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +    } else if (size == 8) {
>> +        uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
>> +        uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
>> +        uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
>> +        stq_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +    }
>> +
>> +    /* Busy flag update, MSB 4-byte register. */
>> +    if (busy) {
>> +        uint32_t rw = ldl_le_p(&s->regs_rw[regb]);
>> +        stl_le_p(&s->regs_rw[regb], rw | busy);
>> +    }
>> +    qemu_spin_unlock(&s->regs_lock);
>> +
>> +    if (process_fn) {
>> +        qemu_mutex_lock(&s->core_lock);
>> +        process_fn(s);
>> +        qemu_mutex_unlock(&s->core_lock);
>> +    }
>> +
>> +    return MEMTX_OK;
>> +}
>> +
>> +static MemTxResult riscv_iommu_mmio_read(void *opaque, hwaddr addr,
>> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    RISCVIOMMUState *s = opaque;
>> +    uint64_t val = -1;
>> +    uint8_t *ptr;
>> +
>> +    if ((addr & (size - 1)) != 0) {
>> +        /* Unsupported MMIO alignment. */
>> +        return MEMTX_ERROR;
>> +    }
>> +
>> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
>> +        return MEMTX_ACCESS_ERROR;
>> +    }
>> +
>> +    ptr = &s->regs_rw[addr];
>> +
>> +    if (size == 1) {
>> +        val = (uint64_t)*ptr;
>> +    } else if (size == 2) {
>> +        val = lduw_le_p(ptr);
>> +    } else if (size == 4) {
>> +        val = ldl_le_p(ptr);
>> +    } else if (size == 8) {
>> +        val = ldq_le_p(ptr);
>> +    } else {
>> +        return MEMTX_ERROR;
>> +    }
>> +
>> +    *data = val;
>> +
>> +    return MEMTX_OK;
>> +}
>> +
>> +static const MemoryRegionOps riscv_iommu_mmio_ops = {
>> +    .read_with_attrs = riscv_iommu_mmio_read,
>> +    .write_with_attrs = riscv_iommu_mmio_write,
>> +    .endianness = DEVICE_NATIVE_ENDIAN,
>> +    .impl = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +        .unaligned = false,
>> +    },
>> +    .valid = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +    }
>> +};
>> +
>> +/*
>> + * Translations matching MSI pattern check are redirected to "riscv-iommu-trap"
>> + * memory region as untranslated address, for additional MSI/MRIF interception
>> + * by IOMMU interrupt remapping implementation.
>> + * Note: Device emulation code generating an MSI is expected to provide a valid
>> + * memory transaction attributes with requested_id set.
>> + */
>> +static MemTxResult riscv_iommu_trap_write(void *opaque, hwaddr addr,
>> +    uint64_t data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    RISCVIOMMUState* s = (RISCVIOMMUState *)opaque;
>> +    RISCVIOMMUContext *ctx;
>> +    MemTxResult res;
>> +    void *ref;
>> +    uint32_t devid = attrs.requester_id;
>> +
>> +    if (attrs.unspecified) {
>> +        return MEMTX_ACCESS_ERROR;
>> +    }
>> +
>> +    /* FIXME: PCIe bus remapping for attached endpoints. */
>> +    devid |= s->bus << 8;
>> +
>> +    ctx = riscv_iommu_ctx(s, devid, 0, &ref);
>> +    if (ctx == NULL) {
>> +        res = MEMTX_ACCESS_ERROR;
>> +    } else {
>> +        res = riscv_iommu_msi_write(s, ctx, addr, data, size, attrs);
>> +    }
>> +    riscv_iommu_ctx_put(s, ref);
>> +    return res;
>> +}
>> +
>> +static MemTxResult riscv_iommu_trap_read(void *opaque, hwaddr addr,
>> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    return MEMTX_ACCESS_ERROR;
>> +}
>> +
>> +static const MemoryRegionOps riscv_iommu_trap_ops = {
>> +    .read_with_attrs = riscv_iommu_trap_read,
>> +    .write_with_attrs = riscv_iommu_trap_write,
>> +    .endianness = DEVICE_LITTLE_ENDIAN,
>> +    .impl = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +        .unaligned = true,
>> +    },
>> +    .valid = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +    }
>> +};
>> +
>> +static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>> +{
>> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
>> +
>> +    s->cap = s->version & RISCV_IOMMU_CAP_VERSION;
>> +    if (s->enable_msi) {
>> +        s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
>> +    }
>> +    /* Report QEMU target physical address space limits */
>> +    s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
>> +                       TARGET_PHYS_ADDR_SPACE_BITS);
>> +
>> +    /* TODO: method to report supported PASID bits */
>> +    s->pasid_bits = 8; /* restricted to size of MemTxAttrs.pasid */
>> +    s->cap |= RISCV_IOMMU_CAP_PD8;
>> +
>> +    /* Out-of-reset translation mode: OFF (DMA disabled) BARE (passthrough) */
>> +    s->ddtp = set_field(0, RISCV_IOMMU_DDTP_MODE, s->enable_off ?
>> +                        RISCV_IOMMU_DDTP_MODE_OFF : RISCV_IOMMU_DDTP_MODE_BARE);
>> +
>> +    /* register storage */
>> +    s->regs_rw = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
>> +    s->regs_ro = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
>> +    s->regs_wc = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
>> +
>> +     /* Mark all registers read-only */
>> +    memset(s->regs_ro, 0xff, RISCV_IOMMU_REG_SIZE);
>> +
>> +    /*
>> +     * Register complete MMIO space, including MSI/PBA registers.
>> +     * Note, PCIDevice implementation will add overlapping MR for MSI/PBA,
>> +     * managed directly by the PCIDevice implementation.
>> +     */
>> +    memory_region_init_io(&s->regs_mr, OBJECT(dev), &riscv_iommu_mmio_ops, s,
>> +        "riscv-iommu-regs", RISCV_IOMMU_REG_SIZE);
>> +
>> +    /* Set power-on register state */
>> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
>> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
>> +        ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
>> +        ~(RISCV_IOMMU_CQB_LOG2SZ | RISCV_IOMMU_CQB_PPN));
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQB],
>> +        ~(RISCV_IOMMU_FQB_LOG2SZ | RISCV_IOMMU_FQB_PPN));
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQB],
>> +        ~(RISCV_IOMMU_PQB_LOG2SZ | RISCV_IOMMU_PQB_PPN));
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQMF |
>> +        RISCV_IOMMU_CQCSR_CMD_TO | RISCV_IOMMU_CQCSR_CMD_ILL);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQON |
>> +        RISCV_IOMMU_CQCSR_BUSY);
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQMF |
>> +        RISCV_IOMMU_FQCSR_FQOF);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQON |
>> +        RISCV_IOMMU_FQCSR_BUSY);
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQMF |
>> +        RISCV_IOMMU_PQCSR_PQOF);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQON |
>> +        RISCV_IOMMU_PQCSR_BUSY);
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
>> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
>> +
>> +    /* Memory region for downstream access, if specified. */
>> +    if (s->target_mr) {
>> +        s->target_as = g_new0(AddressSpace, 1);
>> +        address_space_init(s->target_as, s->target_mr,
>> +            "riscv-iommu-downstream");
>> +    } else {
>> +        /* Fallback to global system memory. */
>> +        s->target_as = &address_space_memory;
>> +    }
>> +
>> +    /* Memory region for untranslated MRIF/MSI writes */
>> +    memory_region_init_io(&s->trap_mr, OBJECT(dev), &riscv_iommu_trap_ops, s,
>> +            "riscv-iommu-trap", ~0ULL);
>> +    address_space_init(&s->trap_as, &s->trap_mr, "riscv-iommu-trap-as");
>> +
>> +    /* Device translation context cache */
>> +    s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
>> +                                         g_free, NULL);
>> +
>> +    s->iommus.le_next = NULL;
>> +    s->iommus.le_prev = NULL;
>> +    QLIST_INIT(&s->spaces);
>> +    qemu_mutex_init(&s->core_lock);
>> +    qemu_spin_init(&s->regs_lock);
>> +}
>> +
>> +static void riscv_iommu_unrealize(DeviceState *dev)
>> +{
>> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
>> +
>> +    qemu_mutex_destroy(&s->core_lock);
>> +    g_hash_table_unref(s->ctx_cache);
>> +}
>> +
>> +static Property riscv_iommu_properties[] = {
>> +    DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
>> +        RISCV_IOMMU_SPEC_DOT_VER),
>> +    DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
>> +    DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
>> +    DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
>> +    DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
>> +        TYPE_MEMORY_REGION, MemoryRegion *),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void riscv_iommu_class_init(ObjectClass *klass, void* data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    /* internal device for riscv-iommu-{pci/sys}, not user-creatable */
>> +    dc->user_creatable = false;
>> +    dc->realize = riscv_iommu_realize;
>> +    dc->unrealize = riscv_iommu_unrealize;
>> +    device_class_set_props(dc, riscv_iommu_properties);
>> +}
>> +
>> +static const TypeInfo riscv_iommu_info = {
>> +    .name = TYPE_RISCV_IOMMU,
>> +    .parent = TYPE_DEVICE,
>> +    .instance_size = sizeof(RISCVIOMMUState),
>> +    .class_init = riscv_iommu_class_init,
>> +};
>> +
>> +static const char *IOMMU_FLAG_STR[] = {
>> +    "NA",
>> +    "RO",
>> +    "WR",
>> +    "RW",
>> +};
>> +
>> +/* RISC-V IOMMU Memory Region - Address Translation Space */
>> +static IOMMUTLBEntry riscv_iommu_memory_region_translate(
>> +    IOMMUMemoryRegion *iommu_mr, hwaddr addr,
>> +    IOMMUAccessFlags flag, int iommu_idx)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
>> +    RISCVIOMMUContext *ctx;
>> +    void *ref;
>> +    IOMMUTLBEntry iotlb = {
>> +        .iova = addr,
>> +        .target_as = as->iommu->target_as,
>> +        .addr_mask = ~0ULL,
>> +        .perm = flag,
>> +    };
>> +
>> +    ctx = riscv_iommu_ctx(as->iommu, as->devid, iommu_idx, &ref);
>> +    if (ctx == NULL) {
>> +        /* Translation disabled or invalid. */
>> +        iotlb.addr_mask = 0;
>> +        iotlb.perm = IOMMU_NONE;
>> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
>> +        /* Translation disabled or fault reported. */
>> +        iotlb.addr_mask = 0;
>> +        iotlb.perm = IOMMU_NONE;
>> +    }
>> +
>> +    /* Trace all dma translations with original access flags. */
>> +    trace_riscv_iommu_dma(as->iommu->parent_obj.id, PCI_BUS_NUM(as->devid),
>> +                          PCI_SLOT(as->devid), PCI_FUNC(as->devid), iommu_idx,
>> +                          IOMMU_FLAG_STR[flag & IOMMU_RW], iotlb.iova,
>> +                          iotlb.translated_addr);
>> +
>> +    riscv_iommu_ctx_put(as->iommu, ref);
>> +
>> +    return iotlb;
>> +}
>> +
>> +static int riscv_iommu_memory_region_notify(
>> +    IOMMUMemoryRegion *iommu_mr, IOMMUNotifierFlag old,
>> +    IOMMUNotifierFlag new, Error **errp)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
>> +
>> +    if (old == IOMMU_NOTIFIER_NONE) {
>> +        as->notifier = true;
>> +        trace_riscv_iommu_notifier_add(iommu_mr->parent_obj.name);
>> +    } else if (new == IOMMU_NOTIFIER_NONE) {
>> +        as->notifier = false;
>> +        trace_riscv_iommu_notifier_del(iommu_mr->parent_obj.name);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static inline bool pci_is_iommu(PCIDevice *pdev)
>> +{
>> +    return pci_get_word(pdev->config + PCI_CLASS_DEVICE) == 0x0806;
>> +}
>> +
>> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, int devfn)
>> +{
>> +    RISCVIOMMUState *s = (RISCVIOMMUState *) opaque;
>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>> +    AddressSpace *as = NULL;
>> +
>> +    if (pdev && pci_is_iommu(pdev)) {
>> +        return s->target_as;
>> +    }
>> +
>> +    /* Find first registered IOMMU device */
>> +    while (s->iommus.le_prev) {
>> +        s = *(s->iommus.le_prev);
>> +    }
>> +
>> +    /* Find first matching IOMMU */
>> +    while (s != NULL && as == NULL) {
>> +        as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), devfn));
>> +        s = s->iommus.le_next;
>> +    }
>> +
>> +    return as ? as : &address_space_memory;
>> +}
>> +
>> +static const PCIIOMMUOps riscv_iommu_ops = {
>> +    .get_address_space = riscv_iommu_find_as,
>> +};
>> +
>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
>> +        Error **errp)
>> +{
>> +    if (bus->iommu_ops &&
>> +        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
>> +        /* Allow multiple IOMMUs on the same PCIe bus, link known devices */
>> +        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
>> +        QLIST_INSERT_AFTER(last, iommu, iommus);
>> +    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
>> +        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);
>> +    } else {
>> +        error_setg(errp, "can't register secondary IOMMU for PCI bus #%d",
>> +            pci_bus_num(bus));
>> +    }
>> +}
>> +
>> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
>> +    MemTxAttrs attrs)
>> +{
>> +    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
>> +}
>> +
>> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
>> +    return 1 << as->iommu->pasid_bits;
>> +}
>> +
>> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void *data)
>> +{
>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
>> +
>> +    imrc->translate = riscv_iommu_memory_region_translate;
>> +    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
>> +    imrc->attrs_to_index = riscv_iommu_memory_region_index;
>> +    imrc->num_indexes = riscv_iommu_memory_region_index_len;
>> +}
>> +
>> +static const TypeInfo riscv_iommu_memory_region_info = {
>> +    .parent = TYPE_IOMMU_MEMORY_REGION,
>> +    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
>> +    .class_init = riscv_iommu_memory_region_init,
>> +};
>> +
>> +static void riscv_iommu_register_mr_types(void)
>> +{
>> +    type_register_static(&riscv_iommu_memory_region_info);
>> +    type_register_static(&riscv_iommu_info);
>> +}
>> +
>> +type_init(riscv_iommu_register_mr_types);
>> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
>> new file mode 100644
>> index 0000000000..31d3907d33
>> --- /dev/null
>> +++ b/hw/riscv/riscv-iommu.h
>> @@ -0,0 +1,141 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2022-2023 Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_RISCV_IOMMU_STATE_H
>> +#define HW_RISCV_IOMMU_STATE_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +
>> +#include "hw/riscv/iommu.h"
>> +
>> +struct RISCVIOMMUState {
>> +    /*< private >*/
>> +    DeviceState parent_obj;
>> +
>> +    /*< public >*/
>> +    uint32_t version;     /* Reported interface version number */
>> +    uint32_t pasid_bits;  /* process identifier width */
>> +    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
>> +
>> +    uint64_t cap;         /* IOMMU supported capabilities */
>> +    uint64_t fctl;        /* IOMMU enabled features */
>> +
>> +    bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
>> +    bool enable_msi;      /* Enable MSI remapping */
>> +
>> +    /* IOMMU Internal State */
>> +    uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
>> +
>> +    dma_addr_t cq_addr;   /* Command queue base physical address */
>> +    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
>> +    dma_addr_t pq_addr;   /* Page request queue base physical address */
>> +
>> +    uint32_t cq_mask;     /* Command queue index bit mask */
>> +    uint32_t fq_mask;     /* Fault/event queue index bit mask */
>> +    uint32_t pq_mask;     /* Page request queue index bit mask */
>> +
>> +    /* interrupt notifier */
>> +    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
>> +
>> +    /* IOMMU State Machine */
>> +    QemuThread core_proc; /* Background processing thread */
>> +    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs updates */
>> +    QemuCond core_cond;   /* Background processing wake up signal */
>> +    unsigned core_exec;   /* Processing thread execution actions */
>> +
>> +    /* IOMMU target address space */
>> +    AddressSpace *target_as;
>> +    MemoryRegion *target_mr;
>> +
>> +    /* MSI / MRIF access trap */
>> +    AddressSpace trap_as;
>> +    MemoryRegion trap_mr;
>> +
>> +    GHashTable *ctx_cache;          /* Device translation Context Cache */
>> +
>> +    /* MMIO Hardware Interface */
>> +    MemoryRegion regs_mr;
>> +    QemuSpin regs_lock;
>> +    uint8_t *regs_rw;  /* register state (user write) */
>> +    uint8_t *regs_wc;  /* write-1-to-clear mask */
>> +    uint8_t *regs_ro;  /* read-only mask */
>> +
>> +    QLIST_ENTRY(RISCVIOMMUState) iommus;
>> +    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
>> +};
>> +
>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
>> +         Error **errp);
>> +
>> +/* private helpers */
>> +
>> +/* Register helper functions */
>> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
>> +    unsigned idx, uint32_t set, uint32_t clr)
>> +{
>> +    uint32_t val;
>> +    qemu_spin_lock(&s->regs_lock);
>> +    val = ldl_le_p(s->regs_rw + idx);
>> +    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +    return val;
>> +}
>> +
>> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
>> +    unsigned idx, uint32_t set)
>> +{
>> +    qemu_spin_lock(&s->regs_lock);
>> +    stl_le_p(s->regs_rw + idx, set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +}
>> +
>> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
>> +    unsigned idx)
>> +{
>> +    return ldl_le_p(s->regs_rw + idx);
>> +}
>> +
>> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
>> +    unsigned idx, uint64_t set, uint64_t clr)
>> +{
>> +    uint64_t val;
>> +    qemu_spin_lock(&s->regs_lock);
>> +    val = ldq_le_p(s->regs_rw + idx);
>> +    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +    return val;
>> +}
>> +
>> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
>> +    unsigned idx, uint64_t set)
>> +{
>> +    qemu_spin_lock(&s->regs_lock);
>> +    stq_le_p(s->regs_rw + idx, set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +}
>> +
>> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
>> +    unsigned idx)
>> +{
>> +    return ldq_le_p(s->regs_rw + idx);
>> +}
>> +
>> +
>> +
>> +#endif
>> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
>> new file mode 100644
>> index 0000000000..42a97caffa
>> --- /dev/null
>> +++ b/hw/riscv/trace-events
>> @@ -0,0 +1,11 @@
>> +# See documentation at docs/devel/tracing.rst
>> +
>> +# riscv-iommu.c
>> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) "%s: device attached %04x:%02x.%d"
>> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" iova: 0x%"PRIx64
>> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
>> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
>> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
>> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
>> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
>> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
>> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
>> new file mode 100644
>> index 0000000000..8c0e3ca1f3
>> --- /dev/null
>> +++ b/hw/riscv/trace.h
>> @@ -0,0 +1 @@
>> +#include "trace/trace-hw_riscv.h"
>> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
>> new file mode 100644
>> index 0000000000..070ee69973
>> --- /dev/null
>> +++ b/include/hw/riscv/iommu.h
>> @@ -0,0 +1,36 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2022-2023 Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_RISCV_IOMMU_H
>> +#define HW_RISCV_IOMMU_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +
>> +#define TYPE_RISCV_IOMMU "riscv-iommu"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
>> +typedef struct RISCVIOMMUState RISCVIOMMUState;
>> +
>> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
>> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
>> +
>> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
>> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
>> +
>> +#endif
>> diff --git a/meson.build b/meson.build
>> index a9de71d450..8099d8271c 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -3319,6 +3319,7 @@ if have_system
>>       'hw/pci-host',
>>       'hw/ppc',
>>       'hw/rtc',
>> +    'hw/riscv',
>>       'hw/s390x',
>>       'hw/scsi',
>>       'hw/sd',
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
  2024-06-05 17:34   ` Tomasz Jeznach
@ 2024-06-07  8:30     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-07  8:30 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, ajones, frank.chang

Hi Tomasz,

On 6/5/24 2:34 PM, Tomasz Jeznach wrote:
> Daniel,
> 
> Thank you for your upstreaming work!

Glad to help!

> 
> I've synchronized the private branch with v3 changes, and noticed
> there is an important change missing in this patchset. We need
> reader-writer lock around access to GLib.HashTable as it's not
> MT-safe.  Diff added below, also available on github [1] branch
> riscv_iommu_v4-rc1.
> 
> [1] link: https://github.com/tjeznach/qemu/tree/riscv_iommu_v4-rc1


Just picked the changes and squashed them in patch 3 and 8. Thanks!


Daniel

> 
> Thanks!
> - Tomasz Jeznach
> 
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index a27f56419a..75c5d645fc 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -991,7 +991,9 @@ static void riscv_iommu_ctx_inval(RISCVIOMMUState
> *s, GHFunc func,
>           .pasid = pasid,
>       };
>       ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    pthread_rwlock_wrlock(&s->ctx_lock);
>       g_hash_table_foreach(ctx_cache, func, &key);
> +    pthread_rwlock_unlock(&s->ctx_lock);
>       g_hash_table_unref(ctx_cache);
>   }
> 
> @@ -1007,26 +1009,31 @@ static RISCVIOMMUContext
> *riscv_iommu_ctx(RISCVIOMMUState *s,
>       };
> 
>       ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    pthread_rwlock_rdlock(&s->ctx_lock);
>       ctx = g_hash_table_lookup(ctx_cache, &key);
> +    pthread_rwlock_unlock(&s->ctx_lock);
> 
>       if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
>           *ref = ctx_cache;
>           return ctx;
>       }
> 
> -    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
> -        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> -                                          g_free, NULL);
> -        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
> -    }
> -
>       ctx = g_new0(RISCVIOMMUContext, 1);
>       ctx->devid = devid;
>       ctx->pasid = pasid;
> 
>       int fault = riscv_iommu_ctx_fetch(s, ctx);
>       if (!fault) {
> +        pthread_rwlock_wrlock(&s->ctx_lock);
> +        if (g_hash_table_size(ctx_cache) >= LIMIT_CACHE_CTX) {
> +            g_hash_table_unref(ctx_cache);
> +            ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                              g_free, NULL);
> +            g_hash_table_ref(ctx_cache);
> +            g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
> +        }
>           g_hash_table_add(ctx_cache, ctx);
> +        pthread_rwlock_unlock(&s->ctx_lock);
>           *ref = ctx_cache;
>           return ctx;
>       }
> @@ -1176,12 +1183,14 @@ static void riscv_iommu_iot_update(RISCVIOMMUState *s,
>           return;
>       }
> 
> +    pthread_rwlock_wrlock(&s->iot_lock);
>       if (g_hash_table_size(s->iot_cache) >= s->iot_limit) {
>           iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
>                                             g_free, NULL);
>           g_hash_table_unref(qatomic_xchg(&s->iot_cache, iot_cache));
>       }
>       g_hash_table_add(iot_cache, iot);
> +    pthread_rwlock_unlock(&s->iot_lock);
>   }
> 
>   static void riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
> @@ -1195,7 +1204,9 @@ static void
> riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
>       };
> 
>       iot_cache = g_hash_table_ref(s->iot_cache);
> +    pthread_rwlock_wrlock(&s->iot_lock);
>       g_hash_table_foreach(iot_cache, func, &key);
> +    pthread_rwlock_unlock(&s->iot_lock);
>       g_hash_table_unref(iot_cache);
>   }
> 
> @@ -1227,7 +1238,9 @@ static int riscv_iommu_translate(RISCVIOMMUState
> *s, RISCVIOMMUContext *ctx,
>           }
>       }
> 
> +    pthread_rwlock_rdlock(&s->iot_lock);
>       iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
> +    pthread_rwlock_unlock(&s->iot_lock);
>       perm = iot ? iot->perm : IOMMU_NONE;
>       if (perm != IOMMU_NONE) {
>           iotlb->translated_addr = PPN_PHYS(iot->phys);
> @@ -2085,6 +2098,8 @@ static void riscv_iommu_realize(DeviceState
> *dev, Error **errp)
>                                            g_free, NULL);
>       s->iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
>                                            g_free, NULL);
> +    pthread_rwlock_init(&s->ctx_lock, NULL);
> +    pthread_rwlock_init(&s->iot_lock, NULL);
> 
>       s->iommus.le_next = NULL;
>       s->iommus.le_prev = NULL;
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> index 26236c3cee..041b3b9e05 100644
> --- a/hw/riscv/riscv-iommu.h
> +++ b/hw/riscv/riscv-iommu.h
> @@ -71,7 +71,9 @@ struct RISCVIOMMUState {
>       MemoryRegion trap_mr;
> 
>       GHashTable *ctx_cache;          /* Device translation Context Cache */
> +    pthread_rwlock_t ctx_lock;      /* Device translation Cache update lock */
>       GHashTable *iot_cache;          /* IO Translated Address Cache */
> +    pthread_rwlock_t iot_lock;      /* IO TLB Cache update lock */
>       unsigned iot_limit;             /* IO Translation Cache size limit */
> 
>       /* MMIO Hardware Interface */
> 
> 
> On Thu, May 23, 2024 at 10:40 AM Daniel Henrique Barboza
> <dbarboza@ventanamicro.com> wrote:
>>
>> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>>
>> The RISC-V IOMMU spec predicts that the IOMMU can use translation caches
>> to hold entries from the DDT. This includes implementation for all cache
>> commands that are marked as 'not implemented'.
>>
>> There are some artifacts included in the cache that predicts s-stage and
>> g-stage elements, although we don't support it yet. We'll introduce them
>> next.
>>
>> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
>> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
>> Reviewed-by: Frank Chang <frank.chang@sifive.com>
>> ---
>>   hw/riscv/riscv-iommu.c | 189 ++++++++++++++++++++++++++++++++++++++++-
>>   hw/riscv/riscv-iommu.h |   2 +
>>   2 files changed, 187 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
>> index 39b4ff1405..abf6ae7726 100644
>> --- a/hw/riscv/riscv-iommu.c
>> +++ b/hw/riscv/riscv-iommu.c
>> @@ -63,6 +63,16 @@ struct RISCVIOMMUContext {
>>       uint64_t msiptp;            /* MSI redirection page table pointer */
>>   };
>>
>> +/* Address translation cache entry */
>> +struct RISCVIOMMUEntry {
>> +    uint64_t iova:44;           /* IOVA Page Number */
>> +    uint64_t pscid:20;          /* Process Soft-Context identifier */
>> +    uint64_t phys:44;           /* Physical Page Number */
>> +    uint64_t gscid:16;          /* Guest Soft-Context identifier */
>> +    uint64_t perm:2;            /* IOMMU_RW flags */
>> +    uint64_t __rfu:2;
>> +};
>> +
>>   /* IOMMU index for transactions without PASID specified. */
>>   #define RISCV_IOMMU_NOPASID 0
>>
>> @@ -751,13 +761,125 @@ static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
>>       return &as->iova_as;
>>   }
>>
>> +/* Translation Object cache support */
>> +static gboolean __iot_equal(gconstpointer v1, gconstpointer v2)
>> +{
>> +    RISCVIOMMUEntry *t1 = (RISCVIOMMUEntry *) v1;
>> +    RISCVIOMMUEntry *t2 = (RISCVIOMMUEntry *) v2;
>> +    return t1->gscid == t2->gscid && t1->pscid == t2->pscid &&
>> +           t1->iova == t2->iova;
>> +}
>> +
>> +static guint __iot_hash(gconstpointer v)
>> +{
>> +    RISCVIOMMUEntry *t = (RISCVIOMMUEntry *) v;
>> +    return (guint)t->iova;
>> +}
>> +
>> +/* GV: 1 PSCV: 1 AV: 1 */
>> +static void __iot_inval_pscid_iova(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
>> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
>> +    if (iot->gscid == arg->gscid &&
>> +        iot->pscid == arg->pscid &&
>> +        iot->iova == arg->iova) {
>> +        iot->perm = IOMMU_NONE;
>> +    }
>> +}
>> +
>> +/* GV: 1 PSCV: 1 AV: 0 */
>> +static void __iot_inval_pscid(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
>> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
>> +    if (iot->gscid == arg->gscid &&
>> +        iot->pscid == arg->pscid) {
>> +        iot->perm = IOMMU_NONE;
>> +    }
>> +}
>> +
>> +/* GV: 1 GVMA: 1 */
>> +static void __iot_inval_gscid_gpa(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
>> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
>> +    if (iot->gscid == arg->gscid) {
>> +        /* simplified cache, no GPA matching */
>> +        iot->perm = IOMMU_NONE;
>> +    }
>> +}
>> +
>> +/* GV: 1 GVMA: 0 */
>> +static void __iot_inval_gscid(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
>> +    RISCVIOMMUEntry *arg = (RISCVIOMMUEntry *) data;
>> +    if (iot->gscid == arg->gscid) {
>> +        iot->perm = IOMMU_NONE;
>> +    }
>> +}
>> +
>> +/* GV: 0 */
>> +static void __iot_inval_all(gpointer key, gpointer value, gpointer data)
>> +{
>> +    RISCVIOMMUEntry *iot = (RISCVIOMMUEntry *) value;
>> +    iot->perm = IOMMU_NONE;
>> +}
>> +
>> +/* caller should keep ref-count for iot_cache object */
>> +static RISCVIOMMUEntry *riscv_iommu_iot_lookup(RISCVIOMMUContext *ctx,
>> +    GHashTable *iot_cache, hwaddr iova)
>> +{
>> +    RISCVIOMMUEntry key = {
>> +        .pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID),
>> +        .iova  = PPN_DOWN(iova),
>> +    };
>> +    return g_hash_table_lookup(iot_cache, &key);
>> +}
>> +
>> +/* caller should keep ref-count for iot_cache object */
>> +static void riscv_iommu_iot_update(RISCVIOMMUState *s,
>> +    GHashTable *iot_cache, RISCVIOMMUEntry *iot)
>> +{
>> +    if (!s->iot_limit) {
>> +        return;
>> +    }
>> +
>> +    if (g_hash_table_size(s->iot_cache) >= s->iot_limit) {
>> +        iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
>> +                                          g_free, NULL);
>> +        g_hash_table_unref(qatomic_xchg(&s->iot_cache, iot_cache));
>> +    }
>> +    g_hash_table_add(iot_cache, iot);
>> +}
>> +
>> +static void riscv_iommu_iot_inval(RISCVIOMMUState *s, GHFunc func,
>> +    uint32_t gscid, uint32_t pscid, hwaddr iova)
>> +{
>> +    GHashTable *iot_cache;
>> +    RISCVIOMMUEntry key = {
>> +        .gscid = gscid,
>> +        .pscid = pscid,
>> +        .iova  = PPN_DOWN(iova),
>> +    };
>> +
>> +    iot_cache = g_hash_table_ref(s->iot_cache);
>> +    g_hash_table_foreach(iot_cache, func, &key);
>> +    g_hash_table_unref(iot_cache);
>> +}
>> +
>>   static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>> -    IOMMUTLBEntry *iotlb)
>> +    IOMMUTLBEntry *iotlb, bool enable_cache)
>>   {
>> +    RISCVIOMMUEntry *iot;
>> +    IOMMUAccessFlags perm;
>>       bool enable_pasid;
>>       bool enable_pri;
>> +    GHashTable *iot_cache;
>>       int fault;
>>
>> +    iot_cache = g_hash_table_ref(s->iot_cache);
>>       /*
>>        * TC[32] is reserved for custom extensions, used here to temporarily
>>        * enable automatic page-request generation for ATS queries.
>> @@ -765,9 +887,36 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>>       enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
>>       enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
>>
>> +    iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
>> +    perm = iot ? iot->perm : IOMMU_NONE;
>> +    if (perm != IOMMU_NONE) {
>> +        iotlb->translated_addr = PPN_PHYS(iot->phys);
>> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
>> +        iotlb->perm = perm;
>> +        fault = 0;
>> +        goto done;
>> +    }
>> +
>>       /* Translate using device directory / page table information. */
>>       fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
>>
>> +    if (!fault && iotlb->target_as == &s->trap_as) {
>> +        /* Do not cache trapped MSI translations */
>> +        goto done;
>> +    }
>> +
>> +    if (!fault && iotlb->translated_addr != iotlb->iova && enable_cache) {
>> +        iot = g_new0(RISCVIOMMUEntry, 1);
>> +        iot->iova = PPN_DOWN(iotlb->iova);
>> +        iot->phys = PPN_DOWN(iotlb->translated_addr);
>> +        iot->pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID);
>> +        iot->perm = iotlb->perm;
>> +        riscv_iommu_iot_update(s, iot_cache, iot);
>> +    }
>> +
>> +done:
>> +    g_hash_table_unref(iot_cache);
>> +
>>       if (enable_pri && fault) {
>>           struct riscv_iommu_pq_record pr = {0};
>>           if (enable_pasid) {
>> @@ -907,13 +1056,40 @@ static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
>>               if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
>>                   /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
>>                   goto cmd_ill;
>> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_GV)) {
>> +                /* invalidate all cache mappings */
>> +                func = __iot_inval_all;
>> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_AV)) {
>> +                /* invalidate cache matching GSCID */
>> +                func = __iot_inval_gscid;
>> +            } else {
>> +                /* invalidate cache matching GSCID and ADDR (GPA) */
>> +                func = __iot_inval_gscid_gpa;
>>               }
>> -            /* translation cache not implemented yet */
>> +            riscv_iommu_iot_inval(s, func,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_GSCID), 0,
>> +                cmd.dword1 & TARGET_PAGE_MASK);
>>               break;
>>
>>           case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
>>                                RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
>> -            /* translation cache not implemented yet */
>> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_GV)) {
>> +                /* invalidate all cache mappings, simplified model */
>> +                func = __iot_inval_all;
>> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV)) {
>> +                /* invalidate cache matching GSCID, simplified model */
>> +                func = __iot_inval_gscid;
>> +            } else if (!(cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_AV)) {
>> +                /* invalidate cache matching GSCID and PSCID */
>> +                func = __iot_inval_pscid;
>> +            } else {
>> +                /* invalidate cache matching GSCID and PSCID and ADDR (IOVA) */
>> +                func = __iot_inval_pscid_iova;
>> +            }
>> +            riscv_iommu_iot_inval(s, func,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_GSCID),
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOTINVAL_PSCID),
>> +                cmd.dword1 & TARGET_PAGE_MASK);
>>               break;
>>
>>           case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
>> @@ -1410,6 +1586,8 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>>       /* Device translation context cache */
>>       s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
>>                                            g_free, NULL);
>> +    s->iot_cache = g_hash_table_new_full(__iot_hash, __iot_equal,
>> +                                         g_free, NULL);
>>
>>       s->iommus.le_next = NULL;
>>       s->iommus.le_prev = NULL;
>> @@ -1423,6 +1601,7 @@ static void riscv_iommu_unrealize(DeviceState *dev)
>>       RISCVIOMMUState *s = RISCV_IOMMU(dev);
>>
>>       qemu_mutex_destroy(&s->core_lock);
>> +    g_hash_table_unref(s->iot_cache);
>>       g_hash_table_unref(s->ctx_cache);
>>   }
>>
>> @@ -1430,6 +1609,8 @@ static Property riscv_iommu_properties[] = {
>>       DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
>>           RISCV_IOMMU_SPEC_DOT_VER),
>>       DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
>> +    DEFINE_PROP_UINT32("ioatc-limit", RISCVIOMMUState, iot_limit,
>> +        LIMIT_CACHE_IOT),
>>       DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
>>       DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
>>       DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
>> @@ -1482,7 +1663,7 @@ static IOMMUTLBEntry riscv_iommu_memory_region_translate(
>>           /* Translation disabled or invalid. */
>>           iotlb.addr_mask = 0;
>>           iotlb.perm = IOMMU_NONE;
>> -    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
>> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb, true)) {
>>           /* Translation disabled or fault reported. */
>>           iotlb.addr_mask = 0;
>>           iotlb.perm = IOMMU_NONE;
>> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
>> index 31d3907d33..3afee9f3e8 100644
>> --- a/hw/riscv/riscv-iommu.h
>> +++ b/hw/riscv/riscv-iommu.h
>> @@ -68,6 +68,8 @@ struct RISCVIOMMUState {
>>       MemoryRegion trap_mr;
>>
>>       GHashTable *ctx_cache;          /* Device translation Context Cache */
>> +    GHashTable *iot_cache;          /* IO Translated Address Cache */
>> +    unsigned iot_limit;             /* IO Translation Cache size limit */
>>
>>       /* MMIO Hardware Interface */
>>       MemoryRegion regs_mr;
>> --
>> 2.44.0
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 05/13] hw/riscv: add riscv-iommu-pci reference device
  2024-05-23 17:39 ` [PATCH v3 05/13] hw/riscv: add riscv-iommu-pci reference device Daniel Henrique Barboza
@ 2024-06-09  8:53   ` Frank Chang
  0 siblings, 0 replies; 38+ messages in thread
From: Frank Chang @ 2024-06-09  8:53 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, tjeznach, ajones, frank.chang

Hi Daniel,

Daniel Henrique Barboza <dbarboza@ventanamicro.com> 於 2024年5月24日 週五 上午1:42寫道:
>
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> The RISC-V IOMMU can be modelled as a PCIe device following the
> guidelines of the RISC-V IOMMU spec, chapter 7.1, "Integrating an IOMMU
> as a PCIe device".
>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>  hw/riscv/meson.build       |   2 +-
>  hw/riscv/riscv-iommu-pci.c | 177 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 178 insertions(+), 1 deletion(-)
>  create mode 100644 hw/riscv/riscv-iommu-pci.c
>
> diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
> index cbc99c6e8e..adbef8a9b2 100644
> --- a/hw/riscv/meson.build
> +++ b/hw/riscv/meson.build
> @@ -10,6 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
>  riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
>  riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
>  riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
> -riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
> +riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c', 'riscv-iommu-pci.c'))
>
>  hw_arch += {'riscv': riscv_ss}
> diff --git a/hw/riscv/riscv-iommu-pci.c b/hw/riscv/riscv-iommu-pci.c
> new file mode 100644
> index 0000000000..7635cc64ff
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu-pci.c
> @@ -0,0 +1,177 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/pci/msi.h"
> +#include "hw/pci/msix.h"
> +#include "hw/pci/pci_bus.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/riscv/riscv_hart.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qemu/host-utils.h"
> +#include "qom/object.h"
> +
> +#include "cpu_bits.h"
> +#include "riscv-iommu.h"
> +#include "riscv-iommu-bits.h"
> +
> +/* RISC-V IOMMU PCI Device Emulation */
> +
> +typedef struct RISCVIOMMUStatePci {
> +    PCIDevice        pci;     /* Parent PCIe device state */
> +    uint16_t         vendor_id;
> +    uint16_t         device_id;
> +    uint8_t          revision;
> +    MemoryRegion     bar0;    /* PCI BAR (including MSI-x config) */
> +    RISCVIOMMUState  iommu;   /* common IOMMU state */
> +} RISCVIOMMUStatePci;
> +
> +/* interrupt delivery callback */
> +static void riscv_iommu_pci_notify(RISCVIOMMUState *iommu, unsigned vector)
> +{
> +    RISCVIOMMUStatePci *s = container_of(iommu, RISCVIOMMUStatePci, iommu);
> +
> +    if (msix_enabled(&(s->pci))) {
> +        msix_notify(&(s->pci), vector);
> +    }
> +}
> +
> +static void riscv_iommu_pci_realize(PCIDevice *dev, Error **errp)
> +{
> +    RISCVIOMMUStatePci *s = DO_UPCAST(RISCVIOMMUStatePci, pci, dev);
> +    RISCVIOMMUState *iommu = &s->iommu;
> +    uint8_t *pci_conf = dev->config;
> +    Error *err = NULL;
> +
> +    pci_set_word(pci_conf + PCI_VENDOR_ID, s->vendor_id);
> +    pci_set_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID, s->vendor_id);
> +    pci_set_word(pci_conf + PCI_DEVICE_ID, s->device_id);
> +    pci_set_word(pci_conf + PCI_SUBSYSTEM_ID, s->device_id);
> +    pci_set_byte(pci_conf + PCI_REVISION_ID, s->revision);
> +
> +    /* Set device id for trace / debug */
> +    DEVICE(iommu)->id = g_strdup_printf("%02x:%02x.%01x",
> +        pci_dev_bus_num(dev), PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
> +    qdev_realize(DEVICE(iommu), NULL, errp);
> +
> +    memory_region_init(&s->bar0, OBJECT(s), "riscv-iommu-bar0",
> +        QEMU_ALIGN_UP(memory_region_size(&iommu->regs_mr), TARGET_PAGE_SIZE));
> +    memory_region_add_subregion(&s->bar0, 0, &iommu->regs_mr);
> +
> +    pcie_endpoint_cap_init(dev, 0);
> +
> +    pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> +                     PCI_BASE_ADDRESS_MEM_TYPE_64, &s->bar0);
> +
> +    int ret = msix_init(dev, RISCV_IOMMU_INTR_COUNT,
> +                        &s->bar0, 0, RISCV_IOMMU_REG_MSI_CONFIG,
> +                        &s->bar0, 0, RISCV_IOMMU_REG_MSI_CONFIG + 256, 0, &err);
> +
> +    if (ret == -ENOTSUP) {
> +        /*
> +         * MSI-x is not supported by the platform.
> +         * Driver should use timer/polling based notification handlers.
> +         */
> +        warn_report_err(err);
> +    } else if (ret < 0) {
> +        error_propagate(errp, err);
> +        return;
> +    } else {
> +        /* mark all allocated MSIx vectors as used. */
> +        msix_vector_use(dev, RISCV_IOMMU_INTR_CQ);
> +        msix_vector_use(dev, RISCV_IOMMU_INTR_FQ);
> +        msix_vector_use(dev, RISCV_IOMMU_INTR_PM);
> +        msix_vector_use(dev, RISCV_IOMMU_INTR_PQ);
> +        iommu->notify = riscv_iommu_pci_notify;
> +    }
> +
> +    PCIBus *bus = pci_device_root_bus(dev);
> +    if (!bus) {
> +        error_setg(errp, "can't find PCIe root port for %02x:%02x.%x",
> +            pci_bus_num(pci_get_bus(dev)), PCI_SLOT(dev->devfn),
> +            PCI_FUNC(dev->devfn));
> +        return;
> +    }
> +
> +    riscv_iommu_pci_setup_iommu(iommu, bus, errp);
> +}
> +
> +static void riscv_iommu_pci_exit(PCIDevice *pci_dev)
> +{
> +    pci_setup_iommu(pci_device_root_bus(pci_dev), NULL, NULL);
> +}
> +
> +static const VMStateDescription riscv_iommu_vmstate = {
> +    .name = "riscv-iommu",
> +    .unmigratable = 1
> +};
> +
> +static void riscv_iommu_pci_init(Object *obj)
> +{
> +    RISCVIOMMUStatePci *s = RISCV_IOMMU_PCI(obj);
> +    RISCVIOMMUState *iommu = &s->iommu;
> +
> +    object_initialize_child(obj, "iommu", iommu, TYPE_RISCV_IOMMU);
> +    qdev_alias_all_properties(DEVICE(iommu), obj);
> +}
> +
> +static Property riscv_iommu_pci_properties[] = {
> +    DEFINE_PROP_UINT16("vendor-id", RISCVIOMMUStatePci, vendor_id,
> +                       PCI_VENDOR_ID_REDHAT),
> +    DEFINE_PROP_UINT16("device-id", RISCVIOMMUStatePci, device_id,
> +                       PCI_DEVICE_ID_REDHAT_RISCV_IOMMU),
> +    DEFINE_PROP_UINT8("revision", RISCVIOMMUStatePci, revision, 0x01),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void riscv_iommu_pci_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> +
> +    k->realize = riscv_iommu_pci_realize;
> +    k->exit = riscv_iommu_pci_exit;
> +    k->class_id = 0x0806;

Replace 0x0806 with PCI_CLASS_SYSTEM_IOMMU.

Otherwise,
Reviewed-by: Frank Chang <frank.chang@sifive.com>

> +    dc->desc = "RISCV-IOMMU DMA Remapping device";
> +    dc->vmsd = &riscv_iommu_vmstate;
> +    dc->hotpluggable = false;
> +    dc->user_creatable = true;
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    device_class_set_props(dc, riscv_iommu_pci_properties);
> +}
> +
> +static const TypeInfo riscv_iommu_pci = {
> +    .name = TYPE_RISCV_IOMMU_PCI,
> +    .parent = TYPE_PCI_DEVICE,
> +    .class_init = riscv_iommu_pci_class_init,
> +    .instance_init = riscv_iommu_pci_init,
> +    .instance_size = sizeof(RISCVIOMMUStatePci),
> +    .interfaces = (InterfaceInfo[]) {
> +        { INTERFACE_PCIE_DEVICE },
> +        { },
> +    },
> +};
> +
> +static void riscv_iommu_register_pci_types(void)
> +{
> +    type_register_static(&riscv_iommu_pci);
> +}
> +
> +type_init(riscv_iommu_register_pci_types);
> --
> 2.44.0
>
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 10/13] hw/riscv/riscv-iommu: add ATS support
  2024-05-23 17:39 ` [PATCH v3 10/13] hw/riscv/riscv-iommu: add ATS support Daniel Henrique Barboza
@ 2024-06-09  9:06   ` Frank Chang
  0 siblings, 0 replies; 38+ messages in thread
From: Frank Chang @ 2024-06-09  9:06 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, tjeznach, ajones, frank.chang

Reviewed-by: Frank Chang <frank.chang@sifive.com>

Daniel Henrique Barboza <dbarboza@ventanamicro.com> 於 2024年5月24日 週五 上午1:41寫道:
>
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> Add PCIe Address Translation Services (ATS) capabilities to the IOMMU.
> This will add support for ATS translation requests in Fault/Event
> queues, Page-request queue and IOATC invalidations.
>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>  hw/riscv/riscv-iommu-bits.h |  43 +++++++++++-
>  hw/riscv/riscv-iommu.c      | 129 +++++++++++++++++++++++++++++++++++-
>  hw/riscv/riscv-iommu.h      |   1 +
>  hw/riscv/trace-events       |   3 +
>  4 files changed, 173 insertions(+), 3 deletions(-)
>
> diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
> index a4def7b8ec..e253b29b16 100644
> --- a/hw/riscv/riscv-iommu-bits.h
> +++ b/hw/riscv/riscv-iommu-bits.h
> @@ -81,6 +81,7 @@ struct riscv_iommu_pq_record {
>  #define RISCV_IOMMU_CAP_SV57X4          BIT_ULL(19)
>  #define RISCV_IOMMU_CAP_MSI_FLAT        BIT_ULL(22)
>  #define RISCV_IOMMU_CAP_MSI_MRIF        BIT_ULL(23)
> +#define RISCV_IOMMU_CAP_ATS             BIT_ULL(25)
>  #define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
>  #define RISCV_IOMMU_CAP_IGS             GENMASK_ULL(29, 28)
>  #define RISCV_IOMMU_CAP_PAS             GENMASK_ULL(37, 32)
> @@ -209,6 +210,7 @@ struct riscv_iommu_dc {
>
>  /* Translation control fields */
>  #define RISCV_IOMMU_DC_TC_V             BIT_ULL(0)
> +#define RISCV_IOMMU_DC_TC_EN_ATS        BIT_ULL(1)
>  #define RISCV_IOMMU_DC_TC_EN_PRI        BIT_ULL(2)
>  #define RISCV_IOMMU_DC_TC_T2GPA         BIT_ULL(3)
>  #define RISCV_IOMMU_DC_TC_DTF           BIT_ULL(4)
> @@ -270,6 +272,20 @@ struct riscv_iommu_command {
>  #define RISCV_IOMMU_CMD_IODIR_DV        BIT_ULL(33)
>  #define RISCV_IOMMU_CMD_IODIR_DID       GENMASK_ULL(63, 40)
>
> +/* 3.1.4 I/O MMU PCIe ATS */
> +#define RISCV_IOMMU_CMD_ATS_OPCODE              4
> +#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL          0
> +#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR           1
> +#define RISCV_IOMMU_CMD_ATS_PID         GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_ATS_PV          BIT_ULL(32)
> +#define RISCV_IOMMU_CMD_ATS_DSV         BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_ATS_RID         GENMASK_ULL(55, 40)
> +#define RISCV_IOMMU_CMD_ATS_DSEG        GENMASK_ULL(63, 56)
> +/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
> +
> +/* ATS.PRGR payload */
> +#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE      GENMASK_ULL(47, 44)
> +
>  enum riscv_iommu_dc_fsc_atp_modes {
>      RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
>      RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
> @@ -334,7 +350,32 @@ enum riscv_iommu_fq_ttypes {
>      RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
>      RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
>      RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
> -    RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 8,
> +    RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
> +    RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
> +};
> +
> +/* Header fields */
> +#define RISCV_IOMMU_PREQ_HDR_PID        GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_PREQ_HDR_PV         BIT_ULL(32)
> +#define RISCV_IOMMU_PREQ_HDR_PRIV       BIT_ULL(33)
> +#define RISCV_IOMMU_PREQ_HDR_EXEC       BIT_ULL(34)
> +#define RISCV_IOMMU_PREQ_HDR_DID        GENMASK_ULL(63, 40)
> +
> +/* Payload fields */
> +#define RISCV_IOMMU_PREQ_PAYLOAD_R      BIT_ULL(0)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_W      BIT_ULL(1)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_L      BIT_ULL(2)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_M      GENMASK_ULL(2, 0)
> +#define RISCV_IOMMU_PREQ_PRG_INDEX      GENMASK_ULL(11, 3)
> +#define RISCV_IOMMU_PREQ_UADDR          GENMASK_ULL(63, 12)
> +
> +
> +/*
> + * struct riscv_iommu_msi_pte - MSI Page Table Entry
> + */
> +struct riscv_iommu_msi_pte {
> +      uint64_t pte;
> +      uint64_t mrif_info;
>  };
>
>  /* Fields on pte */
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index 11c418b548..3516b82081 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -641,6 +641,20 @@ static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
>                                              RISCVIOMMUContext *ctx)
>  {
>      uint32_t fsc_mode, msi_mode;
> +    uint64_t gatp;
> +
> +    if (!(s->cap & RISCV_IOMMU_CAP_ATS) &&
> +        (ctx->tc & RISCV_IOMMU_DC_TC_EN_ATS ||
> +         ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI ||
> +         ctx->tc & RISCV_IOMMU_DC_TC_PRPR)) {
> +        return false;
> +    }
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_ATS) &&
> +        (ctx->tc & RISCV_IOMMU_DC_TC_T2GPA ||
> +         ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI)) {
> +        return false;
> +    }
>
>      if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
>          ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
> @@ -661,6 +675,12 @@ static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
>          }
>      }
>
> +    gatp = get_field(ctx->gatp, RISCV_IOMMU_ATP_MODE_FIELD);
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_T2GPA &&
> +        gatp == RISCV_IOMMU_DC_IOHGATP_MODE_BARE) {
> +        return false;
> +    }
> +
>      fsc_mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
>
>      if (ctx->tc & RISCV_IOMMU_DC_TC_PDTV) {
> @@ -754,7 +774,12 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>              RISCV_IOMMU_DC_IOHGATP_MODE_BARE);
>          ctx->satp = set_field(0, RISCV_IOMMU_ATP_MODE_FIELD,
>              RISCV_IOMMU_DC_FSC_MODE_BARE);
> +
>          ctx->tc = RISCV_IOMMU_DC_TC_V;
> +        if (s->enable_ats) {
> +            ctx->tc |= RISCV_IOMMU_DC_TC_EN_ATS;
> +        }
> +
>          ctx->ta = 0;
>          ctx->msiptp = 0;
>          return 0;
> @@ -1191,6 +1216,16 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>      enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
>      enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
>
> +    /* Check for ATS request. */
> +    if (iotlb->perm == IOMMU_NONE) {
> +        /* Check if ATS is disabled. */
> +        if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_ATS)) {
> +            enable_pri = false;
> +            fault = RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +            goto done;
> +        }
> +    }
> +
>      iot = riscv_iommu_iot_lookup(ctx, iot_cache, iotlb->iova);
>      perm = iot ? iot->perm : IOMMU_NONE;
>      if (perm != IOMMU_NONE) {
> @@ -1236,11 +1271,11 @@ done:
>      }
>
>      if (fault) {
> -        unsigned ttype;
> +        unsigned ttype = RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ;
>
>          if (iotlb->perm & IOMMU_RW) {
>              ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> -        } else {
> +        } else if (iotlb->perm & IOMMU_RO) {
>              ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
>          }
>
> @@ -1268,6 +1303,73 @@ static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
>          MEMTXATTRS_UNSPECIFIED);
>  }
>
> +static void riscv_iommu_ats(RISCVIOMMUState *s,
> +    struct riscv_iommu_command *cmd, IOMMUNotifierFlag flag,
> +    IOMMUAccessFlags perm,
> +    void (*trace_fn)(const char *id))
> +{
> +    RISCVIOMMUSpace *as = NULL;
> +    IOMMUNotifier *n;
> +    IOMMUTLBEvent event;
> +    uint32_t pasid;
> +    uint32_t devid;
> +    const bool pv = cmd->dword0 & RISCV_IOMMU_CMD_ATS_PV;
> +
> +    if (cmd->dword0 & RISCV_IOMMU_CMD_ATS_DSV) {
> +        /* Use device segment and requester id */
> +        devid = get_field(cmd->dword0,
> +            RISCV_IOMMU_CMD_ATS_DSEG | RISCV_IOMMU_CMD_ATS_RID);
> +    } else {
> +        devid = get_field(cmd->dword0, RISCV_IOMMU_CMD_ATS_RID);
> +    }
> +
> +    pasid = get_field(cmd->dword0, RISCV_IOMMU_CMD_ATS_PID);
> +
> +    qemu_mutex_lock(&s->core_lock);
> +    QLIST_FOREACH(as, &s->spaces, list) {
> +        if (as->devid == devid) {
> +            break;
> +        }
> +    }
> +    qemu_mutex_unlock(&s->core_lock);
> +
> +    if (!as || !as->notifier) {
> +        return;
> +    }
> +
> +    event.type = flag;
> +    event.entry.perm = perm;
> +    event.entry.target_as = s->target_as;
> +
> +    IOMMU_NOTIFIER_FOREACH(n, &as->iova_mr) {
> +        if (!pv || n->iommu_idx == pasid) {
> +            event.entry.iova = n->start;
> +            event.entry.addr_mask = n->end - n->start;
> +            trace_fn(as->iova_mr.parent_obj.name);
> +            memory_region_notify_iommu_one(n, &event);
> +        }
> +    }
> +}
> +
> +static void riscv_iommu_ats_inval(RISCVIOMMUState *s,
> +    struct riscv_iommu_command *cmd)
> +{
> +    return riscv_iommu_ats(s, cmd, IOMMU_NOTIFIER_DEVIOTLB_UNMAP, IOMMU_NONE,
> +                           trace_riscv_iommu_ats_inval);
> +}
> +
> +static void riscv_iommu_ats_prgr(RISCVIOMMUState *s,
> +    struct riscv_iommu_command *cmd)
> +{
> +    unsigned resp_code = get_field(cmd->dword1,
> +                                   RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE);
> +
> +    /* Using the access flag to carry response code information */
> +    IOMMUAccessFlags perm = resp_code ? IOMMU_NONE : IOMMU_RW;
> +    return riscv_iommu_ats(s, cmd, IOMMU_NOTIFIER_MAP, perm,
> +                           trace_riscv_iommu_ats_prgr);
> +}
> +
>  static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
>  {
>      uint64_t old_ddtp = s->ddtp;
> @@ -1423,6 +1525,25 @@ static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
>                  get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
>              break;
>
> +        /* ATS commands */
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_ATS_FUNC_INVAL,
> +                             RISCV_IOMMU_CMD_ATS_OPCODE):
> +            if (!s->enable_ats) {
> +                goto cmd_ill;
> +            }
> +
> +            riscv_iommu_ats_inval(s, &cmd);
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_ATS_FUNC_PRGR,
> +                             RISCV_IOMMU_CMD_ATS_OPCODE):
> +            if (!s->enable_ats) {
> +                goto cmd_ill;
> +            }
> +
> +            riscv_iommu_ats_prgr(s, &cmd);
> +            break;
> +
>          default:
>          cmd_ill:
>              /* Invalid instruction, do not advance instruction index. */
> @@ -1818,6 +1939,9 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>      if (s->enable_msi) {
>          s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
>      }
> +    if (s->enable_ats) {
> +        s->cap |= RISCV_IOMMU_CAP_ATS;
> +    }
>      if (s->enable_s_stage) {
>          s->cap |= RISCV_IOMMU_CAP_SV32 | RISCV_IOMMU_CAP_SV39 |
>                    RISCV_IOMMU_CAP_SV48 | RISCV_IOMMU_CAP_SV57;
> @@ -1925,6 +2049,7 @@ static Property riscv_iommu_properties[] = {
>      DEFINE_PROP_UINT32("ioatc-limit", RISCVIOMMUState, iot_limit,
>          LIMIT_CACHE_IOT),
>      DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
> +    DEFINE_PROP_BOOL("ats", RISCVIOMMUState, enable_ats, TRUE),
>      DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
>      DEFINE_PROP_BOOL("s-stage", RISCVIOMMUState, enable_s_stage, TRUE),
>      DEFINE_PROP_BOOL("g-stage", RISCVIOMMUState, enable_g_stage, TRUE),
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> index c24e3e4c16..26236c3cee 100644
> --- a/hw/riscv/riscv-iommu.h
> +++ b/hw/riscv/riscv-iommu.h
> @@ -38,6 +38,7 @@ struct RISCVIOMMUState {
>
>      bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
>      bool enable_msi;      /* Enable MSI remapping */
> +    bool enable_ats;      /* Enable ATS support */
>      bool enable_s_stage;  /* Enable S/VS-Stage translation */
>      bool enable_g_stage;  /* Enable G-Stage translation */
>
> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
> index 42a97caffa..4b486b6420 100644
> --- a/hw/riscv/trace-events
> +++ b/hw/riscv/trace-events
> @@ -9,3 +9,6 @@ riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iov
>  riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
>  riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
>  riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
> +riscv_iommu_ats(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: translate request %04x:%02x.%u iova: 0x%"PRIx64
> +riscv_iommu_ats_inval(const char *id) "%s: dev-iotlb invalidate"
> +riscv_iommu_ats_prgr(const char *id) "%s: dev-iotlb page request group response"
> --
> 2.44.0
>
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 11/13] hw/riscv/riscv-iommu: add DBG support
  2024-05-23 17:39 ` [PATCH v3 11/13] hw/riscv/riscv-iommu: add DBG support Daniel Henrique Barboza
@ 2024-06-09  9:09   ` Frank Chang
  0 siblings, 0 replies; 38+ messages in thread
From: Frank Chang @ 2024-06-09  9:09 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, tjeznach, ajones, frank.chang

Reviewed-by: Frank Chang <frank.chang@sifive.com>

Daniel Henrique Barboza <dbarboza@ventanamicro.com> 於 2024年5月24日 週五 上午1:42寫道:
>
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> DBG support adds three additional registers: tr_req_iova, tr_req_ctl and
> tr_response.
>
> The DBG cap is always enabled. No on/off toggle is provided for it.
>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>  hw/riscv/riscv-iommu-bits.h | 17 +++++++++++
>  hw/riscv/riscv-iommu.c      | 59 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 76 insertions(+)
>
> diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
> index e253b29b16..f143c4a926 100644
> --- a/hw/riscv/riscv-iommu-bits.h
> +++ b/hw/riscv/riscv-iommu-bits.h
> @@ -84,6 +84,7 @@ struct riscv_iommu_pq_record {
>  #define RISCV_IOMMU_CAP_ATS             BIT_ULL(25)
>  #define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
>  #define RISCV_IOMMU_CAP_IGS             GENMASK_ULL(29, 28)
> +#define RISCV_IOMMU_CAP_DBG             BIT_ULL(31)
>  #define RISCV_IOMMU_CAP_PAS             GENMASK_ULL(37, 32)
>  #define RISCV_IOMMU_CAP_PD8             BIT_ULL(38)
>  #define RISCV_IOMMU_CAP_PD17            BIT_ULL(39)
> @@ -185,6 +186,22 @@ enum {
>      RISCV_IOMMU_INTR_COUNT
>  };
>
> +/* 5.24 Translation request IOVA (64bits) */
> +#define RISCV_IOMMU_REG_TR_REQ_IOVA     0x0258
> +
> +/* 5.25 Translation request control (64bits) */
> +#define RISCV_IOMMU_REG_TR_REQ_CTL      0x0260
> +#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY  BIT_ULL(0)
> +#define RISCV_IOMMU_TR_REQ_CTL_NW       BIT_ULL(3)
> +#define RISCV_IOMMU_TR_REQ_CTL_PID      GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_TR_REQ_CTL_DID      GENMASK_ULL(63, 40)
> +
> +/* 5.26 Translation request response (64bits) */
> +#define RISCV_IOMMU_REG_TR_RESPONSE     0x0268
> +#define RISCV_IOMMU_TR_RESPONSE_FAULT   BIT_ULL(0)
> +#define RISCV_IOMMU_TR_RESPONSE_S       BIT_ULL(9)
> +#define RISCV_IOMMU_TR_RESPONSE_PPN     RISCV_IOMMU_PPN_FIELD
> +
>  /* 5.27 Interrupt cause to vector (64bits) */
>  #define RISCV_IOMMU_REG_IVEC            0x02F8
>
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index 3516b82081..52f0851895 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -1655,6 +1655,50 @@ static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
>      riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
>  }
>
> +static void riscv_iommu_process_dbg(RISCVIOMMUState *s)
> +{
> +    uint64_t iova = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_TR_REQ_IOVA);
> +    uint64_t ctrl = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_TR_REQ_CTL);
> +    unsigned devid = get_field(ctrl, RISCV_IOMMU_TR_REQ_CTL_DID);
> +    unsigned pid = get_field(ctrl, RISCV_IOMMU_TR_REQ_CTL_PID);
> +    RISCVIOMMUContext *ctx;
> +    void *ref;
> +
> +    if (!(ctrl & RISCV_IOMMU_TR_REQ_CTL_GO_BUSY)) {
> +        return;
> +    }
> +
> +    ctx = riscv_iommu_ctx(s, devid, pid, &ref);
> +    if (ctx == NULL) {
> +        riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_TR_RESPONSE,
> +                                 RISCV_IOMMU_TR_RESPONSE_FAULT |
> +                                 (RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED << 10));
> +    } else {
> +        IOMMUTLBEntry iotlb = {
> +            .iova = iova,
> +            .perm = ctrl & RISCV_IOMMU_TR_REQ_CTL_NW ? IOMMU_RO : IOMMU_RW,
> +            .addr_mask = ~0,
> +            .target_as = NULL,
> +        };
> +        int fault = riscv_iommu_translate(s, ctx, &iotlb, false);
> +        if (fault) {
> +            iova = RISCV_IOMMU_TR_RESPONSE_FAULT | (((uint64_t) fault) << 10);
> +        } else {
> +            iova = iotlb.translated_addr & ~iotlb.addr_mask;
> +            iova >>= TARGET_PAGE_BITS;
> +            iova &= RISCV_IOMMU_TR_RESPONSE_PPN;
> +
> +            /* We do not support superpages (> 4kbs) for now */
> +            iova &= ~RISCV_IOMMU_TR_RESPONSE_S;
> +        }
> +        riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_TR_RESPONSE, iova);
> +    }
> +
> +    riscv_iommu_reg_mod64(s, RISCV_IOMMU_REG_TR_REQ_CTL, 0,
> +        RISCV_IOMMU_TR_REQ_CTL_GO_BUSY);
> +    riscv_iommu_ctx_put(s, ref);
> +}
> +
>  typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
>
>  static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
> @@ -1778,6 +1822,12 @@ static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
>
>          return MEMTX_OK;
>
> +    case RISCV_IOMMU_REG_TR_REQ_CTL:
> +        process_fn = riscv_iommu_process_dbg;
> +        regb = RISCV_IOMMU_REG_TR_REQ_CTL;
> +        busy = RISCV_IOMMU_TR_REQ_CTL_GO_BUSY;
> +        break;
> +
>      default:
>          break;
>      }
> @@ -1950,6 +2000,9 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>          s->cap |= RISCV_IOMMU_CAP_SV32X4 | RISCV_IOMMU_CAP_SV39X4 |
>                    RISCV_IOMMU_CAP_SV48X4 | RISCV_IOMMU_CAP_SV57X4;
>      }
> +    /* Enable translation debug interface */
> +    s->cap |= RISCV_IOMMU_CAP_DBG;
> +
>      /* Report QEMU target physical address space limits */
>      s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
>                         TARGET_PHYS_ADDR_SPACE_BITS);
> @@ -2004,6 +2057,12 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>      stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
>      stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
>      stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
> +    /* If debug registers enabled. */
> +    if (s->cap & RISCV_IOMMU_CAP_DBG) {
> +        stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_TR_REQ_IOVA], 0);
> +        stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_TR_REQ_CTL],
> +            RISCV_IOMMU_TR_REQ_CTL_GO_BUSY);
> +    }
>
>      /* Memory region for downstream access, if specified. */
>      if (s->target_mr) {
> --
> 2.44.0
>
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (12 preceding siblings ...)
  2024-05-23 17:39 ` [PATCH v3 13/13] qtest/riscv-iommu-test: add init queues test Daniel Henrique Barboza
@ 2024-06-10  0:34 ` Alistair Francis
  2024-06-10 18:32   ` Andrew Jones
  2024-06-11  1:51 ` LIU Zhiwei
  14 siblings, 1 reply; 38+ messages in thread
From: Alistair Francis @ 2024-06-10  0:34 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, tjeznach, ajones, frank.chang

On Fri, May 24, 2024 at 3:43 AM Daniel Henrique Barboza
<dbarboza@ventanamicro.com> wrote:
>
> Hi,
>
> In this new version a lot of changes were made throughout all the code,
> most notably on patch 3. Link for the previous version is [1].
>
> * How it was tested *
>
> This series was tested using an emulated QEMU RISC-V host booting a QEMU
> KVM guest, passing through an emulated e1000 network card from the host
> to the guest. I can provide more details (e.g. QEMU command lines) if
> required, just let me know. For now this cover-letter is too much of an
> essay as is.

It would probably be helpful to document these somewhere, so others
can use them as a starting point for running this

Alistair

>
> The Linux kernel used for tests can be found here:
>
> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>
> This is a newer version of the following work from Tomasz:
>
> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>
> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>
> Note that to test this work using riscv-iommu-pci we'll need to provide
> the Rivos PCI ID in the command line. More details down below.
>
> * Highlights of this version *
>
> - patches removed from v2: platform driver (riscv-iommu-sys, former
> patch 05) and the EDU changes (patches 14 and 15). The platform driver
> will be sent later with a working example on the 'virt' machine,
> either on a newer version of this series or via a follow-up series. We
> already have a PoC on [2] created by Sunil. More tests are needed, so
> it'll be left behind for now. The EDU changes will be sent in separate
> after I finish the doc changes that Frank cited in v2.
>
> - patch 3 contains the bulk of changes made from v2. Please give special
> attention to the following functions since this is entirely new code I
> ended up adding:
>
>  - riscv_iommu_report_fault()
>  - riscv_iommu_validate_device_ctx()
>  - riscv_iommu_update_ipsr()
>
>   Aside from these helpers most of the changes made in this patch 3 were
> punctual.
>
> - Red HAT PCI ID related changes. A new patch (4) that introduces a
> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
> to us by Red Hat and Gerd Hoffman from their ID space. The
> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
> ID. The device was changed slightly to allow vendor-id and device-id to
> be set in the command-line, so it's now possible to use this reference
> device as another RISC-V IOMMU PCI device to ease the burden of
> testing/development.
>
>   To instantiate the riscv-iommu-pci device using the previous Rivos PCI
> ID, use the following cmd line:
>
>   -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>
>   I'm using these options to test the series with the existing Linux RISC-V
> IOMMU support that uses just a Rivos ID to identify the device.
>
>
> Series based on alistair/riscv-to-apply.next. It's also applicable on
> current QEMU master. It can also be fetched from:
>
> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>
>
> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>
> Changes from v2 [1]:
> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>   - will be reintroduced in a later review or as a follow-up series
>
> - patches 14 and 15: dropped
>   - will be sent in separate
>
> - patches 2, 3, 4 and 5:
>   - removed all 'Ziommu' references
>
> - patch 2:
>   - added extra bits that patch 3 ended up using
>
> - patch 3:
>   - fixed blank line at EOF in hw/riscv/trace.h
>   - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
>     a given fault is eligible to be reported if DTF is 1
>   - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
>     to avoid code repetition
>   - added a riscv_iommu_validate_device_ctx() helper to validate the device context
>     as specified in "Device configuration checks" section. This helper is being used
>     in riscv_iommu_ctx_fetch()
>   - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
>     in riscv_iommu_mmio_write()
>   - riscv_iommmu_msi_write() now reports a fault in all error paths
>   - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
>   - change riscv-iommu region name to 'riscv-iommu'
>   - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
>   - changed riscv_iommu_mmio_ops min_access_size to 4
>   - do not check for min and max sizes on riscv_iommu_mmio_write()
>   - changed riscv_iommu_trap_ops  min_access_size to 4
>   - removed IOMMU qemu_thread thread:
>     - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
>       'core_lock'
>   - init FSCR as zero explicitly
>   - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()
>
> - patch 4 (new):
>   - add Red-Hat PCI RISC-V IOMMU ID
>
> - patch 5 (former 4):
>   - create vendor-id and device-id properties
>   - set Red-hat PCI RISC-V IOMMU ID as default ID
>
> - patch 8:
>   - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances
>
> - patch 9:
>   - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>   - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>   - 'en_s' is no longer used for early MSI address match
>
> - patch 10:
>   - add ATS steps in riscv_iommu_validate_device_ctx()
>   - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
>   - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
>   - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()
>
> - patch 11:
>   - removed unused bits
>   - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>     bits
>   - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
>   - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
>
> [1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>
> Andrew Jones (1):
>   hw/riscv/riscv-iommu: Add another irq for mrif notifications
>
> Daniel Henrique Barboza (3):
>   pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>   test/qtest: add riscv-iommu-pci tests
>   qtest/riscv-iommu-test: add init queues test
>
> Tomasz Jeznach (9):
>   exec/memtxattr: add process identifier to the transaction attributes
>   hw/riscv: add riscv-iommu-bits.h
>   hw/riscv: add RISC-V IOMMU base emulation
>   hw/riscv: add riscv-iommu-pci reference device
>   hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>   hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>   hw/riscv/riscv-iommu: add s-stage and g-stage support
>   hw/riscv/riscv-iommu: add ATS support
>   hw/riscv/riscv-iommu: add DBG support
>
>  docs/specs/pci-ids.rst           |    2 +
>  hw/riscv/Kconfig                 |    4 +
>  hw/riscv/meson.build             |    1 +
>  hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>  hw/riscv/riscv-iommu-pci.c       |  177 +++
>  hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
>  hw/riscv/riscv-iommu.h           |  146 ++
>  hw/riscv/trace-events            |   15 +
>  hw/riscv/trace.h                 |    1 +
>  hw/riscv/virt.c                  |   33 +-
>  include/exec/memattrs.h          |    5 +
>  include/hw/pci/pci.h             |    1 +
>  include/hw/riscv/iommu.h         |   36 +
>  meson.build                      |    1 +
>  tests/qtest/libqos/meson.build   |    4 +
>  tests/qtest/libqos/riscv-iommu.c |   76 +
>  tests/qtest/libqos/riscv-iommu.h |  100 ++
>  tests/qtest/meson.build          |    1 +
>  tests/qtest/riscv-iommu-test.c   |  234 +++
>  19 files changed, 3535 insertions(+), 1 deletion(-)
>  create mode 100644 hw/riscv/riscv-iommu-bits.h
>  create mode 100644 hw/riscv/riscv-iommu-pci.c
>  create mode 100644 hw/riscv/riscv-iommu.c
>  create mode 100644 hw/riscv/riscv-iommu.h
>  create mode 100644 hw/riscv/trace-events
>  create mode 100644 hw/riscv/trace.h
>  create mode 100644 include/hw/riscv/iommu.h
>  create mode 100644 tests/qtest/libqos/riscv-iommu.c
>  create mode 100644 tests/qtest/libqos/riscv-iommu.h
>  create mode 100644 tests/qtest/riscv-iommu-test.c
>
> --
> 2.44.0
>
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-10  0:34 ` [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Alistair Francis
@ 2024-06-10 18:32   ` Andrew Jones
  2024-06-10 19:16     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Jones @ 2024-06-10 18:32 UTC (permalink / raw)
  To: Alistair Francis, Daniel Henrique Barboza
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, tjeznach, frank.chang

On June 10, 2024 2:34:58 AM GMT+02:00, Alistair Francis <alistair23@gmail.com> wrote:
>On Fri, May 24, 2024 at 3:43 AM Daniel Henrique Barboza
><dbarboza@ventanamicro.com> wrote:
>>
>> Hi,
>>
>> In this new version a lot of changes were made throughout all the code,
>> most notably on patch 3. Link for the previous version is [1].
>>
>> * How it was tested *
>>
>> This series was tested using an emulated QEMU RISC-V host booting a QEMU
>> KVM guest, passing through an emulated e1000 network card from the host
>> to the guest. I can provide more details (e.g. QEMU command lines) if
>> required, just let me know. For now this cover-letter is too much of an
>> essay as is.
>
>It would probably be helpful to document these somewhere, so others
>can use them as a starting point for running this
>

I've written up a testing procedure which I shared internally with Daniel. I'll sanitize it and post it somewhere public.

Thanks,
drew

>Alistair
>
>>
>> The Linux kernel used for tests can be found here:
>>
>> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>>
>> This is a newer version of the following work from Tomasz:
>>
>> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
>> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>>
>> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>>
>> Note that to test this work using riscv-iommu-pci we'll need to provide
>> the Rivos PCI ID in the command line. More details down below.
>>
>> * Highlights of this version *
>>
>> - patches removed from v2: platform driver (riscv-iommu-sys, former
>> patch 05) and the EDU changes (patches 14 and 15). The platform driver
>> will be sent later with a working example on the 'virt' machine,
>> either on a newer version of this series or via a follow-up series. We
>> already have a PoC on [2] created by Sunil. More tests are needed, so
>> it'll be left behind for now. The EDU changes will be sent in separate
>> after I finish the doc changes that Frank cited in v2.
>>
>> - patch 3 contains the bulk of changes made from v2. Please give special
>> attention to the following functions since this is entirely new code I
>> ended up adding:
>>
>>  - riscv_iommu_report_fault()
>>  - riscv_iommu_validate_device_ctx()
>>  - riscv_iommu_update_ipsr()
>>
>>   Aside from these helpers most of the changes made in this patch 3 were
>> punctual.
>>
>> - Red HAT PCI ID related changes. A new patch (4) that introduces a
>> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
>> to us by Red Hat and Gerd Hoffman from their ID space. The
>> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
>> ID. The device was changed slightly to allow vendor-id and device-id to
>> be set in the command-line, so it's now possible to use this reference
>> device as another RISC-V IOMMU PCI device to ease the burden of
>> testing/development.
>>
>>   To instantiate the riscv-iommu-pci device using the previous Rivos PCI
>> ID, use the following cmd line:
>>
>>   -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>>
>>   I'm using these options to test the series with the existing Linux RISC-V
>> IOMMU support that uses just a Rivos ID to identify the device.
>>
>>
>> Series based on alistair/riscv-to-apply.next. It's also applicable on
>> current QEMU master. It can also be fetched from:
>>
>> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>>
>>
>> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>>
>> Changes from v2 [1]:
>> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>>   - will be reintroduced in a later review or as a follow-up series
>>
>> - patches 14 and 15: dropped
>>   - will be sent in separate
>>
>> - patches 2, 3, 4 and 5:
>>   - removed all 'Ziommu' references
>>
>> - patch 2:
>>   - added extra bits that patch 3 ended up using
>>
>> - patch 3:
>>   - fixed blank line at EOF in hw/riscv/trace.h
>>   - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
>>     a given fault is eligible to be reported if DTF is 1
>>   - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
>>     to avoid code repetition
>>   - added a riscv_iommu_validate_device_ctx() helper to validate the device context
>>     as specified in "Device configuration checks" section. This helper is being used
>>     in riscv_iommu_ctx_fetch()
>>   - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
>>     in riscv_iommu_mmio_write()
>>   - riscv_iommmu_msi_write() now reports a fault in all error paths
>>   - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
>>   - change riscv-iommu region name to 'riscv-iommu'
>>   - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
>>   - changed riscv_iommu_mmio_ops min_access_size to 4
>>   - do not check for min and max sizes on riscv_iommu_mmio_write()
>>   - changed riscv_iommu_trap_ops  min_access_size to 4
>>   - removed IOMMU qemu_thread thread:
>>     - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
>>       'core_lock'
>>   - init FSCR as zero explicitly
>>   - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()
>>
>> - patch 4 (new):
>>   - add Red-Hat PCI RISC-V IOMMU ID
>>
>> - patch 5 (former 4):
>>   - create vendor-id and device-id properties
>>   - set Red-hat PCI RISC-V IOMMU ID as default ID
>>
>> - patch 8:
>>   - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances
>>
>> - patch 9:
>>   - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>>   - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>>   - 'en_s' is no longer used for early MSI address match
>>
>> - patch 10:
>>   - add ATS steps in riscv_iommu_validate_device_ctx()
>>   - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
>>   - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
>>   - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()
>>
>> - patch 11:
>>   - removed unused bits
>>   - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>>     bits
>>   - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
>>   - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
>>
>> [1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
>> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>>
>> Andrew Jones (1):
>>   hw/riscv/riscv-iommu: Add another irq for mrif notifications
>>
>> Daniel Henrique Barboza (3):
>>   pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>>   test/qtest: add riscv-iommu-pci tests
>>   qtest/riscv-iommu-test: add init queues test
>>
>> Tomasz Jeznach (9):
>>   exec/memtxattr: add process identifier to the transaction attributes
>>   hw/riscv: add riscv-iommu-bits.h
>>   hw/riscv: add RISC-V IOMMU base emulation
>>   hw/riscv: add riscv-iommu-pci reference device
>>   hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>>   hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>>   hw/riscv/riscv-iommu: add s-stage and g-stage support
>>   hw/riscv/riscv-iommu: add ATS support
>>   hw/riscv/riscv-iommu: add DBG support
>>
>>  docs/specs/pci-ids.rst           |    2 +
>>  hw/riscv/Kconfig                 |    4 +
>>  hw/riscv/meson.build             |    1 +
>>  hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>>  hw/riscv/riscv-iommu-pci.c       |  177 +++
>>  hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
>>  hw/riscv/riscv-iommu.h           |  146 ++
>>  hw/riscv/trace-events            |   15 +
>>  hw/riscv/trace.h                 |    1 +
>>  hw/riscv/virt.c                  |   33 +-
>>  include/exec/memattrs.h          |    5 +
>>  include/hw/pci/pci.h             |    1 +
>>  include/hw/riscv/iommu.h         |   36 +
>>  meson.build                      |    1 +
>>  tests/qtest/libqos/meson.build   |    4 +
>>  tests/qtest/libqos/riscv-iommu.c |   76 +
>>  tests/qtest/libqos/riscv-iommu.h |  100 ++
>>  tests/qtest/meson.build          |    1 +
>>  tests/qtest/riscv-iommu-test.c   |  234 +++
>>  19 files changed, 3535 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/riscv/riscv-iommu-bits.h
>>  create mode 100644 hw/riscv/riscv-iommu-pci.c
>>  create mode 100644 hw/riscv/riscv-iommu.c
>>  create mode 100644 hw/riscv/riscv-iommu.h
>>  create mode 100644 hw/riscv/trace-events
>>  create mode 100644 hw/riscv/trace.h
>>  create mode 100644 include/hw/riscv/iommu.h
>>  create mode 100644 tests/qtest/libqos/riscv-iommu.c
>>  create mode 100644 tests/qtest/libqos/riscv-iommu.h
>>  create mode 100644 tests/qtest/riscv-iommu-test.c
>>
>> --
>> 2.44.0
>>
>>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-10 18:32   ` Andrew Jones
@ 2024-06-10 19:16     ` Daniel Henrique Barboza
  2024-06-11  0:18       ` Alistair Francis
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-10 19:16 UTC (permalink / raw)
  To: Andrew Jones, Alistair Francis
  Cc: qemu-devel, qemu-riscv, alistair.francis, bmeng, liwei1518,
	zhiwei_liu, palmer, tjeznach, frank.chang



On 6/10/24 3:32 PM, Andrew Jones wrote:
> On June 10, 2024 2:34:58 AM GMT+02:00, Alistair Francis <alistair23@gmail.com> wrote:
>> On Fri, May 24, 2024 at 3:43 AM Daniel Henrique Barboza
>> <dbarboza@ventanamicro.com> wrote:
>>>
>>> Hi,
>>>
>>> In this new version a lot of changes were made throughout all the code,
>>> most notably on patch 3. Link for the previous version is [1].
>>>
>>> * How it was tested *
>>>
>>> This series was tested using an emulated QEMU RISC-V host booting a QEMU
>>> KVM guest, passing through an emulated e1000 network card from the host
>>> to the guest. I can provide more details (e.g. QEMU command lines) if
>>> required, just let me know. For now this cover-letter is too much of an
>>> essay as is.
>>
>> It would probably be helpful to document these somewhere, so others
>> can use them as a starting point for running this
>>
> 
> I've written up a testing procedure which I shared internally with Daniel. I'll sanitize it and post it somewhere public.
>

I can also add a QEMU docs under docs/system/riscv, both as a
subsection of virt.rst and perhaps a new doc that describes the
devices itself (riscv-iommu-pci and later on riscv-iommu-sys).


Thanks,


Daniel
  
> Thanks,
> drew
> 
>> Alistair
>>
>>>
>>> The Linux kernel used for tests can be found here:
>>>
>>> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>>>
>>> This is a newer version of the following work from Tomasz:
>>>
>>> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
>>> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>>>
>>> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>>>
>>> Note that to test this work using riscv-iommu-pci we'll need to provide
>>> the Rivos PCI ID in the command line. More details down below.
>>>
>>> * Highlights of this version *
>>>
>>> - patches removed from v2: platform driver (riscv-iommu-sys, former
>>> patch 05) and the EDU changes (patches 14 and 15). The platform driver
>>> will be sent later with a working example on the 'virt' machine,
>>> either on a newer version of this series or via a follow-up series. We
>>> already have a PoC on [2] created by Sunil. More tests are needed, so
>>> it'll be left behind for now. The EDU changes will be sent in separate
>>> after I finish the doc changes that Frank cited in v2.
>>>
>>> - patch 3 contains the bulk of changes made from v2. Please give special
>>> attention to the following functions since this is entirely new code I
>>> ended up adding:
>>>
>>>   - riscv_iommu_report_fault()
>>>   - riscv_iommu_validate_device_ctx()
>>>   - riscv_iommu_update_ipsr()
>>>
>>>    Aside from these helpers most of the changes made in this patch 3 were
>>> punctual.
>>>
>>> - Red HAT PCI ID related changes. A new patch (4) that introduces a
>>> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
>>> to us by Red Hat and Gerd Hoffman from their ID space. The
>>> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
>>> ID. The device was changed slightly to allow vendor-id and device-id to
>>> be set in the command-line, so it's now possible to use this reference
>>> device as another RISC-V IOMMU PCI device to ease the burden of
>>> testing/development.
>>>
>>>    To instantiate the riscv-iommu-pci device using the previous Rivos PCI
>>> ID, use the following cmd line:
>>>
>>>    -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>>>
>>>    I'm using these options to test the series with the existing Linux RISC-V
>>> IOMMU support that uses just a Rivos ID to identify the device.
>>>
>>>
>>> Series based on alistair/riscv-to-apply.next. It's also applicable on
>>> current QEMU master. It can also be fetched from:
>>>
>>> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>>>
>>>
>>> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>>>
>>> Changes from v2 [1]:
>>> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>>>    - will be reintroduced in a later review or as a follow-up series
>>>
>>> - patches 14 and 15: dropped
>>>    - will be sent in separate
>>>
>>> - patches 2, 3, 4 and 5:
>>>    - removed all 'Ziommu' references
>>>
>>> - patch 2:
>>>    - added extra bits that patch 3 ended up using
>>>
>>> - patch 3:
>>>    - fixed blank line at EOF in hw/riscv/trace.h
>>>    - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
>>>      a given fault is eligible to be reported if DTF is 1
>>>    - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
>>>      to avoid code repetition
>>>    - added a riscv_iommu_validate_device_ctx() helper to validate the device context
>>>      as specified in "Device configuration checks" section. This helper is being used
>>>      in riscv_iommu_ctx_fetch()
>>>    - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
>>>      in riscv_iommu_mmio_write()
>>>    - riscv_iommmu_msi_write() now reports a fault in all error paths
>>>    - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
>>>    - change riscv-iommu region name to 'riscv-iommu'
>>>    - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
>>>    - changed riscv_iommu_mmio_ops min_access_size to 4
>>>    - do not check for min and max sizes on riscv_iommu_mmio_write()
>>>    - changed riscv_iommu_trap_ops  min_access_size to 4
>>>    - removed IOMMU qemu_thread thread:
>>>      - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
>>>        'core_lock'
>>>    - init FSCR as zero explicitly
>>>    - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()
>>>
>>> - patch 4 (new):
>>>    - add Red-Hat PCI RISC-V IOMMU ID
>>>
>>> - patch 5 (former 4):
>>>    - create vendor-id and device-id properties
>>>    - set Red-hat PCI RISC-V IOMMU ID as default ID
>>>
>>> - patch 8:
>>>    - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances
>>>
>>> - patch 9:
>>>    - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>>>    - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>>>    - 'en_s' is no longer used for early MSI address match
>>>
>>> - patch 10:
>>>    - add ATS steps in riscv_iommu_validate_device_ctx()
>>>    - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
>>>    - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
>>>    - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()
>>>
>>> - patch 11:
>>>    - removed unused bits
>>>    - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>>>      bits
>>>    - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
>>>    - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
>>>
>>> [1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
>>> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>>>
>>> Andrew Jones (1):
>>>    hw/riscv/riscv-iommu: Add another irq for mrif notifications
>>>
>>> Daniel Henrique Barboza (3):
>>>    pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>>>    test/qtest: add riscv-iommu-pci tests
>>>    qtest/riscv-iommu-test: add init queues test
>>>
>>> Tomasz Jeznach (9):
>>>    exec/memtxattr: add process identifier to the transaction attributes
>>>    hw/riscv: add riscv-iommu-bits.h
>>>    hw/riscv: add RISC-V IOMMU base emulation
>>>    hw/riscv: add riscv-iommu-pci reference device
>>>    hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>>>    hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>>>    hw/riscv/riscv-iommu: add s-stage and g-stage support
>>>    hw/riscv/riscv-iommu: add ATS support
>>>    hw/riscv/riscv-iommu: add DBG support
>>>
>>>   docs/specs/pci-ids.rst           |    2 +
>>>   hw/riscv/Kconfig                 |    4 +
>>>   hw/riscv/meson.build             |    1 +
>>>   hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>>>   hw/riscv/riscv-iommu-pci.c       |  177 +++
>>>   hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
>>>   hw/riscv/riscv-iommu.h           |  146 ++
>>>   hw/riscv/trace-events            |   15 +
>>>   hw/riscv/trace.h                 |    1 +
>>>   hw/riscv/virt.c                  |   33 +-
>>>   include/exec/memattrs.h          |    5 +
>>>   include/hw/pci/pci.h             |    1 +
>>>   include/hw/riscv/iommu.h         |   36 +
>>>   meson.build                      |    1 +
>>>   tests/qtest/libqos/meson.build   |    4 +
>>>   tests/qtest/libqos/riscv-iommu.c |   76 +
>>>   tests/qtest/libqos/riscv-iommu.h |  100 ++
>>>   tests/qtest/meson.build          |    1 +
>>>   tests/qtest/riscv-iommu-test.c   |  234 +++
>>>   19 files changed, 3535 insertions(+), 1 deletion(-)
>>>   create mode 100644 hw/riscv/riscv-iommu-bits.h
>>>   create mode 100644 hw/riscv/riscv-iommu-pci.c
>>>   create mode 100644 hw/riscv/riscv-iommu.c
>>>   create mode 100644 hw/riscv/riscv-iommu.h
>>>   create mode 100644 hw/riscv/trace-events
>>>   create mode 100644 hw/riscv/trace.h
>>>   create mode 100644 include/hw/riscv/iommu.h
>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.c
>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.h
>>>   create mode 100644 tests/qtest/riscv-iommu-test.c
>>>
>>> --
>>> 2.44.0
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-10 19:16     ` Daniel Henrique Barboza
@ 2024-06-11  0:18       ` Alistair Francis
  0 siblings, 0 replies; 38+ messages in thread
From: Alistair Francis @ 2024-06-11  0:18 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Andrew Jones, qemu-devel, qemu-riscv, alistair.francis, bmeng,
	liwei1518, zhiwei_liu, palmer, tjeznach, frank.chang

On Tue, Jun 11, 2024 at 5:16 AM Daniel Henrique Barboza
<dbarboza@ventanamicro.com> wrote:
>
>
>
> On 6/10/24 3:32 PM, Andrew Jones wrote:
> > On June 10, 2024 2:34:58 AM GMT+02:00, Alistair Francis <alistair23@gmail.com> wrote:
> >> On Fri, May 24, 2024 at 3:43 AM Daniel Henrique Barboza
> >> <dbarboza@ventanamicro.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> In this new version a lot of changes were made throughout all the code,
> >>> most notably on patch 3. Link for the previous version is [1].
> >>>
> >>> * How it was tested *
> >>>
> >>> This series was tested using an emulated QEMU RISC-V host booting a QEMU
> >>> KVM guest, passing through an emulated e1000 network card from the host
> >>> to the guest. I can provide more details (e.g. QEMU command lines) if
> >>> required, just let me know. For now this cover-letter is too much of an
> >>> essay as is.
> >>
> >> It would probably be helpful to document these somewhere, so others
> >> can use them as a starting point for running this
> >>
> >
> > I've written up a testing procedure which I shared internally with Daniel. I'll sanitize it and post it somewhere public.
> >
>
> I can also add a QEMU docs under docs/system/riscv, both as a
> subsection of virt.rst and perhaps a new doc that describes the
> devices itself (riscv-iommu-pci and later on riscv-iommu-sys).

I think that would be great. Even if it isn't a simple "copy this
command and it works" it at least gives users a place to start to
figure out how to use this

Alistair


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
                   ` (13 preceding siblings ...)
  2024-06-10  0:34 ` [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Alistair Francis
@ 2024-06-11  1:51 ` LIU Zhiwei
  2024-06-11 10:13   ` Daniel Henrique Barboza
  14 siblings, 1 reply; 38+ messages in thread
From: LIU Zhiwei @ 2024-06-11  1:51 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, palmer, tjeznach,
	ajones, frank.chang

Hi Daniel,

I want to know if we can use the IOMMU and IOPMP at the same time.

The relationship between them is more similar to MMU and sPMP or to MMU 
and PMP?

Thanks,
Zhiwei

On 2024/5/24 1:39, Daniel Henrique Barboza wrote:
> Hi,
>
> In this new version a lot of changes were made throughout all the code,
> most notably on patch 3. Link for the previous version is [1].
>
> * How it was tested *
>
> This series was tested using an emulated QEMU RISC-V host booting a QEMU
> KVM guest, passing through an emulated e1000 network card from the host
> to the guest. I can provide more details (e.g. QEMU command lines) if
> required, just let me know. For now this cover-letter is too much of an
> essay as is.
>
> The Linux kernel used for tests can be found here:
>
> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>
> This is a newer version of the following work from Tomasz:
>
> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>
> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>
> Note that to test this work using riscv-iommu-pci we'll need to provide
> the Rivos PCI ID in the command line. More details down below.
>
> * Highlights of this version *
>
> - patches removed from v2: platform driver (riscv-iommu-sys, former
> patch 05) and the EDU changes (patches 14 and 15). The platform driver
> will be sent later with a working example on the 'virt' machine,
> either on a newer version of this series or via a follow-up series. We
> already have a PoC on [2] created by Sunil. More tests are needed, so
> it'll be left behind for now. The EDU changes will be sent in separate
> after I finish the doc changes that Frank cited in v2.
>
> - patch 3 contains the bulk of changes made from v2. Please give special
> attention to the following functions since this is entirely new code I
> ended up adding:
>   
>   - riscv_iommu_report_fault()
>   - riscv_iommu_validate_device_ctx()
>   - riscv_iommu_update_ipsr()
>   
>    Aside from these helpers most of the changes made in this patch 3 were
> punctual.
>
> - Red HAT PCI ID related changes. A new patch (4) that introduces a
> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
> to us by Red Hat and Gerd Hoffman from their ID space. The
> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
> ID. The device was changed slightly to allow vendor-id and device-id to
> be set in the command-line, so it's now possible to use this reference
> device as another RISC-V IOMMU PCI device to ease the burden of
> testing/development.
>
>    To instantiate the riscv-iommu-pci device using the previous Rivos PCI
> ID, use the following cmd line:
>
>    -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>
>    I'm using these options to test the series with the existing Linux RISC-V
> IOMMU support that uses just a Rivos ID to identify the device.
>
>
> Series based on alistair/riscv-to-apply.next. It's also applicable on
> current QEMU master. It can also be fetched from:
>
> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>   
>
> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>
> Changes from v2 [1]:
> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>    - will be reintroduced in a later review or as a follow-up series
>
> - patches 14 and 15: dropped
>    - will be sent in separate
>
> - patches 2, 3, 4 and 5:
>    - removed all 'Ziommu' references
>
> - patch 2:
>    - added extra bits that patch 3 ended up using
>
> - patch 3:
>    - fixed blank line at EOF in hw/riscv/trace.h
>    - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
>      a given fault is eligible to be reported if DTF is 1
>    - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
>      to avoid code repetition
>    - added a riscv_iommu_validate_device_ctx() helper to validate the device context
>      as specified in "Device configuration checks" section. This helper is being used
>      in riscv_iommu_ctx_fetch()
>    - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
>      in riscv_iommu_mmio_write()
>    - riscv_iommmu_msi_write() now reports a fault in all error paths
>    - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
>    - change riscv-iommu region name to 'riscv-iommu'
>    - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
>    - changed riscv_iommu_mmio_ops min_access_size to 4
>    - do not check for min and max sizes on riscv_iommu_mmio_write()
>    - changed riscv_iommu_trap_ops  min_access_size to 4
>    - removed IOMMU qemu_thread thread:
>      - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
>        'core_lock'
>    - init FSCR as zero explicitly
>    - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()
>
> - patch 4 (new):
>    - add Red-Hat PCI RISC-V IOMMU ID
>
> - patch 5 (former 4):
>    - create vendor-id and device-id properties
>    - set Red-hat PCI RISC-V IOMMU ID as default ID
>
> - patch 8:
>    - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances
>
> - patch 9:
>    - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>    - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>    - 'en_s' is no longer used for early MSI address match
>
> - patch 10:
>    - add ATS steps in riscv_iommu_validate_device_ctx()
>    - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
>    - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
>    - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()
>
> - patch 11:
>    - removed unused bits
>    - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>      bits
>    - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
>    - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
>   
> [1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>
> Andrew Jones (1):
>    hw/riscv/riscv-iommu: Add another irq for mrif notifications
>
> Daniel Henrique Barboza (3):
>    pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>    test/qtest: add riscv-iommu-pci tests
>    qtest/riscv-iommu-test: add init queues test
>
> Tomasz Jeznach (9):
>    exec/memtxattr: add process identifier to the transaction attributes
>    hw/riscv: add riscv-iommu-bits.h
>    hw/riscv: add RISC-V IOMMU base emulation
>    hw/riscv: add riscv-iommu-pci reference device
>    hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>    hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>    hw/riscv/riscv-iommu: add s-stage and g-stage support
>    hw/riscv/riscv-iommu: add ATS support
>    hw/riscv/riscv-iommu: add DBG support
>
>   docs/specs/pci-ids.rst           |    2 +
>   hw/riscv/Kconfig                 |    4 +
>   hw/riscv/meson.build             |    1 +
>   hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>   hw/riscv/riscv-iommu-pci.c       |  177 +++
>   hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
>   hw/riscv/riscv-iommu.h           |  146 ++
>   hw/riscv/trace-events            |   15 +
>   hw/riscv/trace.h                 |    1 +
>   hw/riscv/virt.c                  |   33 +-
>   include/exec/memattrs.h          |    5 +
>   include/hw/pci/pci.h             |    1 +
>   include/hw/riscv/iommu.h         |   36 +
>   meson.build                      |    1 +
>   tests/qtest/libqos/meson.build   |    4 +
>   tests/qtest/libqos/riscv-iommu.c |   76 +
>   tests/qtest/libqos/riscv-iommu.h |  100 ++
>   tests/qtest/meson.build          |    1 +
>   tests/qtest/riscv-iommu-test.c   |  234 +++
>   19 files changed, 3535 insertions(+), 1 deletion(-)
>   create mode 100644 hw/riscv/riscv-iommu-bits.h
>   create mode 100644 hw/riscv/riscv-iommu-pci.c
>   create mode 100644 hw/riscv/riscv-iommu.c
>   create mode 100644 hw/riscv/riscv-iommu.h
>   create mode 100644 hw/riscv/trace-events
>   create mode 100644 hw/riscv/trace.h
>   create mode 100644 include/hw/riscv/iommu.h
>   create mode 100644 tests/qtest/libqos/riscv-iommu.c
>   create mode 100644 tests/qtest/libqos/riscv-iommu.h
>   create mode 100644 tests/qtest/riscv-iommu-test.c
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-11  1:51 ` LIU Zhiwei
@ 2024-06-11 10:13   ` Daniel Henrique Barboza
  2024-06-12  7:50     ` LIU Zhiwei
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-11 10:13 UTC (permalink / raw)
  To: LIU Zhiwei, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, palmer, tjeznach,
	ajones, frank.chang

Hi Zhiwei,

On 6/10/24 10:51 PM, LIU Zhiwei wrote:
> Hi Daniel,
> 
> I want to know if we can use the IOMMU and IOPMP at the same time.

AFAIK we can. They're not mutually exclusive since they offer protection
and isolation at different layers/stages.


> 
> The relationship between them is more similar to MMU and sPMP or to MMU and PMP?

I'd say MMU and PMP since the IOMMU can isolate devices regardless of
s-mode context or not.


Thanks,

Daniel

> 
> Thanks,
> Zhiwei
> 
> On 2024/5/24 1:39, Daniel Henrique Barboza wrote:
>> Hi,
>>
>> In this new version a lot of changes were made throughout all the code,
>> most notably on patch 3. Link for the previous version is [1].
>>
>> * How it was tested *
>>
>> This series was tested using an emulated QEMU RISC-V host booting a QEMU
>> KVM guest, passing through an emulated e1000 network card from the host
>> to the guest. I can provide more details (e.g. QEMU command lines) if
>> required, just let me know. For now this cover-letter is too much of an
>> essay as is.
>>
>> The Linux kernel used for tests can be found here:
>>
>> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>>
>> This is a newer version of the following work from Tomasz:
>>
>> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
>> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>>
>> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>>
>> Note that to test this work using riscv-iommu-pci we'll need to provide
>> the Rivos PCI ID in the command line. More details down below.
>>
>> * Highlights of this version *
>>
>> - patches removed from v2: platform driver (riscv-iommu-sys, former
>> patch 05) and the EDU changes (patches 14 and 15). The platform driver
>> will be sent later with a working example on the 'virt' machine,
>> either on a newer version of this series or via a follow-up series. We
>> already have a PoC on [2] created by Sunil. More tests are needed, so
>> it'll be left behind for now. The EDU changes will be sent in separate
>> after I finish the doc changes that Frank cited in v2.
>>
>> - patch 3 contains the bulk of changes made from v2. Please give special
>> attention to the following functions since this is entirely new code I
>> ended up adding:
>>   - riscv_iommu_report_fault()
>>   - riscv_iommu_validate_device_ctx()
>>   - riscv_iommu_update_ipsr()
>>    Aside from these helpers most of the changes made in this patch 3 were
>> punctual.
>>
>> - Red HAT PCI ID related changes. A new patch (4) that introduces a
>> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
>> to us by Red Hat and Gerd Hoffman from their ID space. The
>> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
>> ID. The device was changed slightly to allow vendor-id and device-id to
>> be set in the command-line, so it's now possible to use this reference
>> device as another RISC-V IOMMU PCI device to ease the burden of
>> testing/development.
>>
>>    To instantiate the riscv-iommu-pci device using the previous Rivos PCI
>> ID, use the following cmd line:
>>
>>    -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>>
>>    I'm using these options to test the series with the existing Linux RISC-V
>> IOMMU support that uses just a Rivos ID to identify the device.
>>
>>
>> Series based on alistair/riscv-to-apply.next. It's also applicable on
>> current QEMU master. It can also be fetched from:
>>
>> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>>
>> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>>
>> Changes from v2 [1]:
>> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>>    - will be reintroduced in a later review or as a follow-up series
>>
>> - patches 14 and 15: dropped
>>    - will be sent in separate
>>
>> - patches 2, 3, 4 and 5:
>>    - removed all 'Ziommu' references
>>
>> - patch 2:
>>    - added extra bits that patch 3 ended up using
>>
>> - patch 3:
>>    - fixed blank line at EOF in hw/riscv/trace.h
>>    - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
>>      a given fault is eligible to be reported if DTF is 1
>>    - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
>>      to avoid code repetition
>>    - added a riscv_iommu_validate_device_ctx() helper to validate the device context
>>      as specified in "Device configuration checks" section. This helper is being used
>>      in riscv_iommu_ctx_fetch()
>>    - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
>>      in riscv_iommu_mmio_write()
>>    - riscv_iommmu_msi_write() now reports a fault in all error paths
>>    - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
>>    - change riscv-iommu region name to 'riscv-iommu'
>>    - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
>>    - changed riscv_iommu_mmio_ops min_access_size to 4
>>    - do not check for min and max sizes on riscv_iommu_mmio_write()
>>    - changed riscv_iommu_trap_ops  min_access_size to 4
>>    - removed IOMMU qemu_thread thread:
>>      - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
>>        'core_lock'
>>    - init FSCR as zero explicitly
>>    - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()
>>
>> - patch 4 (new):
>>    - add Red-Hat PCI RISC-V IOMMU ID
>>
>> - patch 5 (former 4):
>>    - create vendor-id and device-id properties
>>    - set Red-hat PCI RISC-V IOMMU ID as default ID
>>
>> - patch 8:
>>    - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances
>>
>> - patch 9:
>>    - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>>    - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>>    - 'en_s' is no longer used for early MSI address match
>>
>> - patch 10:
>>    - add ATS steps in riscv_iommu_validate_device_ctx()
>>    - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
>>    - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
>>    - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()
>>
>> - patch 11:
>>    - removed unused bits
>>    - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>>      bits
>>    - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
>>    - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
>> [1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
>> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>>
>> Andrew Jones (1):
>>    hw/riscv/riscv-iommu: Add another irq for mrif notifications
>>
>> Daniel Henrique Barboza (3):
>>    pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>>    test/qtest: add riscv-iommu-pci tests
>>    qtest/riscv-iommu-test: add init queues test
>>
>> Tomasz Jeznach (9):
>>    exec/memtxattr: add process identifier to the transaction attributes
>>    hw/riscv: add riscv-iommu-bits.h
>>    hw/riscv: add RISC-V IOMMU base emulation
>>    hw/riscv: add riscv-iommu-pci reference device
>>    hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>>    hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>>    hw/riscv/riscv-iommu: add s-stage and g-stage support
>>    hw/riscv/riscv-iommu: add ATS support
>>    hw/riscv/riscv-iommu: add DBG support
>>
>>   docs/specs/pci-ids.rst           |    2 +
>>   hw/riscv/Kconfig                 |    4 +
>>   hw/riscv/meson.build             |    1 +
>>   hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>>   hw/riscv/riscv-iommu-pci.c       |  177 +++
>>   hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
>>   hw/riscv/riscv-iommu.h           |  146 ++
>>   hw/riscv/trace-events            |   15 +
>>   hw/riscv/trace.h                 |    1 +
>>   hw/riscv/virt.c                  |   33 +-
>>   include/exec/memattrs.h          |    5 +
>>   include/hw/pci/pci.h             |    1 +
>>   include/hw/riscv/iommu.h         |   36 +
>>   meson.build                      |    1 +
>>   tests/qtest/libqos/meson.build   |    4 +
>>   tests/qtest/libqos/riscv-iommu.c |   76 +
>>   tests/qtest/libqos/riscv-iommu.h |  100 ++
>>   tests/qtest/meson.build          |    1 +
>>   tests/qtest/riscv-iommu-test.c   |  234 +++
>>   19 files changed, 3535 insertions(+), 1 deletion(-)
>>   create mode 100644 hw/riscv/riscv-iommu-bits.h
>>   create mode 100644 hw/riscv/riscv-iommu-pci.c
>>   create mode 100644 hw/riscv/riscv-iommu.c
>>   create mode 100644 hw/riscv/riscv-iommu.h
>>   create mode 100644 hw/riscv/trace-events
>>   create mode 100644 hw/riscv/trace.h
>>   create mode 100644 include/hw/riscv/iommu.h
>>   create mode 100644 tests/qtest/libqos/riscv-iommu.c
>>   create mode 100644 tests/qtest/libqos/riscv-iommu.h
>>   create mode 100644 tests/qtest/riscv-iommu-test.c
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-05-23 17:39 ` [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation Daniel Henrique Barboza
  2024-05-30  1:39   ` Eric Cheng
@ 2024-06-11 16:15   ` Jason Chien
  2024-06-12  9:53     ` Daniel Henrique Barboza
  2024-06-18 10:06   ` Jason Chien
  2 siblings, 1 reply; 38+ messages in thread
From: Jason Chien @ 2024-06-11 16:15 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf

Hi Daniel,

On 2024/5/24 上午 01:39, Daniel Henrique Barboza wrote:
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> The RISC-V IOMMU specification is now ratified as-per the RISC-V
> international process. The latest frozen specifcation can be found
> at:
>
> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf
>
> Add the foundation of the device emulation for RISC-V IOMMU, which
> includes an IOMMU that has no capabilities but MSI interrupt support and
> fault queue interfaces. We'll add add more features incrementally in the
> next patches.
>
> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>   hw/riscv/Kconfig         |    4 +
>   hw/riscv/meson.build     |    1 +
>   hw/riscv/riscv-iommu.c   | 1602 ++++++++++++++++++++++++++++++++++++++
>   hw/riscv/riscv-iommu.h   |  141 ++++
>   hw/riscv/trace-events    |   11 +
>   hw/riscv/trace.h         |    1 +
>   include/hw/riscv/iommu.h |   36 +
>   meson.build              |    1 +
>   8 files changed, 1797 insertions(+)
>   create mode 100644 hw/riscv/riscv-iommu.c
>   create mode 100644 hw/riscv/riscv-iommu.h
>   create mode 100644 hw/riscv/trace-events
>   create mode 100644 hw/riscv/trace.h
>   create mode 100644 include/hw/riscv/iommu.h
>
> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
> index a2030e3a6f..f69d6e3c8e 100644
> --- a/hw/riscv/Kconfig
> +++ b/hw/riscv/Kconfig
> @@ -1,3 +1,6 @@
> +config RISCV_IOMMU
> +    bool
> +
>   config RISCV_NUMA
>       bool
>   
> @@ -47,6 +50,7 @@ config RISCV_VIRT
>       select SERIAL
>       select RISCV_ACLINT
>       select RISCV_APLIC
> +    select RISCV_IOMMU
>       select RISCV_IMSIC
>       select SIFIVE_PLIC
>       select SIFIVE_TEST
> diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
> index f872674093..cbc99c6e8e 100644
> --- a/hw/riscv/meson.build
> +++ b/hw/riscv/meson.build
> @@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
>   riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
>   riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
>   riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
> +riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
>   
>   hw_arch += {'riscv': riscv_ss}
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> new file mode 100644
> index 0000000000..39b4ff1405
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu.c
> @@ -0,0 +1,1602 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2021-2023, Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +#include "hw/pci/pci_bus.h"
> +#include "hw/pci/pci_device.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/riscv/riscv_hart.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +#include "qemu/timer.h"
> +
> +#include "cpu_bits.h"
> +#include "riscv-iommu.h"
> +#include "riscv-iommu-bits.h"
> +#include "trace.h"
> +
> +#define LIMIT_CACHE_CTX               (1U << 7)
> +#define LIMIT_CACHE_IOT               (1U << 20)
> +
> +/* Physical page number coversions */
> +#define PPN_PHYS(ppn)                 ((ppn) << TARGET_PAGE_BITS)
> +#define PPN_DOWN(phy)                 ((phy) >> TARGET_PAGE_BITS)
> +
> +typedef struct RISCVIOMMUContext RISCVIOMMUContext;
> +typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
> +
> +/* Device assigned I/O address space */
> +struct RISCVIOMMUSpace {
> +    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached device */
> +    AddressSpace iova_as;       /* IOVA address space for attached device */
> +    RISCVIOMMUState *iommu;     /* Managing IOMMU device state */
> +    uint32_t devid;             /* Requester identifier, AKA device_id */
> +    bool notifier;              /* IOMMU unmap notifier enabled */
> +    QLIST_ENTRY(RISCVIOMMUSpace) list;
> +};
> +
> +/* Device translation context state. */
> +struct RISCVIOMMUContext {
> +    uint64_t devid:24;          /* Requester Id, AKA device_id */
> +    uint64_t pasid:20;          /* Process Address Space ID */
> +    uint64_t __rfu:20;          /* reserved */
> +    uint64_t tc;                /* Translation Control */
> +    uint64_t ta;                /* Translation Attributes */
> +    uint64_t msi_addr_mask;     /* MSI filtering - address mask */
> +    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
> +    uint64_t msiptp;            /* MSI redirection page table pointer */
> +};
> +
> +/* IOMMU index for transactions without PASID specified. */
> +#define RISCV_IOMMU_NOPASID 0
> +
> +static void riscv_iommu_notify(RISCVIOMMUState *s, int vec)
> +{
> +    const uint32_t fctl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FCTL);
> +    uint32_t ipsr, ivec;
> +
> +    if (fctl & RISCV_IOMMU_FCTL_WSI || !s->notify) {
> +        return;
> +    }
> +
> +    ipsr = riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, (1 << vec), 0);
> +    ivec = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_IVEC);
> +
> +    if (!(ipsr & (1 << vec))) {
> +        s->notify(s, (ivec >> (vec * 4)) & 0x0F);
> +    }
> +}
> +
> +static void riscv_iommu_fault(RISCVIOMMUState *s,
> +                              struct riscv_iommu_fq_record *ev)
> +{
> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQH) & s->fq_mask;
> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQT) & s->fq_mask;
> +    uint32_t next = (tail + 1) & s->fq_mask;
> +    uint32_t devid = get_field(ev->hdr, RISCV_IOMMU_FQ_HDR_DID);
> +
> +    trace_riscv_iommu_flt(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
> +                          PCI_FUNC(devid), ev->hdr, ev->iotval);
> +
> +    if (!(ctrl & RISCV_IOMMU_FQCSR_FQON) ||
> +        !!(ctrl & (RISCV_IOMMU_FQCSR_FQOF | RISCV_IOMMU_FQCSR_FQMF))) {
> +        return;
> +    }
> +
> +    if (head == next) {
> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
> +                              RISCV_IOMMU_FQCSR_FQOF, 0);
> +    } else {
> +        dma_addr_t addr = s->fq_addr + tail * sizeof(*ev);
> +        if (dma_memory_write(s->target_as, addr, ev, sizeof(*ev),
> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
> +                                  RISCV_IOMMU_FQCSR_FQMF, 0);
> +        } else {
> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_FQT, next);
> +        }
> +    }
> +
> +    if (ctrl & RISCV_IOMMU_FQCSR_FIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_FQ);
> +    }
> +}
> +
> +static void riscv_iommu_pri(RISCVIOMMUState *s,
> +    struct riscv_iommu_pq_record *pr)
> +{
> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQH) & s->pq_mask;
> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQT) & s->pq_mask;
> +    uint32_t next = (tail + 1) & s->pq_mask;
> +    uint32_t devid = get_field(pr->hdr, RISCV_IOMMU_PREQ_HDR_DID);
> +
> +    trace_riscv_iommu_pri(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
> +                          PCI_FUNC(devid), pr->payload);
> +
> +    if (!(ctrl & RISCV_IOMMU_PQCSR_PQON) ||
> +        !!(ctrl & (RISCV_IOMMU_PQCSR_PQOF | RISCV_IOMMU_PQCSR_PQMF))) {
> +        return;
> +    }
> +
> +    if (head == next) {
> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
> +                              RISCV_IOMMU_PQCSR_PQOF, 0);
> +    } else {
> +        dma_addr_t addr = s->pq_addr + tail * sizeof(*pr);
> +        if (dma_memory_write(s->target_as, addr, pr, sizeof(*pr),
> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
> +                                  RISCV_IOMMU_PQCSR_PQMF, 0);
> +        } else {
> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_PQT, next);
> +        }
> +    }
> +
> +    if (ctrl & RISCV_IOMMU_PQCSR_PIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_PQ);
> +    }
> +}
> +
> +/* Portable implementation of pext_u64, bit-mask extraction. */
> +static uint64_t _pext_u64(uint64_t val, uint64_t ext)
> +{
> +    uint64_t ret = 0;
> +    uint64_t rot = 1;
> +
> +    while (ext) {
> +        if (ext & 1) {
> +            if (val & 1) {
> +                ret |= rot;
> +            }
> +            rot <<= 1;
> +        }
> +        val >>= 1;
> +        ext >>= 1;
> +    }
> +
> +    return ret;
> +}
> +
> +/* Check if GPA matches MSI/MRIF pattern. */
> +static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    dma_addr_t gpa)
> +{
> +    if (get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE) !=
> +        RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
> +        return false; /* Invalid MSI/MRIF mode */
> +    }
> +
> +    if ((PPN_DOWN(gpa) ^ ctx->msi_addr_pattern) & ~ctx->msi_addr_mask) {
> +        return false; /* GPA not in MSI range defined by AIA IMSIC rules. */
> +    }
> +
> +    return true;
> +}
> +
> +/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
> +static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    IOMMUTLBEntry *iotlb)
> +{
> +    /* Early check for MSI address match when IOVA == GPA */
> +    if (iotlb->perm & IOMMU_WO &&
> +        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
> +        iotlb->target_as = &s->trap_as;
> +        iotlb->translated_addr = iotlb->iova;
> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +        return 0;
> +    }
> +
> +    /* Exit early for pass-through mode. */
> +    iotlb->translated_addr = iotlb->iova;
> +    iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +    /* Allow R/W in pass-through mode */
> +    iotlb->perm = IOMMU_RW;
> +    return 0;
> +}
> +
> +static void riscv_iommu_report_fault(RISCVIOMMUState *s,
> +                                     RISCVIOMMUContext *ctx,
> +                                     uint32_t fault_type, uint32_t cause,
> +                                     bool pv,
> +                                     uint64_t iotval, uint64_t iotval2)
> +{
> +    struct riscv_iommu_fq_record ev = { 0 };
> +
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_DTF) {
> +        switch (cause) {
> +        case RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_INVALID:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED:
> +        case RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR:
> +        case RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT:
> +            break;
> +        default:
> +            /* DTF prevents reporting a fault for this given cause */
> +            return;
> +        }
> +    }
> +
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_CAUSE, cause);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_TTYPE, fault_type);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_DID, ctx->devid);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PV, true);
> +
> +    if (pv) {
> +        ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PID, ctx->pasid);
> +    }
> +
> +    ev.iotval = iotval;
> +    ev.iotval2 = iotval2;
> +
> +    riscv_iommu_fault(s, &ev);
> +}
> +
> +/* Redirect MSI write for given GPA. */
> +static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
> +    RISCVIOMMUContext *ctx, uint64_t gpa, uint64_t data,
> +    unsigned size, MemTxAttrs attrs)
> +{
> +    MemTxResult res;
> +    dma_addr_t addr;
> +    uint64_t intn;
> +    uint32_t n190;
> +    uint64_t pte[2];
> +    int fault_type = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> +    int cause;
> +
> +    if (!riscv_iommu_msi_check(s, ctx, gpa)) {
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    /* Interrupt File Number */
> +    intn = _pext_u64(PPN_DOWN(gpa), ctx->msi_addr_mask);
> +    if (intn >= 256) {
> +        /* Interrupt file number out of range */
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    /* fetch MSI PTE */
> +    addr = PPN_PHYS(get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_PPN));
> +    addr = addr | (intn * sizeof(pte));
> +    res = dma_memory_read(s->target_as, addr, &pte, sizeof(pte),
> +            MEMTXATTRS_UNSPECIFIED);
> +    if (res != MEMTX_OK) {
> +        if (res == MEMTX_DECODE_ERROR) {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED;
> +        } else {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        }
> +        goto err;
> +    }
> +
> +    le64_to_cpus(&pte[0]);
> +    le64_to_cpus(&pte[1]);
> +
> +    if (!(pte[0] & RISCV_IOMMU_MSI_PTE_V) || (pte[0] & RISCV_IOMMU_MSI_PTE_C)) {
> +        /*
> +         * The spec mentions that: "If msipte.C == 1, then further
> +         * processing to interpret the PTE is implementation
> +         * defined.". We'll abort with cause = 262 for this
> +         * case too.
> +         */
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_INVALID;
> +        goto err;
> +    }
> +
> +    switch (get_field(pte[0], RISCV_IOMMU_MSI_PTE_M)) {
> +    case RISCV_IOMMU_MSI_PTE_M_BASIC:
> +        /* MSI Pass-through mode */
> +        addr = PPN_PHYS(get_field(pte[0], RISCV_IOMMU_MSI_PTE_PPN));
> +        addr = addr | (gpa & TARGET_PAGE_MASK);
> +
> +        trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
> +                              PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
> +                              gpa, addr);
> +
> +        res = dma_memory_write(s->target_as, addr, &data, size, attrs);
> +        if (res != MEMTX_OK) {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +            goto err;
> +        }
> +
> +        return MEMTX_OK;
> +    case RISCV_IOMMU_MSI_PTE_M_MRIF:
> +        /* MRIF mode, continue. */
> +        break;
> +    default:
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
> +        goto err;
> +    }
> +
> +    /*
> +     * Report an error for interrupt identities exceeding the maximum allowed
> +     * for an IMSIC interrupt file (2047) or destination address is not 32-bit
> +     * aligned. See IOMMU Specification, Chapter 2.3. MSI page tables.
> +     */
> +    if ((data > 2047) || (gpa & 3)) {
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
> +        goto err;
> +    }
> +
> +    /* MSI MRIF mode, non atomic pending bit update */
> +
> +    /* MRIF pending bit address */
> +    addr = get_field(pte[0], RISCV_IOMMU_MSI_PTE_MRIF_ADDR) << 9;
> +    addr = addr | ((data & 0x7c0) >> 3);
> +
> +    trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
> +                          PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
> +                          gpa, addr);
> +
> +    /* MRIF pending bit mask */
> +    data = 1ULL << (data & 0x03f);
> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    intn = intn | data;
> +    res = dma_memory_write(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +        goto err;
> +    }
> +
> +    /* Get MRIF enable bits */
> +    addr = addr + sizeof(intn);
> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    if (!(intn & data)) {
> +        /* notification disabled, MRIF update completed. */
> +        return MEMTX_OK;
> +    }
> +
> +    /* Send notification message */
> +    addr = PPN_PHYS(get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NPPN));
> +    n190 = get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID) |
> +          (get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID_MSB) << 10);
> +
> +    res = dma_memory_write(s->target_as, addr, &n190, sizeof(n190), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +        goto err;
> +    }
> +
> +    return MEMTX_OK;
> +
> +err:
> +    riscv_iommu_report_fault(s, ctx, fault_type, cause,
> +                             !!ctx->pasid, 0, 0);
> +    return res;
> +}
> +
> +/*
> + * Check device context configuration as described by the
> + * riscv-iommu spec section "Device-context configuration
> + * checks".
> + */
> +static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
> +                                            RISCVIOMMUContext *ctx)
> +{
> +    uint32_t msi_mode;
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
> +        ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
> +        return false;
> +    }
> +
> +    if (!(s->cap & RISCV_IOMMU_CAP_T2GPA) &&
> +        ctx->tc & RISCV_IOMMU_DC_TC_T2GPA) {
> +        return false;
> +    }
> +
> +    if (s->cap & RISCV_IOMMU_CAP_MSI_FLAT) {
> +        msi_mode = get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE);
> +
> +        if (msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_OFF &&
> +            msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
> +            return false;
> +        }
> +    }
> +
> +    /*
> +     * CAP_END is always zero (only one endianess). FCTL_BE is
> +     * always zero (little-endian accesses). Thus TC_SBE must
> +     * always be LE, i.e. zero.
> +     */
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_SBE) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +/*
> + * RISC-V IOMMU Device Context Loopkup - Device Directory Tree Walk
> + *
> + * @s         : IOMMU Device State
> + * @ctx       : Device Translation Context with devid and pasid set.
> + * @return    : success or fault code.
> + */
> +static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
> +{
> +    const uint64_t ddtp = s->ddtp;
> +    unsigned mode = get_field(ddtp, RISCV_IOMMU_DDTP_MODE);
> +    dma_addr_t addr = PPN_PHYS(get_field(ddtp, RISCV_IOMMU_DDTP_PPN));
> +    struct riscv_iommu_dc dc;
> +    /* Device Context format: 0: extended (64 bytes) | 1: base (32 bytes) */
> +    const int dc_fmt = !s->enable_msi;
> +    const size_t dc_len = sizeof(dc) >> dc_fmt;
> +    unsigned depth;
> +    uint64_t de;
> +
> +    switch (mode) {
> +    case RISCV_IOMMU_DDTP_MODE_OFF:
> +        return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
> +
> +    case RISCV_IOMMU_DDTP_MODE_BARE:
> +        /* mock up pass-through translation context */
> +        ctx->tc = RISCV_IOMMU_DC_TC_V;
> +        ctx->ta = 0;
> +        ctx->msiptp = 0;
> +        return 0;
> +
> +    case RISCV_IOMMU_DDTP_MODE_1LVL:
> +        depth = 0;
> +        break;
> +
> +    case RISCV_IOMMU_DDTP_MODE_2LVL:
> +        depth = 1;
> +        break;
> +
> +    case RISCV_IOMMU_DDTP_MODE_3LVL:
> +        depth = 2;
> +        break;
> +
> +    default:
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +    }
> +
> +    /*
> +     * Check supported device id width (in bits).
> +     * See IOMMU Specification, Chapter 6. Software guidelines.
> +     * - if extended device-context format is used:
> +     *   1LVL: 6, 2LVL: 15, 3LVL: 24
> +     * - if base device-context format is used:
> +     *   1LVL: 7, 2LVL: 16, 3LVL: 24
> +     */
> +    if (ctx->devid >= (1 << (depth * 9 + 6 + (dc_fmt && depth != 2)))) {
> +        return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +    }
> +
> +    /* Device directory tree walk */
> +    for (; depth-- > 0; ) {
> +        /*
> +         * Select device id index bits based on device directory tree level
> +         * and device context format.
> +         * See IOMMU Specification, Chapter 2. Data Structures.
> +         * - if extended device-context format is used:
> +         *   device index: [23:15][14:6][5:0]
> +         * - if base device-context format is used:
> +         *   device index: [23:16][15:7][6:0]
> +         */
> +        const int split = depth * 9 + 6 + dc_fmt;
> +        addr |= ((ctx->devid >> split) << 3) & ~TARGET_PAGE_MASK;
> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
> +        }
> +        le64_to_cpus(&de);
> +        if (!(de & RISCV_IOMMU_DDTE_VALID)) {
> +            /* invalid directory entry */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +        }
> +        if (de & ~(RISCV_IOMMU_DDTE_PPN | RISCV_IOMMU_DDTE_VALID)) {
> +            /* reserved bits set */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +        }
> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_DDTE_PPN));
> +    }
> +
> +    /* index into device context entry page */
> +    addr |= (ctx->devid * dc_len) & ~TARGET_PAGE_MASK;
> +
> +    memset(&dc, 0, sizeof(dc));
> +    if (dma_memory_read(s->target_as, addr, &dc, dc_len,
> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
> +    }
> +
> +    /* Set translation context. */
> +    ctx->tc = le64_to_cpu(dc.tc);
> +    ctx->ta = le64_to_cpu(dc.ta);
> +    ctx->msiptp = le64_to_cpu(dc.msiptp);
> +    ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
> +    ctx->msi_addr_pattern = le64_to_cpu(dc.msi_addr_pattern);
> +
> +    if (!riscv_iommu_validate_device_ctx(s, ctx)) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +    }
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_V)) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +    }
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
> +        if (ctx->pasid != RISCV_IOMMU_NOPASID) {
> +            /* PASID is disabled */
> +            return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +        }
> +        return 0;
> +    }
> +
> +    /* FSC.TC.PDTV enabled */
> +    if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
> +        /* Invalid PDTP.MODE */
> +        return RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED;
> +    }
> +
> +    for (depth = mode - RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8; depth-- > 0; ) {
> +        /*
> +         * Select process id index bits based on process directory tree
> +         * level. See IOMMU Specification, 2.2. Process-Directory-Table.
> +         */
> +        const int split = depth * 9 + 8;
> +        addr |= ((ctx->pasid >> split) << 3) & ~TARGET_PAGE_MASK;
> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
> +        }
> +        le64_to_cpus(&de);
> +        if (!(de & RISCV_IOMMU_PC_TA_V)) {
> +            return RISCV_IOMMU_FQ_CAUSE_PDT_INVALID;
> +        }
> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_PC_FSC_PPN));
> +    }
> +
> +    /* Leaf entry in PDT */
> +    addr |= (ctx->pasid << 4) & ~TARGET_PAGE_MASK;
> +    if (dma_memory_read(s->target_as, addr, &dc.ta, sizeof(uint64_t) * 2,
> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +        return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
> +    }
> +
> +    /* Use FSC and TA from process directory entry. */
> +    ctx->ta = le64_to_cpu(dc.ta);
> +
> +    return 0;
> +}
> +
> +/* Translation Context cache support */
> +static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
> +    RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
> +    return c1->devid == c2->devid && c1->pasid == c2->pasid;
> +}
> +
> +static guint __ctx_hash(gconstpointer v)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
> +    /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
> +    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
> +}
> +
> +static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> +        ctx->devid == arg->devid &&
> +        ctx->pasid == arg->pasid) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void __ctx_inval_devid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> +        ctx->devid == arg->devid) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void __ctx_inval_all(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
> +    uint32_t devid, uint32_t pasid)
> +{
> +    GHashTable *ctx_cache;
> +    RISCVIOMMUContext key = {
> +        .devid = devid,
> +        .pasid = pasid,
> +    };
> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    g_hash_table_foreach(ctx_cache, func, &key);
> +    g_hash_table_unref(ctx_cache);
> +}
> +
> +/* Find or allocate translation context for a given {device_id, process_id} */
> +static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
> +    unsigned devid, unsigned pasid, void **ref)
> +{
> +    GHashTable *ctx_cache;
> +    RISCVIOMMUContext *ctx;
> +    RISCVIOMMUContext key = {
> +        .devid = devid,
> +        .pasid = pasid,
> +    };
> +
> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    ctx = g_hash_table_lookup(ctx_cache, &key);
> +
> +    if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
> +        *ref = ctx_cache;
> +        return ctx;
> +    }
> +
> +    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
> +        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                          g_free, NULL);
> +        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
> +    }
> +
> +    ctx = g_new0(RISCVIOMMUContext, 1);
> +    ctx->devid = devid;
> +    ctx->pasid = pasid;
> +
> +    int fault = riscv_iommu_ctx_fetch(s, ctx);
> +    if (!fault) {
> +        g_hash_table_add(ctx_cache, ctx);
> +        *ref = ctx_cache;
> +        return ctx;
> +    }
> +
> +    g_hash_table_unref(ctx_cache);
> +    *ref = NULL;
> +
> +    riscv_iommu_report_fault(s, ctx, RISCV_IOMMU_FQ_TTYPE_UADDR_RD,
> +                             fault, !!pasid, 0, 0);
> +
> +    g_free(ctx);
> +    return NULL;
> +}
> +
> +static void riscv_iommu_ctx_put(RISCVIOMMUState *s, void *ref)
> +{
> +    if (ref) {
> +        g_hash_table_unref((GHashTable *)ref);
> +    }
> +}
> +
> +/* Find or allocate address space for a given device */
> +static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
> +{
> +    RISCVIOMMUSpace *as;
> +
> +    /* FIXME: PCIe bus remapping for attached endpoints. */
> +    devid |= s->bus << 8;
> +
> +    qemu_mutex_lock(&s->core_lock);
> +    QLIST_FOREACH(as, &s->spaces, list) {
> +        if (as->devid == devid) {
> +            break;
> +        }
> +    }
> +    qemu_mutex_unlock(&s->core_lock);
> +
> +    if (as == NULL) {
> +        char name[64];
> +        as = g_new0(RISCVIOMMUSpace, 1);
> +
> +        as->iommu = s;
> +        as->devid = devid;
> +
> +        snprintf(name, sizeof(name), "riscv-iommu-%04x:%02x.%d-iova",
> +            PCI_BUS_NUM(as->devid), PCI_SLOT(as->devid), PCI_FUNC(as->devid));
> +
> +        /* IOVA address space, untranslated addresses */
> +        memory_region_init_iommu(&as->iova_mr, sizeof(as->iova_mr),
> +            TYPE_RISCV_IOMMU_MEMORY_REGION,
> +            OBJECT(as), "riscv_iommu", UINT64_MAX);
> +        address_space_init(&as->iova_as, MEMORY_REGION(&as->iova_mr), name);
> +
> +        qemu_mutex_lock(&s->core_lock);
> +        QLIST_INSERT_HEAD(&s->spaces, as, list);
> +        qemu_mutex_unlock(&s->core_lock);
> +
> +        trace_riscv_iommu_new(s->parent_obj.id, PCI_BUS_NUM(as->devid),
> +                PCI_SLOT(as->devid), PCI_FUNC(as->devid));
> +    }
> +    return &as->iova_as;
> +}
> +
> +static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    IOMMUTLBEntry *iotlb)
> +{
> +    bool enable_pasid;
> +    bool enable_pri;
> +    int fault;
> +
> +    /*
> +     * TC[32] is reserved for custom extensions, used here to temporarily
> +     * enable automatic page-request generation for ATS queries.
> +     */
> +    enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
> +    enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
> +
> +    /* Translate using device directory / page table information. */
> +    fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
> +
> +    if (enable_pri && fault) {
> +        struct riscv_iommu_pq_record pr = {0};
> +        if (enable_pasid) {
> +            pr.hdr = set_field(RISCV_IOMMU_PREQ_HDR_PV,
> +                RISCV_IOMMU_PREQ_HDR_PID, ctx->pasid);
> +        }
> +        pr.hdr = set_field(pr.hdr, RISCV_IOMMU_PREQ_HDR_DID, ctx->devid);
> +        pr.payload = (iotlb->iova & TARGET_PAGE_MASK) |
> +                     RISCV_IOMMU_PREQ_PAYLOAD_M;
> +        riscv_iommu_pri(s, &pr);
> +        return fault;
> +    }
> +
> +    if (fault) {
> +        unsigned ttype;
> +
> +        if (iotlb->perm & IOMMU_RW) {
> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> +        } else {
> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
> +        }
> +
> +        riscv_iommu_report_fault(s, ctx, ttype, fault, enable_pasid,
> +                                 iotlb->iova, iotlb->translated_addr);
> +        return fault;
> +    }
> +
> +    return 0;
> +}
> +
> +/* IOMMU Command Interface */
> +static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
> +    uint64_t addr, uint32_t data)
> +{
> +    /*
> +     * ATS processing in this implementation of the IOMMU is synchronous,
> +     * no need to wait for completions here.
> +     */
> +    if (!notify) {
> +        return MEMTX_OK;
> +    }
> +
> +    return dma_memory_write(s->target_as, addr, &data, sizeof(data),
> +        MEMTXATTRS_UNSPECIFIED);
> +}
> +
> +static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
> +{
> +    uint64_t old_ddtp = s->ddtp;
> +    uint64_t new_ddtp = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_DDTP);
> +    unsigned new_mode = get_field(new_ddtp, RISCV_IOMMU_DDTP_MODE);
> +    unsigned old_mode = get_field(old_ddtp, RISCV_IOMMU_DDTP_MODE);
> +    bool ok = false;
> +
> +    /*
> +     * Check for allowed DDTP.MODE transitions:
> +     * {OFF, BARE}        -> {OFF, BARE, 1LVL, 2LVL, 3LVL}
> +     * {1LVL, 2LVL, 3LVL} -> {OFF, BARE}
> +     */
> +    if (new_mode == old_mode ||
> +        new_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
> +        new_mode == RISCV_IOMMU_DDTP_MODE_BARE) {
> +        ok = true;
> +    } else if (new_mode == RISCV_IOMMU_DDTP_MODE_1LVL ||
> +               new_mode == RISCV_IOMMU_DDTP_MODE_2LVL ||
> +               new_mode == RISCV_IOMMU_DDTP_MODE_3LVL) {
> +        ok = old_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
> +             old_mode == RISCV_IOMMU_DDTP_MODE_BARE;
> +    }
> +
> +    if (ok) {
> +        /* clear reserved and busy bits, report back sanitized version */
> +        new_ddtp = set_field(new_ddtp & RISCV_IOMMU_DDTP_PPN,
> +                             RISCV_IOMMU_DDTP_MODE, new_mode);
> +    } else {
> +        new_ddtp = old_ddtp;
> +    }
> +    s->ddtp = new_ddtp;
> +
> +    riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_DDTP, new_ddtp);
> +}
> +
> +/* Command function and opcode field. */
> +#define RISCV_IOMMU_CMD(func, op) (((func) << 7) | (op))
> +
> +static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
> +{
> +    struct riscv_iommu_command cmd;
> +    MemTxResult res;
> +    dma_addr_t addr;
> +    uint32_t tail, head, ctrl;
> +    uint64_t cmd_opcode;
> +    GHFunc func;
> +
> +    ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +    tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQT) & s->cq_mask;
> +    head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQH) & s->cq_mask;
> +
> +    /* Check for pending error or queue processing disabled */
> +    if (!(ctrl & RISCV_IOMMU_CQCSR_CQON) ||
> +        !!(ctrl & (RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CQMF))) {
> +        return;
> +    }
> +
> +    while (tail != head) {
> +        addr = s->cq_addr  + head * sizeof(cmd);
> +        res = dma_memory_read(s->target_as, addr, &cmd, sizeof(cmd),
> +                              MEMTXATTRS_UNSPECIFIED);
> +
> +        if (res != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                                  RISCV_IOMMU_CQCSR_CQMF, 0);
> +            goto fault;
> +        }
> +
> +        trace_riscv_iommu_cmd(s->parent_obj.id, cmd.dword0, cmd.dword1);
> +
> +        cmd_opcode = get_field(cmd.dword0,
> +                               RISCV_IOMMU_CMD_OPCODE | RISCV_IOMMU_CMD_FUNC);
> +
> +        switch (cmd_opcode) {
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOFENCE_FUNC_C,
> +                             RISCV_IOMMU_CMD_IOFENCE_OPCODE):
> +            res = riscv_iommu_iofence(s,
> +                cmd.dword0 & RISCV_IOMMU_CMD_IOFENCE_AV, cmd.dword1,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOFENCE_DATA));
> +
> +            if (res != MEMTX_OK) {
> +                riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                                      RISCV_IOMMU_CQCSR_CQMF, 0);
> +                goto fault;
> +            }
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA,
> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> +            if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
> +                /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
> +                goto cmd_ill;
> +            }
> +            /* translation cache not implemented yet */
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> +            /* translation cache not implemented yet */
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
> +                /* invalidate all device context cache mappings */
> +                func = __ctx_inval_all;
> +            } else {
> +                /* invalidate all device context matching DID */
> +                func = __ctx_inval_devid;
> +            }
> +            riscv_iommu_ctx_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID), 0);
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT,
> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
> +                /* illegal command arguments IODIR_PDT & DV == 0 */
> +                goto cmd_ill;
> +            } else {
> +                func = __ctx_inval_devid_pasid;
> +            }
> +            riscv_iommu_ctx_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID),
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
> +            break;
> +
> +        default:
> +        cmd_ill:
> +            /* Invalid instruction, do not advance instruction index. */
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                RISCV_IOMMU_CQCSR_CMD_ILL, 0);
> +            goto fault;
> +        }
> +
> +        /* Advance and update head pointer after command completes. */
> +        head = (head + 1) & s->cq_mask;
> +        riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_CQH, head);
> +    }
> +    return;
> +
> +fault:
> +    if (ctrl & RISCV_IOMMU_CQCSR_CIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_CQ);
> +    }
> +}
> +
> +static void riscv_iommu_process_cq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_CQB);
> +        s->cq_mask = (2ULL << get_field(base, RISCV_IOMMU_CQB_LOG2SZ)) - 1;
> +        s->cq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_CQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~s->cq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQT], 0);
> +        ctrl_set = RISCV_IOMMU_CQCSR_CQON;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQMF |
> +                   RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CMD_TO |
> +                   RISCV_IOMMU_CQCSR_FENCE_W_IP;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +static void riscv_iommu_process_fq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_FQB);
> +        s->fq_mask = (2ULL << get_field(base, RISCV_IOMMU_FQB_LOG2SZ)) - 1;
> +        s->fq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_FQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~s->fq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQT], 0);
> +        ctrl_set = RISCV_IOMMU_FQCSR_FQON;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQMF |
> +            RISCV_IOMMU_FQCSR_FQOF;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_PQB);
> +        s->pq_mask = (2ULL << get_field(base, RISCV_IOMMU_PQB_LOG2SZ)) - 1;
> +        s->pq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_PQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~s->pq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQT], 0);
> +        ctrl_set = RISCV_IOMMU_PQCSR_PQON;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQMF |
> +            RISCV_IOMMU_PQCSR_PQOF;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
> +
> +static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
> +{
> +    uint32_t cqcsr, fqcsr, pqcsr;
> +    uint32_t ipsr_set = 0;
> +    uint32_t ipsr_clr = 0;
> +
> +    if (data & RISCV_IOMMU_IPSR_CIP) {
> +        cqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +
> +        if (cqcsr & RISCV_IOMMU_CQCSR_CIE &&
> +            (cqcsr & RISCV_IOMMU_CQCSR_FENCE_W_IP ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_ILL ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_TO ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_CIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
> +    }
> +
> +    if (data & RISCV_IOMMU_IPSR_FIP) {
> +        fqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +
> +        if (fqcsr & RISCV_IOMMU_FQCSR_FIE &&
> +            (fqcsr & RISCV_IOMMU_FQCSR_FQOF ||
> +             fqcsr & RISCV_IOMMU_FQCSR_FQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_FIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
> +    }
> +
> +    if (data & RISCV_IOMMU_IPSR_PIP) {
> +        pqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +
> +        if (pqcsr & RISCV_IOMMU_PQCSR_PIE &&
> +            (pqcsr & RISCV_IOMMU_PQCSR_PQOF ||
> +             pqcsr & RISCV_IOMMU_PQCSR_PQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_PIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, ipsr_set, ipsr_clr);
> +}
> +
> +static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
> +    uint64_t data, unsigned size, MemTxAttrs attrs)
> +{
> +    riscv_iommu_process_fn *process_fn = NULL;
> +    RISCVIOMMUState *s = opaque;
> +    uint32_t regb = addr & ~3;
> +    uint32_t busy = 0;
> +    uint64_t val = 0;
> +
> +    if ((addr & (size - 1)) != 0) {
> +        /* Unsupported MMIO alignment or access size */
> +        return MEMTX_ERROR;
> +    }
> +
> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
> +        /* Unsupported MMIO access location. */
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    /* Track actionable MMIO write. */
> +    switch (regb) {
> +    case RISCV_IOMMU_REG_DDTP:
> +    case RISCV_IOMMU_REG_DDTP + 4:
> +        process_fn = riscv_iommu_process_ddtp;
> +        regb = RISCV_IOMMU_REG_DDTP;
> +        busy = RISCV_IOMMU_DDTP_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_CQT:
> +        process_fn = riscv_iommu_process_cq_tail;
> +        break;
> +
> +    case RISCV_IOMMU_REG_CQCSR:
> +        process_fn = riscv_iommu_process_cq_control;
> +        busy = RISCV_IOMMU_CQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_FQCSR:
> +        process_fn = riscv_iommu_process_fq_control;
> +        busy = RISCV_IOMMU_FQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_PQCSR:
> +        process_fn = riscv_iommu_process_pq_control;
> +        busy = RISCV_IOMMU_PQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_IPSR:
> +        /*
> +         * IPSR has special procedures to update. Execute it
> +         * and exit.
> +         */
> +        if (size == 4) {
> +            uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
> +            uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
> +            uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
> +            stl_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +        } else if (size == 8) {
> +            uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
> +            uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
> +            uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
> +            stq_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +        }
> +
> +        riscv_iommu_update_ipsr(s, val);
> +
> +        return MEMTX_OK;
> +
> +    default:
> +        break;
> +    }
> +
> +    /*
> +     * Registers update might be not synchronized with core logic.
> +     * If system software updates register when relevant BUSY bit
> +     * is set IOMMU behavior of additional writes to the register
> +     * is UNSPECIFIED.
> +     */
> +    qemu_spin_lock(&s->regs_lock);
> +    if (size == 1) {
> +        uint8_t ro = s->regs_ro[addr];
> +        uint8_t wc = s->regs_wc[addr];
> +        uint8_t rw = s->regs_rw[addr];
> +        s->regs_rw[addr] = ((rw & ro) | (data & ~ro)) & ~(data & wc);
> +    } else if (size == 2) {
> +        uint16_t ro = lduw_le_p(&s->regs_ro[addr]);
> +        uint16_t wc = lduw_le_p(&s->regs_wc[addr]);
> +        uint16_t rw = lduw_le_p(&s->regs_rw[addr]);
> +        stw_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    } else if (size == 4) {
> +        uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
> +        uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
> +        uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
> +        stl_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    } else if (size == 8) {
> +        uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
> +        uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
> +        uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
> +        stq_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    }
> +
> +    /* Busy flag update, MSB 4-byte register. */
> +    if (busy) {
> +        uint32_t rw = ldl_le_p(&s->regs_rw[regb]);
> +        stl_le_p(&s->regs_rw[regb], rw | busy);
> +    }
> +    qemu_spin_unlock(&s->regs_lock);
> +
> +    if (process_fn) {
> +        qemu_mutex_lock(&s->core_lock);
> +        process_fn(s);
> +        qemu_mutex_unlock(&s->core_lock);
> +    }
> +
> +    return MEMTX_OK;
> +}
> +
> +static MemTxResult riscv_iommu_mmio_read(void *opaque, hwaddr addr,
> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
> +{
> +    RISCVIOMMUState *s = opaque;
> +    uint64_t val = -1;
> +    uint8_t *ptr;
> +
> +    if ((addr & (size - 1)) != 0) {
> +        /* Unsupported MMIO alignment. */
> +        return MEMTX_ERROR;
> +    }
> +
> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    ptr = &s->regs_rw[addr];
> +
> +    if (size == 1) {
> +        val = (uint64_t)*ptr;
> +    } else if (size == 2) {
> +        val = lduw_le_p(ptr);
> +    } else if (size == 4) {
> +        val = ldl_le_p(ptr);
> +    } else if (size == 8) {
> +        val = ldq_le_p(ptr);
> +    } else {
> +        return MEMTX_ERROR;
> +    }
> +
> +    *data = val;
> +
> +    return MEMTX_OK;
> +}
> +
> +static const MemoryRegionOps riscv_iommu_mmio_ops = {
> +    .read_with_attrs = riscv_iommu_mmio_read,
> +    .write_with_attrs = riscv_iommu_mmio_write,
> +    .endianness = DEVICE_NATIVE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +        .unaligned = false,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +/*
> + * Translations matching MSI pattern check are redirected to "riscv-iommu-trap"
> + * memory region as untranslated address, for additional MSI/MRIF interception
> + * by IOMMU interrupt remapping implementation.
> + * Note: Device emulation code generating an MSI is expected to provide a valid
> + * memory transaction attributes with requested_id set.
> + */
> +static MemTxResult riscv_iommu_trap_write(void *opaque, hwaddr addr,
> +    uint64_t data, unsigned size, MemTxAttrs attrs)
> +{
> +    RISCVIOMMUState* s = (RISCVIOMMUState *)opaque;
> +    RISCVIOMMUContext *ctx;
> +    MemTxResult res;
> +    void *ref;
> +    uint32_t devid = attrs.requester_id;
> +
> +    if (attrs.unspecified) {
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    /* FIXME: PCIe bus remapping for attached endpoints. */
> +    devid |= s->bus << 8;
> +
> +    ctx = riscv_iommu_ctx(s, devid, 0, &ref);
> +    if (ctx == NULL) {
> +        res = MEMTX_ACCESS_ERROR;
> +    } else {
> +        res = riscv_iommu_msi_write(s, ctx, addr, data, size, attrs);
> +    }
> +    riscv_iommu_ctx_put(s, ref);
> +    return res;
> +}
> +
> +static MemTxResult riscv_iommu_trap_read(void *opaque, hwaddr addr,
> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
> +{
> +    return MEMTX_ACCESS_ERROR;
> +}
> +
> +static const MemoryRegionOps riscv_iommu_trap_ops = {
> +    .read_with_attrs = riscv_iommu_trap_read,
> +    .write_with_attrs = riscv_iommu_trap_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +        .unaligned = true,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +static void riscv_iommu_realize(DeviceState *dev, Error **errp)
> +{
> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
> +
> +    s->cap = s->version & RISCV_IOMMU_CAP_VERSION;
> +    if (s->enable_msi) {
> +        s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
> +    }
> +    /* Report QEMU target physical address space limits */
> +    s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
> +                       TARGET_PHYS_ADDR_SPACE_BITS);
> +
> +    /* TODO: method to report supported PASID bits */
> +    s->pasid_bits = 8; /* restricted to size of MemTxAttrs.pasid */
> +    s->cap |= RISCV_IOMMU_CAP_PD8;
> +
> +    /* Out-of-reset translation mode: OFF (DMA disabled) BARE (passthrough) */
> +    s->ddtp = set_field(0, RISCV_IOMMU_DDTP_MODE, s->enable_off ?
> +                        RISCV_IOMMU_DDTP_MODE_OFF : RISCV_IOMMU_DDTP_MODE_BARE);
> +
> +    /* register storage */
> +    s->regs_rw = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +    s->regs_ro = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +    s->regs_wc = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +
> +     /* Mark all registers read-only */
> +    memset(s->regs_ro, 0xff, RISCV_IOMMU_REG_SIZE);
> +
> +    /*
> +     * Register complete MMIO space, including MSI/PBA registers.
> +     * Note, PCIDevice implementation will add overlapping MR for MSI/PBA,
> +     * managed directly by the PCIDevice implementation.
> +     */
> +    memory_region_init_io(&s->regs_mr, OBJECT(dev), &riscv_iommu_mmio_ops, s,
> +        "riscv-iommu-regs", RISCV_IOMMU_REG_SIZE);
> +
> +    /* Set power-on register state */
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
> +        ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
> +        ~(RISCV_IOMMU_CQB_LOG2SZ | RISCV_IOMMU_CQB_PPN));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQB],
> +        ~(RISCV_IOMMU_FQB_LOG2SZ | RISCV_IOMMU_FQB_PPN));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQB],
> +        ~(RISCV_IOMMU_PQB_LOG2SZ | RISCV_IOMMU_PQB_PPN));
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQMF |
> +        RISCV_IOMMU_CQCSR_CMD_TO | RISCV_IOMMU_CQCSR_CMD_ILL);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQON |
> +        RISCV_IOMMU_CQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQMF |
> +        RISCV_IOMMU_FQCSR_FQOF);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQON |
> +        RISCV_IOMMU_FQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQMF |
> +        RISCV_IOMMU_PQCSR_PQOF);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQON |
> +        RISCV_IOMMU_PQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
> +
> +    /* Memory region for downstream access, if specified. */
> +    if (s->target_mr) {
> +        s->target_as = g_new0(AddressSpace, 1);
> +        address_space_init(s->target_as, s->target_mr,
> +            "riscv-iommu-downstream");
> +    } else {
> +        /* Fallback to global system memory. */
> +        s->target_as = &address_space_memory;
> +    }
> +
> +    /* Memory region for untranslated MRIF/MSI writes */
> +    memory_region_init_io(&s->trap_mr, OBJECT(dev), &riscv_iommu_trap_ops, s,
> +            "riscv-iommu-trap", ~0ULL);
> +    address_space_init(&s->trap_as, &s->trap_mr, "riscv-iommu-trap-as");
> +
> +    /* Device translation context cache */
> +    s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                         g_free, NULL);
> +
> +    s->iommus.le_next = NULL;
> +    s->iommus.le_prev = NULL;
> +    QLIST_INIT(&s->spaces);
> +    qemu_mutex_init(&s->core_lock);
> +    qemu_spin_init(&s->regs_lock);
> +}
> +
> +static void riscv_iommu_unrealize(DeviceState *dev)
> +{
> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
> +
> +    qemu_mutex_destroy(&s->core_lock);
> +    g_hash_table_unref(s->ctx_cache);
> +}
> +
> +static Property riscv_iommu_properties[] = {
> +    DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
> +        RISCV_IOMMU_SPEC_DOT_VER),
> +    DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
> +    DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
> +    DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
> +    DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
> +        TYPE_MEMORY_REGION, MemoryRegion *),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void riscv_iommu_class_init(ObjectClass *klass, void* data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    /* internal device for riscv-iommu-{pci/sys}, not user-creatable */
> +    dc->user_creatable = false;
> +    dc->realize = riscv_iommu_realize;
> +    dc->unrealize = riscv_iommu_unrealize;
> +    device_class_set_props(dc, riscv_iommu_properties);
> +}
> +
> +static const TypeInfo riscv_iommu_info = {
> +    .name = TYPE_RISCV_IOMMU,
> +    .parent = TYPE_DEVICE,
> +    .instance_size = sizeof(RISCVIOMMUState),
> +    .class_init = riscv_iommu_class_init,
> +};
> +
> +static const char *IOMMU_FLAG_STR[] = {
> +    "NA",
> +    "RO",
> +    "WR",
> +    "RW",
> +};
> +
> +/* RISC-V IOMMU Memory Region - Address Translation Space */
> +static IOMMUTLBEntry riscv_iommu_memory_region_translate(
> +    IOMMUMemoryRegion *iommu_mr, hwaddr addr,
> +    IOMMUAccessFlags flag, int iommu_idx)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +    RISCVIOMMUContext *ctx;
> +    void *ref;
> +    IOMMUTLBEntry iotlb = {
> +        .iova = addr,
> +        .target_as = as->iommu->target_as,
> +        .addr_mask = ~0ULL,
> +        .perm = flag,
> +    };
> +
> +    ctx = riscv_iommu_ctx(as->iommu, as->devid, iommu_idx, &ref);
> +    if (ctx == NULL) {
> +        /* Translation disabled or invalid. */
> +        iotlb.addr_mask = 0;
> +        iotlb.perm = IOMMU_NONE;
> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
> +        /* Translation disabled or fault reported. */
> +        iotlb.addr_mask = 0;
> +        iotlb.perm = IOMMU_NONE;
> +    }
> +
> +    /* Trace all dma translations with original access flags. */
> +    trace_riscv_iommu_dma(as->iommu->parent_obj.id, PCI_BUS_NUM(as->devid),
> +                          PCI_SLOT(as->devid), PCI_FUNC(as->devid), iommu_idx,
> +                          IOMMU_FLAG_STR[flag & IOMMU_RW], iotlb.iova,
> +                          iotlb.translated_addr);
> +
> +    riscv_iommu_ctx_put(as->iommu, ref);
> +
> +    return iotlb;
> +}
> +
> +static int riscv_iommu_memory_region_notify(
> +    IOMMUMemoryRegion *iommu_mr, IOMMUNotifierFlag old,
> +    IOMMUNotifierFlag new, Error **errp)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +
> +    if (old == IOMMU_NOTIFIER_NONE) {
> +        as->notifier = true;
> +        trace_riscv_iommu_notifier_add(iommu_mr->parent_obj.name);
> +    } else if (new == IOMMU_NOTIFIER_NONE) {
> +        as->notifier = false;
> +        trace_riscv_iommu_notifier_del(iommu_mr->parent_obj.name);
> +    }
> +
> +    return 0;
> +}
> +
> +static inline bool pci_is_iommu(PCIDevice *pdev)
> +{
> +    return pci_get_word(pdev->config + PCI_CLASS_DEVICE) == 0x0806;
> +}
> +
> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, int devfn)
> +{
> +    RISCVIOMMUState *s = (RISCVIOMMUState *) opaque;
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> +    AddressSpace *as = NULL;
> +
> +    if (pdev && pci_is_iommu(pdev)) {
> +        return s->target_as;
> +    }
> +
> +    /* Find first registered IOMMU device */
> +    while (s->iommus.le_prev) {
> +        s = *(s->iommus.le_prev);
> +    }
> +
> +    /* Find first matching IOMMU */
> +    while (s != NULL && as == NULL) {
> +        as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), devfn));
> +        s = s->iommus.le_next;
> +    }
> +
> +    return as ? as : &address_space_memory;
> +}
> +
> +static const PCIIOMMUOps riscv_iommu_ops = {
> +    .get_address_space = riscv_iommu_find_as,
> +};
> +
> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
> +        Error **errp)
> +{
> +    if (bus->iommu_ops &&
> +        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
> +        /* Allow multiple IOMMUs on the same PCIe bus, link known devices */
> +        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
> +        QLIST_INSERT_AFTER(last, iommu, iommus);
> +    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
> +        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);

We use Designware PCIe with RISCV IOMMU internally and there is a 
problem that we would like to point out.

Both hw/riscv/riscv-iommu.c and hw/pci-host/designware.c utilize 
pci_setup_iommu(). When pci_setup_iommu() is invoked in 
hw/riscv/riscv-iommu.c, the iommu_ops set by Designware PCIe host is 
lost, which results in incorrect translation as the PCIe translation 
logic is overwritten and lost.

I think it may be a better choice to expose a memory region property in 
each PCIe host for the purpose of specifying the target memory region 
that the PCIe host should send requests to. By doing this, the 
Designware PCIe host can finish its translation and directs the request 
to the IOMMU memory region of the RISCV IOMMU.

The below code, based on riscv_iommu_v3, exposes a target memory region 
property in Designware PCIe and directs inbound requests to the target 
memory which can be specified to be the IOMMU memory region of RISCV IOMMU.

diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
index c25d50f1c6..6b6d4ac1aa 100644
--- a/hw/pci-host/designware.c
+++ b/hw/pci-host/designware.c
@@ -435,7 +435,7 @@ static void designware_pcie_root_realize(PCIDevice 
*dev, Error **errp)
          viewport->cr[0]   = DESIGNWARE_PCIE_ATU_TYPE_MEM;

          source      = &host->pci.address_space_root;
-        destination = get_system_memory();
+        destination = host->target_mr;
          direction   = "Inbound";

          /*
@@ -713,6 +713,10 @@ static void 
designware_pcie_host_realize(DeviceState *dev, Error **errp)
                         "pcie-bus-address-space");
      pci_setup_iommu(pci->bus, &designware_iommu_ops, s);

+    if (!s->target_mr) {
+        s->target_mr = get_system_memory();
+    }
+
      qdev_realize(DEVICE(&s->root), BUS(pci->bus), &error_fatal);
  }

@@ -730,6 +734,12 @@ static const VMStateDescription 
vmstate_designware_pcie_host = {
      }
  };

+static Property designware_pcie_host_properties[] = {
+    DEFINE_PROP_LINK("target-mr", DesignwarePCIEHost, target_mr,
+                     TYPE_MEMORY_REGION, MemoryRegion *),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
  static void designware_pcie_host_class_init(ObjectClass *klass, void 
*data)
  {
      DeviceClass *dc = DEVICE_CLASS(klass);
@@ -740,6 +750,7 @@ static void 
designware_pcie_host_class_init(ObjectClass *klass, void *data)
      set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
      dc->fw_name = "pci";
      dc->vmsd = &vmstate_designware_pcie_host;
+    device_class_set_props(dc, designware_pcie_host_properties);
  }

  static void designware_pcie_host_init(Object *obj)
diff --git a/include/hw/pci-host/designware.h 
b/include/hw/pci-host/designware.h
index 908f3d946b..2530eacbb0 100644
--- a/include/hw/pci-host/designware.h
+++ b/include/hw/pci-host/designware.h
@@ -91,6 +91,7 @@ struct DesignwarePCIEHost {
      } pci;

      MemoryRegion mmio;
+    MemoryRegion *target_mr;
  };

  #endif /* DESIGNWARE_H */

We also need to specify the requester id (BDF) in the memory attribute 
when sending requests to the IOMMU memory region in order to distinguish 
the source endpoint, since all endpoints under Designware PCIe host 
write to the same IOMMU memory region.

diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..1fc64a2d1f 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -249,8 +249,11 @@ static inline MemTxResult pci_dma_rw(PCIDevice 
*dev, dma_addr_t addr,
  static inline MemTxResult pci_dma_read(PCIDevice *dev, dma_addr_t addr,
                                         void *buf, dma_addr_t len)
  {
+    MemTxAttrs attrs = {};
+    attrs.requester_id = pci_requester_id(dev);
+
      return pci_dma_rw(dev, addr, buf, len,
-                      DMA_DIRECTION_TO_DEVICE, MEMTXATTRS_UNSPECIFIED);
+                      DMA_DIRECTION_TO_DEVICE, attrs);
  }

  /**
@@ -268,8 +271,11 @@ static inline MemTxResult pci_dma_read(PCIDevice 
*dev, dma_addr_t addr,
  static inline MemTxResult pci_dma_write(PCIDevice *dev, dma_addr_t addr,
                                          const void *buf, dma_addr_t len)
  {
+    MemTxAttrs attrs = {};
+    attrs.requester_id = pci_requester_id(dev);
+
      return pci_dma_rw(dev, addr, (void *) buf, len,
-                      DMA_DIRECTION_FROM_DEVICE, MEMTXATTRS_UNSPECIFIED);
+                      DMA_DIRECTION_FROM_DEVICE, attrs);
  }

  #define PCI_DMA_DEFINE_LDST(_l, _s, _bits) \
@@ -313,8 +319,11 @@ PCI_DMA_DEFINE_LDST(q_be, q_be, 64);
  static inline void *pci_dma_map(PCIDevice *dev, dma_addr_t addr,
                                  dma_addr_t *plen, DMADirection dir)
  {
+    MemTxAttrs attrs = {};
+    attrs.requester_id = pci_requester_id(dev);
+
      return dma_memory_map(pci_get_address_space(dev), addr, plen, dir,
-                          MEMTXATTRS_UNSPECIFIED);
+                          attrs);
  }

We hope not to call pci_setup_iommu() in hw/riscv/riscv-iommu.c to avoid 
conflicts of iommu_ops. Do you have any suggestion on this issue?

Thanks.

> +    } else {
> +        error_setg(errp, "can't register secondary IOMMU for PCI bus #%d",
> +            pci_bus_num(bus));
> +    }
> +}
> +
> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
> +    MemTxAttrs attrs)
> +{
> +    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
> +}
> +
> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +    return 1 << as->iommu->pasid_bits;
> +}
> +
> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void *data)
> +{
> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
> +
> +    imrc->translate = riscv_iommu_memory_region_translate;
> +    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
> +    imrc->attrs_to_index = riscv_iommu_memory_region_index;
> +    imrc->num_indexes = riscv_iommu_memory_region_index_len;
> +}
> +
> +static const TypeInfo riscv_iommu_memory_region_info = {
> +    .parent = TYPE_IOMMU_MEMORY_REGION,
> +    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
> +    .class_init = riscv_iommu_memory_region_init,
> +};
> +
> +static void riscv_iommu_register_mr_types(void)
> +{
> +    type_register_static(&riscv_iommu_memory_region_info);
> +    type_register_static(&riscv_iommu_info);
> +}
> +
> +type_init(riscv_iommu_register_mr_types);
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> new file mode 100644
> index 0000000000..31d3907d33
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu.h
> @@ -0,0 +1,141 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_RISCV_IOMMU_STATE_H
> +#define HW_RISCV_IOMMU_STATE_H
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +
> +#include "hw/riscv/iommu.h"
> +
> +struct RISCVIOMMUState {
> +    /*< private >*/
> +    DeviceState parent_obj;
> +
> +    /*< public >*/
> +    uint32_t version;     /* Reported interface version number */
> +    uint32_t pasid_bits;  /* process identifier width */
> +    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
> +
> +    uint64_t cap;         /* IOMMU supported capabilities */
> +    uint64_t fctl;        /* IOMMU enabled features */
> +
> +    bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
> +    bool enable_msi;      /* Enable MSI remapping */
> +
> +    /* IOMMU Internal State */
> +    uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
> +
> +    dma_addr_t cq_addr;   /* Command queue base physical address */
> +    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
> +    dma_addr_t pq_addr;   /* Page request queue base physical address */
> +
> +    uint32_t cq_mask;     /* Command queue index bit mask */
> +    uint32_t fq_mask;     /* Fault/event queue index bit mask */
> +    uint32_t pq_mask;     /* Page request queue index bit mask */
> +
> +    /* interrupt notifier */
> +    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
> +
> +    /* IOMMU State Machine */
> +    QemuThread core_proc; /* Background processing thread */
> +    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs updates */
> +    QemuCond core_cond;   /* Background processing wake up signal */
> +    unsigned core_exec;   /* Processing thread execution actions */
> +
> +    /* IOMMU target address space */
> +    AddressSpace *target_as;
> +    MemoryRegion *target_mr;
> +
> +    /* MSI / MRIF access trap */
> +    AddressSpace trap_as;
> +    MemoryRegion trap_mr;
> +
> +    GHashTable *ctx_cache;          /* Device translation Context Cache */
> +
> +    /* MMIO Hardware Interface */
> +    MemoryRegion regs_mr;
> +    QemuSpin regs_lock;
> +    uint8_t *regs_rw;  /* register state (user write) */
> +    uint8_t *regs_wc;  /* write-1-to-clear mask */
> +    uint8_t *regs_ro;  /* read-only mask */
> +
> +    QLIST_ENTRY(RISCVIOMMUState) iommus;
> +    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
> +};
> +
> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
> +         Error **errp);
> +
> +/* private helpers */
> +
> +/* Register helper functions */
> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
> +    unsigned idx, uint32_t set, uint32_t clr)
> +{
> +    uint32_t val;
> +    qemu_spin_lock(&s->regs_lock);
> +    val = ldl_le_p(s->regs_rw + idx);
> +    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
> +    qemu_spin_unlock(&s->regs_lock);
> +    return val;
> +}
> +
> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
> +    unsigned idx, uint32_t set)
> +{
> +    qemu_spin_lock(&s->regs_lock);
> +    stl_le_p(s->regs_rw + idx, set);
> +    qemu_spin_unlock(&s->regs_lock);
> +}
> +
> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
> +    unsigned idx)
> +{
> +    return ldl_le_p(s->regs_rw + idx);
> +}
> +
> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
> +    unsigned idx, uint64_t set, uint64_t clr)
> +{
> +    uint64_t val;
> +    qemu_spin_lock(&s->regs_lock);
> +    val = ldq_le_p(s->regs_rw + idx);
> +    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
> +    qemu_spin_unlock(&s->regs_lock);
> +    return val;
> +}
> +
> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
> +    unsigned idx, uint64_t set)
> +{
> +    qemu_spin_lock(&s->regs_lock);
> +    stq_le_p(s->regs_rw + idx, set);
> +    qemu_spin_unlock(&s->regs_lock);
> +}
> +
> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
> +    unsigned idx)
> +{
> +    return ldq_le_p(s->regs_rw + idx);
> +}
> +
> +
> +
> +#endif
> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
> new file mode 100644
> index 0000000000..42a97caffa
> --- /dev/null
> +++ b/hw/riscv/trace-events
> @@ -0,0 +1,11 @@
> +# See documentation at docs/devel/tracing.rst
> +
> +# riscv-iommu.c
> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) "%s: device attached %04x:%02x.%d"
> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" iova: 0x%"PRIx64
> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
> new file mode 100644
> index 0000000000..8c0e3ca1f3
> --- /dev/null
> +++ b/hw/riscv/trace.h
> @@ -0,0 +1 @@
> +#include "trace/trace-hw_riscv.h"
> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
> new file mode 100644
> index 0000000000..070ee69973
> --- /dev/null
> +++ b/include/hw/riscv/iommu.h
> @@ -0,0 +1,36 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_RISCV_IOMMU_H
> +#define HW_RISCV_IOMMU_H
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +
> +#define TYPE_RISCV_IOMMU "riscv-iommu"
> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
> +typedef struct RISCVIOMMUState RISCVIOMMUState;
> +
> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
> +
> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
> +
> +#endif
> diff --git a/meson.build b/meson.build
> index a9de71d450..8099d8271c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -3319,6 +3319,7 @@ if have_system
>       'hw/pci-host',
>       'hw/ppc',
>       'hw/rtc',
> +    'hw/riscv',
>       'hw/s390x',
>       'hw/scsi',
>       'hw/sd',


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-11 10:13   ` Daniel Henrique Barboza
@ 2024-06-12  7:50     ` LIU Zhiwei
  2024-06-12 12:10       ` Daniel Henrique Barboza
  0 siblings, 1 reply; 38+ messages in thread
From: LIU Zhiwei @ 2024-06-12  7:50 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, palmer, tjeznach,
	ajones, frank.chang


On 2024/6/11 18:13, Daniel Henrique Barboza wrote:
> Hi Zhiwei,
>
> On 6/10/24 10:51 PM, LIU Zhiwei wrote:
>> Hi Daniel,
>>
>> I want to know if we can use the IOMMU and IOPMP at the same time.
>
> AFAIK we can. They're not mutually exclusive since they offer protection
> and isolation at different layers/stages.

OK. Thanks. I will dive into more details.

I see the IOMMU and IOPMP implementations on mail list both set IOMMU 
for PCI root bus.
Is it right?

Thanks,
Zhiwei

>
>
>>
>> The relationship between them is more similar to MMU and sPMP or to 
>> MMU and PMP?
>
> I'd say MMU and PMP since the IOMMU can isolate devices regardless of
> s-mode context or not.
>
>
> Thanks,
>
> Daniel
>
>>
>> Thanks,
>> Zhiwei
>>
>> On 2024/5/24 1:39, Daniel Henrique Barboza wrote:
>>> Hi,
>>>
>>> In this new version a lot of changes were made throughout all the code,
>>> most notably on patch 3. Link for the previous version is [1].
>>>
>>> * How it was tested *
>>>
>>> This series was tested using an emulated QEMU RISC-V host booting a 
>>> QEMU
>>> KVM guest, passing through an emulated e1000 network card from the host
>>> to the guest. I can provide more details (e.g. QEMU command lines) if
>>> required, just let me know. For now this cover-letter is too much of an
>>> essay as is.
>>>
>>> The Linux kernel used for tests can be found here:
>>>
>>> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>>>
>>> This is a newer version of the following work from Tomasz:
>>>
>>> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/ 
>>>
>>> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>>>
>>> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>>>
>>> Note that to test this work using riscv-iommu-pci we'll need to provide
>>> the Rivos PCI ID in the command line. More details down below.
>>>
>>> * Highlights of this version *
>>>
>>> - patches removed from v2: platform driver (riscv-iommu-sys, former
>>> patch 05) and the EDU changes (patches 14 and 15). The platform driver
>>> will be sent later with a working example on the 'virt' machine,
>>> either on a newer version of this series or via a follow-up series. We
>>> already have a PoC on [2] created by Sunil. More tests are needed, so
>>> it'll be left behind for now. The EDU changes will be sent in separate
>>> after I finish the doc changes that Frank cited in v2.
>>>
>>> - patch 3 contains the bulk of changes made from v2. Please give 
>>> special
>>> attention to the following functions since this is entirely new code I
>>> ended up adding:
>>>   - riscv_iommu_report_fault()
>>>   - riscv_iommu_validate_device_ctx()
>>>   - riscv_iommu_update_ipsr()
>>>    Aside from these helpers most of the changes made in this patch 3 
>>> were
>>> punctual.
>>>
>>> - Red HAT PCI ID related changes. A new patch (4) that introduces a
>>> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
>>> to us by Red Hat and Gerd Hoffman from their ID space. The
>>> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
>>> ID. The device was changed slightly to allow vendor-id and device-id to
>>> be set in the command-line, so it's now possible to use this reference
>>> device as another RISC-V IOMMU PCI device to ease the burden of
>>> testing/development.
>>>
>>>    To instantiate the riscv-iommu-pci device using the previous 
>>> Rivos PCI
>>> ID, use the following cmd line:
>>>
>>>    -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>>>
>>>    I'm using these options to test the series with the existing 
>>> Linux RISC-V
>>> IOMMU support that uses just a Rivos ID to identify the device.
>>>
>>>
>>> Series based on alistair/riscv-to-apply.next. It's also applicable on
>>> current QEMU master. It can also be fetched from:
>>>
>>> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>>>
>>> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>>>
>>> Changes from v2 [1]:
>>> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>>>    - will be reintroduced in a later review or as a follow-up series
>>>
>>> - patches 14 and 15: dropped
>>>    - will be sent in separate
>>>
>>> - patches 2, 3, 4 and 5:
>>>    - removed all 'Ziommu' references
>>>
>>> - patch 2:
>>>    - added extra bits that patch 3 ended up using
>>>
>>> - patch 3:
>>>    - fixed blank line at EOF in hw/riscv/trace.h
>>>    - added a riscv_iommu_report_fault() helper to report faults. The 
>>> helper checks if
>>>      a given fault is eligible to be reported if DTF is 1
>>>    - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and 
>>> riscv_iommu_translate()
>>>      to avoid code repetition
>>>    - added a riscv_iommu_validate_device_ctx() helper to validate 
>>> the device context
>>>      as specified in "Device configuration checks" section. This 
>>> helper is being used
>>>      in riscv_iommu_ctx_fetch()
>>>    - added a new riscv_iommu_update_ipsr() helper to handle IPSR 
>>> updates
>>>      in riscv_iommu_mmio_write()
>>>    - riscv_iommmu_msi_write() now reports a fault in all error paths
>>>    - check for fctl.WSI before issuing a MSI interrupt in 
>>> riscv_iommu_notify()
>>>    - change riscv-iommu region name to 'riscv-iommu'
>>>    - change address_space_init() name for PCI devices to 'name' 
>>> instead of using TYPE_RISCV_IOMMU_PCI
>>>    - changed riscv_iommu_mmio_ops min_access_size to 4
>>>    - do not check for min and max sizes on riscv_iommu_mmio_write()
>>>    - changed riscv_iommu_trap_ops  min_access_size to 4
>>>    - removed IOMMU qemu_thread thread:
>>>      - riscv_iommu_mmio_write() will now execute a 
>>> riscv_iommu_process_fn by holding
>>>        'core_lock'
>>>    - init FSCR as zero explicitly
>>>    - check for bus->iommu_opaque == NULL before calling 
>>> pci_setup_iommu()
>>>
>>> - patch 4 (new):
>>>    - add Red-Hat PCI RISC-V IOMMU ID
>>>
>>> - patch 5 (former 4):
>>>    - create vendor-id and device-id properties
>>>    - set Red-hat PCI RISC-V IOMMU ID as default ID
>>>
>>> - patch 8:
>>>    - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' 
>>> instances
>>>
>>> - patch 9:
>>>    - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>>>    - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>>>    - 'en_s' is no longer used for early MSI address match
>>>
>>> - patch 10:
>>>    - add ATS steps in riscv_iommu_validate_device_ctx()
>>>    - check for 's->enable_ats' before adding 
>>> RISCV_IOMMU_DC_TC_EN_ATS in device context
>>>    - check for 's->enable_ats' before processing ATS commands in 
>>> riscv_iommu_process_cq_tail()
>>>    - remove ambiguous trace_riscv_iommu_ats() from 
>>> riscv_iommu_translate()
>>>
>>> - patch 11:
>>>    - removed unused bits
>>>    - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>>>      bits
>>>    - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in 
>>> riscv_iommu_process_dbg()
>>>    - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). 
>>> Added a comment talking about the (lack of) superpage support
>>> [1] 
>>> https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
>>> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>>>
>>> Andrew Jones (1):
>>>    hw/riscv/riscv-iommu: Add another irq for mrif notifications
>>>
>>> Daniel Henrique Barboza (3):
>>>    pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>>>    test/qtest: add riscv-iommu-pci tests
>>>    qtest/riscv-iommu-test: add init queues test
>>>
>>> Tomasz Jeznach (9):
>>>    exec/memtxattr: add process identifier to the transaction attributes
>>>    hw/riscv: add riscv-iommu-bits.h
>>>    hw/riscv: add RISC-V IOMMU base emulation
>>>    hw/riscv: add riscv-iommu-pci reference device
>>>    hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>>>    hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>>>    hw/riscv/riscv-iommu: add s-stage and g-stage support
>>>    hw/riscv/riscv-iommu: add ATS support
>>>    hw/riscv/riscv-iommu: add DBG support
>>>
>>>   docs/specs/pci-ids.rst           |    2 +
>>>   hw/riscv/Kconfig                 |    4 +
>>>   hw/riscv/meson.build             |    1 +
>>>   hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>>>   hw/riscv/riscv-iommu-pci.c       |  177 +++
>>>   hw/riscv/riscv-iommu.c           | 2283 
>>> ++++++++++++++++++++++++++++++
>>>   hw/riscv/riscv-iommu.h           |  146 ++
>>>   hw/riscv/trace-events            |   15 +
>>>   hw/riscv/trace.h                 |    1 +
>>>   hw/riscv/virt.c                  |   33 +-
>>>   include/exec/memattrs.h          |    5 +
>>>   include/hw/pci/pci.h             |    1 +
>>>   include/hw/riscv/iommu.h         |   36 +
>>>   meson.build                      |    1 +
>>>   tests/qtest/libqos/meson.build   |    4 +
>>>   tests/qtest/libqos/riscv-iommu.c |   76 +
>>>   tests/qtest/libqos/riscv-iommu.h |  100 ++
>>>   tests/qtest/meson.build          |    1 +
>>>   tests/qtest/riscv-iommu-test.c   |  234 +++
>>>   19 files changed, 3535 insertions(+), 1 deletion(-)
>>>   create mode 100644 hw/riscv/riscv-iommu-bits.h
>>>   create mode 100644 hw/riscv/riscv-iommu-pci.c
>>>   create mode 100644 hw/riscv/riscv-iommu.c
>>>   create mode 100644 hw/riscv/riscv-iommu.h
>>>   create mode 100644 hw/riscv/trace-events
>>>   create mode 100644 hw/riscv/trace.h
>>>   create mode 100644 include/hw/riscv/iommu.h
>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.c
>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.h
>>>   create mode 100644 tests/qtest/riscv-iommu-test.c
>>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-06-11 16:15   ` Jason Chien
@ 2024-06-12  9:53     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-12  9:53 UTC (permalink / raw)
  To: Jason Chien, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf,
	Peter Maydell, Andrey Smirnov, Cedric Le Goater

Hi Jason,

(CCing designware folks and Cedric)

On 6/11/24 1:15 PM, Jason Chien wrote:
> Hi Daniel,
> 
> On 2024/5/24 上午 01:39, Daniel Henrique Barboza wrote:
>> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>>
>> The RISC-V IOMMU specification is now ratified as-per the RISC-V
>> international process. The latest frozen specifcation can be found
>> at:
>>
>> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf
>>
>> Add the foundation of the device emulation for RISC-V IOMMU, which
>> includes an IOMMU that has no capabilities but MSI interrupt support and
>> fault queue interfaces. We'll add add more features incrementally in the
>> next patches.
>>
>> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
>> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
>> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
>> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
>> ---

[...]

>> +static const PCIIOMMUOps riscv_iommu_ops = {
>> +    .get_address_space = riscv_iommu_find_as,
>> +};
>> +
>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
>> +        Error **errp)
>> +{
>> +    if (bus->iommu_ops &&
>> +        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
>> +        /* Allow multiple IOMMUs on the same PCIe bus, link known devices */
>> +        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
>> +        QLIST_INSERT_AFTER(last, iommu, iommus);
>> +    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
>> +        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);
> 
> We use Designware PCIe with RISCV IOMMU internally and there is a problem that we would like to point out.
> 
> Both hw/riscv/riscv-iommu.c and hw/pci-host/designware.c utilize pci_setup_iommu(). When pci_setup_iommu() is invoked in hw/riscv/riscv-iommu.c, the iommu_ops set by Designware PCIe host is lost, which results in incorrect translation as the PCIe translation logic is overwritten and lost.
> 
> I think it may be a better choice to expose a memory region property in each PCIe host for the purpose of specifying the target memory region that the PCIe host should send requests to. By doing this, the Designware PCIe host can finish its translation and directs the request to the IOMMU memory region of the RISCV IOMMU.
> 
> The below code, based on riscv_iommu_v3, exposes a target memory region property in Designware PCIe and directs inbound requests to the target memory which can be specified to be the IOMMU memory region of RISCV IOMMU.
> 
> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> index c25d50f1c6..6b6d4ac1aa 100644
> --- a/hw/pci-host/designware.c
> +++ b/hw/pci-host/designware.c
> @@ -435,7 +435,7 @@ static void designware_pcie_root_realize(PCIDevice *dev, Error **errp)
>           viewport->cr[0]   = DESIGNWARE_PCIE_ATU_TYPE_MEM;
> 
>           source      = &host->pci.address_space_root;
> -        destination = get_system_memory();
> +        destination = host->target_mr;
>           direction   = "Inbound";
> 
>           /*
> @@ -713,6 +713,10 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>                          "pcie-bus-address-space");
>       pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> 
> +    if (!s->target_mr) {
> +        s->target_mr = get_system_memory();
> +    }
> +
>       qdev_realize(DEVICE(&s->root), BUS(pci->bus), &error_fatal);
>   }
> 
> @@ -730,6 +734,12 @@ static const VMStateDescription vmstate_designware_pcie_host = {
>       }
>   };
> 
> +static Property designware_pcie_host_properties[] = {
> +    DEFINE_PROP_LINK("target-mr", DesignwarePCIEHost, target_mr,
> +                     TYPE_MEMORY_REGION, MemoryRegion *),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
>   static void designware_pcie_host_class_init(ObjectClass *klass, void *data)
>   {
>       DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -740,6 +750,7 @@ static void designware_pcie_host_class_init(ObjectClass *klass, void *data)
>       set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>       dc->fw_name = "pci";
>       dc->vmsd = &vmstate_designware_pcie_host;
> +    device_class_set_props(dc, designware_pcie_host_properties);
>   }
> 
>   static void designware_pcie_host_init(Object *obj)
> diff --git a/include/hw/pci-host/designware.h b/include/hw/pci-host/designware.h
> index 908f3d946b..2530eacbb0 100644
> --- a/include/hw/pci-host/designware.h
> +++ b/include/hw/pci-host/designware.h
> @@ -91,6 +91,7 @@ struct DesignwarePCIEHost {
>       } pci;
> 
>       MemoryRegion mmio;
> +    MemoryRegion *target_mr;
>   };
> 
>   #endif /* DESIGNWARE_H */
> 
> We also need to specify the requester id (BDF) in the memory attribute when sending requests to the IOMMU memory region in order to distinguish the source endpoint, since all endpoints under Designware PCIe host write to the same IOMMU memory region.
> 
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index d3dd0f64b2..1fc64a2d1f 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -249,8 +249,11 @@ static inline MemTxResult pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
>   static inline MemTxResult pci_dma_read(PCIDevice *dev, dma_addr_t addr,
>                                          void *buf, dma_addr_t len)
>   {
> +    MemTxAttrs attrs = {};
> +    attrs.requester_id = pci_requester_id(dev);
> +
>       return pci_dma_rw(dev, addr, buf, len,
> -                      DMA_DIRECTION_TO_DEVICE, MEMTXATTRS_UNSPECIFIED);
> +                      DMA_DIRECTION_TO_DEVICE, attrs);
>   }
> 
>   /**
> @@ -268,8 +271,11 @@ static inline MemTxResult pci_dma_read(PCIDevice *dev, dma_addr_t addr,
>   static inline MemTxResult pci_dma_write(PCIDevice *dev, dma_addr_t addr,
>                                           const void *buf, dma_addr_t len)
>   {
> +    MemTxAttrs attrs = {};
> +    attrs.requester_id = pci_requester_id(dev);
> +
>       return pci_dma_rw(dev, addr, (void *) buf, len,
> -                      DMA_DIRECTION_FROM_DEVICE, MEMTXATTRS_UNSPECIFIED);
> +                      DMA_DIRECTION_FROM_DEVICE, attrs);
>   }
> 
>   #define PCI_DMA_DEFINE_LDST(_l, _s, _bits) \
> @@ -313,8 +319,11 @@ PCI_DMA_DEFINE_LDST(q_be, q_be, 64);
>   static inline void *pci_dma_map(PCIDevice *dev, dma_addr_t addr,
>                                   dma_addr_t *plen, DMADirection dir)
>   {
> +    MemTxAttrs attrs = {};
> +    attrs.requester_id = pci_requester_id(dev);
> +
>       return dma_memory_map(pci_get_address_space(dev), addr, plen, dir,
> -                          MEMTXATTRS_UNSPECIFIED);
> +                          attrs);
>   }
> 
> We hope not to call pci_setup_iommu() in hw/riscv/riscv-iommu.c to avoid conflicts of iommu_ops. Do you have any suggestion on this issue?

As far as the riscv-iommu changes might go, I'm ok with adding more properties
in the device to customize whether it creates its own iommu_ops (like it's
done today) or whether we should skip it since another entity might provide
it.

I can't comment much on the Designware changes. It seems sensible to me but
I'm not acquainted with how the Designware pci-host works. I couldn't find
the docs for it either in a (lazy) search I just did.

I appreciate if someone more knowledgeable with the Designware device can
comment.


Thanks,

Daniel


> 
> Thanks.
> 
>> +    } else {
>> +        error_setg(errp, "can't register secondary IOMMU for PCI bus #%d",
>> +            pci_bus_num(bus));
>> +    }
>> +}
>> +
>> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
>> +    MemTxAttrs attrs)
>> +{
>> +    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
>> +}
>> +
>> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
>> +    return 1 << as->iommu->pasid_bits;
>> +}
>> +
>> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void *data)
>> +{
>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
>> +
>> +    imrc->translate = riscv_iommu_memory_region_translate;
>> +    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
>> +    imrc->attrs_to_index = riscv_iommu_memory_region_index;
>> +    imrc->num_indexes = riscv_iommu_memory_region_index_len;
>> +}
>> +
>> +static const TypeInfo riscv_iommu_memory_region_info = {
>> +    .parent = TYPE_IOMMU_MEMORY_REGION,
>> +    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
>> +    .class_init = riscv_iommu_memory_region_init,
>> +};
>> +
>> +static void riscv_iommu_register_mr_types(void)
>> +{
>> +    type_register_static(&riscv_iommu_memory_region_info);
>> +    type_register_static(&riscv_iommu_info);
>> +}
>> +
>> +type_init(riscv_iommu_register_mr_types);
>> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
>> new file mode 100644
>> index 0000000000..31d3907d33
>> --- /dev/null
>> +++ b/hw/riscv/riscv-iommu.h
>> @@ -0,0 +1,141 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2022-2023 Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_RISCV_IOMMU_STATE_H
>> +#define HW_RISCV_IOMMU_STATE_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +
>> +#include "hw/riscv/iommu.h"
>> +
>> +struct RISCVIOMMUState {
>> +    /*< private >*/
>> +    DeviceState parent_obj;
>> +
>> +    /*< public >*/
>> +    uint32_t version;     /* Reported interface version number */
>> +    uint32_t pasid_bits;  /* process identifier width */
>> +    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
>> +
>> +    uint64_t cap;         /* IOMMU supported capabilities */
>> +    uint64_t fctl;        /* IOMMU enabled features */
>> +
>> +    bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
>> +    bool enable_msi;      /* Enable MSI remapping */
>> +
>> +    /* IOMMU Internal State */
>> +    uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
>> +
>> +    dma_addr_t cq_addr;   /* Command queue base physical address */
>> +    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
>> +    dma_addr_t pq_addr;   /* Page request queue base physical address */
>> +
>> +    uint32_t cq_mask;     /* Command queue index bit mask */
>> +    uint32_t fq_mask;     /* Fault/event queue index bit mask */
>> +    uint32_t pq_mask;     /* Page request queue index bit mask */
>> +
>> +    /* interrupt notifier */
>> +    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
>> +
>> +    /* IOMMU State Machine */
>> +    QemuThread core_proc; /* Background processing thread */
>> +    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs updates */
>> +    QemuCond core_cond;   /* Background processing wake up signal */
>> +    unsigned core_exec;   /* Processing thread execution actions */
>> +
>> +    /* IOMMU target address space */
>> +    AddressSpace *target_as;
>> +    MemoryRegion *target_mr;
>> +
>> +    /* MSI / MRIF access trap */
>> +    AddressSpace trap_as;
>> +    MemoryRegion trap_mr;
>> +
>> +    GHashTable *ctx_cache;          /* Device translation Context Cache */
>> +
>> +    /* MMIO Hardware Interface */
>> +    MemoryRegion regs_mr;
>> +    QemuSpin regs_lock;
>> +    uint8_t *regs_rw;  /* register state (user write) */
>> +    uint8_t *regs_wc;  /* write-1-to-clear mask */
>> +    uint8_t *regs_ro;  /* read-only mask */
>> +
>> +    QLIST_ENTRY(RISCVIOMMUState) iommus;
>> +    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
>> +};
>> +
>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
>> +         Error **errp);
>> +
>> +/* private helpers */
>> +
>> +/* Register helper functions */
>> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
>> +    unsigned idx, uint32_t set, uint32_t clr)
>> +{
>> +    uint32_t val;
>> +    qemu_spin_lock(&s->regs_lock);
>> +    val = ldl_le_p(s->regs_rw + idx);
>> +    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +    return val;
>> +}
>> +
>> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
>> +    unsigned idx, uint32_t set)
>> +{
>> +    qemu_spin_lock(&s->regs_lock);
>> +    stl_le_p(s->regs_rw + idx, set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +}
>> +
>> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
>> +    unsigned idx)
>> +{
>> +    return ldl_le_p(s->regs_rw + idx);
>> +}
>> +
>> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
>> +    unsigned idx, uint64_t set, uint64_t clr)
>> +{
>> +    uint64_t val;
>> +    qemu_spin_lock(&s->regs_lock);
>> +    val = ldq_le_p(s->regs_rw + idx);
>> +    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +    return val;
>> +}
>> +
>> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
>> +    unsigned idx, uint64_t set)
>> +{
>> +    qemu_spin_lock(&s->regs_lock);
>> +    stq_le_p(s->regs_rw + idx, set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +}
>> +
>> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
>> +    unsigned idx)
>> +{
>> +    return ldq_le_p(s->regs_rw + idx);
>> +}
>> +
>> +
>> +
>> +#endif
>> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
>> new file mode 100644
>> index 0000000000..42a97caffa
>> --- /dev/null
>> +++ b/hw/riscv/trace-events
>> @@ -0,0 +1,11 @@
>> +# See documentation at docs/devel/tracing.rst
>> +
>> +# riscv-iommu.c
>> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) "%s: device attached %04x:%02x.%d"
>> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" iova: 0x%"PRIx64
>> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
>> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
>> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
>> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
>> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
>> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
>> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
>> new file mode 100644
>> index 0000000000..8c0e3ca1f3
>> --- /dev/null
>> +++ b/hw/riscv/trace.h
>> @@ -0,0 +1 @@
>> +#include "trace/trace-hw_riscv.h"
>> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
>> new file mode 100644
>> index 0000000000..070ee69973
>> --- /dev/null
>> +++ b/include/hw/riscv/iommu.h
>> @@ -0,0 +1,36 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2022-2023 Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_RISCV_IOMMU_H
>> +#define HW_RISCV_IOMMU_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +
>> +#define TYPE_RISCV_IOMMU "riscv-iommu"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
>> +typedef struct RISCVIOMMUState RISCVIOMMUState;
>> +
>> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
>> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
>> +
>> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
>> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
>> +
>> +#endif
>> diff --git a/meson.build b/meson.build
>> index a9de71d450..8099d8271c 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -3319,6 +3319,7 @@ if have_system
>>       'hw/pci-host',
>>       'hw/ppc',
>>       'hw/rtc',
>> +    'hw/riscv',
>>       'hw/s390x',
>>       'hw/scsi',
>>       'hw/sd',


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-12  7:50     ` LIU Zhiwei
@ 2024-06-12 12:10       ` Daniel Henrique Barboza
  2024-06-14 13:22         ` LIU Zhiwei
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-12 12:10 UTC (permalink / raw)
  To: LIU Zhiwei, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, palmer, tjeznach,
	ajones, frank.chang



On 6/12/24 4:50 AM, LIU Zhiwei wrote:
> 
> On 2024/6/11 18:13, Daniel Henrique Barboza wrote:
>> Hi Zhiwei,
>>
>> On 6/10/24 10:51 PM, LIU Zhiwei wrote:
>>> Hi Daniel,
>>>
>>> I want to know if we can use the IOMMU and IOPMP at the same time.
>>
>> AFAIK we can. They're not mutually exclusive since they offer protection
>> and isolation at different layers/stages.
> 
> OK. Thanks. I will dive into more details.
> 
> I see the IOMMU and IOPMP implementations on mail list both set IOMMU for PCI root bus.
> Is it right?

For now the riscv-iommu-pci device must be placed at a root bus for the
sake of simplicity. We'll want to lift this restriction in the future as
the support matures.


Thanks,

Daniel

> 
> Thanks,
> Zhiwei
> 
>>
>>
>>>
>>> The relationship between them is more similar to MMU and sPMP or to MMU and PMP?
>>
>> I'd say MMU and PMP since the IOMMU can isolate devices regardless of
>> s-mode context or not.
>>
>>
>> Thanks,
>>
>> Daniel
>>
>>>
>>> Thanks,
>>> Zhiwei
>>>
>>> On 2024/5/24 1:39, Daniel Henrique Barboza wrote:
>>>> Hi,
>>>>
>>>> In this new version a lot of changes were made throughout all the code,
>>>> most notably on patch 3. Link for the previous version is [1].
>>>>
>>>> * How it was tested *
>>>>
>>>> This series was tested using an emulated QEMU RISC-V host booting a QEMU
>>>> KVM guest, passing through an emulated e1000 network card from the host
>>>> to the guest. I can provide more details (e.g. QEMU command lines) if
>>>> required, just let me know. For now this cover-letter is too much of an
>>>> essay as is.
>>>>
>>>> The Linux kernel used for tests can be found here:
>>>>
>>>> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>>>>
>>>> This is a newer version of the following work from Tomasz:
>>>>
>>>> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/
>>>> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>>>>
>>>> The v5 wasn't enough for the testing being done. v6-rc3 did the trick.
>>>>
>>>> Note that to test this work using riscv-iommu-pci we'll need to provide
>>>> the Rivos PCI ID in the command line. More details down below.
>>>>
>>>> * Highlights of this version *
>>>>
>>>> - patches removed from v2: platform driver (riscv-iommu-sys, former
>>>> patch 05) and the EDU changes (patches 14 and 15). The platform driver
>>>> will be sent later with a working example on the 'virt' machine,
>>>> either on a newer version of this series or via a follow-up series. We
>>>> already have a PoC on [2] created by Sunil. More tests are needed, so
>>>> it'll be left behind for now. The EDU changes will be sent in separate
>>>> after I finish the doc changes that Frank cited in v2.
>>>>
>>>> - patch 3 contains the bulk of changes made from v2. Please give special
>>>> attention to the following functions since this is entirely new code I
>>>> ended up adding:
>>>>   - riscv_iommu_report_fault()
>>>>   - riscv_iommu_validate_device_ctx()
>>>>   - riscv_iommu_update_ipsr()
>>>>    Aside from these helpers most of the changes made in this patch 3 were
>>>> punctual.
>>>>
>>>> - Red HAT PCI ID related changes. A new patch (4) that introduces a
>>>> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully given
>>>> to us by Red Hat and Gerd Hoffman from their ID space. The
>>>> riscv-iommu-pci device now defaults to this PCI ID instead of Rivos PCI
>>>> ID. The device was changed slightly to allow vendor-id and device-id to
>>>> be set in the command-line, so it's now possible to use this reference
>>>> device as another RISC-V IOMMU PCI device to ease the burden of
>>>> testing/development.
>>>>
>>>>    To instantiate the riscv-iommu-pci device using the previous Rivos PCI
>>>> ID, use the following cmd line:
>>>>
>>>>    -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>>>>
>>>>    I'm using these options to test the series with the existing Linux RISC-V
>>>> IOMMU support that uses just a Rivos ID to identify the device.
>>>>
>>>>
>>>> Series based on alistair/riscv-to-apply.next. It's also applicable on
>>>> current QEMU master. It can also be fetched from:
>>>>
>>>> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>>>>
>>>> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>>>>
>>>> Changes from v2 [1]:
>>>> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>>>>    - will be reintroduced in a later review or as a follow-up series
>>>>
>>>> - patches 14 and 15: dropped
>>>>    - will be sent in separate
>>>>
>>>> - patches 2, 3, 4 and 5:
>>>>    - removed all 'Ziommu' references
>>>>
>>>> - patch 2:
>>>>    - added extra bits that patch 3 ended up using
>>>>
>>>> - patch 3:
>>>>    - fixed blank line at EOF in hw/riscv/trace.h
>>>>    - added a riscv_iommu_report_fault() helper to report faults. The helper checks if
>>>>      a given fault is eligible to be reported if DTF is 1
>>>>    - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and riscv_iommu_translate()
>>>>      to avoid code repetition
>>>>    - added a riscv_iommu_validate_device_ctx() helper to validate the device context
>>>>      as specified in "Device configuration checks" section. This helper is being used
>>>>      in riscv_iommu_ctx_fetch()
>>>>    - added a new riscv_iommu_update_ipsr() helper to handle IPSR updates
>>>>      in riscv_iommu_mmio_write()
>>>>    - riscv_iommmu_msi_write() now reports a fault in all error paths
>>>>    - check for fctl.WSI before issuing a MSI interrupt in riscv_iommu_notify()
>>>>    - change riscv-iommu region name to 'riscv-iommu'
>>>>    - change address_space_init() name for PCI devices to 'name' instead of using TYPE_RISCV_IOMMU_PCI
>>>>    - changed riscv_iommu_mmio_ops min_access_size to 4
>>>>    - do not check for min and max sizes on riscv_iommu_mmio_write()
>>>>    - changed riscv_iommu_trap_ops  min_access_size to 4
>>>>    - removed IOMMU qemu_thread thread:
>>>>      - riscv_iommu_mmio_write() will now execute a riscv_iommu_process_fn by holding
>>>>        'core_lock'
>>>>    - init FSCR as zero explicitly
>>>>    - check for bus->iommu_opaque == NULL before calling pci_setup_iommu()
>>>>
>>>> - patch 4 (new):
>>>>    - add Red-Hat PCI RISC-V IOMMU ID
>>>>
>>>> - patch 5 (former 4):
>>>>    - create vendor-id and device-id properties
>>>>    - set Red-hat PCI RISC-V IOMMU ID as default ID
>>>>
>>>> - patch 8:
>>>>    - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' instances
>>>>
>>>> - patch 9:
>>>>    - add s-stage and g-stage steps in riscv_iommu_validate_device_ctx()
>>>>    - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>>>>    - 'en_s' is no longer used for early MSI address match
>>>>
>>>> - patch 10:
>>>>    - add ATS steps in riscv_iommu_validate_device_ctx()
>>>>    - check for 's->enable_ats' before adding RISCV_IOMMU_DC_TC_EN_ATS in device context
>>>>    - check for 's->enable_ats' before processing ATS commands in riscv_iommu_process_cq_tail()
>>>>    - remove ambiguous trace_riscv_iommu_ats() from riscv_iommu_translate()
>>>>
>>>> - patch 11:
>>>>    - removed unused bits
>>>>    - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>>>>      bits
>>>>    - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in riscv_iommu_process_dbg()
>>>>    - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). Added a comment talking about the (lack of) superpage support
>>>> [1] https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
>>>> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>>>>
>>>> Andrew Jones (1):
>>>>    hw/riscv/riscv-iommu: Add another irq for mrif notifications
>>>>
>>>> Daniel Henrique Barboza (3):
>>>>    pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>>>>    test/qtest: add riscv-iommu-pci tests
>>>>    qtest/riscv-iommu-test: add init queues test
>>>>
>>>> Tomasz Jeznach (9):
>>>>    exec/memtxattr: add process identifier to the transaction attributes
>>>>    hw/riscv: add riscv-iommu-bits.h
>>>>    hw/riscv: add RISC-V IOMMU base emulation
>>>>    hw/riscv: add riscv-iommu-pci reference device
>>>>    hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>>>>    hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>>>>    hw/riscv/riscv-iommu: add s-stage and g-stage support
>>>>    hw/riscv/riscv-iommu: add ATS support
>>>>    hw/riscv/riscv-iommu: add DBG support
>>>>
>>>>   docs/specs/pci-ids.rst           |    2 +
>>>>   hw/riscv/Kconfig                 |    4 +
>>>>   hw/riscv/meson.build             |    1 +
>>>>   hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>>>>   hw/riscv/riscv-iommu-pci.c       |  177 +++
>>>>   hw/riscv/riscv-iommu.c           | 2283 ++++++++++++++++++++++++++++++
>>>>   hw/riscv/riscv-iommu.h           |  146 ++
>>>>   hw/riscv/trace-events            |   15 +
>>>>   hw/riscv/trace.h                 |    1 +
>>>>   hw/riscv/virt.c                  |   33 +-
>>>>   include/exec/memattrs.h          |    5 +
>>>>   include/hw/pci/pci.h             |    1 +
>>>>   include/hw/riscv/iommu.h         |   36 +
>>>>   meson.build                      |    1 +
>>>>   tests/qtest/libqos/meson.build   |    4 +
>>>>   tests/qtest/libqos/riscv-iommu.c |   76 +
>>>>   tests/qtest/libqos/riscv-iommu.h |  100 ++
>>>>   tests/qtest/meson.build          |    1 +
>>>>   tests/qtest/riscv-iommu-test.c   |  234 +++
>>>>   19 files changed, 3535 insertions(+), 1 deletion(-)
>>>>   create mode 100644 hw/riscv/riscv-iommu-bits.h
>>>>   create mode 100644 hw/riscv/riscv-iommu-pci.c
>>>>   create mode 100644 hw/riscv/riscv-iommu.c
>>>>   create mode 100644 hw/riscv/riscv-iommu.h
>>>>   create mode 100644 hw/riscv/trace-events
>>>>   create mode 100644 hw/riscv/trace.h
>>>>   create mode 100644 include/hw/riscv/iommu.h
>>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.c
>>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.h
>>>>   create mode 100644 tests/qtest/riscv-iommu-test.c
>>>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support
  2024-06-12 12:10       ` Daniel Henrique Barboza
@ 2024-06-14 13:22         ` LIU Zhiwei
  0 siblings, 0 replies; 38+ messages in thread
From: LIU Zhiwei @ 2024-06-14 13:22 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, Ethan Chen
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, palmer, tjeznach,
	ajones, frank.chang


On 2024/6/12 20:10, Daniel Henrique Barboza wrote:
>
>
> On 6/12/24 4:50 AM, LIU Zhiwei wrote:
>>
>> On 2024/6/11 18:13, Daniel Henrique Barboza wrote:
>>> Hi Zhiwei,
>>>
>>> On 6/10/24 10:51 PM, LIU Zhiwei wrote:
>>>> Hi Daniel,
>>>>
>>>> I want to know if we can use the IOMMU and IOPMP at the same time.
>>>
>>> AFAIK we can. They're not mutually exclusive since they offer 
>>> protection
>>> and isolation at different layers/stages.
>>
>> OK. Thanks. I will dive into more details.
>>
>> I see the IOMMU and IOPMP implementations on mail list both set IOMMU 
>> for PCI root bus.
>> Is it right?
>
> For now the riscv-iommu-pci device must be placed at a root bus for the
> sake of simplicity. 
Agree.
> We'll want to lift this restriction in the future as
> the support matures.

I think it's OK if it only belongs to virt machine.

If we both support IOPMP and IOMMU for PCI, I think we should call 
pci_setup_iommu only once and use the same address space finding 
function for them.

At least, currently I don't find a reason that they can't share the same 
address space finding function.

Thanks,
Zhiwei

>
>
> Thanks,
>
> Daniel
>
>>
>> Thanks,
>> Zhiwei
>>
>>>
>>>
>>>>
>>>> The relationship between them is more similar to MMU and sPMP or to 
>>>> MMU and PMP?
>>>
>>> I'd say MMU and PMP since the IOMMU can isolate devices regardless of
>>> s-mode context or not.
>>>
>>>
>>> Thanks,
>>>
>>> Daniel
>>>
>>>>
>>>> Thanks,
>>>> Zhiwei
>>>>
>>>> On 2024/5/24 1:39, Daniel Henrique Barboza wrote:
>>>>> Hi,
>>>>>
>>>>> In this new version a lot of changes were made throughout all the 
>>>>> code,
>>>>> most notably on patch 3. Link for the previous version is [1].
>>>>>
>>>>> * How it was tested *
>>>>>
>>>>> This series was tested using an emulated QEMU RISC-V host booting 
>>>>> a QEMU
>>>>> KVM guest, passing through an emulated e1000 network card from the 
>>>>> host
>>>>> to the guest. I can provide more details (e.g. QEMU command lines) if
>>>>> required, just let me know. For now this cover-letter is too much 
>>>>> of an
>>>>> essay as is.
>>>>>
>>>>> The Linux kernel used for tests can be found here:
>>>>>
>>>>> https://github.com/tjeznach/linux/tree/riscv_iommu_v6-rc3
>>>>>
>>>>> This is a newer version of the following work from Tomasz:
>>>>>
>>>>> https://lore.kernel.org/linux-riscv/cover.1715708679.git.tjeznach@rivosinc.com/ 
>>>>>
>>>>> ("[PATCH v5 0/7] Linux RISC-V IOMMU Support")
>>>>>
>>>>> The v5 wasn't enough for the testing being done. v6-rc3 did the 
>>>>> trick.
>>>>>
>>>>> Note that to test this work using riscv-iommu-pci we'll need to 
>>>>> provide
>>>>> the Rivos PCI ID in the command line. More details down below.
>>>>>
>>>>> * Highlights of this version *
>>>>>
>>>>> - patches removed from v2: platform driver (riscv-iommu-sys, former
>>>>> patch 05) and the EDU changes (patches 14 and 15). The platform 
>>>>> driver
>>>>> will be sent later with a working example on the 'virt' machine,
>>>>> either on a newer version of this series or via a follow-up 
>>>>> series. We
>>>>> already have a PoC on [2] created by Sunil. More tests are needed, so
>>>>> it'll be left behind for now. The EDU changes will be sent in 
>>>>> separate
>>>>> after I finish the doc changes that Frank cited in v2.
>>>>>
>>>>> - patch 3 contains the bulk of changes made from v2. Please give 
>>>>> special
>>>>> attention to the following functions since this is entirely new 
>>>>> code I
>>>>> ended up adding:
>>>>>   - riscv_iommu_report_fault()
>>>>>   - riscv_iommu_validate_device_ctx()
>>>>>   - riscv_iommu_update_ipsr()
>>>>>    Aside from these helpers most of the changes made in this patch 
>>>>> 3 were
>>>>> punctual.
>>>>>
>>>>> - Red HAT PCI ID related changes. A new patch (4) that introduces a
>>>>> generic RISC-V IOMMU PCI ID was added. This PCI ID was gracefully 
>>>>> given
>>>>> to us by Red Hat and Gerd Hoffman from their ID space. The
>>>>> riscv-iommu-pci device now defaults to this PCI ID instead of 
>>>>> Rivos PCI
>>>>> ID. The device was changed slightly to allow vendor-id and 
>>>>> device-id to
>>>>> be set in the command-line, so it's now possible to use this 
>>>>> reference
>>>>> device as another RISC-V IOMMU PCI device to ease the burden of
>>>>> testing/development.
>>>>>
>>>>>    To instantiate the riscv-iommu-pci device using the previous 
>>>>> Rivos PCI
>>>>> ID, use the following cmd line:
>>>>>
>>>>>    -device riscv-iommu-pci,vendor-id=0x1efd,device-id=0xedf1
>>>>>
>>>>>    I'm using these options to test the series with the existing 
>>>>> Linux RISC-V
>>>>> IOMMU support that uses just a Rivos ID to identify the device.
>>>>>
>>>>>
>>>>> Series based on alistair/riscv-to-apply.next. It's also applicable on
>>>>> current QEMU master. It can also be fetched from:
>>>>>
>>>>> https://gitlab.com/danielhb/qemu/-/tree/riscv_iommu_v3
>>>>>
>>>>> Patches missing reviews/acks: 3, 5, 9, 10, 11.
>>>>>
>>>>> Changes from v2 [1]:
>>>>> - patch 05 (hw/riscv: add riscv-iommu-sys platform device): dropped
>>>>>    - will be reintroduced in a later review or as a follow-up series
>>>>>
>>>>> - patches 14 and 15: dropped
>>>>>    - will be sent in separate
>>>>>
>>>>> - patches 2, 3, 4 and 5:
>>>>>    - removed all 'Ziommu' references
>>>>>
>>>>> - patch 2:
>>>>>    - added extra bits that patch 3 ended up using
>>>>>
>>>>> - patch 3:
>>>>>    - fixed blank line at EOF in hw/riscv/trace.h
>>>>>    - added a riscv_iommu_report_fault() helper to report faults. 
>>>>> The helper checks if
>>>>>      a given fault is eligible to be reported if DTF is 1
>>>>>    - Use riscv_iommu_report_fault() in riscv_iommu_ctx() and 
>>>>> riscv_iommu_translate()
>>>>>      to avoid code repetition
>>>>>    - added a riscv_iommu_validate_device_ctx() helper to validate 
>>>>> the device context
>>>>>      as specified in "Device configuration checks" section. This 
>>>>> helper is being used
>>>>>      in riscv_iommu_ctx_fetch()
>>>>>    - added a new riscv_iommu_update_ipsr() helper to handle IPSR 
>>>>> updates
>>>>>      in riscv_iommu_mmio_write()
>>>>>    - riscv_iommmu_msi_write() now reports a fault in all error paths
>>>>>    - check for fctl.WSI before issuing a MSI interrupt in 
>>>>> riscv_iommu_notify()
>>>>>    - change riscv-iommu region name to 'riscv-iommu'
>>>>>    - change address_space_init() name for PCI devices to 'name' 
>>>>> instead of using TYPE_RISCV_IOMMU_PCI
>>>>>    - changed riscv_iommu_mmio_ops min_access_size to 4
>>>>>    - do not check for min and max sizes on riscv_iommu_mmio_write()
>>>>>    - changed riscv_iommu_trap_ops  min_access_size to 4
>>>>>    - removed IOMMU qemu_thread thread:
>>>>>      - riscv_iommu_mmio_write() will now execute a 
>>>>> riscv_iommu_process_fn by holding
>>>>>        'core_lock'
>>>>>    - init FSCR as zero explicitly
>>>>>    - check for bus->iommu_opaque == NULL before calling 
>>>>> pci_setup_iommu()
>>>>>
>>>>> - patch 4 (new):
>>>>>    - add Red-Hat PCI RISC-V IOMMU ID
>>>>>
>>>>> - patch 5 (former 4):
>>>>>    - create vendor-id and device-id properties
>>>>>    - set Red-hat PCI RISC-V IOMMU ID as default ID
>>>>>
>>>>> - patch 8:
>>>>>    - use IOMMU_NONE instead of '0' in relevant 'iot->perm = 0' 
>>>>> instances
>>>>>
>>>>> - patch 9:
>>>>>    - add s-stage and g-stage steps in 
>>>>> riscv_iommu_validate_device_ctx()
>>>>>    - removed 'gpa' boolean from riscv_iommu_spa_fetch()
>>>>>    - 'en_s' is no longer used for early MSI address match
>>>>>
>>>>> - patch 10:
>>>>>    - add ATS steps in riscv_iommu_validate_device_ctx()
>>>>>    - check for 's->enable_ats' before adding 
>>>>> RISCV_IOMMU_DC_TC_EN_ATS in device context
>>>>>    - check for 's->enable_ats' before processing ATS commands in 
>>>>> riscv_iommu_process_cq_tail()
>>>>>    - remove ambiguous trace_riscv_iommu_ats() from 
>>>>> riscv_iommu_translate()
>>>>>
>>>>> - patch 11:
>>>>>    - removed unused bits
>>>>>    - added RISCV_IOMMU_TR_REQ_CTL_NW and RISCV_IOMMU_TR_RESPONSE_S
>>>>>      bits
>>>>>    - set IOMMUTLBEntry 'perm' using RISCV_IOMMU_TR_REQ_CTL_NW in 
>>>>> riscv_iommu_process_dbg()
>>>>>    - clear RISCV_IOMMU_TR_RESPONSE_S in riscv_iommu_process_dbg(). 
>>>>> Added a comment talking about the (lack of) superpage support
>>>>> [1] 
>>>>> https://lore.kernel.org/qemu-riscv/20240307160319.675044-1-dbarboza@ventanamicro.com/
>>>>> [2] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/
>>>>>
>>>>> Andrew Jones (1):
>>>>>    hw/riscv/riscv-iommu: Add another irq for mrif notifications
>>>>>
>>>>> Daniel Henrique Barboza (3):
>>>>>    pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device
>>>>>    test/qtest: add riscv-iommu-pci tests
>>>>>    qtest/riscv-iommu-test: add init queues test
>>>>>
>>>>> Tomasz Jeznach (9):
>>>>>    exec/memtxattr: add process identifier to the transaction 
>>>>> attributes
>>>>>    hw/riscv: add riscv-iommu-bits.h
>>>>>    hw/riscv: add RISC-V IOMMU base emulation
>>>>>    hw/riscv: add riscv-iommu-pci reference device
>>>>>    hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug
>>>>>    hw/riscv/riscv-iommu: add Address Translation Cache (IOATC)
>>>>>    hw/riscv/riscv-iommu: add s-stage and g-stage support
>>>>>    hw/riscv/riscv-iommu: add ATS support
>>>>>    hw/riscv/riscv-iommu: add DBG support
>>>>>
>>>>>   docs/specs/pci-ids.rst           |    2 +
>>>>>   hw/riscv/Kconfig                 |    4 +
>>>>>   hw/riscv/meson.build             |    1 +
>>>>>   hw/riscv/riscv-iommu-bits.h      |  416 ++++++
>>>>>   hw/riscv/riscv-iommu-pci.c       |  177 +++
>>>>>   hw/riscv/riscv-iommu.c           | 2283 
>>>>> ++++++++++++++++++++++++++++++
>>>>>   hw/riscv/riscv-iommu.h           |  146 ++
>>>>>   hw/riscv/trace-events            |   15 +
>>>>>   hw/riscv/trace.h                 |    1 +
>>>>>   hw/riscv/virt.c                  |   33 +-
>>>>>   include/exec/memattrs.h          |    5 +
>>>>>   include/hw/pci/pci.h             |    1 +
>>>>>   include/hw/riscv/iommu.h         |   36 +
>>>>>   meson.build                      |    1 +
>>>>>   tests/qtest/libqos/meson.build   |    4 +
>>>>>   tests/qtest/libqos/riscv-iommu.c |   76 +
>>>>>   tests/qtest/libqos/riscv-iommu.h |  100 ++
>>>>>   tests/qtest/meson.build          |    1 +
>>>>>   tests/qtest/riscv-iommu-test.c   |  234 +++
>>>>>   19 files changed, 3535 insertions(+), 1 deletion(-)
>>>>>   create mode 100644 hw/riscv/riscv-iommu-bits.h
>>>>>   create mode 100644 hw/riscv/riscv-iommu-pci.c
>>>>>   create mode 100644 hw/riscv/riscv-iommu.c
>>>>>   create mode 100644 hw/riscv/riscv-iommu.h
>>>>>   create mode 100644 hw/riscv/trace-events
>>>>>   create mode 100644 hw/riscv/trace.h
>>>>>   create mode 100644 include/hw/riscv/iommu.h
>>>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.c
>>>>>   create mode 100644 tests/qtest/libqos/riscv-iommu.h
>>>>>   create mode 100644 tests/qtest/riscv-iommu-test.c
>>>>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-05-23 17:39 ` [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation Daniel Henrique Barboza
  2024-05-30  1:39   ` Eric Cheng
  2024-06-11 16:15   ` Jason Chien
@ 2024-06-18 10:06   ` Jason Chien
  2024-06-18 15:15     ` Jason Chien
  2 siblings, 1 reply; 38+ messages in thread
From: Jason Chien @ 2024-06-18 10:06 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf

Hi Daniel,

On 2024/5/24 上午 01:39, Daniel Henrique Barboza wrote:
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> The RISC-V IOMMU specification is now ratified as-per the RISC-V
> international process. The latest frozen specifcation can be found
> at:
>
> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf
>
> Add the foundation of the device emulation for RISC-V IOMMU, which
> includes an IOMMU that has no capabilities but MSI interrupt support and
> fault queue interfaces. We'll add add more features incrementally in the
> next patches.
>
> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>   hw/riscv/Kconfig         |    4 +
>   hw/riscv/meson.build     |    1 +
>   hw/riscv/riscv-iommu.c   | 1602 ++++++++++++++++++++++++++++++++++++++
>   hw/riscv/riscv-iommu.h   |  141 ++++
>   hw/riscv/trace-events    |   11 +
>   hw/riscv/trace.h         |    1 +
>   include/hw/riscv/iommu.h |   36 +
>   meson.build              |    1 +
>   8 files changed, 1797 insertions(+)
>   create mode 100644 hw/riscv/riscv-iommu.c
>   create mode 100644 hw/riscv/riscv-iommu.h
>   create mode 100644 hw/riscv/trace-events
>   create mode 100644 hw/riscv/trace.h
>   create mode 100644 include/hw/riscv/iommu.h
>
> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
> index a2030e3a6f..f69d6e3c8e 100644
> --- a/hw/riscv/Kconfig
> +++ b/hw/riscv/Kconfig
> @@ -1,3 +1,6 @@
> +config RISCV_IOMMU
> +    bool
> +
>   config RISCV_NUMA
>       bool
>   
> @@ -47,6 +50,7 @@ config RISCV_VIRT
>       select SERIAL
>       select RISCV_ACLINT
>       select RISCV_APLIC
> +    select RISCV_IOMMU
>       select RISCV_IMSIC
>       select SIFIVE_PLIC
>       select SIFIVE_TEST
> diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
> index f872674093..cbc99c6e8e 100644
> --- a/hw/riscv/meson.build
> +++ b/hw/riscv/meson.build
> @@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: files('sifive_u.c'))
>   riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
>   riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: files('microchip_pfsoc.c'))
>   riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
> +riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
>   
>   hw_arch += {'riscv': riscv_ss}
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> new file mode 100644
> index 0000000000..39b4ff1405
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu.c
> @@ -0,0 +1,1602 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2021-2023, Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +#include "hw/pci/pci_bus.h"
> +#include "hw/pci/pci_device.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/riscv/riscv_hart.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +#include "qemu/timer.h"
> +
> +#include "cpu_bits.h"
> +#include "riscv-iommu.h"
> +#include "riscv-iommu-bits.h"
> +#include "trace.h"
> +
> +#define LIMIT_CACHE_CTX               (1U << 7)
> +#define LIMIT_CACHE_IOT               (1U << 20)
> +
> +/* Physical page number coversions */
> +#define PPN_PHYS(ppn)                 ((ppn) << TARGET_PAGE_BITS)
> +#define PPN_DOWN(phy)                 ((phy) >> TARGET_PAGE_BITS)
> +
> +typedef struct RISCVIOMMUContext RISCVIOMMUContext;
> +typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
> +
> +/* Device assigned I/O address space */
> +struct RISCVIOMMUSpace {
> +    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached device */
> +    AddressSpace iova_as;       /* IOVA address space for attached device */
> +    RISCVIOMMUState *iommu;     /* Managing IOMMU device state */
> +    uint32_t devid;             /* Requester identifier, AKA device_id */
> +    bool notifier;              /* IOMMU unmap notifier enabled */
> +    QLIST_ENTRY(RISCVIOMMUSpace) list;
> +};
> +
> +/* Device translation context state. */
> +struct RISCVIOMMUContext {
> +    uint64_t devid:24;          /* Requester Id, AKA device_id */
> +    uint64_t pasid:20;          /* Process Address Space ID */
> +    uint64_t __rfu:20;          /* reserved */
> +    uint64_t tc;                /* Translation Control */
> +    uint64_t ta;                /* Translation Attributes */
> +    uint64_t msi_addr_mask;     /* MSI filtering - address mask */
> +    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
> +    uint64_t msiptp;            /* MSI redirection page table pointer */
> +};
> +
> +/* IOMMU index for transactions without PASID specified. */
> +#define RISCV_IOMMU_NOPASID 0
> +
> +static void riscv_iommu_notify(RISCVIOMMUState *s, int vec)
> +{
> +    const uint32_t fctl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FCTL);
> +    uint32_t ipsr, ivec;
> +
> +    if (fctl & RISCV_IOMMU_FCTL_WSI || !s->notify) {
> +        return;
> +    }
> +
> +    ipsr = riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, (1 << vec), 0);
> +    ivec = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_IVEC);
> +
> +    if (!(ipsr & (1 << vec))) {
> +        s->notify(s, (ivec >> (vec * 4)) & 0x0F);
> +    }
> +}
> +
> +static void riscv_iommu_fault(RISCVIOMMUState *s,
> +                              struct riscv_iommu_fq_record *ev)
> +{
> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQH) & s->fq_mask;
> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQT) & s->fq_mask;
> +    uint32_t next = (tail + 1) & s->fq_mask;
> +    uint32_t devid = get_field(ev->hdr, RISCV_IOMMU_FQ_HDR_DID);
> +
> +    trace_riscv_iommu_flt(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
> +                          PCI_FUNC(devid), ev->hdr, ev->iotval);
> +
> +    if (!(ctrl & RISCV_IOMMU_FQCSR_FQON) ||
> +        !!(ctrl & (RISCV_IOMMU_FQCSR_FQOF | RISCV_IOMMU_FQCSR_FQMF))) {
> +        return;
> +    }
> +
> +    if (head == next) {
> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
> +                              RISCV_IOMMU_FQCSR_FQOF, 0);
> +    } else {
> +        dma_addr_t addr = s->fq_addr + tail * sizeof(*ev);
> +        if (dma_memory_write(s->target_as, addr, ev, sizeof(*ev),
> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
> +                                  RISCV_IOMMU_FQCSR_FQMF, 0);
> +        } else {
> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_FQT, next);
> +        }
> +    }
> +
> +    if (ctrl & RISCV_IOMMU_FQCSR_FIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_FQ);
> +    }
> +}
> +
> +static void riscv_iommu_pri(RISCVIOMMUState *s,
> +    struct riscv_iommu_pq_record *pr)
> +{
> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQH) & s->pq_mask;
> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQT) & s->pq_mask;
> +    uint32_t next = (tail + 1) & s->pq_mask;
> +    uint32_t devid = get_field(pr->hdr, RISCV_IOMMU_PREQ_HDR_DID);
> +
> +    trace_riscv_iommu_pri(s->parent_obj.id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
> +                          PCI_FUNC(devid), pr->payload);
> +
> +    if (!(ctrl & RISCV_IOMMU_PQCSR_PQON) ||
> +        !!(ctrl & (RISCV_IOMMU_PQCSR_PQOF | RISCV_IOMMU_PQCSR_PQMF))) {
> +        return;
> +    }
> +
> +    if (head == next) {
> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
> +                              RISCV_IOMMU_PQCSR_PQOF, 0);
> +    } else {
> +        dma_addr_t addr = s->pq_addr + tail * sizeof(*pr);
> +        if (dma_memory_write(s->target_as, addr, pr, sizeof(*pr),
> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
> +                                  RISCV_IOMMU_PQCSR_PQMF, 0);
> +        } else {
> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_PQT, next);
> +        }
> +    }
> +
> +    if (ctrl & RISCV_IOMMU_PQCSR_PIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_PQ);
> +    }
> +}
> +
> +/* Portable implementation of pext_u64, bit-mask extraction. */
> +static uint64_t _pext_u64(uint64_t val, uint64_t ext)
> +{
> +    uint64_t ret = 0;
> +    uint64_t rot = 1;
> +
> +    while (ext) {
> +        if (ext & 1) {
> +            if (val & 1) {
> +                ret |= rot;
> +            }
> +            rot <<= 1;
> +        }
> +        val >>= 1;
> +        ext >>= 1;
> +    }
> +
> +    return ret;
> +}
> +
> +/* Check if GPA matches MSI/MRIF pattern. */
> +static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    dma_addr_t gpa)
> +{
> +    if (get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE) !=
> +        RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
> +        return false; /* Invalid MSI/MRIF mode */
> +    }
> +
> +    if ((PPN_DOWN(gpa) ^ ctx->msi_addr_pattern) & ~ctx->msi_addr_mask) {
> +        return false; /* GPA not in MSI range defined by AIA IMSIC rules. */
> +    }
> +
> +    return true;
> +}
> +
> +/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
> +static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    IOMMUTLBEntry *iotlb)
> +{
> +    /* Early check for MSI address match when IOVA == GPA */
> +    if (iotlb->perm & IOMMU_WO &&
> +        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
> +        iotlb->target_as = &s->trap_as;
> +        iotlb->translated_addr = iotlb->iova;
> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +        return 0;
> +    }
> +
> +    /* Exit early for pass-through mode. */
> +    iotlb->translated_addr = iotlb->iova;
> +    iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +    /* Allow R/W in pass-through mode */
> +    iotlb->perm = IOMMU_RW;
> +    return 0;
> +}
> +
> +static void riscv_iommu_report_fault(RISCVIOMMUState *s,
> +                                     RISCVIOMMUContext *ctx,
> +                                     uint32_t fault_type, uint32_t cause,
> +                                     bool pv,
> +                                     uint64_t iotval, uint64_t iotval2)
> +{
> +    struct riscv_iommu_fq_record ev = { 0 };
> +
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_DTF) {
> +        switch (cause) {
> +        case RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_INVALID:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED:
> +        case RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED:
> +        case RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR:
> +        case RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT:
> +            break;
> +        default:
> +            /* DTF prevents reporting a fault for this given cause */
> +            return;
> +        }
> +    }
> +
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_CAUSE, cause);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_TTYPE, fault_type);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_DID, ctx->devid);
> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PV, true);
> +
> +    if (pv) {
> +        ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PID, ctx->pasid);
> +    }
> +
> +    ev.iotval = iotval;
> +    ev.iotval2 = iotval2;
> +
> +    riscv_iommu_fault(s, &ev);
> +}
> +
> +/* Redirect MSI write for given GPA. */
> +static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
> +    RISCVIOMMUContext *ctx, uint64_t gpa, uint64_t data,
> +    unsigned size, MemTxAttrs attrs)
> +{
> +    MemTxResult res;
> +    dma_addr_t addr;
> +    uint64_t intn;
> +    uint32_t n190;
> +    uint64_t pte[2];
> +    int fault_type = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> +    int cause;
> +
> +    if (!riscv_iommu_msi_check(s, ctx, gpa)) {
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    /* Interrupt File Number */
> +    intn = _pext_u64(PPN_DOWN(gpa), ctx->msi_addr_mask);
> +    if (intn >= 256) {
> +        /* Interrupt file number out of range */
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    /* fetch MSI PTE */
> +    addr = PPN_PHYS(get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_PPN));
> +    addr = addr | (intn * sizeof(pte));
> +    res = dma_memory_read(s->target_as, addr, &pte, sizeof(pte),
> +            MEMTXATTRS_UNSPECIFIED);
> +    if (res != MEMTX_OK) {
> +        if (res == MEMTX_DECODE_ERROR) {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED;
> +        } else {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        }
> +        goto err;
> +    }
> +
> +    le64_to_cpus(&pte[0]);
> +    le64_to_cpus(&pte[1]);
> +
> +    if (!(pte[0] & RISCV_IOMMU_MSI_PTE_V) || (pte[0] & RISCV_IOMMU_MSI_PTE_C)) {
> +        /*
> +         * The spec mentions that: "If msipte.C == 1, then further
> +         * processing to interpret the PTE is implementation
> +         * defined.". We'll abort with cause = 262 for this
> +         * case too.
> +         */
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_INVALID;
> +        goto err;
> +    }
> +
> +    switch (get_field(pte[0], RISCV_IOMMU_MSI_PTE_M)) {
> +    case RISCV_IOMMU_MSI_PTE_M_BASIC:
> +        /* MSI Pass-through mode */
> +        addr = PPN_PHYS(get_field(pte[0], RISCV_IOMMU_MSI_PTE_PPN));
> +        addr = addr | (gpa & TARGET_PAGE_MASK);
> +
> +        trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
> +                              PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
> +                              gpa, addr);
> +
> +        res = dma_memory_write(s->target_as, addr, &data, size, attrs);
> +        if (res != MEMTX_OK) {
> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +            goto err;
> +        }
> +
> +        return MEMTX_OK;
> +    case RISCV_IOMMU_MSI_PTE_M_MRIF:
> +        /* MRIF mode, continue. */
> +        break;
> +    default:
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
> +        goto err;
> +    }
> +
> +    /*
> +     * Report an error for interrupt identities exceeding the maximum allowed
> +     * for an IMSIC interrupt file (2047) or destination address is not 32-bit
> +     * aligned. See IOMMU Specification, Chapter 2.3. MSI page tables.
> +     */
> +    if ((data > 2047) || (gpa & 3)) {
> +        res = MEMTX_ACCESS_ERROR;
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
> +        goto err;
> +    }
> +
> +    /* MSI MRIF mode, non atomic pending bit update */
> +
> +    /* MRIF pending bit address */
> +    addr = get_field(pte[0], RISCV_IOMMU_MSI_PTE_MRIF_ADDR) << 9;
> +    addr = addr | ((data & 0x7c0) >> 3);
> +
> +    trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
> +                          PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
> +                          gpa, addr);
> +
> +    /* MRIF pending bit mask */
> +    data = 1ULL << (data & 0x03f);
> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    intn = intn | data;
> +    res = dma_memory_write(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +        goto err;
> +    }
> +
> +    /* Get MRIF enable bits */
> +    addr = addr + sizeof(intn);
> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
> +        goto err;
> +    }
> +
> +    if (!(intn & data)) {
> +        /* notification disabled, MRIF update completed. */
> +        return MEMTX_OK;
> +    }
> +
> +    /* Send notification message */
> +    addr = PPN_PHYS(get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NPPN));
> +    n190 = get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID) |
> +          (get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID_MSB) << 10);
> +
> +    res = dma_memory_write(s->target_as, addr, &n190, sizeof(n190), attrs);
> +    if (res != MEMTX_OK) {
> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
> +        goto err;
> +    }
> +
> +    return MEMTX_OK;
> +
> +err:
> +    riscv_iommu_report_fault(s, ctx, fault_type, cause,
> +                             !!ctx->pasid, 0, 0);
> +    return res;
> +}
> +
> +/*
> + * Check device context configuration as described by the
> + * riscv-iommu spec section "Device-context configuration
> + * checks".
> + */
> +static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
> +                                            RISCVIOMMUContext *ctx)
> +{
> +    uint32_t msi_mode;
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
> +        ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
> +        return false;
> +    }
> +
> +    if (!(s->cap & RISCV_IOMMU_CAP_T2GPA) &&
> +        ctx->tc & RISCV_IOMMU_DC_TC_T2GPA) {
> +        return false;
> +    }
> +
> +    if (s->cap & RISCV_IOMMU_CAP_MSI_FLAT) {
> +        msi_mode = get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE);
> +
> +        if (msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_OFF &&
> +            msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
> +            return false;
> +        }
> +    }
> +
> +    /*
> +     * CAP_END is always zero (only one endianess). FCTL_BE is
> +     * always zero (little-endian accesses). Thus TC_SBE must
> +     * always be LE, i.e. zero.
> +     */
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_SBE) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +/*
> + * RISC-V IOMMU Device Context Loopkup - Device Directory Tree Walk
> + *
> + * @s         : IOMMU Device State
> + * @ctx       : Device Translation Context with devid and pasid set.
> + * @return    : success or fault code.
> + */
> +static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
> +{
> +    const uint64_t ddtp = s->ddtp;
> +    unsigned mode = get_field(ddtp, RISCV_IOMMU_DDTP_MODE);
> +    dma_addr_t addr = PPN_PHYS(get_field(ddtp, RISCV_IOMMU_DDTP_PPN));
> +    struct riscv_iommu_dc dc;
> +    /* Device Context format: 0: extended (64 bytes) | 1: base (32 bytes) */
> +    const int dc_fmt = !s->enable_msi;
> +    const size_t dc_len = sizeof(dc) >> dc_fmt;
> +    unsigned depth;
> +    uint64_t de;
> +
> +    switch (mode) {
> +    case RISCV_IOMMU_DDTP_MODE_OFF:
> +        return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
> +
> +    case RISCV_IOMMU_DDTP_MODE_BARE:
> +        /* mock up pass-through translation context */
> +        ctx->tc = RISCV_IOMMU_DC_TC_V;
> +        ctx->ta = 0;
> +        ctx->msiptp = 0;
> +        return 0;
> +
> +    case RISCV_IOMMU_DDTP_MODE_1LVL:
> +        depth = 0;
> +        break;
> +
> +    case RISCV_IOMMU_DDTP_MODE_2LVL:
> +        depth = 1;
> +        break;
> +
> +    case RISCV_IOMMU_DDTP_MODE_3LVL:
> +        depth = 2;
> +        break;
> +
> +    default:
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +    }
> +
> +    /*
> +     * Check supported device id width (in bits).
> +     * See IOMMU Specification, Chapter 6. Software guidelines.
> +     * - if extended device-context format is used:
> +     *   1LVL: 6, 2LVL: 15, 3LVL: 24
> +     * - if base device-context format is used:
> +     *   1LVL: 7, 2LVL: 16, 3LVL: 24
> +     */
> +    if (ctx->devid >= (1 << (depth * 9 + 6 + (dc_fmt && depth != 2)))) {
> +        return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +    }
> +
> +    /* Device directory tree walk */
> +    for (; depth-- > 0; ) {
> +        /*
> +         * Select device id index bits based on device directory tree level
> +         * and device context format.
> +         * See IOMMU Specification, Chapter 2. Data Structures.
> +         * - if extended device-context format is used:
> +         *   device index: [23:15][14:6][5:0]
> +         * - if base device-context format is used:
> +         *   device index: [23:16][15:7][6:0]
> +         */
> +        const int split = depth * 9 + 6 + dc_fmt;
> +        addr |= ((ctx->devid >> split) << 3) & ~TARGET_PAGE_MASK;
> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
> +        }
> +        le64_to_cpus(&de);
> +        if (!(de & RISCV_IOMMU_DDTE_VALID)) {
> +            /* invalid directory entry */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +        }
> +        if (de & ~(RISCV_IOMMU_DDTE_PPN | RISCV_IOMMU_DDTE_VALID)) {
> +            /* reserved bits set */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +        }
> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_DDTE_PPN));
> +    }
> +
> +    /* index into device context entry page */
> +    addr |= (ctx->devid * dc_len) & ~TARGET_PAGE_MASK;
> +
> +    memset(&dc, 0, sizeof(dc));
> +    if (dma_memory_read(s->target_as, addr, &dc, dc_len,
> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
> +    }
> +
> +    /* Set translation context. */
> +    ctx->tc = le64_to_cpu(dc.tc);
> +    ctx->ta = le64_to_cpu(dc.ta);
> +    ctx->msiptp = le64_to_cpu(dc.msiptp);
> +    ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
> +    ctx->msi_addr_pattern = le64_to_cpu(dc.msi_addr_pattern);
> +
> +    if (!riscv_iommu_validate_device_ctx(s, ctx)) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +    }
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_V)) {
> +        return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +    }
According to section 2.3.1(item 9 and item 10), DC.tc.V == 0 should be 
checked before riscv_iommu_validate_device_ctx() is checked.
> +
> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
> +        if (ctx->pasid != RISCV_IOMMU_NOPASID) {
> +            /* PASID is disabled */
> +            return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
> +        }
> +        return 0;
> +    }
> +
> +    /* FSC.TC.PDTV enabled */
> +    if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
> +        /* Invalid PDTP.MODE */
> +        return RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED;
> +    }
> +
> +    for (depth = mode - RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8; depth-- > 0; ) {
> +        /*
> +         * Select process id index bits based on process directory tree
> +         * level. See IOMMU Specification, 2.2. Process-Directory-Table.
> +         */
> +        const int split = depth * 9 + 8;
> +        addr |= ((ctx->pasid >> split) << 3) & ~TARGET_PAGE_MASK;
> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
> +        }
> +        le64_to_cpus(&de);
> +        if (!(de & RISCV_IOMMU_PC_TA_V)) {
> +            return RISCV_IOMMU_FQ_CAUSE_PDT_INVALID;
> +        }
> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_PC_FSC_PPN));
> +    }
> +
> +    /* Leaf entry in PDT */
> +    addr |= (ctx->pasid << 4) & ~TARGET_PAGE_MASK;
> +    if (dma_memory_read(s->target_as, addr, &dc.ta, sizeof(uint64_t) * 2,
> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +        return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
> +    }
> +
> +    /* Use FSC and TA from process directory entry. */
> +    ctx->ta = le64_to_cpu(dc.ta);
> +

According to section 2.3.2:

10. If PC.ta.V == 0, stop and report "PDT entry not valid" (cause = 266).
11. If the PC is misconfigured as determined by rules outlined in 
Section 2.2.4 then stop and report "PDT entry misconfigured" (cause = 267).

> +    return 0;
> +}
> +
> +/* Translation Context cache support */
> +static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
> +    RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
> +    return c1->devid == c2->devid && c1->pasid == c2->pasid;
> +}
> +
> +static guint __ctx_hash(gconstpointer v)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
> +    /* Generate simple hash of (pasid, devid), assuming 24-bit wide devid */
> +    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
> +}
> +
> +static void __ctx_inval_devid_pasid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> +        ctx->devid == arg->devid &&
> +        ctx->pasid == arg->pasid) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void __ctx_inval_devid(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
> +        ctx->devid == arg->devid) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void __ctx_inval_all(gpointer key, gpointer value, gpointer data)
> +{
> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V) {
> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
> +    }
> +}
> +
> +static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
> +    uint32_t devid, uint32_t pasid)
> +{
> +    GHashTable *ctx_cache;
> +    RISCVIOMMUContext key = {
> +        .devid = devid,
> +        .pasid = pasid,
> +    };
> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    g_hash_table_foreach(ctx_cache, func, &key);
> +    g_hash_table_unref(ctx_cache);
> +}
> +
> +/* Find or allocate translation context for a given {device_id, process_id} */
> +static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
> +    unsigned devid, unsigned pasid, void **ref)
> +{
> +    GHashTable *ctx_cache;
> +    RISCVIOMMUContext *ctx;
> +    RISCVIOMMUContext key = {
> +        .devid = devid,
> +        .pasid = pasid,
> +    };
> +
> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
> +    ctx = g_hash_table_lookup(ctx_cache, &key);
> +
> +    if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
> +        *ref = ctx_cache;
> +        return ctx;
Is it possible that ddtp.iommu_mode is set to off or bare which requires 
no translation? It looks like that the returned ctx will be used to 
perform translation in riscv_iommu_translate().
> +    }
> +
> +    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
> +        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                          g_free, NULL);
> +        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
> +    }
> +
> +    ctx = g_new0(RISCVIOMMUContext, 1);
> +    ctx->devid = devid;
> +    ctx->pasid = pasid;
> +
> +    int fault = riscv_iommu_ctx_fetch(s, ctx);
> +    if (!fault) {
> +        g_hash_table_add(ctx_cache, ctx);
> +        *ref = ctx_cache;
> +        return ctx;
> +    }
> +
> +    g_hash_table_unref(ctx_cache);
> +    *ref = NULL;
> +
> +    riscv_iommu_report_fault(s, ctx, RISCV_IOMMU_FQ_TTYPE_UADDR_RD,
> +                             fault, !!pasid, 0, 0);
> +
> +    g_free(ctx);
> +    return NULL;
> +}
> +
> +static void riscv_iommu_ctx_put(RISCVIOMMUState *s, void *ref)
> +{
> +    if (ref) {
> +        g_hash_table_unref((GHashTable *)ref);
> +    }
> +}
> +
> +/* Find or allocate address space for a given device */
> +static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t devid)
> +{
> +    RISCVIOMMUSpace *as;
> +
> +    /* FIXME: PCIe bus remapping for attached endpoints. */
> +    devid |= s->bus << 8;
> +
> +    qemu_mutex_lock(&s->core_lock);
> +    QLIST_FOREACH(as, &s->spaces, list) {
> +        if (as->devid == devid) {
> +            break;
> +        }
> +    }
> +    qemu_mutex_unlock(&s->core_lock);
> +
> +    if (as == NULL) {
> +        char name[64];
> +        as = g_new0(RISCVIOMMUSpace, 1);
> +
> +        as->iommu = s;
> +        as->devid = devid;
> +
> +        snprintf(name, sizeof(name), "riscv-iommu-%04x:%02x.%d-iova",
> +            PCI_BUS_NUM(as->devid), PCI_SLOT(as->devid), PCI_FUNC(as->devid));
> +
> +        /* IOVA address space, untranslated addresses */
> +        memory_region_init_iommu(&as->iova_mr, sizeof(as->iova_mr),
> +            TYPE_RISCV_IOMMU_MEMORY_REGION,
> +            OBJECT(as), "riscv_iommu", UINT64_MAX);
> +        address_space_init(&as->iova_as, MEMORY_REGION(&as->iova_mr), name);
> +
> +        qemu_mutex_lock(&s->core_lock);
> +        QLIST_INSERT_HEAD(&s->spaces, as, list);
> +        qemu_mutex_unlock(&s->core_lock);
> +
> +        trace_riscv_iommu_new(s->parent_obj.id, PCI_BUS_NUM(as->devid),
> +                PCI_SLOT(as->devid), PCI_FUNC(as->devid));
> +    }
> +    return &as->iova_as;
> +}
> +
> +static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
> +    IOMMUTLBEntry *iotlb)
> +{
> +    bool enable_pasid;
> +    bool enable_pri;
> +    int fault;
> +
> +    /*
> +     * TC[32] is reserved for custom extensions, used here to temporarily
> +     * enable automatic page-request generation for ATS queries.
> +     */
> +    enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & BIT_ULL(32));
> +    enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
> +
> +    /* Translate using device directory / page table information. */
> +    fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
> +
> +    if (enable_pri && fault) {
> +        struct riscv_iommu_pq_record pr = {0};
> +        if (enable_pasid) {
> +            pr.hdr = set_field(RISCV_IOMMU_PREQ_HDR_PV,
> +                RISCV_IOMMU_PREQ_HDR_PID, ctx->pasid);
> +        }
> +        pr.hdr = set_field(pr.hdr, RISCV_IOMMU_PREQ_HDR_DID, ctx->devid);
> +        pr.payload = (iotlb->iova & TARGET_PAGE_MASK) |
> +                     RISCV_IOMMU_PREQ_PAYLOAD_M;
> +        riscv_iommu_pri(s, &pr);
> +        return fault;
> +    }
> +
> +    if (fault) {
> +        unsigned ttype;
> +
> +        if (iotlb->perm & IOMMU_RW) {
> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
> +        } else {
> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
> +        }
> +
> +        riscv_iommu_report_fault(s, ctx, ttype, fault, enable_pasid,
> +                                 iotlb->iova, iotlb->translated_addr);
> +        return fault;
> +    }
> +
> +    return 0;
> +}
> +
> +/* IOMMU Command Interface */
> +static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
> +    uint64_t addr, uint32_t data)
> +{
> +    /*
> +     * ATS processing in this implementation of the IOMMU is synchronous,
> +     * no need to wait for completions here.
> +     */
> +    if (!notify) {
> +        return MEMTX_OK;
> +    }
> +
> +    return dma_memory_write(s->target_as, addr, &data, sizeof(data),
> +        MEMTXATTRS_UNSPECIFIED);
> +}
> +
> +static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
> +{
> +    uint64_t old_ddtp = s->ddtp;
> +    uint64_t new_ddtp = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_DDTP);
> +    unsigned new_mode = get_field(new_ddtp, RISCV_IOMMU_DDTP_MODE);
> +    unsigned old_mode = get_field(old_ddtp, RISCV_IOMMU_DDTP_MODE);
> +    bool ok = false;
> +
> +    /*
> +     * Check for allowed DDTP.MODE transitions:
> +     * {OFF, BARE}        -> {OFF, BARE, 1LVL, 2LVL, 3LVL}
> +     * {1LVL, 2LVL, 3LVL} -> {OFF, BARE}
> +     */
> +    if (new_mode == old_mode ||
> +        new_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
> +        new_mode == RISCV_IOMMU_DDTP_MODE_BARE) {
> +        ok = true;
> +    } else if (new_mode == RISCV_IOMMU_DDTP_MODE_1LVL ||
> +               new_mode == RISCV_IOMMU_DDTP_MODE_2LVL ||
> +               new_mode == RISCV_IOMMU_DDTP_MODE_3LVL) {
> +        ok = old_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
> +             old_mode == RISCV_IOMMU_DDTP_MODE_BARE;
> +    }
> +
> +    if (ok) {
> +        /* clear reserved and busy bits, report back sanitized version */
> +        new_ddtp = set_field(new_ddtp & RISCV_IOMMU_DDTP_PPN,
> +                             RISCV_IOMMU_DDTP_MODE, new_mode);
> +    } else {
> +        new_ddtp = old_ddtp;
> +    }
> +    s->ddtp = new_ddtp;
> +
> +    riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_DDTP, new_ddtp);
> +}
> +
> +/* Command function and opcode field. */
> +#define RISCV_IOMMU_CMD(func, op) (((func) << 7) | (op))
> +
> +static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
> +{
> +    struct riscv_iommu_command cmd;
> +    MemTxResult res;
> +    dma_addr_t addr;
> +    uint32_t tail, head, ctrl;
> +    uint64_t cmd_opcode;
> +    GHFunc func;
> +
> +    ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +    tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQT) & s->cq_mask;
> +    head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQH) & s->cq_mask;
> +
> +    /* Check for pending error or queue processing disabled */
> +    if (!(ctrl & RISCV_IOMMU_CQCSR_CQON) ||
> +        !!(ctrl & (RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CQMF))) {
> +        return;
> +    }
> +
> +    while (tail != head) {
> +        addr = s->cq_addr  + head * sizeof(cmd);
> +        res = dma_memory_read(s->target_as, addr, &cmd, sizeof(cmd),
> +                              MEMTXATTRS_UNSPECIFIED);
> +
> +        if (res != MEMTX_OK) {
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                                  RISCV_IOMMU_CQCSR_CQMF, 0);
> +            goto fault;
> +        }
> +
> +        trace_riscv_iommu_cmd(s->parent_obj.id, cmd.dword0, cmd.dword1);
> +
> +        cmd_opcode = get_field(cmd.dword0,
> +                               RISCV_IOMMU_CMD_OPCODE | RISCV_IOMMU_CMD_FUNC);
> +
> +        switch (cmd_opcode) {
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOFENCE_FUNC_C,
> +                             RISCV_IOMMU_CMD_IOFENCE_OPCODE):
> +            res = riscv_iommu_iofence(s,
> +                cmd.dword0 & RISCV_IOMMU_CMD_IOFENCE_AV, cmd.dword1,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOFENCE_DATA));
> +
> +            if (res != MEMTX_OK) {
> +                riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                                      RISCV_IOMMU_CQCSR_CQMF, 0);
> +                goto fault;
> +            }
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA,
> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> +            if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
> +                /* illegal command arguments IOTINVAL.GVMA & PSCV == 1 */
> +                goto cmd_ill;
> +            }
> +            /* translation cache not implemented yet */
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
> +            /* translation cache not implemented yet */
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
> +                /* invalidate all device context cache mappings */
> +                func = __ctx_inval_all;
> +            } else {
> +                /* invalidate all device context matching DID */
> +                func = __ctx_inval_devid;
> +            }
> +            riscv_iommu_ctx_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID), 0);
> +            break;
> +
> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT,
> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
> +                /* illegal command arguments IODIR_PDT & DV == 0 */
> +                goto cmd_ill;
> +            } else {
> +                func = __ctx_inval_devid_pasid;
> +            }
> +            riscv_iommu_ctx_inval(s, func,
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID),
> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
> +            break;
> +
> +        default:
> +        cmd_ill:
> +            /* Invalid instruction, do not advance instruction index. */
> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
> +                RISCV_IOMMU_CQCSR_CMD_ILL, 0);
> +            goto fault;
> +        }
> +
> +        /* Advance and update head pointer after command completes. */
> +        head = (head + 1) & s->cq_mask;
> +        riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_CQH, head);
> +    }
> +    return;
> +
> +fault:
> +    if (ctrl & RISCV_IOMMU_CQCSR_CIE) {
> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_CQ);
> +    }
> +}
> +
> +static void riscv_iommu_process_cq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_CQB);
> +        s->cq_mask = (2ULL << get_field(base, RISCV_IOMMU_CQB_LOG2SZ)) - 1;
> +        s->cq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_CQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~s->cq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQT], 0);
> +        ctrl_set = RISCV_IOMMU_CQCSR_CQON;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQMF |
> +                   RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_CMD_TO |
> +                   RISCV_IOMMU_CQCSR_FENCE_W_IP;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +static void riscv_iommu_process_fq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_FQB);
> +        s->fq_mask = (2ULL << get_field(base, RISCV_IOMMU_FQB_LOG2SZ)) - 1;
> +        s->fq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_FQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~s->fq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQT], 0);
> +        ctrl_set = RISCV_IOMMU_FQCSR_FQON;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQMF |
> +            RISCV_IOMMU_FQCSR_FQOF;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
> +{
> +    uint64_t base;
> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +    uint32_t ctrl_clr;
> +    bool enable = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQEN);
> +    bool active = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQON);
> +
> +    if (enable && !active) {
> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_PQB);
> +        s->pq_mask = (2ULL << get_field(base, RISCV_IOMMU_PQB_LOG2SZ)) - 1;
> +        s->pq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_PQB_PPN));
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~s->pq_mask);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQH], 0);
> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQT], 0);
> +        ctrl_set = RISCV_IOMMU_PQCSR_PQON;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQMF |
> +            RISCV_IOMMU_PQCSR_PQOF;
> +    } else if (!enable && active) {
> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~0);
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQON;
> +    } else {
> +        ctrl_set = 0;
> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, ctrl_clr);
> +}
> +
> +typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
> +
> +static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
> +{
> +    uint32_t cqcsr, fqcsr, pqcsr;
> +    uint32_t ipsr_set = 0;
> +    uint32_t ipsr_clr = 0;
> +
> +    if (data & RISCV_IOMMU_IPSR_CIP) {
> +        cqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
> +
> +        if (cqcsr & RISCV_IOMMU_CQCSR_CIE &&
> +            (cqcsr & RISCV_IOMMU_CQCSR_FENCE_W_IP ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_ILL ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_TO ||
> +             cqcsr & RISCV_IOMMU_CQCSR_CQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_CIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
> +    }
> +
> +    if (data & RISCV_IOMMU_IPSR_FIP) {
> +        fqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
> +
> +        if (fqcsr & RISCV_IOMMU_FQCSR_FIE &&
> +            (fqcsr & RISCV_IOMMU_FQCSR_FQOF ||
> +             fqcsr & RISCV_IOMMU_FQCSR_FQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_FIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
> +    }
> +
> +    if (data & RISCV_IOMMU_IPSR_PIP) {
> +        pqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
> +
> +        if (pqcsr & RISCV_IOMMU_PQCSR_PIE &&
> +            (pqcsr & RISCV_IOMMU_PQCSR_PQOF ||
> +             pqcsr & RISCV_IOMMU_PQCSR_PQMF)) {
> +            ipsr_set |= RISCV_IOMMU_IPSR_PIP;
> +        } else {
> +            ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
> +        }
> +    } else {
> +        ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
> +    }
> +
> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, ipsr_set, ipsr_clr);
> +}
> +
> +static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
> +    uint64_t data, unsigned size, MemTxAttrs attrs)
> +{
> +    riscv_iommu_process_fn *process_fn = NULL;
> +    RISCVIOMMUState *s = opaque;
> +    uint32_t regb = addr & ~3;
> +    uint32_t busy = 0;
> +    uint64_t val = 0;
> +
> +    if ((addr & (size - 1)) != 0) {
> +        /* Unsupported MMIO alignment or access size */
> +        return MEMTX_ERROR;
> +    }
> +
> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
> +        /* Unsupported MMIO access location. */
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    /* Track actionable MMIO write. */
> +    switch (regb) {
> +    case RISCV_IOMMU_REG_DDTP:
> +    case RISCV_IOMMU_REG_DDTP + 4:
> +        process_fn = riscv_iommu_process_ddtp;
> +        regb = RISCV_IOMMU_REG_DDTP;
> +        busy = RISCV_IOMMU_DDTP_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_CQT:
> +        process_fn = riscv_iommu_process_cq_tail;
> +        break;
> +
> +    case RISCV_IOMMU_REG_CQCSR:
> +        process_fn = riscv_iommu_process_cq_control;
> +        busy = RISCV_IOMMU_CQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_FQCSR:
> +        process_fn = riscv_iommu_process_fq_control;
> +        busy = RISCV_IOMMU_FQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_PQCSR:
> +        process_fn = riscv_iommu_process_pq_control;
> +        busy = RISCV_IOMMU_PQCSR_BUSY;
> +        break;
> +
> +    case RISCV_IOMMU_REG_IPSR:
> +        /*
> +         * IPSR has special procedures to update. Execute it
> +         * and exit.
> +         */
> +        if (size == 4) {
> +            uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
> +            uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
> +            uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
> +            stl_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +        } else if (size == 8) {
> +            uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
> +            uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
> +            uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
> +            stq_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +        }
> +
> +        riscv_iommu_update_ipsr(s, val);
> +
> +        return MEMTX_OK;
> +
> +    default:
> +        break;
> +    }
> +
> +    /*
> +     * Registers update might be not synchronized with core logic.
> +     * If system software updates register when relevant BUSY bit
> +     * is set IOMMU behavior of additional writes to the register
> +     * is UNSPECIFIED.
> +     */
> +    qemu_spin_lock(&s->regs_lock);
> +    if (size == 1) {
> +        uint8_t ro = s->regs_ro[addr];
> +        uint8_t wc = s->regs_wc[addr];
> +        uint8_t rw = s->regs_rw[addr];
> +        s->regs_rw[addr] = ((rw & ro) | (data & ~ro)) & ~(data & wc);
> +    } else if (size == 2) {
> +        uint16_t ro = lduw_le_p(&s->regs_ro[addr]);
> +        uint16_t wc = lduw_le_p(&s->regs_wc[addr]);
> +        uint16_t rw = lduw_le_p(&s->regs_rw[addr]);
> +        stw_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    } else if (size == 4) {
> +        uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
> +        uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
> +        uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
> +        stl_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    } else if (size == 8) {
> +        uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
> +        uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
> +        uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
> +        stq_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & ~(data & wc));
> +    }
> +
> +    /* Busy flag update, MSB 4-byte register. */
> +    if (busy) {
> +        uint32_t rw = ldl_le_p(&s->regs_rw[regb]);
> +        stl_le_p(&s->regs_rw[regb], rw | busy);
> +    }
> +    qemu_spin_unlock(&s->regs_lock);
> +
> +    if (process_fn) {
> +        qemu_mutex_lock(&s->core_lock);
> +        process_fn(s);
> +        qemu_mutex_unlock(&s->core_lock);
> +    }
> +
> +    return MEMTX_OK;
> +}
> +
> +static MemTxResult riscv_iommu_mmio_read(void *opaque, hwaddr addr,
> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
> +{
> +    RISCVIOMMUState *s = opaque;
> +    uint64_t val = -1;
> +    uint8_t *ptr;
> +
> +    if ((addr & (size - 1)) != 0) {
> +        /* Unsupported MMIO alignment. */
> +        return MEMTX_ERROR;
> +    }
> +
> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    ptr = &s->regs_rw[addr];
> +
> +    if (size == 1) {
> +        val = (uint64_t)*ptr;
> +    } else if (size == 2) {
> +        val = lduw_le_p(ptr);
> +    } else if (size == 4) {
> +        val = ldl_le_p(ptr);
> +    } else if (size == 8) {
> +        val = ldq_le_p(ptr);
> +    } else {
> +        return MEMTX_ERROR;
> +    }
> +
> +    *data = val;
> +
> +    return MEMTX_OK;
> +}
> +
> +static const MemoryRegionOps riscv_iommu_mmio_ops = {
> +    .read_with_attrs = riscv_iommu_mmio_read,
> +    .write_with_attrs = riscv_iommu_mmio_write,
> +    .endianness = DEVICE_NATIVE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +        .unaligned = false,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +/*
> + * Translations matching MSI pattern check are redirected to "riscv-iommu-trap"
> + * memory region as untranslated address, for additional MSI/MRIF interception
> + * by IOMMU interrupt remapping implementation.
> + * Note: Device emulation code generating an MSI is expected to provide a valid
> + * memory transaction attributes with requested_id set.
> + */
> +static MemTxResult riscv_iommu_trap_write(void *opaque, hwaddr addr,
> +    uint64_t data, unsigned size, MemTxAttrs attrs)
> +{
> +    RISCVIOMMUState* s = (RISCVIOMMUState *)opaque;
> +    RISCVIOMMUContext *ctx;
> +    MemTxResult res;
> +    void *ref;
> +    uint32_t devid = attrs.requester_id;
> +
> +    if (attrs.unspecified) {
> +        return MEMTX_ACCESS_ERROR;
> +    }
> +
> +    /* FIXME: PCIe bus remapping for attached endpoints. */
> +    devid |= s->bus << 8;
> +
> +    ctx = riscv_iommu_ctx(s, devid, 0, &ref);
> +    if (ctx == NULL) {
> +        res = MEMTX_ACCESS_ERROR;
> +    } else {
> +        res = riscv_iommu_msi_write(s, ctx, addr, data, size, attrs);
> +    }
> +    riscv_iommu_ctx_put(s, ref);
> +    return res;
> +}
> +
> +static MemTxResult riscv_iommu_trap_read(void *opaque, hwaddr addr,
> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
> +{
> +    return MEMTX_ACCESS_ERROR;
> +}
> +
> +static const MemoryRegionOps riscv_iommu_trap_ops = {
> +    .read_with_attrs = riscv_iommu_trap_read,
> +    .write_with_attrs = riscv_iommu_trap_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +        .unaligned = true,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +static void riscv_iommu_realize(DeviceState *dev, Error **errp)
> +{
> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
> +
> +    s->cap = s->version & RISCV_IOMMU_CAP_VERSION;
> +    if (s->enable_msi) {
> +        s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
> +    }
> +    /* Report QEMU target physical address space limits */
> +    s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
> +                       TARGET_PHYS_ADDR_SPACE_BITS);
> +
> +    /* TODO: method to report supported PASID bits */
> +    s->pasid_bits = 8; /* restricted to size of MemTxAttrs.pasid */
> +    s->cap |= RISCV_IOMMU_CAP_PD8;
> +
> +    /* Out-of-reset translation mode: OFF (DMA disabled) BARE (passthrough) */
> +    s->ddtp = set_field(0, RISCV_IOMMU_DDTP_MODE, s->enable_off ?
> +                        RISCV_IOMMU_DDTP_MODE_OFF : RISCV_IOMMU_DDTP_MODE_BARE);
> +
> +    /* register storage */
> +    s->regs_rw = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +    s->regs_ro = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +    s->regs_wc = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
> +
> +     /* Mark all registers read-only */
> +    memset(s->regs_ro, 0xff, RISCV_IOMMU_REG_SIZE);
> +
> +    /*
> +     * Register complete MMIO space, including MSI/PBA registers.
> +     * Note, PCIDevice implementation will add overlapping MR for MSI/PBA,
> +     * managed directly by the PCIDevice implementation.
> +     */
> +    memory_region_init_io(&s->regs_mr, OBJECT(dev), &riscv_iommu_mmio_ops, s,
> +        "riscv-iommu-regs", RISCV_IOMMU_REG_SIZE);
> +
> +    /* Set power-on register state */
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
> +        ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
> +        ~(RISCV_IOMMU_CQB_LOG2SZ | RISCV_IOMMU_CQB_PPN));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQB],
> +        ~(RISCV_IOMMU_FQB_LOG2SZ | RISCV_IOMMU_FQB_PPN));
> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQB],
> +        ~(RISCV_IOMMU_PQB_LOG2SZ | RISCV_IOMMU_PQB_PPN));
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQMF |
> +        RISCV_IOMMU_CQCSR_CMD_TO | RISCV_IOMMU_CQCSR_CMD_ILL);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQCSR], RISCV_IOMMU_CQCSR_CQON |
> +        RISCV_IOMMU_CQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQMF |
> +        RISCV_IOMMU_FQCSR_FQOF);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQCSR], RISCV_IOMMU_FQCSR_FQON |
> +        RISCV_IOMMU_FQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQMF |
> +        RISCV_IOMMU_PQCSR_PQOF);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQCSR], RISCV_IOMMU_PQCSR_PQON |
> +        RISCV_IOMMU_PQCSR_BUSY);
> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
> +
> +    /* Memory region for downstream access, if specified. */
> +    if (s->target_mr) {
> +        s->target_as = g_new0(AddressSpace, 1);
> +        address_space_init(s->target_as, s->target_mr,
> +            "riscv-iommu-downstream");
> +    } else {
> +        /* Fallback to global system memory. */
> +        s->target_as = &address_space_memory;
> +    }
> +
> +    /* Memory region for untranslated MRIF/MSI writes */
> +    memory_region_init_io(&s->trap_mr, OBJECT(dev), &riscv_iommu_trap_ops, s,
> +            "riscv-iommu-trap", ~0ULL);
> +    address_space_init(&s->trap_as, &s->trap_mr, "riscv-iommu-trap-as");
> +
> +    /* Device translation context cache */
> +    s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
> +                                         g_free, NULL);
> +
> +    s->iommus.le_next = NULL;
> +    s->iommus.le_prev = NULL;
> +    QLIST_INIT(&s->spaces);
> +    qemu_mutex_init(&s->core_lock);
> +    qemu_spin_init(&s->regs_lock);
> +}
> +
> +static void riscv_iommu_unrealize(DeviceState *dev)
> +{
> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
> +
> +    qemu_mutex_destroy(&s->core_lock);
> +    g_hash_table_unref(s->ctx_cache);
> +}
> +
> +static Property riscv_iommu_properties[] = {
> +    DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
> +        RISCV_IOMMU_SPEC_DOT_VER),
> +    DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
> +    DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
> +    DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
> +    DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
> +        TYPE_MEMORY_REGION, MemoryRegion *),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void riscv_iommu_class_init(ObjectClass *klass, void* data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    /* internal device for riscv-iommu-{pci/sys}, not user-creatable */
> +    dc->user_creatable = false;
> +    dc->realize = riscv_iommu_realize;
> +    dc->unrealize = riscv_iommu_unrealize;
> +    device_class_set_props(dc, riscv_iommu_properties);
> +}
> +
> +static const TypeInfo riscv_iommu_info = {
> +    .name = TYPE_RISCV_IOMMU,
> +    .parent = TYPE_DEVICE,
> +    .instance_size = sizeof(RISCVIOMMUState),
> +    .class_init = riscv_iommu_class_init,
> +};
> +
> +static const char *IOMMU_FLAG_STR[] = {
> +    "NA",
> +    "RO",
> +    "WR",
> +    "RW",
> +};
> +
> +/* RISC-V IOMMU Memory Region - Address Translation Space */
> +static IOMMUTLBEntry riscv_iommu_memory_region_translate(
> +    IOMMUMemoryRegion *iommu_mr, hwaddr addr,
> +    IOMMUAccessFlags flag, int iommu_idx)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +    RISCVIOMMUContext *ctx;
> +    void *ref;
> +    IOMMUTLBEntry iotlb = {
> +        .iova = addr,
> +        .target_as = as->iommu->target_as,
> +        .addr_mask = ~0ULL,
> +        .perm = flag,
> +    };
> +
> +    ctx = riscv_iommu_ctx(as->iommu, as->devid, iommu_idx, &ref);
> +    if (ctx == NULL) {
> +        /* Translation disabled or invalid. */
> +        iotlb.addr_mask = 0;
> +        iotlb.perm = IOMMU_NONE;
> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
> +        /* Translation disabled or fault reported. */
> +        iotlb.addr_mask = 0;
> +        iotlb.perm = IOMMU_NONE;
> +    }
> +
> +    /* Trace all dma translations with original access flags. */
> +    trace_riscv_iommu_dma(as->iommu->parent_obj.id, PCI_BUS_NUM(as->devid),
> +                          PCI_SLOT(as->devid), PCI_FUNC(as->devid), iommu_idx,
> +                          IOMMU_FLAG_STR[flag & IOMMU_RW], iotlb.iova,
> +                          iotlb.translated_addr);
> +
> +    riscv_iommu_ctx_put(as->iommu, ref);
> +
> +    return iotlb;
> +}
> +
> +static int riscv_iommu_memory_region_notify(
> +    IOMMUMemoryRegion *iommu_mr, IOMMUNotifierFlag old,
> +    IOMMUNotifierFlag new, Error **errp)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +
> +    if (old == IOMMU_NOTIFIER_NONE) {
> +        as->notifier = true;
> +        trace_riscv_iommu_notifier_add(iommu_mr->parent_obj.name);
> +    } else if (new == IOMMU_NOTIFIER_NONE) {
> +        as->notifier = false;
> +        trace_riscv_iommu_notifier_del(iommu_mr->parent_obj.name);
> +    }
> +
> +    return 0;
> +}
> +
> +static inline bool pci_is_iommu(PCIDevice *pdev)
> +{
> +    return pci_get_word(pdev->config + PCI_CLASS_DEVICE) == 0x0806;
> +}
> +
> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, int devfn)
> +{
> +    RISCVIOMMUState *s = (RISCVIOMMUState *) opaque;
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> +    AddressSpace *as = NULL;
> +
> +    if (pdev && pci_is_iommu(pdev)) {
> +        return s->target_as;
> +    }
> +
> +    /* Find first registered IOMMU device */
> +    while (s->iommus.le_prev) {
> +        s = *(s->iommus.le_prev);
> +    }
> +
> +    /* Find first matching IOMMU */
> +    while (s != NULL && as == NULL) {
> +        as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), devfn));
> +        s = s->iommus.le_next;
> +    }
> +
> +    return as ? as : &address_space_memory;
> +}
> +
> +static const PCIIOMMUOps riscv_iommu_ops = {
> +    .get_address_space = riscv_iommu_find_as,
> +};
> +
> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
> +        Error **errp)
> +{
> +    if (bus->iommu_ops &&
> +        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
> +        /* Allow multiple IOMMUs on the same PCIe bus, link known devices */
> +        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
> +        QLIST_INSERT_AFTER(last, iommu, iommus);
> +    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
> +        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);
> +    } else {
> +        error_setg(errp, "can't register secondary IOMMU for PCI bus #%d",
> +            pci_bus_num(bus));
> +    }
> +}
> +
> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
> +    MemTxAttrs attrs)
> +{
> +    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
> +}
> +
> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr)
> +{
> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, iova_mr);
> +    return 1 << as->iommu->pasid_bits;
> +}
> +
> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void *data)
> +{
> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
> +
> +    imrc->translate = riscv_iommu_memory_region_translate;
> +    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
> +    imrc->attrs_to_index = riscv_iommu_memory_region_index;
> +    imrc->num_indexes = riscv_iommu_memory_region_index_len;
> +}
> +
> +static const TypeInfo riscv_iommu_memory_region_info = {
> +    .parent = TYPE_IOMMU_MEMORY_REGION,
> +    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
> +    .class_init = riscv_iommu_memory_region_init,
> +};
> +
> +static void riscv_iommu_register_mr_types(void)
> +{
> +    type_register_static(&riscv_iommu_memory_region_info);
> +    type_register_static(&riscv_iommu_info);
> +}
> +
> +type_init(riscv_iommu_register_mr_types);
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> new file mode 100644
> index 0000000000..31d3907d33
> --- /dev/null
> +++ b/hw/riscv/riscv-iommu.h
> @@ -0,0 +1,141 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_RISCV_IOMMU_STATE_H
> +#define HW_RISCV_IOMMU_STATE_H
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +
> +#include "hw/riscv/iommu.h"
> +
> +struct RISCVIOMMUState {
> +    /*< private >*/
> +    DeviceState parent_obj;
> +
> +    /*< public >*/
> +    uint32_t version;     /* Reported interface version number */
> +    uint32_t pasid_bits;  /* process identifier width */
> +    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
> +
> +    uint64_t cap;         /* IOMMU supported capabilities */
> +    uint64_t fctl;        /* IOMMU enabled features */
> +
> +    bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
> +    bool enable_msi;      /* Enable MSI remapping */
> +
> +    /* IOMMU Internal State */
> +    uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */
> +
> +    dma_addr_t cq_addr;   /* Command queue base physical address */
> +    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
> +    dma_addr_t pq_addr;   /* Page request queue base physical address */
> +
> +    uint32_t cq_mask;     /* Command queue index bit mask */
> +    uint32_t fq_mask;     /* Fault/event queue index bit mask */
> +    uint32_t pq_mask;     /* Page request queue index bit mask */
> +
> +    /* interrupt notifier */
> +    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
> +
> +    /* IOMMU State Machine */
> +    QemuThread core_proc; /* Background processing thread */
> +    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs updates */
> +    QemuCond core_cond;   /* Background processing wake up signal */
> +    unsigned core_exec;   /* Processing thread execution actions */
> +
> +    /* IOMMU target address space */
> +    AddressSpace *target_as;
> +    MemoryRegion *target_mr;
> +
> +    /* MSI / MRIF access trap */
> +    AddressSpace trap_as;
> +    MemoryRegion trap_mr;
> +
> +    GHashTable *ctx_cache;          /* Device translation Context Cache */
> +
> +    /* MMIO Hardware Interface */
> +    MemoryRegion regs_mr;
> +    QemuSpin regs_lock;
> +    uint8_t *regs_rw;  /* register state (user write) */
> +    uint8_t *regs_wc;  /* write-1-to-clear mask */
> +    uint8_t *regs_ro;  /* read-only mask */
> +
> +    QLIST_ENTRY(RISCVIOMMUState) iommus;
> +    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
> +};
> +
> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
> +         Error **errp);
> +
> +/* private helpers */
> +
> +/* Register helper functions */
> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
> +    unsigned idx, uint32_t set, uint32_t clr)
> +{
> +    uint32_t val;
> +    qemu_spin_lock(&s->regs_lock);
> +    val = ldl_le_p(s->regs_rw + idx);
> +    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
> +    qemu_spin_unlock(&s->regs_lock);
> +    return val;
> +}
> +
> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
> +    unsigned idx, uint32_t set)
> +{
> +    qemu_spin_lock(&s->regs_lock);
> +    stl_le_p(s->regs_rw + idx, set);
> +    qemu_spin_unlock(&s->regs_lock);
> +}
> +
> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
> +    unsigned idx)
> +{
> +    return ldl_le_p(s->regs_rw + idx);
> +}
> +
> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
> +    unsigned idx, uint64_t set, uint64_t clr)
> +{
> +    uint64_t val;
> +    qemu_spin_lock(&s->regs_lock);
> +    val = ldq_le_p(s->regs_rw + idx);
> +    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
> +    qemu_spin_unlock(&s->regs_lock);
> +    return val;
> +}
> +
> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
> +    unsigned idx, uint64_t set)
> +{
> +    qemu_spin_lock(&s->regs_lock);
> +    stq_le_p(s->regs_rw + idx, set);
> +    qemu_spin_unlock(&s->regs_lock);
> +}
> +
> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
> +    unsigned idx)
> +{
> +    return ldq_le_p(s->regs_rw + idx);
> +}
> +
> +
> +
> +#endif
> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
> new file mode 100644
> index 0000000000..42a97caffa
> --- /dev/null
> +++ b/hw/riscv/trace-events
> @@ -0,0 +1,11 @@
> +# See documentation at docs/devel/tracing.rst
> +
> +# riscv-iommu.c
> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) "%s: device attached %04x:%02x.%d"
> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" iova: 0x%"PRIx64
> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" -> 0x%"PRIx64
> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 0x%"PRIx64" 0x%"PRIx64
> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier removed"
> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
> new file mode 100644
> index 0000000000..8c0e3ca1f3
> --- /dev/null
> +++ b/hw/riscv/trace.h
> @@ -0,0 +1 @@
> +#include "trace/trace-hw_riscv.h"
> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
> new file mode 100644
> index 0000000000..070ee69973
> --- /dev/null
> +++ b/include/hw/riscv/iommu.h
> @@ -0,0 +1,36 @@
> +/*
> + * QEMU emulation of an RISC-V IOMMU
> + *
> + * Copyright (C) 2022-2023 Rivos Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_RISCV_IOMMU_H
> +#define HW_RISCV_IOMMU_H
> +
> +#include "qemu/osdep.h"
> +#include "qom/object.h"
> +
> +#define TYPE_RISCV_IOMMU "riscv-iommu"
> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
> +typedef struct RISCVIOMMUState RISCVIOMMUState;
> +
> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
> +
> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
> +
> +#endif
> diff --git a/meson.build b/meson.build
> index a9de71d450..8099d8271c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -3319,6 +3319,7 @@ if have_system
>       'hw/pci-host',
>       'hw/ppc',
>       'hw/rtc',
> +    'hw/riscv',
>       'hw/s390x',
>       'hw/scsi',
>       'hw/sd',


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support
  2024-05-23 17:39 ` [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support Daniel Henrique Barboza
@ 2024-06-18 10:30   ` Jason Chien
  2024-06-21 11:58     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Chien @ 2024-06-18 10:30 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang

Hi Daniel,

On 2024/5/24 上午 01:39, Daniel Henrique Barboza wrote:
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> Add support for s-stage (sv32, sv39, sv48, sv57 caps) and g-stage
> (sv32x4, sv39x4, sv48x4, sv57x4 caps). Most of the work is done in the
> riscv_iommu_spa_fetch() function that now has to consider how many
> translation stages we need to walk the page table.
>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---
>   hw/riscv/riscv-iommu-bits.h |  11 ++
>   hw/riscv/riscv-iommu.c      | 331 +++++++++++++++++++++++++++++++++++-
>   hw/riscv/riscv-iommu.h      |   2 +
>   3 files changed, 336 insertions(+), 8 deletions(-)
>
> diff --git a/hw/riscv/riscv-iommu-bits.h b/hw/riscv/riscv-iommu-bits.h
> index f29b916acb..a4def7b8ec 100644
> --- a/hw/riscv/riscv-iommu-bits.h
> +++ b/hw/riscv/riscv-iommu-bits.h
> @@ -71,6 +71,14 @@ struct riscv_iommu_pq_record {
>   /* 5.3 IOMMU Capabilities (64bits) */
>   #define RISCV_IOMMU_REG_CAP             0x0000
>   #define RISCV_IOMMU_CAP_VERSION         GENMASK_ULL(7, 0)
> +#define RISCV_IOMMU_CAP_SV32            BIT_ULL(8)
> +#define RISCV_IOMMU_CAP_SV39            BIT_ULL(9)
> +#define RISCV_IOMMU_CAP_SV48            BIT_ULL(10)
> +#define RISCV_IOMMU_CAP_SV57            BIT_ULL(11)
> +#define RISCV_IOMMU_CAP_SV32X4          BIT_ULL(16)
> +#define RISCV_IOMMU_CAP_SV39X4          BIT_ULL(17)
> +#define RISCV_IOMMU_CAP_SV48X4          BIT_ULL(18)
> +#define RISCV_IOMMU_CAP_SV57X4          BIT_ULL(19)
>   #define RISCV_IOMMU_CAP_MSI_FLAT        BIT_ULL(22)
>   #define RISCV_IOMMU_CAP_MSI_MRIF        BIT_ULL(23)
>   #define RISCV_IOMMU_CAP_T2GPA           BIT_ULL(26)
> @@ -83,6 +91,7 @@ struct riscv_iommu_pq_record {
>   /* 5.4 Features control register (32bits) */
>   #define RISCV_IOMMU_REG_FCTL            0x0008
>   #define RISCV_IOMMU_FCTL_WSI            BIT(1)
> +#define RISCV_IOMMU_FCTL_GXL            BIT(2)
>   
>   /* 5.5 Device-directory-table pointer (64bits) */
>   #define RISCV_IOMMU_REG_DDTP            0x0010
> @@ -205,6 +214,8 @@ struct riscv_iommu_dc {
>   #define RISCV_IOMMU_DC_TC_DTF           BIT_ULL(4)
>   #define RISCV_IOMMU_DC_TC_PDTV          BIT_ULL(5)
>   #define RISCV_IOMMU_DC_TC_PRPR          BIT_ULL(6)
> +#define RISCV_IOMMU_DC_TC_GADE          BIT_ULL(7)
> +#define RISCV_IOMMU_DC_TC_SADE          BIT_ULL(8)
>   #define RISCV_IOMMU_DC_TC_DPE           BIT_ULL(9)
>   #define RISCV_IOMMU_DC_TC_SBE           BIT_ULL(10)
>   #define RISCV_IOMMU_DC_TC_SXL           BIT_ULL(11)
> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
> index abf6ae7726..11c418b548 100644
> --- a/hw/riscv/riscv-iommu.c
> +++ b/hw/riscv/riscv-iommu.c
> @@ -58,6 +58,8 @@ struct RISCVIOMMUContext {
>       uint64_t __rfu:20;          /* reserved */
>       uint64_t tc;                /* Translation Control */
>       uint64_t ta;                /* Translation Attributes */
> +    uint64_t satp;              /* S-Stage address translation and protection */
> +    uint64_t gatp;              /* G-Stage address translation and protection */
>       uint64_t msi_addr_mask;     /* MSI filtering - address mask */
>       uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
>       uint64_t msiptp;            /* MSI redirection page table pointer */
> @@ -201,12 +203,45 @@ static bool riscv_iommu_msi_check(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>       return true;
>   }
>   
> -/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
> +/*
> + * RISCV IOMMU Address Translation Lookup - Page Table Walk
> + *
> + * Note: Code is based on get_physical_address() from target/riscv/cpu_helper.c
> + * Both implementation can be merged into single helper function in future.
> + * Keeping them separate for now, as error reporting and flow specifics are
> + * sufficiently different for separate implementation.
> + *
> + * @s        : IOMMU Device State
> + * @ctx      : Translation context for device id and process address space id.
> + * @iotlb    : translation data: physical address and access mode.
> + * @return   : success or fault cause code.
> + */
>   static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>       IOMMUTLBEntry *iotlb)
>   {
> +    dma_addr_t addr, base;
> +    uint64_t satp, gatp, pte;
> +    bool en_s, en_g;
> +    struct {
> +        unsigned char step;
> +        unsigned char levels;
> +        unsigned char ptidxbits;
> +        unsigned char ptesize;
> +    } sc[2];
> +    /* Translation stage phase */
> +    enum {
> +        S_STAGE = 0,
> +        G_STAGE = 1,
> +    } pass;
> +
> +    satp = get_field(ctx->satp, RISCV_IOMMU_ATP_MODE_FIELD);
> +    gatp = get_field(ctx->gatp, RISCV_IOMMU_ATP_MODE_FIELD);
> +
> +    en_s = satp != RISCV_IOMMU_DC_FSC_MODE_BARE;
> +    en_g = gatp != RISCV_IOMMU_DC_IOHGATP_MODE_BARE;
> +
>       /* Early check for MSI address match when IOVA == GPA */
> -    if (iotlb->perm & IOMMU_WO &&
> +    if ((iotlb->perm & IOMMU_WO) &&
>           riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
>           iotlb->target_as = &s->trap_as;
>           iotlb->translated_addr = iotlb->iova;
> @@ -215,11 +250,196 @@ static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>       }
>   
>       /* Exit early for pass-through mode. */
> -    iotlb->translated_addr = iotlb->iova;
> -    iotlb->addr_mask = ~TARGET_PAGE_MASK;
> -    /* Allow R/W in pass-through mode */
> -    iotlb->perm = IOMMU_RW;
> -    return 0;
> +    if (!(en_s || en_g)) {
> +        iotlb->translated_addr = iotlb->iova;
> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +        /* Allow R/W in pass-through mode */
> +        iotlb->perm = IOMMU_RW;
> +        return 0;
> +    }
> +
> +    /* S/G translation parameters. */
> +    for (pass = 0; pass < 2; pass++) {
> +        uint32_t sv_mode;
> +
> +        sc[pass].step = 0;
> +        if (pass ? (s->fctl & RISCV_IOMMU_FCTL_GXL) :
> +            (ctx->tc & RISCV_IOMMU_DC_TC_SXL)) {
> +            /* 32bit mode for GXL/SXL == 1 */
> +            switch (pass ? gatp : satp) {
> +            case RISCV_IOMMU_DC_IOHGATP_MODE_BARE:
> +                sc[pass].levels    = 0;
> +                sc[pass].ptidxbits = 0;
> +                sc[pass].ptesize   = 0;
> +                break;
> +            case RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4:
> +                sv_mode = pass ? RISCV_IOMMU_CAP_SV32X4 : RISCV_IOMMU_CAP_SV32;
> +                if (!(s->cap & sv_mode)) {
> +                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +                }
> +                sc[pass].levels    = 2;
> +                sc[pass].ptidxbits = 10;
> +                sc[pass].ptesize   = 4;
> +                break;
> +            default:
> +                return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +            }
> +        } else {
> +            /* 64bit mode for GXL/SXL == 0 */
> +            switch (pass ? gatp : satp) {
> +            case RISCV_IOMMU_DC_IOHGATP_MODE_BARE:
> +                sc[pass].levels    = 0;
> +                sc[pass].ptidxbits = 0;
> +                sc[pass].ptesize   = 0;
> +                break;
> +            case RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4:
> +                sv_mode = pass ? RISCV_IOMMU_CAP_SV39X4 : RISCV_IOMMU_CAP_SV39;
> +                if (!(s->cap & sv_mode)) {
> +                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +                }
> +                sc[pass].levels    = 3;
> +                sc[pass].ptidxbits = 9;
> +                sc[pass].ptesize   = 8;
> +                break;
> +            case RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4:
> +                sv_mode = pass ? RISCV_IOMMU_CAP_SV48X4 : RISCV_IOMMU_CAP_SV48;
> +                if (!(s->cap & sv_mode)) {
> +                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +                }
> +                sc[pass].levels    = 4;
> +                sc[pass].ptidxbits = 9;
> +                sc[pass].ptesize   = 8;
> +                break;
> +            case RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4:
> +                sv_mode = pass ? RISCV_IOMMU_CAP_SV57X4 : RISCV_IOMMU_CAP_SV57;
> +                if (!(s->cap & sv_mode)) {
> +                    return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +                }
> +                sc[pass].levels    = 5;
> +                sc[pass].ptidxbits = 9;
> +                sc[pass].ptesize   = 8;
> +                break;
> +            default:
> +                return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
> +            }
> +        }
> +    };
> +
> +    /* S/G stages translation tables root pointers */
> +    gatp = PPN_PHYS(get_field(ctx->gatp, RISCV_IOMMU_ATP_PPN_FIELD));
> +    satp = PPN_PHYS(get_field(ctx->satp, RISCV_IOMMU_ATP_PPN_FIELD));
> +    addr = (en_s && en_g) ? satp : iotlb->iova;
> +    base = en_g ? gatp : satp;
> +    pass = en_g ? G_STAGE : S_STAGE;
> +
> +    do {
> +        const unsigned widened = (pass && !sc[pass].step) ? 2 : 0;
> +        const unsigned va_bits = widened + sc[pass].ptidxbits;
> +        const unsigned va_skip = TARGET_PAGE_BITS + sc[pass].ptidxbits *
> +                                 (sc[pass].levels - 1 - sc[pass].step);
> +        const unsigned idx = (addr >> va_skip) & ((1 << va_bits) - 1);
> +        const dma_addr_t pte_addr = base + idx * sc[pass].ptesize;
> +        const bool ade =
> +            ctx->tc & (pass ? RISCV_IOMMU_DC_TC_GADE : RISCV_IOMMU_DC_TC_SADE);
> +
> +        /* Address range check before first level lookup */
> +        if (!sc[pass].step) {
> +            const uint64_t va_mask = (1ULL << (va_skip + va_bits)) - 1;
> +            if ((addr & va_mask) != addr) {
> +                return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
> +            }
> +        }
> +
> +        /* Read page table entry */
> +        if (dma_memory_read(s->target_as, pte_addr, &pte,
> +                sc[pass].ptesize, MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
> +            return (iotlb->perm & IOMMU_WO) ? RISCV_IOMMU_FQ_CAUSE_WR_FAULT
> +                                            : RISCV_IOMMU_FQ_CAUSE_RD_FAULT;
> +        }
> +
> +        if (sc[pass].ptesize == 4) {
> +            pte = (uint64_t) le32_to_cpu(*((uint32_t *)&pte));
> +        } else {
> +            pte = le64_to_cpu(pte);
> +        }
> +
> +        sc[pass].step++;
> +        hwaddr ppn = pte >> PTE_PPN_SHIFT;
> +
> +        if (!(pte & PTE_V)) {
> +            break;                /* Invalid PTE */
> +        } else if (!(pte & (PTE_R | PTE_W | PTE_X))) {
> +            base = PPN_PHYS(ppn); /* Inner PTE, continue walking */
> +        } else if ((pte & (PTE_R | PTE_W | PTE_X)) == PTE_W) {
> +            break;                /* Reserved leaf PTE flags: PTE_W */
> +        } else if ((pte & (PTE_R | PTE_W | PTE_X)) == (PTE_W | PTE_X)) {
> +            break;                /* Reserved leaf PTE flags: PTE_W + PTE_X */
> +        } else if (ppn & ((1ULL << (va_skip - TARGET_PAGE_BITS)) - 1)) {
> +            break;                /* Misaligned PPN */
> +        } else if ((iotlb->perm & IOMMU_RO) && !(pte & PTE_R)) {
> +            break;                /* Read access check failed */
> +        } else if ((iotlb->perm & IOMMU_WO) && !(pte & PTE_W)) {
> +            break;                /* Write access check failed */
> +        } else if ((iotlb->perm & IOMMU_RO) && !ade && !(pte & PTE_A)) {
> +            break;                /* Access bit not set */
> +        } else if ((iotlb->perm & IOMMU_WO) && !ade && !(pte & PTE_D)) {
> +            break;                /* Dirty bit not set */
> +        } else {
> +            /* Leaf PTE, translation completed. */
> +            sc[pass].step = sc[pass].levels;
> +            base = PPN_PHYS(ppn) | (addr & ((1ULL << va_skip) - 1));
> +            /* Update address mask based on smallest translation granularity */
> +            iotlb->addr_mask &= (1ULL << va_skip) - 1;
> +            /* Continue with S-Stage translation? */
> +            if (pass && sc[0].step != sc[0].levels) {
> +                pass = S_STAGE;
> +                addr = iotlb->iova;
> +                continue;
> +            }
> +            /* Translation phase completed (GPA or SPA) */
> +            iotlb->translated_addr = base;
> +            iotlb->perm = (pte & PTE_W) ? ((pte & PTE_R) ? IOMMU_RW : IOMMU_WO)
> +                                                         : IOMMU_RO;
> +
> +            /* Check MSI GPA address match */
> +            if (pass == S_STAGE && (iotlb->perm & IOMMU_WO) &&
> +                riscv_iommu_msi_check(s, ctx, base)) {
> +                /* Trap MSI writes and return GPA address. */
> +                iotlb->target_as = &s->trap_as;
> +                iotlb->addr_mask = ~TARGET_PAGE_MASK;
> +                return 0;
> +            }
> +
> +            /* Continue with G-Stage translation? */
> +            if (!pass && en_g) {
> +                pass = G_STAGE;
> +                addr = base;
> +                base = gatp;
> +                sc[pass].step = 0;
> +                continue;
> +            }
> +
> +            return 0;
> +        }
> +
> +        if (sc[pass].step == sc[pass].levels) {
> +            break; /* Can't find leaf PTE */
> +        }
> +
> +        /* Continue with G-Stage translation? */
> +        if (!pass && en_g) {
> +            pass = G_STAGE;
> +            addr = base;
> +            base = gatp;
> +            sc[pass].step = 0;
> +        }
> +    } while (1);
> +
> +    return (iotlb->perm & IOMMU_WO) ?
> +                (pass ? RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS :
> +                        RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S) :
> +                (pass ? RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS :
> +                        RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S);
>   }
>   
>   static void riscv_iommu_report_fault(RISCVIOMMUState *s,
> @@ -420,7 +640,7 @@ err:
>   static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
>                                               RISCVIOMMUContext *ctx)
>   {
> -    uint32_t msi_mode;
> +    uint32_t fsc_mode, msi_mode;
>   
>       if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
>           ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
> @@ -441,6 +661,58 @@ static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
>           }
>       }
>   
> +    fsc_mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
> +
> +    if (ctx->tc & RISCV_IOMMU_DC_TC_PDTV) {
> +        switch (fsc_mode) {
> +        case RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8:
> +            if (!(s->cap & RISCV_IOMMU_CAP_PD8)) {
> +                return false;
> +            }
> +            break;
> +        case RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17:
> +            if (!(s->cap & RISCV_IOMMU_CAP_PD17)) {
> +                return false;
> +            }
> +            break;
> +        case RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20:
> +            if (!(s->cap & RISCV_IOMMU_CAP_PD20)) {
> +                return false;
> +            }
> +            break;
> +        }
> +    } else {
> +        /* DC.tc.PDTV is 0 */
> +        if (ctx->tc & RISCV_IOMMU_DC_TC_DPE) {
> +            return false;
> +        }
> +
> +        if (ctx->tc & RISCV_IOMMU_DC_TC_SXL) {
> +            if (fsc_mode == RISCV_IOMMU_CAP_SV32 &&
> +                !(s->cap & RISCV_IOMMU_CAP_SV32)) {
> +                return false;
> +            }
> +        } else {
> +            switch (fsc_mode) {
> +            case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39:
> +                if (!(s->cap & RISCV_IOMMU_CAP_SV39)) {
> +                    return false;
> +                }
> +                break;
> +            case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48:
> +                if (!(s->cap & RISCV_IOMMU_CAP_SV48)) {
> +                    return false;
> +                }
> +            break;
> +            case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57:
> +                if (!(s->cap & RISCV_IOMMU_CAP_SV57)) {
> +                    return false;
> +                }
> +                break;
> +            }
> +        }
> +    }
> +
>       /*
>        * CAP_END is always zero (only one endianess). FCTL_BE is
>        * always zero (little-endian accesses). Thus TC_SBE must
> @@ -478,6 +750,10 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>   
>       case RISCV_IOMMU_DDTP_MODE_BARE:
>           /* mock up pass-through translation context */
> +        ctx->gatp = set_field(0, RISCV_IOMMU_ATP_MODE_FIELD,
> +            RISCV_IOMMU_DC_IOHGATP_MODE_BARE);
> +        ctx->satp = set_field(0, RISCV_IOMMU_ATP_MODE_FIELD,
> +            RISCV_IOMMU_DC_FSC_MODE_BARE);
>           ctx->tc = RISCV_IOMMU_DC_TC_V;
>           ctx->ta = 0;
>           ctx->msiptp = 0;
> @@ -551,6 +827,8 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>   
>       /* Set translation context. */
>       ctx->tc = le64_to_cpu(dc.tc);
> +    ctx->gatp = le64_to_cpu(dc.iohgatp);
> +    ctx->satp = le64_to_cpu(dc.fsc);
>       ctx->ta = le64_to_cpu(dc.ta);
>       ctx->msiptp = le64_to_cpu(dc.msiptp);
>       ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
> @@ -564,14 +842,38 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>           return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>       }
>   
> +    /* FSC field checks */
> +    mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
> +    addr = PPN_PHYS(get_field(ctx->satp, RISCV_IOMMU_DC_FSC_PPN));
> +
> +    if (mode == RISCV_IOMMU_DC_FSC_MODE_BARE) {
According to section 2.3, if the function returns here, some necessary 
checks are skipped. I think this if scope should be moved down to after 
"if (ctx->pasid == RISCV_IOMMU_NOPASID) {...}".
> +        /* No S-Stage translation, done. */
> +        return 0;
> +    }
> +
>       if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
>           if (ctx->pasid != RISCV_IOMMU_NOPASID) {
>               /* PASID is disabled */
>               return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
>           }
> +        if (mode > RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57) {
> +            /* Invalid translation mode */
> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
> +        }
>           return 0;
>       }
>   
> +    if (ctx->pasid == RISCV_IOMMU_NOPASID) {
> +        if (!(ctx->tc & RISCV_IOMMU_DC_TC_DPE)) {
> +            /* No default PASID enabled, set BARE mode */
> +            ctx->satp = 0ULL;
> +            return 0;
> +        } else {
> +            /* Use default PASID #0 */
> +            ctx->pasid = 0;
> +        }
> +    }
> +
return if mode is bare.
>       /* FSC.TC.PDTV enabled */
>       if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
>           /* Invalid PDTP.MODE */
> @@ -605,6 +907,7 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>   
>       /* Use FSC and TA from process directory entry. */
>       ctx->ta = le64_to_cpu(dc.ta);
> +    ctx->satp = le64_to_cpu(dc.fsc);
>   
>       return 0;
>   }
> @@ -832,6 +1135,7 @@ static RISCVIOMMUEntry *riscv_iommu_iot_lookup(RISCVIOMMUContext *ctx,
>       GHashTable *iot_cache, hwaddr iova)
>   {
>       RISCVIOMMUEntry key = {
> +        .gscid = get_field(ctx->gatp, RISCV_IOMMU_DC_IOHGATP_GSCID),
>           .pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID),
>           .iova  = PPN_DOWN(iova),
>       };
> @@ -909,6 +1213,7 @@ static int riscv_iommu_translate(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
>           iot = g_new0(RISCVIOMMUEntry, 1);
>           iot->iova = PPN_DOWN(iotlb->iova);
>           iot->phys = PPN_DOWN(iotlb->translated_addr);
> +        iot->gscid = get_field(ctx->gatp, RISCV_IOMMU_DC_IOHGATP_GSCID);
>           iot->pscid = get_field(ctx->ta, RISCV_IOMMU_DC_TA_PSCID);
>           iot->perm = iotlb->perm;
>           riscv_iommu_iot_update(s, iot_cache, iot);
> @@ -1513,6 +1818,14 @@ static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>       if (s->enable_msi) {
>           s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
>       }
> +    if (s->enable_s_stage) {
> +        s->cap |= RISCV_IOMMU_CAP_SV32 | RISCV_IOMMU_CAP_SV39 |
> +                  RISCV_IOMMU_CAP_SV48 | RISCV_IOMMU_CAP_SV57;
> +    }
> +    if (s->enable_g_stage) {
> +        s->cap |= RISCV_IOMMU_CAP_SV32X4 | RISCV_IOMMU_CAP_SV39X4 |
> +                  RISCV_IOMMU_CAP_SV48X4 | RISCV_IOMMU_CAP_SV57X4;
> +    }
>       /* Report QEMU target physical address space limits */
>       s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
>                          TARGET_PHYS_ADDR_SPACE_BITS);
> @@ -1613,6 +1926,8 @@ static Property riscv_iommu_properties[] = {
>           LIMIT_CACHE_IOT),
>       DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
>       DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
> +    DEFINE_PROP_BOOL("s-stage", RISCVIOMMUState, enable_s_stage, TRUE),
> +    DEFINE_PROP_BOOL("g-stage", RISCVIOMMUState, enable_g_stage, TRUE),
>       DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
>           TYPE_MEMORY_REGION, MemoryRegion *),
>       DEFINE_PROP_END_OF_LIST(),
> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
> index 3afee9f3e8..c24e3e4c16 100644
> --- a/hw/riscv/riscv-iommu.h
> +++ b/hw/riscv/riscv-iommu.h
> @@ -38,6 +38,8 @@ struct RISCVIOMMUState {
>   
>       bool enable_off;      /* Enable out-of-reset OFF mode (DMA disabled) */
>       bool enable_msi;      /* Enable MSI remapping */
> +    bool enable_s_stage;  /* Enable S/VS-Stage translation */
> +    bool enable_g_stage;  /* Enable G-Stage translation */
>   
>       /* IOMMU Internal State */
>       uint64_t ddtp;        /* Validated Device Directory Tree Root Pointer */


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation
  2024-06-18 10:06   ` Jason Chien
@ 2024-06-18 15:15     ` Jason Chien
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Chien @ 2024-06-18 15:15 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang, Sebastien Boeuf


On 2024/6/18 下午 06:06, Jason Chien wrote:
> Hi Daniel,
>
> On 2024/5/24 上午 01:39, Daniel Henrique Barboza wrote:
>> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>>
>> The RISC-V IOMMU specification is now ratified as-per the RISC-V
>> international process. The latest frozen specifcation can be found
>> at:
>>
>> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf 
>>
>>
>> Add the foundation of the device emulation for RISC-V IOMMU, which
>> includes an IOMMU that has no capabilities but MSI interrupt support and
>> fault queue interfaces. We'll add add more features incrementally in the
>> next patches.
>>
>> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
>> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
>> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
>> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
>> ---
>>   hw/riscv/Kconfig         |    4 +
>>   hw/riscv/meson.build     |    1 +
>>   hw/riscv/riscv-iommu.c   | 1602 ++++++++++++++++++++++++++++++++++++++
>>   hw/riscv/riscv-iommu.h   |  141 ++++
>>   hw/riscv/trace-events    |   11 +
>>   hw/riscv/trace.h         |    1 +
>>   include/hw/riscv/iommu.h |   36 +
>>   meson.build              |    1 +
>>   8 files changed, 1797 insertions(+)
>>   create mode 100644 hw/riscv/riscv-iommu.c
>>   create mode 100644 hw/riscv/riscv-iommu.h
>>   create mode 100644 hw/riscv/trace-events
>>   create mode 100644 hw/riscv/trace.h
>>   create mode 100644 include/hw/riscv/iommu.h
>>
>> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
>> index a2030e3a6f..f69d6e3c8e 100644
>> --- a/hw/riscv/Kconfig
>> +++ b/hw/riscv/Kconfig
>> @@ -1,3 +1,6 @@
>> +config RISCV_IOMMU
>> +    bool
>> +
>>   config RISCV_NUMA
>>       bool
>>   @@ -47,6 +50,7 @@ config RISCV_VIRT
>>       select SERIAL
>>       select RISCV_ACLINT
>>       select RISCV_APLIC
>> +    select RISCV_IOMMU
>>       select RISCV_IMSIC
>>       select SIFIVE_PLIC
>>       select SIFIVE_TEST
>> diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
>> index f872674093..cbc99c6e8e 100644
>> --- a/hw/riscv/meson.build
>> +++ b/hw/riscv/meson.build
>> @@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: 
>> files('sifive_u.c'))
>>   riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
>>   riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: 
>> files('microchip_pfsoc.c'))
>>   riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
>> +riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: 
>> files('riscv-iommu.c'))
>>     hw_arch += {'riscv': riscv_ss}
>> diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
>> new file mode 100644
>> index 0000000000..39b4ff1405
>> --- /dev/null
>> +++ b/hw/riscv/riscv-iommu.c
>> @@ -0,0 +1,1602 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2021-2023, Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License 
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +#include "hw/pci/pci_bus.h"
>> +#include "hw/pci/pci_device.h"
>> +#include "hw/qdev-properties.h"
>> +#include "hw/riscv/riscv_hart.h"
>> +#include "migration/vmstate.h"
>> +#include "qapi/error.h"
>> +#include "qemu/timer.h"
>> +
>> +#include "cpu_bits.h"
>> +#include "riscv-iommu.h"
>> +#include "riscv-iommu-bits.h"
>> +#include "trace.h"
>> +
>> +#define LIMIT_CACHE_CTX               (1U << 7)
>> +#define LIMIT_CACHE_IOT               (1U << 20)
>> +
>> +/* Physical page number coversions */
>> +#define PPN_PHYS(ppn)                 ((ppn) << TARGET_PAGE_BITS)
>> +#define PPN_DOWN(phy)                 ((phy) >> TARGET_PAGE_BITS)
>> +
>> +typedef struct RISCVIOMMUContext RISCVIOMMUContext;
>> +typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
>> +
>> +/* Device assigned I/O address space */
>> +struct RISCVIOMMUSpace {
>> +    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached 
>> device */
>> +    AddressSpace iova_as;       /* IOVA address space for attached 
>> device */
>> +    RISCVIOMMUState *iommu;     /* Managing IOMMU device state */
>> +    uint32_t devid;             /* Requester identifier, AKA 
>> device_id */
>> +    bool notifier;              /* IOMMU unmap notifier enabled */
>> +    QLIST_ENTRY(RISCVIOMMUSpace) list;
>> +};
>> +
>> +/* Device translation context state. */
>> +struct RISCVIOMMUContext {
>> +    uint64_t devid:24;          /* Requester Id, AKA device_id */
>> +    uint64_t pasid:20;          /* Process Address Space ID */
>> +    uint64_t __rfu:20;          /* reserved */
>> +    uint64_t tc;                /* Translation Control */
>> +    uint64_t ta;                /* Translation Attributes */
>> +    uint64_t msi_addr_mask;     /* MSI filtering - address mask */
>> +    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
>> +    uint64_t msiptp;            /* MSI redirection page table 
>> pointer */
>> +};
>> +
>> +/* IOMMU index for transactions without PASID specified. */
>> +#define RISCV_IOMMU_NOPASID 0
>> +
>> +static void riscv_iommu_notify(RISCVIOMMUState *s, int vec)
>> +{
>> +    const uint32_t fctl = riscv_iommu_reg_get32(s, 
>> RISCV_IOMMU_REG_FCTL);
>> +    uint32_t ipsr, ivec;
>> +
>> +    if (fctl & RISCV_IOMMU_FCTL_WSI || !s->notify) {
>> +        return;
>> +    }
>> +
>> +    ipsr = riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, (1 << 
>> vec), 0);
>> +    ivec = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_IVEC);
>> +
>> +    if (!(ipsr & (1 << vec))) {
>> +        s->notify(s, (ivec >> (vec * 4)) & 0x0F);
>> +    }
>> +}
>> +
>> +static void riscv_iommu_fault(RISCVIOMMUState *s,
>> +                              struct riscv_iommu_fq_record *ev)
>> +{
>> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
>> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQH) & 
>> s->fq_mask;
>> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQT) & 
>> s->fq_mask;
>> +    uint32_t next = (tail + 1) & s->fq_mask;
>> +    uint32_t devid = get_field(ev->hdr, RISCV_IOMMU_FQ_HDR_DID);
>> +
>> +    trace_riscv_iommu_flt(s->parent_obj.id, PCI_BUS_NUM(devid), 
>> PCI_SLOT(devid),
>> +                          PCI_FUNC(devid), ev->hdr, ev->iotval);
>> +
>> +    if (!(ctrl & RISCV_IOMMU_FQCSR_FQON) ||
>> +        !!(ctrl & (RISCV_IOMMU_FQCSR_FQOF | RISCV_IOMMU_FQCSR_FQMF))) {
>> +        return;
>> +    }
>> +
>> +    if (head == next) {
>> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
>> +                              RISCV_IOMMU_FQCSR_FQOF, 0);
>> +    } else {
>> +        dma_addr_t addr = s->fq_addr + tail * sizeof(*ev);
>> +        if (dma_memory_write(s->target_as, addr, ev, sizeof(*ev),
>> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR,
>> +                                  RISCV_IOMMU_FQCSR_FQMF, 0);
>> +        } else {
>> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_FQT, next);
>> +        }
>> +    }
>> +
>> +    if (ctrl & RISCV_IOMMU_FQCSR_FIE) {
>> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_FQ);
>> +    }
>> +}
>> +
>> +static void riscv_iommu_pri(RISCVIOMMUState *s,
>> +    struct riscv_iommu_pq_record *pr)
>> +{
>> +    uint32_t ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
>> +    uint32_t head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQH) & 
>> s->pq_mask;
>> +    uint32_t tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQT) & 
>> s->pq_mask;
>> +    uint32_t next = (tail + 1) & s->pq_mask;
>> +    uint32_t devid = get_field(pr->hdr, RISCV_IOMMU_PREQ_HDR_DID);
>> +
>> +    trace_riscv_iommu_pri(s->parent_obj.id, PCI_BUS_NUM(devid), 
>> PCI_SLOT(devid),
>> +                          PCI_FUNC(devid), pr->payload);
>> +
>> +    if (!(ctrl & RISCV_IOMMU_PQCSR_PQON) ||
>> +        !!(ctrl & (RISCV_IOMMU_PQCSR_PQOF | RISCV_IOMMU_PQCSR_PQMF))) {
>> +        return;
>> +    }
>> +
>> +    if (head == next) {
>> +        riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
>> +                              RISCV_IOMMU_PQCSR_PQOF, 0);
>> +    } else {
>> +        dma_addr_t addr = s->pq_addr + tail * sizeof(*pr);
>> +        if (dma_memory_write(s->target_as, addr, pr, sizeof(*pr),
>> +                             MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR,
>> +                                  RISCV_IOMMU_PQCSR_PQMF, 0);
>> +        } else {
>> +            riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_PQT, next);
>> +        }
>> +    }
>> +
>> +    if (ctrl & RISCV_IOMMU_PQCSR_PIE) {
>> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_PQ);
>> +    }
>> +}
>> +
>> +/* Portable implementation of pext_u64, bit-mask extraction. */
>> +static uint64_t _pext_u64(uint64_t val, uint64_t ext)
>> +{
>> +    uint64_t ret = 0;
>> +    uint64_t rot = 1;
>> +
>> +    while (ext) {
>> +        if (ext & 1) {
>> +            if (val & 1) {
>> +                ret |= rot;
>> +            }
>> +            rot <<= 1;
>> +        }
>> +        val >>= 1;
>> +        ext >>= 1;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +/* Check if GPA matches MSI/MRIF pattern. */
>> +static bool riscv_iommu_msi_check(RISCVIOMMUState *s, 
>> RISCVIOMMUContext *ctx,
>> +    dma_addr_t gpa)
>> +{
>> +    if (get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE) !=
>> +        RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
>> +        return false; /* Invalid MSI/MRIF mode */
>> +    }
>> +
>> +    if ((PPN_DOWN(gpa) ^ ctx->msi_addr_pattern) & 
>> ~ctx->msi_addr_mask) {
>> +        return false; /* GPA not in MSI range defined by AIA IMSIC 
>> rules. */
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/* RISCV IOMMU Address Translation Lookup - Page Table Walk */
>> +static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, 
>> RISCVIOMMUContext *ctx,
>> +    IOMMUTLBEntry *iotlb)
>> +{
>> +    /* Early check for MSI address match when IOVA == GPA */
>> +    if (iotlb->perm & IOMMU_WO &&
>> +        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
>> +        iotlb->target_as = &s->trap_as;
>> +        iotlb->translated_addr = iotlb->iova;
>> +        iotlb->addr_mask = ~TARGET_PAGE_MASK;
>> +        return 0;
>> +    }
>> +
>> +    /* Exit early for pass-through mode. */
>> +    iotlb->translated_addr = iotlb->iova;
>> +    iotlb->addr_mask = ~TARGET_PAGE_MASK;
>> +    /* Allow R/W in pass-through mode */
>> +    iotlb->perm = IOMMU_RW;
>> +    return 0;
>> +}
>> +
>> +static void riscv_iommu_report_fault(RISCVIOMMUState *s,
>> +                                     RISCVIOMMUContext *ctx,
>> +                                     uint32_t fault_type, uint32_t 
>> cause,
>> +                                     bool pv,
>> +                                     uint64_t iotval, uint64_t iotval2)
>> +{
>> +    struct riscv_iommu_fq_record ev = { 0 };
>> +
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_DTF) {
>> +        switch (cause) {
>> +        case RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_INVALID:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED:
>> +        case RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED:
>> +        case RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR:
>> +        case RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT:
>> +            break;
>> +        default:
>> +            /* DTF prevents reporting a fault for this given cause */
>> +            return;
>> +        }
>> +    }
>> +
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_CAUSE, cause);
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_TTYPE, fault_type);
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_DID, ctx->devid);
>> +    ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PV, true);
>> +
>> +    if (pv) {
>> +        ev.hdr = set_field(ev.hdr, RISCV_IOMMU_FQ_HDR_PID, ctx->pasid);
>> +    }
>> +
>> +    ev.iotval = iotval;
>> +    ev.iotval2 = iotval2;
>> +
>> +    riscv_iommu_fault(s, &ev);
>> +}
>> +
>> +/* Redirect MSI write for given GPA. */
>> +static MemTxResult riscv_iommu_msi_write(RISCVIOMMUState *s,
>> +    RISCVIOMMUContext *ctx, uint64_t gpa, uint64_t data,
>> +    unsigned size, MemTxAttrs attrs)
>> +{
>> +    MemTxResult res;
>> +    dma_addr_t addr;
>> +    uint64_t intn;
>> +    uint32_t n190;
>> +    uint64_t pte[2];
>> +    int fault_type = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
>> +    int cause;
>> +
>> +    if (!riscv_iommu_msi_check(s, ctx, gpa)) {
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    /* Interrupt File Number */
>> +    intn = _pext_u64(PPN_DOWN(gpa), ctx->msi_addr_mask);
>> +    if (intn >= 256) {
>> +        /* Interrupt file number out of range */
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    /* fetch MSI PTE */
>> +    addr = PPN_PHYS(get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_PPN));
>> +    addr = addr | (intn * sizeof(pte));
>> +    res = dma_memory_read(s->target_as, addr, &pte, sizeof(pte),
>> +            MEMTXATTRS_UNSPECIFIED);
>> +    if (res != MEMTX_OK) {
>> +        if (res == MEMTX_DECODE_ERROR) {
>> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED;
>> +        } else {
>> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        }
>> +        goto err;
>> +    }
>> +
>> +    le64_to_cpus(&pte[0]);
>> +    le64_to_cpus(&pte[1]);
>> +
>> +    if (!(pte[0] & RISCV_IOMMU_MSI_PTE_V) || (pte[0] & 
>> RISCV_IOMMU_MSI_PTE_C)) {
>> +        /*
>> +         * The spec mentions that: "If msipte.C == 1, then further
>> +         * processing to interpret the PTE is implementation
>> +         * defined.". We'll abort with cause = 262 for this
>> +         * case too.
>> +         */
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_INVALID;
>> +        goto err;
>> +    }
>> +
>> +    switch (get_field(pte[0], RISCV_IOMMU_MSI_PTE_M)) {
>> +    case RISCV_IOMMU_MSI_PTE_M_BASIC:
>> +        /* MSI Pass-through mode */
>> +        addr = PPN_PHYS(get_field(pte[0], RISCV_IOMMU_MSI_PTE_PPN));
>> +        addr = addr | (gpa & TARGET_PAGE_MASK);
>> +
>> +        trace_riscv_iommu_msi(s->parent_obj.id, 
>> PCI_BUS_NUM(ctx->devid),
>> +                              PCI_SLOT(ctx->devid), 
>> PCI_FUNC(ctx->devid),
>> +                              gpa, addr);
>> +
>> +        res = dma_memory_write(s->target_as, addr, &data, size, attrs);
>> +        if (res != MEMTX_OK) {
>> +            cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
>> +            goto err;
>> +        }
>> +
>> +        return MEMTX_OK;
>> +    case RISCV_IOMMU_MSI_PTE_M_MRIF:
>> +        /* MRIF mode, continue. */
>> +        break;
>> +    default:
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
>> +        goto err;
>> +    }
>> +
>> +    /*
>> +     * Report an error for interrupt identities exceeding the 
>> maximum allowed
>> +     * for an IMSIC interrupt file (2047) or destination address is 
>> not 32-bit
>> +     * aligned. See IOMMU Specification, Chapter 2.3. MSI page tables.
>> +     */
>> +    if ((data > 2047) || (gpa & 3)) {
>> +        res = MEMTX_ACCESS_ERROR;
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED;
>> +        goto err;
>> +    }
>> +
>> +    /* MSI MRIF mode, non atomic pending bit update */
>> +
>> +    /* MRIF pending bit address */
>> +    addr = get_field(pte[0], RISCV_IOMMU_MSI_PTE_MRIF_ADDR) << 9;
>> +    addr = addr | ((data & 0x7c0) >> 3);
>> +
>> +    trace_riscv_iommu_msi(s->parent_obj.id, PCI_BUS_NUM(ctx->devid),
>> +                          PCI_SLOT(ctx->devid), PCI_FUNC(ctx->devid),
>> +                          gpa, addr);
>> +
>> +    /* MRIF pending bit mask */
>> +    data = 1ULL << (data & 0x03f);
>> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), 
>> attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    intn = intn | data;
>> +    res = dma_memory_write(s->target_as, addr, &intn, sizeof(intn), 
>> attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    /* Get MRIF enable bits */
>> +    addr = addr + sizeof(intn);
>> +    res = dma_memory_read(s->target_as, addr, &intn, sizeof(intn), 
>> attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    if (!(intn & data)) {
>> +        /* notification disabled, MRIF update completed. */
>> +        return MEMTX_OK;
>> +    }
>> +
>> +    /* Send notification message */
>> +    addr = PPN_PHYS(get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NPPN));
>> +    n190 = get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID) |
>> +          (get_field(pte[1], RISCV_IOMMU_MSI_MRIF_NID_MSB) << 10);
>> +
>> +    res = dma_memory_write(s->target_as, addr, &n190, sizeof(n190), 
>> attrs);
>> +    if (res != MEMTX_OK) {
>> +        cause = RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT;
>> +        goto err;
>> +    }
>> +
>> +    return MEMTX_OK;
>> +
>> +err:
>> +    riscv_iommu_report_fault(s, ctx, fault_type, cause,
>> +                             !!ctx->pasid, 0, 0);
>> +    return res;
>> +}
>> +
>> +/*
>> + * Check device context configuration as described by the
>> + * riscv-iommu spec section "Device-context configuration
>> + * checks".
>> + */
>> +static bool riscv_iommu_validate_device_ctx(RISCVIOMMUState *s,
>> +                                            RISCVIOMMUContext *ctx)
>> +{
>> +    uint32_t msi_mode;
>> +
>> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_EN_PRI) &&
>> +        ctx->tc & RISCV_IOMMU_DC_TC_PRPR) {
>> +        return false;
>> +    }
>> +
>> +    if (!(s->cap & RISCV_IOMMU_CAP_T2GPA) &&
>> +        ctx->tc & RISCV_IOMMU_DC_TC_T2GPA) {
>> +        return false;
>> +    }
>> +
>> +    if (s->cap & RISCV_IOMMU_CAP_MSI_FLAT) {
>> +        msi_mode = get_field(ctx->msiptp, RISCV_IOMMU_DC_MSIPTP_MODE);
>> +
>> +        if (msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_OFF &&
>> +            msi_mode != RISCV_IOMMU_DC_MSIPTP_MODE_FLAT) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * CAP_END is always zero (only one endianess). FCTL_BE is
>> +     * always zero (little-endian accesses). Thus TC_SBE must
>> +     * always be LE, i.e. zero.
>> +     */
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_SBE) {
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * RISC-V IOMMU Device Context Loopkup - Device Directory Tree Walk
>> + *
>> + * @s         : IOMMU Device State
>> + * @ctx       : Device Translation Context with devid and pasid set.
>> + * @return    : success or fault code.
>> + */
>> +static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, 
>> RISCVIOMMUContext *ctx)
>> +{
>> +    const uint64_t ddtp = s->ddtp;
>> +    unsigned mode = get_field(ddtp, RISCV_IOMMU_DDTP_MODE);
>> +    dma_addr_t addr = PPN_PHYS(get_field(ddtp, RISCV_IOMMU_DDTP_PPN));
>> +    struct riscv_iommu_dc dc;
>> +    /* Device Context format: 0: extended (64 bytes) | 1: base (32 
>> bytes) */
>> +    const int dc_fmt = !s->enable_msi;
>> +    const size_t dc_len = sizeof(dc) >> dc_fmt;
>> +    unsigned depth;
>> +    uint64_t de;
>> +
>> +    switch (mode) {
>> +    case RISCV_IOMMU_DDTP_MODE_OFF:
>> +        return RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_BARE:
>> +        /* mock up pass-through translation context */
>> +        ctx->tc = RISCV_IOMMU_DC_TC_V;
>> +        ctx->ta = 0;
>> +        ctx->msiptp = 0;
>> +        return 0;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_1LVL:
>> +        depth = 0;
>> +        break;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_2LVL:
>> +        depth = 1;
>> +        break;
>> +
>> +    case RISCV_IOMMU_DDTP_MODE_3LVL:
>> +        depth = 2;
>> +        break;
>> +
>> +    default:
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
>> +    }
>> +
>> +    /*
>> +     * Check supported device id width (in bits).
>> +     * See IOMMU Specification, Chapter 6. Software guidelines.
>> +     * - if extended device-context format is used:
>> +     *   1LVL: 6, 2LVL: 15, 3LVL: 24
>> +     * - if base device-context format is used:
>> +     *   1LVL: 7, 2LVL: 16, 3LVL: 24
>> +     */
>> +    if (ctx->devid >= (1 << (depth * 9 + 6 + (dc_fmt && depth != 
>> 2)))) {
>> +        return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
>> +    }
>> +
>> +    /* Device directory tree walk */
>> +    for (; depth-- > 0; ) {
>> +        /*
>> +         * Select device id index bits based on device directory 
>> tree level
>> +         * and device context format.
>> +         * See IOMMU Specification, Chapter 2. Data Structures.
>> +         * - if extended device-context format is used:
>> +         *   device index: [23:15][14:6][5:0]
>> +         * - if base device-context format is used:
>> +         *   device index: [23:16][15:7][6:0]
>> +         */
>> +        const int split = depth * 9 + 6 + dc_fmt;
>> +        addr |= ((ctx->devid >> split) << 3) & ~TARGET_PAGE_MASK;
>> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
>> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
>> +        }
>> +        le64_to_cpus(&de);
>> +        if (!(de & RISCV_IOMMU_DDTE_VALID)) {
>> +            /* invalid directory entry */
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>> +        }
>> +        if (de & ~(RISCV_IOMMU_DDTE_PPN | RISCV_IOMMU_DDTE_VALID)) {
>> +            /* reserved bits set */
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
>> +        }
>> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_DDTE_PPN));
>> +    }
>> +
>> +    /* index into device context entry page */
>> +    addr |= (ctx->devid * dc_len) & ~TARGET_PAGE_MASK;
>> +
>> +    memset(&dc, 0, sizeof(dc));
>> +    if (dma_memory_read(s->target_as, addr, &dc, dc_len,
>> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT;
>> +    }
>> +
>> +    /* Set translation context. */
>> +    ctx->tc = le64_to_cpu(dc.tc);
>> +    ctx->ta = le64_to_cpu(dc.ta);
>> +    ctx->msiptp = le64_to_cpu(dc.msiptp);
>> +    ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
>> +    ctx->msi_addr_pattern = le64_to_cpu(dc.msi_addr_pattern);
>> +
>> +    if (!riscv_iommu_validate_device_ctx(s, ctx)) {
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED;
>> +    }
>> +
>> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_V)) {
>> +        return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>> +    }
> According to section 2.3.1(item 9 and item 10), DC.tc.V == 0 should be 
> checked before riscv_iommu_validate_device_ctx() is checked.
>> +
>> +    if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
>> +        if (ctx->pasid != RISCV_IOMMU_NOPASID) {
>> +            /* PASID is disabled */
>> +            return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
>> +        }
>> +        return 0;
>> +    }
>> +
>> +    /* FSC.TC.PDTV enabled */
>> +    if (mode > RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20) {
>> +        /* Invalid PDTP.MODE */
>> +        return RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED;
>> +    }
>> +
>> +    for (depth = mode - RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8; depth-- > 
>> 0; ) {
>> +        /*
>> +         * Select process id index bits based on process directory tree
>> +         * level. See IOMMU Specification, 2.2. 
>> Process-Directory-Table.
>> +         */
>> +        const int split = depth * 9 + 8;
>> +        addr |= ((ctx->pasid >> split) << 3) & ~TARGET_PAGE_MASK;
>> +        if (dma_memory_read(s->target_as, addr, &de, sizeof(de),
>> +                            MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +            return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
>> +        }
>> +        le64_to_cpus(&de);
>> +        if (!(de & RISCV_IOMMU_PC_TA_V)) {
>> +            return RISCV_IOMMU_FQ_CAUSE_PDT_INVALID;
>> +        }
>> +        addr = PPN_PHYS(get_field(de, RISCV_IOMMU_PC_FSC_PPN));
>> +    }
>> +
>> +    /* Leaf entry in PDT */
>> +    addr |= (ctx->pasid << 4) & ~TARGET_PAGE_MASK;
>> +    if (dma_memory_read(s->target_as, addr, &dc.ta, sizeof(uint64_t) 
>> * 2,
>> +                        MEMTXATTRS_UNSPECIFIED) != MEMTX_OK) {
>> +        return RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT;
>> +    }
>> +
>> +    /* Use FSC and TA from process directory entry. */
>> +    ctx->ta = le64_to_cpu(dc.ta);
>> +
>
> According to section 2.3.2:
>
> 10. If PC.ta.V == 0, stop and report "PDT entry not valid" (cause = 266).
> 11. If the PC is misconfigured as determined by rules outlined in 
> Section 2.2.4 then stop and report "PDT entry misconfigured" (cause = 
> 267).
>
>> +    return 0;
>> +}
>> +
>> +/* Translation Context cache support */
>> +static gboolean __ctx_equal(gconstpointer v1, gconstpointer v2)
>> +{
>> +    RISCVIOMMUContext *c1 = (RISCVIOMMUContext *) v1;
>> +    RISCVIOMMUContext *c2 = (RISCVIOMMUContext *) v2;
>> +    return c1->devid == c2->devid && c1->pasid == c2->pasid;
>> +}
>> +
>> +static guint __ctx_hash(gconstpointer v)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) v;
>> +    /* Generate simple hash of (pasid, devid), assuming 24-bit wide 
>> devid */
>> +    return (guint)(ctx->devid) + ((guint)(ctx->pasid) << 24);
>> +}
>> +
>> +static void __ctx_inval_devid_pasid(gpointer key, gpointer value, 
>> gpointer data)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
>> +        ctx->devid == arg->devid &&
>> +        ctx->pasid == arg->pasid) {
>> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>> +    }
>> +}
>> +
>> +static void __ctx_inval_devid(gpointer key, gpointer value, gpointer 
>> data)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>> +    RISCVIOMMUContext *arg = (RISCVIOMMUContext *) data;
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V &&
>> +        ctx->devid == arg->devid) {
>> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>> +    }
>> +}
>> +
>> +static void __ctx_inval_all(gpointer key, gpointer value, gpointer 
>> data)
>> +{
>> +    RISCVIOMMUContext *ctx = (RISCVIOMMUContext *) value;
>> +    if (ctx->tc & RISCV_IOMMU_DC_TC_V) {
>> +        ctx->tc &= ~RISCV_IOMMU_DC_TC_V;
>> +    }
>> +}
>> +
>> +static void riscv_iommu_ctx_inval(RISCVIOMMUState *s, GHFunc func,
>> +    uint32_t devid, uint32_t pasid)
>> +{
>> +    GHashTable *ctx_cache;
>> +    RISCVIOMMUContext key = {
>> +        .devid = devid,
>> +        .pasid = pasid,
>> +    };
>> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
>> +    g_hash_table_foreach(ctx_cache, func, &key);
>> +    g_hash_table_unref(ctx_cache);
>> +}
>> +
>> +/* Find or allocate translation context for a given {device_id, 
>> process_id} */
>> +static RISCVIOMMUContext *riscv_iommu_ctx(RISCVIOMMUState *s,
>> +    unsigned devid, unsigned pasid, void **ref)
>> +{
>> +    GHashTable *ctx_cache;
>> +    RISCVIOMMUContext *ctx;
>> +    RISCVIOMMUContext key = {
>> +        .devid = devid,
>> +        .pasid = pasid,
>> +    };
>> +
>> +    ctx_cache = g_hash_table_ref(s->ctx_cache);
>> +    ctx = g_hash_table_lookup(ctx_cache, &key);
>> +
>> +    if (ctx && (ctx->tc & RISCV_IOMMU_DC_TC_V)) {
>> +        *ref = ctx_cache;
>> +        return ctx;
> Is it possible that ddtp.iommu_mode is set to off or bare which 
> requires no translation? It looks like that the returned ctx will be 
> used to perform translation in riscv_iommu_translate().
I have found the answer in the spec. SW is responsible to invalidate the 
cache. Please ignore this.
>> +    }
>> +
>> +    if (g_hash_table_size(s->ctx_cache) >= LIMIT_CACHE_CTX) {
>> +        ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
>> +                                          g_free, NULL);
>> +        g_hash_table_unref(qatomic_xchg(&s->ctx_cache, ctx_cache));
>> +    }
>> +
>> +    ctx = g_new0(RISCVIOMMUContext, 1);
>> +    ctx->devid = devid;
>> +    ctx->pasid = pasid;
>> +
>> +    int fault = riscv_iommu_ctx_fetch(s, ctx);
>> +    if (!fault) {
>> +        g_hash_table_add(ctx_cache, ctx);
>> +        *ref = ctx_cache;
>> +        return ctx;
>> +    }
>> +
>> +    g_hash_table_unref(ctx_cache);
>> +    *ref = NULL;
>> +
>> +    riscv_iommu_report_fault(s, ctx, RISCV_IOMMU_FQ_TTYPE_UADDR_RD,
>> +                             fault, !!pasid, 0, 0);
>> +
>> +    g_free(ctx);
>> +    return NULL;
>> +}
>> +
>> +static void riscv_iommu_ctx_put(RISCVIOMMUState *s, void *ref)
>> +{
>> +    if (ref) {
>> +        g_hash_table_unref((GHashTable *)ref);
>> +    }
>> +}
>> +
>> +/* Find or allocate address space for a given device */
>> +static AddressSpace *riscv_iommu_space(RISCVIOMMUState *s, uint32_t 
>> devid)
>> +{
>> +    RISCVIOMMUSpace *as;
>> +
>> +    /* FIXME: PCIe bus remapping for attached endpoints. */
>> +    devid |= s->bus << 8;
>> +
>> +    qemu_mutex_lock(&s->core_lock);
>> +    QLIST_FOREACH(as, &s->spaces, list) {
>> +        if (as->devid == devid) {
>> +            break;
>> +        }
>> +    }
>> +    qemu_mutex_unlock(&s->core_lock);
>> +
>> +    if (as == NULL) {
>> +        char name[64];
>> +        as = g_new0(RISCVIOMMUSpace, 1);
>> +
>> +        as->iommu = s;
>> +        as->devid = devid;
>> +
>> +        snprintf(name, sizeof(name), "riscv-iommu-%04x:%02x.%d-iova",
>> +            PCI_BUS_NUM(as->devid), PCI_SLOT(as->devid), 
>> PCI_FUNC(as->devid));
>> +
>> +        /* IOVA address space, untranslated addresses */
>> +        memory_region_init_iommu(&as->iova_mr, sizeof(as->iova_mr),
>> +            TYPE_RISCV_IOMMU_MEMORY_REGION,
>> +            OBJECT(as), "riscv_iommu", UINT64_MAX);
>> +        address_space_init(&as->iova_as, 
>> MEMORY_REGION(&as->iova_mr), name);
>> +
>> +        qemu_mutex_lock(&s->core_lock);
>> +        QLIST_INSERT_HEAD(&s->spaces, as, list);
>> +        qemu_mutex_unlock(&s->core_lock);
>> +
>> +        trace_riscv_iommu_new(s->parent_obj.id, PCI_BUS_NUM(as->devid),
>> +                PCI_SLOT(as->devid), PCI_FUNC(as->devid));
>> +    }
>> +    return &as->iova_as;
>> +}
>> +
>> +static int riscv_iommu_translate(RISCVIOMMUState *s, 
>> RISCVIOMMUContext *ctx,
>> +    IOMMUTLBEntry *iotlb)
>> +{
>> +    bool enable_pasid;
>> +    bool enable_pri;
>> +    int fault;
>> +
>> +    /*
>> +     * TC[32] is reserved for custom extensions, used here to 
>> temporarily
>> +     * enable automatic page-request generation for ATS queries.
>> +     */
>> +    enable_pri = (iotlb->perm == IOMMU_NONE) && (ctx->tc & 
>> BIT_ULL(32));
>> +    enable_pasid = (ctx->tc & RISCV_IOMMU_DC_TC_PDTV);
>> +
>> +    /* Translate using device directory / page table information. */
>> +    fault = riscv_iommu_spa_fetch(s, ctx, iotlb);
>> +
>> +    if (enable_pri && fault) {
>> +        struct riscv_iommu_pq_record pr = {0};
>> +        if (enable_pasid) {
>> +            pr.hdr = set_field(RISCV_IOMMU_PREQ_HDR_PV,
>> +                RISCV_IOMMU_PREQ_HDR_PID, ctx->pasid);
>> +        }
>> +        pr.hdr = set_field(pr.hdr, RISCV_IOMMU_PREQ_HDR_DID, 
>> ctx->devid);
>> +        pr.payload = (iotlb->iova & TARGET_PAGE_MASK) |
>> +                     RISCV_IOMMU_PREQ_PAYLOAD_M;
>> +        riscv_iommu_pri(s, &pr);
>> +        return fault;
>> +    }
>> +
>> +    if (fault) {
>> +        unsigned ttype;
>> +
>> +        if (iotlb->perm & IOMMU_RW) {
>> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_WR;
>> +        } else {
>> +            ttype = RISCV_IOMMU_FQ_TTYPE_UADDR_RD;
>> +        }
>> +
>> +        riscv_iommu_report_fault(s, ctx, ttype, fault, enable_pasid,
>> +                                 iotlb->iova, iotlb->translated_addr);
>> +        return fault;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* IOMMU Command Interface */
>> +static MemTxResult riscv_iommu_iofence(RISCVIOMMUState *s, bool notify,
>> +    uint64_t addr, uint32_t data)
>> +{
>> +    /*
>> +     * ATS processing in this implementation of the IOMMU is 
>> synchronous,
>> +     * no need to wait for completions here.
>> +     */
>> +    if (!notify) {
>> +        return MEMTX_OK;
>> +    }
>> +
>> +    return dma_memory_write(s->target_as, addr, &data, sizeof(data),
>> +        MEMTXATTRS_UNSPECIFIED);
>> +}
>> +
>> +static void riscv_iommu_process_ddtp(RISCVIOMMUState *s)
>> +{
>> +    uint64_t old_ddtp = s->ddtp;
>> +    uint64_t new_ddtp = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_DDTP);
>> +    unsigned new_mode = get_field(new_ddtp, RISCV_IOMMU_DDTP_MODE);
>> +    unsigned old_mode = get_field(old_ddtp, RISCV_IOMMU_DDTP_MODE);
>> +    bool ok = false;
>> +
>> +    /*
>> +     * Check for allowed DDTP.MODE transitions:
>> +     * {OFF, BARE}        -> {OFF, BARE, 1LVL, 2LVL, 3LVL}
>> +     * {1LVL, 2LVL, 3LVL} -> {OFF, BARE}
>> +     */
>> +    if (new_mode == old_mode ||
>> +        new_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
>> +        new_mode == RISCV_IOMMU_DDTP_MODE_BARE) {
>> +        ok = true;
>> +    } else if (new_mode == RISCV_IOMMU_DDTP_MODE_1LVL ||
>> +               new_mode == RISCV_IOMMU_DDTP_MODE_2LVL ||
>> +               new_mode == RISCV_IOMMU_DDTP_MODE_3LVL) {
>> +        ok = old_mode == RISCV_IOMMU_DDTP_MODE_OFF ||
>> +             old_mode == RISCV_IOMMU_DDTP_MODE_BARE;
>> +    }
>> +
>> +    if (ok) {
>> +        /* clear reserved and busy bits, report back sanitized 
>> version */
>> +        new_ddtp = set_field(new_ddtp & RISCV_IOMMU_DDTP_PPN,
>> +                             RISCV_IOMMU_DDTP_MODE, new_mode);
>> +    } else {
>> +        new_ddtp = old_ddtp;
>> +    }
>> +    s->ddtp = new_ddtp;
>> +
>> +    riscv_iommu_reg_set64(s, RISCV_IOMMU_REG_DDTP, new_ddtp);
>> +}
>> +
>> +/* Command function and opcode field. */
>> +#define RISCV_IOMMU_CMD(func, op) (((func) << 7) | (op))
>> +
>> +static void riscv_iommu_process_cq_tail(RISCVIOMMUState *s)
>> +{
>> +    struct riscv_iommu_command cmd;
>> +    MemTxResult res;
>> +    dma_addr_t addr;
>> +    uint32_t tail, head, ctrl;
>> +    uint64_t cmd_opcode;
>> +    GHFunc func;
>> +
>> +    ctrl = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
>> +    tail = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQT) & s->cq_mask;
>> +    head = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQH) & s->cq_mask;
>> +
>> +    /* Check for pending error or queue processing disabled */
>> +    if (!(ctrl & RISCV_IOMMU_CQCSR_CQON) ||
>> +        !!(ctrl & (RISCV_IOMMU_CQCSR_CMD_ILL | 
>> RISCV_IOMMU_CQCSR_CQMF))) {
>> +        return;
>> +    }
>> +
>> +    while (tail != head) {
>> +        addr = s->cq_addr  + head * sizeof(cmd);
>> +        res = dma_memory_read(s->target_as, addr, &cmd, sizeof(cmd),
>> +                              MEMTXATTRS_UNSPECIFIED);
>> +
>> +        if (res != MEMTX_OK) {
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
>> +                                  RISCV_IOMMU_CQCSR_CQMF, 0);
>> +            goto fault;
>> +        }
>> +
>> +        trace_riscv_iommu_cmd(s->parent_obj.id, cmd.dword0, 
>> cmd.dword1);
>> +
>> +        cmd_opcode = get_field(cmd.dword0,
>> +                               RISCV_IOMMU_CMD_OPCODE | 
>> RISCV_IOMMU_CMD_FUNC);
>> +
>> +        switch (cmd_opcode) {
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOFENCE_FUNC_C,
>> +                             RISCV_IOMMU_CMD_IOFENCE_OPCODE):
>> +            res = riscv_iommu_iofence(s,
>> +                cmd.dword0 & RISCV_IOMMU_CMD_IOFENCE_AV, cmd.dword1,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IOFENCE_DATA));
>> +
>> +            if (res != MEMTX_OK) {
>> +                riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
>> +                                      RISCV_IOMMU_CQCSR_CQMF, 0);
>> +                goto fault;
>> +            }
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA,
>> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
>> +            if (cmd.dword0 & RISCV_IOMMU_CMD_IOTINVAL_PSCV) {
>> +                /* illegal command arguments IOTINVAL.GVMA & PSCV == 
>> 1 */
>> +                goto cmd_ill;
>> +            }
>> +            /* translation cache not implemented yet */
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA,
>> +                             RISCV_IOMMU_CMD_IOTINVAL_OPCODE):
>> +            /* translation cache not implemented yet */
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT,
>> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
>> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
>> +                /* invalidate all device context cache mappings */
>> +                func = __ctx_inval_all;
>> +            } else {
>> +                /* invalidate all device context matching DID */
>> +                func = __ctx_inval_devid;
>> +            }
>> +            riscv_iommu_ctx_inval(s, func,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID), 0);
>> +            break;
>> +
>> +        case RISCV_IOMMU_CMD(RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT,
>> +                             RISCV_IOMMU_CMD_IODIR_OPCODE):
>> +            if (!(cmd.dword0 & RISCV_IOMMU_CMD_IODIR_DV)) {
>> +                /* illegal command arguments IODIR_PDT & DV == 0 */
>> +                goto cmd_ill;
>> +            } else {
>> +                func = __ctx_inval_devid_pasid;
>> +            }
>> +            riscv_iommu_ctx_inval(s, func,
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_DID),
>> +                get_field(cmd.dword0, RISCV_IOMMU_CMD_IODIR_PID));
>> +            break;
>> +
>> +        default:
>> +        cmd_ill:
>> +            /* Invalid instruction, do not advance instruction 
>> index. */
>> +            riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR,
>> +                RISCV_IOMMU_CQCSR_CMD_ILL, 0);
>> +            goto fault;
>> +        }
>> +
>> +        /* Advance and update head pointer after command completes. */
>> +        head = (head + 1) & s->cq_mask;
>> +        riscv_iommu_reg_set32(s, RISCV_IOMMU_REG_CQH, head);
>> +    }
>> +    return;
>> +
>> +fault:
>> +    if (ctrl & RISCV_IOMMU_CQCSR_CIE) {
>> +        riscv_iommu_notify(s, RISCV_IOMMU_INTR_CQ);
>> +    }
>> +}
>> +
>> +static void riscv_iommu_process_cq_control(RISCVIOMMUState *s)
>> +{
>> +    uint64_t base;
>> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, 
>> RISCV_IOMMU_REG_CQCSR);
>> +    uint32_t ctrl_clr;
>> +    bool enable = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQEN);
>> +    bool active = !!(ctrl_set & RISCV_IOMMU_CQCSR_CQON);
>> +
>> +    if (enable && !active) {
>> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_CQB);
>> +        s->cq_mask = (2ULL << get_field(base, 
>> RISCV_IOMMU_CQB_LOG2SZ)) - 1;
>> +        s->cq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_CQB_PPN));
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~s->cq_mask);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQH], 0);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_CQT], 0);
>> +        ctrl_set = RISCV_IOMMU_CQCSR_CQON;
>> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQMF |
>> +                   RISCV_IOMMU_CQCSR_CMD_ILL | 
>> RISCV_IOMMU_CQCSR_CMD_TO |
>> +                   RISCV_IOMMU_CQCSR_FENCE_W_IP;
>> +    } else if (!enable && active) {
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQT], ~0);
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY | RISCV_IOMMU_CQCSR_CQON;
>> +    } else {
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_CQCSR_BUSY;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_CQCSR, ctrl_set, 
>> ctrl_clr);
>> +}
>> +
>> +static void riscv_iommu_process_fq_control(RISCVIOMMUState *s)
>> +{
>> +    uint64_t base;
>> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, 
>> RISCV_IOMMU_REG_FQCSR);
>> +    uint32_t ctrl_clr;
>> +    bool enable = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQEN);
>> +    bool active = !!(ctrl_set & RISCV_IOMMU_FQCSR_FQON);
>> +
>> +    if (enable && !active) {
>> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_FQB);
>> +        s->fq_mask = (2ULL << get_field(base, 
>> RISCV_IOMMU_FQB_LOG2SZ)) - 1;
>> +        s->fq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_FQB_PPN));
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~s->fq_mask);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQH], 0);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_FQT], 0);
>> +        ctrl_set = RISCV_IOMMU_FQCSR_FQON;
>> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQMF |
>> +            RISCV_IOMMU_FQCSR_FQOF;
>> +    } else if (!enable && active) {
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQH], ~0);
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY | RISCV_IOMMU_FQCSR_FQON;
>> +    } else {
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_FQCSR_BUSY;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_FQCSR, ctrl_set, 
>> ctrl_clr);
>> +}
>> +
>> +static void riscv_iommu_process_pq_control(RISCVIOMMUState *s)
>> +{
>> +    uint64_t base;
>> +    uint32_t ctrl_set = riscv_iommu_reg_get32(s, 
>> RISCV_IOMMU_REG_PQCSR);
>> +    uint32_t ctrl_clr;
>> +    bool enable = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQEN);
>> +    bool active = !!(ctrl_set & RISCV_IOMMU_PQCSR_PQON);
>> +
>> +    if (enable && !active) {
>> +        base = riscv_iommu_reg_get64(s, RISCV_IOMMU_REG_PQB);
>> +        s->pq_mask = (2ULL << get_field(base, 
>> RISCV_IOMMU_PQB_LOG2SZ)) - 1;
>> +        s->pq_addr = PPN_PHYS(get_field(base, RISCV_IOMMU_PQB_PPN));
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~s->pq_mask);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQH], 0);
>> +        stl_le_p(&s->regs_rw[RISCV_IOMMU_REG_PQT], 0);
>> +        ctrl_set = RISCV_IOMMU_PQCSR_PQON;
>> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQMF |
>> +            RISCV_IOMMU_PQCSR_PQOF;
>> +    } else if (!enable && active) {
>> +        stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQH], ~0);
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY | RISCV_IOMMU_PQCSR_PQON;
>> +    } else {
>> +        ctrl_set = 0;
>> +        ctrl_clr = RISCV_IOMMU_PQCSR_BUSY;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_PQCSR, ctrl_set, 
>> ctrl_clr);
>> +}
>> +
>> +typedef void riscv_iommu_process_fn(RISCVIOMMUState *s);
>> +
>> +static void riscv_iommu_update_ipsr(RISCVIOMMUState *s, uint64_t data)
>> +{
>> +    uint32_t cqcsr, fqcsr, pqcsr;
>> +    uint32_t ipsr_set = 0;
>> +    uint32_t ipsr_clr = 0;
>> +
>> +    if (data & RISCV_IOMMU_IPSR_CIP) {
>> +        cqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_CQCSR);
>> +
>> +        if (cqcsr & RISCV_IOMMU_CQCSR_CIE &&
>> +            (cqcsr & RISCV_IOMMU_CQCSR_FENCE_W_IP ||
>> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_ILL ||
>> +             cqcsr & RISCV_IOMMU_CQCSR_CMD_TO ||
>> +             cqcsr & RISCV_IOMMU_CQCSR_CQMF)) {
>> +            ipsr_set |= RISCV_IOMMU_IPSR_CIP;
>> +        } else {
>> +            ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
>> +        }
>> +    } else {
>> +        ipsr_clr |= RISCV_IOMMU_IPSR_CIP;
>> +    }
>> +
>> +    if (data & RISCV_IOMMU_IPSR_FIP) {
>> +        fqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_FQCSR);
>> +
>> +        if (fqcsr & RISCV_IOMMU_FQCSR_FIE &&
>> +            (fqcsr & RISCV_IOMMU_FQCSR_FQOF ||
>> +             fqcsr & RISCV_IOMMU_FQCSR_FQMF)) {
>> +            ipsr_set |= RISCV_IOMMU_IPSR_FIP;
>> +        } else {
>> +            ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
>> +        }
>> +    } else {
>> +        ipsr_clr |= RISCV_IOMMU_IPSR_FIP;
>> +    }
>> +
>> +    if (data & RISCV_IOMMU_IPSR_PIP) {
>> +        pqcsr = riscv_iommu_reg_get32(s, RISCV_IOMMU_REG_PQCSR);
>> +
>> +        if (pqcsr & RISCV_IOMMU_PQCSR_PIE &&
>> +            (pqcsr & RISCV_IOMMU_PQCSR_PQOF ||
>> +             pqcsr & RISCV_IOMMU_PQCSR_PQMF)) {
>> +            ipsr_set |= RISCV_IOMMU_IPSR_PIP;
>> +        } else {
>> +            ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
>> +        }
>> +    } else {
>> +        ipsr_clr |= RISCV_IOMMU_IPSR_PIP;
>> +    }
>> +
>> +    riscv_iommu_reg_mod32(s, RISCV_IOMMU_REG_IPSR, ipsr_set, ipsr_clr);
>> +}
>> +
>> +static MemTxResult riscv_iommu_mmio_write(void *opaque, hwaddr addr,
>> +    uint64_t data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    riscv_iommu_process_fn *process_fn = NULL;
>> +    RISCVIOMMUState *s = opaque;
>> +    uint32_t regb = addr & ~3;
>> +    uint32_t busy = 0;
>> +    uint64_t val = 0;
>> +
>> +    if ((addr & (size - 1)) != 0) {
>> +        /* Unsupported MMIO alignment or access size */
>> +        return MEMTX_ERROR;
>> +    }
>> +
>> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
>> +        /* Unsupported MMIO access location. */
>> +        return MEMTX_ACCESS_ERROR;
>> +    }
>> +
>> +    /* Track actionable MMIO write. */
>> +    switch (regb) {
>> +    case RISCV_IOMMU_REG_DDTP:
>> +    case RISCV_IOMMU_REG_DDTP + 4:
>> +        process_fn = riscv_iommu_process_ddtp;
>> +        regb = RISCV_IOMMU_REG_DDTP;
>> +        busy = RISCV_IOMMU_DDTP_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_CQT:
>> +        process_fn = riscv_iommu_process_cq_tail;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_CQCSR:
>> +        process_fn = riscv_iommu_process_cq_control;
>> +        busy = RISCV_IOMMU_CQCSR_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_FQCSR:
>> +        process_fn = riscv_iommu_process_fq_control;
>> +        busy = RISCV_IOMMU_FQCSR_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_PQCSR:
>> +        process_fn = riscv_iommu_process_pq_control;
>> +        busy = RISCV_IOMMU_PQCSR_BUSY;
>> +        break;
>> +
>> +    case RISCV_IOMMU_REG_IPSR:
>> +        /*
>> +         * IPSR has special procedures to update. Execute it
>> +         * and exit.
>> +         */
>> +        if (size == 4) {
>> +            uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
>> +            uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
>> +            uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
>> +            stl_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +        } else if (size == 8) {
>> +            uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
>> +            uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
>> +            uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
>> +            stq_le_p(&val, ((rw & ro) | (data & ~ro)) & ~(data & wc));
>> +        }
>> +
>> +        riscv_iommu_update_ipsr(s, val);
>> +
>> +        return MEMTX_OK;
>> +
>> +    default:
>> +        break;
>> +    }
>> +
>> +    /*
>> +     * Registers update might be not synchronized with core logic.
>> +     * If system software updates register when relevant BUSY bit
>> +     * is set IOMMU behavior of additional writes to the register
>> +     * is UNSPECIFIED.
>> +     */
>> +    qemu_spin_lock(&s->regs_lock);
>> +    if (size == 1) {
>> +        uint8_t ro = s->regs_ro[addr];
>> +        uint8_t wc = s->regs_wc[addr];
>> +        uint8_t rw = s->regs_rw[addr];
>> +        s->regs_rw[addr] = ((rw & ro) | (data & ~ro)) & ~(data & wc);
>> +    } else if (size == 2) {
>> +        uint16_t ro = lduw_le_p(&s->regs_ro[addr]);
>> +        uint16_t wc = lduw_le_p(&s->regs_wc[addr]);
>> +        uint16_t rw = lduw_le_p(&s->regs_rw[addr]);
>> +        stw_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & 
>> ~(data & wc));
>> +    } else if (size == 4) {
>> +        uint32_t ro = ldl_le_p(&s->regs_ro[addr]);
>> +        uint32_t wc = ldl_le_p(&s->regs_wc[addr]);
>> +        uint32_t rw = ldl_le_p(&s->regs_rw[addr]);
>> +        stl_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & 
>> ~(data & wc));
>> +    } else if (size == 8) {
>> +        uint64_t ro = ldq_le_p(&s->regs_ro[addr]);
>> +        uint64_t wc = ldq_le_p(&s->regs_wc[addr]);
>> +        uint64_t rw = ldq_le_p(&s->regs_rw[addr]);
>> +        stq_le_p(&s->regs_rw[addr], ((rw & ro) | (data & ~ro)) & 
>> ~(data & wc));
>> +    }
>> +
>> +    /* Busy flag update, MSB 4-byte register. */
>> +    if (busy) {
>> +        uint32_t rw = ldl_le_p(&s->regs_rw[regb]);
>> +        stl_le_p(&s->regs_rw[regb], rw | busy);
>> +    }
>> +    qemu_spin_unlock(&s->regs_lock);
>> +
>> +    if (process_fn) {
>> +        qemu_mutex_lock(&s->core_lock);
>> +        process_fn(s);
>> +        qemu_mutex_unlock(&s->core_lock);
>> +    }
>> +
>> +    return MEMTX_OK;
>> +}
>> +
>> +static MemTxResult riscv_iommu_mmio_read(void *opaque, hwaddr addr,
>> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    RISCVIOMMUState *s = opaque;
>> +    uint64_t val = -1;
>> +    uint8_t *ptr;
>> +
>> +    if ((addr & (size - 1)) != 0) {
>> +        /* Unsupported MMIO alignment. */
>> +        return MEMTX_ERROR;
>> +    }
>> +
>> +    if (addr + size > RISCV_IOMMU_REG_MSI_CONFIG) {
>> +        return MEMTX_ACCESS_ERROR;
>> +    }
>> +
>> +    ptr = &s->regs_rw[addr];
>> +
>> +    if (size == 1) {
>> +        val = (uint64_t)*ptr;
>> +    } else if (size == 2) {
>> +        val = lduw_le_p(ptr);
>> +    } else if (size == 4) {
>> +        val = ldl_le_p(ptr);
>> +    } else if (size == 8) {
>> +        val = ldq_le_p(ptr);
>> +    } else {
>> +        return MEMTX_ERROR;
>> +    }
>> +
>> +    *data = val;
>> +
>> +    return MEMTX_OK;
>> +}
>> +
>> +static const MemoryRegionOps riscv_iommu_mmio_ops = {
>> +    .read_with_attrs = riscv_iommu_mmio_read,
>> +    .write_with_attrs = riscv_iommu_mmio_write,
>> +    .endianness = DEVICE_NATIVE_ENDIAN,
>> +    .impl = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +        .unaligned = false,
>> +    },
>> +    .valid = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +    }
>> +};
>> +
>> +/*
>> + * Translations matching MSI pattern check are redirected to 
>> "riscv-iommu-trap"
>> + * memory region as untranslated address, for additional MSI/MRIF 
>> interception
>> + * by IOMMU interrupt remapping implementation.
>> + * Note: Device emulation code generating an MSI is expected to 
>> provide a valid
>> + * memory transaction attributes with requested_id set.
>> + */
>> +static MemTxResult riscv_iommu_trap_write(void *opaque, hwaddr addr,
>> +    uint64_t data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    RISCVIOMMUState* s = (RISCVIOMMUState *)opaque;
>> +    RISCVIOMMUContext *ctx;
>> +    MemTxResult res;
>> +    void *ref;
>> +    uint32_t devid = attrs.requester_id;
>> +
>> +    if (attrs.unspecified) {
>> +        return MEMTX_ACCESS_ERROR;
>> +    }
>> +
>> +    /* FIXME: PCIe bus remapping for attached endpoints. */
>> +    devid |= s->bus << 8;
>> +
>> +    ctx = riscv_iommu_ctx(s, devid, 0, &ref);
>> +    if (ctx == NULL) {
>> +        res = MEMTX_ACCESS_ERROR;
>> +    } else {
>> +        res = riscv_iommu_msi_write(s, ctx, addr, data, size, attrs);
>> +    }
>> +    riscv_iommu_ctx_put(s, ref);
>> +    return res;
>> +}
>> +
>> +static MemTxResult riscv_iommu_trap_read(void *opaque, hwaddr addr,
>> +    uint64_t *data, unsigned size, MemTxAttrs attrs)
>> +{
>> +    return MEMTX_ACCESS_ERROR;
>> +}
>> +
>> +static const MemoryRegionOps riscv_iommu_trap_ops = {
>> +    .read_with_attrs = riscv_iommu_trap_read,
>> +    .write_with_attrs = riscv_iommu_trap_write,
>> +    .endianness = DEVICE_LITTLE_ENDIAN,
>> +    .impl = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +        .unaligned = true,
>> +    },
>> +    .valid = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +    }
>> +};
>> +
>> +static void riscv_iommu_realize(DeviceState *dev, Error **errp)
>> +{
>> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
>> +
>> +    s->cap = s->version & RISCV_IOMMU_CAP_VERSION;
>> +    if (s->enable_msi) {
>> +        s->cap |= RISCV_IOMMU_CAP_MSI_FLAT | RISCV_IOMMU_CAP_MSI_MRIF;
>> +    }
>> +    /* Report QEMU target physical address space limits */
>> +    s->cap = set_field(s->cap, RISCV_IOMMU_CAP_PAS,
>> +                       TARGET_PHYS_ADDR_SPACE_BITS);
>> +
>> +    /* TODO: method to report supported PASID bits */
>> +    s->pasid_bits = 8; /* restricted to size of MemTxAttrs.pasid */
>> +    s->cap |= RISCV_IOMMU_CAP_PD8;
>> +
>> +    /* Out-of-reset translation mode: OFF (DMA disabled) BARE 
>> (passthrough) */
>> +    s->ddtp = set_field(0, RISCV_IOMMU_DDTP_MODE, s->enable_off ?
>> +                        RISCV_IOMMU_DDTP_MODE_OFF : 
>> RISCV_IOMMU_DDTP_MODE_BARE);
>> +
>> +    /* register storage */
>> +    s->regs_rw = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
>> +    s->regs_ro = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
>> +    s->regs_wc = g_new0(uint8_t, RISCV_IOMMU_REG_SIZE);
>> +
>> +     /* Mark all registers read-only */
>> +    memset(s->regs_ro, 0xff, RISCV_IOMMU_REG_SIZE);
>> +
>> +    /*
>> +     * Register complete MMIO space, including MSI/PBA registers.
>> +     * Note, PCIDevice implementation will add overlapping MR for 
>> MSI/PBA,
>> +     * managed directly by the PCIDevice implementation.
>> +     */
>> +    memory_region_init_io(&s->regs_mr, OBJECT(dev), 
>> &riscv_iommu_mmio_ops, s,
>> +        "riscv-iommu-regs", RISCV_IOMMU_REG_SIZE);
>> +
>> +    /* Set power-on register state */
>> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_CAP], s->cap);
>> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_FCTL], 0);
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_DDTP],
>> +        ~(RISCV_IOMMU_DDTP_PPN | RISCV_IOMMU_DDTP_MODE));
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQB],
>> +        ~(RISCV_IOMMU_CQB_LOG2SZ | RISCV_IOMMU_CQB_PPN));
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQB],
>> +        ~(RISCV_IOMMU_FQB_LOG2SZ | RISCV_IOMMU_FQB_PPN));
>> +    stq_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQB],
>> +        ~(RISCV_IOMMU_PQB_LOG2SZ | RISCV_IOMMU_PQB_PPN));
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_CQCSR], 
>> RISCV_IOMMU_CQCSR_CQMF |
>> +        RISCV_IOMMU_CQCSR_CMD_TO | RISCV_IOMMU_CQCSR_CMD_ILL);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_CQCSR], 
>> RISCV_IOMMU_CQCSR_CQON |
>> +        RISCV_IOMMU_CQCSR_BUSY);
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_FQCSR], 
>> RISCV_IOMMU_FQCSR_FQMF |
>> +        RISCV_IOMMU_FQCSR_FQOF);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_FQCSR], 
>> RISCV_IOMMU_FQCSR_FQON |
>> +        RISCV_IOMMU_FQCSR_BUSY);
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_PQCSR], 
>> RISCV_IOMMU_PQCSR_PQMF |
>> +        RISCV_IOMMU_PQCSR_PQOF);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_PQCSR], 
>> RISCV_IOMMU_PQCSR_PQON |
>> +        RISCV_IOMMU_PQCSR_BUSY);
>> +    stl_le_p(&s->regs_wc[RISCV_IOMMU_REG_IPSR], ~0);
>> +    stl_le_p(&s->regs_ro[RISCV_IOMMU_REG_IVEC], 0);
>> +    stq_le_p(&s->regs_rw[RISCV_IOMMU_REG_DDTP], s->ddtp);
>> +
>> +    /* Memory region for downstream access, if specified. */
>> +    if (s->target_mr) {
>> +        s->target_as = g_new0(AddressSpace, 1);
>> +        address_space_init(s->target_as, s->target_mr,
>> +            "riscv-iommu-downstream");
>> +    } else {
>> +        /* Fallback to global system memory. */
>> +        s->target_as = &address_space_memory;
>> +    }
>> +
>> +    /* Memory region for untranslated MRIF/MSI writes */
>> +    memory_region_init_io(&s->trap_mr, OBJECT(dev), 
>> &riscv_iommu_trap_ops, s,
>> +            "riscv-iommu-trap", ~0ULL);
>> +    address_space_init(&s->trap_as, &s->trap_mr, 
>> "riscv-iommu-trap-as");
>> +
>> +    /* Device translation context cache */
>> +    s->ctx_cache = g_hash_table_new_full(__ctx_hash, __ctx_equal,
>> +                                         g_free, NULL);
>> +
>> +    s->iommus.le_next = NULL;
>> +    s->iommus.le_prev = NULL;
>> +    QLIST_INIT(&s->spaces);
>> +    qemu_mutex_init(&s->core_lock);
>> +    qemu_spin_init(&s->regs_lock);
>> +}
>> +
>> +static void riscv_iommu_unrealize(DeviceState *dev)
>> +{
>> +    RISCVIOMMUState *s = RISCV_IOMMU(dev);
>> +
>> +    qemu_mutex_destroy(&s->core_lock);
>> +    g_hash_table_unref(s->ctx_cache);
>> +}
>> +
>> +static Property riscv_iommu_properties[] = {
>> +    DEFINE_PROP_UINT32("version", RISCVIOMMUState, version,
>> +        RISCV_IOMMU_SPEC_DOT_VER),
>> +    DEFINE_PROP_UINT32("bus", RISCVIOMMUState, bus, 0x0),
>> +    DEFINE_PROP_BOOL("intremap", RISCVIOMMUState, enable_msi, TRUE),
>> +    DEFINE_PROP_BOOL("off", RISCVIOMMUState, enable_off, TRUE),
>> +    DEFINE_PROP_LINK("downstream-mr", RISCVIOMMUState, target_mr,
>> +        TYPE_MEMORY_REGION, MemoryRegion *),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void riscv_iommu_class_init(ObjectClass *klass, void* data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    /* internal device for riscv-iommu-{pci/sys}, not user-creatable */
>> +    dc->user_creatable = false;
>> +    dc->realize = riscv_iommu_realize;
>> +    dc->unrealize = riscv_iommu_unrealize;
>> +    device_class_set_props(dc, riscv_iommu_properties);
>> +}
>> +
>> +static const TypeInfo riscv_iommu_info = {
>> +    .name = TYPE_RISCV_IOMMU,
>> +    .parent = TYPE_DEVICE,
>> +    .instance_size = sizeof(RISCVIOMMUState),
>> +    .class_init = riscv_iommu_class_init,
>> +};
>> +
>> +static const char *IOMMU_FLAG_STR[] = {
>> +    "NA",
>> +    "RO",
>> +    "WR",
>> +    "RW",
>> +};
>> +
>> +/* RISC-V IOMMU Memory Region - Address Translation Space */
>> +static IOMMUTLBEntry riscv_iommu_memory_region_translate(
>> +    IOMMUMemoryRegion *iommu_mr, hwaddr addr,
>> +    IOMMUAccessFlags flag, int iommu_idx)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, 
>> iova_mr);
>> +    RISCVIOMMUContext *ctx;
>> +    void *ref;
>> +    IOMMUTLBEntry iotlb = {
>> +        .iova = addr,
>> +        .target_as = as->iommu->target_as,
>> +        .addr_mask = ~0ULL,
>> +        .perm = flag,
>> +    };
>> +
>> +    ctx = riscv_iommu_ctx(as->iommu, as->devid, iommu_idx, &ref);
>> +    if (ctx == NULL) {
>> +        /* Translation disabled or invalid. */
>> +        iotlb.addr_mask = 0;
>> +        iotlb.perm = IOMMU_NONE;
>> +    } else if (riscv_iommu_translate(as->iommu, ctx, &iotlb)) {
>> +        /* Translation disabled or fault reported. */
>> +        iotlb.addr_mask = 0;
>> +        iotlb.perm = IOMMU_NONE;
>> +    }
>> +
>> +    /* Trace all dma translations with original access flags. */
>> +    trace_riscv_iommu_dma(as->iommu->parent_obj.id, 
>> PCI_BUS_NUM(as->devid),
>> +                          PCI_SLOT(as->devid), PCI_FUNC(as->devid), 
>> iommu_idx,
>> +                          IOMMU_FLAG_STR[flag & IOMMU_RW], iotlb.iova,
>> +                          iotlb.translated_addr);
>> +
>> +    riscv_iommu_ctx_put(as->iommu, ref);
>> +
>> +    return iotlb;
>> +}
>> +
>> +static int riscv_iommu_memory_region_notify(
>> +    IOMMUMemoryRegion *iommu_mr, IOMMUNotifierFlag old,
>> +    IOMMUNotifierFlag new, Error **errp)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, 
>> iova_mr);
>> +
>> +    if (old == IOMMU_NOTIFIER_NONE) {
>> +        as->notifier = true;
>> + trace_riscv_iommu_notifier_add(iommu_mr->parent_obj.name);
>> +    } else if (new == IOMMU_NOTIFIER_NONE) {
>> +        as->notifier = false;
>> + trace_riscv_iommu_notifier_del(iommu_mr->parent_obj.name);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static inline bool pci_is_iommu(PCIDevice *pdev)
>> +{
>> +    return pci_get_word(pdev->config + PCI_CLASS_DEVICE) == 0x0806;
>> +}
>> +
>> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, 
>> int devfn)
>> +{
>> +    RISCVIOMMUState *s = (RISCVIOMMUState *) opaque;
>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>> +    AddressSpace *as = NULL;
>> +
>> +    if (pdev && pci_is_iommu(pdev)) {
>> +        return s->target_as;
>> +    }
>> +
>> +    /* Find first registered IOMMU device */
>> +    while (s->iommus.le_prev) {
>> +        s = *(s->iommus.le_prev);
>> +    }
>> +
>> +    /* Find first matching IOMMU */
>> +    while (s != NULL && as == NULL) {
>> +        as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), 
>> devfn));
>> +        s = s->iommus.le_next;
>> +    }
>> +
>> +    return as ? as : &address_space_memory;
>> +}
>> +
>> +static const PCIIOMMUOps riscv_iommu_ops = {
>> +    .get_address_space = riscv_iommu_find_as,
>> +};
>> +
>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
>> +        Error **errp)
>> +{
>> +    if (bus->iommu_ops &&
>> +        bus->iommu_ops->get_address_space == riscv_iommu_find_as) {
>> +        /* Allow multiple IOMMUs on the same PCIe bus, link known 
>> devices */
>> +        RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque;
>> +        QLIST_INSERT_AFTER(last, iommu, iommus);
>> +    } else if (!bus->iommu_ops && !bus->iommu_opaque) {
>> +        pci_setup_iommu(bus, &riscv_iommu_ops, iommu);
>> +    } else {
>> +        error_setg(errp, "can't register secondary IOMMU for PCI bus 
>> #%d",
>> +            pci_bus_num(bus));
>> +    }
>> +}
>> +
>> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion *iommu_mr,
>> +    MemTxAttrs attrs)
>> +{
>> +    return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid;
>> +}
>> +
>> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion 
>> *iommu_mr)
>> +{
>> +    RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, 
>> iova_mr);
>> +    return 1 << as->iommu->pasid_bits;
>> +}
>> +
>> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void 
>> *data)
>> +{
>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
>> +
>> +    imrc->translate = riscv_iommu_memory_region_translate;
>> +    imrc->notify_flag_changed = riscv_iommu_memory_region_notify;
>> +    imrc->attrs_to_index = riscv_iommu_memory_region_index;
>> +    imrc->num_indexes = riscv_iommu_memory_region_index_len;
>> +}
>> +
>> +static const TypeInfo riscv_iommu_memory_region_info = {
>> +    .parent = TYPE_IOMMU_MEMORY_REGION,
>> +    .name = TYPE_RISCV_IOMMU_MEMORY_REGION,
>> +    .class_init = riscv_iommu_memory_region_init,
>> +};
>> +
>> +static void riscv_iommu_register_mr_types(void)
>> +{
>> +    type_register_static(&riscv_iommu_memory_region_info);
>> +    type_register_static(&riscv_iommu_info);
>> +}
>> +
>> +type_init(riscv_iommu_register_mr_types);
>> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h
>> new file mode 100644
>> index 0000000000..31d3907d33
>> --- /dev/null
>> +++ b/hw/riscv/riscv-iommu.h
>> @@ -0,0 +1,141 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2022-2023 Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License 
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_RISCV_IOMMU_STATE_H
>> +#define HW_RISCV_IOMMU_STATE_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +
>> +#include "hw/riscv/iommu.h"
>> +
>> +struct RISCVIOMMUState {
>> +    /*< private >*/
>> +    DeviceState parent_obj;
>> +
>> +    /*< public >*/
>> +    uint32_t version;     /* Reported interface version number */
>> +    uint32_t pasid_bits;  /* process identifier width */
>> +    uint32_t bus;         /* PCI bus mapping for non-root endpoints */
>> +
>> +    uint64_t cap;         /* IOMMU supported capabilities */
>> +    uint64_t fctl;        /* IOMMU enabled features */
>> +
>> +    bool enable_off;      /* Enable out-of-reset OFF mode (DMA 
>> disabled) */
>> +    bool enable_msi;      /* Enable MSI remapping */
>> +
>> +    /* IOMMU Internal State */
>> +    uint64_t ddtp;        /* Validated Device Directory Tree Root 
>> Pointer */
>> +
>> +    dma_addr_t cq_addr;   /* Command queue base physical address */
>> +    dma_addr_t fq_addr;   /* Fault/event queue base physical address */
>> +    dma_addr_t pq_addr;   /* Page request queue base physical 
>> address */
>> +
>> +    uint32_t cq_mask;     /* Command queue index bit mask */
>> +    uint32_t fq_mask;     /* Fault/event queue index bit mask */
>> +    uint32_t pq_mask;     /* Page request queue index bit mask */
>> +
>> +    /* interrupt notifier */
>> +    void (*notify)(RISCVIOMMUState *iommu, unsigned vector);
>> +
>> +    /* IOMMU State Machine */
>> +    QemuThread core_proc; /* Background processing thread */
>> +    QemuMutex core_lock;  /* Global IOMMU lock, used for cache/regs 
>> updates */
>> +    QemuCond core_cond;   /* Background processing wake up signal */
>> +    unsigned core_exec;   /* Processing thread execution actions */
>> +
>> +    /* IOMMU target address space */
>> +    AddressSpace *target_as;
>> +    MemoryRegion *target_mr;
>> +
>> +    /* MSI / MRIF access trap */
>> +    AddressSpace trap_as;
>> +    MemoryRegion trap_mr;
>> +
>> +    GHashTable *ctx_cache;          /* Device translation Context 
>> Cache */
>> +
>> +    /* MMIO Hardware Interface */
>> +    MemoryRegion regs_mr;
>> +    QemuSpin regs_lock;
>> +    uint8_t *regs_rw;  /* register state (user write) */
>> +    uint8_t *regs_wc;  /* write-1-to-clear mask */
>> +    uint8_t *regs_ro;  /* read-only mask */
>> +
>> +    QLIST_ENTRY(RISCVIOMMUState) iommus;
>> +    QLIST_HEAD(, RISCVIOMMUSpace) spaces;
>> +};
>> +
>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus,
>> +         Error **errp);
>> +
>> +/* private helpers */
>> +
>> +/* Register helper functions */
>> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s,
>> +    unsigned idx, uint32_t set, uint32_t clr)
>> +{
>> +    uint32_t val;
>> +    qemu_spin_lock(&s->regs_lock);
>> +    val = ldl_le_p(s->regs_rw + idx);
>> +    stl_le_p(s->regs_rw + idx, (val & ~clr) | set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +    return val;
>> +}
>> +
>> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s,
>> +    unsigned idx, uint32_t set)
>> +{
>> +    qemu_spin_lock(&s->regs_lock);
>> +    stl_le_p(s->regs_rw + idx, set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +}
>> +
>> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s,
>> +    unsigned idx)
>> +{
>> +    return ldl_le_p(s->regs_rw + idx);
>> +}
>> +
>> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s,
>> +    unsigned idx, uint64_t set, uint64_t clr)
>> +{
>> +    uint64_t val;
>> +    qemu_spin_lock(&s->regs_lock);
>> +    val = ldq_le_p(s->regs_rw + idx);
>> +    stq_le_p(s->regs_rw + idx, (val & ~clr) | set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +    return val;
>> +}
>> +
>> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s,
>> +    unsigned idx, uint64_t set)
>> +{
>> +    qemu_spin_lock(&s->regs_lock);
>> +    stq_le_p(s->regs_rw + idx, set);
>> +    qemu_spin_unlock(&s->regs_lock);
>> +}
>> +
>> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s,
>> +    unsigned idx)
>> +{
>> +    return ldq_le_p(s->regs_rw + idx);
>> +}
>> +
>> +
>> +
>> +#endif
>> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events
>> new file mode 100644
>> index 0000000000..42a97caffa
>> --- /dev/null
>> +++ b/hw/riscv/trace-events
>> @@ -0,0 +1,11 @@
>> +# See documentation at docs/devel/tracing.rst
>> +
>> +# riscv-iommu.c
>> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) 
>> "%s: device attached %04x:%02x.%d"
>> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, 
>> uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 
>> 0x%"PRIx64" iova: 0x%"PRIx64
>> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, 
>> uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64
>> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, 
>> unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: 
>> translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64
>> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, 
>> uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 
>> 0x%"PRIx64" -> 0x%"PRIx64
>> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command 
>> 0x%"PRIx64" 0x%"PRIx64
>> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier added"
>> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier 
>> removed"
>> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h
>> new file mode 100644
>> index 0000000000..8c0e3ca1f3
>> --- /dev/null
>> +++ b/hw/riscv/trace.h
>> @@ -0,0 +1 @@
>> +#include "trace/trace-hw_riscv.h"
>> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h
>> new file mode 100644
>> index 0000000000..070ee69973
>> --- /dev/null
>> +++ b/include/hw/riscv/iommu.h
>> @@ -0,0 +1,36 @@
>> +/*
>> + * QEMU emulation of an RISC-V IOMMU
>> + *
>> + * Copyright (C) 2022-2023 Rivos Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License 
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_RISCV_IOMMU_H
>> +#define HW_RISCV_IOMMU_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "qom/object.h"
>> +
>> +#define TYPE_RISCV_IOMMU "riscv-iommu"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU)
>> +typedef struct RISCVIOMMUState RISCVIOMMUState;
>> +
>> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr"
>> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace;
>> +
>> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI)
>> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci;
>> +
>> +#endif
>> diff --git a/meson.build b/meson.build
>> index a9de71d450..8099d8271c 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -3319,6 +3319,7 @@ if have_system
>>       'hw/pci-host',
>>       'hw/ppc',
>>       'hw/rtc',
>> +    'hw/riscv',
>>       'hw/s390x',
>>       'hw/scsi',
>>       'hw/sd',


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support
  2024-06-18 10:30   ` Jason Chien
@ 2024-06-21 11:58     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-21 11:58 UTC (permalink / raw)
  To: Jason Chien, qemu-devel
  Cc: qemu-riscv, alistair.francis, bmeng, liwei1518, zhiwei_liu,
	palmer, tjeznach, ajones, frank.chang



On 6/18/24 7:30 AM, Jason Chien wrote:
> Hi Daniel,
> 
> On 2024/5/24 上午 01:39, Daniel Henrique Barboza wrote:
>> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>>
>> Add support for s-stage (sv32, sv39, sv48, sv57 caps) and g-stage
>> (sv32x4, sv39x4, sv48x4, sv57x4 caps). Most of the work is done in the
>> riscv_iommu_spa_fetch() function that now has to consider how many
>> translation stages we need to walk the page table.
>>
>> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
>> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
>> ---
>>   hw/riscv/riscv-iommu-bits.h |  11 ++
>>   hw/riscv/riscv-iommu.c      | 331 +++++++++++++++++++++++++++++++++++-
>>   hw/riscv/riscv-iommu.h      |   2 +
>>   3 files changed, 336 insertions(+), 8 deletions(-)
>>

(...)

>>       /* Set translation context. */
>>       ctx->tc = le64_to_cpu(dc.tc);
>> +    ctx->gatp = le64_to_cpu(dc.iohgatp);
>> +    ctx->satp = le64_to_cpu(dc.fsc);
>>       ctx->ta = le64_to_cpu(dc.ta);
>>       ctx->msiptp = le64_to_cpu(dc.msiptp);
>>       ctx->msi_addr_mask = le64_to_cpu(dc.msi_addr_mask);
>> @@ -564,14 +842,38 @@ static int riscv_iommu_ctx_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx)
>>           return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>>       }
>> +    /* FSC field checks */
>> +    mode = get_field(ctx->satp, RISCV_IOMMU_DC_FSC_MODE);
>> +    addr = PPN_PHYS(get_field(ctx->satp, RISCV_IOMMU_DC_FSC_PPN));
>> +
>> +    if (mode == RISCV_IOMMU_DC_FSC_MODE_BARE) {
> According to section 2.3, if the function returns here, some necessary checks are skipped. I think this if scope should be moved down to after "if (ctx->pasid == RISCV_IOMMU_NOPASID) {...}".
>> +        /* No S-Stage translation, done. */
>> +        return 0;
>> +    }
>> +
>>       if (!(ctx->tc & RISCV_IOMMU_DC_TC_PDTV)) {
>>           if (ctx->pasid != RISCV_IOMMU_NOPASID) {
>>               /* PASID is disabled */
>>               return RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED;
>>           }
>> +        if (mode > RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57) {
>> +            /* Invalid translation mode */
>> +            return RISCV_IOMMU_FQ_CAUSE_DDT_INVALID;
>> +        }
>>           return 0;
>>       }
>> +    if (ctx->pasid == RISCV_IOMMU_NOPASID) {
>> +        if (!(ctx->tc & RISCV_IOMMU_DC_TC_DPE)) {
>> +            /* No default PASID enabled, set BARE mode */
>> +            ctx->satp = 0ULL;
>> +            return 0;
>> +        } else {
>> +            /* Use default PASID #0 */
>> +            ctx->pasid = 0;
>> +        }
>> +    }
>> +
> return if mode is bare.

I agree. Just moved the 'if (bare) return' check to this point as you
suggested.


Thanks,


Daniel



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-06-21 11:59 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-23 17:39 [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 01/13] exec/memtxattr: add process identifier to the transaction attributes Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 02/13] hw/riscv: add riscv-iommu-bits.h Daniel Henrique Barboza
2024-05-28  6:41   ` Eric Cheng
2024-06-05 22:21     ` Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation Daniel Henrique Barboza
2024-05-30  1:39   ` Eric Cheng
2024-06-06 19:46     ` Daniel Henrique Barboza
2024-06-11 16:15   ` Jason Chien
2024-06-12  9:53     ` Daniel Henrique Barboza
2024-06-18 10:06   ` Jason Chien
2024-06-18 15:15     ` Jason Chien
2024-05-23 17:39 ` [PATCH v3 04/13] pci-ids.rst: add Red Hat pci-id for RISC-V IOMMU device Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 05/13] hw/riscv: add riscv-iommu-pci reference device Daniel Henrique Barboza
2024-06-09  8:53   ` Frank Chang
2024-05-23 17:39 ` [PATCH v3 06/13] hw/riscv/virt.c: support for RISC-V IOMMU PCIDevice hotplug Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 07/13] test/qtest: add riscv-iommu-pci tests Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 08/13] hw/riscv/riscv-iommu: add Address Translation Cache (IOATC) Daniel Henrique Barboza
2024-06-05 17:34   ` Tomasz Jeznach
2024-06-07  8:30     ` Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 09/13] hw/riscv/riscv-iommu: add s-stage and g-stage support Daniel Henrique Barboza
2024-06-18 10:30   ` Jason Chien
2024-06-21 11:58     ` Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 10/13] hw/riscv/riscv-iommu: add ATS support Daniel Henrique Barboza
2024-06-09  9:06   ` Frank Chang
2024-05-23 17:39 ` [PATCH v3 11/13] hw/riscv/riscv-iommu: add DBG support Daniel Henrique Barboza
2024-06-09  9:09   ` Frank Chang
2024-05-23 17:39 ` [PATCH v3 12/13] hw/riscv/riscv-iommu: Add another irq for mrif notifications Daniel Henrique Barboza
2024-05-23 17:39 ` [PATCH v3 13/13] qtest/riscv-iommu-test: add init queues test Daniel Henrique Barboza
2024-06-10  0:34 ` [PATCH v3 00/13] riscv: QEMU RISC-V IOMMU Support Alistair Francis
2024-06-10 18:32   ` Andrew Jones
2024-06-10 19:16     ` Daniel Henrique Barboza
2024-06-11  0:18       ` Alistair Francis
2024-06-11  1:51 ` LIU Zhiwei
2024-06-11 10:13   ` Daniel Henrique Barboza
2024-06-12  7:50     ` LIU Zhiwei
2024-06-12 12:10       ` Daniel Henrique Barboza
2024-06-14 13:22         ` LIU Zhiwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).