* [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization
@ 2014-08-29 7:17 Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 1/4] pci: Update pci_regs header Knut Omang
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Knut Omang @ 2014-08-29 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi,
Michael S. Tsirkin, Laszlo Ersek, Gabriel Somlo, Knut Omang,
Alexander Graf, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong
This patch set consists of two parts:
- The two first patches implements SR/IOV building blocks in pcie/pci.
I have held this patch back for a while because I haven't had a good
example test case to accompany it, but in the light of the latest
developments such as the discussion we have had around ARI and downstream
switches and root ports, I think it would be valuable to get this in,
to make it easier for people to experiment with creating devices with
many functions. Hopefully, I can also get some help to fix the hotplug
issues described below.
- The two last patches contains an example to illustrate the usage
of the new SR/IOV support. The example leverages the e1000
code and the fact that registers between E1000 and the PCIe based
Intel 82576 Gigabit Ethernet Adapter (which supports SR/IOV) are quite similar,
but so far without considering much of the differences beyond the
bare minimum needed to trick the igb driver into loading to the point
where VFs can be enabled.
So you cannot yet use these PF or VF Ethernet devices to send ethernet packets,
and the implementation do not in any way attempt to model the internals
of igb such as the multiple queues or any multiplexing onto the same device,
it only instantiates the VFs as well as the PFs mostly directly by inheritance
(and as separate devices) from E1000, but it should hopefully be relatively
easy to understand how to proceed to make "true" VFs.
It was also a nice exercise in using QOM.
The changes to E1000 to accommodate igb (patch 3) should be fairly
non-intrusive, nevertheless I suppose it should not be applied
unless it will eventually lead to a new derived device which is enough
different from E1000 to qualify for a separate set of source files.
So if someone with more detailed knowledge of the internal differences
between igb and e1000 on the functional level might have input, that would be
great, an alternative of course be to only apply the two first
patches, and leave any usage examples for the future.
To test and see how it plays out, you can add something like this
to the command line:
-device ioh3420,slot=0,id=pcie_port.0
-device igb,mac=DE:AD:BE:EE:04:18,vlan=1,bus=pcie_port.0
-net tap,vlan=1
The Linux igb driver does not yet support changing the VF setup via sysfs
(the preferred way since kernel v.3.8) so to see some VFs on the guest,
you need to set up a modprobe file with something like this:
options igb max_vfs=4
For some reason I dont yet understand, removal of the igb driver
does not cause a 0 write to sr/iov vf_enable to disable the VFs again, eg.
# lspci | grep '^01:00'
01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
01:00.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
01:00.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
01:00.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
Writing directly to vf_enable using setpci works though, but after the later hotplug
changes, a little too well, as after the VF removals have succeded,
some way a slot power down gets triggered, also removing the PF:
# setpci -s 01:00.0 168.b
0x19
# setpci -s 01:00.0 168.b=18
# lspci | grep '^01:00'
<nothing>
For similar reasons attaching the igb device directly on the root complex (just remove
bus parameter above) attempts to enable VFs will fail with
qdev.c:89: bus_add_child: Assertion `bus->allow_hotplug' failed
I imagine this could for instance be a matter of defining a property "subfunction"
or similar that allows qdev (QOM?) to discriminate between main devices and devices
representing subfunctions of another main device. Suggestions on how to proceed on this
welcome.
This patch depends (by diff context only) on my patch:
[PATCH v2 1/4] pcie: Fix incorrect write to the ari capability next function field
and for stability on Paolo's
[PATCH] pci_bridge: manually destroy memory regions within PCIBridgeWindows
both of which Michael has pulled, but which are not in master yet.
Thanks,
Knut Omang (4):
pci: Update pci_regs header
pcie: Add support for Single Root I/O Virtualization (SR/IOV)
e1000: Refactor to allow subclassing from other source file
igb: Example code to illustrate the SR/IOV support.
hw/i386/kvm/pci-assign.c | 4 +-
hw/misc/vfio.c | 8 +-
hw/net/Makefile.objs | 2 +-
hw/net/e1000.c | 126 ++--------------
hw/net/e1000.h | 139 +++++++++++++++++
hw/net/igb.c | 293 ++++++++++++++++++++++++++++++++++++
hw/net/igb_regs.h | 27 ++++
hw/pci/msi.c | 4 -
hw/pci/msix.c | 2 +-
hw/pci/pci.c | 107 +++++++++----
hw/pci/pcie.c | 205 ++++++++++++++++++++++++-
include/hw/pci/pci.h | 6 +-
include/hw/pci/pci_regs.h | 371 ++++++++++++++++++++++++++++++++++------------
include/hw/pci/pcie.h | 26 ++++
14 files changed, 1072 insertions(+), 248 deletions(-)
create mode 100644 hw/net/e1000.h
create mode 100644 hw/net/igb.c
create mode 100644 hw/net/igb_regs.h
--
1.9.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Qemu-devel] [PATCH 1/4] pci: Update pci_regs header
2014-08-29 7:17 [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Knut Omang
@ 2014-08-29 7:17 ` Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Knut Omang @ 2014-08-29 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi,
Michael S. Tsirkin, Laszlo Ersek, Gabriel Somlo, Knut Omang,
Alexander Graf, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong
Pulls in latest version from kernel v3.16
msi: Removed local definitions now in pci_regs.h
Also replace usage of PCI_MSIX_FLAGS_BIRMASK with PCI_MSIX_TABLE_BIR
to keep in sync with the kernel for future updates.
Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
hw/i386/kvm/pci-assign.c | 4 +-
hw/misc/vfio.c | 8 +-
hw/pci/msi.c | 4 -
hw/pci/msix.c | 2 +-
include/hw/pci/pci_regs.h | 371 ++++++++++++++++++++++++++++++++++------------
5 files changed, 285 insertions(+), 104 deletions(-)
diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index 17c7d6dc..a2285c4 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -1321,8 +1321,8 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev, Error **errp)
PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
msix_table_entry = pci_get_long(pci_dev->config + pos + PCI_MSIX_TABLE);
- bar_nr = msix_table_entry & PCI_MSIX_FLAGS_BIRMASK;
- msix_table_entry &= ~PCI_MSIX_FLAGS_BIRMASK;
+ bar_nr = msix_table_entry & PCI_MSIX_TABLE_BIR;
+ msix_table_entry &= ~PCI_MSIX_TABLE_BIR;
dev->msix_table_addr = pci_region[bar_nr].base_addr + msix_table_entry;
dev->msix_max = msix_max;
}
diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 0617b70..984c9e2 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -2795,10 +2795,10 @@ static int vfio_early_setup_msix(VFIODevice *vdev)
pba = le32_to_cpu(pba);
vdev->msix = g_malloc0(sizeof(*(vdev->msix)));
- vdev->msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
- vdev->msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
- vdev->msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
- vdev->msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK;
+ vdev->msix->table_bar = table & PCI_MSIX_TABLE_BIR;
+ vdev->msix->table_offset = table & ~PCI_MSIX_TABLE_BIR;
+ vdev->msix->pba_bar = pba & PCI_MSIX_TABLE_BIR;
+ vdev->msix->pba_offset = pba & ~PCI_MSIX_TABLE_BIR;
vdev->msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
DPRINTF("%04x:%02x:%02x.%x "
diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 52d2313..1d25b62 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -21,10 +21,6 @@
#include "hw/pci/msi.h"
#include "qemu/range.h"
-/* Eventually those constants should go to Linux pci_regs.h */
-#define PCI_MSI_PENDING_32 0x10
-#define PCI_MSI_PENDING_64 0x14
-
/* PCI_MSI_ADDRESS_LO */
#define PCI_MSI_ADDRESS_LO_MASK (~0x3)
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 24de260..8206264 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -250,7 +250,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
ranges_overlap(table_offset, table_size, pba_offset, pba_size)) ||
table_offset + table_size > memory_region_size(table_bar) ||
pba_offset + pba_size > memory_region_size(pba_bar) ||
- (table_offset | pba_offset) & PCI_MSIX_FLAGS_BIRMASK) {
+ (table_offset | pba_offset) & PCI_MSIX_TABLE_BIR) {
return -EINVAL;
}
diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 56a404b..30db069 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -13,10 +13,10 @@
* PCI to PCI Bridge Specification
* PCI System Design Guide
*
- * For hypertransport information, please consult the following manuals
- * from http://www.hypertransport.org
+ * For HyperTransport information, please consult the following manuals
+ * from http://www.hypertransport.org
*
- * The Hypertransport I/O Link Specification
+ * The HyperTransport I/O Link Specification
*/
#ifndef LINUX_PCI_REGS_H
@@ -26,6 +26,7 @@
* Under PCI, each device has 256 bytes of configuration address space,
* of which the first 64 bytes are standardized as follows:
*/
+#define PCI_STD_HEADER_SIZEOF 64
#define PCI_VENDOR_ID 0x00 /* 16 bits */
#define PCI_DEVICE_ID 0x02 /* 16 bits */
#define PCI_COMMAND 0x04 /* 16 bits */
@@ -36,7 +37,7 @@
#define PCI_COMMAND_INVALIDATE 0x10 /* Use memory write and invalidate */
#define PCI_COMMAND_VGA_PALETTE 0x20 /* Enable palette snooping */
#define PCI_COMMAND_PARITY 0x40 /* Enable parity checking */
-#define PCI_COMMAND_WAIT 0x80 /* Enable address/data stepping */
+#define PCI_COMMAND_WAIT 0x80 /* Enable address/data stepping */
#define PCI_COMMAND_SERR 0x100 /* Enable SERR */
#define PCI_COMMAND_FAST_BACK 0x200 /* Enable back-to-back writes */
#define PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */
@@ -44,7 +45,7 @@
#define PCI_STATUS 0x06 /* 16 bits */
#define PCI_STATUS_INTERRUPT 0x08 /* Interrupt status */
#define PCI_STATUS_CAP_LIST 0x10 /* Support Capability List */
-#define PCI_STATUS_66MHZ 0x20 /* Support 66 Mhz PCI 2.1 bus */
+#define PCI_STATUS_66MHZ 0x20 /* Support 66 MHz PCI 2.1 bus */
#define PCI_STATUS_UDF 0x40 /* Support User Definable Features [obsolete] */
#define PCI_STATUS_FAST_BACK 0x80 /* Accept fast-back to back */
#define PCI_STATUS_PARITY 0x100 /* Detected parity error */
@@ -125,7 +126,8 @@
#define PCI_IO_RANGE_TYPE_MASK 0x0fUL /* I/O bridging type */
#define PCI_IO_RANGE_TYPE_16 0x00
#define PCI_IO_RANGE_TYPE_32 0x01
-#define PCI_IO_RANGE_MASK (~0x0fUL)
+#define PCI_IO_RANGE_MASK (~0x0fUL) /* Standard 4K I/O windows */
+#define PCI_IO_1K_RANGE_MASK (~0x03UL) /* Intel 1K I/O windows */
#define PCI_SEC_STATUS 0x1e /* Secondary status register, only bit 14 used */
#define PCI_MEMORY_BASE 0x20 /* Memory range behind */
#define PCI_MEMORY_LIMIT 0x22
@@ -203,16 +205,18 @@
#define PCI_CAP_ID_CHSWP 0x06 /* CompactPCI HotSwap */
#define PCI_CAP_ID_PCIX 0x07 /* PCI-X */
#define PCI_CAP_ID_HT 0x08 /* HyperTransport */
-#define PCI_CAP_ID_VNDR 0x09 /* Vendor specific */
+#define PCI_CAP_ID_VNDR 0x09 /* Vendor-Specific */
#define PCI_CAP_ID_DBG 0x0A /* Debug port */
#define PCI_CAP_ID_CCRC 0x0B /* CompactPCI Central Resource Control */
-#define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */
+#define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */
#define PCI_CAP_ID_SSVID 0x0D /* Bridge subsystem vendor/device ID */
#define PCI_CAP_ID_AGP3 0x0E /* AGP Target PCI-PCI bridge */
-#define PCI_CAP_ID_EXP 0x10 /* PCI Express */
+#define PCI_CAP_ID_SECDEV 0x0F /* Secure Device */
+#define PCI_CAP_ID_EXP 0x10 /* PCI Express */
#define PCI_CAP_ID_MSIX 0x11 /* MSI-X */
-#define PCI_CAP_ID_SATA 0x12 /* Serial ATA */
+#define PCI_CAP_ID_SATA 0x12 /* SATA Data/Index Conf. */
#define PCI_CAP_ID_AF 0x13 /* PCI Advanced Features */
+#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
#define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */
#define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */
#define PCI_CAP_SIZEOF 4
@@ -264,8 +268,8 @@
#define PCI_AGP_COMMAND_RQ_MASK 0xff000000 /* Master: Maximum number of requests */
#define PCI_AGP_COMMAND_SBA 0x0200 /* Sideband addressing enabled */
#define PCI_AGP_COMMAND_AGP 0x0100 /* Allow processing of AGP transactions */
-#define PCI_AGP_COMMAND_64BIT 0x0020 /* Allow processing of 64-bit addresses */
-#define PCI_AGP_COMMAND_FW 0x0010 /* Force FW transfers */
+#define PCI_AGP_COMMAND_64BIT 0x0020 /* Allow processing of 64-bit addresses */
+#define PCI_AGP_COMMAND_FW 0x0010 /* Force FW transfers */
#define PCI_AGP_COMMAND_RATE4 0x0004 /* Use 4x rate */
#define PCI_AGP_COMMAND_RATE2 0x0002 /* Use 2x rate */
#define PCI_AGP_COMMAND_RATE1 0x0001 /* Use 1x rate */
@@ -277,6 +281,7 @@
#define PCI_VPD_ADDR_MASK 0x7fff /* Address mask */
#define PCI_VPD_ADDR_F 0x8000 /* Write 0, 1 indicates completion */
#define PCI_VPD_DATA 4 /* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF 8
/* Slot Identification */
@@ -287,30 +292,36 @@
/* Message Signalled Interrupts registers */
-#define PCI_MSI_FLAGS 2 /* Various flags */
-#define PCI_MSI_FLAGS_64BIT 0x80 /* 64-bit addresses allowed */
-#define PCI_MSI_FLAGS_QSIZE 0x70 /* Message queue size configured */
-#define PCI_MSI_FLAGS_QMASK 0x0e /* Maximum queue size available */
-#define PCI_MSI_FLAGS_ENABLE 0x01 /* MSI feature enabled */
-#define PCI_MSI_FLAGS_MASKBIT 0x100 /* 64-bit mask bits allowed */
+#define PCI_MSI_FLAGS 2 /* Message Control */
+#define PCI_MSI_FLAGS_ENABLE 0x0001 /* MSI feature enabled */
+#define PCI_MSI_FLAGS_QMASK 0x000e /* Maximum queue size available */
+#define PCI_MSI_FLAGS_QSIZE 0x0070 /* Message queue size configured */
+#define PCI_MSI_FLAGS_64BIT 0x0080 /* 64-bit addresses allowed */
+#define PCI_MSI_FLAGS_MASKBIT 0x0100 /* Per-vector masking capable */
#define PCI_MSI_RFU 3 /* Rest of capability flags */
#define PCI_MSI_ADDRESS_LO 4 /* Lower 32 bits */
#define PCI_MSI_ADDRESS_HI 8 /* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
#define PCI_MSI_DATA_32 8 /* 16 bits of data for 32-bit devices */
#define PCI_MSI_MASK_32 12 /* Mask bits register for 32-bit devices */
+#define PCI_MSI_PENDING_32 16 /* Pending intrs for 32-bit devices */
#define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */
#define PCI_MSI_MASK_64 16 /* Mask bits register for 64-bit devices */
+#define PCI_MSI_PENDING_64 20 /* Pending intrs for 64-bit devices */
/* MSI-X registers */
-#define PCI_MSIX_FLAGS 2
-#define PCI_MSIX_FLAGS_QSIZE 0x7FF
-#define PCI_MSIX_FLAGS_ENABLE (1 << 15)
-#define PCI_MSIX_FLAGS_MASKALL (1 << 14)
-#define PCI_MSIX_TABLE 4
-#define PCI_MSIX_PBA 8
-#define PCI_MSIX_FLAGS_BIRMASK (7 << 0)
-
-/* MSI-X entry's format */
+#define PCI_MSIX_FLAGS 2 /* Message Control */
+#define PCI_MSIX_FLAGS_QSIZE 0x07FF /* Table size */
+#define PCI_MSIX_FLAGS_MASKALL 0x4000 /* Mask all vectors for this function */
+#define PCI_MSIX_FLAGS_ENABLE 0x8000 /* MSI-X enable */
+#define PCI_MSIX_TABLE 4 /* Table offset */
+#define PCI_MSIX_TABLE_BIR 0x00000007 /* BAR index */
+#define PCI_MSIX_TABLE_OFFSET 0xfffffff8 /* Offset into specified BAR */
+#define PCI_MSIX_PBA 8 /* Pending Bit Array offset */
+#define PCI_MSIX_PBA_BIR 0x00000007 /* BAR index */
+#define PCI_MSIX_PBA_OFFSET 0xfffffff8 /* Offset into specified BAR */
+#define PCI_CAP_MSIX_SIZEOF 12 /* size of MSIX registers */
+
+/* MSI-X Table entry format */
#define PCI_MSIX_ENTRY_SIZE 16
#define PCI_MSIX_ENTRY_LOWER_ADDR 0
#define PCI_MSIX_ENTRY_UPPER_ADDR 4
@@ -339,8 +350,9 @@
#define PCI_AF_CTRL_FLR 0x01
#define PCI_AF_STATUS 5
#define PCI_AF_STATUS_TP 0x01
+#define PCI_CAP_AF_SIZEOF 6 /* size of AF registers */
-/* PCI-X registers */
+/* PCI-X registers (Type 0 (non-bridge) devices) */
#define PCI_X_CMD 2 /* Modes & Features */
#define PCI_X_CMD_DPERR_E 0x0001 /* Data Parity Error Recovery Enable */
@@ -360,7 +372,7 @@
#define PCI_X_CMD_SPLIT_16 0x0060 /* Max 16 */
#define PCI_X_CMD_SPLIT_32 0x0070 /* Max 32 */
#define PCI_X_CMD_MAX_SPLIT 0x0070 /* Max Outstanding Split Transactions */
-#define PCI_X_CMD_VERSION(x) (((x) >> 12) & 3) /* Version */
+#define PCI_X_CMD_VERSION(x) (((x) >> 12) & 3) /* Version */
#define PCI_X_STATUS 4 /* PCI-X capabilities */
#define PCI_X_STATUS_DEVFN 0x000000ff /* A copy of devfn */
#define PCI_X_STATUS_BUS 0x0000ff00 /* A copy of bus nr */
@@ -375,11 +387,28 @@
#define PCI_X_STATUS_SPL_ERR 0x20000000 /* Rcvd Split Completion Error Msg */
#define PCI_X_STATUS_266MHZ 0x40000000 /* 266 MHz capable */
#define PCI_X_STATUS_533MHZ 0x80000000 /* 533 MHz capable */
+#define PCI_X_ECC_CSR 8 /* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0 8 /* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V1 24 /* size for Version 1 */
+#define PCI_CAP_PCIX_SIZEOF_V2 PCI_CAP_PCIX_SIZEOF_V1 /* Same for v2 */
+
+/* PCI-X registers (Type 1 (bridge) devices) */
+
+#define PCI_X_BRIDGE_SSTATUS 2 /* Secondary Status */
+#define PCI_X_SSTATUS_64BIT 0x0001 /* Secondary AD interface is 64 bits */
+#define PCI_X_SSTATUS_133MHZ 0x0002 /* 133 MHz capable */
+#define PCI_X_SSTATUS_FREQ 0x03c0 /* Secondary Bus Mode and Frequency */
+#define PCI_X_SSTATUS_VERS 0x3000 /* PCI-X Capability Version */
+#define PCI_X_SSTATUS_V1 0x1000 /* Mode 2, not Mode 1 */
+#define PCI_X_SSTATUS_V2 0x2000 /* Mode 1 or Modes 1 and 2 */
+#define PCI_X_SSTATUS_266MHZ 0x4000 /* 266 MHz capable */
+#define PCI_X_SSTATUS_533MHZ 0x8000 /* 533 MHz capable */
+#define PCI_X_BRIDGE_STATUS 4 /* Bridge Status */
/* PCI Bridge Subsystem ID registers */
-#define PCI_SSVID_VENDOR_ID 4 /* PCI-Bridge subsystem vendor id register */
-#define PCI_SSVID_DEVICE_ID 6 /* PCI-Bridge subsystem device id register */
+#define PCI_SSVID_VENDOR_ID 4 /* PCI Bridge subsystem vendor ID */
+#define PCI_SSVID_DEVICE_ID 6 /* PCI Bridge subsystem device ID */
/* PCI Express capability registers */
@@ -391,24 +420,24 @@
#define PCI_EXP_TYPE_ROOT_PORT 0x4 /* Root Port */
#define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */
#define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */
-#define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */
-#define PCI_EXP_TYPE_PCIE_BRIDGE 0x8 /* PCI/PCI-X to PCIE Bridge */
+#define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCIe to PCI/PCI-X Bridge */
+#define PCI_EXP_TYPE_PCIE_BRIDGE 0x8 /* PCI/PCI-X to PCIe Bridge */
#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */
-#define PCI_EXP_TYPE_RC_EC 0xa /* Root Complex Event Collector */
+#define PCI_EXP_TYPE_RC_EC 0xa /* Root Complex Event Collector */
#define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */
#define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */
#define PCI_EXP_DEVCAP 4 /* Device capabilities */
-#define PCI_EXP_DEVCAP_PAYLOAD 0x07 /* Max_Payload_Size */
-#define PCI_EXP_DEVCAP_PHANTOM 0x18 /* Phantom functions */
-#define PCI_EXP_DEVCAP_EXT_TAG 0x20 /* Extended tags */
-#define PCI_EXP_DEVCAP_L0S 0x1c0 /* L0s Acceptable Latency */
-#define PCI_EXP_DEVCAP_L1 0xe00 /* L1 Acceptable Latency */
-#define PCI_EXP_DEVCAP_ATN_BUT 0x1000 /* Attention Button Present */
-#define PCI_EXP_DEVCAP_ATN_IND 0x2000 /* Attention Indicator Present */
-#define PCI_EXP_DEVCAP_PWR_IND 0x4000 /* Power Indicator Present */
-#define PCI_EXP_DEVCAP_RBER 0x8000 /* Role-Based Error Reporting */
-#define PCI_EXP_DEVCAP_PWR_VAL 0x3fc0000 /* Slot Power Limit Value */
-#define PCI_EXP_DEVCAP_PWR_SCL 0xc000000 /* Slot Power Limit Scale */
+#define PCI_EXP_DEVCAP_PAYLOAD 0x00000007 /* Max_Payload_Size */
+#define PCI_EXP_DEVCAP_PHANTOM 0x00000018 /* Phantom functions */
+#define PCI_EXP_DEVCAP_EXT_TAG 0x00000020 /* Extended tags */
+#define PCI_EXP_DEVCAP_L0S 0x000001c0 /* L0s Acceptable Latency */
+#define PCI_EXP_DEVCAP_L1 0x00000e00 /* L1 Acceptable Latency */
+#define PCI_EXP_DEVCAP_ATN_BUT 0x00001000 /* Attention Button Present */
+#define PCI_EXP_DEVCAP_ATN_IND 0x00002000 /* Attention Indicator Present */
+#define PCI_EXP_DEVCAP_PWR_IND 0x00004000 /* Power Indicator Present */
+#define PCI_EXP_DEVCAP_RBER 0x00008000 /* Role-Based Error Reporting */
+#define PCI_EXP_DEVCAP_PWR_VAL 0x03fc0000 /* Slot Power Limit Value */
+#define PCI_EXP_DEVCAP_PWR_SCL 0x0c000000 /* Slot Power Limit Scale */
#define PCI_EXP_DEVCAP_FLR 0x10000000 /* Function Level Reset */
#define PCI_EXP_DEVCTL 8 /* Device Control */
#define PCI_EXP_DEVCTL_CERE 0x0001 /* Correctable Error Reporting En. */
@@ -424,45 +453,55 @@
#define PCI_EXP_DEVCTL_READRQ 0x7000 /* Max_Read_Request_Size */
#define PCI_EXP_DEVCTL_BCR_FLR 0x8000 /* Bridge Configuration Retry / FLR */
#define PCI_EXP_DEVSTA 10 /* Device Status */
-#define PCI_EXP_DEVSTA_CED 0x01 /* Correctable Error Detected */
-#define PCI_EXP_DEVSTA_NFED 0x02 /* Non-Fatal Error Detected */
-#define PCI_EXP_DEVSTA_FED 0x04 /* Fatal Error Detected */
-#define PCI_EXP_DEVSTA_URD 0x08 /* Unsupported Request Detected */
-#define PCI_EXP_DEVSTA_AUXPD 0x10 /* AUX Power Detected */
-#define PCI_EXP_DEVSTA_TRPND 0x20 /* Transactions Pending */
+#define PCI_EXP_DEVSTA_CED 0x0001 /* Correctable Error Detected */
+#define PCI_EXP_DEVSTA_NFED 0x0002 /* Non-Fatal Error Detected */
+#define PCI_EXP_DEVSTA_FED 0x0004 /* Fatal Error Detected */
+#define PCI_EXP_DEVSTA_URD 0x0008 /* Unsupported Request Detected */
+#define PCI_EXP_DEVSTA_AUXPD 0x0010 /* AUX Power Detected */
+#define PCI_EXP_DEVSTA_TRPND 0x0020 /* Transactions Pending */
#define PCI_EXP_LNKCAP 12 /* Link Capabilities */
#define PCI_EXP_LNKCAP_SLS 0x0000000f /* Supported Link Speeds */
+#define PCI_EXP_LNKCAP_SLS_2_5GB 0x00000001 /* LNKCAP2 SLS Vector bit 0 */
+#define PCI_EXP_LNKCAP_SLS_5_0GB 0x00000002 /* LNKCAP2 SLS Vector bit 1 */
#define PCI_EXP_LNKCAP_MLW 0x000003f0 /* Maximum Link Width */
#define PCI_EXP_LNKCAP_ASPMS 0x00000c00 /* ASPM Support */
#define PCI_EXP_LNKCAP_L0SEL 0x00007000 /* L0s Exit Latency */
#define PCI_EXP_LNKCAP_L1EL 0x00038000 /* L1 Exit Latency */
-#define PCI_EXP_LNKCAP_CLKPM 0x00040000 /* L1 Clock Power Management */
+#define PCI_EXP_LNKCAP_CLKPM 0x00040000 /* Clock Power Management */
#define PCI_EXP_LNKCAP_SDERC 0x00080000 /* Surprise Down Error Reporting Capable */
#define PCI_EXP_LNKCAP_DLLLARC 0x00100000 /* Data Link Layer Link Active Reporting Capable */
#define PCI_EXP_LNKCAP_LBNC 0x00200000 /* Link Bandwidth Notification Capability */
#define PCI_EXP_LNKCAP_PN 0xff000000 /* Port Number */
#define PCI_EXP_LNKCTL 16 /* Link Control */
#define PCI_EXP_LNKCTL_ASPMC 0x0003 /* ASPM Control */
+#define PCI_EXP_LNKCTL_ASPM_L0S 0x0001 /* L0s Enable */
+#define PCI_EXP_LNKCTL_ASPM_L1 0x0002 /* L1 Enable */
#define PCI_EXP_LNKCTL_RCB 0x0008 /* Read Completion Boundary */
#define PCI_EXP_LNKCTL_LD 0x0010 /* Link Disable */
#define PCI_EXP_LNKCTL_RL 0x0020 /* Retrain Link */
#define PCI_EXP_LNKCTL_CCC 0x0040 /* Common Clock Configuration */
#define PCI_EXP_LNKCTL_ES 0x0080 /* Extended Synch */
-#define PCI_EXP_LNKCTL_CLKREQ_EN 0x100 /* Enable clkreq */
+#define PCI_EXP_LNKCTL_CLKREQ_EN 0x0100 /* Enable clkreq */
#define PCI_EXP_LNKCTL_HAWD 0x0200 /* Hardware Autonomous Width Disable */
#define PCI_EXP_LNKCTL_LBMIE 0x0400 /* Link Bandwidth Management Interrupt Enable */
-#define PCI_EXP_LNKCTL_LABIE 0x0800 /* Lnk Autonomous Bandwidth Interrupt Enable */
+#define PCI_EXP_LNKCTL_LABIE 0x0800 /* Link Autonomous Bandwidth Interrupt Enable */
#define PCI_EXP_LNKSTA 18 /* Link Status */
#define PCI_EXP_LNKSTA_CLS 0x000f /* Current Link Speed */
-#define PCI_EXP_LNKSTA_CLS_2_5GB 0x01 /* Current Link Speed 2.5GT/s */
-#define PCI_EXP_LNKSTA_CLS_5_0GB 0x02 /* Current Link Speed 5.0GT/s */
-#define PCI_EXP_LNKSTA_NLW 0x03f0 /* Nogotiated Link Width */
+#define PCI_EXP_LNKSTA_CLS_2_5GB 0x0001 /* Current Link Speed 2.5GT/s */
+#define PCI_EXP_LNKSTA_CLS_5_0GB 0x0002 /* Current Link Speed 5.0GT/s */
+#define PCI_EXP_LNKSTA_CLS_8_0GB 0x0003 /* Current Link Speed 8.0GT/s */
+#define PCI_EXP_LNKSTA_NLW 0x03f0 /* Negotiated Link Width */
+#define PCI_EXP_LNKSTA_NLW_X1 0x0010 /* Current Link Width x1 */
+#define PCI_EXP_LNKSTA_NLW_X2 0x0020 /* Current Link Width x2 */
+#define PCI_EXP_LNKSTA_NLW_X4 0x0040 /* Current Link Width x4 */
+#define PCI_EXP_LNKSTA_NLW_X8 0x0080 /* Current Link Width x8 */
#define PCI_EXP_LNKSTA_NLW_SHIFT 4 /* start of NLW mask in link status */
#define PCI_EXP_LNKSTA_LT 0x0800 /* Link Training */
#define PCI_EXP_LNKSTA_SLC 0x1000 /* Slot Clock Configuration */
#define PCI_EXP_LNKSTA_DLLLA 0x2000 /* Data Link Layer Link Active */
#define PCI_EXP_LNKSTA_LBMS 0x4000 /* Link Bandwidth Management Status */
#define PCI_EXP_LNKSTA_LABS 0x8000 /* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20 /* v1 endpoints end here */
#define PCI_EXP_SLTCAP 20 /* Slot Capabilities */
#define PCI_EXP_SLTCAP_ABP 0x00000001 /* Attention Button Present */
#define PCI_EXP_SLTCAP_PCP 0x00000002 /* Power Controller Present */
@@ -484,8 +523,16 @@
#define PCI_EXP_SLTCTL_CCIE 0x0010 /* Command Completed Interrupt Enable */
#define PCI_EXP_SLTCTL_HPIE 0x0020 /* Hot-Plug Interrupt Enable */
#define PCI_EXP_SLTCTL_AIC 0x00c0 /* Attention Indicator Control */
+#define PCI_EXP_SLTCTL_ATTN_IND_ON 0x0040 /* Attention Indicator on */
+#define PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */
+#define PCI_EXP_SLTCTL_ATTN_IND_OFF 0x00c0 /* Attention Indicator off */
#define PCI_EXP_SLTCTL_PIC 0x0300 /* Power Indicator Control */
+#define PCI_EXP_SLTCTL_PWR_IND_ON 0x0100 /* Power Indicator on */
+#define PCI_EXP_SLTCTL_PWR_IND_BLINK 0x0200 /* Power Indicator blinking */
+#define PCI_EXP_SLTCTL_PWR_IND_OFF 0x0300 /* Power Indicator off */
#define PCI_EXP_SLTCTL_PCC 0x0400 /* Power Controller Control */
+#define PCI_EXP_SLTCTL_PWR_ON 0x0000 /* Power On */
+#define PCI_EXP_SLTCTL_PWR_OFF 0x0400 /* Power Off */
#define PCI_EXP_SLTCTL_EIC 0x0800 /* Electromechanical Interlock Control */
#define PCI_EXP_SLTCTL_DLLSCE 0x1000 /* Data Link Layer State Changed Enable */
#define PCI_EXP_SLTSTA 26 /* Slot Status */
@@ -499,52 +546,93 @@
#define PCI_EXP_SLTSTA_EIS 0x0080 /* Electromechanical Interlock Status */
#define PCI_EXP_SLTSTA_DLLSC 0x0100 /* Data Link Layer State Changed */
#define PCI_EXP_RTCTL 28 /* Root Control */
-#define PCI_EXP_RTCTL_SECEE 0x01 /* System Error on Correctable Error */
-#define PCI_EXP_RTCTL_SENFEE 0x02 /* System Error on Non-Fatal Error */
-#define PCI_EXP_RTCTL_SEFEE 0x04 /* System Error on Fatal Error */
-#define PCI_EXP_RTCTL_PMEIE 0x08 /* PME Interrupt Enable */
-#define PCI_EXP_RTCTL_CRSSVE 0x10 /* CRS Software Visibility Enable */
+#define PCI_EXP_RTCTL_SECEE 0x0001 /* System Error on Correctable Error */
+#define PCI_EXP_RTCTL_SENFEE 0x0002 /* System Error on Non-Fatal Error */
+#define PCI_EXP_RTCTL_SEFEE 0x0004 /* System Error on Fatal Error */
+#define PCI_EXP_RTCTL_PMEIE 0x0008 /* PME Interrupt Enable */
+#define PCI_EXP_RTCTL_CRSSVE 0x0010 /* CRS Software Visibility Enable */
#define PCI_EXP_RTCAP 30 /* Root Capabilities */
#define PCI_EXP_RTSTA 32 /* Root Status */
-#define PCI_EXP_RTSTA_PME 0x10000 /* PME status */
-#define PCI_EXP_RTSTA_PENDING 0x20000 /* PME pending */
+#define PCI_EXP_RTSTA_PME 0x00010000 /* PME status */
+#define PCI_EXP_RTSTA_PENDING 0x00020000 /* PME pending */
+/*
+ * The Device Capabilities 2, Device Status 2, Device Control 2,
+ * Link Capabilities 2, Link Status 2, Link Control 2,
+ * Slot Capabilities 2, Slot Status 2, and Slot Control 2 registers
+ * are only present on devices with PCIe Capability version 2.
+ * Use pcie_capability_read_word() and similar interfaces to use them
+ * safely.
+ */
#define PCI_EXP_DEVCAP2 36 /* Device Capabilities 2 */
-#define PCI_EXP_DEVCAP2_ARI 0x20 /* Alternative Routing-ID */
-#define PCI_EXP_DEVCAP2_LTR 0x800 /* Latency tolerance reporting */
-#define PCI_EXP_OBFF_MASK 0xc0000 /* OBFF support mechanism */
-#define PCI_EXP_OBFF_MSG 0x40000 /* New message signaling */
-#define PCI_EXP_OBFF_WAKE 0x80000 /* Re-use WAKE# for OBFF */
+#define PCI_EXP_DEVCAP2_ARI 0x00000020 /* Alternative Routing-ID */
+#define PCI_EXP_DEVCAP2_LTR 0x00000800 /* Latency tolerance reporting */
+#define PCI_EXP_DEVCAP2_OBFF_MASK 0x000c0000 /* OBFF support mechanism */
+#define PCI_EXP_DEVCAP2_OBFF_MSG 0x00040000 /* New message signaling */
+#define PCI_EXP_DEVCAP2_OBFF_WAKE 0x00080000 /* Re-use WAKE# for OBFF */
#define PCI_EXP_DEVCTL2 40 /* Device Control 2 */
-#define PCI_EXP_DEVCTL2_ARI 0x20 /* Alternative Routing-ID */
-#define PCI_EXP_IDO_REQ_EN 0x100 /* ID-based ordering request enable */
-#define PCI_EXP_IDO_CMP_EN 0x200 /* ID-based ordering completion enable */
-#define PCI_EXP_LTR_EN 0x400 /* Latency tolerance reporting */
-#define PCI_EXP_OBFF_MSGA_EN 0x2000 /* OBFF enable with Message type A */
-#define PCI_EXP_OBFF_MSGB_EN 0x4000 /* OBFF enable with Message type B */
-#define PCI_EXP_OBFF_WAKE_EN 0x6000 /* OBFF using WAKE# signaling */
+#define PCI_EXP_DEVCTL2_COMP_TIMEOUT 0x000f /* Completion Timeout Value */
+#define PCI_EXP_DEVCTL2_ARI 0x0020 /* Alternative Routing-ID */
+#define PCI_EXP_DEVCTL2_IDO_REQ_EN 0x0100 /* Allow IDO for requests */
+#define PCI_EXP_DEVCTL2_IDO_CMP_EN 0x0200 /* Allow IDO for completions */
+#define PCI_EXP_DEVCTL2_LTR_EN 0x0400 /* Enable LTR mechanism */
+#define PCI_EXP_DEVCTL2_OBFF_MSGA_EN 0x2000 /* Enable OBFF Message type A */
+#define PCI_EXP_DEVCTL2_OBFF_MSGB_EN 0x4000 /* Enable OBFF Message type B */
+#define PCI_EXP_DEVCTL2_OBFF_WAKE_EN 0x6000 /* OBFF using WAKE# signaling */
+#define PCI_EXP_DEVSTA2 42 /* Device Status 2 */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44 /* v2 endpoints end here */
+#define PCI_EXP_LNKCAP2 44 /* Link Capabilities 2 */
+#define PCI_EXP_LNKCAP2_SLS_2_5GB 0x00000002 /* Supported Speed 2.5GT/s */
+#define PCI_EXP_LNKCAP2_SLS_5_0GB 0x00000004 /* Supported Speed 5.0GT/s */
+#define PCI_EXP_LNKCAP2_SLS_8_0GB 0x00000008 /* Supported Speed 8.0GT/s */
+#define PCI_EXP_LNKCAP2_CROSSLINK 0x00000100 /* Crosslink supported */
#define PCI_EXP_LNKCTL2 48 /* Link Control 2 */
+#define PCI_EXP_LNKSTA2 50 /* Link Status 2 */
+#define PCI_EXP_SLTCAP2 52 /* Slot Capabilities 2 */
#define PCI_EXP_SLTCTL2 56 /* Slot Control 2 */
+#define PCI_EXP_SLTSTA2 58 /* Slot Status 2 */
/* Extended Capabilities (PCI-X 2.0 and Express) */
#define PCI_EXT_CAP_ID(header) (header & 0x0000ffff)
#define PCI_EXT_CAP_VER(header) ((header >> 16) & 0xf)
#define PCI_EXT_CAP_NEXT(header) ((header >> 20) & 0xffc)
-#define PCI_EXT_CAP_ID_ERR 1
-#define PCI_EXT_CAP_ID_VC 2
-#define PCI_EXT_CAP_ID_DSN 3
-#define PCI_EXT_CAP_ID_PWR 4
-#define PCI_EXT_CAP_ID_VNDR 11
-#define PCI_EXT_CAP_ID_ACS 13
-#define PCI_EXT_CAP_ID_ARI 14
-#define PCI_EXT_CAP_ID_ATS 15
-#define PCI_EXT_CAP_ID_SRIOV 16
-#define PCI_EXT_CAP_ID_LTR 24
+#define PCI_EXT_CAP_ID_ERR 0x01 /* Advanced Error Reporting */
+#define PCI_EXT_CAP_ID_VC 0x02 /* Virtual Channel Capability */
+#define PCI_EXT_CAP_ID_DSN 0x03 /* Device Serial Number */
+#define PCI_EXT_CAP_ID_PWR 0x04 /* Power Budgeting */
+#define PCI_EXT_CAP_ID_RCLD 0x05 /* Root Complex Link Declaration */
+#define PCI_EXT_CAP_ID_RCILC 0x06 /* Root Complex Internal Link Control */
+#define PCI_EXT_CAP_ID_RCEC 0x07 /* Root Complex Event Collector */
+#define PCI_EXT_CAP_ID_MFVC 0x08 /* Multi-Function VC Capability */
+#define PCI_EXT_CAP_ID_VC9 0x09 /* same as _VC */
+#define PCI_EXT_CAP_ID_RCRB 0x0A /* Root Complex RB? */
+#define PCI_EXT_CAP_ID_VNDR 0x0B /* Vendor-Specific */
+#define PCI_EXT_CAP_ID_CAC 0x0C /* Config Access - obsolete */
+#define PCI_EXT_CAP_ID_ACS 0x0D /* Access Control Services */
+#define PCI_EXT_CAP_ID_ARI 0x0E /* Alternate Routing ID */
+#define PCI_EXT_CAP_ID_ATS 0x0F /* Address Translation Services */
+#define PCI_EXT_CAP_ID_SRIOV 0x10 /* Single Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MRIOV 0x11 /* Multi Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MCAST 0x12 /* Multicast */
+#define PCI_EXT_CAP_ID_PRI 0x13 /* Page Request Interface */
+#define PCI_EXT_CAP_ID_AMD_XXX 0x14 /* Reserved for AMD */
+#define PCI_EXT_CAP_ID_REBAR 0x15 /* Resizable BAR */
+#define PCI_EXT_CAP_ID_DPA 0x16 /* Dynamic Power Allocation */
+#define PCI_EXT_CAP_ID_TPH 0x17 /* TPH Requester */
+#define PCI_EXT_CAP_ID_LTR 0x18 /* Latency Tolerance Reporting */
+#define PCI_EXT_CAP_ID_SECPCI 0x19 /* Secondary PCIe Capability */
+#define PCI_EXT_CAP_ID_PMUX 0x1A /* Protocol Multiplexing */
+#define PCI_EXT_CAP_ID_PASID 0x1B /* Process Address Space ID */
+#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PASID
+
+#define PCI_EXT_CAP_DSN_SIZEOF 12
+#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
/* Advanced Error Reporting */
#define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
#define PCI_ERR_UNC_TRAIN 0x00000001 /* Training */
#define PCI_ERR_UNC_DLP 0x00000010 /* Data Link Protocol */
+#define PCI_ERR_UNC_SURPDN 0x00000020 /* Surprise Down */
#define PCI_ERR_UNC_POISON_TLP 0x00001000 /* Poisoned TLP */
#define PCI_ERR_UNC_FCP 0x00002000 /* Flow Control Protocol */
#define PCI_ERR_UNC_COMP_TIME 0x00004000 /* Completion Timeout */
@@ -554,6 +642,11 @@
#define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */
#define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */
#define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */
+#define PCI_ERR_UNC_ACSV 0x00200000 /* ACS Violation */
+#define PCI_ERR_UNC_INTN 0x00400000 /* internal error */
+#define PCI_ERR_UNC_MCBTLP 0x00800000 /* MC blocked TLP */
+#define PCI_ERR_UNC_ATOMEG 0x01000000 /* Atomic egress blocked */
+#define PCI_ERR_UNC_TLPPRE 0x02000000 /* TLP prefix blocked */
#define PCI_ERR_UNCOR_MASK 8 /* Uncorrectable Error Mask */
/* Same bits as above */
#define PCI_ERR_UNCOR_SEVER 12 /* Uncorrectable Error Severity */
@@ -564,6 +657,9 @@
#define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */
#define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */
#define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */
+#define PCI_ERR_COR_ADV_NFAT 0x00002000 /* Advisory Non-Fatal */
+#define PCI_ERR_COR_INTERNAL 0x00004000 /* Corrected Internal */
+#define PCI_ERR_COR_LOG_OVER 0x00008000 /* Header Log Overflow */
#define PCI_ERR_COR_MASK 20 /* Correctable Error Mask */
/* Same bits as above */
#define PCI_ERR_CAP 24 /* Advanced Error Capabilities */
@@ -584,9 +680,9 @@
#define PCI_ERR_ROOT_COR_RCV 0x00000001 /* ERR_COR Received */
/* Multi ERR_COR Received */
#define PCI_ERR_ROOT_MULTI_COR_RCV 0x00000002
-/* ERR_FATAL/NONFATAL Recevied */
+/* ERR_FATAL/NONFATAL Received */
#define PCI_ERR_ROOT_UNCOR_RCV 0x00000004
-/* Multi ERR_FATAL/NONFATAL Recevied */
+/* Multi ERR_FATAL/NONFATAL Received */
#define PCI_ERR_ROOT_MULTI_UNCOR_RCV 0x00000008
#define PCI_ERR_ROOT_FIRST_FATAL 0x00000010 /* First Fatal */
#define PCI_ERR_ROOT_NONFATAL_RCV 0x00000020 /* Non-Fatal Received */
@@ -594,13 +690,36 @@
#define PCI_ERR_ROOT_ERR_SRC 52 /* Error Source Identification */
/* Virtual Channel */
-#define PCI_VC_PORT_REG1 4
-#define PCI_VC_PORT_REG2 8
+#define PCI_VC_PORT_CAP1 4
+#define PCI_VC_CAP1_EVCC 0x00000007 /* extended VC count */
+#define PCI_VC_CAP1_LPEVCC 0x00000070 /* low prio extended VC count */
+#define PCI_VC_CAP1_ARB_SIZE 0x00000c00
+#define PCI_VC_PORT_CAP2 8
+#define PCI_VC_CAP2_32_PHASE 0x00000002
+#define PCI_VC_CAP2_64_PHASE 0x00000004
+#define PCI_VC_CAP2_128_PHASE 0x00000008
+#define PCI_VC_CAP2_ARB_OFF 0xff000000
#define PCI_VC_PORT_CTRL 12
+#define PCI_VC_PORT_CTRL_LOAD_TABLE 0x00000001
#define PCI_VC_PORT_STATUS 14
+#define PCI_VC_PORT_STATUS_TABLE 0x00000001
#define PCI_VC_RES_CAP 16
+#define PCI_VC_RES_CAP_32_PHASE 0x00000002
+#define PCI_VC_RES_CAP_64_PHASE 0x00000004
+#define PCI_VC_RES_CAP_128_PHASE 0x00000008
+#define PCI_VC_RES_CAP_128_PHASE_TB 0x00000010
+#define PCI_VC_RES_CAP_256_PHASE 0x00000020
+#define PCI_VC_RES_CAP_ARB_OFF 0xff000000
#define PCI_VC_RES_CTRL 20
+#define PCI_VC_RES_CTRL_LOAD_TABLE 0x00010000
+#define PCI_VC_RES_CTRL_ARB_SELECT 0x000e0000
+#define PCI_VC_RES_CTRL_ID 0x07000000
+#define PCI_VC_RES_CTRL_ENABLE 0x80000000
#define PCI_VC_RES_STATUS 26
+#define PCI_VC_RES_STATUS_TABLE 0x00000001
+#define PCI_VC_RES_STATUS_NEGO 0x00000002
+#define PCI_CAP_VC_BASE_SIZEOF 0x10
+#define PCI_CAP_VC_PER_VC_SIZEOF 0x0C
/* Power Budgeting */
#define PCI_PWR_DSR 4 /* Data Select Register */
@@ -613,9 +732,16 @@
#define PCI_PWR_DATA_RAIL(x) (((x) >> 18) & 7) /* Power Rail */
#define PCI_PWR_CAP 12 /* Capability */
#define PCI_PWR_CAP_BUDGET(x) ((x) & 1) /* Included in system budget */
+#define PCI_EXT_CAP_PWR_SIZEOF 16
+
+/* Vendor-Specific (VSEC, PCI_EXT_CAP_ID_VNDR) */
+#define PCI_VNDR_HEADER 4 /* Vendor-Specific Header */
+#define PCI_VNDR_HEADER_ID(x) ((x) & 0xffff)
+#define PCI_VNDR_HEADER_REV(x) (((x) >> 16) & 0xf)
+#define PCI_VNDR_HEADER_LEN(x) (((x) >> 20) & 0xfff)
/*
- * Hypertransport sub capability types
+ * HyperTransport sub capability types
*
* Unfortunately there are both 3 bit and 5 bit capability types defined
* in the HT spec, catering for that is a little messy. You probably don't
@@ -643,8 +769,10 @@
#define HT_CAPTYPE_DIRECT_ROUTE 0xB0 /* Direct routing configuration */
#define HT_CAPTYPE_VCSET 0xB8 /* Virtual Channel configuration */
#define HT_CAPTYPE_ERROR_RETRY 0xC0 /* Retry on error configuration */
-#define HT_CAPTYPE_GEN3 0xD0 /* Generation 3 hypertransport configuration */
-#define HT_CAPTYPE_PM 0xE0 /* Hypertransport powermanagement configuration */
+#define HT_CAPTYPE_GEN3 0xD0 /* Generation 3 HyperTransport configuration */
+#define HT_CAPTYPE_PM 0xE0 /* HyperTransport power management configuration */
+#define HT_CAP_SIZEOF_LONG 28 /* slave & primary */
+#define HT_CAP_SIZEOF_SHORT 24 /* host & secondary */
/* Alternative Routing-ID Interpretation */
#define PCI_ARI_CAP 0x04 /* ARI Capability Register */
@@ -655,6 +783,7 @@
#define PCI_ARI_CTRL_MFVC 0x0001 /* MFVC Function Groups Enable */
#define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */
#define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */
+#define PCI_EXT_CAP_ARI_SIZEOF 8
/* Address Translation Service */
#define PCI_ATS_CAP 0x04 /* ATS Capability Register */
@@ -664,6 +793,29 @@
#define PCI_ATS_CTRL_ENABLE 0x8000 /* ATS Enable */
#define PCI_ATS_CTRL_STU(x) ((x) & 0x1f) /* Smallest Translation Unit */
#define PCI_ATS_MIN_STU 12 /* shift of minimum STU block */
+#define PCI_EXT_CAP_ATS_SIZEOF 8
+
+/* Page Request Interface */
+#define PCI_PRI_CTRL 0x04 /* PRI control register */
+#define PCI_PRI_CTRL_ENABLE 0x01 /* Enable */
+#define PCI_PRI_CTRL_RESET 0x02 /* Reset */
+#define PCI_PRI_STATUS 0x06 /* PRI status register */
+#define PCI_PRI_STATUS_RF 0x001 /* Response Failure */
+#define PCI_PRI_STATUS_UPRGI 0x002 /* Unexpected PRG index */
+#define PCI_PRI_STATUS_STOPPED 0x100 /* PRI Stopped */
+#define PCI_PRI_MAX_REQ 0x08 /* PRI max reqs supported */
+#define PCI_PRI_ALLOC_REQ 0x0c /* PRI max reqs allowed */
+#define PCI_EXT_CAP_PRI_SIZEOF 16
+
+/* Process Address Space ID */
+#define PCI_PASID_CAP 0x04 /* PASID feature register */
+#define PCI_PASID_CAP_EXEC 0x02 /* Exec permissions Supported */
+#define PCI_PASID_CAP_PRIV 0x04 /* Privilege Mode Supported */
+#define PCI_PASID_CTRL 0x06 /* PASID control register */
+#define PCI_PASID_CTRL_ENABLE 0x01 /* Enable bit */
+#define PCI_PASID_CTRL_EXEC 0x02 /* Exec permissions Enable */
+#define PCI_PASID_CTRL_PRIV 0x04 /* Privilege Mode Enable */
+#define PCI_EXT_CAP_PASID_SIZEOF 8
/* Single Root I/O Virtualization */
#define PCI_SRIOV_CAP 0x04 /* SR-IOV Capabilities */
@@ -695,12 +847,14 @@
#define PCI_SRIOV_VFM_MI 0x1 /* Dormant.MigrateIn */
#define PCI_SRIOV_VFM_MO 0x2 /* Active.MigrateOut */
#define PCI_SRIOV_VFM_AV 0x3 /* Active.Available */
+#define PCI_EXT_CAP_SRIOV_SIZEOF 64
#define PCI_LTR_MAX_SNOOP_LAT 0x4
#define PCI_LTR_MAX_NOSNOOP_LAT 0x6
#define PCI_LTR_VALUE_MASK 0x000003ff
#define PCI_LTR_SCALE_MASK 0x00001c00
#define PCI_LTR_SCALE_SHIFT 10
+#define PCI_EXT_CAP_LTR_SIZEOF 8
/* Access Control Service */
#define PCI_ACS_CAP 0x04 /* ACS Capability Register */
@@ -711,7 +865,38 @@
#define PCI_ACS_UF 0x10 /* Upstream Forwarding */
#define PCI_ACS_EC 0x20 /* P2P Egress Control */
#define PCI_ACS_DT 0x40 /* Direct Translated P2P */
+#define PCI_ACS_EGRESS_BITS 0x05 /* ACS Egress Control Vector Size */
#define PCI_ACS_CTRL 0x06 /* ACS Control Register */
#define PCI_ACS_EGRESS_CTL_V 0x08 /* ACS Egress Control Vector */
+#define PCI_VSEC_HDR 4 /* extended cap - vendor-specific */
+#define PCI_VSEC_HDR_LEN_SHIFT 20 /* shift for length field */
+
+/* SATA capability */
+#define PCI_SATA_REGS 4 /* SATA REGs specifier */
+#define PCI_SATA_REGS_MASK 0xF /* location - BAR#/inline */
+#define PCI_SATA_REGS_INLINE 0xF /* REGS in config space */
+#define PCI_SATA_SIZEOF_SHORT 8
+#define PCI_SATA_SIZEOF_LONG 16
+
+/* Resizable BARs */
+#define PCI_REBAR_CTRL 8 /* control register */
+#define PCI_REBAR_CTRL_NBAR_MASK (7 << 5) /* mask for # bars */
+#define PCI_REBAR_CTRL_NBAR_SHIFT 5 /* shift for # bars */
+
+/* Dynamic Power Allocation */
+#define PCI_DPA_CAP 4 /* capability register */
+#define PCI_DPA_CAP_SUBSTATE_MASK 0x1F /* # substates - 1 */
+#define PCI_DPA_BASE_SIZEOF 16 /* size with 0 substates */
+
+/* TPH Requester */
+#define PCI_TPH_CAP 4 /* capability register */
+#define PCI_TPH_CAP_LOC_MASK 0x600 /* location mask */
+#define PCI_TPH_LOC_NONE 0x000 /* no location */
+#define PCI_TPH_LOC_CAP 0x200 /* in capability */
+#define PCI_TPH_LOC_MSIX 0x400 /* in MSI-X */
+#define PCI_TPH_CAP_ST_MASK 0x07FF0000 /* st table mask */
+#define PCI_TPH_CAP_ST_SHIFT 16 /* st table shift */
+#define PCI_TPH_BASE_SIZEOF 12 /* size with no st table */
+
#endif /* LINUX_PCI_REGS_H */
--
1.9.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
2014-08-29 7:17 [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 1/4] pci: Update pci_regs header Knut Omang
@ 2014-08-29 7:17 ` Knut Omang
2014-09-01 9:39 ` Michael S. Tsirkin
2014-08-29 7:17 ` [Qemu-devel] [PATCH 3/4] e1000: Refactor to allow subclassing from other source file Knut Omang
` (2 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Knut Omang @ 2014-08-29 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi,
Michael S. Tsirkin, Laszlo Ersek, Gabriel Somlo, Knut Omang,
Alexander Graf, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong
This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and creating and removing
SR/IOV Virtual Functions.
Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
hw/pci/pci.c | 107 +++++++++++++++++++-------
hw/pci/pcie.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++-
include/hw/pci/pci.h | 6 +-
include/hw/pci/pcie.h | 26 +++++++
4 files changed, 311 insertions(+), 33 deletions(-)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index daeaeac..071ab81 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -35,7 +35,6 @@
#include "hw/pci/msi.h"
#include "hw/pci/msix.h"
#include "exec/address-spaces.h"
-#include "hw/hotplug.h"
//#define DEBUG_PCI
#ifdef DEBUG_PCI
@@ -126,6 +125,9 @@ static int pci_bar(PCIDevice *d, int reg)
{
uint8_t type;
+ /* PCIe virtual functions do not have their own BARs */
+ assert(!d->exp.is_vf);
+
if (reg != PCI_ROM_SLOT)
return PCI_BASE_ADDRESS_0 + reg * 4;
@@ -184,22 +186,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
}
}
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
{
int r;
+ if (dev->exp.is_vf) {
+ return;
+ }
- pci_device_deassert_intx(dev);
- assert(dev->irq_state == 0);
-
- /* Clear all writable bits */
- pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
- pci_get_word(dev->wmask + PCI_COMMAND) |
- pci_get_word(dev->w1cmask + PCI_COMMAND));
- pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
- pci_get_word(dev->wmask + PCI_STATUS) |
- pci_get_word(dev->w1cmask + PCI_STATUS));
- dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
- dev->config[PCI_INTERRUPT_LINE] = 0x0;
for (r = 0; r < PCI_NUM_REGIONS; ++r) {
PCIIORegion *region = &dev->io_regions[r];
if (!region->size) {
@@ -213,6 +206,27 @@ static void pci_do_device_reset(PCIDevice *dev)
pci_set_long(dev->config + pci_bar(dev, r), region->type);
}
}
+}
+
+static void pci_do_device_reset(PCIDevice *dev)
+{
+ qdev_reset_all(&dev->qdev);
+
+ dev->irq_state = 0;
+ pci_update_irq_status(dev);
+ pci_device_deassert_intx(dev);
+ assert(dev->irq_state == 0);
+
+ /* Clear all writable bits */
+ pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
+ pci_get_word(dev->wmask + PCI_COMMAND) |
+ pci_get_word(dev->w1cmask + PCI_COMMAND));
+ pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
+ pci_get_word(dev->wmask + PCI_STATUS) |
+ pci_get_word(dev->w1cmask + PCI_STATUS));
+ dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
+ dev->config[PCI_INTERRUPT_LINE] = 0x0;
+ pci_reset_regions(dev);
pci_update_mappings(dev);
msi_reset(dev);
@@ -734,6 +748,14 @@ static int pci_init_multifunction(PCIBus *bus, PCIDevice *dev)
dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
}
+ /* With SR/IOV and ARI, subsequent function 0's are only
+ * another VF which the physical function is placed in the initial
+ * function (0.0)
+ */
+ if (dev->exp.pf && dev->exp.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+ return 0;
+ }
+
/*
* multifunction bit is interpreted in two ways as follows.
* - all functions must set the bit to 1.
@@ -920,6 +942,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
uint64_t wmask;
pcibus_t size = memory_region_size(memory);
+ assert(!pci_dev->exp.is_vf); /* VFs must use pcie_register_vf_bar */
assert(region_num >= 0);
assert(region_num < PCI_NUM_REGIONS);
if (size & (size-1)) {
@@ -1018,18 +1041,51 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
return pci_dev->io_regions[region_num].addr;
}
-static pcibus_t pci_bar_address(PCIDevice *d,
- int reg, uint8_t type, pcibus_t size)
+
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+ uint8_t type, pcibus_t size)
+{
+ pcibus_t new_addr;
+ if (!d->exp.is_vf) {
+ int bar = pci_bar(d, reg);
+ if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+ new_addr = pci_get_quad(d->config + bar);
+ } else {
+ new_addr = pci_get_long(d->config + bar);
+ }
+ } else {
+ int bar = d->exp.pf->exp.sriov_cap + PCI_SRIOV_BAR + reg * 4;
+ uint32_t vf_num = d->devfn - (d->exp.pf->devfn + 1);
+ if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+ new_addr = pci_get_quad(d->exp.pf->config + bar);
+ } else {
+ new_addr = pci_get_long(d->exp.pf->config + bar);
+ }
+ new_addr += vf_num * size;
+
+ PCI_DPRINTF("%02d:%02d.%d: (vf %d) found config reg %d, "
+ "size 0x%lx addr 0x%lx\n",
+ pci_bus_num(d->bus), d->devfn >> 3, d->devfn & 7,
+ vf_num, reg, size, new_addr);
+ }
+ if (reg != PCI_ROM_SLOT) {
+ /* Preserve the rom enable bit */
+ new_addr &= ~(size - 1);
+ }
+ return new_addr;
+}
+
+pcibus_t pci_bar_address(PCIDevice *d,
+ int reg, uint8_t type, pcibus_t size)
{
pcibus_t new_addr, last_addr;
- int bar = pci_bar(d, reg);
uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
if (type & PCI_BASE_ADDRESS_SPACE_IO) {
if (!(cmd & PCI_COMMAND_IO)) {
return PCI_BAR_UNMAPPED;
}
- new_addr = pci_get_long(d->config + bar) & ~(size - 1);
+ new_addr = pci_config_get_bar_addr(d, reg, type, size);
last_addr = new_addr + size - 1;
/* Check if 32 bit BAR wraps around explicitly.
* TODO: make priorities correct and remove this work around.
@@ -1043,11 +1099,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
if (!(cmd & PCI_COMMAND_MEMORY)) {
return PCI_BAR_UNMAPPED;
}
- if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
- new_addr = pci_get_quad(d->config + bar);
- } else {
- new_addr = pci_get_long(d->config + bar);
- }
+ new_addr = pci_config_get_bar_addr(d, reg, type, size);
/* the ROM slot has a specific enable bit */
if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
return PCI_BAR_UNMAPPED;
@@ -1146,9 +1198,10 @@ uint32_t pci_default_read_config(PCIDevice *d,
return le32_to_cpu(val);
}
-void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
+void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
{
int i, was_irq_disabled = pci_irq_disabled(d);
+ uint32_t val = val_in;
for (i = 0; i < l; val >>= 8, ++i) {
uint8_t wmask = d->wmask[addr + i];
@@ -1170,8 +1223,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
& PCI_COMMAND_MASTER);
}
- msi_write_config(d, addr, val, l);
- msix_write_config(d, addr, val, l);
+ msi_write_config(d, addr, val_in, l);
+ msix_write_config(d, addr, val_in, l);
+ pcie_sriov_config_write(d, addr, val_in, l);
}
/***********************************************************/
@@ -1776,7 +1830,6 @@ static int pci_qdev_init(DeviceState *qdev)
is_default_rom = true;
}
pci_add_option_rom(pci_dev, is_default_rom);
-
return 0;
}
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 6cb6e0c..b7af693 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -254,7 +254,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
* Right now, only a device of function = 0 is allowed to be
* hot plugged/unplugged.
*/
- assert(PCI_FUNC(pci_dev->devfn) == 0);
+ assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_dev->exp.is_vf);
pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
PCI_EXP_SLTSTA_PDS);
@@ -266,10 +266,11 @@ void pcie_cap_slot_hot_unplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
Error **errp)
{
uint8_t *exp_cap;
+ PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
- pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, &exp_cap, errp);
+ pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
- pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
+ pcie_cap_slot_push_attention_button(pdev);
}
/* pci express slot for pci express root/downstream port
@@ -409,7 +410,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
}
/*
- * If the slot is polulated, power indicator is off and power
+ * If the slot is populated, power indicator is off and power
* controller is off, it is safe to detach the devices.
*/
if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
@@ -633,3 +634,199 @@ void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn)
offset, PCI_ARI_SIZEOF);
pci_set_long(dev->config + offset + PCI_ARI_CAP, (nextfn & 0xff) << 8);
}
+
+
+
+/* SR/IOV */
+void pcie_sriov_init(PCIDevice *dev, uint16_t offset,
+ const char *vfname, uint16_t vf_dev_id,
+ uint16_t init_vfs, uint16_t total_vfs)
+{
+ uint8_t *cfg = dev->config + offset;
+ uint8_t *wmask;
+ pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
+ offset, PCI_EXT_CAP_SRIOV_SIZEOF);
+ dev->exp.sriov_cap = offset;
+ dev->exp.num_vfs = 0;
+ dev->exp.vfname = g_strdup(vfname);
+ dev->exp.vf = NULL;
+
+ /* set some sensible defaults - devices can override later */
+ pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, 0x1);
+ pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, 0x1);
+ pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
+ pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
+
+ /* Set up device ID and initial/total number of VFs available */
+ pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
+ pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
+ pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
+
+ /* Write enable control bits */
+ wmask = dev->wmask + offset;
+ pci_set_word(wmask + PCI_SRIOV_CTRL,
+ PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
+ pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
+
+ qdev_prop_set_bit(&dev->qdev, "multifunction", true);
+}
+
+
+void pcie_sriov_exit(PCIDevice *dev)
+{
+ PCIE_DPRINTF("\n");
+ pcie_sriov_reset_vfs(dev);
+}
+
+void pcie_sriov_init_bar(PCIDevice *dev, int region_num,
+ uint8_t type, dma_addr_t size)
+{
+ uint32_t addr;
+ uint64_t wmask;
+ uint16_t sriov_cap = dev->exp.sriov_cap;
+
+ assert(sriov_cap > 0);
+ assert(region_num >= 0);
+ assert(region_num < PCI_NUM_REGIONS);
+ assert(region_num != PCI_ROM_SLOT);
+
+ wmask = ~(size - 1);
+ addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
+
+ pci_set_long(dev->config + addr, type);
+ if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
+ type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+ pci_set_quad(dev->wmask + addr, wmask);
+ pci_set_quad(dev->cmask + addr, ~0ULL);
+ } else {
+ pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
+ pci_set_long(dev->cmask + addr, 0xffffffff);
+ }
+ dev->exp.vf_bar_type[region_num] = type;
+}
+
+void pcie_register_vf_bar(PCIDevice *dev, int region_num,
+ MemoryRegion *memory)
+{
+ PCIIORegion *r;
+ uint8_t type;
+ pcibus_t size = memory_region_size(memory);
+
+ assert(dev->exp.is_vf); /* PFs must use pci_register_bar */
+ assert(region_num >= 0);
+ assert(region_num < PCI_NUM_REGIONS);
+ type = dev->exp.pf->exp.vf_bar_type[region_num];
+
+ if (size & (size-1)) {
+ fprintf(stderr, "ERROR: PCI region size must be pow2 "
+ "type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
+ exit(1);
+ }
+
+ r = &dev->io_regions[region_num];
+ r->memory = memory;
+ r->address_space =
+ type & PCI_BASE_ADDRESS_SPACE_IO
+ ? dev->bus->address_space_io
+ : dev->bus->address_space_mem;
+ r->size = size;
+ r->type = type;
+
+ r->addr = pci_bar_address(dev, region_num, r->type, r->size);
+ if (r->addr != PCI_BAR_UNMAPPED) {
+ memory_region_add_subregion_overlap(r->address_space,
+ r->addr, r->memory, 1);
+ }
+}
+
+
+static PCIDevice *pcie_create_vf(PCIDevice *pf, int devfn, const char *name)
+{
+ int ret;
+ PCIDevice *dev = pci_create(pf->bus, devfn, name);
+ dev->exp.is_vf = true;
+ dev->exp.pf = pf;
+
+ ret = qdev_init(&dev->qdev);
+ if (ret)
+ return NULL;
+
+ /* set vid/did according to sr/iov spec - they are not used */
+ pci_config_set_vendor_id(dev->config, 0xffff);
+ pci_config_set_device_id(dev->config, 0xffff);
+ return dev;
+}
+
+
+void pcie_sriov_create_vfs(PCIDevice *dev)
+{
+ uint16_t num_vfs;
+ uint16_t i;
+ int32_t devfn = dev->devfn + 1;
+ uint16_t sriov_cap = dev->exp.sriov_cap;
+
+ assert(sriov_cap > 0);
+ num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+
+ dev->exp.vf = g_malloc(sizeof(PCIDevice*) * num_vfs);
+ assert(dev->exp.vf);
+
+ PCIE_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
+ for (i = 0; i < num_vfs; i++) {
+ dev->exp.vf[i] = pcie_create_vf(dev, devfn++, dev->exp.vfname);
+ if (!dev->exp.vf[i]) {
+ PCIE_DEV_PRINTF(dev, "Failed to create VF %d\n", i);
+ num_vfs = i;
+ break;
+ }
+ }
+ dev->exp.num_vfs = num_vfs;
+}
+
+
+void pcie_sriov_reset_vfs(PCIDevice *dev)
+{
+ Error *local_err = NULL;
+ uint16_t num_vfs = dev->exp.num_vfs;
+ uint16_t i;
+ PCIE_DEV_PRINTF(dev, "Resetting %d vf devs\n", num_vfs);
+ for (i = 0; i < num_vfs; i++) {
+ qdev_unplug(&dev->exp.vf[i]->qdev, &local_err);
+ if (local_err) {
+ fprintf(stderr, "Failed to unplug: %s\n",
+ error_get_pretty(local_err));
+ error_free(local_err);
+ }
+ }
+ dev->exp.num_vfs = 0;
+}
+
+
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
+{
+ uint32_t off;
+ uint16_t sriov_cap = dev->exp.sriov_cap;
+
+ if (!sriov_cap || address < sriov_cap) {
+ return;
+ }
+ off = address - sriov_cap;
+ if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
+ return;
+ }
+
+ PCIE_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x len %d\n",
+ sriov_cap, off, val, len);
+
+ if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
+ if (dev->exp.num_vfs) {
+ if (!(val & PCI_SRIOV_CTRL_VFE)) {
+ pcie_sriov_reset_vfs(dev);
+ }
+ } else {
+ if (val & PCI_SRIOV_CTRL_VFE) {
+ pcie_sriov_create_vfs(dev);
+ }
+ }
+ }
+}
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index c352c7b..d36c68d 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -11,8 +11,6 @@
/* PCI includes legacy ISA access. */
#include "hw/isa/isa.h"
-#include "hw/pci/pcie.h"
-
/* PCI bus */
#define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & 0x07))
@@ -127,6 +125,7 @@ enum {
#define QEMU_PCI_VGA_IO_HI_SIZE 0x20
#include "hw/pci/pci_regs.h"
+#include "hw/pci/pcie.h"
/* PCI HEADER_TYPE */
#define PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
@@ -413,6 +412,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
+pcibus_t pci_bar_address(PCIDevice *d,
+ int reg, uint8_t type, pcibus_t size);
+
static inline void
pci_set_byte(uint8_t *config, uint8_t val)
{
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index d139d58..b34d831 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -74,6 +74,15 @@ struct PCIExpressDevice {
/* AER */
uint16_t aer_cap;
PCIEAERLog aer_log;
+
+ /* SR/IOV */
+ uint16_t sriov_cap;
+ uint16_t num_vfs; /* Number of virtual functions created */
+ bool is_vf; /* Set if this device is a virtual function */
+ const char *vfname; /* Reference to the device type used for the VFs */
+ PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
+ PCIDevice *pf; /* Pointer back to owner physical function */
+ uint8_t vf_bar_type[PCI_NUM_REGIONS]; /* Store type for each VF bar */
};
#define COMPAT_PROP_PCP "power_controller_present"
@@ -115,6 +124,23 @@ void pcie_add_capability(PCIDevice *dev,
uint16_t offset, uint16_t size);
void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
+void pcie_sriov_init(PCIDevice *dev, uint16_t offset,
+ const char *vfname, uint16_t vf_dev_id,
+ uint16_t init_vfs, uint16_t total_vfs);
+void pcie_sriov_exit(PCIDevice *dev);
+
+/* Set up a VF bar in the SR/IOV bar area */
+void pcie_sriov_init_bar(PCIDevice *s, int region_num,
+ uint8_t type, dma_addr_t size);
+
+/* Instantiate a bar for a VF */
+void pcie_register_vf_bar(PCIDevice *pci_dev, int region_num,
+ MemoryRegion *memory);
+
+void pcie_sriov_create_vfs(PCIDevice *dev);
+void pcie_sriov_reset_vfs(PCIDevice *dev);
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
+ uint32_t val, int len);
extern const VMStateDescription vmstate_pcie_device;
--
1.9.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Qemu-devel] [PATCH 3/4] e1000: Refactor to allow subclassing from other source file
2014-08-29 7:17 [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 1/4] pci: Update pci_regs header Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
@ 2014-08-29 7:17 ` Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 4/4] igb: Example code to illustrate the SR/IOV support Knut Omang
2014-08-29 16:17 ` [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Michael S. Tsirkin
4 siblings, 0 replies; 9+ messages in thread
From: Knut Omang @ 2014-08-29 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi,
Michael S. Tsirkin, Laszlo Ersek, Gabriel Somlo, Knut Omang,
Alexander Graf, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong
Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
hw/net/e1000.c | 126 +++++++--------------------------------------------
hw/net/e1000.h | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 155 insertions(+), 110 deletions(-)
create mode 100644 hw/net/e1000.h
diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 272df00..7afdcad 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -35,6 +35,7 @@
#include "qemu/iov.h"
#include "e1000_regs.h"
+#include "e1000.h"
#define E1000_DEBUG
@@ -75,101 +76,6 @@ static int debugflags = DBGBIT(TXERR) | DBGBIT(GENERAL);
* Others never tested
*/
-typedef struct E1000State_st {
- /*< private >*/
- PCIDevice parent_obj;
- /*< public >*/
-
- NICState *nic;
- NICConf conf;
- MemoryRegion mmio;
- MemoryRegion io;
-
- uint32_t mac_reg[0x8000];
- uint16_t phy_reg[0x20];
- uint16_t eeprom_data[64];
-
- uint32_t rxbuf_size;
- uint32_t rxbuf_min_shift;
- struct e1000_tx {
- unsigned char header[256];
- unsigned char vlan_header[4];
- /* Fields vlan and data must not be reordered or separated. */
- unsigned char vlan[4];
- unsigned char data[0x10000];
- uint16_t size;
- unsigned char sum_needed;
- unsigned char vlan_needed;
- uint8_t ipcss;
- uint8_t ipcso;
- uint16_t ipcse;
- uint8_t tucss;
- uint8_t tucso;
- uint16_t tucse;
- uint8_t hdr_len;
- uint16_t mss;
- uint32_t paylen;
- uint16_t tso_frames;
- char tse;
- int8_t ip;
- int8_t tcp;
- char cptse; // current packet tse bit
- } tx;
-
- struct {
- uint32_t val_in; // shifted in from guest driver
- uint16_t bitnum_in;
- uint16_t bitnum_out;
- uint16_t reading;
- uint32_t old_eecd;
- } eecd_state;
-
- QEMUTimer *autoneg_timer;
-
- QEMUTimer *mit_timer; /* Mitigation timer. */
- bool mit_timer_on; /* Mitigation timer is running. */
- bool mit_irq_level; /* Tracks interrupt pin level. */
- uint32_t mit_ide; /* Tracks E1000_TXD_CMD_IDE bit. */
-
-/* Compatibility flags for migration to/from qemu 1.3.0 and older */
-#define E1000_FLAG_AUTONEG_BIT 0
-#define E1000_FLAG_MIT_BIT 1
-#define E1000_FLAG_AUTONEG (1 << E1000_FLAG_AUTONEG_BIT)
-#define E1000_FLAG_MIT (1 << E1000_FLAG_MIT_BIT)
- uint32_t compat_flags;
-} E1000State;
-
-typedef struct E1000BaseClass {
- PCIDeviceClass parent_class;
- uint16_t phy_id2;
-} E1000BaseClass;
-
-#define TYPE_E1000_BASE "e1000-base"
-
-#define E1000(obj) \
- OBJECT_CHECK(E1000State, (obj), TYPE_E1000_BASE)
-
-#define E1000_DEVICE_CLASS(klass) \
- OBJECT_CLASS_CHECK(E1000BaseClass, (klass), TYPE_E1000_BASE)
-#define E1000_DEVICE_GET_CLASS(obj) \
- OBJECT_GET_CLASS(E1000BaseClass, (obj), TYPE_E1000_BASE)
-
-#define defreg(x) x = (E1000_##x>>2)
-enum {
- defreg(CTRL), defreg(EECD), defreg(EERD), defreg(GPRC),
- defreg(GPTC), defreg(ICR), defreg(ICS), defreg(IMC),
- defreg(IMS), defreg(LEDCTL), defreg(MANC), defreg(MDIC),
- defreg(MPC), defreg(PBA), defreg(RCTL), defreg(RDBAH),
- defreg(RDBAL), defreg(RDH), defreg(RDLEN), defreg(RDT),
- defreg(STATUS), defreg(SWSM), defreg(TCTL), defreg(TDBAH),
- defreg(TDBAL), defreg(TDH), defreg(TDLEN), defreg(TDT),
- defreg(TORH), defreg(TORL), defreg(TOTH), defreg(TOTL),
- defreg(TPR), defreg(TPT), defreg(TXDCTL), defreg(WUFC),
- defreg(RA), defreg(MTA), defreg(CRCERRS),defreg(VFTA),
- defreg(VET), defreg(RDTR), defreg(RADV), defreg(TADV),
- defreg(ITR),
-};
-
static void
e1000_link_down(E1000State *s)
{
@@ -398,7 +304,7 @@ rxbufsize(uint32_t v)
return 2048;
}
-static void e1000_reset(void *opaque)
+void e1000_reset(void *opaque)
{
E1000State *d = opaque;
E1000BaseClass *edc = E1000_DEVICE_GET_CLASS(d);
@@ -1507,8 +1413,7 @@ e1000_cleanup(NetClientState *nc)
s->nic = NULL;
}
-static void
-pci_e1000_uninit(PCIDevice *dev)
+void pci_e1000_uninit(PCIDevice *dev)
{
E1000State *d = E1000(dev);
@@ -1529,10 +1434,11 @@ static NetClientInfo net_e1000_info = {
.link_status_changed = e1000_set_link_status,
};
-static int pci_e1000_init(PCIDevice *pci_dev)
+int pci_e1000_init(PCIDevice *pci_dev)
{
DeviceState *dev = DEVICE(pci_dev);
E1000State *d = E1000(pci_dev);
+ E1000BaseClass *edc = E1000_DEVICE_GET_CLASS(d);
PCIDeviceClass *pdc = PCI_DEVICE_GET_CLASS(pci_dev);
uint8_t *pci_conf;
uint16_t checksum = 0;
@@ -1548,9 +1454,12 @@ static int pci_e1000_init(PCIDevice *pci_dev)
e1000_mmio_setup(d);
- pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio);
-
- pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->io);
+ if (pci_dev->exp.is_vf) {
+ pcie_register_vf_bar(pci_dev, 0, &d->mmio);
+ } else {
+ pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio);
+ pci_register_bar(pci_dev, edc->io_bar, PCI_BASE_ADDRESS_SPACE_IO, &d->io);
+ }
memmove(d->eeprom_data, e1000_eeprom_template,
sizeof e1000_eeprom_template);
@@ -1592,14 +1501,7 @@ static Property e1000_properties[] = {
DEFINE_PROP_END_OF_LIST(),
};
-typedef struct E1000Info {
- const char *name;
- uint16_t device_id;
- uint8_t revision;
- uint16_t phy_id2;
-} E1000Info;
-
-static void e1000_class_init(ObjectClass *klass, void *data)
+void e1000_class_init(ObjectClass *klass, void *data)
{
DeviceClass *dc = DEVICE_CLASS(klass);
PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
@@ -1613,6 +1515,7 @@ static void e1000_class_init(ObjectClass *klass, void *data)
k->device_id = info->device_id;
k->revision = info->revision;
e->phy_id2 = info->phy_id2;
+ e->io_bar = info->io_bar;
k->class_id = PCI_CLASS_NETWORK_ETHERNET;
set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
dc->desc = "Intel Gigabit Ethernet";
@@ -1634,18 +1537,21 @@ static const E1000Info e1000_devices[] = {
.name = "e1000-82540em",
.device_id = E1000_DEV_ID_82540EM,
.revision = 0x03,
+ .io_bar = 1,
.phy_id2 = E1000_PHY_ID2_8254xx_DEFAULT,
},
{
.name = "e1000-82544gc",
.device_id = E1000_DEV_ID_82544GC_COPPER,
.revision = 0x03,
+ .io_bar = 1,
.phy_id2 = E1000_PHY_ID2_82544x,
},
{
.name = "e1000-82545em",
.device_id = E1000_DEV_ID_82545EM_COPPER,
.revision = 0x03,
+ .io_bar = 1,
.phy_id2 = E1000_PHY_ID2_8254xx_DEFAULT,
},
};
diff --git a/hw/net/e1000.h b/hw/net/e1000.h
new file mode 100644
index 0000000..f215122
--- /dev/null
+++ b/hw/net/e1000.h
@@ -0,0 +1,139 @@
+/*
+ * QEMU e1000 emulation
+ *
+ * Software developer's manual:
+ * http://download.intel.com/design/network/manuals/8254x_GBe_SDM.pdf
+ *
+ * Nir Peleg, Tutis Systems Ltd. for Qumranet Inc.
+ * Copyright (c) 2008 Qumranet
+ * Based on work done by:
+ * Copyright (c) 2007 Dan Aloni
+ * Copyright (c) 2004 Antony T Curtis
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "e1000_regs.h"
+
+typedef struct E1000State_st {
+ /*< private >*/
+ PCIDevice parent_obj;
+ /*< public >*/
+
+ NICState *nic;
+ NICConf conf;
+ MemoryRegion mmio;
+ MemoryRegion io;
+
+ uint32_t mac_reg[0x8000];
+ uint16_t phy_reg[0x20];
+ uint16_t eeprom_data[64];
+
+ uint32_t rxbuf_size;
+ uint32_t rxbuf_min_shift;
+ struct e1000_tx {
+ unsigned char header[256];
+ unsigned char vlan_header[4];
+ /* Fields vlan and data must not be reordered or separated. */
+ unsigned char vlan[4];
+ unsigned char data[0x10000];
+ uint16_t size;
+ unsigned char sum_needed;
+ unsigned char vlan_needed;
+ uint8_t ipcss;
+ uint8_t ipcso;
+ uint16_t ipcse;
+ uint8_t tucss;
+ uint8_t tucso;
+ uint16_t tucse;
+ uint8_t hdr_len;
+ uint16_t mss;
+ uint32_t paylen;
+ uint16_t tso_frames;
+ char tse;
+ int8_t ip;
+ int8_t tcp;
+ char cptse; // current packet tse bit
+ } tx;
+
+ struct {
+ uint32_t val_in; // shifted in from guest driver
+ uint16_t bitnum_in;
+ uint16_t bitnum_out;
+ uint16_t reading;
+ uint32_t old_eecd;
+ } eecd_state;
+
+ QEMUTimer *autoneg_timer;
+
+ QEMUTimer *mit_timer; /* Mitigation timer. */
+ bool mit_timer_on; /* Mitigation timer is running. */
+ bool mit_irq_level; /* Tracks interrupt pin level. */
+ uint32_t mit_ide; /* Tracks E1000_TXD_CMD_IDE bit. */
+
+/* Compatibility flags for migration to/from qemu 1.3.0 and older */
+#define E1000_FLAG_AUTONEG_BIT 0
+#define E1000_FLAG_MIT_BIT 1
+#define E1000_FLAG_AUTONEG (1 << E1000_FLAG_AUTONEG_BIT)
+#define E1000_FLAG_MIT (1 << E1000_FLAG_MIT_BIT)
+ uint32_t compat_flags;
+} E1000State;
+
+typedef struct E1000BaseClass {
+ PCIDeviceClass parent_class;
+ uint16_t phy_id2;
+ uint16_t io_bar; /* Which BAR register to use for the IO bar */
+} E1000BaseClass;
+
+#define TYPE_E1000_BASE "e1000-base"
+
+#define E1000(obj) \
+ OBJECT_CHECK(E1000State, (obj), TYPE_E1000_BASE)
+
+#define E1000_DEVICE_CLASS(klass) \
+ OBJECT_CLASS_CHECK(E1000BaseClass, (klass), TYPE_E1000_BASE)
+#define E1000_DEVICE_GET_CLASS(obj) \
+ OBJECT_GET_CLASS(E1000BaseClass, (obj), TYPE_E1000_BASE)
+
+#define defreg(x) x = (E1000_##x>>2)
+enum {
+ defreg(CTRL), defreg(EECD), defreg(EERD), defreg(GPRC),
+ defreg(GPTC), defreg(ICR), defreg(ICS), defreg(IMC),
+ defreg(IMS), defreg(LEDCTL), defreg(MANC), defreg(MDIC),
+ defreg(MPC), defreg(PBA), defreg(RCTL), defreg(RDBAH),
+ defreg(RDBAL), defreg(RDH), defreg(RDLEN), defreg(RDT),
+ defreg(STATUS), defreg(SWSM), defreg(TCTL), defreg(TDBAH),
+ defreg(TDBAL), defreg(TDH), defreg(TDLEN), defreg(TDT),
+ defreg(TORH), defreg(TORL), defreg(TOTH), defreg(TOTL),
+ defreg(TPR), defreg(TPT), defreg(TXDCTL), defreg(WUFC),
+ defreg(RA), defreg(MTA), defreg(CRCERRS),defreg(VFTA),
+ defreg(VET), defreg(RDTR), defreg(RADV), defreg(TADV),
+ defreg(ITR),
+};
+
+
+typedef struct E1000Info {
+ const char *name;
+ uint16_t device_id;
+ uint8_t revision;
+ uint8_t io_bar;
+ uint16_t phy_id2;
+} E1000Info;
+
+
+void e1000_class_init(ObjectClass *klass, void *data);
+
+int pci_e1000_init(PCIDevice *pci_dev);
+void pci_e1000_uninit(PCIDevice *dev);
+void e1000_reset(void *opaque);
--
1.9.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Qemu-devel] [PATCH 4/4] igb: Example code to illustrate the SR/IOV support.
2014-08-29 7:17 [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Knut Omang
` (2 preceding siblings ...)
2014-08-29 7:17 ` [Qemu-devel] [PATCH 3/4] e1000: Refactor to allow subclassing from other source file Knut Omang
@ 2014-08-29 7:17 ` Knut Omang
2014-08-29 16:17 ` [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Michael S. Tsirkin
4 siblings, 0 replies; 9+ messages in thread
From: Knut Omang @ 2014-08-29 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi,
Michael S. Tsirkin, Laszlo Ersek, Gabriel Somlo, Knut Omang,
Alexander Graf, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong
This patch contains a fairly naive attempt to emulate an
Intel 82576 Gigabit Ethernet Adapter (igb/igbvf) for the purpose
of illustrating the use of the SR/IOV emulation features.
This first shot at it does not completely work as an Ethernet device,
as it is basically a simple subclassing of the e1000 base class
where the emphasis has been on adding the minimal functionality
needed to fool the igb/igvf Linux drivers to load and allow VFs
to show up, but it should be possible to extend it into
something working for someone familiar with the details of the 82576.
The current status is that the VFs (if enabled vith a nonzero
value for the num_vfs parameter to the igb driver)
will load and present ethernet devices, but the PF fails to load
because it fails to detect a valid flash content in BAR 1.
Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
hw/net/Makefile.objs | 2 +-
hw/net/igb.c | 293 +++++++++++++++++++++++++++++++++++++++++++++++++++
hw/net/igb_regs.h | 27 +++++
3 files changed, 321 insertions(+), 1 deletion(-)
create mode 100644 hw/net/igb.c
create mode 100644 hw/net/igb_regs.h
diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
index ea93293..6832f12 100644
--- a/hw/net/Makefile.objs
+++ b/hw/net/Makefile.objs
@@ -6,7 +6,7 @@ common-obj-$(CONFIG_NE2000_PCI) += ne2000.o
common-obj-$(CONFIG_EEPRO100_PCI) += eepro100.o
common-obj-$(CONFIG_PCNET_PCI) += pcnet-pci.o
common-obj-$(CONFIG_PCNET_COMMON) += pcnet.o
-common-obj-$(CONFIG_E1000_PCI) += e1000.o
+common-obj-$(CONFIG_E1000_PCI) += e1000.o igb.o
common-obj-$(CONFIG_RTL8139_PCI) += rtl8139.o
common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet_tx_pkt.o vmxnet_rx_pkt.o
common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet3.o
diff --git a/hw/net/igb.c b/hw/net/igb.c
new file mode 100644
index 0000000..08ca1c2
--- /dev/null
+++ b/hw/net/igb.c
@@ -0,0 +1,293 @@
+/*
+ * igb.c
+ *
+ * Intel 82576 Gigabit Ethernet Adapter
+ * (SR/IOV capable PCIe ethernet device) emulation
+ *
+ * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/* NB! This implementation does not work (yet) as an ethernet device,
+ * it mainly serves as an example to demonstrate the emulated SR/IOV device
+ * building blocks!
+ *
+ * You should be able to load both the PF and VF driver (igb/igbvf) on Linux
+ * if the igb driver is loaded with a num_vfs != 0 driver parameter.
+ * but the PF fails to reset (probably because that where the obvious
+ * similarities between e1000 and igb ends..
+ */
+
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pcie.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "net/net.h"
+#include "igb_regs.h"
+#include "e1000.h"
+
+typedef struct IgbState {
+ /*< private >*/
+ E1000State parent_obj;
+
+ /*< public >*/
+ MemoryRegion flash;
+ MemoryRegion msix;
+} IgbState;
+
+
+
+
+static uint64_t igb_flash_read(void *opaque, hwaddr addr,
+ unsigned size)
+{
+ fprintf(stderr, "%s: addr %lx size %d (Not implemented)\n",
+ __func__, addr, size);
+ return 0;
+}
+
+static void igb_flash_write(void *opaque, hwaddr addr,
+ uint64_t val, unsigned size)
+{
+ fprintf(stderr, "%s: addr %lx size %d, value %lx (Not implemented)\n",
+ __func__, addr, size, val);
+}
+
+static const MemoryRegionOps igb_flash_ops = {
+ .read = igb_flash_read,
+ .write = igb_flash_write,
+ .endianness = DEVICE_LITTLE_ENDIAN,
+ .impl = {
+ .min_access_size = 1,
+ .max_access_size = 8,
+ },
+};
+
+
+#define TYPE_IGB "igb"
+#define IGB(obj) \
+ OBJECT_CHECK(IgbState, (obj), TYPE_IGB)
+#define TYPE_IGBVF "igbvf"
+#define IGBVF(obj) \
+ OBJECT_CHECK(IgbState, (obj), TYPE_IGBVF)
+
+static int pci_igb_init(PCIDevice *d)
+{
+ int v;
+ IgbState *igb = IGB(d);
+ MemoryRegion *mr = &igb->flash;
+ int ret = pci_e1000_init(d);
+ if (ret) {
+ return ret;
+ }
+
+ memory_region_init_io(mr, OBJECT(d), &igb_flash_ops,
+ igb, "igb-flash", 0x20000);
+ pci_register_bar(d, 1, PCI_BASE_ADDRESS_SPACE_MEMORY, &igb->flash);
+
+ ret = msi_init(d, 0x40, 2, true, true);
+ if (ret < 0) {
+ goto err_msi;
+ }
+
+ mr = &igb->msix;
+ memory_region_init(mr, OBJECT(d), "igb-msix", 0x8000);
+ pci_register_bar(d, 3, PCI_BASE_ADDRESS_MEM_TYPE_64, mr);
+ ret = msix_init(d, IGB_MSIX_VECTORS_PF, mr, IGB_MSIX_BAR, 0, mr,
+ IGB_MSIX_BAR, 0x2000, 0x70);
+ if (ret) {
+ goto err_msix;
+ }
+
+ /* TBD: Only initialize the vectors used */
+ for (v = 0; v < IGB_MSIX_VECTORS_PF; v++) {
+ ret = msix_vector_use(d, v);
+ if (ret) {
+ goto err_pcie_cap;
+ }
+ }
+
+ ret = pcie_endpoint_cap_init(d, 0xa0);
+ if (ret < 0) {
+ goto err_pcie_cap;
+ }
+ ret = pcie_aer_init(d, 0x100);
+ if (ret < 0) {
+ goto err_aer;
+ }
+
+ pcie_ari_init(d, 0x150, 1);
+
+ pcie_sriov_init(d, IGB_CAP_SRIOV_OFFSET, "igbvf",
+ IGB_82576_VF_DEV_ID, IGB_TOTAL_VFS, IGB_TOTAL_VFS);
+
+ pcie_sriov_init_bar(d, 0, PCI_BASE_ADDRESS_MEM_TYPE_64, 0x8000);
+ pcie_sriov_init_bar(d, 3, PCI_BASE_ADDRESS_MEM_TYPE_64, 0x8000);
+
+ return 0;
+ err_aer:
+ pcie_cap_exit(d);
+ err_pcie_cap:
+ msix_unuse_all_vectors(d);
+ msix_uninit(d, mr, mr);
+ err_msix:
+ msi_uninit(d);
+ err_msi:
+ pci_e1000_uninit(d);
+ return ret;
+}
+
+
+static void pci_igb_uninit(PCIDevice *d)
+{
+ IgbState *igb = IGB(d);
+ MemoryRegion *mr = &igb->msix;
+
+ pcie_sriov_exit(d);
+ pcie_cap_exit(d);
+ msix_unuse_all_vectors(d);
+ msix_uninit(d, mr, mr);
+ msi_uninit(d);
+ pci_e1000_uninit(d);
+}
+
+static void igb_reset(DeviceState *dev)
+{
+ E1000State *ed = E1000(dev);
+ PCIDevice *d = PCI_DEVICE(dev);
+ pcie_sriov_reset_vfs(d);
+ e1000_reset(ed);
+}
+
+
+static int pci_igbvf_init(PCIDevice *d)
+{
+ int v;
+ IgbState *igb = IGBVF(d);
+ MemoryRegion *mr = &igb->flash;
+ int ret = pci_e1000_init(d);
+ if (ret) {
+ return ret;
+ }
+
+ mr = &igb->msix;
+ memory_region_init(mr, OBJECT(d), "igbvf-msix", 0x8000);
+ pcie_register_vf_bar(d, 3, mr);
+ ret = msix_init(d, IGB_MSIX_VECTORS_VF, mr, IGB_MSIX_BAR, 0, mr,
+ IGB_MSIX_BAR, 0x2000, 0x70);
+ if (ret) {
+ goto err_msix;
+ }
+
+ for (v = 0; v < IGB_MSIX_VECTORS_VF; v++) {
+ ret = msix_vector_use(d, v);
+ if (ret) {
+ goto err_pcie_cap;
+ }
+ }
+
+ ret = pcie_endpoint_cap_init(d, 0xa0);
+ if (ret < 0) {
+ goto err_pcie_cap;
+ }
+
+ ret = pcie_aer_init(d, 0x100);
+ if (ret < 0) {
+ goto err_aer;
+ }
+
+ pcie_ari_init(d, 0x150, 1);
+ return 0;
+
+ err_aer:
+ pcie_cap_exit(d);
+ err_pcie_cap:
+ msix_unuse_all_vectors(d);
+ msix_uninit(d, mr, mr);
+ err_msix:
+ pci_e1000_uninit(d);
+ return ret;
+}
+
+
+static void pci_igbvf_uninit(PCIDevice *d)
+{
+ IgbState *igb = IGBVF(d);
+ MemoryRegion *mr = &igb->msix;
+
+ pcie_cap_exit(d);
+ msix_uninit(d, mr, mr);
+ pci_e1000_uninit(d);
+}
+
+static void igb_class_init(ObjectClass *klass, void *data)
+{
+ PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+ DeviceClass *dc = DEVICE_CLASS(klass);
+
+ e1000_class_init(klass, data);
+ /* extend/modify some methods/settings: */
+ k->is_express = 1;
+ k->init = pci_igb_init;
+ k->exit = pci_igb_uninit;
+ dc->reset = igb_reset;
+}
+
+static void igbvf_class_init(ObjectClass *klass, void *data)
+{
+ PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+ e1000_class_init(klass, data);
+
+ /* extend/modify some methods/settings: */
+ k->is_express = 1;
+ k->romfile = NULL;
+ k->init = pci_igbvf_init;
+ k->exit = pci_igbvf_uninit;
+}
+
+static const E1000Info igb_device = {
+ .name = "igb",
+ .device_id = E1000_DEV_ID_82576,
+ .revision = 0x01,
+ .io_bar = 2,
+ .phy_id2 = I210_I_PHY_ID2,
+};
+
+static const E1000Info igbvf_device = {
+ .name = TYPE_IGBVF,
+ .device_id = IGB_82576_VF_DEV_ID,
+ .revision = 0x01,
+ .io_bar = (uint8_t)-1,
+ .phy_id2 = I210_I_PHY_ID2,
+};
+
+
+static const TypeInfo igb_info = {
+ .name = TYPE_IGB,
+ .parent = TYPE_E1000_BASE,
+ .class_data = (void *) &igb_device,
+ .class_init = igb_class_init,
+ .instance_size = sizeof(IgbState),
+};
+
+
+static const TypeInfo igbvf_info = {
+ .name = "igbvf",
+ .parent = TYPE_E1000_BASE,
+ .class_data = (void *) &igbvf_device,
+ .class_init = igbvf_class_init,
+ .instance_size = sizeof(IgbState),
+};
+
+
+static void igb_register_types(void)
+{
+ type_register_static(&igb_info);
+ type_register_static(&igbvf_info);
+}
+
+type_init(igb_register_types)
diff --git a/hw/net/igb_regs.h b/hw/net/igb_regs.h
new file mode 100644
index 0000000..496eb42
--- /dev/null
+++ b/hw/net/igb_regs.h
@@ -0,0 +1,27 @@
+/*
+ * igb_regs.h
+ *
+ * Intel 82576 Gigabit Ethernet Adapter
+ * (SR/IOV capable PCIe ethernet device) emulation
+ *
+ * Partly derived from kernel/drivers/net/ethernet/intel/igb/igb.h
+ *
+ * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "e1000_regs.h"
+
+/* Experimental igb device (with SR/IOV) IDs for PF and VF: */
+#define E1000_DEV_ID_82576 0x10C9
+#define IGB_82576_VF_DEV_ID 0x10CA
+
+#define I210_I_PHY_ID2 0x0c00
+#define IGB_MSIX_BAR 3
+#define IGB_TOTAL_VFS 8
+#define IGB_CAP_SRIOV_OFFSET 0x160
+#define IGB_MSIX_VECTORS_PF 3
+#define IGB_MSIX_VECTORS_VF 3
--
1.9.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization
2014-08-29 7:17 [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Knut Omang
` (3 preceding siblings ...)
2014-08-29 7:17 ` [Qemu-devel] [PATCH 4/4] igb: Example code to illustrate the SR/IOV support Knut Omang
@ 2014-08-29 16:17 ` Michael S. Tsirkin
2014-08-29 16:23 ` Knut Omang
4 siblings, 1 reply; 9+ messages in thread
From: Michael S. Tsirkin @ 2014-08-29 16:17 UTC (permalink / raw)
To: Knut Omang
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi, Laszlo Ersek,
Alexander Graf, qemu-devel, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong, Gabriel Somlo
On Fri, Aug 29, 2014 at 09:17:05AM +0200, Knut Omang wrote:
> This patch set consists of two parts:
>
> - The two first patches implements SR/IOV building blocks in pcie/pci.
> I have held this patch back for a while because I haven't had a good
> example test case to accompany it, but in the light of the latest
> developments such as the discussion we have had around ARI and downstream
> switches and root ports, I think it would be valuable to get this in,
> to make it easier for people to experiment with creating devices with
> many functions. Hopefully, I can also get some help to fix the hotplug
> issues described below.
>
> - The two last patches contains an example to illustrate the usage
> of the new SR/IOV support. The example leverages the e1000
> code and the fact that registers between E1000 and the PCIe based
> Intel 82576 Gigabit Ethernet Adapter (which supports SR/IOV) are quite similar,
> but so far without considering much of the differences beyond the
> bare minimum needed to trick the igb driver into loading to the point
> where VFs can be enabled.
>
> So you cannot yet use these PF or VF Ethernet devices to send ethernet packets,
> and the implementation do not in any way attempt to model the internals
> of igb such as the multiple queues or any multiplexing onto the same device,
> it only instantiates the VFs as well as the PFs mostly directly by inheritance
> (and as separate devices) from E1000, but it should hopefully be relatively
> easy to understand how to proceed to make "true" VFs.
> It was also a nice exercise in using QOM.
>
> The changes to E1000 to accommodate igb (patch 3) should be fairly
> non-intrusive, nevertheless I suppose it should not be applied
> unless it will eventually lead to a new derived device which is enough
> different from E1000 to qualify for a separate set of source files.
> So if someone with more detailed knowledge of the internal differences
> between igb and e1000 on the functional level might have input, that would be
> great, an alternative of course be to only apply the two first
> patches, and leave any usage examples for the future.
>
> To test and see how it plays out, you can add something like this
> to the command line:
>
> -device ioh3420,slot=0,id=pcie_port.0
> -device igb,mac=DE:AD:BE:EE:04:18,vlan=1,bus=pcie_port.0
> -net tap,vlan=1
>
> The Linux igb driver does not yet support changing the VF setup via sysfs
> (the preferred way since kernel v.3.8) so to see some VFs on the guest,
> you need to set up a modprobe file with something like this:
>
> options igb max_vfs=4
>
> For some reason I dont yet understand, removal of the igb driver
> does not cause a 0 write to sr/iov vf_enable to disable the VFs again, eg.
>
> # lspci | grep '^01:00'
> 01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
> 01:00.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 01:00.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 01:00.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 01:00.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
>
> Writing directly to vf_enable using setpci works though, but after the later hotplug
> changes, a little too well, as after the VF removals have succeded,
> some way a slot power down gets triggered, also removing the PF:
>
> # setpci -s 01:00.0 168.b
> 0x19
> # setpci -s 01:00.0 168.b=18
> # lspci | grep '^01:00'
> <nothing>
>
> For similar reasons attaching the igb device directly on the root complex (just remove
> bus parameter above) attempts to enable VFs will fail with
> qdev.c:89: bus_add_child: Assertion `bus->allow_hotplug' failed
>
> I imagine this could for instance be a matter of defining a property "subfunction"
> or similar that allows qdev (QOM?) to discriminate between main devices and devices
> representing subfunctions of another main device. Suggestions on how to proceed on this
> welcome.
>
> This patch depends (by diff context only) on my patch:
>
> [PATCH v2 1/4] pcie: Fix incorrect write to the ari capability next function field
>
> and for stability on Paolo's
>
> [PATCH] pci_bridge: manually destroy memory regions within PCIBridgeWindows
>
> both of which Michael has pulled, but which are not in master yet.
>
> Thanks,
So practically you would like patches 1 and 2 applied,
and 3 and 4 are RFC?
> Knut Omang (4):
> pci: Update pci_regs header
> pcie: Add support for Single Root I/O Virtualization (SR/IOV)
> e1000: Refactor to allow subclassing from other source file
> igb: Example code to illustrate the SR/IOV support.
>
> hw/i386/kvm/pci-assign.c | 4 +-
> hw/misc/vfio.c | 8 +-
> hw/net/Makefile.objs | 2 +-
> hw/net/e1000.c | 126 ++--------------
> hw/net/e1000.h | 139 +++++++++++++++++
> hw/net/igb.c | 293 ++++++++++++++++++++++++++++++++++++
> hw/net/igb_regs.h | 27 ++++
> hw/pci/msi.c | 4 -
> hw/pci/msix.c | 2 +-
> hw/pci/pci.c | 107 +++++++++----
> hw/pci/pcie.c | 205 ++++++++++++++++++++++++-
> include/hw/pci/pci.h | 6 +-
> include/hw/pci/pci_regs.h | 371 ++++++++++++++++++++++++++++++++++------------
> include/hw/pci/pcie.h | 26 ++++
> 14 files changed, 1072 insertions(+), 248 deletions(-)
> create mode 100644 hw/net/e1000.h
> create mode 100644 hw/net/igb.c
> create mode 100644 hw/net/igb_regs.h
>
> --
> 1.9.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization
2014-08-29 16:17 ` [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Michael S. Tsirkin
@ 2014-08-29 16:23 ` Knut Omang
0 siblings, 0 replies; 9+ messages in thread
From: Knut Omang @ 2014-08-29 16:23 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi, Laszlo Ersek,
Alexander Graf, qemu-devel, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong, Gabriel Somlo
On Fri, 2014-08-29 at 18:17 +0200, Michael S. Tsirkin wrote:
> On Fri, Aug 29, 2014 at 09:17:05AM +0200, Knut Omang wrote:
> > This patch set consists of two parts:
> >
> > - The two first patches implements SR/IOV building blocks in pcie/pci.
> > I have held this patch back for a while because I haven't had a good
> > example test case to accompany it, but in the light of the latest
> > developments such as the discussion we have had around ARI and downstream
> > switches and root ports, I think it would be valuable to get this in,
> > to make it easier for people to experiment with creating devices with
> > many functions. Hopefully, I can also get some help to fix the hotplug
> > issues described below.
> >
> > - The two last patches contains an example to illustrate the usage
> > of the new SR/IOV support. The example leverages the e1000
> > code and the fact that registers between E1000 and the PCIe based
> > Intel 82576 Gigabit Ethernet Adapter (which supports SR/IOV) are quite similar,
> > but so far without considering much of the differences beyond the
> > bare minimum needed to trick the igb driver into loading to the point
> > where VFs can be enabled.
> >
> > So you cannot yet use these PF or VF Ethernet devices to send ethernet packets,
> > and the implementation do not in any way attempt to model the internals
> > of igb such as the multiple queues or any multiplexing onto the same device,
> > it only instantiates the VFs as well as the PFs mostly directly by inheritance
> > (and as separate devices) from E1000, but it should hopefully be relatively
> > easy to understand how to proceed to make "true" VFs.
> > It was also a nice exercise in using QOM.
> >
> > The changes to E1000 to accommodate igb (patch 3) should be fairly
> > non-intrusive, nevertheless I suppose it should not be applied
> > unless it will eventually lead to a new derived device which is enough
> > different from E1000 to qualify for a separate set of source files.
> > So if someone with more detailed knowledge of the internal differences
> > between igb and e1000 on the functional level might have input, that would be
> > great, an alternative of course be to only apply the two first
> > patches, and leave any usage examples for the future.
> >
> > To test and see how it plays out, you can add something like this
> > to the command line:
> >
> > -device ioh3420,slot=0,id=pcie_port.0
> > -device igb,mac=DE:AD:BE:EE:04:18,vlan=1,bus=pcie_port.0
> > -net tap,vlan=1
> >
> > The Linux igb driver does not yet support changing the VF setup via sysfs
> > (the preferred way since kernel v.3.8) so to see some VFs on the guest,
> > you need to set up a modprobe file with something like this:
> >
> > options igb max_vfs=4
> >
> > For some reason I dont yet understand, removal of the igb driver
> > does not cause a 0 write to sr/iov vf_enable to disable the VFs again, eg.
> >
> > # lspci | grep '^01:00'
> > 01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
> > 01:00.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 01:00.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 01:00.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 01:00.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> >
> > Writing directly to vf_enable using setpci works though, but after the later hotplug
> > changes, a little too well, as after the VF removals have succeded,
> > some way a slot power down gets triggered, also removing the PF:
> >
> > # setpci -s 01:00.0 168.b
> > 0x19
> > # setpci -s 01:00.0 168.b=18
> > # lspci | grep '^01:00'
> > <nothing>
> >
> > For similar reasons attaching the igb device directly on the root complex (just remove
> > bus parameter above) attempts to enable VFs will fail with
> > qdev.c:89: bus_add_child: Assertion `bus->allow_hotplug' failed
> >
> > I imagine this could for instance be a matter of defining a property "subfunction"
> > or similar that allows qdev (QOM?) to discriminate between main devices and devices
> > representing subfunctions of another main device. Suggestions on how to proceed on this
> > welcome.
> >
> > This patch depends (by diff context only) on my patch:
> >
> > [PATCH v2 1/4] pcie: Fix incorrect write to the ari capability next function field
> >
> > and for stability on Paolo's
> >
> > [PATCH] pci_bridge: manually destroy memory regions within PCIBridgeWindows
> >
> > both of which Michael has pulled, but which are not in master yet.
> >
> > Thanks,
>
> So practically you would like patches 1 and 2 applied,
> and 3 and 4 are RFC?
Yes, I suppose that's a better way of formulating my intentions, thanks!
Knut
> > Knut Omang (4):
> > pci: Update pci_regs header
> > pcie: Add support for Single Root I/O Virtualization (SR/IOV)
> > e1000: Refactor to allow subclassing from other source file
> > igb: Example code to illustrate the SR/IOV support.
> >
> > hw/i386/kvm/pci-assign.c | 4 +-
> > hw/misc/vfio.c | 8 +-
> > hw/net/Makefile.objs | 2 +-
> > hw/net/e1000.c | 126 ++--------------
> > hw/net/e1000.h | 139 +++++++++++++++++
> > hw/net/igb.c | 293 ++++++++++++++++++++++++++++++++++++
> > hw/net/igb_regs.h | 27 ++++
> > hw/pci/msi.c | 4 -
> > hw/pci/msix.c | 2 +-
> > hw/pci/pci.c | 107 +++++++++----
> > hw/pci/pcie.c | 205 ++++++++++++++++++++++++-
> > include/hw/pci/pci.h | 6 +-
> > include/hw/pci/pci_regs.h | 371 ++++++++++++++++++++++++++++++++++------------
> > include/hw/pci/pcie.h | 26 ++++
> > 14 files changed, 1072 insertions(+), 248 deletions(-)
> > create mode 100644 hw/net/e1000.h
> > create mode 100644 hw/net/igb.c
> > create mode 100644 hw/net/igb_regs.h
> >
> > --
> > 1.9.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
2014-08-29 7:17 ` [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
@ 2014-09-01 9:39 ` Michael S. Tsirkin
2014-09-01 18:34 ` Knut Omang
0 siblings, 1 reply; 9+ messages in thread
From: Michael S. Tsirkin @ 2014-09-01 9:39 UTC (permalink / raw)
To: Knut Omang
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi, Laszlo Ersek,
Alexander Graf, qemu-devel, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong, Gabriel Somlo
On Fri, Aug 29, 2014 at 09:17:07AM +0200, Knut Omang wrote:
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and creating and removing
> SR/IOV Virtual Functions.
>
> Signed-off-by: Knut Omang <knut.omang@oracle.com>
> ---
> hw/pci/pci.c | 107 +++++++++++++++++++-------
> hw/pci/pcie.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++-
> include/hw/pci/pci.h | 6 +-
> include/hw/pci/pcie.h | 26 +++++++
> 4 files changed, 311 insertions(+), 33 deletions(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index daeaeac..071ab81 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -35,7 +35,6 @@
> #include "hw/pci/msi.h"
> #include "hw/pci/msix.h"
> #include "exec/address-spaces.h"
> -#include "hw/hotplug.h"
>
> //#define DEBUG_PCI
> #ifdef DEBUG_PCI
> @@ -126,6 +125,9 @@ static int pci_bar(PCIDevice *d, int reg)
> {
> uint8_t type;
>
> + /* PCIe virtual functions do not have their own BARs */
> + assert(!d->exp.is_vf);
> +
> if (reg != PCI_ROM_SLOT)
> return PCI_BASE_ADDRESS_0 + reg * 4;
>
> @@ -184,22 +186,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
> }
> }
>
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
> {
> int r;
> + if (dev->exp.is_vf) {
> + return;
> + }
>
> - pci_device_deassert_intx(dev);
> - assert(dev->irq_state == 0);
> -
> - /* Clear all writable bits */
> - pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> - pci_get_word(dev->wmask + PCI_COMMAND) |
> - pci_get_word(dev->w1cmask + PCI_COMMAND));
> - pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> - pci_get_word(dev->wmask + PCI_STATUS) |
> - pci_get_word(dev->w1cmask + PCI_STATUS));
> - dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> - dev->config[PCI_INTERRUPT_LINE] = 0x0;
> for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> PCIIORegion *region = &dev->io_regions[r];
> if (!region->size) {
> @@ -213,6 +206,27 @@ static void pci_do_device_reset(PCIDevice *dev)
> pci_set_long(dev->config + pci_bar(dev, r), region->type);
> }
> }
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
> + qdev_reset_all(&dev->qdev);
> +
> + dev->irq_state = 0;
> + pci_update_irq_status(dev);
> + pci_device_deassert_intx(dev);
> + assert(dev->irq_state == 0);
> +
> + /* Clear all writable bits */
> + pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> + pci_get_word(dev->wmask + PCI_COMMAND) |
> + pci_get_word(dev->w1cmask + PCI_COMMAND));
> + pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> + pci_get_word(dev->wmask + PCI_STATUS) |
> + pci_get_word(dev->w1cmask + PCI_STATUS));
> + dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> + dev->config[PCI_INTERRUPT_LINE] = 0x0;
> + pci_reset_regions(dev);
> pci_update_mappings(dev);
>
> msi_reset(dev);
> @@ -734,6 +748,14 @@ static int pci_init_multifunction(PCIBus *bus, PCIDevice *dev)
> dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
> }
>
> + /* With SR/IOV and ARI, subsequent function 0's are only
> + * another VF which the physical function is placed in the initial
> + * function (0.0)
Couldn't parse this. Could you please reword?
> + */
> + if (dev->exp.pf && dev->exp.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> + return 0;
> + }
> +
> /*
> * multifunction bit is interpreted in two ways as follows.
> * - all functions must set the bit to 1.
> @@ -920,6 +942,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
> uint64_t wmask;
> pcibus_t size = memory_region_size(memory);
>
> + assert(!pci_dev->exp.is_vf); /* VFs must use pcie_register_vf_bar */
> assert(region_num >= 0);
> assert(region_num < PCI_NUM_REGIONS);
> if (size & (size-1)) {
> @@ -1018,18 +1041,51 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
> return pci_dev->io_regions[region_num].addr;
> }
>
> -static pcibus_t pci_bar_address(PCIDevice *d,
> - int reg, uint8_t type, pcibus_t size)
> +
> +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> + uint8_t type, pcibus_t size)
> +{
> + pcibus_t new_addr;
> + if (!d->exp.is_vf) {
> + int bar = pci_bar(d, reg);
> + if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> + new_addr = pci_get_quad(d->config + bar);
> + } else {
> + new_addr = pci_get_long(d->config + bar);
> + }
> + } else {
> + int bar = d->exp.pf->exp.sriov_cap + PCI_SRIOV_BAR + reg * 4;
> + uint32_t vf_num = d->devfn - (d->exp.pf->devfn + 1);
> + if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> + new_addr = pci_get_quad(d->exp.pf->config + bar);
> + } else {
> + new_addr = pci_get_long(d->exp.pf->config + bar);
> + }
> + new_addr += vf_num * size;
> +
> + PCI_DPRINTF("%02d:%02d.%d: (vf %d) found config reg %d, "
> + "size 0x%lx addr 0x%lx\n",
> + pci_bus_num(d->bus), d->devfn >> 3, d->devfn & 7,
> + vf_num, reg, size, new_addr);
I don't think this is needed, assuming it is, don't we want this
for all functions, bot just vfs?
> + }
> + if (reg != PCI_ROM_SLOT) {
> + /* Preserve the rom enable bit */
> + new_addr &= ~(size - 1);
> + }
> + return new_addr;
> +}
> +
> +pcibus_t pci_bar_address(PCIDevice *d,
> + int reg, uint8_t type, pcibus_t size)
> {
> pcibus_t new_addr, last_addr;
> - int bar = pci_bar(d, reg);
> uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
>
> if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> if (!(cmd & PCI_COMMAND_IO)) {
> return PCI_BAR_UNMAPPED;
> }
> - new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> + new_addr = pci_config_get_bar_addr(d, reg, type, size);
Hmm this is IO, VFs don't have IO, do they? Change really necessary?
> last_addr = new_addr + size - 1;
> /* Check if 32 bit BAR wraps around explicitly.
> * TODO: make priorities correct and remove this work around.
> @@ -1043,11 +1099,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> if (!(cmd & PCI_COMMAND_MEMORY)) {
> return PCI_BAR_UNMAPPED;
> }
> - if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> - new_addr = pci_get_quad(d->config + bar);
> - } else {
> - new_addr = pci_get_long(d->config + bar);
> - }
> + new_addr = pci_config_get_bar_addr(d, reg, type, size);
> /* the ROM slot has a specific enable bit */
> if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
> return PCI_BAR_UNMAPPED;
> @@ -1146,9 +1198,10 @@ uint32_t pci_default_read_config(PCIDevice *d,
> return le32_to_cpu(val);
> }
>
> -void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
> +void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
> {
> int i, was_irq_disabled = pci_irq_disabled(d);
> + uint32_t val = val_in;
>
> for (i = 0; i < l; val >>= 8, ++i) {
> uint8_t wmask = d->wmask[addr + i];
> @@ -1170,8 +1223,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
> & PCI_COMMAND_MASTER);
> }
>
> - msi_write_config(d, addr, val, l);
> - msix_write_config(d, addr, val, l);
> + msi_write_config(d, addr, val_in, l);
> + msix_write_config(d, addr, val_in, l);
> + pcie_sriov_config_write(d, addr, val_in, l);
> }
Not sure I get this chunk.
A code comment might be helpful.
>
> /***********************************************************/
> @@ -1776,7 +1830,6 @@ static int pci_qdev_init(DeviceState *qdev)
> is_default_rom = true;
> }
> pci_add_option_rom(pci_dev, is_default_rom);
> -
> return 0;
> }
>
generally, don't make unrelated changes, review is
hard enough.
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index 6cb6e0c..b7af693 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -254,7 +254,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
> * Right now, only a device of function = 0 is allowed to be
> * hot plugged/unplugged.
> */
> - assert(PCI_FUNC(pci_dev->devfn) == 0);
> + assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_dev->exp.is_vf);
>
> pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
> PCI_EXP_SLTSTA_PDS);
> @@ -266,10 +266,11 @@ void pcie_cap_slot_hot_unplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
> Error **errp)
> {
> uint8_t *exp_cap;
> + PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
>
> - pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, &exp_cap, errp);
> + pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
>
> - pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> + pcie_cap_slot_push_attention_button(pdev);
> }
>
> /* pci express slot for pci express root/downstream port
> @@ -409,7 +410,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
> }
>
> /*
> - * If the slot is polulated, power indicator is off and power
> + * If the slot is populated, power indicator is off and power
> * controller is off, it is safe to detach the devices.
> */
> if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> @@ -633,3 +634,199 @@ void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn)
> offset, PCI_ARI_SIZEOF);
> pci_set_long(dev->config + offset + PCI_ARI_CAP, (nextfn & 0xff) << 8);
> }
> +
> +
> +
don't add > 1 empty line in a row.
> +/* SR/IOV */
this comment is really useless, function name makes it clear.
> +void pcie_sriov_init(PCIDevice *dev, uint16_t offset,
> + const char *vfname, uint16_t vf_dev_id,
> + uint16_t init_vfs, uint16_t total_vfs)
> +{
> + uint8_t *cfg = dev->config + offset;
> + uint8_t *wmask;
empty line here after variables.
> + pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> + offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> + dev->exp.sriov_cap = offset;
> + dev->exp.num_vfs = 0;
> + dev->exp.vfname = g_strdup(vfname);
I don't see a free for this field anywhere.
> + dev->exp.vf = NULL;
> +
> + /* set some sensible defaults - devices can override later */
do they in fact override?
> + pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, 0x1);
> + pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, 0x1);
> + pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
what's this magic constant?
needs a comment.
> + pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> +
> + /* Set up device ID and initial/total number of VFs available */
> + pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> + pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> + pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> +
> + /* Write enable control bits */
> + wmask = dev->wmask + offset;
> + pci_set_word(wmask + PCI_SRIOV_CTRL,
> + PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
> + pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> +
> + qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> +}
> +
> +
2 empty lines again
> +void pcie_sriov_exit(PCIDevice *dev)
> +{
> + PCIE_DPRINTF("\n");
> + pcie_sriov_reset_vfs(dev);
> +}
> +
> +void pcie_sriov_init_bar(PCIDevice *dev, int region_num,
> + uint8_t type, dma_addr_t size)
> +{
> + uint32_t addr;
> + uint64_t wmask;
> + uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> + assert(sriov_cap > 0);
> + assert(region_num >= 0);
> + assert(region_num < PCI_NUM_REGIONS);
> + assert(region_num != PCI_ROM_SLOT);
> +
> + wmask = ~(size - 1);
> + addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> +
> + pci_set_long(dev->config + addr, type);
> + if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> + type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> + pci_set_quad(dev->wmask + addr, wmask);
> + pci_set_quad(dev->cmask + addr, ~0ULL);
> + } else {
> + pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> + pci_set_long(dev->cmask + addr, 0xffffffff);
> + }
> + dev->exp.vf_bar_type[region_num] = type;
> +}
> +
> +void pcie_register_vf_bar(PCIDevice *dev, int region_num,
> + MemoryRegion *memory)
> +{
> + PCIIORegion *r;
> + uint8_t type;
> + pcibus_t size = memory_region_size(memory);
> +
> + assert(dev->exp.is_vf); /* PFs must use pci_register_bar */
> + assert(region_num >= 0);
> + assert(region_num < PCI_NUM_REGIONS);
> + type = dev->exp.pf->exp.vf_bar_type[region_num];
> +
> + if (size & (size-1)) {
> + fprintf(stderr, "ERROR: PCI region size must be pow2 "
power of 2
eschew abbreviation
> + "type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
> + exit(1);
> + }
> +
> + r = &dev->io_regions[region_num];
> + r->memory = memory;
> + r->address_space =
> + type & PCI_BASE_ADDRESS_SPACE_IO
> + ? dev->bus->address_space_io
> + : dev->bus->address_space_mem;
> + r->size = size;
> + r->type = type;
> +
> + r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> + if (r->addr != PCI_BAR_UNMAPPED) {
> + memory_region_add_subregion_overlap(r->address_space,
> + r->addr, r->memory, 1);
> + }
> +}
> +
> +
> +static PCIDevice *pcie_create_vf(PCIDevice *pf, int devfn, const char *name)
> +{
> + int ret;
> + PCIDevice *dev = pci_create(pf->bus, devfn, name);
> + dev->exp.is_vf = true;
> + dev->exp.pf = pf;
> +
> + ret = qdev_init(&dev->qdev);
> + if (ret)
> + return NULL;
> +
> + /* set vid/did according to sr/iov spec - they are not used */
> + pci_config_set_vendor_id(dev->config, 0xffff);
> + pci_config_set_device_id(dev->config, 0xffff);
> + return dev;
> +}
> +
> +
> +void pcie_sriov_create_vfs(PCIDevice *dev)
> +{
> + uint16_t num_vfs;
> + uint16_t i;
> + int32_t devfn = dev->devfn + 1;
> + uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> + assert(sriov_cap > 0);
> + num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> +
> + dev->exp.vf = g_malloc(sizeof(PCIDevice*) * num_vfs);
Don't see a free to match this malloc call.
> + assert(dev->exp.vf);
> +
> + PCIE_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> + for (i = 0; i < num_vfs; i++) {
> + dev->exp.vf[i] = pcie_create_vf(dev, devfn++, dev->exp.vfname);
> + if (!dev->exp.vf[i]) {
> + PCIE_DEV_PRINTF(dev, "Failed to create VF %d\n", i);
> + num_vfs = i;
> + break;
> + }
> + }
> + dev->exp.num_vfs = num_vfs;
> +}
> +
> +
> +void pcie_sriov_reset_vfs(PCIDevice *dev)
> +{
> + Error *local_err = NULL;
> + uint16_t num_vfs = dev->exp.num_vfs;
> + uint16_t i;
> + PCIE_DEV_PRINTF(dev, "Resetting %d vf devs\n", num_vfs);
> + for (i = 0; i < num_vfs; i++) {
> + qdev_unplug(&dev->exp.vf[i]->qdev, &local_err);
> + if (local_err) {
> + fprintf(stderr, "Failed to unplug: %s\n",
> + error_get_pretty(local_err));
> + error_free(local_err);
> + }
> + }
> + dev->exp.num_vfs = 0;
> +}
> +
> +
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> +{
> + uint32_t off;
> + uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> + if (!sriov_cap || address < sriov_cap) {
> + return;
> + }
> + off = address - sriov_cap;
> + if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> + return;
> + }
> +
> + PCIE_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x len %d\n",
> + sriov_cap, off, val, len);
> +
> + if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> + if (dev->exp.num_vfs) {
> + if (!(val & PCI_SRIOV_CTRL_VFE)) {
> + pcie_sriov_reset_vfs(dev);
> + }
> + } else {
> + if (val & PCI_SRIOV_CTRL_VFE) {
> + pcie_sriov_create_vfs(dev);
> + }
> + }
> + }
> +}
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index c352c7b..d36c68d 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -11,8 +11,6 @@
> /* PCI includes legacy ISA access. */
> #include "hw/isa/isa.h"
>
> -#include "hw/pci/pcie.h"
> -
> /* PCI bus */
>
> #define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & 0x07))
> @@ -127,6 +125,7 @@ enum {
> #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
>
> #include "hw/pci/pci_regs.h"
> +#include "hw/pci/pcie.h"
>
> /* PCI HEADER_TYPE */
> #define PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
why move it here?
> @@ -413,6 +412,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>
> +pcibus_t pci_bar_address(PCIDevice *d,
> + int reg, uint8_t type, pcibus_t size);
> +
> static inline void
> pci_set_byte(uint8_t *config, uint8_t val)
> {
> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> index d139d58..b34d831 100644
> --- a/include/hw/pci/pcie.h
> +++ b/include/hw/pci/pcie.h
> @@ -74,6 +74,15 @@ struct PCIExpressDevice {
> /* AER */
> uint16_t aer_cap;
> PCIEAERLog aer_log;
> +
> + /* SR/IOV */
> + uint16_t sriov_cap;
> + uint16_t num_vfs; /* Number of virtual functions created */
> + bool is_vf; /* Set if this device is a virtual function */
Checking pf pointer not enough?
> + const char *vfname; /* Reference to the device type used for the VFs */
> + PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
> + PCIDevice *pf; /* Pointer back to owner physical function */
> + uint8_t vf_bar_type[PCI_NUM_REGIONS]; /* Store type for each VF bar */
There are two types of fields here: some are per-vf,
some are sriov fields in pf.
Pls make this clearer:
- create structures for each group
- name fields sriov_pf and sriov_vf accordingly.
> };
>
> #define COMPAT_PROP_PCP "power_controller_present"
> @@ -115,6 +124,23 @@ void pcie_add_capability(PCIDevice *dev,
> uint16_t offset, uint16_t size);
>
> void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> +void pcie_sriov_init(PCIDevice *dev, uint16_t offset,
> + const char *vfname, uint16_t vf_dev_id,
> + uint16_t init_vfs, uint16_t total_vfs);
> +void pcie_sriov_exit(PCIDevice *dev);
> +
Assumption is, this applies to pf only?
> +/* Set up a VF bar in the SR/IOV bar area */
> +void pcie_sriov_init_bar(PCIDevice *s, int region_num,
> + uint8_t type, dma_addr_t size);
> +
> +/* Instantiate a bar for a VF */
> +void pcie_register_vf_bar(PCIDevice *pci_dev, int region_num,
> + MemoryRegion *memory);
> +
Again, this isn't very clear. prefix everything with
pcie_sriov_pf_ and pcie_sriov_vf_ to make it clear
what each applies to.
> +void pcie_sriov_create_vfs(PCIDevice *dev);
Do we need to export this one?
> +void pcie_sriov_reset_vfs(PCIDevice *dev);
How about pcie_sriov_reset? Callers don't need to know
what you do internally I think.
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> + uint32_t val, int len);
>
> extern const VMStateDescription vmstate_pcie_device;
I'm not sure which functions apply to pfs,vfs, and which to both
pfs and vfs. function names should make this clear.
>
> --
> 1.9.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
2014-09-01 9:39 ` Michael S. Tsirkin
@ 2014-09-01 18:34 ` Knut Omang
0 siblings, 0 replies; 9+ messages in thread
From: Knut Omang @ 2014-09-01 18:34 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Maydell, Peter Crosthwaite, Stefan Hajnoczi, Laszlo Ersek,
Alexander Graf, qemu-devel, Fabien Chouteau, Luiz Capitulino,
Beniamino Galvani, Alex Williamson, Gonglei (Arei), Jan Kiszka,
Anthony Liguori, Paolo Bonzini, Amos Kong, Gabriel Somlo
On Mon, 2014-09-01 at 12:39 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 29, 2014 at 09:17:07AM +0200, Knut Omang wrote:
> > This patch provides the building blocks for creating an SR/IOV
> > PCIe Extended Capability header and creating and removing
> > SR/IOV Virtual Functions.
> >
> > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > ---
> > hw/pci/pci.c | 107 +++++++++++++++++++-------
> > hw/pci/pcie.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/hw/pci/pci.h | 6 +-
> > include/hw/pci/pcie.h | 26 +++++++
> > 4 files changed, 311 insertions(+), 33 deletions(-)
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index daeaeac..071ab81 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -35,7 +35,6 @@
> > #include "hw/pci/msi.h"
> > #include "hw/pci/msix.h"
> > #include "exec/address-spaces.h"
> > -#include "hw/hotplug.h"
> >
> > //#define DEBUG_PCI
> > #ifdef DEBUG_PCI
> > @@ -126,6 +125,9 @@ static int pci_bar(PCIDevice *d, int reg)
> > {
> > uint8_t type;
> >
> > + /* PCIe virtual functions do not have their own BARs */
> > + assert(!d->exp.is_vf);
> > +
> > if (reg != PCI_ROM_SLOT)
> > return PCI_BASE_ADDRESS_0 + reg * 4;
> >
> > @@ -184,22 +186,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
> > }
> > }
> >
> > -static void pci_do_device_reset(PCIDevice *dev)
> > +static void pci_reset_regions(PCIDevice *dev)
> > {
> > int r;
> > + if (dev->exp.is_vf) {
> > + return;
> > + }
> >
> > - pci_device_deassert_intx(dev);
> > - assert(dev->irq_state == 0);
> > -
> > - /* Clear all writable bits */
> > - pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > - pci_get_word(dev->wmask + PCI_COMMAND) |
> > - pci_get_word(dev->w1cmask + PCI_COMMAND));
> > - pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > - pci_get_word(dev->wmask + PCI_STATUS) |
> > - pci_get_word(dev->w1cmask + PCI_STATUS));
> > - dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > - dev->config[PCI_INTERRUPT_LINE] = 0x0;
> > for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> > PCIIORegion *region = &dev->io_regions[r];
> > if (!region->size) {
> > @@ -213,6 +206,27 @@ static void pci_do_device_reset(PCIDevice *dev)
> > pci_set_long(dev->config + pci_bar(dev, r), region->type);
> > }
> > }
> > +}
> > +
> > +static void pci_do_device_reset(PCIDevice *dev)
> > +{
> > + qdev_reset_all(&dev->qdev);
> > +
> > + dev->irq_state = 0;
> > + pci_update_irq_status(dev);
> > + pci_device_deassert_intx(dev);
> > + assert(dev->irq_state == 0);
> > +
> > + /* Clear all writable bits */
> > + pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > + pci_get_word(dev->wmask + PCI_COMMAND) |
> > + pci_get_word(dev->w1cmask + PCI_COMMAND));
> > + pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > + pci_get_word(dev->wmask + PCI_STATUS) |
> > + pci_get_word(dev->w1cmask + PCI_STATUS));
> > + dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > + dev->config[PCI_INTERRUPT_LINE] = 0x0;
> > + pci_reset_regions(dev);
> > pci_update_mappings(dev);
> >
> > msi_reset(dev);
> > @@ -734,6 +748,14 @@ static int pci_init_multifunction(PCIBus *bus, PCIDevice *dev)
> > dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
> > }
> >
> > + /* With SR/IOV and ARI, subsequent function 0's are only
> > + * another VF which the physical function is placed in the initial
> > + * function (0.0)
>
> Couldn't parse this. Could you please reword?
The main point is that if this is a VF (it references a pf), the below
checks for multifunction on function 0 does not apply, as it is really a
virtual function which happens to get a devfn which yields a
function number 0 in the legacy pci interpretation.
I realize I would need (at least for the purpose of documentation)
to do some more on the VF_OFFSET and VF_STRIDE,
which I have assumed are both 1 for simplicity, see comment below,
> > + */
> > + if (dev->exp.pf && dev->exp.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> > + return 0;
> > + }
> > +
> > /*
> > * multifunction bit is interpreted in two ways as follows.
> > * - all functions must set the bit to 1.
> > @@ -920,6 +942,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
> > uint64_t wmask;
> > pcibus_t size = memory_region_size(memory);
> >
> > + assert(!pci_dev->exp.is_vf); /* VFs must use pcie_register_vf_bar */
> > assert(region_num >= 0);
> > assert(region_num < PCI_NUM_REGIONS);
> > if (size & (size-1)) {
> > @@ -1018,18 +1041,51 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
> > return pci_dev->io_regions[region_num].addr;
> > }
> >
> > -static pcibus_t pci_bar_address(PCIDevice *d,
> > - int reg, uint8_t type, pcibus_t size)
> > +
> > +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> > + uint8_t type, pcibus_t size)
> > +{
> > + pcibus_t new_addr;
> > + if (!d->exp.is_vf) {
> > + int bar = pci_bar(d, reg);
> > + if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > + new_addr = pci_get_quad(d->config + bar);
> > + } else {
> > + new_addr = pci_get_long(d->config + bar);
> > + }
> > + } else {
> > + int bar = d->exp.pf->exp.sriov_cap + PCI_SRIOV_BAR + reg * 4;
> > + uint32_t vf_num = d->devfn - (d->exp.pf->devfn + 1);
> > + if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > + new_addr = pci_get_quad(d->exp.pf->config + bar);
> > + } else {
> > + new_addr = pci_get_long(d->exp.pf->config + bar);
> > + }
> > + new_addr += vf_num * size;
> > +
> > + PCI_DPRINTF("%02d:%02d.%d: (vf %d) found config reg %d, "
> > + "size 0x%lx addr 0x%lx\n",
> > + pci_bus_num(d->bus), d->devfn >> 3, d->devfn & 7,
> > + vf_num, reg, size, new_addr);
>
> I don't think this is needed, assuming it is, don't we want this
> for all functions, bot just vfs?
It was useful for debugging the size and address calculations,
which are less obvious than in the plain PCI case since they are derived
from the VF bars in the PF, but I can remove it.
> > + }
> > + if (reg != PCI_ROM_SLOT) {
> > + /* Preserve the rom enable bit */
> > + new_addr &= ~(size - 1);
> > + }
> > + return new_addr;
> > +}
> > +
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > + int reg, uint8_t type, pcibus_t size)
> > {
> > pcibus_t new_addr, last_addr;
> > - int bar = pci_bar(d, reg);
> > uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
> >
> > if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> > if (!(cmd & PCI_COMMAND_IO)) {
> > return PCI_BAR_UNMAPPED;
> > }
> > - new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> > + new_addr = pci_config_get_bar_addr(d, reg, type, size);
>
> Hmm this is IO, VFs don't have IO, do they? Change really necessary?
I think it simplifies the code, notice that the outer
bar variable is not needed elsewhere, so I would have had to
change something in that if() anyway.
> > last_addr = new_addr + size - 1;
> > /* Check if 32 bit BAR wraps around explicitly.
> > * TODO: make priorities correct and remove this work around.
> > @@ -1043,11 +1099,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> > if (!(cmd & PCI_COMMAND_MEMORY)) {
> > return PCI_BAR_UNMAPPED;
> > }
> > - if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > - new_addr = pci_get_quad(d->config + bar);
> > - } else {
> > - new_addr = pci_get_long(d->config + bar);
> > - }
> > + new_addr = pci_config_get_bar_addr(d, reg, type, size);
> > /* the ROM slot has a specific enable bit */
> > if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
> > return PCI_BAR_UNMAPPED;
> > @@ -1146,9 +1198,10 @@ uint32_t pci_default_read_config(PCIDevice *d,
> > return le32_to_cpu(val);
> > }
> >
> > -void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
> > +void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
> > {
> > int i, was_irq_disabled = pci_irq_disabled(d);
> > + uint32_t val = val_in;
> >
> > for (i = 0; i < l; val >>= 8, ++i) {
> > uint8_t wmask = d->wmask[addr + i];
> > @@ -1170,8 +1223,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
> > & PCI_COMMAND_MASTER);
> > }
> >
> > - msi_write_config(d, addr, val, l);
> > - msix_write_config(d, addr, val, l);
> > + msi_write_config(d, addr, val_in, l);
> > + msix_write_config(d, addr, val_in, l);
> > + pcie_sriov_config_write(d, addr, val_in, l);
> > }
>
>
> Not sure I get this chunk.
> A code comment might be helpful.
I believe I fix a bug here - should perhaps have been a separate patch
(should I factor it out, if you agree?)
This fix lost my attention as things started to work and I moved forward:
The old code would never propagate any set bits through to the msi/msix
write functions as they get lost in the shift inside for(..) above,
which is why I introduced a new local val and renamed the argument to
val_in.
Otherwise it is just another capability subscribing for config writes,
eg. pcie_sriov_config_write would only have any effect if the config
address matches a valid sriov capability, which is similar usage as for
the msi and msix calls.
> >
> > /***********************************************************/
> > @@ -1776,7 +1830,6 @@ static int pci_qdev_init(DeviceState *qdev)
> > is_default_rom = true;
> > }
> > pci_add_option_rom(pci_dev, is_default_rom);
> > -
> > return 0;
> > }
> >
>
> generally, don't make unrelated changes, review is
> hard enough.
Sorry about that - I tried to look for unintended white space changes
but some seems to have slipped by anyway - the code went
through several rebases on the way..
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 6cb6e0c..b7af693 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -254,7 +254,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
> > * Right now, only a device of function = 0 is allowed to be
> > * hot plugged/unplugged.
> > */
> > - assert(PCI_FUNC(pci_dev->devfn) == 0);
> > + assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_dev->exp.is_vf);
> >
> > pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
> > PCI_EXP_SLTSTA_PDS);
> > @@ -266,10 +266,11 @@ void pcie_cap_slot_hot_unplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
> > Error **errp)
> > {
> > uint8_t *exp_cap;
> > + PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
> >
> > - pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, &exp_cap, errp);
> > + pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
> >
> > - pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> > + pcie_cap_slot_push_attention_button(pdev);
> > }
> >
> > /* pci express slot for pci express root/downstream port
> > @@ -409,7 +410,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
> > }
> >
> > /*
> > - * If the slot is polulated, power indicator is off and power
> > + * If the slot is populated, power indicator is off and power
> > * controller is off, it is safe to detach the devices.
> > */
> > if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > @@ -633,3 +634,199 @@ void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn)
> > offset, PCI_ARI_SIZEOF);
> > pci_set_long(dev->config + offset + PCI_ARI_CAP, (nextfn & 0xff) << 8);
> > }
> > +
> > +
> > +
>
> don't add > 1 empty line in a row.
Get it - will clean up - same for the rest of the unintended white space
changes..
> > +/* SR/IOV */
>
> this comment is really useless, function name makes it clear.
I just tried to follow the style in the source file - eg. /* ARI */
above,
Continuing with your suggestions for better datatypes below, I ended up
factoring the sriov code out into a separate source/header file pair
simular to the aer code instead, hope that's ok.
> > +void pcie_sriov_init(PCIDevice *dev, uint16_t offset,
> > + const char *vfname, uint16_t vf_dev_id,
> > + uint16_t init_vfs, uint16_t total_vfs)
> > +{
> > + uint8_t *cfg = dev->config + offset;
> > + uint8_t *wmask;
>
> empty line here after variables.
Good, I like that myself too,
the existing code uses both 0 and 1 lines,
> > + pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> > + offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> > + dev->exp.sriov_cap = offset;
> > + dev->exp.num_vfs = 0;
> > + dev->exp.vfname = g_strdup(vfname);
>
> I don't see a free for this field anywhere.
Ouch - will fix.
> > + dev->exp.vf = NULL;
> > +
> > + /* set some sensible defaults - devices can override later */
>
> do they in fact override?
Hmm - I realize I need to handle this better, at present it is
hard coded to work for stride 1, offset 1 - it should be relatively
easy to be more generic.
If a device implementing SR/IOV wants stride > 1 or for instance has
more than 1 PF, it needs to start at offset > 1 or supports
more pages sizes, it would have to set other values accordingly.
The igb I have experimented with in the lab actually sets stride 2 -
it also sets offset 0x80 that is, it implements two PFs at
function 0 and 1 and the VFs from each PF ends
up interleaved at every second function, starting at devfn 10.0.
In my igb example code, I deliberately just used the hard coded stride 1,
offset 1 - will fix that instead of documenting the misfeature..
> > + pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, 0x1);
> > + pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, 0x1);
> > + pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
>
> what's this magic constant?
> needs a comment.
I agree, will add it - the mandatory set of page sizes to support.
> > + pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> > +
> > + /* Set up device ID and initial/total number of VFs available */
> > + pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> > + pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> > + pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> > +
> > + /* Write enable control bits */
> > + wmask = dev->wmask + offset;
> > + pci_set_word(wmask + PCI_SRIOV_CTRL,
> > + PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
> > + pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> > +
> > + qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> > +}
> > +
> > +
>
>
> 2 empty lines again
sorry, old habits...
> > +void pcie_sriov_exit(PCIDevice *dev)
> > +{
> > + PCIE_DPRINTF("\n");
> > + pcie_sriov_reset_vfs(dev);
> > +}
> > +
> > +void pcie_sriov_init_bar(PCIDevice *dev, int region_num,
> > + uint8_t type, dma_addr_t size)
> > +{
> > + uint32_t addr;
> > + uint64_t wmask;
> > + uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > + assert(sriov_cap > 0);
> > + assert(region_num >= 0);
> > + assert(region_num < PCI_NUM_REGIONS);
> > + assert(region_num != PCI_ROM_SLOT);
> > +
> > + wmask = ~(size - 1);
> > + addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> > +
> > + pci_set_long(dev->config + addr, type);
> > + if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> > + type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > + pci_set_quad(dev->wmask + addr, wmask);
> > + pci_set_quad(dev->cmask + addr, ~0ULL);
> > + } else {
> > + pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> > + pci_set_long(dev->cmask + addr, 0xffffffff);
> > + }
> > + dev->exp.vf_bar_type[region_num] = type;
> > +}
> > +
> > +void pcie_register_vf_bar(PCIDevice *dev, int region_num,
> > + MemoryRegion *memory)
> > +{
> > + PCIIORegion *r;
> > + uint8_t type;
> > + pcibus_t size = memory_region_size(memory);
> > +
> > + assert(dev->exp.is_vf); /* PFs must use pci_register_bar */
> > + assert(region_num >= 0);
> > + assert(region_num < PCI_NUM_REGIONS);
> > + type = dev->exp.pf->exp.vf_bar_type[region_num];
> > +
> > + if (size & (size-1)) {
> > + fprintf(stderr, "ERROR: PCI region size must be pow2 "
>
> power of 2
>
> eschew abbreviation
agree - cut'n paste from pci.c:pci_register_bar() but nevertheless...
> > + "type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
> > + exit(1);
> > + }
> > +
> > + r = &dev->io_regions[region_num];
> > + r->memory = memory;
> > + r->address_space =
> > + type & PCI_BASE_ADDRESS_SPACE_IO
> > + ? dev->bus->address_space_io
> > + : dev->bus->address_space_mem;
> > + r->size = size;
> > + r->type = type;
> > +
> > + r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> > + if (r->addr != PCI_BAR_UNMAPPED) {
> > + memory_region_add_subregion_overlap(r->address_space,
> > + r->addr, r->memory, 1);
> > + }
> > +}
> > +
> > +
> > +static PCIDevice *pcie_create_vf(PCIDevice *pf, int devfn, const char *name)
> > +{
> > + int ret;
> > + PCIDevice *dev = pci_create(pf->bus, devfn, name);
> > + dev->exp.is_vf = true;
> > + dev->exp.pf = pf;
> > +
> > + ret = qdev_init(&dev->qdev);
> > + if (ret)
> > + return NULL;
> > +
> > + /* set vid/did according to sr/iov spec - they are not used */
> > + pci_config_set_vendor_id(dev->config, 0xffff);
> > + pci_config_set_device_id(dev->config, 0xffff);
> > + return dev;
> > +}
> > +
> > +
> > +void pcie_sriov_create_vfs(PCIDevice *dev)
> > +{
> > + uint16_t num_vfs;
> > + uint16_t i;
> > + int32_t devfn = dev->devfn + 1;
> > + uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > + assert(sriov_cap > 0);
> > + num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> > +
> > + dev->exp.vf = g_malloc(sizeof(PCIDevice*) * num_vfs);
>
> Don't see a free to match this malloc call.
You're quite right - will fix, thanks,
> > + assert(dev->exp.vf);
> > +
> > + PCIE_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> > + for (i = 0; i < num_vfs; i++) {
> > + dev->exp.vf[i] = pcie_create_vf(dev, devfn++, dev->exp.vfname);
> > + if (!dev->exp.vf[i]) {
> > + PCIE_DEV_PRINTF(dev, "Failed to create VF %d\n", i);
> > + num_vfs = i;
> > + break;
> > + }
> > + }
> > + dev->exp.num_vfs = num_vfs;
> > +}
> > +
> > +
> > +void pcie_sriov_reset_vfs(PCIDevice *dev)
> > +{
> > + Error *local_err = NULL;
> > + uint16_t num_vfs = dev->exp.num_vfs;
> > + uint16_t i;
> > + PCIE_DEV_PRINTF(dev, "Resetting %d vf devs\n", num_vfs);
> > + for (i = 0; i < num_vfs; i++) {
> > + qdev_unplug(&dev->exp.vf[i]->qdev, &local_err);
> > + if (local_err) {
> > + fprintf(stderr, "Failed to unplug: %s\n",
> > + error_get_pretty(local_err));
> > + error_free(local_err);
> > + }
> > + }
> > + dev->exp.num_vfs = 0;
> > +}
> > +
> > +
> > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > +{
> > + uint32_t off;
> > + uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > + if (!sriov_cap || address < sriov_cap) {
> > + return;
> > + }
> > + off = address - sriov_cap;
> > + if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> > + return;
> > + }
> > +
> > + PCIE_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x len %d\n",
> > + sriov_cap, off, val, len);
> > +
> > + if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > + if (dev->exp.num_vfs) {
> > + if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > + pcie_sriov_reset_vfs(dev);
> > + }
> > + } else {
> > + if (val & PCI_SRIOV_CTRL_VFE) {
> > + pcie_sriov_create_vfs(dev);
> > + }
> > + }
> > + }
> > +}
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index c352c7b..d36c68d 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -11,8 +11,6 @@
> > /* PCI includes legacy ISA access. */
> > #include "hw/isa/isa.h"
> >
> > -#include "hw/pci/pcie.h"
> > -
> > /* PCI bus */
> >
> > #define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & 0x07))
> > @@ -127,6 +125,7 @@ enum {
> > #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
> >
> > #include "hw/pci/pci_regs.h"
> > +#include "hw/pci/pcie.h"
> >
> > /* PCI HEADER_TYPE */
> > #define PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
>
> why move it here?
pcie.h now depends on PCI_NUM_REGIONS which are declared in pci.h after the old
pcie.h include.
> > @@ -413,6 +412,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> > AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> > void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> >
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > + int reg, uint8_t type, pcibus_t size);
> > +
> > static inline void
> > pci_set_byte(uint8_t *config, uint8_t val)
> > {
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index d139d58..b34d831 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -74,6 +74,15 @@ struct PCIExpressDevice {
> > /* AER */
> > uint16_t aer_cap;
> > PCIEAERLog aer_log;
> > +
> > + /* SR/IOV */
> > + uint16_t sriov_cap;
> > + uint16_t num_vfs; /* Number of virtual functions created */
> > + bool is_vf; /* Set if this device is a virtual function */
>
> Checking pf pointer not enough?
Should be - was considering it, but kept it to
be coherent with is_express and is_bridge.
Will eliminate and add+use a pci_is_vf()
instead,
> > + const char *vfname; /* Reference to the device type used for the VFs */
> > + PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
> > + PCIDevice *pf; /* Pointer back to owner physical function */
> > + uint8_t vf_bar_type[PCI_NUM_REGIONS]; /* Store type for each VF bar */
>
> There are two types of fields here: some are per-vf,
> some are sriov fields in pf.
> Pls make this clearer:
> - create structures for each group
> - name fields sriov_pf and sriov_vf accordingly.
Was unsure about desired number of levels vs abstraction for
readability - will do as you suggest,
> > };
> >
> > #define COMPAT_PROP_PCP "power_controller_present"
> > @@ -115,6 +124,23 @@ void pcie_add_capability(PCIDevice *dev,
> > uint16_t offset, uint16_t size);
> >
> > void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> > +void pcie_sriov_init(PCIDevice *dev, uint16_t offset,
> > + const char *vfname, uint16_t vf_dev_id,
> > + uint16_t init_vfs, uint16_t total_vfs);
> > +void pcie_sriov_exit(PCIDevice *dev);
> > +
>
> Assumption is, this applies to pf only?
I see your concern - similar to the one I had with ARIfwd vs ARI cap.
will make it clearer!
> > +/* Set up a VF bar in the SR/IOV bar area */
> > +void pcie_sriov_init_bar(PCIDevice *s, int region_num,
> > + uint8_t type, dma_addr_t size);
> > +
> > +/* Instantiate a bar for a VF */
> > +void pcie_register_vf_bar(PCIDevice *pci_dev, int region_num,
> > + MemoryRegion *memory);
> > +
>
> Again, this isn't very clear. prefix everything with
> pcie_sriov_pf_ and pcie_sriov_vf_ to make it clear
> what each applies to.
Yes
> > +void pcie_sriov_create_vfs(PCIDevice *dev);
>
>
> Do we need to export this one?
>
> > +void pcie_sriov_reset_vfs(PCIDevice *dev);
Hmm, it is needed both as a result of PF reset and during destruction of
a device (which does not necessarily yield a PCI reset) and I call it
from the igb code,
but your comment makes me realize:
1) I should probably rename the create/reset_vfs functions to
pcie_sriov_pf_register/unregister_vfs() or similar as that would better
reflect what it does.
2) call reset (unregister) unconditionally for SR/IOV devices (PFs) from
pci_unregister_device()
> How about pcie_sriov_reset? Callers don't need to know
> what you do internally I think.
Actually, it seems the external API can be made a lot simpler,
working on that,...
> > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > + uint32_t val, int len);
> >
> > extern const VMStateDescription vmstate_pcie_device;
>
>
> I'm not sure which functions apply to pfs,vfs, and which to both
> pfs and vfs. function names should make this clear.
I get it,..
Thanks a lot for your thorough review,
Knut
> > --
> > 1.9.0
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-09-01 18:35 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 7:17 [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 1/4] pci: Update pci_regs header Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
2014-09-01 9:39 ` Michael S. Tsirkin
2014-09-01 18:34 ` Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 3/4] e1000: Refactor to allow subclassing from other source file Knut Omang
2014-08-29 7:17 ` [Qemu-devel] [PATCH 4/4] igb: Example code to illustrate the SR/IOV support Knut Omang
2014-08-29 16:17 ` [Qemu-devel] [PATCH 0/4] pcie: Add support for Single Root I/O Virtualization Michael S. Tsirkin
2014-08-29 16:23 ` Knut Omang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).