* [PATCH] vfio: selftests: Add driver for IGB QEMU device
@ 2026-05-11 21:18 Josh Hilke
2026-05-11 23:45 ` David Matlack
0 siblings, 1 reply; 11+ messages in thread
From: Josh Hilke @ 2026-05-11 21:18 UTC (permalink / raw)
To: Alex Williamson, David Matlack
Cc: Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm,
linux-kernel, linux-kselftest, Josh Hilke
Add a VFIO selftest driver for the Intel Gigabit Ethernet controller
(IGB). IGB is fully virtualized in QEMU [1] which makes it easy to run
VFIO selftests without needing any specific hardware.
IGB does not have a default memcpy operation, and memcpy is required for
all VFIO selftest drivers. However, IGB has a "loopback" mode which is
used in this driver to implement memcpy. When IGB is in loopback mode,
it doesn't send data out to the network. Instead, it sends data from the
Tx queue directly to the Rx queue which points to the memcpy
destination, instead of sending the data out to the network.
This driver passes all of the VFIO selftests in
tools/testing/selftests/vfio/ when running in QEMU. The driver has not
been tested using a real IGB device since the main goal of writing this
driver is to run VFIO selftests in QEMU without requiring any hardware.
This command is used to test the driver. It runs
./tools/testing/selftests/vfio/vfio_pci_driver_test using virtme-ng [2],
which runs a kernel in a virtualized environment using QEMU that has a
copy-on-write snapshot the host filesystem.
vng \
--run arch/x86/boot/bzImage \
--user root \
--disable-microvm \
--memory 32G \
--cpus 8 \
--qemu-opts="-M q35,accel=kvm,kernel-irqchip=split" \
--qemu-opts="-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on" \
--qemu-opts="-netdev user,id=net0 -device igb,netdev=net0,addr=09.0" \
--append "console=ttyS0 earlyprintk=ttyS0 intel_iommu=on iommu=pt" \
--exec "modprobe vfio-pci && \
./tools/testing/selftests/vfio/scripts/setup.sh 0000:00:09.0 && \
./tools/testing/selftests/vfio/scripts/run.sh ./tools/testing/selftests/vfio/vfio_pci_driver_test"
Code was written entirely by AI (Gemini) by feeding it the Intel IGB
specification, the code for the real IGB driver, and the QEMU
implementation of the IGB device. It took many iterations of prompting
to get the driver to pass all of the tests, and lot's of de-slopping to
remove unnecessary code and make the code readable.
Code comments are also written by Gemini through iterative
prompting.
This patch is based on the kvm/queue branch.
[1] https://www.qemu.org/docs/master/system/devices/igb.html
[2] https://github.com/arighi/virtme-ng
Assisted-by: Gemini:gemini-3-pro-preview
Signed-off-by: Josh Hilke <jrhilke@google.com>
---
.../selftests/vfio/lib/drivers/igb/igb.c | 381 ++++++++++++++++++
.../vfio/lib/drivers/igb/registers.h | 103 +++++
tools/testing/selftests/vfio/lib/libvfio.mk | 1 +
.../selftests/vfio/lib/vfio_pci_driver.c | 2 +
4 files changed, 487 insertions(+)
create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/igb.c
create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/registers.h
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
new file mode 100644
index 000000000000..13c8429784ac
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -0,0 +1,381 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <unistd.h>
+#include <errno.h>
+#include <stdint.h>
+#include <linux/io.h>
+#include <linux/pci_regs.h>
+#include <libvfio/vfio_pci_device.h>
+
+#include "registers.h"
+
+#define PCI_VENDOR_ID_INTEL 0x8086
+#define PCI_DEVICE_ID_INTEL_IGB 0x10C9
+#define IGB_MAX_CHUNK_SIZE 1024
+#define MSIX_VECTOR 0
+#define RING_SIZE 4096 /* Number of descriptors in ring */
+
+struct igb_tx_desc {
+ union {
+ struct {
+ uint64_t buffer_addr; /* Address of descriptor's data buffer */
+ uint32_t cmd_type_len; /* Command/Type/Length */
+ uint32_t olinfo_status; /* Context/Buffer info */
+ } read;
+
+ struct {
+ uint64_t rsvd; /* Reserved */
+ uint32_t nxtseq_seed; /* Next sequence seed */
+ uint32_t status; /* Descriptor status */
+ } wb;
+ };
+};
+
+struct igb_rx_desc {
+ union {
+ struct {
+ uint64_t pkt_addr; /* Packet buffer address */
+ uint64_t hdr_addr; /* Header buffer address */
+ } read;
+ struct {
+ uint16_t pkt_info; /* RSS type, Packet type */
+ uint16_t hdr_info; /* Split Head, buf len */
+ uint32_t rss; /* RSS Hash */
+ uint32_t status_error; /* ext status/error */
+ uint16_t length; /* Packet length */
+ uint16_t vlan; /* VLAN tag */
+ } wb; /* writeback */
+ };
+};
+
+struct igb {
+ void *bar0;
+ struct igb_tx_desc *tx_ring;
+ struct igb_rx_desc *rx_ring;
+ uint32_t tx_tail;
+ uint32_t rx_tail;
+};
+
+static inline struct igb *to_igb_state(struct vfio_pci_device *device)
+{
+ return (struct igb *)device->driver.region.vaddr;
+}
+
+static inline void igb_write32(struct igb *igb, uint32_t reg, uint32_t val)
+{
+ writel(val, igb->bar0 + reg);
+}
+
+static inline uint32_t igb_read32(struct igb *igb, uint32_t reg)
+{
+ return readl(igb->bar0 + reg);
+}
+
+static int igb_write_phy(struct igb *igb, uint32_t offset, uint16_t data)
+{
+ uint32_t mdic;
+ int i;
+
+ mdic = (((uint32_t)data) |
+ (offset << IGB_MDIC_REG_SHIFT) |
+ (1 << IGB_MDIC_PHY_SHIFT) |
+ IGB_MDIC_OP_WRITE);
+
+ igb_write32(igb, IGB_MDIC, mdic);
+
+ for (i = 0; i < 1000; i++) {
+ usleep(50);
+ mdic = igb_read32(igb, IGB_MDIC);
+ if (mdic & IGB_MDIC_READY)
+ break;
+ }
+
+ if (!(mdic & IGB_MDIC_READY))
+ return -1;
+
+ if (mdic & IGB_MDIC_ERROR)
+ return -1;
+
+ return 0;
+}
+
+static int igb_read_phy(struct igb *igb, uint32_t offset, uint16_t *data)
+{
+ uint32_t mdic;
+ int i;
+
+ mdic = ((offset << IGB_MDIC_REG_SHIFT) |
+ (1 << IGB_MDIC_PHY_SHIFT) |
+ IGB_MDIC_OP_READ);
+
+ igb_write32(igb, IGB_MDIC, mdic);
+
+ for (i = 0; i < 1000; i++) {
+ usleep(50);
+ mdic = igb_read32(igb, IGB_MDIC);
+ if (mdic & IGB_MDIC_READY)
+ break;
+ }
+
+ if (!(mdic & IGB_MDIC_READY))
+ return -1;
+
+ if (mdic & IGB_MDIC_ERROR)
+ return -1;
+
+ *data = (uint16_t)mdic;
+ return 0;
+}
+
+static int igb_phy_setup_autoneg(struct igb *igb)
+{
+ int timeout_ms = 1000;
+ bool success = false;
+ uint16_t phy_status;
+ int ret;
+ int i;
+
+ /* Trigger auto-negotiation */
+ ret = igb_write_phy(igb, IGB_PHY_CTRL_REG_OFFSET,
+ IGB_PHY_CTRL_AN_ENABLE | IGB_PHY_CTRL_RESTART_AN);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < timeout_ms; i++) {
+ if (igb_read_phy(igb, IGB_PHY_STATUS_REG_OFFSET, &phy_status) == 0) {
+ success = !!(phy_status & IGB_PHY_STATUS_AN_COMP);
+ if (success)
+ break;
+ }
+ usleep(1000);
+ }
+
+ if (!success) {
+ printf("igb: Auto-negotiation did not complete in time\n");
+ return -ETIMEDOUT;
+ }
+
+ return 0;
+}
+
+static int igb_probe(struct vfio_pci_device *device)
+{
+ if (!vfio_pci_device_match(device, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IGB))
+ return -EINVAL;
+
+ return 0;
+}
+
+static void igb_init(struct vfio_pci_device *device)
+{
+ struct igb *igb = to_igb_state(device);
+ uint64_t iova_tx, iova_rx;
+ uint32_t ctrl, rctl;
+ uint16_t cmd_reg;
+ int retries;
+
+ VFIO_ASSERT_GE(device->driver.region.size,
+ sizeof(struct igb) + 2 * RING_SIZE * sizeof(struct igb_tx_desc));
+
+ /* Set up rings and calculate IOVAs */
+ igb->bar0 = device->bars[0].vaddr;
+
+ igb->tx_ring = (struct igb_tx_desc *)(igb + 1);
+ igb->rx_ring = (struct igb_rx_desc *)(igb->tx_ring + RING_SIZE);
+
+ iova_tx = to_iova(device, igb->tx_ring);
+ iova_rx = to_iova(device, igb->rx_ring);
+
+ /* Reset device and disable all interrupts */
+ igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST);
+ usleep(20000);
+ igb_write32(igb, IGB_IMC, 0xFFFFFFFF);
+
+ /* Signal that the driver is loaded */
+ ctrl = igb_read32(igb, IGB_CTRL_EXT);
+ ctrl |= IGB_CTRL_EXT_DRV_LOAD;
+ ctrl &= ~IGB_CTRL_EXT_LINK_MODE_MASK;
+ igb_write32(igb, IGB_CTRL_EXT, ctrl);
+
+ /* Enable PCI Bus Master. */
+ cmd_reg = vfio_pci_config_readw(device, PCI_COMMAND);
+ if (!(cmd_reg & PCI_COMMAND_MASTER)) {
+ cmd_reg |= (PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY);
+ vfio_pci_config_writew(device, PCI_COMMAND, cmd_reg);
+ }
+
+ /* Trigger autonegotiation. This enables IGB to transmit data. */
+ if (igb_phy_setup_autoneg(igb))
+ return;
+
+ /* Configure TX and RX descriptor rings */
+ igb_write32(igb, IGB_TDBAL0, (uint32_t)iova_tx);
+ igb_write32(igb, IGB_TDBAH0, (uint32_t)(iova_tx >> 32));
+ igb_write32(igb, IGB_TDLEN0, RING_SIZE * sizeof(struct igb_tx_desc));
+ igb_write32(igb, IGB_TDH0, 0);
+ igb_write32(igb, IGB_TDT0, 0);
+ igb_write32(igb, IGB_TXDCTL0, IGB_TXDCTL0_Q_EN);
+
+ igb_write32(igb, IGB_RDBAL0, (uint32_t)iova_rx);
+ igb_write32(igb, IGB_RDBAH0, (uint32_t)(iova_rx >> 32));
+ igb_write32(igb, IGB_RDLEN0, RING_SIZE * sizeof(struct igb_rx_desc));
+ igb_write32(igb, IGB_RDH0, 0);
+ igb_write32(igb, IGB_RDT0, 0);
+ igb_write32(igb, IGB_RXDCTL0, IGB_RXDCTL0_Q_EN);
+
+ /* Wait for TX and RX queues to be enabled */
+ retries = 2000;
+ while (retries-- > 0) {
+ if ((igb_read32(igb, IGB_TXDCTL0) & IGB_TXDCTL0_Q_EN) &&
+ (igb_read32(igb, IGB_RXDCTL0) & IGB_RXDCTL0_Q_EN))
+ break;
+ usleep(10);
+ }
+
+ /* Enable Receiver and Transmitter */
+ rctl = IGB_RCTL_EN | /* Receiver Enable */
+ IGB_RCTL_UPE | /* Unicast Promiscuous (for dummy MAC) */
+ IGB_RCTL_LBM_MAC | /* MAC Loopback Mode */
+ IGB_RCTL_SECRC; /* Strip CRC (needed for memcmp) */
+ igb_write32(igb, IGB_RCTL, rctl);
+ igb_write32(igb, IGB_TCTL, IGB_TCTL_EN);
+
+ /* Enable MSI-X with 1 vector for the test */
+ vfio_pci_msix_enable(device, MSIX_VECTOR, 1);
+
+ /* Enable auto-masking of interrupts to avoid storms without a real ISR */
+ igb_write32(igb, IGB_GPIE, IGB_GPIE_EIAME);
+
+ /* Enable interrupts on vector 0 */
+ igb_write32(igb, IGB_EIMS, 1);
+
+ /* Map vector 0 to interrupt cause 0 and mark it valid */
+ igb_write32(igb, IGB_IVAR0, IGB_IVAR_VALID);
+
+ /* Initialize driver state and capability limits */
+ igb->tx_tail = 0;
+ igb->rx_tail = 0;
+
+ device->driver.max_memcpy_size = (RING_SIZE - 1) * IGB_MAX_CHUNK_SIZE;
+ device->driver.max_memcpy_count = RING_SIZE - 1;
+ device->driver.msi = MSIX_VECTOR;
+}
+
+static void igb_remove(struct vfio_pci_device *device)
+{
+ struct igb *igb = to_igb_state(device);
+
+ vfio_pci_msix_disable(device);
+ igb_write32(igb, IGB_RCTL, 0);
+ igb_write32(igb, IGB_TCTL, 0);
+ igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST);
+}
+
+static void igb_irq_disable(struct igb *igb)
+{
+ igb_write32(igb, IGB_EIMC, 1);
+}
+
+static void igb_irq_enable(struct igb *igb)
+{
+ igb_write32(igb, IGB_EIMS, 1);
+}
+
+static void igb_irq_clear(struct igb *igb)
+{
+ igb_read32(igb, IGB_EICR);
+}
+
+static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src,
+ iova_t dst, u64 size, u64 count)
+{
+ struct igb *igb = to_igb_state(device);
+ iova_t curr_src, curr_dst;
+ struct igb_rx_desc *rx;
+ struct igb_tx_desc *tx;
+ u64 chunk_size;
+ u64 remaining;
+ uint32_t i;
+
+ igb_irq_disable(igb);
+
+ for (i = 0; i < count; i++) {
+ curr_src = src;
+ curr_dst = dst;
+ remaining = size;
+
+ while (remaining > 0) {
+ chunk_size = min_t(u64, remaining, IGB_MAX_CHUNK_SIZE);
+
+ tx = &igb->tx_ring[igb->tx_tail];
+ rx = &igb->rx_ring[igb->rx_tail];
+
+ memset(tx, 0, sizeof(struct igb_tx_desc));
+ memset(rx, 0, sizeof(struct igb_rx_desc));
+
+ rx->read.pkt_addr = curr_dst;
+ rx->read.hdr_addr = 0;
+ rx->wb.status_error = 0;
+
+ tx->read.buffer_addr = curr_src;
+ tx->read.cmd_type_len = (uint32_t)chunk_size;
+ tx->read.cmd_type_len |= (uint32_t)(IGB_TXD_CMD_EOP) << IGB_TXD_CMD_SHIFT;
+
+ /* Set to 0 to disable offloads and avoid needing a context descriptor */
+ tx->read.olinfo_status = 0;
+
+ curr_src += chunk_size;
+ curr_dst += chunk_size;
+ remaining -= chunk_size;
+
+ igb->tx_tail = (igb->tx_tail + 1) % RING_SIZE;
+ igb->rx_tail = (igb->rx_tail + 1) % RING_SIZE;
+ }
+ }
+
+ igb_write32(igb, IGB_RDT0, igb->rx_tail);
+ igb_write32(igb, IGB_TDT0, igb->tx_tail);
+}
+
+static int igb_memcpy_wait(struct vfio_pci_device *device)
+{
+ struct igb *igb = to_igb_state(device);
+ struct igb_rx_desc *rx;
+ uint32_t prev_tail;
+ int retries;
+
+ prev_tail = (igb->rx_tail + RING_SIZE - 1) % RING_SIZE;
+ rx = &igb->rx_ring[prev_tail];
+
+ retries = 100;
+ while (retries-- > 0) {
+ if (rx->wb.status_error & 1)
+ break;
+ usleep(10);
+ }
+
+ igb_irq_clear(igb);
+
+ igb_irq_enable(igb);
+
+ if (rx->wb.status_error & 1)
+ return 0;
+
+ return -ETIMEDOUT;
+}
+
+static void igb_send_msi(struct vfio_pci_device *device)
+{
+ struct igb *igb = to_igb_state(device);
+
+ igb_write32(igb, IGB_EICS, 1);
+}
+
+const struct vfio_pci_driver_ops igb_ops = {
+ .name = "igb",
+ .probe = igb_probe,
+ .init = igb_init,
+ .remove = igb_remove,
+ .memcpy_start = igb_memcpy_start,
+ .memcpy_wait = igb_memcpy_wait,
+ .send_msi = igb_send_msi,
+};
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
new file mode 100644
index 000000000000..e2756be2e9c0
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _IGB_REGISTERS_H_
+#define _IGB_REGISTERS_H_
+
+/* Register Offsets (Intel 82576EB Datasheet) */
+#define IGB_CTRL 0x00000 /* Device Control */
+#define IGB_STATUS 0x00008 /* Device Status */
+#define IGB_CTRL_EXT 0x00018 /* Extended Device Control */
+#define IGB_MDIC 0x00020 /* MDI Control */
+#define IGB_RCTL 0x00100 /* Receive Control */
+#define IGB_TCTL 0x00400 /* Transmit Control */
+#define IGB_SCTL 0x00420 /* SerDes Control */
+
+/* Interrupt Registers */
+#define IGB_IMC 0x0150C /* Interrupt Mask Clear */
+#define IGB_IVAR0 0x01700 /* Interrupt Vector Allocation Register 0 */
+
+/* Rx Ring 0 Registers */
+#define IGB_RDBAL0 0x0C000 /* Rx Desc Base Address Low */
+#define IGB_RDBAH0 0x0C004 /* Rx Desc Base Address High */
+#define IGB_RDLEN0 0x0C008 /* Rx Desc Length */
+#define IGB_RDH0 0x0C010 /* Rx Desc Head */
+#define IGB_RDT0 0x0C018 /* Rx Desc Tail */
+#define IGB_RXDCTL0 0x0C028 /* Rx Desc Control */
+
+/* Tx Ring 0 Registers */
+#define IGB_TDBAL0 0x0E000 /* Tx Desc Base Address Low */
+#define IGB_TDBAH0 0x0E004 /* Tx Desc Base Address High */
+#define IGB_TDLEN0 0x0E008 /* Tx Desc Length */
+#define IGB_TDH0 0x0E010 /* Tx Desc Head */
+#define IGB_TDT0 0x0E018 /* Tx Desc Tail */
+#define IGB_TXDCTL0 0x0E028 /* Tx Desc Control */
+
+/* Control Bit Definitions */
+/* CTRL */
+#define IGB_CTRL_RST (1 << 26) /* Device Reset */
+#define IGB_CTRL_EXT_LINK_MODE_MASK (3 << 22)
+
+/* CTRL_EXT */
+#define IGB_CTRL_EXT_DRV_LOAD (1 << 28) /* Driver Loaded */
+
+/* RCTL */
+#define IGB_RCTL_EN (1 << 1) /* Receiver Enable */
+#define IGB_RCTL_UPE (1 << 3) /* Unicast Promiscuous Enabled */
+#define IGB_RCTL_LPE (1 << 5) /* Long Packet Reception Enable */
+#define IGB_RCTL_LBM_MAC (1 << 6) /* Loopback Mode - MAC */
+#define IGB_RCTL_SECRC (1 << 26) /* Strip Ethernet CRC */
+
+/* TCTL */
+#define IGB_TCTL_EN (1 << 1) /* Transmit Enable */
+
+/* SCTL */
+#define IGB_SCTL_DISABLE_SERDES_LOOPBACK (1 << 6)
+
+/* MDIC */
+#define IGB_MDIC_OP_WRITE (1 << 26)
+#define IGB_MDIC_OP_READ (2 << 26)
+
+#define IGB_EICR 0x01580 /* Extended Interrupt Cause Read */
+
+#define IGB_RAH0 0x05404 /* Receive Address High 0 */
+#define IGB_VMOLR0 0x05AD0 /* VM Offload Layout Register 0 */
+
+#define IGB_VMOLR_LPE 0x00010000 /* Long Packet Enable */
+#define IGB_VMOLR_BAM 0x08000000 /* Broadcast Accept Mode */
+#define IGB_RAH_POOL_1 0x00040000 /* Pool 1 assignment */
+
+#define IGB_EIMS 0x01524 /* Extended Interrupt Mask Set */
+#define IGB_EICS 0x01520 /* Extended Interrupt Cause Set */
+#define IGB_EIMC 0x01528 /* Extended Interrupt Mask Clear */
+#define IGB_CTRL_GIO_MASTER_DISABLE (1 << 2) /* GIO Master Disable */
+#define IGB_STATUS_GIO_MASTER_ENABLE (1 << 19) /* GIO Master Enable */
+#define IGB_GPIE 0x01514 /* General Purpose Interrupt Enable */
+#define IGB_TXDCTL0_Q_EN (1 << 25) /* Transmit Queue Enable */
+#define IGB_RXDCTL0_Q_EN (1 << 25) /* Receive Queue Enable */
+#define IGB_MRQC 0x05818 /* Multiple Receive Queues Command */
+
+#define IGB_MDIC_PHY_SHIFT 21 /* PHY Address Shift */
+#define IGB_MDIC_REG_SHIFT 16 /* Register Address Shift */
+#define IGB_MDIC_READY (1 << 28) /* MDI Data Ready */
+#define IGB_MDIC_ERROR (1 << 29) /* MDI Error */
+
+#define IGB_PHY_CTRL_REG_OFFSET 0
+#define IGB_PHY_STATUS_REG_OFFSET 1
+#define IGB_PHY_STATUS_AN_COMP 0x0020 /* Auto-negotiation complete */
+#define IGB_PHY_CTRL_AN_ENABLE 0x1000 /* Auto-Negotiation Enable */
+#define IGB_PHY_CTRL_RESTART_AN 0x0200 /* Restart Auto-Negotiation */
+
+#define IGB_PHY_AUTONEG_ADV_REG_OFFSET 4 /* Auto-Negotiation Advertisement */
+#define IGB_PHY_1000T_CTRL_REG_OFFSET 9 /* 1000Base-T Control */
+#define IGB_CR_1000T_FD_CAPS 0x0200 /* Advertise 1000 Mbps Full Duplex */
+
+#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */
+#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */
+
+#define IGB_TXD_CMD_EOP 0x01 /* End of Packet */
+#define IGB_TXD_CMD_IFCS 0x02 /* Insert FCS */
+#define IGB_TXD_CMD_RS 0x08 /* Report Status */
+#define IGB_TXD_CMD_SHIFT 24 /* Shift for command bits in cmd_type_len */
+#define IGB_TXD_CMD_LEGACY_FORMAT (1 << 20) /* Forces legacy descriptor format in QEMU */
+
+#endif /* _IGB_REGISTERS_H_ */
diff --git a/tools/testing/selftests/vfio/lib/libvfio.mk b/tools/testing/selftests/vfio/lib/libvfio.mk
index 9f47bceed16f..1f13cca04348 100644
--- a/tools/testing/selftests/vfio/lib/libvfio.mk
+++ b/tools/testing/selftests/vfio/lib/libvfio.mk
@@ -12,6 +12,7 @@ LIBVFIO_C += vfio_pci_driver.c
ifeq ($(ARCH:x86_64=x86),x86)
LIBVFIO_C += drivers/ioat/ioat.c
LIBVFIO_C += drivers/dsa/dsa.c
+LIBVFIO_C += drivers/igb/igb.c
endif
LIBVFIO_OUTPUT := $(OUTPUT)/libvfio
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c
index 6827f4a6febe..a5d0547132c4 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c
@@ -5,12 +5,14 @@
#ifdef __x86_64__
extern struct vfio_pci_driver_ops dsa_ops;
extern struct vfio_pci_driver_ops ioat_ops;
+extern struct vfio_pci_driver_ops igb_ops;
#endif
static struct vfio_pci_driver_ops *driver_ops[] = {
#ifdef __x86_64__
&dsa_ops,
&ioat_ops,
+ &igb_ops,
#endif
};
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-11 21:18 [PATCH] vfio: selftests: Add driver for IGB QEMU device Josh Hilke @ 2026-05-11 23:45 ` David Matlack 2026-05-13 2:12 ` Josh Hilke 0 siblings, 1 reply; 11+ messages in thread From: David Matlack @ 2026-05-11 23:45 UTC (permalink / raw) To: Josh Hilke Cc: Alex Williamson, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On 2026-05-11 09:18 PM, Josh Hilke wrote: > Add a VFIO selftest driver for the Intel Gigabit Ethernet controller > (IGB). IGB is fully virtualized in QEMU [1] which makes it easy to run > VFIO selftests without needing any specific hardware. > > IGB does not have a default memcpy operation, and memcpy is required for > all VFIO selftest drivers. However, IGB has a "loopback" mode which is > used in this driver to implement memcpy. When IGB is in loopback mode, > it doesn't send data out to the network. Instead, it sends data from the > Tx queue directly to the Rx queue which points to the memcpy > destination, instead of sending the data out to the network. > > This driver passes all of the VFIO selftests in > tools/testing/selftests/vfio/ when running in QEMU. The driver has not > been tested using a real IGB device since the main goal of writing this > driver is to run VFIO selftests in QEMU without requiring any hardware. > > This command is used to test the driver. It runs > ./tools/testing/selftests/vfio/vfio_pci_driver_test using virtme-ng [2], > which runs a kernel in a virtualized environment using QEMU that has a > copy-on-write snapshot the host filesystem. > > vng \ > --run arch/x86/boot/bzImage \ > --user root \ > --disable-microvm \ > --memory 32G \ > --cpus 8 \ > --qemu-opts="-M q35,accel=kvm,kernel-irqchip=split" \ > --qemu-opts="-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on" \ > --qemu-opts="-netdev user,id=net0 -device igb,netdev=net0,addr=09.0" \ > --append "console=ttyS0 earlyprintk=ttyS0 intel_iommu=on iommu=pt" \ > --exec "modprobe vfio-pci && \ > ./tools/testing/selftests/vfio/scripts/setup.sh 0000:00:09.0 && \ > ./tools/testing/selftests/vfio/scripts/run.sh ./tools/testing/selftests/vfio/vfio_pci_driver_test" > > Code was written entirely by AI (Gemini) by feeding it the Intel IGB > specification, the code for the real IGB driver, and the QEMU > implementation of the IGB device. It took many iterations of prompting > to get the driver to pass all of the tests, and lot's of de-slopping to > remove unnecessary code and make the code readable. > > Code comments are also written by Gemini through iterative > prompting. > > This patch is based on the kvm/queue branch. > > [1] https://www.qemu.org/docs/master/system/devices/igb.html > [2] https://github.com/arighi/virtme-ng > > Assisted-by: Gemini:gemini-3-pro-preview > Signed-off-by: Josh Hilke <jrhilke@google.com> I just did a quick first pass and left some feedback. Overall I'm very excited to have a driver for a device supported by QEMU! > --- > .../selftests/vfio/lib/drivers/igb/igb.c | 381 ++++++++++++++++++ > .../vfio/lib/drivers/igb/registers.h | 103 +++++ > tools/testing/selftests/vfio/lib/libvfio.mk | 1 + > .../selftests/vfio/lib/vfio_pci_driver.c | 2 + > 4 files changed, 487 insertions(+) > create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/igb.c > create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c > new file mode 100644 > index 000000000000..13c8429784ac > --- /dev/null > +++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c > @@ -0,0 +1,381 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +#include <unistd.h> > +#include <errno.h> > +#include <stdint.h> > +#include <linux/io.h> > +#include <linux/pci_regs.h> > +#include <libvfio/vfio_pci_device.h> > + > +#include "registers.h" > + > +#define PCI_VENDOR_ID_INTEL 0x8086 > +#define PCI_DEVICE_ID_INTEL_IGB 0x10C9 You should be able to include <linux/pci_id.h> for these. > +#define IGB_MAX_CHUNK_SIZE 1024 > +#define MSIX_VECTOR 0 > +#define RING_SIZE 4096 /* Number of descriptors in ring */ What's the maximum ring size that IGB can support? The max chunk size is quite small so using a larger ring will be helpful toward getting the device to do a lot of memcpys for Live Update testing. > + > +struct igb_tx_desc { > + union { > + struct { > + uint64_t buffer_addr; /* Address of descriptor's data buffer */ > + uint32_t cmd_type_len; /* Command/Type/Length */ > + uint32_t olinfo_status; /* Context/Buffer info */ Use kernel-style fixed width types (u64, u32, etc.). > + } read; > + > + struct { > + uint64_t rsvd; /* Reserved */ > + uint32_t nxtseq_seed; /* Next sequence seed */ > + uint32_t status; /* Descriptor status */ > + } wb; > + }; > +}; > + > +struct igb_rx_desc { > + union { > + struct { > + uint64_t pkt_addr; /* Packet buffer address */ > + uint64_t hdr_addr; /* Header buffer address */ > + } read; > + struct { > + uint16_t pkt_info; /* RSS type, Packet type */ > + uint16_t hdr_info; /* Split Head, buf len */ > + uint32_t rss; /* RSS Hash */ > + uint32_t status_error; /* ext status/error */ > + uint16_t length; /* Packet length */ > + uint16_t vlan; /* VLAN tag */ > + } wb; /* writeback */ > + }; > +}; > + > +struct igb { > + void *bar0; > + struct igb_tx_desc *tx_ring; > + struct igb_rx_desc *rx_ring; > + uint32_t tx_tail; > + uint32_t rx_tail; > +}; > + > +static inline struct igb *to_igb_state(struct vfio_pci_device *device) > +{ > + return (struct igb *)device->driver.region.vaddr; > +} > + > +static inline void igb_write32(struct igb *igb, uint32_t reg, uint32_t val) > +{ > + writel(val, igb->bar0 + reg); > +} > + > +static inline uint32_t igb_read32(struct igb *igb, uint32_t reg) > +{ > + return readl(igb->bar0 + reg); > +} > + > +static int igb_write_phy(struct igb *igb, uint32_t offset, uint16_t data) > +{ > + uint32_t mdic; > + int i; > + > + mdic = (((uint32_t)data) | > + (offset << IGB_MDIC_REG_SHIFT) | > + (1 << IGB_MDIC_PHY_SHIFT) | > + IGB_MDIC_OP_WRITE); > + > + igb_write32(igb, IGB_MDIC, mdic); > + > + for (i = 0; i < 1000; i++) { > + usleep(50); > + mdic = igb_read32(igb, IGB_MDIC); > + if (mdic & IGB_MDIC_READY) > + break; > + } > + > + if (!(mdic & IGB_MDIC_READY)) > + return -1; > + > + if (mdic & IGB_MDIC_ERROR) > + return -1; > + > + return 0; > +} > + > +static int igb_read_phy(struct igb *igb, uint32_t offset, uint16_t *data) > +{ > + uint32_t mdic; > + int i; > + > + mdic = ((offset << IGB_MDIC_REG_SHIFT) | > + (1 << IGB_MDIC_PHY_SHIFT) | > + IGB_MDIC_OP_READ); > + > + igb_write32(igb, IGB_MDIC, mdic); > + > + for (i = 0; i < 1000; i++) { > + usleep(50); > + mdic = igb_read32(igb, IGB_MDIC); > + if (mdic & IGB_MDIC_READY) > + break; > + } > + > + if (!(mdic & IGB_MDIC_READY)) > + return -1; > + > + if (mdic & IGB_MDIC_ERROR) > + return -1; > + > + *data = (uint16_t)mdic; > + return 0; > +} > + > +static int igb_phy_setup_autoneg(struct igb *igb) > +{ > + int timeout_ms = 1000; > + bool success = false; > + uint16_t phy_status; > + int ret; > + int i; > + > + /* Trigger auto-negotiation */ > + ret = igb_write_phy(igb, IGB_PHY_CTRL_REG_OFFSET, > + IGB_PHY_CTRL_AN_ENABLE | IGB_PHY_CTRL_RESTART_AN); > + if (ret) > + return ret; > + > + for (i = 0; i < timeout_ms; i++) { > + if (igb_read_phy(igb, IGB_PHY_STATUS_REG_OFFSET, &phy_status) == 0) { > + success = !!(phy_status & IGB_PHY_STATUS_AN_COMP); > + if (success) > + break; > + } > + usleep(1000); > + } > + > + if (!success) { > + printf("igb: Auto-negotiation did not complete in time\n"); > + return -ETIMEDOUT; > + } > + > + return 0; > +} > + > +static int igb_probe(struct vfio_pci_device *device) > +{ > + if (!vfio_pci_device_match(device, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IGB)) > + return -EINVAL; > + > + return 0; > +} > + > +static void igb_init(struct vfio_pci_device *device) > +{ > + struct igb *igb = to_igb_state(device); > + uint64_t iova_tx, iova_rx; > + uint32_t ctrl, rctl; > + uint16_t cmd_reg; > + int retries; > + > + VFIO_ASSERT_GE(device->driver.region.size, > + sizeof(struct igb) + 2 * RING_SIZE * sizeof(struct igb_tx_desc)); Please put these rings into struct igb directly rather than doing raw pointer arithmetic and size calculations. e.g. struct igb { void *bar0; u32 tx_tail; u32 rx_tail; struct igb_tx_desc tx_ring[RING_SIZE]; struct igb_rx_desc rx_ring[RING_SIZE]; }; > + > + /* Set up rings and calculate IOVAs */ > + igb->bar0 = device->bars[0].vaddr; > + > + igb->tx_ring = (struct igb_tx_desc *)(igb + 1); > + igb->rx_ring = (struct igb_rx_desc *)(igb->tx_ring + RING_SIZE); > + > + iova_tx = to_iova(device, igb->tx_ring); > + iova_rx = to_iova(device, igb->rx_ring); > + > + /* Reset device and disable all interrupts */ > + igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST); > + usleep(20000); > + igb_write32(igb, IGB_IMC, 0xFFFFFFFF); > + > + /* Signal that the driver is loaded */ > + ctrl = igb_read32(igb, IGB_CTRL_EXT); > + ctrl |= IGB_CTRL_EXT_DRV_LOAD; > + ctrl &= ~IGB_CTRL_EXT_LINK_MODE_MASK; > + igb_write32(igb, IGB_CTRL_EXT, ctrl); > + > + /* Enable PCI Bus Master. */ > + cmd_reg = vfio_pci_config_readw(device, PCI_COMMAND); > + if (!(cmd_reg & PCI_COMMAND_MASTER)) { > + cmd_reg |= (PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY); > + vfio_pci_config_writew(device, PCI_COMMAND, cmd_reg); > + } > + > + /* Trigger autonegotiation. This enables IGB to transmit data. */ > + if (igb_phy_setup_autoneg(igb)) > + return; > + > + /* Configure TX and RX descriptor rings */ > + igb_write32(igb, IGB_TDBAL0, (uint32_t)iova_tx); > + igb_write32(igb, IGB_TDBAH0, (uint32_t)(iova_tx >> 32)); > + igb_write32(igb, IGB_TDLEN0, RING_SIZE * sizeof(struct igb_tx_desc)); > + igb_write32(igb, IGB_TDH0, 0); > + igb_write32(igb, IGB_TDT0, 0); > + igb_write32(igb, IGB_TXDCTL0, IGB_TXDCTL0_Q_EN); > + > + igb_write32(igb, IGB_RDBAL0, (uint32_t)iova_rx); > + igb_write32(igb, IGB_RDBAH0, (uint32_t)(iova_rx >> 32)); > + igb_write32(igb, IGB_RDLEN0, RING_SIZE * sizeof(struct igb_rx_desc)); > + igb_write32(igb, IGB_RDH0, 0); > + igb_write32(igb, IGB_RDT0, 0); > + igb_write32(igb, IGB_RXDCTL0, IGB_RXDCTL0_Q_EN); > + > + /* Wait for TX and RX queues to be enabled */ > + retries = 2000; > + while (retries-- > 0) { > + if ((igb_read32(igb, IGB_TXDCTL0) & IGB_TXDCTL0_Q_EN) && > + (igb_read32(igb, IGB_RXDCTL0) & IGB_RXDCTL0_Q_EN)) > + break; > + usleep(10); > + } > + > + /* Enable Receiver and Transmitter */ > + rctl = IGB_RCTL_EN | /* Receiver Enable */ > + IGB_RCTL_UPE | /* Unicast Promiscuous (for dummy MAC) */ > + IGB_RCTL_LBM_MAC | /* MAC Loopback Mode */ > + IGB_RCTL_SECRC; /* Strip CRC (needed for memcmp) */ > + igb_write32(igb, IGB_RCTL, rctl); > + igb_write32(igb, IGB_TCTL, IGB_TCTL_EN); > + > + /* Enable MSI-X with 1 vector for the test */ > + vfio_pci_msix_enable(device, MSIX_VECTOR, 1); > + > + /* Enable auto-masking of interrupts to avoid storms without a real ISR */ > + igb_write32(igb, IGB_GPIE, IGB_GPIE_EIAME); > + > + /* Enable interrupts on vector 0 */ > + igb_write32(igb, IGB_EIMS, 1); > + > + /* Map vector 0 to interrupt cause 0 and mark it valid */ > + igb_write32(igb, IGB_IVAR0, IGB_IVAR_VALID); > + > + /* Initialize driver state and capability limits */ > + igb->tx_tail = 0; > + igb->rx_tail = 0; This comment seems wrong. > + > + device->driver.max_memcpy_size = (RING_SIZE - 1) * IGB_MAX_CHUNK_SIZE; > + device->driver.max_memcpy_count = RING_SIZE - 1; The device must be able to do max_memcpy_count copies of max_memcpy_size, so max_memcpy_size should be IGB_MAX_CHUNK_SIZE. Once you do this you can also get rid of the chunking loop in memcpy_start(). > + device->driver.msi = MSIX_VECTOR; > +} > + > +static void igb_remove(struct vfio_pci_device *device) > +{ > + struct igb *igb = to_igb_state(device); > + > + vfio_pci_msix_disable(device); > + igb_write32(igb, IGB_RCTL, 0); > + igb_write32(igb, IGB_TCTL, 0); > + igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST); > +} > + > +static void igb_irq_disable(struct igb *igb) > +{ > + igb_write32(igb, IGB_EIMC, 1); > +} > + > +static void igb_irq_enable(struct igb *igb) > +{ > + igb_write32(igb, IGB_EIMS, 1); > +} > + > +static void igb_irq_clear(struct igb *igb) > +{ > + igb_read32(igb, IGB_EICR); > +} > + > +static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src, > + iova_t dst, u64 size, u64 count) > +{ > + struct igb *igb = to_igb_state(device); > + iova_t curr_src, curr_dst; > + struct igb_rx_desc *rx; > + struct igb_tx_desc *tx; > + u64 chunk_size; > + u64 remaining; > + uint32_t i; > + > + igb_irq_disable(igb); > + > + for (i = 0; i < count; i++) { > + curr_src = src; > + curr_dst = dst; > + remaining = size; > + > + while (remaining > 0) { > + chunk_size = min_t(u64, remaining, IGB_MAX_CHUNK_SIZE); > + > + tx = &igb->tx_ring[igb->tx_tail]; > + rx = &igb->rx_ring[igb->rx_tail]; > + > + memset(tx, 0, sizeof(struct igb_tx_desc)); > + memset(rx, 0, sizeof(struct igb_rx_desc)); > + > + rx->read.pkt_addr = curr_dst; > + rx->read.hdr_addr = 0; > + rx->wb.status_error = 0; Don't rx->read.hdr_addr and rx->wb.status_error overlap the same u64? It's not a bug since both write 0 but it is weird. > + > + tx->read.buffer_addr = curr_src; > + tx->read.cmd_type_len = (uint32_t)chunk_size; > + tx->read.cmd_type_len |= (uint32_t)(IGB_TXD_CMD_EOP) << IGB_TXD_CMD_SHIFT; > + > + /* Set to 0 to disable offloads and avoid needing a context descriptor */ > + tx->read.olinfo_status = 0; > + > + curr_src += chunk_size; > + curr_dst += chunk_size; > + remaining -= chunk_size; > + > + igb->tx_tail = (igb->tx_tail + 1) % RING_SIZE; > + igb->rx_tail = (igb->rx_tail + 1) % RING_SIZE; > + } > + } > + > + igb_write32(igb, IGB_RDT0, igb->rx_tail); > + igb_write32(igb, IGB_TDT0, igb->tx_tail); > +} > + > +static int igb_memcpy_wait(struct vfio_pci_device *device) > +{ > + struct igb *igb = to_igb_state(device); > + struct igb_rx_desc *rx; > + uint32_t prev_tail; > + int retries; > + > + prev_tail = (igb->rx_tail + RING_SIZE - 1) % RING_SIZE; > + rx = &igb->rx_ring[prev_tail]; > + > + retries = 100; > + while (retries-- > 0) { > + if (rx->wb.status_error & 1) > + break; > + usleep(10); > + } Why bail after a certain timeout? The test may have kicked off a large count of memcpys. Is this for error detection? > + > + igb_irq_clear(igb); > + > + igb_irq_enable(igb); > + > + if (rx->wb.status_error & 1) > + return 0; > + > + return -ETIMEDOUT; Is there any other way to the device signals errors other than failing to set bit 0 of status_error? I'm especially curious how it handles memcpy_from_unmapped_iova in vfio_pci_driver_test.c. > +} > + > +static void igb_send_msi(struct vfio_pci_device *device) > +{ > + struct igb *igb = to_igb_state(device); > + > + igb_write32(igb, IGB_EICS, 1); > +} > + > +const struct vfio_pci_driver_ops igb_ops = { > + .name = "igb", > + .probe = igb_probe, > + .init = igb_init, > + .remove = igb_remove, > + .memcpy_start = igb_memcpy_start, > + .memcpy_wait = igb_memcpy_wait, > + .send_msi = igb_send_msi, > +}; > diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h > new file mode 100644 > index 000000000000..e2756be2e9c0 > --- /dev/null > +++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h > @@ -0,0 +1,103 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > + > +#ifndef _IGB_REGISTERS_H_ > +#define _IGB_REGISTERS_H_ > + > +/* Register Offsets (Intel 82576EB Datasheet) */ > +#define IGB_CTRL 0x00000 /* Device Control */ > +#define IGB_STATUS 0x00008 /* Device Status */ > +#define IGB_CTRL_EXT 0x00018 /* Extended Device Control */ > +#define IGB_MDIC 0x00020 /* MDI Control */ > +#define IGB_RCTL 0x00100 /* Receive Control */ > +#define IGB_TCTL 0x00400 /* Transmit Control */ > +#define IGB_SCTL 0x00420 /* SerDes Control */ > + > +/* Interrupt Registers */ > +#define IGB_IMC 0x0150C /* Interrupt Mask Clear */ > +#define IGB_IVAR0 0x01700 /* Interrupt Vector Allocation Register 0 */ > + > +/* Rx Ring 0 Registers */ > +#define IGB_RDBAL0 0x0C000 /* Rx Desc Base Address Low */ > +#define IGB_RDBAH0 0x0C004 /* Rx Desc Base Address High */ > +#define IGB_RDLEN0 0x0C008 /* Rx Desc Length */ > +#define IGB_RDH0 0x0C010 /* Rx Desc Head */ > +#define IGB_RDT0 0x0C018 /* Rx Desc Tail */ > +#define IGB_RXDCTL0 0x0C028 /* Rx Desc Control */ > + > +/* Tx Ring 0 Registers */ > +#define IGB_TDBAL0 0x0E000 /* Tx Desc Base Address Low */ > +#define IGB_TDBAH0 0x0E004 /* Tx Desc Base Address High */ > +#define IGB_TDLEN0 0x0E008 /* Tx Desc Length */ > +#define IGB_TDH0 0x0E010 /* Tx Desc Head */ > +#define IGB_TDT0 0x0E018 /* Tx Desc Tail */ > +#define IGB_TXDCTL0 0x0E028 /* Tx Desc Control */ > + > +/* Control Bit Definitions */ > +/* CTRL */ > +#define IGB_CTRL_RST (1 << 26) /* Device Reset */ > +#define IGB_CTRL_EXT_LINK_MODE_MASK (3 << 22) > + > +/* CTRL_EXT */ > +#define IGB_CTRL_EXT_DRV_LOAD (1 << 28) /* Driver Loaded */ > + > +/* RCTL */ > +#define IGB_RCTL_EN (1 << 1) /* Receiver Enable */ > +#define IGB_RCTL_UPE (1 << 3) /* Unicast Promiscuous Enabled */ > +#define IGB_RCTL_LPE (1 << 5) /* Long Packet Reception Enable */ > +#define IGB_RCTL_LBM_MAC (1 << 6) /* Loopback Mode - MAC */ > +#define IGB_RCTL_SECRC (1 << 26) /* Strip Ethernet CRC */ > + > +/* TCTL */ > +#define IGB_TCTL_EN (1 << 1) /* Transmit Enable */ > + > +/* SCTL */ > +#define IGB_SCTL_DISABLE_SERDES_LOOPBACK (1 << 6) > + > +/* MDIC */ > +#define IGB_MDIC_OP_WRITE (1 << 26) > +#define IGB_MDIC_OP_READ (2 << 26) > + > +#define IGB_EICR 0x01580 /* Extended Interrupt Cause Read */ > + > +#define IGB_RAH0 0x05404 /* Receive Address High 0 */ > +#define IGB_VMOLR0 0x05AD0 /* VM Offload Layout Register 0 */ > + > +#define IGB_VMOLR_LPE 0x00010000 /* Long Packet Enable */ > +#define IGB_VMOLR_BAM 0x08000000 /* Broadcast Accept Mode */ > +#define IGB_RAH_POOL_1 0x00040000 /* Pool 1 assignment */ > + > +#define IGB_EIMS 0x01524 /* Extended Interrupt Mask Set */ > +#define IGB_EICS 0x01520 /* Extended Interrupt Cause Set */ > +#define IGB_EIMC 0x01528 /* Extended Interrupt Mask Clear */ > +#define IGB_CTRL_GIO_MASTER_DISABLE (1 << 2) /* GIO Master Disable */ > +#define IGB_STATUS_GIO_MASTER_ENABLE (1 << 19) /* GIO Master Enable */ > +#define IGB_GPIE 0x01514 /* General Purpose Interrupt Enable */ > +#define IGB_TXDCTL0_Q_EN (1 << 25) /* Transmit Queue Enable */ > +#define IGB_RXDCTL0_Q_EN (1 << 25) /* Receive Queue Enable */ > +#define IGB_MRQC 0x05818 /* Multiple Receive Queues Command */ > + > +#define IGB_MDIC_PHY_SHIFT 21 /* PHY Address Shift */ > +#define IGB_MDIC_REG_SHIFT 16 /* Register Address Shift */ > +#define IGB_MDIC_READY (1 << 28) /* MDI Data Ready */ > +#define IGB_MDIC_ERROR (1 << 29) /* MDI Error */ > + > +#define IGB_PHY_CTRL_REG_OFFSET 0 > +#define IGB_PHY_STATUS_REG_OFFSET 1 > +#define IGB_PHY_STATUS_AN_COMP 0x0020 /* Auto-negotiation complete */ > +#define IGB_PHY_CTRL_AN_ENABLE 0x1000 /* Auto-Negotiation Enable */ > +#define IGB_PHY_CTRL_RESTART_AN 0x0200 /* Restart Auto-Negotiation */ > + > +#define IGB_PHY_AUTONEG_ADV_REG_OFFSET 4 /* Auto-Negotiation Advertisement */ > +#define IGB_PHY_1000T_CTRL_REG_OFFSET 9 /* 1000Base-T Control */ > +#define IGB_CR_1000T_FD_CAPS 0x0200 /* Advertise 1000 Mbps Full Duplex */ > + > +#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */ > +#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */ > + > +#define IGB_TXD_CMD_EOP 0x01 /* End of Packet */ > +#define IGB_TXD_CMD_IFCS 0x02 /* Insert FCS */ > +#define IGB_TXD_CMD_RS 0x08 /* Report Status */ > +#define IGB_TXD_CMD_SHIFT 24 /* Shift for command bits in cmd_type_len */ > +#define IGB_TXD_CMD_LEGACY_FORMAT (1 << 20) /* Forces legacy descriptor format in QEMU */ IGB_TXD_CMD_LEGACY_FORMAT appears to be unused. I didn't check the rest. Can you prune out the unused macros? Related: Are these macros and/or the descriptor struct available in any kernel headers that we could symlink into the VFIO selftests like we do for DSA? > + > +#endif /* _IGB_REGISTERS_H_ */ > diff --git a/tools/testing/selftests/vfio/lib/libvfio.mk b/tools/testing/selftests/vfio/lib/libvfio.mk > index 9f47bceed16f..1f13cca04348 100644 > --- a/tools/testing/selftests/vfio/lib/libvfio.mk > +++ b/tools/testing/selftests/vfio/lib/libvfio.mk > @@ -12,6 +12,7 @@ LIBVFIO_C += vfio_pci_driver.c > ifeq ($(ARCH:x86_64=x86),x86) > LIBVFIO_C += drivers/ioat/ioat.c > LIBVFIO_C += drivers/dsa/dsa.c > +LIBVFIO_C += drivers/igb/igb.c > endif > > LIBVFIO_OUTPUT := $(OUTPUT)/libvfio > diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > index 6827f4a6febe..a5d0547132c4 100644 > --- a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > +++ b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > @@ -5,12 +5,14 @@ > #ifdef __x86_64__ > extern struct vfio_pci_driver_ops dsa_ops; > extern struct vfio_pci_driver_ops ioat_ops; > +extern struct vfio_pci_driver_ops igb_ops; > #endif > > static struct vfio_pci_driver_ops *driver_ops[] = { > #ifdef __x86_64__ > &dsa_ops, > &ioat_ops, > + &igb_ops, > #endif > }; > > -- > 2.54.0.563.g4f69b47b94-goog > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-11 23:45 ` David Matlack @ 2026-05-13 2:12 ` Josh Hilke 2026-05-13 18:49 ` Josh Hilke 2026-05-13 23:29 ` David Matlack 0 siblings, 2 replies; 11+ messages in thread From: Josh Hilke @ 2026-05-13 2:12 UTC (permalink / raw) To: David Matlack Cc: Alex Williamson, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > Add a VFIO selftest driver for the Intel Gigabit Ethernet controller > > (IGB). IGB is fully virtualized in QEMU [1] which makes it easy to run > > VFIO selftests without needing any specific hardware. > > > > IGB does not have a default memcpy operation, and memcpy is required for > > all VFIO selftest drivers. However, IGB has a "loopback" mode which is > > used in this driver to implement memcpy. When IGB is in loopback mode, > > it doesn't send data out to the network. Instead, it sends data from the > > Tx queue directly to the Rx queue which points to the memcpy > > destination, instead of sending the data out to the network. > > > > This driver passes all of the VFIO selftests in > > tools/testing/selftests/vfio/ when running in QEMU. The driver has not > > been tested using a real IGB device since the main goal of writing this > > driver is to run VFIO selftests in QEMU without requiring any hardware. > > > > This command is used to test the driver. It runs > > ./tools/testing/selftests/vfio/vfio_pci_driver_test using virtme-ng [2], > > which runs a kernel in a virtualized environment using QEMU that has a > > copy-on-write snapshot the host filesystem. > > > > vng \ > > --run arch/x86/boot/bzImage \ > > --user root \ > > --disable-microvm \ > > --memory 32G \ > > --cpus 8 \ > > --qemu-opts="-M q35,accel=kvm,kernel-irqchip=split" \ > > --qemu-opts="-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on" \ > > --qemu-opts="-netdev user,id=net0 -device igb,netdev=net0,addr=09.0" \ > > --append "console=ttyS0 earlyprintk=ttyS0 intel_iommu=on iommu=pt" \ > > --exec "modprobe vfio-pci && \ > > ./tools/testing/selftests/vfio/scripts/setup.sh 0000:00:09.0 && \ > > ./tools/testing/selftests/vfio/scripts/run.sh ./tools/testing/selftests/vfio/vfio_pci_driver_test" > > > > Code was written entirely by AI (Gemini) by feeding it the Intel IGB > > specification, the code for the real IGB driver, and the QEMU > > implementation of the IGB device. It took many iterations of prompting > > to get the driver to pass all of the tests, and lot's of de-slopping to > > remove unnecessary code and make the code readable. > > > > Code comments are also written by Gemini through iterative > > prompting. > > > > This patch is based on the kvm/queue branch. > > > > [1] https://www.qemu.org/docs/master/system/devices/igb.html > > [2] https://github.com/arighi/virtme-ng > > > > Assisted-by: Gemini:gemini-3-pro-preview > > Signed-off-by: Josh Hilke <jrhilke@google.com> > > I just did a quick first pass and left some feedback. Overall I'm very > excited to have a driver for a device supported by QEMU! > > > --- > > .../selftests/vfio/lib/drivers/igb/igb.c | 381 ++++++++++++++++++ > > .../vfio/lib/drivers/igb/registers.h | 103 +++++ > > tools/testing/selftests/vfio/lib/libvfio.mk | 1 + > > .../selftests/vfio/lib/vfio_pci_driver.c | 2 + > > 4 files changed, 487 insertions(+) > > create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/igb.c > > create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > > > diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c > > new file mode 100644 > > index 000000000000..13c8429784ac > > --- /dev/null > > +++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c > > @@ -0,0 +1,381 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +#include <unistd.h> > > +#include <errno.h> > > +#include <stdint.h> > > +#include <linux/io.h> > > +#include <linux/pci_regs.h> > > +#include <libvfio/vfio_pci_device.h> > > + > > +#include "registers.h" > > + > > +#define PCI_VENDOR_ID_INTEL 0x8086 > > +#define PCI_DEVICE_ID_INTEL_IGB 0x10C9 > > You should be able to include <linux/pci_id.h> for these. PCI_VENDOR_ID_INTEL is in <linux/pci_id.h>, so I can remove that one, but PCI_DEVICE_ID_INTEL_IGB is only defined in the private header, drivers/net/ethernet/intel/e1000/e1000_hw.h. I added more detail about private headers to your comment below about utilizing existing constants from IGB header files. > > > +#define IGB_MAX_CHUNK_SIZE 1024 > > +#define MSIX_VECTOR 0 > > +#define RING_SIZE 4096 /* Number of descriptors in ring */ > > What's the maximum ring size that IGB can support? The max chunk size is > quite small so using a larger ring will be helpful toward getting the > device to do a lot of memcpys for Live Update testing. I can dig into QEMU code and investigate the max ring size the QEMU IGB implementation can handle. 4096 is just the max ring size for real the IGB in drivers/net/ethernet/intel/igb/igb.h. What's our target for the maximum memcpy size? > > + > > +struct igb_tx_desc { > > + union { > > + struct { > > + uint64_t buffer_addr; /* Address of descriptor's data buffer */ > > + uint32_t cmd_type_len; /* Command/Type/Length */ > > + uint32_t olinfo_status; /* Context/Buffer info */ > > Use kernel-style fixed width types (u64, u32, etc.). Will do in v2. > > > + } read; > > + > > + struct { > > + uint64_t rsvd; /* Reserved */ > > + uint32_t nxtseq_seed; /* Next sequence seed */ > > + uint32_t status; /* Descriptor status */ > > + } wb; > > + }; > > +}; > > + > > +struct igb_rx_desc { > > + union { > > + struct { > > + uint64_t pkt_addr; /* Packet buffer address */ > > + uint64_t hdr_addr; /* Header buffer address */ > > + } read; > > + struct { > > + uint16_t pkt_info; /* RSS type, Packet type */ > > + uint16_t hdr_info; /* Split Head, buf len */ > > + uint32_t rss; /* RSS Hash */ > > + uint32_t status_error; /* ext status/error */ > > + uint16_t length; /* Packet length */ > > + uint16_t vlan; /* VLAN tag */ > > + } wb; /* writeback */ > > + }; > > +}; > > + > > +struct igb { > > + void *bar0; > > + struct igb_tx_desc *tx_ring; > > + struct igb_rx_desc *rx_ring; > > + uint32_t tx_tail; > > + uint32_t rx_tail; > > +}; > > + > > +static inline struct igb *to_igb_state(struct vfio_pci_device *device) > > +{ > > + return (struct igb *)device->driver.region.vaddr; > > +} > > + > > +static inline void igb_write32(struct igb *igb, uint32_t reg, uint32_t val) > > +{ > > + writel(val, igb->bar0 + reg); > > +} > > + > > +static inline uint32_t igb_read32(struct igb *igb, uint32_t reg) > > +{ > > + return readl(igb->bar0 + reg); > > +} > > + > > +static int igb_write_phy(struct igb *igb, uint32_t offset, uint16_t data) > > +{ > > + uint32_t mdic; > > + int i; > > + > > + mdic = (((uint32_t)data) | > > + (offset << IGB_MDIC_REG_SHIFT) | > > + (1 << IGB_MDIC_PHY_SHIFT) | > > + IGB_MDIC_OP_WRITE); > > + > > + igb_write32(igb, IGB_MDIC, mdic); > > + > > + for (i = 0; i < 1000; i++) { > > + usleep(50); > > + mdic = igb_read32(igb, IGB_MDIC); > > + if (mdic & IGB_MDIC_READY) > > + break; > > + } > > + > > + if (!(mdic & IGB_MDIC_READY)) > > + return -1; > > + > > + if (mdic & IGB_MDIC_ERROR) > > + return -1; > > + > > + return 0; > > +} > > + > > +static int igb_read_phy(struct igb *igb, uint32_t offset, uint16_t *data) > > +{ > > + uint32_t mdic; > > + int i; > > + > > + mdic = ((offset << IGB_MDIC_REG_SHIFT) | > > + (1 << IGB_MDIC_PHY_SHIFT) | > > + IGB_MDIC_OP_READ); > > + > > + igb_write32(igb, IGB_MDIC, mdic); > > + > > + for (i = 0; i < 1000; i++) { > > + usleep(50); > > + mdic = igb_read32(igb, IGB_MDIC); > > + if (mdic & IGB_MDIC_READY) > > + break; > > + } > > + > > + if (!(mdic & IGB_MDIC_READY)) > > + return -1; > > + > > + if (mdic & IGB_MDIC_ERROR) > > + return -1; > > + > > + *data = (uint16_t)mdic; > > + return 0; > > +} > > + > > +static int igb_phy_setup_autoneg(struct igb *igb) > > +{ > > + int timeout_ms = 1000; > > + bool success = false; > > + uint16_t phy_status; > > + int ret; > > + int i; > > + > > + /* Trigger auto-negotiation */ > > + ret = igb_write_phy(igb, IGB_PHY_CTRL_REG_OFFSET, > > + IGB_PHY_CTRL_AN_ENABLE | IGB_PHY_CTRL_RESTART_AN); > > + if (ret) > > + return ret; > > + > > + for (i = 0; i < timeout_ms; i++) { > > + if (igb_read_phy(igb, IGB_PHY_STATUS_REG_OFFSET, &phy_status) == 0) { > > + success = !!(phy_status & IGB_PHY_STATUS_AN_COMP); > > + if (success) > > + break; > > + } > > + usleep(1000); > > + } > > + > > + if (!success) { > > + printf("igb: Auto-negotiation did not complete in time\n"); > > + return -ETIMEDOUT; > > + } > > + > > + return 0; > > +} > > + > > +static int igb_probe(struct vfio_pci_device *device) > > +{ > > + if (!vfio_pci_device_match(device, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IGB)) > > + return -EINVAL; > > + > > + return 0; > > +} > > + > > +static void igb_init(struct vfio_pci_device *device) > > +{ > > + struct igb *igb = to_igb_state(device); > > + uint64_t iova_tx, iova_rx; > > + uint32_t ctrl, rctl; > > + uint16_t cmd_reg; > > + int retries; > > + > > + VFIO_ASSERT_GE(device->driver.region.size, > > + sizeof(struct igb) + 2 * RING_SIZE * sizeof(struct igb_tx_desc)); > > Please put these rings into struct igb directly rather than doing raw > pointer arithmetic and size calculations. e.g. Will do in v2. > struct igb { > void *bar0; > u32 tx_tail; > u32 rx_tail; > struct igb_tx_desc tx_ring[RING_SIZE]; > struct igb_rx_desc rx_ring[RING_SIZE]; > }; > > > + > > + /* Set up rings and calculate IOVAs */ > > + igb->bar0 = device->bars[0].vaddr; > > + > > + igb->tx_ring = (struct igb_tx_desc *)(igb + 1); > > + igb->rx_ring = (struct igb_rx_desc *)(igb->tx_ring + RING_SIZE); > > + > > + iova_tx = to_iova(device, igb->tx_ring); > > + iova_rx = to_iova(device, igb->rx_ring); > > + > > + /* Reset device and disable all interrupts */ > > + igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST); > > + usleep(20000); > > + igb_write32(igb, IGB_IMC, 0xFFFFFFFF); > > + > > + /* Signal that the driver is loaded */ > > + ctrl = igb_read32(igb, IGB_CTRL_EXT); > > + ctrl |= IGB_CTRL_EXT_DRV_LOAD; > > + ctrl &= ~IGB_CTRL_EXT_LINK_MODE_MASK; > > + igb_write32(igb, IGB_CTRL_EXT, ctrl); > > + > > + /* Enable PCI Bus Master. */ > > + cmd_reg = vfio_pci_config_readw(device, PCI_COMMAND); > > + if (!(cmd_reg & PCI_COMMAND_MASTER)) { > > + cmd_reg |= (PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY); > > + vfio_pci_config_writew(device, PCI_COMMAND, cmd_reg); > > + } > > + > > + /* Trigger autonegotiation. This enables IGB to transmit data. */ > > + if (igb_phy_setup_autoneg(igb)) > > + return; > > + > > + /* Configure TX and RX descriptor rings */ > > + igb_write32(igb, IGB_TDBAL0, (uint32_t)iova_tx); > > + igb_write32(igb, IGB_TDBAH0, (uint32_t)(iova_tx >> 32)); > > + igb_write32(igb, IGB_TDLEN0, RING_SIZE * sizeof(struct igb_tx_desc)); > > + igb_write32(igb, IGB_TDH0, 0); > > + igb_write32(igb, IGB_TDT0, 0); > > + igb_write32(igb, IGB_TXDCTL0, IGB_TXDCTL0_Q_EN); > > + > > + igb_write32(igb, IGB_RDBAL0, (uint32_t)iova_rx); > > + igb_write32(igb, IGB_RDBAH0, (uint32_t)(iova_rx >> 32)); > > + igb_write32(igb, IGB_RDLEN0, RING_SIZE * sizeof(struct igb_rx_desc)); > > + igb_write32(igb, IGB_RDH0, 0); > > + igb_write32(igb, IGB_RDT0, 0); > > + igb_write32(igb, IGB_RXDCTL0, IGB_RXDCTL0_Q_EN); > > + > > + /* Wait for TX and RX queues to be enabled */ > > + retries = 2000; > > + while (retries-- > 0) { > > + if ((igb_read32(igb, IGB_TXDCTL0) & IGB_TXDCTL0_Q_EN) && > > + (igb_read32(igb, IGB_RXDCTL0) & IGB_RXDCTL0_Q_EN)) > > + break; > > + usleep(10); > > + } > > + > > + /* Enable Receiver and Transmitter */ > > + rctl = IGB_RCTL_EN | /* Receiver Enable */ > > + IGB_RCTL_UPE | /* Unicast Promiscuous (for dummy MAC) */ > > + IGB_RCTL_LBM_MAC | /* MAC Loopback Mode */ > > + IGB_RCTL_SECRC; /* Strip CRC (needed for memcmp) */ > > + igb_write32(igb, IGB_RCTL, rctl); > > + igb_write32(igb, IGB_TCTL, IGB_TCTL_EN); > > + > > + /* Enable MSI-X with 1 vector for the test */ > > + vfio_pci_msix_enable(device, MSIX_VECTOR, 1); > > + > > + /* Enable auto-masking of interrupts to avoid storms without a real ISR */ > > + igb_write32(igb, IGB_GPIE, IGB_GPIE_EIAME); > > + > > + /* Enable interrupts on vector 0 */ > > + igb_write32(igb, IGB_EIMS, 1); > > + > > + /* Map vector 0 to interrupt cause 0 and mark it valid */ > > + igb_write32(igb, IGB_IVAR0, IGB_IVAR_VALID); > > + > > + /* Initialize driver state and capability limits */ > > + igb->tx_tail = 0; > > + igb->rx_tail = 0; > > This comment seems wrong. Ah, yea, looks like some stuff got shuffled around during refactoring. I'll fix it in v2. > > + > > + device->driver.max_memcpy_size = (RING_SIZE - 1) * IGB_MAX_CHUNK_SIZE; > > + device->driver.max_memcpy_count = RING_SIZE - 1; > > The device must be able to do max_memcpy_count copies of > max_memcpy_size, so max_memcpy_size should be IGB_MAX_CHUNK_SIZE. > > Once you do this you can also get rid of the chunking loop in > memcpy_start(). Got it. Will do in v2. > > > + device->driver.msi = MSIX_VECTOR; > > +} > > + > > +static void igb_remove(struct vfio_pci_device *device) > > +{ > > + struct igb *igb = to_igb_state(device); > > + > > + vfio_pci_msix_disable(device); > > + igb_write32(igb, IGB_RCTL, 0); > > + igb_write32(igb, IGB_TCTL, 0); > > + igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST); > > +} > > + > > +static void igb_irq_disable(struct igb *igb) > > +{ > > + igb_write32(igb, IGB_EIMC, 1); > > +} > > + > > +static void igb_irq_enable(struct igb *igb) > > +{ > > + igb_write32(igb, IGB_EIMS, 1); > > +} > > + > > +static void igb_irq_clear(struct igb *igb) > > +{ > > + igb_read32(igb, IGB_EICR); > > +} > > + > > +static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src, > > + iova_t dst, u64 size, u64 count) > > +{ > > + struct igb *igb = to_igb_state(device); > > + iova_t curr_src, curr_dst; > > + struct igb_rx_desc *rx; > > + struct igb_tx_desc *tx; > > + u64 chunk_size; > > + u64 remaining; > > + uint32_t i; > > + > > + igb_irq_disable(igb); > > + > > + for (i = 0; i < count; i++) { > > + curr_src = src; > > + curr_dst = dst; > > + remaining = size; > > + > > + while (remaining > 0) { > > + chunk_size = min_t(u64, remaining, IGB_MAX_CHUNK_SIZE); > > + > > + tx = &igb->tx_ring[igb->tx_tail]; > > + rx = &igb->rx_ring[igb->rx_tail]; > > + > > + memset(tx, 0, sizeof(struct igb_tx_desc)); > > + memset(rx, 0, sizeof(struct igb_rx_desc)); > > + > > + rx->read.pkt_addr = curr_dst; > > + rx->read.hdr_addr = 0; > > + rx->wb.status_error = 0; > > Don't rx->read.hdr_addr and rx->wb.status_error overlap the same u64? > It's not a bug since both write 0 but it is weird. That's correct, they overlap. I agree it's a bit weird -- it's a hardware optimization to save memory. This layout is also how the real descriptor (`e1000_adv_rx_desc`) is structured in drivers/net/ethernet/intel/igb/e1000_82575.h. The QEMU IGB implementation also assumes this layout for the bits. > > > + > > + tx->read.buffer_addr = curr_src; > > + tx->read.cmd_type_len = (uint32_t)chunk_size; > > + tx->read.cmd_type_len |= (uint32_t)(IGB_TXD_CMD_EOP) << IGB_TXD_CMD_SHIFT; > > + > > + /* Set to 0 to disable offloads and avoid needing a context descriptor */ > > + tx->read.olinfo_status = 0; > > + > > + curr_src += chunk_size; > > + curr_dst += chunk_size; > > + remaining -= chunk_size; > > + > > + igb->tx_tail = (igb->tx_tail + 1) % RING_SIZE; > > + igb->rx_tail = (igb->rx_tail + 1) % RING_SIZE; > > + } > > + } > > + > > + igb_write32(igb, IGB_RDT0, igb->rx_tail); > > + igb_write32(igb, IGB_TDT0, igb->tx_tail); > > +} > > + > > +static int igb_memcpy_wait(struct vfio_pci_device *device) > > +{ > > + struct igb *igb = to_igb_state(device); > > + struct igb_rx_desc *rx; > > + uint32_t prev_tail; > > + int retries; > > + > > + prev_tail = (igb->rx_tail + RING_SIZE - 1) % RING_SIZE; > > + rx = &igb->rx_ring[prev_tail]; > > + > > + retries = 100; > > + while (retries-- > 0) { > > + if (rx->wb.status_error & 1) > > + break; > > + usleep(10); > > + } > > Why bail after a certain timeout? The test may have kicked off a large > count of memcpys. Is this for error detection? The bailout was intended to detect errors during development. Shouldn't need it anymore. I'll remove it in v2. > > + > > + igb_irq_clear(igb); > > + > > + igb_irq_enable(igb); > > + > > + if (rx->wb.status_error & 1) > > + return 0; > > + > > + return -ETIMEDOUT; > > Is there any other way to the device signals errors other than failing > to set bit 0 of status_error? I'm especially curious how it handles > memcpy_from_unmapped_iova in vfio_pci_driver_test.c. Yea, there are some relevant error bits in status_err, but unfortunately QEMU IGB doesn't emulate them: - E1000_RXDEXT_STATERR_LB: confirms that the packet actually traversed the loopback path - E1000_RXDEXT_STATERR_CE: detects bit flips during transmission - E1000_RXDEXT_STATERR_RXE: catch-all error bit for the receive engine (buffer overflows, etc.) The other error bits (E1000_RXDEXT_STATERR_IPE, E1000_RXDEXT_STATERR_TCPE) validate network packet integrity, which isn't relevant because the driver doesn't package the memcpy data within IP or TCP/UDP packets. Regarding the memcpy_from_unmapped_iova test, the test only cares that the device doesn't trigger an MSI. Normally, the QEMU IGB sends an MSI once it sets the descriptor done bit on completion of a memcpy operation. To prevent this, the driver disables interrupts before starting memcpy, and then re-enables them after the memcpy finishes. > > +} > > + > > +static void igb_send_msi(struct vfio_pci_device *device) > > +{ > > + struct igb *igb = to_igb_state(device); > > + > > + igb_write32(igb, IGB_EICS, 1); > > +} > > + > > +const struct vfio_pci_driver_ops igb_ops = { > > + .name = "igb", > > + .probe = igb_probe, > > + .init = igb_init, > > + .remove = igb_remove, > > + .memcpy_start = igb_memcpy_start, > > + .memcpy_wait = igb_memcpy_wait, > > + .send_msi = igb_send_msi, > > +}; > > diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > new file mode 100644 > > index 000000000000..e2756be2e9c0 > > --- /dev/null > > +++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > @@ -0,0 +1,103 @@ > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > + > > +#ifndef _IGB_REGISTERS_H_ > > +#define _IGB_REGISTERS_H_ > > + > > +/* Register Offsets (Intel 82576EB Datasheet) */ > > +#define IGB_CTRL 0x00000 /* Device Control */ > > +#define IGB_STATUS 0x00008 /* Device Status */ > > +#define IGB_CTRL_EXT 0x00018 /* Extended Device Control */ > > +#define IGB_MDIC 0x00020 /* MDI Control */ > > +#define IGB_RCTL 0x00100 /* Receive Control */ > > +#define IGB_TCTL 0x00400 /* Transmit Control */ > > +#define IGB_SCTL 0x00420 /* SerDes Control */ > > + > > +/* Interrupt Registers */ > > +#define IGB_IMC 0x0150C /* Interrupt Mask Clear */ > > +#define IGB_IVAR0 0x01700 /* Interrupt Vector Allocation Register 0 */ > > + > > +/* Rx Ring 0 Registers */ > > +#define IGB_RDBAL0 0x0C000 /* Rx Desc Base Address Low */ > > +#define IGB_RDBAH0 0x0C004 /* Rx Desc Base Address High */ > > +#define IGB_RDLEN0 0x0C008 /* Rx Desc Length */ > > +#define IGB_RDH0 0x0C010 /* Rx Desc Head */ > > +#define IGB_RDT0 0x0C018 /* Rx Desc Tail */ > > +#define IGB_RXDCTL0 0x0C028 /* Rx Desc Control */ > > + > > +/* Tx Ring 0 Registers */ > > +#define IGB_TDBAL0 0x0E000 /* Tx Desc Base Address Low */ > > +#define IGB_TDBAH0 0x0E004 /* Tx Desc Base Address High */ > > +#define IGB_TDLEN0 0x0E008 /* Tx Desc Length */ > > +#define IGB_TDH0 0x0E010 /* Tx Desc Head */ > > +#define IGB_TDT0 0x0E018 /* Tx Desc Tail */ > > +#define IGB_TXDCTL0 0x0E028 /* Tx Desc Control */ > > + > > +/* Control Bit Definitions */ > > +/* CTRL */ > > +#define IGB_CTRL_RST (1 << 26) /* Device Reset */ > > +#define IGB_CTRL_EXT_LINK_MODE_MASK (3 << 22) > > + > > +/* CTRL_EXT */ > > +#define IGB_CTRL_EXT_DRV_LOAD (1 << 28) /* Driver Loaded */ > > + > > +/* RCTL */ > > +#define IGB_RCTL_EN (1 << 1) /* Receiver Enable */ > > +#define IGB_RCTL_UPE (1 << 3) /* Unicast Promiscuous Enabled */ > > +#define IGB_RCTL_LPE (1 << 5) /* Long Packet Reception Enable */ > > +#define IGB_RCTL_LBM_MAC (1 << 6) /* Loopback Mode - MAC */ > > +#define IGB_RCTL_SECRC (1 << 26) /* Strip Ethernet CRC */ > > + > > +/* TCTL */ > > +#define IGB_TCTL_EN (1 << 1) /* Transmit Enable */ > > + > > +/* SCTL */ > > +#define IGB_SCTL_DISABLE_SERDES_LOOPBACK (1 << 6) > > + > > +/* MDIC */ > > +#define IGB_MDIC_OP_WRITE (1 << 26) > > +#define IGB_MDIC_OP_READ (2 << 26) > > + > > +#define IGB_EICR 0x01580 /* Extended Interrupt Cause Read */ > > + > > +#define IGB_RAH0 0x05404 /* Receive Address High 0 */ > > +#define IGB_VMOLR0 0x05AD0 /* VM Offload Layout Register 0 */ > > + > > +#define IGB_VMOLR_LPE 0x00010000 /* Long Packet Enable */ > > +#define IGB_VMOLR_BAM 0x08000000 /* Broadcast Accept Mode */ > > +#define IGB_RAH_POOL_1 0x00040000 /* Pool 1 assignment */ > > + > > +#define IGB_EIMS 0x01524 /* Extended Interrupt Mask Set */ > > +#define IGB_EICS 0x01520 /* Extended Interrupt Cause Set */ > > +#define IGB_EIMC 0x01528 /* Extended Interrupt Mask Clear */ > > +#define IGB_CTRL_GIO_MASTER_DISABLE (1 << 2) /* GIO Master Disable */ > > +#define IGB_STATUS_GIO_MASTER_ENABLE (1 << 19) /* GIO Master Enable */ > > +#define IGB_GPIE 0x01514 /* General Purpose Interrupt Enable */ > > +#define IGB_TXDCTL0_Q_EN (1 << 25) /* Transmit Queue Enable */ > > +#define IGB_RXDCTL0_Q_EN (1 << 25) /* Receive Queue Enable */ > > +#define IGB_MRQC 0x05818 /* Multiple Receive Queues Command */ > > + > > +#define IGB_MDIC_PHY_SHIFT 21 /* PHY Address Shift */ > > +#define IGB_MDIC_REG_SHIFT 16 /* Register Address Shift */ > > +#define IGB_MDIC_READY (1 << 28) /* MDI Data Ready */ > > +#define IGB_MDIC_ERROR (1 << 29) /* MDI Error */ > > + > > +#define IGB_PHY_CTRL_REG_OFFSET 0 > > +#define IGB_PHY_STATUS_REG_OFFSET 1 > > +#define IGB_PHY_STATUS_AN_COMP 0x0020 /* Auto-negotiation complete */ > > +#define IGB_PHY_CTRL_AN_ENABLE 0x1000 /* Auto-Negotiation Enable */ > > +#define IGB_PHY_CTRL_RESTART_AN 0x0200 /* Restart Auto-Negotiation */ > > + > > +#define IGB_PHY_AUTONEG_ADV_REG_OFFSET 4 /* Auto-Negotiation Advertisement */ > > +#define IGB_PHY_1000T_CTRL_REG_OFFSET 9 /* 1000Base-T Control */ > > +#define IGB_CR_1000T_FD_CAPS 0x0200 /* Advertise 1000 Mbps Full Duplex */ > > + > > +#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */ > > +#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */ > > + > > +#define IGB_TXD_CMD_EOP 0x01 /* End of Packet */ > > +#define IGB_TXD_CMD_IFCS 0x02 /* Insert FCS */ > > +#define IGB_TXD_CMD_RS 0x08 /* Report Status */ > > +#define IGB_TXD_CMD_SHIFT 24 /* Shift for command bits in cmd_type_len */ > > +#define IGB_TXD_CMD_LEGACY_FORMAT (1 << 20) /* Forces legacy descriptor format in QEMU */ > > IGB_TXD_CMD_LEGACY_FORMAT appears to be unused. I didn't check the rest. > Can you prune out the unused macros? Yep, will do! > > Related: Are these macros and/or the descriptor struct available in any > kernel headers that we could symlink into the VFIO selftests like we do > for DSA? They're in private headers under drivers/net/ethernet/intel/igb. Are you ok with linking in private headers? The DSA headers are public. > > + > > +#endif /* _IGB_REGISTERS_H_ */ > > diff --git a/tools/testing/selftests/vfio/lib/libvfio.mk b/tools/testing/selftests/vfio/lib/libvfio.mk > > index 9f47bceed16f..1f13cca04348 100644 > > --- a/tools/testing/selftests/vfio/lib/libvfio.mk > > +++ b/tools/testing/selftests/vfio/lib/libvfio.mk > > @@ -12,6 +12,7 @@ LIBVFIO_C += vfio_pci_driver.c > > ifeq ($(ARCH:x86_64=x86),x86) > > LIBVFIO_C += drivers/ioat/ioat.c > > LIBVFIO_C += drivers/dsa/dsa.c > > +LIBVFIO_C += drivers/igb/igb.c > > endif > > > > LIBVFIO_OUTPUT := $(OUTPUT)/libvfio > > diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > > index 6827f4a6febe..a5d0547132c4 100644 > > --- a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > > +++ b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > > @@ -5,12 +5,14 @@ > > #ifdef __x86_64__ > > extern struct vfio_pci_driver_ops dsa_ops; > > extern struct vfio_pci_driver_ops ioat_ops; > > +extern struct vfio_pci_driver_ops igb_ops; > > #endif > > > > static struct vfio_pci_driver_ops *driver_ops[] = { > > #ifdef __x86_64__ > > &dsa_ops, > > &ioat_ops, > > + &igb_ops, > > #endif > > }; > > > > -- > > 2.54.0.563.g4f69b47b94-goog > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-13 2:12 ` Josh Hilke @ 2026-05-13 18:49 ` Josh Hilke 2026-05-13 23:33 ` David Matlack 2026-05-13 23:29 ` David Matlack 1 sibling, 1 reply; 11+ messages in thread From: Josh Hilke @ 2026-05-13 18:49 UTC (permalink / raw) To: David Matlack Cc: Alex Williamson, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > Add a VFIO selftest driver for the Intel Gigabit Ethernet controller > > > (IGB). IGB is fully virtualized in QEMU [1] which makes it easy to run > > > VFIO selftests without needing any specific hardware. > > > > > > IGB does not have a default memcpy operation, and memcpy is required for > > > all VFIO selftest drivers. However, IGB has a "loopback" mode which is > > > used in this driver to implement memcpy. When IGB is in loopback mode, > > > it doesn't send data out to the network. Instead, it sends data from the > > > Tx queue directly to the Rx queue which points to the memcpy > > > destination, instead of sending the data out to the network. > > > > > > This driver passes all of the VFIO selftests in > > > tools/testing/selftests/vfio/ when running in QEMU. The driver has not > > > been tested using a real IGB device since the main goal of writing this > > > driver is to run VFIO selftests in QEMU without requiring any hardware. > > > > > > This command is used to test the driver. It runs > > > ./tools/testing/selftests/vfio/vfio_pci_driver_test using virtme-ng [2], > > > which runs a kernel in a virtualized environment using QEMU that has a > > > copy-on-write snapshot the host filesystem. > > > > > > vng \ > > > --run arch/x86/boot/bzImage \ > > > --user root \ > > > --disable-microvm \ > > > --memory 32G \ > > > --cpus 8 \ > > > --qemu-opts="-M q35,accel=kvm,kernel-irqchip=split" \ > > > --qemu-opts="-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on" \ > > > --qemu-opts="-netdev user,id=net0 -device igb,netdev=net0,addr=09.0" \ > > > --append "console=ttyS0 earlyprintk=ttyS0 intel_iommu=on iommu=pt" \ > > > --exec "modprobe vfio-pci && \ > > > ./tools/testing/selftests/vfio/scripts/setup.sh 0000:00:09.0 && \ > > > ./tools/testing/selftests/vfio/scripts/run.sh ./tools/testing/selftests/vfio/vfio_pci_driver_test" > > > > > > Code was written entirely by AI (Gemini) by feeding it the Intel IGB > > > specification, the code for the real IGB driver, and the QEMU > > > implementation of the IGB device. It took many iterations of prompting > > > to get the driver to pass all of the tests, and lot's of de-slopping to > > > remove unnecessary code and make the code readable. > > > > > > Code comments are also written by Gemini through iterative > > > prompting. > > > > > > This patch is based on the kvm/queue branch. > > > > > > [1] https://www.qemu.org/docs/master/system/devices/igb.html > > > [2] https://github.com/arighi/virtme-ng > > > > > > Assisted-by: Gemini:gemini-3-pro-preview > > > Signed-off-by: Josh Hilke <jrhilke@google.com> > > > > I just did a quick first pass and left some feedback. Overall I'm very > > excited to have a driver for a device supported by QEMU! > > > > > --- > > > .../selftests/vfio/lib/drivers/igb/igb.c | 381 ++++++++++++++++++ > > > .../vfio/lib/drivers/igb/registers.h | 103 +++++ > > > tools/testing/selftests/vfio/lib/libvfio.mk | 1 + > > > .../selftests/vfio/lib/vfio_pci_driver.c | 2 + > > > 4 files changed, 487 insertions(+) > > > create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/igb.c > > > create mode 100644 tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > > > > > diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c > > > new file mode 100644 > > > index 000000000000..13c8429784ac > > > --- /dev/null > > > +++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c > > > @@ -0,0 +1,381 @@ > > > +// SPDX-License-Identifier: GPL-2.0-only > > > +#include <unistd.h> > > > +#include <errno.h> > > > +#include <stdint.h> > > > +#include <linux/io.h> > > > +#include <linux/pci_regs.h> > > > +#include <libvfio/vfio_pci_device.h> > > > + > > > +#include "registers.h" > > > + > > > +#define PCI_VENDOR_ID_INTEL 0x8086 > > > +#define PCI_DEVICE_ID_INTEL_IGB 0x10C9 > > > > You should be able to include <linux/pci_id.h> for these. > > PCI_VENDOR_ID_INTEL is in <linux/pci_id.h>, so I can remove that one, > but PCI_DEVICE_ID_INTEL_IGB is only defined in the private header, > drivers/net/ethernet/intel/e1000/e1000_hw.h. I added more detail about > private headers to your comment below about utilizing existing > constants from IGB header files. > > > > > > +#define IGB_MAX_CHUNK_SIZE 1024 > > > +#define MSIX_VECTOR 0 > > > +#define RING_SIZE 4096 /* Number of descriptors in ring */ > > > > What's the maximum ring size that IGB can support? The max chunk size is > > quite small so using a larger ring will be helpful toward getting the > > device to do a lot of memcpys for Live Update testing. > > I can dig into QEMU code and investigate the max ring size the QEMU > IGB implementation can handle. 4096 is just the max ring size for real > the IGB in drivers/net/ethernet/intel/igb/igb.h. What's our target for > the maximum memcpy size? > > > > + > > > +struct igb_tx_desc { > > > + union { > > > + struct { > > > + uint64_t buffer_addr; /* Address of descriptor's data buffer */ > > > + uint32_t cmd_type_len; /* Command/Type/Length */ > > > + uint32_t olinfo_status; /* Context/Buffer info */ > > > > Use kernel-style fixed width types (u64, u32, etc.). > > Will do in v2. > > > > > > + } read; > > > + > > > + struct { > > > + uint64_t rsvd; /* Reserved */ > > > + uint32_t nxtseq_seed; /* Next sequence seed */ > > > + uint32_t status; /* Descriptor status */ > > > + } wb; > > > + }; > > > +}; > > > + > > > +struct igb_rx_desc { > > > + union { > > > + struct { > > > + uint64_t pkt_addr; /* Packet buffer address */ > > > + uint64_t hdr_addr; /* Header buffer address */ > > > + } read; > > > + struct { > > > + uint16_t pkt_info; /* RSS type, Packet type */ > > > + uint16_t hdr_info; /* Split Head, buf len */ > > > + uint32_t rss; /* RSS Hash */ > > > + uint32_t status_error; /* ext status/error */ > > > + uint16_t length; /* Packet length */ > > > + uint16_t vlan; /* VLAN tag */ > > > + } wb; /* writeback */ > > > + }; > > > +}; > > > + > > > +struct igb { > > > + void *bar0; > > > + struct igb_tx_desc *tx_ring; > > > + struct igb_rx_desc *rx_ring; > > > + uint32_t tx_tail; > > > + uint32_t rx_tail; > > > +}; > > > + > > > +static inline struct igb *to_igb_state(struct vfio_pci_device *device) > > > +{ > > > + return (struct igb *)device->driver.region.vaddr; > > > +} > > > + > > > +static inline void igb_write32(struct igb *igb, uint32_t reg, uint32_t val) > > > +{ > > > + writel(val, igb->bar0 + reg); > > > +} > > > + > > > +static inline uint32_t igb_read32(struct igb *igb, uint32_t reg) > > > +{ > > > + return readl(igb->bar0 + reg); > > > +} > > > + > > > +static int igb_write_phy(struct igb *igb, uint32_t offset, uint16_t data) > > > +{ > > > + uint32_t mdic; > > > + int i; > > > + > > > + mdic = (((uint32_t)data) | > > > + (offset << IGB_MDIC_REG_SHIFT) | > > > + (1 << IGB_MDIC_PHY_SHIFT) | > > > + IGB_MDIC_OP_WRITE); > > > + > > > + igb_write32(igb, IGB_MDIC, mdic); > > > + > > > + for (i = 0; i < 1000; i++) { > > > + usleep(50); > > > + mdic = igb_read32(igb, IGB_MDIC); > > > + if (mdic & IGB_MDIC_READY) > > > + break; > > > + } > > > + > > > + if (!(mdic & IGB_MDIC_READY)) > > > + return -1; > > > + > > > + if (mdic & IGB_MDIC_ERROR) > > > + return -1; > > > + > > > + return 0; > > > +} > > > + > > > +static int igb_read_phy(struct igb *igb, uint32_t offset, uint16_t *data) > > > +{ > > > + uint32_t mdic; > > > + int i; > > > + > > > + mdic = ((offset << IGB_MDIC_REG_SHIFT) | > > > + (1 << IGB_MDIC_PHY_SHIFT) | > > > + IGB_MDIC_OP_READ); > > > + > > > + igb_write32(igb, IGB_MDIC, mdic); > > > + > > > + for (i = 0; i < 1000; i++) { > > > + usleep(50); > > > + mdic = igb_read32(igb, IGB_MDIC); > > > + if (mdic & IGB_MDIC_READY) > > > + break; > > > + } > > > + > > > + if (!(mdic & IGB_MDIC_READY)) > > > + return -1; > > > + > > > + if (mdic & IGB_MDIC_ERROR) > > > + return -1; > > > + > > > + *data = (uint16_t)mdic; > > > + return 0; > > > +} > > > + > > > +static int igb_phy_setup_autoneg(struct igb *igb) > > > +{ > > > + int timeout_ms = 1000; > > > + bool success = false; > > > + uint16_t phy_status; > > > + int ret; > > > + int i; > > > + > > > + /* Trigger auto-negotiation */ > > > + ret = igb_write_phy(igb, IGB_PHY_CTRL_REG_OFFSET, > > > + IGB_PHY_CTRL_AN_ENABLE | IGB_PHY_CTRL_RESTART_AN); > > > + if (ret) > > > + return ret; > > > + > > > + for (i = 0; i < timeout_ms; i++) { > > > + if (igb_read_phy(igb, IGB_PHY_STATUS_REG_OFFSET, &phy_status) == 0) { > > > + success = !!(phy_status & IGB_PHY_STATUS_AN_COMP); > > > + if (success) > > > + break; > > > + } > > > + usleep(1000); > > > + } > > > + > > > + if (!success) { > > > + printf("igb: Auto-negotiation did not complete in time\n"); > > > + return -ETIMEDOUT; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static int igb_probe(struct vfio_pci_device *device) > > > +{ > > > + if (!vfio_pci_device_match(device, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IGB)) > > > + return -EINVAL; > > > + > > > + return 0; > > > +} > > > + > > > +static void igb_init(struct vfio_pci_device *device) > > > +{ > > > + struct igb *igb = to_igb_state(device); > > > + uint64_t iova_tx, iova_rx; > > > + uint32_t ctrl, rctl; > > > + uint16_t cmd_reg; > > > + int retries; > > > + > > > + VFIO_ASSERT_GE(device->driver.region.size, > > > + sizeof(struct igb) + 2 * RING_SIZE * sizeof(struct igb_tx_desc)); > > > > Please put these rings into struct igb directly rather than doing raw > > pointer arithmetic and size calculations. e.g. > > Will do in v2. > > > struct igb { > > void *bar0; > > u32 tx_tail; > > u32 rx_tail; > > struct igb_tx_desc tx_ring[RING_SIZE]; > > struct igb_rx_desc rx_ring[RING_SIZE]; > > }; > > > > > + > > > + /* Set up rings and calculate IOVAs */ > > > + igb->bar0 = device->bars[0].vaddr; > > > + > > > + igb->tx_ring = (struct igb_tx_desc *)(igb + 1); > > > + igb->rx_ring = (struct igb_rx_desc *)(igb->tx_ring + RING_SIZE); > > > + > > > + iova_tx = to_iova(device, igb->tx_ring); > > > + iova_rx = to_iova(device, igb->rx_ring); > > > + > > > + /* Reset device and disable all interrupts */ > > > + igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST); > > > + usleep(20000); > > > + igb_write32(igb, IGB_IMC, 0xFFFFFFFF); > > > + > > > + /* Signal that the driver is loaded */ > > > + ctrl = igb_read32(igb, IGB_CTRL_EXT); > > > + ctrl |= IGB_CTRL_EXT_DRV_LOAD; > > > + ctrl &= ~IGB_CTRL_EXT_LINK_MODE_MASK; > > > + igb_write32(igb, IGB_CTRL_EXT, ctrl); > > > + > > > + /* Enable PCI Bus Master. */ > > > + cmd_reg = vfio_pci_config_readw(device, PCI_COMMAND); > > > + if (!(cmd_reg & PCI_COMMAND_MASTER)) { > > > + cmd_reg |= (PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY); > > > + vfio_pci_config_writew(device, PCI_COMMAND, cmd_reg); > > > + } > > > + > > > + /* Trigger autonegotiation. This enables IGB to transmit data. */ > > > + if (igb_phy_setup_autoneg(igb)) > > > + return; > > > + > > > + /* Configure TX and RX descriptor rings */ > > > + igb_write32(igb, IGB_TDBAL0, (uint32_t)iova_tx); > > > + igb_write32(igb, IGB_TDBAH0, (uint32_t)(iova_tx >> 32)); > > > + igb_write32(igb, IGB_TDLEN0, RING_SIZE * sizeof(struct igb_tx_desc)); > > > + igb_write32(igb, IGB_TDH0, 0); > > > + igb_write32(igb, IGB_TDT0, 0); > > > + igb_write32(igb, IGB_TXDCTL0, IGB_TXDCTL0_Q_EN); > > > + > > > + igb_write32(igb, IGB_RDBAL0, (uint32_t)iova_rx); > > > + igb_write32(igb, IGB_RDBAH0, (uint32_t)(iova_rx >> 32)); > > > + igb_write32(igb, IGB_RDLEN0, RING_SIZE * sizeof(struct igb_rx_desc)); > > > + igb_write32(igb, IGB_RDH0, 0); > > > + igb_write32(igb, IGB_RDT0, 0); > > > + igb_write32(igb, IGB_RXDCTL0, IGB_RXDCTL0_Q_EN); > > > + > > > + /* Wait for TX and RX queues to be enabled */ > > > + retries = 2000; > > > + while (retries-- > 0) { > > > + if ((igb_read32(igb, IGB_TXDCTL0) & IGB_TXDCTL0_Q_EN) && > > > + (igb_read32(igb, IGB_RXDCTL0) & IGB_RXDCTL0_Q_EN)) > > > + break; > > > + usleep(10); > > > + } > > > + > > > + /* Enable Receiver and Transmitter */ > > > + rctl = IGB_RCTL_EN | /* Receiver Enable */ > > > + IGB_RCTL_UPE | /* Unicast Promiscuous (for dummy MAC) */ > > > + IGB_RCTL_LBM_MAC | /* MAC Loopback Mode */ > > > + IGB_RCTL_SECRC; /* Strip CRC (needed for memcmp) */ > > > + igb_write32(igb, IGB_RCTL, rctl); > > > + igb_write32(igb, IGB_TCTL, IGB_TCTL_EN); > > > + > > > + /* Enable MSI-X with 1 vector for the test */ > > > + vfio_pci_msix_enable(device, MSIX_VECTOR, 1); > > > + > > > + /* Enable auto-masking of interrupts to avoid storms without a real ISR */ > > > + igb_write32(igb, IGB_GPIE, IGB_GPIE_EIAME); > > > + > > > + /* Enable interrupts on vector 0 */ > > > + igb_write32(igb, IGB_EIMS, 1); > > > + > > > + /* Map vector 0 to interrupt cause 0 and mark it valid */ > > > + igb_write32(igb, IGB_IVAR0, IGB_IVAR_VALID); > > > + > > > + /* Initialize driver state and capability limits */ > > > + igb->tx_tail = 0; > > > + igb->rx_tail = 0; > > > > This comment seems wrong. > > Ah, yea, looks like some stuff got shuffled around during refactoring. > I'll fix it in v2. > > > > + > > > + device->driver.max_memcpy_size = (RING_SIZE - 1) * IGB_MAX_CHUNK_SIZE; > > > + device->driver.max_memcpy_count = RING_SIZE - 1; > > > > The device must be able to do max_memcpy_count copies of > > max_memcpy_size, so max_memcpy_size should be IGB_MAX_CHUNK_SIZE. > > > > Once you do this you can also get rid of the chunking loop in > > memcpy_start(). > > Got it. Will do in v2. > > > > > > + device->driver.msi = MSIX_VECTOR; > > > +} > > > + > > > +static void igb_remove(struct vfio_pci_device *device) > > > +{ > > > + struct igb *igb = to_igb_state(device); > > > + > > > + vfio_pci_msix_disable(device); > > > + igb_write32(igb, IGB_RCTL, 0); > > > + igb_write32(igb, IGB_TCTL, 0); > > > + igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST); > > > +} > > > + > > > +static void igb_irq_disable(struct igb *igb) > > > +{ > > > + igb_write32(igb, IGB_EIMC, 1); > > > +} > > > + > > > +static void igb_irq_enable(struct igb *igb) > > > +{ > > > + igb_write32(igb, IGB_EIMS, 1); > > > +} > > > + > > > +static void igb_irq_clear(struct igb *igb) > > > +{ > > > + igb_read32(igb, IGB_EICR); > > > +} > > > + > > > +static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src, > > > + iova_t dst, u64 size, u64 count) > > > +{ > > > + struct igb *igb = to_igb_state(device); > > > + iova_t curr_src, curr_dst; > > > + struct igb_rx_desc *rx; > > > + struct igb_tx_desc *tx; > > > + u64 chunk_size; > > > + u64 remaining; > > > + uint32_t i; > > > + > > > + igb_irq_disable(igb); > > > + > > > + for (i = 0; i < count; i++) { > > > + curr_src = src; > > > + curr_dst = dst; > > > + remaining = size; > > > + > > > + while (remaining > 0) { > > > + chunk_size = min_t(u64, remaining, IGB_MAX_CHUNK_SIZE); > > > + > > > + tx = &igb->tx_ring[igb->tx_tail]; > > > + rx = &igb->rx_ring[igb->rx_tail]; > > > + > > > + memset(tx, 0, sizeof(struct igb_tx_desc)); > > > + memset(rx, 0, sizeof(struct igb_rx_desc)); > > > + > > > + rx->read.pkt_addr = curr_dst; > > > + rx->read.hdr_addr = 0; > > > + rx->wb.status_error = 0; > > > > Don't rx->read.hdr_addr and rx->wb.status_error overlap the same u64? > > It's not a bug since both write 0 but it is weird. > > That's correct, they overlap. I agree it's a bit weird -- it's a > hardware optimization to save memory. This layout is also how the real > descriptor (`e1000_adv_rx_desc`) is structured in > drivers/net/ethernet/intel/igb/e1000_82575.h. The QEMU IGB > implementation also assumes this layout for the bits. > > > > > > + > > > + tx->read.buffer_addr = curr_src; > > > + tx->read.cmd_type_len = (uint32_t)chunk_size; > > > + tx->read.cmd_type_len |= (uint32_t)(IGB_TXD_CMD_EOP) << IGB_TXD_CMD_SHIFT; > > > + > > > + /* Set to 0 to disable offloads and avoid needing a context descriptor */ > > > + tx->read.olinfo_status = 0; > > > + > > > + curr_src += chunk_size; > > > + curr_dst += chunk_size; > > > + remaining -= chunk_size; > > > + > > > + igb->tx_tail = (igb->tx_tail + 1) % RING_SIZE; > > > + igb->rx_tail = (igb->rx_tail + 1) % RING_SIZE; > > > + } > > > + } > > > + > > > + igb_write32(igb, IGB_RDT0, igb->rx_tail); > > > + igb_write32(igb, IGB_TDT0, igb->tx_tail); > > > +} > > > + > > > +static int igb_memcpy_wait(struct vfio_pci_device *device) > > > +{ > > > + struct igb *igb = to_igb_state(device); > > > + struct igb_rx_desc *rx; > > > + uint32_t prev_tail; > > > + int retries; > > > + > > > + prev_tail = (igb->rx_tail + RING_SIZE - 1) % RING_SIZE; > > > + rx = &igb->rx_ring[prev_tail]; > > > + > > > + retries = 100; > > > + while (retries-- > 0) { > > > + if (rx->wb.status_error & 1) > > > + break; > > > + usleep(10); > > > + } > > > > Why bail after a certain timeout? The test may have kicked off a large > > count of memcpys. Is this for error detection? > > The bailout was intended to detect errors during development. > Shouldn't need it anymore. I'll remove it in v2. Sorry, I forgot: we need the timeout to detect DMA errors for the memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU aborts the DMA operation. However, the QEMU IGB implementation does not set an error bit, so timing out is our only method for error detection. > > > > + > > > + igb_irq_clear(igb); > > > + > > > + igb_irq_enable(igb); > > > + > > > + if (rx->wb.status_error & 1) > > > + return 0; > > > + > > > + return -ETIMEDOUT; > > > > Is there any other way to the device signals errors other than failing > > to set bit 0 of status_error? I'm especially curious how it handles > > memcpy_from_unmapped_iova in vfio_pci_driver_test.c. > > Yea, there are some relevant error bits in status_err, but > unfortunately QEMU IGB doesn't emulate them: > - E1000_RXDEXT_STATERR_LB: confirms that the packet actually traversed > the loopback path > - E1000_RXDEXT_STATERR_CE: detects bit flips during transmission > - E1000_RXDEXT_STATERR_RXE: catch-all error bit for the receive engine > (buffer overflows, etc.) > > The other error bits (E1000_RXDEXT_STATERR_IPE, > E1000_RXDEXT_STATERR_TCPE) validate network packet integrity, which > isn't relevant because the driver doesn't package the memcpy data > within IP or TCP/UDP packets. > > Regarding the memcpy_from_unmapped_iova test, the test only cares that > the device doesn't trigger an MSI. Normally, the QEMU IGB sends an MSI > once it sets the descriptor done bit on completion of a memcpy > operation. To prevent this, the driver disables interrupts before > starting memcpy, and then re-enables them after the memcpy finishes. > > > > +} > > > + > > > +static void igb_send_msi(struct vfio_pci_device *device) > > > +{ > > > + struct igb *igb = to_igb_state(device); > > > + > > > + igb_write32(igb, IGB_EICS, 1); > > > +} > > > + > > > +const struct vfio_pci_driver_ops igb_ops = { > > > + .name = "igb", > > > + .probe = igb_probe, > > > + .init = igb_init, > > > + .remove = igb_remove, > > > + .memcpy_start = igb_memcpy_start, > > > + .memcpy_wait = igb_memcpy_wait, > > > + .send_msi = igb_send_msi, > > > +}; > > > diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > > new file mode 100644 > > > index 000000000000..e2756be2e9c0 > > > --- /dev/null > > > +++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h > > > @@ -0,0 +1,103 @@ > > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > > + > > > +#ifndef _IGB_REGISTERS_H_ > > > +#define _IGB_REGISTERS_H_ > > > + > > > +/* Register Offsets (Intel 82576EB Datasheet) */ > > > +#define IGB_CTRL 0x00000 /* Device Control */ > > > +#define IGB_STATUS 0x00008 /* Device Status */ > > > +#define IGB_CTRL_EXT 0x00018 /* Extended Device Control */ > > > +#define IGB_MDIC 0x00020 /* MDI Control */ > > > +#define IGB_RCTL 0x00100 /* Receive Control */ > > > +#define IGB_TCTL 0x00400 /* Transmit Control */ > > > +#define IGB_SCTL 0x00420 /* SerDes Control */ > > > + > > > +/* Interrupt Registers */ > > > +#define IGB_IMC 0x0150C /* Interrupt Mask Clear */ > > > +#define IGB_IVAR0 0x01700 /* Interrupt Vector Allocation Register 0 */ > > > + > > > +/* Rx Ring 0 Registers */ > > > +#define IGB_RDBAL0 0x0C000 /* Rx Desc Base Address Low */ > > > +#define IGB_RDBAH0 0x0C004 /* Rx Desc Base Address High */ > > > +#define IGB_RDLEN0 0x0C008 /* Rx Desc Length */ > > > +#define IGB_RDH0 0x0C010 /* Rx Desc Head */ > > > +#define IGB_RDT0 0x0C018 /* Rx Desc Tail */ > > > +#define IGB_RXDCTL0 0x0C028 /* Rx Desc Control */ > > > + > > > +/* Tx Ring 0 Registers */ > > > +#define IGB_TDBAL0 0x0E000 /* Tx Desc Base Address Low */ > > > +#define IGB_TDBAH0 0x0E004 /* Tx Desc Base Address High */ > > > +#define IGB_TDLEN0 0x0E008 /* Tx Desc Length */ > > > +#define IGB_TDH0 0x0E010 /* Tx Desc Head */ > > > +#define IGB_TDT0 0x0E018 /* Tx Desc Tail */ > > > +#define IGB_TXDCTL0 0x0E028 /* Tx Desc Control */ > > > + > > > +/* Control Bit Definitions */ > > > +/* CTRL */ > > > +#define IGB_CTRL_RST (1 << 26) /* Device Reset */ > > > +#define IGB_CTRL_EXT_LINK_MODE_MASK (3 << 22) > > > + > > > +/* CTRL_EXT */ > > > +#define IGB_CTRL_EXT_DRV_LOAD (1 << 28) /* Driver Loaded */ > > > + > > > +/* RCTL */ > > > +#define IGB_RCTL_EN (1 << 1) /* Receiver Enable */ > > > +#define IGB_RCTL_UPE (1 << 3) /* Unicast Promiscuous Enabled */ > > > +#define IGB_RCTL_LPE (1 << 5) /* Long Packet Reception Enable */ > > > +#define IGB_RCTL_LBM_MAC (1 << 6) /* Loopback Mode - MAC */ > > > +#define IGB_RCTL_SECRC (1 << 26) /* Strip Ethernet CRC */ > > > + > > > +/* TCTL */ > > > +#define IGB_TCTL_EN (1 << 1) /* Transmit Enable */ > > > + > > > +/* SCTL */ > > > +#define IGB_SCTL_DISABLE_SERDES_LOOPBACK (1 << 6) > > > + > > > +/* MDIC */ > > > +#define IGB_MDIC_OP_WRITE (1 << 26) > > > +#define IGB_MDIC_OP_READ (2 << 26) > > > + > > > +#define IGB_EICR 0x01580 /* Extended Interrupt Cause Read */ > > > + > > > +#define IGB_RAH0 0x05404 /* Receive Address High 0 */ > > > +#define IGB_VMOLR0 0x05AD0 /* VM Offload Layout Register 0 */ > > > + > > > +#define IGB_VMOLR_LPE 0x00010000 /* Long Packet Enable */ > > > +#define IGB_VMOLR_BAM 0x08000000 /* Broadcast Accept Mode */ > > > +#define IGB_RAH_POOL_1 0x00040000 /* Pool 1 assignment */ > > > + > > > +#define IGB_EIMS 0x01524 /* Extended Interrupt Mask Set */ > > > +#define IGB_EICS 0x01520 /* Extended Interrupt Cause Set */ > > > +#define IGB_EIMC 0x01528 /* Extended Interrupt Mask Clear */ > > > +#define IGB_CTRL_GIO_MASTER_DISABLE (1 << 2) /* GIO Master Disable */ > > > +#define IGB_STATUS_GIO_MASTER_ENABLE (1 << 19) /* GIO Master Enable */ > > > +#define IGB_GPIE 0x01514 /* General Purpose Interrupt Enable */ > > > +#define IGB_TXDCTL0_Q_EN (1 << 25) /* Transmit Queue Enable */ > > > +#define IGB_RXDCTL0_Q_EN (1 << 25) /* Receive Queue Enable */ > > > +#define IGB_MRQC 0x05818 /* Multiple Receive Queues Command */ > > > + > > > +#define IGB_MDIC_PHY_SHIFT 21 /* PHY Address Shift */ > > > +#define IGB_MDIC_REG_SHIFT 16 /* Register Address Shift */ > > > +#define IGB_MDIC_READY (1 << 28) /* MDI Data Ready */ > > > +#define IGB_MDIC_ERROR (1 << 29) /* MDI Error */ > > > + > > > +#define IGB_PHY_CTRL_REG_OFFSET 0 > > > +#define IGB_PHY_STATUS_REG_OFFSET 1 > > > +#define IGB_PHY_STATUS_AN_COMP 0x0020 /* Auto-negotiation complete */ > > > +#define IGB_PHY_CTRL_AN_ENABLE 0x1000 /* Auto-Negotiation Enable */ > > > +#define IGB_PHY_CTRL_RESTART_AN 0x0200 /* Restart Auto-Negotiation */ > > > + > > > +#define IGB_PHY_AUTONEG_ADV_REG_OFFSET 4 /* Auto-Negotiation Advertisement */ > > > +#define IGB_PHY_1000T_CTRL_REG_OFFSET 9 /* 1000Base-T Control */ > > > +#define IGB_CR_1000T_FD_CAPS 0x0200 /* Advertise 1000 Mbps Full Duplex */ > > > + > > > +#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */ > > > +#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */ > > > + > > > +#define IGB_TXD_CMD_EOP 0x01 /* End of Packet */ > > > +#define IGB_TXD_CMD_IFCS 0x02 /* Insert FCS */ > > > +#define IGB_TXD_CMD_RS 0x08 /* Report Status */ > > > +#define IGB_TXD_CMD_SHIFT 24 /* Shift for command bits in cmd_type_len */ > > > +#define IGB_TXD_CMD_LEGACY_FORMAT (1 << 20) /* Forces legacy descriptor format in QEMU */ > > > > IGB_TXD_CMD_LEGACY_FORMAT appears to be unused. I didn't check the rest. > > Can you prune out the unused macros? > > Yep, will do! > > > > > Related: Are these macros and/or the descriptor struct available in any > > kernel headers that we could symlink into the VFIO selftests like we do > > for DSA? > > They're in private headers under drivers/net/ethernet/intel/igb. Are > you ok with linking in private headers? The DSA headers are public. > > > > + > > > +#endif /* _IGB_REGISTERS_H_ */ > > > diff --git a/tools/testing/selftests/vfio/lib/libvfio.mk b/tools/testing/selftests/vfio/lib/libvfio.mk > > > index 9f47bceed16f..1f13cca04348 100644 > > > --- a/tools/testing/selftests/vfio/lib/libvfio.mk > > > +++ b/tools/testing/selftests/vfio/lib/libvfio.mk > > > @@ -12,6 +12,7 @@ LIBVFIO_C += vfio_pci_driver.c > > > ifeq ($(ARCH:x86_64=x86),x86) > > > LIBVFIO_C += drivers/ioat/ioat.c > > > LIBVFIO_C += drivers/dsa/dsa.c > > > +LIBVFIO_C += drivers/igb/igb.c > > > endif > > > > > > LIBVFIO_OUTPUT := $(OUTPUT)/libvfio > > > diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > > > index 6827f4a6febe..a5d0547132c4 100644 > > > --- a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > > > +++ b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c > > > @@ -5,12 +5,14 @@ > > > #ifdef __x86_64__ > > > extern struct vfio_pci_driver_ops dsa_ops; > > > extern struct vfio_pci_driver_ops ioat_ops; > > > +extern struct vfio_pci_driver_ops igb_ops; > > > #endif > > > > > > static struct vfio_pci_driver_ops *driver_ops[] = { > > > #ifdef __x86_64__ > > > &dsa_ops, > > > &ioat_ops, > > > + &igb_ops, > > > #endif > > > }; > > > > > > -- > > > 2.54.0.563.g4f69b47b94-goog > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-13 18:49 ` Josh Hilke @ 2026-05-13 23:33 ` David Matlack 2026-05-14 16:28 ` Alex Williamson 0 siblings, 1 reply; 11+ messages in thread From: David Matlack @ 2026-05-13 23:33 UTC (permalink / raw) To: Josh Hilke Cc: Alex Williamson, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On 2026-05-13 11:49 AM, Josh Hilke wrote: > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > + retries = 100; > > > > + while (retries-- > 0) { > > > > + if (rx->wb.status_error & 1) > > > > + break; > > > > + usleep(10); > > > > + } > > > > > > Why bail after a certain timeout? The test may have kicked off a large > > > count of memcpys. Is this for error detection? > > > > The bailout was intended to detect errors during development. > > Shouldn't need it anymore. I'll remove it in v2. > > Sorry, I forgot: we need the timeout to detect DMA errors for the > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > aborts the DMA operation. However, the QEMU IGB implementation does > not set an error bit, so timing out is our only method for error > detection. Hm... that's going to be tricky then. This means we would have to set the timeout to longer than the longest possible memcpy duration to avoid false negatives? That means we'll have to set the timeout to quite long. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-13 23:33 ` David Matlack @ 2026-05-14 16:28 ` Alex Williamson 2026-05-14 21:49 ` Josh Hilke 0 siblings, 1 reply; 11+ messages in thread From: Alex Williamson @ 2026-05-14 16:28 UTC (permalink / raw) To: David Matlack, Josh Hilke Cc: Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest, alex On Wed, 13 May 2026 23:33:02 +0000 David Matlack <dmatlack@google.com> wrote: > On 2026-05-13 11:49 AM, Josh Hilke wrote: > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > > > + retries = 100; > > > > > + while (retries-- > 0) { > > > > > + if (rx->wb.status_error & 1) > > > > > + break; > > > > > + usleep(10); > > > > > + } > > > > > > > > Why bail after a certain timeout? The test may have kicked off a large > > > > count of memcpys. Is this for error detection? > > > > > > The bailout was intended to detect errors during development. > > > Shouldn't need it anymore. I'll remove it in v2. > > > > Sorry, I forgot: we need the timeout to detect DMA errors for the > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > > aborts the DMA operation. However, the QEMU IGB implementation does > > not set an error bit, so timing out is our only method for error > > detection. > > Hm... that's going to be tricky then. This means we would have to set > the timeout to longer than the longest possible memcpy duration to avoid > false negatives? That means we'll have to set the timeout to quite long. FWIW, I had AI churn on trying to make this work on a physical 82576 as I have several of these in my local machines as sort of the defacto, readily available SR-IOV NIC. The AI got up to 30/35 tests passing but is currently stuck that the queues stall in the mix-and-match test when it's trying to DMA from an unmapped IOVA. So far none of the in-band methods to kick the queues seem to work, I'm not sure if we'll need to resort to an FLR. I'd be happy to send the changes it's made so far if you want to validate and incorporate, or have any thoughts to kicking it after the IOMMU fault. Some of the changes are related to timeouts, where QEMU loopback is actually faster than bare metal since the physical queues run at 1Gbps even in loopback mode. I'll also plant the seed that if we do have outstanding issues for a driver that binds to a real world device, but only works on the emulated version of that device... how do we handle that? In part, I think it's emulated in QEMU because it is so ubiquitous. I'm also hoping to use the same device for the new SR-IOV selftests. Thanks, Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-14 16:28 ` Alex Williamson @ 2026-05-14 21:49 ` Josh Hilke 2026-05-14 22:48 ` Alex Williamson 0 siblings, 1 reply; 11+ messages in thread From: Josh Hilke @ 2026-05-14 21:49 UTC (permalink / raw) To: Alex Williamson Cc: David Matlack, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On Thu, May 14, 2026 at 9:28 AM Alex Williamson <alex@shazbot.org> wrote: > > On Wed, 13 May 2026 23:33:02 +0000 > David Matlack <dmatlack@google.com> wrote: > > > On 2026-05-13 11:49 AM, Josh Hilke wrote: > > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > > > > > + retries = 100; > > > > > > + while (retries-- > 0) { > > > > > > + if (rx->wb.status_error & 1) > > > > > > + break; > > > > > > + usleep(10); > > > > > > + } > > > > > > > > > > Why bail after a certain timeout? The test may have kicked off a large > > > > > count of memcpys. Is this for error detection? > > > > > > > > The bailout was intended to detect errors during development. > > > > Shouldn't need it anymore. I'll remove it in v2. > > > > > > Sorry, I forgot: we need the timeout to detect DMA errors for the > > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > > > aborts the DMA operation. However, the QEMU IGB implementation does > > > not set an error bit, so timing out is our only method for error > > > detection. > > > > Hm... that's going to be tricky then. This means we would have to set > > the timeout to longer than the longest possible memcpy duration to avoid > > false negatives? That means we'll have to set the timeout to quite long. > > FWIW, I had AI churn on trying to make this work on a physical 82576 as > I have several of these in my local machines as sort of the defacto, > readily available SR-IOV NIC. The AI got up to 30/35 tests passing but > is currently stuck that the queues stall in the mix-and-match test when > it's trying to DMA from an unmapped IOVA. So far none of the in-band > methods to kick the queues seem to work, I'm not sure if we'll need to > resort to an FLR. > > I'd be happy to send the changes it's made so far if you want to > validate and incorporate, or have any thoughts to kicking it after the > IOMMU fault. Some of the changes are related to timeouts, where QEMU > loopback is actually faster than bare metal since the physical queues > run at 1Gbps even in loopback mode. > > I'll also plant the seed that if we do have outstanding issues for a > driver that binds to a real world device, but only works on the > emulated version of that device... how do we handle that? In part, I > think it's emulated in QEMU because it is so ubiquitous. I'm also > hoping to use the same device for the new SR-IOV selftests. Thanks, > > Alex I'm glad you're interested in this as well! Unfortunately (and ironically) I don't have access to a physical device. Regarding driver support for the real vs. emulated device, I think we should prioritize supporting the emulated version. This approach unlocks the ability for anyone to run VFIO/Live Update tests without needing specific hardware. Once that's done, other folks can add patches to update the driver if they want to use the physical device. What do you think? I plan to add SR-IOV support to this driver in a separate series after the driver merges. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-14 21:49 ` Josh Hilke @ 2026-05-14 22:48 ` Alex Williamson 2026-05-14 22:58 ` David Matlack 2026-05-14 23:16 ` Josh Hilke 0 siblings, 2 replies; 11+ messages in thread From: Alex Williamson @ 2026-05-14 22:48 UTC (permalink / raw) To: Josh Hilke Cc: David Matlack, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest, alex On Thu, 14 May 2026 14:49:28 -0700 Josh Hilke <jrhilke@google.com> wrote: > On Thu, May 14, 2026 at 9:28 AM Alex Williamson <alex@shazbot.org> wrote: > > > > On Wed, 13 May 2026 23:33:02 +0000 > > David Matlack <dmatlack@google.com> wrote: > > > > > On 2026-05-13 11:49 AM, Josh Hilke wrote: > > > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > > > > > > > + retries = 100; > > > > > > > + while (retries-- > 0) { > > > > > > > + if (rx->wb.status_error & 1) > > > > > > > + break; > > > > > > > + usleep(10); > > > > > > > + } > > > > > > > > > > > > Why bail after a certain timeout? The test may have kicked off a large > > > > > > count of memcpys. Is this for error detection? > > > > > > > > > > The bailout was intended to detect errors during development. > > > > > Shouldn't need it anymore. I'll remove it in v2. > > > > > > > > Sorry, I forgot: we need the timeout to detect DMA errors for the > > > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > > > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > > > > aborts the DMA operation. However, the QEMU IGB implementation does > > > > not set an error bit, so timing out is our only method for error > > > > detection. > > > > > > Hm... that's going to be tricky then. This means we would have to set > > > the timeout to longer than the longest possible memcpy duration to avoid > > > false negatives? That means we'll have to set the timeout to quite long. > > > > FWIW, I had AI churn on trying to make this work on a physical 82576 as > > I have several of these in my local machines as sort of the defacto, > > readily available SR-IOV NIC. The AI got up to 30/35 tests passing but > > is currently stuck that the queues stall in the mix-and-match test when > > it's trying to DMA from an unmapped IOVA. So far none of the in-band > > methods to kick the queues seem to work, I'm not sure if we'll need to > > resort to an FLR. > > > > I'd be happy to send the changes it's made so far if you want to > > validate and incorporate, or have any thoughts to kicking it after the > > IOMMU fault. Some of the changes are related to timeouts, where QEMU > > loopback is actually faster than bare metal since the physical queues > > run at 1Gbps even in loopback mode. > > > > I'll also plant the seed that if we do have outstanding issues for a > > driver that binds to a real world device, but only works on the > > emulated version of that device... how do we handle that? In part, I > > think it's emulated in QEMU because it is so ubiquitous. I'm also > > hoping to use the same device for the new SR-IOV selftests. Thanks, > > > > Alex > > I'm glad you're interested in this as well! > > Unfortunately (and ironically) I don't have access to a physical device. > > Regarding driver support for the real vs. emulated device, I think we > should prioritize supporting the emulated version. This approach > unlocks the ability for anyone to run VFIO/Live Update tests without > needing specific hardware. Once that's done, other folks can add > patches to update the driver if they want to use the physical device. > What do you think? > > I plan to add SR-IOV support to this driver in a separate series after > the driver merges. I've got it working now, I'll need to do some cleanup and verification, but it's not too bad (imo), several places where QEMU emulation only supports one mode and we don't fully configure the device or don't account for physical hardware limitations or timing. The most significant change is in resolving the issue above, that once the queue gets wedged from DMA errors, VFIO_DEVICE_RESET unfortunately seems to be the most straightforward mechanism to get it unstuck. That may result in some initialization refactoring and plumbing to restore the interrupt state. Thanks, Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-14 22:48 ` Alex Williamson @ 2026-05-14 22:58 ` David Matlack 2026-05-14 23:16 ` Josh Hilke 1 sibling, 0 replies; 11+ messages in thread From: David Matlack @ 2026-05-14 22:58 UTC (permalink / raw) To: Alex Williamson Cc: Josh Hilke, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On 2026-05-14 04:48 PM, Alex Williamson wrote: > On Thu, 14 May 2026 14:49:28 -0700 > Josh Hilke <jrhilke@google.com> wrote: > > > On Thu, May 14, 2026 at 9:28 AM Alex Williamson <alex@shazbot.org> wrote: > > > > > > On Wed, 13 May 2026 23:33:02 +0000 > > > David Matlack <dmatlack@google.com> wrote: > > > > > > > On 2026-05-13 11:49 AM, Josh Hilke wrote: > > > > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > > > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > > > > > > > > > + retries = 100; > > > > > > > > + while (retries-- > 0) { > > > > > > > > + if (rx->wb.status_error & 1) > > > > > > > > + break; > > > > > > > > + usleep(10); > > > > > > > > + } > > > > > > > > > > > > > > Why bail after a certain timeout? The test may have kicked off a large > > > > > > > count of memcpys. Is this for error detection? > > > > > > > > > > > > The bailout was intended to detect errors during development. > > > > > > Shouldn't need it anymore. I'll remove it in v2. > > > > > > > > > > Sorry, I forgot: we need the timeout to detect DMA errors for the > > > > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > > > > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > > > > > aborts the DMA operation. However, the QEMU IGB implementation does > > > > > not set an error bit, so timing out is our only method for error > > > > > detection. > > > > > > > > Hm... that's going to be tricky then. This means we would have to set > > > > the timeout to longer than the longest possible memcpy duration to avoid > > > > false negatives? That means we'll have to set the timeout to quite long. > > > > > > FWIW, I had AI churn on trying to make this work on a physical 82576 as > > > I have several of these in my local machines as sort of the defacto, > > > readily available SR-IOV NIC. The AI got up to 30/35 tests passing but > > > is currently stuck that the queues stall in the mix-and-match test when > > > it's trying to DMA from an unmapped IOVA. So far none of the in-band > > > methods to kick the queues seem to work, I'm not sure if we'll need to > > > resort to an FLR. > > > > > > I'd be happy to send the changes it's made so far if you want to > > > validate and incorporate, or have any thoughts to kicking it after the > > > IOMMU fault. Some of the changes are related to timeouts, where QEMU > > > loopback is actually faster than bare metal since the physical queues > > > run at 1Gbps even in loopback mode. > > > > > > I'll also plant the seed that if we do have outstanding issues for a > > > driver that binds to a real world device, but only works on the > > > emulated version of that device... how do we handle that? In part, I > > > think it's emulated in QEMU because it is so ubiquitous. I'm also > > > hoping to use the same device for the new SR-IOV selftests. Thanks, > > > > > > Alex > > > > I'm glad you're interested in this as well! > > > > Unfortunately (and ironically) I don't have access to a physical device. > > > > Regarding driver support for the real vs. emulated device, I think we > > should prioritize supporting the emulated version. This approach > > unlocks the ability for anyone to run VFIO/Live Update tests without > > needing specific hardware. Once that's done, other folks can add > > patches to update the driver if they want to use the physical device. > > What do you think? > > > > I plan to add SR-IOV support to this driver in a separate series after > > the driver merges. > > I've got it working now, I'll need to do some cleanup and verification, > but it's not too bad (imo), several places where QEMU emulation only > supports one mode and we don't fully configure the device or don't > account for physical hardware limitations or timing. The most > significant change is in resolving the issue above, that once the queue > gets wedged from DMA errors, VFIO_DEVICE_RESET unfortunately seems to > be the most straightforward mechanism to get it unstuck. That may > result in some initialization refactoring and plumbing to restore the > interrupt state. Thanks, FWIW I had to do similar for the ioat driver, except it used a device-specific reset mechanism instead of VFIO_DEVICE_RESET, so I think this approach should be ok. static void ioat_handle_error(struct vfio_pci_device *device) { ... ioat_reset(device); } static int ioat_memcpy_wait(struct vfio_pci_device *device) { void *registers = ioat_channel_registers(device); u64 status; int r = 0; /* Wait until all operations complete. */ for (;;) { status = ioat_channel_status(registers); ... if (status == IOAT_CHANSTS_HALTED) { ioat_handle_error(device); return -1; } } ... } ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-14 22:48 ` Alex Williamson 2026-05-14 22:58 ` David Matlack @ 2026-05-14 23:16 ` Josh Hilke 1 sibling, 0 replies; 11+ messages in thread From: Josh Hilke @ 2026-05-14 23:16 UTC (permalink / raw) To: Alex Williamson Cc: David Matlack, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On Thu, May 14, 2026 at 3:48 PM Alex Williamson <alex@shazbot.org> wrote: > > On Thu, 14 May 2026 14:49:28 -0700 > Josh Hilke <jrhilke@google.com> wrote: > > > On Thu, May 14, 2026 at 9:28 AM Alex Williamson <alex@shazbot.org> wrote: > > > > > > On Wed, 13 May 2026 23:33:02 +0000 > > > David Matlack <dmatlack@google.com> wrote: > > > > > > > On 2026-05-13 11:49 AM, Josh Hilke wrote: > > > > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <jrhilke@google.com> wrote: > > > > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > > > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > > > > > > > > > + retries = 100; > > > > > > > > + while (retries-- > 0) { > > > > > > > > + if (rx->wb.status_error & 1) > > > > > > > > + break; > > > > > > > > + usleep(10); > > > > > > > > + } > > > > > > > > > > > > > > Why bail after a certain timeout? The test may have kicked off a large > > > > > > > count of memcpys. Is this for error detection? > > > > > > > > > > > > The bailout was intended to detect errors during development. > > > > > > Shouldn't need it anymore. I'll remove it in v2. > > > > > > > > > > Sorry, I forgot: we need the timeout to detect DMA errors for the > > > > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > > > > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > > > > > aborts the DMA operation. However, the QEMU IGB implementation does > > > > > not set an error bit, so timing out is our only method for error > > > > > detection. > > > > > > > > Hm... that's going to be tricky then. This means we would have to set > > > > the timeout to longer than the longest possible memcpy duration to avoid > > > > false negatives? That means we'll have to set the timeout to quite long. > > > > > > FWIW, I had AI churn on trying to make this work on a physical 82576 as > > > I have several of these in my local machines as sort of the defacto, > > > readily available SR-IOV NIC. The AI got up to 30/35 tests passing but > > > is currently stuck that the queues stall in the mix-and-match test when > > > it's trying to DMA from an unmapped IOVA. So far none of the in-band > > > methods to kick the queues seem to work, I'm not sure if we'll need to > > > resort to an FLR. > > > > > > I'd be happy to send the changes it's made so far if you want to > > > validate and incorporate, or have any thoughts to kicking it after the > > > IOMMU fault. Some of the changes are related to timeouts, where QEMU > > > loopback is actually faster than bare metal since the physical queues > > > run at 1Gbps even in loopback mode. > > > > > > I'll also plant the seed that if we do have outstanding issues for a > > > driver that binds to a real world device, but only works on the > > > emulated version of that device... how do we handle that? In part, I > > > think it's emulated in QEMU because it is so ubiquitous. I'm also > > > hoping to use the same device for the new SR-IOV selftests. Thanks, > > > > > > Alex > > > > I'm glad you're interested in this as well! > > > > Unfortunately (and ironically) I don't have access to a physical device. > > > > Regarding driver support for the real vs. emulated device, I think we > > should prioritize supporting the emulated version. This approach > > unlocks the ability for anyone to run VFIO/Live Update tests without > > needing specific hardware. Once that's done, other folks can add > > patches to update the driver if they want to use the physical device. > > What do you think? > > > > I plan to add SR-IOV support to this driver in a separate series after > > the driver merges. > > I've got it working now, I'll need to do some cleanup and verification, > but it's not too bad (imo), several places where QEMU emulation only > supports one mode and we don't fully configure the device or don't > account for physical hardware limitations or timing. The most > significant change is in resolving the issue above, that once the queue > gets wedged from DMA errors, VFIO_DEVICE_RESET unfortunately seems to > be the most straightforward mechanism to get it unstuck. That may > result in some initialization refactoring and plumbing to restore the > interrupt state. Thanks, > > Alex Alex, how do you want to incorporate your changes? Do you want to send out a v2 of this driver? Or should I continue implementing my v2, and then you send a follow-up patch? I'm fine with either option. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] vfio: selftests: Add driver for IGB QEMU device 2026-05-13 2:12 ` Josh Hilke 2026-05-13 18:49 ` Josh Hilke @ 2026-05-13 23:29 ` David Matlack 1 sibling, 0 replies; 11+ messages in thread From: David Matlack @ 2026-05-13 23:29 UTC (permalink / raw) To: Josh Hilke Cc: Alex Williamson, Vipin Sharma, Jason Gunthorpe, Shuah Khan, Tony Nguyen, kvm, linux-kernel, linux-kselftest On 2026-05-12 07:12 PM, Josh Hilke wrote: > On Mon, May 11, 2026 at 4:45 PM David Matlack <dmatlack@google.com> wrote: > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > +#define IGB_MAX_CHUNK_SIZE 1024 > > > +#define MSIX_VECTOR 0 > > > +#define RING_SIZE 4096 /* Number of descriptors in ring */ > > > > What's the maximum ring size that IGB can support? The max chunk size is > > quite small so using a larger ring will be helpful toward getting the > > device to do a lot of memcpys for Live Update testing. > > I can dig into QEMU code and investigate the max ring size the QEMU > IGB implementation can handle. 4096 is just the max ring size for real > the IGB in drivers/net/ethernet/intel/igb/igb.h. What's our target for > the maximum memcpy size? I assume you mean memcpy count, not size. Ideally we want to be able to get the device to do memcpys for multiple minutes to stress test Live Update. So "as much as possible". There are 2 limits in practice: 1. Device-specific limits. 2. VFIO selftests memory overhead (e.g. for the tx/rx rings). It looks like the driver programs the ring size into the device: igb_write32(igb, IGB_TDLEN0, RING_SIZE * sizeof(struct igb_tx_desc)); So I assume we can pick a larger RING_SIZE if we want? What's the device limit (if any)? It might be just the width of the IGB_TDLEN0 register. > > > + rx->read.pkt_addr = curr_dst; > > > + rx->read.hdr_addr = 0; > > > + rx->wb.status_error = 0; > > > > Don't rx->read.hdr_addr and rx->wb.status_error overlap the same u64? > > It's not a bug since both write 0 but it is weird. > > That's correct, they overlap. I agree it's a bit weird -- it's a > hardware optimization to save memory. This layout is also how the real > descriptor (`e1000_adv_rx_desc`) is structured in > drivers/net/ethernet/intel/igb/e1000_82575.h. The QEMU IGB > implementation also assumes this layout for the bits. Oh they layout is totally fine, and comes from the device specification so not something we can change. I was more commenting on the code here that does overlapping writes as being weird. > > > + > > > + igb_irq_clear(igb); > > > + > > > + igb_irq_enable(igb); > > > + > > > + if (rx->wb.status_error & 1) > > > + return 0; > > > + > > > + return -ETIMEDOUT; > > > > Is there any other way to the device signals errors other than failing > > to set bit 0 of status_error? I'm especially curious how it handles > > memcpy_from_unmapped_iova in vfio_pci_driver_test.c. > > Yea, there are some relevant error bits in status_err, but > unfortunately QEMU IGB doesn't emulate them: > - E1000_RXDEXT_STATERR_LB: confirms that the packet actually traversed > the loopback path > - E1000_RXDEXT_STATERR_CE: detects bit flips during transmission > - E1000_RXDEXT_STATERR_RXE: catch-all error bit for the receive engine > (buffer overflows, etc.) > > The other error bits (E1000_RXDEXT_STATERR_IPE, > E1000_RXDEXT_STATERR_TCPE) validate network packet integrity, which > isn't relevant because the driver doesn't package the memcpy data > within IP or TCP/UDP packets. > > Regarding the memcpy_from_unmapped_iova test, the test only cares that > the device doesn't trigger an MSI. Normally, the QEMU IGB sends an MSI > once it sets the descriptor done bit on completion of a memcpy > operation. To prevent this, the driver disables interrupts before > starting memcpy, and then re-enables them after the memcpy finishes. memcpy_from_unmapped_iova triggers the device to do a DMA read from an unmapped address during rx. I was asking how the device handles that and communicates the error to the driver. > > > +#define IGB_TXD_CMD_EOP 0x01 /* End of Packet */ > > > +#define IGB_TXD_CMD_IFCS 0x02 /* Insert FCS */ > > > +#define IGB_TXD_CMD_RS 0x08 /* Report Status */ > > > +#define IGB_TXD_CMD_SHIFT 24 /* Shift for command bits in cmd_type_len */ > > > +#define IGB_TXD_CMD_LEGACY_FORMAT (1 << 20) /* Forces legacy descriptor format in QEMU */ > > > > IGB_TXD_CMD_LEGACY_FORMAT appears to be unused. I didn't check the rest. > > Can you prune out the unused macros? > > Yep, will do! > > > > > Related: Are these macros and/or the descriptor struct available in any > > kernel headers that we could symlink into the VFIO selftests like we do > > for DSA? > > They're in private headers under drivers/net/ethernet/intel/igb. Are > you ok with linking in private headers? The DSA headers are public. Yeah we already symlink private headers into the VFIO selftests drivers. But the constraint is those headers have to be bare-bones enough to compile in selftests environment, which is different from kernel environment. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-05-14 23:16 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-11 21:18 [PATCH] vfio: selftests: Add driver for IGB QEMU device Josh Hilke 2026-05-11 23:45 ` David Matlack 2026-05-13 2:12 ` Josh Hilke 2026-05-13 18:49 ` Josh Hilke 2026-05-13 23:33 ` David Matlack 2026-05-14 16:28 ` Alex Williamson 2026-05-14 21:49 ` Josh Hilke 2026-05-14 22:48 ` Alex Williamson 2026-05-14 22:58 ` David Matlack 2026-05-14 23:16 ` Josh Hilke 2026-05-13 23:29 ` David Matlack
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox