* [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility
@ 2026-05-15 22:03 Alex Williamson
2026-05-15 22:03 ` [PATCH 1/8] selftests/vfio: igb: Use PHY internal loopback on 82576 Alex Williamson
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
This series is based on Josh Hilke's initial igb selftest driver
posted as:
https://lore.kernel.org/all/20260511211839.2781731-1-jrhilke@google.com/
That posting validated the driver against QEMU's emulated igb only; it
has not been tested on physical 82576 hardware. Real 82576 silicon
rejects several of the shortcuts the submitted driver relies on (MAC
loopback, legacy TX descriptors, read-to-clear EICR, autoneg-based
link bring-up, and an unbounded fault-recovery story), and adding
real-hardware coverage is the goal here.
Two of the patches add accommodations specifically for QEMU's emulated
igb (which does not implement PHY-register-0 bit 14 and does not drive
STATUS.LU from CTRL.SLU) so that Josh's "run selftests without
hardware" workflow continues to work. One of these (RCTL.LBM_MAC)
deviates from datasheet 8.10.1 guidance; empirically the bit has no
observable effect on real 82576 because MAC loopback is not
implemented (3.5.6.2). See patch 1 for both rationales.
1) selftests/vfio: igb: Use PHY internal loopback on 82576
Replace MAC-loopback-via-RCTL.LBM_MAC and PHY autonegotiation with
PHY internal loopback per datasheet 3.5.6.3.1. Force the MAC link
state via CTRL.FRCSPD/FRCDPX/SLU since the descriptor engine
otherwise waits for a real negotiated link. Keep RCTL.LBM_MAC
(deviates from 8.10.1 guidance but empirically inert on real
hardware per 3.5.6.2) and prefix the loopback setup with a
one-shot autoneg-restart PHY write (the next PHY write clears
autoneg-enable before autoneg can start, so this is a no-op on
real silicon); both are required by QEMU's emulation. Drop the
dead igb_read_phy() and its now-unused macros.
2) selftests/vfio: igb: Use advanced TX and RX descriptors
Program SRRCTL.DESCTYPE for advanced one-buffer receive
descriptors (datasheet 7.1.5.2, 8.10.2) and build advanced TX data
descriptors with DEXT/DTYP/IFCS/EOP/PAYLEN (7.2.2.3) rather than
the simplified legacy format the submitted driver used. Drop the
unused legacy TX descriptor macros.
3) selftests/vfio: igb: Program MSI-X interrupt routing
Configure GPIE.Multiple_MSIX and GPIE.EIAME (Table 7-47), EIAC
and EIAM for vector 0 (8.8.5, 8.8.6), and switch EICR clearing
from read-to-clear to write-to-clear (7.3.4.2 / 8.8.5 forbid
reading EICR while EIAC is programmed).
4) selftests/vfio: igb: Extend memcpy completion timeout for line-rate
hardware
The submitted 1 ms cap is well below the 32 ms line-rate floor for
a 4 MB transfer at 1 Gb/s. Bump to ~200 ms (6x margin).
5) selftests/vfio: igb: Disable PCIe completion timeout retries
Clear GCR.Completion_Timeout_Resend (datasheet 8.6.1) so the
intentional unmapped-IOVA tests do not generate an unbounded
stream of retried reads on real hardware.
6) selftests/vfio: Add vfio_pci_irq_reenable() helper
New libvfio helper that re-issues VFIO_DEVICE_SET_IRQS against
existing eventfds, for drivers that recover from
VFIO_DEVICE_RESET without disturbing user-side eventfds (and any
fd a test fixture may have cached).
7) selftests/vfio: igb: Factor hardware programming into igb_hw_init()
Pure refactor splitting igb_init() into a one-shot outer
(region-size check, BAR map, CTRL.RST, IMC, vfio_pci_msix_enable)
and a reusable inner that programs the registers CTRL.RST clears.
8) selftests/vfio: igb: Recover after DMA-read faults
Add igb_error_reset_and_reinit() and call it from
igb_memcpy_wait() on completion timeout. Datasheet 4.2.1.6.1
describes CTRL.RST as the recovery mechanism, but empirically
CTRL.RST alone leaves the descriptor engine wedged after a
DMA-read fault; the 82576 advertises PCIe FLR (datasheet
4.2.1.5.1) and VFIO_DEVICE_RESET drives it.
Testing:
- Selftest builds clean at every commit (verified bisect-buildable).
- QEMU emulated igb via vng on a host kernel built with VFIO and
Intel IOMMU enabled: vfio_pci_driver_test 35/35 pass across all
four IOMMU mode permutations.
- Physical 82576 on Intel Alderlake platform: vfio_pci_driver_test
35/35 pass across all four IOMMU mode permutations.
Assisted:
- Series developed primarily with Claude Opus 4.7 with additional
assistance from GPT 5.5. Spec references spot checked and cross
checked against multiple models.
Alex Williamson (8):
selftests/vfio: igb: Use PHY internal loopback on 82576
selftests/vfio: igb: Use advanced TX and RX descriptors
selftests/vfio: igb: Program MSI-X interrupt routing
selftests/vfio: igb: Extend memcpy completion timeout for line-rate hardware
selftests/vfio: igb: Disable PCIe completion timeout retries
selftests/vfio: Add vfio_pci_irq_reenable() helper
selftests/vfio: igb: Factor hardware programming into igb_hw_init()
selftests/vfio: igb: Recover after DMA-read faults
.../selftests/vfio/lib/drivers/igb/igb.c | 318 +++++++++++++-----
.../vfio/lib/drivers/igb/registers.h | 58 +++-
.../lib/include/libvfio/vfio_pci_device.h | 2 +
.../selftests/vfio/lib/vfio_pci_device.c | 22 ++
4 files changed, 302 insertions(+), 98 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/8] selftests/vfio: igb: Use PHY internal loopback on 82576
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 2/8] selftests/vfio: igb: Use advanced TX and RX descriptors Alex Williamson
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
The submitted driver waits for PHY autonegotiation and then enables MAC
loopback via RCTL.LBM_MAC. QEMU's emulated igb tolerates this, but the
82576 datasheet rejects the MAC loopback path on real hardware:
Section 3.5.6.1 (Loopback Support / General): "Use PHY Loopback
instead of MAC Loopback on the 82576."
Section 3.5.6.2 (MAC Loopback): "MAC Loopback is not used on this
device."
Section 3.5.6.3.1 (Setting the 82576 to PHY loopback Mode): set PHY
control register bits 8 (duplex), 14 (loopback), clear bit 12
(autoneg enable), set the speed via bits 6 and 13. For 1Gb/s the
register value is 0x4140.
Section 8.10.1 (RCTL register): "When using the internal PHY, LBM
should remain set to 00b and the PHY instead configured for
loopback through the MDIO interface."
Replace igb_phy_setup_autoneg() with igb_setup_loopback() which:
- writes PHY register 0 with LOOPBACK | SPEED_1000 | FULL_DUPLEX
- forces the MAC into 1Gb/s full duplex via CTRL.FRCSPD, CTRL.FRCDPX,
CTRL.SPD_1000, CTRL.FD, CTRL.SLU; without forcing the MAC link
state, the descriptor engine does not run on real hardware
PHY internal loopback (section 3.5.6.3) wraps data at the end of the
PHY datapath before the MDI, so the physical link state and cable
speed are irrelevant. This matches the kernel ethtool selftest path
in igb_integrated_phy_loopback()
(drivers/net/ethernet/intel/igb/igb_ethtool.c).
QEMU's igb emulation drives STATUS.LU exclusively from its autoneg-done
timer or a network-backend link-state change; the guest cannot set
STATUS.LU through CTRL.SLU. Its receive path checks STATUS.LU
(e1000x_hw_rx_enabled in hw/net/e1000x_common.c) and drops every
loopback frame until LU is set. Issue a one-shot autoneg-restart PHY
write at the top of igb_setup_loopback() to kick the timer; the
subsequent PHY write clears autoneg-enable, so on real hardware
autoneg never starts and the write is a no-op.
QEMU's igb also does not honor PHY register 0 bit 14 (PHY internal
loopback) and relies on RCTL.LBM_MAC to wrap TX descriptors back to
the RX queue. Datasheet 8.10.1 advises that LBM remain 00b when
using the internal PHY, but empirically setting LBM_MAC has no
observable effect on real 82576 (MAC loopback is not implemented per
3.5.6.2), so set it alongside PHY loopback as the actual loopback
mechanism under QEMU. With these two QEMU-only accommodations the
selftest works in both environments without environment-specific code
paths.
The submitted driver also wrote bit 14 (link disable) to PHY register
16 (port control). Datasheet 3.5.6.3.1 describes this as required for
10/100Mb/s but explicitly "not a must for 1G", and the kernel ethtool
selftest omits it. It is dropped here. Flagging the removal in case a
future regression bisects to this commit.
Remove igb_read_phy() and the PHY status macros it served, which become
unused: IGB_PHY_STATUS_REG_OFFSET, IGB_PHY_STATUS_AN_COMP. Keep
IGB_PHY_CTRL_AN_ENABLE and IGB_PHY_CTRL_AN_RESTART for the QEMU-only
autoneg kick; add the PHY ctrl loopback and CTRL force-link macros.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../selftests/vfio/lib/drivers/igb/igb.c | 126 ++++++++++--------
.../vfio/lib/drivers/igb/registers.h | 26 ++--
2 files changed, 89 insertions(+), 63 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index 13c8429784ac..ce2e2c90315e 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -98,61 +98,71 @@ static int igb_write_phy(struct igb *igb, uint32_t offset, uint16_t data)
return 0;
}
-static int igb_read_phy(struct igb *igb, uint32_t offset, uint16_t *data)
+/*
+ * Configure the device for PHY internal loopback per 82576 datasheet
+ * section 3.5.6.3.1. Force the PHY to 1Gb/s full duplex with loopback
+ * enabled, then force the MAC link state to match. Internal loopback
+ * wraps data at the end of the PHY datapath (section 3.5.6.3), so the
+ * physical link state is irrelevant.
+ *
+ * Section 3.5.6.1 directs to "Use PHY Loopback instead of MAC Loopback
+ * on the 82576", and section 3.5.6.2 states "MAC Loopback is not used
+ * on this device." RCTL.LBM_MAC is still set elsewhere as a QEMU-only
+ * accommodation; see the RCTL programming in the caller for the
+ * rationale.
+ */
+static int igb_setup_loopback(struct igb *igb)
{
- uint32_t mdic;
- int i;
-
- mdic = ((offset << IGB_MDIC_REG_SHIFT) |
- (1 << IGB_MDIC_PHY_SHIFT) |
- IGB_MDIC_OP_READ);
-
- igb_write32(igb, IGB_MDIC, mdic);
-
- for (i = 0; i < 1000; i++) {
- usleep(50);
- mdic = igb_read32(igb, IGB_MDIC);
- if (mdic & IGB_MDIC_READY)
- break;
- }
-
- if (!(mdic & IGB_MDIC_READY))
- return -1;
-
- if (mdic & IGB_MDIC_ERROR)
- return -1;
-
- *data = (uint16_t)mdic;
- return 0;
-}
-
-static int igb_phy_setup_autoneg(struct igb *igb)
-{
- int timeout_ms = 1000;
- bool success = false;
- uint16_t phy_status;
+ uint32_t ctrl;
int ret;
- int i;
- /* Trigger auto-negotiation */
+ /*
+ * Kick the autoneg machinery solely to bring STATUS.LU up under
+ * QEMU's igb emulation: QEMU only updates STATUS.LU via its
+ * autoneg-done timer, and without LU set its receive path
+ * (e1000x_hw_rx_enabled) drops every loopback frame. On real
+ * hardware autoneg cannot complete before the next PHY write
+ * below clears the autoneg-enable bit, so this is effectively a
+ * no-op there.
+ */
+ (void)igb_write_phy(igb, IGB_PHY_CTRL_REG_OFFSET,
+ IGB_PHY_CTRL_AN_ENABLE | IGB_PHY_CTRL_AN_RESTART);
+
+ /* PHY control: loopback + 1Gb/s full duplex, autoneg disabled. */
ret = igb_write_phy(igb, IGB_PHY_CTRL_REG_OFFSET,
- IGB_PHY_CTRL_AN_ENABLE | IGB_PHY_CTRL_RESTART_AN);
+ IGB_PHY_CTRL_LOOPBACK |
+ IGB_PHY_CTRL_SPEED_1000 |
+ IGB_PHY_CTRL_FULL_DUPLEX);
if (ret)
return ret;
- for (i = 0; i < timeout_ms; i++) {
- if (igb_read_phy(igb, IGB_PHY_STATUS_REG_OFFSET, &phy_status) == 0) {
- success = !!(phy_status & IGB_PHY_STATUS_AN_COMP);
- if (success)
- break;
- }
- usleep(1000);
- }
-
- if (!success) {
- printf("igb: Auto-negotiation did not complete in time\n");
- return -ETIMEDOUT;
- }
+ /*
+ * Brief delay before forcing the MAC, mirroring the kernel ethtool
+ * selftest in igb_integrated_phy_loopback(). Not specified by the
+ * datasheet, but empirically required by the kernel driver.
+ */
+ usleep(50000);
+
+ /*
+ * Force the MAC to 1Gb/s full duplex with link up. Without forcing
+ * the link state the descriptor engine does not run, since the chip
+ * normally waits for a real negotiated link.
+ */
+ ctrl = igb_read32(igb, IGB_CTRL);
+ ctrl &= ~IGB_CTRL_SPD_SEL;
+ ctrl |= IGB_CTRL_FRCSPD |
+ IGB_CTRL_FRCDPX |
+ IGB_CTRL_SPD_1000 |
+ IGB_CTRL_FD |
+ IGB_CTRL_SLU;
+ igb_write32(igb, IGB_CTRL, ctrl);
+
+ /*
+ * Settling delay matching the kernel ethtool selftest's msleep(500)
+ * at the tail of igb_integrated_phy_loopback(). Not specified by
+ * the datasheet; empirical, and inherited from the kernel driver.
+ */
+ usleep(500000);
return 0;
}
@@ -203,8 +213,8 @@ static void igb_init(struct vfio_pci_device *device)
vfio_pci_config_writew(device, PCI_COMMAND, cmd_reg);
}
- /* Trigger autonegotiation. This enables IGB to transmit data. */
- if (igb_phy_setup_autoneg(igb))
+ /* Configure PHY internal loopback for testing. */
+ if (igb_setup_loopback(igb))
return;
/* Configure TX and RX descriptor rings */
@@ -231,10 +241,22 @@ static void igb_init(struct vfio_pci_device *device)
usleep(10);
}
- /* Enable Receiver and Transmitter */
+ /*
+ * Enable Receiver and Transmitter. RCTL.LBM_MAC is set in addition
+ * to PHY loopback as a QEMU-only accommodation: QEMU's emulated igb
+ * does not honor PHY register 0 bit 14 (PHY internal loopback) and
+ * relies on RCTL.LBM_MAC to wrap TX descriptors back to the RX
+ * queue. Datasheet 8.10.1 (RCTL register) advises "When using the
+ * internal PHY, LBM should remain set to 00b", so setting LBM_MAC
+ * here deviates from datasheet guidance; empirically the bit has
+ * no observable effect on real 82576 hardware because MAC loopback
+ * is not implemented (datasheet 3.5.6.2). Setting both lets the
+ * selftest work on both real hardware and QEMU without conditional
+ * code paths.
+ */
rctl = IGB_RCTL_EN | /* Receiver Enable */
IGB_RCTL_UPE | /* Unicast Promiscuous (for dummy MAC) */
- IGB_RCTL_LBM_MAC | /* MAC Loopback Mode */
+ IGB_RCTL_LBM_MAC | /* MAC Loopback - for QEMU emulation only */
IGB_RCTL_SECRC; /* Strip CRC (needed for memcmp) */
igb_write32(igb, IGB_RCTL, rctl);
igb_write32(igb, IGB_TCTL, IGB_TCTL_EN);
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
index d2d402c9bd57..c00b5ae83ccb 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
@@ -36,7 +36,13 @@
/* Control Bit Definitions */
/* CTRL */
-#define IGB_CTRL_RST BIT(26) /* Device Reset */
+#define IGB_CTRL_FD BIT(0) /* Full Duplex */
+#define IGB_CTRL_SLU BIT(6) /* Set Link Up */
+#define IGB_CTRL_SPD_SEL (3 << 8) /* Speed Select Mask */
+#define IGB_CTRL_SPD_1000 (2 << 8) /* Force 1000 Mb/s */
+#define IGB_CTRL_FRCSPD BIT(11) /* Force Speed */
+#define IGB_CTRL_FRCDPX BIT(12) /* Force Duplex */
+#define IGB_CTRL_RST BIT(26) /* Device Reset */
#define IGB_CTRL_EXT_LINK_MODE_MASK (3 << 22)
/* CTRL_EXT */
@@ -46,7 +52,7 @@
#define IGB_RCTL_EN BIT(1) /* Receiver Enable */
#define IGB_RCTL_UPE BIT(3) /* Unicast Promiscuous Enabled */
#define IGB_RCTL_LPE BIT(5) /* Long Packet Reception Enable */
-#define IGB_RCTL_LBM_MAC BIT(6) /* Loopback Mode - MAC */
+#define IGB_RCTL_LBM_MAC BIT(6) /* Loopback Mode - MAC (set as QEMU-only accommodation) */
#define IGB_RCTL_SECRC BIT(26) /* Strip Ethernet CRC */
/* TCTL */
@@ -83,15 +89,13 @@
#define IGB_MDIC_READY BIT(28) /* MDI Data Ready */
#define IGB_MDIC_ERROR BIT(29) /* MDI Error */
-#define IGB_PHY_CTRL_REG_OFFSET 0
-#define IGB_PHY_STATUS_REG_OFFSET 1
-#define IGB_PHY_STATUS_AN_COMP 0x0020 /* Auto-negotiation complete */
-#define IGB_PHY_CTRL_AN_ENABLE 0x1000 /* Auto-Negotiation Enable */
-#define IGB_PHY_CTRL_RESTART_AN 0x0200 /* Restart Auto-Negotiation */
-
-#define IGB_PHY_AUTONEG_ADV_REG_OFFSET 4 /* Auto-Negotiation Advertisement */
-#define IGB_PHY_1000T_CTRL_REG_OFFSET 9 /* 1000Base-T Control */
-#define IGB_CR_1000T_FD_CAPS 0x0200 /* Advertise 1000 Mbps Full Duplex */
+/* PHY register 0 (Control), per 82576 datasheet section 3.5.6.3.1 */
+#define IGB_PHY_CTRL_REG_OFFSET 0
+#define IGB_PHY_CTRL_AN_RESTART 0x0200 /* bit 9 */
+#define IGB_PHY_CTRL_AN_ENABLE 0x1000 /* bit 12 */
+#define IGB_PHY_CTRL_SPEED_1000 0x0040 /* bit 6 set, bit 13 clear */
+#define IGB_PHY_CTRL_FULL_DUPLEX 0x0100 /* bit 8 */
+#define IGB_PHY_CTRL_LOOPBACK 0x4000 /* bit 14 */
#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */
#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/8] selftests/vfio: igb: Use advanced TX and RX descriptors
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
2026-05-15 22:03 ` [PATCH 1/8] selftests/vfio: igb: Use PHY internal loopback on 82576 Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 3/8] selftests/vfio: igb: Program MSI-X interrupt routing Alex Williamson
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
The submitted driver builds a partial legacy TX descriptor (just
DTALEN | CMD_EOP) and never programs SRRCTL.DESCTYPE. QEMU's emulated
igb tolerates this by treating descriptors as advanced regardless of
DESCTYPE, but real 82576 hardware does not.
For receive, 82576 datasheet section 7.1.5.2 states: "SRRCTL[n].DESCTYPE
must be set to a value other than 000b for the 82576 to write back the
special descriptors." struct igb_rx_desc matches the advanced
one-buffer writeback layout, so the test polls rx.wb.status_error,
which is only written in that layout. Section 8.10.2 places DESCTYPE
in SRRCTL bits 27:25; program it with 001b (advanced one-buffer).
For transmit, datasheet section 7.2.2.3 describes the advanced data
descriptor with DEXT (DCMD bit 5) marking the descriptor as advanced,
DTYP=0011b selecting the data descriptor, IFCS (DCMD bit 1) asking the
MAC to append the Ethernet FCS (without it the frame is dropped as
malformed), EOP (DCMD bit 0) marking end of packet, and PAYLEN in
olinfo_status[31:14] carrying the total payload size. Build this
descriptor in igb_memcpy_start().
Remove the legacy CMD macros (IGB_TXD_CMD_EOP, IGB_TXD_CMD_IFCS,
IGB_TXD_CMD_RS, IGB_TXD_CMD_SHIFT, IGB_TXD_CMD_LEGACY_FORMAT) that
become unused, and add the SRRCTL register offset, DESCTYPE encoding,
and advanced TX descriptor field macros.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../selftests/vfio/lib/drivers/igb/igb.c | 36 ++++++++++++++++---
.../vfio/lib/drivers/igb/registers.h | 21 ++++++++---
2 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index ce2e2c90315e..594e51ba29f5 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -230,6 +230,17 @@ static void igb_init(struct vfio_pci_device *device)
igb_write32(igb, IGB_RDLEN0, RING_SIZE * sizeof(struct igb_rx_desc));
igb_write32(igb, IGB_RDH0, 0);
igb_write32(igb, IGB_RDT0, 0);
+
+ /*
+ * Select the advanced one-buffer descriptor format. Per 82576
+ * datasheet section 7.1.5.2: "SRRCTL[n].DESCTYPE must be set to a
+ * value other than 000b for the 82576 to write back the special
+ * descriptors." struct igb_rx_desc matches the advanced one-buffer
+ * writeback layout (section 7.1.5.2), so polling rx.wb.status_error
+ * requires this format. Section 8.10.2 specifies DESCTYPE[27:25].
+ */
+ igb_write32(igb, IGB_SRRCTL0, IGB_SRRCTL_DESCTYPE_ADV_ONEBUF);
+
igb_write32(igb, IGB_RXDCTL0, IGB_RXDCTL0_Q_EN);
/* Wait for TX and RX queues to be enabled */
@@ -339,11 +350,26 @@ static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src,
rx->wb.status_error = 0;
tx->read.buffer_addr = curr_src;
- tx->read.cmd_type_len = (uint32_t)chunk_size;
- tx->read.cmd_type_len |= (uint32_t)(IGB_TXD_CMD_EOP) << IGB_TXD_CMD_SHIFT;
-
- /* Set to 0 to disable offloads and avoid needing a context descriptor */
- tx->read.olinfo_status = 0;
+ /*
+ * Build an advanced data descriptor per 82576 datasheet
+ * section 7.2.2.3. DEXT marks the descriptor as advanced
+ * (required by hardware); DTYP=data selects the data
+ * descriptor; IFCS asks the MAC to append the Ethernet
+ * FCS (without it the frame is dropped as malformed);
+ * EOP marks end of packet. DTALEN is the buffer length
+ * in bits 15:0 of cmd_type_len.
+ */
+ tx->read.cmd_type_len = (uint32_t)chunk_size |
+ IGB_ADVTXD_DTYP_DATA |
+ IGB_ADVTXD_DCMD_DEXT |
+ IGB_ADVTXD_DCMD_IFCS |
+ IGB_ADVTXD_DCMD_EOP;
+ /*
+ * PAYLEN (section 7.2.2.3.11) is the total payload size
+ * in olinfo_status[31:14].
+ */
+ tx->read.olinfo_status =
+ (uint32_t)chunk_size << IGB_ADVTXD_PAYLEN_SHIFT;
curr_src += chunk_size;
curr_dst += chunk_size;
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
index c00b5ae83ccb..c44788642522 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
@@ -22,10 +22,14 @@
#define IGB_RDBAL0 0x0C000 /* Rx Desc Base Address Low */
#define IGB_RDBAH0 0x0C004 /* Rx Desc Base Address High */
#define IGB_RDLEN0 0x0C008 /* Rx Desc Length */
+#define IGB_SRRCTL0 0x0C00C /* Split and Replication Receive Control Q0 */
#define IGB_RDH0 0x0C010 /* Rx Desc Head */
#define IGB_RDT0 0x0C018 /* Rx Desc Tail */
#define IGB_RXDCTL0 0x0C028 /* Rx Desc Control */
+/* SRRCTL fields per 82576 datasheet section 8.10.2 */
+#define IGB_SRRCTL_DESCTYPE_ADV_ONEBUF (1u << 25) /* 001b: advanced one-buffer */
+
/* Tx Ring 0 Registers */
#define IGB_TDBAL0 0x0E000 /* Tx Desc Base Address Low */
#define IGB_TDBAH0 0x0E004 /* Tx Desc Base Address High */
@@ -100,10 +104,17 @@
#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */
#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */
-#define IGB_TXD_CMD_EOP 0x01 /* End of Packet */
-#define IGB_TXD_CMD_IFCS 0x02 /* Insert FCS */
-#define IGB_TXD_CMD_RS 0x08 /* Report Status */
-#define IGB_TXD_CMD_SHIFT 24 /* Shift for command bits in cmd_type_len */
-#define IGB_TXD_CMD_LEGACY_FORMAT BIT(20) /* Forces legacy descriptor format in QEMU */
+/*
+ * Advanced TX Data Descriptor fields per 82576 datasheet section 7.2.2.3.
+ * The cmd_type_len word holds: DTALEN[15:0], MAC[19:18], DTYP[23:20],
+ * DCMD[31:24]. The olinfo_status word holds: STA[3:0], IDX[6:4],
+ * POPTS[13:8], PAYLEN[31:14].
+ */
+#define IGB_ADVTXD_DTYP_DATA (0x3u << 20) /* DTYP=0011b: advanced data */
+#define IGB_ADVTXD_DCMD_EOP (1u << 24) /* DCMD bit 0: End of Packet */
+#define IGB_ADVTXD_DCMD_IFCS (1u << 25) /* DCMD bit 1: Insert FCS */
+#define IGB_ADVTXD_DCMD_RS (1u << 27) /* DCMD bit 3: Report Status */
+#define IGB_ADVTXD_DCMD_DEXT (1u << 29) /* DCMD bit 5: 1b for advanced */
+#define IGB_ADVTXD_PAYLEN_SHIFT 14 /* PAYLEN bit position */
#endif /* _IGB_REGISTERS_H_ */
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/8] selftests/vfio: igb: Program MSI-X interrupt routing
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
2026-05-15 22:03 ` [PATCH 1/8] selftests/vfio: igb: Use PHY internal loopback on 82576 Alex Williamson
2026-05-15 22:03 ` [PATCH 2/8] selftests/vfio: igb: Use advanced TX and RX descriptors Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 4/8] selftests/vfio: igb: Extend memcpy completion timeout for line-rate hardware Alex Williamson
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
The submitted driver writes only GPIE.EIAME (with a register value of
0x10, which is actually GPIE.Multiple_MSIX, bit 4) and clears EICR by
reading it. On QEMU this works because the emulated loopback path is
synchronous and EICR is implemented as read-to-clear unconditionally.
Real 82576 hardware needs the full MSI-X programming sequence.
Per 82576 datasheet section 7.3.2.11 Table 7-47, MSI-X mode requires:
GPIE.Multiple_MSIX (bit 4): route causes through IVAR.
GPIE.EIAME (bit 30): apply EIAM on MSI-X assertion. Without EIAME,
section 7.3.2.11 specifies EIAM only takes effect on EICR
read/write, which is not the path used here.
Configure auto-clear and auto-mask for vector 0:
EIAC (section 8.8.5): auto-clear of EICR cause bit on MSI-X assertion.
EIAM (section 8.8.6): with EIAME set, auto-mask of EIMS on MSI-X
assertion. This guarantees one interrupt per memcpy batch and
prevents repeat delivery if the cause re-asserts before EIMS is
restored.
Replace the read-to-clear of EICR with write-to-clear. Section 8.8.5
states "If any bits are set in EIAC, the EICR register should not be
read", and section 7.3.4.3 cautions against read-to-clear in MSI-X
mode in general. Write-to-clear (section 7.3.4.2) is unconditional.
Replace the magic '1' values written to EIMS/EIMC with IGB_EICR_VEC0,
add the GPIE/EIAC/EIAM macros, and drop the wrong-valued IGB_GPIE_EIAME
macro (the new definition lives next to IGB_GPIE_MULTIPLE_MSIX).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../selftests/vfio/lib/drivers/igb/igb.c | 40 ++++++++++++++++---
.../vfio/lib/drivers/igb/registers.h | 9 ++++-
2 files changed, 41 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index 594e51ba29f5..d44a08a36171 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -275,11 +275,32 @@ static void igb_init(struct vfio_pci_device *device)
/* Enable MSI-X with 1 vector for the test */
vfio_pci_msix_enable(device, MSIX_VECTOR, 1);
- /* Enable auto-masking of interrupts to avoid storms without a real ISR */
- igb_write32(igb, IGB_GPIE, IGB_GPIE_EIAME);
+ /*
+ * Program MSI-X interrupt routing per 82576 datasheet:
+ *
+ * GPIE (section 7.3.2.11, Table 7-47): set Multiple_MSIX (bit 4) to
+ * route interrupt causes through IVAR mapping, and EIAME (bit 30)
+ * to apply EIAM on MSI-X assertion (without EIAME, EIAM only
+ * applies on EICR read/write).
+ *
+ * EIAC (section 8.8.5): enable auto-clear of EICR for vector 0.
+ * Without auto-clear the cause stays set after delivery and the
+ * test can see spurious interrupts on the next memcpy batch.
+ *
+ * EIAM (section 8.8.6): enable auto-mask of EIMS for vector 0 on
+ * MSI-X assertion (effective because EIAME is set), so a single
+ * interrupt is delivered per memcpy batch even if the cause
+ * re-asserts before software re-enables the mask.
+ *
+ * IVAR (section 7.3.1.2, register definition in 8.8.13): map RX
+ * cause 0 to MSI-X vector 0 and mark the entry valid.
+ */
+ igb_write32(igb, IGB_GPIE, IGB_GPIE_MULTIPLE_MSIX | IGB_GPIE_EIAME);
+ igb_write32(igb, IGB_EIAC, IGB_EICR_VEC0);
+ igb_write32(igb, IGB_EIAM, IGB_EICR_VEC0);
/* Enable interrupts on vector 0 */
- igb_write32(igb, IGB_EIMS, 1);
+ igb_write32(igb, IGB_EIMS, IGB_EICR_VEC0);
/* Map vector 0 to interrupt cause 0 and mark it valid */
igb_write32(igb, IGB_IVAR0, IGB_IVAR_VALID);
@@ -305,17 +326,24 @@ static void igb_remove(struct vfio_pci_device *device)
static void igb_irq_disable(struct igb *igb)
{
- igb_write32(igb, IGB_EIMC, 1);
+ igb_write32(igb, IGB_EIMC, IGB_EICR_VEC0);
}
static void igb_irq_enable(struct igb *igb)
{
- igb_write32(igb, IGB_EIMS, 1);
+ igb_write32(igb, IGB_EIMS, IGB_EICR_VEC0);
}
static void igb_irq_clear(struct igb *igb)
{
- igb_read32(igb, IGB_EICR);
+ /*
+ * Use write-to-clear (datasheet 7.3.4.2). In MSI-X mode with EIAC
+ * programmed, section 8.8.5 explicitly states "If any bits are set
+ * in EIAC, the EICR register should not be read", which rules out
+ * the read-to-clear path in 7.3.4.3. Bits not in EIAC are still
+ * cleared by writing 1.
+ */
+ igb_write32(igb, IGB_EICR, 0xFFFFFFFF);
}
static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src,
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
index c44788642522..139f1c2e6fdd 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
@@ -78,12 +78,18 @@
#define IGB_VMOLR_BAM 0x08000000 /* Broadcast Accept Mode */
#define IGB_RAH_POOL_1 0x00040000 /* Pool 1 assignment */
-#define IGB_EIMS 0x01524 /* Extended Interrupt Mask Set */
#define IGB_EICS 0x01520 /* Extended Interrupt Cause Set */
+#define IGB_EIMS 0x01524 /* Extended Interrupt Mask Set */
#define IGB_EIMC 0x01528 /* Extended Interrupt Mask Clear */
+#define IGB_EIAC 0x0152C /* Extended Interrupt Auto Clear */
+#define IGB_EIAM 0x01530 /* Extended Interrupt Auto Mask Enable */
+#define IGB_EICR_VEC0 BIT(0) /* MSI-X cause/vector 0 */
#define IGB_CTRL_GIO_MASTER_DISABLE BIT(2) /* GIO Master Disable */
#define IGB_STATUS_GIO_MASTER_ENABLE BIT(19) /* GIO Master Enable */
#define IGB_GPIE 0x01514 /* General Purpose Interrupt Enable */
+/* GPIE fields per 82576 datasheet section 7.3.2.11, Table 7-47 */
+#define IGB_GPIE_MULTIPLE_MSIX BIT(4) /* Multi-vector MSI-X mode */
+#define IGB_GPIE_EIAME BIT(30) /* Apply EIAM on MSI-X assertion */
#define IGB_TXDCTL0_Q_EN BIT(25) /* Transmit Queue Enable */
#define IGB_RXDCTL0_Q_EN BIT(25) /* Receive Queue Enable */
#define IGB_MRQC 0x05818 /* Multiple Receive Queues Command */
@@ -101,7 +107,6 @@
#define IGB_PHY_CTRL_FULL_DUPLEX 0x0100 /* bit 8 */
#define IGB_PHY_CTRL_LOOPBACK 0x4000 /* bit 14 */
-#define IGB_GPIE_EIAME 0x10 /* Extended Interrupt Auto Mask Enable */
#define IGB_IVAR_VALID 0x80 /* Valid bit for IVAR register */
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/8] selftests/vfio: igb: Extend memcpy completion timeout for line-rate hardware
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
` (2 preceding siblings ...)
2026-05-15 22:03 ` [PATCH 3/8] selftests/vfio: igb: Program MSI-X interrupt routing Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 5/8] selftests/vfio: igb: Disable PCIe completion timeout retries Alex Williamson
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
The submitted driver waits at most 1 ms (100 * 10 us) for the last RX
descriptor to be written back. QEMU's emulated loopback is synchronous:
by the time igb_memcpy_wait() runs, the receive descriptors are already
written back. Real 82576 hardware processes the descriptor ring at
line rate.
max_memcpy_size is (RING_SIZE - 1) * IGB_MAX_CHUNK_SIZE, approximately
4 MB, split into 4095 1 KB frames. At 1 Gb/s line rate (~125 MB/s),
4 MB takes ~32 ms on the wire, plus per-frame preamble, SFD,
inter-frame gap and FCS overhead (~3%) and descriptor fetch/writeback
latency. The 1 ms cap times out long before any valid transfer can
complete.
Wait up to ~200 ms (200 iterations * 1 ms) for descriptor writeback
before returning -ETIMEDOUT. ~6x the line-rate floor leaves comfortable
headroom for host scheduling jitter and slower PCIe paths, while still
bounding the intentional invalid-DMA tests (mix_and_match) so they
recover quickly on real faults.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../testing/selftests/vfio/lib/drivers/igb/igb.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index d44a08a36171..2297382d7c26 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -422,11 +422,22 @@ static int igb_memcpy_wait(struct vfio_pci_device *device)
prev_tail = (igb->rx_tail + RING_SIZE - 1) % RING_SIZE;
rx = &igb->rx_ring[prev_tail];
- retries = 100;
+ /*
+ * Real 82576 hardware processes the descriptor ring at line rate.
+ * max_memcpy_size = (RING_SIZE - 1) * IGB_MAX_CHUNK_SIZE ~= 4 MB,
+ * split into 4095 1 KB frames. At 1 Gb/s (~125 MB/s) the worst
+ * valid memcpy takes ~32 ms on the wire, plus per-frame preamble,
+ * SFD, IFG and FCS overhead (~3%) and descriptor fetch/writeback
+ * latency. Wait up to ~200 ms before declaring the device hung;
+ * ~6x the line-rate floor leaves comfortable headroom for host
+ * scheduling jitter while keeping the intentional invalid-DMA
+ * tests bounded.
+ */
+ retries = 200;
while (retries-- > 0) {
if (rx->wb.status_error & 1)
break;
- usleep(10);
+ usleep(1000);
}
igb_irq_clear(igb);
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/8] selftests/vfio: igb: Disable PCIe completion timeout retries
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
` (3 preceding siblings ...)
2026-05-15 22:03 ` [PATCH 4/8] selftests/vfio: igb: Extend memcpy completion timeout for line-rate hardware Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 6/8] selftests/vfio: Add vfio_pci_irq_reenable() helper Alex Williamson
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
The mix_and_match test intentionally submits TX descriptors with an
unmapped source IOVA so that the DMA read fails. By default the 82576
re-sends the request after a PCIe completion timeout (datasheet section
8.6.1, GCR.Completion_Timeout_Resend, bit 16, initial value 1b). On
real hardware this turns a single fault into a stream of retried reads,
keeping PCIe AER and IOMMU error handling busy and interfering with
subsequent reset recovery.
Clear GCR.Completion_Timeout_Resend during device initialization so a
faulted read fails once and stays failed.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
tools/testing/selftests/vfio/lib/drivers/igb/igb.c | 12 ++++++++++++
.../selftests/vfio/lib/drivers/igb/registers.h | 2 ++
2 files changed, 14 insertions(+)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index 2297382d7c26..9f93ec7ba8bc 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -213,6 +213,18 @@ static void igb_init(struct vfio_pci_device *device)
vfio_pci_config_writew(device, PCI_COMMAND, cmd_reg);
}
+ /*
+ * Disable DMA re-send on PCIe completion timeout (82576 datasheet
+ * section 8.6.1, GCR.Completion_Timeout_Resend, bit 16). The
+ * mix_and_match test intentionally submits descriptors targeting
+ * unmapped IOVAs; with the default (set) value, the device keeps
+ * retrying the failed read indefinitely, which keeps PCIe AER and
+ * IOMMU error handling busy and interferes with reset recovery.
+ */
+ ctrl = igb_read32(igb, IGB_GCR);
+ ctrl &= ~IGB_GCR_CMPL_TMOUT_RESEND;
+ igb_write32(igb, IGB_GCR, ctrl);
+
/* Configure PHY internal loopback for testing. */
if (igb_setup_loopback(igb))
return;
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
index 139f1c2e6fdd..45f71dc26e24 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/registers.h
@@ -73,6 +73,8 @@
#define IGB_RAH0 0x05404 /* Receive Address High 0 */
#define IGB_VMOLR0 0x05AD0 /* VM Offload Layout Register 0 */
+#define IGB_GCR 0x05B00 /* PCIe Control */
+#define IGB_GCR_CMPL_TMOUT_RESEND BIT(16) /* Re-send on completion timeout */
#define IGB_VMOLR_LPE 0x00010000 /* Long Packet Enable */
#define IGB_VMOLR_BAM 0x08000000 /* Broadcast Accept Mode */
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/8] selftests/vfio: Add vfio_pci_irq_reenable() helper
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
` (4 preceding siblings ...)
2026-05-15 22:03 ` [PATCH 5/8] selftests/vfio: igb: Disable PCIe completion timeout retries Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 7/8] selftests/vfio: igb: Factor hardware programming into igb_hw_init() Alex Williamson
2026-05-15 22:03 ` [PATCH 8/8] selftests/vfio: igb: Recover after DMA-read faults Alex Williamson
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
Selftest drivers that recover from a fault by issuing VFIO_DEVICE_RESET
need to re-arm device interrupts afterwards. VFIO_DEVICE_RESET tears
down the kernel-side IRQ trigger so a subsequent VFIO_DEVICE_SET_IRQS
is required, but the user-side eventfds (and any fd cached in a test
fixture) are still valid and must be preserved.
vfio_pci_irq_enable() refuses to be called for vectors that already
have an eventfd (VFIO_ASSERT_LT), and vfio_pci_irq_disable() closes
all eventfds before resetting the trigger, so neither is suitable.
Add vfio_pci_irq_reenable(device, index, vector, count) which asserts
that the requested range has existing eventfds and re-issues
VFIO_DEVICE_SET_IRQS using them. Signature mirrors vfio_pci_irq_enable().
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../lib/include/libvfio/vfio_pci_device.h | 2 ++
.../selftests/vfio/lib/vfio_pci_device.c | 22 +++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
index 2858885a89bb..a362e2b2bfda 100644
--- a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
+++ b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
@@ -68,6 +68,8 @@ void vfio_pci_config_access(struct vfio_pci_device *device, bool write,
void vfio_pci_irq_enable(struct vfio_pci_device *device, u32 index,
u32 vector, int count);
void vfio_pci_irq_disable(struct vfio_pci_device *device, u32 index);
+void vfio_pci_irq_reenable(struct vfio_pci_device *device, u32 index,
+ u32 vector, int count);
void vfio_pci_irq_trigger(struct vfio_pci_device *device, u32 index, u32 vector);
static inline void fcntl_set_nonblock(int fd)
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index fc75e04ef010..7b8394d0ac50 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -106,6 +106,28 @@ void vfio_pci_irq_disable(struct vfio_pci_device *device, u32 index)
vfio_pci_irq_set(device, index, 0, 0, NULL);
}
+/*
+ * Re-issue VFIO_DEVICE_SET_IRQS for an already-enabled vector range using
+ * the existing eventfds. Intended for drivers that need to re-arm device
+ * interrupts after a VFIO_DEVICE_RESET, which tears down the kernel-side
+ * IRQ trigger but leaves user-side eventfds intact. Recreating the
+ * eventfds would invalidate any test-fixture cache of the fd, so this
+ * helper deliberately preserves them.
+ */
+void vfio_pci_irq_reenable(struct vfio_pci_device *device, u32 index,
+ u32 vector, int count)
+{
+ int i;
+
+ check_supported_irq_index(index);
+
+ for (i = vector; i < vector + count; i++)
+ VFIO_ASSERT_GE(device->msi_eventfds[i], 0,
+ "vector %d eventfd not allocated\n", i);
+
+ vfio_pci_irq_set(device, index, vector, count, device->msi_eventfds + vector);
+}
+
static void vfio_pci_irq_get(struct vfio_pci_device *device, u32 index,
struct vfio_irq_info *irq_info)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 7/8] selftests/vfio: igb: Factor hardware programming into igb_hw_init()
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
` (5 preceding siblings ...)
2026-05-15 22:03 ` [PATCH 6/8] selftests/vfio: Add vfio_pci_irq_reenable() helper Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
2026-05-15 22:03 ` [PATCH 8/8] selftests/vfio: igb: Recover after DMA-read faults Alex Williamson
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
Split the device register programming out of igb_init() into a new
igb_hw_init() helper so that the same sequence can be re-run after a
VFIO_DEVICE_RESET to restore the registers that CTRL.RST clears. No
functional change for the initial path.
igb_init() now performs the one-shot setup: region size assertion, BAR
mapping, CTRL.RST + IMC mask-all to put the device into a known state,
and vfio_pci_msix_enable() to set up the kernel-side IRQ trigger.
igb_hw_init() does the rest: ring pointer setup and IOVA calc,
CTRL_EXT, PCI bus master, GCR, PHY loopback, descriptor rings, RCTL,
TCTL, GPIE/EIAC/EIAM/EIMS/IVAR, and driver-state initialization.
vfio_pci_msix_enable() moves from after RCTL/TCTL to before all
device-side programming. Its only side effects are the VFIO kernel
IRQ trigger setup and the PCI MSI-X capability bits in config space;
neither has any ordering dependency on the 82576 device register
writes performed in igb_hw_init(). Performing it once in igb_init()
keeps igb_hw_init() reusable from the reset recovery path (which uses
vfio_pci_irq_reenable() to re-arm the existing trigger).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../selftests/vfio/lib/drivers/igb/igb.c | 49 +++++++++++++------
1 file changed, 34 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index 9f93ec7ba8bc..ef242ebd9d2e 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -175,7 +175,13 @@ static int igb_probe(struct vfio_pci_device *device)
return 0;
}
-static void igb_init(struct vfio_pci_device *device)
+/*
+ * Program the device into a usable state. Split out of igb_init() so it
+ * can be reused after a device reset to re-program the registers that
+ * CTRL.RST clears. Expects bar0 to be mapped and MSI-X already enabled
+ * via VFIO.
+ */
+static void igb_hw_init(struct vfio_pci_device *device)
{
struct igb *igb = to_igb_state(device);
uint64_t iova_tx, iova_rx;
@@ -183,23 +189,12 @@ static void igb_init(struct vfio_pci_device *device)
uint16_t cmd_reg;
int retries;
- VFIO_ASSERT_GE(device->driver.region.size,
- sizeof(struct igb) + 2 * RING_SIZE * sizeof(struct igb_tx_desc));
-
- /* Set up rings and calculate IOVAs */
- igb->bar0 = device->bars[0].vaddr;
-
igb->tx_ring = (struct igb_tx_desc *)(igb + 1);
igb->rx_ring = (struct igb_rx_desc *)(igb->tx_ring + RING_SIZE);
iova_tx = to_iova(device, igb->tx_ring);
iova_rx = to_iova(device, igb->rx_ring);
- /* Reset device and disable all interrupts */
- igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST);
- usleep(20000);
- igb_write32(igb, IGB_IMC, 0xFFFFFFFF);
-
/* Signal that the driver is loaded */
ctrl = igb_read32(igb, IGB_CTRL_EXT);
ctrl |= IGB_CTRL_EXT_DRV_LOAD;
@@ -284,9 +279,6 @@ static void igb_init(struct vfio_pci_device *device)
igb_write32(igb, IGB_RCTL, rctl);
igb_write32(igb, IGB_TCTL, IGB_TCTL_EN);
- /* Enable MSI-X with 1 vector for the test */
- vfio_pci_msix_enable(device, MSIX_VECTOR, 1);
-
/*
* Program MSI-X interrupt routing per 82576 datasheet:
*
@@ -326,6 +318,33 @@ static void igb_init(struct vfio_pci_device *device)
device->driver.msi = MSIX_VECTOR;
}
+static void igb_init(struct vfio_pci_device *device)
+{
+ struct igb *igb = to_igb_state(device);
+
+ VFIO_ASSERT_GE(device->driver.region.size,
+ sizeof(struct igb) + 2 * RING_SIZE * sizeof(struct igb_tx_desc));
+
+ igb->bar0 = device->bars[0].vaddr;
+
+ /* Reset device and disable all interrupts. */
+ igb_write32(igb, IGB_CTRL, igb_read32(igb, IGB_CTRL) | IGB_CTRL_RST);
+ usleep(20000);
+ igb_write32(igb, IGB_IMC, 0xFFFFFFFF);
+
+ /*
+ * Enable MSI-X via VFIO before device-side register programming.
+ * vfio_pci_msix_enable() only touches the VFIO IRQ machinery and the
+ * PCI MSI-X capability via config space; it has no ordering
+ * dependency on the device-side writes performed by igb_hw_init().
+ * Placing it here keeps igb_hw_init() reusable from the reset
+ * recovery path (which calls vfio_pci_irq_reenable() instead).
+ */
+ vfio_pci_msix_enable(device, MSIX_VECTOR, 1);
+
+ igb_hw_init(device);
+}
+
static void igb_remove(struct vfio_pci_device *device)
{
struct igb *igb = to_igb_state(device);
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 8/8] selftests/vfio: igb: Recover after DMA-read faults
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
` (6 preceding siblings ...)
2026-05-15 22:03 ` [PATCH 7/8] selftests/vfio: igb: Factor hardware programming into igb_hw_init() Alex Williamson
@ 2026-05-15 22:03 ` Alex Williamson
7 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2026-05-15 22:03 UTC (permalink / raw)
To: jrhilke
Cc: Alex Williamson, Alex Williamson, kvm, David Matlack,
linux-kernel, Jason Gunthorpe
The mix_and_match test intentionally submits a TX descriptor with an
unmapped source IOVA so that the DMA read fails. On real 82576
hardware the resulting fault leaves the descriptor engine unable to
service subsequent valid descriptors, so the next memcpy in the same
test iteration times out.
The 82576 datasheet (section 4.2.1.6.1) describes CTRL.RST as the
software mechanism to recover from a hung device. Empirically
CTRL.RST alone is not sufficient in this state: the visible queue
registers are reinitialized, but the next valid memcpy still posts
descriptors without any TDH/TDT progress in the same process. A
fresh device open after the failure works, which points to a reset
scope broader than CTRL.RST being required. The 82576 advertises
PCIe FLR; VFIO_DEVICE_RESET drives FLR and supplies that scope while
preserving the selftest process and its DMA mappings.
Add igb_error_reset_and_reinit() implementing the recovery sequence:
issue VFIO_DEVICE_RESET, re-arm the kernel-side MSI-X trigger against
the still-valid eventfd via vfio_pci_irq_reenable() (this does not
touch the eventfd, which test fixtures may have cached), and
re-program the device via igb_hw_init(). FLR clears EICR and leaves
EIMS=0, so no explicit interrupt mask or cause writes are needed.
igb_hw_init() resets tx_tail/rx_tail to 0 and igb_memcpy_start() zeros
each descriptor before submission, so no ring memset is needed either.
Call this from igb_memcpy_wait() on completion timeout, preceded by a
10 ms delay so that PCIe/IOMMU/AER error handling triggered by the
just-observed DMA fault can release the device lock VFIO_DEVICE_RESET
contends for. The delay is heuristic and tied to the fault path, so
it lives at the call site rather than inside the reset helper. The
failed memcpy still returns -ETIMEDOUT; reset recovery only ensures
the next operation starts from a usable device state.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
---
.../selftests/vfio/lib/drivers/igb/igb.c | 40 +++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
index ef242ebd9d2e..07d1a907f18a 100644
--- a/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
+++ b/tools/testing/selftests/vfio/lib/drivers/igb/igb.c
@@ -443,6 +443,28 @@ static void igb_memcpy_start(struct vfio_pci_device *device, iova_t src,
igb_write32(igb, IGB_TDT0, igb->tx_tail);
}
+/*
+ * Reset the device via VFIO_DEVICE_RESET (PCIe FLR on the 82576) and
+ * re-program it. VFIO_DEVICE_RESET tears down the kernel-side MSI-X
+ * trigger but leaves user-side eventfds intact, so re-arm the trigger
+ * via vfio_pci_irq_reenable() before reprogramming so any caller-cached
+ * eventfd remains valid.
+ *
+ * FLR clears device-side state to power-on reset values (datasheet
+ * 4.2.1.5.1: a PF FLR is "equivalent to a D0->D3->D0 transition"), so
+ * EIMS and EICR come back as 0 from their register-defined initial
+ * values, and igb_hw_init() resets tx_tail/rx_tail to 0. The next
+ * igb_memcpy_start() will memset each descriptor it touches before
+ * submission, so no explicit IMC/EICR writes or ring memsets are
+ * needed here.
+ */
+static void igb_error_reset_and_reinit(struct vfio_pci_device *device)
+{
+ vfio_pci_device_reset(device);
+ vfio_pci_irq_reenable(device, VFIO_PCI_MSIX_IRQ_INDEX, MSIX_VECTOR, 1);
+ igb_hw_init(device);
+}
+
static int igb_memcpy_wait(struct vfio_pci_device *device)
{
struct igb *igb = to_igb_state(device);
@@ -478,6 +500,24 @@ static int igb_memcpy_wait(struct vfio_pci_device *device)
if (rx->wb.status_error & 1)
return 0;
+ /*
+ * The descriptor never completed. On real 82576 hardware this
+ * typically follows a DMA-read fault from one of the intentional
+ * unmapped-IOVA tests; the fault leaves the descriptor engine
+ * unable to service subsequent valid descriptors. CTRL.RST alone
+ * reinitializes the queue registers but leaves the engine wedged
+ * for the current process, so a broader VFIO_DEVICE_RESET (FLR)
+ * is required.
+ *
+ * Delay before requesting reset so PCIe/IOMMU/AER error handling
+ * triggered by the just-observed DMA fault can release the device
+ * lock VFIO_DEVICE_RESET contends for. The 10 ms value is
+ * heuristic. The current memcpy still fails with -ETIMEDOUT;
+ * recovery only ensures the next memcpy starts from a usable state.
+ */
+ usleep(10000);
+ igb_error_reset_and_reinit(device);
+
return -ETIMEDOUT;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-15 22:04 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 22:03 [PATCH 0/8] selftests/vfio: igb: 82576 hardware compatibility Alex Williamson
2026-05-15 22:03 ` [PATCH 1/8] selftests/vfio: igb: Use PHY internal loopback on 82576 Alex Williamson
2026-05-15 22:03 ` [PATCH 2/8] selftests/vfio: igb: Use advanced TX and RX descriptors Alex Williamson
2026-05-15 22:03 ` [PATCH 3/8] selftests/vfio: igb: Program MSI-X interrupt routing Alex Williamson
2026-05-15 22:03 ` [PATCH 4/8] selftests/vfio: igb: Extend memcpy completion timeout for line-rate hardware Alex Williamson
2026-05-15 22:03 ` [PATCH 5/8] selftests/vfio: igb: Disable PCIe completion timeout retries Alex Williamson
2026-05-15 22:03 ` [PATCH 6/8] selftests/vfio: Add vfio_pci_irq_reenable() helper Alex Williamson
2026-05-15 22:03 ` [PATCH 7/8] selftests/vfio: igb: Factor hardware programming into igb_hw_init() Alex Williamson
2026-05-15 22:03 ` [PATCH 8/8] selftests/vfio: igb: Recover after DMA-read faults Alex Williamson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox