From: Petr Oros <poros@redhat.com>
To: netdev@vger.kernel.org
Cc: jacob.e.keller@intel.com,
Tony Nguyen <anthony.l.nguyen@intel.com>,
Przemek Kitszel <przemyslaw.kitszel@intel.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC iwl-next 0/4] iavf: fix VLAN filter state machine races
Date: Fri, 6 Mar 2026 14:12:22 +0100 [thread overview]
Message-ID: <76331edf-2963-4527-9f01-80fed3f6d49b@redhat.com> (raw)
In-Reply-To: <20260302114025.1017985-1-poros@redhat.com>
I leveraged Claude Opus 4.6 to develop a stress-test suite with a
primary 'break-it' objective targeting VF stability. The suite focuses
on aggressive edge cases, specifically cyclic VF migration between
network namespaces while VLAN filtering is active a sequence known
to trigger state machine regressions. The following output
demonstrates the failure state on an unpatched iavf driver (prior to
the 'fix VLAN filter state machine races' patch):
echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
iavf VLAN state machine test suite
================================================
VF1: enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6502
VF2: enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6502
PF: enp65s0f0np0 (0000:41:00.0)
MAX: 8 user VLANs per VF
================================================
PASS state: basic add/remove
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL state: 8 VLANs add/remove (only 7 created)
PASS state: VLAN persists across down/up
PASS state: 5 VLANs persist across down/up
PASS state: rapid add/del same VLAN x100
PASS state: add during remove (REMOVING race)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
PASS state: bulk 8 add then remove
PASS state: 20x rapid down/up with VLAN
PASS state: add VLAN while down
PASS state: remove VLAN while down
PASS state: down -> remove -> up
PASS state: add VLANs while down, verify all after up
PASS state: double add same VLAN (idempotent)
PASS state: double remove same VLAN
PASS state: interleaved add/remove different VIDs
PASS state: remove+re-add loop x50
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL state: stress 8 VLANs (fill to max) (expected 8, got 7)
PASS state: VLAN VID 1 (common edge case)
PASS state: VLAN VID 4094 (max)
PASS state: concurrent VLAN adds (4 parallel)
PASS state: concurrent VLAN deletes (4 parallel)
PASS state: add/del storm (200 ops, 5 VIDs)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL state: over-limit VLAN rejected, existing survive (fill:
expected 8, got 7)
PASS reset: VLANs recover after VF PCI FLR
PASS reset: 5 VLANs recover after VF PCI FLR
PASS reset: rapid VF resets x5 with VLANs
PASS reset: VLANs survive PF link flap
PASS reset: 5 VLANs survive PF link flap
PASS reset: VLANs survive 3x PF link flap
PASS reset: VLANs survive PF PCI FLR
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL reset: all 8 VLANs recover after VF FLR (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL reset: all 8 VLANs survive PF link flap (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL reset: all 8 VLANs survive PF PCI FLR (VLAN 107 gone)
PASS reset: FLR during VLAN add/del (race)
PASS reset: VF driver unbind/bind cycle
PASS ping: basic VLAN traffic
PASS ping: 5 VLANs simultaneously
PASS ping: survives VF down/up
PASS ping: survives 10x rapid VF flap
PASS ping: survives VF PCI FLR
PASS ping: survives PF link flap
PASS ping: survives PF PCI FLR
PASS ping: stable while adding/removing other VLANs
PASS ping: all 3 VLANs work after down/up
PASS ping: parallel VLAN churn from both VFs
PASS ping: VLANs work after rapid add/del churn
PASS ping: VLANs survive repeated NS move cycle
PASS ping: all VLANs survive PF link flap
PASS ping: VLAN isolation (no cross-VLAN leakage)
PASS ping: traffic works with spoofchk enabled
PASS ping: port VLAN (PF-assigned pvid)
PASS dmesg: no call traces / BUGs / stalls
================================================
PASS 46 | FAIL 6 | SKIP 0 | TOTAL 52
================================================
RESULT: FAIL -- check dmesg
The underlying failures stem from a breakdown in state synchronization
between the VF and the PF. This desynchronization prevents the driver
from maintaining a consistent hardware state during rapid configuration
cycles, leading to the observed issues.
...................
Patched kernel:
# echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
iavf VLAN state machine test suite
================================================
VF1: enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6573
VF2: enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6573
PF: enp65s0f0np0 (0000:41:00.0)
MAX: 8 user VLANs per VF
================================================
PASS state: basic add/remove
PASS state: 8 VLANs add/remove
PASS state: VLAN persists across down/up
PASS state: 5 VLANs persist across down/up
PASS state: rapid add/del same VLAN x100
PASS state: add during remove (REMOVING race)
PASS state: bulk 8 add then remove
PASS state: 20x rapid down/up with VLAN
PASS state: add VLAN while down
PASS state: remove VLAN while down
PASS state: down -> remove -> up
PASS state: add VLANs while down, verify all after up
PASS state: double add same VLAN (idempotent)
PASS state: double remove same VLAN
PASS state: interleaved add/remove different VIDs
PASS state: remove+re-add loop x50
PASS state: stress 8 VLANs (fill to max)
PASS state: VLAN VID 1 (common edge case)
PASS state: VLAN VID 4094 (max)
PASS state: concurrent VLAN adds (4 parallel)
PASS state: concurrent VLAN deletes (4 parallel)
PASS state: add/del storm (200 ops, 5 VIDs)
PASS state: over-limit VLAN rejected, existing survive
PASS reset: VLANs recover after VF PCI FLR
PASS reset: 5 VLANs recover after VF PCI FLR
PASS reset: rapid VF resets x5 with VLANs
PASS reset: VLANs survive PF link flap
PASS reset: 5 VLANs survive PF link flap
PASS reset: VLANs survive 3x PF link flap
PASS reset: VLANs survive PF PCI FLR
PASS reset: all 8 VLANs recover after VF FLR
PASS reset: all 8 VLANs survive PF link flap
PASS reset: all 8 VLANs survive PF PCI FLR
PASS reset: FLR during VLAN add/del (race)
PASS reset: VF driver unbind/bind cycle
PASS ping: basic VLAN traffic
PASS ping: 5 VLANs simultaneously
PASS ping: survives VF down/up
PASS ping: survives 10x rapid VF flap
PASS ping: survives VF PCI FLR
PASS ping: survives PF link flap
PASS ping: survives PF PCI FLR
PASS ping: stable while adding/removing other VLANs
PASS ping: all 3 VLANs work after down/up
PASS ping: parallel VLAN churn from both VFs
PASS ping: VLANs work after rapid add/del churn
PASS ping: VLANs survive repeated NS move cycle
PASS ping: all VLANs survive PF link flap
PASS ping: VLAN isolation (no cross-VLAN leakage)
PASS ping: traffic works with spoofchk enabled
PASS ping: port VLAN (PF-assigned pvid)
PASS dmesg: no call traces / BUGs / stalls
================================================
PASS 52 | FAIL 0 | SKIP 0 | TOTAL 52
================================================
RESULT: OK
Additionally, interface up/down performance with active VLAN
filtering is significantly improved. The previous bottleneck—a
synchronous VLAN filtering cycle (VF -> PF -> HW -> PF -> VF)
utilizing AdminQ for per-VLAN updates introduced substantial
latency.
Test suite:
https://github.com/torvalds/linux/commit/5c60850c33da80a1c2497fb6bc31f956316197a9
Regards,
Petr
prev parent reply other threads:[~2026-03-06 13:12 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-02 11:40 [PATCH RFC iwl-next 0/4] iavf: fix VLAN filter state machine races Petr Oros
2026-03-02 11:40 ` [PATCH RFC iwl-next 1/4] iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING Petr Oros
2026-03-16 11:34 ` [Intel-wired-lan] " Loktionov, Aleksandr
2026-03-02 11:40 ` [PATCH RFC iwl-next 2/4] iavf: stop removing VLAN filters from PF on interface down Petr Oros
2026-03-16 11:35 ` [Intel-wired-lan] " Loktionov, Aleksandr
2026-03-02 11:40 ` [PATCH RFC iwl-next 3/4] iavf: wait for PF confirmation before removing VLAN filters Petr Oros
2026-03-02 11:40 ` [PATCH RFC iwl-next 4/4] iavf: harden VLAN filter state machine race handling Petr Oros
2026-03-06 13:12 ` Petr Oros [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=76331edf-2963-4527-9f01-80fed3f6d49b@redhat.com \
--to=poros@redhat.com \
--cc=andrew+netdev@lunn.ch \
--cc=anthony.l.nguyen@intel.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=intel-wired-lan@lists.osuosl.org \
--cc=jacob.e.keller@intel.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=przemyslaw.kitszel@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox