public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Petr Oros <poros@redhat.com>
To: netdev@vger.kernel.org
Cc: jacob.e.keller@intel.com,
	Tony Nguyen <anthony.l.nguyen@intel.com>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC iwl-next 0/4] iavf: fix VLAN filter state machine races
Date: Fri, 6 Mar 2026 14:12:22 +0100	[thread overview]
Message-ID: <76331edf-2963-4527-9f01-80fed3f6d49b@redhat.com> (raw)
In-Reply-To: <20260302114025.1017985-1-poros@redhat.com>

I leveraged Claude Opus 4.6 to develop a stress-test suite with a
primary 'break-it' objective targeting VF stability. The suite focuses
on aggressive edge cases, specifically cyclic VF migration between
network namespaces while VLAN filtering is active a sequence known
to trigger state machine regressions. The following output
demonstrates the failure state on an unpatched iavf driver (prior to
the 'fix VLAN filter state machine races' patch):

echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
   iavf VLAN state machine test suite
================================================
   VF1:  enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6502
   VF2:  enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6502
   PF:   enp65s0f0np0 (0000:41:00.0)
   MAX:  8 user VLANs per VF
================================================
   PASS  state: basic add/remove
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   FAIL  state: 8 VLANs add/remove  (only 7 created)
   PASS  state: VLAN persists across down/up
   PASS  state: 5 VLANs persist across down/up
   PASS  state: rapid add/del same VLAN x100
   PASS  state: add during remove (REMOVING race)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   PASS  state: bulk 8 add then remove
   PASS  state: 20x rapid down/up with VLAN
   PASS  state: add VLAN while down
   PASS  state: remove VLAN while down
   PASS  state: down -> remove -> up
   PASS  state: add VLANs while down, verify all after up
   PASS  state: double add same VLAN (idempotent)
   PASS  state: double remove same VLAN
   PASS  state: interleaved add/remove different VIDs
   PASS  state: remove+re-add loop x50
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   FAIL  state: stress 8 VLANs (fill to max)  (expected 8, got 7)
   PASS  state: VLAN VID 1 (common edge case)
   PASS  state: VLAN VID 4094 (max)
   PASS  state: concurrent VLAN adds (4 parallel)
   PASS  state: concurrent VLAN deletes (4 parallel)
   PASS  state: add/del storm (200 ops, 5 VIDs)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   FAIL  state: over-limit VLAN rejected, existing survive  (fill: 
expected 8, got 7)
   PASS  reset: VLANs recover after VF PCI FLR
   PASS  reset: 5 VLANs recover after VF PCI FLR
   PASS  reset: rapid VF resets x5 with VLANs
   PASS  reset: VLANs survive PF link flap
   PASS  reset: 5 VLANs survive PF link flap
   PASS  reset: VLANs survive 3x PF link flap
   PASS  reset: VLANs survive PF PCI FLR
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   FAIL  reset: all 8 VLANs recover after VF FLR  (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   FAIL  reset: all 8 VLANs survive PF link flap  (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
   FAIL  reset: all 8 VLANs survive PF PCI FLR  (VLAN 107 gone)
   PASS  reset: FLR during VLAN add/del (race)
   PASS  reset: VF driver unbind/bind cycle
   PASS  ping: basic VLAN traffic
   PASS  ping: 5 VLANs simultaneously
   PASS  ping: survives VF down/up
   PASS  ping: survives 10x rapid VF flap
   PASS  ping: survives VF PCI FLR
   PASS  ping: survives PF link flap
   PASS  ping: survives PF PCI FLR
   PASS  ping: stable while adding/removing other VLANs
   PASS  ping: all 3 VLANs work after down/up
   PASS  ping: parallel VLAN churn from both VFs
   PASS  ping: VLANs work after rapid add/del churn
   PASS  ping: VLANs survive repeated NS move cycle
   PASS  ping: all VLANs survive PF link flap
   PASS  ping: VLAN isolation (no cross-VLAN leakage)
   PASS  ping: traffic works with spoofchk enabled
   PASS  ping: port VLAN (PF-assigned pvid)
   PASS  dmesg: no call traces / BUGs / stalls

================================================
   PASS 46  |  FAIL 6  |  SKIP 0  |  TOTAL 52
================================================
   RESULT: FAIL  -- check dmesg


The underlying failures stem from a breakdown in state synchronization
between the VF and the PF. This desynchronization prevents the driver
from maintaining a consistent hardware state during rapid configuration
cycles, leading to the observed issues.

...................

Patched kernel:

# echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
   iavf VLAN state machine test suite
================================================
   VF1:  enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6573
   VF2:  enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6573
   PF:   enp65s0f0np0 (0000:41:00.0)
   MAX:  8 user VLANs per VF
================================================
   PASS  state: basic add/remove
   PASS  state: 8 VLANs add/remove
   PASS  state: VLAN persists across down/up
   PASS  state: 5 VLANs persist across down/up
   PASS  state: rapid add/del same VLAN x100
   PASS  state: add during remove (REMOVING race)
   PASS  state: bulk 8 add then remove
   PASS  state: 20x rapid down/up with VLAN
   PASS  state: add VLAN while down
   PASS  state: remove VLAN while down
   PASS  state: down -> remove -> up
   PASS  state: add VLANs while down, verify all after up
   PASS  state: double add same VLAN (idempotent)
   PASS  state: double remove same VLAN
   PASS  state: interleaved add/remove different VIDs
   PASS  state: remove+re-add loop x50
   PASS  state: stress 8 VLANs (fill to max)
   PASS  state: VLAN VID 1 (common edge case)
   PASS  state: VLAN VID 4094 (max)
   PASS  state: concurrent VLAN adds (4 parallel)
   PASS  state: concurrent VLAN deletes (4 parallel)
   PASS  state: add/del storm (200 ops, 5 VIDs)
   PASS  state: over-limit VLAN rejected, existing survive
   PASS  reset: VLANs recover after VF PCI FLR
   PASS  reset: 5 VLANs recover after VF PCI FLR
   PASS  reset: rapid VF resets x5 with VLANs
   PASS  reset: VLANs survive PF link flap
   PASS  reset: 5 VLANs survive PF link flap
   PASS  reset: VLANs survive 3x PF link flap
   PASS  reset: VLANs survive PF PCI FLR
   PASS  reset: all 8 VLANs recover after VF FLR
   PASS  reset: all 8 VLANs survive PF link flap
   PASS  reset: all 8 VLANs survive PF PCI FLR
   PASS  reset: FLR during VLAN add/del (race)
   PASS  reset: VF driver unbind/bind cycle
   PASS  ping: basic VLAN traffic
   PASS  ping: 5 VLANs simultaneously
   PASS  ping: survives VF down/up
   PASS  ping: survives 10x rapid VF flap
   PASS  ping: survives VF PCI FLR
   PASS  ping: survives PF link flap
   PASS  ping: survives PF PCI FLR
   PASS  ping: stable while adding/removing other VLANs
   PASS  ping: all 3 VLANs work after down/up
   PASS  ping: parallel VLAN churn from both VFs
   PASS  ping: VLANs work after rapid add/del churn
   PASS  ping: VLANs survive repeated NS move cycle
   PASS  ping: all VLANs survive PF link flap
   PASS  ping: VLAN isolation (no cross-VLAN leakage)
   PASS  ping: traffic works with spoofchk enabled
   PASS  ping: port VLAN (PF-assigned pvid)
   PASS  dmesg: no call traces / BUGs / stalls

================================================
   PASS 52  |  FAIL 0  |  SKIP 0  |  TOTAL 52
================================================
   RESULT: OK

Additionally, interface up/down performance with active VLAN
filtering is significantly improved. The previous bottleneck—a
synchronous VLAN filtering cycle (VF -> PF -> HW -> PF -> VF)
utilizing AdminQ for per-VLAN updates introduced substantial
latency.

Test suite:

https://github.com/torvalds/linux/commit/5c60850c33da80a1c2497fb6bc31f956316197a9 


Regards,

Petr



      parent reply	other threads:[~2026-03-06 13:12 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-02 11:40 [PATCH RFC iwl-next 0/4] iavf: fix VLAN filter state machine races Petr Oros
2026-03-02 11:40 ` [PATCH RFC iwl-next 1/4] iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING Petr Oros
2026-03-16 11:34   ` [Intel-wired-lan] " Loktionov, Aleksandr
2026-03-02 11:40 ` [PATCH RFC iwl-next 2/4] iavf: stop removing VLAN filters from PF on interface down Petr Oros
2026-03-16 11:35   ` [Intel-wired-lan] " Loktionov, Aleksandr
2026-03-02 11:40 ` [PATCH RFC iwl-next 3/4] iavf: wait for PF confirmation before removing VLAN filters Petr Oros
2026-03-02 11:40 ` [PATCH RFC iwl-next 4/4] iavf: harden VLAN filter state machine race handling Petr Oros
2026-03-06 13:12 ` Petr Oros [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=76331edf-2963-4527-9f01-80fed3f6d49b@redhat.com \
    --to=poros@redhat.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jacob.e.keller@intel.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox