From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "Bjorn Helgaas" <bhelgaas@google.com>,
"Jürgen Groß" <jgross@suse.com>,
"Roger Pau Monné" <roger.pau@citrix.com>,
"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
xen-devel <xen-devel@lists.xenproject.org>,
linux-kernel@vger.kernel.org, regressions@lists.linux.dev,
"Felix Fietkau" <nbd@nbd.name>,
"Lorenzo Bianconi" <lorenzo@kernel.org>,
"Ryder Lee" <ryder.lee@mediatek.com>,
linux-pci@vger.kernel.org
Subject: Re: Config space access to Mediatek MT7922 doesn't work after device reset in Xen PV dom0 (regression, Linux 6.12)
Date: Wed, 29 Jan 2025 04:22:43 +0100 [thread overview]
Message-ID: <Z5mfA32bvEn6yD-C@mail-itl> (raw)
In-Reply-To: <20250129030315.GA392478@bhelgaas>
[-- Attachment #1: Type: text/plain, Size: 10845 bytes --]
On Tue, Jan 28, 2025 at 09:03:15PM -0600, Bjorn Helgaas wrote:
> [+cc linux-pci]
>
> On Wed, Jan 29, 2025 at 03:10:49AM +0100, Marek Marczykowski-Górecki wrote:
> > On Tue, Jan 28, 2025 at 07:15:26PM -0600, Bjorn Helgaas wrote:
> > > On Fri, Jan 17, 2025 at 01:05:30PM +0100, Marek Marczykowski-Górecki wrote:
> > > > After updating PV dom0 to Linux 6.12, The Mediatek MT7922 device reports
> > > > all 0xff when accessing its config space. This happens only after device
> > > > reset (which is also triggered when binding the device to the
> > > > xen-pciback driver).
> > >
> > > Thanks for the report and for all the debugging you've already done!
> > >
> > > > Reproducer:
> > > >
> > > > # lspci -xs 01:00.0
> > > > 01:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter
> > > > 00: c3 14 16 06 00 00 10 00 00 00 80 02 10 00 00 00
> > > > ...
> > > > # echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
> > > > # lspci -xs 01:00.0
> > > > 01:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter
> > > > 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > >
> > > > The same operation done on Linux 6.12 running without Xen works fine.
> > > >
> > > > git bisect points at:
> > > >
> > > > commit d591f6804e7e1310881c9224d72247a2b65039af
> > > > Author: Bjorn Helgaas <bhelgaas@google.com>
> > > > Date: Tue Aug 27 18:48:46 2024 -0500
> > > >
> > > > PCI: Wait for device readiness with Configuration RRS
> > > >
> > > > part of that commit:
> > > > @@ -1311,9 +1320,15 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> > > > return -ENOTTY;
> > > > }
> > > >
> > > > - pci_read_config_dword(dev, PCI_COMMAND, &id);
> > > > - if (!PCI_POSSIBLE_ERROR(id))
> > > > - break;
> > > > + if (root && root->config_crs_sv) {
> > > > + pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> > > > + if (!pci_bus_crs_vendor_id(id))
> > > > + break;
> > > > + } else {
> > > > + pci_read_config_dword(dev, PCI_COMMAND, &id);
> > > > + if (!PCI_POSSIBLE_ERROR(id))
> > > > + break;
> > > > + }
> > > >
> > > >
> > > > Adding some debugging, the PCI_VENDOR_ID read in pci_dev_wait() returns
> > > > initially 0xffffffff. If I extend the condition with
> > > > "&& !PCI_POSSIBLE_ERROR(id)", then the issue disappear. But reading the
> > > > patch description, it would break VF.
> > > > I'm not sure where the issue is, but given it breaks only when running
> > > > with Xen, I guess something is wrong with "Configuration RRS Software
> > > > Visibility" in that case.
> > >
> > > I'm missing something. If you get 0xffffffff, that is not the 0x0001
> > > Vendor ID, so pci_dev_wait() should exit immediately.
> >
> > I'm not sure what is going on there either, but my _guess_ is that the
> > loop exits too early due to the above. And it makes some further actions
> > to fail.
>
> When RRS SV is enabled, reading PCI_VENDOR_ID should always return
> 0x0001 (if the device isn't ready and responds with RRS status) or the
> valid Vendor ID. I don't think it should ever return 0xffff (unless
> the device is powered off, unplugged, or broken, of course).
Maybe it isn't really enabled when Xen is involved?
By looking at lspci of the bridge for this device, I do see RootCtl: ...
CRSVisible+, but maybe there is something else needed too?
Just in case, full lspci -vvvs 2.2 (the bridge):
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge (prog-if 00 [Normal decode])
Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 102
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled] [32-bit]
Memory behind bridge: 90b00000-90bfffff [size=1M] [32-bit]
Prefetchable memory behind bridge: 8010900000-80109fffff [size=1M] [32-bit]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag+ RBE+ TEE-IO-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #4, Speed 16GT/s, Width x1, ASPM L1, Exit Latency L1 <64us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #0, PowerLimit 75W; Interlock- NoCompl+
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet- LinkState+
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee08000 Data: 4000
Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=150us
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
T_CommonMode=10us LTR1.2_Threshold=166912ns
L1SubCtl2: T_PwrOn=150us
Capabilities: [400 v1] Data Link Feature <?>
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [440 v1] Lane Margining at the Receiver
PortCap: Uses Driver-
PortSta: MargReady- MargSoftReady-
Kernel driver in use: pcieport
>
> > > But the log at
> > > https://github.com/QubesOS/qubes-issues/issues/9689#issuecomment-2582927149
> > > says it *doesn't* exit and eventually times out.
> >
> > Note this log is from "working" kernel, so that timeout must be
> > something else.
>
> I saw it was labeled "NO BUG" but I'm not sure it's labeled correctly
> since there are no interesting messages from the "BUG PRESENT" part.
> Awfully funny coincidence if it's unrelated.
The timeout thing I have seen before, possibly also on a different
hardware (although I'm not 100% sure), I think it's a different issue.
> > > And the lspci above shows ~0 data for much of the header, even though
> > > the device must be ready by then.
> > >
> > > I don't have any good ideas, but since the problem only happens with
> > > Xen, and it seems to affect more than just the Vendor ID, maybe you
> > > could instrument xen_pcibk_config_read() and see if there's something
> > > wonky going on there?
> >
> > This one is used when pcifront (from a different PV VM) is asking pciback
> > to read something. I see the issue even before starting any other VM and
> > not even attaching the device to the xen-pciback driver...
>
> The report claims the problem only happens with Xen. I'm not a Xen
> person, and I don't know how to find the relevant config accessors.
> The snippets of kernel messages I see at [1] all mention pciback, so
> that's my only clue of where to look. Bottom line, I have no idea
> what the config accessor path is, and maybe we could learn something
> by looking at whatever it is.
AFAIK there are no separate config accessors under Xen dom0, the default
ones are used. xen-pcifront takes over PCI config space access (and few
more) only in a domU (and only for PV), when PCI passthrough is used.
Here, it didn't went that far...
But then, Xen may intercept such access [2]. If I read it right, it
should allow all access (is_hardware_domain(dom0)==true, and also the
device is not on ro_map - otherwise reset wouldn't work at all).
>
> [1] https://github.com/QubesOS/qubes-issues/issues/9689#issuecomment-2582927149
[2] https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/pv/emul-priv-op.c;h=70150c27227661baa253af8693ff00f2ab640a98;hb=HEAD#l295
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2025-01-29 3:22 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 12:05 Config space access to Mediatek MT7922 doesn't work after device reset in Xen PV dom0 (regression, Linux 6.12) Marek Marczykowski-Górecki
2025-01-29 1:15 ` Bjorn Helgaas
2025-01-29 2:10 ` Marek Marczykowski-Górecki
2025-01-29 3:03 ` Bjorn Helgaas
2025-01-29 3:22 ` Marek Marczykowski-Górecki [this message]
2025-01-29 3:40 ` Bjorn Helgaas
2025-01-29 3:47 ` Marek Marczykowski-Górecki
2025-01-29 13:32 ` Bjorn Helgaas
2025-01-29 13:52 ` Jan Beulich
2025-01-29 14:50 ` Bjorn Helgaas
2025-01-29 9:17 ` Jan Beulich
2025-01-29 11:53 ` Marek Marczykowski-Górecki
2025-01-29 12:49 ` Jan Beulich
2025-01-29 13:28 ` Bjorn Helgaas
2025-01-29 13:54 ` Jan Beulich
2025-01-29 18:48 ` Bjorn Helgaas
2025-01-30 4:55 ` Marek Marczykowski-Górecki
2025-01-30 9:30 ` Jan Beulich
2025-01-30 21:31 ` Bjorn Helgaas
2025-01-31 7:13 ` Jan Beulich
2025-01-31 8:36 ` Marek Marczykowski-Górecki
2025-02-05 22:14 ` Marek Marczykowski-Górecki
2025-02-07 22:00 ` Bjorn Helgaas
2025-02-07 22:10 ` Marek Marczykowski-Górecki
2025-02-07 22:23 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5mfA32bvEn6yD-C@mail-itl \
--to=marmarek@invisiblethingslab.com \
--cc=bhelgaas@google.com \
--cc=boris.ostrovsky@oracle.com \
--cc=helgaas@kernel.org \
--cc=jgross@suse.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lorenzo@kernel.org \
--cc=nbd@nbd.name \
--cc=regressions@lists.linux.dev \
--cc=roger.pau@citrix.com \
--cc=ryder.lee@mediatek.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.