From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
Alex Williamson <alex.williamson@redhat.com>,
Leon Romanovsky <leon@kernel.org>, Jason Gunthorpe <jgg@ziepe.ca>
Subject: Re: Write to srvio_numvfs triggers kernel panic
Date: Sat, 7 May 2022 10:22:32 +0000 [thread overview]
Message-ID: <87v8uhlk1w.fsf@epam.com> (raw)
In-Reply-To: <20220506201722.GA555374@bhelgaas>
Hello Bjorn,
Bjorn Helgaas <helgaas@kernel.org> writes:
> [+cc Alex, Leon, Jason]
>
> On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>>
>> Hello,
>>
>> I have encountered issue when PCI code tries to use both fields in
>>
>> union {
>> struct pci_sriov *sriov; /* PF: SR-IOV info */
>> struct pci_dev *physfn; /* VF: related PF */
>> };
>>
>> (which are part of struct pci_dev) at the same time.
>>
>> Symptoms are following:
>>
>> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>>
>> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
>> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>> Unable to handle kernel paging request at virtual address 0001000200000010
>> Mem abort info:
>> ESR = 0x96000004
>> EC = 0x25: DABT (current EL), IL = 32 bits
>> SET = 0, FnV = 0
>> EA = 0, S1PTW = 0
>> Data abort info:
>> ISV = 0, ISS = 0x00000004
>> CM = 0, WnR = 0
>> [0001000200000010] address between user and kernel address ranges
>> Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
>> nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
>> pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>> CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>> Hardware name: XENVM-4.17 (DT)
>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>> pc : pcie_aspm_get_link+0x90/0xcc
>> lr : pcie_aspm_get_link+0x8c/0xcc
>> sp : ffff8000130d39c0
>> x29: ffff8000130d39c0 x28: 00000000000001a4
>> x27: 00000000ffffee4b x26: ffff80001164f560
>> x25: 0000000000000000 x24: 0000000000000000
>> x23: ffff80001164f660 x22: 0000000000000000
>> x23: ffff80001164f660 x22: 0000000000000000
>> x21: ffff000003f08000 x20: ffff800010db37d8
>> x19: ffff000004b8e780 x18: ffffffffffffffff
>> x17: 0000000000000000 x16: 00000000deadbeef
>> x15: ffff8000930d36c7 x14: 0000000000000006
>> x13: ffff8000115c2710 x12: 000000000000081c
>> x11: 00000000000002b4 x10: ffff8000115c2710
>> x9 : ffff8000115c2710 x8 : 00000000ffffefff
>> x7 : ffff80001161a710 x6 : ffff80001161a710
>> x5 : ffff00003fdad900 x4 : 0000000000000000
>> x3 : 0000000000000000 x2 : 0000000000000000
>> x1 : ffff000003c51c80 x0 : 0001000200000000
>> Call trace:
>> pcie_aspm_get_link+0x90/0xcc
>> aspm_ctrl_attrs_are_visible+0x30/0xc0
>> internal_create_group+0xd0/0x3cc
>> internal_create_groups.part.0+0x4c/0xc0
>> sysfs_create_groups+0x20/0x34
>> device_add+0x2b4/0x760
>> pci_device_add+0x814/0x854
>> pci_iov_add_virtfn+0x240/0x2f0
>> sriov_enable+0x1f8/0x474
>> pci_sriov_configure_simple+0x38/0x90
>> sriov_numvfs_store+0xa4/0x1a0
>> dev_attr_store+0x1c/0x30
>> sysfs_kf_write+0x48/0x60
>> kernfs_fop_write_iter+0x118/0x1ac
>> new_sync_write+0xe8/0x184
>> vfs_write+0x23c/0x2a0
>> ksys_write+0x68/0xf4
>> __arm64_sys_write+0x20/0x2c
>> el0_svc_common.constprop.0+0x78/0x1a0
>> do_el0_svc+0x28/0x94
>> el0_svc+0x14/0x20
>> el0_sync_handler+0xa4/0x130
>> el0_sync+0x180/0x1c0
>> Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
>>
>>
>> Debugging showed the following:
>>
>> pci_iov_add_virtfn() allocates new struct pci_dev:
>>
>> virtfn = pci_alloc_dev(bus);
>> and sets physfn:
>> virtfn->is_virtfn = 1;
>> virtfn->physfn = pci_dev_get(dev);
>>
>> then we will get into sriov_init() via the following call path:
>>
>> pci_device_add(virtfn, virtfn->bus);
>> pci_init_capabilities(dev);
>> pci_iov_init(dev);
>> sriov_init(dev, pos);
>
> We called pci_device_add() with the VF. pci_iov_init() only calls
> sriov_init() if it finds an SR-IOV capability on the device:
>
> pci_iov_init(struct pci_dev *dev)
> pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> if (pos)
> return sriov_init(dev, pos);
>
> So this means the VF must have an SR-IOV capability, which sounds a
> little dubious. From PCIe r6.0:
[...]
Yes, I dived into debugging and came to the same conclusions. I'm still
investigating this, but looks like my PCIe controller (DesignWare-based)
incorrectly reads configuration space for VF. Looks like instead of
providing access VF config space, it reads PF's one.
>
> Can you supply the output of "sudo lspci -vv" for your system?
Sure:
root@spider:~# lspci -vv
00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 189
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: 30000000-301fffff [size=2M]
Prefetchable memory behind bridge: [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
Address: 0000000004030040 Data: 0000
Masking: fffffffe Pending: 00000000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x2 (ok)
TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
RootCap: CRSVisible-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn- NFERptEn- FERptEn-
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [19c v1] Lane Margining at the Receiver <?>
Capabilities: [1bc v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [304 v1] Data Link Feature <?>
Capabilities: [310 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport
Kernel modules: pci_endpoint_test
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a809
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 0
NUMA node: 0
Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00 [8/5710]
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00004000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: a824
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [3a4 v1] Data Link Feature <?>
Kernel driver in use: nvme
Kernel modules: nvme
> It could be that the device has an SR-IOV capability when it
> shouldn't. But even if it does, Linux could tolerate that better
> than it does today.
>
Agree there. I can create simple patch that checks for is_virtfn
in sriov_init(). But what to do if it is set?
>> sriov_init() overwrites value in the union:
>> dev->sriov = iov; <<<<<---- There
>> dev->is_physfn = 1;
>>
--
Volodymyr Babchuk at EPAM
next prev parent reply other threads:[~2022-05-07 10:22 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-04 19:56 Write to srvio_numvfs triggers kernel panic Volodymyr Babchuk
2022-05-06 20:17 ` Bjorn Helgaas
2022-05-07 1:34 ` Jason Gunthorpe
2022-05-07 10:25 ` Volodymyr Babchuk
2022-05-08 11:19 ` Leon Romanovsky
2022-05-09 18:22 ` Keith Busch
2022-05-07 10:22 ` Volodymyr Babchuk [this message]
2022-05-07 15:41 ` Bjorn Helgaas
2022-05-08 11:07 ` Volodymyr Babchuk
2022-05-09 16:49 ` Bjorn Helgaas
2022-05-09 16:58 ` Alex Williamson
2022-05-10 6:39 ` Christoph Hellwig
2022-05-10 17:37 ` Bjorn Helgaas
2022-05-12 7:18 ` Volodymyr Babchuk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87v8uhlk1w.fsf@epam.com \
--to=volodymyr_babchuk@epam.com \
--cc=alex.williamson@redhat.com \
--cc=helgaas@kernel.org \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.