linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yishai Hadas <yishaih@dev.mellanox.co.il>
To: Don Dutile <ddutile@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	"Pandarathil, Vijaymohan R" <vijaymohan.pandarathil@hp.com>,
	Myron Stowe <myron.stowe@redhat.com>,
	"linux-rdma (linux-rdma@vger.kernel.org)"
	<linux-rdma@vger.kernel.org>,
	"yishaih@mellanox.com" <yishaih@mellanox.com>,
	liranl@mellanox.com,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: PCI/AER: AER in SRIOV environment
Date: Tue, 24 Jun 2014 01:44:37 +0300	[thread overview]
Message-ID: <53A8ADD5.7030207@dev.mellanox.co.il> (raw)
In-Reply-To: <53A88A32.4010406@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5823 bytes --]

On 6/23/2014 11:12 PM, Don Dutile wrote:
> On 06/23/2014 03:09 PM, Bjorn Helgaas wrote:
>> [+cc linux-pci, Don]
>>
> Adding Alex Williamson in case he can add more to this conversation...
>
>> On Mon, Jun 23, 2014 at 8:29 AM, Yishai Hadas
>> <yishaih@dev.mellanox.co.il> wrote:
>>> Hi Vijay,
>>> Trying to add AER support for Mellanox NIC in SRIOV environment, while
>>> evaluating/testing encountered a problem which led me to your
>>> patch accepted as part of kernel 3.8, commit ID
>>> "918b4053184c0ca22236e70e299c5343eea35304".
>>>
>>> Have some concerns/questions on:
>>> When working in SRIOV environment VFs may be un-attached, having no 
>>> driver
>>> assigned to, or may be attached to Virtual machine to work in some
>>> pass-through mode.
>>> Once working in KVM setup there is pci-stub driver which is loaded 
>>> in the
>>> HYP/PF for a given attached VF.
> huh? 'loaded in the hyp/pf? .... um, loaded in the host, and a VF is
> detached from its host driver -- a VF can be used in the host w/o any 
> virtualization,
> i.e., that's how guest VM is driving the VF: as if it was used by a 
> guest (host) OS directly --
> and attached to pci-stub driver, when assigned to a KVM guest in 
> pre-VFIO days/ways.
> If VFIO used, then VF is attached to vfio-pci driver.
>
>>>
>>> I'm using the aer-inject kernel module and its corresponding 
>>> aer-inject tool
>>> to simulate an error in the HYP.
>>> In both cases your commit will cause the AER recovery to fail as 
>>> there is no
>>> driver assigned to PF's VFs that supports AER, comparing the code 
>>> before
>>> your change.
>>>
> Without VFIO, I believe that's correct. There was no AER-to-VF support 
> pre-VFIO days.
> I believe with the recent VFIO support,
> and modifications to KVM, an AER that is associated with an assigned 
> VF will
> force the crash/halt of the KVM guest -- can't depend on a guest VF 
> driver clearing
> the AER in the hyp/host -- guest isn't privileged enough to clear the 
> error.
> So, crashing the guest is the simple option at the moment, to contain 
> the error.
> Alex: do I have that (vfio aer default) correct, or is that still 
> site-under-construction?
     How about the case that the VF is not attached to a KVM guest and 
has no driver loaded on host ? in such a case from code review and some 
testing the recovery will
     fail as there is no AER aware driver here. What is the expected 
solution here ?
     Any special qemu /stuff is needed to activate the VFIO support ? 
would like to give it a try for a case that VF is attached.
>
>>> How such cases should work ?  my expectation was that the PF will 
>>> get the
>>> error detected message then will recognize whether
>>> issue is its own or one of its VFs
> The AER packet will have the tag of the VF in if it was the source of 
> the error;
> so the PF will never see it; although one could argue it should be 
> 'promoted'
> to the PF if PF/VF needs to clear some state it has wrt the VF (the 
> SRIOV spec is
> lacking of info in this space); _but_, VFIO resets the VF (sets FLR 
> bit) when the
> device is deassigned and before re-attachment to the host, so that 
> should clear out
> any state btwn PF & VF ('should' ... famous last words...).
     In my test I have used the aer-inject tool simulating an error to 
the BUS that both PF/VF are residing on, putting the function number to 
be the PF one, looks like both should be called by the aer driver as part
     of the pci_walk_bus(). As mentioned I got a call only on the PF and 
recovery failed as of the VF doesn't include an AER aware driver, once 
removed the VF recovery succeeded.
     I believe that packet should include some info about the source of 
the error isn't it ?
     In addition, looking at IXGBE upstream source code at 
ixgbe_error_detected()  looks like there is some code running on the PF 
that checks whether the source was a VF.

     By the way: when tried to simulate a VF error using its FN got 
below error:
     "Error: Failed to write, Inappropriate ioctl for device", any idea 
about that error ?
>
>>
>> I'm really not an AER expert, so help me understand this question of
>> recognizing whether an error is associated with a PF or a VF.
>>
>> In terms of hardware, it looks like the device that detects an error
>> logs some information and sends an Error Message upstream.  The Root
>> Complex receives the message, captures the source ID from the Error
>> Message, and may generate an interrupt.  I expect this source ID can
>> be either a PF or a VF; there's no requirement that a VF error must be
>> reported as though it's from the PF, is there?
>>
>>> and work accordingly, in current code
>>> looks like recovery failed as part of "voting" once there is no AER 
>>> handler
>>> assigned to the VFs.
>>
>> The commit you mentioned has to do with PCI_ERS_RESULT_NO_AER_DRIVER.
>> We use pci_walk_bus() to figure out whether all the devices in a
>> subtree have a driver.  What subtree is involved here?  I would expect
>> the VFs to be siblings of the PF, not children of it, so I'm not sure
>> where things went wrong.
> Well, VFs could be on virtual busses (ARI turned on), so not 
> necessarily a
> sibling to PF ... and then we have the problem in PCI code of not 
> being able
> to traverse these virtual busses (in some cases; not sure if 
> pci_walk_bus(),
> which is going down the tree vs up the tree, has any problems here 
> w/VFs on
> virtual busses).
>
>>
>> Can you collect "lspci -vvv" output and maybe add some debug so we can
>> see exactly where the error is detected and what devices we're looking
>> at to conclude that one of them doesn't have a driver?
     lspci -vvv for both PF & VF is attached, we can see that VF 
(21:00.1) has no driver loaded comparing the PF (Kernel driver in use: 
mlx4_core).
>>
>> Bjorn
>>
>


[-- Attachment #2: lspci.txt --]
[-- Type: text/plain, Size: 7889 bytes --]

21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
        Subsystem: Hewlett-Packard Company Device 18cf
        Physical Slot: 2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 64
        Region 0: Memory at fbf00000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at f8000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
                Product Name: HP ConnectX-3 Mezz
                Read-only fields:
                        [PN] Part number: 644161-B21
                        [EC] Engineering changes: C4
                        [SN] Serial number: IL224202VW
                        [V0] Vendor specific: HP IB FDR/EN 10/40Gb 2P 544M Adptr
                        [RV] Reserved: checksum good, 0 byte(s) reserved
                Read/write fields:
                        [V1] Vendor specific: N/A
                        [YA] Asset tag: N/A
                        [RW] Read-write area: 102 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 252 byte(s) free
                End
        Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
                Vector table: BAR=0 offset=0007c000
                PBA: BAR=0 offset=0007d000
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [148 v1] Device Serial Number 24-be-05-ff-ff-8b-6b-d0
        Capabilities: [108 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 16, Total VFs: 16, Number of VFs: 1, Function Dependency Link: 00
                VF offset: 1, stride: 1, Device ID: 1004
                Supported Page Size: 000007ff, System Page Size: 00000001
                Region 2: Memory at 00000000d8000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [154 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [18c v1] #19
        Kernel driver in use: mlx4_core
        Kernel modules: mlx4_core
		
21:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
        Subsystem: Hewlett-Packard Company Device 61b0
        Physical Slot: 2
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Region 2: [virtual] Memory at d8000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM unknown, Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [9c] MSI-X: Enable- Count=256 Masked-
                Vector table: BAR=2 offset=00002000
                PBA: BAR=2 offset=00003000
        Capabilities: [40] #00 [0000]
        Kernel modules: mlx4_core
		
		

  reply	other threads:[~2014-06-23 22:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <53A839C6.5050102@dev.mellanox.co.il>
2014-06-23 19:09 ` PCI/AER: AER in SRIOV environment Bjorn Helgaas
2014-06-23 20:12   ` Don Dutile
2014-06-23 22:44     ` Yishai Hadas [this message]
2014-06-23 23:17       ` Alex Williamson
2014-06-24 14:56       ` Don Dutile
2014-06-24 16:22         ` Yishai Hadas
2014-06-24 17:38           ` Alex Williamson
2014-06-23 23:10     ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53A8ADD5.7030207@dev.mellanox.co.il \
    --to=yishaih@dev.mellanox.co.il \
    --cc=bhelgaas@google.com \
    --cc=ddutile@redhat.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=liranl@mellanox.com \
    --cc=myron.stowe@redhat.com \
    --cc=vijaymohan.pandarathil@hp.com \
    --cc=yishaih@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).