linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: poza@codeaurora.org
To: rajatxjain@gmail.com
Cc: Rajat Jain <rajatja@google.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Philippe Ombredanne <pombredanne@nexb.com>,
	Kate Stewart <kstewart@linuxfoundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Frederick Lawler <fred@fredlawl.com>,
	"Busch, Keith" <keith.busch@intel.com>,
	Gabriele Paoloni <gabriele.paoloni@huawei.com>,
	Alexandru Gagniuc <mr.nuke.me@gmail.com>,
	Thomas Tai <thomas.tai@oracle.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	linux-pci <linux-pci@vger.kernel.org>,
	linux-doc <linux-doc@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jes Sorensen <jsorensen@fb.com>, Kyle McMartin <jkkm@fb.com>
Subject: Re: [PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics
Date: Thu, 21 Jun 2018 14:49:17 +0530	[thread overview]
Message-ID: <52d480b22e70713966039443931c2697@codeaurora.org> (raw)
In-Reply-To: <CAA93t1r2fG+U9av2jj8jKrVfHrc_UABOS3DVO-UfHgcsJKyv7g@mail.gmail.com>

On 2018-06-19 22:01, Rajat Jain wrote:
> On Mon, Jun 18, 2018 at 11:03 PM,  <poza@codeaurora.org> wrote:
>> On 2018-06-19 05:41, Rajat Jain wrote:
>>> 
>>> Hello,
>>> 
>>> On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
>>>> 
>>>> 
>>>> On 2018-05-23 23:28, Rajat Jain wrote:
>>>> > Add the PCI AER statistics details to
>>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > and provide a pointer to it in
>>>> > Documentation/PCI/pcieaer-howto.txt
>>>> >
>>>> > Signed-off-by: Rajat Jain <rajatja@google.com>
>>>> > ---
>>>> > v2: Move the documentation to Documentation/ABI/
>>>> >
>>>> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
>>>> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
>>>> >  2 files changed, 108 insertions(+)
>>>> >  create mode 100644
>>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> >
>>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > new file mode 100644
>>>> > index 000000000000..f55c389290ac
>>>> > --- /dev/null
>>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > @@ -0,0 +1,103 @@
>>>> > +==========================
>>>> > +PCIe Device AER statistics
>>>> > +==========================
>>>> > +These attributes show up under all the devices that are AER capable.
>>>> > These
>>>> > +statistical counters indicate the errors "as seen/reported by the
>>>> > device".
>>>> > +Note that this may mean that if an end point is causing problems, the
>>>> > AER
>>>> > +counters may increment at its link partner (e.g. root port) because
>>>> > the
>>>> > +errors will be "seen" / reported by the link partner and not the the
>>>> > +problematic end point itself (which may report all counters as 0 as it
>>>> > never
>>>> > +saw any problems).
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of correctable errors seen and reported by
>>>> > this
>>>> > +             PCI device using ERR_COR.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of uncorrectable fatal errors seen and
>>>> > reported
>>>> > +             by this PCI device using ERR_FATAL.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of uncorrectable non-fatal errors seen and
>>>> > reported
>>>> > +             by this PCI device using ERR_NONFATAL.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Breakdown of of correctable errors seen and reported by
>>>> > this
>>>> > +             PCI device using ERR_COR. A sample result looks like
>>>> > this:
>>>> > +-----------------------------------------
>>>> > +Receiver Error = 0x174
>>>> > +Bad TLP = 0x19
>>>> > +Bad DLLP = 0x3
>>>> > +RELAY_NUM Rollover = 0x0
>>>> > +Replay Timer Timeout = 0x1
>>>> > +Advisory Non-Fatal = 0x0
>>>> > +Corrected Internal Error = 0x0
>>>> > +Header Log Overflow = 0x0
>>>> > +-----------------------------------------
>>>> why hex display ? decimal is easy to read as these are counters.
>>> 
>>> 
>>> Have no particular preference. Since these can be potentially large
>>> numbers, just had a random thought that hex might make it more
>>> concise. I can change to decimal if that is preferable.
>>> 
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Breakdown of of correctable errors seen and reported by
>>>> > this
>>>> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample
>>>> > result
>>>> > +             looks like this:
>>>> > +-----------------------------------------
>>>> > +Undefined = 0x0
>>>> > +Data Link Protocol = 0x0
>>>> > +Surprise Down Error = 0x0
>>>> > +Poisoned TLP = 0x0
>>>> > +Flow Control Protocol = 0x0
>>>> > +Completion Timeout = 0x0
>>>> > +Completer Abort = 0x0
>>>> > +Unexpected Completion = 0x0
>>>> > +Receiver Overflow = 0x0
>>>> > +Malformed TLP = 0x0
>>>> > +ECRC = 0x0
>>>> > +Unsupported Request = 0x0
>>>> > +ACS Violation = 0x0
>>>> > +Uncorrectable Internal Error = 0x0
>>>> > +MC Blocked TLP = 0x0
>>>> > +AtomicOp Egress Blocked = 0x0
>>>> > +TLP Prefix Blocked Error = 0x0
>>>> > +-----------------------------------------
>>>> > +
>>>> > +============================
>>>> > +PCIe Rootport AER statistics
>>>> > +============================
>>>> > +These attributes showup under only the rootports that are AER capable.
>>>> > These
>>>> > +indicate the number of error messages as "reported to" the rootport.
>>>> > Please note
>>>> > +that the rootports also transmit (internally) the ERR_* messages for
>>>> > errors seen
>>>> > +by the internal rootport PCI device, so these counters includes them
>>>> > and are
>>>> > +thus cumulative of all the error messages on the PCI hierarchy
>>>> > originating
>>>> > +at that root port.
>>>> 
>>>> what about switches and bridges ?
>>> 
>>> 
>>> What about them? AIUI, the switches forward the ERR_ messages from
>>> downstream devices to the rootport, like they do with standard
>>> messages. They can potentially generate their own ERR_ message and
>>> that would be reported no different than other end point devices.
>> 
>> 
>> 
>> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
>> contained by switch
>> and the error handling code thinks that, the error is contained by 
>> switch
>> irrespective of
>> AER or DPC, and it will think that the problem could be with 
>> Switch/bridge
>> upstream link.
>> 
>> hence the pci_dev of the switch where you should be increment your 
>> counters.
>> of course ER_FATAL would have traversed till RP, but that doesnt meant 
>> that
>> you account the error there.
> 
> In this case, for the pci_dev for the rootport:
> - rootport_total_fatal_errors will be incremented (since it will get 
> ERR_FATAL)
> - dev_total_fatal_errors will not be incremented.

ok but my confusion is: should you not be incrementing counter against 
pci_dev of switch ? and not the RP ?
because the problem was with upstream link of the EP (e.g. switch)

> 
> The dev_total_fatal_errors will be incremented only for the pci device
> identified by the "Error Source Identification Register" in the PCIe
> spec. Does this help clarify?

> 
>> 
>> 
>>> 
>>>> Also Can you give some idea as e.g what is the difference between
>>>> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that 
>>>> both
>>>> are same pci_dev.
>>> 
>>> 
>>> For a pci_dev representing the rootport:
>>> 
>>> dev_total_fatal_errors = how many times this PCI device *experienced*
>>> a fatal problem on its own (i.e. either link issues while talking to
>>> its link partner, or some internal errors).
>>> 
>>> rootport_total_fatal_errors = how many times this rootport was
>>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
>>> that originates at it (can be any link further downstream). This
>>> includes the dev_total_fatal_errors also, because any errors detected
>>> by the rootport are also "informed" to itself via ERR_* messages. In
>>> reality, this is just the total number of ERR_FATAL messages received
>>> at the rootport. This sysfs attribute will only exist for root ports.
>>> 
>>>> 
>>>> rootport_total_fatal_errs gives me an idea that how many times 
>>>> things
>>>> have been failed under this pci_dev ?
>>> 
>>> 
>>> Yes, as above.
>>> 
>>>> which means num of downstream link problems. but I am still trying 
>>>> to
>>>> make sense as how it could be used,
>>>> since we dont have BDF information associated with the number of 
>>>> errors
>>>> anywhere (except these AER print messages)
>>>> 
>>> 
>>> Agree. That is a limitation. The challenges being more record 
>>> keeping,
>>> more complicated sysfs representation, and given that PCI devices may
>>> come and go, how do we know it is the same device before we collate
>>> their stats etc.
>>> 
>>>> 
>>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>>>> then say root-port will report it and increment
>>>> dev_total_fatal_errs ++
>>>> does it also increment root-port_total_fatal_errs ++ in above 
>>>> scenario ?
>>> 
>>> 
>>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
>>> port of that hierarchy.
>>> 
>>> Thanks,
>>> 
>>> Rajat
>>> 
>>>> 
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of ERR_COR messages reported to rootport.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of ERR_NONFATAL messages reported to
>>>> > rootport.
>>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>>>> > b/Documentation/PCI/pcieaer-howto.txt
>>>> > index acd0dddd6bb8..91b6e677cb8c 100644
>>>> > --- a/Documentation/PCI/pcieaer-howto.txt
>>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>>>> > device who sends
>>>> >  the error message to root port. Pls. refer to pci express specs for
>>>> >  other fields.
>>>> >
>>>> > +2.4 AER Statistics / Counters
>>>> > +
>>>> > +When PCIe AER errors are captured, the counters / statistics are also
>>>> > exposed
>>>> > +in form of sysfs attributes which are documented at
>>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> >
>>>> >  3. Developer Guide
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-06-21  9:19 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-22 22:28 [PATCH 0/5] Expose PCIe AER stats via sysfs Rajat Jain
2018-05-22 22:28 ` [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices Rajat Jain
2018-05-23  8:27   ` Greg Kroah-Hartman
2018-05-23 14:20   ` Jes Sorensen
2018-05-23 14:26     ` Alex G.
2018-05-23 14:28       ` Jes Sorensen
2018-05-23 14:26     ` Matthew Wilcox
2018-05-23 14:32       ` Jes Sorensen
2018-05-23 14:33         ` Alex G.
2018-05-23 14:46           ` Steven Rostedt
2018-05-22 22:28 ` [PATCH 2/5] PCI/AER: Add sysfs stats " Rajat Jain
2018-05-22 22:50   ` Alex G.
2018-05-22 23:27     ` Rajat Jain
2018-05-22 23:30       ` Sinan Kaya
2018-05-23  8:22   ` Greg Kroah-Hartman
2018-05-23  8:24   ` Greg Kroah-Hartman
2018-05-22 22:28 ` [PATCH 3/5] PCP/AER: Add sysfs attributes to provide breakdown of AERs Rajat Jain
2018-05-23  8:25   ` Greg Kroah-Hartman
2018-05-22 22:28 ` [PATCH 4/5] PCI/AER: Add sysfs attributes for rootport cumulative stats Rajat Jain
2018-05-22 22:28 ` [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics Rajat Jain
2018-05-22 22:52   ` Alex G.
2018-05-22 23:18     ` Rajat Jain
2018-05-23  8:23   ` Greg Kroah-Hartman
2018-05-23 17:58 ` [PATCH v2 0/5] Expose PCIe AER stats via sysfs Rajat Jain
2018-05-23 17:58   ` [PATCH v2 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices Rajat Jain
2018-05-24  6:08     ` Greg Kroah-Hartman
2018-05-23 17:58   ` [PATCH v2 2/5] PCI/AER: Add sysfs stats " Rajat Jain
2018-05-23 17:58   ` [PATCH v2 3/5] PCI/AER: Add sysfs attributes to provide breakdown of AERs Rajat Jain
2018-05-23 17:58   ` [PATCH v2 4/5] PCI/AER: Add sysfs attributes for rootport cumulative stats Rajat Jain
2018-05-23 17:58   ` [PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics Rajat Jain
2018-06-17  5:24     ` poza
2018-06-19  0:11       ` Rajat Jain
2018-06-19  0:32         ` Rajat Jain
2018-06-19  6:03         ` poza
2018-06-19 16:31           ` Rajat Jain
2018-06-21  9:19             ` poza [this message]
2018-06-22  0:45               ` Rajat Jain
2018-06-19 22:16   ` [PATCH v2 0/5] Expose PCIe AER stats via sysfs Bjorn Helgaas
2018-06-19 22:17     ` Rajat Jain
2018-06-19 22:20     ` Alex G.
2018-06-19 22:25       ` Steven Rostedt
2018-06-19 22:29         ` Alex G.
2018-06-20  1:12     ` [PATCH v3 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices Rajat Jain
2018-06-20  1:12       ` [PATCH v3 2/5] PCI/AER: Add sysfs stats " Rajat Jain
2018-06-20  1:12       ` [PATCH v3 3/5] PCI/AER: Add sysfs attributes to provide breakdown of AERs Rajat Jain
2018-06-20  1:12       ` [PATCH v3 4/5] PCI/AER: Add sysfs attributes for rootport cumulative stats Rajat Jain
2018-06-20  3:13         ` kbuild test robot
2018-06-20  1:12       ` [PATCH v3 5/5] Documentation/ABI: Add details of PCI AER statistics Rajat Jain
2018-06-20 23:28 ` [PATCH v4 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices Rajat Jain
2018-06-20 23:28   ` [PATCH v4 2/5] PCI/AER: Add sysfs stats " Rajat Jain
2018-06-20 23:41 ` [PATCH v5 1/5] PCI/AER: Define and allocate aer_stats structure " Rajat Jain
2018-06-20 23:41   ` [PATCH v5 2/5] PCI/AER: Add sysfs stats " Rajat Jain
2018-06-20 23:41   ` [PATCH v5 3/5] PCI/AER: Add sysfs attributes to provide breakdown of AERs Rajat Jain
2018-06-21 18:48     ` Bjorn Helgaas
2018-06-21 21:25       ` Rajat Jain
2018-06-22 16:38         ` Tyler Baicar
2018-06-22 17:27           ` Bjorn Helgaas
2018-06-20 23:41   ` [PATCH v5 4/5] PCI/AER: Add sysfs attributes for rootport cumulative stats Rajat Jain
2018-06-20 23:41   ` [PATCH v5 5/5] Documentation/ABI: Add details of PCI AER statistics Rajat Jain
2018-06-21 13:17   ` [PATCH v5 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices Bjorn Helgaas
2018-06-21 20:41     ` Rajat Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52d480b22e70713966039443931c2697@codeaurora.org \
    --to=poza@codeaurora.org \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=fred@fredlawl.com \
    --cc=gabriele.paoloni@huawei.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jkkm@fb.com \
    --cc=jsorensen@fb.com \
    --cc=keith.busch@intel.com \
    --cc=kstewart@linuxfoundation.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mr.nuke.me@gmail.com \
    --cc=pombredanne@nexb.com \
    --cc=rajatja@google.com \
    --cc=rajatxjain@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=thomas.tai@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).