All of lore.kernel.org
 help / color / mirror / Atom feed
From: poza@codeaurora.org
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	Philippe Ombredanne <pombredanne@nexb.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Kate Stewart <kstewart@linuxfoundation.org>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Dongdong Liu <liudongdong3@huawei.com>,
	Keith Busch <keith.busch@intel.com>, Wei Zhang <wzhang@fb.com>,
	Sinan Kaya <okaya@codeaurora.org>,
	Timur Tabi <timur@codeaurora.org>
Subject: Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits
Date: Wed, 18 Jul 2018 14:43:46 +0530	[thread overview]
Message-ID: <e0296f79855257d6960a63471c53fbf1@codeaurora.org> (raw)
In-Reply-To: <20180717213601.GC128988@bhelgaas-glaptop.roam.corp.google.com>

On 2018-07-18 03:06, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote:
>> Hi Oza,
>> 
>> Thanks for doing this!
>> 
>> On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
>> > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
>> > callbacks in case of ERR_NONFATAL.
>> 
>> IIRC, the current strategy is:
>> 
>>   ERR_COR: log only
>>   ERR_NONFATAL: call driver callbacks (pci_error_handlers)
>>   ERR_FATAL: remove device and re-enumerate
>> 
>> So these slot_reset callbacks are only used for ERR_NONFATAL, which
>> are all uncorrectable errors, of course.
>> 
>> This patch makes it so that when the slot_reset callbacks call
>> pci_cleanup_aer_uncorrect_error_status(), we only clear the
>> bits set by ERR_NONFATAL events (this is determined by
>> PCI_ERR_UNCOR_SEVER).
>> 
>> That makes good sense to me.  All these status bits are RW1CS, so they
>> will be preserved across a reset but will be cleared when we
>> re-enumerate, in this path:
>> 
>>   pci_init_capabilities
>>     pci_aer_init
>>       pci_cleanup_aer_error_status_regs
>>       # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits
>> 
>> > AER uncorrectable error status should take severity into account in order
>> > to clear the bits, so that ERR_NONFATAL path does not clear the bit which
>> > are marked with severity fatal.
>> 
>> Two comments:
>> 
>> 1) Can you split this into two patches, one that changes
>> pci_cleanup_aer_uncorrect_error_status() so it looks like the error
>> clearing code in aer_error_resume(), and a second that factors out the
>> duplicate code?
>> 
>> 2) Maybe use "sev" or "sever" instead of "mask" for the local
>> variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
>> which is not involved here.
> 
> Let me back up a little here: I'm not asking you to do the things
> below here.  They're just possible future things, so we can think
> about them after this series.  And the things above are things I can
> easily do myself.  So no action required from you, unless you think
> I'm on the wrong track :)

I agree with your points, and have taken them into account for future 
series reference as well.

what about PATCH-2 of this series ?
that clears ERR_FATAL bits, but as you said, during re-enumeration
pci_init_capabilities
      pci_aer_init
        pci_cleanup_aer_error_status_regs
        # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits

but that should clear the ERR_FATAL of the devices beneath.

PATCH2: we are doing it for BRIDGE where we think where ERR_FATAL was 
reported by bridge and the problem is with downstream link.
if ((service == PCIE_PORT_SERVICE_AER) &&
	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
		/*
		 * If the error is reported by a bridge, we think this error
		 * is related to the downstream link of the bridge, so we
		 * do error recovery on all subordinates of the bridge instead
		 * of the bridge and clear the error status of the bridge.
		 */
		pci_cleanup_aer_uncorrect_error_status(dev);
	}


so overall, I think all the patches are required, if you have comments 
please let me know.
so far I see that, no action is required from me.

> 
>> 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
>> really describes what this does.  Something like
>> "pci_aer_clear_nonfatal_status()" would be more descriptive.  But I
>> see you have a subsequent patch (which I haven't looked at yet) that
>> is related to this.
>> 
>> 4) I don't think the driver slot_reset callbacks should be responsible
>> for clearing these AER status bits.  Can we clear them somewhere in
>> the pcie_do_nonfatal_recovery() path and remove these calls from the
>> drivers?
>> 
>> 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
>> device when handling an error.  We currently read it three times:
>> 
>>   aer_isr
>>     aer_isr_one_error
>>       find_source_device
>>         find_device_iter
>>           is_error_source
>>             read PCI_ERR_UNCOR_STATUS              # 1
>>       aer_process_err_devices
>>         get_device_error_info(e_info->dev[i])
>>           read PCI_ERR_UNCOR_STATUS                # 2
>>         handle_error_source
>>           pcie_do_nonfatal_recovery
>>             ...
>>               report_slot_reset
>>                 driver->err_handler->slot_reset
>>                   pci_cleanup_aer_uncorrect_error_status
>>                     read PCI_ERR_UNCOR_STATUS      # 3
>> 
>> OK, that was more than two comments :)

  reply	other threads:[~2018-07-18  9:13 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Oza Pawandeep
2018-07-17 19:03   ` Bjorn Helgaas
2018-07-17 21:36     ` Bjorn Helgaas
2018-07-18  9:13       ` poza [this message]
2018-06-22  9:58 ` [PATCH v2 2/6] PCI/AER: Clear uncorrectable fatal error status bits Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 3/6] PCI/ERR: Cleanup ERR_FATAL of error broadcast Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 4/6] PCI/AER: Clear device error status bits during ERR_FATAL and ERR_NONFATAL Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 5/6] PCI/AER: Fix correctable status bits clearing in device register Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset() Oza Pawandeep
2018-07-18 19:10   ` Bjorn Helgaas
2018-07-05 15:12 ` [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL poza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e0296f79855257d6960a63471c53fbf1@codeaurora.org \
    --to=poza@codeaurora.org \
    --cc=bhelgaas@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=helgaas@kernel.org \
    --cc=keith.busch@intel.com \
    --cc=kstewart@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=liudongdong3@huawei.com \
    --cc=okaya@codeaurora.org \
    --cc=pombredanne@nexb.com \
    --cc=tglx@linutronix.de \
    --cc=timur@codeaurora.org \
    --cc=wzhang@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.