From: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
To: "Oliver O'Halloran" <oohall@gmail.com>
Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>
Subject: Re: [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
Date: Mon, 18 Oct 2021 22:58:08 +0530 [thread overview]
Message-ID: <20211018172808.efx5bxniscy2lcj4@in.ibm.com> (raw)
In-Reply-To: <CAOSf1CG4H2GrxV5C=55vcNue4taSAkgFOUg32yVspgw9+meDAg@mail.gmail.com>
On 2021-05-07 10:41:46 Fri, Oliver O'Halloran wrote:
> On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote:
> >
> > When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> > state as temporarily unavailable. In this case, per PAPR, rtas call
> > ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
> > and OS has to wait until that recovery is complete. During this state the
> > slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
> > connector empty which leads to assumption that the device has been
> > hot-removed. This results into no EEH recovery on this device and it stays
> > in failed state forever.
> >
> > This patch fixes this issue by skipping slot presence check only if device
> > PE state is temporarily unavailable(5).
> >
> > Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
> > ---
> > * snip*
> >
> > /*
> > * It should be corner case that the parent PE has been
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 3eff6a4888e79..a0913768f33de 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
> > return;
> > }
> >
> > + /*
> > + * When PE's state is temporarily unavailable, the slot
> > + * presence check returns as DR connector empty.
>
> That sounds like a bug in either RTAS or the hotplug slot driver (or
> both). The presence check is there largely to filter out events that
> we can guarantee are not recoverable (i.e. surprise hot-unplug). In
> every other case (especially if we can't determine the state) we
> should be going down the recovery path. If the hotplug slot driver is
> incorrectly reporting the card has been removed then you should be
> fixing the slot driver.
Thanks Oliver for the comment.
So phyp fixed the issue where it was incorrectly reporting the card has
been removed. After the phyp fix, the slot presence check
'get-sensor-state(dr-entity-sense)' returns extended busy error (9902)
until PHB is recovered by phyp. And once PHB is recovered, the
get-sensor-state() returns success with correct presence status.
But now we have different problem. The Linux rtas call interface
rtas_get_sensor() loops over the rtas call on extended delay return code
(9902) until the return value is either success (0) or error (-1). This
causes EEH handler to get stuck at presence check 'rtas_get_sensor()'
for ~6 seconds before it could indicate network driver that error has
been detected and stop any active operations. With no I/O traffic this
doesn't cause any issue and EEH recovery works fine. However with
running I/O traffic, during this 6 seconds, network driver continues its
operation and hits timeout (netdev watchdog). On timeouts, network
driver go into ffdc capture mode and reset path assuming PCI device is
in fatal condition. This causes EEH recovery to fail and sometimes it
leads to system hang or crash.
------------
[52732.244731] DEBUG: ibm_read_slot_reset_state2()
[52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0
[52732.244798] DEBUG: in eeh_slot_presence_check
[52732.244804] DEBUG: error state check
[52732.244807] DEBUG: Is slot hotpluggable
[52732.244810] DEBUG: hotpluggable ops ?
[52732.244953] DEBUG: Calling ops->get_adapter_status
[52732.244958] DEBUG: calling rpaphp_get_sensor_state
[52736.564262] ------------[ cut here ]------------
[52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out
[52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440
[...]
[52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
[52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
------------
I am working on ways to fix this and looking at below two options. More
ideas are welcome.
1. There is an alternate call rtas_get_sensor_fast() available that does
not use rtas_busy_delay() and returns immediately with error code. Using
rtas_get_sensor_fast() for slot presence check fixes the above issue and
EEH recovery works fine. However there is no provision in
hotplug_slot_ops struct to do a quick check of adapter status that can
be used to call rtas_get_sensor_fast().
2. Another option is to move the slot presence check after reporting
network driver that error has been detected. This also fixes the issue.
However need to verify the hotplug case where if slot is empty, inform
driver to resume while skiping the recovery.
Let me know what do you think about above options and if there is any
other better way to fix this.
Thanks,
-Mahesh.
prev parent reply other threads:[~2021-10-18 17:29 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-06 17:43 [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable Mahesh Salgaonkar
2021-05-07 0:41 ` Oliver O'Halloran
2021-10-18 17:28 ` Mahesh J Salgaonkar [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211018172808.efx5bxniscy2lcj4@in.ibm.com \
--to=mahesh@linux.ibm.com \
--cc=linuxppc-dev@ozlabs.org \
--cc=oohall@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox