From: Manivannan Sadhasivam <mani@kernel.org>
To: Jeffrey Hugo <quic_jhugo@quicinc.com>
Cc: quic_pkanojiy@quicinc.com, quic_carlv@quicinc.com,
mhi@lists.linux.dev, linux-arm-msm@vger.kernel.org
Subject: Re: [PATCH v2] bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state
Date: Tue, 30 Jan 2024 13:45:19 +0530 [thread overview]
Message-ID: <20240130081519.GI32821@thinkpad> (raw)
In-Reply-To: <20240112180800.536733-1-quic_jhugo@quicinc.com>
On Fri, Jan 12, 2024 at 11:08:00AM -0700, Jeffrey Hugo wrote:
> When processing a SYSERR, if the device does not respond to the MHI_RESET
> from the host, the host will be stuck in a difficult to recover state.
> The host will remain in MHI_PM_SYS_ERR_PROCESS and not clean up the host
> channels. Clients will not be notified of the SYSERR via the destruction
> of their channel devices, which means clients may think that the device is
> still up. Subsequent SYSERR events such as a device fatal error will not
> be processed as the state machine cannot transition from PROCESS back to
> DETECT. The only way to recover from this is to unload the mhi module
> (wipe the state machine state) or for the mhi controller to initiate
> SHUTDOWN.
>
> This issue was discovered by stress testing soc_reset events on AIC100
> via the sysfs node.
>
> soc_reset is processed entirely in hardware. When the register write
> hits the endpoint hardware, it causes the soc to reset without firmware
> involvement. In stress testing, there is a rare race where soc_reset N
> will cause the soc to reset and PBL to signal SYSERR (fatal error). If
> soc_reset N+1 is triggered before PBL can process the MHI_RESET from the
> host, then the soc will reset again, and re-run PBL from the beginning.
> This will cause PBL to lose all state. PBL will be waiting for the host
> to respond to the new syserr, but host will be stuck expecting the
> previous MHI_RESET to be processed.
>
> Additionally, the AMSS EE firmware (QSM) was hacked to synthetically
> reproduce the issue by simulating a FW hang after the QSM issued a
> SYSERR. In this case, soc_reset would not recover the device.
>
> For this failure case, to recover the device, we need a state similar to
> PROCESS, but can transition to DETECT. There is not a viable existing
> state to use. POR has the needed transitions, but assumes the device is
> in a good state and could allow the host to attempt to use the device.
> Allowing PROCESS to transition to DETECT invites the possibility of
> parallel SYSERR processing which could get the host and device out of
> sync.
>
> Thus, invent a new state - MHI_PM_SYS_ERR_FAIL
>
> This essentially a holding state. It allows us to clean up the host
> elements that are based on the old state of the device (channels), but
> does not allow us to directly advance back to an operational state. It
> does allow the detection and processing of another SYSERR which may
> recover the device, or allows the controller to do a clean shutdown.
>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
- Mani
> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> ---
>
> v2:
> -Add additional details about issue discovery to commit text
>
> drivers/bus/mhi/host/init.c | 1 +
> drivers/bus/mhi/host/internal.h | 9 ++++++---
> drivers/bus/mhi/host/pm.c | 20 +++++++++++++++++---
> 3 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
> index 15c1740a2c88..cca2731bc98b 100644
> --- a/drivers/bus/mhi/host/init.c
> +++ b/drivers/bus/mhi/host/init.c
> @@ -62,6 +62,7 @@ static const char * const mhi_pm_state_str[] = {
> [MHI_PM_STATE_FW_DL_ERR] = "Firmware Download Error",
> [MHI_PM_STATE_SYS_ERR_DETECT] = "SYS ERROR Detect",
> [MHI_PM_STATE_SYS_ERR_PROCESS] = "SYS ERROR Process",
> + [MHI_PM_STATE_SYS_ERR_FAIL] = "SYS ERROR Failure",
> [MHI_PM_STATE_SHUTDOWN_PROCESS] = "SHUTDOWN Process",
> [MHI_PM_STATE_LD_ERR_FATAL_DETECT] = "Linkdown or Error Fatal Detect",
> };
> diff --git a/drivers/bus/mhi/host/internal.h b/drivers/bus/mhi/host/internal.h
> index 2e139e76de4c..d2858236af52 100644
> --- a/drivers/bus/mhi/host/internal.h
> +++ b/drivers/bus/mhi/host/internal.h
> @@ -88,6 +88,7 @@ enum mhi_pm_state {
> MHI_PM_STATE_FW_DL_ERR,
> MHI_PM_STATE_SYS_ERR_DETECT,
> MHI_PM_STATE_SYS_ERR_PROCESS,
> + MHI_PM_STATE_SYS_ERR_FAIL,
> MHI_PM_STATE_SHUTDOWN_PROCESS,
> MHI_PM_STATE_LD_ERR_FATAL_DETECT,
> MHI_PM_STATE_MAX
> @@ -104,14 +105,16 @@ enum mhi_pm_state {
> #define MHI_PM_FW_DL_ERR BIT(7)
> #define MHI_PM_SYS_ERR_DETECT BIT(8)
> #define MHI_PM_SYS_ERR_PROCESS BIT(9)
> -#define MHI_PM_SHUTDOWN_PROCESS BIT(10)
> +#define MHI_PM_SYS_ERR_FAIL BIT(10)
> +#define MHI_PM_SHUTDOWN_PROCESS BIT(11)
> /* link not accessible */
> -#define MHI_PM_LD_ERR_FATAL_DETECT BIT(11)
> +#define MHI_PM_LD_ERR_FATAL_DETECT BIT(12)
>
> #define MHI_REG_ACCESS_VALID(pm_state) ((pm_state & (MHI_PM_POR | MHI_PM_M0 | \
> MHI_PM_M2 | MHI_PM_M3_ENTER | MHI_PM_M3_EXIT | \
> MHI_PM_SYS_ERR_DETECT | MHI_PM_SYS_ERR_PROCESS | \
> - MHI_PM_SHUTDOWN_PROCESS | MHI_PM_FW_DL_ERR)))
> + MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS | \
> + MHI_PM_FW_DL_ERR)))
> #define MHI_PM_IN_ERROR_STATE(pm_state) (pm_state >= MHI_PM_FW_DL_ERR)
> #define MHI_PM_IN_FATAL_STATE(pm_state) (pm_state == MHI_PM_LD_ERR_FATAL_DETECT)
> #define MHI_DB_ACCESS_VALID(mhi_cntrl) (mhi_cntrl->pm_state & mhi_cntrl->db_access)
> diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
> index 8a4362d75fc4..27f8a40f288c 100644
> --- a/drivers/bus/mhi/host/pm.c
> +++ b/drivers/bus/mhi/host/pm.c
> @@ -36,7 +36,10 @@
> * M0 <--> M0
> * M0 -> FW_DL_ERR
> * M0 -> M3_ENTER -> M3 -> M3_EXIT --> M0
> - * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS --> POR
> + * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS
> + * SYS_ERR_PROCESS -> SYS_ERR_FAIL
> + * SYS_ERR_FAIL -> SYS_ERR_DETECT
> + * SYS_ERR_PROCESS --> POR
> * L2: SHUTDOWN_PROCESS -> LD_ERR_FATAL_DETECT
> * SHUTDOWN_PROCESS -> DISABLE
> * L3: LD_ERR_FATAL_DETECT <--> LD_ERR_FATAL_DETECT
> @@ -93,7 +96,12 @@ static const struct mhi_pm_transitions dev_state_transitions[] = {
> },
> {
> MHI_PM_SYS_ERR_PROCESS,
> - MHI_PM_POR | MHI_PM_SHUTDOWN_PROCESS |
> + MHI_PM_POR | MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS |
> + MHI_PM_LD_ERR_FATAL_DETECT
> + },
> + {
> + MHI_PM_SYS_ERR_FAIL,
> + MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> MHI_PM_LD_ERR_FATAL_DETECT
> },
> /* L2 States */
> @@ -624,7 +632,13 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
> !in_reset, timeout);
> if (!ret || in_reset) {
> dev_err(dev, "Device failed to exit MHI Reset state\n");
> - goto exit_sys_error_transition;
> + write_lock_irq(&mhi_cntrl->pm_lock);
> + cur_state = mhi_tryset_pm_state(mhi_cntrl,
> + MHI_PM_SYS_ERR_FAIL);
> + write_unlock_irq(&mhi_cntrl->pm_lock);
> + /* Shutdown may have occurred, otherwise cleanup now */
> + if (cur_state != MHI_PM_SYS_ERR_FAIL)
> + goto exit_sys_error_transition;
> }
>
> /*
> --
> 2.34.1
>
>
--
மணிவண்ணன் சதாசிவம்
next prev parent reply other threads:[~2024-01-30 8:15 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-12 18:08 [PATCH v2] bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state Jeffrey Hugo
2024-01-30 8:15 ` Manivannan Sadhasivam [this message]
2024-01-30 8:18 ` Manivannan Sadhasivam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240130081519.GI32821@thinkpad \
--to=mani@kernel.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=mhi@lists.linux.dev \
--cc=quic_carlv@quicinc.com \
--cc=quic_jhugo@quicinc.com \
--cc=quic_pkanojiy@quicinc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).