Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Poirier <mathieu.poirier@linaro.org>
To: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: Bjorn Andersson <andersson@kernel.org>,
	linux-remoteproc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
Date: Fri, 25 Oct 2024 09:08:03 -0600	[thread overview]
Message-ID: <Zxu0U9U+BvTo20Zq@p14s> (raw)
In-Reply-To: <ZxtShfshsuyVzGx3@hu-mojha-hyd.qualcomm.com>

On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote:
> On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> > Hi Mukesh,
> > 
> > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > recover it results in NULL pointer dereference issue in
> > > qcom_glink_smem_unregister().
> > > 
> > > There is other side to this issue if we want to fix this via adding a
> > > NULL check on glink->edge which does not guarantees that the remoteproc
> > > will recover in second call from Process B as it has failed in the first
> > > Process A during SMC shutdown call and may again fail at the same call
> > > and rproc can not recover for such case.
> > > 
> > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > > remoteproc and the only way to recover from it via system restart.
> > > 
> > > 	Process-A                			Process-B
> > > 
> > >   fatal error interrupt happens
> > > 
> > >   rproc_crash_handler_work()
> > >     mutex_lock_interruptible(&rproc->lock);
> > >     ...
> > > 
> > >        rproc->state = RPROC_CRASHED;
> > >     ...
> > >     mutex_unlock(&rproc->lock);
> > > 
> > >     rproc_trigger_recovery()
> > >      mutex_lock_interruptible(&rproc->lock);
> > > 
> > >       adsp_stop()
> > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > >       remoteproc remoteproc3: can't stop rproc: -22
> > >      mutex_unlock(&rproc->lock);
> > 
> > Ok, that can happen.
> > 
> > > 
> > > 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > > 						recovery_store()
> > > 						 rproc_trigger_recovery()
> > > 						  mutex_lock_interruptible(&rproc->lock);
> > > 						   rproc_stop()
> > > 						    glink_subdev_stop()
> > > 						      qcom_glink_smem_unregister() ==|
> > >                                                                                      |
> > >                                                                                      V
> > 
> > I am missing some information here but I will _assume_ this is caused by
> > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> > process A.  Instead of adding a new state to the core I think a better idea
> > would be to add a check for a NULL value on @smem in
> > qcom_glink_smem_unregister().  This is a problem that should be fixed in the
> > driver rather than the core.
> > 
> > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
> 
> 
> I did the same here [1] but after discussion with Bjorn, realized that
> remoteproc might not even recover and may fail in the second attempt as
> well and only way is reboot of the machine.

Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same -
manual intervention is needed.  I don't see why another state needs to be added.

> 
> [1]
> https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> 
> > 
> > > 						      Unable to handle kernel NULL pointer dereference
> > >                                                                 at virtual address 0000000000000358
> > > 
> > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > > ---
> > > Changes in v3:
> > >  - Fix kernel test reported error.
> > > 
> > > Changes in v2:
> > >  - Removed NULL pointer check instead added a new state to signify
> > >    non-recoverable state of remoteproc.
> > > 
> > >  drivers/remoteproc/remoteproc_core.c  | 3 ++-
> > >  drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > >  include/linux/remoteproc.h            | 5 ++++-
> > >  3 files changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > index f276956f2c5c..c4e14503b971 100644
> > > --- a/drivers/remoteproc/remoteproc_core.c
> > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > >  	/* power off the remote processor */
> > >  	ret = rproc->ops->stop(rproc);
> > >  	if (ret) {
> > > +		rproc->state = RPROC_DEFUNCT;
> > >  		dev_err(dev, "can't stop rproc: %d\n", ret);
> > >  		return ret;
> > >  	}
> > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > >  		return ret;
> > >  
> > >  	/* State could have changed before we got the mutex */
> > > -	if (rproc->state != RPROC_CRASHED)
> > > +	if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> > >  		goto unlock_mutex;
> > 
> > The problem is that rproc_trigger_recovery() an only be called once for a
> > remoteproc, something that modifies the state machine and may introduce backward
> > compatibility issues for other remote processor implementations.
> > 
> 
> I missed one more point to add here which i tried to highlight in second
> version[2] that setting of RPROC_DEFUNCT should happen for this case
> from vendor remoteproc driver and not at the core and that should take
> care of the backward compatibility.
> 
> [2]
> https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/
> 
> -Mukesh
> 
> > Thanks,
> > Mathieu
> > 
> > >  
> > >  	dev_err(dev, "recovering %s\n", rproc->name);
> > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > > index 138e752c5e4e..5f722b4576b2 100644
> > > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > >  	[RPROC_DELETED]		= "deleted",
> > >  	[RPROC_ATTACHED]	= "attached",
> > >  	[RPROC_DETACHED]	= "detached",
> > > +	[RPROC_DEFUNCT]		= "defunct",
> > >  	[RPROC_LAST]		= "invalid",
> > >  };
> > >  
> > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > > index b4795698d8c2..3e4ba06c6a9a 100644
> > > --- a/include/linux/remoteproc.h
> > > +++ b/include/linux/remoteproc.h
> > > @@ -417,6 +417,8 @@ struct rproc_ops {
> > >   *			has attached to it
> > >   * @RPROC_DETACHED:	device has been booted by another entity and waiting
> > >   *			for the core to attach to it
> > > + * @RPROC_DEFUNCT:	device neither crashed nor responding to any of the
> > > + * 			requests and can only recover on system restart.
> > >   * @RPROC_LAST:		just keep this one at the end
> > >   *
> > >   * Please note that the values of these states are used as indices
> > > @@ -433,7 +435,8 @@ enum rproc_state {
> > >  	RPROC_DELETED	= 4,
> > >  	RPROC_ATTACHED	= 5,
> > >  	RPROC_DETACHED	= 6,
> > > -	RPROC_LAST	= 7,
> > > +	RPROC_DEFUNCT	= 7,
> > > +	RPROC_LAST	= 8,
> > >  };
> > >  
> > >  /**
> > > -- 
> > > 2.34.1
> > >

next prev parent reply	other threads:[~2024-10-25 15:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-16  4:55 [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT Mukesh Ojha
2024-10-16 16:04 ` anish kumar
2024-10-21 15:12 ` Mathieu Poirier
2024-10-25  8:10   ` Mukesh Ojha
2024-10-25 15:08     ` Mathieu Poirier [this message]
2024-10-25 15:39       ` Mukesh Ojha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zxu0U9U+BvTo20Zq@p14s \
    --to=mathieu.poirier@linaro.org \
    --cc=andersson@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-remoteproc@vger.kernel.org \
    --cc=quic_mojha@quicinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.