* [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
@ 2024-10-16 4:55 Mukesh Ojha
2024-10-16 16:04 ` anish kumar
2024-10-21 15:12 ` Mathieu Poirier
0 siblings, 2 replies; 6+ messages in thread
From: Mukesh Ojha @ 2024-10-16 4:55 UTC (permalink / raw)
To: Bjorn Andersson, Mathieu Poirier
Cc: linux-remoteproc, linux-kernel, Mukesh Ojha
Multiple call to glink_subdev_stop() for the same remoteproc can happen
if rproc_stop() fails from Process-A that leaves the rproc state to
RPROC_CRASHED state later a call to recovery_store from user space in
Process B triggers rproc_trigger_recovery() of the same remoteproc to
recover it results in NULL pointer dereference issue in
qcom_glink_smem_unregister().
There is other side to this issue if we want to fix this via adding a
NULL check on glink->edge which does not guarantees that the remoteproc
will recover in second call from Process B as it has failed in the first
Process A during SMC shutdown call and may again fail at the same call
and rproc can not recover for such case.
Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
remoteproc and the only way to recover from it via system restart.
Process-A Process-B
fatal error interrupt happens
rproc_crash_handler_work()
mutex_lock_interruptible(&rproc->lock);
...
rproc->state = RPROC_CRASHED;
...
mutex_unlock(&rproc->lock);
rproc_trigger_recovery()
mutex_lock_interruptible(&rproc->lock);
adsp_stop()
qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
remoteproc remoteproc3: can't stop rproc: -22
mutex_unlock(&rproc->lock);
echo enabled > /sys/class/remoteproc/remoteprocX/recovery
recovery_store()
rproc_trigger_recovery()
mutex_lock_interruptible(&rproc->lock);
rproc_stop()
glink_subdev_stop()
qcom_glink_smem_unregister() ==|
|
V
Unable to handle kernel NULL pointer dereference
at virtual address 0000000000000358
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
---
Changes in v3:
- Fix kernel test reported error.
Changes in v2:
- Removed NULL pointer check instead added a new state to signify
non-recoverable state of remoteproc.
drivers/remoteproc/remoteproc_core.c | 3 ++-
drivers/remoteproc/remoteproc_sysfs.c | 1 +
include/linux/remoteproc.h | 5 ++++-
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index f276956f2c5c..c4e14503b971 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
/* power off the remote processor */
ret = rproc->ops->stop(rproc);
if (ret) {
+ rproc->state = RPROC_DEFUNCT;
dev_err(dev, "can't stop rproc: %d\n", ret);
return ret;
}
@@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
return ret;
/* State could have changed before we got the mutex */
- if (rproc->state != RPROC_CRASHED)
+ if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
goto unlock_mutex;
dev_err(dev, "recovering %s\n", rproc->name);
diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
index 138e752c5e4e..5f722b4576b2 100644
--- a/drivers/remoteproc/remoteproc_sysfs.c
+++ b/drivers/remoteproc/remoteproc_sysfs.c
@@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
[RPROC_DELETED] = "deleted",
[RPROC_ATTACHED] = "attached",
[RPROC_DETACHED] = "detached",
+ [RPROC_DEFUNCT] = "defunct",
[RPROC_LAST] = "invalid",
};
diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
index b4795698d8c2..3e4ba06c6a9a 100644
--- a/include/linux/remoteproc.h
+++ b/include/linux/remoteproc.h
@@ -417,6 +417,8 @@ struct rproc_ops {
* has attached to it
* @RPROC_DETACHED: device has been booted by another entity and waiting
* for the core to attach to it
+ * @RPROC_DEFUNCT: device neither crashed nor responding to any of the
+ * requests and can only recover on system restart.
* @RPROC_LAST: just keep this one at the end
*
* Please note that the values of these states are used as indices
@@ -433,7 +435,8 @@ enum rproc_state {
RPROC_DELETED = 4,
RPROC_ATTACHED = 5,
RPROC_DETACHED = 6,
- RPROC_LAST = 7,
+ RPROC_DEFUNCT = 7,
+ RPROC_LAST = 8,
};
/**
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
2024-10-16 4:55 [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT Mukesh Ojha
@ 2024-10-16 16:04 ` anish kumar
2024-10-21 15:12 ` Mathieu Poirier
1 sibling, 0 replies; 6+ messages in thread
From: anish kumar @ 2024-10-16 16:04 UTC (permalink / raw)
To: Mukesh Ojha
Cc: Bjorn Andersson, Mathieu Poirier, linux-remoteproc, linux-kernel
On Tue, Oct 15, 2024 at 9:57 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote:
>
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
>
> There is other side to this issue if we want to fix this via adding a
> NULL check on glink->edge which does not guarantees that the remoteproc
> will recover in second call from Process B as it has failed in the first
> Process A during SMC shutdown call and may again fail at the same call
> and rproc can not recover for such case.
What is the guarantee that the second stop also will fail? I feel
it should be handled in user space, if rproc calls are failing then
there is a bigger issue and then let userspace decide what to do if it
is happening continuously. Also, why not add this DEFUNCT_STATE
in other callbacks, as all callbacks from core to rproc driver can fail?
>
> Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
Even if this state is present, ultimately it will be up to user space to
decide what to do, right?
> remoteproc and the only way to recover from it via system restart.
>
> Process-A Process-B
>
> fatal error interrupt happens
>
> rproc_crash_handler_work()
> mutex_lock_interruptible(&rproc->lock);
> ...
>
> rproc->state = RPROC_CRASHED;
> ...
> mutex_unlock(&rproc->lock);
>
> rproc_trigger_recovery()
> mutex_lock_interruptible(&rproc->lock);
>
> adsp_stop()
> qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> remoteproc remoteproc3: can't stop rproc: -22
> mutex_unlock(&rproc->lock);
>
> echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> recovery_store()
> rproc_trigger_recovery()
> mutex_lock_interruptible(&rproc->lock);
> rproc_stop()
> glink_subdev_stop()
> qcom_glink_smem_unregister() ==|
> |
> V
> Unable to handle kernel NULL pointer dereference
> at virtual address 0000000000000358
>
> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> ---
> Changes in v3:
> - Fix kernel test reported error.
>
> Changes in v2:
> - Removed NULL pointer check instead added a new state to signify
> non-recoverable state of remoteproc.
>
> drivers/remoteproc/remoteproc_core.c | 3 ++-
> drivers/remoteproc/remoteproc_sysfs.c | 1 +
> include/linux/remoteproc.h | 5 ++++-
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index f276956f2c5c..c4e14503b971 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> /* power off the remote processor */
> ret = rproc->ops->stop(rproc);
> if (ret) {
> + rproc->state = RPROC_DEFUNCT;
> dev_err(dev, "can't stop rproc: %d\n", ret);
> return ret;
> }
> @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> return ret;
>
> /* State could have changed before we got the mutex */
> - if (rproc->state != RPROC_CRASHED)
> + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> goto unlock_mutex;
>
> dev_err(dev, "recovering %s\n", rproc->name);
> diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> index 138e752c5e4e..5f722b4576b2 100644
> --- a/drivers/remoteproc/remoteproc_sysfs.c
> +++ b/drivers/remoteproc/remoteproc_sysfs.c
> @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> [RPROC_DELETED] = "deleted",
> [RPROC_ATTACHED] = "attached",
> [RPROC_DETACHED] = "detached",
> + [RPROC_DEFUNCT] = "defunct",
> [RPROC_LAST] = "invalid",
> };
>
> diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> index b4795698d8c2..3e4ba06c6a9a 100644
> --- a/include/linux/remoteproc.h
> +++ b/include/linux/remoteproc.h
> @@ -417,6 +417,8 @@ struct rproc_ops {
> * has attached to it
> * @RPROC_DETACHED: device has been booted by another entity and waiting
> * for the core to attach to it
> + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the
> + * requests and can only recover on system restart.
> * @RPROC_LAST: just keep this one at the end
> *
> * Please note that the values of these states are used as indices
> @@ -433,7 +435,8 @@ enum rproc_state {
> RPROC_DELETED = 4,
> RPROC_ATTACHED = 5,
> RPROC_DETACHED = 6,
> - RPROC_LAST = 7,
> + RPROC_DEFUNCT = 7,
> + RPROC_LAST = 8,
> };
>
> /**
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
2024-10-16 4:55 [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT Mukesh Ojha
2024-10-16 16:04 ` anish kumar
@ 2024-10-21 15:12 ` Mathieu Poirier
2024-10-25 8:10 ` Mukesh Ojha
1 sibling, 1 reply; 6+ messages in thread
From: Mathieu Poirier @ 2024-10-21 15:12 UTC (permalink / raw)
To: Mukesh Ojha; +Cc: Bjorn Andersson, linux-remoteproc, linux-kernel
Hi Mukesh,
On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
>
> There is other side to this issue if we want to fix this via adding a
> NULL check on glink->edge which does not guarantees that the remoteproc
> will recover in second call from Process B as it has failed in the first
> Process A during SMC shutdown call and may again fail at the same call
> and rproc can not recover for such case.
>
> Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> remoteproc and the only way to recover from it via system restart.
>
> Process-A Process-B
>
> fatal error interrupt happens
>
> rproc_crash_handler_work()
> mutex_lock_interruptible(&rproc->lock);
> ...
>
> rproc->state = RPROC_CRASHED;
> ...
> mutex_unlock(&rproc->lock);
>
> rproc_trigger_recovery()
> mutex_lock_interruptible(&rproc->lock);
>
> adsp_stop()
> qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> remoteproc remoteproc3: can't stop rproc: -22
> mutex_unlock(&rproc->lock);
Ok, that can happen.
>
> echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> recovery_store()
> rproc_trigger_recovery()
> mutex_lock_interruptible(&rproc->lock);
> rproc_stop()
> glink_subdev_stop()
> qcom_glink_smem_unregister() ==|
> |
> V
I am missing some information here but I will _assume_ this is caused by
glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
process A. Instead of adding a new state to the core I think a better idea
would be to add a check for a NULL value on @smem in
qcom_glink_smem_unregister(). This is a problem that should be fixed in the
driver rather than the core.
[1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
> Unable to handle kernel NULL pointer dereference
> at virtual address 0000000000000358
>
> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> ---
> Changes in v3:
> - Fix kernel test reported error.
>
> Changes in v2:
> - Removed NULL pointer check instead added a new state to signify
> non-recoverable state of remoteproc.
>
> drivers/remoteproc/remoteproc_core.c | 3 ++-
> drivers/remoteproc/remoteproc_sysfs.c | 1 +
> include/linux/remoteproc.h | 5 ++++-
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index f276956f2c5c..c4e14503b971 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> /* power off the remote processor */
> ret = rproc->ops->stop(rproc);
> if (ret) {
> + rproc->state = RPROC_DEFUNCT;
> dev_err(dev, "can't stop rproc: %d\n", ret);
> return ret;
> }
> @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> return ret;
>
> /* State could have changed before we got the mutex */
> - if (rproc->state != RPROC_CRASHED)
> + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> goto unlock_mutex;
The problem is that rproc_trigger_recovery() an only be called once for a
remoteproc, something that modifies the state machine and may introduce backward
compatibility issues for other remote processor implementations.
Thanks,
Mathieu
>
> dev_err(dev, "recovering %s\n", rproc->name);
> diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> index 138e752c5e4e..5f722b4576b2 100644
> --- a/drivers/remoteproc/remoteproc_sysfs.c
> +++ b/drivers/remoteproc/remoteproc_sysfs.c
> @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> [RPROC_DELETED] = "deleted",
> [RPROC_ATTACHED] = "attached",
> [RPROC_DETACHED] = "detached",
> + [RPROC_DEFUNCT] = "defunct",
> [RPROC_LAST] = "invalid",
> };
>
> diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> index b4795698d8c2..3e4ba06c6a9a 100644
> --- a/include/linux/remoteproc.h
> +++ b/include/linux/remoteproc.h
> @@ -417,6 +417,8 @@ struct rproc_ops {
> * has attached to it
> * @RPROC_DETACHED: device has been booted by another entity and waiting
> * for the core to attach to it
> + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the
> + * requests and can only recover on system restart.
> * @RPROC_LAST: just keep this one at the end
> *
> * Please note that the values of these states are used as indices
> @@ -433,7 +435,8 @@ enum rproc_state {
> RPROC_DELETED = 4,
> RPROC_ATTACHED = 5,
> RPROC_DETACHED = 6,
> - RPROC_LAST = 7,
> + RPROC_DEFUNCT = 7,
> + RPROC_LAST = 8,
> };
>
> /**
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
2024-10-21 15:12 ` Mathieu Poirier
@ 2024-10-25 8:10 ` Mukesh Ojha
2024-10-25 15:08 ` Mathieu Poirier
0 siblings, 1 reply; 6+ messages in thread
From: Mukesh Ojha @ 2024-10-25 8:10 UTC (permalink / raw)
To: Mathieu Poirier; +Cc: Bjorn Andersson, linux-remoteproc, linux-kernel
On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> Hi Mukesh,
>
> On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > if rproc_stop() fails from Process-A that leaves the rproc state to
> > RPROC_CRASHED state later a call to recovery_store from user space in
> > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > recover it results in NULL pointer dereference issue in
> > qcom_glink_smem_unregister().
> >
> > There is other side to this issue if we want to fix this via adding a
> > NULL check on glink->edge which does not guarantees that the remoteproc
> > will recover in second call from Process B as it has failed in the first
> > Process A during SMC shutdown call and may again fail at the same call
> > and rproc can not recover for such case.
> >
> > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > remoteproc and the only way to recover from it via system restart.
> >
> > Process-A Process-B
> >
> > fatal error interrupt happens
> >
> > rproc_crash_handler_work()
> > mutex_lock_interruptible(&rproc->lock);
> > ...
> >
> > rproc->state = RPROC_CRASHED;
> > ...
> > mutex_unlock(&rproc->lock);
> >
> > rproc_trigger_recovery()
> > mutex_lock_interruptible(&rproc->lock);
> >
> > adsp_stop()
> > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > remoteproc remoteproc3: can't stop rproc: -22
> > mutex_unlock(&rproc->lock);
>
> Ok, that can happen.
>
> >
> > echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > recovery_store()
> > rproc_trigger_recovery()
> > mutex_lock_interruptible(&rproc->lock);
> > rproc_stop()
> > glink_subdev_stop()
> > qcom_glink_smem_unregister() ==|
> > |
> > V
>
> I am missing some information here but I will _assume_ this is caused by
> glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> process A. Instead of adding a new state to the core I think a better idea
> would be to add a check for a NULL value on @smem in
> qcom_glink_smem_unregister(). This is a problem that should be fixed in the
> driver rather than the core.
>
> [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
I did the same here [1] but after discussion with Bjorn, realized that
remoteproc might not even recover and may fail in the second attempt as
well and only way is reboot of the machine.
[1]
https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
>
> > Unable to handle kernel NULL pointer dereference
> > at virtual address 0000000000000358
> >
> > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > ---
> > Changes in v3:
> > - Fix kernel test reported error.
> >
> > Changes in v2:
> > - Removed NULL pointer check instead added a new state to signify
> > non-recoverable state of remoteproc.
> >
> > drivers/remoteproc/remoteproc_core.c | 3 ++-
> > drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > include/linux/remoteproc.h | 5 ++++-
> > 3 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index f276956f2c5c..c4e14503b971 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > /* power off the remote processor */
> > ret = rproc->ops->stop(rproc);
> > if (ret) {
> > + rproc->state = RPROC_DEFUNCT;
> > dev_err(dev, "can't stop rproc: %d\n", ret);
> > return ret;
> > }
> > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > return ret;
> >
> > /* State could have changed before we got the mutex */
> > - if (rproc->state != RPROC_CRASHED)
> > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> > goto unlock_mutex;
>
> The problem is that rproc_trigger_recovery() an only be called once for a
> remoteproc, something that modifies the state machine and may introduce backward
> compatibility issues for other remote processor implementations.
>
I missed one more point to add here which i tried to highlight in second
version[2] that setting of RPROC_DEFUNCT should happen for this case
from vendor remoteproc driver and not at the core and that should take
care of the backward compatibility.
[2]
https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/
-Mukesh
> Thanks,
> Mathieu
>
> >
> > dev_err(dev, "recovering %s\n", rproc->name);
> > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > index 138e752c5e4e..5f722b4576b2 100644
> > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > [RPROC_DELETED] = "deleted",
> > [RPROC_ATTACHED] = "attached",
> > [RPROC_DETACHED] = "detached",
> > + [RPROC_DEFUNCT] = "defunct",
> > [RPROC_LAST] = "invalid",
> > };
> >
> > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > index b4795698d8c2..3e4ba06c6a9a 100644
> > --- a/include/linux/remoteproc.h
> > +++ b/include/linux/remoteproc.h
> > @@ -417,6 +417,8 @@ struct rproc_ops {
> > * has attached to it
> > * @RPROC_DETACHED: device has been booted by another entity and waiting
> > * for the core to attach to it
> > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the
> > + * requests and can only recover on system restart.
> > * @RPROC_LAST: just keep this one at the end
> > *
> > * Please note that the values of these states are used as indices
> > @@ -433,7 +435,8 @@ enum rproc_state {
> > RPROC_DELETED = 4,
> > RPROC_ATTACHED = 5,
> > RPROC_DETACHED = 6,
> > - RPROC_LAST = 7,
> > + RPROC_DEFUNCT = 7,
> > + RPROC_LAST = 8,
> > };
> >
> > /**
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
2024-10-25 8:10 ` Mukesh Ojha
@ 2024-10-25 15:08 ` Mathieu Poirier
2024-10-25 15:39 ` Mukesh Ojha
0 siblings, 1 reply; 6+ messages in thread
From: Mathieu Poirier @ 2024-10-25 15:08 UTC (permalink / raw)
To: Mukesh Ojha; +Cc: Bjorn Andersson, linux-remoteproc, linux-kernel
On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote:
> On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> > Hi Mukesh,
> >
> > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > recover it results in NULL pointer dereference issue in
> > > qcom_glink_smem_unregister().
> > >
> > > There is other side to this issue if we want to fix this via adding a
> > > NULL check on glink->edge which does not guarantees that the remoteproc
> > > will recover in second call from Process B as it has failed in the first
> > > Process A during SMC shutdown call and may again fail at the same call
> > > and rproc can not recover for such case.
> > >
> > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > > remoteproc and the only way to recover from it via system restart.
> > >
> > > Process-A Process-B
> > >
> > > fatal error interrupt happens
> > >
> > > rproc_crash_handler_work()
> > > mutex_lock_interruptible(&rproc->lock);
> > > ...
> > >
> > > rproc->state = RPROC_CRASHED;
> > > ...
> > > mutex_unlock(&rproc->lock);
> > >
> > > rproc_trigger_recovery()
> > > mutex_lock_interruptible(&rproc->lock);
> > >
> > > adsp_stop()
> > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > remoteproc remoteproc3: can't stop rproc: -22
> > > mutex_unlock(&rproc->lock);
> >
> > Ok, that can happen.
> >
> > >
> > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > > recovery_store()
> > > rproc_trigger_recovery()
> > > mutex_lock_interruptible(&rproc->lock);
> > > rproc_stop()
> > > glink_subdev_stop()
> > > qcom_glink_smem_unregister() ==|
> > > |
> > > V
> >
> > I am missing some information here but I will _assume_ this is caused by
> > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> > process A. Instead of adding a new state to the core I think a better idea
> > would be to add a check for a NULL value on @smem in
> > qcom_glink_smem_unregister(). This is a problem that should be fixed in the
> > driver rather than the core.
> >
> > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
>
>
> I did the same here [1] but after discussion with Bjorn, realized that
> remoteproc might not even recover and may fail in the second attempt as
> well and only way is reboot of the machine.
Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same -
manual intervention is needed. I don't see why another state needs to be added.
>
> [1]
> https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
>
> >
> > > Unable to handle kernel NULL pointer dereference
> > > at virtual address 0000000000000358
> > >
> > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > > ---
> > > Changes in v3:
> > > - Fix kernel test reported error.
> > >
> > > Changes in v2:
> > > - Removed NULL pointer check instead added a new state to signify
> > > non-recoverable state of remoteproc.
> > >
> > > drivers/remoteproc/remoteproc_core.c | 3 ++-
> > > drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > > include/linux/remoteproc.h | 5 ++++-
> > > 3 files changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > index f276956f2c5c..c4e14503b971 100644
> > > --- a/drivers/remoteproc/remoteproc_core.c
> > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > > /* power off the remote processor */
> > > ret = rproc->ops->stop(rproc);
> > > if (ret) {
> > > + rproc->state = RPROC_DEFUNCT;
> > > dev_err(dev, "can't stop rproc: %d\n", ret);
> > > return ret;
> > > }
> > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > > return ret;
> > >
> > > /* State could have changed before we got the mutex */
> > > - if (rproc->state != RPROC_CRASHED)
> > > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> > > goto unlock_mutex;
> >
> > The problem is that rproc_trigger_recovery() an only be called once for a
> > remoteproc, something that modifies the state machine and may introduce backward
> > compatibility issues for other remote processor implementations.
> >
>
> I missed one more point to add here which i tried to highlight in second
> version[2] that setting of RPROC_DEFUNCT should happen for this case
> from vendor remoteproc driver and not at the core and that should take
> care of the backward compatibility.
>
> [2]
> https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/
>
> -Mukesh
>
> > Thanks,
> > Mathieu
> >
> > >
> > > dev_err(dev, "recovering %s\n", rproc->name);
> > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > > index 138e752c5e4e..5f722b4576b2 100644
> > > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > > [RPROC_DELETED] = "deleted",
> > > [RPROC_ATTACHED] = "attached",
> > > [RPROC_DETACHED] = "detached",
> > > + [RPROC_DEFUNCT] = "defunct",
> > > [RPROC_LAST] = "invalid",
> > > };
> > >
> > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > > index b4795698d8c2..3e4ba06c6a9a 100644
> > > --- a/include/linux/remoteproc.h
> > > +++ b/include/linux/remoteproc.h
> > > @@ -417,6 +417,8 @@ struct rproc_ops {
> > > * has attached to it
> > > * @RPROC_DETACHED: device has been booted by another entity and waiting
> > > * for the core to attach to it
> > > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the
> > > + * requests and can only recover on system restart.
> > > * @RPROC_LAST: just keep this one at the end
> > > *
> > > * Please note that the values of these states are used as indices
> > > @@ -433,7 +435,8 @@ enum rproc_state {
> > > RPROC_DELETED = 4,
> > > RPROC_ATTACHED = 5,
> > > RPROC_DETACHED = 6,
> > > - RPROC_LAST = 7,
> > > + RPROC_DEFUNCT = 7,
> > > + RPROC_LAST = 8,
> > > };
> > >
> > > /**
> > > --
> > > 2.34.1
> > >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT
2024-10-25 15:08 ` Mathieu Poirier
@ 2024-10-25 15:39 ` Mukesh Ojha
0 siblings, 0 replies; 6+ messages in thread
From: Mukesh Ojha @ 2024-10-25 15:39 UTC (permalink / raw)
To: Mathieu Poirier; +Cc: Bjorn Andersson, linux-remoteproc, linux-kernel
On Fri, Oct 25, 2024 at 09:08:03AM -0600, Mathieu Poirier wrote:
> On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote:
> > On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> > > Hi Mukesh,
> > >
> > > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > recover it results in NULL pointer dereference issue in
> > > > qcom_glink_smem_unregister().
> > > >
> > > > There is other side to this issue if we want to fix this via adding a
> > > > NULL check on glink->edge which does not guarantees that the remoteproc
> > > > will recover in second call from Process B as it has failed in the first
> > > > Process A during SMC shutdown call and may again fail at the same call
> > > > and rproc can not recover for such case.
> > > >
> > > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > > > remoteproc and the only way to recover from it via system restart.
> > > >
> > > > Process-A Process-B
> > > >
> > > > fatal error interrupt happens
> > > >
> > > > rproc_crash_handler_work()
> > > > mutex_lock_interruptible(&rproc->lock);
> > > > ...
> > > >
> > > > rproc->state = RPROC_CRASHED;
> > > > ...
> > > > mutex_unlock(&rproc->lock);
> > > >
> > > > rproc_trigger_recovery()
> > > > mutex_lock_interruptible(&rproc->lock);
> > > >
> > > > adsp_stop()
> > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > > remoteproc remoteproc3: can't stop rproc: -22
> > > > mutex_unlock(&rproc->lock);
> > >
> > > Ok, that can happen.
> > >
> > > >
> > > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > > > recovery_store()
> > > > rproc_trigger_recovery()
> > > > mutex_lock_interruptible(&rproc->lock);
> > > > rproc_stop()
> > > > glink_subdev_stop()
> > > > qcom_glink_smem_unregister() ==|
> > > > |
> > > > V
> > >
> > > I am missing some information here but I will _assume_ this is caused by
> > > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> > > process A. Instead of adding a new state to the core I think a better idea
> > > would be to add a check for a NULL value on @smem in
> > > qcom_glink_smem_unregister(). This is a problem that should be fixed in the
> > > driver rather than the core.
> > >
> > > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
> >
> >
> > I did the same here [1] but after discussion with Bjorn, realized that
> > remoteproc might not even recover and may fail in the second attempt as
> > well and only way is reboot of the machine.
>
> Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same -
> manual intervention is needed. I don't see why another state needs to be added.
Is it really true ? As when recovery is disabled and any rproc crash
will result in RPROC_CRASHED state, while recovery enablement can
recover the rproc back to ONLINE while if rproc recovery is not
successful it can be put into RPROC_DEFUNCT state.
-Mukesh
>
> >
> > [1]
> > https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> >
> > >
> > > > Unable to handle kernel NULL pointer dereference
> > > > at virtual address 0000000000000358
> > > >
> > > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > > > ---
> > > > Changes in v3:
> > > > - Fix kernel test reported error.
> > > >
> > > > Changes in v2:
> > > > - Removed NULL pointer check instead added a new state to signify
> > > > non-recoverable state of remoteproc.
> > > >
> > > > drivers/remoteproc/remoteproc_core.c | 3 ++-
> > > > drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > > > include/linux/remoteproc.h | 5 ++++-
> > > > 3 files changed, 7 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > > index f276956f2c5c..c4e14503b971 100644
> > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > > > /* power off the remote processor */
> > > > ret = rproc->ops->stop(rproc);
> > > > if (ret) {
> > > > + rproc->state = RPROC_DEFUNCT;
> > > > dev_err(dev, "can't stop rproc: %d\n", ret);
> > > > return ret;
> > > > }
> > > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > > > return ret;
> > > >
> > > > /* State could have changed before we got the mutex */
> > > > - if (rproc->state != RPROC_CRASHED)
> > > > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> > > > goto unlock_mutex;
> > >
> > > The problem is that rproc_trigger_recovery() an only be called once for a
> > > remoteproc, something that modifies the state machine and may introduce backward
> > > compatibility issues for other remote processor implementations.
> > >
> >
> > I missed one more point to add here which i tried to highlight in second
> > version[2] that setting of RPROC_DEFUNCT should happen for this case
> > from vendor remoteproc driver and not at the core and that should take
> > care of the backward compatibility.
> >
> > [2]
> > https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/
> >
> > -Mukesh
> >
> > > Thanks,
> > > Mathieu
> > >
> > > >
> > > > dev_err(dev, "recovering %s\n", rproc->name);
> > > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > > > index 138e752c5e4e..5f722b4576b2 100644
> > > > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > > > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > > > [RPROC_DELETED] = "deleted",
> > > > [RPROC_ATTACHED] = "attached",
> > > > [RPROC_DETACHED] = "detached",
> > > > + [RPROC_DEFUNCT] = "defunct",
> > > > [RPROC_LAST] = "invalid",
> > > > };
> > > >
> > > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > > > index b4795698d8c2..3e4ba06c6a9a 100644
> > > > --- a/include/linux/remoteproc.h
> > > > +++ b/include/linux/remoteproc.h
> > > > @@ -417,6 +417,8 @@ struct rproc_ops {
> > > > * has attached to it
> > > > * @RPROC_DETACHED: device has been booted by another entity and waiting
> > > > * for the core to attach to it
> > > > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the
> > > > + * requests and can only recover on system restart.
> > > > * @RPROC_LAST: just keep this one at the end
> > > > *
> > > > * Please note that the values of these states are used as indices
> > > > @@ -433,7 +435,8 @@ enum rproc_state {
> > > > RPROC_DELETED = 4,
> > > > RPROC_ATTACHED = 5,
> > > > RPROC_DETACHED = 6,
> > > > - RPROC_LAST = 7,
> > > > + RPROC_DEFUNCT = 7,
> > > > + RPROC_LAST = 8,
> > > > };
> > > >
> > > > /**
> > > > --
> > > > 2.34.1
> > > >
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-10-25 15:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-16 4:55 [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT Mukesh Ojha
2024-10-16 16:04 ` anish kumar
2024-10-21 15:12 ` Mathieu Poirier
2024-10-25 8:10 ` Mukesh Ojha
2024-10-25 15:08 ` Mathieu Poirier
2024-10-25 15:39 ` Mukesh Ojha
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).