[PATCH 0/2] soc: qcom: pmic_glink: Resolve failures to bring up pmic

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] soc: qcom: pmic_glink: Resolve failures to bring up pmic_glink
@ 2024-10-22  4:17 Bjorn Andersson
  2024-10-22  4:17 ` [PATCH 1/2] rpmsg: glink: Handle rejected intent request better Bjorn Andersson
  2024-10-22  4:17 ` [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections Bjorn Andersson
  0 siblings, 2 replies; 6+ messages in thread
From: Bjorn Andersson @ 2024-10-22  4:17 UTC (permalink / raw)
  To: Bjorn Andersson, Mathieu Poirier, Chris Lew, Konrad Dybcio,
	Johan Hovold
  Cc: linux-arm-msm, Bjorn Andersson, linux-remoteproc, linux-kernel,
	Bjorn Andersson, stable

With the transition of pd-mapper into the kernel, the timing was altered
such that on some targets the initial rpmsg_send() requests from
pmic_glink clients would be attempted before the firmware had announced
intents, and the firmware reject intent requests.

Fix this

Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
---
Bjorn Andersson (2):
      rpmsg: glink: Handle rejected intent request better
      soc: qcom: pmic_glink: Handle GLINK intent allocation rejections

 drivers/rpmsg/qcom_glink_native.c | 10 +++++++---
 drivers/soc/qcom/pmic_glink.c     | 18 +++++++++++++++---
 2 files changed, 22 insertions(+), 6 deletions(-)
---
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
change-id: 20241022-pmic-glink-ecancelled-d899a9ca0358

Best regards,
-- 
Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] rpmsg: glink: Handle rejected intent request better
  2024-10-22  4:17 [PATCH 0/2] soc: qcom: pmic_glink: Resolve failures to bring up pmic_glink Bjorn Andersson
@ 2024-10-22  4:17 ` Bjorn Andersson
  2024-10-22 15:39   ` Johan Hovold
  2024-10-22  4:17 ` [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections Bjorn Andersson
  1 sibling, 1 reply; 6+ messages in thread
From: Bjorn Andersson @ 2024-10-22  4:17 UTC (permalink / raw)
  To: Bjorn Andersson, Mathieu Poirier, Chris Lew, Konrad Dybcio,
	Johan Hovold
  Cc: linux-arm-msm, Bjorn Andersson, linux-remoteproc, linux-kernel,
	Bjorn Andersson, stable

The initial implementation of request intent response handling dealt
with two outcomes; granted allocations, and all other cases being
considered -ECANCELLED (likely from "cancelling the operation as the
remote is going down").

But on some channels intent allocation is not supported, instead the
remote will pre-allocate and announce a fixed number of intents for the
sender to use. If for such channels an rpmsg_send() is being invoked
before any channels have been announced, an intent request will be
issued and as this comes back rejected the call is failed with
-ECANCELLED.

Given that this is reported in the same way as the remote being shut
down, there's no way for the client to differentiate the two cases.

In line with the original GLINK design, change the return value to
-EAGAIN for the case where the remote rejects an intent allocation
request.

It's tempting to handle this case in the GLINK core, as we expect
intents to show up in this case. But there's no way to distinguish
between this case and a rejection for a too big allocation, nor is it
possible to predict if a currently used (and seeminly suitable) intent
will be returned for reuse or not. As such, returning the error to the
client and allow it to react seems to be the only sensible solution.

In addition to this, commit 'c05dfce0b89e ("rpmsg: glink: Wait for
intent, not just request ack")' changed the logic such that the code
always wait for an intent request response and an intent. This works out
in most cases, but in the event that a intent request is rejected and no
further intent arrives (e.g. client asks for a too big intent), the code
will stall for 10 seconds and then return -ETIMEDOUT; instead of a more
suitable error.

This change also resulted in intent requests racing with the shutdown of
the remote would be exposed to this same problem, unless some intent
happens to arrive. A patch for this was developed and posted by Sarannya
S [1], and has been incorporated here.

To summarize, the intent request can end in 4 ways:
- Timeout, no response arrived => return -ETIMEDOUT
- Abort TX, the edge is going away => return -ECANCELLED
- Intent request was rejected => return -EAGAIN
- Intent request was accepted, and an intent arrived => return 0

This patch was developed with input from Sarannya S, Deepak Kumar Singh,
and Chris Lew.

[1] https://lore.kernel.org/all/20240925072328.1163183-1-quic_deesin@quicinc.com/

Fixes: c05dfce0b89e ("rpmsg: glink: Wait for intent, not just request ack")
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
---
 drivers/rpmsg/qcom_glink_native.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/rpmsg/qcom_glink_native.c b/drivers/rpmsg/qcom_glink_native.c
index 0b2f290069080638581a13b3a580054d31e176c2..d3af1dfa3c7d71b95dda911dfc7ad844679359d6 100644
--- a/drivers/rpmsg/qcom_glink_native.c
+++ b/drivers/rpmsg/qcom_glink_native.c
@@ -1440,14 +1440,18 @@ static int qcom_glink_request_intent(struct qcom_glink *glink,
 		goto unlock;

 	ret = wait_event_timeout(channel->intent_req_wq,
-				 READ_ONCE(channel->intent_req_result) >= 0 &&
-				 READ_ONCE(channel->intent_received),
+				 READ_ONCE(channel->intent_req_result) == 0 ||
+				 (READ_ONCE(channel->intent_req_result) > 0 &&
+				  READ_ONCE(channel->intent_received)) ||
+				 glink->abort_tx,
 				 10 * HZ);
 	if (!ret) {
 		dev_err(glink->dev, "intent request timed out\n");
 		ret = -ETIMEDOUT;
+	} else if (glink->abort_tx) {
+		ret = -ECANCELED;
 	} else {
-		ret = READ_ONCE(channel->intent_req_result) ? 0 : -ECANCELED;
+		ret = READ_ONCE(channel->intent_req_result) ? 0 : -EAGAIN;
 	}

 unlock:

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections
  2024-10-22  4:17 [PATCH 0/2] soc: qcom: pmic_glink: Resolve failures to bring up pmic_glink Bjorn Andersson
  2024-10-22  4:17 ` [PATCH 1/2] rpmsg: glink: Handle rejected intent request better Bjorn Andersson
@ 2024-10-22  4:17 ` Bjorn Andersson
  2024-10-22 15:30   ` Johan Hovold
  1 sibling, 1 reply; 6+ messages in thread
From: Bjorn Andersson @ 2024-10-22  4:17 UTC (permalink / raw)
  To: Bjorn Andersson, Mathieu Poirier, Chris Lew, Konrad Dybcio,
	Johan Hovold
  Cc: linux-arm-msm, Bjorn Andersson, linux-remoteproc, linux-kernel,
	Bjorn Andersson, stable

Some versions of the pmic_glink firmware does not allow dynamic GLINK
intent allocations, attempting to send a message before the firmware has
allocated its receive buffers and announced these intent allocations
will fail. When this happens something like this showns up in the log:

	[    9.799719] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
	[    9.812446] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
	[    9.831796] ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125

GLINK has been updated to distinguish between the cases where the remote
is going down (-ECANCELLED) and the intent allocation being rejected
(-EAGAIN).

Retry the send until intent buffers becomes available, or an actual
error occur.

To avoid infinitely waiting for the firmware in the event that this
misbehaves and no intents arrive, an arbitrary 10 second timeout is
used.

This patch was developed with input from Chris Lew.

Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
---
 drivers/soc/qcom/pmic_glink.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
index 9606222993fd78e80d776ea299cad024a0197e91..221639f3da149da1f967dbc769a97d327ffd6c63 100644
--- a/drivers/soc/qcom/pmic_glink.c
+++ b/drivers/soc/qcom/pmic_glink.c
@@ -13,6 +13,8 @@
 #include <linux/soc/qcom/pmic_glink.h>
 #include <linux/spinlock.h>
 
+#define PMIC_GLINK_SEND_TIMEOUT (10*HZ)
+
 enum {
 	PMIC_GLINK_CLIENT_BATT = 0,
 	PMIC_GLINK_CLIENT_ALTMODE,
@@ -112,13 +114,23 @@ EXPORT_SYMBOL_GPL(pmic_glink_client_register);
 int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len)
 {
 	struct pmic_glink *pg = client->pg;
+	unsigned long start;
+	bool timeout_reached = false;
 	int ret;
 
 	mutex_lock(&pg->state_lock);
-	if (!pg->ept)
+	if (!pg->ept) {
 		ret = -ECONNRESET;
-	else
-		ret = rpmsg_send(pg->ept, data, len);
+	} else {
+		start = jiffies;
+		do {
+			timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT);
+			ret = rpmsg_send(pg->ept, data, len);
+		} while (ret == -EAGAIN && !timeout_reached);
+
+		if (ret == -EAGAIN && timeout_reached)
+			ret = -ETIMEDOUT;
+	}
 	mutex_unlock(&pg->state_lock);
 
 	return ret;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections
  2024-10-22  4:17 ` [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections Bjorn Andersson
@ 2024-10-22 15:30   ` Johan Hovold
  2024-10-23 21:13     ` Bjorn Andersson
  0 siblings, 1 reply; 6+ messages in thread
From: Johan Hovold @ 2024-10-22 15:30 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Bjorn Andersson, Mathieu Poirier, Chris Lew, Konrad Dybcio,
	linux-arm-msm, Bjorn Andersson, linux-remoteproc, linux-kernel,
	stable

On Tue, Oct 22, 2024 at 04:17:12AM +0000, Bjorn Andersson wrote:
> Some versions of the pmic_glink firmware does not allow dynamic GLINK
> intent allocations, attempting to send a message before the firmware has
> allocated its receive buffers and announced these intent allocations
> will fail. When this happens something like this showns up in the log:
> 
> 	[    9.799719] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
> 	[    9.812446] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> 	[    9.831796] ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125

I think you should drop the time stamps here, and also add the battery
notification error to make the patch easier to find when searching for
these errors:

	qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications

> GLINK has been updated to distinguish between the cases where the remote
> is going down (-ECANCELLED) and the intent allocation being rejected
> (-EAGAIN).
> 
> Retry the send until intent buffers becomes available, or an actual
> error occur.
> 
> To avoid infinitely waiting for the firmware in the event that this
> misbehaves and no intents arrive, an arbitrary 10 second timeout is
> used.
> 
> This patch was developed with input from Chris Lew.
> 
> Reported-by: Johan Hovold <johan@kernel.org>
> Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t

This indeed seems to fix the -ECANCELED related errors I reported above,
but the audio probe failure still remains as expected:

	PDR: avs/audio get domain list txn wait failed: -110
	PDR: service lookup for avs/audio failed: -110

I hit it on the third reboot and then again after another 75 reboots
(and have never seen it with the user space pd-mapper over several
hundred boots).

Do you guys have any theories as to what is causing the above with the
in-kernel pd-mapper (beyond the obvious changes in timing)?

> Cc: stable@vger.kernel.org

Can you add a Fixes tag here?

This patch depends on the former, but that is not necessarily obvious
for someone backporting this (and the previous patch is only going to be
backported to 6.4).

Perhaps you can use the stable tag dependency annotation or even mark
the previous patch so that it is backported far enough.

> Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>

Tested-by: Johan Hovold <johan+linaro@kernel.org>
	
> ---
>  drivers/soc/qcom/pmic_glink.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
> index 9606222993fd78e80d776ea299cad024a0197e91..221639f3da149da1f967dbc769a97d327ffd6c63 100644
> --- a/drivers/soc/qcom/pmic_glink.c
> +++ b/drivers/soc/qcom/pmic_glink.c
> @@ -13,6 +13,8 @@
>  #include <linux/soc/qcom/pmic_glink.h>
>  #include <linux/spinlock.h>
>  
> +#define PMIC_GLINK_SEND_TIMEOUT (10*HZ)

nit: spaces around *

Ten seconds seems a little excessive; are there any reasons for not
picking something shorter like 5 s (also used by USB but that comes from
spec)?

> +
>  enum {
>  	PMIC_GLINK_CLIENT_BATT = 0,
>  	PMIC_GLINK_CLIENT_ALTMODE,
> @@ -112,13 +114,23 @@ EXPORT_SYMBOL_GPL(pmic_glink_client_register);
>  int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len)
>  {
>  	struct pmic_glink *pg = client->pg;
> +	unsigned long start;
> +	bool timeout_reached = false;

No need to initialise.

>  	int ret;
>  
>  	mutex_lock(&pg->state_lock);
> -	if (!pg->ept)
> +	if (!pg->ept) {
>  		ret = -ECONNRESET;
> -	else
> -		ret = rpmsg_send(pg->ept, data, len);
> +	} else {
> +		start = jiffies;
> +		do {
> +			timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT);
> +			ret = rpmsg_send(pg->ept, data, len);

Add a delay here to avoid hammering the remote side with requests in a
tight loop for 10 s?

> +		} while (ret == -EAGAIN && !timeout_reached);
> +
> +		if (ret == -EAGAIN && timeout_reached)
> +			ret = -ETIMEDOUT;
> +	}
>  	mutex_unlock(&pg->state_lock);
>  
>  	return ret;

Looks good to me otherwise: 

Reviewed-by: Johan Hovold <johan+linaro@kernel.org>

Johan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] rpmsg: glink: Handle rejected intent request better
  2024-10-22  4:17 ` [PATCH 1/2] rpmsg: glink: Handle rejected intent request better Bjorn Andersson
@ 2024-10-22 15:39   ` Johan Hovold
  0 siblings, 0 replies; 6+ messages in thread
From: Johan Hovold @ 2024-10-22 15:39 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Bjorn Andersson, Mathieu Poirier, Chris Lew, Konrad Dybcio,
	linux-arm-msm, Bjorn Andersson, linux-remoteproc, linux-kernel,
	stable

On Tue, Oct 22, 2024 at 04:17:11AM +0000, Bjorn Andersson wrote:
> The initial implementation of request intent response handling dealt
> with two outcomes; granted allocations, and all other cases being
> considered -ECANCELLED (likely from "cancelling the operation as the
> remote is going down").

For the benefit of casual reviewers and contributors, could you add
introductory comment about what "intents" are?

> But on some channels intent allocation is not supported, instead the
> remote will pre-allocate and announce a fixed number of intents for the
> sender to use. If for such channels an rpmsg_send() is being invoked
> before any channels have been announced, an intent request will be
> issued and as this comes back rejected the call is failed with
> -ECANCELLED.

It's actually the one L -ECANCELED

s/is failed/fails/ ?
 
> Given that this is reported in the same way as the remote being shut
> down, there's no way for the client to differentiate the two cases.
> 
> In line with the original GLINK design, change the return value to
> -EAGAIN for the case where the remote rejects an intent allocation
> request.
> 
> It's tempting to handle this case in the GLINK core, as we expect
> intents to show up in this case. But there's no way to distinguish
> between this case and a rejection for a too big allocation, nor is it
> possible to predict if a currently used (and seeminly suitable) intent

seemingly

> will be returned for reuse or not. As such, returning the error to the
> client and allow it to react seems to be the only sensible solution.

s/allow/allowing/ ?

> In addition to this, commit 'c05dfce0b89e ("rpmsg: glink: Wait for
> intent, not just request ack")' changed the logic such that the code
> always wait for an intent request response and an intent. This works out
> in most cases, but in the event that a intent request is rejected and no

an intent

> further intent arrives (e.g. client asks for a too big intent), the code
> will stall for 10 seconds and then return -ETIMEDOUT; instead of a more
> suitable error.
> 
> This change also resulted in intent requests racing with the shutdown of
> the remote would be exposed to this same problem, unless some intent
> happens to arrive. A patch for this was developed and posted by Sarannya
> S [1], and has been incorporated here.
> 
> To summarize, the intent request can end in 4 ways:
> - Timeout, no response arrived => return -ETIMEDOUT
> - Abort TX, the edge is going away => return -ECANCELLED
> - Intent request was rejected => return -EAGAIN
> - Intent request was accepted, and an intent arrived => return 0
> 
> This patch was developed with input from Sarannya S, Deepak Kumar Singh,
> and Chris Lew.
> 
> [1] https://lore.kernel.org/all/20240925072328.1163183-1-quic_deesin@quicinc.com/
> 
> Fixes: c05dfce0b89e ("rpmsg: glink: Wait for intent, not just request ack")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>

Nit picks aside, this was all nice and clear.

Tested-by: Johan Hovold <johan+linaro@kernel.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections
  2024-10-22 15:30   ` Johan Hovold
@ 2024-10-23 21:13     ` Bjorn Andersson
  0 siblings, 0 replies; 6+ messages in thread
From: Bjorn Andersson @ 2024-10-23 21:13 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Bjorn Andersson, Mathieu Poirier, Chris Lew, Konrad Dybcio,
	linux-arm-msm, Bjorn Andersson, linux-remoteproc, linux-kernel,
	stable

On Tue, Oct 22, 2024 at 05:30:55PM GMT, Johan Hovold wrote:
> On Tue, Oct 22, 2024 at 04:17:12AM +0000, Bjorn Andersson wrote:
[..]
> > Reported-by: Johan Hovold <johan@kernel.org>
> > Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t
> 
> This indeed seems to fix the -ECANCELED related errors I reported above,
> but the audio probe failure still remains as expected:
> 
> 	PDR: avs/audio get domain list txn wait failed: -110
> 	PDR: service lookup for avs/audio failed: -110
> 
> I hit it on the third reboot and then again after another 75 reboots
> (and have never seen it with the user space pd-mapper over several
> hundred boots).
> 
> Do you guys have any theories as to what is causing the above with the
> in-kernel pd-mapper (beyond the obvious changes in timing)?
> 

Not yet. This would be a timeout in a completely different codepath.

I'm trying to figure out a better way to reproduce this, than just
restarting the whole machine...

Thanks for the review.

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-10-23 21:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-22  4:17 [PATCH 0/2] soc: qcom: pmic_glink: Resolve failures to bring up pmic_glink Bjorn Andersson
2024-10-22  4:17 ` [PATCH 1/2] rpmsg: glink: Handle rejected intent request better Bjorn Andersson
2024-10-22 15:39   ` Johan Hovold
2024-10-22  4:17 ` [PATCH 2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections Bjorn Andersson
2024-10-22 15:30   ` Johan Hovold
2024-10-23 21:13     ` Bjorn Andersson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox