[PATCH] dirvers: scmi: poll again when transfer reach timeout

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] dirvers: scmi: poll again when transfer reach timeout
@ 2025-01-23  8:33 jack21
  2025-01-23 10:38 ` Dan Carpenter
  0 siblings, 1 reply; 4+ messages in thread
From: jack21 @ 2025-01-23  8:33 UTC (permalink / raw)
  To: sudeep.holla, cristian.marussi
  Cc: arm-scmi, linux-arm-kernel, linux-kernel, Huangjie

From: Huangjie <huangjie1663@phytium.com.cn>

spin_until_cond() not really hold a spin lock, possible timeout may occur
in preemption kernel when preempted after spin_until_cond().

We check status again when reach timeout is reached to prevent incorrect
jugement of timeout.

Signed-off-by: Huangjie <huangjie1663@phytium.com.cn>
---
 drivers/firmware/arm_scmi/driver.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c
index 1b5fb2c4ce86..10b049fe5fd0 100644
--- a/drivers/firmware/arm_scmi/driver.c
+++ b/drivers/firmware/arm_scmi/driver.c
@@ -1246,7 +1246,8 @@ static int scmi_wait_for_reply(struct device *dev, const struct scmi_desc *desc,
 
 			spin_until_cond(scmi_xfer_done_no_timeout(cinfo,
 								  xfer, stop));
-			if (ktime_after(ktime_get(), stop)) {
+			if (ktime_after(ktime_get(), stop) &&
+			    !info->desc->ops->poll_done(cinfo, xfer)) {
 				dev_err(dev,
 					"timed out in resp(caller: %pS) - polling\n",
 					(void *)_RET_IP_);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout
  2025-01-23  8:33 [PATCH] dirvers: scmi: poll again when transfer reach timeout jack21
@ 2025-01-23 10:38 ` Dan Carpenter
  2025-01-23 19:48   ` Cristian Marussi
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Carpenter @ 2025-01-23 10:38 UTC (permalink / raw)
  To: jack21
  Cc: sudeep.holla, cristian.marussi, arm-scmi, linux-arm-kernel,
	linux-kernel, Huangjie

s/dirvers/drivers/

On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> From: Huangjie <huangjie1663@phytium.com.cn>
> 
> spin_until_cond() not really hold a spin lock, possible timeout may occur
> in preemption kernel when preempted after spin_until_cond().
> 
> We check status again when reach timeout is reached to prevent incorrect
> jugement of timeout.
> 

I suspect the real issue is that we exit the spin loop when
try_wait_for_completion(&xfer->done) is true.  Probably we should add
that as a Fixes tag?:

Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages")

Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it
does timeout.  Was the "_no" an accident?

regards,
dan carpenter



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout
  2025-01-23 10:38 ` Dan Carpenter
@ 2025-01-23 19:48   ` Cristian Marussi
  2025-02-11 15:37     ` Sudeep Holla
  0 siblings, 1 reply; 4+ messages in thread
From: Cristian Marussi @ 2025-01-23 19:48 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: jack21, sudeep.holla, cristian.marussi, arm-scmi,
	linux-arm-kernel, linux-kernel, Huangjie

On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> s/dirvers/drivers/
> 
> On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > From: Huangjie <huangjie1663@phytium.com.cn>
> > 
> > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > in preemption kernel when preempted after spin_until_cond().
> > 
> > We check status again when reach timeout is reached to prevent incorrect
> > jugement of timeout.
> > 

Hi,

probably another not so short email of mine :P ... 

> 
> I suspect the real issue is that we exit the spin loop when
> try_wait_for_completion(&xfer->done) is true.  Probably we should add
> that as a Fixes tag?:
> 

The Kernel SCMI stack, acting as an SCMI agent have to cope with the
possible (even though rare) scenario of receiving Out of Order messages
when dealing with async commands...

...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
is received BEFORE the corresponding immediate reply to the initial request:
such a wicked (and rare) situation could be the result of a misbehaving
platform server OR simply due to parallellization of activies on the A2P
and P2A on some transports, i.e. platform sent reply and delayed_response
in the correct order but the transport delivered those OoO, being
transmitted on 2 discinct physical  channels...hard but not impossible.

Kernel side we address this OoO scenario by assuming that, if a valid
Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
before the immediate A2P reply, the transaction itself is good and we
can progress by just logging and ignoring/swallowing the missing
immediate-reply, that will probably arrive later, and just carry on
processing the Delayed Response.

In order to do that, we maintain, in fact, a per-message state-machine,
and inside scmi_msg_response_validate(), when we detect the OoO condition,
we cause the wait-for-immediate-reply to terminate by issuing forcibly a
complete(xfer->done)....

...this works straight away for the non-polled IRQ transactions since
causes the wait_for_completion() to terminate cleanly, BUT in order to
cut-short also the busy-wait in the polling case we need that additional
try_wait_for_completion()....so as not to spin forever for a message
that we dont care anymore...
[ note that any late arriving immediate-reply will be in teh future
 discarded at this point]

For this reason, I think that this patch, while correctly checking the
poll_done() condition when the spinloop terminates for the reason
explained in the commit-message, it should also check if the loop was not
instead forcibly terminated by the OoO scenario...if not, we will end up
considering the polling to have timeout, while instead it was forcibly
terminated by the OoO state machine complete() and just want to ignore
such missing message...

So what about instead of checking (untested):

+	if (!completion_done(&xfer->done) &&
+	    !info->desc->ops->poll_done(cinfo, xfer))

... so that we determine that the polling has timed out only when:

- the immediate-reply has NOT been received [!poll_done()]

 AND

- we were NOT forcibly make to ignore the missing reply in an OoO scenario
    [!completion_done]

NOTE THAT xfer->done on the polling path is ONLY completed in case of OoO.

> Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages")
> 
> Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it
> does timeout.  Was the "_no" an accident?
> 

Agreed. Not sure of the history of the bad naming...

Thans,
Cristian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout
  2025-01-23 19:48   ` Cristian Marussi
@ 2025-02-11 15:37     ` Sudeep Holla
  0 siblings, 0 replies; 4+ messages in thread
From: Sudeep Holla @ 2025-02-11 15:37 UTC (permalink / raw)
  To: jack21
  Cc: Cristian Marussi, Dan Carpenter, arm-scmi, linux-arm-kernel,
	linux-kernel, Huangjie

On Thu, Jan 23, 2025 at 07:48:53PM +0000, Cristian Marussi wrote:
> On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> > s/dirvers/drivers/
> > 
> > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > > From: Huangjie <huangjie1663@phytium.com.cn>
> > > 
> > > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > > in preemption kernel when preempted after spin_until_cond().
> > > 
> > > We check status again when reach timeout is reached to prevent incorrect
> > > jugement of timeout.
> > > 
> 
> Hi,
> 
> probably another not so short email of mine :P ... 
> 
> > 
> > I suspect the real issue is that we exit the spin loop when
> > try_wait_for_completion(&xfer->done) is true.  Probably we should add
> > that as a Fixes tag?:
> > 
> 
> The Kernel SCMI stack, acting as an SCMI agent have to cope with the
> possible (even though rare) scenario of receiving Out of Order messages
> when dealing with async commands...
> 
> ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
> is received BEFORE the corresponding immediate reply to the initial request:
> such a wicked (and rare) situation could be the result of a misbehaving
> platform server OR simply due to parallellization of activies on the A2P
> and P2A on some transports, i.e. platform sent reply and delayed_response
> in the correct order but the transport delivered those OoO, being
> transmitted on 2 discinct physical  channels...hard but not impossible.
> 
> Kernel side we address this OoO scenario by assuming that, if a valid
> Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
> before the immediate A2P reply, the transaction itself is good and we
> can progress by just logging and ignoring/swallowing the missing
> immediate-reply, that will probably arrive later, and just carry on
> processing the Delayed Response.
> 
> In order to do that, we maintain, in fact, a per-message state-machine,
> and inside scmi_msg_response_validate(), when we detect the OoO condition,
> we cause the wait-for-immediate-reply to terminate by issuing forcibly a
> complete(xfer->done)....
> 
> ...this works straight away for the non-polled IRQ transactions since
> causes the wait_for_completion() to terminate cleanly, BUT in order to
> cut-short also the busy-wait in the polling case we need that additional
> try_wait_for_completion()....so as not to spin forever for a message
> that we dont care anymore...
> [ note that any late arriving immediate-reply will be in teh future
>  discarded at this point]
> 
> For this reason, I think that this patch, while correctly checking the
> poll_done() condition when the spinloop terminates for the reason
> explained in the commit-message, it should also check if the loop was not
> instead forcibly terminated by the OoO scenario...if not, we will end up
> considering the polling to have timeout, while instead it was forcibly
> terminated by the OoO state machine complete() and just want to ignore
> such missing message...
> 
> So what about instead of checking (untested):
> 
> +	if (!completion_done(&xfer->done) &&
> +	    !info->desc->ops->poll_done(cinfo, xfer))

Were you able to check this ? I am waiting to hear back from you on this.

-- 
Regards,
Sudeep


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-02-11 15:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-23  8:33 [PATCH] dirvers: scmi: poll again when transfer reach timeout jack21
2025-01-23 10:38 ` Dan Carpenter
2025-01-23 19:48   ` Cristian Marussi
2025-02-11 15:37     ` Sudeep Holla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).