* [PATCH] dirvers: scmi: poll again when transfer reach timeout @ 2025-01-23 8:33 jack21 2025-01-23 10:38 ` Dan Carpenter 0 siblings, 1 reply; 4+ messages in thread From: jack21 @ 2025-01-23 8:33 UTC (permalink / raw) To: sudeep.holla, cristian.marussi Cc: arm-scmi, linux-arm-kernel, linux-kernel, Huangjie From: Huangjie <huangjie1663@phytium.com.cn> spin_until_cond() not really hold a spin lock, possible timeout may occur in preemption kernel when preempted after spin_until_cond(). We check status again when reach timeout is reached to prevent incorrect jugement of timeout. Signed-off-by: Huangjie <huangjie1663@phytium.com.cn> --- drivers/firmware/arm_scmi/driver.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c index 1b5fb2c4ce86..10b049fe5fd0 100644 --- a/drivers/firmware/arm_scmi/driver.c +++ b/drivers/firmware/arm_scmi/driver.c @@ -1246,7 +1246,8 @@ static int scmi_wait_for_reply(struct device *dev, const struct scmi_desc *desc, spin_until_cond(scmi_xfer_done_no_timeout(cinfo, xfer, stop)); - if (ktime_after(ktime_get(), stop)) { + if (ktime_after(ktime_get(), stop) && + !info->desc->ops->poll_done(cinfo, xfer)) { dev_err(dev, "timed out in resp(caller: %pS) - polling\n", (void *)_RET_IP_); -- 2.34.1 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout 2025-01-23 8:33 [PATCH] dirvers: scmi: poll again when transfer reach timeout jack21 @ 2025-01-23 10:38 ` Dan Carpenter 2025-01-23 19:48 ` Cristian Marussi 0 siblings, 1 reply; 4+ messages in thread From: Dan Carpenter @ 2025-01-23 10:38 UTC (permalink / raw) To: jack21 Cc: sudeep.holla, cristian.marussi, arm-scmi, linux-arm-kernel, linux-kernel, Huangjie s/dirvers/drivers/ On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote: > From: Huangjie <huangjie1663@phytium.com.cn> > > spin_until_cond() not really hold a spin lock, possible timeout may occur > in preemption kernel when preempted after spin_until_cond(). > > We check status again when reach timeout is reached to prevent incorrect > jugement of timeout. > I suspect the real issue is that we exit the spin loop when try_wait_for_completion(&xfer->done) is true. Probably we should add that as a Fixes tag?: Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages") Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it does timeout. Was the "_no" an accident? regards, dan carpenter ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout 2025-01-23 10:38 ` Dan Carpenter @ 2025-01-23 19:48 ` Cristian Marussi 2025-02-11 15:37 ` Sudeep Holla 0 siblings, 1 reply; 4+ messages in thread From: Cristian Marussi @ 2025-01-23 19:48 UTC (permalink / raw) To: Dan Carpenter Cc: jack21, sudeep.holla, cristian.marussi, arm-scmi, linux-arm-kernel, linux-kernel, Huangjie On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote: > s/dirvers/drivers/ > > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote: > > From: Huangjie <huangjie1663@phytium.com.cn> > > > > spin_until_cond() not really hold a spin lock, possible timeout may occur > > in preemption kernel when preempted after spin_until_cond(). > > > > We check status again when reach timeout is reached to prevent incorrect > > jugement of timeout. > > Hi, probably another not so short email of mine :P ... > > I suspect the real issue is that we exit the spin loop when > try_wait_for_completion(&xfer->done) is true. Probably we should add > that as a Fixes tag?: > The Kernel SCMI stack, acting as an SCMI agent have to cope with the possible (even though rare) scenario of receiving Out of Order messages when dealing with async commands... ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response is received BEFORE the corresponding immediate reply to the initial request: such a wicked (and rare) situation could be the result of a misbehaving platform server OR simply due to parallellization of activies on the A2P and P2A on some transports, i.e. platform sent reply and delayed_response in the correct order but the transport delivered those OoO, being transmitted on 2 discinct physical channels...hard but not impossible. Kernel side we address this OoO scenario by assuming that, if a valid Delayed Response to a in-flight Async-cmd is received on teh P2A chan, before the immediate A2P reply, the transaction itself is good and we can progress by just logging and ignoring/swallowing the missing immediate-reply, that will probably arrive later, and just carry on processing the Delayed Response. In order to do that, we maintain, in fact, a per-message state-machine, and inside scmi_msg_response_validate(), when we detect the OoO condition, we cause the wait-for-immediate-reply to terminate by issuing forcibly a complete(xfer->done).... ...this works straight away for the non-polled IRQ transactions since causes the wait_for_completion() to terminate cleanly, BUT in order to cut-short also the busy-wait in the polling case we need that additional try_wait_for_completion()....so as not to spin forever for a message that we dont care anymore... [ note that any late arriving immediate-reply will be in teh future discarded at this point] For this reason, I think that this patch, while correctly checking the poll_done() condition when the spinloop terminates for the reason explained in the commit-message, it should also check if the loop was not instead forcibly terminated by the OoO scenario...if not, we will end up considering the polling to have timeout, while instead it was forcibly terminated by the OoO state machine complete() and just want to ignore such missing message... So what about instead of checking (untested): + if (!completion_done(&xfer->done) && + !info->desc->ops->poll_done(cinfo, xfer)) ... so that we determine that the polling has timed out only when: - the immediate-reply has NOT been received [!poll_done()] AND - we were NOT forcibly make to ignore the missing reply in an OoO scenario [!completion_done] NOTE THAT xfer->done on the polling path is ONLY completed in case of OoO. > Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages") > > Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it > does timeout. Was the "_no" an accident? > Agreed. Not sure of the history of the bad naming... Thans, Cristian ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout 2025-01-23 19:48 ` Cristian Marussi @ 2025-02-11 15:37 ` Sudeep Holla 0 siblings, 0 replies; 4+ messages in thread From: Sudeep Holla @ 2025-02-11 15:37 UTC (permalink / raw) To: jack21 Cc: Cristian Marussi, Dan Carpenter, arm-scmi, linux-arm-kernel, linux-kernel, Huangjie On Thu, Jan 23, 2025 at 07:48:53PM +0000, Cristian Marussi wrote: > On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote: > > s/dirvers/drivers/ > > > > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote: > > > From: Huangjie <huangjie1663@phytium.com.cn> > > > > > > spin_until_cond() not really hold a spin lock, possible timeout may occur > > > in preemption kernel when preempted after spin_until_cond(). > > > > > > We check status again when reach timeout is reached to prevent incorrect > > > jugement of timeout. > > > > > Hi, > > probably another not so short email of mine :P ... > > > > > I suspect the real issue is that we exit the spin loop when > > try_wait_for_completion(&xfer->done) is true. Probably we should add > > that as a Fixes tag?: > > > > The Kernel SCMI stack, acting as an SCMI agent have to cope with the > possible (even though rare) scenario of receiving Out of Order messages > when dealing with async commands... > > ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response > is received BEFORE the corresponding immediate reply to the initial request: > such a wicked (and rare) situation could be the result of a misbehaving > platform server OR simply due to parallellization of activies on the A2P > and P2A on some transports, i.e. platform sent reply and delayed_response > in the correct order but the transport delivered those OoO, being > transmitted on 2 discinct physical channels...hard but not impossible. > > Kernel side we address this OoO scenario by assuming that, if a valid > Delayed Response to a in-flight Async-cmd is received on teh P2A chan, > before the immediate A2P reply, the transaction itself is good and we > can progress by just logging and ignoring/swallowing the missing > immediate-reply, that will probably arrive later, and just carry on > processing the Delayed Response. > > In order to do that, we maintain, in fact, a per-message state-machine, > and inside scmi_msg_response_validate(), when we detect the OoO condition, > we cause the wait-for-immediate-reply to terminate by issuing forcibly a > complete(xfer->done).... > > ...this works straight away for the non-polled IRQ transactions since > causes the wait_for_completion() to terminate cleanly, BUT in order to > cut-short also the busy-wait in the polling case we need that additional > try_wait_for_completion()....so as not to spin forever for a message > that we dont care anymore... > [ note that any late arriving immediate-reply will be in teh future > discarded at this point] > > For this reason, I think that this patch, while correctly checking the > poll_done() condition when the spinloop terminates for the reason > explained in the commit-message, it should also check if the loop was not > instead forcibly terminated by the OoO scenario...if not, we will end up > considering the polling to have timeout, while instead it was forcibly > terminated by the OoO state machine complete() and just want to ignore > such missing message... > > So what about instead of checking (untested): > > + if (!completion_done(&xfer->done) && > + !info->desc->ops->poll_done(cinfo, xfer)) Were you able to check this ? I am waiting to hear back from you on this. -- Regards, Sudeep ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-02-11 15:39 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-23 8:33 [PATCH] dirvers: scmi: poll again when transfer reach timeout jack21 2025-01-23 10:38 ` Dan Carpenter 2025-01-23 19:48 ` Cristian Marussi 2025-02-11 15:37 ` Sudeep Holla
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).