From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 2E0A1CEFC5D
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  9 Oct 2024 09:45:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help
	:List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type:
	MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	bh=o7zN2Z2x5UUIEq0dL1AvjzJ9NjmQX/sYJOyxgV5YKZk=; b=CDKmm4fxb/y6p8R6abRIWD44ie
	uj/cykynVw44zi0z1/CNcFkRNTEkgg+/fY5P2jrN/g85KAOF6ouXMkK4yjCaDE0nLD2Bw1ei8VweV
	ofaECMCsy0ZhQuuiVtAIwrqD7GOV5ChCQjJpMol/3QF67Hr+79RSw6oRlcI9Es2gvix+hJWZtvhUj
	u9kWyovUPVXp6f7xLh0x0gMNQkPLZpWonkc8mTUV+cl8AVXQOFJG14tmmkx62sVk8iOiIjHigDci9
	UFNy0jMxqMmiyGUJko/5U6QmYAoFiYP6zYbr8yOFzZzsF8cgsxm9HeLlMYzBMVAfNLPD7TH4XjGNL
	HGO2nXKw==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux))
	id 1syTGM-00000008jpr-3Ue7;
	Wed, 09 Oct 2024 09:45:46 +0000
Received: from foss.arm.com ([217.140.110.172])
	by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux))
	id 1syS7u-00000008TQv-0L1n
	for linux-arm-kernel@lists.infradead.org;
	Wed, 09 Oct 2024 08:33:00 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2B0EEFEC;
	Wed,  9 Oct 2024 01:33:26 -0700 (PDT)
Received: from pluto (usa-sjc-mx-foss1.foss.arm.com [172.31.20.19])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 432D43F58B;
	Wed,  9 Oct 2024 01:32:55 -0700 (PDT)
Date: Wed, 9 Oct 2024 09:32:42 +0100
From: Cristian Marussi <cristian.marussi@arm.com>
To: Justin Chen <justin.chen@broadcom.com>
Cc: Cristian Marussi <cristian.marussi@arm.com>,
	Sudeep Holla <sudeep.holla@arm.com>, arm-scmi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, peng.fan@nxp.com,
	bcm-kernel-feedback-list@broadcom.com,
	florian.fainelli@broadcom.com
Subject: Re: [PATCH] firmware: arm_scmi: Queue in scmi layer for mailbox
 implementation
Message-ID: <ZwY_nrvxCEkDgN8u@pluto>
References: <20241004221257.2888603-1-justin.chen@broadcom.com>
 <ZwPcSmRpTGrCdt6I@bogus>
 <ZwPd1Z2jl0A46hEU@pluto>
 <1ad5c4e9-9f98-40ab-afa4-a7939781e8cc@broadcom.com>
 <ZwUhP0SYEXUKpRG-@pluto>
 <ZwUyNCnXmjtVbhai@bogus>
 <ZwU1A_R81_ed4L8o@bogus>
 <ZwU1m4Uj9uF2bzoE@pluto>
 <ed8b9825-c655-4712-806e-fee22f30c113@broadcom.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ed8b9825-c655-4712-806e-fee22f30c113@broadcom.com>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20241009_013258_365234_FCAD5FE3 
X-CRM114-Status: GOOD (  43.23  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On Tue, Oct 08, 2024 at 12:23:28PM -0700, Justin Chen wrote:
> 
> 
> On 10/8/24 6:37 AM, Cristian Marussi wrote:
> > On Tue, Oct 08, 2024 at 02:34:59PM +0100, Sudeep Holla wrote:
> > > On Tue, Oct 08, 2024 at 02:23:00PM +0100, Sudeep Holla wrote:
> > > > On Tue, Oct 08, 2024 at 01:10:39PM +0100, Cristian Marussi wrote:
> > > > > On Mon, Oct 07, 2024 at 10:58:47AM -0700, Justin Chen wrote:
> > > > > > Thanks for the response. I'll try to elaborate.
> > > > > > 
> > > > > > When comparing SMC and mailbox transport, we noticed mailbox transport
> > > > > > timesout much quicker when under load. Originally we thought this was the
> > > > > > latency of the mailbox implementation, but after debugging we noticed a
> > > > > > weird behavior. We saw SMCI transactions timing out before the mailbox even
> > > > > > transmitted the message.
> > > > > > 
> > > > > > This issue lies in the SCMI layer. drivers/firmware/arm_scmi/driver.c
> > > > > > do_xfer() function.
> > > > > > 
> > > > > > The fundamental issue is send_message() blocks for SMC transport, but
> > > > > > doesn't block for mailbox transport. So if send_message() doesn't block we
> > > > > > can have multiple messages waiting at scmi_wait_for_message_response().
> > > > > > 
> > > > > 
> > > > > oh...yes...now I can see it...tx_prepare is really never called given
> > > > > how the mailbox subsystem de-queues messages once at time...so we end up
> > > > > waiting for a reply to some message that is still to be sent...so the
> > > > > message inflight is really NOT corrupted because the next remain pending
> > > > > until the reply in the shmem is read back , BUT the timeout will drift away
> > > > > if you multiple inflights are pending to be sent...
> > > > > 
> > > > 
> > > > Indeed.
> > > > 
> > > > > > SMC looks like this
> > > > > > CPU #0 SCMI message 0 -> calls send_message() then calls
> > > > > > scmi_wait_for_message_response(), timesout after 30ms.
> > > > > > CPU #1 SCMI message 1 -> blocks at send_message() waiting for SCMI message 0
> > > > > > to complete.
> > > > > > 
> > > > > > Mailbox looks like this
> > > > > > CPU #0 SCMI message 0 -> calls send_message(), mailbox layer queues up
> > > > > > message, mailbox layer sees no message is outgoing and sends it. CPU waits
> > > > > > at scmi_wait_for_message_response(), timesout after 30ms
> > > > > > CPU #1 SCMI message 1 -> calls send_message(), mailbox layer queues up
> > > > > > message, mailbox layer sees message pending, hold message in queue. CPU
> > > > > > waits at scmi_wait_for_message_response(), timesout after 30ms.
> > > > > > 
> > > > > > Lets say if transport takes 25ms. The first message would succeed, the
> > > > > > second message would timeout after 5ms.
> > > > > > 
> > > > > > Hopefully this makes sense.
> > > > > 
> > > > > Yes, of course, thanks, for reporting this, and taking time to
> > > > > explain...
> > > > > 
> > > > > ...in general the patch LGTM...I think your patch is good also because it
> > > > > could be easily backported as a fix....can you add a Fixes tag in your
> > > > > next version ?
> > > > > 
> > > > 
> > > > Are you seeing this issue a lot ? IOW, do we need this to be backported ?
> > > > 
> 
> I wouldn't say a lot. But we are seeing it with standard use of our devices
> running over an extended amount of time. Yes we would like this backported.
> 
> > > > > Also can you explain in more detail the issue and the solution in the commit
> > > > > message....that will help having it merged as a Fix in stables...
> > > > > 
> > > > > ...for the future (definitely NOT in this series) we could probably think to
> > > > > get rid of the sleeping mutex in favour of some other non-sleeping form of
> > > > > mutual exclusion around the channnel (like in SMC transport) and enable
> > > > > (optionally) Atomic transmission support AND also review if the shmem
> > > > > layer busy-waiting in txprepare is anymore needed at all...
> > > > > 
> > > > 
> > > > Agreed, if we are locking the channel in SCMI, we can drop the busy-waiting
> > > > in tx_prepare and the associated details in the comment as this locking
> > > > voids that. It is better have both the changes in the same patch to indicate
> > > > the relation between them.
> > > 
> > > Actually scratch that last point. The waiting in tx_prepare until the platform
> > > marks it free for agent to use is still needed. One usecase is when agent/OS
> > > times out but platform continues to process and eventually releases the shmem.
> > > Sorry I completely forgot about that.
> > > 
> > 
> > Yes indeed it is the mechanism that we avoid to reclaim forcibly anyway the shmem
> > if the transmission times out...and we should keep that to avoid
> > corruption of newer messages by late replies from the earlier ones that
> > have timed out.
> > 
> 
> Yup. I saw an interesting interaction from this. Since modifying shmem and
> ringing the doorbell are often two different task. The modification of shmem
> can race with processing of timed out messages from the platform. This
> usually leads to an early ACK and spurious interrupt. Mostly harmless. We
> did see lockups when multiple timeouts occur, but it was unclear if this was
> an issue with the SCMI transport layer or our driver/platform.
> 

Are you talking about something similar to this:

https://lore.kernel.org/all/20231220172112.763539-1-cristian.marussi@arm.com/

... reported as a side effect of a spurious IRQ on late timed-out
replies, it should have been fixed with the above commit in v6.8.

Thanks,
Cristian