From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4D53C7F7C4;
	Wed, 24 Jan 2024 17:22:11 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706116931; cv=none; b=rfDPAmaj1SN0xGkjEb9LtL7uNgiYXJij9QwQN2YtVJD8gig+y0OtbmFFYSuqzDLWAdfspr9S6sNe82b7g3E+k5EYt5n9T/lo8WLZaJUyelKjiy/bYpTTX++Y8BzSo21cl7D0zWqM1ItjvmesPtW31/gDqLTfJ8phAQsDv8nDenU=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706116931; c=relaxed/simple;
	bh=S1PhJcaN20LtKXfwXDa/CLbEgNUETEehcQgb+fXeUA4=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=TzA3wPxPaUx019Me2QrBdVZNElSm+7Y4+t1wTEZRqp3/5CxAOxlrRCu/TQfDpo7kuVBKca8/lnm9Dkxv49e6th0Ddz68+89D3UwJQKrEeJ7UqqFabDbKHypPBGLlLz8m1rJeBPYaaM6vKYb9p0mlWamEUuYeQmNbZdjzq0NNc0g=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3DE2AC433F1;
	Wed, 24 Jan 2024 17:22:08 +0000 (UTC)
Date: Wed, 24 Jan 2024 17:22:05 +0000
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
	Michael Guralnik <michaelgur@mellanox.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Message-ID: <ZbFHPTUaBmbHYnwx@arm.com>
References: <ZXBWXodu2rxQ7Szz@arm.com>
 <20231206125919.GP2692119@nvidia.com>
 <20240116185121.GB980613@nvidia.com>
 <ZafISDVeAA1swx2I@FVFF77S0Q05N.cambridge.arm.com>
 <20240117123618.GD734935@nvidia.com>
 <ZafWIsrjvk--JdDn@FVFF77S0Q05N.cambridge.arm.com>
 <ZbAj34vdVuMrmdFD@arm.com>
 <ZbD2n-BKGbDgMfsB@FVFF77S0Q05N.cambridge.arm.com>
 <ZbEFPbT7vl6HN4lk@arm.com>
 <20240124132719.GF1455070@nvidia.com>
Precedence: bulk
X-Mailing-List: linux-arch@vger.kernel.org
List-Id: <linux-arch.vger.kernel.org>
List-Subscribe: <mailto:linux-arch+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-arch+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20240124132719.GF1455070@nvidia.com>

On Wed, Jan 24, 2024 at 09:27:19AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 24, 2024 at 12:40:29PM +0000, Catalin Marinas wrote:
> 
> > > Just to be clear, that means we should drop this patch ("arm64/io: add
> > > memcpy_toio_64") for now, right?
> > 
> > In its current form yes, but that doesn't mean that memcpy_toio_64()
> > cannot be reworked differently.
> 
> I gave up on touching memcpy_toio_64(), it doesn't work very well
> because of the weak alignment
> 
> Instead I followed your suggestion to fix __iowrite64_copy()

I forgot the details. Was it to introduce an __iowrite512_copy()
function or to simply use __iowrite64_copy() with a count of 8?

> There are only a couple of places that use this API:
[...]
> __iowrite64_copy() has a sufficient API that the compiler can inline
> the STP block as this patch did.
> 
> I experimented with having memcpy_toio_64() invoke __iowrite64_copy(),
> but it did not look very nice. Maybe there is a possible performance
> win there, I don't know.

Just invoking __iowrite64_copy() with a count of 8 wouldn't work well
even if we have the writeq generating STR with an offset (well, it also
increments the pointers, so I don't think Mark's optimisation would
help). The copy implies loads and these would be interleaved with stores
and potentially get in the way of write combining. An
__iowrite512_copy() or maybe even an optimised __iowrite64_copy() for
count 8 could do the loads first followed by the stores. You can use a
special path in __iowrite64_copy() with __builtin_contant_p().

You can try with an arm64 specific __iowrite64_copy() and see how it
goes (together with Mark's patch):

void __iowrite64_copy(void __iomem *to, const void *from,
		      size_t count)
{
	u64 __iomem *dst = to;
	const u64 *src = from;
	const u64 *end = src + count;

	/*
	 * Try a 64-byte write, the CPUs tend to write-combine them.
	 */
	if (__builtin_contant_p(count) && count == 8) {
		__raw_writeq(*src, dst);
		__raw_writeq(*(src + 1), dst + 1);
		__raw_writeq(*(src + 2), dst + 2);
		__raw_writeq(*(src + 3), dst + 3);
		__raw_writeq(*(src + 4), dst + 4);
		__raw_writeq(*(src + 5), dst + 5);
		__raw_writeq(*(src + 6), dst + 6);
		__raw_writeq(*(src + 7), dst + 7);
		return;
	}

	while (src < end)
		__raw_writeq(*src++, dst++);
}
EXPORT_SYMBOL_GPL(__iowrite64_copy);

What we don't have is inlining of __iowrite64_copy() but if we need that
we can move away from a weak symbol to a static inline.

Give this a go and see if it you get write-combining in your hardware.
If the loads interleaves with stores get in the way, maybe we can resort
to inline asm.

-- 
Catalin

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id A4D7FC46CD2
	for <linux-arm-kernel@archiver.kernel.org>; Wed, 24 Jan 2024 17:22:40 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=5CxBR3l7fQKFd3uPP2l5u6jjjQo9aJ4qiHLU7s6oRKk=; b=R5Dp1Wcw1N+fFt
	KrPruwUjYUEuTGSZ33q6ZmX61g/Lh7RwyxW8kYdOYi9A3PtuMeqhcD6f0cNRJaf6zWK52vEdP6wvU
	mS0kJ/cWKLcH8rebv3V6hvhhbQmfBlNyx81xkeBtxYnbnkyDuF8bAgdd2gfrGq5IVqz4c26DU7KBy
	/WGlyhrYkXmJlEWBri7FztyJTlvGPldew4KodoB/mGEXIslQ3U1myJO6buy0oj87sztDgPz461PVJ
	0Vh4e6YhIwiPPoJ4TMLuoJl+muWwhXahhyZXDsuXN0LdYnxoMKMnSiYay0rusKlYlBglIm7L46d6P
	FDbrHXJRfE2i10+BRuww==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux))
	id 1rSgx5-004RJj-0C;
	Wed, 24 Jan 2024 17:22:15 +0000
Received: from dfw.source.kernel.org ([2604:1380:4641:c500::1])
	by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux))
	id 1rSgx1-004RI2-2s
	for linux-arm-kernel@lists.infradead.org;
	Wed, 24 Jan 2024 17:22:13 +0000
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id 2896A61DAB;
	Wed, 24 Jan 2024 17:22:11 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3DE2AC433F1;
	Wed, 24 Jan 2024 17:22:08 +0000 (UTC)
Date: Wed, 24 Jan 2024 17:22:05 +0000
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
	Michael Guralnik <michaelgur@mellanox.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Message-ID: <ZbFHPTUaBmbHYnwx@arm.com>
References: <ZXBWXodu2rxQ7Szz@arm.com>
 <20231206125919.GP2692119@nvidia.com>
 <20240116185121.GB980613@nvidia.com>
 <ZafISDVeAA1swx2I@FVFF77S0Q05N.cambridge.arm.com>
 <20240117123618.GD734935@nvidia.com>
 <ZafWIsrjvk--JdDn@FVFF77S0Q05N.cambridge.arm.com>
 <ZbAj34vdVuMrmdFD@arm.com>
 <ZbD2n-BKGbDgMfsB@FVFF77S0Q05N.cambridge.arm.com>
 <ZbEFPbT7vl6HN4lk@arm.com>
 <20240124132719.GF1455070@nvidia.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20240124132719.GF1455070@nvidia.com>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20240124_092212_016613_28ABBC63 
X-CRM114-Status: GOOD (  23.31  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On Wed, Jan 24, 2024 at 09:27:19AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 24, 2024 at 12:40:29PM +0000, Catalin Marinas wrote:
> 
> > > Just to be clear, that means we should drop this patch ("arm64/io: add
> > > memcpy_toio_64") for now, right?
> > 
> > In its current form yes, but that doesn't mean that memcpy_toio_64()
> > cannot be reworked differently.
> 
> I gave up on touching memcpy_toio_64(), it doesn't work very well
> because of the weak alignment
> 
> Instead I followed your suggestion to fix __iowrite64_copy()

I forgot the details. Was it to introduce an __iowrite512_copy()
function or to simply use __iowrite64_copy() with a count of 8?

> There are only a couple of places that use this API:
[...]
> __iowrite64_copy() has a sufficient API that the compiler can inline
> the STP block as this patch did.
> 
> I experimented with having memcpy_toio_64() invoke __iowrite64_copy(),
> but it did not look very nice. Maybe there is a possible performance
> win there, I don't know.

Just invoking __iowrite64_copy() with a count of 8 wouldn't work well
even if we have the writeq generating STR with an offset (well, it also
increments the pointers, so I don't think Mark's optimisation would
help). The copy implies loads and these would be interleaved with stores
and potentially get in the way of write combining. An
__iowrite512_copy() or maybe even an optimised __iowrite64_copy() for
count 8 could do the loads first followed by the stores. You can use a
special path in __iowrite64_copy() with __builtin_contant_p().

You can try with an arm64 specific __iowrite64_copy() and see how it
goes (together with Mark's patch):

void __iowrite64_copy(void __iomem *to, const void *from,
		      size_t count)
{
	u64 __iomem *dst = to;
	const u64 *src = from;
	const u64 *end = src + count;

	/*
	 * Try a 64-byte write, the CPUs tend to write-combine them.
	 */
	if (__builtin_contant_p(count) && count == 8) {
		__raw_writeq(*src, dst);
		__raw_writeq(*(src + 1), dst + 1);
		__raw_writeq(*(src + 2), dst + 2);
		__raw_writeq(*(src + 3), dst + 3);
		__raw_writeq(*(src + 4), dst + 4);
		__raw_writeq(*(src + 5), dst + 5);
		__raw_writeq(*(src + 6), dst + 6);
		__raw_writeq(*(src + 7), dst + 7);
		return;
	}

	while (src < end)
		__raw_writeq(*src++, dst++);
}
EXPORT_SYMBOL_GPL(__iowrite64_copy);

What we don't have is inlining of __iowrite64_copy() but if we need that
we can move away from a weak symbol to a static inline.

Give this a go and see if it you get write-combining in your hardware.
If the loads interleaves with stores get in the way, maybe we can resort
to inline asm.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel