From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
Michael Guralnik <michaelgur@mellanox.com>,
Nathan Chancellor <nathan@kernel.org>,
Nick Desaulniers <ndesaulniers@google.com>,
Will Deacon <will@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Date: Tue, 5 Dec 2023 17:21:27 +0000 [thread overview]
Message-ID: <ZW9cF0ALVwgvcQMy@arm.com> (raw)
In-Reply-To: <20231204182330.GK1493156@nvidia.com>
On Mon, Dec 04, 2023 at 02:23:30PM -0400, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 05:31:47PM +0000, Catalin Marinas wrote:
> > Personally I'd optimise the mempcy_toio() arm64 implementation to do
> > STPs if the alignment is right (like we do for classic memcpy()).
> > There's a slight overhead for alignment checking but I suspect it would
> > be lost as long as you can get the write-combining. Not sure whether the
> > interspersed reads in memcpy_toio() would somehow prevent the
> > write-combining.
>
> I understand on these new CPUs anything other than a block of
> contiguous STPs is risky to break the WC. I was told we should not
> have any loads between them.
Classic memcpy does similar tricks with four LDPs in a row before
starting to issue the STPs (though there are new LDPs for the next
data in-between). But that was tuned for cacheable memory, not sure
if something similar would behave well on Normal-NC memory.
> So we can't just update memcpy_toio to optimize a 128 bit store
> variant like memcpy might. We actually need a special case just for 64
> byte.
>
> IMHO it does not look good as the chance any existing callers can use
> this optmized 64B path is probably small, but everyone has to pay the
> costs to check for it.
I don't think the cost of the check is noticeable and there are several
places where the copy goes beyond 64 bytes. It may be worth a try.
> I also would not do this on x86 - Pathscale apparently decided the
> needed special __iowrite*_copy() things to actually make this work on
> xome x86 systems - I'm very leary to change x86 stuff away from the 64
> bit copy loopw we know works already on x86.
>
> IMHO encoding the alignment expectation in the API is best, especially
> since this is typically a performance path.
The slight downside of a __iowrite512_copy() API is that, if we follow
the 32/64 semantics, it would need the source buffer aligned. Maybe we
can document it to 64-bit alignment only rather than 512.
> > A memcpy_toio_64() can use the new ST64B instruction if available or
> > fall back to memcpy_toio() on arm64. It should also have the DGH
> > instruction (io_stop_wc()) but only if falling back to classic
> > memcpy_toio(). We don't need DGH with ST64B.
>
> I'm told it is problematic, something about ST64B not working with
> NORMAL_NC.
Last time I checked it was meant to work on Normal-NC (not cacheable
though). That's on page 285 of the Arm ARM J.a.
> Also in a future ST64B world we are going to see HW start relying on
> large TLPs, not just being an optional performance win. To my mind it
> makes more sense that there is an API that guarantees a large TLP or
> oops. We really don't want an automatic fallback to memcpy.
We can't guarantee those large TLPs without the ST64B instructions, so
it needs to be more of a QoS aspect than correctness.
--
Catalin
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
Michael Guralnik <michaelgur@mellanox.com>,
Nathan Chancellor <nathan@kernel.org>,
Nick Desaulniers <ndesaulniers@google.com>,
Will Deacon <will@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Date: Tue, 5 Dec 2023 17:21:27 +0000 [thread overview]
Message-ID: <ZW9cF0ALVwgvcQMy@arm.com> (raw)
In-Reply-To: <20231204182330.GK1493156@nvidia.com>
On Mon, Dec 04, 2023 at 02:23:30PM -0400, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 05:31:47PM +0000, Catalin Marinas wrote:
> > Personally I'd optimise the mempcy_toio() arm64 implementation to do
> > STPs if the alignment is right (like we do for classic memcpy()).
> > There's a slight overhead for alignment checking but I suspect it would
> > be lost as long as you can get the write-combining. Not sure whether the
> > interspersed reads in memcpy_toio() would somehow prevent the
> > write-combining.
>
> I understand on these new CPUs anything other than a block of
> contiguous STPs is risky to break the WC. I was told we should not
> have any loads between them.
Classic memcpy does similar tricks with four LDPs in a row before
starting to issue the STPs (though there are new LDPs for the next
data in-between). But that was tuned for cacheable memory, not sure
if something similar would behave well on Normal-NC memory.
> So we can't just update memcpy_toio to optimize a 128 bit store
> variant like memcpy might. We actually need a special case just for 64
> byte.
>
> IMHO it does not look good as the chance any existing callers can use
> this optmized 64B path is probably small, but everyone has to pay the
> costs to check for it.
I don't think the cost of the check is noticeable and there are several
places where the copy goes beyond 64 bytes. It may be worth a try.
> I also would not do this on x86 - Pathscale apparently decided the
> needed special __iowrite*_copy() things to actually make this work on
> xome x86 systems - I'm very leary to change x86 stuff away from the 64
> bit copy loopw we know works already on x86.
>
> IMHO encoding the alignment expectation in the API is best, especially
> since this is typically a performance path.
The slight downside of a __iowrite512_copy() API is that, if we follow
the 32/64 semantics, it would need the source buffer aligned. Maybe we
can document it to 64-bit alignment only rather than 512.
> > A memcpy_toio_64() can use the new ST64B instruction if available or
> > fall back to memcpy_toio() on arm64. It should also have the DGH
> > instruction (io_stop_wc()) but only if falling back to classic
> > memcpy_toio(). We don't need DGH with ST64B.
>
> I'm told it is problematic, something about ST64B not working with
> NORMAL_NC.
Last time I checked it was meant to work on Normal-NC (not cacheable
though). That's on page 285 of the Arm ARM J.a.
> Also in a future ST64B world we are going to see HW start relying on
> large TLPs, not just being an optional performance win. To my mind it
> makes more sense that there is an API that guarantees a large TLP or
> oops. We really don't want an automatic fallback to memcpy.
We can't guarantee those large TLPs without the ST64B instructions, so
it needs to be more of a QoS aspect than correctness.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-12-05 17:21 UTC|newest]
Thread overview: 136+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-23 19:04 [PATCH rdma-next 0/2] Add and use memcpy_toio_64() Leon Romanovsky
2023-11-23 19:04 ` Leon Romanovsky
2023-11-23 19:04 ` [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64 Leon Romanovsky
2023-11-23 19:04 ` Leon Romanovsky
2023-11-24 10:16 ` Mark Rutland
2023-11-24 10:16 ` Mark Rutland
2023-11-24 12:23 ` Jason Gunthorpe
2023-11-24 12:23 ` Jason Gunthorpe
2023-11-27 12:42 ` Catalin Marinas
2023-11-27 12:42 ` Catalin Marinas
2023-11-27 13:45 ` Jason Gunthorpe
2023-11-27 13:45 ` Jason Gunthorpe
2023-12-04 17:31 ` Catalin Marinas
2023-12-04 17:31 ` Catalin Marinas
2023-12-04 18:23 ` Jason Gunthorpe
2023-12-04 18:23 ` Jason Gunthorpe
2023-12-05 17:21 ` Catalin Marinas [this message]
2023-12-05 17:21 ` Catalin Marinas
2023-12-05 17:51 ` Jason Gunthorpe
2023-12-05 17:51 ` Jason Gunthorpe
2023-12-05 19:34 ` Catalin Marinas
2023-12-05 19:34 ` Catalin Marinas
2023-12-05 19:51 ` Jason Gunthorpe
2023-12-05 19:51 ` Jason Gunthorpe
2023-12-06 11:09 ` Catalin Marinas
2023-12-06 11:09 ` Catalin Marinas
2023-12-06 12:59 ` Jason Gunthorpe
2023-12-06 12:59 ` Jason Gunthorpe
2024-01-16 18:51 ` Jason Gunthorpe
2024-01-16 18:51 ` Jason Gunthorpe
2024-01-17 12:30 ` Mark Rutland
2024-01-17 12:30 ` Mark Rutland
2024-01-17 12:36 ` Jason Gunthorpe
2024-01-17 12:36 ` Jason Gunthorpe
2024-01-17 12:41 ` Jason Gunthorpe
2024-01-17 12:41 ` Jason Gunthorpe
2024-01-17 13:29 ` Mark Rutland
2024-01-17 13:29 ` Mark Rutland
2024-01-23 20:38 ` Catalin Marinas
2024-01-23 20:38 ` Catalin Marinas
2024-01-24 1:27 ` Jason Gunthorpe
2024-01-24 1:27 ` Jason Gunthorpe
2024-01-24 8:26 ` Marc Zyngier
2024-01-24 8:26 ` Marc Zyngier
2024-01-24 13:06 ` Jason Gunthorpe
2024-01-24 13:06 ` Jason Gunthorpe
2024-01-24 13:32 ` Marc Zyngier
2024-01-24 13:32 ` Marc Zyngier
2024-01-24 15:52 ` Jason Gunthorpe
2024-01-24 15:52 ` Jason Gunthorpe
2024-01-24 17:54 ` Catalin Marinas
2024-01-24 17:54 ` Catalin Marinas
2024-01-25 1:29 ` Jason Gunthorpe
2024-01-25 1:29 ` Jason Gunthorpe
2024-01-26 16:15 ` Catalin Marinas
2024-01-26 16:15 ` Catalin Marinas
2024-01-26 17:09 ` Jason Gunthorpe
2024-01-26 17:09 ` Jason Gunthorpe
2024-01-24 11:38 ` Mark Rutland
2024-01-24 11:38 ` Mark Rutland
2024-01-24 12:40 ` Catalin Marinas
2024-01-24 12:40 ` Catalin Marinas
2024-01-24 13:27 ` Jason Gunthorpe
2024-01-24 13:27 ` Jason Gunthorpe
2024-01-24 17:22 ` Catalin Marinas
2024-01-24 17:22 ` Catalin Marinas
2024-01-24 19:26 ` Jason Gunthorpe
2024-01-24 19:26 ` Jason Gunthorpe
2024-01-25 17:43 ` Jason Gunthorpe
2024-01-25 17:43 ` Jason Gunthorpe
2024-01-26 14:56 ` Catalin Marinas
2024-01-26 14:56 ` Catalin Marinas
2024-01-26 15:24 ` Jason Gunthorpe
2024-01-26 15:24 ` Jason Gunthorpe
2024-01-17 14:07 ` Mark Rutland
2024-01-17 14:07 ` Mark Rutland
2024-01-17 15:28 ` Jason Gunthorpe
2024-01-17 15:28 ` Jason Gunthorpe
2024-01-17 16:05 ` Will Deacon
2024-01-17 16:05 ` Will Deacon
2024-01-18 16:18 ` Jason Gunthorpe
2024-01-18 16:18 ` Jason Gunthorpe
2024-01-24 11:31 ` Mark Rutland
2024-01-24 11:31 ` Mark Rutland
2023-11-24 12:58 ` Robin Murphy
2023-11-24 12:58 ` Robin Murphy
2023-11-24 13:45 ` Jason Gunthorpe
2023-11-24 13:45 ` Jason Gunthorpe
2023-11-24 15:32 ` Robin Murphy
2023-11-24 15:32 ` Robin Murphy
2023-11-24 14:10 ` Niklas Schnelle
2023-11-24 14:10 ` Niklas Schnelle
2023-11-24 14:20 ` Jason Gunthorpe
2023-11-24 14:20 ` Jason Gunthorpe
2023-11-24 14:48 ` Niklas Schnelle
2023-11-24 14:48 ` Niklas Schnelle
2023-11-24 14:53 ` Niklas Schnelle
2023-11-24 14:53 ` Niklas Schnelle
2023-11-24 14:55 ` Jason Gunthorpe
2023-11-24 14:55 ` Jason Gunthorpe
2023-11-24 15:59 ` Niklas Schnelle
2023-11-24 15:59 ` Niklas Schnelle
2023-11-24 16:06 ` Jason Gunthorpe
2023-11-24 16:06 ` Jason Gunthorpe
2023-11-27 17:43 ` Niklas Schnelle
2023-11-27 17:43 ` Niklas Schnelle
2023-11-27 17:51 ` Jason Gunthorpe
2023-11-27 17:51 ` Jason Gunthorpe
2023-11-28 16:28 ` Niklas Schnelle
2023-11-28 16:28 ` Niklas Schnelle
2024-01-16 17:33 ` Jason Gunthorpe
2024-01-16 17:33 ` Jason Gunthorpe
2024-01-17 13:20 ` Niklas Schnelle
2024-01-17 13:20 ` Niklas Schnelle
2024-01-17 13:26 ` Jason Gunthorpe
2024-01-17 13:26 ` Jason Gunthorpe
2024-01-17 17:55 ` Jason Gunthorpe
2024-01-17 17:55 ` Jason Gunthorpe
2024-01-18 13:46 ` Niklas Schnelle
2024-01-18 13:46 ` Niklas Schnelle
2024-01-18 14:00 ` Jason Gunthorpe
2024-01-18 14:00 ` Jason Gunthorpe
2024-01-18 15:59 ` Niklas Schnelle
2024-01-18 15:59 ` Niklas Schnelle
2024-01-18 16:21 ` Jason Gunthorpe
2024-01-18 16:21 ` Jason Gunthorpe
2024-01-18 16:25 ` Niklas Schnelle
2024-01-18 16:25 ` Niklas Schnelle
2024-01-19 11:52 ` Niklas Schnelle
2024-01-19 11:52 ` Niklas Schnelle
2024-02-16 12:09 ` Niklas Schnelle
2024-02-16 12:09 ` Niklas Schnelle
2024-02-16 12:39 ` Jason Gunthorpe
2024-02-16 12:39 ` Jason Gunthorpe
2023-11-23 19:04 ` [PATCH rdma-next 2/2] IB/mlx5: Use memcpy_toio_64() for write combining stores Leon Romanovsky
2023-11-23 19:04 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZW9cF0ALVwgvcQMy@arm.com \
--to=catalin.marinas@arm.com \
--cc=arnd@arndb.de \
--cc=jgg@nvidia.com \
--cc=leon@kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-rdma@vger.kernel.org \
--cc=llvm@lists.linux.dev \
--cc=mark.rutland@arm.com \
--cc=michaelgur@mellanox.com \
--cc=nathan@kernel.org \
--cc=ndesaulniers@google.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.