All of lore.kernel.org
 help / color / mirror / Atom feed
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
	Michael Guralnik <michaelgur@mellanox.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Will Deacon <will@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Date: Mon, 4 Dec 2023 17:31:47 +0000	[thread overview]
Message-ID: <ZW4NAzI_jvwoq8dL@arm.com> (raw)
In-Reply-To: <20231127134505.GI436702@nvidia.com>

On Mon, Nov 27, 2023 at 09:45:05AM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 27, 2023 at 12:42:41PM +0000, Catalin Marinas wrote:
> > > > What's the actual requirement here? Is this just for performance?
> > > 
> > > Yes, just performance.
> > 
> > Do you have any rough numbers (percentage)? It's highly
> > microarchitecture-dependent until we get the ST64B instruction.
> 
> The current C code is an open coded store loop. The kernel does 250
> tries and measures if any one of them succeeds to combine.
> 
> On x86, and older ARM cores we see that 100% of the time at least 1 in
> 250 tries succeeds.
> 
> With the new CPU cores we see more like 9 out of 10 time there are 0
> in 250 tries that succeed. Ie we can go thousands of times without
> seeing any successful WC combine.
> 
> The STP block brings it back to 100% of the time 1 in 250 succeed.

That's a bit confusing to me: 1 in 250 succeeding is still pretty rare.
But I guess what your benchmark says is that at least 1 succeeded to
write-combine and it might as well be all 250 tries. It's more
interesting to see if there's actual performance gain in real-world
traffic, not just some artificial benchmark (I may have misunderstood
your numbers above).

> However, in userspace we have long been using ST4 to create a
> single-instruction 64 byte store on ARM64. As far as I know this is
> highly reliable. I don't have direct data on the STP configuration.

Personally I'd optimise the mempcy_toio() arm64 implementation to do
STPs if the alignment is right (like we do for classic memcpy()).
There's a slight overhead for alignment checking but I suspect it would
be lost as long as you can get the write-combining. Not sure whether the
interspersed reads in memcpy_toio() would somehow prevent the
write-combining.

A memcpy_toio_64() can use the new ST64B instruction if available or
fall back to memcpy_toio() on arm64. It should also have the DGH
instruction (io_stop_wc()) but only if falling back to classic
memcpy_toio(). We don't need DGH with ST64B.

> > More of a bike-shedding, I wonder whether the __iowrite*_copy()
> > semantics are better suited for what you need in terms of ordering (not
> > that mempcy_toio() to Normal NC memory gives us any ordering).
> 
> I have the same remark I gave to Niklas, this does not require
> alignment or an exact 64 byte size. It was clearly made to support WC
> stores since Pathscale did it, but I don't see this mapping nicely to
> the future 64 byte store instructions are we getting.

As above, I'd suggest just using memcpy_toio() as a fallback if ST64B is
not available.

> We could name it __iowrite512_copy() if that makes more sense?

I've been thinking at the __iowrite*_copy() and these also take a
'count' argument. I assume in this instance we don't really need one, so
it's just additional overhead (more like API clutter, I doubt it makes
much difference for performance). I'd say just stick to the
mempcy_toio_64() but have the io_stop_wc() inside this function as we
won't need it with ST64B.

Well, unless someone has a better name for this function.

-- 
Catalin

WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
	Michael Guralnik <michaelgur@mellanox.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Will Deacon <will@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Date: Mon, 4 Dec 2023 17:31:47 +0000	[thread overview]
Message-ID: <ZW4NAzI_jvwoq8dL@arm.com> (raw)
In-Reply-To: <20231127134505.GI436702@nvidia.com>

On Mon, Nov 27, 2023 at 09:45:05AM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 27, 2023 at 12:42:41PM +0000, Catalin Marinas wrote:
> > > > What's the actual requirement here? Is this just for performance?
> > > 
> > > Yes, just performance.
> > 
> > Do you have any rough numbers (percentage)? It's highly
> > microarchitecture-dependent until we get the ST64B instruction.
> 
> The current C code is an open coded store loop. The kernel does 250
> tries and measures if any one of them succeeds to combine.
> 
> On x86, and older ARM cores we see that 100% of the time at least 1 in
> 250 tries succeeds.
> 
> With the new CPU cores we see more like 9 out of 10 time there are 0
> in 250 tries that succeed. Ie we can go thousands of times without
> seeing any successful WC combine.
> 
> The STP block brings it back to 100% of the time 1 in 250 succeed.

That's a bit confusing to me: 1 in 250 succeeding is still pretty rare.
But I guess what your benchmark says is that at least 1 succeeded to
write-combine and it might as well be all 250 tries. It's more
interesting to see if there's actual performance gain in real-world
traffic, not just some artificial benchmark (I may have misunderstood
your numbers above).

> However, in userspace we have long been using ST4 to create a
> single-instruction 64 byte store on ARM64. As far as I know this is
> highly reliable. I don't have direct data on the STP configuration.

Personally I'd optimise the mempcy_toio() arm64 implementation to do
STPs if the alignment is right (like we do for classic memcpy()).
There's a slight overhead for alignment checking but I suspect it would
be lost as long as you can get the write-combining. Not sure whether the
interspersed reads in memcpy_toio() would somehow prevent the
write-combining.

A memcpy_toio_64() can use the new ST64B instruction if available or
fall back to memcpy_toio() on arm64. It should also have the DGH
instruction (io_stop_wc()) but only if falling back to classic
memcpy_toio(). We don't need DGH with ST64B.

> > More of a bike-shedding, I wonder whether the __iowrite*_copy()
> > semantics are better suited for what you need in terms of ordering (not
> > that mempcy_toio() to Normal NC memory gives us any ordering).
> 
> I have the same remark I gave to Niklas, this does not require
> alignment or an exact 64 byte size. It was clearly made to support WC
> stores since Pathscale did it, but I don't see this mapping nicely to
> the future 64 byte store instructions are we getting.

As above, I'd suggest just using memcpy_toio() as a fallback if ST64B is
not available.

> We could name it __iowrite512_copy() if that makes more sense?

I've been thinking at the __iowrite*_copy() and these also take a
'count' argument. I assume in this instance we don't really need one, so
it's just additional overhead (more like API clutter, I doubt it makes
much difference for performance). I'd say just stick to the
mempcy_toio_64() but have the io_stop_wc() inside this function as we
won't need it with ST64B.

Well, unless someone has a better name for this function.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-12-04 17:31 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-23 19:04 [PATCH rdma-next 0/2] Add and use memcpy_toio_64() Leon Romanovsky
2023-11-23 19:04 ` Leon Romanovsky
2023-11-23 19:04 ` [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64 Leon Romanovsky
2023-11-23 19:04   ` Leon Romanovsky
2023-11-24 10:16   ` Mark Rutland
2023-11-24 10:16     ` Mark Rutland
2023-11-24 12:23     ` Jason Gunthorpe
2023-11-24 12:23       ` Jason Gunthorpe
2023-11-27 12:42       ` Catalin Marinas
2023-11-27 12:42         ` Catalin Marinas
2023-11-27 13:45         ` Jason Gunthorpe
2023-11-27 13:45           ` Jason Gunthorpe
2023-12-04 17:31           ` Catalin Marinas [this message]
2023-12-04 17:31             ` Catalin Marinas
2023-12-04 18:23             ` Jason Gunthorpe
2023-12-04 18:23               ` Jason Gunthorpe
2023-12-05 17:21               ` Catalin Marinas
2023-12-05 17:21                 ` Catalin Marinas
2023-12-05 17:51                 ` Jason Gunthorpe
2023-12-05 17:51                   ` Jason Gunthorpe
2023-12-05 19:34                   ` Catalin Marinas
2023-12-05 19:34                     ` Catalin Marinas
2023-12-05 19:51                     ` Jason Gunthorpe
2023-12-05 19:51                       ` Jason Gunthorpe
2023-12-06 11:09                       ` Catalin Marinas
2023-12-06 11:09                         ` Catalin Marinas
2023-12-06 12:59                         ` Jason Gunthorpe
2023-12-06 12:59                           ` Jason Gunthorpe
2024-01-16 18:51                           ` Jason Gunthorpe
2024-01-16 18:51                             ` Jason Gunthorpe
2024-01-17 12:30                             ` Mark Rutland
2024-01-17 12:30                               ` Mark Rutland
2024-01-17 12:36                               ` Jason Gunthorpe
2024-01-17 12:36                                 ` Jason Gunthorpe
2024-01-17 12:41                                 ` Jason Gunthorpe
2024-01-17 12:41                                   ` Jason Gunthorpe
2024-01-17 13:29                                 ` Mark Rutland
2024-01-17 13:29                                   ` Mark Rutland
2024-01-23 20:38                                   ` Catalin Marinas
2024-01-23 20:38                                     ` Catalin Marinas
2024-01-24  1:27                                     ` Jason Gunthorpe
2024-01-24  1:27                                       ` Jason Gunthorpe
2024-01-24  8:26                                       ` Marc Zyngier
2024-01-24  8:26                                         ` Marc Zyngier
2024-01-24 13:06                                         ` Jason Gunthorpe
2024-01-24 13:06                                           ` Jason Gunthorpe
2024-01-24 13:32                                           ` Marc Zyngier
2024-01-24 13:32                                             ` Marc Zyngier
2024-01-24 15:52                                             ` Jason Gunthorpe
2024-01-24 15:52                                               ` Jason Gunthorpe
2024-01-24 17:54                                               ` Catalin Marinas
2024-01-24 17:54                                                 ` Catalin Marinas
2024-01-25  1:29                                                 ` Jason Gunthorpe
2024-01-25  1:29                                                   ` Jason Gunthorpe
2024-01-26 16:15                                                   ` Catalin Marinas
2024-01-26 16:15                                                     ` Catalin Marinas
2024-01-26 17:09                                                     ` Jason Gunthorpe
2024-01-26 17:09                                                       ` Jason Gunthorpe
2024-01-24 11:38                                     ` Mark Rutland
2024-01-24 11:38                                       ` Mark Rutland
2024-01-24 12:40                                       ` Catalin Marinas
2024-01-24 12:40                                         ` Catalin Marinas
2024-01-24 13:27                                         ` Jason Gunthorpe
2024-01-24 13:27                                           ` Jason Gunthorpe
2024-01-24 17:22                                           ` Catalin Marinas
2024-01-24 17:22                                             ` Catalin Marinas
2024-01-24 19:26                                             ` Jason Gunthorpe
2024-01-24 19:26                                               ` Jason Gunthorpe
2024-01-25 17:43                                               ` Jason Gunthorpe
2024-01-25 17:43                                                 ` Jason Gunthorpe
2024-01-26 14:56                                                 ` Catalin Marinas
2024-01-26 14:56                                                   ` Catalin Marinas
2024-01-26 15:24                                                   ` Jason Gunthorpe
2024-01-26 15:24                                                     ` Jason Gunthorpe
2024-01-17 14:07                               ` Mark Rutland
2024-01-17 14:07                                 ` Mark Rutland
2024-01-17 15:28                                 ` Jason Gunthorpe
2024-01-17 15:28                                   ` Jason Gunthorpe
2024-01-17 16:05                                   ` Will Deacon
2024-01-17 16:05                                     ` Will Deacon
2024-01-18 16:18                                     ` Jason Gunthorpe
2024-01-18 16:18                                       ` Jason Gunthorpe
2024-01-24 11:31                                       ` Mark Rutland
2024-01-24 11:31                                         ` Mark Rutland
2023-11-24 12:58   ` Robin Murphy
2023-11-24 12:58     ` Robin Murphy
2023-11-24 13:45     ` Jason Gunthorpe
2023-11-24 13:45       ` Jason Gunthorpe
2023-11-24 15:32       ` Robin Murphy
2023-11-24 15:32         ` Robin Murphy
2023-11-24 14:10   ` Niklas Schnelle
2023-11-24 14:10     ` Niklas Schnelle
2023-11-24 14:20     ` Jason Gunthorpe
2023-11-24 14:20       ` Jason Gunthorpe
2023-11-24 14:48       ` Niklas Schnelle
2023-11-24 14:48         ` Niklas Schnelle
2023-11-24 14:53         ` Niklas Schnelle
2023-11-24 14:53           ` Niklas Schnelle
2023-11-24 14:55         ` Jason Gunthorpe
2023-11-24 14:55           ` Jason Gunthorpe
2023-11-24 15:59           ` Niklas Schnelle
2023-11-24 15:59             ` Niklas Schnelle
2023-11-24 16:06             ` Jason Gunthorpe
2023-11-24 16:06               ` Jason Gunthorpe
2023-11-27 17:43               ` Niklas Schnelle
2023-11-27 17:43                 ` Niklas Schnelle
2023-11-27 17:51                 ` Jason Gunthorpe
2023-11-27 17:51                   ` Jason Gunthorpe
2023-11-28 16:28                   ` Niklas Schnelle
2023-11-28 16:28                     ` Niklas Schnelle
2024-01-16 17:33                     ` Jason Gunthorpe
2024-01-16 17:33                       ` Jason Gunthorpe
2024-01-17 13:20                       ` Niklas Schnelle
2024-01-17 13:20                         ` Niklas Schnelle
2024-01-17 13:26                         ` Jason Gunthorpe
2024-01-17 13:26                           ` Jason Gunthorpe
2024-01-17 17:55                           ` Jason Gunthorpe
2024-01-17 17:55                             ` Jason Gunthorpe
2024-01-18 13:46                             ` Niklas Schnelle
2024-01-18 13:46                               ` Niklas Schnelle
2024-01-18 14:00                               ` Jason Gunthorpe
2024-01-18 14:00                                 ` Jason Gunthorpe
2024-01-18 15:59                                 ` Niklas Schnelle
2024-01-18 15:59                                   ` Niklas Schnelle
2024-01-18 16:21                                   ` Jason Gunthorpe
2024-01-18 16:21                                     ` Jason Gunthorpe
2024-01-18 16:25                                     ` Niklas Schnelle
2024-01-18 16:25                                       ` Niklas Schnelle
2024-01-19 11:52                                       ` Niklas Schnelle
2024-01-19 11:52                                         ` Niklas Schnelle
2024-02-16 12:09                                   ` Niklas Schnelle
2024-02-16 12:09                                     ` Niklas Schnelle
2024-02-16 12:39                                     ` Jason Gunthorpe
2024-02-16 12:39                                       ` Jason Gunthorpe
2023-11-23 19:04 ` [PATCH rdma-next 2/2] IB/mlx5: Use memcpy_toio_64() for write combining stores Leon Romanovsky
2023-11-23 19:04   ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZW4NAzI_jvwoq8dL@arm.com \
    --to=catalin.marinas@arm.com \
    --cc=arnd@arndb.de \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mark.rutland@arm.com \
    --cc=michaelgur@mellanox.com \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.