* [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
@ 2025-09-15 8:35 Tariq Toukan
2025-09-15 22:18 ` Nathan Chancellor
0 siblings, 1 reply; 10+ messages in thread
From: Tariq Toukan @ 2025-09-15 8:35 UTC (permalink / raw)
To: Catalin Marinas, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Sabrina Dubroca, netdev, linux-rdma, linux-kernel, Gal Pressman,
Leon Romanovsky, Jason Gunthorpe, Michael Guralnik, Moshe Shemesh,
Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nathan Chancellor, Nick Desaulniers, Salil Mehta, Sven Schnelle,
Thomas Gleixner, x86, Yisen Zhuang, Arnd Bergmann,
Leon Romanovsky, linux-arch, linux-arm-kernel, Mark Rutland,
Michael Guralnik, patches, Niklas Schnelle, Jijie Shao,
Patrisious Haddad
From: Patrisious Haddad <phaddad@nvidia.com>
Write combining is an optimization feature in CPUs that is frequently
used by modern devices to generate 32 or 64 byte TLPs at the PCIe level.
These large TLPs allow certain optimizations in the driver to HW
communication that improve performance. As WC is unpredictable and
optional the HW designs all tolerate cases where combining doesn't
happen and simply experience a performance degradation.
Unfortunately many virtualization environments on all architectures have
done things that completely disable WC inside the VM with no generic way
to detect this. For example WC was fully blocked in ARM64 KVM until
commit 8c47ce3e1d2c ("KVM: arm64: Set io memory s2 pte as normalnc for
vfio pci device").
Trying to use WC when it is known not to work has a measurable
performance cost (~5%). Long ago mlx5 developed an boot time algorithm
to test if WC is available or not by using unique mlx5 HW features to
measure how many large TLPs the device is receiving. The SW generates a
large number of combining opportunities and if any succeed then WC is
declared working.
In mlx5 the WC optimization feature is never used by the kernel except
for the boot time test. The WC is only used by userspace in rdma-core.
Sadly modern ARM CPUs, especially NVIDIA Grace, have a combining
implementation that is very unreliable compared to pretty much
everything prior. This is being fixed architecturally in new CPUs with a
new ST64B instruction, but current shipping devices suffer this problem.
Unreliable means the SW can present thousands of combining opportunities
and the HW will not combine for any of them, which creates a performance
degradation, and critically fails the mlx5 boot test. However, the CPU
is very sensitive to the instruction sequence used, with the better
options being sufficiently good that the performance loss from the
unreliable CPU is not measurable.
Broadly there are several options, from worst to best:
1) A C loop doing a u64 memcpy.
This was used prior to commit ef302283ddfc
("IB/mlx5: Use __iowrite64_copy() for write combining stores")
and failed almost all the time on Grace CPUs.
2) ARM64 assembly with consecutive 8 byte stores. This was implemented
as an arch-generic __iowriteXX_copy() family of functions suitable
for performance use in drivers for WC. commit ead79118dae6
("arm64/io: Provide a WC friendly __iowriteXX_copy()") provided the
ARM implementation.
3) ARM64 assembly with consecutive 16 byte stores. This was rejected
from kernel use over fears of virtualization failures. Common ARM
VMMs will crash if STP is used against emulated memory.
4) A single NEON store instruction. Userspace has used this option for a
very long time, it performs well.
5) For future silicon the new ST64B instruction is guaranteed to
generate a 64 byte TLP 100% of the time
The past upgrade from #1 to #2 was thought to be sufficient to solve
this problem. However, more testing on more systems shows that #3 is
still problematic at a low frequency and the kernel test fails.
Thus, make the mlx5 use the same instructions as userspace during the
boot time WC self test. This way the WC test matches the userspace and
will properly detect the ability of HW to support the WC workload that
userspace will generate. While #4 still has imperfect combining
performance, it is substantially better than #2, and does actually give
a performance win to applications. Self-test failures with #2 are like
3/10 boots, on some systems, #4 has never seen a boot failure.
There is no real general use case for a NEON based WC flow in the
kernel. This is not suitable for any performance path work as getting
into/out of a NEON context is fairly expensive compared to the gain of
WC. Future CPUs are going to fix this issue by using an new ARM
instruction and __iowriteXX_copy() will be updated to use that
automatically, probably using the ALTERNATES mechanism.
Since this problem is constrained to mlx5's unique situation of needing
a non-performance code path to duplicate what mlx5 userspace is doing as
a matter of self-testing, implement it as a one line inline assembly in
the driver directly.
Lastly, this was concluded from the discussion with ARM maintainers
which confirms that this is the best approach for the solution:
https://lore.kernel.org/r/aHqN_hpJl84T1Usi@arm.com
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/Makefile | 6 +++++
.../mlx5/core/lib/wc_neon_iowrite64_copy.c | 14 +++++++++++
.../mlx5/core/lib/wc_neon_iowrite64_copy.h | 12 ++++++++++
drivers/net/ethernet/mellanox/mlx5/core/wc.c | 24 +++++++++++++++++--
4 files changed, 54 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.h
Find V1 here:
https://lore.kernel.org/all/1757572873-602396-1-git-send-email-tariqt@nvidia.com/
V2: no changes, extended the CC list.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index d77696f46eb5..06d0eb190816 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o
obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
mlx5_dpll-y := dpll.o
+
+#
+# NEON WC specific for mlx5
+#
+mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
+FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
new file mode 100644
index 000000000000..8c07d2040607
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include "lib/wc_neon_iowrite64_copy.h"
+
+void mlx5_wc_neon_iowrite64_copy(void __iomem *to, const void *from)
+{
+ asm volatile
+ ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
+ "st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%1]"
+ :
+ : "r"(from), "r"(to)
+ : "memory", "v0", "v1", "v2", "v3");
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.h
new file mode 100644
index 000000000000..ff2a2e263190
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5_LIB_WC_NEON_H__
+#define __MLX5_LIB_WC_NEON_H__
+
+/* Executes a 64 byte copy between the two provided pointers via ARM neon
+ * instruction.
+ */
+void mlx5_wc_neon_iowrite64_copy(void __iomem *to, const void *from);
+
+#endif /* __MLX5_LIB_WC_NEON_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wc.c b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
index 2f0316616fa4..44c2ac953ea2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
@@ -7,6 +7,11 @@
#include "mlx5_core.h"
#include "wq.h"
+#ifdef CONFIG_KERNEL_MODE_NEON
+#include "lib/wc_neon_iowrite64_copy.h"
+#include <asm/neon.h>
+#endif
+
#define TEST_WC_NUM_WQES 255
#define TEST_WC_LOG_CQ_SZ (order_base_2(TEST_WC_NUM_WQES))
#define TEST_WC_SQ_LOG_WQ_SZ TEST_WC_LOG_CQ_SZ
@@ -249,6 +254,22 @@ static int mlx5_wc_create_sq(struct mlx5_core_dev *mdev, struct mlx5_wc_sq *sq)
return err;
}
+static void mlx5_iowrite64_copy(struct mlx5_wc_sq *sq, __be32 mmio_wqe[16],
+ size_t mmio_wqe_size)
+{
+#ifdef CONFIG_KERNEL_MODE_NEON
+ if (cpu_has_neon()) {
+ kernel_neon_begin();
+ mlx5_wc_neon_iowrite64_copy(sq->bfreg.map + sq->bfreg.offset,
+ mmio_wqe);
+ kernel_neon_end();
+ return;
+ }
+#endif
+ __iowrite64_copy(sq->bfreg.map + sq->bfreg.offset, mmio_wqe,
+ mmio_wqe_size / 8);
+}
+
static void mlx5_wc_destroy_sq(struct mlx5_wc_sq *sq)
{
mlx5_core_destroy_sq(sq->cq.mdev, sq->sqn);
@@ -288,8 +309,7 @@ static void mlx5_wc_post_nop(struct mlx5_wc_sq *sq, bool signaled)
*/
wmb();
- __iowrite64_copy(sq->bfreg.map + sq->bfreg.offset, mmio_wqe,
- sizeof(mmio_wqe) / 8);
+ mlx5_iowrite64_copy(sq, mmio_wqe, sizeof(mmio_wqe));
sq->bfreg.offset ^= buf_size;
}
base-commit: 1f24a240974589ce42f70502ccb3ff3f5189d69a
--
2.31.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-15 8:35 [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs Tariq Toukan
@ 2025-09-15 22:18 ` Nathan Chancellor
2025-09-15 22:27 ` Nathan Chancellor
0 siblings, 1 reply; 10+ messages in thread
From: Nathan Chancellor @ 2025-09-15 22:18 UTC (permalink / raw)
To: Tariq Toukan
Cc: Catalin Marinas, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Saeed Mahameed, Leon Romanovsky,
Mark Bloch, Sabrina Dubroca, netdev, linux-rdma, linux-kernel,
Gal Pressman, Leon Romanovsky, Jason Gunthorpe, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Arnd Bergmann, Leon Romanovsky, linux-arch,
linux-arm-kernel, Mark Rutland, Michael Guralnik, patches,
Niklas Schnelle, Jijie Shao, Patrisious Haddad
On Mon, Sep 15, 2025 at 11:35:08AM +0300, Tariq Toukan wrote:
...
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index d77696f46eb5..06d0eb190816 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o
>
> obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
> mlx5_dpll-y := dpll.o
> +
> +#
> +# NEON WC specific for mlx5
> +#
> +mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
> +FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
Does this work as is? I think this needs to be CFLAGS instead of FLAGS
but I did not test to verify.
Cheers,
Nathan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-15 22:18 ` Nathan Chancellor
@ 2025-09-15 22:27 ` Nathan Chancellor
2025-09-15 22:48 ` Jason Gunthorpe
0 siblings, 1 reply; 10+ messages in thread
From: Nathan Chancellor @ 2025-09-15 22:27 UTC (permalink / raw)
To: Tariq Toukan
Cc: Catalin Marinas, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Saeed Mahameed, Leon Romanovsky,
Mark Bloch, Sabrina Dubroca, netdev, linux-rdma, linux-kernel,
Gal Pressman, Leon Romanovsky, Jason Gunthorpe, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Arnd Bergmann, Leon Romanovsky, linux-arch,
linux-arm-kernel, Mark Rutland, Michael Guralnik, patches,
Niklas Schnelle, Jijie Shao, Patrisious Haddad
On Mon, Sep 15, 2025 at 03:18:59PM -0700, Nathan Chancellor wrote:
> On Mon, Sep 15, 2025 at 11:35:08AM +0300, Tariq Toukan wrote:
> ...
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > index d77696f46eb5..06d0eb190816 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > @@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o
> >
> > obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
> > mlx5_dpll-y := dpll.o
> > +
> > +#
> > +# NEON WC specific for mlx5
> > +#
> > +mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
> > +FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
>
> Does this work as is? I think this needs to be CFLAGS instead of FLAGS
> but I did not test to verify.
Also, Documentation/core-api/floating-point.rst states that code should
also use CFLAGS_REMOVE_ for CC_FLAGS_NO_FPU as well as adding
CC_FLAGS_FPU.
CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
Cheers,
Nathan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-15 22:27 ` Nathan Chancellor
@ 2025-09-15 22:48 ` Jason Gunthorpe
2025-09-15 23:15 ` Nathan Chancellor
0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2025-09-15 22:48 UTC (permalink / raw)
To: Nathan Chancellor
Cc: Tariq Toukan, Catalin Marinas, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
Leon Romanovsky, Mark Bloch, Sabrina Dubroca, netdev, linux-rdma,
linux-kernel, Gal Pressman, Leon Romanovsky, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Arnd Bergmann, Leon Romanovsky, linux-arch,
linux-arm-kernel, Mark Rutland, Michael Guralnik, patches,
Niklas Schnelle, Jijie Shao, Patrisious Haddad
On Mon, Sep 15, 2025 at 03:27:58PM -0700, Nathan Chancellor wrote:
> On Mon, Sep 15, 2025 at 03:18:59PM -0700, Nathan Chancellor wrote:
> > On Mon, Sep 15, 2025 at 11:35:08AM +0300, Tariq Toukan wrote:
> > ...
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > > index d77696f46eb5..06d0eb190816 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > > @@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o
> > >
> > > obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
> > > mlx5_dpll-y := dpll.o
> > > +
> > > +#
> > > +# NEON WC specific for mlx5
> > > +#
> > > +mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
> > > +FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
> >
> > Does this work as is? I think this needs to be CFLAGS instead of FLAGS
> > but I did not test to verify.
>
> Also, Documentation/core-api/floating-point.rst states that code should
> also use CFLAGS_REMOVE_ for CC_FLAGS_NO_FPU as well as adding
> CC_FLAGS_FPU.
>
> CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
I wondered if you needed the seperate compilation unit at all since it
it all done with inline assembly.. Since the makefile seems to have a
typo, it suggests you don't need the compilation unit and it could
just be a little inline protected by CONFIG_KERNEL_MODE_NEON.
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-15 22:48 ` Jason Gunthorpe
@ 2025-09-15 23:15 ` Nathan Chancellor
2025-09-16 8:39 ` Patrisious Haddad
0 siblings, 1 reply; 10+ messages in thread
From: Nathan Chancellor @ 2025-09-15 23:15 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Tariq Toukan, Catalin Marinas, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
Leon Romanovsky, Mark Bloch, Sabrina Dubroca, netdev, linux-rdma,
linux-kernel, Gal Pressman, Leon Romanovsky, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Arnd Bergmann, Leon Romanovsky, linux-arch,
linux-arm-kernel, Mark Rutland, Michael Guralnik, patches,
Niklas Schnelle, Jijie Shao, Patrisious Haddad
On Mon, Sep 15, 2025 at 07:48:10PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 15, 2025 at 03:27:58PM -0700, Nathan Chancellor wrote:
> > On Mon, Sep 15, 2025 at 03:18:59PM -0700, Nathan Chancellor wrote:
> > > On Mon, Sep 15, 2025 at 11:35:08AM +0300, Tariq Toukan wrote:
> > > ...
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > > > index d77696f46eb5..06d0eb190816 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > > > @@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o
> > > >
> > > > obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
> > > > mlx5_dpll-y := dpll.o
> > > > +
> > > > +#
> > > > +# NEON WC specific for mlx5
> > > > +#
> > > > +mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
> > > > +FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
> > >
> > > Does this work as is? I think this needs to be CFLAGS instead of FLAGS
> > > but I did not test to verify.
> >
> > Also, Documentation/core-api/floating-point.rst states that code should
> > also use CFLAGS_REMOVE_ for CC_FLAGS_NO_FPU as well as adding
> > CC_FLAGS_FPU.
> >
> > CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
>
> I wondered if you needed the seperate compilation unit at all since it
> it all done with inline assembly.. Since the makefile seems to have a
> typo, it suggests you don't need the compilation unit and it could
> just be a little inline protected by CONFIG_KERNEL_MODE_NEON.
Hmmm, clang rejects the current patch
drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c:9:3: error: instruction requires: neon
9 | ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
| ^
<inline asm>:1:2: note: instantiated into assembly here
1 | ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x19]
| ^
drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c:9:48: error: instruction requires: neon
9 | ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
| ^
<inline asm>:2:2: note: instantiated into assembly here
2 | st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x20]
| ^
while GCC accepts it... It looks like GCC's -mgeneral-regs-only only
impacts the compiler using floating-point and SIMD registers after [1]
in GCC 6.x, whereas clang's restriction is on both the compiler and
assembler. Perhaps clang should be adjusted to match but its behavior
seems more desirable for the kernel to ensure floating-point code is
properly separated and called between kernel_fpu_{begin,end}(). This
error is resolved with the following diff.
[1]: https://gcc.gnu.org/cgit/gcc/commit/?id=7d9425d46b58e69667300331aa55ebddddcceaeb
Cheers,
Nathan
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 06d0eb190816..a85fc21419d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -181,4 +181,5 @@ mlx5_dpll-y := dpll.o
# NEON WC specific for mlx5
#
mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
-FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
+CFLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
+CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-15 23:15 ` Nathan Chancellor
@ 2025-09-16 8:39 ` Patrisious Haddad
2025-09-16 8:58 ` Arnd Bergmann
0 siblings, 1 reply; 10+ messages in thread
From: Patrisious Haddad @ 2025-09-16 8:39 UTC (permalink / raw)
To: Nathan Chancellor, Jason Gunthorpe
Cc: Tariq Toukan, Catalin Marinas, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
Leon Romanovsky, Mark Bloch, Sabrina Dubroca, netdev, linux-rdma,
linux-kernel, Gal Pressman, Leon Romanovsky, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Arnd Bergmann, Leon Romanovsky, linux-arch,
linux-arm-kernel, Mark Rutland, Michael Guralnik, patches,
Niklas Schnelle, Jijie Shao
On 9/16/2025 2:15 AM, Nathan Chancellor wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, Sep 15, 2025 at 07:48:10PM -0300, Jason Gunthorpe wrote:
>> On Mon, Sep 15, 2025 at 03:27:58PM -0700, Nathan Chancellor wrote:
>>> On Mon, Sep 15, 2025 at 03:18:59PM -0700, Nathan Chancellor wrote:
>>>> On Mon, Sep 15, 2025 at 11:35:08AM +0300, Tariq Toukan wrote:
>>>> ...
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
>>>>> index d77696f46eb5..06d0eb190816 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
>>>>> @@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o
>>>>>
>>>>> obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
>>>>> mlx5_dpll-y := dpll.o
>>>>> +
>>>>> +#
>>>>> +# NEON WC specific for mlx5
>>>>> +#
>>>>> +mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
>>>>> +FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
>>>> Does this work as is? I think this needs to be CFLAGS instead of FLAGS
>>>> but I did not test to verify.
>>> Also, Documentation/core-api/floating-point.rst states that code should
>>> also use CFLAGS_REMOVE_ for CC_FLAGS_NO_FPU as well as adding
>>> CC_FLAGS_FPU.
>>>
>>> CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
>> I wondered if you needed the seperate compilation unit at all since it
>> it all done with inline assembly.. Since the makefile seems to have a
>> typo, it suggests you don't need the compilation unit and it could
>> just be a little inline protected by CONFIG_KERNEL_MODE_NEON.
There is difference between what actually compiles and the effect of
these flags on actual performance/assembly translation. To avoid finding
that the hard way I prefer to stick to their documentation which does as
Natan described below,
a separate compilation unit between begin and end and the correct flags
- and eventually that was what I tested , I missed to re-test this post
finishing my code review - thinking my changes were only cosmetic ...
> Hmmm, clang rejects the current patch
>
> drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c:9:3: error: instruction requires: neon
> 9 | ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
> | ^
> <inline asm>:1:2: note: instantiated into assembly here
> 1 | ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x19]
> | ^
> drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c:9:48: error: instruction requires: neon
> 9 | ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
> | ^
> <inline asm>:2:2: note: instantiated into assembly here
> 2 | st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x20]
> | ^
>
> while GCC accepts it... It looks like GCC's -mgeneral-regs-only only
> impacts the compiler using floating-point and SIMD registers after [1]
> in GCC 6.x, whereas clang's restriction is on both the compiler and
> assembler. Perhaps clang should be adjusted to match but its behavior
> seems more desirable for the kernel to ensure floating-point code is
> properly separated and called between kernel_fpu_{begin,end}(). This
> error is resolved with the following diff.
>
> [1]: https://gcc.gnu.org/cgit/gcc/commit/?id=7d9425d46b58e69667300331aa55ebddddcceaeb
>
> Cheers,
> Nathan
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index 06d0eb190816..a85fc21419d8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -181,4 +181,5 @@ mlx5_dpll-y := dpll.o
> # NEON WC specific for mlx5
> #
> mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
> -FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
> +CFLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
> +CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
You are spot on, I checked my patchset and the actual tested code
(performance wise) beyond compilation used the following code:
ifeq ($(ARCH),arm64)
CFLAGS_lib/neon_iowrite64_copy.o += -ffreestanding
CFLAGS_REMOVE_lib/neon_iowrite64_copy.o += -mgeneral-regs-only
endif
Which is actually equivalent to the diff you sent, Thanks for the
heads-up will fix and resend.
Thanks, Patrisious.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-16 8:39 ` Patrisious Haddad
@ 2025-09-16 8:58 ` Arnd Bergmann
2025-09-16 9:47 ` Patrisious Haddad
0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2025-09-16 8:58 UTC (permalink / raw)
To: Patrisious Haddad, Nathan Chancellor, Jason Gunthorpe
Cc: Tariq Toukan, Catalin Marinas, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S . Miller, Saeed Mahameed,
Leon Romanovsky, Mark Bloch, Sabrina Dubroca, Netdev, linux-rdma,
linux-kernel, Gal Pressman, Leon Romanovsky, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Leon Romanovsky, Linux-Arch, linux-arm-kernel,
Mark Rutland, Michael Guralnik, patches, Niklas Schnelle,
Jijie Shao
On Tue, Sep 16, 2025, at 10:39, Patrisious Haddad wrote:
> On 9/16/2025 2:15 AM, Nathan Chancellor wrote:
>> External email: Use caution opening links or attachments
>
> ifeq ($(ARCH),arm64)
> CFLAGS_lib/neon_iowrite64_copy.o += -ffreestanding
> CFLAGS_REMOVE_lib/neon_iowrite64_copy.o += -mgeneral-regs-only
> endif
>
> Which is actually equivalent to the diff you sent, Thanks for the
> heads-up will fix and resend.
>
I think it's better to handle this inside of the inline asm itself
by adding
".arch_extension simd;\n\t"
at the start of it.
Arnd
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-16 8:58 ` Arnd Bergmann
@ 2025-09-16 9:47 ` Patrisious Haddad
2025-09-16 12:27 ` Jason Gunthorpe
0 siblings, 1 reply; 10+ messages in thread
From: Patrisious Haddad @ 2025-09-16 9:47 UTC (permalink / raw)
To: Arnd Bergmann, Nathan Chancellor, Jason Gunthorpe
Cc: Tariq Toukan, Catalin Marinas, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S . Miller, Saeed Mahameed,
Leon Romanovsky, Mark Bloch, Sabrina Dubroca, Netdev, linux-rdma,
linux-kernel, Gal Pressman, Leon Romanovsky, Michael Guralnik,
Moshe Shemesh, Will Deacon, Alexander Gordeev, Andrew Morton,
Christian Borntraeger, Borislav Petkov, Dave Hansen,
Gerald Schaefer, Vasily Gorbik, Heiko Carstens, H. Peter Anvin,
Justin Stitt, linux-s390, llvm, Ingo Molnar, Bill Wendling,
Nick Desaulniers, Salil Mehta, Sven Schnelle, Thomas Gleixner,
x86, Yisen Zhuang, Leon Romanovsky, Linux-Arch, linux-arm-kernel,
Mark Rutland, Michael Guralnik, patches, Niklas Schnelle,
Jijie Shao
On 9/16/2025 11:58 AM, Arnd Bergmann wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Sep 16, 2025, at 10:39, Patrisious Haddad wrote:
>> On 9/16/2025 2:15 AM, Nathan Chancellor wrote:
>>> External email: Use caution opening links or attachments
>> ifeq ($(ARCH),arm64)
>> CFLAGS_lib/neon_iowrite64_copy.o += -ffreestanding
>> CFLAGS_REMOVE_lib/neon_iowrite64_copy.o += -mgeneral-regs-only
>> endif
>>
>> Which is actually equivalent to the diff you sent, Thanks for the
>> heads-up will fix and resend.
>>
> I think it's better to handle this inside of the inline asm itself
> by adding
>
> ".arch_extension simd;\n\t"
>
> at the start of it.
I don't get it why and how is that better - than using the correct CC
flags for neon, also are you suggesting this in addition or instead of
the CC flags fix natan sent above ?
Using the correct CC flags by itself is sufficient and correct - and I
don't see other neon users using the ".arch_extension simd" you
mentioned , so why do you think it is needed here ?
and I'm assuming you meant to add it like so? (incase it is really
needed - I'm still not convinced it is ...)
diff --git
a/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
b/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
index 8c07d2040607..cde3d11909a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c
@@ -6,7 +6,8 @@
void mlx5_wc_neon_iowrite64_copy(void __iomem *to, const void *from)
{
asm volatile
- ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
+ (".arch_extension simd;\n\t"
+ "ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]"
"st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%1]"
:
: "r"(from), "r"(to)
Patrisious.
>
> Arnd
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-16 9:47 ` Patrisious Haddad
@ 2025-09-16 12:27 ` Jason Gunthorpe
2025-09-16 13:13 ` Patrisious Haddad
0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 12:27 UTC (permalink / raw)
To: Patrisious Haddad
Cc: Arnd Bergmann, Nathan Chancellor, Tariq Toukan, Catalin Marinas,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S . Miller, Saeed Mahameed, Leon Romanovsky, Mark Bloch,
Sabrina Dubroca, Netdev, linux-rdma, linux-kernel, Gal Pressman,
Leon Romanovsky, Michael Guralnik, Moshe Shemesh, Will Deacon,
Alexander Gordeev, Andrew Morton, Christian Borntraeger,
Borislav Petkov, Dave Hansen, Gerald Schaefer, Vasily Gorbik,
Heiko Carstens, H. Peter Anvin, Justin Stitt, linux-s390, llvm,
Ingo Molnar, Bill Wendling, Nick Desaulniers, Salil Mehta,
Sven Schnelle, Thomas Gleixner, x86, Yisen Zhuang,
Leon Romanovsky, Linux-Arch, linux-arm-kernel, Mark Rutland,
Michael Guralnik, patches, Niklas Schnelle, Jijie Shao
On Tue, Sep 16, 2025 at 12:47:17PM +0300, Patrisious Haddad wrote:
> Using the correct CC flags by itself is sufficient and correct - and I don't
> see other neon users using the ".arch_extension simd" you mentioned , so why
> do you think it is needed here ?
If you do it as Arnd suggests then we don't need a another file,
makefile changes and so on. You can just drop a 5 line inline right
before it is used.
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
2025-09-16 12:27 ` Jason Gunthorpe
@ 2025-09-16 13:13 ` Patrisious Haddad
0 siblings, 0 replies; 10+ messages in thread
From: Patrisious Haddad @ 2025-09-16 13:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Arnd Bergmann, Nathan Chancellor, Tariq Toukan, Catalin Marinas,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S . Miller, Saeed Mahameed, Leon Romanovsky, Mark Bloch,
Sabrina Dubroca, Netdev, linux-rdma, linux-kernel, Gal Pressman,
Leon Romanovsky, Michael Guralnik, Moshe Shemesh, Will Deacon,
Alexander Gordeev, Andrew Morton, Christian Borntraeger,
Borislav Petkov, Dave Hansen, Gerald Schaefer, Vasily Gorbik,
Heiko Carstens, H. Peter Anvin, Justin Stitt, linux-s390, llvm,
Ingo Molnar, Bill Wendling, Nick Desaulniers, Salil Mehta,
Sven Schnelle, Thomas Gleixner, x86, Yisen Zhuang,
Leon Romanovsky, Linux-Arch, linux-arm-kernel, Mark Rutland,
Michael Guralnik, patches, Niklas Schnelle, Jijie Shao
On 9/16/2025 3:27 PM, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 12:47:17PM +0300, Patrisious Haddad wrote:
>
>> Using the correct CC flags by itself is sufficient and correct - and I don't
>> see other neon users using the ".arch_extension simd" you mentioned , so why
>> do you think it is needed here ?
> If you do it as Arnd suggests then we don't need a another file,
> makefile changes and so on. You can just drop a 5 line inline right
> before it is used.
I see, thanks for clarifying that, will try it and test it locally, if
that does work , and gives identical results then indeed that it is way
better/cleaner, will send V3 as such after testing.
Patrisious.
>
> Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-09-16 13:13 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-15 8:35 [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs Tariq Toukan
2025-09-15 22:18 ` Nathan Chancellor
2025-09-15 22:27 ` Nathan Chancellor
2025-09-15 22:48 ` Jason Gunthorpe
2025-09-15 23:15 ` Nathan Chancellor
2025-09-16 8:39 ` Patrisious Haddad
2025-09-16 8:58 ` Arnd Bergmann
2025-09-16 9:47 ` Patrisious Haddad
2025-09-16 12:27 ` Jason Gunthorpe
2025-09-16 13:13 ` Patrisious Haddad
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).