linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Leon Romanovsky <leon@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-rdma@vger.kernel.org, llvm@lists.linux.dev,
	Michael Guralnik <michaelgur@mellanox.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>
Subject: Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Date: Wed, 24 Jan 2024 15:26:34 -0400	[thread overview]
Message-ID: <20240124192634.GJ1455070@nvidia.com> (raw)
In-Reply-To: <ZbFHPTUaBmbHYnwx@arm.com>

On Wed, Jan 24, 2024 at 05:22:05PM +0000, Catalin Marinas wrote:
> On Wed, Jan 24, 2024 at 09:27:19AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 24, 2024 at 12:40:29PM +0000, Catalin Marinas wrote:
> > 
> > > > Just to be clear, that means we should drop this patch ("arm64/io: add
> > > > memcpy_toio_64") for now, right?
> > > 
> > > In its current form yes, but that doesn't mean that memcpy_toio_64()
> > > cannot be reworked differently.
> > 
> > I gave up on touching memcpy_toio_64(), it doesn't work very well
> > because of the weak alignment
> > 
> > Instead I followed your suggestion to fix __iowrite64_copy()
> 
> I forgot the details. Was it to introduce an __iowrite512_copy()
> function or to simply use __iowrite64_copy() with a count of 8?

Count of 8

> Just invoking __iowrite64_copy() with a count of 8 wouldn't work well
> even if we have the writeq generating STR with an offset (well, it also
> increments the pointers, so I don't think Mark's optimisation would
> help). The copy implies loads and these would be interleaved with stores
> and potentially get in the way of write combining. An
> __iowrite512_copy() or maybe even an optimised __iowrite64_copy() for
> count 8 could do the loads first followed by the stores. You can use a
> special path in __iowrite64_copy() with __builtin_contant_p().

I did exactly the latter like this:

static inline void __const_memcpy_toio_aligned64(volatile u64 __iomem *to,
						 const u64 *from, size_t count)
{
	switch (count) {
	case 8:
		asm volatile("stp %x0, %x1, [%8, #16 * 0]\n"
			     "stp %x2, %x3, [%8, #16 * 1]\n"
			     "stp %x4, %x5, [%8, #16 * 2]\n"
			     "stp %x6, %x7, [%8, #16 * 3]\n"
			     :
			     : "rZ"(from[0]), "rZ"(from[1]), "rZ"(from[2]),
			       "rZ"(from[3]), "rZ"(from[4]), "rZ"(from[5]),
			       "rZ"(from[6]), "rZ"(from[7]), "r"(to));
		break;
	case 4:
		asm volatile("stp %x0, %x1, [%4, #16 * 0]\n"
			     "stp %x2, %x3, [%4, #16 * 1]\n"
			     :
			     : "rZ"(from[0]), "rZ"(from[1]), "rZ"(from[2]),
			       "rZ"(from[3]), "r"(to));
		break;
	case 2:
		asm volatile("stp %x0, %x1, [%2, #16 * 0]\n"
			     :
			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
		break;
	case 1:
		__raw_writeq(*from, to);
		break;
	default:
		BUILD_BUG();
	}
}

void __iowrite64_copy_full(void __iomem *to, const void *from, size_t count);

static inline void __const_iowrite64_copy(void __iomem *to, const void *from,
					  size_t count)
{
	if (count == 8 || count == 4 || count == 2 || count == 1) {
		__const_memcpy_toio_aligned64(to, from, count);
		dgh();
	} else {
		__iowrite64_copy_full(to, from, count);
	}
}

#define __iowrite64_copy(to, from, count)                  \
	(__builtin_constant_p(count) ?                     \
		 __const_iowrite64_copy(to, from, count) : \
		 __iowrite64_copy_full(to, from, count))

And the out of line __iowrite64_copy_full() generates good
assembly that loops 8/4/2/1 sized blocks.

I was going to send it out yesterday but am waiting for some
conclusion on the STP.

https://github.com/jgunthorpe/linux/commits/mlx5_wc/

> void __iowrite64_copy(void __iomem *to, const void *from,
> 		      size_t count)
> {
> 	u64 __iomem *dst = to;
> 	const u64 *src = from;
> 	const u64 *end = src + count;
> 
> 	/*
> 	 * Try a 64-byte write, the CPUs tend to write-combine them.
> 	 */
> 	if (__builtin_contant_p(count) && count == 8) {
> 		__raw_writeq(*src, dst);
> 		__raw_writeq(*(src + 1), dst + 1);
> 		__raw_writeq(*(src + 2), dst + 2);
> 		__raw_writeq(*(src + 3), dst + 3);
> 		__raw_writeq(*(src + 4), dst + 4);
> 		__raw_writeq(*(src + 5), dst + 5);
> 		__raw_writeq(*(src + 6), dst + 6);
> 		__raw_writeq(*(src + 7), dst + 7);
> 		return;
> 	}

I already looked at this, clang with the "Qo" constraint does:

ffffffc08086e6ec:       f9400029        ldr     x9, [x1]
ffffffc08086e6f0:       91002008        add     x8, x0, #0x8
ffffffc08086e6f4:       f9000009        str     x9, [x0]
ffffffc08086e6f8:       f9400429        ldr     x9, [x1, #8]
ffffffc08086e6fc:       f9000109        str     x9, [x8]
ffffffc08086e700:       91004008        add     x8, x0, #0x10
ffffffc08086e704:       f9400829        ldr     x9, [x1, #16]
ffffffc08086e708:       f9000109        str     x9, [x8]
ffffffc08086e70c:       91006008        add     x8, x0, #0x18
ffffffc08086e710:       f9400c29        ldr     x9, [x1, #24]
ffffffc08086e714:       f9000109        str     x9, [x8]
ffffffc08086e718:       91008008        add     x8, x0, #0x20
ffffffc08086e71c:       f9401029        ldr     x9, [x1, #32]
ffffffc08086e720:       f9000109        str     x9, [x8]
ffffffc08086e724:       9100a008        add     x8, x0, #0x28
ffffffc08086e728:       f9401429        ldr     x9, [x1, #40]
ffffffc08086e72c:       f9000109        str     x9, [x8]
ffffffc08086e730:       9100c008        add     x8, x0, #0x30
ffffffc08086e734:       f9401829        ldr     x9, [x1, #48]
ffffffc08086e738:       f9000109        str     x9, [x8]
ffffffc08086e73c:       f9401c28        ldr     x8, [x1, #56]
ffffffc08086e740:       9100e009        add     x9, x0, #0x38
ffffffc08086e744:       f9000128        str     x8, [x9]

Which is not good. Gcc is a better, but not perfect.

> What we don't have is inlining of __iowrite64_copy() but if we need that
> we can move away from a weak symbol to a static inline.

Yes I did this as well. It helps s390 and x86 nicely too.

> Give this a go and see if it you get write-combining in your hardware.
> If the loads interleaves with stores get in the way, maybe we can resort
> to inline asm.

For reference the actual assembly (see post_send_nop()) that fails is:

   13534:       d503201f        nop
   13538:       93407ea1        sxtw    x1, w21
   1353c:       f100403f        cmp     x1, #0x10
   13540:       54000488        b.hi    135d0 <post_send_nop.isra.0+0x260>  // b.pmore
   13544:       a9408a63        ldp     x3, x2, [x19, #8]
   13548:       f84086c4        ldr     x4, [x22], #8
   1354c:       f9400042        ldr     x2, [x2]
   13550:       8b030283        add     x3, x20, x3
   13554:       8b030042        add     x2, x2, x3
   13558:       f9000044        str     x4, [x2]
   1355c:       91002294        add     x20, x20, #0x8
   13560:       11000ab5        add     w21, w21, #0x2
   13564:       f101029f        cmp     x20, #0x40
   13568:       54fffe81        b.ne    13538 <post_send_nop.isra.0+0x1c8>  // b.any
   1356c:       d50320df        hint    #0x6

Not very good code the compiler wrote (the main issue is that it
reloads the dest pointer every iteration), but still, all those loads
are coming from memory that was recently touched so should be in-cache
most of the time. So it isn't like we are sitting around waiting for a
lengthy dcache fill and timing out the WC buffer.

However, it is 136 instructions, so it feels like the issue may be the
write combining buffer auto-flushes in less. Maybe it auto-flushes
after 128/64/32/16/8 cycles now. I know there has been a tension to
reduce WC latency vs maximum aggregation.

The suggestion that it should not have any interleaving instructions
and use STP came from our CPU architecture team.

The assembly I have been able to get tested from this series that did
works is this:

ffffffc08086ec84:       d5033e9f        dsb     st
ffffffc08086ec88:       f941de6b        ldr     x11, [x19, #952]
ffffffc08086ec8c:       f941da6c        ldr     x12, [x19, #944]
ffffffc08086ec90:       f940016b        ldr     x11, [x11]
ffffffc08086ec94:       8b0c016b        add     x11, x11, x12
ffffffc08086ec98:       a9002969        stp     x9, x10, [x11]
ffffffc08086ec9c:       a9012168        stp     x8, x8, [x11, #16]
ffffffc08086eca0:       a9022168        stp     x8, x8, [x11, #32]
ffffffc08086eca4:       a9032168        stp     x8, x8, [x11, #48]
ffffffc08086eca8:       d50320df        hint    #0x6

The revised __iowrite64_copy() version also creates this assembly.

The ST4 based thing in userspace also works.

Remember there are two related topics here.. mlx5 would like high
frequency of large TLP generation, but doesn't care about raw
performance. If the 24 instructions clang generates does that then
great.

hns/broadcom/others need the large TLP and care about performance. In
that case the stp block is the best we can do in the kernel as st4 is
off the table.

I would like the architecture code to do a good job for performance
since it is a generic API for all drivers.

Regarding the 8x STR option, I have to get that tested.

Jason

  reply	other threads:[~2024-01-24 19:26 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-23 19:04 [PATCH rdma-next 0/2] Add and use memcpy_toio_64() Leon Romanovsky
2023-11-23 19:04 ` [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64 Leon Romanovsky
2023-11-24 10:16   ` Mark Rutland
2023-11-24 12:23     ` Jason Gunthorpe
2023-11-27 12:42       ` Catalin Marinas
2023-11-27 13:45         ` Jason Gunthorpe
2023-12-04 17:31           ` Catalin Marinas
2023-12-04 18:23             ` Jason Gunthorpe
2023-12-05 17:21               ` Catalin Marinas
2023-12-05 17:51                 ` Jason Gunthorpe
2023-12-05 19:34                   ` Catalin Marinas
2023-12-05 19:51                     ` Jason Gunthorpe
2023-12-06 11:09                       ` Catalin Marinas
2023-12-06 12:59                         ` Jason Gunthorpe
2024-01-16 18:51                           ` Jason Gunthorpe
2024-01-17 12:30                             ` Mark Rutland
2024-01-17 12:36                               ` Jason Gunthorpe
2024-01-17 12:41                                 ` Jason Gunthorpe
2024-01-17 13:29                                 ` Mark Rutland
2024-01-23 20:38                                   ` Catalin Marinas
2024-01-24  1:27                                     ` Jason Gunthorpe
2024-01-24  8:26                                       ` Marc Zyngier
2024-01-24 13:06                                         ` Jason Gunthorpe
2024-01-24 13:32                                           ` Marc Zyngier
2024-01-24 15:52                                             ` Jason Gunthorpe
2024-01-24 17:54                                               ` Catalin Marinas
2024-01-25  1:29                                                 ` Jason Gunthorpe
2024-01-26 16:15                                                   ` Catalin Marinas
2024-01-26 17:09                                                     ` Jason Gunthorpe
2024-01-24 11:38                                     ` Mark Rutland
2024-01-24 12:40                                       ` Catalin Marinas
2024-01-24 13:27                                         ` Jason Gunthorpe
2024-01-24 17:22                                           ` Catalin Marinas
2024-01-24 19:26                                             ` Jason Gunthorpe [this message]
2024-01-25 17:43                                               ` Jason Gunthorpe
2024-01-26 14:56                                                 ` Catalin Marinas
2024-01-26 15:24                                                   ` Jason Gunthorpe
2024-01-17 14:07                               ` Mark Rutland
2024-01-17 15:28                                 ` Jason Gunthorpe
2024-01-17 16:05                                   ` Will Deacon
2024-01-18 16:18                                     ` Jason Gunthorpe
2024-01-24 11:31                                       ` Mark Rutland
2023-11-24 12:58   ` Robin Murphy
2023-11-24 13:45     ` Jason Gunthorpe
2023-11-24 15:32       ` Robin Murphy
2023-11-24 14:10   ` Niklas Schnelle
2023-11-24 14:20     ` Jason Gunthorpe
2023-11-24 14:48       ` Niklas Schnelle
2023-11-24 14:53         ` Niklas Schnelle
2023-11-24 14:55         ` Jason Gunthorpe
2023-11-24 15:59           ` Niklas Schnelle
2023-11-24 16:06             ` Jason Gunthorpe
2023-11-27 17:43               ` Niklas Schnelle
2023-11-27 17:51                 ` Jason Gunthorpe
2023-11-28 16:28                   ` Niklas Schnelle
2024-01-16 17:33                     ` Jason Gunthorpe
2024-01-17 13:20                       ` Niklas Schnelle
2024-01-17 13:26                         ` Jason Gunthorpe
2024-01-17 17:55                           ` Jason Gunthorpe
2024-01-18 13:46                             ` Niklas Schnelle
2024-01-18 14:00                               ` Jason Gunthorpe
2024-01-18 15:59                                 ` Niklas Schnelle
2024-01-18 16:21                                   ` Jason Gunthorpe
2024-01-18 16:25                                     ` Niklas Schnelle
2024-01-19 11:52                                       ` Niklas Schnelle
2024-02-16 12:09                                   ` Niklas Schnelle
2024-02-16 12:39                                     ` Jason Gunthorpe
2023-11-23 19:04 ` [PATCH rdma-next 2/2] IB/mlx5: Use memcpy_toio_64() for write combining stores Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240124192634.GJ1455070@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=leon@kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=michaelgur@mellanox.com \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=schnelle@linux.ibm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).