All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Justin Stitt <justinstitt@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Leon Romanovsky <leon@kernel.org>,
	linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org,
	llvm@lists.linux.dev, Ingo Molnar <mingo@redhat.com>,
	Bill Wendling <morbo@google.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	netdev@vger.kernel.org, Paolo Abeni <pabeni@redhat.com>,
	Salil Mehta <salil.mehta@huawei.com>,
	Jijie Shao <shaojijie@huawei.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	x86@kernel.org, Yisen Zhuang <yisen.zhuang@huawei.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Leon Romanovsky <leonro@mellanox.com>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Michael Guralnik <michaelgur@mellanox.com>,
	patches@lists.linux.dev, Niklas Schnelle <schnelle@linux.ibm.com>,
	Will Deacon <will@kernel.org>
Subject: Re: [PATCH 4/6] arm64/io: Provide a WC friendly __iowriteXX_copy()
Date: Wed, 28 Feb 2024 19:06:16 -0400	[thread overview]
Message-ID: <20240228230616.GS13330@nvidia.com> (raw)
In-Reply-To: <Zd27XtDg_NDzLXg-@arm.com>

On Tue, Feb 27, 2024 at 10:37:18AM +0000, Catalin Marinas wrote:
> On Tue, Feb 20, 2024 at 09:17:08PM -0400, Jason Gunthorpe wrote:
> > +/*
> > + * This generates a memcpy that works on a from/to address which is aligned to
> > + * bits. Count is in terms of the number of bits sized quantities to copy. It
> > + * optimizes to use the STR groupings when possible so that it is WC friendly.
> > + */
> > +#define memcpy_toio_aligned(to, from, count, bits)                        \
> > +	({                                                                \
> > +		volatile u##bits __iomem *_to = to;                       \
> > +		const u##bits *_from = from;                              \
> > +		size_t _count = count;                                    \
> > +		const u##bits *_end_from = _from + ALIGN_DOWN(_count, 8); \
> > +                                                                          \
> > +		for (; _from < _end_from; _from += 8, _to += 8)           \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 8); \
> > +		if ((_count % 8) >= 4) {                                  \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 4); \
> > +			_from += 4;                                       \
> > +			_to += 4;                                         \
> > +		}                                                         \
> > +		if ((_count % 4) >= 2) {                                  \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 2); \
> > +			_from += 2;                                       \
> > +			_to += 2;                                         \
> > +		}                                                         \
> > +		if (_count % 2)                                           \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 1); \
> > +	})
> 
> Do we actually need all this if count is not constant? If it's not
> performance critical anywhere, I'd rather copy the generic
> implementation, it's easier to read.

Which generic version?

The point is to maximize WC effects with non-constant values, so I
think we do need something like this. ie we can't just fall back to
looping over 64 bit stores one at a time.

If we don't use the large block stores we know we get very poor WC
behavior. So at least the 8 and 4 constant value sections are
needed. At that point you may as well just do 4 and 2 instead of
another loop.

Most places I know about using this are performance paths, the entire
iocopy infrastructure was introduced as an x86 performance
optimization..

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Justin Stitt <justinstitt@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Leon Romanovsky <leon@kernel.org>,
	linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org,
	llvm@lists.linux.dev, Ingo Molnar <mingo@redhat.com>,
	Bill Wendling <morbo@google.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	netdev@vger.kernel.org, Paolo Abeni <pabeni@redhat.com>,
	Salil Mehta <salil.mehta@huawei.com>,
	Jijie Shao <shaojijie@huawei.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	x86@kernel.org, Yisen Zhuang <yisen.zhuang@huawei.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Leon Romanovsky <leonro@mellanox.com>,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Michael Guralnik <michaelgur@mellanox.com>,
	patches@lists.linux.dev, Niklas Schnelle <schnelle@linux.ibm.com>,
	Will Deacon <will@kernel.org>
Subject: Re: [PATCH 4/6] arm64/io: Provide a WC friendly __iowriteXX_copy()
Date: Wed, 28 Feb 2024 19:06:16 -0400	[thread overview]
Message-ID: <20240228230616.GS13330@nvidia.com> (raw)
In-Reply-To: <Zd27XtDg_NDzLXg-@arm.com>

On Tue, Feb 27, 2024 at 10:37:18AM +0000, Catalin Marinas wrote:
> On Tue, Feb 20, 2024 at 09:17:08PM -0400, Jason Gunthorpe wrote:
> > +/*
> > + * This generates a memcpy that works on a from/to address which is aligned to
> > + * bits. Count is in terms of the number of bits sized quantities to copy. It
> > + * optimizes to use the STR groupings when possible so that it is WC friendly.
> > + */
> > +#define memcpy_toio_aligned(to, from, count, bits)                        \
> > +	({                                                                \
> > +		volatile u##bits __iomem *_to = to;                       \
> > +		const u##bits *_from = from;                              \
> > +		size_t _count = count;                                    \
> > +		const u##bits *_end_from = _from + ALIGN_DOWN(_count, 8); \
> > +                                                                          \
> > +		for (; _from < _end_from; _from += 8, _to += 8)           \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 8); \
> > +		if ((_count % 8) >= 4) {                                  \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 4); \
> > +			_from += 4;                                       \
> > +			_to += 4;                                         \
> > +		}                                                         \
> > +		if ((_count % 4) >= 2) {                                  \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 2); \
> > +			_from += 2;                                       \
> > +			_to += 2;                                         \
> > +		}                                                         \
> > +		if (_count % 2)                                           \
> > +			__const_memcpy_toio_aligned##bits(_to, _from, 1); \
> > +	})
> 
> Do we actually need all this if count is not constant? If it's not
> performance critical anywhere, I'd rather copy the generic
> implementation, it's easier to read.

Which generic version?

The point is to maximize WC effects with non-constant values, so I
think we do need something like this. ie we can't just fall back to
looping over 64 bit stores one at a time.

If we don't use the large block stores we know we get very poor WC
behavior. So at least the 8 and 4 constant value sections are
needed. At that point you may as well just do 4 and 2 instead of
another loop.

Most places I know about using this are performance paths, the entire
iocopy infrastructure was introduced as an x86 performance
optimization..

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2024-02-28 23:06 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-21  1:17 [PATCH 0/6] Fix mlx5 write combining support on new ARM64 cores Jason Gunthorpe
2024-02-21  1:17 ` Jason Gunthorpe
2024-02-21  1:17 ` [PATCH 1/6] x86: Stop using weak symbols for __iowrite32_copy() Jason Gunthorpe
2024-02-21  1:17   ` Jason Gunthorpe
2024-02-21  1:17 ` [PATCH 2/6] s390: Implement __iowrite32_copy() Jason Gunthorpe
2024-02-21  1:17   ` Jason Gunthorpe
2024-02-21  1:17 ` [PATCH 3/6] s390: Stop using weak symbols for __iowrite64_copy() Jason Gunthorpe
2024-02-21  1:17   ` Jason Gunthorpe
2024-02-21  1:17 ` [PATCH 4/6] arm64/io: Provide a WC friendly __iowriteXX_copy() Jason Gunthorpe
2024-02-21  1:17   ` Jason Gunthorpe
2024-02-21 19:22   ` Will Deacon
2024-02-21 19:22     ` Will Deacon
2024-02-21 23:28     ` Jason Gunthorpe
2024-02-21 23:28       ` Jason Gunthorpe
2024-02-22 22:05   ` David Laight
2024-02-22 22:05     ` David Laight
2024-02-22 22:36     ` Jason Gunthorpe
2024-02-22 22:36       ` Jason Gunthorpe
2024-02-23  9:07       ` David Laight
2024-02-23  9:07         ` David Laight
2024-02-23 11:01         ` Niklas Schnelle
2024-02-23 11:01           ` Niklas Schnelle
2024-02-23 11:05           ` David Laight
2024-02-23 11:05             ` David Laight
2024-02-23 12:53             ` Jason Gunthorpe
2024-02-23 12:53               ` Jason Gunthorpe
2024-02-23 11:38         ` Niklas Schnelle
2024-02-23 11:38           ` Niklas Schnelle
2024-02-23 12:19           ` David Laight
2024-02-23 12:19             ` David Laight
2024-02-23 13:03             ` Jason Gunthorpe
2024-02-23 13:03               ` Jason Gunthorpe
2024-02-23 13:52               ` David Laight
2024-02-23 13:52                 ` David Laight
2024-02-23 14:44                 ` Jason Gunthorpe
2024-02-23 14:44                   ` Jason Gunthorpe
2024-02-23 12:58           ` Jason Gunthorpe
2024-02-23 12:58             ` Jason Gunthorpe
2024-02-23 16:35             ` Niklas Schnelle
2024-02-23 16:35               ` Niklas Schnelle
2024-02-23 17:05               ` Jason Gunthorpe
2024-02-23 17:05                 ` Jason Gunthorpe
2024-02-27 10:37   ` Catalin Marinas
2024-02-27 10:37     ` Catalin Marinas
2024-02-28 23:06     ` Jason Gunthorpe [this message]
2024-02-28 23:06       ` Jason Gunthorpe
2024-02-29 10:24       ` Catalin Marinas
2024-02-29 10:24         ` Catalin Marinas
2024-02-29 13:28         ` Jason Gunthorpe
2024-02-29 13:28           ` Jason Gunthorpe
2024-02-29 10:33   ` Catalin Marinas
2024-02-29 10:33     ` Catalin Marinas
2024-02-29 13:29     ` Jason Gunthorpe
2024-02-29 13:29       ` Jason Gunthorpe
2024-03-01 18:52   ` Catalin Marinas
2024-03-01 18:52     ` Catalin Marinas
2024-02-21  1:17 ` [PATCH 5/6] net: hns3: Remove io_stop_wc() calls after __iowrite64_copy() Jason Gunthorpe
2024-02-21  1:17   ` Jason Gunthorpe
2024-02-22  0:57   ` Jijie Shao
2024-02-22  0:57     ` Jijie Shao
2024-02-21  1:17 ` [PATCH 6/6] IB/mlx5: Use __iowrite64_copy() for write combining stores Jason Gunthorpe
2024-02-21  1:17   ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240228230616.GS13330@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=borntraeger@linux.ibm.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hpa@zytor.com \
    --cc=justinstitt@google.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=leonro@mellanox.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mark.rutland@arm.com \
    --cc=michaelgur@mellanox.com \
    --cc=mingo@redhat.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=patches@lists.linux.dev \
    --cc=salil.mehta@huawei.com \
    --cc=schnelle@linux.ibm.com \
    --cc=shaojijie@huawei.com \
    --cc=svens@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yisen.zhuang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.