* [PATCH 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead
@ 2018-09-27 23:23 Mark Hairgrove
2018-09-27 23:23 ` [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Mark Hairgrove @ 2018-09-27 23:23 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab
When ATS is used in a process, all CPU TLB invalidates in that process
also trigger ATSD invalidates via mmu_notifiers. This additional overhead
is noticeable in applications which do heavy memory allocation or page
migration among nodes, particularly to and from GPUs.
This patch set reduces that overhead by rearranging how the ATSDs are
issued and by using size-based ATSD invalidates.
Mark Hairgrove (3):
powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
powerpc/powernv/npu: Use size-based ATSD invalidates
powerpc/powernv/npu: Remove atsd_threshold debugfs setting
arch/powerpc/platforms/powernv/npu-dma.c | 177 ++++++++++++++---------------
1 files changed, 85 insertions(+), 92 deletions(-)
--
1.7.2.5
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
2018-09-27 23:23 [PATCH 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
@ 2018-09-27 23:23 ` Mark Hairgrove
2018-10-02 7:23 ` Alistair Popple
2018-09-27 23:23 ` [PATCH 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
2018-09-27 23:23 ` [PATCH 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
2 siblings, 1 reply; 10+ messages in thread
From: Mark Hairgrove @ 2018-09-27 23:23 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab
There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:
for each NPU
- Write the target address to the XTS_ATSD_AVA register
- EIEIO
- Write the launch value to issue the ATSD
First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.
Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:
for each NPU
- Write the target address to the XTS_ATSD_AVA register
EIEIO
for each NPU
- Write the launch value to issue the ATSD
Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.
With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).
With two NPUs active (on a 6-GPU system) the effect is more noticeable:
mprotect rate (GB/s)
Stride Before After Speedup
64K 5.9 6.5 10%
1M 31.2 33.4 7%
2M 36.3 38.7 7%
4M 322.6 356.7 11%
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
arch/powerpc/platforms/powernv/npu-dma.c | 99 ++++++++++++++---------------
1 files changed, 48 insertions(+), 51 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 8006c54..c8f438a 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -454,79 +454,76 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
}
/* MMIO ATSD register offsets */
-#define XTS_ATSD_AVA 1
-#define XTS_ATSD_STAT 2
+#define XTS_ATSD_LAUNCH 0
+#define XTS_ATSD_AVA 1
+#define XTS_ATSD_STAT 2
-static void mmio_launch_invalidate(struct mmio_atsd_reg *mmio_atsd_reg,
- unsigned long launch, unsigned long va)
+static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
+ bool flush)
{
- struct npu *npu = mmio_atsd_reg->npu;
- int reg = mmio_atsd_reg->reg;
+ unsigned long launch = 0;
- __raw_writeq_be(va, npu->mmio_atsd_regs[reg] + XTS_ATSD_AVA);
- eieio();
- __raw_writeq_be(launch, npu->mmio_atsd_regs[reg]);
+ if (psize == MMU_PAGE_COUNT) {
+ /* IS set to invalidate entire matching PID */
+ launch |= PPC_BIT(12);
+ } else {
+ /* AP set to invalidate region of psize */
+ launch |= (u64)mmu_get_ap(psize) << PPC_BITLSHIFT(17);
+ }
+
+ /* PRS set to process-scoped */
+ launch |= PPC_BIT(13);
+
+ /* PID */
+ launch |= pid << PPC_BITLSHIFT(38);
+
+ /* No flush */
+ launch |= !flush << PPC_BITLSHIFT(39);
+
+ return launch;
}
-static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
- unsigned long pid, bool flush)
+static void mmio_atsd_regs_write(struct mmio_atsd_reg
+ mmio_atsd_reg[NV_MAX_NPUS], unsigned long offset,
+ unsigned long val)
{
- int i;
- unsigned long launch;
+ struct npu *npu;
+ int i, reg;
for (i = 0; i <= max_npu2_index; i++) {
- if (mmio_atsd_reg[i].reg < 0)
+ reg = mmio_atsd_reg[i].reg;
+ if (reg < 0)
continue;
- /* IS set to invalidate matching PID */
- launch = PPC_BIT(12);
-
- /* PRS set to process-scoped */
- launch |= PPC_BIT(13);
-
- /* AP */
- launch |= (u64)
- mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
-
- /* PID */
- launch |= pid << PPC_BITLSHIFT(38);
+ npu = mmio_atsd_reg[i].npu;
+ __raw_writeq_be(val, npu->mmio_atsd_regs[reg] + offset);
+ }
+}
- /* No flush */
- launch |= !flush << PPC_BITLSHIFT(39);
+static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
+ unsigned long pid, bool flush)
+{
+ unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
- /* Invalidating the entire process doesn't use a va */
- mmio_launch_invalidate(&mmio_atsd_reg[i], launch, 0);
- }
+ /* Invalidating the entire process doesn't use a va */
+ mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
}
static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
unsigned long va, unsigned long pid, bool flush)
{
- int i;
unsigned long launch;
- for (i = 0; i <= max_npu2_index; i++) {
- if (mmio_atsd_reg[i].reg < 0)
- continue;
-
- /* IS set to invalidate target VA */
- launch = 0;
+ launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
- /* PRS set to process scoped */
- launch |= PPC_BIT(13);
+ /* Write all VAs first */
+ mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
- /* AP */
- launch |= (u64)
- mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
-
- /* PID */
- launch |= pid << PPC_BITLSHIFT(38);
-
- /* No flush */
- launch |= !flush << PPC_BITLSHIFT(39);
+ /* Issue one barrier for all address writes */
+ eieio();
- mmio_launch_invalidate(&mmio_atsd_reg[i], launch, va);
- }
+ /* Launch */
+ mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
}
#define mn_to_npu_context(x) container_of(x, struct npu_context, mn)
--
1.7.2.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
2018-09-27 23:23 [PATCH 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
2018-09-27 23:23 ` [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
@ 2018-09-27 23:23 ` Mark Hairgrove
2018-10-02 7:11 ` Alistair Popple
2018-09-27 23:23 ` [PATCH 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
2 siblings, 1 reply; 10+ messages in thread
From: Mark Hairgrove @ 2018-09-27 23:23 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.
The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.
ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.
To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.
One NPU (1 GPU):
mprotect rate (GB/s)
Stride Before After Speedup
64K 5.3 5.6 5%
1M 39.3 57.4 46%
2M 49.7 82.6 66%
4M 286.6 285.7 0%
Two NPUs (6 GPUs):
mprotect rate (GB/s)
Stride Before After Speedup
64K 6.5 7.4 13%
1M 33.4 67.9 103%
2M 38.7 93.1 141%
4M 356.7 354.6 -1%
Anything over 2M is roughly the same as before since both cases issue a
single ATSD.
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
arch/powerpc/platforms/powernv/npu-dma.c | 71 +++++++++++++++++-------------
1 files changed, 40 insertions(+), 31 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index c8f438a..e471a1a 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -509,15 +509,14 @@ static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
}
-static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
- unsigned long va, unsigned long pid, bool flush)
+static void mmio_invalidate_range(struct mmio_atsd_reg
+ mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid,
+ unsigned long start, unsigned long psize, bool flush)
{
- unsigned long launch;
-
- launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
+ unsigned long launch = get_atsd_launch_val(pid, psize, flush);
/* Write all VAs first */
- mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
+ mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start);
/* Issue one barrier for all address writes */
eieio();
@@ -608,15 +607,38 @@ static void release_atsd_reg(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS])
}
}
+#define PAGE_64K (64UL * 1024)
+#define PAGE_2M (2UL * 1024 * 1024)
+#define PAGE_1G (1UL * 1024 * 1024 * 1024)
+
/*
- * Invalidate either a single address or an entire PID depending on
- * the value of va.
+ * Invalidate a virtual address range
*/
-static void mmio_invalidate(struct npu_context *npu_context, int va,
- unsigned long address, bool flush)
+static void mmio_invalidate(struct npu_context *npu_context,
+ unsigned long start, unsigned long size, bool flush)
{
struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
unsigned long pid = npu_context->mm->context.id;
+ unsigned long atsd_start = 0;
+ unsigned long end = start + size - 1;
+ int atsd_psize = MMU_PAGE_COUNT;
+
+ /*
+ * Convert the input range into one of the supported sizes. If the range
+ * doesn't fit, use the next larger supported size. Invalidation latency
+ * is high, so over-invalidation is preferred to issuing multiple
+ * invalidates.
+ */
+ if (size == PAGE_64K) {
+ atsd_start = start;
+ atsd_psize = MMU_PAGE_64K;
+ } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
+ atsd_start = ALIGN_DOWN(start, PAGE_2M);
+ atsd_psize = MMU_PAGE_2M;
+ } else if (ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G)) {
+ atsd_start = ALIGN_DOWN(start, PAGE_1G);
+ atsd_psize = MMU_PAGE_1G;
+ }
if (npu_context->nmmu_flush)
/*
@@ -631,10 +653,12 @@ static void mmio_invalidate(struct npu_context *npu_context, int va,
* an invalidate.
*/
acquire_atsd_reg(npu_context, mmio_atsd_reg);
- if (va)
- mmio_invalidate_va(mmio_atsd_reg, address, pid, flush);
- else
+
+ if (atsd_psize == MMU_PAGE_COUNT)
mmio_invalidate_pid(mmio_atsd_reg, pid, flush);
+ else
+ mmio_invalidate_range(mmio_atsd_reg, pid, atsd_start,
+ atsd_psize, flush);
mmio_invalidate_wait(mmio_atsd_reg);
if (flush) {
@@ -664,7 +688,7 @@ static void pnv_npu2_mn_release(struct mmu_notifier *mn,
* There should be no more translation requests for this PID, but we
* need to ensure any entries for it are removed from the TLB.
*/
- mmio_invalidate(npu_context, 0, 0, true);
+ mmio_invalidate(npu_context, 0, ~0UL, true);
}
static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
@@ -673,8 +697,7 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
pte_t pte)
{
struct npu_context *npu_context = mn_to_npu_context(mn);
-
- mmio_invalidate(npu_context, 1, address, true);
+ mmio_invalidate(npu_context, address, PAGE_SIZE, true);
}
static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
@@ -682,21 +705,7 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
unsigned long start, unsigned long end)
{
struct npu_context *npu_context = mn_to_npu_context(mn);
- unsigned long address;
-
- if (end - start > atsd_threshold) {
- /*
- * Just invalidate the entire PID if the address range is too
- * large.
- */
- mmio_invalidate(npu_context, 0, 0, true);
- } else {
- for (address = start; address < end; address += PAGE_SIZE)
- mmio_invalidate(npu_context, 1, address, false);
-
- /* Do the flush only on the final addess == end */
- mmio_invalidate(npu_context, 1, address, true);
- }
+ mmio_invalidate(npu_context, start, end - start, true);
}
static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
--
1.7.2.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting
2018-09-27 23:23 [PATCH 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
2018-09-27 23:23 ` [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
2018-09-27 23:23 ` [PATCH 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
@ 2018-09-27 23:23 ` Mark Hairgrove
2018-10-02 7:12 ` Alistair Popple
2 siblings, 1 reply; 10+ messages in thread
From: Mark Hairgrove @ 2018-09-27 23:23 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab
This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
arch/powerpc/platforms/powernv/npu-dma.c | 13 -------------
1 files changed, 0 insertions(+), 13 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index e471a1a..426df1b 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -42,14 +42,6 @@
static DEFINE_SPINLOCK(npu_context_lock);
/*
- * When an address shootdown range exceeds this threshold we invalidate the
- * entire TLB on the GPU for the given PID rather than each specific address in
- * the range.
- */
-static uint64_t atsd_threshold = 2 * 1024 * 1024;
-static struct dentry *atsd_threshold_dentry;
-
-/*
* Other types of TCE cache invalidation are not functional in the
* hardware.
*/
@@ -968,11 +960,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
static int npu_index;
uint64_t rc = 0;
- if (!atsd_threshold_dentry) {
- atsd_threshold_dentry = debugfs_create_x64("atsd_threshold",
- 0600, powerpc_debugfs_root, &atsd_threshold);
- }
-
phb->npu.nmmu_flush =
of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
for_each_child_of_node(phb->hose->dn, dn) {
--
1.7.2.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
2018-09-27 23:23 ` [PATCH 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
@ 2018-10-02 7:11 ` Alistair Popple
2018-10-03 1:10 ` Mark Hairgrove
0 siblings, 1 reply; 10+ messages in thread
From: Alistair Popple @ 2018-10-02 7:11 UTC (permalink / raw)
To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab
Thanks Mark,
Looks like some worthwhile improvments to be had. I've added a couple of
comments inline below.
> +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define
> PAGE_1G (1UL * 1024 * 1024 * 1024)
include/linux/sizes.h includes definitions for SZ_64K, SZ_2M, SZ_1G, etc. so
unless they're redefined here for some reason I personally think it's cleaner to
use those.
> /*
> - * Invalidate either a single address or an entire PID depending on
> - * the value of va.
> + * Invalidate a virtual address range
> */
> -static void mmio_invalidate(struct npu_context *npu_context, int va,
> - unsigned long address, bool flush)
> +static void mmio_invalidate(struct npu_context *npu_context,
> + unsigned long start, unsigned long size, bool flush)
With this optimisation every caller of mmio_invalidate() sets flush == true so
it no longer appears to be used. We should drop it as a parameter unless you
think there might be some reason to use it in future?
Therefore we could also drop it as a parameter to get_atsd_launch_val(),
mmio_invalidate_pid() and mmio_invalidate_range() as well as I couldn't find any
callers of those that set it to anything other than true.
> struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
> unsigned long pid = npu_context->mm->context.id;
> + unsigned long atsd_start = 0;
> + unsigned long end = start + size - 1;
> + int atsd_psize = MMU_PAGE_COUNT;
> +
> + /*
> + * Convert the input range into one of the supported sizes. If the range
> + * doesn't fit, use the next larger supported size. Invalidation latency
> + * is high, so over-invalidation is preferred to issuing multiple
> + * invalidates.
> + */
> + if (size == PAGE_64K) {
We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
would invalidate the entire GPU TLB for a the given PID on those systems. Could
we change the above check to `if (size <= PAGE_64K)` to avoid this?
> + atsd_start = start;
Which would also require:
atsd_start = ALIGN_DOWN(start, PAGE_64K);
> + atsd_psize = MMU_PAGE_64K;
> + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
Wouldn't this lead to under invalidation in ranges which happen to cross a 2M
boundary? For example invalidating a 128K (ie. 2x64K pages) range with start ==
0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 -
0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB.
> + atsd_start = ALIGN_DOWN(start, PAGE_2M);
> + atsd_psize = MMU_PAGE_2M;
> + } else if (ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G)) {
Ditto.
> + atsd_start = ALIGN_DOWN(start, PAGE_1G);
> + atsd_psize = MMU_PAGE_1G;
> + }
>
- Alistair
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting
2018-09-27 23:23 ` [PATCH 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
@ 2018-10-02 7:12 ` Alistair Popple
0 siblings, 0 replies; 10+ messages in thread
From: Alistair Popple @ 2018-10-02 7:12 UTC (permalink / raw)
To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab
Looks good to me, thanks!
Reviewed-by: Alistair Popple <alistair@popple.id.au>
On Thursday, 27 September 2018 4:23:11 PM AEST Mark Hairgrove wrote:
> This threshold is no longer used now that all invalidates issue a single
> ATSD to each active NPU.
>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> ---
> arch/powerpc/platforms/powernv/npu-dma.c | 13 -------------
> 1 files changed, 0 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index e471a1a..426df1b 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -42,14 +42,6 @@
> static DEFINE_SPINLOCK(npu_context_lock);
>
> /*
> - * When an address shootdown range exceeds this threshold we invalidate the
> - * entire TLB on the GPU for the given PID rather than each specific address in
> - * the range.
> - */
> -static uint64_t atsd_threshold = 2 * 1024 * 1024;
> -static struct dentry *atsd_threshold_dentry;
> -
> -/*
> * Other types of TCE cache invalidation are not functional in the
> * hardware.
> */
> @@ -968,11 +960,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
> static int npu_index;
> uint64_t rc = 0;
>
> - if (!atsd_threshold_dentry) {
> - atsd_threshold_dentry = debugfs_create_x64("atsd_threshold",
> - 0600, powerpc_debugfs_root, &atsd_threshold);
> - }
> -
> phb->npu.nmmu_flush =
> of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
> for_each_child_of_node(phb->hose->dn, dn) {
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
2018-09-27 23:23 ` [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
@ 2018-10-02 7:23 ` Alistair Popple
0 siblings, 0 replies; 10+ messages in thread
From: Alistair Popple @ 2018-10-02 7:23 UTC (permalink / raw)
To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab
Thanks, this looks good to me.
Reviewed-by: Alistair Popple <alistair@popple.id.au>
On Thursday, 27 September 2018 4:23:09 PM AEST Mark Hairgrove wrote:
> There are two types of ATSDs issued to the NPU: invalidates targeting a
> specific virtual address and invalidates targeting the whole address
> space. In both cases prior to this change, the sequence was:
>
> for each NPU
> - Write the target address to the XTS_ATSD_AVA register
> - EIEIO
> - Write the launch value to issue the ATSD
>
> First, a target address is not required when invalidating the whole
> address space, so that write and the EIEIO have been removed. The AP
> (size) field in the launch is not needed either.
>
> Second, for per-address invalidates the above sequence is inefficient in
> the common case of multiple NPUs because an EIEIO is issued per NPU. This
> unnecessarily forces the launches of later ATSDs to be ordered with the
> launches of earlier ones. The new sequence only issues a single EIEIO:
>
> for each NPU
> - Write the target address to the XTS_ATSD_AVA register
> EIEIO
> for each NPU
> - Write the launch value to issue the ATSD
>
> Performance results were gathered using a microbenchmark which creates a
> 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
> strides across the allocation.
>
> With only a single NPU active (one GPU) the difference is in the noise for
> both types of invalidates (+/-1%).
>
> With two NPUs active (on a 6-GPU system) the effect is more noticeable:
>
> mprotect rate (GB/s)
> Stride Before After Speedup
> 64K 5.9 6.5 10%
> 1M 31.2 33.4 7%
> 2M 36.3 38.7 7%
> 4M 322.6 356.7 11%
>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> ---
> arch/powerpc/platforms/powernv/npu-dma.c | 99 ++++++++++++++---------------
> 1 files changed, 48 insertions(+), 51 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8006c54..c8f438a 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -454,79 +454,76 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
> }
>
> /* MMIO ATSD register offsets */
> -#define XTS_ATSD_AVA 1
> -#define XTS_ATSD_STAT 2
> +#define XTS_ATSD_LAUNCH 0
> +#define XTS_ATSD_AVA 1
> +#define XTS_ATSD_STAT 2
>
> -static void mmio_launch_invalidate(struct mmio_atsd_reg *mmio_atsd_reg,
> - unsigned long launch, unsigned long va)
> +static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
> + bool flush)
> {
> - struct npu *npu = mmio_atsd_reg->npu;
> - int reg = mmio_atsd_reg->reg;
> + unsigned long launch = 0;
>
> - __raw_writeq_be(va, npu->mmio_atsd_regs[reg] + XTS_ATSD_AVA);
> - eieio();
> - __raw_writeq_be(launch, npu->mmio_atsd_regs[reg]);
> + if (psize == MMU_PAGE_COUNT) {
> + /* IS set to invalidate entire matching PID */
> + launch |= PPC_BIT(12);
> + } else {
> + /* AP set to invalidate region of psize */
> + launch |= (u64)mmu_get_ap(psize) << PPC_BITLSHIFT(17);
> + }
> +
> + /* PRS set to process-scoped */
> + launch |= PPC_BIT(13);
> +
> + /* PID */
> + launch |= pid << PPC_BITLSHIFT(38);
> +
> + /* No flush */
> + launch |= !flush << PPC_BITLSHIFT(39);
> +
> + return launch;
> }
>
> -static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> - unsigned long pid, bool flush)
> +static void mmio_atsd_regs_write(struct mmio_atsd_reg
> + mmio_atsd_reg[NV_MAX_NPUS], unsigned long offset,
> + unsigned long val)
> {
> - int i;
> - unsigned long launch;
> + struct npu *npu;
> + int i, reg;
>
> for (i = 0; i <= max_npu2_index; i++) {
> - if (mmio_atsd_reg[i].reg < 0)
> + reg = mmio_atsd_reg[i].reg;
> + if (reg < 0)
> continue;
>
> - /* IS set to invalidate matching PID */
> - launch = PPC_BIT(12);
> -
> - /* PRS set to process-scoped */
> - launch |= PPC_BIT(13);
> -
> - /* AP */
> - launch |= (u64)
> - mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
> -
> - /* PID */
> - launch |= pid << PPC_BITLSHIFT(38);
> + npu = mmio_atsd_reg[i].npu;
> + __raw_writeq_be(val, npu->mmio_atsd_regs[reg] + offset);
> + }
> +}
>
> - /* No flush */
> - launch |= !flush << PPC_BITLSHIFT(39);
> +static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> + unsigned long pid, bool flush)
> +{
> + unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
>
> - /* Invalidating the entire process doesn't use a va */
> - mmio_launch_invalidate(&mmio_atsd_reg[i], launch, 0);
> - }
> + /* Invalidating the entire process doesn't use a va */
> + mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
> }
>
> static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> unsigned long va, unsigned long pid, bool flush)
> {
> - int i;
> unsigned long launch;
>
> - for (i = 0; i <= max_npu2_index; i++) {
> - if (mmio_atsd_reg[i].reg < 0)
> - continue;
> -
> - /* IS set to invalidate target VA */
> - launch = 0;
> + launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
>
> - /* PRS set to process scoped */
> - launch |= PPC_BIT(13);
> + /* Write all VAs first */
> + mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
>
> - /* AP */
> - launch |= (u64)
> - mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
> -
> - /* PID */
> - launch |= pid << PPC_BITLSHIFT(38);
> -
> - /* No flush */
> - launch |= !flush << PPC_BITLSHIFT(39);
> + /* Issue one barrier for all address writes */
> + eieio();
>
> - mmio_launch_invalidate(&mmio_atsd_reg[i], launch, va);
> - }
> + /* Launch */
> + mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
> }
>
> #define mn_to_npu_context(x) container_of(x, struct npu_context, mn)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
2018-10-02 7:11 ` Alistair Popple
@ 2018-10-03 1:10 ` Mark Hairgrove
2018-10-03 2:27 ` Alistair Popple
0 siblings, 1 reply; 10+ messages in thread
From: Mark Hairgrove @ 2018-10-03 1:10 UTC (permalink / raw)
To: Alistair Popple; +Cc: linuxppc-dev, Reza Arbab
Thanks for the review. Comments below.
On Tue, 2 Oct 2018, Alistair Popple wrote:
> Thanks Mark,
>
> Looks like some worthwhile improvments to be had. I've added a couple of
> comments inline below.
>
> > +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define
> > PAGE_1G (1UL * 1024 * 1024 * 1024)
>
> include/linux/sizes.h includes definitions for SZ_64K, SZ_2M, SZ_1G, etc. so
> unless they're redefined here for some reason I personally think it's cleaner to
> use those.
Agreed, will fix. Thanks for the pointer.
>
> > /*
> > - * Invalidate either a single address or an entire PID depending on
> > - * the value of va.
> > + * Invalidate a virtual address range
> > */
> > -static void mmio_invalidate(struct npu_context *npu_context, int va,
> > - unsigned long address, bool flush)
> > +static void mmio_invalidate(struct npu_context *npu_context,
> > + unsigned long start, unsigned long size, bool flush)
>
> With this optimisation every caller of mmio_invalidate() sets flush == true so
> it no longer appears to be used. We should drop it as a parameter unless you
> think there might be some reason to use it in future?
>
> Therefore we could also drop it as a parameter to get_atsd_launch_val(),
> mmio_invalidate_pid() and mmio_invalidate_range() as well as I couldn't find any
> callers of those that set it to anything other than true.
Yeah, good catch. I'll simplify all of those.
>
> > struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
> > unsigned long pid = npu_context->mm->context.id;
> > + unsigned long atsd_start = 0;
> > + unsigned long end = start + size - 1;
> > + int atsd_psize = MMU_PAGE_COUNT;
> > +
> > + /*
> > + * Convert the input range into one of the supported sizes. If the range
> > + * doesn't fit, use the next larger supported size. Invalidation latency
> > + * is high, so over-invalidation is preferred to issuing multiple
> > + * invalidates.
> > + */
> > + if (size == PAGE_64K) {
>
> We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
> would invalidate the entire GPU TLB for a the given PID on those systems. Could
> we change the above check to `if (size <= PAGE_64K)` to avoid this?
PPC supports 4K pages but the GPU ATS implementation does not. For that
reason I didn't bother handling invalidates smaller than 64K. I'll add a
comment on that.
I don't know that this requirement is enforced anywhere though. I could
add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
would be useful.
>
> > + atsd_start = start;
>
> Which would also require:
>
> atsd_start = ALIGN_DOWN(start, PAGE_64K);
>
> > + atsd_psize = MMU_PAGE_64K;
> > + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
>
> Wouldn't this lead to under invalidation in ranges which happen to cross a 2M
> boundary? For example invalidating a 128K (ie. 2x64K pages) range with start ==
> 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 -
> 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB.
In this example:
start 0x1f0000
size 0x020000
end (start + size - 1) 0x20ffff
ALIGN_DOWN(start, PAGE_2M) 0x000000
ALIGN_DOWN(end, PAGE_2M) 0x200000
Since ALIGN_DOWN(start, PAGE_2M) != ALIGN_DOWN(end, PAGE_2M), the
condition fails and we move to the 1G clause. Then
ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G) == 0, so we
invalidate the range [0, 1G).
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
2018-10-03 1:10 ` Mark Hairgrove
@ 2018-10-03 2:27 ` Alistair Popple
2018-10-03 18:33 ` Mark Hairgrove
0 siblings, 1 reply; 10+ messages in thread
From: Alistair Popple @ 2018-10-03 2:27 UTC (permalink / raw)
To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab
> >
> > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
> > would invalidate the entire GPU TLB for a the given PID on those systems. Could
> > we change the above check to `if (size <= PAGE_64K)` to avoid this?
>
> PPC supports 4K pages but the GPU ATS implementation does not. For that
> reason I didn't bother handling invalidates smaller than 64K. I'll add a
> comment on that.
Interesting, I was not aware of that limitation. Do you know if it is a
SW/driver limitation or a HW limitation?
> I don't know that this requirement is enforced anywhere though. I could
> add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
> would be useful.
Given it's a static kernel build parameter perhaps it makes more sense to do the
check as part of the driver build in a conftest rather than a runtime failure?
> >
> > > + atsd_start = start;
> >
> > Which would also require:
> >
> > atsd_start = ALIGN_DOWN(start, PAGE_64K);
> >
> > > + atsd_psize = MMU_PAGE_64K;
> > > + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
> >
> > Wouldn't this lead to under invalidation in ranges which happen to cross a 2M
> > boundary? For example invalidating a 128K (ie. 2x64K pages) range with start ==
> > 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 -
> > 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB.
>
> In this example:
> start 0x1f0000
> size 0x020000
> end (start + size - 1) 0x20ffff
> ALIGN_DOWN(start, PAGE_2M) 0x000000
> ALIGN_DOWN(end, PAGE_2M) 0x200000
>
> Since ALIGN_DOWN(start, PAGE_2M) != ALIGN_DOWN(end, PAGE_2M), the
> condition fails and we move to the 1G clause. Then
> ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G) == 0, so we
> invalidate the range [0, 1G).
Oh yeah, sorry that makes sense and looks good to me.
- Alistair
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
2018-10-03 2:27 ` Alistair Popple
@ 2018-10-03 18:33 ` Mark Hairgrove
0 siblings, 0 replies; 10+ messages in thread
From: Mark Hairgrove @ 2018-10-03 18:33 UTC (permalink / raw)
To: Alistair Popple; +Cc: linuxppc-dev, Reza Arbab
On Wed, 3 Oct 2018, Alistair Popple wrote:
> > >
> > > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
> > > would invalidate the entire GPU TLB for a the given PID on those systems. Could
> > > we change the above check to `if (size <= PAGE_64K)` to avoid this?
> >
> > PPC supports 4K pages but the GPU ATS implementation does not. For that
> > reason I didn't bother handling invalidates smaller than 64K. I'll add a
> > comment on that.
>
> Interesting, I was not aware of that limitation. Do you know if it is a
> SW/driver limitation or a HW limitation?
HW.
>
> > I don't know that this requirement is enforced anywhere though. I could
> > add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
> > would be useful.
>
> Given it's a static kernel build parameter perhaps it makes more sense to do the
> check as part of the driver build in a conftest rather than a runtime failure?
Sure, we can do it in the driver. I'll just stick with a comment for the
kernel.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-10-03 18:36 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-27 23:23 [PATCH 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
2018-09-27 23:23 ` [PATCH 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
2018-10-02 7:23 ` Alistair Popple
2018-09-27 23:23 ` [PATCH 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
2018-10-02 7:11 ` Alistair Popple
2018-10-03 1:10 ` Mark Hairgrove
2018-10-03 2:27 ` Alistair Popple
2018-10-03 18:33 ` Mark Hairgrove
2018-09-27 23:23 ` [PATCH 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
2018-10-02 7:12 ` Alistair Popple
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.