Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Catalin Marinas <catalin.marinas@arm.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org, x86@kernel.org,
	will@kernel.org, anshuman.khandual@arm.com,
	linux-doc@vger.kernel.org, corbet@lwn.net, peterz@infradead.org,
	arnd@arndb.de, punit.agrawal@bytedance.com,
	linux-kernel@vger.kernel.org, darren@os.amperecomputing.com,
	yangyicong@hisilicon.com, huzhanyuan@oppo.com,
	lipeifeng@oppo.com, zhangshiming@oppo.com, guojian@oppo.com,
	realmz6@gmail.com, linux-mips@vger.kernel.org,
	openrisc@lists.librecores.org, linuxppc-dev@lists.ozlabs.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	Barry Song <21cnbao@gmail.com>,
	wangkefeng.wang@huawei.com, xhao@linux.alibaba.com,
	prime.zeng@hisilicon.com, Barry Song <v-songbaohua@oppo.com>,
	Nadav Amit <namit@vmware.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Thu, 5 Jan 2023 18:14:58 +0000	[thread overview]
Message-ID: <Y7cToj5mWd1ZbMyQ@arm.com> (raw)
In-Reply-To: <20221117082648.47526-3-yangyicong@huawei.com>

On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and make this disabled on systems
> with less than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

What's the overhead of such batching on systems with 4 or fewer CPUs? If
it isn't noticeable, I'd rather have it always on than some number
chosen on whichever SoC you tested.

Another option would be to make this a sysctl tunable.

>  .../features/vm/TLB/arch-support.txt          |  2 +-
>  arch/arm64/Kconfig                            |  6 +++
>  arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>  arch/arm64/include/asm/tlbflush.h             | 52 ++++++++++++++++++-
>  arch/x86/include/asm/tlbflush.h               |  5 +-
>  include/linux/mm_types_task.h                 |  4 +-
>  mm/rmap.c                                     | 10 ++--

Please keep any function prototype changes in a preparatory patch so
that the arm64 one only introduces the arch specific changes. Easier to
review.

> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;

The x86 implementation tracks the cpumask of where a task has run. We
don't have such tracking on arm64 and I don't think it matters. As
noticed/described in this series, the bottleneck is the actual DSB
synchronisation (which sends a DVM Sync message to all the other CPUs
and waits for a DVM Complete response). So I think it makes sense not to
bother with an mm_cpumask(). What this patch aims to optimise is
actually the number of DSBs issued on an SMP system by
ptep_clear_flush().

The DVM is not an architected concept (well, it's part of AMBA AXI). I'd
be curious to know how such patch behaves on Apple's M1/M2 hardware. My
preference would be to have this always on for num_online_cpus() > 1 if
there's no overhead.

-- 
Catalin

WARNING: multiple messages have this Message-ID (diff)

From: Catalin Marinas <catalin.marinas@arm.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org, x86@kernel.org,
	will@kernel.org, anshuman.khandual@arm.com,
	linux-doc@vger.kernel.org, corbet@lwn.net, peterz@infradead.org,
	arnd@arndb.de, punit.agrawal@bytedance.com,
	linux-kernel@vger.kernel.org, darren@os.amperecomputing.com,
	yangyicong@hisilicon.com, huzhanyuan@oppo.com,
	lipeifeng@oppo.com, zhangshiming@oppo.com, guojian@oppo.com,
	realmz6@gmail.com, linux-mips@vger.kernel.org,
	openrisc@lists.librecores.org, linuxppc-dev@lists.ozlabs.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	Barry Song <21cnbao@gmail.com>,
	wangkefeng.wang@huawei.com, xhao@linux.alibaba.com,
	prime.zeng@hisilicon.com, Barry Song <v-songbaohua@oppo.com>,
	Nadav Amit <namit@vmware.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Thu, 5 Jan 2023 18:14:58 +0000	[thread overview]
Message-ID: <Y7cToj5mWd1ZbMyQ@arm.com> (raw)
In-Reply-To: <20221117082648.47526-3-yangyicong@huawei.com>

On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and make this disabled on systems
> with less than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

What's the overhead of such batching on systems with 4 or fewer CPUs? If
it isn't noticeable, I'd rather have it always on than some number
chosen on whichever SoC you tested.

Another option would be to make this a sysctl tunable.

>  .../features/vm/TLB/arch-support.txt          |  2 +-
>  arch/arm64/Kconfig                            |  6 +++
>  arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>  arch/arm64/include/asm/tlbflush.h             | 52 ++++++++++++++++++-
>  arch/x86/include/asm/tlbflush.h               |  5 +-
>  include/linux/mm_types_task.h                 |  4 +-
>  mm/rmap.c                                     | 10 ++--

Please keep any function prototype changes in a preparatory patch so
that the arm64 one only introduces the arch specific changes. Easier to
review.

> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;

The x86 implementation tracks the cpumask of where a task has run. We
don't have such tracking on arm64 and I don't think it matters. As
noticed/described in this series, the bottleneck is the actual DSB
synchronisation (which sends a DVM Sync message to all the other CPUs
and waits for a DVM Complete response). So I think it makes sense not to
bother with an mm_cpumask(). What this patch aims to optimise is
actually the number of DSBs issued on an SMP system by
ptep_clear_flush().

The DVM is not an architected concept (well, it's part of AMBA AXI). I'd
be curious to know how such patch behaves on Apple's M1/M2 hardware. My
preference would be to have this always on for num_online_cpus() > 1 if
there's no overhead.

-- 
Catalin

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

WARNING: multiple messages have this Message-ID (diff)

From: Catalin Marinas <catalin.marinas@arm.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: wangkefeng.wang@huawei.com, prime.zeng@hisilicon.com,
	realmz6@gmail.com, linux-doc@vger.kernel.org,
	peterz@infradead.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Nadav Amit <namit@vmware.com>,
	punit.agrawal@bytedance.com, linux-riscv@lists.infradead.org,
	will@kernel.org, linux-s390@vger.kernel.org,
	zhangshiming@oppo.com, lipeifeng@oppo.com, corbet@lwn.net,
	x86@kernel.org, Barry Song <21cnbao@gmail.com>,
	Mel Gorman <mgorman@suse.de>,
	arnd@arndb.de, anshuman.khandual@arm.com,
	Barry Song <v-songbaohua@oppo.com>,
	openrisc@lists.librecores.org, darren@os.amperecomputing.com,
	yangyicong@hisilicon.com, linux-arm-kernel@lists.infradead.org,
	guojian@oppo.com, xhao@linux.alibaba.com,
	linux-mips@vger.kernel.org, huzhanyuan@oppo.com,
	akpm@linux-foundation.org, linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Thu, 5 Jan 2023 18:14:58 +0000	[thread overview]
Message-ID: <Y7cToj5mWd1ZbMyQ@arm.com> (raw)
In-Reply-To: <20221117082648.47526-3-yangyicong@huawei.com>

On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and make this disabled on systems
> with less than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

What's the overhead of such batching on systems with 4 or fewer CPUs? If
it isn't noticeable, I'd rather have it always on than some number
chosen on whichever SoC you tested.

Another option would be to make this a sysctl tunable.

>  .../features/vm/TLB/arch-support.txt          |  2 +-
>  arch/arm64/Kconfig                            |  6 +++
>  arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>  arch/arm64/include/asm/tlbflush.h             | 52 ++++++++++++++++++-
>  arch/x86/include/asm/tlbflush.h               |  5 +-
>  include/linux/mm_types_task.h                 |  4 +-
>  mm/rmap.c                                     | 10 ++--

Please keep any function prototype changes in a preparatory patch so
that the arm64 one only introduces the arch specific changes. Easier to
review.

> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;

The x86 implementation tracks the cpumask of where a task has run. We
don't have such tracking on arm64 and I don't think it matters. As
noticed/described in this series, the bottleneck is the actual DSB
synchronisation (which sends a DVM Sync message to all the other CPUs
and waits for a DVM Complete response). So I think it makes sense not to
bother with an mm_cpumask(). What this patch aims to optimise is
actually the number of DSBs issued on an SMP system by
ptep_clear_flush().

The DVM is not an architected concept (well, it's part of AMBA AXI). I'd
be curious to know how such patch behaves on Apple's M1/M2 hardware. My
preference would be to have this always on for num_online_cpus() > 1 if
there's no overhead.

-- 
Catalin

WARNING: multiple messages have this Message-ID (diff)

From: Catalin Marinas <catalin.marinas@arm.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org, x86@kernel.org,
	will@kernel.org, anshuman.khandual@arm.com,
	linux-doc@vger.kernel.org, corbet@lwn.net, peterz@infradead.org,
	arnd@arndb.de, punit.agrawal@bytedance.com,
	linux-kernel@vger.kernel.org, darren@os.amperecomputing.com,
	yangyicong@hisilicon.com, huzhanyuan@oppo.com,
	lipeifeng@oppo.com, zhangshiming@oppo.com, guojian@oppo.com,
	realmz6@gmail.com, linux-mips@vger.kernel.org,
	openrisc@lists.librecores.org, linuxppc-dev@lists.ozlabs.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	Barry Song <21cnbao@gmail.com>,
	wangkefeng.wang@huawei.com, xhao@linux.alibaba.com,
	prime.zeng@hisilicon.com, Barry Song <v-songbaohua@oppo.com>,
	Nadav Amit <namit@vmware.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Thu, 5 Jan 2023 18:14:58 +0000	[thread overview]
Message-ID: <Y7cToj5mWd1ZbMyQ@arm.com> (raw)
In-Reply-To: <20221117082648.47526-3-yangyicong@huawei.com>

On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and make this disabled on systems
> with less than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

What's the overhead of such batching on systems with 4 or fewer CPUs? If
it isn't noticeable, I'd rather have it always on than some number
chosen on whichever SoC you tested.

Another option would be to make this a sysctl tunable.

>  .../features/vm/TLB/arch-support.txt          |  2 +-
>  arch/arm64/Kconfig                            |  6 +++
>  arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>  arch/arm64/include/asm/tlbflush.h             | 52 ++++++++++++++++++-
>  arch/x86/include/asm/tlbflush.h               |  5 +-
>  include/linux/mm_types_task.h                 |  4 +-
>  mm/rmap.c                                     | 10 ++--

Please keep any function prototype changes in a preparatory patch so
that the arm64 one only introduces the arch specific changes. Easier to
review.

> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;

The x86 implementation tracks the cpumask of where a task has run. We
don't have such tracking on arm64 and I don't think it matters. As
noticed/described in this series, the bottleneck is the actual DSB
synchronisation (which sends a DVM Sync message to all the other CPUs
and waits for a DVM Complete response). So I think it makes sense not to
bother with an mm_cpumask(). What this patch aims to optimise is
actually the number of DSBs issued on an SMP system by
ptep_clear_flush().

The DVM is not an architected concept (well, it's part of AMBA AXI). I'd
be curious to know how such patch behaves on Apple's M1/M2 hardware. My
preference would be to have this always on for num_online_cpus() > 1 if
there's no overhead.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

next prev parent reply	other threads:[~2023-01-05 18:15 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-17  8:26 [PATCH v7 0/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-11-17  8:26 ` Yicong Yang
2022-11-17  8:26 ` Yicong Yang
2022-11-17  8:26 ` Yicong Yang
2022-11-17  8:26 ` [PATCH v7 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
2022-11-17  8:26   ` Yicong Yang
2022-11-17  8:26   ` Yicong Yang
2022-11-17  8:26   ` Yicong Yang
2022-11-29 23:23   ` Andrew Morton
2022-11-29 23:23     ` Andrew Morton
2022-11-29 23:23     ` Andrew Morton
2022-11-29 23:23     ` Andrew Morton
2022-11-30  2:23     ` Yicong Yang
2022-11-30  2:23       ` Yicong Yang
2022-11-30  2:23       ` Yicong Yang
2022-11-30  2:23       ` Yicong Yang
2022-11-30  2:57       ` Anshuman Khandual
2022-11-30  2:57         ` Anshuman Khandual
2022-11-30  2:57         ` Anshuman Khandual
2022-11-30  2:57         ` Anshuman Khandual
2022-11-17  8:26 ` [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-11-17  8:26   ` Yicong Yang
2022-11-17  8:26   ` Yicong Yang
2022-11-17  8:26   ` Yicong Yang
2022-11-23 14:07   ` Anshuman Khandual
2022-11-23 14:07     ` Anshuman Khandual
2022-11-23 14:07     ` Anshuman Khandual
2022-11-23 14:07     ` Anshuman Khandual
2023-01-05 18:14   ` Catalin Marinas [this message]
2023-01-05 18:14     ` Catalin Marinas
2023-01-05 18:14     ` Catalin Marinas
2023-01-05 18:14     ` Catalin Marinas
2023-01-08 10:48     ` Barry Song
2023-01-08 10:48       ` Barry Song
2023-01-08 10:48       ` Barry Song
2023-01-08 10:48       ` Barry Song
2023-01-09 17:19       ` Catalin Marinas
2023-01-09 17:19         ` Catalin Marinas
2023-01-09 17:19         ` Catalin Marinas
2023-01-09 17:19         ` Catalin Marinas
2023-01-09 21:28         ` Barry Song
2023-01-09 21:28           ` Barry Song
2023-01-09 21:28           ` Barry Song
2023-01-09 21:28           ` Barry Song
2022-11-29 11:09 ` [PATCH v7 0/2] " Yicong Yang
2022-11-29 11:09   ` Yicong Yang
2022-11-29 11:09   ` Yicong Yang
2022-11-29 11:09   ` Yicong Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y7cToj5mWd1ZbMyQ@arm.com \
    --to=catalin.marinas@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=arnd@arndb.de \
    --cc=corbet@lwn.net \
    --cc=darren@os.amperecomputing.com \
    --cc=guojian@oppo.com \
    --cc=huzhanyuan@oppo.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lipeifeng@oppo.com \
    --cc=mgorman@suse.de \
    --cc=namit@vmware.com \
    --cc=openrisc@lists.librecores.org \
    --cc=peterz@infradead.org \
    --cc=prime.zeng@hisilicon.com \
    --cc=punit.agrawal@bytedance.com \
    --cc=realmz6@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=xhao@linux.alibaba.com \
    --cc=yangyicong@hisilicon.com \
    --cc=yangyicong@huawei.com \
    --cc=zhangshiming@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.