Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Punit Agrawal <punit.agrawal@bytedance.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	<linux-arm-kernel@lists.infradead.org>, <x86@kernel.org>,
	<catalin.marinas@arm.com>, <will@kernel.org>,
	<anshuman.khandual@arm.com>, <linux-doc@vger.kernel.org>,
	<corbet@lwn.net>, <peterz@infradead.org>, <arnd@arndb.de>,
	<punit.agrawal@bytedance.com>, <linux-kernel@vger.kernel.org>,
	<darren@os.amperecomputing.com>, <yangyicong@hisilicon.com>,
	<huzhanyuan@oppo.com>, <lipeifeng@oppo.com>,
	<zhangshiming@oppo.com>, <guojian@oppo.com>, <realmz6@gmail.com>,
	<linux-mips@vger.kernel.org>, <openrisc@lists.librecores.org>,
	<linuxppc-dev@lists.ozlabs.org>,
	<linux-riscv@lists.infradead.org>, <linux-s390@vger.kernel.org>,
	Barry Song <21cnbao@gmail.com>, <wangkefeng.wang@huawei.com>,
	<xhao@linux.alibaba.com>, <prime.zeng@hisilicon.com>
Subject: Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Fri, 11 Nov 2022 10:17:09 +0000	[thread overview]
Message-ID: <87pmdtztga.fsf@stealth> (raw)
In-Reply-To: <20221028081255.19157-1-yangyicong@huawei.com> (Yicong Yang's message of "Fri, 28 Oct 2022 16:12:53 +0800")

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]

WARNING: multiple messages have this Message-ID (diff)

From: Punit Agrawal <punit.agrawal@bytedance.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: <akpm@linux-foundation.org>,  <linux-mm@kvack.org>,
	<linux-arm-kernel@lists.infradead.org>,  <x86@kernel.org>,
	<catalin.marinas@arm.com>,  <will@kernel.org>,
	<anshuman.khandual@arm.com>,  <linux-doc@vger.kernel.org>,
	<corbet@lwn.net>,  <peterz@infradead.org>,  <arnd@arndb.de>,
	<punit.agrawal@bytedance.com>,  <linux-kernel@vger.kernel.org>,
	<darren@os.amperecomputing.com>,  <yangyicong@hisilicon.com>,
	<huzhanyuan@oppo.com>,  <lipeifeng@oppo.com>,
	 <zhangshiming@oppo.com>, <guojian@oppo.com>,
	 <realmz6@gmail.com>,  <linux-mips@vger.kernel.org>,
	<openrisc@lists.librecores.org>,  <linuxppc-dev@lists.ozlabs.org>,
	<linux-riscv@lists.infradead.org>,  <linux-s390@vger.kernel.org>,
	 Barry Song <21cnbao@gmail.com>,  <wangkefeng.wang@huawei.com>,
	<xhao@linux.alibaba.com>,  <prime.zeng@hisilicon.com>
Subject: Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Fri, 11 Nov 2022 10:17:09 +0000	[thread overview]
Message-ID: <87pmdtztga.fsf@stealth> (raw)
In-Reply-To: <20221028081255.19157-1-yangyicong@huawei.com> (Yicong Yang's message of "Fri, 28 Oct 2022 16:12:53 +0800")

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

WARNING: multiple messages have this Message-ID (diff)

From: Punit Agrawal <punit.agrawal@bytedance.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: wangkefeng.wang@huawei.com, prime.zeng@hisilicon.com,
	realmz6@gmail.com, linux-doc@vger.kernel.org,
	peterz@infradead.org, catalin.marinas@arm.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	punit.agrawal@bytedance.com, linux-riscv@lists.infradead.org,
	will@kernel.org, linux-s390@vger.kernel.org,
	zhangshiming@oppo.com, lipeifeng@oppo.com, corbet@lwn.net,
	x86@kernel.org, Barry Song <21cnbao@gmail.com>,
	arnd@arndb.de, anshuman.khandual@arm.com,
	openrisc@lists.librecores.org, darren@os.amperecomputing.com,
	yangyicong@hisilicon.com, linux-arm-kernel@lists.infradead.org,
	guojian@oppo.com, xhao@linux.alibaba.com,
	linux-mips@vger.kernel.org, huzhanyuan@oppo.com,
	akpm@linux-foundation.org, linuxppc-dev@lists.ozlabs.org
Subject: Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Fri, 11 Nov 2022 10:17:09 +0000	[thread overview]
Message-ID: <87pmdtztga.fsf@stealth> (raw)
In-Reply-To: <20221028081255.19157-1-yangyicong@huawei.com> (Yicong Yang's message of "Fri, 28 Oct 2022 16:12:53 +0800")

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]

WARNING: multiple messages have this Message-ID (diff)

From: Punit Agrawal <punit.agrawal@bytedance.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: <akpm@linux-foundation.org>,  <linux-mm@kvack.org>,
	<linux-arm-kernel@lists.infradead.org>,  <x86@kernel.org>,
	<catalin.marinas@arm.com>,  <will@kernel.org>,
	<anshuman.khandual@arm.com>,  <linux-doc@vger.kernel.org>,
	<corbet@lwn.net>,  <peterz@infradead.org>,  <arnd@arndb.de>,
	<punit.agrawal@bytedance.com>,  <linux-kernel@vger.kernel.org>,
	<darren@os.amperecomputing.com>,  <yangyicong@hisilicon.com>,
	<huzhanyuan@oppo.com>,  <lipeifeng@oppo.com>,
	 <zhangshiming@oppo.com>, <guojian@oppo.com>,
	 <realmz6@gmail.com>,  <linux-mips@vger.kernel.org>,
	<openrisc@lists.librecores.org>,  <linuxppc-dev@lists.ozlabs.org>,
	<linux-riscv@lists.infradead.org>,  <linux-s390@vger.kernel.org>,
	 Barry Song <21cnbao@gmail.com>,  <wangkefeng.wang@huawei.com>,
	<xhao@linux.alibaba.com>,  <prime.zeng@hisilicon.com>
Subject: Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
Date: Fri, 11 Nov 2022 10:17:09 +0000	[thread overview]
Message-ID: <87pmdtztga.fsf@stealth> (raw)
In-Reply-To: <20221028081255.19157-1-yangyicong@huawei.com> (Yicong Yang's message of "Fri, 28 Oct 2022 16:12:53 +0800")

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

next prev parent reply	other threads:[~2022-11-11 10:18 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-28  8:12 [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-10-28  8:12 ` Yicong Yang
2022-10-28  8:12 ` Yicong Yang
2022-10-28  8:12 ` Yicong Yang
2022-10-28  8:12 ` [PATCH v5 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12 ` [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-11-14  3:29   ` Anshuman Khandual
2022-11-14  3:29     ` Anshuman Khandual
2022-11-14  3:29     ` Anshuman Khandual
2022-11-14  3:29     ` Anshuman Khandual
2022-11-14  8:46     ` Yicong Yang
2022-11-14  8:46       ` Yicong Yang
2022-11-14  8:46       ` Yicong Yang
2022-11-14  8:46       ` Yicong Yang
2022-11-14 14:19       ` Anshuman Khandual
2022-11-14 14:19         ` Anshuman Khandual
2022-11-14 14:19         ` Anshuman Khandual
2022-11-14 14:19         ` Anshuman Khandual
2022-11-15  3:34         ` Yicong Yang
2022-11-15  3:34           ` Yicong Yang
2022-11-15  3:34           ` Yicong Yang
2022-11-15  3:34           ` Yicong Yang
2022-11-14  8:00   ` haoxin
2022-11-14  8:00     ` haoxin
2022-11-14  8:00     ` haoxin
2022-11-14  8:00     ` haoxin
2022-11-11 10:17 ` Punit Agrawal [this message]
2022-11-11 10:17   ` [External] [PATCH v5 0/2] " Punit Agrawal
2022-11-11 10:17   ` Punit Agrawal
2022-11-11 10:17   ` Punit Agrawal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pmdtztga.fsf@stealth \
    --to=punit.agrawal@bytedance.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=darren@os.amperecomputing.com \
    --cc=guojian@oppo.com \
    --cc=huzhanyuan@oppo.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lipeifeng@oppo.com \
    --cc=openrisc@lists.librecores.org \
    --cc=peterz@infradead.org \
    --cc=prime.zeng@hisilicon.com \
    --cc=realmz6@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=xhao@linux.alibaba.com \
    --cc=yangyicong@hisilicon.com \
    --cc=yangyicong@huawei.com \
    --cc=zhangshiming@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.