From: Byungchul Park <byungchul@sk.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kernel_team@skhynix.com, akpm@linux-foundation.org,
namit@vmware.com, xhao@linux.alibaba.com,
mgorman@techsingularity.net, hughd@google.com,
willy@infradead.org, david@redhat.com, peterz@infradead.org,
luto@kernel.org, dave.hansen@linux.intel.com
Subject: Re: [RFC 2/2] mm: Defer TLB flush by keeping both src and dst folios at migration
Date: Wed, 16 Aug 2023 09:13:07 +0900 [thread overview]
Message-ID: <20230816001307.GA44941@system.software.com> (raw)
In-Reply-To: <877cpx9jsx.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Tue, Aug 15, 2023 at 09:27:26AM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
>
> > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> >
> > We always face the migration overhead at either promotion or demotion,
> > while working with tiered memory e.g. CXL memory and found out TLB
> > shootdown is a quite big one that is needed to get rid of if possible.
> >
> > Fortunately, TLB flush can be defered or even skipped if both source and
> > destination of folios during migration are kept until all TLB flushes
> > required will have been done, of course, only if the target PTE entries
> > have read only permission, more precisely speaking, don't have write
> > permission. Otherwise, no doubt the folio might get messed up.
> >
> > To achieve that:
> >
> > 1. For the folios that have only non-writable TLB entries, prevent
> > TLB flush by keeping both source and destination of folios during
> > migration, which will be handled later at a better time.
> >
> > 2. When any non-writable TLB entry changes to writable e.g. through
> > fault handler, give up CONFIG_MIGRC mechanism so as to perform
> > TLB flush required right away.
> >
> > 3. TLB flushes can be skipped if all TLB flushes required to free the
> > duplicated folios have been done by any reason, which doesn't have
> > to be done from migrations.
> >
> > 4. Adjust watermark check routine, __zone_watermark_ok(), with the
> > number of duplicated folios because those folios can be freed
> > and obtained right away through appropreate TLB flushes.
> >
> > 5. Perform TLB flushes and free the duplicated folios pending the
> > flushes if page allocation routine is in trouble due to memory
> > pressure, even more aggresively for high order allocation.
>
> Is the optimization restricted for page migration only? Can it be used
> for other places? Like page reclaiming?
Just to make sure, are you talking about the (5) description? For now,
it's performed at the beginning of __alloc_pages_slowpath(), say, before
page recaiming. Do you think it'd be meaningful to perform it during page
reclaiming? Or do you mean something else?
> > The measurement result:
> >
> > Architecture - x86_64
> > QEMU - kvm enabled, host cpu, 2nodes((4cpus, 2GB)+(cpuless, 6GB))
> > Linux Kernel - v6.4, numa balancing tiering on, demotion enabled
> > Benchmark - XSBench with no parameter changed
> >
> > run 'perf stat' using events:
> > (FYI, process wide result ~= system wide result(-a option))
> > 1) itlb.itlb_flush
> > 2) tlb_flush.dtlb_thread
> > 3) tlb_flush.stlb_any
> >
> > run 'cat /proc/vmstat' and pick up:
> > 1) pgdemote_kswapd
> > 2) numa_pages_migrated
> > 3) pgmigrate_success
> > 4) nr_tlb_remote_flush
> > 5) nr_tlb_remote_flush_received
> > 6) nr_tlb_local_flush_all
> > 7) nr_tlb_local_flush_one
> >
> > BEFORE - mainline v6.4
> > ==========================================
> >
> > $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench
> >
> > Performance counter stats for './XSBench':
> >
> > 426856 itlb.itlb_flush
> > 6900414 tlb_flush.dtlb_thread
> > 7303137 tlb_flush.stlb_any
> >
> > 33.500486566 seconds time elapsed
> > 92.852128000 seconds user
> > 10.526718000 seconds sys
> >
> > $ cat /proc/vmstat
> >
> > ...
> > pgdemote_kswapd 1052596
> > numa_pages_migrated 1052359
> > pgmigrate_success 2161846
> > nr_tlb_remote_flush 72370
> > nr_tlb_remote_flush_received 213711
> > nr_tlb_local_flush_all 3385
> > nr_tlb_local_flush_one 198679
> > ...
> >
> > AFTER - mainline v6.4 + CONFIG_MIGRC
> > ==========================================
> >
> > $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench
> >
> > Performance counter stats for './XSBench':
> >
> > 179537 itlb.itlb_flush
> > 6131135 tlb_flush.dtlb_thread
> > 6920979 tlb_flush.stlb_any
>
> It appears that the number of "itlb.itlb_flush" changes much, but not
> for other 2 events. Because the text segment of the executable file is
> mapped as read-only? And most other pages are mapped read-write?
Yes, for this benchmarch, XSBench. I didn't noticed that until checking
it using perf event either.
> > 30.396700625 seconds time elapsed
> > 80.331252000 seconds user
> > 10.303761000 seconds sys
> >
> > $ cat /proc/vmstat
> >
> > ...
> > pgdemote_kswapd 1044602
> > numa_pages_migrated 1044202
> > pgmigrate_success 2157808
> > nr_tlb_remote_flush 30453
> > nr_tlb_remote_flush_received 88840
> > nr_tlb_local_flush_all 3039
> > nr_tlb_local_flush_one 198875
> > ...
> >
> > Signed-off-by: Byungchul Park <byungchul@sk.com>
[...]
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..3be66d3eabd2 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -228,6 +228,10 @@ struct page {
> > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> > int _last_cpupid;
> > #endif
> > +#ifdef CONFIG_MIGRC
> > + struct llist_node migrc_node;
> > + unsigned int migrc_state;
> > +#endif
>
> We cannot enlarge "struct page".
This is what I worried about. Do you have a better idea? I don't think
they fit onto page_ext or something.
Byungchul
next prev parent reply other threads:[~2023-08-16 0:16 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-04 6:18 [RFC 0/2] Reduce TLB flushes under some specific conditions Byungchul Park
2023-08-04 6:18 ` [RFC 1/2] mm/rmap: Recognize non-writable TLB entries during TLB batch flush Byungchul Park
2023-08-17 2:18 ` Xin Hao
2023-08-04 6:18 ` [RFC 2/2] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park
2023-08-04 16:08 ` Zi Yan
2023-08-07 0:43 ` Byungchul Park
2023-08-04 17:32 ` Nadav Amit
2023-08-07 1:42 ` Byungchul Park
2023-08-07 5:05 ` Byungchul Park
2023-08-15 1:27 ` Huang, Ying
2023-08-16 0:13 ` Byungchul Park [this message]
2023-08-16 1:01 ` Huang, Ying
2023-08-16 2:40 ` Byungchul Park
2023-08-21 1:28 ` Byungchul Park
2023-08-21 2:51 ` Huang, Ying
2023-08-17 8:16 ` Byungchul Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230816001307.GA44941@system.software.com \
--to=byungchul@sk.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=kernel_team@skhynix.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@techsingularity.net \
--cc=namit@vmware.com \
--cc=peterz@infradead.org \
--cc=willy@infradead.org \
--cc=xhao@linux.alibaba.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.