[PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: kirill@shutemov.name, mhocko@kernel.org,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	Ankur Arora <ankur.a.arora@oracle.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Tony Luck <tony.luck@intel.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Xiaoyao Li <xiaoyao.li@intel.com>,
	Fenghua Yu <fenghua.yu@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
Date: Wed, 14 Oct 2020 01:32:58 -0700	[thread overview]
Message-ID: <20201014083300.19077-8-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20201014083300.19077-1-ankur.a.arora@oracle.com>

System:           Oracle X6-2
CPU:              2 nodes * 10 cores/node * 2 threads/core
		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory:           256 GB evenly split between nodes
Microcode:        0xb00002e
scaling_governor: performance
L3 size:          25MB
intel_pstate/no_turbo: 1

Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
(X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
              -----------------------   -----------------------     -------
     size       BW        (   pstdev)          BW   (   pstdev)

     16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
    128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
   1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
   4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%

The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.

$ cat pf-test.c
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <linux/mman.h>

 #define HPAGE_BITS 30
 int main(int argc, char **argv) {
	int i;
	unsigned long len = atoi(argv[1]); /* In GB */
	unsigned long offset = 0;
	unsigned long numpages;
	char *base;

	len *= 1UL << 30;
	numpages = len >> HPAGE_BITS;

	base = mmap(NULL, len, PROT_READ|PROT_WRITE,
	            MAP_PRIVATE | MAP_ANONYMOUS |
		    MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

	for (i = 0; i < numpages; i++) {
	        *((volatile char *)base + offset) = *(base + offset);
	        offset += 1UL << HPAGE_BITS;
	}

	return 0;
 }

The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.

Page-clearing throughput for clear_page_erms(): 3.72 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    74,799,496,556      cpu-cycles                #    2.176 GHz                      ( +-  2.22% )  (29.41%)
     1,474,615,023      instructions              #    0.02  insn per cycle           ( +-  0.23% )  (35.29%)
     2,148,580,131      cache-references          #   62.502 M/sec                    ( +-  0.02% )  (35.29%)
        71,736,985      cache-misses              #    3.339 % of all cache refs      ( +-  0.94% )  (35.29%)
       433,713,165      branch-instructions       #   12.617 M/sec                    ( +-  0.15% )  (35.30%)
         1,008,251      branch-misses             #    0.23% of all branches          ( +-  1.88% )  (35.30%)
     3,406,821,966      bus-cycles                #   99.104 M/sec                    ( +-  2.22% )  (23.53%)
     2,156,059,110      L1-dcache-load-misses     #  445.35% of all L1-dcache accesses  ( +-  0.01% )  (23.53%)
       484,128,243      L1-dcache-loads           #   14.083 M/sec                    ( +-  0.22% )  (23.53%)
           944,216      LLC-loads                 #    0.027 M/sec                    ( +-  7.41% )  (23.53%)
           537,989      LLC-load-misses           #   56.98% of all LL-cache accesses  ( +- 13.64% )  (23.53%)
     2,150,138,476      LLC-stores                #   62.547 M/sec                    ( +-  0.01% )  (11.76%)
        69,598,760      LLC-store-misses          #    2.025 M/sec                    ( +-  0.47% )  (11.76%)
       483,923,875      dTLB-loads                #   14.077 M/sec                    ( +-  0.21% )  (17.64%)
             1,892      dTLB-load-misses          #    0.00% of all dTLB cache accesses  ( +- 30.63% )  (23.53%)
     4,799,154,980      dTLB-stores               #  139.606 M/sec                    ( +-  0.03% )  (23.53%)
                90      dTLB-store-misses         #    0.003 K/sec                    ( +- 35.92% )  (23.53%)

            34.377 +- 0.760 seconds time elapsed  ( +-  2.21% )

Page-clearing throughput with clear_page_nt(): 11.78GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    23,699,446,603      cpu-cycles                #    2.182 GHz                      ( +-  0.01% )  (23.53%)
    24,794,548,512      instructions              #    1.05  insn per cycle           ( +-  0.00% )  (29.41%)
           432,775      cache-references          #    0.040 M/sec                    ( +-  3.96% )  (29.41%)
            75,580      cache-misses              #   17.464 % of all cache refs      ( +- 51.42% )  (29.41%)
     2,492,858,290      branch-instructions       #  229.475 M/sec                    ( +-  0.00% )  (29.42%)
        34,016,826      branch-misses             #    1.36% of all branches          ( +-  0.04% )  (29.42%)
     1,078,468,643      bus-cycles                #   99.276 M/sec                    ( +-  0.01% )  (23.53%)
           717,228      L1-dcache-load-misses     #    0.20% of all L1-dcache accesses  ( +-  3.77% )  (23.53%)
       351,999,535      L1-dcache-loads           #   32.403 M/sec                    ( +-  0.04% )  (23.53%)
            75,988      LLC-loads                 #    0.007 M/sec                    ( +-  4.20% )  (23.53%)
            24,503      LLC-load-misses           #   32.25% of all LL-cache accesses  ( +- 53.30% )  (23.53%)
            57,283      LLC-stores                #    0.005 M/sec                    ( +-  2.15% )  (11.76%)
            19,738      LLC-store-misses          #    0.002 M/sec                    ( +- 46.55% )  (11.76%)
       351,836,498      dTLB-loads                #   32.388 M/sec                    ( +-  0.04% )  (17.65%)
             1,171      dTLB-load-misses          #    0.00% of all dTLB cache accesses  ( +- 42.68% )  (23.53%)
    17,385,579,725      dTLB-stores               # 1600.392 M/sec                    ( +-  0.00% )  (23.53%)
               200      dTLB-store-misses         #    0.018 K/sec                    ( +- 10.63% )  (23.53%)

         10.863678 +- 0.000804 seconds time elapsed  ( +-  0.01% )

L1-dcache-load-misses (L1D.REPLACEMENT) is substantially lower which
suggests that, as expected, we aren't doing write-allocate or RFO.

Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSB' per
PAGE_SIZE region to a MOVNTI loop.

The page-clearing BW is substantially higher (~100% or more), so enable
X86_FEATURE_NT_GOOD for Intel Broadwellx.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/intel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 59a1e3ce3f14..161028c1dee0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -662,6 +662,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
 	if (c->x86 == 6)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
+		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
 #else
 	/*
 	 * Names for the Pentium II/Celeron processors
-- 
2.9.3

next prev parent reply	other threads:[~2020-10-14  8:34 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
2020-10-14  8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
2020-10-14  8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
2020-10-14 19:56   ` Borislav Petkov
2020-10-14 21:11     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10   ` kernel test robot
2020-10-14 13:04   ` kernel test robot
2020-10-14 15:45   ` Andy Lutomirski
2020-10-14 19:58     ` Borislav Petkov
2020-10-14 21:07       ` Andy Lutomirski
2020-10-14 21:12         ` Borislav Petkov
2020-10-15  3:37           ` Ankur Arora
2020-10-15 10:35             ` Borislav Petkov
2020-10-15 21:20               ` Ankur Arora
2020-10-16 18:21                 ` Borislav Petkov
2020-10-15  3:21         ` Ankur Arora
2020-10-15 10:40           ` Borislav Petkov
2020-10-15 21:40             ` Ankur Arora
2020-10-14 20:54     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
2020-10-14 15:28   ` Ingo Molnar
2020-10-14 19:15     ` Ankur Arora
2020-10-14  8:32 ` Ankur Arora [this message]
2020-10-14 15:31   ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ingo Molnar
2020-10-14 19:23     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:59a1e3ce3f1 dfblob:161028c1dee )
 OR (
bs:"[PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201014083300.19077-8-ankur.a.arora@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=fenghua.yu@intel.com \
    --cc=hpa@zytor.com \
    --cc=kirill@shutemov.name \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.ibm.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).