qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Robert Foley <robert.foley@linaro.org>
To: qemu-devel@nongnu.org
Cc: robert.foley@linaro.org, cota@braap.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	peter.puhov@linaro.org, alex.bennee@linaro.org,
	Richard Henderson <rth@twiddle.net>
Subject: [PATCH v10 73/73] cputlb: queue async flush jobs without the BQL
Date: Wed, 17 Jun 2020 17:02:31 -0400	[thread overview]
Message-ID: <20200617210231.4393-74-robert.foley@linaro.org> (raw)
In-Reply-To: <20200617210231.4393-1-robert.foley@linaro.org>

From: "Emilio G. Cota" <cota@braap.org>

This yields sizable scalability improvements, as the below results show.

Host: Two Intel Xeon Silver 4114 20-core CPUs at 2.20 GHz

VM: Ubuntu 18.04 ppc64

                   Speedup vs a single thread for kernel build                  
                                                                               
  7 +-----------------------------------------------------------------------+  
    |         +          +         +         +         +          +         |  
    |                                    ###########       baseline ******* |  
    |                               #####           ####   cpu lock ####### |  
    |                             ##                    ####                |  
  6 |-+                         ##                          ##            +-|  
    |                         ##                              ####          |  
    |                       ##                                    ###       |  
    |                     ##        *****                            #      |  
    |                   ##      ****     ***                          #     |  
    |                 ##     ***            *                               |  
  5 |-+             ##    ***                ****                         +-|  
    |              #  ****                       **                         |  
    |             # **                             **                       |  
    |             #*                                 **                     |  
    |          #*                                          **               |  
    |         #*                                             *              |  
    |         #                                               ******        |  
    |        #                                                      **      |  
    |       #                                                         *     |  
  3 |-+     #                                                             +-|  
    |      #                                                                |  
    |      #                                                                |  
    |     #                                                                 |  
    |     #                                                                 |  
  2 |-+  #                                                                +-|  
    |    #                                                                  |  
    |   #                                                                   |  
    |   #                                                                   |  
    |  #                                                                    |  
    |  #      +          +         +         +         +          +         |  
  1 +-----------------------------------------------------------------------+  
    0         5          10        15        20        25         30        35  
                                   Guest vCPUs  
Pictures are also here:
https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing

Some notes:
- baseline corresponds to the commit before this series
- cpu-lock is this series

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized
      ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz
      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle
      ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec
      ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches
      ( +-  0.12% )

       7.287011656 seconds time elapsed
 ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized
      ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz
      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle
      ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec
      ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches
      ( +-  0.09% )

       7.389068145 seconds time elapsed
 ( +-  0.13% )

That is, a 1.37% slowdown.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
[Updated the speedup chart results for re-based series.]
Signed-off-by: Robert Foley <robert.foley@linaro.org>
---
 accel/tcg/cputlb.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 1e815357c7..7f75054643 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -299,7 +299,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
 
     CPU_FOREACH(cpu) {
         if (cpu != src) {
-            async_run_on_cpu(cpu, fn, d);
+            async_run_on_cpu_no_bql(cpu, fn, d);
         }
     }
 }
@@ -367,8 +367,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
     tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap);
 
     if (cpu->created && !qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
-                         RUN_ON_CPU_HOST_INT(idxmap));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
+                                RUN_ON_CPU_HOST_INT(idxmap));
     } else {
         tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap));
     }
@@ -562,7 +562,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
          * we can stuff idxmap into the low TARGET_PAGE_BITS, avoid
          * allocating memory for this operation.
          */
-        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_1,
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_1,
                          RUN_ON_CPU_TARGET_PTR(addr | idxmap));
     } else {
         TLBFlushPageByMMUIdxData *d = g_new(TLBFlushPageByMMUIdxData, 1);
@@ -570,7 +570,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
         /* Otherwise allocate a structure, freed by the worker.  */
         d->addr = addr;
         d->idxmap = idxmap;
-        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_2,
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_2,
                          RUN_ON_CPU_HOST_PTR(d));
     }
 }
-- 
2.17.1



  parent reply	other threads:[~2020-06-17 21:37 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-17 21:01 [PATCH v10 00/73] per-CPU locks Robert Foley
2020-06-17 21:01 ` [PATCH v10 01/73] cpu: rename cpu->work_mutex to cpu->lock Robert Foley
2020-06-17 21:01 ` [PATCH v10 02/73] cpu: introduce cpu_mutex_lock/unlock Robert Foley
2020-06-17 21:01 ` [PATCH v10 03/73] cpu: make qemu_work_cond per-cpu Robert Foley
2020-06-17 21:01 ` [PATCH v10 04/73] cpu: move run_on_cpu to cpus-common Robert Foley
2020-06-17 21:01 ` [PATCH v10 05/73] cpu: introduce process_queued_cpu_work_locked Robert Foley
2020-06-17 21:01 ` [PATCH v10 06/73] cpu: make per-CPU locks an alias of the BQL in TCG rr mode Robert Foley
2020-06-17 21:01 ` [PATCH v10 07/73] tcg-runtime: define helper_cpu_halted_set Robert Foley
2020-06-17 21:01 ` [PATCH v10 08/73] ppc: convert to helper_cpu_halted_set Robert Foley
2020-06-17 21:01 ` [PATCH v10 09/73] cris: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 10/73] hppa: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 11/73] m68k: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 12/73] alpha: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 13/73] microblaze: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 14/73] cpu: define cpu_halted helpers Robert Foley
2020-06-17 21:01 ` [PATCH v10 15/73] tcg-runtime: convert to cpu_halted_set Robert Foley
2020-06-17 21:01 ` [PATCH v10 16/73] hw/semihosting: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 17/73] arm: convert to cpu_halted Robert Foley
2020-06-17 21:01 ` [PATCH v10 18/73] ppc: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 19/73] sh4: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 20/73] i386: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 21/73] lm32: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 22/73] m68k: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 23/73] mips: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 24/73] riscv: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 25/73] s390x: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 26/73] sparc: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 27/73] xtensa: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 28/73] gdbstub: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 29/73] openrisc: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 30/73] cpu-exec: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 31/73] cpu: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 32/73] cpu: define cpu_interrupt_request helpers Robert Foley
2020-06-17 21:01 ` [PATCH v10 33/73] ppc: use cpu_reset_interrupt Robert Foley
2020-06-17 21:01 ` [PATCH v10 34/73] exec: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 35/73] i386: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 36/73] s390x: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 37/73] openrisc: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 38/73] arm: convert to cpu_interrupt_request Robert Foley
2020-06-17 21:01 ` [PATCH v10 39/73] i386: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 40/73] i386/kvm: " Robert Foley
2020-06-17 21:01 ` [PATCH v10 41/73] i386/hax-all: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 42/73] i386/whpx-all: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 43/73] i386/hvf: convert to cpu_request_interrupt Robert Foley
2020-06-17 21:02 ` [PATCH v10 44/73] ppc: convert to cpu_interrupt_request Robert Foley
2020-06-17 21:02 ` [PATCH v10 45/73] sh4: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 46/73] cris: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 47/73] hppa: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 48/73] lm32: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 49/73] m68k: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 50/73] mips: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 51/73] nios: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 52/73] s390x: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 53/73] alpha: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 54/73] moxie: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 55/73] sparc: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 56/73] openrisc: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 57/73] unicore32: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 58/73] microblaze: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 59/73] accel/tcg: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 60/73] cpu: convert to interrupt_request Robert Foley
2020-06-17 21:02 ` [PATCH v10 61/73] cpu: call .cpu_has_work with the CPU lock held Robert Foley
2020-06-17 21:02 ` [PATCH v10 62/73] cpu: introduce cpu_has_work_with_iothread_lock Robert Foley
2020-06-17 21:02 ` [PATCH v10 63/73] ppc: convert to cpu_has_work_with_iothread_lock Robert Foley
2020-06-17 21:02 ` [PATCH v10 64/73] mips: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 65/73] s390x: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 66/73] riscv: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 67/73] sparc: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 68/73] xtensa: " Robert Foley
2020-06-17 21:02 ` [PATCH v10 69/73] cpu: rename all_cpu_threads_idle to qemu_tcg_rr_all_cpu_threads_idle Robert Foley
2020-06-17 21:02 ` [PATCH v10 70/73] cpu: protect CPU state with cpu->lock instead of the BQL Robert Foley
2020-06-17 21:02 ` [PATCH v10 71/73] cpus-common: release BQL earlier in run_on_cpu Robert Foley
2020-06-17 21:02 ` [PATCH v10 72/73] cpu: add async_run_on_cpu_no_bql Robert Foley
2020-06-17 21:02 ` Robert Foley [this message]
2020-06-17 22:20 ` [PATCH v10 00/73] per-CPU locks no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200617210231.4393-74-robert.foley@linaro.org \
    --to=robert.foley@linaro.org \
    --cc=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=pbonzini@redhat.com \
    --cc=peter.puhov@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).