qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Emilio G. Cota" <cota@braap.org>
To: Richard Henderson <richard.henderson@linaro.org>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line
Date: Thu, 15 Nov 2018 20:13:38 -0500	[thread overview]
Message-ID: <20181116011338.GB17566@flamenco> (raw)
In-Reply-To: <06e66024-1abb-e5b7-591c-3633b5cb3e31@linaro.org>

On Thu, Nov 15, 2018 at 23:04:50 +0100, Richard Henderson wrote:
> On 11/15/18 7:48 PM, Emilio G. Cota wrote:
> > - Segfault in code_gen_buffer. This one I don't have a fix for,
> >   but it's *much* easier to reproduce when -tb-size is very small,
> >   e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.)
> >   So at first I thought the code cache flushing was the problem,
> >   but I don't see how that could be, at least from a TCGContext
> >   viewpoint -- I agree that clearing the hash table in
> >   tcg_region_assign is a good place to do so.
> 
> Ho hum.
> 
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 639f0b2728..115ea186e5 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1831,10 +1831,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>      existing_tb = tb_link_page(tb, phys_pc, phys_page2);
>      /* if the TB already exists, discard what we just translated */
>      if (unlikely(existing_tb != tb)) {
> -        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
> -
> -        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> -        atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);
>          return existing_tb;
>      }
>      tcg_tb_insert(tb);
> 
> We can't easily undo the hash table insert, and for a relatively rare
> occurrence it's not worth the effort.

Nice catch! Everything works now =D

In the bootup+shutdown aarch64 test with -smp 12, we end up
discarding ~2500 TB's--that's ~439K of space for code that we
do not waste; note that I'm assuming 180 host bytes per TB,
which is the average reported by info jit.

We can still discard most of these by increasing a counter every
time we insert a new element into the OOL table, and checking
this counter before/after tcg_gen_code. (Note that checking
g_hash_table_size before/after is not enough, because we might
have replaced an existing item from the table.)
Then, we discard a TB iff an OOL thunk was generated. (Diff below.)

This allows us to discard most TBs; in the example above,
we end up *not* discarding only ~70 TBs, that is we end up keeping
only 70/2500 = 2.8% of the TBs that we'd discard without OOL.

Performance-wise it doesn't make a difference for -smp 1:

Host: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (5 runs):

- Before (3.1.0-rc1):

      14351.436177      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.24% )
    49,963,260,126      cycles                    #    3.481 GHz                      ( +-  0.22% )  (83.32%)
    26,047,650,654      stalled-cycles-frontend   #   52.13% frontend cycles idle     ( +-  0.29% )  (83.34%)
    19,717,480,482      stalled-cycles-backend    #   39.46% backend  cycles idle     ( +-  0.27% )  (66.67%)
    59,278,011,067      instructions              #    1.19  insns per cycle        
                                                  #    0.44  stalled cycles per insn  ( +-  0.17% )  (83.34%)
    10,632,601,608      branches                  #  740.874 M/sec                    ( +-  0.17% )  (83.34%)
       236,153,469      branch-misses             #    2.22% of all branches          ( +-  0.16% )  (83.35%)

      14.382847823 seconds time elapsed                                          ( +-  0.25% )

- After this series (with the fixes we've discussed):

      13256.198927      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.04% )
    46,146,457,353      cycles                    #    3.481 GHz                      ( +-  0.08% )  (83.34%)
    22,632,342,565      stalled-cycles-frontend   #   49.04% frontend cycles idle     ( +-  0.12% )  (83.35%)
    16,534,690,741      stalled-cycles-backend    #   35.83% backend  cycles idle     ( +-  0.15% )  (66.67%)
    58,047,832,548      instructions              #    1.26  insns per cycle        
                                                  #    0.39  stalled cycles per insn  ( +-  0.18% )  (83.34%)
    11,031,634,880      branches                  #  832.187 M/sec                    ( +-  0.12% )  (83.33%)
       210,593,929      branch-misses             #    1.91% of all branches          ( +-  0.30% )  (83.33%)

      13.285023783 seconds time elapsed                                          ( +-  0.05% )

- After the fixup below:

      13240.889734      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.19% )
    46,074,292,775      cycles                    #    3.480 GHz                      ( +-  0.12% )  (83.35%)
    22,670,132,770      stalled-cycles-frontend   #   49.20% frontend cycles idle     ( +-  0.17% )  (83.35%)
    16,598,822,504      stalled-cycles-backend    #   36.03% backend  cycles idle     ( +-  0.26% )  (66.66%)
    57,796,083,344      instructions              #    1.25  insns per cycle        
                                                  #    0.39  stalled cycles per insn  ( +-  0.16% )  (83.34%)
    11,002,340,174      branches                  #  830.937 M/sec                    ( +-  0.11% )  (83.35%)
       211,023,549      branch-misses             #    1.92% of all branches          ( +-  0.22% )  (83.32%)

      13.264499034 seconds time elapsed                                          ( +-  0.19% )

I'll generate now some more perf numbers that we could include in the
commit logs.

Thanks,

		Emilio

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 115ea18..15f7d4e 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1678,6 +1678,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
     int gen_code_size, search_size;
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    size_t n_ool_thunks;
+#endif
 #ifdef CONFIG_PROFILER
     TCGProfile *prof = &tcg_ctx->prof;
     int64_t ti;
@@ -1744,6 +1747,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     ti = profile_getclock();
 #endif
 
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    n_ool_thunks = tcg_ctx->n_ool_thunks;
+#endif
+
     /* ??? Overflow could be handled better here.  In particular, we
        don't need to re-do gen_intermediate_code, nor should we re-do
        the tcg optimization currently hidden inside tcg_gen_code.  All
@@ -1831,6 +1838,18 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     existing_tb = tb_link_page(tb, phys_pc, phys_page2);
     /* if the TB already exists, discard what we just translated */
     if (unlikely(existing_tb != tb)) {
+        bool discard = true;
+
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+        /* only discard the TB if we didn't generate an OOL thunk */
+        discard = tcg_ctx->n_ool_thunks == n_ool_thunks;
+#endif
+        if (discard) {
+            uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+            orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+            atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);
+        }
         return existing_tb;
     }
     tcg_tb_insert(tb);
diff --git a/tcg/tcg-ldst-ool.inc.c b/tcg/tcg-ldst-ool.inc.c
index 8fb6550..61da060 100644
--- a/tcg/tcg-ldst-ool.inc.c
+++ b/tcg/tcg-ldst-ool.inc.c
@@ -69,6 +69,7 @@ static bool tcg_out_ldst_ool_finalize(TCGContext *s)
 
         /* Remember the thunk for next time.  */
         g_hash_table_replace(s->ldst_ool_thunks, key, dest);
+        s->n_ool_thunks++;
 
         /* The new thunk must be in range.  */
         ok = patch_reloc(lb->label, lb->reloc, (intptr_t)dest, lb->addend);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1255d2a..d4f07a6 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -709,6 +709,7 @@ struct TCGContext {
 #ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
     QSIMPLEQ_HEAD(ldst_labels, TCGLabelQemuLdstOol) ldst_ool_labels;
     GHashTable *ldst_ool_thunks;
+    size_t n_ool_thunks;
 #endif
 #ifdef TCG_TARGET_NEED_POOL_LABELS
     struct TCGLabelPoolData *pool_labels;

  reply	other threads:[~2018-11-16  1:13 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-12 21:44 [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 01/17] tcg/i386: Add constraints for r8 and r9 Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 02/17] tcg/i386: Return a base register from tcg_out_tlb_load Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 03/17] tcg/i386: Change TCG_REG_L[01] to not overlap function arguments Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 04/17] tcg/i386: Force qemu_ld/st arguments into fixed registers Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 05/17] tcg: Return success from patch_reloc Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 06/17] tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 07/17] tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 08/17] tcg/aarch64: Add constraints for x0, x1, x2 Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 09/17] tcg/aarch64: Parameterize the temps for tcg_out_tlb_read Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 10/17] tcg/aarch64: Parameterize the temp for tcg_out_goto_long Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 11/17] tcg/aarch64: Use B not BL " Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 12/17] tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS Richard Henderson
2018-11-12 21:44 ` [Qemu-devel] [PATCH for-4.0 13/17] tcg/arm: Parameterize the temps for tcg_out_tlb_read Richard Henderson
2018-11-12 21:45 ` [Qemu-devel] [PATCH for-4.0 14/17] tcg/arm: Add constraints for R0-R5 Richard Henderson
2018-11-12 21:45 ` [Qemu-devel] [PATCH for-4.0 15/17] tcg/arm: Reduce the number of temps for tcg_out_tlb_read Richard Henderson
2018-11-12 21:45 ` [Qemu-devel] [PATCH for-4.0 16/17] tcg/arm: Force qemu_ld/st arguments into fixed registers Richard Henderson
2018-11-12 21:45 ` [Qemu-devel] [PATCH for-4.0 17/17] tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS Richard Henderson
2018-11-13  9:00 ` [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line no-reply
2018-11-14  1:00 ` Emilio G. Cota
2018-11-15 11:32   ` Richard Henderson
2018-11-15 18:48     ` Emilio G. Cota
2018-11-15 18:54       ` Richard Henderson
2018-11-15 22:04       ` Richard Henderson
2018-11-16  1:13         ` Emilio G. Cota [this message]
2018-11-16  5:10           ` Emilio G. Cota
2018-11-16  8:07             ` Richard Henderson
2018-11-16 15:07               ` Emilio G. Cota
2018-11-16  8:10           ` Richard Henderson
2018-11-16 15:10             ` Emilio G. Cota

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181116011338.GB17566@flamenco \
    --to=cota@braap.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).