Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Ilya Leoshkevich <iii@linux.ibm.com>
To: Richard Henderson <richard.henderson@linaro.org>, qemu-devel@nongnu.org
Cc: thuth@redhat.com
Subject: Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
Date: Wed, 07 Dec 2022 23:09:07 +0100	[thread overview]
Message-ID: <d3672f3db80fa495e9af23680f0926901bb99b86.camel@linux.ibm.com> (raw)
In-Reply-To: <8a3965f7-f830-6343-be15-4e16b20655fd@linaro.org>

On Tue, 2022-12-06 at 16:22 -0600, Richard Henderson wrote:
> On 12/6/22 13:29, Ilya Leoshkevich wrote:
> > On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:
> > > This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
> > > several follow-up patches.  The primary motivation is to reduce
> > > the
> > > less-tested code paths, pre-z10.  Secondarily, this allows the
> > > unconditional use of TCG_TARGET_HAS_direct_jump, which might be
> > > more
> > > important for performance than any slight increase in code size.
> > > 
> > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> > > ---
> > >   tcg/s390x/tcg-target.h     |   2 +-
> > >   tcg/s390x/tcg-target.c.inc | 176 +++++-------------------------
> > > -------
> > >   2 files changed, 23 insertions(+), 155 deletions(-)
> > 
> > Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
> > 
> > I have a few questions/ideas for the future below.  
> > > 
...

> > 
> > > -    /* Use the constant pool if USE_REG_TB, but not for small
> > > constants.  */
> > > -    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
> > > -        if (type == TCG_TYPE_I32) {
> > > -            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> > > -        } else {
> > > -            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> > > -        }
> > > -    } else if (USE_REG_TB) {
> > > -        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE,
> > > 0);
> > > -        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
> > > -                       tcg_tbrel_diff(s, NULL));
> > > +    tcg_out_movi(s, type, TCG_TMP0, val);
> > > +    if (type == TCG_TYPE_I32) {
> > > +        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> > >       } else {
> > > -        /* Perform the xor by parts.  */
> > > -        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
> > > -        if (val & 0xffffffff) {
> > > -            tcg_out_insn(s, RIL, XILF, dest, val);
> > > -        }
> > > -        if (val > 0xffffffff) {
> > > -            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
> > > -        }
> > > +        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> > >       }
> > >   }
> > 
> > Wouldn't it be worth keeping XILF/XIFH here?
> 
> I don't know.  It's difficult for me to guess whether a dependency
> chain like
> 
>      val -> xor -> xor
> 
> (3 insns with serial dependencies) is better than
> 
>      val   --> xor
>      load  -/
> 
> (3 insns, but only one serial dependency) is better.  But there may
> also be instruction 
> fusion going on at the micro-architectural level, so that there's
> really only one xor.
> 
> If you have suggestions, I'm all ears.

I ran some experiments, and it turned out you were right: xilf+xihf is
slower exactly because of dependency chains.
So no need to change anything here and sorry for the noise.

> > I don't have any numbers right now, but it looks more
> > compact/efficient
> > than a load + XGR.
> 
> If we assume general-instruction-extension-facility (z10?), LGRL +
> XGR is smaller than 
> XILF + XIFH, ignoring the constant pool entry which might be shared,
> and modulo the µarch 
> questions above.
> 
> 
> > Same for OGR above; I even wonder if both implementations could be
> > unified.
> 
> Sadly not, because of OILL et al.  There are no 16-bit xor immediate
> insns.
> 
> > > +        /*
> > > +         * branch displacement must be aligned for atomic
> > > patching;
> > > +         * see if we need to add extra nop before branch
> > > +         */
> > > +        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
> > > +            tcg_out16(s, NOP);
> > >           }
> > > +        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
> > > +        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
> > > +        tcg_out32(s, 0);
> > >           set_jmp_reset_offset(s, a0);
> > 
> > This seems to work in practice, but I don't think patching branch
> > offsets is allowed by PoP (in a multi-threaded environment). For
> > example, I had to do [2] in order to work around this limitation in
> > ftrace.
> 
> Really?  How does the processor distinguish between overwriting
> opcode/condition vs 
> overwriting immediate operand when invalidating cached instructions?

The problem is that nothing in PoP prevents a CPU from fetching
instruction bytes one by one, in random order and more than one time.
It's not that this is necessarily going to happen, rather, it's just
that this has never been verified for all instructions and formally
stated. The observation that I use in the ftrace patch is not
formalized either, but it's specific to a single instruction and should
hold in practice for the existing hardware to the best of my knowledge.

> If overwriting operand truly isn't correct, then I think we have to
> use indirect branch 
> always for goto_tb.
> 
> > A third benefit seems to be that we now have one more register to
> > allocate.
> 
> Yes.  It's call clobbered, so it isn't live so often, but sometimes.
> 
> 
> r~

next prev parent reply	other threads:[~2022-12-07 22:09 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
2022-12-02  6:51 ` [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2 Richard Henderson
2022-12-06 15:49   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB Richard Henderson
2022-12-06 19:29   ` Ilya Leoshkevich
2022-12-06 22:22     ` Richard Henderson
2022-12-07  0:42       ` Richard Henderson
2022-12-08 14:04         ` Ilya Leoshkevich
2022-12-07  7:45       ` Thomas Huth
2022-12-07 14:55         ` Richard Henderson
2022-12-07 20:40           ` Ilya Leoshkevich
2022-12-07 21:20             ` Christian Borntraeger
2022-12-07 22:09       ` Ilya Leoshkevich [this message]
2022-12-02  6:51 ` [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses Richard Henderson
2022-12-06 19:42   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats Richard Henderson
2022-12-06 19:45   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats Richard Henderson
2022-12-06 19:47   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions Richard Henderson
2022-12-06 20:02   ` Ilya Leoshkevich
2022-12-06 20:20     ` Richard Henderson
2022-12-02  6:51 ` [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction Richard Henderson
2022-12-06 20:02   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations Richard Henderson
2022-12-06 20:08   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond Richard Henderson
2022-12-06 20:14   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation Richard Henderson
2022-12-06 20:39   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond Richard Henderson
2022-12-06 20:41   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz Richard Henderson
2022-12-06 20:49   ` Ilya Leoshkevich
2022-12-02  6:52 ` [PATCH v3 13/13] tcg/s390x: Implement ctpop operation Richard Henderson
2022-12-06 21:10   ` Ilya Leoshkevich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d3672f3db80fa495e9af23680f0926901bb99b86.camel@linux.ibm.com \
    --to=iii@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).