From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46923) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bPaAr-0005IS-9a for qemu-devel@nongnu.org; Tue, 19 Jul 2016 14:55:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bPaAn-0006w9-A9 for qemu-devel@nongnu.org; Tue, 19 Jul 2016 14:55:20 -0400 Received: from mail-yw0-x231.google.com ([2607:f8b0:4002:c05::231]:36156) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bPaAn-0006w4-5E for qemu-devel@nongnu.org; Tue, 19 Jul 2016 14:55:17 -0400 Received: by mail-yw0-x231.google.com with SMTP id u134so24092082ywg.3 for ; Tue, 19 Jul 2016 11:55:16 -0700 (PDT) References: <20160714202940.18399-1-bobby.prani@gmail.com> <558fdb52-fe3e-2841-cc67-3ec2744c0224@redhat.com> From: Pranith Kumar In-reply-to: <558fdb52-fe3e-2841-cc67-3ec2744c0224@redhat.com> Date: Tue, 19 Jul 2016 14:55:15 -0400 Message-ID: <87zipde9v0.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Richard Henderson , "open list:All patches CC here" , serge.fdrv@gmail.com, alex.bennee@linaro.org Paolo Bonzini writes: > On 14/07/2016 22:29, Pranith Kumar wrote: >> + } else if (curr_mb_type == TCG_BAR_STRL && >> + prev_mb_type == TCG_BAR_LDAQ) { >> + /* Consecutive load-acquire and store-release barriers >> + * can be merged into one stronger SC barrier >> + * ldaq; strl => ld; mb; st >> + */ >> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; >> + tcg_op_remove(s, prev_op); > > Is this really an optimization? For example the processor could reorder > "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this > if you change ldaq1/strl2 to ld1/mb/st2. > > On x86 for example a memory fence costs ~50 clock cycles, while normal > loads and stores are of course faster. > > Of course this is useful if your target doesn't have ldaq/strl > instructions. In this case, however, you probably want to lower ldaq to > "ld;mb" and strl to "mb;st"; the other optimizations then will remove > the unnecessary barrier. > I agree that this is a conservative optimization. The problem is that currently even for architectures which have ldaq/strl instructions, tcg backend does not generate them. TCG just generates plain loads and stores.I guess we didn't need to since it was single threaded MTTCG. I am trying to add support to generate these instructions on AARCH64. Once this is done we can disable the above optimization. -- Pranith