From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: [PATCH 2/3] x86: use CLFLUSHOPT when available Date: Wed, 10 Feb 2016 16:27:06 +0000 Message-ID: <56BB64DA.4040409@citrix.com> References: <56BB3DE902000078000D087A@prv-mh.provo.novell.com> <56BB41BC02000078000D08AD@prv-mh.provo.novell.com> <56BB5158.5050507@citrix.com> <56BB67C602000078000D0A83@prv-mh.provo.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta14.messagelabs.com ([193.109.254.103]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1aTXbi-0003Ei-PM for xen-devel@lists.xenproject.org; Wed, 10 Feb 2016 16:27:10 +0000 In-Reply-To: <56BB67C602000078000D0A83@prv-mh.provo.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: xen-devel , Keir Fraser List-Id: xen-devel@lists.xenproject.org On 10/02/16 15:39, Jan Beulich wrote: >>>> On 10.02.16 at 16:03, wrote: >> On 10/02/16 12:57, Jan Beulich wrote: >>> Also drop an unnecessary va adjustment in the code being touched. >>> >>> Signed-off-by: Jan Beulich >>> >>> --- a/xen/arch/x86/flushtlb.c >>> +++ b/xen/arch/x86/flushtlb.c >>> @@ -139,10 +139,12 @@ unsigned int flush_area_local(const void >>> c->x86_clflush_size && c->x86_cache_size && sz && >>> ((sz >> 10) < c->x86_cache_size) ) >>> { >>> - va = (const void *)((unsigned long)va & ~(sz - 1)); >>> + alternative(ASM_NOP3, "sfence", X86_FEATURE_CLFLUSHOPT); >> Why separate? This would be better in the lower alternative(), with one >> single nop making up the difference in length. That way, processors >> without CLFLUSHOPT don't suffer the 1 cycle instruction decode stall >> from the redundant rex prefix. > Why would we want the fence inside the loop - a single fence is > sufficient for the entire flush. Ah yes - of course. > > Also if we're worried about the REX decode, this could easily be a > NOP instead, just that I'm not certain which one in the end is less > decode overhead. A redundant prefix will generally have a lower overhead than a full new instruction. Reviewed-by: Andrew Cooper