From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48938) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bYDFI-00006P-2A for qemu-devel@nongnu.org; Fri, 12 Aug 2016 10:16:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bYDEy-0004It-FW for qemu-devel@nongnu.org; Fri, 12 Aug 2016 10:15:37 -0400 Sender: Richard Henderson References: <1470133216-6758-1-git-send-email-vijay.kilari@gmail.com> <1470133216-6758-3-git-send-email-vijay.kilari@gmail.com> <540c4c1d-1590-4c64-116a-e6b41bd3475e@twiddle.net> From: Richard Henderson Message-ID: Date: Fri, 12 Aug 2016 14:20:43 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx platform List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vijay Kilari Cc: qemu-arm , Peter Maydell , Paolo Bonzini , "prasun.kapoor" , QEMU Developers , Vijaya Kumar K On 08/12/2016 12:32 PM, Vijay Kilari wrote: > On Sat, Aug 6, 2016 at 3:47 PM, Richard Henderson wrote: >> On 08/02/2016 03:50 PM, vijay.kilari@gmail.com wrote: >>> >>> +#define VEC_PREFETCH(base, index) \ >>> + asm volatile ("prfm pldl1strm, [%x[a]]\n" : : >>> [a]"r"(&base[(index)])) >> >> >> Is this not __builtin_prefetch(base + index) ? >> >> I.e. you can defined this generically for all targets. > > __builtin_prefetch() is available only in gcc 5.3 for arm64. So? You can't really defend the position that you care about aarch64 code quality if you're using gcc 4.x. Essentially all of the performance work has been done for later versions. >> I'll note that you're also prefetching too much, off the end of the block, >> and that you're probably not prefetching far enough. You'd need to break >> off the last iteration(s) of the loop. >> >> I'll note that you're also prefetching too close. The loop operates on >> 8*vecsize units. In the case of aarch64, 128 byte units. Both i+32 and > > 128 unit is specific to thunder. I will move this to thunder > specific function No, you misunderstand. While it's true that thunderx is unique within other aarch64 implementations in having a 128-byte cacheline size, the "128" I mention above has nothing to do with that. The loop is operating on BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR bytes, which is defined above as 8 * sizeof(vector), which happens to be 128. r~