From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bruce Richardson Subject: Re: [PATCH v3] Implement memcmp using SIMD intrinsics Date: Fri, 12 Jun 2015 10:03:35 +0100 Message-ID: <20150612090334.GA496@bricha3-MOBL3> References: <1431979303-1346-1-git-send-email-rkerur@gmail.com> <20150612083056.GA18090@domone> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: dev@dpdk.org To: =?utf-8?B?T25kxZllaiBCw61sa2E=?= Return-path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id E8C37B3D6 for ; Fri, 12 Jun 2015 11:03:38 +0200 (CEST) Content-Disposition: inline In-Reply-To: <20150612083056.GA18090@domone> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Fri, Jun 12, 2015 at 10:30:56AM +0200, Ond=C5=99ej B=C3=ADlka wrote: > On Mon, May 18, 2015 at 01:01:42PM -0700, Ravi Kerur wrote: > > Background: > > After preliminary discussion with John (Zhihong) and Tim from Intel i= t was > > decided that it would be beneficial to use AVX/SSE intrinsics for mem= cmp > > similar to memcpy that had been implemeneted. In addition, we decided= to use > > librte_hash as a test candidate to test both functionality and perfor= mance. > >=20 > > Further discussions lead to complete functionality implementation of = memory > > comparison and v3 code reflects that. > >=20 > > Test was conducted on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu= 14.04, > > x86_64, 16GB DDR3 system. > >=20 > > Ravi Kerur (1): > > Implement memcmp using Intel SIMD instrinsics. >=20 > As my previous mail got lost I am resending it.=20 >=20 > In short you shouldn't > use sse2/avx2 for memcmp at all. In 95% of calls you find inequality in > first 8 bytes so sse2 adds just unnecessary overhead versus checking > these with. >=20 > 190: 48 8b 4e 08 mov 0x8(%rsi),%rcx > 194: 48 39 4f 08 cmp %rcx,0x8(%rdi) > 198: 75 f3 jne 18d >=20 > Also as you have full memcmp does in your gcc optimize out=20 > if (memcmp(x,y))=20 > like in mine? >=20 > So run also implementation below in your benchmark, my guess is it will > be faster. >=20 Thanks for the contribution. It's very informative! /Bruce