From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59123) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zpa4F-0000YB-Ey for qemu-devel@nongnu.org; Fri, 23 Oct 2015 06:59:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zpa4C-0008Bo-52 for qemu-devel@nongnu.org; Fri, 23 Oct 2015 06:59:27 -0400 References: <1445522453-14450-1-git-send-email-P@draigBrady.com> <5628F4BC.2040502@redhat.com> <5628F634.6040809@redhat.com> <5628FE20.80802@draigBrady.com> <5629050C.20607@bernhard-voelker.de> <562906DD.5040501@redhat.com> From: Bernhard Voelker Message-ID: <562A1306.2070902@bernhard-voelker.de> Date: Fri, 23 Oct 2015 12:59:18 +0200 MIME-Version: 1.0 In-Reply-To: <562906DD.5040501@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake , =?UTF-8?Q?P=c3=a1draig_Brady?= , Paolo Bonzini , coreutils@gnu.org Cc: Rusty Russell , "qemu-devel@nongnu.org" On 10/22/2015 05:55 PM, Eric Blake wrote: > On 10/22/2015 09:47 AM, Bernhard Voelker wrote: > >>> Also I suspect the extra conditions involved in using longs >>> for just the first 16 bytes would outweigh the benefits? >>> I.E. the first simple loop probably breaks early, and if not >>> has the added benefit of "priming the pumps" for the subsequent memcmp(). >> >> what about spending some 16 bytes of memory and do the memcmp on the whole >> buffer? >> >> static unsigned char p[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; >> return 0 == memcmp (p, buf, bufsize); > > Won't work over the whole bufsize for anything larger than 16 unless you > do repeated memcmp()s. > > Or are you suggesting that the first 16-byte head validation be done > against a static buffer via one memcmp(), followed by the other > overlap-self memcmp() for the rest of the buffer? But I suspect that > for short lengths, it is more efficient to do an unrolled loop than to > make a function call (where the function call itself will probably just > do an unrolled loop on the short length). You want the short case to be > fast, and the real speedup comes by delegating as much of the long case > as possible to the system memcmp() optimizations. Of course, you're completely right. My example above was over-simplified and therefore plain wrong, sorry. Aiming at tools like dd(1), I played a bit with the idea of pre-known-zeroed buffer in front of the real payload data, i.e. having a buffer of 16 + 64k where the first 16 bytes are all NULs, thus being able to immediately use the overlap-self memcmp() with the payload starting at offset 16. Tests showed that you are right with your other suspicion, too: the overhead of calling memcmp() for small buffer sizes is less effective than Rusty's way. Therefore +1 for Padraig's patch. Have a nice day, Berny