From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43103) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZpHq7-00041K-HG for qemu-devel@nongnu.org; Thu, 22 Oct 2015 11:31:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZpHq4-000825-MK for qemu-devel@nongnu.org; Thu, 22 Oct 2015 11:31:39 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35787) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZpHq3-00081d-6B for qemu-devel@nongnu.org; Thu, 22 Oct 2015 11:31:36 -0400 References: <1445522453-14450-1-git-send-email-P@draigBrady.com> <5628F4BC.2040502@redhat.com> <5628F634.6040809@redhat.com> <5628FE20.80802@draigBrady.com> From: Paolo Bonzini Message-ID: <56290152.7010408@redhat.com> Date: Thu, 22 Oct 2015 17:31:30 +0200 MIME-Version: 1.0 In-Reply-To: <5628FE20.80802@draigBrady.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?UTF-8?Q?P=c3=a1draig_Brady?= , Eric Blake , coreutils@gnu.org Cc: Rusty Russell , "qemu-devel@nongnu.org" On 22/10/2015 17:17, P=C3=A1draig Brady wrote: >> > Nice trick indeed. On the other hand, the first 16 bytes are enough= to >> > rule out 99.99% (number out of thin hair) of the non-zero blocks, so >> > that's where you want to optimize. Checking them an unsigned long a= t a >> > time, or fetching a few unsigned longs and ORing them together would >> > probably be the best of both worlds, because you then only use the F= PU >> > in the rare case of a zero buffer. > Note the above does break early if non zero detected in first 16 bytes. Yes, but it loops unnecessarily if the non-zero byte is the third or four= th. > Also I suspect the extra conditions involved in using longs > for just the first 16 bytes would outweigh the benefits? Only if your machine cannot do unaligned loads. If it can, you can align the length instead of the buffer. memcmp will take care of aligning the buffer (with some luck it won't have to, e.g. if buf is 0x12340002 and length =3D 4094). On x86 unaligned "unsigned long" loads are basically free as long as they don't cross a cache line. > BTW Rusty has a benchmark framework for this as referenced from: > http://rusty.ozlabs.org/?p=3D560 I missed his benchmark framework so I wrote another one, here it is: https://gist.githubusercontent.com/bonzini/9a95b0e02d1ceb60af9e/raw/7bc42= ddccdb6c42fea3db58e0539d0443d0e6dc6/memeqzero.c Paolo