From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51285)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1anSnW-0001sn-G3
	for qemu-devel@nongnu.org; Tue, 05 Apr 2016 11:21:46 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1anSnS-000424-BM
	for qemu-devel@nongnu.org; Tue, 05 Apr 2016 11:21:42 -0400
References: <1459777195-7907-1-git-send-email-vijayak@caviumnetworks.com>
	<1459777195-7907-3-git-send-email-vijayak@caviumnetworks.com>
	<CAFEAcA_755uEOZoGj0SCtmT-5PyLMntkK3wZeuHvZ_kPG9A1aQ@mail.gmail.com>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <5703D7EE.2000005@redhat.com>
Date: Tue, 5 Apr 2016 17:21:18 +0200
MIME-Version: 1.0
In-Reply-To: <CAFEAcA_755uEOZoGj0SCtmT-5PyLMntkK3wZeuHvZ_kPG9A1aQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC PATCH v1 2/2] target-arm: Use Neon for zero
	checking
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Maydell <peter.maydell@linaro.org>, Vijaya Kumar K <vijayak@caviumnetworks.com>
Cc: Prasun Kapoor <Prasun.Kapoor@caviumnetworks.com>, Vijay <vijayak@cavium.com>, qemu-arm <qemu-arm@nongnu.org>, QEMU Developers <qemu-devel@nongnu.org>, Vijay Kilari <vijay.kilari@gmail.com>


On 05/04/2016 16:36, Peter Maydell wrote:
>> > +
>> > +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR_NEON 16
>> > +
>> > +/*
>> > + * Zero page/buffer checking using SIMD(Neon)
>> > + */
>> > +
>> > +static bool
>> > +can_use_buffer_find_nonzero_offset_neon(const void *buf, size_t len=
)
>> > +{
>> > +    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR_NEON
>> > +                   * sizeof(NEON_VECTYPE)) =3D=3D 0
>> > +            && ((uintptr_t) buf) % sizeof(NEON_VECTYPE) =3D=3D 0);
>> > +}
>> > +
>> > +static size_t buffer_find_nonzero_offset_neon(const void *buf, size=
_t len)
>> > +{
>> > +    size_t i;
>> > +    NEON_VECTYPE d0, d1, d2, d3, d4, d5, d6;
>> > +    NEON_VECTYPE d7, d8, d9, d10, d11, d12, d13, d14;
>> > +    uint64_t const *data =3D buf;
>> > +
>> > +    assert(can_use_buffer_find_nonzero_offset_neon(buf, len));
>> > +    len /=3D sizeof(unsigned long);
>> > +
>> > +    for (i =3D 0; i < len; i +=3D 32) {
>> > +        d0 =3D NEON_LOAD_N_ORR(data[i], data[i + 2]);
>> > +        d1 =3D NEON_LOAD_N_ORR(data[i + 4], data[i + 6]);
>> > +        d2 =3D NEON_LOAD_N_ORR(data[i + 8], data[i + 10]);
>> > +        d3 =3D NEON_LOAD_N_ORR(data[i + 12], data[i + 14]);
>> > +        d4 =3D NEON_ORR(d0, d1);
>> > +        d5 =3D NEON_ORR(d2, d3);
>> > +        d6 =3D NEON_ORR(d4, d5);
>> > +
>> > +        d7 =3D NEON_LOAD_N_ORR(data[i + 16], data[i + 18]);
>> > +        d8 =3D NEON_LOAD_N_ORR(data[i + 20], data[i + 22]);
>> > +        d9 =3D NEON_LOAD_N_ORR(data[i + 24], data[i + 26]);
>> > +        d10 =3D NEON_LOAD_N_ORR(data[i + 28], data[i + 30]);
>> > +        d11 =3D NEON_ORR(d7, d8);
>> > +        d12 =3D NEON_ORR(d9, d10);
>> > +        d13 =3D NEON_ORR(d11, d12);
>> > +
>> > +        d14 =3D NEON_ORR(d6, d13);
>> > +        if (NEON_EQ_ZERO(d14)) {
>> > +            break;
>> > +        }
>> > +    }
> Both the other optimised find_nonzero implementations in this
> file have two loops, not just one. Is it OK that this
> implementation has only a single loop?
>=20
> Paolo: do you know why we have two loops in the other
> implementations?

Because usually the first one or two iterations are enough to exit the
function if the page is nonzero.  It's measurably slower to go through
the unrolled loop in that case.  On the other hand, once the first few
iterations found only zero bytes, the buffer is very likely entirely
zero and the unrolled loop helps.

But in theory it should be enough to add a new #elif branch like this:

#include "arm_neon.h"
#define VECTYPE   uint64x2_t
#define VEC_OR(a, b) ((a) | (b))
#define ALL_EQ(a, b) /* ??? :) */

around the

/* vector definitions */

comment in util/cutils.c.  GCC should do everything else.

Paolo