From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:60193) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RKU9a-0000qQ-PD for qemu-devel@nongnu.org; Sun, 30 Oct 2011 08:06:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RKU9Y-0007qh-9p for qemu-devel@nongnu.org; Sun, 30 Oct 2011 08:06:18 -0400 Received: from mail-bw0-f45.google.com ([209.85.214.45]:50441) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RKU9X-0007qU-OS for qemu-devel@nongnu.org; Sun, 30 Oct 2011 08:06:16 -0400 Received: by bkbzu5 with SMTP id zu5so9172bkb.4 for ; Sun, 30 Oct 2011 05:06:14 -0700 (PDT) From: Max Filippov Date: Sun, 30 Oct 2011 16:06:07 +0400 References: <20111030113908.GA18904@cs.nctu.edu.tw> In-Reply-To: <20111030113908.GA18904@cs.nctu.edu.tw> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201110301606.08786.jcmvbkbc@gmail.com> Subject: Re: [Qemu-devel] Why some ARM NEON helper functions need mask? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: =?utf-8?q?=E9=99=B3=E9=9F=8B=E4=BB=BB?= > I am looking into QEMU's implementation for ARM NEON instructions > (target-arm/neon_helper.c). Some helper functions will do mask > operation, neon_add_u8, for example. I thought simply adding a and b > is enough and can't figure out why the mask operation is needed. These are SIMD instructions acting upon independent data 'lanes' packed into bigger data item. Lane operations must not interfere with each other. > --- > uint32_t HELPER(neon_add_u8)(uint32_t a, uint32_t b) > { > uint32_t mask; >1: mask = (a ^ b) & 0x80808080u; >2: a &= ~0x80808080u; >3: b &= ~0x80808080u; >4: return (a + b) ^ mask; > } > --- In your example there are four 8-bit lanes packed into 32-bit word. If we add whole 32-bit words then care must be taken to prevent overflow propagation between the lanes. This is done by putting zero at the top bit of each 8-bit operand (steps 2 and 3). These top bits are summed modulo 2 separately (step 1) and then added back (step4). Thanks. -- Max