From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:60888)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UFN7r-0003iX-K5
	for qemu-devel@nongnu.org; Tue, 12 Mar 2013 07:12:18 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UFN7m-0000rI-Oq
	for qemu-devel@nongnu.org; Tue, 12 Mar 2013 07:12:11 -0400
Received: from mx1.redhat.com ([209.132.183.28]:2459)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UFN7m-0000rA-Gh
	for qemu-devel@nongnu.org; Tue, 12 Mar 2013 07:12:06 -0400
Message-ID: <513F0D7E.4010901@redhat.com>
Date: Tue, 12 Mar 2013 12:11:58 +0100
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <513F08BF.4040209@dlhnet.de>
In-Reply-To: <513F08BF.4040209@dlhnet.de>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Lieven <pl@dlhnet.de>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, peter.maydell@linaro.org

Il 12/03/2013 11:51, Peter Lieven ha scritto:
> Hi,
> 
> a second patch to optimize live migration. I have generated some
> artifical load
> testing for zero pages. Ordinary dup or non dup pages are not affected.
> 
> savings for zero pages (test case):
>  non SSE2:    30s -> 26s
>  SSE2:        27s -> 21s
> 
> optionally I would suggest optimizing buffer_is_zero to use SSE2 if addr
> is 16 byte aligned and length is 128 byte aligned.
> in this case bdrv functions could also benefit from it.
> 
> Peter
> 
> diff --git a/arch_init.c b/arch_init.c
> index 98e2bc6..e1051e6 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -164,9 +164,37 @@ int qemu_read_default_config_files(bool userconfig)
>      return 0;
>  }
> 
> -static int is_dup_page(uint8_t *page)
> +#if __SSE2__
> +static int is_zero_page_sse2(u_int8_t *page)
>  {
>      VECTYPE *p = (VECTYPE *)page;
> +    VECTYPE zero = _mm_setzero_si128();
> +    int i;
> +    for (i = 0; i < (TARGET_PAGE_SIZE / sizeof(VECTYPE)); i+=8) {
> +               VECTYPE tmp0 = _mm_or_si128(p[i+0],p[i+1]);
> +               VECTYPE tmp1 = _mm_or_si128(p[i+2],p[i+3]);
> +               VECTYPE tmp2 = _mm_or_si128(p[i+4],p[i+5]);
> +               VECTYPE tmp3 = _mm_or_si128(p[i+6],p[i+7]);
> +               VECTYPE tmp01 = _mm_or_si128(tmp0,tmp1);
> +               VECTYPE tmp23 = _mm_or_si128(tmp2,tmp3);

You can use the normal "|" C operator, then the result will be portable
to Altivec or !SSE2 as well.

The problem is that find_zero_bit has a known case when there are a lot
of zero bytes---namely, the final passes of migration.  For is_dup_page,
it is reasonable to assume that:

* zero pages remain zero, and thus are only processed once

* non-zero pages are modified often, and thus are processed multiple times.

Your patch adds overhead in the case where a page is non-zero, which
will be the common case in any non-artificial benchmark.  It _is_
possible that the net result is positive because you warm the cache with
the first 128 bytes of the page.  But without more benchmarking, it is
reasonable to optimize is_dup_page for the case where the for loop rolls
very few times.

Paolo