[Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform
@ 2016-08-16 12:02 vijay.kilari
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: vijay.kilari @ 2016-08-16 12:02 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: p.fedin, qemu-devel, Prasun.Kapoor, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

The CPU MIDR_EL1 register is exposed to userspace for arm64
with the below patch.
https://lkml.org/lkml/2016/7/8/467

Thunderx platform requires explicit prefetch instruction to
provide prefetch hint. Using MIDR_EL1 information, provided
by above kernel patch, prefetch is executed if the platform
is Thunderx.

The results of live migration time improvement is provided
in commit message of patch 2.

Note: Check for size of while prefetching beyond page is
not added. Making this check is counter productive on
performance of live migration.

v1 => v2:
   - Rename util/cpuinfo.c as util/aarch64-cpuid.c
   - Introduced header file include/qemu/aarch64-cpuid.h
   - Place all arch specific code under define __aarch64__ and
     CONFIG_LINUX.
   - Used builtin_prefetch() to add prefetch instruction.
   - Moved arch specific changes out of generic code
   - Dropped prefetching 5th cache line.

Vijaya Kumar K (2):
  utils: Add helper to read arm MIDR_EL1 register
  utils: Add prefetch for Thunderx platform

 include/qemu/aarch64-cpuid.h |  9 +++++
 util/Makefile.objs           |  1 +
 util/aarch64-cpuid.c         | 94 ++++++++++++++++++++++++++++++++++++++++++++
 util/cutils.c                | 31 +++++++++++++++
 4 files changed, 135 insertions(+)
 create mode 100644 include/qemu/aarch64-cpuid.h
 create mode 100644 util/aarch64-cpuid.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-16 12:02 [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform vijay.kilari
@ 2016-08-16 12:02 ` vijay.kilari
  2016-08-17 13:39   ` Paolo Bonzini
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform vijay.kilari
  2016-08-16 16:02 ` [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization " no-reply
  2 siblings, 1 reply; 19+ messages in thread
From: vijay.kilari @ 2016-08-16 12:02 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: p.fedin, qemu-devel, Prasun.Kapoor, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

Add helper API to read MIDR_EL1 registers to fetch
cpu identification information. This helps in
adding errata's and architecture specific features.

This is implemented only for arm architecture.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
---
 include/qemu/aarch64-cpuid.h |  9 +++++
 util/Makefile.objs           |  1 +
 util/aarch64-cpuid.c         | 94 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 104 insertions(+)

diff --git a/include/qemu/aarch64-cpuid.h b/include/qemu/aarch64-cpuid.h
new file mode 100644
index 0000000..3c11057
--- /dev/null
+++ b/include/qemu/aarch64-cpuid.h
@@ -0,0 +1,9 @@
+#ifndef QEMU_AARCH64_CPUID_H
+#define QEMU_AARCH64_CPUID_H
+
+#if defined (__aarch64__)
+uint64_t get_aarch64_cpu_id(void);
+bool is_thunderx_pass2_cpu(void);
+#endif
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 96cb1e0..aa07bc3 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -35,3 +35,4 @@ util-obj-y += log.o
 util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
+util-obj-y += aarch64-cpuid.o
diff --git a/util/aarch64-cpuid.c b/util/aarch64-cpuid.c
new file mode 100644
index 0000000..42af704
--- /dev/null
+++ b/util/aarch64-cpuid.c
@@ -0,0 +1,94 @@
+/*
+ * Dealing with arm cpu identification information.
+ *
+ * Copyright (C) 2016 Cavium, Inc.
+ *
+ * Authors:
+ *  Vijaya Kumar K <Vijaya.Kumar@cavium.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/cutils.h"
+#include "qemu/aarch64-cpuid.h"
+
+#if defined (__aarch64__)
+#define MIDR_IMPLEMENTER_SHIFT  24
+#define MIDR_IMPLEMENTER_MASK   (0xffULL << MIDR_IMPLEMENTER_SHIFT)
+#define MIDR_ARCHITECTURE_SHIFT 16
+#define MIDR_ARCHITECTURE_MASK  (0xf << MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_PARTNUM_SHIFT      4
+#define MIDR_PARTNUM_MASK       (0xfff << MIDR_PARTNUM_SHIFT)
+
+#define MIDR_CPU_PART(imp, partnum) \
+        (((imp)                 << MIDR_IMPLEMENTER_SHIFT)  | \
+        (0xf                    << MIDR_ARCHITECTURE_SHIFT) | \
+        ((partnum)              << MIDR_PARTNUM_SHIFT))
+
+#define ARM_CPU_IMP_CAVIUM        0x43
+#define CAVIUM_CPU_PART_THUNDERX  0x0A1
+
+#define MIDR_THUNDERX_PASS2  \
+               MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+#define CPU_MODEL_MASK  (MIDR_IMPLEMENTER_MASK | MIDR_ARCHITECTURE_MASK | \
+                         MIDR_PARTNUM_MASK)
+
+static uint64_t qemu_read_aarch64_midr_el1(void)
+{
+#ifdef CONFIG_LINUX
+    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";
+    char *buf;
+    uint64_t midr = 0;
+
+#define BUF_SIZE 32
+    buf = g_malloc0(BUF_SIZE);
+    if (!buf) {
+        return 0;
+    }
+
+    if (!g_file_get_contents(file, &buf, 0, NULL)) {
+        goto out;
+    }
+
+    if (qemu_strtoull(buf, NULL, 0, &midr) < 0) {
+        goto out;
+    }
+
+out:
+    g_free(buf);
+
+    return midr;
+#else
+    return 0;
+#endif
+}
+
+static bool is_thunderx_cpu;
+static uint64_t aarch64_midr_val;
+uint64_t get_aarch64_cpu_id(void)
+{
+#ifdef CONFIG_LINUX
+    static bool cpu_info_read;
+
+    if (unlikely(!cpu_info_read)) {
+        aarch64_midr_val = qemu_read_aarch64_midr_el1();
+        aarch64_midr_val &= CPU_MODEL_MASK;
+        cpu_info_read = 1;
+        if (aarch64_midr_val == MIDR_THUNDERX_PASS2) {
+            is_thunderx_cpu = 1;
+        }
+    }
+    return aarch64_midr_val;
+#else
+    return 0;
+#endif
+}
+
+bool is_thunderx_pass2_cpu(void)
+{
+   return is_thunderx_cpu;
+}
+#endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform
  2016-08-16 12:02 [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform vijay.kilari
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
@ 2016-08-16 12:02 ` vijay.kilari
  2016-08-16 18:02   ` Richard Henderson
  2016-08-16 16:02 ` [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization " no-reply
  2 siblings, 1 reply; 19+ messages in thread
From: vijay.kilari @ 2016-08-16 12:02 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: p.fedin, qemu-devel, Prasun.Kapoor, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

Thunderx pass2 chip requires explicit prefetch
instruction to give prefetch hint.

To speed up live migration on Thunderx platform,
prefetch instruction is added in zero buffer check
function.

The below results show live migration time improvement
with prefetch instruction with 1K and 4K page size.
VM with 4 VCPUs, 8GB RAM is migrated.

1K page size, no prefetch
=========================
Migration status: completed
total time: 13012 milliseconds
downtime: 10 milliseconds
setup: 15 milliseconds
transferred ram: 268131 kbytes
throughput: 168.84 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 8338072 pages
skipped: 0 pages
normal: 193335 pages
normal bytes: 193335 kbytes
dirty sync count: 4

1K page size with prefetch
=========================
Migration status: completed
total time: 7493 milliseconds
downtime: 71 milliseconds
setup: 16 milliseconds
transferred ram: 269666 kbytes
throughput: 294.88 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 8340596 pages
skipped: 0 pages
normal: 194837 pages
normal bytes: 194837 kbytes
dirty sync count: 3

4K page size with no prefetch
=============================
Migration status: completed
total time: 10456 milliseconds
downtime: 49 milliseconds
setup: 5 milliseconds
transferred ram: 231726 kbytes
throughput: 181.59 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2079914 pages
skipped: 0 pages
normal: 53257 pages
normal bytes: 213028 kbytes
dirty sync count: 3

4K page size with prefetch
==========================
Migration status: completed
total time: 3937 milliseconds
downtime: 23 milliseconds
setup: 5 milliseconds
transferred ram: 229283 kbytes
throughput: 477.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2079775 pages
skipped: 0 pages
normal: 52648 pages
normal bytes: 210592 kbytes
dirty sync count: 3

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
---
 util/cutils.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 7505fda..342d1e3 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -186,11 +186,14 @@ int qemu_fdatasync(int fd)
 #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
 #elif defined(__aarch64__)
 #include "arm_neon.h"
+#include "qemu/aarch64-cpuid.h"
 #define VECTYPE        uint64x2_t
 #define ALL_EQ(v1, v2) \
         ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
          (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
 #define VEC_OR(v1, v2) ((v1) | (v2))
+#define VEC_PREFETCH(base, index) \
+        __builtin_prefetch(&base[index], 0, 0);
 #else
 #define VECTYPE        unsigned long
 #define SPLAT(p)       (*(p) * (~0UL / 255))
@@ -200,6 +203,29 @@ int qemu_fdatasync(int fd)
 
 #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
 
+static inline void prefetch_vector(const VECTYPE *p, int index)
+{
+#if defined(__aarch64__)
+    get_aarch64_cpu_id();
+    if (is_thunderx_pass2_cpu()) {
+        /* Prefetch first 3 cache lines */
+        VEC_PREFETCH(p, index + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR);
+        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 2));
+        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 3));
+    }
+#endif
+}
+
+static inline void prefetch_vector_loop(const VECTYPE *p, int index)
+{
+#if defined(__aarch64__)
+    if (is_thunderx_pass2_cpu()) {
+        /* Prefetch 4 cache lines ahead from index */
+        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 4));
+    }
+#endif
+}
+
 static bool
 can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
 {
@@ -246,9 +272,14 @@ static size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
         }
     }
 
+    prefetch_vector(p, 0);
+
     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
          i < len / sizeof(VECTYPE);
          i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+
+        prefetch_vector_loop(p, i);
+
         VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
         VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
         VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform
  2016-08-16 12:02 [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform vijay.kilari
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform vijay.kilari
@ 2016-08-16 16:02 ` no-reply
  2016-08-17 13:40   ` Paolo Bonzini
  2 siblings, 1 reply; 19+ messages in thread
From: no-reply @ 2016-08-16 16:02 UTC (permalink / raw)
  To: vijay.kilari
  Cc: famz, qemu-arm, peter.maydell, pbonzini, rth, Prasun.Kapoor,
	p.fedin, qemu-devel, Vijaya.Kumar

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Message-id: 1471348968-4614-1-git-send-email-vijay.kilari@gmail.com
Subject: [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1471348633-22174-1-git-send-email-ehabkost@redhat.com -> patchew/1471348633-22174-1-git-send-email-ehabkost@redhat.com
 * [new tag]         patchew/1471348968-4614-1-git-send-email-vijay.kilari@gmail.com -> patchew/1471348968-4614-1-git-send-email-vijay.kilari@gmail.com
Switched to a new branch 'test'
58a4317 utils: Add prefetch for Thunderx platform
5898005 utils: Add helper to read arm MIDR_EL1 register

=== OUTPUT BEGIN ===
Checking PATCH 1/2: utils: Add helper to read arm MIDR_EL1 register...
ERROR: space prohibited between function name and open parenthesis '('
#24: FILE: include/qemu/aarch64-cpuid.h:4:
+#if defined (__aarch64__)

ERROR: architecture specific defines should be avoided
#24: FILE: include/qemu/aarch64-cpuid.h:4:
+#if defined (__aarch64__)

ERROR: space prohibited between function name and open parenthesis '('
#62: FILE: util/aarch64-cpuid.c:18:
+#if defined (__aarch64__)

ERROR: architecture specific defines should be avoided
#62: FILE: util/aarch64-cpuid.c:18:
+#if defined (__aarch64__)

WARNING: line over 80 characters
#86: FILE: util/aarch64-cpuid.c:42:
+    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";

total: 4 errors, 1 warnings, 107 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 2/2: utils: Add prefetch for Thunderx platform...
ERROR: architecture specific defines should be avoided
#109: FILE: util/cutils.c:208:
+#if defined(__aarch64__)

ERROR: architecture specific defines should be avoided
#122: FILE: util/cutils.c:221:
+#if defined(__aarch64__)

total: 2 errors, 0 warnings, 57 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform vijay.kilari
@ 2016-08-16 18:02   ` Richard Henderson
  2016-08-16 23:45     ` Vijay Kilari
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2016-08-16 18:02 UTC (permalink / raw)
  To: vijay.kilari, qemu-arm, peter.maydell, pbonzini
  Cc: p.fedin, qemu-devel, Prasun.Kapoor, Vijaya Kumar K

On 08/16/2016 05:02 AM, vijay.kilari@gmail.com wrote:
> +static inline void prefetch_vector_loop(const VECTYPE *p, int index)
> +{
> +#if defined(__aarch64__)
> +    if (is_thunderx_pass2_cpu()) {
> +        /* Prefetch 4 cache lines ahead from index */
> +        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 4));
> +    }
> +#endif
> +}

Oh come now.  This is even worse than before.  A function call protecting a 
mere prefetch within the main body of an inner loop?

Did you not understand what I was asking for?


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform
  2016-08-16 18:02   ` Richard Henderson
@ 2016-08-16 23:45     ` Vijay Kilari
  2016-08-17 15:34       ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Kilari @ 2016-08-16 23:45 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-arm, Peter Maydell, Paolo Bonzini, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On Tue, Aug 16, 2016 at 11:32 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 08/16/2016 05:02 AM, vijay.kilari@gmail.com wrote:
>>
>> +static inline void prefetch_vector_loop(const VECTYPE *p, int index)
>> +{
>> +#if defined(__aarch64__)
>> +    if (is_thunderx_pass2_cpu()) {
>> +        /* Prefetch 4 cache lines ahead from index */
>> +        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> * 4));
>> +    }
>> +#endif
>> +}
>
>
> Oh come now.  This is even worse than before.  A function call protecting a
> mere prefetch within the main body of an inner loop?
>
> Did you not understand what I was asking for?

No, Could you please detail the problem?.

>
>
> r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
@ 2016-08-17 13:39   ` Paolo Bonzini
  2016-08-18  7:56     ` Vijay Kilari
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2016-08-17 13:39 UTC (permalink / raw)
  To: vijay.kilari, qemu-arm, peter.maydell, rth
  Cc: p.fedin, qemu-devel, Prasun.Kapoor, Vijaya Kumar K



On 16/08/2016 14:02, vijay.kilari@gmail.com wrote:
> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> 
> Add helper API to read MIDR_EL1 registers to fetch
> cpu identification information. This helps in
> adding errata's and architecture specific features.
> 
> This is implemented only for arm architecture.
> 
> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> ---
>  include/qemu/aarch64-cpuid.h |  9 +++++
>  util/Makefile.objs           |  1 +
>  util/aarch64-cpuid.c         | 94 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 104 insertions(+)
> 
> diff --git a/include/qemu/aarch64-cpuid.h b/include/qemu/aarch64-cpuid.h
> new file mode 100644
> index 0000000..3c11057
> --- /dev/null
> +++ b/include/qemu/aarch64-cpuid.h
> @@ -0,0 +1,9 @@
> +#ifndef QEMU_AARCH64_CPUID_H
> +#define QEMU_AARCH64_CPUID_H
> +
> +#if defined (__aarch64__)
> +uint64_t get_aarch64_cpu_id(void);
> +bool is_thunderx_pass2_cpu(void);
> +#endif
> +
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 96cb1e0..aa07bc3 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -35,3 +35,4 @@ util-obj-y += log.o
>  util-obj-y += qdist.o
>  util-obj-y += qht.o
>  util-obj-y += range.o
> +util-obj-y += aarch64-cpuid.o
> diff --git a/util/aarch64-cpuid.c b/util/aarch64-cpuid.c
> new file mode 100644
> index 0000000..42af704
> --- /dev/null
> +++ b/util/aarch64-cpuid.c
> @@ -0,0 +1,94 @@
> +/*
> + * Dealing with arm cpu identification information.
> + *
> + * Copyright (C) 2016 Cavium, Inc.
> + *
> + * Authors:
> + *  Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1
> + * or later.  See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/cutils.h"
> +#include "qemu/aarch64-cpuid.h"
> +
> +#if defined (__aarch64__)
> +#define MIDR_IMPLEMENTER_SHIFT  24
> +#define MIDR_IMPLEMENTER_MASK   (0xffULL << MIDR_IMPLEMENTER_SHIFT)
> +#define MIDR_ARCHITECTURE_SHIFT 16
> +#define MIDR_ARCHITECTURE_MASK  (0xf << MIDR_ARCHITECTURE_SHIFT)
> +#define MIDR_PARTNUM_SHIFT      4
> +#define MIDR_PARTNUM_MASK       (0xfff << MIDR_PARTNUM_SHIFT)
> +
> +#define MIDR_CPU_PART(imp, partnum) \
> +        (((imp)                 << MIDR_IMPLEMENTER_SHIFT)  | \
> +        (0xf                    << MIDR_ARCHITECTURE_SHIFT) | \
> +        ((partnum)              << MIDR_PARTNUM_SHIFT))
> +
> +#define ARM_CPU_IMP_CAVIUM        0x43
> +#define CAVIUM_CPU_PART_THUNDERX  0x0A1
> +
> +#define MIDR_THUNDERX_PASS2  \
> +               MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
> +#define CPU_MODEL_MASK  (MIDR_IMPLEMENTER_MASK | MIDR_ARCHITECTURE_MASK | \
> +                         MIDR_PARTNUM_MASK)
> +
> +static uint64_t qemu_read_aarch64_midr_el1(void)
> +{
> +#ifdef CONFIG_LINUX
> +    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";
> +    char *buf;
> +    uint64_t midr = 0;
> +
> +#define BUF_SIZE 32
> +    buf = g_malloc0(BUF_SIZE);
> +    if (!buf) {
> +        return 0;
> +    }
> +
> +    if (!g_file_get_contents(file, &buf, 0, NULL)) {
> +        goto out;
> +    }
> +
> +    if (qemu_strtoull(buf, NULL, 0, &midr) < 0) {
> +        goto out;
> +    }
> +
> +out:
> +    g_free(buf);
> +
> +    return midr;
> +#else
> +    return 0;
> +#endif
> +}
> +
> +static bool is_thunderx_cpu;
> +static uint64_t aarch64_midr_val;
> +uint64_t get_aarch64_cpu_id(void)
> +{
> +#ifdef CONFIG_LINUX
> +    static bool cpu_info_read;
> +
> +    if (unlikely(!cpu_info_read)) {
> +        aarch64_midr_val = qemu_read_aarch64_midr_el1();
> +        aarch64_midr_val &= CPU_MODEL_MASK;
> +        cpu_info_read = 1;
> +        if (aarch64_midr_val == MIDR_THUNDERX_PASS2) {
> +            is_thunderx_cpu = 1;
> +        }
> +    }
> +    return aarch64_midr_val;
> +#else
> +    return 0;
> +#endif
> +}
> +
> +bool is_thunderx_pass2_cpu(void)
> +{
> +   return is_thunderx_cpu;

This can be:

   return get_aarch64_cpu_id() == MIDR_THUNDERX_PASS2;

without the is_thunderx_cpu variable.

Paolo

> +}
> +#endif
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform
  2016-08-16 16:02 ` [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization " no-reply
@ 2016-08-17 13:40   ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2016-08-17 13:40 UTC (permalink / raw)
  To: qemu-devel, vijay.kilari
  Cc: famz, qemu-arm, peter.maydell, rth, Prasun.Kapoor, p.fedin,
	Vijaya.Kumar



On 16/08/2016 18:02, no-reply@ec2-52-6-146-230.compute-1.amazonaws.com
wrote:
> ERROR: architecture specific defines should be avoided
> #24: FILE: include/qemu/aarch64-cpuid.h:4:
> +#if defined (__aarch64__)
> 
> ERROR: architecture specific defines should be avoided
> #62: FILE: util/aarch64-cpuid.c:18:
> +#if defined (__aarch64__)

You can ignore these two, but the others have to be fixed (just use #ifdef).

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform
  2016-08-16 23:45     ` Vijay Kilari
@ 2016-08-17 15:34       ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2016-08-17 15:34 UTC (permalink / raw)
  To: Vijay Kilari
  Cc: qemu-arm, Peter Maydell, Paolo Bonzini, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On 08/16/2016 04:45 PM, Vijay Kilari wrote:
> On Tue, Aug 16, 2016 at 11:32 PM, Richard Henderson <rth@twiddle.net> wrote:
>> On 08/16/2016 05:02 AM, vijay.kilari@gmail.com wrote:
>>>
>>> +static inline void prefetch_vector_loop(const VECTYPE *p, int index)
>>> +{
>>> +#if defined(__aarch64__)
>>> +    if (is_thunderx_pass2_cpu()) {
>>> +        /* Prefetch 4 cache lines ahead from index */
>>> +        VEC_PREFETCH(p, index + (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>> * 4));
>>> +    }
>>> +#endif
>>> +}
>>
>>
>> Oh come now.  This is even worse than before.  A function call protecting a
>> mere prefetch within the main body of an inner loop?
>>
>> Did you not understand what I was asking for?
>
> No, Could you please detail the problem?.

The thunderx check, *if it even needs to exist at all*, must happen outside the 
loop.  Preferably not more than once, at startup time.

I strongly suspect that you do not need any check at all.  That even for cpus 
which automatically detect the streaming loop, adding a prefetch will not hurt.

You should repeat your same benchmark, with and without the prefetch, on (1) an 
A57 or suchlike, and (2) an x86 of some variety.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-17 13:39   ` Paolo Bonzini
@ 2016-08-18  7:56     ` Vijay Kilari
  2016-08-18  8:50       ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Kilari @ 2016-08-18  7:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-arm, Peter Maydell, Richard Henderson, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On Wed, Aug 17, 2016 at 7:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 16/08/2016 14:02, vijay.kilari@gmail.com wrote:
>> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>>
>> Add helper API to read MIDR_EL1 registers to fetch
>> cpu identification information. This helps in
>> adding errata's and architecture specific features.
>>
>> This is implemented only for arm architecture.
>>
>> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>> ---
>>  include/qemu/aarch64-cpuid.h |  9 +++++
>>  util/Makefile.objs           |  1 +
>>  util/aarch64-cpuid.c         | 94 ++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 104 insertions(+)
>>
>> diff --git a/include/qemu/aarch64-cpuid.h b/include/qemu/aarch64-cpuid.h
>> new file mode 100644
>> index 0000000..3c11057
>> --- /dev/null
>> +++ b/include/qemu/aarch64-cpuid.h
>> @@ -0,0 +1,9 @@
>> +#ifndef QEMU_AARCH64_CPUID_H
>> +#define QEMU_AARCH64_CPUID_H
>> +
>> +#if defined (__aarch64__)
>> +uint64_t get_aarch64_cpu_id(void);
>> +bool is_thunderx_pass2_cpu(void);
>> +#endif
>> +
>> +#endif
>> diff --git a/util/Makefile.objs b/util/Makefile.objs
>> index 96cb1e0..aa07bc3 100644
>> --- a/util/Makefile.objs
>> +++ b/util/Makefile.objs
>> @@ -35,3 +35,4 @@ util-obj-y += log.o
>>  util-obj-y += qdist.o
>>  util-obj-y += qht.o
>>  util-obj-y += range.o
>> +util-obj-y += aarch64-cpuid.o
>> diff --git a/util/aarch64-cpuid.c b/util/aarch64-cpuid.c
>> new file mode 100644
>> index 0000000..42af704
>> --- /dev/null
>> +++ b/util/aarch64-cpuid.c
>> @@ -0,0 +1,94 @@
>> +/*
>> + * Dealing with arm cpu identification information.
>> + *
>> + * Copyright (C) 2016 Cavium, Inc.
>> + *
>> + * Authors:
>> + *  Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.1
>> + * or later.  See the COPYING.LIB file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qemu/cutils.h"
>> +#include "qemu/aarch64-cpuid.h"
>> +
>> +#if defined (__aarch64__)
>> +#define MIDR_IMPLEMENTER_SHIFT  24
>> +#define MIDR_IMPLEMENTER_MASK   (0xffULL << MIDR_IMPLEMENTER_SHIFT)
>> +#define MIDR_ARCHITECTURE_SHIFT 16
>> +#define MIDR_ARCHITECTURE_MASK  (0xf << MIDR_ARCHITECTURE_SHIFT)
>> +#define MIDR_PARTNUM_SHIFT      4
>> +#define MIDR_PARTNUM_MASK       (0xfff << MIDR_PARTNUM_SHIFT)
>> +
>> +#define MIDR_CPU_PART(imp, partnum) \
>> +        (((imp)                 << MIDR_IMPLEMENTER_SHIFT)  | \
>> +        (0xf                    << MIDR_ARCHITECTURE_SHIFT) | \
>> +        ((partnum)              << MIDR_PARTNUM_SHIFT))
>> +
>> +#define ARM_CPU_IMP_CAVIUM        0x43
>> +#define CAVIUM_CPU_PART_THUNDERX  0x0A1
>> +
>> +#define MIDR_THUNDERX_PASS2  \
>> +               MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
>> +#define CPU_MODEL_MASK  (MIDR_IMPLEMENTER_MASK | MIDR_ARCHITECTURE_MASK | \
>> +                         MIDR_PARTNUM_MASK)
>> +
>> +static uint64_t qemu_read_aarch64_midr_el1(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";
>> +    char *buf;
>> +    uint64_t midr = 0;
>> +
>> +#define BUF_SIZE 32
>> +    buf = g_malloc0(BUF_SIZE);
>> +    if (!buf) {
>> +        return 0;
>> +    }
>> +
>> +    if (!g_file_get_contents(file, &buf, 0, NULL)) {
>> +        goto out;
>> +    }
>> +
>> +    if (qemu_strtoull(buf, NULL, 0, &midr) < 0) {
>> +        goto out;
>> +    }
>> +
>> +out:
>> +    g_free(buf);
>> +
>> +    return midr;
>> +#else
>> +    return 0;
>> +#endif
>> +}
>> +
>> +static bool is_thunderx_cpu;
>> +static uint64_t aarch64_midr_val;
>> +uint64_t get_aarch64_cpu_id(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    static bool cpu_info_read;
>> +
>> +    if (unlikely(!cpu_info_read)) {
>> +        aarch64_midr_val = qemu_read_aarch64_midr_el1();
>> +        aarch64_midr_val &= CPU_MODEL_MASK;
>> +        cpu_info_read = 1;
>> +        if (aarch64_midr_val == MIDR_THUNDERX_PASS2) {
>> +            is_thunderx_cpu = 1;
>> +        }
>> +    }
>> +    return aarch64_midr_val;
>> +#else
>> +    return 0;
>> +#endif
>> +}
>> +
>> +bool is_thunderx_pass2_cpu(void)
>> +{
>> +   return is_thunderx_cpu;
>
> This can be:
>
>    return get_aarch64_cpu_id() == MIDR_THUNDERX_PASS2;
>
> without the is_thunderx_cpu variable.

The get_aarch_cpu_id() has check " if (unlikely(!cpu_info_read)) ".
If we call get_aarch_cpu_id() from is_thunderx_pass2_cpu() which is
called from inside the loop, we will be adding one additional check.

What I observed is having extra check inside the loop is adding 100 to
200ms overhead
on live migration time. So I added this variable extra is_thunderx_cpu
static variable
to make it simple single check.

>
> Paolo
>
>> +}
>> +#endif
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18  7:56     ` Vijay Kilari
@ 2016-08-18  8:50       ` Paolo Bonzini
  2016-08-18  9:01         ` Vijay Kilari
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2016-08-18  8:50 UTC (permalink / raw)
  To: Vijay Kilari
  Cc: qemu-arm, Peter Maydell, Richard Henderson, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K



On 18/08/2016 09:56, Vijay Kilari wrote:
> The get_aarch_cpu_id() has check " if (unlikely(!cpu_info_read)) ".
> If we call get_aarch_cpu_id() from is_thunderx_pass2_cpu() which is
> called from inside the loop, we will be adding one additional check.

On the other hand, you are making an assumption that the caller of
is_thunderx_pass2_cpu() calls get_aarch64_cpu_id() first, and not
documenting it anywhere.

And given that you shouldn't call _any_ function from inside such a hot
loop, your solution is inferior on both counts.

Paolo

> What I observed is having extra check inside the loop is adding 100 to
> 200ms overhead
> on live migration time. So I added this variable extra is_thunderx_cpu
> static variable
> to make it simple single check.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18  8:50       ` Paolo Bonzini
@ 2016-08-18  9:01         ` Vijay Kilari
  2016-08-18  9:39           ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Kilari @ 2016-08-18  9:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-arm, Peter Maydell, Richard Henderson, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On Thu, Aug 18, 2016 at 2:20 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 18/08/2016 09:56, Vijay Kilari wrote:
>> The get_aarch_cpu_id() has check " if (unlikely(!cpu_info_read)) ".
>> If we call get_aarch_cpu_id() from is_thunderx_pass2_cpu() which is
>> called from inside the loop, we will be adding one additional check.
>
> On the other hand, you are making an assumption that the caller of
> is_thunderx_pass2_cpu() calls get_aarch64_cpu_id() first, and not
> documenting it anywhere.
>
> And given that you shouldn't call _any_ function from inside such a hot
> loop, your solution is inferior on both counts.

Yes, but I could not think of better way to get rid of this check. However
as Richard suggested (in another email), to drop this check and let prefetch
be called for all the arm64 architectures. But I don't have any other
arm64 platform
to check the impact of it.

>
> Paolo
>
>> What I observed is having extra check inside the loop is adding 100 to
>> 200ms overhead
>> on live migration time. So I added this variable extra is_thunderx_cpu
>> static variable
>> to make it simple single check.
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18  9:01         ` Vijay Kilari
@ 2016-08-18  9:39           ` Paolo Bonzini
  2016-08-18 14:04             ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2016-08-18  9:39 UTC (permalink / raw)
  To: Vijay Kilari
  Cc: qemu-arm, Peter Maydell, Richard Henderson, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K



On 18/08/2016 11:01, Vijay Kilari wrote:
> On Thu, Aug 18, 2016 at 2:20 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 18/08/2016 09:56, Vijay Kilari wrote:
>>> The get_aarch_cpu_id() has check " if (unlikely(!cpu_info_read)) ".
>>> If we call get_aarch_cpu_id() from is_thunderx_pass2_cpu() which is
>>> called from inside the loop, we will be adding one additional check.
>>
>> On the other hand, you are making an assumption that the caller of
>> is_thunderx_pass2_cpu() calls get_aarch64_cpu_id() first, and not
>> documenting it anywhere.
>>
>> And given that you shouldn't call _any_ function from inside such a hot
>> loop, your solution is inferior on both counts.
> 
> Yes, but I could not think of better way to get rid of this check.

    bool need_aa64_prefetch = is_thunderx_pass2();
    for (...) {
         if (need_aa64_prefetch) {
             ...
         }
    }

The check on cpu_info_read is done just once.

Paolo

 However
> as Richard suggested (in another email), to drop this check and let prefetch
> be called for all the arm64 architectures. But I don't have any other
> arm64 platform
> to check the impact of it.
> 
>>
>> Paolo
>>
>>> What I observed is having extra check inside the loop is adding 100 to
>>> 200ms overhead
>>> on live migration time. So I added this variable extra is_thunderx_cpu
>>> static variable
>>> to make it simple single check.
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18  9:39           ` Paolo Bonzini
@ 2016-08-18 14:04             ` Richard Henderson
  2016-08-18 14:14               ` Peter Maydell
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2016-08-18 14:04 UTC (permalink / raw)
  To: Paolo Bonzini, Vijay Kilari
  Cc: qemu-arm, Peter Maydell, Pavel Fedin, QEMU Developers,
	prasun.kapoor, Vijaya Kumar K

On 08/18/2016 02:39 AM, Paolo Bonzini wrote:
>
>
> On 18/08/2016 11:01, Vijay Kilari wrote:
>> On Thu, Aug 18, 2016 at 2:20 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 18/08/2016 09:56, Vijay Kilari wrote:
>>>> The get_aarch_cpu_id() has check " if (unlikely(!cpu_info_read)) ".
>>>> If we call get_aarch_cpu_id() from is_thunderx_pass2_cpu() which is
>>>> called from inside the loop, we will be adding one additional check.
>>>
>>> On the other hand, you are making an assumption that the caller of
>>> is_thunderx_pass2_cpu() calls get_aarch64_cpu_id() first, and not
>>> documenting it anywhere.
>>>
>>> And given that you shouldn't call _any_ function from inside such a hot
>>> loop, your solution is inferior on both counts.
>>
>> Yes, but I could not think of better way to get rid of this check.
>
>     bool need_aa64_prefetch = is_thunderx_pass2();
>     for (...) {
>          if (need_aa64_prefetch) {
>              ...
>          }
>     }
>
> The check on cpu_info_read is done just once.

Supposing a check is required at all, this is still inferior to either

(1) If completely outside the loop,

   if (is_thunderx_pass2()) {
       for (...)
         ...
   } else {
       for (...)
   }

or (2) ifunc, so that we only check once, not every invocation.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18 14:04             ` Richard Henderson
@ 2016-08-18 14:14               ` Peter Maydell
  2016-08-18 14:46                 ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2016-08-18 14:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Paolo Bonzini, Vijay Kilari, qemu-arm, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On 18 August 2016 at 15:04, Richard Henderson <rth@twiddle.net> wrote:
> or (2) ifunc

While we're on the subject, can somebody explain to me why we
use ifuncs at all? I couldn't work out why it would be better than
just using a straightforward function pointer -- when I tried single
stepping through things the ifunc approach still seemed to indirect
through some table or other so it wasn't actually resolving to
a direct function call anyway.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18 14:14               ` Peter Maydell
@ 2016-08-18 14:46                 ` Richard Henderson
  2016-08-18 14:56                   ` Peter Maydell
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2016-08-18 14:46 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Paolo Bonzini, Vijay Kilari, qemu-arm, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On 08/18/2016 07:14 AM, Peter Maydell wrote:
> On 18 August 2016 at 15:04, Richard Henderson <rth@twiddle.net> wrote:
>> or (2) ifunc
>
> While we're on the subject, can somebody explain to me why we
> use ifuncs at all? I couldn't work out why it would be better than
> just using a straightforward function pointer -- when I tried single
> stepping through things the ifunc approach still seemed to indirect
> through some table or other so it wasn't actually resolving to
> a direct function call anyway.

No reason, I suppose.

It's particularly helpful for libraries, where we don't really want the 
overhead of the initialization when it's not used.

But (1) we don't have many of these and (2) we really don't care *that* much 
about startup time.

So a simple function pointer initialized by a constructor has the same effect.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18 14:46                 ` Richard Henderson
@ 2016-08-18 14:56                   ` Peter Maydell
  2016-08-19  9:05                     ` Vijay Kilari
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2016-08-18 14:56 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Paolo Bonzini, Vijay Kilari, qemu-arm, Pavel Fedin,
	QEMU Developers, prasun.kapoor, Vijaya Kumar K

On 18 August 2016 at 15:46, Richard Henderson <rth@twiddle.net> wrote:
> On 08/18/2016 07:14 AM, Peter Maydell wrote:
>> While we're on the subject, can somebody explain to me why we
>> use ifuncs at all? I couldn't work out why it would be better than
>> just using a straightforward function pointer -- when I tried single
>> stepping through things the ifunc approach still seemed to indirect
>> through some table or other so it wasn't actually resolving to
>> a direct function call anyway.

> No reason, I suppose.
>
> It's particularly helpful for libraries, where we don't really want the
> overhead of the initialization when it's not used.

Ah, I see.

> But (1) we don't have many of these and (2) we really don't care *that* much
> about startup time.
>
> So a simple function pointer initialized by a constructor has the same
> effect.

That seems like it would be a worthwhile change since
(a) I think it's easier to understand than ifunc magic
(b) it means we don't unnecessarily restrict ourselves to a libc
with ifunc support (musl libc doesn't do ifuncs, for instance)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-18 14:56                   ` Peter Maydell
@ 2016-08-19  9:05                     ` Vijay Kilari
  2016-08-19 14:57                       ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Kilari @ 2016-08-19  9:05 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, Paolo Bonzini, qemu-arm, Pavel Fedin,
	QEMU Developers, Vijaya Kumar K

On Thu, Aug 18, 2016 at 8:26 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 18 August 2016 at 15:46, Richard Henderson <rth@twiddle.net> wrote:
>> On 08/18/2016 07:14 AM, Peter Maydell wrote:
>>> While we're on the subject, can somebody explain to me why we
>>> use ifuncs at all? I couldn't work out why it would be better than
>>> just using a straightforward function pointer -- when I tried single
>>> stepping through things the ifunc approach still seemed to indirect
>>> through some table or other so it wasn't actually resolving to
>>> a direct function call anyway.
>
>> No reason, I suppose.
>>
>> It's particularly helpful for libraries, where we don't really want the
>> overhead of the initialization when it's not used.
>
> Ah, I see.
>
>> But (1) we don't have many of these and (2) we really don't care *that* much
>> about startup time.
>>
>> So a simple function pointer initialized by a constructor has the same
>> effect.
>

 The cutils does not have any initialization function that can init
function/constructor pointer
for zero_check function.

Also creating separate function with most of repeated code for prefetch does
not look good. So suggest to put check for prefetch outside the for loop and
code for loop with and without prefetch

I profiled and found that a single check inside the loop is adding 100ms delay
for 8GB RAM migration. So moving check outside the loop is enough.

Ex:

   if (need_prefetch()) {

       prefetch_vector(p, 0);

        for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
             i < len / sizeof(VECTYPE);
             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {

            prefetch_vector_loop(p, i);

            VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
            VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
            VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
            VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
           VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
           VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
            if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
                break;
            }
        }

} else {

        for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
             i < len / sizeof(VECTYPE);
             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {

            VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
            VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
            VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
            VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
           VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
           VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
            if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
                break;
            }
        }
}

Also,  If you want to make prefetch common for all arm64 platforms,
Then thunder cache line is 128 bytes so the prefetch is performed
at 128 byte index. If the platform has 64 byte cache line, then this
prefetch will fill only 64 byte line instead of 128 bytes required for the loop.

> That seems like it would be a worthwhile change since
> (a) I think it's easier to understand than ifunc magic
> (b) it means we don't unnecessarily restrict ourselves to a libc
> with ifunc support (musl libc doesn't do ifuncs, for instance)
>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
  2016-08-19  9:05                     ` Vijay Kilari
@ 2016-08-19 14:57                       ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2016-08-19 14:57 UTC (permalink / raw)
  To: Vijay Kilari, Peter Maydell
  Cc: Paolo Bonzini, qemu-arm, Pavel Fedin, QEMU Developers,
	Vijaya Kumar K

On 08/19/2016 02:05 AM, Vijay Kilari wrote:
> On Thu, Aug 18, 2016 at 8:26 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 18 August 2016 at 15:46, Richard Henderson <rth@twiddle.net> wrote:
>>> On 08/18/2016 07:14 AM, Peter Maydell wrote:
>>>> While we're on the subject, can somebody explain to me why we
>>>> use ifuncs at all? I couldn't work out why it would be better than
>>>> just using a straightforward function pointer -- when I tried single
>>>> stepping through things the ifunc approach still seemed to indirect
>>>> through some table or other so it wasn't actually resolving to
>>>> a direct function call anyway.
>>
>>> No reason, I suppose.
>>>
>>> It's particularly helpful for libraries, where we don't really want the
>>> overhead of the initialization when it's not used.
>>
>> Ah, I see.
>>
>>> But (1) we don't have many of these and (2) we really don't care *that* much
>>> about startup time.
>>>
>>> So a simple function pointer initialized by a constructor has the same
>>> effect.
>>
>
>  The cutils does not have any initialization function that can init
> function/constructor pointer
> for zero_check function.

static void __attribute__((constructor)) init_buffer_find_nonzero(void)
{
    ...
}

> Also creating separate function with most of repeated code for prefetch does
> not look good.

Why do you say that?

> So suggest to put check for prefetch outside the for loop and
> code for loop with and without prefetch

You're duplicating the inner loop either way, so that can't be your objection 
to creating a separate function.

> I profiled and found that a single check inside the loop is adding 100ms delay
> for 8GB RAM migration.

That's about what I expected.

> Also,  If you want to make prefetch common for all arm64 platforms,
> Then thunder cache line is 128 bytes so the prefetch is performed
> at 128 byte index. If the platform has 64 byte cache line, then this
> prefetch will fill only 64 byte line instead of 128 bytes required for the loop.

Yes, I had thought of that.

It would make sense to create two versions, that prefetch for and iterate over, 
cacheline sizes of 64 and 128 (I don't know of any other common sizes).

Preferably, we should then use sysconf(_SC_LEVEL1_DCACHE_LINESIZE) within the 
init function above to choose the appropriate version.

But I see that glibc doesn't currently implement that for aarch64, so we do 
want to have a fallback.  I know that the "official" cache line data isn't 
(easily) available to userspace, but a close proxy is the size described by 
dczid_el0.  That seems much better than groveling through a file under /sys.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-08-19 14:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-16 12:02 [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization for Thunderx platform vijay.kilari
2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
2016-08-17 13:39   ` Paolo Bonzini
2016-08-18  7:56     ` Vijay Kilari
2016-08-18  8:50       ` Paolo Bonzini
2016-08-18  9:01         ` Vijay Kilari
2016-08-18  9:39           ` Paolo Bonzini
2016-08-18 14:04             ` Richard Henderson
2016-08-18 14:14               ` Peter Maydell
2016-08-18 14:46                 ` Richard Henderson
2016-08-18 14:56                   ` Peter Maydell
2016-08-19  9:05                     ` Vijay Kilari
2016-08-19 14:57                       ` Richard Henderson
2016-08-16 12:02 ` [Qemu-devel] [RFC PATCH v2 2/2] utils: Add prefetch for Thunderx platform vijay.kilari
2016-08-16 18:02   ` Richard Henderson
2016-08-16 23:45     ` Vijay Kilari
2016-08-17 15:34       ` Richard Henderson
2016-08-16 16:02 ` [Qemu-devel] [RFC PATCH v2 0/2] Live migration optimization " no-reply
2016-08-17 13:40   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).