From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57712)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1ZOw5V-0005aP-0Z
	for qemu-devel@nongnu.org; Mon, 10 Aug 2015 19:02:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1ZOw5R-0001nA-Pj
	for qemu-devel@nongnu.org; Mon, 10 Aug 2015 19:02:36 -0400
Received: from greensocs.com ([193.104.36.180]:34923)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1ZOw5R-0001mg-Ar
	for qemu-devel@nongnu.org; Mon, 10 Aug 2015 19:02:33 -0400
Message-ID: <55C92D7E.80604@greensocs.com>
Date: Tue, 11 Aug 2015 01:02:22 +0200
From: Frederic Konrad <fred.konrad@greensocs.com>
MIME-Version: 1.0
References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com>
	<87bnefgfgx.fsf@linaro.org>
In-Reply-To: <87bnefgfgx.fsf@linaro.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: =?UTF-8?B?QWxleCBCZW5uw6ll?= <alex.bennee@linaro.org>
Cc: mttcg@listserver.greensocs.com, mark.burton@greensocs.com, qemu-devel@nongnu.org, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com, pbonzini@redhat.com

On 10/08/2015 20:34, Alex Benn=C3=A9e wrote:
> fred.konrad@greensocs.com writes:
>
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> This is the 7th round of the MTTCG patch series.
>>
>>
>> It can be cloned from:
>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
> I'm not seeing this yet, did you remember to push?
oops sorry done!
>
>> This patch-set try to address the different issues in the global pictu=
re of
>> MTTCG, presented on the wiki.
>>
>> =3D=3D Needed patch for our work =3D=3D
>>
>> Some preliminaries are needed for our work:
>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is=
 added to
>>     the CPUState.
>>   * We need to run some work safely when all VCPUs are outside their e=
xecution
>>     loop. This is done with the async_run_safe_work_on_cpu function in=
troduced
>>     in this series.
>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster =
handling of
>>     atomic instruction.
>>
>> =3D=3D Code generation and cache =3D=3D
>>
>> As Qemu stands, there is no protection at all against two threads atte=
mpting to
>> generate code at the same time or modifying a TranslationBlock.
>> The "protect TBContext with tb_lock" patch address the issue of code g=
eneration
>> and makes all the tb_* function thread safe (except tb_flush).
>> This raised the question of one or multiple caches. We choosed to use =
one
>> unified cache because it's easier as a first step and since the struct=
ure of
>> QEMU effectively has a =E2=80=98local=E2=80=99 cache per CPU in the fo=
rm of the jump cache, we
>> don't see the benefit of having two pools of tbs.
>>
>> =3D=3D Dirty tracking =3D=3D
>>
>> Protecting the IOs:
>> To allows all VCPUs threads to run at the same time we need to drop th=
e
>> global_mutex as soon as possible. The io access need to take the mutex=
. This is
>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qem=
u/345258
>> will be upstreamed.
>>
>> Invalidation of TranslationBlocks:
>> We can have all VCPUs running during an invalidation. Each VCPU is abl=
e to clean
>> it's jump cache itself as it is in CPUState so that can be handled by =
a simple
>> call to async_run_on_cpu. However tb_invalidate also writes to the
>> TranslationBlock which is shared as we have only one pool.
>> Hence this part of invalidate requires all VCPUs to exit before it can=
 be done.
>> Hence the async_run_safe_work_on_cpu is introduced to handle this case=
.
>>
>> =3D=3D Atomic instruction =3D=3D
>>
>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>> Specifically the limitation of this approach is that it is harder to s=
upport
>> 64bit ARM on a host architecture that is multi-core, but only supports=
 32 bit
>> cmpxchg (we believe this could be the case for some PPC cores).  For n=
ow this
>> case is not correctly handled. The existing atomic patch will attempt =
to execute
>> the 64 bit cmpxchg functionality in a non thread safe fashion. Our int=
ention is
>> to provide a new multi-thread ARM atomic patch for 64bit ARM on effect=
ive 32bit
>> hosts.
>> This atomic instruction part has been tested with Alexander's atomic s=
tress repo
>> available here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>
>> The execution is a little slower than upstream probably because of the=
 different
>> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spi=
n_lock
>> reduce considerably the difference.
>>
>> =3D=3D Testing =3D=3D
>>
>> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux g=
uest show
>> a good performance progression: it takes basically 18s upstream to com=
plete vs
>> 10s with MTTCG.
>>
>> Testing image is available here:
>> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
>>
>> Then simply:
>> ./configure --target-list=3Darm-softmmu
>> make -j8
>> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
>> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
>> --append "console=3DttyAMA0"
>>
>> login: root
>>
>> The dhrystone command is the last one in the history.
>> "dhrystone 10000000 & dhrystone 10000000"
>>
>> The atomic spinlock benchmark from Alexander shows that atomic basical=
ly work.
>> Just follow the instruction here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>
>> =3D=3D Known issues =3D=3D
>>
>> * GDB stub:
>>    GDB stub is not tested right now it will probably requires some cha=
nges to
>>    work.
>>
>> * deadlock on exit:
>>    When exiting QEMU Ctrl-C some VCPU's thread are not able to exit an=
d continue
>>    execution.
>>    http://git.greensocs.com/fkonrad/mttcg/issues/1
>>
>> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG=
 code.
>>    Strangely this happen only with "-smp 4" and 2 in the DTB.
>>    http://git.greensocs.com/fkonrad/mttcg/issues/2
>>
>> Changes V6 -> V7:
>>    * global_lock:
>>       * Don't protect softmmu read/write helper as it's now done in
>>         adress_space_rw.
>>    * tcg_exec_flag:
>>       * Make the flag atomically test and set through an API.
>>    * introduce async_safe_work:
>>       * move qemu_cpu_kick_thread to avoid prototype declaration.
>>       * use the work_mutex.
>>    * async_work:
>>       * protect it with a mutex (work_mutex) against concurent access.
>>    * tb_lock:
>>       * protect tcg_malloc_internal as well.
>>    * signal the VCPU even of current_cpu is NULL.
>>    * added PSCI patch.
>>    * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
>>
>> Changes V5 -> V6:
>>    * Introduce async_safe_work to do the tb_flush and some part of tb_=
invalidate.
>>    * Introduce QemuSpin from Guillaume which allow a faster atomic ins=
truction
>>      (6s to pass Alexander's atomic test instead of 30s before).
>>    * Don't take tb_lock before tb_find_fast.
>>    * Handle tb_flush with async_safe_work.
>>    * Handle tb_invalidate with async_work and async_safe_work.
>>    * Drop the tlb_flush_request mechanism and use async_work as well.
>>    * Fix the wrong lenght in atomic patch.
>>    * Fix the wrong return address for exception in atomic patch.
>>
>> Alex Benn=C3=A9e (1):
>>    target-arm/psci.c: wake up sleeping CPUs (MTTCG)
>>
>> Guillaume Delbergue (1):
>>    add support for spin lock on POSIX systems exclusively
>>
>> KONRAD Frederic (17):
>>    cpus: protect queued_work_* with work_mutex.
>>    cpus: add tcg_exec_flag.
>>    cpus: introduce async_run_safe_work_on_cpu.
>>    replace spinlock by QemuMutex.
>>    remove unused spinlock.
>>    protect TBContext with tb_lock.
>>    tcg: remove tcg_halt_cond global variable.
>>    Drop global lock during TCG code execution
>>    cpu: remove exit_request global.
>>    tcg: switch on multithread.
>>    Use atomic cmpxchg to atomically check the exclusive value in a STR=
EX
>>    add a callback when tb_invalidate is called.
>>    cpu: introduce tlb_flush*_all.
>>    arm: use tlb_flush*_all
>>    translate-all: introduces tb_flush_safe.
>>    translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>>    mttcg: signal the associated cpu anyway.
>>
>>   cpu-exec.c                  |  98 +++++++++------
>>   cpus.c                      | 295 +++++++++++++++++++++++++---------=
----------
>>   cputlb.c                    |  81 ++++++++++++
>>   include/exec/exec-all.h     |   8 +-
>>   include/exec/spinlock.h     |  49 --------
>>   include/qemu/thread-posix.h |   4 +
>>   include/qemu/thread-win32.h |   4 +
>>   include/qemu/thread.h       |   7 ++
>>   include/qom/cpu.h           |  57 +++++++++
>>   linux-user/main.c           |   6 +-
>>   qom/cpu.c                   |  20 +++
>>   target-arm/cpu.c            |  21 ++++
>>   target-arm/cpu.h            |   6 +
>>   target-arm/helper.c         |  58 +++------
>>   target-arm/helper.h         |   4 +
>>   target-arm/op_helper.c      | 128 ++++++++++++++++++-
>>   target-arm/psci.c           |   2 +
>>   target-arm/translate.c      | 101 +++------------
>>   target-i386/mem_helper.c    |  16 ++-
>>   target-i386/misc_helper.c   |  27 +++-
>>   tcg/i386/tcg-target.c       |   8 ++
>>   tcg/tcg.h                   |  14 ++-
>>   translate-all.c             | 217 +++++++++++++++++++++++++++-----
>>   util/qemu-thread-posix.c    |  45 +++++++
>>   util/qemu-thread-win32.c    |  30 +++++
>>   vl.c                        |   6 +
>>   26 files changed, 934 insertions(+), 378 deletions(-)
>>   delete mode 100644 include/exec/spinlock.h