From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55830)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1ZPXoV-0000WD-Sr
	for qemu-devel@nongnu.org; Wed, 12 Aug 2015 11:19:41 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1ZPXoR-0007DB-5V
	for qemu-devel@nongnu.org; Wed, 12 Aug 2015 11:19:35 -0400
Received: from greensocs.com ([193.104.36.180]:58492)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1ZPXoQ-0007Co-Q5
	for qemu-devel@nongnu.org; Wed, 12 Aug 2015 11:19:31 -0400
Message-ID: <55CB63FA.1070200@greensocs.com>
Date: Wed, 12 Aug 2015 17:19:22 +0200
From: Frederic Konrad <fred.konrad@greensocs.com>
MIME-Version: 1.0
References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com>	<55C9EE60.80004@redhat.com>
	<55C9FFC0.7040502@greensocs.com>
In-Reply-To: <55C9FFC0.7040502@greensocs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org, mttcg@listserver.greensocs.com
Cc: mark.burton@greensocs.com, alex.bennee@linaro.org, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com

On 11/08/2015 15:59, Frederic Konrad wrote:
> On 11/08/2015 14:45, Paolo Bonzini wrote:
>> On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
>>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>>
>>> This is the 7th round of the MTTCG patch series.
> Thanks to look at this.
>> Here is a list of issues that I found:
>>
>> - tb_lock usage in tb_find_fast is complicated and introduces the need
>> for other complicated code such as the tb_invalidate callback. Instead=
,
>> the tb locking should reuse the cpu-exec.c code for user-mode emulatio=
n,
>> with additional locking in the spots identified by Fred.
> The reason for this is that locking around tb_find_fast just kills the=20
> performance.
>
>>
>> - tb_lock uses a recursive lock, but this is not necessary.  Did I eve=
r
>> say I dislike recursive mutexes? :)  The wrappers
>> tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for
>> us, so it's not hard to do without it.
> True this is definitely not nice but seems some code eg: tb_invalidate=20
> are called from
> different place with or without the tb_lock. We might probably be able=20
> to lock the
> caller.
>
>>
>> - code_bitmap is not protected by any mutex
>> (tb_invalidate_phys_page_fast is called with the iothread mutex taken,
>> but other users of code_bitmap do not use it).  Writes should be
>> protected by the tb_lock, reads by either tb_lock or RCU.
> Yes good point I missed this one.
>
>> - memory barriers are probably requested around accesses to
>> ->exit_request.  ->thread_kicked also needs to be accessed with atomic=
s,
>> because async_run_{,safe_}on_cpu can be called outside the big QEMU=20
>> lock.
>>
>> - the whole signal-based qemu_cpu_kick can just go away.  Just setting
>> tcg_exit_req and exit_request will kick the TCG thread.  The hairy Win=
32
>> SuspendThread/ResumeThread goes away too.  I suggest doing it now,
>> because proving it unnecessary is easier than proving it correct.
> Just setting tcg_exit_req and exit_request and signal the=20
> cpu->halt_cond I guess?
BTW that affect KVM as well. Seems this mechanism is used as well with
qemu_cpu_kick_self().. Which is a little strange as it seems the SIGIPI=20
trigger a
dummy signal handler?

     memset(&sigact, 0, sizeof(sigact));
     sigact.sa_handler =3D dummy_signal;
     sigaction(SIG_IPI, &sigact, NULL);


>
>> - user-mode emulation is broken (does not compile)
>>
>> - the big QEMU lock is not taken anywhere for MMIO accesses that requi=
re
>> it (i.e. basically all of them)
> Isn't that handled by prepare_mmio_access?
>
>>
>> - some code wants to be called _outside_ the big QEMU lock, for exampl=
e
>> because it longjmps back to cpu_exec.  For example, I suspect that the
>> notdirty callbacks must be marked with=20
>> memory_region_clear_global_locking.
>
> Fred
>>
>> I've started looking at them (and documenting the locking conventions
>> for functions), and I hope to post it to some git repo later this week=
.
>>
>> Paolo
>>
>>> It can be cloned from:
>>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
>>>
>>> This patch-set try to address the different issues in the global=20
>>> picture of
>>> MTTCG, presented on the wiki.
>>>
>>> =3D=3D Needed patch for our work =3D=3D
>>>
>>> Some preliminaries are needed for our work:
>>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag=20
>>> is added to
>>>     the CPUState.
>>>   * We need to run some work safely when all VCPUs are outside their=20
>>> execution
>>>     loop. This is done with the async_run_safe_work_on_cpu function=20
>>> introduced
>>>     in this series.
>>>   * QemuSpin lock is introduced (on posix only yet) to allow a=20
>>> faster handling of
>>>     atomic instruction.
>>>
>>> =3D=3D Code generation and cache =3D=3D
>>>
>>> As Qemu stands, there is no protection at all against two threads=20
>>> attempting to
>>> generate code at the same time or modifying a TranslationBlock.
>>> The "protect TBContext with tb_lock" patch address the issue of code=20
>>> generation
>>> and makes all the tb_* function thread safe (except tb_flush).
>>> This raised the question of one or multiple caches. We choosed to=20
>>> use one
>>> unified cache because it's easier as a first step and since the=20
>>> structure of
>>> QEMU effectively has a =E2=80=98local=E2=80=99 cache per CPU in the f=
orm of the jump=20
>>> cache, we
>>> don't see the benefit of having two pools of tbs.
>>>
>>> =3D=3D Dirty tracking =3D=3D
>>>
>>> Protecting the IOs:
>>> To allows all VCPUs threads to run at the same time we need to drop t=
he
>>> global_mutex as soon as possible. The io access need to take the=20
>>> mutex. This is
>>> likely to change when=20
>>> http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>>> will be upstreamed.
>>>
>>> Invalidation of TranslationBlocks:
>>> We can have all VCPUs running during an invalidation. Each VCPU is=20
>>> able to clean
>>> it's jump cache itself as it is in CPUState so that can be handled=20
>>> by a simple
>>> call to async_run_on_cpu. However tb_invalidate also writes to the
>>> TranslationBlock which is shared as we have only one pool.
>>> Hence this part of invalidate requires all VCPUs to exit before it=20
>>> can be done.
>>> Hence the async_run_safe_work_on_cpu is introduced to handle this cas=
e.
>>>
>>> =3D=3D Atomic instruction =3D=3D
>>>
>>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>>> Specifically the limitation of this approach is that it is harder to=20
>>> support
>>> 64bit ARM on a host architecture that is multi-core, but only=20
>>> supports 32 bit
>>> cmpxchg (we believe this could be the case for some PPC cores).  For=20
>>> now this
>>> case is not correctly handled. The existing atomic patch will=20
>>> attempt to execute
>>> the 64 bit cmpxchg functionality in a non thread safe fashion. Our=20
>>> intention is
>>> to provide a new multi-thread ARM atomic patch for 64bit ARM on=20
>>> effective 32bit
>>> hosts.
>>> This atomic instruction part has been tested with Alexander's atomic=20
>>> stress repo
>>> available here:
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>>
>>> The execution is a little slower than upstream probably because of=20
>>> the different
>>> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to=20
>>> spin_lock
>>> reduce considerably the difference.
>>>
>>> =3D=3D Testing =3D=3D
>>>
>>> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux=20
>>> guest show
>>> a good performance progression: it takes basically 18s upstream to=20
>>> complete vs
>>> 10s with MTTCG.
>>>
>>> Testing image is available here:
>>> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
>>>
>>> Then simply:
>>> ./configure --target-list=3Darm-softmmu
>>> make -j8
>>> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
>>> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
>>> --append "console=3DttyAMA0"
>>>
>>> login: root
>>>
>>> The dhrystone command is the last one in the history.
>>> "dhrystone 10000000 & dhrystone 10000000"
>>>
>>> The atomic spinlock benchmark from Alexander shows that atomic=20
>>> basically work.
>>> Just follow the instruction here:
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>>
>>> =3D=3D Known issues =3D=3D
>>>
>>> * GDB stub:
>>>    GDB stub is not tested right now it will probably requires some=20
>>> changes to
>>>    work.
>>>
>>> * deadlock on exit:
>>>    When exiting QEMU Ctrl-C some VCPU's thread are not able to exit=20
>>> and continue
>>>    execution.
>>>    http://git.greensocs.com/fkonrad/mttcg/issues/1
>>>
>>> * memory_region_rom_device_set_romd from pflash01 just crashes the=20
>>> TCG code.
>>>    Strangely this happen only with "-smp 4" and 2 in the DTB.
>>>    http://git.greensocs.com/fkonrad/mttcg/issues/2
>>>
>>> Changes V6 -> V7:
>>>    * global_lock:
>>>       * Don't protect softmmu read/write helper as it's now done in
>>>         adress_space_rw.
>>>    * tcg_exec_flag:
>>>       * Make the flag atomically test and set through an API.
>>>    * introduce async_safe_work:
>>>       * move qemu_cpu_kick_thread to avoid prototype declaration.
>>>       * use the work_mutex.
>>>    * async_work:
>>>       * protect it with a mutex (work_mutex) against concurent access=
.
>>>    * tb_lock:
>>>       * protect tcg_malloc_internal as well.
>>>    * signal the VCPU even of current_cpu is NULL.
>>>    * added PSCI patch.
>>>    * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899)=
.
>>>
>>> Changes V5 -> V6:
>>>    * Introduce async_safe_work to do the tb_flush and some part of=20
>>> tb_invalidate.
>>>    * Introduce QemuSpin from Guillaume which allow a faster atomic=20
>>> instruction
>>>      (6s to pass Alexander's atomic test instead of 30s before).
>>>    * Don't take tb_lock before tb_find_fast.
>>>    * Handle tb_flush with async_safe_work.
>>>    * Handle tb_invalidate with async_work and async_safe_work.
>>>    * Drop the tlb_flush_request mechanism and use async_work as well.
>>>    * Fix the wrong lenght in atomic patch.
>>>    * Fix the wrong return address for exception in atomic patch.
>>>
>>> Alex Benn=C3=A9e (1):
>>>    target-arm/psci.c: wake up sleeping CPUs (MTTCG)
>>>
>>> Guillaume Delbergue (1):
>>>    add support for spin lock on POSIX systems exclusively
>>>
>>> KONRAD Frederic (17):
>>>    cpus: protect queued_work_* with work_mutex.
>>>    cpus: add tcg_exec_flag.
>>>    cpus: introduce async_run_safe_work_on_cpu.
>>>    replace spinlock by QemuMutex.
>>>    remove unused spinlock.
>>>    protect TBContext with tb_lock.
>>>    tcg: remove tcg_halt_cond global variable.
>>>    Drop global lock during TCG code execution
>>>    cpu: remove exit_request global.
>>>    tcg: switch on multithread.
>>>    Use atomic cmpxchg to atomically check the exclusive value in a=20
>>> STREX
>>>    add a callback when tb_invalidate is called.
>>>    cpu: introduce tlb_flush*_all.
>>>    arm: use tlb_flush*_all
>>>    translate-all: introduces tb_flush_safe.
>>>    translate-all: (wip) use tb_flush_safe when we can't alloc more tb=
.
>>>    mttcg: signal the associated cpu anyway.
>>>
>>>   cpu-exec.c                  |  98 +++++++++------
>>>   cpus.c                      | 295=20
>>> +++++++++++++++++++++++++-------------------
>>>   cputlb.c                    |  81 ++++++++++++
>>>   include/exec/exec-all.h     |   8 +-
>>>   include/exec/spinlock.h     |  49 --------
>>>   include/qemu/thread-posix.h |   4 +
>>>   include/qemu/thread-win32.h |   4 +
>>>   include/qemu/thread.h       |   7 ++
>>>   include/qom/cpu.h           |  57 +++++++++
>>>   linux-user/main.c           |   6 +-
>>>   qom/cpu.c                   |  20 +++
>>>   target-arm/cpu.c            |  21 ++++
>>>   target-arm/cpu.h            |   6 +
>>>   target-arm/helper.c         |  58 +++------
>>>   target-arm/helper.h         |   4 +
>>>   target-arm/op_helper.c      | 128 ++++++++++++++++++-
>>>   target-arm/psci.c           |   2 +
>>>   target-arm/translate.c      | 101 +++------------
>>>   target-i386/mem_helper.c    |  16 ++-
>>>   target-i386/misc_helper.c   |  27 +++-
>>>   tcg/i386/tcg-target.c       |   8 ++
>>>   tcg/tcg.h                   |  14 ++-
>>>   translate-all.c             | 217 +++++++++++++++++++++++++++-----
>>>   util/qemu-thread-posix.c    |  45 +++++++
>>>   util/qemu-thread-win32.c    |  30 +++++
>>>   vl.c                        |   6 +
>>>   26 files changed, 934 insertions(+), 378 deletions(-)
>>>   delete mode 100644 include/exec/spinlock.h
>>>
>
>