From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50125)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1YktjU-0003DV-QO
	for qemu-devel@nongnu.org; Wed, 22 Apr 2015 08:26:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1YktjP-00017Q-QT
	for qemu-devel@nongnu.org; Wed, 22 Apr 2015 08:26:24 -0400
Received: from greensocs.com ([193.104.36.180]:54231)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1YktjP-000172-Fw
	for qemu-devel@nongnu.org; Wed, 22 Apr 2015 08:26:19 -0400
Message-ID: <55379366.9070304@greensocs.com>
Date: Wed, 22 Apr 2015 14:26:14 +0200
From: Frederic Konrad <fred.konrad@greensocs.com>
MIME-Version: 1.0
References: <1421428797-23697-1-git-send-email-fred.konrad@greensocs.com>
	<87wq22bvmd.fsf@linaro.org> <551532CF.600@greensocs.com>
	<0ACA9B57-1DC6-4C75-8FA4-DEAA701B1E94@greensocs.com>
	<CAFEAcA8FTNWbWfjfVd31e53=UCjctM5vCExqh6cqh2i8ePZ1Yg@mail.gmail.com>
	<5527F444.7060808@greensocs.com>
In-Reply-To: <5527F444.7060808@greensocs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.
Reply-To: Frederic Konrad <fred.konrad@greensocs.com>
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Maydell <peter.maydell@linaro.org>, Mark Burton <mark.burton@greensocs.com>
Cc: mttcg@listserver.greensocs.com, Jan Kiszka <jan.kiszka@siemens.com>, QEMU Developers <qemu-devel@nongnu.org>, Alexander Graf <agraf@suse.de>, Paolo Bonzini <pbonzini@redhat.com>, =?UTF-8?B?QWxleCBCZW5uw6ll?= <alex.bennee@linaro.org>

On 10/04/2015 18:03, Frederic Konrad wrote:
> On 30/03/2015 23:46, Peter Maydell wrote:
>> On 30 March 2015 at 07:52, Mark Burton <mark.burton@greensocs.com>=20
>> wrote:
>>> So - Fred is unwilling to send the patch set as it stands, because=20
>>> frankly this part is totally broken.
>>>
>>> There is an independent patch set that needs splitting out which=20
>>> deals with just the atomic instruction issue - specifically for ARM=20
>>> (though I guess it=E2=80=99s applicable across the board)=E2=80=A6
>>>
>>> So - in short - I HOPE to get the patch set onto the reflector=20
>>> sometime next week, and I=E2=80=99m sorry for the delay.
>> What I really want to see is not so much the patch set
>> but the design sketch I asked for that lists the
>> various data structures and indicates which ones
>> are going to be per-cpu, which ones will be shared
>> (and with what locking), etc.
>>
>> -- PMM

Does that makes sense?

BTW here is the repository:
git clone git@git.greensocs.com:fkonrad/mttcg.git -b multi_tcg_v4

Thanks,
Fred

> Hi everybody,
> Hi Peter,
>
> I tried to recap what we did, how it "works" and what the status:
>
> All the mechanism are basically unchanged.
>
> A lot of TCG structures are not thread safe.
> And all TCG threads can run at the same times and sometimes want to=20
> generate
> code at the same time.
>
> Translation block related structure:
>
> struct TBContext {
>
>     TranslationBlock *tbs;
>     TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
>     int nb_tbs;
>     /* any access to the tbs or the page table must use this lock */
>     QemuMutex tb_lock;
>
>     /* statistics */
>     int tb_flush_count;
>     int tb_phys_invalidate_count;
>
>     int tb_invalidated_flag;
> };
>
> This structure is used in TCGContext: TBContext tb_ctx;
>
> "tbs" is basically where the translated block are stored and=20
> tb_phys_hash an
> hash table to find them quickly.
>
> There are two solutions to prevent thread issues:
>   A/ Just have two tb_ctx.
>   B/ Share it between CPUs and protect the tb_ctx access.
>
> We took the second solution so all CPUs can benefit of the translated T=
B.
> TBContext is written almost everywhere in translate-all.c.
> When there are too much tbs a tb_flush occurs and destroy the array.=20
> We don't
> handle this case right now.
> tb_lock is already used by user-mode code, so we just convert it to=20
> QemuMutex so
> we can reuse it in system-mode.
>
> struct TCGContext {
>     uint8_t *pool_cur, *pool_end;
>     TCGPool *pool_first, *pool_current, *pool_first_large;
>     TCGLabel *labels;
>     int nb_labels;
>     int nb_globals;
>     int nb_temps;
>
>     /* goto_tb support */
>     tcg_insn_unit *code_buf;
>     uintptr_t *tb_next;
>     uint16_t *tb_next_offset;
>     uint16_t *tb_jmp_offset; /* !=3D NULL if USE_DIRECT_JUMP */
>
>     /* liveness analysis */
>     uint16_t *op_dead_args; /* for each operation, each bit tells if th=
e
>                                corresponding argument is dead */
>     uint8_t *op_sync_args;  /* for each operation, each bit tells if th=
e
>                                corresponding output argument needs to b=
e
>                                sync to memory. */
>
>     /* tells in which temporary a given register is. It does not take
>        into account fixed registers */
>     int reg_to_temp[TCG_TARGET_NB_REGS];
>     TCGRegSet reserved_regs;
>     intptr_t current_frame_offset;
>     intptr_t frame_start;
>     intptr_t frame_end;
>     int frame_reg;
>
>     tcg_insn_unit *code_ptr;
>     TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
>     TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
>
>     GHashTable *helpers;
>
> #ifdef CONFIG_PROFILER
>     /* profiling info */
>     int64_t tb_count1;
>     int64_t tb_count;
>     int64_t op_count; /* total insn count */
>     int op_count_max; /* max insn per TB */
>     int64_t temp_count;
>     int temp_count_max;
>     int64_t del_op_count;
>     int64_t code_in_len;
>     int64_t code_out_len;
>     int64_t interm_time;
>     int64_t code_time;
>     int64_t la_time;
>     int64_t opt_time;
>     int64_t restore_count;
>     int64_t restore_time;
> #endif
>
> #ifdef CONFIG_DEBUG_TCG
>     int temps_in_use;
>     int goto_tb_issue_mask;
> #endif
>
>     uint16_t gen_opc_buf[OPC_BUF_SIZE];
>     TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
>
>     uint16_t *gen_opc_ptr;
>     TCGArg *gen_opparam_ptr;
>     target_ulong gen_opc_pc[OPC_BUF_SIZE];
>     uint16_t gen_opc_icount[OPC_BUF_SIZE];
>     uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
>
>     /* Code generation.  Note that we specifically do not use=20
> tcg_insn_unit
>        here, because there's too much arithmetic throughout that relies
>        on addition and subtraction working on bytes.  Rely on the GCC
>        extension that allows arithmetic on void*.  */
>     int code_gen_max_blocks;
>     void *code_gen_prologue;
>     void *code_gen_buffer;
>     size_t code_gen_buffer_size;
>     /* threshold to flush the translated code buffer */
>     size_t code_gen_buffer_max_size;
>     void *code_gen_ptr;
>
>     TBContext tb_ctx;
>
>     /* The TCGBackendData structure is private to tcg-target.c. */
>     struct TCGBackendData *be;
> };
>
> This structure is used to translate the TBs.
> The easier solution was to protect the generation of the code to only=20
> allow one
> CPU to generate code at a time. This is normal as we don't want double=20
> generated
> tb in the pool anyway. This is achieved with the tb_lock used above.
>
> TLB:
>
> TLB seems to be CPU dependant, so it is not really a problem as in our
> implementation one CPU =3D one pthread. But sometimes a CPU wants to=20
> flush TLB,
> through an instruction for example. It is very likely an other CPU in=20
> an other
> thread is executing code at the same time. That's why we choose to=20
> create a
> tlb_flush_mechanism:
> When a CPU wants to flush it asks and wait all CPU to exit TCG and=20
> then exit
> itself. This can be reused for tb_invalidate and or tb_flush as well.
>
> Atomic instructions:
>
> Atomic instructions are quite hard to implement.
> The TranslationBlock implementing the atomic instruction can't be=20
> interrupted
> during the execution (eg: by an interrupt or a signal) cmpxchg64=20
> helper is used
> for that.
>
> QEMU's global lock:
>
> TCG thread take the lock during code execution. This is not ok for=20
> multi-thread
> because that means only one thread will be running at a time. That's=20
> why we took
> Jan's patch to allow TCG to run without the lock and take it when neede=
d.
>
> What is the status:
>
>  * We can start a vexpress-a15 simulation with two A15 and run two=20
> dhrystones at
>    a time, the performance are increased it's quite stable.
>
> What is missing:
>
>  * tb_flush is not implemented correctly.
>  * PageDesc structure is not protected the patch which introduced a=20
> first_tb
>    array was not the right approach and is removed. This implies that
>    tb_invalidate is broken.
>
> For both issues we plan to use the same mechanism as tlb_flush:=20
> exiting all the
> CPU, flushing, invalidating and let them continue. A generic mechanism=20
> must be
> implemented for that.
>
> Known issues:
>
>  * GDB stub is broken because it uses tb_invalidate and we didn't=20
> implement that
>    for now, and there are probably other issues.
>  * SMP > 2 crashes, probably because of tb_invalidate as well.
>  * We don't know the status of the user code, which is probably broken=20
> by our
>    changes.
>