From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48758)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1YgbOv-0002X4-9n
	for qemu-devel@nongnu.org; Fri, 10 Apr 2015 12:03:27 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1YgbOq-0005Dn-RQ
	for qemu-devel@nongnu.org; Fri, 10 Apr 2015 12:03:25 -0400
Received: from greensocs.com ([193.104.36.180]:36089)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1YgbOq-0005Cf-Df
	for qemu-devel@nongnu.org; Fri, 10 Apr 2015 12:03:20 -0400
Message-ID: <5527F444.7060808@greensocs.com>
Date: Fri, 10 Apr 2015 18:03:16 +0200
From: Frederic Konrad <fred.konrad@greensocs.com>
MIME-Version: 1.0
References: <1421428797-23697-1-git-send-email-fred.konrad@greensocs.com>
	<87wq22bvmd.fsf@linaro.org> <551532CF.600@greensocs.com>
	<0ACA9B57-1DC6-4C75-8FA4-DEAA701B1E94@greensocs.com>
	<CAFEAcA8FTNWbWfjfVd31e53=UCjctM5vCExqh6cqh2i8ePZ1Yg@mail.gmail.com>
In-Reply-To: <CAFEAcA8FTNWbWfjfVd31e53=UCjctM5vCExqh6cqh2i8ePZ1Yg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Maydell <peter.maydell@linaro.org>, Mark Burton <mark.burton@greensocs.com>
Cc: mttcg@listserver.greensocs.com, Jan Kiszka <jan.kiszka@siemens.com>, QEMU Developers <qemu-devel@nongnu.org>, Alexander Graf <agraf@suse.de>, Paolo Bonzini <pbonzini@redhat.com>, =?UTF-8?B?QWxleCBCZW5uw6ll?= <alex.bennee@linaro.org>

On 30/03/2015 23:46, Peter Maydell wrote:
> On 30 March 2015 at 07:52, Mark Burton <mark.burton@greensocs.com> wrot=
e:
>> So - Fred is unwilling to send the patch set as it stands, because fra=
nkly this part is totally broken.
>>
>> There is an independent patch set that needs splitting out which deals=
 with just the atomic instruction issue - specifically for ARM (though I =
guess it=E2=80=99s applicable across the board)=E2=80=A6
>>
>> So - in short - I HOPE to get the patch set onto the reflector sometim=
e next week, and I=E2=80=99m sorry for the delay.
> What I really want to see is not so much the patch set
> but the design sketch I asked for that lists the
> various data structures and indicates which ones
> are going to be per-cpu, which ones will be shared
> (and with what locking), etc.
>
> -- PMM
Hi everybody,
Hi Peter,

I tried to recap what we did, how it "works" and what the status:

All the mechanism are basically unchanged.

A lot of TCG structures are not thread safe.
And all TCG threads can run at the same times and sometimes want to gener=
ate
code at the same time.

Translation block related structure:

struct TBContext {

     TranslationBlock *tbs;
     TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;

     /* statistics */
     int tb_flush_count;
     int tb_phys_invalidate_count;

     int tb_invalidated_flag;
};

This structure is used in TCGContext: TBContext tb_ctx;

"tbs" is basically where the translated block are stored and=20
tb_phys_hash an
hash table to find them quickly.

There are two solutions to prevent thread issues:
   A/ Just have two tb_ctx.
   B/ Share it between CPUs and protect the tb_ctx access.

We took the second solution so all CPUs can benefit of the translated TB.
TBContext is written almost everywhere in translate-all.c.
When there are too much tbs a tb_flush occurs and destroy the array. We=20
don't
handle this case right now.
tb_lock is already used by user-mode code, so we just convert it to=20
QemuMutex so
we can reuse it in system-mode.

struct TCGContext {
     uint8_t *pool_cur, *pool_end;
     TCGPool *pool_first, *pool_current, *pool_first_large;
     TCGLabel *labels;
     int nb_labels;
     int nb_globals;
     int nb_temps;

     /* goto_tb support */
     tcg_insn_unit *code_buf;
     uintptr_t *tb_next;
     uint16_t *tb_next_offset;
     uint16_t *tb_jmp_offset; /* !=3D NULL if USE_DIRECT_JUMP */

     /* liveness analysis */
     uint16_t *op_dead_args; /* for each operation, each bit tells if the
                                corresponding argument is dead */
     uint8_t *op_sync_args;  /* for each operation, each bit tells if the
                                corresponding output argument needs to be
                                sync to memory. */

     /* tells in which temporary a given register is. It does not take
        into account fixed registers */
     int reg_to_temp[TCG_TARGET_NB_REGS];
     TCGRegSet reserved_regs;
     intptr_t current_frame_offset;
     intptr_t frame_start;
     intptr_t frame_end;
     int frame_reg;

     tcg_insn_unit *code_ptr;
     TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
     TCGTempSet free_temps[TCG_TYPE_COUNT * 2];

     GHashTable *helpers;

#ifdef CONFIG_PROFILER
     /* profiling info */
     int64_t tb_count1;
     int64_t tb_count;
     int64_t op_count; /* total insn count */
     int op_count_max; /* max insn per TB */
     int64_t temp_count;
     int temp_count_max;
     int64_t del_op_count;
     int64_t code_in_len;
     int64_t code_out_len;
     int64_t interm_time;
     int64_t code_time;
     int64_t la_time;
     int64_t opt_time;
     int64_t restore_count;
     int64_t restore_time;
#endif

#ifdef CONFIG_DEBUG_TCG
     int temps_in_use;
     int goto_tb_issue_mask;
#endif

     uint16_t gen_opc_buf[OPC_BUF_SIZE];
     TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];

     uint16_t *gen_opc_ptr;
     TCGArg *gen_opparam_ptr;
     target_ulong gen_opc_pc[OPC_BUF_SIZE];
     uint16_t gen_opc_icount[OPC_BUF_SIZE];
     uint8_t gen_opc_instr_start[OPC_BUF_SIZE];

     /* Code generation.  Note that we specifically do not use=20
tcg_insn_unit
        here, because there's too much arithmetic throughout that relies
        on addition and subtraction working on bytes.  Rely on the GCC
        extension that allows arithmetic on void*.  */
     int code_gen_max_blocks;
     void *code_gen_prologue;
     void *code_gen_buffer;
     size_t code_gen_buffer_size;
     /* threshold to flush the translated code buffer */
     size_t code_gen_buffer_max_size;
     void *code_gen_ptr;

     TBContext tb_ctx;

     /* The TCGBackendData structure is private to tcg-target.c.  */
     struct TCGBackendData *be;
};

This structure is used to translate the TBs.
The easier solution was to protect the generation of the code to only=20
allow one
CPU to generate code at a time. This is normal as we don't want double=20
generated
tb in the pool anyway. This is achieved with the tb_lock used above.

TLB:

TLB seems to be CPU dependant, so it is not really a problem as in our
implementation one CPU =3D one pthread. But sometimes a CPU wants to flus=
h=20
TLB,
through an instruction for example. It is very likely an other CPU in an=20
other
thread is executing code at the same time. That's why we choose to create=
 a
tlb_flush_mechanism:
When a CPU wants to flush it asks and wait all CPU to exit TCG and then=20
exit
itself. This can be reused for tb_invalidate and or tb_flush as well.

Atomic instructions:

Atomic instructions are quite hard to implement.
The TranslationBlock implementing the atomic instruction can't be=20
interrupted
during the execution (eg: by an interrupt or a signal) cmpxchg64 helper=20
is used
for that.

QEMU's global lock:

TCG thread take the lock during code execution. This is not ok for=20
multi-thread
because that means only one thread will be running at a time. That's why=20
we took
Jan's patch to allow TCG to run without the lock and take it when needed.

What is the status:

  * We can start a vexpress-a15 simulation with two A15 and run two=20
dhrystones at
    a time, the performance are increased it's quite stable.

What is missing:

  * tb_flush is not implemented correctly.
  * PageDesc structure is not protected the patch which introduced a=20
first_tb
    array was not the right approach and is removed. This implies that
    tb_invalidate is broken.

For both issues we plan to use the same mechanism as tlb_flush: exiting=20
all the
CPU, flushing, invalidating and let them continue. A generic mechanism=20
must be
implemented for that.

Known issues:

  * GDB stub is broken because it uses tb_invalidate and we didn't=20
implement that
    for now, and there are probably other issues.
  * SMP > 2 crashes, probably because of tb_invalidate as well.
  * We don't know the status of the user code, which is probably broken=20
by our
    changes.