From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48758) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YgbOv-0002X4-9n for qemu-devel@nongnu.org; Fri, 10 Apr 2015 12:03:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YgbOq-0005Dn-RQ for qemu-devel@nongnu.org; Fri, 10 Apr 2015 12:03:25 -0400 Received: from greensocs.com ([193.104.36.180]:36089) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YgbOq-0005Cf-Df for qemu-devel@nongnu.org; Fri, 10 Apr 2015 12:03:20 -0400 Message-ID: <5527F444.7060808@greensocs.com> Date: Fri, 10 Apr 2015 18:03:16 +0200 From: Frederic Konrad MIME-Version: 1.0 References: <1421428797-23697-1-git-send-email-fred.konrad@greensocs.com> <87wq22bvmd.fsf@linaro.org> <551532CF.600@greensocs.com> <0ACA9B57-1DC6-4C75-8FA4-DEAA701B1E94@greensocs.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC 00/10] MultiThread TCG. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Maydell , Mark Burton Cc: mttcg@listserver.greensocs.com, Jan Kiszka , QEMU Developers , Alexander Graf , Paolo Bonzini , =?UTF-8?B?QWxleCBCZW5uw6ll?= On 30/03/2015 23:46, Peter Maydell wrote: > On 30 March 2015 at 07:52, Mark Burton wrot= e: >> So - Fred is unwilling to send the patch set as it stands, because fra= nkly this part is totally broken. >> >> There is an independent patch set that needs splitting out which deals= with just the atomic instruction issue - specifically for ARM (though I = guess it=E2=80=99s applicable across the board)=E2=80=A6 >> >> So - in short - I HOPE to get the patch set onto the reflector sometim= e next week, and I=E2=80=99m sorry for the delay. > What I really want to see is not so much the patch set > but the design sketch I asked for that lists the > various data structures and indicates which ones > are going to be per-cpu, which ones will be shared > (and with what locking), etc. > > -- PMM Hi everybody, Hi Peter, I tried to recap what we did, how it "works" and what the status: All the mechanism are basically unchanged. A lot of TCG structures are not thread safe. And all TCG threads can run at the same times and sometimes want to gener= ate code at the same time. Translation block related structure: struct TBContext { TranslationBlock *tbs; TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE]; int nb_tbs; /* any access to the tbs or the page table must use this lock */ QemuMutex tb_lock; /* statistics */ int tb_flush_count; int tb_phys_invalidate_count; int tb_invalidated_flag; }; This structure is used in TCGContext: TBContext tb_ctx; "tbs" is basically where the translated block are stored and=20 tb_phys_hash an hash table to find them quickly. There are two solutions to prevent thread issues: A/ Just have two tb_ctx. B/ Share it between CPUs and protect the tb_ctx access. We took the second solution so all CPUs can benefit of the translated TB. TBContext is written almost everywhere in translate-all.c. When there are too much tbs a tb_flush occurs and destroy the array. We=20 don't handle this case right now. tb_lock is already used by user-mode code, so we just convert it to=20 QemuMutex so we can reuse it in system-mode. struct TCGContext { uint8_t *pool_cur, *pool_end; TCGPool *pool_first, *pool_current, *pool_first_large; TCGLabel *labels; int nb_labels; int nb_globals; int nb_temps; /* goto_tb support */ tcg_insn_unit *code_buf; uintptr_t *tb_next; uint16_t *tb_next_offset; uint16_t *tb_jmp_offset; /* !=3D NULL if USE_DIRECT_JUMP */ /* liveness analysis */ uint16_t *op_dead_args; /* for each operation, each bit tells if the corresponding argument is dead */ uint8_t *op_sync_args; /* for each operation, each bit tells if the corresponding output argument needs to be sync to memory. */ /* tells in which temporary a given register is. It does not take into account fixed registers */ int reg_to_temp[TCG_TARGET_NB_REGS]; TCGRegSet reserved_regs; intptr_t current_frame_offset; intptr_t frame_start; intptr_t frame_end; int frame_reg; tcg_insn_unit *code_ptr; TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */ TCGTempSet free_temps[TCG_TYPE_COUNT * 2]; GHashTable *helpers; #ifdef CONFIG_PROFILER /* profiling info */ int64_t tb_count1; int64_t tb_count; int64_t op_count; /* total insn count */ int op_count_max; /* max insn per TB */ int64_t temp_count; int temp_count_max; int64_t del_op_count; int64_t code_in_len; int64_t code_out_len; int64_t interm_time; int64_t code_time; int64_t la_time; int64_t opt_time; int64_t restore_count; int64_t restore_time; #endif #ifdef CONFIG_DEBUG_TCG int temps_in_use; int goto_tb_issue_mask; #endif uint16_t gen_opc_buf[OPC_BUF_SIZE]; TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE]; uint16_t *gen_opc_ptr; TCGArg *gen_opparam_ptr; target_ulong gen_opc_pc[OPC_BUF_SIZE]; uint16_t gen_opc_icount[OPC_BUF_SIZE]; uint8_t gen_opc_instr_start[OPC_BUF_SIZE]; /* Code generation. Note that we specifically do not use=20 tcg_insn_unit here, because there's too much arithmetic throughout that relies on addition and subtraction working on bytes. Rely on the GCC extension that allows arithmetic on void*. */ int code_gen_max_blocks; void *code_gen_prologue; void *code_gen_buffer; size_t code_gen_buffer_size; /* threshold to flush the translated code buffer */ size_t code_gen_buffer_max_size; void *code_gen_ptr; TBContext tb_ctx; /* The TCGBackendData structure is private to tcg-target.c. */ struct TCGBackendData *be; }; This structure is used to translate the TBs. The easier solution was to protect the generation of the code to only=20 allow one CPU to generate code at a time. This is normal as we don't want double=20 generated tb in the pool anyway. This is achieved with the tb_lock used above. TLB: TLB seems to be CPU dependant, so it is not really a problem as in our implementation one CPU =3D one pthread. But sometimes a CPU wants to flus= h=20 TLB, through an instruction for example. It is very likely an other CPU in an=20 other thread is executing code at the same time. That's why we choose to create= a tlb_flush_mechanism: When a CPU wants to flush it asks and wait all CPU to exit TCG and then=20 exit itself. This can be reused for tb_invalidate and or tb_flush as well. Atomic instructions: Atomic instructions are quite hard to implement. The TranslationBlock implementing the atomic instruction can't be=20 interrupted during the execution (eg: by an interrupt or a signal) cmpxchg64 helper=20 is used for that. QEMU's global lock: TCG thread take the lock during code execution. This is not ok for=20 multi-thread because that means only one thread will be running at a time. That's why=20 we took Jan's patch to allow TCG to run without the lock and take it when needed. What is the status: * We can start a vexpress-a15 simulation with two A15 and run two=20 dhrystones at a time, the performance are increased it's quite stable. What is missing: * tb_flush is not implemented correctly. * PageDesc structure is not protected the patch which introduced a=20 first_tb array was not the right approach and is removed. This implies that tb_invalidate is broken. For both issues we plan to use the same mechanism as tlb_flush: exiting=20 all the CPU, flushing, invalidating and let them continue. A generic mechanism=20 must be implemented for that. Known issues: * GDB stub is broken because it uses tb_invalidate and we didn't=20 implement that for now, and there are probably other issues. * SMP > 2 crashes, probably because of tb_invalidate as well. * We don't know the status of the user code, which is probably broken=20 by our changes.