From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from zen.linaroharston ([51.148.130.216]) by smtp.gmail.com with ESMTPSA id e25sm4508847wra.71.2020.11.04.12.37.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Nov 2020 12:37:00 -0800 (PST) Received: from zen (localhost [127.0.0.1]) by zen.linaroharston (Postfix) with ESMTP id D389A1FF7E; Wed, 4 Nov 2020 20:36:59 +0000 (GMT) References: <87v9elax60.fsf@linaro.org> <87pn4taufd.fsf@linaro.org> User-agent: mu4e 1.5.6; emacs 28.0.50 From: Alex =?utf-8?Q?Benn=C3=A9e?= To: Ard Biesheuvel Cc: Peter Maydell , qemu-arm@nongnu.org , Richard Henderson Subject: Re: regression in TCG emulation of VTBL neon instruction In-reply-to: Date: Wed, 04 Nov 2020 20:36:59 +0000 Message-ID: <87lffgc104.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-TUID: JQn+arWR+bD4 Ard Biesheuvel writes: > On Wed, 4 Nov 2020 at 18:50, Peter Maydell wro= te: >> >> On Wed, 4 Nov 2020 at 17:44, Alex Benn=C3=A9e w= rote: >> > Just checking - what host are you on? >> > > model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx > pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor > ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c > rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti > ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad > fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx > rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves > dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear > flush_l1d Eyeballing hackbox2 which has: model name : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca = cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx p= dpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopo= logy nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx s= mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c= rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_sin= gle pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ep= t vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx= rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd a= vx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_m= bm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d Seems to have avx512 but the avx1 and avx2 stuff is common which will make use of more registers in the generated code: if (have_avx1) { tcg_target_available_regs[TCG_TYPE_V64] =3D ALL_VECTOR_REGS; tcg_target_available_regs[TCG_TYPE_V128] =3D ALL_VECTOR_REGS; } if (have_avx2) { tcg_target_available_regs[TCG_TYPE_V256] =3D ALL_VECTOR_REGS; } > > >> Oh, good question -- what the TCG backend emits as vector >> operations or not will depend on the host CPU (eg whether >> it supports AVX1/AVX2/etc). >> >> If the test case can be cut down to a Linux userspace >> program that can be run under the qemu-arm single-binary >> emulator that will probably also be easier to debug than >> "boot whole guest kernel and wait for it to get to a selftest". >> > > Sure. The code can be found at [0] > > The sequence in question is > > # r4 between -31 and 0 > # q4-q5 holding 32 bytes of cipher stream > > adr lr, .Lpermute + 32 > add lr, lr, r4 > vld1.8 {q2-q3}, [lr] > > vtbl.8 d4, {q4-q5}, d4 > vtbl.8 d5, {q4-q5}, d5 > vtbl.8 d6, {q4-q5}, d6 > vtbl.8 d7, {q4-q5}, d7 > > .Lpermute: > .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 > .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f > .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17 > .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f > .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 > .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f > .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17 > .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f > > This is essentially a bytewise rotate function operating on a 32 byte > vector (the patch explains the purpose) > > Using GDB to single step through the code, I noticed that d6 and d7 > turn up as all zeroes. > > > [0] https://lore.kernel.org/linux-arm-kernel/20201103162809.28167-1-ardb@= kernel.org/ --=20 Alex Benn=C3=A9e