regression in TCG emulation of VTBL neon instruction

All of lore.kernel.org
 help / color / mirror / Atom feed

* regression in TCG emulation of VTBL neon instruction
@ 2020-11-02  7:54 Ard Biesheuvel
  2020-11-04 16:45 ` Alex Bennée
  0 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2020-11-02  7:54 UTC (permalink / raw)
  To: qemu-arm, Peter Maydell, Alex Bennée, Richard Henderson

Hello all,

I spotted an issue with the TCG emulation of VTBL instructions in 32-bit mode.

It seems that when using the 4 register version, indexes in the range
[0x10 .. 0x1f] are not handled correctly, and I end up with all zero
vectors in the output.

For example, I am optimizing Linux's NEON ChaCha20 implementation to
use overlapping loads and stores, and this requires the final cipher
stream block to be shifted accordingly, using a sequence such as

vtbl.8 d4, {q4-q5}, d4
vtbl.8 d5, {q4-q5}, d5
vtbl.8 d6, {q4-q5}, d6
vtbl.8 d7, {q4-q5}, d7

where q4-q5 contain 32 bytes of cipher stream, and d4-d7 contain a set
of permutation vectors, where each value is in the range [0x0, 0x1f].

The above works fine with older QEMU and KVM, but with recent QEMU,
this fails, seemingly because d6 and d7 always turn up as all zeros.

This can be reproduced by running the zImage I prepared [0] as follows:

qemu-system-aarch64 -M virt -cpu cortex-a15 -m 2048 -net none
-nographic -kernel arch/arm/boot/zImage

and it will print the following (somewhere halfway down the kernel
log) on the affected builds of QEMU:

alg: skcipher: chacha20-neon encryption test failed (wrong result) on
test vector 1, cfg="in-place"
alg: skcipher: xchacha20-neon encryption test failed (wrong result) on
test vector 1, cfg="in-place"
alg: skcipher: xchacha12-neon encryption test failed (wrong result) on
test vector 1, cfg="in-place"

[0] https://people.linaro.org/~ard.biesheuvel/qemu-tcg-vtbl/zImage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-02  7:54 regression in TCG emulation of VTBL neon instruction Ard Biesheuvel
@ 2020-11-04 16:45 ` Alex Bennée
  2020-11-04 17:02   ` Ard Biesheuvel
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Bennée @ 2020-11-04 16:45 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: qemu-arm, Peter Maydell, Richard Henderson


Ard Biesheuvel <ardb@kernel.org> writes:

> Hello all,
>
> I spotted an issue with the TCG emulation of VTBL instructions in 32-bit mode.
>
> It seems that when using the 4 register version, indexes in the range
> [0x10 .. 0x1f] are not handled correctly, and I end up with all zero
> vectors in the output.
>
> For example, I am optimizing Linux's NEON ChaCha20 implementation to
> use overlapping loads and stores, and this requires the final cipher
> stream block to be shifted accordingly, using a sequence such as
>
> vtbl.8 d4, {q4-q5}, d4
> vtbl.8 d5, {q4-q5}, d5
> vtbl.8 d6, {q4-q5}, d6
> vtbl.8 d7, {q4-q5}, d7
>
> where q4-q5 contain 32 bytes of cipher stream, and d4-d7 contain a set
> of permutation vectors, where each value is in the range [0x0, 0x1f].
>
> The above works fine with older QEMU and KVM, but with recent QEMU,
> this fails, seemingly because d6 and d7 always turn up as all zeros.
>
> This can be reproduced by running the zImage I prepared [0] as follows:
>
> qemu-system-aarch64 -M virt -cpu cortex-a15 -m 2048 -net none
> -nographic -kernel arch/arm/boot/zImage
>
> and it will print the following (somewhere halfway down the kernel
> log) on the affected builds of QEMU:
>
> alg: skcipher: chacha20-neon encryption test failed (wrong result) on
> test vector 1, cfg="in-place"
> alg: skcipher: xchacha20-neon encryption test failed (wrong result) on
> test vector 1, cfg="in-place"
> alg: skcipher: xchacha12-neon encryption test failed (wrong result) on
> test vector 1, cfg="in-place"

I get:

[    8.974879] testing speed of sync chacha20 (chacha20-neon) encryption
[    8.975230] tcrypt: test 0 (256 bit key, 16 byte blocks): 351309 operations in 1 seconds (5620944 bytes)
[    9.967242] tcrypt: test 1 (256 bit key, 64 byte blocks): 383886 operations in 1 seconds (24568704 bytes)
[   10.967103] tcrypt: test 2 (256 bit key, 256 byte blocks): 109213 operations in 1 seconds (27958528 bytes)
[   11.967164] tcrypt: test 3 (256 bit key, 1024 byte blocks): 29061 operations in 1 seconds (29758464 bytes)
[   12.967165] tcrypt: test 4 (256 bit key, 1420 byte blocks): 19577 operations in 1 seconds (27799340 bytes)
[   13.967147] tcrypt: test 5 (256 bit key, 4096 byte blocks): 7217 operations in 1 seconds (29560832 bytes)
[   14.972354] input: gpio-keys as /devices/platform/gpio-keys/input/input0
[   14.977272] uart-pl011 9000000.pl011: no DMA platform data
[   14.980208] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
[   14.980431] Please append a correct "root=" boot option; here are the available partitions:

I wonder if it was a transient bug when stuff was converted to
decodetree and got fixed up later? Tested on HEAD @ 4c5b97bfd and @
e46912b66.

>
>
>
> [0] https://people.linaro.org/~ard.biesheuvel/qemu-tcg-vtbl/zImage


-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 16:45 ` Alex Bennée
@ 2020-11-04 17:02   ` Ard Biesheuvel
  2020-11-04 17:44     ` Alex Bennée
  0 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2020-11-04 17:02 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-arm, Peter Maydell, Richard Henderson

On Wed, 4 Nov 2020 at 17:45, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Ard Biesheuvel <ardb@kernel.org> writes:
>
> > Hello all,
> >
> > I spotted an issue with the TCG emulation of VTBL instructions in 32-bit mode.
> >
> > It seems that when using the 4 register version, indexes in the range
> > [0x10 .. 0x1f] are not handled correctly, and I end up with all zero
> > vectors in the output.
> >
> > For example, I am optimizing Linux's NEON ChaCha20 implementation to
> > use overlapping loads and stores, and this requires the final cipher
> > stream block to be shifted accordingly, using a sequence such as
> >
> > vtbl.8 d4, {q4-q5}, d4
> > vtbl.8 d5, {q4-q5}, d5
> > vtbl.8 d6, {q4-q5}, d6
> > vtbl.8 d7, {q4-q5}, d7
> >
> > where q4-q5 contain 32 bytes of cipher stream, and d4-d7 contain a set
> > of permutation vectors, where each value is in the range [0x0, 0x1f].
> >
> > The above works fine with older QEMU and KVM, but with recent QEMU,
> > this fails, seemingly because d6 and d7 always turn up as all zeros.
> >
> > This can be reproduced by running the zImage I prepared [0] as follows:
> >
> > qemu-system-aarch64 -M virt -cpu cortex-a15 -m 2048 -net none
> > -nographic -kernel arch/arm/boot/zImage
> >
> > and it will print the following (somewhere halfway down the kernel
> > log) on the affected builds of QEMU:
> >
> > alg: skcipher: chacha20-neon encryption test failed (wrong result) on
> > test vector 1, cfg="in-place"
> > alg: skcipher: xchacha20-neon encryption test failed (wrong result) on
> > test vector 1, cfg="in-place"
> > alg: skcipher: xchacha12-neon encryption test failed (wrong result) on
> > test vector 1, cfg="in-place"
>
> I get:
>
> [    8.974879] testing speed of sync chacha20 (chacha20-neon) encryption
> [    8.975230] tcrypt: test 0 (256 bit key, 16 byte blocks): 351309 operations in 1 seconds (5620944 bytes)
> [    9.967242] tcrypt: test 1 (256 bit key, 64 byte blocks): 383886 operations in 1 seconds (24568704 bytes)
> [   10.967103] tcrypt: test 2 (256 bit key, 256 byte blocks): 109213 operations in 1 seconds (27958528 bytes)
> [   11.967164] tcrypt: test 3 (256 bit key, 1024 byte blocks): 29061 operations in 1 seconds (29758464 bytes)
> [   12.967165] tcrypt: test 4 (256 bit key, 1420 byte blocks): 19577 operations in 1 seconds (27799340 bytes)
> [   13.967147] tcrypt: test 5 (256 bit key, 4096 byte blocks): 7217 operations in 1 seconds (29560832 bytes)
> [   14.972354] input: gpio-keys as /devices/platform/gpio-keys/input/input0
> [   14.977272] uart-pl011 9000000.pl011: no DMA platform data
> [   14.980208] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
> [   14.980431] Please append a correct "root=" boot option; here are the available partitions:
>
> I wonder if it was a transient bug when stuff was converted to
> decodetree and got fixed up later? Tested on HEAD @ 4c5b97bfd and @
> e46912b66.
>

I am seeing the issue on 700d20b49e303549 *and* on e46912b66f50b2d8,
after a clean rebuild.

Weird.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 17:02   ` Ard Biesheuvel
@ 2020-11-04 17:44     ` Alex Bennée
  2020-11-04 17:50       ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Bennée @ 2020-11-04 17:44 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: qemu-arm, Peter Maydell, Richard Henderson


Ard Biesheuvel <ardb@kernel.org> writes:

> On Wed, 4 Nov 2020 at 17:45, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Ard Biesheuvel <ardb@kernel.org> writes:
>>
>> > Hello all,
>> >
>> > I spotted an issue with the TCG emulation of VTBL instructions in 32-bit mode.
>> >
>> > It seems that when using the 4 register version, indexes in the range
>> > [0x10 .. 0x1f] are not handled correctly, and I end up with all zero
>> > vectors in the output.
>> >
>> > For example, I am optimizing Linux's NEON ChaCha20 implementation to
>> > use overlapping loads and stores, and this requires the final cipher
>> > stream block to be shifted accordingly, using a sequence such as
>> >
>> > vtbl.8 d4, {q4-q5}, d4
>> > vtbl.8 d5, {q4-q5}, d5
>> > vtbl.8 d6, {q4-q5}, d6
>> > vtbl.8 d7, {q4-q5}, d7
>> >
>> > where q4-q5 contain 32 bytes of cipher stream, and d4-d7 contain a set
>> > of permutation vectors, where each value is in the range [0x0, 0x1f].
>> >
>> > The above works fine with older QEMU and KVM, but with recent QEMU,
>> > this fails, seemingly because d6 and d7 always turn up as all zeros.
>> >
>> > This can be reproduced by running the zImage I prepared [0] as follows:
>> >
>> > qemu-system-aarch64 -M virt -cpu cortex-a15 -m 2048 -net none
>> > -nographic -kernel arch/arm/boot/zImage
>> >
>> > and it will print the following (somewhere halfway down the kernel
>> > log) on the affected builds of QEMU:
>> >
>> > alg: skcipher: chacha20-neon encryption test failed (wrong result) on
>> > test vector 1, cfg="in-place"
>> > alg: skcipher: xchacha20-neon encryption test failed (wrong result) on
>> > test vector 1, cfg="in-place"
>> > alg: skcipher: xchacha12-neon encryption test failed (wrong result) on
>> > test vector 1, cfg="in-place"
>>
>> I get:
>>
>> [    8.974879] testing speed of sync chacha20 (chacha20-neon) encryption
>> [    8.975230] tcrypt: test 0 (256 bit key, 16 byte blocks): 351309 operations in 1 seconds (5620944 bytes)
>> [    9.967242] tcrypt: test 1 (256 bit key, 64 byte blocks): 383886 operations in 1 seconds (24568704 bytes)
>> [   10.967103] tcrypt: test 2 (256 bit key, 256 byte blocks): 109213 operations in 1 seconds (27958528 bytes)
>> [   11.967164] tcrypt: test 3 (256 bit key, 1024 byte blocks): 29061 operations in 1 seconds (29758464 bytes)
>> [   12.967165] tcrypt: test 4 (256 bit key, 1420 byte blocks): 19577 operations in 1 seconds (27799340 bytes)
>> [   13.967147] tcrypt: test 5 (256 bit key, 4096 byte blocks): 7217 operations in 1 seconds (29560832 bytes)
>> [   14.972354] input: gpio-keys as /devices/platform/gpio-keys/input/input0
>> [   14.977272] uart-pl011 9000000.pl011: no DMA platform data
>> [   14.980208] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
>> [   14.980431] Please append a correct "root=" boot option; here are the available partitions:
>>
>> I wonder if it was a transient bug when stuff was converted to
>> decodetree and got fixed up later? Tested on HEAD @ 4c5b97bfd and @
>> e46912b66.
>>
>
> I am seeing the issue on 700d20b49e303549 *and* on e46912b66f50b2d8,
> after a clean rebuild.

Just checking - what host are you on?

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 17:44     ` Alex Bennée
@ 2020-11-04 17:50       ` Peter Maydell
  2020-11-04 18:01         ` Ard Biesheuvel
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2020-11-04 17:50 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Ard Biesheuvel, qemu-arm, Richard Henderson

On Wed, 4 Nov 2020 at 17:44, Alex Bennée <alex.bennee@linaro.org> wrote:
> Just checking - what host are you on?

Oh, good question -- what the TCG backend emits as vector
operations or not will depend on the host CPU (eg whether
it supports AVX1/AVX2/etc).

If the test case can be cut down to a Linux userspace
program that can be run under the qemu-arm single-binary
emulator that will probably also be easier to debug than
"boot whole guest kernel and wait for it to get to a selftest".

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 17:50       ` Peter Maydell
@ 2020-11-04 18:01         ` Ard Biesheuvel
  2020-11-04 19:22           ` Ard Biesheuvel
  2020-11-04 20:36           ` Alex Bennée
  0 siblings, 2 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2020-11-04 18:01 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Alex Bennée, qemu-arm, Richard Henderson

On Wed, 4 Nov 2020 at 18:50, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Wed, 4 Nov 2020 at 17:44, Alex Bennée <alex.bennee@linaro.org> wrote:
> > Just checking - what host are you on?
>

model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti
ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad
fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
flush_l1d


> Oh, good question -- what the TCG backend emits as vector
> operations or not will depend on the host CPU (eg whether
> it supports AVX1/AVX2/etc).
>
> If the test case can be cut down to a Linux userspace
> program that can be run under the qemu-arm single-binary
> emulator that will probably also be easier to debug than
> "boot whole guest kernel and wait for it to get to a selftest".
>

Sure. The code can be found at [0]

The sequence in question is

# r4 between -31 and 0
# q4-q5 holding 32 bytes of cipher stream

adr lr, .Lpermute + 32
add lr, lr, r4
vld1.8 {q2-q3}, [lr]

vtbl.8 d4, {q4-q5}, d4
vtbl.8 d5, {q4-q5}, d5
vtbl.8 d6, {q4-q5}, d6
vtbl.8 d7, {q4-q5}, d7

.Lpermute:
 .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
 .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
 .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
 .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
 .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
 .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
 .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
 .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f

This is essentially a bytewise rotate function operating on a 32 byte
vector (the patch explains the purpose)

Using GDB to single step through the code, I noticed that d6 and d7
turn up as all zeroes.


[0] https://lore.kernel.org/linux-arm-kernel/20201103162809.28167-1-ardb@kernel.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 18:01         ` Ard Biesheuvel
@ 2020-11-04 19:22           ` Ard Biesheuvel
  2020-11-04 20:36           ` Alex Bennée
  1 sibling, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2020-11-04 19:22 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Alex Bennée, qemu-arm, Richard Henderson

On Wed, 4 Nov 2020 at 19:01, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 4 Nov 2020 at 18:50, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Wed, 4 Nov 2020 at 17:44, Alex Bennée <alex.bennee@linaro.org> wrote:
> > > Just checking - what host are you on?
> >
>
> model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti
> ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
> rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
> dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
> flush_l1d
>
>
> > Oh, good question -- what the TCG backend emits as vector
> > operations or not will depend on the host CPU (eg whether
> > it supports AVX1/AVX2/etc).
> >
> > If the test case can be cut down to a Linux userspace
> > program that can be run under the qemu-arm single-binary
> > emulator that will probably also be easier to debug than
> > "boot whole guest kernel and wait for it to get to a selftest".
> >
>
> Sure. The code can be found at [0]
>
> The sequence in question is
>
> # r4 between -31 and 0
> # q4-q5 holding 32 bytes of cipher stream
>
> adr lr, .Lpermute + 32
> add lr, lr, r4
> vld1.8 {q2-q3}, [lr]
>
> vtbl.8 d4, {q4-q5}, d4
> vtbl.8 d5, {q4-q5}, d5
> vtbl.8 d6, {q4-q5}, d6
> vtbl.8 d7, {q4-q5}, d7
>
> .Lpermute:
>  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
>  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
>  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
>  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
>  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
>  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
>  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
>  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
>
> This is essentially a bytewise rotate function operating on a 32 byte
> vector (the patch explains the purpose)
>
> Using GDB to single step through the code, I noticed that d6 and d7
> turn up as all zeroes.
>
>
> [0] https://lore.kernel.org/linux-arm-kernel/20201103162809.28167-1-ardb@kernel.org/

OK, I could not reproduce with qemu-arm. However, I did found out that
the issue only occurs when using qemu-system-aarch64, not when using
qemu-system-arm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 18:01         ` Ard Biesheuvel
  2020-11-04 19:22           ` Ard Biesheuvel
@ 2020-11-04 20:36           ` Alex Bennée
  2020-11-04 23:18             ` Ard Biesheuvel
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Bennée @ 2020-11-04 20:36 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Peter Maydell, qemu-arm


Ard Biesheuvel <ardb@kernel.org> writes:

> On Wed, 4 Nov 2020 at 18:50, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On Wed, 4 Nov 2020 at 17:44, Alex Bennée <alex.bennee@linaro.org> wrote:
>> > Just checking - what host are you on?
>>
>
> model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti
> ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
> rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
> dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
> flush_l1d

Eyeballing hackbox2 which has:

model name      : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

Seems to have avx512 but the avx1 and avx2 stuff is common which will
make use of more registers in the generated code:

    if (have_avx1) {
        tcg_target_available_regs[TCG_TYPE_V64] = ALL_VECTOR_REGS;
        tcg_target_available_regs[TCG_TYPE_V128] = ALL_VECTOR_REGS;
    }
    if (have_avx2) {
        tcg_target_available_regs[TCG_TYPE_V256] = ALL_VECTOR_REGS;
    }

>
>
>> Oh, good question -- what the TCG backend emits as vector
>> operations or not will depend on the host CPU (eg whether
>> it supports AVX1/AVX2/etc).
>>
>> If the test case can be cut down to a Linux userspace
>> program that can be run under the qemu-arm single-binary
>> emulator that will probably also be easier to debug than
>> "boot whole guest kernel and wait for it to get to a selftest".
>>
>
> Sure. The code can be found at [0]
>
> The sequence in question is
>
> # r4 between -31 and 0
> # q4-q5 holding 32 bytes of cipher stream
>
> adr lr, .Lpermute + 32
> add lr, lr, r4
> vld1.8 {q2-q3}, [lr]
>
> vtbl.8 d4, {q4-q5}, d4
> vtbl.8 d5, {q4-q5}, d5
> vtbl.8 d6, {q4-q5}, d6
> vtbl.8 d7, {q4-q5}, d7
>
> .Lpermute:
>  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
>  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
>  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
>  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
>  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
>  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
>  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
>  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
>
> This is essentially a bytewise rotate function operating on a 32 byte
> vector (the patch explains the purpose)
>
> Using GDB to single step through the code, I noticed that d6 and d7
> turn up as all zeroes.
>
>
> [0] https://lore.kernel.org/linux-arm-kernel/20201103162809.28167-1-ardb@kernel.org/


-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 20:36           ` Alex Bennée
@ 2020-11-04 23:18             ` Ard Biesheuvel
  2020-11-05  3:47               ` Richard Henderson
  0 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2020-11-04 23:18 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Peter Maydell, qemu-arm@nongnu.org, Richard Henderson

On Wed, 4 Nov 2020 at 21:37, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Ard Biesheuvel <ardb@kernel.org> writes:
>
> > On Wed, 4 Nov 2020 at 18:50, Peter Maydell <peter.maydell@linaro.org> wrote:
> >>
> >> On Wed, 4 Nov 2020 at 17:44, Alex Bennée <alex.bennee@linaro.org> wrote:
> >> > Just checking - what host are you on?
> >>
> >
> > model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> > pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
> > ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
> > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> > rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti
> > ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad
> > fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
> > rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
> > dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
> > flush_l1d
>
> Eyeballing hackbox2 which has:
>
> model name      : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
> dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
>
> Seems to have avx512 but the avx1 and avx2 stuff is common which will
> make use of more registers in the generated code:
>
>     if (have_avx1) {
>         tcg_target_available_regs[TCG_TYPE_V64] = ALL_VECTOR_REGS;
>         tcg_target_available_regs[TCG_TYPE_V128] = ALL_VECTOR_REGS;
>     }
>     if (have_avx2) {
>         tcg_target_available_regs[TCG_TYPE_V256] = ALL_VECTOR_REGS;
>     }
>
> >
> >
> >> Oh, good question -- what the TCG backend emits as vector
> >> operations or not will depend on the host CPU (eg whether
> >> it supports AVX1/AVX2/etc).
> >>
> >> If the test case can be cut down to a Linux userspace
> >> program that can be run under the qemu-arm single-binary
> >> emulator that will probably also be easier to debug than
> >> "boot whole guest kernel and wait for it to get to a selftest".
> >>
> >
> > Sure. The code can be found at [0]
> >
> > The sequence in question is
> >
> > # r4 between -31 and 0
> > # q4-q5 holding 32 bytes of cipher stream
> >
> > adr lr, .Lpermute + 32
> > add lr, lr, r4
> > vld1.8 {q2-q3}, [lr]
> >
> > vtbl.8 d4, {q4-q5}, d4
> > vtbl.8 d5, {q4-q5}, d5
> > vtbl.8 d6, {q4-q5}, d6
> > vtbl.8 d7, {q4-q5}, d7
> >
> > .Lpermute:
> >  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
> >  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
> >  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
> >  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
> >  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
> >  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
> >  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
> >  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
> >
> > This is essentially a bytewise rotate function operating on a 32 byte
> > vector (the patch explains the purpose)
> >
> > Using GDB to single step through the code, I noticed that d6 and d7
> > turn up as all zeroes.
> >
> >
> > [0] https://lore.kernel.org/linux-arm-kernel/20201103162809.28167-1-ardb@kernel.org/
>
>

So comparing qemu-system-aarch64 and qemu-system-arm running in GDB gives me:

qemu-system-arm:

(gdb) b helper_neon_tbl if maxindex==32
Breakpoint 1 at 0x60e250: file ../target/arm/op_helper.c, line 73.
(gdb) r -M virt -cpu cortex-a15 -m 2048 -net none -nographic -kernel
arch/arm/boot/zImage
Starting program: /home/ardbie01/build/qemu/build/qemu-system-arm -M
virt -cpu cortex-a15 -m 2048 -net none -nographic -kernel
arch/arm/boot/zImage

(gdb) x/8x table
0x555556e6d390: 0xbb75b15a 0xdb0107ff 0x560fe329 0x980e8754
0x555556e6d3a0: 0x08e58eb7 0x814e8602 0x2654e32c 0x979ff7d2

whereas qemu-system-aarch64 gives me

(gdb) x/8x table
0x555556ff8c20: 0xbb75b15a 0xdb0107ff 0x560fe329 0x980e8754
0x555556ff8c30: 0x00000000 0x00000000 0x00000000 0x00000000

Looking at HELPER(neon_tbl)(), it seems to me that casting void *vn to
uint64_t* and indexing it as an array fails to account for the SVE
view of the registers. This also explains why qemu-system-arm works
and qemu-system-aarch64 doesn't.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regression in TCG emulation of VTBL neon instruction
  2020-11-04 23:18             ` Ard Biesheuvel
@ 2020-11-05  3:47               ` Richard Henderson
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2020-11-05  3:47 UTC (permalink / raw)
  To: Ard Biesheuvel, Alex Bennée; +Cc: Peter Maydell, qemu-arm@nongnu.org

On 11/4/20 3:18 PM, Ard Biesheuvel wrote:
> Looking at HELPER(neon_tbl)(), it seems to me that casting void *vn to
> uint64_t* and indexing it as an array fails to account for the SVE
> view of the registers. This also explains why qemu-system-arm works
> and qemu-system-aarch64 doesn't.

Yep, you're right.  There was a semi-recent change here, but it merely moved
the point at which we treated this as an array.

There's a different tbl version for aarch64, helper_simd_tbl, which might be
done correctly for both.  I'll investigate more tomorrow.


r~

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-05  3:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-02  7:54 regression in TCG emulation of VTBL neon instruction Ard Biesheuvel
2020-11-04 16:45 ` Alex Bennée
2020-11-04 17:02   ` Ard Biesheuvel
2020-11-04 17:44     ` Alex Bennée
2020-11-04 17:50       ` Peter Maydell
2020-11-04 18:01         ` Ard Biesheuvel
2020-11-04 19:22           ` Ard Biesheuvel
2020-11-04 20:36           ` Alex Bennée
2020-11-04 23:18             ` Ard Biesheuvel
2020-11-05  3:47               ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.