* Re: rcutorture’s init segfaults in ppc64le VM
2022-02-08 12:12 ` Paul Menzel
@ 2022-02-08 12:27 ` Paul Menzel
2022-02-11 1:48 ` Michael Ellerman
2022-03-10 2:37 ` Zhouyi Zhou
2 siblings, 0 replies; 15+ messages in thread
From: Paul Menzel @ 2022-02-08 12:27 UTC (permalink / raw)
To: Michael Ellerman, Paul E. McKenney
Cc: rcu, Zhouyi Zhou, linuxppc-dev, Willy Tarreau
[Correct sha1 for test for 2022.02.01-21.52.37]
Am 08.02.22 um 13:12 schrieb Paul Menzel:
> Dear Michael,
>
>
> Thank you for looking into this.
>
> Am 08.02.22 um 11:09 schrieb Michael Ellerman:
>> Paul Menzel writes:
>
> […]
>
>>> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
>>> 5.17-rc2+ with rcutorture tests
>>
>> I'm not sure if that's the host kernel version or the version you're
>> using of rcutorture? Can you tell us the sha1 of your host kernel and of
>> the tree you're running rcutorture from?
>
> The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
> I am unable to find the exact sha1.
>
> $ more /proc/version
> Linux version 5.17.0-rc1+ (pmenzel@flughafenberlinbrandenburgwillybrandt.molgen.mpg.de) (Ubuntu clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28 17:13:04 CET 2022
>
> The Linux tree, from where I run rcutorture from, is at commit
> dfd42facf1e4 (Linux 5.17-rc3) with four patches on top:
>
> $ git log --oneline -6
> 207cec79e752 (HEAD -> master, origin/master, origin/HEAD) Problems with rcutorture on ppc64le: allmodconfig(2) and other failures
> 8c82f96fbe57 ata: libata-sata: improve sata_link_debounce()
> a447541d925f ata: libata-sata: remove debounce delay by default
> afd84e1eeafc ata: libata-sata: introduce struct sata_deb_timing
> f4caf7e48b75 ata: libata-sata: Simplify sata_link_resume() interface
> dfd42facf1e4 (tag: v5.17-rc3) Linux 5.17-rc3
I was able to reproduce this with the above, but the report and the
attached logs at the end are from:
$ git log --oneline -6 b37a34a8cf5a
b37a34a8cf5a Problems with rcutorture on ppc64le: allmodconfig(2)
and other failures
9a78ddead89a ata: libata-sata: improve sata_link_debounce()
567da2eaf099 ata: libata-sata: remove debounce delay by default
70ae61851660 ata: libata-sata: introduce struct sata_deb_timing
9ebb6433d9c3 ata: libata-sata: Simplify sata_link_resume() interface
26291c54e111 (tag: v5.17-rc2) Linux 5.17-rc2
>>> $ tools/testing/selftests/rcutorture/bin/torture.sh --duration 10
>>>
>>> the built init
>>>
>>> $ file tools/testing/selftests/rcutorture/initrd/init
>>> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=0ded0e45649184a296f30d611f7a03cc51ecb616, for GNU/Linux 3.10.0, stripped
>>
>> Mine looks pretty much identical:
>>
>> $ file tools/testing/selftests/rcutorture/initrd/init
>> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=86078bf6e5d54ab0860d36aa9a65d52818b972c8, for GNU/Linux 3.10.0, stripped
>>
>>> segfaults in QEMU. From one of the log files
>>
>> But mine doesn't segfault, it runs fine and the test completes.
>>
>> What qemu version are you using?
>>
>> I tried 4.2.1 and 6.2.0, both worked.
>
> $ qemu-system-ppc64le --version
> QEMU emulator version 6.0.0 (Debian 1:6.0+dfsg-2expubuntu1.1)
> Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project
> developers
>
>>> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/console.log
>>>
>
> Sorry, that was the wrong path/test. The correct one for the excerpt
> below is:
>
>
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/console.log
>
>
> (For TREE03, QEMU does not start the Linux kernel at all, that means no
> output after:
>
> Booting Linux via __start() @ 0x0000000000400000 ...
> )
>
>>> [ 1.119803][ T1] Run /init as init process
>>> [ 1.122011][ T1] init[1]: segfault (11) at f0656d90 nip 10000a18 lr 0 code 1 in init[10000000+d0000]
>>> [ 1.124863][ T1] init[1]: code: 2c2903e7 f9210030 4081ff84 4bffff58 00000000 01000000 00000580 3c40100f
>>> [ 1.128823][ T1] init[1]: code: 38427c00 7c290b78 782106e4 38000000 <f821ff81> 7c0803a6 f8010000 e9028010
>>
>> The disassembly from 3c40100f is:
>> lis r2,4111
>> addi r2,r2,31744
>> mr r9,r1
>> rldicr r1,r1,0,59
>> li r0,0
>> stdu r1,-128(r1) <- fault
>> mtlr r0
>> std r0,0(r1)
>> ld r8,-32752(r2)
>>
>>
>> I think you'll find that's the code at the ELF entry point. You can
>> check with:
>>
>> $ readelf -e tools/testing/selftests/rcutorture/initrd/init | grep
>> Entry
>> Entry point address: 0x10000c0c
>>
>> $ objdump -d tools/testing/selftests/rcutorture/initrd/init | grep
>> -m 1 -A 8 10000c0c
>> 10000c0c: 0e 10 40 3c lis r2,4110
>> 10000c10: 00 7b 42 38 addi r2,r2,31488
>> 10000c14: 78 0b 29 7c mr r9,r1
>> 10000c18: e4 06 21 78 rldicr r1,r1,0,59
>> 10000c1c: 00 00 00 38 li r0,0
>> 10000c20: 81 ff 21 f8 stdu r1,-128(r1)
>> 10000c24: a6 03 08 7c mtlr r0
>> 10000c28: 00 00 01 f8 std r0,0(r1)
>> 10000c2c: 10 80 02 e9 ld r8,-32752(r2)
>>
>> The fault you're seeing is the first store using the stack pointer (r1),
>> which is setup by the kernel.
>>
>> The fault address f0656d90 is weirdly low, the stack should be up near
>> 128TB.
>>
>> I'm not sure how we end up with a bad r1.
>>
>> Can you dump some info about the kernel that was built, something like:
>>
>> $ file
>> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/vmlinux
>>
>> And maybe paste/attach the full log, maybe there's a clue somewhere.
>
> You can now download the content of
> `/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01`
> [1, 65 MB].
>
> Can you reproduce the segmentation fault with the line below?
>
> $ qemu-system-ppc64 -enable-kvm -nographic -smp cores=1,threads=8
> -net none -enable-kvm -M pseries -nodefaults -device spapr-vscsi -serial
> stdio -m 512 -kernel
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/vmlinux
> -append "debug_boot_weak_hash panic=-1 console=ttyS0
> torture.disable_onoff_at_boot locktorture.onoff_interval=3
> locktorture.onoff_holdoff=30 locktorture.stat_interval=15
> locktorture.shutdown_secs=60 locktorture.verbose=1"
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: rcutorture’s init segfaults in ppc64le VM
2022-02-08 12:12 ` Paul Menzel
2022-02-08 12:27 ` Paul Menzel
@ 2022-02-11 1:48 ` Michael Ellerman
2022-02-11 14:19 ` Paul Menzel
2022-03-10 2:37 ` Zhouyi Zhou
2 siblings, 1 reply; 15+ messages in thread
From: Michael Ellerman @ 2022-02-11 1:48 UTC (permalink / raw)
To: Paul Menzel, Paul E. McKenney
Cc: rcu, Zhouyi Zhou, linuxppc-dev, Willy Tarreau
Paul Menzel <pmenzel@molgen.mpg.de> writes:
> Am 08.02.22 um 11:09 schrieb Michael Ellerman:
>> Paul Menzel writes:
>
> […]
>
>>> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
>>> 5.17-rc2+ with rcutorture tests
>>
>> I'm not sure if that's the host kernel version or the version you're
>> using of rcutorture? Can you tell us the sha1 of your host kernel and of
>> the tree you're running rcutorture from?
>
> The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
> I am unable to find the exact sha1.
>
> $ more /proc/version
> Linux version 5.17.0-rc1+
> (pmenzel@flughafenberlinbrandenburgwillybrandt.molgen.mpg.de) (Ubuntu
> clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28
> 17:13:04 CET 2022
OK. In general rc1 kernels can have issues, so it might be worth
rebooting the host into either v5.17-rc3 or a distro or stable kernel.
Just to rule out any issues on the host.
> The Linux tree, from where I run rcutorture from, is at commit
> dfd42facf1e4 (Linux 5.17-rc3) with four patches on top:
>
> $ git log --oneline -6
> 207cec79e752 (HEAD -> master, origin/master, origin/HEAD) Problems
> with rcutorture on ppc64le: allmodconfig(2) and other failures
> 8c82f96fbe57 ata: libata-sata: improve sata_link_debounce()
> a447541d925f ata: libata-sata: remove debounce delay by default
> afd84e1eeafc ata: libata-sata: introduce struct sata_deb_timing
> f4caf7e48b75 ata: libata-sata: Simplify sata_link_resume() interface
> dfd42facf1e4 (tag: v5.17-rc3) Linux 5.17-rc3
>
>>> $ tools/testing/selftests/rcutorture/bin/torture.sh --duration 10
>>>
>>> the built init
>>>
>>> $ file tools/testing/selftests/rcutorture/initrd/init
>>> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=0ded0e45649184a296f30d611f7a03cc51ecb616, for GNU/Linux 3.10.0, stripped
>>
>> Mine looks pretty much identical:
>>
>> $ file tools/testing/selftests/rcutorture/initrd/init
>> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=86078bf6e5d54ab0860d36aa9a65d52818b972c8, for GNU/Linux 3.10.0, stripped
>>
>>> segfaults in QEMU. From one of the log files
>>
>> But mine doesn't segfault, it runs fine and the test completes.
>>
>> What qemu version are you using?
>>
>> I tried 4.2.1 and 6.2.0, both worked.
>
> $ qemu-system-ppc64le --version
> QEMU emulator version 6.0.0 (Debian 1:6.0+dfsg-2expubuntu1.1)
> Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
OK, that's one difference between our setups, but I'd be surprised if it
explains this bug, but I guess anything's possible.
>>> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/console.log
>
> Sorry, that was the wrong path/test. The correct one for the excerpt
> below is:
>
>
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/console.log
>
> (For TREE03, QEMU does not start the Linux kernel at all, that means no
> output after:
>
> Booting Linux via __start() @ 0x0000000000400000 ...
OK yeah I see that too.
Removing "threadirqs" from tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
seems to fix it.
I still see some preempt related warnings, we clearly have some bugs
with preempt enabled.
> You can now download the content of
> `/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01`
> [1, 65 MB].
>
> Can you reproduce the segmentation fault with the line below?
>
> $ qemu-system-ppc64 -enable-kvm -nographic -smp cores=1,threads=8
> -net none -enable-kvm -M pseries -nodefaults -device spapr-vscsi -serial
> stdio -m 512 -kernel
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/vmlinux
> -append "debug_boot_weak_hash panic=-1 console=ttyS0
> torture.disable_onoff_at_boot locktorture.onoff_interval=3
> locktorture.onoff_holdoff=30 locktorture.stat_interval=15
> locktorture.shutdown_secs=60 locktorture.verbose=1"
That works fine for me, boots and runs the test, then shuts down.
I assume you see the segfault on every boot, not intermittently?
So the differences between our setups are the host kernel and the qemu
version. Can you try a different host kernel easily?
The other thing would be to try a different qemu version, you might need
to build from source, but it's not that hard :)
cheers
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rcutorture’s init segfaults in ppc64le VM
2022-02-11 1:48 ` Michael Ellerman
@ 2022-02-11 14:19 ` Paul Menzel
2022-02-11 15:42 ` Paul Menzel
0 siblings, 1 reply; 15+ messages in thread
From: Paul Menzel @ 2022-02-11 14:19 UTC (permalink / raw)
To: Michael Ellerman, Paul E. McKenney
Cc: rcu, Zhouyi Zhou, linuxppc-dev, Willy Tarreau
Dear Michael,
Am 11.02.22 um 02:48 schrieb Michael Ellerman:
> Paul Menzel writes:
>> Am 08.02.22 um 11:09 schrieb Michael Ellerman:
>>> Paul Menzel writes:
>>
>> […]
>>
>>>> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
>>>> 5.17-rc2+ with rcutorture tests
>>>
>>> I'm not sure if that's the host kernel version or the version you're
>>> using of rcutorture? Can you tell us the sha1 of your host kernel and of
>>> the tree you're running rcutorture from?
>>
>> The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
>> I am unable to find the exact sha1.
>>
>> $ more /proc/version
>> Linux version 5.17.0-rc1+ (x@eddb.molgen.mpg.de) (Ubuntu clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28 17:13:04 CET 2022
>
> OK. In general rc1 kernels can have issues, so it might be worth
> rebooting the host into either v5.17-rc3 or a distro or stable kernel.
> Just to rule out any issues on the host.
Yes, that was a good test. It works with Ubuntu’s 5.13 Linux kernel.
$ more /proc/version
Linux version 5.13.0-28-generic (buildd@bos02-ppc64el-013) (gcc
(Ubuntu 11.2.0-7ubuntu2) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.37)
#31-Ubuntu SMP Thu Jan 13 17:40:19 UTC 2022
I have to do more tests, but it could be LLVM/clang related.
>> The Linux tree, from where I run rcutorture from, is at commit
>> dfd42facf1e4 (Linux 5.17-rc3) with four patches on top:
>>
>> $ git log --oneline -6
>> 207cec79e752 (HEAD -> master, origin/master, origin/HEAD) Problems with rcutorture on ppc64le: allmodconfig(2) and other failures
>> 8c82f96fbe57 ata: libata-sata: improve sata_link_debounce()
>> a447541d925f ata: libata-sata: remove debounce delay by default
>> afd84e1eeafc ata: libata-sata: introduce struct sata_deb_timing
>> f4caf7e48b75 ata: libata-sata: Simplify sata_link_resume() interface
>> dfd42facf1e4 (tag: v5.17-rc3) Linux 5.17-rc3
>>
>>>> $ tools/testing/selftests/rcutorture/bin/torture.sh --duration 10
>>>>
>>>> the built init
>>>>
>>>> $ file tools/testing/selftests/rcutorture/initrd/init
>>>> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=0ded0e45649184a296f30d611f7a03cc51ecb616, for GNU/Linux 3.10.0, stripped
>>>
>>> Mine looks pretty much identical:
>>>
>>> $ file tools/testing/selftests/rcutorture/initrd/init
>>> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=86078bf6e5d54ab0860d36aa9a65d52818b972c8, for GNU/Linux 3.10.0, stripped
>>>
>>>> segfaults in QEMU. From one of the log files
>>>
>>> But mine doesn't segfault, it runs fine and the test completes.
>>>
>>> What qemu version are you using?
>>>
>>> I tried 4.2.1 and 6.2.0, both worked.
>>
>> $ qemu-system-ppc64le --version
>> QEMU emulator version 6.0.0 (Debian 1:6.0+dfsg-2expubuntu1.1)
>> Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
>
> OK, that's one difference between our setups, but I'd be surprised if it
> explains this bug, but I guess anything's possible.
>
>>>> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/console.log
>>
>> Sorry, that was the wrong path/test. The correct one for the excerpt
>> below is:
>>
>>
>> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/console.log
>>
>> (For TREE03, QEMU does not start the Linux kernel at all, that means no
>> output after:
>>
>> Booting Linux via __start() @ 0x0000000000400000 ...
>
> OK yeah I see that too.
>
> Removing "threadirqs" from tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
> seems to fix it.
Nice find. I have no idea, what that means though.
> I still see some preempt related warnings, we clearly have some bugs
> with preempt enabled.
>
>> You can now download the content of
>> `/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01`
>> [1, 65 MB].
>>
>> Can you reproduce the segmentation fault with the line below?
>>
>> $ qemu-system-ppc64 -enable-kvm -nographic -smp cores=1,threads=8 \
>> -net none -enable-kvm -M pseries -nodefaults -device spapr-vscsi -serial stdio -m 512 \
>> -kernel /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/vmlinux \
>> -append "debug_boot_weak_hash panic=-1 console=ttyS0 \
>> torture.disable_onoff_at_boot locktorture.onoff_interval=3 \
>> locktorture.onoff_holdoff=30 locktorture.stat_interval=15 \
>> locktorture.shutdown_secs=60 locktorture.verbose=1"
>
> That works fine for me, boots and runs the test, then shuts down.
>
> I assume you see the segfault on every boot, not intermittently?
>
> So the differences between our setups are the host kernel and the qemu
> version. Can you try a different host kernel easily?
>
> The other thing would be to try a different qemu version, you might need
> to build from source, but it's not that hard :)
Indeed. I needed to find a current Meson, but then it didn’t make a
difference, as found out above, it’s related to the Linux kernel.
Kind regards,
Paul
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: rcutorture’s init segfaults in ppc64le VM
2022-02-11 14:19 ` Paul Menzel
@ 2022-02-11 15:42 ` Paul Menzel
0 siblings, 0 replies; 15+ messages in thread
From: Paul Menzel @ 2022-02-11 15:42 UTC (permalink / raw)
To: Michael Ellerman, Paul E. McKenney
Cc: rcu, Zhouyi Zhou, linuxppc-dev, Willy Tarreau
Dear Michael,
Am 11.02.22 um 15:19 schrieb Paul Menzel:
> Am 11.02.22 um 02:48 schrieb Michael Ellerman:
>> Paul Menzel writes:
>>> Am 08.02.22 um 11:09 schrieb Michael Ellerman:
>>>> Paul Menzel writes:
>>>
>>> […]
>>>
>>>>> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
>>>>> 5.17-rc2+ with rcutorture tests
>>>>
>>>> I'm not sure if that's the host kernel version or the version you're
>>>> using of rcutorture? Can you tell us the sha1 of your host kernel
>>>> and of the tree you're running rcutorture from?
>>>
>>> The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
>>> I am unable to find the exact sha1.
>>>
>>> $ more /proc/version
>>> Linux version 5.17.0-rc1+ (x@eddb.molgen.mpg.de) (Ubuntu clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28 17:13:04 CET 2022
>>
>> OK. In general rc1 kernels can have issues, so it might be worth
>> rebooting the host into either v5.17-rc3 or a distro or stable kernel.
>> Just to rule out any issues on the host.
>
> Yes, that was a good test. It works with Ubuntu’s 5.13 Linux kernel.
>
> $ more /proc/version
> Linux version 5.13.0-28-generic (buildd@bos02-ppc64el-013) (gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.37) #31-Ubuntu SMP Thu Jan 13 17:40:19 UTC 2022
>
> I have to do more tests, but it could be LLVM/clang related.
Building commit f1baf68e1383 (Merge tag 'net-5.17-rc4' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net) with the ata
patches on top with GCC, I am unable to reproduce the issue. Before I
built it with
make -j100 LLVM=1 LLVM_IAS=0 bindeb-pkg
[…]
Kind regards,
Paul
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rcutorture’s init segfaults in ppc64le VM
2022-02-08 12:12 ` Paul Menzel
2022-02-08 12:27 ` Paul Menzel
2022-02-11 1:48 ` Michael Ellerman
@ 2022-03-10 2:37 ` Zhouyi Zhou
2022-03-10 4:48 ` Paul E. McKenney
2022-03-10 8:10 ` Paul Menzel
2 siblings, 2 replies; 15+ messages in thread
From: Zhouyi Zhou @ 2022-03-10 2:37 UTC (permalink / raw)
To: Paul Menzel; +Cc: rcu, linuxppc-dev, Willy Tarreau, Paul E. McKenney
Dear Paul
I try to reproduce the bug in ppc64 VM in Oregon State University
using the vmlinux extracted from
https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
the ppc64 VM in which I run the qemu without hardware acceleration is:
Linux version 5.4.0-100-generic (buildd@bos02-ppc64el-021) (gcc
version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #113-Ubuntu SMP Thu Feb
3 18:43:11 UTC 2022 (Ubuntu 5.4.0-100.113-generic 5.4.166)
The qemu command I use to test:
cd /tmp/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01$
$qemu-system-ppc64 -nographic -smp cores=2,threads=1 -net none -M
pseries -nodefaults -device spapr-vscsi -serial file:/tmp/console.log
-m 512 -kernel ./vmlinux -append "debug_boot_weak_hash panic=-1
console=ttyS0 rcutorture.onoff_interval=200
rcutorture.onoff_holdoff=30 rcutree.gp_preinit_delay=12
rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3
rcutree.kthread_prio=2 threadirqs tree.use_softirq=0
rcutorture.n_barrier_cbs=4 rcutorture.stat_interval=15
rcutorture.shutdown_secs=1800 rcutorture.test_no_idle_hz=1
rcutorture.verbose=1"
The console.log is uploaded to:
http://154.223.142.244/logs/20220310/console.paul.log
The log tells us it is illegal instruction that causes the trouble:
[ 4.246387][ T1] init[1]: illegal instruction (4) at 1002c308
nip 1002c308 lr 10001684 code 1 in init[10000000+d0000]
[ 4.251400][ T1] init[1]: code: f90d88c0 f92a0008 f9480008
7c2004ac 2c2d0000 f9490000 386d88d0 380000e8
[ 4.253416][ T1] init[1]: code: 41820098 e92d8f98 75290010
4182008c <44000001> 2c2d0000 60000000 8902f438
Meanwhile, the vmlinux compiled by myself runs smoothly.
Then I modify mkinitrd.sh to let it panic manually:
http://154.223.142.244/logs/20220310/mkinitrd.sh
The log tells us it is a segfault (instead of a illegal instruction):
http://154.223.142.244/logs/20220310/console.zhouyi.log
Then I use gdb to debug the init in host:
ubuntu@zhouzhouyi-1:~/newkernel/linux-next$ gdb
tools/testing/selftests/rcutorture/initrd/init
(gdb) run
Starting program:
/home/ubuntu/newkernel/linux-next/tools/testing/selftests/rcutorture/initrd/init
Program received signal SIGSEGV, Segmentation fault.
0x0000000010000b2c in ?? ()
(gdb) x/10i $pc
=> 0x10000b2c: stw r9,0(r9)
0x10000b30: trap
0x10000b34: .long 0x0
0x10000b38: .long 0x0
0x10000b3c: .long 0x0
0x10000b40: lis r2,4110
0x10000b44: addi r2,r2,31488
0x10000b48: mr r9,r1
0x10000b4c: rldicr r1,r1,0,59
0x10000b50: li r0,0
(gdb) p $r9
$1 = 0
(gdb) x/30x $pc - 0x30
0x10000afc: 0x38840040 0x387f0040 0xf8010040 0x48026919
0x10000b0c: 0x60000000 0xe8010040 0x7c0803a6 0x4bffff24
0x10000b1c: 0x00000000 0x01000000 0x00000180 0x39200000
0x10000b2c: 0x91290000 0x7fe00008 0x00000000 0x00000000
which matches the hex content of
http://154.223.142.244/logs/20220310/console.zhouyi.log:
[ 5.077431][ T1] init[1]: segfault (11) at 0 nip 10000b2c lr
10001024 code 1 in init[10000000+d0000]
[ 5.087167][ T1] init[1]: code: 38840040 387f0040 f8010040
48026919 60000000 e8010040 7c0803a6 4bffff24
[ 5.093987][ T1] init[1]: code: 00000000 01000000 00000180
39200000 <91290000> 7fe00008 00000000 00000000
Conclusions: there might be something wrong when packing the init into
vmlinux in your environment.
I will continue to do research on this interesting problem with you.
Thanks
Kind Regards
Zhouyi
On Tue, Feb 8, 2022 at 8:12 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Michael,
>
>
> Thank you for looking into this.
>
> Am 08.02.22 um 11:09 schrieb Michael Ellerman:
> > Paul Menzel writes:
>
> […]
>
> >> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
> >> 5.17-rc2+ with rcutorture tests
> >
> > I'm not sure if that's the host kernel version or the version you're
> > using of rcutorture? Can you tell us the sha1 of your host kernel and of
> > the tree you're running rcutorture from?
>
> The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
> I am unable to find the exact sha1.
>
> $ more /proc/version
> Linux version 5.17.0-rc1+
> (pmenzel@flughafenberlinbrandenburgwillybrandt.molgen.mpg.de) (Ubuntu
> clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28
> 17:13:04 CET 2022
>
> The Linux tree, from where I run rcutorture from, is at commit
> dfd42facf1e4 (Linux 5.17-rc3) with four patches on top:
>
> $ git log --oneline -6
> 207cec79e752 (HEAD -> master, origin/master, origin/HEAD) Problems
> with rcutorture on ppc64le: allmodconfig(2) and other failures
> 8c82f96fbe57 ata: libata-sata: improve sata_link_debounce()
> a447541d925f ata: libata-sata: remove debounce delay by default
> afd84e1eeafc ata: libata-sata: introduce struct sata_deb_timing
> f4caf7e48b75 ata: libata-sata: Simplify sata_link_resume() interface
> dfd42facf1e4 (tag: v5.17-rc3) Linux 5.17-rc3
>
> >> $ tools/testing/selftests/rcutorture/bin/torture.sh --duration 10
> >>
> >> the built init
> >>
> >> $ file tools/testing/selftests/rcutorture/initrd/init
> >> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=0ded0e45649184a296f30d611f7a03cc51ecb616, for GNU/Linux 3.10.0, stripped
> >
> > Mine looks pretty much identical:
> >
> > $ file tools/testing/selftests/rcutorture/initrd/init
> > tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=86078bf6e5d54ab0860d36aa9a65d52818b972c8, for GNU/Linux 3.10.0, stripped
> >
> >> segfaults in QEMU. From one of the log files
> >
> > But mine doesn't segfault, it runs fine and the test completes.
> >
> > What qemu version are you using?
> >
> > I tried 4.2.1 and 6.2.0, both worked.
>
> $ qemu-system-ppc64le --version
> QEMU emulator version 6.0.0 (Debian 1:6.0+dfsg-2expubuntu1.1)
> Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
>
> >> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/console.log
>
> Sorry, that was the wrong path/test. The correct one for the excerpt
> below is:
>
>
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/console.log
>
> (For TREE03, QEMU does not start the Linux kernel at all, that means no
> output after:
>
> Booting Linux via __start() @ 0x0000000000400000 ...
> )
>
> >> [ 1.119803][ T1] Run /init as init process
> >> [ 1.122011][ T1] init[1]: segfault (11) at f0656d90 nip 10000a18 lr 0 code 1 in init[10000000+d0000]
> >> [ 1.124863][ T1] init[1]: code: 2c2903e7 f9210030 4081ff84 4bffff58 00000000 01000000 00000580 3c40100f
> >> [ 1.128823][ T1] init[1]: code: 38427c00 7c290b78 782106e4 38000000 <f821ff81> 7c0803a6 f8010000 e9028010
> >
> > The disassembly from 3c40100f is:
> > lis r2,4111
> > addi r2,r2,31744
> > mr r9,r1
> > rldicr r1,r1,0,59
> > li r0,0
> > stdu r1,-128(r1) <- fault
> > mtlr r0
> > std r0,0(r1)
> > ld r8,-32752(r2)
> >
> >
> > I think you'll find that's the code at the ELF entry point. You can
> > check with:
> >
> > $ readelf -e tools/testing/selftests/rcutorture/initrd/init | grep Entry
> > Entry point address: 0x10000c0c
> >
> > $ objdump -d tools/testing/selftests/rcutorture/initrd/init | grep -m 1 -A 8 10000c0c
> > 10000c0c: 0e 10 40 3c lis r2,4110
> > 10000c10: 00 7b 42 38 addi r2,r2,31488
> > 10000c14: 78 0b 29 7c mr r9,r1
> > 10000c18: e4 06 21 78 rldicr r1,r1,0,59
> > 10000c1c: 00 00 00 38 li r0,0
> > 10000c20: 81 ff 21 f8 stdu r1,-128(r1)
> > 10000c24: a6 03 08 7c mtlr r0
> > 10000c28: 00 00 01 f8 std r0,0(r1)
> > 10000c2c: 10 80 02 e9 ld r8,-32752(r2)
> >
> > The fault you're seeing is the first store using the stack pointer (r1),
> > which is setup by the kernel.
> >
> > The fault address f0656d90 is weirdly low, the stack should be up near 128TB.
> >
> > I'm not sure how we end up with a bad r1.
> >
> > Can you dump some info about the kernel that was built, something like:
> >
> > $ file /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/vmlinux
> >
> > And maybe paste/attach the full log, maybe there's a clue somewhere.
>
> You can now download the content of
> `/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01`
> [1, 65 MB].
>
> Can you reproduce the segmentation fault with the line below?
>
> $ qemu-system-ppc64 -enable-kvm -nographic -smp cores=1,threads=8
> -net none -enable-kvm -M pseries -nodefaults -device spapr-vscsi -serial
> stdio -m 512 -kernel
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/vmlinux
> -append "debug_boot_weak_hash panic=-1 console=ttyS0
> torture.disable_onoff_at_boot locktorture.onoff_interval=3
> locktorture.onoff_holdoff=30 locktorture.stat_interval=15
> locktorture.shutdown_secs=60 locktorture.verbose=1"
>
>
> Kind regards,
>
> Paul
>
>
> [1]:
> https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rcutorture’s init segfaults in ppc64le VM
2022-03-10 2:37 ` Zhouyi Zhou
@ 2022-03-10 4:48 ` Paul E. McKenney
2022-03-10 8:10 ` Paul Menzel
1 sibling, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2022-03-10 4:48 UTC (permalink / raw)
To: Zhouyi Zhou; +Cc: rcu, Paul Menzel, linuxppc-dev, Willy Tarreau
On Thu, Mar 10, 2022 at 10:37:12AM +0800, Zhouyi Zhou wrote:
> Dear Paul
>
> I try to reproduce the bug in ppc64 VM in Oregon State University
> using the vmlinux extracted from
> https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
>
> the ppc64 VM in which I run the qemu without hardware acceleration is:
> Linux version 5.4.0-100-generic (buildd@bos02-ppc64el-021) (gcc
> version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #113-Ubuntu SMP Thu Feb
> 3 18:43:11 UTC 2022 (Ubuntu 5.4.0-100.113-generic 5.4.166)
>
>
> The qemu command I use to test:
> cd /tmp/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01$
> $qemu-system-ppc64 -nographic -smp cores=2,threads=1 -net none -M
> pseries -nodefaults -device spapr-vscsi -serial file:/tmp/console.log
> -m 512 -kernel ./vmlinux -append "debug_boot_weak_hash panic=-1
> console=ttyS0 rcutorture.onoff_interval=200
> rcutorture.onoff_holdoff=30 rcutree.gp_preinit_delay=12
> rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3
> rcutree.kthread_prio=2 threadirqs tree.use_softirq=0
> rcutorture.n_barrier_cbs=4 rcutorture.stat_interval=15
> rcutorture.shutdown_secs=1800 rcutorture.test_no_idle_hz=1
> rcutorture.verbose=1"
>
> The console.log is uploaded to:
> http://154.223.142.244/logs/20220310/console.paul.log
> The log tells us it is illegal instruction that causes the trouble:
> [ 4.246387][ T1] init[1]: illegal instruction (4) at 1002c308
> nip 1002c308 lr 10001684 code 1 in init[10000000+d0000]
> [ 4.251400][ T1] init[1]: code: f90d88c0 f92a0008 f9480008
> 7c2004ac 2c2d0000 f9490000 386d88d0 380000e8
> [ 4.253416][ T1] init[1]: code: 41820098 e92d8f98 75290010
> 4182008c <44000001> 2c2d0000 60000000 8902f438
>
>
> Meanwhile, the vmlinux compiled by myself runs smoothly.
>
> Then I modify mkinitrd.sh to let it panic manually:
> http://154.223.142.244/logs/20220310/mkinitrd.sh
> The log tells us it is a segfault (instead of a illegal instruction):
> http://154.223.142.244/logs/20220310/console.zhouyi.log
>
> Then I use gdb to debug the init in host:
> ubuntu@zhouzhouyi-1:~/newkernel/linux-next$ gdb
> tools/testing/selftests/rcutorture/initrd/init
> (gdb) run
> Starting program:
> /home/ubuntu/newkernel/linux-next/tools/testing/selftests/rcutorture/initrd/init
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000010000b2c in ?? ()
> (gdb) x/10i $pc
> => 0x10000b2c: stw r9,0(r9)
> 0x10000b30: trap
> 0x10000b34: .long 0x0
> 0x10000b38: .long 0x0
> 0x10000b3c: .long 0x0
> 0x10000b40: lis r2,4110
> 0x10000b44: addi r2,r2,31488
> 0x10000b48: mr r9,r1
> 0x10000b4c: rldicr r1,r1,0,59
> 0x10000b50: li r0,0
> (gdb) p $r9
> $1 = 0
> (gdb) x/30x $pc - 0x30
> 0x10000afc: 0x38840040 0x387f0040 0xf8010040 0x48026919
> 0x10000b0c: 0x60000000 0xe8010040 0x7c0803a6 0x4bffff24
> 0x10000b1c: 0x00000000 0x01000000 0x00000180 0x39200000
> 0x10000b2c: 0x91290000 0x7fe00008 0x00000000 0x00000000
> which matches the hex content of
> http://154.223.142.244/logs/20220310/console.zhouyi.log:
> [ 5.077431][ T1] init[1]: segfault (11) at 0 nip 10000b2c lr
> 10001024 code 1 in init[10000000+d0000]
> [ 5.087167][ T1] init[1]: code: 38840040 387f0040 f8010040
> 48026919 60000000 e8010040 7c0803a6 4bffff24
> [ 5.093987][ T1] init[1]: code: 00000000 01000000 00000180
> 39200000 <91290000> 7fe00008 00000000 00000000
>
>
> Conclusions: there might be something wrong when packing the init into
> vmlinux in your environment.
Quite possibly! Or the compiler might not be being invoked properly
by the mkinitrd.sh script.
> I will continue to do research on this interesting problem with you.
Please let me know how it goes!
Thanx, Paul
> Thanks
> Kind Regards
> Zhouyi
>
>
>
> On Tue, Feb 8, 2022 at 8:12 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >
> > Dear Michael,
> >
> >
> > Thank you for looking into this.
> >
> > Am 08.02.22 um 11:09 schrieb Michael Ellerman:
> > > Paul Menzel writes:
> >
> > […]
> >
> > >> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
> > >> 5.17-rc2+ with rcutorture tests
> > >
> > > I'm not sure if that's the host kernel version or the version you're
> > > using of rcutorture? Can you tell us the sha1 of your host kernel and of
> > > the tree you're running rcutorture from?
> >
> > The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
> > I am unable to find the exact sha1.
> >
> > $ more /proc/version
> > Linux version 5.17.0-rc1+
> > (pmenzel@flughafenberlinbrandenburgwillybrandt.molgen.mpg.de) (Ubuntu
> > clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28
> > 17:13:04 CET 2022
> >
> > The Linux tree, from where I run rcutorture from, is at commit
> > dfd42facf1e4 (Linux 5.17-rc3) with four patches on top:
> >
> > $ git log --oneline -6
> > 207cec79e752 (HEAD -> master, origin/master, origin/HEAD) Problems
> > with rcutorture on ppc64le: allmodconfig(2) and other failures
> > 8c82f96fbe57 ata: libata-sata: improve sata_link_debounce()
> > a447541d925f ata: libata-sata: remove debounce delay by default
> > afd84e1eeafc ata: libata-sata: introduce struct sata_deb_timing
> > f4caf7e48b75 ata: libata-sata: Simplify sata_link_resume() interface
> > dfd42facf1e4 (tag: v5.17-rc3) Linux 5.17-rc3
> >
> > >> $ tools/testing/selftests/rcutorture/bin/torture.sh --duration 10
> > >>
> > >> the built init
> > >>
> > >> $ file tools/testing/selftests/rcutorture/initrd/init
> > >> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=0ded0e45649184a296f30d611f7a03cc51ecb616, for GNU/Linux 3.10.0, stripped
> > >
> > > Mine looks pretty much identical:
> > >
> > > $ file tools/testing/selftests/rcutorture/initrd/init
> > > tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically linked, BuildID[sha1]=86078bf6e5d54ab0860d36aa9a65d52818b972c8, for GNU/Linux 3.10.0, stripped
> > >
> > >> segfaults in QEMU. From one of the log files
> > >
> > > But mine doesn't segfault, it runs fine and the test completes.
> > >
> > > What qemu version are you using?
> > >
> > > I tried 4.2.1 and 6.2.0, both worked.
> >
> > $ qemu-system-ppc64le --version
> > QEMU emulator version 6.0.0 (Debian 1:6.0+dfsg-2expubuntu1.1)
> > Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
> >
> > >> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/console.log
> >
> > Sorry, that was the wrong path/test. The correct one for the excerpt
> > below is:
> >
> >
> > /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/console.log
> >
> > (For TREE03, QEMU does not start the Linux kernel at all, that means no
> > output after:
> >
> > Booting Linux via __start() @ 0x0000000000400000 ...
> > )
> >
> > >> [ 1.119803][ T1] Run /init as init process
> > >> [ 1.122011][ T1] init[1]: segfault (11) at f0656d90 nip 10000a18 lr 0 code 1 in init[10000000+d0000]
> > >> [ 1.124863][ T1] init[1]: code: 2c2903e7 f9210030 4081ff84 4bffff58 00000000 01000000 00000580 3c40100f
> > >> [ 1.128823][ T1] init[1]: code: 38427c00 7c290b78 782106e4 38000000 <f821ff81> 7c0803a6 f8010000 e9028010
> > >
> > > The disassembly from 3c40100f is:
> > > lis r2,4111
> > > addi r2,r2,31744
> > > mr r9,r1
> > > rldicr r1,r1,0,59
> > > li r0,0
> > > stdu r1,-128(r1) <- fault
> > > mtlr r0
> > > std r0,0(r1)
> > > ld r8,-32752(r2)
> > >
> > >
> > > I think you'll find that's the code at the ELF entry point. You can
> > > check with:
> > >
> > > $ readelf -e tools/testing/selftests/rcutorture/initrd/init | grep Entry
> > > Entry point address: 0x10000c0c
> > >
> > > $ objdump -d tools/testing/selftests/rcutorture/initrd/init | grep -m 1 -A 8 10000c0c
> > > 10000c0c: 0e 10 40 3c lis r2,4110
> > > 10000c10: 00 7b 42 38 addi r2,r2,31488
> > > 10000c14: 78 0b 29 7c mr r9,r1
> > > 10000c18: e4 06 21 78 rldicr r1,r1,0,59
> > > 10000c1c: 00 00 00 38 li r0,0
> > > 10000c20: 81 ff 21 f8 stdu r1,-128(r1)
> > > 10000c24: a6 03 08 7c mtlr r0
> > > 10000c28: 00 00 01 f8 std r0,0(r1)
> > > 10000c2c: 10 80 02 e9 ld r8,-32752(r2)
> > >
> > > The fault you're seeing is the first store using the stack pointer (r1),
> > > which is setup by the kernel.
> > >
> > > The fault address f0656d90 is weirdly low, the stack should be up near 128TB.
> > >
> > > I'm not sure how we end up with a bad r1.
> > >
> > > Can you dump some info about the kernel that was built, something like:
> > >
> > > $ file /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/vmlinux
> > >
> > > And maybe paste/attach the full log, maybe there's a clue somewhere.
> >
> > You can now download the content of
> > `/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01`
> > [1, 65 MB].
> >
> > Can you reproduce the segmentation fault with the line below?
> >
> > $ qemu-system-ppc64 -enable-kvm -nographic -smp cores=1,threads=8
> > -net none -enable-kvm -M pseries -nodefaults -device spapr-vscsi -serial
> > stdio -m 512 -kernel
> > /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01/vmlinux
> > -append "debug_boot_weak_hash panic=-1 console=ttyS0
> > torture.disable_onoff_at_boot locktorture.onoff_interval=3
> > locktorture.onoff_holdoff=30 locktorture.stat_interval=15
> > locktorture.shutdown_secs=60 locktorture.verbose=1"
> >
> >
> > Kind regards,
> >
> > Paul
> >
> >
> > [1]:
> > https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rcutorture’s init segfaults in ppc64le VM
2022-03-10 2:37 ` Zhouyi Zhou
2022-03-10 4:48 ` Paul E. McKenney
@ 2022-03-10 8:10 ` Paul Menzel
2022-03-10 22:13 ` Zhouyi Zhou
1 sibling, 1 reply; 15+ messages in thread
From: Paul Menzel @ 2022-03-10 8:10 UTC (permalink / raw)
To: Zhouyi Zhou; +Cc: rcu, linuxppc-dev, Willy Tarreau, Paul E. McKenney
Dear Zhouyi,
Thank you for still looking into this.
Am 10.03.22 um 03:37 schrieb Zhouyi Zhou:
> I try to reproduce the bug in ppc64 VM in Oregon State University
> using the vmlinux extracted from
> https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
>
> the ppc64 VM in which I run the qemu without hardware acceleration is:
> Linux version 5.4.0-100-generic (buildd@bos02-ppc64el-021) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #113-Ubuntu SMP Thu Feb 3 18:43:11 UTC 2022 (Ubuntu 5.4.0-100.113-generic 5.4.166)
>
>
> The qemu command I use to test:
> cd /tmp/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01$
> $qemu-system-ppc64 -nographic -smp cores=2,threads=1 -net none -M
> pseries -nodefaults -device spapr-vscsi -serial file:/tmp/console.log
> -m 512 -kernel ./vmlinux -append "debug_boot_weak_hash panic=-1
> console=ttyS0 rcutorture.onoff_interval=200
> rcutorture.onoff_holdoff=30 rcutree.gp_preinit_delay=12
> rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3
> rcutree.kthread_prio=2 threadirqs tree.use_softirq=0
> rcutorture.n_barrier_cbs=4 rcutorture.stat_interval=15
> rcutorture.shutdown_secs=1800 rcutorture.test_no_idle_hz=1
> rcutorture.verbose=1"
>
> The console.log is uploaded to:
> http://154.223.142.244/logs/20220310/console.paul.log
> The log tells us it is illegal instruction that causes the trouble:
> [ 4.246387][ T1] init[1]: illegal instruction (4) at 1002c308 nip 1002c308 lr 10001684 code 1 in init[10000000+d0000]
> [ 4.251400][ T1] init[1]: code: f90d88c0 f92a0008 f9480008 7c2004ac 2c2d0000 f9490000 386d88d0 380000e8
> [ 4.253416][ T1] init[1]: code: 41820098 e92d8f98 75290010 4182008c <44000001> 2c2d0000 60000000 8902f438
>
>
> Meanwhile, the vmlinux compiled by myself runs smoothly.
How did you build it? Using GCC or clang? I forgot, if the problem was
only reproducible if the host Linux kernel was built with clang or the
VM kernel.
> Then I modify mkinitrd.sh to let it panic manually:
> http://154.223.142.244/logs/20220310/mkinitrd.sh
I only see the change:
-
+ int *ptr = 0;
+ *ptr = 0;
> The log tells us it is a segfault (instead of a illegal instruction):
> http://154.223.142.244/logs/20220310/console.zhouyi.log
>
> Then I use gdb to debug the init in host:
> ubuntu@zhouzhouyi-1:~/newkernel/linux-next$ gdb
> tools/testing/selftests/rcutorture/initrd/init
> (gdb) run
> Starting program:
> /home/ubuntu/newkernel/linux-next/tools/testing/selftests/rcutorture/initrd/init
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000010000b2c in ?? ()
> (gdb) x/10i $pc
> => 0x10000b2c: stw r9,0(r9)
> 0x10000b30: trap
> 0x10000b34: .long 0x0
> 0x10000b38: .long 0x0
> 0x10000b3c: .long 0x0
> 0x10000b40: lis r2,4110
> 0x10000b44: addi r2,r2,31488
> 0x10000b48: mr r9,r1
> 0x10000b4c: rldicr r1,r1,0,59
> 0x10000b50: li r0,0
> (gdb) p $r9
> $1 = 0
> (gdb) x/30x $pc - 0x30
> 0x10000afc: 0x38840040 0x387f0040 0xf8010040 0x48026919
> 0x10000b0c: 0x60000000 0xe8010040 0x7c0803a6 0x4bffff24
> 0x10000b1c: 0x00000000 0x01000000 0x00000180 0x39200000
> 0x10000b2c: 0x91290000 0x7fe00008 0x00000000 0x00000000
> which matches the hex content of
> http://154.223.142.244/logs/20220310/console.zhouyi.log:
> [ 5.077431][ T1] init[1]: segfault (11) at 0 nip 10000b2c lr 10001024 code 1 in init[10000000+d0000]
> [ 5.087167][ T1] init[1]: code: 38840040 387f0040 f8010040 48026919 60000000 e8010040 7c0803a6 4bffff24
> [ 5.093987][ T1] init[1]: code: 00000000 01000000 00000180 39200000 <91290000> 7fe00008 00000000 00000000
>
>
> Conclusions: there might be something wrong when packing the init into
> vmlinux in your environment.
>
> I will continue to do research on this interesting problem with you.
As written I think it’s a problem with LLVM/clang. Unfortunately, I
won’t be able to retest before next week.
Kind regards,
Paul
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: rcutorture’s init segfaults in ppc64le VM
2022-03-10 8:10 ` Paul Menzel
@ 2022-03-10 22:13 ` Zhouyi Zhou
0 siblings, 0 replies; 15+ messages in thread
From: Zhouyi Zhou @ 2022-03-10 22:13 UTC (permalink / raw)
To: Paul Menzel; +Cc: rcu, linuxppc-dev, Willy Tarreau, Paul E. McKenney
Dear Paul
On Thu, Mar 10, 2022 at 4:10 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Zhouyi,
>
>
> Thank you for still looking into this.
You are very welcome ;-)
>
>
> Am 10.03.22 um 03:37 schrieb Zhouyi Zhou:
>
> > I try to reproduce the bug in ppc64 VM in Oregon State University
> > using the vmlinux extracted from
> > https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
> >
> > the ppc64 VM in which I run the qemu without hardware acceleration is:
> > Linux version 5.4.0-100-generic (buildd@bos02-ppc64el-021) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #113-Ubuntu SMP Thu Feb 3 18:43:11 UTC 2022 (Ubuntu 5.4.0-100.113-generic 5.4.166)
> >
> >
> > The qemu command I use to test:
> > cd /tmp/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01$
> > $qemu-system-ppc64 -nographic -smp cores=2,threads=1 -net none -M
> > pseries -nodefaults -device spapr-vscsi -serial file:/tmp/console.log
> > -m 512 -kernel ./vmlinux -append "debug_boot_weak_hash panic=-1
> > console=ttyS0 rcutorture.onoff_interval=200
> > rcutorture.onoff_holdoff=30 rcutree.gp_preinit_delay=12
> > rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3
> > rcutree.kthread_prio=2 threadirqs tree.use_softirq=0
> > rcutorture.n_barrier_cbs=4 rcutorture.stat_interval=15
> > rcutorture.shutdown_secs=1800 rcutorture.test_no_idle_hz=1
> > rcutorture.verbose=1"
> >
> > The console.log is uploaded to:
> > http://154.223.142.244/logs/20220310/console.paul.log
> > The log tells us it is illegal instruction that causes the trouble:
> > [ 4.246387][ T1] init[1]: illegal instruction (4) at 1002c308 nip 1002c308 lr 10001684 code 1 in init[10000000+d0000]
> > [ 4.251400][ T1] init[1]: code: f90d88c0 f92a0008 f9480008 7c2004ac 2c2d0000 f9490000 386d88d0 380000e8
> > [ 4.253416][ T1] init[1]: code: 41820098 e92d8f98 75290010 4182008c <44000001> 2c2d0000 60000000 8902f438
> >
> >
> > Meanwhile, the vmlinux compiled by myself runs smoothly.
>
> How did you build it? Using GCC or clang? I forgot, if the problem was
I built vmlinux(es) using GCC and clang both. The compiled vmlinux(es)
runs smoothly.
> only reproducible if the host Linux kernel was built with clang or the
> VM kernel.
Yes, I also remember this, the dependence of how the host Linux kernel
is built makes things more complex.
>
> > Then I modify mkinitrd.sh to let it panic manually:
> > http://154.223.142.244/logs/20220310/mkinitrd.sh
>
> I only see the change:
>
> -
> + int *ptr = 0;
> + *ptr = 0;
>
Yes, I make the segfault happen manually.
> > The log tells us it is a segfault (instead of a illegal instruction):
> > http://154.223.142.244/logs/20220310/console.zhouyi.log
> >
> > Then I use gdb to debug the init in host:
> > ubuntu@zhouzhouyi-1:~/newkernel/linux-next$ gdb
> > tools/testing/selftests/rcutorture/initrd/init
> > (gdb) run
> > Starting program:
> > /home/ubuntu/newkernel/linux-next/tools/testing/selftests/rcutorture/initrd/init
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x0000000010000b2c in ?? ()
> > (gdb) x/10i $pc
> > => 0x10000b2c: stw r9,0(r9)
> > 0x10000b30: trap
> > 0x10000b34: .long 0x0
> > 0x10000b38: .long 0x0
> > 0x10000b3c: .long 0x0
> > 0x10000b40: lis r2,4110
> > 0x10000b44: addi r2,r2,31488
> > 0x10000b48: mr r9,r1
> > 0x10000b4c: rldicr r1,r1,0,59
> > 0x10000b50: li r0,0
> > (gdb) p $r9
> > $1 = 0
> > (gdb) x/30x $pc - 0x30
> > 0x10000afc: 0x38840040 0x387f0040 0xf8010040 0x48026919
> > 0x10000b0c: 0x60000000 0xe8010040 0x7c0803a6 0x4bffff24
> > 0x10000b1c: 0x00000000 0x01000000 0x00000180 0x39200000
> > 0x10000b2c: 0x91290000 0x7fe00008 0x00000000 0x00000000
> > which matches the hex content of
> > http://154.223.142.244/logs/20220310/console.zhouyi.log:
> > [ 5.077431][ T1] init[1]: segfault (11) at 0 nip 10000b2c lr 10001024 code 1 in init[10000000+d0000]
> > [ 5.087167][ T1] init[1]: code: 38840040 387f0040 f8010040 48026919 60000000 e8010040 7c0803a6 4bffff24
> > [ 5.093987][ T1] init[1]: code: 00000000 01000000 00000180 39200000 <91290000> 7fe00008 00000000 00000000
> >
> >
> > Conclusions: there might be something wrong when packing the init into
> > vmlinux in your environment.
> >
> > I will continue to do research on this interesting problem with you.
>
> As written I think it’s a problem with LLVM/clang. Unfortunately, I
> won’t be able to retest before next week.
Roger that, no need to hurry ;-)
Kind regards
Zhouyi
> Kind regards,
>
> Paul
^ permalink raw reply [flat|nested] 15+ messages in thread