From: Sean Anderson <seanga2@gmail.com>
To: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Lukas Auer <lukas.auer@aisec.fraunhofer.de>,
U-Boot Mailing List <u-boot@lists.denx.de>,
Atish Patra <atishp@atishpatra.org>,
Anup Patel <anup@brainfault.org>, Bin Meng <bmeng.cn@gmail.com>,
Leo Liang <ycliang@andestech.com>, rick <rick@andestech.com>,
Nikita Shubin <nikita.shubin@maquefel.me>,
Rick Chen <rickchen36@gmail.com>
Subject: Re: RISCV: the machanism of available_harts may cause other harts boot failure
Date: Mon, 5 Sep 2022 11:45:53 -0400 [thread overview]
Message-ID: <f3a26099-4605-258d-ed01-43afd75b85e4@gmail.com> (raw)
In-Reply-To: <53ef4762-eb1d-043c-69de-a621eb3806d2@gmx.de>
On 9/5/22 11:41 AM, Heinrich Schuchardt wrote:
> On 9/5/22 17:30, Sean Anderson wrote:
>> On 9/5/22 3:47 AM, Nikita Shubin wrote:
>>> Hi Rick!
>>>
>>> On Mon, 5 Sep 2022 14:22:41 +0800
>>> Rick Chen <rickchen36@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> When I free-run a SMP system, I once hit a failure case where some
>>>> harts didn't boot to the kernel shell successfully.
>>>> However it can't be duplicated anymore even if I try many times.
>>>>
>>>> But when I set a break during debugging with GDB, it can trigger the
>>>> failure case each time.
>>>
>>> If hart fails to register itself to available_harts before
>>> send_ipi_many is hit by the main hart:
>>> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/lib/smp.c#L50
>>>
>>> it won't exit the secondary_hart_loop:
>>> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/cpu/start.S#L433
>>> As no ipi will be sent to it.
>
> Can we call send_ipi_many() again when booting?
AFAIK we do; see arch/riscv/lib/bootm.c and arch/riscv/lib/spl.c
> Do we need to call it before booting?
Yes. We also call it when relocating (in SPL and U-Boot proper).
>>>
>>> This might be exactly your case.
>>
>> When working on the IPI mechanism, I considered this possibility. However,
>> there's really no way to know how long to wait. On normal systems, the boot
>> hart is going to do a lot of work before calling send_ipi_many, and the
>> other harts just have to make it through ~100 instructions. So I figured we
>> would never run into this issue.
>>
>> We might not even need the mask... the only direct reason we might is for
>> OpenSBI, as spl_invoke_opensbi is the only function which uses the wait
>> parameter.
>>
>>>> I think the mechanism of available_harts does not provide a method
>>>> that guarantees the success of the SMP system.
>>>> Maybe we shall think of a better way for the SMP booting or just
>>>> remove it ?
>>>
>>> I haven't experienced any unexplained problem with hart_lottery or
>>> available_harts_lock unless:
>>>
>>> 1) harts are started non-simultaneously
>>> 2) SPL/U-Boot is in some kind of TCM, OCRAM, etc... which is not cleared
>>> on reset which leaves available_harts dirty
>>
>> XIP, of course, has this problem every time and just doesn't use the mask.
>> I remember thinking a lot about how to deal with this, but I never ended
>> up sending a patch because I didn't have a XIP system.
>>
>> --Sean
>>
>>> 3) something is wrong with atomics
>>>
>>> Also there might be something wrong with IPI send/recieve.
>>>
>>>>
>>>> Thread 8 hit Breakpoint 1, harts_early_init ()
>>>>
>>>> (gdb) c
>>>> Continuing.
>>>> [Switching to Thread 7]
>>>>
>>>> Thread 7 hit Breakpoint 1, harts_early_init ()
>>>>
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 6]
>>>>
>>>> Thread 6 hit Breakpoint 1, harts_early_init ()
>>>>
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 5]
>>>>
>>>> Thread 5 hit Breakpoint 1, harts_early_init ()
>>>>
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 4]
>>>>
>>>> Thread 4 hit Breakpoint 1, harts_early_init ()
>>>>
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 3]
>>>>
>>>> Thread 3 hit Breakpoint 1, harts_early_init ()
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 2]
>>>>
>>>> Thread 2 hit Breakpoint 1, harts_early_init ()
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 1]
>>>>
>>>> Thread 1 hit Breakpoint 1, harts_early_init ()
>>>> (gdb)
>>>> Continuing.
>>>> [Switching to Thread 5]
>>>>
>>>>
>>>> Thread 5 hit Breakpoint 3, 0x0000000001200000 in ?? ()
>>>> (gdb) info threads
>>>> Id Target Id Frame
>>>> 1 Thread 1 (hart 1) secondary_hart_loop () at
>>>> arch/riscv/cpu/start.S:436 2 Thread 2 (hart 2) secondary_hart_loop
>>>> () at arch/riscv/cpu/start.S:436 3 Thread 3 (hart 3)
>>>> secondary_hart_loop () at arch/riscv/cpu/start.S:436 4 Thread 4
>>>> (hart 4) secondary_hart_loop () at arch/riscv/cpu/start.S:436
>>>> * 5 Thread 5 (hart 5) 0x0000000001200000 in ?? ()
>>>> 6 Thread 6 (hart 6) 0x000000000000b650 in ?? ()
>>>> 7 Thread 7 (hart 7) 0x000000000000b650 in ?? ()
>>>> 8 Thread 8 (hart 8) 0x0000000000005fa0 in ?? ()
>>>> (gdb) c
>>>> Continuing.
>>>
>>> Do they all "offline" harts remain in SPL/U-Boot secondary_hart_loop ?
>>>
>>>>
>>>>
>>>>
>>>> [ 0.175619] smp: Bringing up secondary CPUs ...
>>>> [ 1.230474] CPU1: failed to come online
>>>> [ 2.282349] CPU2: failed to come online
>>>> [ 3.334394] CPU3: failed to come online
>>>> [ 4.386783] CPU4: failed to come online
>>>> [ 4.427829] smp: Brought up 1 node, 4 CPUs
>>>>
>>>>
>>>> /root # cat /proc/cpuinfo
>>>> processor : 0
>>>> hart : 4
>>>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>> mmu : sv39
>>>>
>>>> processor : 5
>>>> hart : 5
>>>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>> mmu : sv39
>>>>
>>>> processor : 6
>>>> hart : 6
>>>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>> mmu : sv39
>>>>
>>>> processor : 7
>>>> hart : 7
>>>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>> mmu : sv39
>>>>
>>>> /root #
>>>>
>>>> Thanks,
>>>> Rick
>>>
>>
>
next prev parent reply other threads:[~2022-09-05 15:46 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-05 6:22 RISCV: the machanism of available_harts may cause other harts boot failure Rick Chen
2022-09-05 7:47 ` Nikita Shubin
2022-09-05 15:30 ` Sean Anderson
2022-09-05 15:41 ` Heinrich Schuchardt
2022-09-05 15:45 ` Sean Anderson [this message]
2022-09-05 16:00 ` Heinrich Schuchardt
2022-09-05 16:14 ` Sean Anderson
2022-09-05 16:30 ` Heinrich Schuchardt
2022-09-05 17:10 ` Nikita Shubin
2022-09-06 1:51 ` Rick Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f3a26099-4605-258d-ed01-43afd75b85e4@gmail.com \
--to=seanga2@gmail.com \
--cc=anup@brainfault.org \
--cc=atishp@atishpatra.org \
--cc=bmeng.cn@gmail.com \
--cc=lukas.auer@aisec.fraunhofer.de \
--cc=nikita.shubin@maquefel.me \
--cc=rick@andestech.com \
--cc=rickchen36@gmail.com \
--cc=u-boot@lists.denx.de \
--cc=xypron.glpk@gmx.de \
--cc=ycliang@andestech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox