From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from phobos.denx.de (phobos.denx.de [85.214.62.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BB473C6FA83 for ; Mon, 5 Sep 2022 17:10:46 +0000 (UTC) Received: from h2850616.stratoserver.net (localhost [IPv6:::1]) by phobos.denx.de (Postfix) with ESMTP id 2CB4A84904; Mon, 5 Sep 2022 19:10:44 +0200 (CEST) Authentication-Results: phobos.denx.de; dmarc=none (p=none dis=none) header.from=maquefel.me Authentication-Results: phobos.denx.de; spf=pass smtp.mailfrom=u-boot-bounces@lists.denx.de Authentication-Results: phobos.denx.de; dkim=pass (1024-bit key; unprotected) header.d=maquefel.me header.i=@maquefel.me header.b="pa0j4du2"; dkim-atps=neutral Received: by phobos.denx.de (Postfix, from userid 109) id A54E484912; Mon, 5 Sep 2022 19:10:42 +0200 (CEST) Received: from forward500p.mail.yandex.net (forward500p.mail.yandex.net [77.88.28.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by phobos.denx.de (Postfix) with ESMTPS id 4AF99848DE for ; Mon, 5 Sep 2022 19:10:39 +0200 (CEST) Authentication-Results: phobos.denx.de; dmarc=none (p=none dis=none) header.from=maquefel.me Authentication-Results: phobos.denx.de; spf=pass smtp.mailfrom=nikita.shubin@maquefel.me Received: from iva5-344f444591f3.qloud-c.yandex.net (iva5-344f444591f3.qloud-c.yandex.net [IPv6:2a02:6b8:c0c:687:0:640:344f:4445]) by forward500p.mail.yandex.net (Yandex) with ESMTP id 3A6EAF00FE0; Mon, 5 Sep 2022 20:10:38 +0300 (MSK) Received: by iva5-344f444591f3.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA id TsSnzPzBUI-Aai4p3TQ; Mon, 05 Sep 2022 20:10:37 +0300 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=maquefel.me; s=mail; t=1662397837; bh=dTDhHpjGsWWkXu48QCsCkFEEua7iKXTtMKdCJeBJjGE=; h=Cc:Message-ID:Subject:Date:References:To:From:In-Reply-To; b=pa0j4du2g93lsR2NN6seYUU7r2lgR0iilyvTTc5b9RM8issRmi9On/MhyM3FAm262 dDkPeWF9lAqle0sprvtujhWSmwgWgqQ8QzB5uDMmbVS8KO9wfG0GRL8rPdflrlhfrc ymGll3Q8XARU2NT86c7oAz5+NwqI7apI6RSBgDXk= Authentication-Results: iva5-344f444591f3.qloud-c.yandex.net; dkim=pass header.i=@maquefel.me Date: Mon, 5 Sep 2022 20:10:36 +0300 From: Nikita Shubin To: Sean Anderson Cc: Rick Chen , Lukas Auer , U-Boot Mailing List , Heinrich Schuchardt , Atish Patra , Anup Patel , Bin Meng , Leo Liang , rick Subject: Re: RISCV: the machanism of available_harts may cause other harts boot failure Message-ID: <20220905201036.156bd9fd@redslave.neermore.group> In-Reply-To: References: <20220905104735.5c2a260d@redslave.neermore.group> X-Mailer: Claws Mail 3.17.7 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: u-boot@lists.denx.de X-Mailman-Version: 2.1.39 Precedence: list List-Id: U-Boot discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: u-boot-bounces@lists.denx.de Sender: "U-Boot" X-Virus-Scanned: clamav-milter 0.103.6 at phobos.denx.de X-Virus-Status: Clean On Mon, 5 Sep 2022 11:30:38 -0400 Sean Anderson wrote: > On 9/5/22 3:47 AM, Nikita Shubin wrote: > > Hi Rick! > > > > On Mon, 5 Sep 2022 14:22:41 +0800 > > Rick Chen wrote: > > > >> Hi, > >> > >> When I free-run a SMP system, I once hit a failure case where some > >> harts didn't boot to the kernel shell successfully. > >> However it can't be duplicated anymore even if I try many times. > >> > >> But when I set a break during debugging with GDB, it can trigger > >> the failure case each time. > > > > If hart fails to register itself to available_harts before > > send_ipi_many is hit by the main hart: > > https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/lib/smp.c#L50 > > > > it won't exit the secondary_hart_loop: > > https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/cpu/start.S#L433 > > As no ipi will be sent to it. > > > > This might be exactly your case. > > When working on the IPI mechanism, I considered this possibility. > However, there's really no way to know how long to wait. On normal > systems, the boot hart is going to do a lot of work before calling > send_ipi_many, and the other harts just have to make it through ~100 > instructions. So I figured we would never run into this issue. > > We might not even need the mask... the only direct reason we might is > for OpenSBI, as spl_invoke_opensbi is the only function which uses > the wait parameter. Actually i think available_harts in is duplicated by device tree, so we can: 1) drop registering harts in start.S (and related lock completely) 2) fill gd->arch.available_harts in send_ipi_many relying on device tree, and also making riscv_send_ipi non-fatal 3) move this procedure to the very end just before spl_invoke_opensbi 4) may be even wrap all above in some CONFIG option which enforces checking that harts are alive, otherwise just pass the device tree harts count > > >> I think the mechanism of available_harts does not provide a method > >> that guarantees the success of the SMP system. > >> Maybe we shall think of a better way for the SMP booting or just > >> remove it ? > > > > I haven't experienced any unexplained problem with hart_lottery or > > available_harts_lock unless: > > > > 1) harts are started non-simultaneously > > 2) SPL/U-Boot is in some kind of TCM, OCRAM, etc... which is not > > cleared on reset which leaves available_harts dirty > > XIP, of course, has this problem every time and just doesn't use the > mask. I remember thinking a lot about how to deal with this, but I > never ended up sending a patch because I didn't have a XIP system. It can be in some part emulated by setting up SPL region as read-only via PMP before start. > > --Sean > > > 3) something is wrong with atomics > > > > Also there might be something wrong with IPI send/recieve. > > > >> > >> Thread 8 hit Breakpoint 1, harts_early_init () > >> > >> (gdb) c > >> Continuing. > >> [Switching to Thread 7] > >> > >> Thread 7 hit Breakpoint 1, harts_early_init () > >> > >> (gdb) > >> Continuing. > >> [Switching to Thread 6] > >> > >> Thread 6 hit Breakpoint 1, harts_early_init () > >> > >> (gdb) > >> Continuing. > >> [Switching to Thread 5] > >> > >> Thread 5 hit Breakpoint 1, harts_early_init () > >> > >> (gdb) > >> Continuing. > >> [Switching to Thread 4] > >> > >> Thread 4 hit Breakpoint 1, harts_early_init () > >> > >> (gdb) > >> Continuing. > >> [Switching to Thread 3] > >> > >> Thread 3 hit Breakpoint 1, harts_early_init () > >> (gdb) > >> Continuing. > >> [Switching to Thread 2] > >> > >> Thread 2 hit Breakpoint 1, harts_early_init () > >> (gdb) > >> Continuing. > >> [Switching to Thread 1] > >> > >> Thread 1 hit Breakpoint 1, harts_early_init () > >> (gdb) > >> Continuing. > >> [Switching to Thread 5] > >> > >> > >> Thread 5 hit Breakpoint 3, 0x0000000001200000 in ?? () > >> (gdb) info threads > >> Id Target Id Frame > >> 1 Thread 1 (hart 1) secondary_hart_loop () at > >> arch/riscv/cpu/start.S:436 2 Thread 2 (hart 2) > >> secondary_hart_loop () at arch/riscv/cpu/start.S:436 3 Thread 3 > >> (hart 3) secondary_hart_loop () at arch/riscv/cpu/start.S:436 4 > >> Thread 4 (hart 4) secondary_hart_loop () at > >> arch/riscv/cpu/start.S:436 > >> * 5 Thread 5 (hart 5) 0x0000000001200000 in ?? () > >> 6 Thread 6 (hart 6) 0x000000000000b650 in ?? () > >> 7 Thread 7 (hart 7) 0x000000000000b650 in ?? () > >> 8 Thread 8 (hart 8) 0x0000000000005fa0 in ?? () > >> (gdb) c > >> Continuing. > > > > Do they all "offline" harts remain in SPL/U-Boot > > secondary_hart_loop ? > >> > >> > >> > >> [ 0.175619] smp: Bringing up secondary CPUs ... > >> [ 1.230474] CPU1: failed to come online > >> [ 2.282349] CPU2: failed to come online > >> [ 3.334394] CPU3: failed to come online > >> [ 4.386783] CPU4: failed to come online > >> [ 4.427829] smp: Brought up 1 node, 4 CPUs > >> > >> > >> /root # cat /proc/cpuinfo > >> processor : 0 > >> hart : 4 > >> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1 > >> mmu : sv39 > >> > >> processor : 5 > >> hart : 5 > >> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1 > >> mmu : sv39 > >> > >> processor : 6 > >> hart : 6 > >> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1 > >> mmu : sv39 > >> > >> processor : 7 > >> hart : 7 > >> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1 > >> mmu : sv39 > >> > >> /root # > >> > >> Thanks, > >> Rick > > >