From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54686C433EF for ; Thu, 31 Mar 2022 20:00:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241141AbiCaUB6 (ORCPT ); Thu, 31 Mar 2022 16:01:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57446 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240855AbiCaUBz (ORCPT ); Thu, 31 Mar 2022 16:01:55 -0400 Received: from 1wt.eu (wtarreau.pck.nerim.net [62.212.114.60]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7ABFB17335E for ; Thu, 31 Mar 2022 13:00:07 -0700 (PDT) Received: (from willy@localhost) by pcw.home.local (8.15.2/8.15.2/Submit) id 22VJxwec026309; Thu, 31 Mar 2022 21:59:58 +0200 Date: Thu, 31 Mar 2022 21:59:58 +0200 From: Willy Tarreau To: Ben Greear Cc: Linux Kernel Mailing List Subject: Re: Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 since 5.17 Message-ID: <20220331195958.GA26284@1wt.eu> References: <20220331034343.GC23200@1wt.eu> <794a9c23-d3b9-c454-8f78-760060f0b9f2@candelatech.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <794a9c23-d3b9-c454-8f78-760060f0b9f2@candelatech.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 31, 2022 at 11:36:17AM -0700, Ben Greear wrote: > On 3/30/22 8:43 PM, Willy Tarreau wrote: > > Hi Ben, > > > > On Wed, Mar 30, 2022 at 02:27:56PM -0700, Ben Greear wrote: > > > Run /init as init process > > > Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 > > > /init: error whitsc: Refined TSC clocksource calibration: 2903.996 MHz > > > le loading shareCPU: 2 PID: 1 Comm: init Not tainted 5.17.0+ #12 > > > d libraries: libclocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x29dc020bb13, max_idle_ns: 440795273180 ns > > > rt.so.1: cannot Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020 > > > open shared objeCall Trace: > > > ct file: No such > > > file or directo dump_stack_lvl+0x47/0x5c > > > ry > > > > The coincidence between this error about your userland "libclocksource" > > and the messages about the clock sources being refined makes me wonder > > if there could be an error experienced during this lib's initialization > > at a moment where the list of clocksources appears empty or opening one > > of the /sys file is temporarily refused. I suspect that making a much > > larger or much smaller initrd could change the initialization order > > enough to prevent such an event from happening, but that sounds a bit > > odd :-/ > > > > Willy > > For whatever reason, it was quite reproducible yesterday. I notice that it > often (50+% of the time) failed on soft reboot, but I don't think it failed > a single time when I then went and powered it down fully and powered it back on. > > So possibly it is some un-initialized memory somewhere that is exacerbating > some problem. > > I will keep a watch on these errors and see if they always related to libclocksource. > Looks like 'rt.so.1' is what it cannot find though? So maybe nothing particular to > do with /sys? I really don't know, not knowing these libs. We could imagine that the former only falls back on the latter when failing to find what it needs in /sys. Of course I'm not trying to invent a scenario, just to find a few rational explanations to what you're observing, as it's certainly strange. Maybe other tests could involve trying the reset button and also kexec to introduce some variations in the reboot methods. It sort-of reminds me a BIOS bug I faced a decade ago by which a server would randomly hang on soft reboot, and I finally figured that the BIOS couldn't handle reboots triggered by CPUs other than CPU0. The workaround I proposed then was to always use "taskset -c 0 reboot" and that one would never fail. Maybe you're observing sort of a variation of this, where some devices are not correctly initialized or your initrd gets some parts corrupted when loaded from the wrong CPU, just guessing. Cheers, Willy