From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF050C77B7E for ; Tue, 2 May 2023 15:23:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234356AbjEBPXy (ORCPT ); Tue, 2 May 2023 11:23:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41340 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233786AbjEBPXy (ORCPT ); Tue, 2 May 2023 11:23:54 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9838DD7; Tue, 2 May 2023 08:23:49 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 4879F221BC; Tue, 2 May 2023 15:23:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1683041028; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=uxhui/Ir4eT5sPQEaZT+4zpzC7jhoc2yOtRKPtRVLuQ=; b=aq3JR34r8i83T70pqTkemzrSH10xQCplmij5pIQyD0YK8z+itGyO9k/Zvwi7Qf+IAKSBhy eo5tEqIAixUbBkkCE5eSv5dIFEOV+IPzcaBToFTE8efSam2JM1pCB6A8Hp2m1P1M7YFOvF iPQTkRjELszxauRKGj1883nDqmBx4G8= Received: from suse.cz (unknown [10.100.201.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 1DBAC2C141; Tue, 2 May 2023 15:23:46 +0000 (UTC) Date: Tue, 2 May 2023 17:23:45 +0200 From: Petr Mladek To: Douglas Anderson Cc: Andrew Morton , Mark Rutland , Randy Dunlap , Will Deacon , Catalin Marinas , Sumit Garg , Daniel Thompson , Ian Rogers , ravi.v.shankar@intel.com, Marc Zyngier , linux-perf-users@vger.kernel.org, Stephane Eranian , kgdb-bugreport@lists.sourceforge.net, ito-yuichi@fujitsu.com, linux-arm-kernel@lists.infradead.org, Stephen Boyd , Masayoshi Mizuma , ricardo.neri@intel.com, Lecopzer Chen , Chen-Yu Tsai , Andi Kleen , Colin Cross , Matthias Kaehlcke , Guenter Roeck , Tzung-Bi Shih , Alexander Potapenko , AngeloGioacchino Del Regno , Geert Uytterhoeven , Juergen Gross , Kees Cook , Laurent Dufour , Liam Howlett , Masahiro Yamada , Matthias Brugger , Michael Ellerman , Miguel Ojeda , Nathan Chancellor , Nick Desaulniers , "Paul E. McKenney" , Sami Tolvanen , Vlastimil Babka , Zhaoyang Huang , Zhen Lei , linux-kernel@vger.kernel.org, linux-mediatek@lists.infradead.org Subject: cpu hotplug : was: Re: [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs Message-ID: References: <20230501082341.v3.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230501082341.v3.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid> Precedence: bulk List-ID: X-Mailing-List: linux-perf-users@vger.kernel.org On Mon 2023-05-01 08:24:46, Douglas Anderson wrote: > From: Colin Cross > > Implement a hardlockup detector that doesn't doesn't need any extra > arch-specific support code to detect lockups. Instead of using > something arch-specific we will use the buddy system, where each CPU > watches out for another one. Specifically, each CPU will use its > softlockup hrtimer to check that the next CPU is processing hrtimer > interrupts by verifying that a counter is increasing. > > --- /dev/null > +++ b/kernel/watchdog_buddy_cpu.c > +int watchdog_nmi_enable(unsigned int cpu) > +{ > + /* > + * The new CPU will be marked online before the first hrtimer interrupt > + * runs on it. It does not need to be the first hrtimer interrupt. The CPU might have been offlined/onlined repeatedly. The counter might have any value. > + * If another CPU tests for a hardlockup on the new CPU > + * before it has run its first hrtimer, it will get a false positive. > + * Touch the watchdog on the new CPU to delay the first check for at > + * least 3 sampling periods to guarantee one hrtimer has run on the new > + * CPU. > + */ > + per_cpu(watchdog_touch, cpu) = true; We should touch also the next_cpu: /* * We are going to check the next CPU. Our watchdog_hrtimer * need not be zero if the CPU has already been online earlier. * Touch the watchdog on the next CPU to avoid false positive * if we try to check it in less then 3 interrupts. */ next_cpu = watchdog_next_cpu(cpu); if (next_cpu < nr_cpu_ids) per_cpu(watchdog_touch, next_cpu) = true; Alternative would be to clear watchdog_hrtimer. But it would kind-of affect also the softlockup detector. > + /* Match with smp_rmb() in watchdog_check_hardlockup() */ > + smp_wmb(); > + cpumask_set_cpu(cpu, &watchdog_cpus); > + return 0; > +} > + > +void watchdog_nmi_disable(unsigned int cpu) > +{ > + unsigned int next_cpu = watchdog_next_cpu(cpu); > + > + /* > + * Offlining this CPU will cause the CPU before this one to start > + * checking the one after this one. If this CPU just finished checking > + * the next CPU and updating hrtimer_interrupts_saved, and then the > + * previous CPU checks it within one sample period, it will trigger a > + * false positive. Touch the watchdog on the next CPU to prevent it. > + */ > + if (next_cpu < nr_cpu_ids) > + per_cpu(watchdog_touch, next_cpu) = true; > + /* Match with smp_rmb() in watchdog_check_hardlockup() */ > + smp_wmb(); > + cpumask_clear_cpu(cpu, &watchdog_cpus); > +} > + Best Regards, Petr