From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 85440C7EE29 for ; Tue, 2 May 2023 15:24:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=Gk/NscCxYtJ5Qvd6sg6DJhR7ve3QZLSa8ht++xDPN/c=; b=iKc0eQEWUYbX9i P0EmSsheHHY59Hmbu7Mk0A06FrnxGwxdQnT1JL8XYon3nQxECaPO+lAiDHz7Bhdh+fAec/EPhfM2T IiIYTcr7LFyFdBJqYpogl8cMGkmG3yoVHArgi11UyQldUb2tvohWh0SnasksRCE0Y99UqsBLMpPzn oNwX0GjJE/9Vy118Mx3PG+0YvFIfWH5RIotpuHPJU7qJkMik3fZGtTNwY/QRU/rLDCneSyq+ZOqLs LXDgBB7BacrPDjHOikqgjMuogCJbZOMghVvVEMdBfqhXDICeLztxTesp/of/hq47CyfeuIzPvpKTD h5L/1urUEXO2DP9EkCwg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1ptrrA-001wCw-0y; Tue, 02 May 2023 15:23:56 +0000 Received: from smtp-out1.suse.de ([2001:67c:2178:6::1c]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1ptrr6-001wBl-0t; Tue, 02 May 2023 15:23:55 +0000 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 4879F221BC; Tue, 2 May 2023 15:23:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1683041028; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=uxhui/Ir4eT5sPQEaZT+4zpzC7jhoc2yOtRKPtRVLuQ=; b=aq3JR34r8i83T70pqTkemzrSH10xQCplmij5pIQyD0YK8z+itGyO9k/Zvwi7Qf+IAKSBhy eo5tEqIAixUbBkkCE5eSv5dIFEOV+IPzcaBToFTE8efSam2JM1pCB6A8Hp2m1P1M7YFOvF iPQTkRjELszxauRKGj1883nDqmBx4G8= Received: from suse.cz (unknown [10.100.201.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 1DBAC2C141; Tue, 2 May 2023 15:23:46 +0000 (UTC) Date: Tue, 2 May 2023 17:23:45 +0200 From: Petr Mladek To: Douglas Anderson Cc: Andrew Morton , Mark Rutland , Randy Dunlap , Will Deacon , Catalin Marinas , Sumit Garg , Daniel Thompson , Ian Rogers , ravi.v.shankar@intel.com, Marc Zyngier , linux-perf-users@vger.kernel.org, Stephane Eranian , kgdb-bugreport@lists.sourceforge.net, ito-yuichi@fujitsu.com, linux-arm-kernel@lists.infradead.org, Stephen Boyd , Masayoshi Mizuma , ricardo.neri@intel.com, Lecopzer Chen , Chen-Yu Tsai , Andi Kleen , Colin Cross , Matthias Kaehlcke , Guenter Roeck , Tzung-Bi Shih , Alexander Potapenko , AngeloGioacchino Del Regno , Geert Uytterhoeven , Juergen Gross , Kees Cook , Laurent Dufour , Liam Howlett , Masahiro Yamada , Matthias Brugger , Michael Ellerman , Miguel Ojeda , Nathan Chancellor , Nick Desaulniers , "Paul E. McKenney" , Sami Tolvanen , Vlastimil Babka , Zhaoyang Huang , Zhen Lei , linux-kernel@vger.kernel.org, linux-mediatek@lists.infradead.org Subject: cpu hotplug : was: Re: [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs Message-ID: References: <20230501082341.v3.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230501082341.v3.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230502_082352_481881_3278DB68 X-CRM114-Status: GOOD ( 28.90 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon 2023-05-01 08:24:46, Douglas Anderson wrote: > From: Colin Cross > > Implement a hardlockup detector that doesn't doesn't need any extra > arch-specific support code to detect lockups. Instead of using > something arch-specific we will use the buddy system, where each CPU > watches out for another one. Specifically, each CPU will use its > softlockup hrtimer to check that the next CPU is processing hrtimer > interrupts by verifying that a counter is increasing. > > --- /dev/null > +++ b/kernel/watchdog_buddy_cpu.c > +int watchdog_nmi_enable(unsigned int cpu) > +{ > + /* > + * The new CPU will be marked online before the first hrtimer interrupt > + * runs on it. It does not need to be the first hrtimer interrupt. The CPU might have been offlined/onlined repeatedly. The counter might have any value. > + * If another CPU tests for a hardlockup on the new CPU > + * before it has run its first hrtimer, it will get a false positive. > + * Touch the watchdog on the new CPU to delay the first check for at > + * least 3 sampling periods to guarantee one hrtimer has run on the new > + * CPU. > + */ > + per_cpu(watchdog_touch, cpu) = true; We should touch also the next_cpu: /* * We are going to check the next CPU. Our watchdog_hrtimer * need not be zero if the CPU has already been online earlier. * Touch the watchdog on the next CPU to avoid false positive * if we try to check it in less then 3 interrupts. */ next_cpu = watchdog_next_cpu(cpu); if (next_cpu < nr_cpu_ids) per_cpu(watchdog_touch, next_cpu) = true; Alternative would be to clear watchdog_hrtimer. But it would kind-of affect also the softlockup detector. > + /* Match with smp_rmb() in watchdog_check_hardlockup() */ > + smp_wmb(); > + cpumask_set_cpu(cpu, &watchdog_cpus); > + return 0; > +} > + > +void watchdog_nmi_disable(unsigned int cpu) > +{ > + unsigned int next_cpu = watchdog_next_cpu(cpu); > + > + /* > + * Offlining this CPU will cause the CPU before this one to start > + * checking the one after this one. If this CPU just finished checking > + * the next CPU and updating hrtimer_interrupts_saved, and then the > + * previous CPU checks it within one sample period, it will trigger a > + * false positive. Touch the watchdog on the next CPU to prevent it. > + */ > + if (next_cpu < nr_cpu_ids) > + per_cpu(watchdog_touch, next_cpu) = true; > + /* Match with smp_rmb() in watchdog_check_hardlockup() */ > + smp_wmb(); > + cpumask_clear_cpu(cpu, &watchdog_cpus); > +} > + Best Regards, Petr _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel