From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7651387341 for ; Thu, 5 Mar 2026 13:47:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.46 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772718422; cv=none; b=nXi6mAGFtujZLj0o8X2IYCZC/ETL/0cxaIIO2bfCAY4ByMlGgeL55YH5yPkktyHFNkAisDZxiIZrzzMkK78Czno/dE5srNP7eHsK8AZLl5jVlGfjTvwRIInfep3JRCpDHFBzZFH+T1qWVZ+3y2+0t/8jhlz3jQni4biXhruCbgg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772718422; c=relaxed/simple; bh=sEuDsc0+mp9tmcxZyL7bvKlf4PPTOy0G8mjD4rlPTcE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ZlayEVf329eMssI/Y+DLt0G7QhhqlF1vkzbXipXhFW9yOrrKCGO94jUa4gM5hjzTUMGD+VDL+j5BVB7iu0zXmMaMcwfvoi5C0LL2pQqUdmFi3JmECbzKsAARJ2RRZShWszVC1VjVh1DhFdIHfi6Ie4UBc2kX7ir+b5ki1r+yUpQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=gMrn46V0; arc=none smtp.client-ip=209.85.128.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="gMrn46V0" Received: by mail-wm1-f46.google.com with SMTP id 5b1f17b1804b1-483a233819aso79446395e9.3 for ; Thu, 05 Mar 2026 05:47:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1772718419; x=1773323219; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=zT+emzS7qx5zsje7mMsVKr8zugK2lzMFjC3SnpIik94=; b=gMrn46V0MEVPyU1YMxXYKLVBFCx6ojp6hnIXoo+sVjCcxjPuriJZ+OljAklziJYo38 hnejw4LOIuDLQjC5fCFcpIB6t3tHK5ryn9cenXrfK+zEKc4TLKQF/uz9sUBCmTzPq4n4 KfxGADtmRjxwhC341uV3JpDxEZPejiFKgO1bVT1CVpEm+dfA+QRL1XhUGeMfZYTHKTa9 ldRMygXmYrIYyZDkLvKBzJOvGJ2RTzJo3gpgf/vZcZttOUrKEEd92QsVN+Ou6PeZIwB/ jrT0PLzzBDgwLC6jcYP4n/a8R4n2a/kpG9YKXsxDK8dQs5FXHIvTBACzeAarA/UB9vN1 vNIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772718419; x=1773323219; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zT+emzS7qx5zsje7mMsVKr8zugK2lzMFjC3SnpIik94=; b=LuPGriPkrqJNTxntxYvHWkWYawPC7TGGQcv/BjrDHRZ3nLiXKzS7lRdJc7nQWO32Rr RvyZIoSS+hY/CZwGqmr1WUgGfQjhIkmOw+Dgsf/iZJ+RN/CHgGfdVTpypJglX40PIzJt ShEaXlaBXyTp8hvx5VLxrVyIqc8Z1I5wIOUmXH0C2bAOOxag8xsfGjH3mFlspc9zk32O NZtA5k7/rMEv9VlY2Sx1KU/nnBg3jOZeyFqtzIoleDUHhBnZZbFi/nSnLKbrq2cl9kmh tPYSUC6/p1hLibU5TPpFksNiMb/9FmkiMA4EYEKWVmM/Dk7yE4wNugM5UCHdhsfIMwr9 x3WQ== X-Forwarded-Encrypted: i=1; AJvYcCWrvzdyBl70uoiRyEL4AMt1SR1WFuVBjRmal73sKBBwH0M37WkI1GDg1lyVLw+HPPo/s9XBeM5tFdaqZ8M=@vger.kernel.org X-Gm-Message-State: AOJu0Yyg5gy0F8M32jZxcCQLN3S0xy5QrdJP1ceKtOmMDTD58GIL6fsG qT/udHYItAHtfxd/Q1EFKSPJslIRXlhCu4K2dhD4beb62cZVBWuPT79A+3pmzWSBAJY= X-Gm-Gg: ATEYQzwuraM67ERZO/aokEZpzlaBB7Vt3cUIZe3SADwMKbZcQ7+45rbyvL13eSuQ2Tg 7RDiZweWLuBQTM36fI2BV0aZGMaRZbP0AZeJ42cVkXSlREJp6DcXaZUrr1mSDYWejpWoWXBEOdD 8YD3QQBa+xmGHJrVnGD0PSzTqWX7LRrBLV1/5rGFxCYoQ2jzd04oipIMkI/IWPtreU7uI346QdH 2u05FliCTJstrE3IdpYL9rGd4mZ3XonoPiEcourgxpXXyw48OYNbz6OJdOQ81jkKcJWD/2m3kS+ BHW+RLujBa69FZppK2zrthYEMi+6q8+7rMbnWNErAYfrCoSy02PFbk/XKqFQWt6nxueCgdGO21B C+8RXqALWAegaJ8JARssLtAjyQdc8qytlmsn3T8fIUJG8bUyLCzuRjnOqTDo4DwVV7J+rVK1Ufg kTez3HeN9h2wYCZfV4Yks73ZmvjQ== X-Received: by 2002:a05:600c:444e:b0:47e:e91d:73c0 with SMTP id 5b1f17b1804b1-48519886a55mr99649885e9.19.1772718418844; Thu, 05 Mar 2026 05:46:58 -0800 (PST) Received: from pathway.suse.cz ([176.114.240.130]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-439b130abfasm32676008f8f.34.2026.03.05.05.46.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Mar 2026 05:46:58 -0800 (PST) Date: Thu, 5 Mar 2026 14:46:56 +0100 From: Petr Mladek To: mrungta@google.com Cc: Jonathan Corbet , Jinchao Wang , Yunhui Cui , Stephane Eranian , Ian Rogers , Li Huafei , Feng Tang , Max Kellermann , Douglas Anderson , Andrew Morton , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org Subject: Re: [PATCH 3/4] watchdog/hardlockup: improve buddy system detection timeliness Message-ID: References: <20260212-hardlockup-watchdog-fixes-v1-0-745f1dce04c3@google.com> <20260212-hardlockup-watchdog-fixes-v1-3-745f1dce04c3@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260212-hardlockup-watchdog-fixes-v1-3-745f1dce04c3@google.com> On Thu 2026-02-12 14:12:12, Mayank Rungta via B4 Relay wrote: > From: Mayank Rungta > > Currently, the buddy system only performs checks every 3rd sample. With > a 4-second interval. If a check window is missed, the next check occurs > 12 seconds later, potentially delaying hard lockup detection for up to > 24 seconds. > > Modify the buddy system to perform checks at every interval (4s). > Introduce a missed-interrupt threshold to maintain the existing grace > period while reducing the detection window to 8-12 seconds. > > Best and worst case detection scenarios: > > Before (12s check window): > - Best case: Lockup occurs after first check but just before heartbeat > interval. Detected in ~8s (8s till next check). > - Worst case: Lockup occurs just after a check. > Detected in ~24s (missed check + 12s till next check + 12s logic). > > After (4s check window with threshold of 3): > - Best case: Lockup occurs just before a check. > Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd). > - Worst case: Lockup occurs just after a check. > Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd). One might argue that the interval <8s,24s> is not much worse than <6s,20s> achieved by the perf detector. But I personally like that the disperse of <8s,12s> is lower so that the result is more predictable. And it is relatively cheap. People might have different option. But I am fine with this change. > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -163,8 +171,13 @@ static bool is_hardlockup(unsigned int cpu) > { > int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu)); > > - if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) > - return true; > + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) { > + per_cpu(hrtimer_interrupts_missed, cpu)++; > + if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh) This would return true for every check when missed >= 3. As a result, the hardlockup would be reported every 4s. I would keep the 12s cadence and change this to: if (per_cpu(hrtimer_interrupts_missed, cpu) % watchdog_hardlockup_miss_thresh == 0) > + return true; > + > + return false; > + } > > /* > * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE > --- a/kernel/watchdog_buddy.c > +++ b/kernel/watchdog_buddy.c > @@ -86,14 +87,6 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) > { > unsigned int next_cpu; > > - /* > - * Test for hardlockups every 3 samples. The sample period is > - * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over > - * watchdog_thresh (over by 20%). > - */ > - if (hrtimer_interrupts % 3 != 0) > - return; It would be symetric with the "% 3" above. > - > /* check for a hardlockup on the next CPU */ > next_cpu = watchdog_next_cpu(smp_processor_id()); > if (next_cpu >= nr_cpu_ids) Best Regards, Petr