From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
To: Tony Luck <tony.luck@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
Reinette Chatre <reinette.chatre@intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Len Brown <len.brown@intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>,
Andi Kleen <ak@linux.intel.com>,
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>,
Ricardo Neri <ricardo.neri@intel.com>,
Stephane Eranian <eranian@google.com>,
linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org,
linuxppc-dev@lists.ozlabs.org
Subject: [PATCH v7 20/24] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
Date: Wed, 1 Mar 2023 15:47:49 -0800 [thread overview]
Message-ID: <20230301234753.28582-21-ricardo.neri-calderon@linux.intel.com> (raw)
In-Reply-To: <20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com>
It is not possible to determine the source of a non-maskable interrupt
(NMI) in x86. When dealing with an HPET channel, the only direct method to
determine whether it caused an NMI would be to read the Interrupt Status
register.
Reading HPET registers is slow and, therefore, not to be done while in NMI
context. Also, the interrupt status bit is not available if the HPET
channel is programmed to deliver an MSI interrupt.
An indirect manner to infer if the HPET channel is the source of an NMI is
is to use the time-stamp counter (TSC). Compute the value that the TSC is
expected to have at the next interrupt of the HPET channel and compare it
with the value it has when the interrupt does happen. Let this error be
tsc_next_error. If tsc_next_error is less than a certain value, assume that
the HPET channel of the detector is the source of the NMI.
Below is a table that characterizes tsc_next_error in a collection of
systems. The error is expressed in microseconds as well as a percentage of
tsc_delta: the computed number of TSC counts between two consecutive
interrupts of the HPET channel.
The table summarizes the error of 4096 interrupts of the HPET channel in
two experiments: a) since the system booted and b) ignoring the first 5
minutes after boot.
The maximum observed error in a) is 0.198%. For b) the maximum error is
0.045%.
Allow a maximum tsc_next_error that is twice as big the maximum error
observed in these experiments: 0.4% of tsc_delta.
watchdog_thresh 1s 10s 60s
tsc_next_error % us % us % us
AMD EPYC 7742 64-Core Processor
max(abs(a)) 0.04517 451.74 0.00171 171.04 0.00034 201.89
max(abs(b)) 0.04517 451.74 0.00171 171.04 0.00034 201.89
Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
max(abs(a)) 0.00811 81.15 0.00462 462.40 0.00014 81.65
max(abs(b)) 0.00811 81.15 0.00084 84.31 0.00014 81.65
Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X
max(abs(a)) 0.10530 1053.04 0.01324 1324.27 0.00407 2443.25
max(abs(b)) 0.01166 116.59 0.00114 114.11 0.00024 143.47
Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X
max(abs(a)) 0.00010 99.34 0.00099 98.83 0.00016 97.50
max(abs(b)) 0.00010 99.34 0.00099 98.83 0.00016 97.50
Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X
max(abs(a)) 0.11262 1126.17 0.01109 1109.17 0.00409 2455.73
max(abs(b)) 0.01073 107.31 0.00109 109.02 0.00019 115.34
Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X
max(abs(a)) 0.19853 1985.30 0.00784 783.53 -0.00017 -104.77
max(abs(b)) 0.01550 155.02 0.00158 157.56 0.00020 117.74
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
NOTE: The error characterization data is repeated here from the cover
letter.
---
Changes since v6:
* Fixed bug when checking the error window. Now check for an error
which is +/-4% the actual TSC value, not +/-2%.
Changes since v5:
* Reworked is_hpet_hld_interrupt() to reduce indentation.
* Use time_in_range64() to compare the actual TSC value vs the expected
value. This makes it more readable. (Tony)
* Reduced the error window of the expected TSC value at the time of the
HPET channel expiration.
* Described better the heuristics used to determine if the HPET channel
caused the NMI. (Tony)
* Added a table to characterize the error in the expected TSC value when
the HPET channel fires.
* Removed references to groups of monitored CPUs. Instead, use tsc_khz
directly.
Changes since v4:
* Compute the TSC expected value at the next HPET interrupt based on the
number of monitored packages and not the number of monitored CPUs.
Changes since v3:
* None
Changes since v2:
* Reworked condition to check if the expected TSC value is within the
error margin to avoid an unnecessary conditional. (Peter Zijlstra)
* Removed TSC error margin from struct hld_data; use a global variable
instead. (Peter Zijlstra)
Changes since v1:
* Introduced this patch.
---
arch/x86/include/asm/hpet.h | 3 ++
arch/x86/kernel/watchdog_hld_hpet.c | 58 +++++++++++++++++++++++++++--
2 files changed, 58 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index c88901744848..af0a504b5cff 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -113,6 +113,8 @@ static inline int is_hpet_enabled(void) { return 0; }
* @channel: HPET channel assigned to the detector
* @channe_priv: Private data of the assigned channel
* @ticks_per_second: Frequency of the HPET timer
+ * @tsc_next: Estimated value of the TSC at the next
+ * HPET timer interrupt
* @irq: IRQ number assigned to the HPET channel
* @handling_cpu: CPU handling the HPET interrupt
* @monitored_cpumask: CPUs monitored by the hardlockup detector
@@ -124,6 +126,7 @@ struct hpet_hld_data {
u32 channel;
struct hpet_channel *channel_priv;
u64 ticks_per_second;
+ u64 tsc_next;
int irq;
u32 handling_cpu;
cpumask_var_t monitored_cpumask;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index b583d3180ae0..a03126e02eda 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -12,6 +12,11 @@
* (offline CPUs also get the NMI but they "ignore" it). A cpumask is used to
* specify whether a CPU must check for hardlockups.
*
+ * It is not possible to determine the source of an NMI. Instead, we calculate
+ * the value that the TSC counter should have when the next HPET NMI occurs. If
+ * it has the calculated value +/- 0.4%, we conclude that the HPET channel is the
+ * source of the NMI.
+ *
* The NMI also disturbs isolated CPUs. The detector fails to initialize if
* tick_nohz_full is enabled.
*/
@@ -34,6 +39,7 @@
#include "apic/local.h"
static struct hpet_hld_data *hld_data;
+static u64 tsc_next_error;
static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
{
@@ -65,12 +71,39 @@ static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
* Reprogram the timer to expire in watchdog_thresh seconds in the future.
* If the timer supports periodic mode, it is not kicked unless @force is
* true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value.
*/
static void kick_timer(struct hpet_hld_data *hdata, bool force)
{
- u64 new_compare, count, period = 0;
+ u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
+
+ tsc_curr = rdtsc();
+
+ /*
+ * Compute the delta between the value of the TSC now and the value
+ * it will have the next time the HPET channel fires.
+ */
+ tsc_delta = watchdog_thresh * tsc_khz * 1000L;
+ hdata->tsc_next = tsc_curr + tsc_delta;
+
+ /*
+ * Define an error window between the expected TSC value and the actual
+ * value it will have the next time the HPET channel fires. Define this
+ * error as percentage of tsc_delta.
+ *
+ * The systems that have been tested so far exhibit an error of 0.05%
+ * of the expected TSC value once the system is up and running. Systems
+ * that refine tsc_khz exhibit a larger initial error up to 0.2%. To be
+ * safe, allow a maximum error of ~0.4% (i.e., tsc_delta / 256).
+ */
+ tsc_next_error = tsc_delta >> 8;
- /* Kick the timer only when needed. */
+ /*
+ * We must compute the exptected TSC value always. Kick the timer only
+ * when needed.
+ */
if (!force && hdata->has_periodic)
return;
@@ -133,12 +166,31 @@ static void enable_timer(struct hpet_hld_data *hdata)
* is_hpet_hld_interrupt() - Check if the HPET channel caused the interrupt
* @hdata: A data structure describing the HPET channel
*
+ * Determining the sources of NMIs is not possible. Furthermore, we have
+ * programmed the HPET channel for MSI delivery, which does not have a
+ * status bit. Also, reading HPET registers is slow.
+ *
+ * Instead, we just assume that an NMI delivered within a time window
+ * of when the HPET was expected to fire probably came from the HPET.
+ *
+ * The window is estimated using the TSC counter. Check the comments in
+ * kick_timer() for details on the size of the time window.
+ *
* Returns:
* True if the HPET watchdog timer caused the interrupt. False otherwise.
*/
static bool is_hpet_hld_interrupt(struct hpet_hld_data *hdata)
{
- return false;
+ u64 tsc_curr, tsc_curr_min, tsc_curr_max;
+
+ if (smp_processor_id() != hdata->handling_cpu)
+ return false;
+
+ tsc_curr = rdtsc();
+ tsc_curr_min = tsc_curr - tsc_next_error;
+ tsc_curr_max = tsc_curr + tsc_next_error;
+
+ return time_in_range64(hdata->tsc_next, tsc_curr_min, tsc_curr_max);
}
/**
--
2.25.1
next prev parent reply other threads:[~2023-03-01 23:58 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-01 23:47 [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 01/24] x86/apic: Add irq_cfg::delivery_mode Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 02/24] x86/apic/msi: Use the delivery mode from irq_cfg for message composition Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 03/24] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI interrupt allocation flag Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 04/24] x86/apic/vector: Implement a local APIC NMI controller Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 05/24] x86/apic/vector: Skip cleanup for the NMI vector Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 06/24] iommu/vt-d: Clear the redirection hint when the destination mode is physical Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 07/24] iommu/vt-d: Rework prepare_irte() to support per-interrupt delivery mode Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 08/24] iommu/vt-d: Set the IRTE delivery mode individually for each interrupt Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 09/24] iommu/amd: Expose [set|get]_dev_entry_bit() Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 10/24] iommu/amd: Enable NMIPass when allocating an NMI Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 11/24] iommu/amd: Compose MSI messages for NMIs in non-IR format Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 12/24] x86/hpet: Expose hpet_writel() in header Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 13/24] x86/hpet: Add helper function hpet_set_comparator_periodic() Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 14/24] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 15/24] x86/hpet: Reserve an HPET channel for the hardlockup detector Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 16/24] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 17/24] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 18/24] init/main: Delay initialization of the lockup detector after smp_init() Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 19/24] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
2026-02-03 17:02 ` Thomas Gleixner
2023-03-01 23:47 ` Ricardo Neri [this message]
2023-03-01 23:47 ` [PATCH v7 21/24] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 22/24] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 23/24] watchdog: Introduce hardlockup_detector_mark_unavailable() Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 24/24] x86/tsc: Stop the HPET hardlockup detector if TSC become unstable Ricardo Neri
2023-04-13 3:58 ` [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector Ricardo Neri
2026-02-03 15:58 ` Thomas Gleixner
2026-02-04 5:02 ` Ricardo Neri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230301234753.28582-21-ricardo.neri-calderon@linux.intel.com \
--to=ricardo.neri-calderon@linux.intel.com \
--cc=ak@linux.intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=eranian@google.com \
--cc=iommu@lists.linux-foundation.org \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=rafael.j.wysocki@intel.com \
--cc=ravi.v.shankar@intel.com \
--cc=reinette.chatre@intel.com \
--cc=ricardo.neri@intel.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox