public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] rt-tests: hwlatdetect: Add MTBF calculation
@ 2026-02-20 18:24 Costa Shulyupin
  2026-02-24 23:44 ` John Kacur
  0 siblings, 1 reply; 3+ messages in thread
From: Costa Shulyupin @ 2026-02-20 18:24 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Bart Wensley, John Kacur, Clark Williams, Costa Shulyupin

Hwlatdetect reports the number of latency spikes but provides no
information about their frequency distribution over time. This makes it
difficult to compare results - a test with 10 spikes over 1 hour is very
different from 10 spikes over 24 hours, but both show 'spikes = 10'.

Add Mean Time Between Failures (MTBF) calculation to quantify spike
frequency.

By definition MTBF = total operating time / number of failures.

When the failure interval is large relative to test duration, this
formula is biased.  For example, imagine stable periodic failures.  The
total operating time will include time before the first failure and
after the last failure.  These intervals are determined by when the test
starts and stops, not by the system's failure behavior, which adds
measurement bias.  The resulting MTBF will vary between runs even for
stable periodic failures.

To reduce this bias, calculate MTBF using only the time between the
first and last failure, divided by the number of intervals (failures
minus one):

MTBF = (timestamp of last failure - timestamp of first failure)
/ (number of failures - 1)

In hwlatdetect, the failures are called samples.

This metric enables meaningful comparison of real-time performance
consistency across different test runs, hardware configurations, and
kernel versions.  It can be considered a KPI for real-time stability,
relevant for certification and SLA evaluation.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---

v3:
- Fix formatting
- Make first and last instance variables
v2:
- Use another more stable calculation of MTBF
---
 src/hwlatdetect/hwlatdetect.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
index 38671724f3e3..bb12019ef2d6 100755
--- a/src/hwlatdetect/hwlatdetect.py
+++ b/src/hwlatdetect/hwlatdetect.py
@@ -306,6 +306,8 @@     def __init__(self):
             raise DetectorNotAvailable("hwlat", "hwlat tracer not available")
         self.type = "tracer"
         self.samples = []
+        self.first = None
+        self.last = None
         self.set("enable", 0)
         self.set('current_tracer', 'hwlat')
 
@@ -338,6 +340,8 @@     def detect(self):
                 pollcnt += 1
                 val = self.get_sample()
                 while val:
+                    self.first = self.first or val.timestamp
+                    self.last = val.timestamp
                     self.samples.append(val)
                     if watch:
                         val.display()
@@ -559,6 +563,9 @@     def cleanup(self):
     exceeding = detect.get("count")
     info(f"Samples exceeding threshold: {exceeding}")
 
+    if exceeding > 1:
+        info(f"MTBF: {(float(detect.last) - float(detect.first)) / (exceeding - 1):.3f} seconds")
+
     if detect.have_msr:
         finishsmi = detect.getsmicounts()
         total_smis = 0
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] rt-tests: hwlatdetect: Add MTBF calculation
  2026-02-20 18:24 [PATCH v3] rt-tests: hwlatdetect: Add MTBF calculation Costa Shulyupin
@ 2026-02-24 23:44 ` John Kacur
  2026-02-25  9:25   ` Costa Shulyupin
  0 siblings, 1 reply; 3+ messages in thread
From: John Kacur @ 2026-02-24 23:44 UTC (permalink / raw)
  To: Costa Shulyupin; +Cc: linux-rt-users, Bart Wensley, John Kacur, Clark Williams



On Fri, 20 Feb 2026, Costa Shulyupin wrote:

8< -----------

>  
> +    if exceeding > 1:
> +        info(f"MTBF: {(float(detect.last) - float(detect.first)) / (exceeding - 1):.3f} seconds")
> +

With newer kernels you can get multiple latency events ie, the count  per 
timestamp but you don't know when they occurred. (the timestamp measures the occurence 
of the first event)
Does it therefore make more sense to measure the mean time between samples 
(distinct timestamp entries)

Does the following make more sense? 

if detect.samples > 1:
    info(f"MTBF: {(float(detect.last) - float(detect.first)) / (len(detect.samples) - 1):.3f} seconds")

>      if detect.have_msr:
>          finishsmi = detect.getsmicounts()
>          total_smis = 0
> -- 

John

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] rt-tests: hwlatdetect: Add MTBF calculation
  2026-02-24 23:44 ` John Kacur
@ 2026-02-25  9:25   ` Costa Shulyupin
  0 siblings, 0 replies; 3+ messages in thread
From: Costa Shulyupin @ 2026-02-25  9:25 UTC (permalink / raw)
  To: John Kacur; +Cc: linux-rt-users, John Kacur

On Wed, 25 Feb 2026 at 01:45, John Kacur <jkacur@gmail.com> wrote:
> With newer kernels you can get multiple latency events ie, the count  per
> timestamp but you don't know when they occurred. (the timestamp measures the occurence
> of the first event)
> Does it therefore make more sense to measure the mean time between samples
> (distinct timestamp entries)
>
> Does the following make more sense?
>
> if detect.samples > 1:
>     info(f"MTBF: {(float(detect.last) - float(detect.first)) / (len(detect.samples) - 1):.3f} seconds")

I disagree. Naturally, a longer sampling width increases the error.
While using the total number of samples instead of the number of
failures might appear more deterministic (since it avoids the
resolution error inherent in width length), it fails to estimate MTBF.
Such a metric becomes proportional to the duration of the sampling
width (by several failures per sample width), whereas a true MTBF
should remain independent of it.

Your question has led me to consider how non-sampling periods
contribute to the error margin. The adjusted MTBF should therefore be
compensated by multiplying it by the ratio of window to width. I’ll
send an updated patch that includes this compensation factor.

Thanks
Costa


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-25  9:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-20 18:24 [PATCH v3] rt-tests: hwlatdetect: Add MTBF calculation Costa Shulyupin
2026-02-24 23:44 ` John Kacur
2026-02-25  9:25   ` Costa Shulyupin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox