public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4] hwlatdetect: Add MTBF calculation
@ 2026-02-25 13:32 Costa Shulyupin
  2026-02-26 20:37 ` John Kacur
  0 siblings, 1 reply; 2+ messages in thread
From: Costa Shulyupin @ 2026-02-25 13:32 UTC (permalink / raw)
  To: linux-rt-users; +Cc: John Kacur, Costa Shulyupin, Luis Claudio R. Goncalves

Hwlatdetect reports the number of latency spikes but provides no
information about their frequency distribution over time. This makes it
difficult to compare results - a test with 10 spikes over 1 hour is very
different from 10 spikes over 24 hours, but both show 'spikes = 10'.

Add Mean Time Between Failures (MTBF) calculation to quantify spike
frequency.

By definition MTBF = total operating time / number of failures.

When the failure interval is large relative to test duration, this
formula is biased.  For example, imagine stable periodic failures.  The
total operating time will include time before the first failure and
after the last failure.  These intervals are determined by when the test
starts and stops, not by the system's failure behavior, which adds
measurement bias.  The resulting MTBF will vary between runs even for
stable periodic failures.

To reduce this bias, calculate MTBF using only the time between the
first and last failure, divided by the number of intervals (failures
minus one).

Additionally, hwlatdetect only samples during the 'width' period within
each 'window' cycle. The non-sampling periods contribute to the error
margin. The MTBF is therefore adjusted by multiplying it by the ratio
of window to width:

MTBF = (timestamp of last failure - timestamp of first failure) * window
/ ((number of failures - 1) * width)

In hwlatdetect, the failures are called samples. The failure count is
the sum of counters from all samples.

This metric enables meaningful comparison of hardware latency across
different test runs, hardware configurations, and kernel versions.  It
can be considered a KPI for real-time stability, relevant for
certification and SLA evaluation.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---

v4: Adjust MTBF by multiplying by the ratio of window to width
v3:
- Fix formatting
- Make first and last instance variables
v2:
- Use another more stable calculation of MTBF
---
 src/hwlatdetect/hwlatdetect.py | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
index 38671724f3e3..abfbb954fe75 100755
--- a/src/hwlatdetect/hwlatdetect.py
+++ b/src/hwlatdetect/hwlatdetect.py
@@ -306,6 +306,8 @@     def __init__(self):
             raise DetectorNotAvailable("hwlat", "hwlat tracer not available")
         self.type = "tracer"
         self.samples = []
+        self.first = None
+        self.last = None
         self.set("enable", 0)
         self.set('current_tracer', 'hwlat')
 
@@ -338,6 +340,8 @@     def detect(self):
                 pollcnt += 1
                 val = self.get_sample()
                 while val:
+                    self.first = self.first or val.timestamp
+                    self.last = val.timestamp
                     self.samples.append(val)
                     if watch:
                         val.display()
@@ -559,6 +563,11 @@     def cleanup(self):
     exceeding = detect.get("count")
     info(f"Samples exceeding threshold: {exceeding}")
 
+    if exceeding > 1:
+        mtbf = ((float(detect.last) - float(detect.first)) * int(detect.get('window'))
+                / ((exceeding - 1) * int(detect.get('width'))))
+        info(f"MTBF: {mtbf:.3f} seconds")
+
     if detect.have_msr:
         finishsmi = detect.getsmicounts()
         total_smis = 0
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v4] hwlatdetect: Add MTBF calculation
  2026-02-25 13:32 [PATCH v4] hwlatdetect: Add MTBF calculation Costa Shulyupin
@ 2026-02-26 20:37 ` John Kacur
  0 siblings, 0 replies; 2+ messages in thread
From: John Kacur @ 2026-02-26 20:37 UTC (permalink / raw)
  To: Costa Shulyupin; +Cc: linux-rt-users, John Kacur, Luis Claudio R. Goncalves



On Wed, 25 Feb 2026, Costa Shulyupin wrote:

> Hwlatdetect reports the number of latency spikes but provides no
> information about their frequency distribution over time. This makes it
> difficult to compare results - a test with 10 spikes over 1 hour is very
> different from 10 spikes over 24 hours, but both show 'spikes = 10'.
> 
> Add Mean Time Between Failures (MTBF) calculation to quantify spike
> frequency.
> 
> By definition MTBF = total operating time / number of failures.
> 
> When the failure interval is large relative to test duration, this
> formula is biased.  For example, imagine stable periodic failures.  The
> total operating time will include time before the first failure and
> after the last failure.  These intervals are determined by when the test
> starts and stops, not by the system's failure behavior, which adds
> measurement bias.  The resulting MTBF will vary between runs even for
> stable periodic failures.
> 
> To reduce this bias, calculate MTBF using only the time between the
> first and last failure, divided by the number of intervals (failures
> minus one).
> 
> Additionally, hwlatdetect only samples during the 'width' period within
> each 'window' cycle. The non-sampling periods contribute to the error
> margin. The MTBF is therefore adjusted by multiplying it by the ratio
> of window to width:
> 
> MTBF = (timestamp of last failure - timestamp of first failure) * window
> / ((number of failures - 1) * width)
> 
> In hwlatdetect, the failures are called samples. The failure count is
> the sum of counters from all samples.
> 
> This metric enables meaningful comparison of hardware latency across
> different test runs, hardware configurations, and kernel versions.  It
> can be considered a KPI for real-time stability, relevant for
> certification and SLA evaluation.
> 
> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
> ---
> 
> v4: Adjust MTBF by multiplying by the ratio of window to width
> v3:
> - Fix formatting
> - Make first and last instance variables
> v2:
> - Use another more stable calculation of MTBF
> ---
>  src/hwlatdetect/hwlatdetect.py | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
> index 38671724f3e3..abfbb954fe75 100755
> --- a/src/hwlatdetect/hwlatdetect.py
> +++ b/src/hwlatdetect/hwlatdetect.py
> @@ -306,6 +306,8 @@     def __init__(self):
>              raise DetectorNotAvailable("hwlat", "hwlat tracer not available")
>          self.type = "tracer"
>          self.samples = []
> +        self.first = None
> +        self.last = None
>          self.set("enable", 0)
>          self.set('current_tracer', 'hwlat')
>  
> @@ -338,6 +340,8 @@     def detect(self):
>                  pollcnt += 1
>                  val = self.get_sample()
>                  while val:
> +                    self.first = self.first or val.timestamp
> +                    self.last = val.timestamp
>                      self.samples.append(val)
>                      if watch:
>                          val.display()
> @@ -559,6 +563,11 @@     def cleanup(self):
>      exceeding = detect.get("count")
>      info(f"Samples exceeding threshold: {exceeding}")
>  
> +    if exceeding > 1:
> +        mtbf = ((float(detect.last) - float(detect.first)) * int(detect.get('window'))
> +                / ((exceeding - 1) * int(detect.get('width'))))
> +        info(f"MTBF: {mtbf:.3f} seconds")
> +
>      if detect.have_msr:
>          finishsmi = detect.getsmicounts()
>          total_smis = 0
> -- 
> 2.53.0
> 
> 
> 
Signed-off-by: John Kacur <jkacur@redhat.com>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-02-26 20:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 13:32 [PATCH v4] hwlatdetect: Add MTBF calculation Costa Shulyupin
2026-02-26 20:37 ` John Kacur

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox