[PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation
@ 2026-01-06  9:10 Costa Shulyupin
  2026-01-06 13:28 ` Luis Claudio R. Goncalves
  2026-02-19 20:59 ` John Kacur
  0 siblings, 2 replies; 4+ messages in thread
From: Costa Shulyupin @ 2026-01-06  9:10 UTC (permalink / raw)
  To: linux-rt-users, John Kacur; +Cc: Bart Wensley, Clark Williams, Costa Shulyupin

Hwlatdetect reports the number of latency spikes but provides no
information about their frequency distribution over time. This makes it
difficult to compare results - a test with 10 spikes over 1 hour is very
different from 10 spikes over 24 hours, but both show 'spikes = 10'.

Add Mean Time Between Failures (MTBF) calculation to quantify spike
frequency.

By definition MTBF = total operating time / number of failures.

When the failure interval is large relative to test duration, this
formula is biased.  For example, imagine stable periodic failures.  The
total operating time will include time before the first failure and
after the last failure.  These intervals are determined by when the test
starts and stops, not by the system’s failure behavior, which adds
measurement bias.  The resulting MTBF will vary between runs even for
stable periodic failures.

To reduce this bias, calculate MTBF using only the time between the
first and last failure, divided by the number of intervals (failures
minus one):

MTBF = (timestamp of last failure - timestamp of first failure)
/ (number of failures - 1)

In hwlatdetect, the failures are called samples.

This metric enables meaningful comparison of real-time performance
consistency across different test runs, hardware configurations, and
kernel versions.  It can be considered a KPI for real-time stability,
relevant for certification and SLA evaluation.

---

Changed in v2:
- Use another more stable calculation of MTBF

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 src/hwlatdetect/hwlatdetect.py | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
index 68f312db639f..6d9db9aec933 100755
--- a/src/hwlatdetect/hwlatdetect.py
+++ b/src/hwlatdetect/hwlatdetect.py
@@ -19,6 +19,7 @@
 debugging = False
 quiet = False
 watch = False
+first = last = 0

 def debug(dstr):
@@ -306,6 +307,10 @@     def detect(self):
                 pollcnt += 1
                 val = self.get_sample()
                 while val:
+                    global first, last
+                    if not first:
+                        first = val.timestamp
+                    last = val.timestamp
                     self.samples.append(val)
                     if watch:
                         val.display()
@@ -527,6 +532,9 @@     def cleanup(self):
     exceeding = detect.get("count")
     info(f"Samples exceeding threshold: {exceeding}")

+    if exceeding > 1:
+        info(f"MTBF: {(float(last)-float(first))/ (exceeding - 1):.3f} seconds")
+
     if detect.have_msr:
         finishsmi = detect.getsmicounts()
         total_smis = 0
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation
  2026-01-06  9:10 [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation Costa Shulyupin
@ 2026-01-06 13:28 ` Luis Claudio R. Goncalves
  2026-01-06 17:21   ` Costa Shulyupin
  2026-02-19 20:59 ` John Kacur
  1 sibling, 1 reply; 4+ messages in thread
From: Luis Claudio R. Goncalves @ 2026-01-06 13:28 UTC (permalink / raw)
  To: Costa Shulyupin; +Cc: linux-rt-users, John Kacur, Bart Wensley, Clark Williams

On Tue, Jan 06, 2026 at 11:10:23AM +0200, Costa Shulyupin wrote:
> Hwlatdetect reports the number of latency spikes but provides no
> information about their frequency distribution over time. This makes it
> difficult to compare results - a test with 10 spikes over 1 hour is very
> different from 10 spikes over 24 hours, but both show 'spikes = 10'.
> 
> Add Mean Time Between Failures (MTBF) calculation to quantify spike
> frequency.
> 
> By definition MTBF = total operating time / number of failures.
> 
> When the failure interval is large relative to test duration, this
> formula is biased.  For example, imagine stable periodic failures.  The
> total operating time will include time before the first failure and
> after the last failure.  These intervals are determined by when the test
> starts and stops, not by the system’s failure behavior, which adds
> measurement bias.  The resulting MTBF will vary between runs even for
> stable periodic failures.
> 
> To reduce this bias, calculate MTBF using only the time between the
> first and last failure, divided by the number of intervals (failures
> minus one):
> 
> MTBF = (timestamp of last failure - timestamp of first failure)
> / (number of failures - 1)

Hi Costa!

I do like the idea of using the MTBF to better understand the latency
spikes patterns. But I have the impression using the Mean (way better than
average to avoid bias) could still be misleading. Let me give you a few
examples of my doubt scenarios:

Say that you see 10 failures during a 6h test, the first 9 of them happening
within the first hour and the last one happening 5h into the test. That
would greatly skew the MTBF result and wrongly characterize the most common
and periodic failures in that system. (Please keep in mind that I am
familiar with the concept of MTBF - My question is whether the MTBF alone
is enough information)

Another example would be a 30-minute test where you have a latency spike
every 11 minutes (inspired by NTP + efi-rtc for illustration purposes) and
at the last minute you get a spurious latency spike that happens once in a
lifetime. Is the MTBF a good representation of the failure (latency) profile
of that system?

If what I described above makes sense, showing both the mean and
median of the interval between latency spikes would be helpful.

Again, having the MTBF as you proposed in the code is already helpful and a
new data point. I was just wondering if adding a tad bit of extra data
would help.

For future enhancements, maybe even classifying the periodic latency spikes
per duration or a histogram could be interesting. E.g.: the 200~300us latency
spikes happen every ~15 minutes, the latency spikes between 2~3ms happen
every 11 minutes, ...

But that only makes sense if identifying/classifying the latency patterns is
as helpful as I think it is for debugging.

Best regards,
Luis

> In hwlatdetect, the failures are called samples.
> 
> This metric enables meaningful comparison of real-time performance
> consistency across different test runs, hardware configurations, and
> kernel versions.  It can be considered a KPI for real-time stability,
> relevant for certification and SLA evaluation.
> 
> ---
> 
> Changed in v2:
> - Use another more stable calculation of MTBF
> 
> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
> ---
>  src/hwlatdetect/hwlatdetect.py | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
> index 68f312db639f..6d9db9aec933 100755
> --- a/src/hwlatdetect/hwlatdetect.py
> +++ b/src/hwlatdetect/hwlatdetect.py
> @@ -19,6 +19,7 @@
>  debugging = False
>  quiet = False
>  watch = False
> +first = last = 0
>  
>  
>  def debug(dstr):
> @@ -306,6 +307,10 @@     def detect(self):
>                  pollcnt += 1
>                  val = self.get_sample()
>                  while val:
> +                    global first, last
> +                    if not first:
> +                        first = val.timestamp
> +                    last = val.timestamp
>                      self.samples.append(val)
>                      if watch:
>                          val.display()
> @@ -527,6 +532,9 @@     def cleanup(self):
>      exceeding = detect.get("count")
>      info(f"Samples exceeding threshold: {exceeding}")
>  
> +    if exceeding > 1:
> +        info(f"MTBF: {(float(last)-float(first))/ (exceeding - 1):.3f} seconds")
> +
>      if detect.have_msr:
>          finishsmi = detect.getsmicounts()
>          total_smis = 0
> -- 
> 2.52.0
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation
  2026-01-06 13:28 ` Luis Claudio R. Goncalves
@ 2026-01-06 17:21   ` Costa Shulyupin
  0 siblings, 0 replies; 4+ messages in thread
From: Costa Shulyupin @ 2026-01-06 17:21 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: linux-rt-users, John Kacur, Bart Wensley, Clark Williams

On Tue, 6 Jan 2026 at 15:28, Luis Claudio R. Goncalves
<lgoncalv@redhat.com> wrote:
> But that only makes sense if identifying/classifying the latency patterns is
> as helpful as I think it is for debugging.
But I don’t think this is very useful, since we can barely fix it.
I once observed strange patterns on a preproduction machine.
In theory, it is even possible for process failures with FFT.
The next patch I plan to submit will print the time delta between samples.
This patch depends on a previously submitted one, which has not yet
been accepted.

Thanks.
Costa


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation
  2026-01-06  9:10 [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation Costa Shulyupin
  2026-01-06 13:28 ` Luis Claudio R. Goncalves
@ 2026-02-19 20:59 ` John Kacur
  1 sibling, 0 replies; 4+ messages in thread
From: John Kacur @ 2026-02-19 20:59 UTC (permalink / raw)
  To: Costa Shulyupin; +Cc: linux-rt-users, John Kacur, Bart Wensley, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 3039 bytes --]



On Tue, 6 Jan 2026, Costa Shulyupin wrote:

> Hwlatdetect reports the number of latency spikes but provides no
> information about their frequency distribution over time. This makes it
> difficult to compare results - a test with 10 spikes over 1 hour is very
> different from 10 spikes over 24 hours, but both show 'spikes = 10'.
> 
> Add Mean Time Between Failures (MTBF) calculation to quantify spike
> frequency.
> 
> By definition MTBF = total operating time / number of failures.
> 
> When the failure interval is large relative to test duration, this
> formula is biased.  For example, imagine stable periodic failures.  The
> total operating time will include time before the first failure and
> after the last failure.  These intervals are determined by when the test
> starts and stops, not by the system’s failure behavior, which adds
> measurement bias.  The resulting MTBF will vary between runs even for
> stable periodic failures.
> 
> To reduce this bias, calculate MTBF using only the time between the
> first and last failure, divided by the number of intervals (failures
> minus one):
> 
> MTBF = (timestamp of last failure - timestamp of first failure)
> / (number of failures - 1)
> 
> In hwlatdetect, the failures are called samples.
> 
> This metric enables meaningful comparison of real-time performance
> consistency across different test runs, hardware configurations, and
> kernel versions.  It can be considered a KPI for real-time stability,
> relevant for certification and SLA evaluation.
> 
> ---
> 
> Changed in v2:
> - Use another more stable calculation of MTBF
> 
> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
> ---
>  src/hwlatdetect/hwlatdetect.py | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
> index 68f312db639f..6d9db9aec933 100755
> --- a/src/hwlatdetect/hwlatdetect.py
> +++ b/src/hwlatdetect/hwlatdetect.py
> @@ -19,6 +19,7 @@
>  debugging = False
>  quiet = False
>  watch = False
> +first = last = 0
>  
>  
>  def debug(dstr):
> @@ -306,6 +307,10 @@     def detect(self):
>                  pollcnt += 1
>                  val = self.get_sample()
>                  while val:
> +                    global first, last
> +                    if not first:
> +                        first = val.timestamp
> +                    last = val.timestamp
>                      self.samples.append(val)
>                      if watch:
>                          val.display()
> @@ -527,6 +532,9 @@     def cleanup(self):
>      exceeding = detect.get("count")
>      info(f"Samples exceeding threshold: {exceeding}")
>  
> +    if exceeding > 1:
> +        info(f"MTBF: {(float(last)-float(first))/ (exceeding - 1):.3f} seconds")

exceeding = detect.get("count"), but now we have a count that isn't the 
same as a sample, you might have to rework this for newer kernels.

> +
>      if detect.have_msr:
>          finishsmi = detect.getsmicounts()
>          total_smis = 0
> -- 
> 2.52.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-19 20:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-06  9:10 [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation Costa Shulyupin
2026-01-06 13:28 ` Luis Claudio R. Goncalves
2026-01-06 17:21   ` Costa Shulyupin
2026-02-19 20:59 ` John Kacur

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox