[PATCH v4] hwlatdetect: Add MTBF calculation

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Costa Shulyupin <costa.shul@redhat.com>
To: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: John Kacur <jkacur@redhat.com>,
	Costa Shulyupin <costa.shul@redhat.com>,
	"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Subject: [PATCH v4] hwlatdetect: Add MTBF calculation
Date: Wed, 25 Feb 2026 15:32:32 +0200	[thread overview]
Message-ID: <20260225133232.756608-1-costa.shul@redhat.com> (raw)

Hwlatdetect reports the number of latency spikes but provides no
information about their frequency distribution over time. This makes it
difficult to compare results - a test with 10 spikes over 1 hour is very
different from 10 spikes over 24 hours, but both show 'spikes = 10'.

Add Mean Time Between Failures (MTBF) calculation to quantify spike
frequency.

By definition MTBF = total operating time / number of failures.

When the failure interval is large relative to test duration, this
formula is biased.  For example, imagine stable periodic failures.  The
total operating time will include time before the first failure and
after the last failure.  These intervals are determined by when the test
starts and stops, not by the system's failure behavior, which adds
measurement bias.  The resulting MTBF will vary between runs even for
stable periodic failures.

To reduce this bias, calculate MTBF using only the time between the
first and last failure, divided by the number of intervals (failures
minus one).

Additionally, hwlatdetect only samples during the 'width' period within
each 'window' cycle. The non-sampling periods contribute to the error
margin. The MTBF is therefore adjusted by multiplying it by the ratio
of window to width:

MTBF = (timestamp of last failure - timestamp of first failure) * window
/ ((number of failures - 1) * width)

In hwlatdetect, the failures are called samples. The failure count is
the sum of counters from all samples.

This metric enables meaningful comparison of hardware latency across
different test runs, hardware configurations, and kernel versions.  It
can be considered a KPI for real-time stability, relevant for
certification and SLA evaluation.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---

v4: Adjust MTBF by multiplying by the ratio of window to width
v3:
- Fix formatting
- Make first and last instance variables
v2:
- Use another more stable calculation of MTBF
---
 src/hwlatdetect/hwlatdetect.py | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
index 38671724f3e3..abfbb954fe75 100755
--- a/src/hwlatdetect/hwlatdetect.py
+++ b/src/hwlatdetect/hwlatdetect.py
@@ -306,6 +306,8 @@     def __init__(self):
             raise DetectorNotAvailable("hwlat", "hwlat tracer not available")
         self.type = "tracer"
         self.samples = []
+        self.first = None
+        self.last = None
         self.set("enable", 0)
         self.set('current_tracer', 'hwlat')

@@ -338,6 +340,8 @@     def detect(self):
                 pollcnt += 1
                 val = self.get_sample()
                 while val:
+                    self.first = self.first or val.timestamp
+                    self.last = val.timestamp
                     self.samples.append(val)
                     if watch:
                         val.display()
@@ -559,6 +563,11 @@     def cleanup(self):
     exceeding = detect.get("count")
     info(f"Samples exceeding threshold: {exceeding}")

+    if exceeding > 1:
+        mtbf = ((float(detect.last) - float(detect.first)) * int(detect.get('window'))
+                / ((exceeding - 1) * int(detect.get('width'))))
+        info(f"MTBF: {mtbf:.3f} seconds")
+
     if detect.have_msr:
         finishsmi = detect.getsmicounts()
         total_smis = 0
-- 
2.53.0

next             reply	other threads:[~2026-02-25 13:33 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25 13:32 Costa Shulyupin [this message]
2026-02-26 20:37 ` [PATCH v4] hwlatdetect: Add MTBF calculation John Kacur

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:38671724f3e dfblob:abfbb954fe7 )
 OR (
bs:"[PATCH v4] hwlatdetect: Add MTBF calculation" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260225133232.756608-1-costa.shul@redhat.com \
    --to=costa.shul@redhat.com \
    --cc=jkacur@redhat.com \
    --cc=lgoncalv@redhat.com \
    --cc=linux-rt-users@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox