From: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
To: Costa Shulyupin <costa.shul@redhat.com>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>,
John Kacur <jkacur@redhat.com>,
Bart Wensley <bwensley@redhat.com>,
Clark Williams <williams@redhat.com>
Subject: Re: [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation
Date: Tue, 6 Jan 2026 10:28:27 -0300 [thread overview]
Message-ID: <aV0N-zTclZAwy1Fv@redhat.com> (raw)
In-Reply-To: <20260106091023.1375803-1-costa.shul@redhat.com>
On Tue, Jan 06, 2026 at 11:10:23AM +0200, Costa Shulyupin wrote:
> Hwlatdetect reports the number of latency spikes but provides no
> information about their frequency distribution over time. This makes it
> difficult to compare results - a test with 10 spikes over 1 hour is very
> different from 10 spikes over 24 hours, but both show 'spikes = 10'.
>
> Add Mean Time Between Failures (MTBF) calculation to quantify spike
> frequency.
>
> By definition MTBF = total operating time / number of failures.
>
> When the failure interval is large relative to test duration, this
> formula is biased. For example, imagine stable periodic failures. The
> total operating time will include time before the first failure and
> after the last failure. These intervals are determined by when the test
> starts and stops, not by the system’s failure behavior, which adds
> measurement bias. The resulting MTBF will vary between runs even for
> stable periodic failures.
>
> To reduce this bias, calculate MTBF using only the time between the
> first and last failure, divided by the number of intervals (failures
> minus one):
>
> MTBF = (timestamp of last failure - timestamp of first failure)
> / (number of failures - 1)
Hi Costa!
I do like the idea of using the MTBF to better understand the latency
spikes patterns. But I have the impression using the Mean (way better than
average to avoid bias) could still be misleading. Let me give you a few
examples of my doubt scenarios:
Say that you see 10 failures during a 6h test, the first 9 of them happening
within the first hour and the last one happening 5h into the test. That
would greatly skew the MTBF result and wrongly characterize the most common
and periodic failures in that system. (Please keep in mind that I am
familiar with the concept of MTBF - My question is whether the MTBF alone
is enough information)
Another example would be a 30-minute test where you have a latency spike
every 11 minutes (inspired by NTP + efi-rtc for illustration purposes) and
at the last minute you get a spurious latency spike that happens once in a
lifetime. Is the MTBF a good representation of the failure (latency) profile
of that system?
If what I described above makes sense, showing both the mean and
median of the interval between latency spikes would be helpful.
Again, having the MTBF as you proposed in the code is already helpful and a
new data point. I was just wondering if adding a tad bit of extra data
would help.
For future enhancements, maybe even classifying the periodic latency spikes
per duration or a histogram could be interesting. E.g.: the 200~300us latency
spikes happen every ~15 minutes, the latency spikes between 2~3ms happen
every 11 minutes, ...
But that only makes sense if identifying/classifying the latency patterns is
as helpful as I think it is for debugging.
Best regards,
Luis
> In hwlatdetect, the failures are called samples.
>
> This metric enables meaningful comparison of real-time performance
> consistency across different test runs, hardware configurations, and
> kernel versions. It can be considered a KPI for real-time stability,
> relevant for certification and SLA evaluation.
>
> ---
>
> Changed in v2:
> - Use another more stable calculation of MTBF
>
> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
> ---
> src/hwlatdetect/hwlatdetect.py | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
> index 68f312db639f..6d9db9aec933 100755
> --- a/src/hwlatdetect/hwlatdetect.py
> +++ b/src/hwlatdetect/hwlatdetect.py
> @@ -19,6 +19,7 @@
> debugging = False
> quiet = False
> watch = False
> +first = last = 0
>
>
> def debug(dstr):
> @@ -306,6 +307,10 @@ def detect(self):
> pollcnt += 1
> val = self.get_sample()
> while val:
> + global first, last
> + if not first:
> + first = val.timestamp
> + last = val.timestamp
> self.samples.append(val)
> if watch:
> val.display()
> @@ -527,6 +532,9 @@ def cleanup(self):
> exceeding = detect.get("count")
> info(f"Samples exceeding threshold: {exceeding}")
>
> + if exceeding > 1:
> + info(f"MTBF: {(float(last)-float(first))/ (exceeding - 1):.3f} seconds")
> +
> if detect.have_msr:
> finishsmi = detect.getsmicounts()
> total_smis = 0
> --
> 2.52.0
>
>
---end quoted text---
next prev parent reply other threads:[~2026-01-06 13:28 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-06 9:10 [PATCH v2] rt-tests: hwlatdetect: Add MTBF calculation Costa Shulyupin
2026-01-06 13:28 ` Luis Claudio R. Goncalves [this message]
2026-01-06 17:21 ` Costa Shulyupin
2026-02-19 20:59 ` John Kacur
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aV0N-zTclZAwy1Fv@redhat.com \
--to=lgoncalv@redhat.com \
--cc=bwensley@redhat.com \
--cc=costa.shul@redhat.com \
--cc=jkacur@redhat.com \
--cc=linux-rt-users@vger.kernel.org \
--cc=williams@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox