public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed
From: John Kacur <jkacur@gmail.com>
To: Costa Shulyupin <costa.shul@redhat.com>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>,
	 John Kacur <jkacur@redhat.com>,
	 "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Subject: Re: [PATCH v4] hwlatdetect: Add MTBF calculation
Date: Thu, 26 Feb 2026 15:37:28 -0500 (EST)	[thread overview]
Message-ID: <29ea4c85-1ce6-ada8-3cdf-dc6ca79f432c@gmail.com> (raw)
In-Reply-To: <20260225133232.756608-1-costa.shul@redhat.com>



On Wed, 25 Feb 2026, Costa Shulyupin wrote:

> Hwlatdetect reports the number of latency spikes but provides no
> information about their frequency distribution over time. This makes it
> difficult to compare results - a test with 10 spikes over 1 hour is very
> different from 10 spikes over 24 hours, but both show 'spikes = 10'.
> 
> Add Mean Time Between Failures (MTBF) calculation to quantify spike
> frequency.
> 
> By definition MTBF = total operating time / number of failures.
> 
> When the failure interval is large relative to test duration, this
> formula is biased.  For example, imagine stable periodic failures.  The
> total operating time will include time before the first failure and
> after the last failure.  These intervals are determined by when the test
> starts and stops, not by the system's failure behavior, which adds
> measurement bias.  The resulting MTBF will vary between runs even for
> stable periodic failures.
> 
> To reduce this bias, calculate MTBF using only the time between the
> first and last failure, divided by the number of intervals (failures
> minus one).
> 
> Additionally, hwlatdetect only samples during the 'width' period within
> each 'window' cycle. The non-sampling periods contribute to the error
> margin. The MTBF is therefore adjusted by multiplying it by the ratio
> of window to width:
> 
> MTBF = (timestamp of last failure - timestamp of first failure) * window
> / ((number of failures - 1) * width)
> 
> In hwlatdetect, the failures are called samples. The failure count is
> the sum of counters from all samples.
> 
> This metric enables meaningful comparison of hardware latency across
> different test runs, hardware configurations, and kernel versions.  It
> can be considered a KPI for real-time stability, relevant for
> certification and SLA evaluation.
> 
> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
> ---
> 
> v4: Adjust MTBF by multiplying by the ratio of window to width
> v3:
> - Fix formatting
> - Make first and last instance variables
> v2:
> - Use another more stable calculation of MTBF
> ---
>  src/hwlatdetect/hwlatdetect.py | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/src/hwlatdetect/hwlatdetect.py b/src/hwlatdetect/hwlatdetect.py
> index 38671724f3e3..abfbb954fe75 100755
> --- a/src/hwlatdetect/hwlatdetect.py
> +++ b/src/hwlatdetect/hwlatdetect.py
> @@ -306,6 +306,8 @@     def __init__(self):
>              raise DetectorNotAvailable("hwlat", "hwlat tracer not available")
>          self.type = "tracer"
>          self.samples = []
> +        self.first = None
> +        self.last = None
>          self.set("enable", 0)
>          self.set('current_tracer', 'hwlat')
>  
> @@ -338,6 +340,8 @@     def detect(self):
>                  pollcnt += 1
>                  val = self.get_sample()
>                  while val:
> +                    self.first = self.first or val.timestamp
> +                    self.last = val.timestamp
>                      self.samples.append(val)
>                      if watch:
>                          val.display()
> @@ -559,6 +563,11 @@     def cleanup(self):
>      exceeding = detect.get("count")
>      info(f"Samples exceeding threshold: {exceeding}")
>  
> +    if exceeding > 1:
> +        mtbf = ((float(detect.last) - float(detect.first)) * int(detect.get('window'))
> +                / ((exceeding - 1) * int(detect.get('width'))))
> +        info(f"MTBF: {mtbf:.3f} seconds")
> +
>      if detect.have_msr:
>          finishsmi = detect.getsmicounts()
>          total_smis = 0
> -- 
> 2.53.0
> 
> 
> 
Signed-off-by: John Kacur <jkacur@redhat.com>

      reply	other threads:[~2026-02-26 20:37 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25 13:32 [PATCH v4] hwlatdetect: Add MTBF calculation Costa Shulyupin
2026-02-26 20:37 ` John Kacur [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29ea4c85-1ce6-ada8-3cdf-dc6ca79f432c@gmail.com \
    --to=jkacur@gmail.com \
    --cc=costa.shul@redhat.com \
    --cc=jkacur@redhat.com \
    --cc=lgoncalv@redhat.com \
    --cc=linux-rt-users@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox