From: Crystal Wood <crwood@redhat.com>
To: Tomas Glozar <tglozar@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
John Kacur <jkacur@redhat.com>,
Luis Goncalves <lgoncalv@redhat.com>,
Costa Shulyupin <costa.shul@redhat.com>,
Wander Lairson Costa <wander@redhat.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-trace-kernel <linux-trace-kernel@vger.kernel.org>
Subject: Re: [PATCH] tracing/osnoise: Add option to align tlat threads
Date: Mon, 02 Mar 2026 21:21:37 -0600 [thread overview]
Message-ID: <238e4851e45a505d7974443c4fe703e61a0fbf55.camel@redhat.com> (raw)
In-Reply-To: <CAP4=nvQcQQCeyi00WchwhrecOQCxoN3gHiuZD63KZb-30JzTYw@mail.gmail.com>
On Mon, 2026-03-02 at 09:48 +0100, Tomas Glozar wrote:
> so 28. 2. 2026 v 0:50 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > > Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> > > corresponding setting osnoise/timerlat_align_us.
> > >
> > > This option sets the alignment of wakeup times between different
> > > timerlat threads, similarly to cyclictest's -A/--aligned option. If
> > > TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> > > records its first wake-up time. Each following thread sets its first
> > > wake-up time to a fixed offset from the recorded time, and incremenets
> > > it by the same offset.
> >
> > Why not just set the initial timer expiration to be
> > "period + cpu * align_us"? Then you wouldn't need any interaction
> > between CPUs.
>
> "period + cpu * align_us" wouldn't quite do it, for two reasons:
>
> 1. The wake-up timers are set to absolute time, and are incremented by
> "period" (once or multiple times, if the timer is significantly
> delayed) each cycle. What can be done as an alternative to what v1
> does is this: record the current time when starting the timerlat
> tracer (I need to reset align_next to zero anyway even with the v1
> design, that is a bug in the patch), and increment from that.
I was only talking about doing this for the initial expiration, not on
increment.
> 2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
> run on CPUs 0 and 100 with alignment 10, suddenly, the space between
> the threads becomes 1000us, which is equivalent to 0us. I would need
> to go through the cpuset and assign numbers from 0 to n to each CPU.
> That would guarantee a fixed spacing of the threads independent of
> when the threads wake up in the first cycle (unlike the v1 design),
> but it would make the implementation more complex, since I would have
> to store the numbers.
Right, I was thinking of just a few CPUs missing not being a big deal,
but on big systems with only a few CPUs running the test it does
matter.
Instead of assigning numbers, could you just loop over each CPU's
tlat->abs_period and set the initial expiration with appropriate
offset (prior to starting any of the threads)? Then the thread would
not need to care about anything other than the usual increment.
> If I implemented both of those ideas, the interaction between the CPUs
> can indeed be gotten rid of. I'm not sure if it is a better solution,
> though. Another motivation of recording the first thread wake-up was
> that when using user threads, the first thread might be created some
> time after the tracer is enabled, and I did not want to have a large
> gap that would have to be corrected by the while loop at the end of
> wait_next_period().
What is the actual concerning impact here?
If we want to be really paranoid we could check for pending signals
during the loop, in case userspace delayed so long (with a very short
period) that the user has a hard time with ctrl+c and such... but that
could happen already if userspace does something silly (e.g. stopped
by a debugger?) between loops.
> > > kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
> > > 1 file changed, 33 insertions(+), 1 deletion(-)
> >
> > Documentation needs to be updated as well.
> >
> > Should mention that updating align_us while the timer is running won't
> > take effect immediately (unlike period, which does).
> >
>
> Good idea, thanks! In general, I'm not expecting the user to change
> timerlat parameters during a measurement - but it is supported, and
> should be documented.
Maybe better to phrase it as "not guaranteed to take effect
immediately" :-)
> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index dee610e465b9..df1d4529d226 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -58,6 +58,7 @@ enum osnoise_options_index {
> > > OSN_PANIC_ON_STOP,
> > > OSN_PREEMPT_DISABLE,
> > > OSN_IRQ_DISABLE,
> > > + OSN_TIMERLAT_ALIGN,
> > > OSN_MAX
> > > };
> > >
> > > @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
> > > "OSNOISE_WORKLOAD",
> > > "PANIC_ON_STOP",
> > > "OSNOISE_PREEMPT_DISABLE",
> > > - "OSNOISE_IRQ_DISABLE" };
> > > + "OSNOISE_IRQ_DISABLE",
> > > + "TIMERLAT_ALIGN" };
> >
> > Do we really need a flag for this, or can we just interpret a non-zero
> > align_us value as enabling the feature?
> >
>
> Yes, we need a flag for this, because a zero alignment is a common use case.
>
> I used it in cyclictest to measure the overhead of a large number of
> threads waking up at the same time. Similarly, a non-zero alignment
> will get rid of most of that overhead. Without alignment set, the
> thread wake-ups offsets are semi-random, depending on how the threads
> wake up, which might lead to inconsistent results where one run has
> good numbers and another run bad numbers, since the alignment is
> determined in the first cycle.
OK, I was viewing the staggering as the main point, but I see how the
alignment itself helps too.
Is there a use case for not always doing the alignment?
Other than people asking why their numbers suddenly got worse...
> > I'm already unclear about the existing purpose of next_abs_period, but
> > if it has any use at all shouldn't it be to avoid writing intermediate
> > values like this back to tlat?
> >
>
> next_abs_period is basically just the ktime_t variant of
> tlat->abs_period for local calculations of the next period inside
> wait_next_period(). Its only purpose is the ktime_compare() call that
> increments tlat->abs_period by the period until it lands into the
> future, if it happens to be in the past. This is necessary to do for
> both a regular cycle (which might take long due to noise) and the
> first cycle with alignment (because the other thread's first wake up
> might be late), so it has to be set in the new code as well,
> otherwise, the while loop won't see the time is in the past.
Oh, I missed the unit difference (though it's basically just a typedef
at this point).
> I agree that this part of the code is confusing. There is also a
> field, timerlat_variables.rel_period (tlat->rel_period), that is not
> used anywhere, since the relative period is pulled out of
> osnoise_variables. Something like this would be easier to read and
> comprehend, IMHO:
Yeah, I noticed that as well... we should remove it if we're not going
to use it.
> /*
> * wait_next_period - Wait for the next period for timerlat
> */
> static int wait_next_period(struct timerlat_variables *tlat)
> {
> ktime_t now;
> u64 rel_period = osnoise_data.timerlat_period * 1000;
>
> now = hrtimer_cb_get_time(&tlat->timer);
>
> /*
> * Set the next abs_period.
> */
> tlat->abs_period += rel_period;
>
> /*
> * If the new abs_period is in the past, skip the activation.
> */
> while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
> next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> }
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
> schedule();
> return 1;
> }
>
> (Excluding the changes from this patch.) What do you think?
Why can't we just make tlat->abs_period and every other time variable
in this file be ktime_t? Other than atomic stuff if we do go that
route.
Not saying that that should hold up this patch, just an idea to simplify things.
-Crystal
next prev parent reply other threads:[~2026-03-03 3:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-27 15:04 [PATCH] tracing/osnoise: Add option to align tlat threads Tomas Glozar
2026-02-27 15:52 ` Steven Rostedt
2026-03-02 7:09 ` Tomas Glozar
2026-02-27 23:49 ` Crystal Wood
2026-03-02 8:48 ` Tomas Glozar
2026-03-03 3:21 ` Crystal Wood [this message]
2026-03-06 15:15 ` Tomas Glozar
2026-03-06 22:45 ` Crystal Wood
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=238e4851e45a505d7974443c4fe703e61a0fbf55.camel@redhat.com \
--to=crwood@redhat.com \
--cc=costa.shul@redhat.com \
--cc=jkacur@redhat.com \
--cc=lgoncalv@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=rostedt@goodmis.org \
--cc=tglozar@redhat.com \
--cc=wander@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox