linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gabriele Monaco <gmonaco@redhat.com>
To: Nam Cao <namcao@linutronix.de>
Cc: linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	 linux-trace-kernel@vger.kernel.org,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tomas Glozar <tglozar@redhat.com>, Juri Lelli <jlelli@redhat.com>
Subject: Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions
Date: Mon, 19 May 2025 13:13:01 +0200	[thread overview]
Message-ID: <cdcab46480cbfe2c3320e060ea9603adbd04956c.camel@redhat.com> (raw)
In-Reply-To: <20250519103809.nuEUQbVl@linutronix.de>



On Mon, 2025-05-19 at 12:38 +0200, Nam Cao wrote:
> On Mon, May 19, 2025 at 12:28:12PM +0200, Gabriele Monaco wrote:
> > Mmh, although definitely unlikely, I'm thinking of a case in which
> > the
> > event starts on one CPU and at the same time we see events in IRQ
> > and 
> > on another CPU, let's say continuously. Nothing forbids that
> > between
> > any two consecutive try_cmpxchg another CPU/context changes the
> > next
> > state (making the local try_cmpxchg fail).
> > In practice I've never seen it going on the second iteration, as
> > the
> > critical section is really tiny, but I'm not sure we can guarantee
> > this
> > never happens.
> > Or am I missing something?
> 
> I have a feeling that you missed my point. I agree that the retrying
> is
> needed, because we may race with another.
> 
> What I am proposing is that we drop the MAX_DA_RETRY_RACING_EVENTS,
> and
> just keep retrying until we succeed.
> 
> And that's safe to do, because the maximum number of retries is the
> number
> of tasks contending with us to set the monitor's state. So we know we
> won't
> be retrying for long.

I get this point, what I mean is: can we really guarantee the number of
contending tasks (or contexts) is finite?
In other words, the try_cmpxchg guarantees 1 and only 1 actor wins
every time, but cannot guarantee all actors will eventually win, an
actor /could/ be hanging there forever.

This handler is running for each event in the monitor and tracepoint
handlers can be interrupted as well as run in interrupt context (where
of course they cannot be interrupted). I don't think the number of
actors is bounded by the number of CPUs.
I see this situation is extremely unlikely, but in an exotic scenario
where a CPU is sufficiently slower than others (e.g. in a VM) I believe
we can see this critical section large enough for this to potentially
happen.

I'm not quite afraid of infinite loops, but rather RV introducing
unbounded latency very hard to track and without any reporting.
Chances are, since tracepoints and actual traced events are not atomic,
that by the time this delayed context /wins/ the RV event is no longer
current, so we may see an error already.

Does it make sense to you or am I making it more complex than it should
be?

Thanks,
Gabriele


  reply	other threads:[~2025-05-19 11:13 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250514084314.57976-1-gmonaco@redhat.com>
2025-05-14  8:43 ` [RFC PATCH v2 01/12] tools/rv: Do not skip idle in trace Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 02/12] tools/rv: Stop gracefully also on SIGTERM Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 03/12] rv: Add da_handle_start_run_event_ to per-task monitors Gabriele Monaco
2025-05-19  8:02   ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 04/12] rv: Remove trailing whitespace from tracepoint string Gabriele Monaco
2025-05-19  8:05   ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 05/12] rv: Return init error when registering monitors Gabriele Monaco
2025-05-19  8:06   ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 06/12] sched: Adapt sched tracepoints for RV task model Gabriele Monaco
2025-05-19  8:29   ` Nam Cao
2025-05-19  8:41     ` Gabriele Monaco
2025-05-19  8:43       ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 07/12] rv: Adapt the sco monitor to the new set_state Gabriele Monaco
2025-05-19  8:42   ` Nam Cao
2025-05-19  9:04     ` Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 08/12] rv: Extend and adapt snroc model Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 09/12] rv: Replace tss monitor with more complete sts Gabriele Monaco
2025-06-24  7:36   ` Nam Cao
2025-06-24 14:44     ` Gabriele Monaco
2025-06-24 15:50       ` Nam Cao
2025-06-24 19:31         ` Steven Rostedt
2025-06-27 15:02           ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions Gabriele Monaco
2025-05-19  9:06   ` Nam Cao
2025-05-19 10:28     ` Gabriele Monaco
2025-05-19 10:38       ` Nam Cao
2025-05-19 11:13         ` Gabriele Monaco [this message]
2025-05-21  6:58           ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 11/12] rv: Add nrp and sssw per-task monitors Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 12/12] rv: Add opid per-cpu monitor Gabriele Monaco
2025-05-27 13:37   ` Nam Cao
2025-05-27 14:35     ` Gabriele Monaco
2025-05-27 14:50       ` Nam Cao
2025-05-28 11:27         ` Gabriele Monaco

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cdcab46480cbfe2c3320e060ea9603adbd04956c.camel@redhat.com \
    --to=gmonaco@redhat.com \
    --cc=jlelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namcao@linutronix.de \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglozar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).