All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nam Cao <namcao@linutronix.de>
To: Gabriele Monaco <gmonaco@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	linux-trace-kernel@vger.kernel.org,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tomas Glozar <tglozar@redhat.com>, Juri Lelli <jlelli@redhat.com>
Subject: Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions
Date: Wed, 21 May 2025 08:58:00 +0200	[thread overview]
Message-ID: <20250521065800.jgtPyWd7@linutronix.de> (raw)
In-Reply-To: <cdcab46480cbfe2c3320e060ea9603adbd04956c.camel@redhat.com>

On Mon, May 19, 2025 at 01:13:01PM +0200, Gabriele Monaco wrote:
> 
> 
> On Mon, 2025-05-19 at 12:38 +0200, Nam Cao wrote:
> > On Mon, May 19, 2025 at 12:28:12PM +0200, Gabriele Monaco wrote:
> > > Mmh, although definitely unlikely, I'm thinking of a case in which
> > > the
> > > event starts on one CPU and at the same time we see events in IRQ
> > > and 
> > > on another CPU, let's say continuously. Nothing forbids that
> > > between
> > > any two consecutive try_cmpxchg another CPU/context changes the
> > > next
> > > state (making the local try_cmpxchg fail).
> > > In practice I've never seen it going on the second iteration, as
> > > the
> > > critical section is really tiny, but I'm not sure we can guarantee
> > > this
> > > never happens.
> > > Or am I missing something?
> > 
> > I have a feeling that you missed my point. I agree that the retrying
> > is
> > needed, because we may race with another.
> > 
> > What I am proposing is that we drop the MAX_DA_RETRY_RACING_EVENTS,
> > and
> > just keep retrying until we succeed.
> > 
> > And that's safe to do, because the maximum number of retries is the
> > number
> > of tasks contending with us to set the monitor's state. So we know we
> > won't
> > be retrying for long.
> 
> I get this point, what I mean is: can we really guarantee the number of
> contending tasks (or contexts) is finite?
> In other words, the try_cmpxchg guarantees 1 and only 1 actor wins
> every time, but cannot guarantee all actors will eventually win, an
> actor /could/ be hanging there forever.
> 
> This handler is running for each event in the monitor and tracepoint
> handlers can be interrupted as well as run in interrupt context (where
> of course they cannot be interrupted). I don't think the number of
> actors is bounded by the number of CPUs.
> I see this situation is extremely unlikely, but in an exotic scenario
> where a CPU is sufficiently slower than others (e.g. in a VM) I believe
> we can see this critical section large enough for this to potentially
> happen.
> 
> I'm not quite afraid of infinite loops, but rather RV introducing
> unbounded latency very hard to track and without any reporting.
> Chances are, since tracepoints and actual traced events are not atomic,
> that by the time this delayed context /wins/ the RV event is no longer
> current, so we may see an error already.
> 
> Does it make sense to you or am I making it more complex than it should
> be?

Right, I can see that being a problem. But I don't know enough about it to
comment further, so do as you think best, maybe someone else can help.

Best regards,
Nam

  reply	other threads:[~2025-05-21  6:58 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-14  8:43 [RFC PATCH v2 00/12] rv: Add monitors to validate task switch Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 01/12] tools/rv: Do not skip idle in trace Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 02/12] tools/rv: Stop gracefully also on SIGTERM Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 03/12] rv: Add da_handle_start_run_event_ to per-task monitors Gabriele Monaco
2025-05-19  8:02   ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 04/12] rv: Remove trailing whitespace from tracepoint string Gabriele Monaco
2025-05-19  8:05   ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 05/12] rv: Return init error when registering monitors Gabriele Monaco
2025-05-19  8:06   ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 06/12] sched: Adapt sched tracepoints for RV task model Gabriele Monaco
2025-05-19  8:29   ` Nam Cao
2025-05-19  8:41     ` Gabriele Monaco
2025-05-19  8:43       ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 07/12] rv: Adapt the sco monitor to the new set_state Gabriele Monaco
2025-05-19  8:42   ` Nam Cao
2025-05-19  9:04     ` Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 08/12] rv: Extend and adapt snroc model Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 09/12] rv: Replace tss monitor with more complete sts Gabriele Monaco
2025-06-24  7:36   ` Nam Cao
2025-06-24 14:44     ` Gabriele Monaco
2025-06-24 15:50       ` Nam Cao
2025-06-24 19:31         ` Steven Rostedt
2025-06-27 15:02           ` Nam Cao
2025-05-14  8:43 ` [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions Gabriele Monaco
2025-05-19  9:06   ` Nam Cao
2025-05-19 10:28     ` Gabriele Monaco
2025-05-19 10:38       ` Nam Cao
2025-05-19 11:13         ` Gabriele Monaco
2025-05-21  6:58           ` Nam Cao [this message]
2025-05-14  8:43 ` [RFC PATCH v2 11/12] rv: Add nrp and sssw per-task monitors Gabriele Monaco
2025-05-14  8:43 ` [RFC PATCH v2 12/12] rv: Add opid per-cpu monitor Gabriele Monaco
2025-05-15 11:55   ` kernel test robot
2025-05-27 13:37   ` Nam Cao
2025-05-27 14:35     ` Gabriele Monaco
2025-05-27 14:50       ` Nam Cao
2025-05-28 11:27         ` Gabriele Monaco
2025-05-21  7:15 ` [RFC PATCH v2 00/12] rv: Add monitors to validate task switch Nam Cao
2025-05-21  7:31   ` Gabriele Monaco
2025-05-27 13:51     ` Nam Cao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250521065800.jgtPyWd7@linutronix.de \
    --to=namcao@linutronix.de \
    --cc=gmonaco@redhat.com \
    --cc=jlelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglozar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.