Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Yongliang Gao <leonylgao@gmail.com>
Cc: rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, frankjpliu@tencent.com,
	Yongliang Gao <leonylgao@tencent.com>,
	Huang Cun <cunhuang@tencent.com>
Subject: Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention
Date: Thu, 13 Nov 2025 15:15:15 +0100	[thread overview]
Message-ID: <20251113141515.iZSIDK0T@linutronix.de> (raw)
In-Reply-To: <CAJxhyqCyB3-CyDKgPtP-EoC=G9cWAYgLvse003+i2n6U4Pgv1w@mail.gmail.com>

On 2025-11-13 19:13:23 [+0800], Yongliang Gao wrote:
> Hi Sebastian,
Hi Yongliang,

> Thank you for your review and the thoughtful questions.
> 
> 1. Performance Data
> We encountered this issue in a production environment with 288 cores
> where enabling set_ftrace_pid caused system CPU usage (sys%) to
> increase from 10% to over 90%. In our 92-core VM test environment:
> 
> Before patch (spinlock):
> - Without filtering: cs=2395401/s, sys%=7%
> - With filtering: cs=1828261/s, sys%=40%
> 
> After patch (seqlock):
> - Without filtering: cs=2397032/s, sys%=6%
> - With filtering: cs=2398922/s, sys%=6%
> 
> The seqlock approach eliminates the pid_list->lock contention that was
> previously causing sys% to increase from 7% to 40%.
> 
> 2. Reader Retry Behavior
> Yes, if the write side is continuously busy, the reader might spin and
> retry. However, in practice:
> - Writes are infrequent (only when setting ftrace_pid filter or during
> task fork/exit with function-fork enabled)
> - For readers, trace_pid_list_is_set() is called on every task switch,
> which can occur at a very high frequency.

See.

> 3. Result Accuracy
> You're correct that the result might change immediately after the
> read. For trace_ignore_this_task(), we don't require absolute
> accuracy. Slight race conditions (where a task might be traced or not
> in borderline cases) are acceptable.

I don't see why RCU work shouldn't work here.
If a pid is removed then it might result that a chunk cleared/ removed
then upper_chunk/ lower_chunk can become NULL. The buffer itself can be
reused and point to something else. It could lear to a false outcome in
test_bit(). This is handled by read_seqcount_retry().

You could assign upper1, upper2, to NULL/ buffer as now and the removal
(in put_lower_chunk(), put_upper_chunk()) delay to RCU so it happens
after the grace period. That would allow you to iterate over it in
trace_pid_list_is_set() lockless since the buffer will not disappear and
will not be reused for another task until after all reader left.

Additionally it would guarantee that the buffer is not released in
trace_pid_list_free(). I don't see how the seqcount ensures that the
buffer is not gone. I mean you could have a reader and the retry would
force you to do another loop but before that happens you dereference the
upper_chunk pointer which could be reused.

So I *think* the RCU approach should be doable and cover this.

> Best regards,
> Yongliang

Sebastian

next prev parent reply	other threads:[~2025-11-13 14:15 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-13  0:02 [PATCH v3] trace/pid_list: optimize pid_list->lock contention Yongliang Gao
2025-11-13  0:21 ` Steven Rostedt
2025-11-13  7:34 ` Sebastian Andrzej Siewior
2025-11-13 11:13   ` Yongliang Gao
2025-11-13 14:15     ` Sebastian Andrzej Siewior [this message]
2025-11-13 15:05       ` Steven Rostedt
2025-11-13 15:17         ` Sebastian Andrzej Siewior
2025-11-13 15:24           ` Steven Rostedt
2025-11-13 15:35             ` Sebastian Andrzej Siewior
2025-11-13 15:51               ` Steven Rostedt
2025-11-13 16:07                 ` Sebastian Andrzej Siewior
2025-11-13 16:14                   ` Steven Rostedt
2025-11-13 15:35   ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251113141515.iZSIDK0T@linutronix.de \
    --to=bigeasy@linutronix.de \
    --cc=cunhuang@tencent.com \
    --cc=frankjpliu@tencent.com \
    --cc=leonylgao@gmail.com \
    --cc=leonylgao@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).