From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4ADB327C09
	for <linux-kernel@vger.kernel.org>; Sat, 18 Apr 2026 22:37:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=216.40.44.14
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776551839; cv=none; b=XtFZ9NUc91wg0BqJNxH2Mk1q+Tl9ik51HSi/C/UkGNwW0VaQW4LRHVFntSAqLkmszRk70PwPhFhSMcXeRCrM+bXIVeDUHJrREb+Sc6taa3ySm9DBDBHqogNc5R+eznAi8i42fnDktIBZXeHXJiOOfLGWJWBUap4ZYrl29WJwEmk=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776551839; c=relaxed/simple;
	bh=9ZEwC5tKOqzK4j2uTWWOxEFQjrK0oiaItdCQh2sWkEs=;
	h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=TnAZFiPrOZLYIhjyGIEd1Tm0dm8CGFEdZ7mbJF0iVg3fwr7mLRrOFIDqf95OVXTks7/3yKQtwwYz6ASOWbJOQfa02Ay34/H/yCdFcJ10Z38IcZEp8FCzLtyIfBEOC+kRkuLkGtPZy0g78vOmPE+tmdcAaXXZsePDJBHDKUMSzq4=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=goodmis.org; spf=pass smtp.mailfrom=goodmis.org; arc=none smtp.client-ip=216.40.44.14
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=goodmis.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=goodmis.org
Received: from omf18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 6C43A5948C;
	Sat, 18 Apr 2026 22:37:10 +0000 (UTC)
Received: from [HIDDEN] (Authenticated sender: rostedt@goodmis.org) by omf18.hostedemail.com (Postfix) with ESMTPA id 6F3832E;
	Sat, 18 Apr 2026 22:37:08 +0000 (UTC)
Date: Sat, 18 Apr 2026 18:36:57 -0400
From: Steven Rostedt <rostedt@goodmis.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Frederic Weisbecker
 <fweisbec@gmail.com>, Joel Fernandes <joel@joelfernandes.org>, Eric Dumazet
 <edumazet@google.com>, Kuniyuki Iwashima <kuniyu@google.com>, Paolo Abeni
 <pabeni@redhat.com>, Willem de Bruijn <willemb@google.com>, Yao Kai
 <yaokai34@huawei.com>, Peter Zijlstra <peterz@infradead.org>, Thomas
 Gleixner <tglx@linutronix.de>
Subject: Re: [WARNING] RCU stall in sock_def_readable()
Message-ID: <20260418183657.19c00883@fedora>
In-Reply-To: <61264010-608d-46dd-884c-2448b65a57ed@paulmck-laptop>
References: <20260415132722.788bbdcf@fedora>
	<a61df50a-8d16-4302-a40e-aff4c39faf4e@paulmck-laptop>
	<20260417084313.010864e8@fedora>
	<20260417093025.38faf68d@fedora>
	<61264010-608d-46dd-884c-2448b65a57ed@paulmck-laptop>
X-Mailer: Claws Mail 4.3.1 (GTK 3.24.52; x86_64-redhat-linux-gnu)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Rspamd-Queue-Id: 6F3832E
X-Rspamd-Server: rspamout06
X-Stat-Signature: febk46w4q68mgzg9ao1ehgqrqkgr9mcp
X-Session-Marker: 726F737465647440676F6F646D69732E6F7267
X-Session-ID: U2FsdGVkX18BeL9eUY7OOjde65hpgMhCgU9oGbnoVVA=
X-HE-Tag: 1776551828-502936
X-HE-Meta: U2FsdGVkX1/Mi33+1/JuKQJX+qDHcG3K8jUt/bvb5z2W/kekcDtL/KVVmKyDgf3boh2KzO5UdFY=

On Fri, 17 Apr 2026 12:03:22 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> Huh.  One other possibility is that CONFIG_PREEMPT_NONE=y and its friend
> CONFIG_PREEMPT_VOLUNTARY=y responded to RCU calls for help at any call
> to cond_resched() or (in the voluntary case) might_sleep().  Isn't it
> the case that with CONFIG_PREEMPT_LAZY=y, nothing happens until some
> other task shows up on that CPU?  Maybe RCU needs to be more aggressive
> about invoking resched_cpu() when the grace period starts getting a bit
> long in the tooth?

Perhaps.

Note, I was still able to trigger the issue with PREEMPT_FULL. But
anyway, I think I may have found the culprit. It looks to be a
__wake_up_common that is waking up too many tasks:

    hackbench_64-4695  [001] .....   328.224508: function:                               consume_skb <-- unix_stream_read_generic
    hackbench_64-4695  [001] .....   328.224508: function:                                  skb_release_head_state <-- consume_skb
    hackbench_64-4695  [001] .....   328.224508: function:                                     unix_destruct_scm <-- skb_release_head_state
    hackbench_64-4695  [001] .....   328.224508: function:                                        put_pid <-- unix_destruct_scm
    hackbench_64-4695  [001] .....   328.224508: function:                                        sock_wfree <-- unix_destruct_scm
    hackbench_64-4695  [001] .....   328.224509: function:                                           unix_write_space <-- sock_wfree
    hackbench_64-4695  [001] .....   328.224509: function:                                              __rcu_read_lock <-- unix_write_space
    hackbench_64-4695  [001] .....   328.224509: function:                                              __wake_up_sync_key <-- unix_write_space
    hackbench_64-4695  [001] .....   328.224510: function:                                                 _raw_spin_lock_irqsave <-- __wake_up_sync_key
    hackbench_64-4695  [001] d....   328.224510: function:                                                    preempt_count_add <-- _raw_spin_lock_irqsave
    hackbench_64-4695  [001] d..1.   328.224510: function:                                                    do_raw_spin_lock <-- _raw_spin_lock_irqsave
    hackbench_64-4695  [001] d..1.   328.224511: function:                                                 __wake_up_common <-- __wake_up_sync_key
    hackbench_64-4695  [001] d..1.   328.224511: function:                                                    autoremove_wake_function <-- __wake_up_common
    hackbench_64-4695  [001] d..1.   328.224511: function:                                                       default_wake_function <-- autoremove_wake_function
    hackbench_64-4695  [001] d..1.   328.224512: function:                                                       try_to_wake_up <-- autoremove_wake_function

The above __wake_up_common() goes into a very long loop, so much so
that the LAZY_NEED_RESCHED is triggered (look at the 'L' in the flags
field of the trace:

    hackbench_64-4695  [001] d..4.   328.224582: function:                                                                      vruntime_eligible <-- __pick_eevdf
    hackbench_64-4695  [001] d..4.   328.224582: function:                                                                      vruntime_eligible <-- __pick_eevdf
    hackbench_64-4695  [001] d..4.   328.224584: function:                                                                resched_curr_lazy <-- wakeup_preempt
    hackbench_64-4695  [001] d..4.   328.224584: function:                                                                __resched_curr <-- wakeup_preempt
    hackbench_64-4695  [001] dL.4.   328.224584: function:                                                             __srcu_check_read_flavor <-- ttwu_do_activate
    hackbench_64-4695  [001] dL.4.   328.224585: function:                                                             __srcu_check_read_flavor <-- ttwu_do_activate
    hackbench_64-4695  [001] dL.4.   328.224585: function:                                                          _raw_spin_unlock <-- try_to_wake_up
    hackbench_64-4695  [001] dL.4.   328.224585: function:                                                             do_raw_spin_unlock <-- _raw_spin_unlock
    hackbench_64-4695  [001] dL.4.   328.224586: function:                                                             preempt_count_sub <-- _raw_spin_unlock
    hackbench_64-4695  [001] dL.3.   328.224586: function:                                                          _raw_spin_unlock_irqrestore <-- try_to_wake_up
    hackbench_64-4695  [001] dL.3.   328.224586: function:                                                             do_raw_spin_unlock <-- _raw_spin_unlock_irqrestore
    hackbench_64-4695  [001] dL.3.   328.224586: function:                                                             preempt_count_sub <-- _raw_spin_unlock_irqrestore
    hackbench_64-4695  [001] dL.2.   328.224586: function:                                                          preempt_count_sub <-- try_to_wake_up
    hackbench_64-4695  [001] dL.1.   328.224587: function:                                                    autoremove_wake_function <-- __wake_up_common
    hackbench_64-4695  [001] dL.1.   328.224587: function:                                                       default_wake_function <-- autoremove_wake_function
    hackbench_64-4695  [001] dL.1.   328.224587: function:                                                       try_to_wake_up <-- autoremove_wake_function
    hackbench_64-4695  [001] dL.1.   328.224588: function:                                                          preempt_count_add <-- try_to_wake_up
    hackbench_64-4695  [001] dL.2.   328.224588: function:                                                          _raw_spin_lock_irqsave <-- try_to_wake_up
    hackbench_64-4695  [001] dL.2.   328.224588: function:                                                             preempt_count_add <-- _raw_spin_lock_irqsave
    hackbench_64-4695  [001] dL.3.   328.224588: function:                                                             do_raw_spin_lock <-- _raw_spin_lock_irqsave
    hackbench_64-4695  [001] dL.3.   328.224589: function:                                                          select_task_rq_fair <-- try_to_wake_up
    hackbench_64-4695  [001] dL.3.   328.224589: function:                                                             __rcu_read_lock <-- select_task_rq_fair
    hackbench_64-4695  [001] dL.3.   328.224589: function:                                                             __rcu_read_unlock <-- select_task_rq_fair
    hackbench_64-4695  [001] dL.3.   328.224590: function:                                                          ttwu_queue_wakelist <-- try_to_wake_up
    hackbench_64-4695  [001] dL.3.   328.224590: function:                                                             __smp_call_single_queue <-- ttwu_queue_wakelist
                              ^
This goes on for a very long time, and the next tick is hit which causes
a full NEED_RESCHED to happen:

    hackbench_64-4695  [001] dLh1.   328.229455: function:                            sched_tick <-- update_process_times
    hackbench_64-4695  [001] dLh1.   328.229456: function:                               arch_scale_freq_tick <-- sched_tick
    hackbench_64-4695  [001] dLh1.   328.229456: function:                               preempt_count_add <-- sched_tick
    hackbench_64-4695  [001] dLh2.   328.229457: function:                               _raw_spin_lock_nested <-- sched_tick
    hackbench_64-4695  [001] dLh2.   328.229457: function:                                  preempt_count_add <-- _raw_spin_lock_nested
    hackbench_64-4695  [001] dLh3.   328.229458: function:                               do_raw_spin_lock <-- sched_tick
    hackbench_64-4695  [001] dLh3.   328.229470: function:                               preempt_count_sub <-- sched_tick
    hackbench_64-4695  [001] dLh2.   328.229470: function:                               update_rq_clock <-- sched_tick
    hackbench_64-4695  [001] dLh2.   328.229471: function:                                  arch_scale_cpu_capacity <-- update_rq_clock
    hackbench_64-4695  [001] dLh2.   328.229472: function:                               __resched_curr <-- sched_tick
    hackbench_64-4695  [001] dNh2.   328.229472: function:                               task_tick_fair <-- sched_tick
    hackbench_64-4695  [001] dNh2.   328.229473: function:                                  update_curr <-- task_tick_fair
    hackbench_64-4695  [001] dNh2.   328.229473: function:                                     update_se <-- update_curr
    hackbench_64-4695  [001] dNh2.   328.229474: function:                                        cpuacct_charge <-- update_se
    hackbench_64-4695  [001] dNh2.   328.229474: function:                                        __cgroup_account_cputime <-- update_se
    hackbench_64-4695  [001] dNh2.   328.229475: function:                                           preempt_count_add <-- __cgroup_account_cputime
    hackbench_64-4695  [001] dNh3.   328.229476: function:                                           css_rstat_updated <-- __cgroup_account_cputime
    hackbench_64-4695  [001] dNh3.   328.229476: function:                                           preempt_count_sub <-- __cgroup_account_cputime
    hackbench_64-4695  [001] dNh2.   328.229477: function:                                     dl_server_update <-- update_curr
    hackbench_64-4695  [001] dNh2.   328.229477: function:                                     update_curr_dl_se <-- update_curr
    hackbench_64-4695  [001] dNh2.   328.229478: function:                                     resched_curr_lazy <-- update_curr
    hackbench_64-4695  [001] dNh2.   328.229478: function:                                     __resched_curr <-- update_curr

So this task finally gets scheduled out (while holding RCU read lock):

static void unix_write_space(struct sock *sk)
{
        struct socket_wq *wq;

        rcu_read_lock();
        if (unix_writable(sk, READ_ONCE(sk->sk_state))) {
                wq = rcu_dereference(sk->sk_wq);
                if (skwq_has_sleeper(wq))
                        wake_up_interruptible_sync_poll(&wq->wait,
                                EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND);
                sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
        }
        rcu_read_unlock();
}

And because the system is very busy, it doesn't get scheduled in for a
long time, and RCU complains about a stall:

[  357.679707] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  357.810980] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P4695/1:b..l
[  357.813601] rcu:     (detected by 2, t=6535 jiffies, g=4125, q=100 ncpus=4)
[  357.816256] task:hackbench_64    state:R  running task     stack:0     pid:4695  tgid:4695  ppid:4690   task_flags:0x400000 flags:0x00080000
[  357.821176] Call Trace:
[  357.822382]  <TASK>
[  357.823474]  __schedule+0x4ac/0x12f0
[  357.825104]  preempt_schedule_common+0x26/0xe0
[  357.826988]  ? preempt_schedule_thunk+0x16/0x30
[  357.828930]  preempt_schedule_thunk+0x16/0x30
[  357.830788]  ? _raw_spin_unlock_irqrestore+0x39/0x70
[  357.832901]  _raw_spin_unlock_irqrestore+0x5d/0x70
[  357.834884]  sock_def_readable+0x9c/0x2b0
[  357.836633]  unix_stream_sendmsg+0x2d7/0x710
[  357.838423]  sock_write_iter+0x185/0x190
[  357.840084]  vfs_write+0x457/0x5b0
[  357.841568]  ksys_write+0xc8/0xf0
[  357.842914]  do_syscall_64+0x117/0x1660
[  357.844413]  ? irqentry_exit+0xd9/0x690

Now I guess the question is, how do we fix this?

Before LAZY_PREEMPT, an rcu_read_lock() would never schedule out. Now,
long loops under RCU can schedule out, which can be flagged as an RCU
stall. Should we not schedule out when rcu_read_lock() is held?

-- Steve