From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E743338F933 for ; Fri, 17 Apr 2026 13:30:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=216.40.44.11 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776432645; cv=none; b=ZD4JgyIkFws/iGvar8JdzXJihr+fEiu9IfQEVrpTIbsdu3eWT7RC8ipij6gaAi0S8Qc9GvLIJBw9Nbr83OEOsEpLOEAJqlSq9WMDbrIOTWMVdoU+5JGHGsq6vWoWLG2pUGInBZ1s5JTPq+w4IEMm81ahYpLv788aW0/VXHPOmTA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776432645; c=relaxed/simple; bh=creuepkKZAWZBQZbFmpurTITR1KCfpx1nAmREpG+W8E=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ODce6xXeZbtFpx9uPA5venXIuNeeKJloBfkMafywUXPJQTiAxqtFi8Viv4po5LRZMgsyl8UrUCWk/BmeKUrtQefVz9BFx051sNnLtO4LP+y82farhgN6r8J1fh3oQr3dhnQjvJFRunMoGQF21P/8bEs/Y9Dl+RYzi4ktaVXTQq4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=goodmis.org; spf=pass smtp.mailfrom=goodmis.org; arc=none smtp.client-ip=216.40.44.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=goodmis.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=goodmis.org Received: from omf14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DA8501A0191; Fri, 17 Apr 2026 13:30:35 +0000 (UTC) Received: from [HIDDEN] (Authenticated sender: rostedt@goodmis.org) by omf14.hostedemail.com (Postfix) with ESMTPA id 958693F; Fri, 17 Apr 2026 13:30:32 +0000 (UTC) Date: Fri, 17 Apr 2026 09:30:25 -0400 From: Steven Rostedt To: "Paul E. McKenney" Cc: LKML , Frederic Weisbecker , Joel Fernandes , Eric Dumazet , Kuniyuki Iwashima , Paolo Abeni , Willem de Bruijn , Yao Kai , Peter Zijlstra , Thomas Gleixner Subject: Re: [WARNING] RCU stall in sock_def_readable() Message-ID: <20260417093025.38faf68d@fedora> In-Reply-To: <20260417084313.010864e8@fedora> References: <20260415132722.788bbdcf@fedora> <20260417084313.010864e8@fedora> X-Mailer: Claws Mail 4.3.1 (GTK 3.24.52; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Stat-Signature: 4bu871jyxm4qisz3775hnedax9pwbddg X-Rspamd-Server: rspamout02 X-Rspamd-Queue-Id: 958693F X-Session-Marker: 726F737465647440676F6F646D69732E6F7267 X-Session-ID: U2FsdGVkX197YdwWlro3L/1Or+Pld6cCz7nqOBbyu14= X-HE-Tag: 1776432632-259049 X-HE-Meta: U2FsdGVkX1+ETDpzJRWAAGAfIiMxMr/mIl96MTp9DyKXe3gYCicUbbVK56kRjsJshXZwmEypBfRqvXTHzJHKQ3E+/2LuMUzzFzYtqfkzFy1jB4A1vGTiSxYjTZNLw3PRQOGE4Dw9gL9RXr8FLEjKJ+I8GTYvriHxgSVip1qVW4O9v+IoWG1r2dEHajC7NKDonPC9hHMyeFD+KxQ2lI81B6uDnxjaqmd+kRm6OsyUYRj/4TzBCbBY6nLPFdgWbqIRQhgU7Q0X/JZp/SP+RsFVpJ1i0LZPl3j1ndcOyA0DgmYdVeEvoAUrBnB6paTLI5ZCAMRi2ZB7jqZ7+cigMtMMQva+0KQmCjEFpVthBBKTwl+ft7spPbyioaOkd20+3vqtircJ4DyCwBAdaSN6vIy3kA== On Fri, 17 Apr 2026 08:43:13 -0400 Steven Rostedt wrote: > On Thu, 16 Apr 2026 17:16:11 -0700 > "Paul E. McKenney" wrote: > > > One "hail Mary" thought is to revert this guy and see if it helps: > > > > d41e37f26b31 ("rcu: Fix rcu_read_unlock() deadloop due to softirq") > > > > This commit fixes a bug, so we cannot revert it in mainline, but there > > is some reason to believe that there are more bugs beyond the one that > > it fixed, and it might have (through no fault of its own) made those > > other bugs more probable. > > > > Worth a try, anyway! > > Hail mary's are worth a try, but the reason they call it a hail mary is > because it is unlikely to succeed :-p > > run test ssh -t root@tracetest "trace-cmd record -p function -e syscalls /work/c/hackbench_64 50" > ssh -t root@tracetest "trace-cmd record -p function -e syscalls /work/c/hackbench_64 50" ... [ 209.590500] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [ 209.592620] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P3151/1:b..l > [ 209.595266] rcu: (detected by 0, t=6502 jiffies, g=29673, q=186 ncpus=4) > [ 209.597557] task:hackbench_64 state:R running task stack:0 pid:3151 tgid:3151 ppid:3144 task_flags:0x400000 flags:0x00080000 > [ 209.601871] Call Trace: > [ 209.602852] > [ 209.603752] __schedule+0x4ac/0x12f0 > [ 209.605172] preempt_schedule_common+0x26/0xe0 > [ 209.606755] ? preempt_schedule_thunk+0x16/0x30 > [ 209.608337] preempt_schedule_thunk+0x16/0x30 > [ 209.609973] ? _raw_spin_unlock_irqrestore+0x39/0x70 > [ 209.611688] _raw_spin_unlock_irqrestore+0x5d/0x70 > [ 209.613408] sock_def_readable+0x9c/0x2b0 > [ 209.614841] unix_stream_sendmsg+0x2d7/0x710 > [ 209.616420] sock_write_iter+0x185/0x190 > [ 209.617934] vfs_write+0x457/0x5b0 > [ 209.619242] ksys_write+0xc8/0xf0 > [ 209.620532] do_syscall_64+0x117/0x1660 > [ 209.621936] ? irqentry_exit+0xd9/0x690 > [ 209.623319] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 209.625199] RIP: 0033:0x7f603e8e5190 > [ 209.626628] RSP: 002b:00007ffd003f99c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 > [ 209.629304] RAX: ffffffffffffffda RBX: 00007ffd003f9b58 RCX: 00007f603e8e5190 > [ 209.631710] RDX: 0000000000000001 RSI: 00007ffd003f99ef RDI: 0000000000000006 > [ 209.634200] RBP: 00007ffd003f9a40 R08: 0011861580000000 R09: 0000000000000000 > [ 209.636638] R10: 00007f603e8064d0 R11: 0000000000000202 R12: 0000000000000000 > [ 209.639050] R13: 00007ffd003f9b70 R14: 00005637df126dd8 R15: 00007f603ea10020 > [ 209.641600] > Detected kernel crash! > > > That was with the revert :-( I went and looked at the configs that it used to see if that changed. One thing that stands out is that it used to use CONFIG_PREEMPT_VOLUNTARY. Now it's using CONFIG_PREEMPT_LAZY. I'm thinking that because preemption now doesn't happen until tasks go back to user space (and kernel threads do not preempt at all), that this could have delayed the RCU threads much longer. I'm not sure why the stack trace is always the same. Maybe that's where the biggest delay is caused by hackbench? I'm going to switch it over to PREEMPT_FULL and see if that makes the warning go away. Oh, and when I logged into this box, I noticed that it triggered an OOM due to memory not being freed up fast enough. All that said, my config is full of a lot of debugging that does have a high overhead which makes this issue much more predominate. It may not even be something to worry about. If switch to PREEMPT_FULL fixes it, then that may be all I do. Configs that cause overhead: PROVE_LOCKING FTRACE_RECORD_RECURSION - keeps track of function trace recursion RING_BUFFER_VALIDATE_TIME_DELTAS - this causes a big overhead with tracing it tests the timestamps of every event. This requires walking the sub-buffer page and adding the time deltas of each event to make sure it adds up to the current event. That's an O(n^2) operation on the number of events in the sub-buffer. With the above overhead I do consider this one of those "Doctor it hurts me when I do this. Doctor: Then don't do that" moments. But this test has been running for years with no issues except for catching cases where the timestamp did get out of sync. Hence, I don't want to stop testing this. But if I can find the culprit, I can modify the test to avoid failing due to it. -- Steve