* [BUG] printk/nbcon can use RCU illegally prior to CPU online
@ 2024-07-18 15:41 Andrew Halaney
2024-07-18 15:57 ` Steven Rostedt
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Halaney @ 2024-07-18 15:41 UTC (permalink / raw)
To: pmladek, rostedt, john.ogness, senozhatsky, debarbos, williams,
jlelli
Cc: linux-rt-users
Hi,
Recently I was made aware of an issue when running 6.10.0-rc6-rt11+
(with a !PREEMPT_RT config, although I'm not sure it matters here).
Its easy to reproduce, just printk on a CPU that's coming online (I got
pointed to a real splat, but this suffices to reproduce).
For example:
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0c35207320cb..eb75a7cffe31 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -274,6 +274,10 @@ static void notrace start_secondary(void *unused)
cpuhp_ap_sync_alive();
cpu_init();
+
+ /* Let's printk() and see if RCU whines */
+ printk(KERN_ERR "RCU, what do you think?!\n");
+
fpu__init_cpu();
rcutree_report_cpu_starting(raw_smp_processor_id());
x86_cpuinit.early_percpu_clock_init();
Results in the following on boot:
[ 2.376218] .... node #0, CPUs: #46
[ 2.377276] .... node #1, CPUs: #47
[ 1.890157] RCU, what do you think?!
[ 1.890157] =============================
[ 1.890157] WARNING: suspicious RCU usage
[ 1.890157] 6.10.0-rc6-rt11+ #12 Not tainted
[ 1.890157] -----------------------------
[ 1.890157] kernel/printk/nbcon.c:1157 suspicious rcu_dereference_check() usage!
[ 1.890157]
other info that might help us debug this:
[ 1.890157]
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
[ 1.890157] 2 locks held by swapper/1/0:
[ 1.890157] #0: ffffffff8b87efd0 (console_srcu){....}-{0:0}, at: console_srcu_read_lock+0x30/0x60
[ 1.890157] #1: ffffffff8bc04050 (rcu_read_lock){....}-{1:3}, at: nbcon_wake_threads+0x4d/0x190
[ 1.890157]
stack backtrace:
[ 1.890157] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.10.0-rc6-rt11+ #12
[ 1.890157] Hardware name: Dell Inc. PowerEdge R660/0NN3RP, BIOS 2.1.5 03/14/2024
[ 1.890157] Call Trace:
[ 1.890157] <TASK>
[ 1.890157] dump_stack_lvl+0x86/0xc0
[ 1.890157] lockdep_rcu_suspicious+0x154/0x1a0
[ 1.890157] ? nbcon_wake_threads+0x4d/0x190
[ 1.890157] nbcon_wake_threads+0x176/0x190
[ 1.890157] vprintk_emit+0x170/0x450
[ 1.890157] _printk+0x5d/0x80
[ 1.890157] ? initialize_tlbstate_and_flush+0xb5/0x1d0
[ 1.890157] start_secondary+0x29/0xb0
[ 1.890157] common_startup_64+0x13e/0x140
[ 1.890157] </TASK>
Basically, we're using RCU during a printk() (rcuwait_has_sleeper()
in nbcon_wake_threads()) to figure out if we need to wake a rcuwait'ing thread
with an irq_work + rcuwait_wake_up() dance, and that "figuring out" with RCU is
not allowed prior to a CPU coming online.
I scratched my head and read up on RCU some more, but I don't have any solutions
brainstormed up yet that are proper seeming (can't just blindly use SRCU with rcuwait,
not sure if we should try scheduling an irq_work always until we're "RCU online"
via checking with something akin to rcu_cpu_online(), etc)...
Anyone more familiar with this neck of the woods have a good tip on how
to handle a printk()/nbcon prior to a CPU coming online?
Thanks,
Andrew
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 15:41 [BUG] printk/nbcon can use RCU illegally prior to CPU online Andrew Halaney
@ 2024-07-18 15:57 ` Steven Rostedt
2024-07-18 18:13 ` Andrew Halaney
0 siblings, 1 reply; 12+ messages in thread
From: Steven Rostedt @ 2024-07-18 15:57 UTC (permalink / raw)
To: Andrew Halaney
Cc: pmladek, john.ogness, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Thu, 18 Jul 2024 10:41:25 -0500
Andrew Halaney <ahalaney@redhat.com> wrote:
> Hi,
>
> Recently I was made aware of an issue when running 6.10.0-rc6-rt11+
> (with a !PREEMPT_RT config, although I'm not sure it matters here).
>
> Its easy to reproduce, just printk on a CPU that's coming online (I got
> pointed to a real splat, but this suffices to reproduce).
I'm more interested in the real splat. Where was the printk() that
triggered it?
> For example:
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 0c35207320cb..eb75a7cffe31 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -274,6 +274,10 @@ static void notrace start_secondary(void *unused)
> cpuhp_ap_sync_alive();
>
> cpu_init();
> +
> + /* Let's printk() and see if RCU whines */
> + printk(KERN_ERR "RCU, what do you think?!\n");
> +
prink() is normally OK to use on boot up before SMP starts. That's
because it has nothing to race against. But this is the secondary CPU,
where we are entering SMP. RCU has to be active to use pretty much
anything (printk, tracing, etc).
This is why I'm interested in knowing what the real splat was. It could
be that it is used in an incorrect location.
But if you are just debugging, then you should be safely able to ignore
this.
-- Steve
> fpu__init_cpu();
> rcutree_report_cpu_starting(raw_smp_processor_id());
> x86_cpuinit.early_percpu_clock_init();
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 15:57 ` Steven Rostedt
@ 2024-07-18 18:13 ` Andrew Halaney
2024-07-18 19:24 ` Steven Rostedt
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Andrew Halaney @ 2024-07-18 18:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: pmladek, john.ogness, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Thu, Jul 18, 2024 at 11:57:04AM GMT, Steven Rostedt wrote:
> On Thu, 18 Jul 2024 10:41:25 -0500
> Andrew Halaney <ahalaney@redhat.com> wrote:
>
> > Hi,
> >
> > Recently I was made aware of an issue when running 6.10.0-rc6-rt11+
> > (with a !PREEMPT_RT config, although I'm not sure it matters here).
> >
> > Its easy to reproduce, just printk on a CPU that's coming online (I got
> > pointed to a real splat, but this suffices to reproduce).
>
> I'm more interested in the real splat. Where was the printk() that
> triggered it?
I'll be honest, I only considered this from a printk() point of view so
far and pretty much ignored the original splat (as someone else is
looking at that). Here's the original splat from a centos kernel, I have
yet to look into it other than "we took an exception prior to RCU coming
online for this CPU" and me naively thinking that printk() should work
everywhere. I'll post the splat here for now before considering it more,
but it sounds like my expectation is wrong and that if something goes
awry and needs to printk prior to the CPU coming online, then this RCU
warning is sort of expected?:
[ 13.643007] smp: Bringing up secondary CPUs ...
[ 13.644422] smpboot: x86: Booting SMP configuration:
[ 13.644431] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34
[ 13.680691] Callback from call_rcu_tasks_rude() invoked.
[ 13.684308] #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55
[ 13.720218] .... node #1, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99
[ 13.783247] Callback from call_rcu_tasks() invoked.
[ 13.785941] #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111
[ 5.061942] x2apic lock mismatch between BSP and AP.
[ 5.061942] unchecked MSR access error: WRMSR to 0x83f (tried to write 0x00000000000000f6) at rIP: 0xffffffff877434d7 (native_apic_msr_write+0x27/0x40)
[ 5.061942] Call Trace:
[ 5.061942] <TASK>
[ 5.061942] ? show_trace_log_lvl+0x1c4/0x2df
[ 5.061942] ? show_trace_log_lvl+0x1c4/0x2df
[ 5.061942] ? arch_irq_work_raise+0x75/0xb0
[ 5.061942] ? native_apic_msr_write+0x27/0x40
[ 5.061942] ? ex_handler_msr.cold+0xb3/0x151
[ 5.061942] ? fixup_exception+0x8d1/0xab0
[ 5.061942] ? gp_try_fixup_and_notify+0x20/0x50
[ 5.061942] ? exc_general_protection+0xf1/0x220
[ 5.061942] ? asm_exc_general_protection+0x22/0x30
[ 5.061942] ? native_apic_msr_write+0x27/0x40
[ 5.061942] arch_irq_work_raise+0x75/0xb0
[ 5.061942] irq_work_queue+0x2f/0x50
[ 5.061942] nbcon_wake_threads+0x196/0x270
[ 5.061942] vprintk_emit+0x183/0x400
[ 5.061942] _printk+0xb2/0xe7
[ 5.061942] ? _pfx_printk+0x10/0x10
[ 5.061942] ? __pfx_lock_acquire+0x10/0x10
[ 5.061942] lockdep_rcu_suspicious+0x12d/0x200
[ 5.061942] nbcon_wake_threads+0x1e1/0x270
[ 5.061942] vprintk_emit+0x183/0x400
[ 5.061942] _printk+0xb2/0xe7
[ 5.061942] ? _pfx_printk+0x10/0x10
[ 5.061942] ? setup_ghcb+0xa/0x120
[ 5.061942] ? cpu_init_exception_handling+0x5bc/0x7a0
[ 5.061942] x2apic_setup.cold+0xc/0x11
[ 5.061942] cpu_init+0xe9/0x3a0
[ 5.061942] ? __pfx_cpu_init+0x10/0x10
[ 5.061942] start_secondary+0x6d/0x290
[ 5.061942] ? __pfx_start_secondary+0x10/0x10
[ 5.061942] ? set_bringup_idt_handler.constprop.0+0x98/0xc0
[ 5.061942] ? soft_restart_cpu+0x15/0x15
[ 5.061942] secondary_startup_64_no_verify+0x18f/0x19b
[ 5.061942] </TASK>
[ 5.061942] =============================
[ 5.061942] WARNING: suspicious RCU usage
[ 5.061942] 5.14.0-467.el9.x86_64+debug #1 Not tainted
[ 5.061942] -----------------------------
[ 5.061942] kernel/printk/nbcon.c:1118 suspicious rcu_dereference_check() usage!
[ 5.061942]
other info that might help us debug this:
[ 5.061942]
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
[ 5.061942] 2 locks held by swapper/1/0:
[ 5.061942] #0: ffffffff8b721a10 (console_srcu)
{....}-{0:0}, at: console_srcu_read_lock+0x44/0x60
[ 5.061942] #1: ffffffff8b735080 (rcu_read_lock){....}
-
{1:2}
, at: nbcon_wake_threads+0x11d/0x270
[ 5.061942]
stack backtrace:
[ 5.061942] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.14.0-467.el9.x86_64+debug #1
[ 5.061942] Hardware name: HPE ProLiant DL380 Gen11/ProLiant DL380 Gen11, BIOS 1.32 03/23/2023
[ 5.061942] Call Trace:
[ 5.061942] <TASK>
[ 5.061942] dump_stack_lvl+0x57/0x81
[ 5.061942] lockdep_rcu_suspicious.cold+0x96/0xfc
[ 5.061942] nbcon_wake_threads+0x1e1/0x270
[ 5.061942] vprintk_emit+0x183/0x400
[ 5.061942] _printk+0xb2/0xe7
[ 5.061942] ? _pfx_printk+0x10/0x10
[ 5.061942] ? setup_ghcb+0xa/0x120
[ 5.061942] ? cpu_init_exception_handling+0x5bc/0x7a0
[ 5.061942] x2apic_setup.cold+0xc/0x11
[ 5.061942] cpu_init+0xe9/0x3a0
[ 5.061942] ? __pfx_cpu_init+0x10/0x10
[ 5.061942] start_secondary+0x6d/0x290
[ 5.061942] ? __pfx_start_secondary+0x10/0x10
[ 5.061942] ? set_bringup_idt_handler.constprop.0+0x98/0xc0
[ 5.061942] ? soft_restart_cpu+0x15/0x15
[ 5.061942] secondary_startup_64_no_verify+0x18f/0x19b
[ 5.061942] </TASK>
[ 5.061942] numa_add_cpu cpu 1 node 0: mask now 0-1
[ 5.061942] x2apic lock mismatch between BSP and AP.
[ 5.061942] numa_add_cpu cpu 2 node 0: mask now 0-2
[ 5.061942] x2apic lock mismatch between BSP and AP.
[ 5.061942] numa_add_cpu cpu 3 node 0: mask now 0-3
[ 5.061942] x2apic lock mismatch between BSP and AP.
>
> > For example:
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 0c35207320cb..eb75a7cffe31 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -274,6 +274,10 @@ static void notrace start_secondary(void *unused)
> > cpuhp_ap_sync_alive();
> >
> > cpu_init();
> > +
> > + /* Let's printk() and see if RCU whines */
> > + printk(KERN_ERR "RCU, what do you think?!\n");
> > +
>
> prink() is normally OK to use on boot up before SMP starts. That's
> because it has nothing to race against. But this is the secondary CPU,
> where we are entering SMP. RCU has to be active to use pretty much
> anything (printk, tracing, etc).
>
> This is why I'm interested in knowing what the real splat was. It could
> be that it is used in an incorrect location.
>
> But if you are just debugging, then you should be safely able to ignore
> this.
Thanks for the explanation.
Please see above for the splat actual splat from a centos kernel (which
is the main reason I didn't want to send the original writeup with the
downstream splat).
Thanks,
Andrew
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 18:13 ` Andrew Halaney
@ 2024-07-18 19:24 ` Steven Rostedt
2024-07-18 19:34 ` Andrew Halaney
2024-07-19 7:55 ` John Ogness
2024-07-19 8:26 ` John Ogness
2 siblings, 1 reply; 12+ messages in thread
From: Steven Rostedt @ 2024-07-18 19:24 UTC (permalink / raw)
To: Andrew Halaney
Cc: pmladek, john.ogness, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Thu, 18 Jul 2024 13:13:02 -0500
Andrew Halaney <ahalaney@redhat.com> wrote:
> I'll be honest, I only considered this from a printk() point of view so
> far and pretty much ignored the original splat (as someone else is
> looking at that). Here's the original splat from a centos kernel, I have
> yet to look into it other than "we took an exception prior to RCU coming
> online for this CPU" and me naively thinking that printk() should work
> everywhere. I'll post the splat here for now before considering it more,
> but it sounds like my expectation is wrong and that if something goes
> awry and needs to printk prior to the CPU coming online, then this RCU
> warning is sort of expected?:
Note, printk() is most definitely not safe in all contexts. It can
cause issues if done in NMI, or even called while holding an scheduler
rq lock. And now we are trying to make sure all tracing/printk is done
while RCU is watching.
>
> [ 13.643007] smp: Bringing up secondary CPUs ...
> [ 13.644422] smpboot: x86: Booting SMP configuration:
> [ 13.644431] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34
> [ 13.680691] Callback from call_rcu_tasks_rude() invoked.
> [ 13.684308] #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55
> [ 13.720218] .... node #1, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99
> [ 13.783247] Callback from call_rcu_tasks() invoked.
> [ 13.785941] #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111
> [ 5.061942] x2apic lock mismatch between BSP and AP.
> [ 5.061942] =============================
> [ 5.061942] WARNING: suspicious RCU usage
> [ 5.061942] 5.14.0-467.el9.x86_64+debug #1 Not tainted
> [ 5.061942] -----------------------------
> [ 5.061942] kernel/printk/nbcon.c:1118 suspicious rcu_dereference_check() usage!
You said this was a centos kernel. Well that explains why the above
says 5.14 and has a reference to kernel/printk/nbcon.c, which doesn't
exist in 5.14.
Looks like a bad backport.
-- Steve
> [ 5.061942]
> other info that might help us debug this:
>
> [ 5.061942]
> RCU used illegally from offline CPU!
> rcu_scheduler_active = 1, debug_locks = 1
> [ 5.061942] 2 locks held by swapper/1/0:
> [ 5.061942] #0: ffffffff8b721a10 (console_srcu)
> {....}-{0:0}, at: console_srcu_read_lock+0x44/0x60
> [ 5.061942] #1: ffffffff8b735080 (rcu_read_lock){....}
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 19:24 ` Steven Rostedt
@ 2024-07-18 19:34 ` Andrew Halaney
2024-07-18 20:17 ` Steven Rostedt
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Halaney @ 2024-07-18 19:34 UTC (permalink / raw)
To: Steven Rostedt
Cc: pmladek, john.ogness, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Thu, Jul 18, 2024 at 03:24:47PM GMT, Steven Rostedt wrote:
> On Thu, 18 Jul 2024 13:13:02 -0500
> Andrew Halaney <ahalaney@redhat.com> wrote:
>
>
> > I'll be honest, I only considered this from a printk() point of view so
> > far and pretty much ignored the original splat (as someone else is
> > looking at that). Here's the original splat from a centos kernel, I have
> > yet to look into it other than "we took an exception prior to RCU coming
> > online for this CPU" and me naively thinking that printk() should work
> > everywhere. I'll post the splat here for now before considering it more,
> > but it sounds like my expectation is wrong and that if something goes
> > awry and needs to printk prior to the CPU coming online, then this RCU
> > warning is sort of expected?:
>
> Note, printk() is most definitely not safe in all contexts. It can
> cause issues if done in NMI, or even called while holding an scheduler
> rq lock. And now we are trying to make sure all tracing/printk is done
> while RCU is watching.
>
> >
> > [ 13.643007] smp: Bringing up secondary CPUs ...
> > [ 13.644422] smpboot: x86: Booting SMP configuration:
> > [ 13.644431] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34
> > [ 13.680691] Callback from call_rcu_tasks_rude() invoked.
> > [ 13.684308] #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55
> > [ 13.720218] .... node #1, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99
> > [ 13.783247] Callback from call_rcu_tasks() invoked.
> > [ 13.785941] #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111
> > [ 5.061942] x2apic lock mismatch between BSP and AP.
>
> > [ 5.061942] =============================
> > [ 5.061942] WARNING: suspicious RCU usage
> > [ 5.061942] 5.14.0-467.el9.x86_64+debug #1 Not tainted
> > [ 5.061942] -----------------------------
> > [ 5.061942] kernel/printk/nbcon.c:1118 suspicious rcu_dereference_check() usage!
>
> You said this was a centos kernel. Well that explains why the above
> says 5.14 and has a reference to kernel/printk/nbcon.c, which doesn't
> exist in 5.14.
>
> Looks like a bad backport.
>
Heh, exactly why I made a "reproducer" for the printk/RCU side of the
splat with linux-rt-devel before reporting.
So my takeaway is that printk is not considered safe to use in all
contexts, including during SMP bringup (because RCU's at play and not
yet ready to be used on those CPUs). So if we do printk for reasons we
can't control (say some firmware problem causing an exception) and try
to print you could get the RCU warning about printk/nbcon usage in
addition to the actual error, and that's just something to accept.
Ideally though we don't printk() in those situations... so if its badly
behaving code calling printk() that's a different story.
Let me know if you think that recap is bogus.
Thanks,
Andrew
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 19:34 ` Andrew Halaney
@ 2024-07-18 20:17 ` Steven Rostedt
0 siblings, 0 replies; 12+ messages in thread
From: Steven Rostedt @ 2024-07-18 20:17 UTC (permalink / raw)
To: Andrew Halaney
Cc: pmladek, john.ogness, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Thu, 18 Jul 2024 14:34:41 -0500
Andrew Halaney <ahalaney@redhat.com> wrote:
> Heh, exactly why I made a "reproducer" for the printk/RCU side of the
> splat with linux-rt-devel before reporting.
>
> So my takeaway is that printk is not considered safe to use in all
> contexts, including during SMP bringup (because RCU's at play and not
> yet ready to be used on those CPUs). So if we do printk for reasons we
> can't control (say some firmware problem causing an exception) and try
> to print you could get the RCU warning about printk/nbcon usage in
> addition to the actual error, and that's just something to accept.
Well, in most cases people don't run a debug kernel in production and
if this printk was triggered in production, that warning would not show
up (it's only there when lockdep is enabled).
>
> Ideally though we don't printk() in those situations... so if its badly
> behaving code calling printk() that's a different story.
Right. printk() does get called in locations it shouldn't simply due to
a bug happening. But then, it's just a best effort to get information
to the user.
>
> Let me know if you think that recap is bogus.
I think you summarized it well.
-- Steve
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 18:13 ` Andrew Halaney
2024-07-18 19:24 ` Steven Rostedt
@ 2024-07-19 7:55 ` John Ogness
2024-07-19 14:09 ` Andrew Halaney
2024-07-19 8:26 ` John Ogness
2 siblings, 1 reply; 12+ messages in thread
From: John Ogness @ 2024-07-19 7:55 UTC (permalink / raw)
To: Andrew Halaney, Steven Rostedt
Cc: pmladek, senozhatsky, debarbos, williams, jlelli, linux-rt-users
On 2024-07-18, Andrew Halaney <ahalaney@redhat.com> wrote:
> I'll be honest, I only considered this from a printk() point of view
> so far and pretty much ignored the original splat (as someone else is
> looking at that). Here's the original splat from a centos kernel, I
> have yet to look into it other than "we took an exception prior to RCU
> coming online for this CPU"
Thanks for posting the splat. This is indeed a printk()
issue. vprintk_emit() or nbcon_wake_threads() need to be aware of the
RCU state and act appropriately. I will look into this.
> and me naively thinking that printk() should work everywhere.
Your thinking is not naive. It is correct (for PREEMPT_RT). printk()
stores messages locklessly into its ringbuffer and must work in all
cases (including NMI and scheduler). And AFAIK it does.
However, printk() is also responsible for reliably triggering or
performing console printing. This separate responsibility is tricky. For
PREEMPT_RT the problem is supposed to be solved. Obviously it still has
some issues, so thank you for the report.
For PREEMPT_RT, if you can put a printk() anywhere after the Linux
banner and cause a problem, the printk-folks need to know about it.
John Ogness
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-19 7:55 ` John Ogness
@ 2024-07-19 14:09 ` Andrew Halaney
2024-07-19 14:55 ` John Ogness
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Halaney @ 2024-07-19 14:09 UTC (permalink / raw)
To: John Ogness
Cc: Steven Rostedt, pmladek, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Fri, Jul 19, 2024 at 10:01:25AM GMT, John Ogness wrote:
> On 2024-07-18, Andrew Halaney <ahalaney@redhat.com> wrote:
> > I'll be honest, I only considered this from a printk() point of view
> > so far and pretty much ignored the original splat (as someone else is
> > looking at that). Here's the original splat from a centos kernel, I
> > have yet to look into it other than "we took an exception prior to RCU
> > coming online for this CPU"
>
> Thanks for posting the splat. This is indeed a printk()
> issue. vprintk_emit() or nbcon_wake_threads() need to be aware of the
> RCU state and act appropriately. I will look into this.
>
> > and me naively thinking that printk() should work everywhere.
>
> Your thinking is not naive. It is correct (for PREEMPT_RT). printk()
> stores messages locklessly into its ringbuffer and must work in all
> cases (including NMI and scheduler). And AFAIK it does.
>
> However, printk() is also responsible for reliably triggering or
> performing console printing. This separate responsibility is tricky. For
> PREEMPT_RT the problem is supposed to be solved. Obviously it still has
> some issues, so thank you for the report.
>
> For PREEMPT_RT, if you can put a printk() anywhere after the Linux
> banner and cause a problem, the printk-folks need to know about it.
Just to be sure I'm following correctly, when you say PREEMPT_RT above,
are you meaning "the printk/nbcon/console solution in PREEMPT (independent of
the current CONFIG_PREEMPT* setting)"? Or do you mean that + CONFIG_PREEMPT_RT?
Thanks,
Andrew
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-19 14:09 ` Andrew Halaney
@ 2024-07-19 14:55 ` John Ogness
2024-07-19 15:02 ` Andrew Halaney
0 siblings, 1 reply; 12+ messages in thread
From: John Ogness @ 2024-07-19 14:55 UTC (permalink / raw)
To: Andrew Halaney
Cc: Steven Rostedt, pmladek, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On 2024-07-19, Andrew Halaney <ahalaney@redhat.com> wrote:
>> Your thinking is not naive. It is correct (for PREEMPT_RT). printk()
>> stores messages locklessly into its ringbuffer and must work in all
>> cases (including NMI and scheduler). And AFAIK it does.
>>
>> However, printk() is also responsible for reliably triggering or
>> performing console printing. This separate responsibility is tricky. For
>> PREEMPT_RT the problem is supposed to be solved. Obviously it still has
>> some issues, so thank you for the report.
>>
>> For PREEMPT_RT, if you can put a printk() anywhere after the Linux
>> banner and cause a problem, the printk-folks need to know about it.
>
> Just to be sure I'm following correctly, when you say PREEMPT_RT above,
> are you meaning "the printk/nbcon/console solution in PREEMPT (independent of
> the current CONFIG_PREEMPT* setting)"? Or do you mean that + CONFIG_PREEMPT_RT?
First, the _complete_ printk/nbcon/console solution must be present in
the kernel (most likely by simply applying the PREEMPT_RT patchset).
Then, either
1. Enabling CONFIG_PREEMPT_RT
or
2. Enabling some other preemption model but only using nbcon consoles
(which currently is only the 8250 uart driver)
For !CONFIG_PREEMPT_RT using legacy consoles the
non-interference/reliability properties of the console printing is not
changed from current mainline. This is on purpose.
John
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-19 14:55 ` John Ogness
@ 2024-07-19 15:02 ` Andrew Halaney
0 siblings, 0 replies; 12+ messages in thread
From: Andrew Halaney @ 2024-07-19 15:02 UTC (permalink / raw)
To: John Ogness
Cc: Steven Rostedt, pmladek, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Fri, Jul 19, 2024 at 05:01:10PM GMT, John Ogness wrote:
> On 2024-07-19, Andrew Halaney <ahalaney@redhat.com> wrote:
> >> Your thinking is not naive. It is correct (for PREEMPT_RT). printk()
> >> stores messages locklessly into its ringbuffer and must work in all
> >> cases (including NMI and scheduler). And AFAIK it does.
> >>
> >> However, printk() is also responsible for reliably triggering or
> >> performing console printing. This separate responsibility is tricky. For
> >> PREEMPT_RT the problem is supposed to be solved. Obviously it still has
> >> some issues, so thank you for the report.
> >>
> >> For PREEMPT_RT, if you can put a printk() anywhere after the Linux
> >> banner and cause a problem, the printk-folks need to know about it.
> >
> > Just to be sure I'm following correctly, when you say PREEMPT_RT above,
> > are you meaning "the printk/nbcon/console solution in PREEMPT (independent of
s/PREEMPT/PREEMPT_RT patches/
bad spot for a typo on my part...
> > the current CONFIG_PREEMPT* setting)"? Or do you mean that + CONFIG_PREEMPT_RT?
>
> First, the _complete_ printk/nbcon/console solution must be present in
> the kernel (most likely by simply applying the PREEMPT_RT patchset).
>
> Then, either
>
> 1. Enabling CONFIG_PREEMPT_RT
>
> or
>
> 2. Enabling some other preemption model but only using nbcon consoles
> (which currently is only the 8250 uart driver)
>
> For !CONFIG_PREEMPT_RT using legacy consoles the
> non-interference/reliability properties of the console printing is not
> changed from current mainline. This is on purpose.
Thanks for clarifying!
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-18 18:13 ` Andrew Halaney
2024-07-18 19:24 ` Steven Rostedt
2024-07-19 7:55 ` John Ogness
@ 2024-07-19 8:26 ` John Ogness
2024-07-19 16:01 ` Andrew Halaney
2 siblings, 1 reply; 12+ messages in thread
From: John Ogness @ 2024-07-19 8:26 UTC (permalink / raw)
To: Andrew Halaney, Steven Rostedt
Cc: pmladek, senozhatsky, debarbos, williams, jlelli, linux-rt-users
On 2024-07-18, Andrew Halaney <ahalaney@redhat.com> wrote:
> [ 5.061942] =============================
> [ 5.061942] WARNING: suspicious RCU usage
> [ 5.061942] 5.14.0-467.el9.x86_64+debug #1 Not tainted
> [ 5.061942] -----------------------------
> [ 5.061942] kernel/printk/nbcon.c:1118 suspicious rcu_dereference_check() usage!
In Petr Mladek's review of my latest printk-threading series, he pointed
out [0] that I was using the wrong RCU API in rcuwait_has_sleeper().
The patch below was his suggestion, which I was already folding into the
next version of my series.
John Ogness
[0] https://lore.kernel.org/lkml/ZmsLcu2tHmCfLiu0@pathway.suse.cz
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index 0951c2c7fb1b..da462661876c 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -1181,9 +1181,6 @@ static void nbcon_irq_work(struct irq_work *irq_work)
static inline bool rcuwait_has_sleeper(struct rcuwait *w)
{
- bool has_sleeper;
-
- rcu_read_lock();
/*
* Guarantee any new records can be seen by tasks preparing to wait
* before this context checks if the rcuwait is empty.
@@ -1196,10 +1193,7 @@ static inline bool rcuwait_has_sleeper(struct rcuwait *w)
* This pairs with nbcon_kthread_func:A.
*/
smp_mb(); /* LMM(rcuwait_has_sleeper:A) */
- has_sleeper = !!rcu_dereference(w->task);
- rcu_read_unlock();
-
- return has_sleeper;
+ return rcuwait_active(w);
}
/**
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [BUG] printk/nbcon can use RCU illegally prior to CPU online
2024-07-19 8:26 ` John Ogness
@ 2024-07-19 16:01 ` Andrew Halaney
0 siblings, 0 replies; 12+ messages in thread
From: Andrew Halaney @ 2024-07-19 16:01 UTC (permalink / raw)
To: John Ogness
Cc: Steven Rostedt, pmladek, senozhatsky, debarbos, williams, jlelli,
linux-rt-users
On Fri, Jul 19, 2024 at 10:32:32AM GMT, John Ogness wrote:
> On 2024-07-18, Andrew Halaney <ahalaney@redhat.com> wrote:
> > [ 5.061942] =============================
> > [ 5.061942] WARNING: suspicious RCU usage
> > [ 5.061942] 5.14.0-467.el9.x86_64+debug #1 Not tainted
> > [ 5.061942] -----------------------------
> > [ 5.061942] kernel/printk/nbcon.c:1118 suspicious rcu_dereference_check() usage!
>
> In Petr Mladek's review of my latest printk-threading series, he pointed
> out [0] that I was using the wrong RCU API in rcuwait_has_sleeper().
>
> The patch below was his suggestion, which I was already folding into the
> next version of my series.
>
> John Ogness
>
> [0] https://lore.kernel.org/lkml/ZmsLcu2tHmCfLiu0@pathway.suse.cz
>
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index 0951c2c7fb1b..da462661876c 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1181,9 +1181,6 @@ static void nbcon_irq_work(struct irq_work *irq_work)
>
> static inline bool rcuwait_has_sleeper(struct rcuwait *w)
> {
> - bool has_sleeper;
> -
> - rcu_read_lock();
> /*
> * Guarantee any new records can be seen by tasks preparing to wait
> * before this context checks if the rcuwait is empty.
> @@ -1196,10 +1193,7 @@ static inline bool rcuwait_has_sleeper(struct rcuwait *w)
> * This pairs with nbcon_kthread_func:A.
> */
> smp_mb(); /* LMM(rcuwait_has_sleeper:A) */
> - has_sleeper = !!rcu_dereference(w->task);
> - rcu_read_unlock();
> -
> - return has_sleeper;
> + return rcuwait_active(w);
> }
>
> /**
>
Nice, TIL about that. That seems like a good change to me, and
prevents the RCU splat from nbcon.
Thanks,
Andrew
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-07-19 16:01 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-18 15:41 [BUG] printk/nbcon can use RCU illegally prior to CPU online Andrew Halaney
2024-07-18 15:57 ` Steven Rostedt
2024-07-18 18:13 ` Andrew Halaney
2024-07-18 19:24 ` Steven Rostedt
2024-07-18 19:34 ` Andrew Halaney
2024-07-18 20:17 ` Steven Rostedt
2024-07-19 7:55 ` John Ogness
2024-07-19 14:09 ` Andrew Halaney
2024-07-19 14:55 ` John Ogness
2024-07-19 15:02 ` Andrew Halaney
2024-07-19 8:26 ` John Ogness
2024-07-19 16:01 ` Andrew Halaney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox