public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100
@ 2025-11-13  0:52 Paul E. McKenney
  2025-11-13  7:37 ` Petr Mladek
  0 siblings, 1 reply; 3+ messages in thread
From: Paul E. McKenney @ 2025-11-13  0:52 UTC (permalink / raw)
  To: linux-kernel, linux-next
  Cc: d-tatianin, pmladek, john.ogness, sfr, rostedt, senozhatsky

Hello!

Some rcutorture runs on next-20251110 hit the following error on x86:

WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100, CPU#0: rcu_torture_sta/63

This happens in about 20-25% of the rcutorture runs, and is the
WARN_ON_ONCE(1) in the "else" clause of get_data().  There was no
rcutorture scenario that failed to reproduce this bug, so I am guessing
that the various .config files will not provide useful information.
Please see the end of this email for a representative splat, which is
usually rcutorture printing out something or another.  (Which, in its
defense, has worked just fine in the past.)

Bisection converged on this commit:

67e1b0052f6b ("printk_ringbuffer: don't needlessly wrap data blocks around")

Reverting this commit suppressed (or at least hugely reduced the
probability of) the WARN_ON_ONCE().

The SRCU-T, SRCU-U, and TREE09 scenarios hit this most frequently at
about double the base rate, but are CONFIG_SMP=n builds.  The RUDE01
scenario was the most productive CONFIG_SMP=y scenario.  Reproduce as
follows, where "N" is the number of CPUs on your system divided by three,
rounded down:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*RUDE01"

Or if you can do CONFIG_SMP=n, the following works, where "N" is the
number of CPUs on your system:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*SRCU-T"

Or please tell me what debug I should enable on my runs.

							Thanx, Paul

------------------------------------------------------------------------

[ 1676.478083] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100, CPU#0: rcu_torture_sta/63
[ 1676.478092] Modules linked in:
[ 1676.478099] CPU: 0 UID: 0 PID: 63 Comm: rcu_torture_sta Not tainted 6.18.0-rc4-next-20251110-dirty #6903 PREEMPT(full)
[ 1676.478104] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 1676.478107] RIP: 0010:get_data+0xb3/0x100
[ 1676.478110] Code: 39 cf 75 50 48 8b 76 08 48 8d 4e 07 48 83 e1 f8 48 39 ce 75 4a 83 fa 07 76 50 83 ea 08 48 83 c0 08 41 89 11 c3 cc cc cc cc 90 <0f> 0b 90 31 c0 c3 cc cc cc cc 48 c7 c0 ff ff ff ff 44 29 c2 48 d3
[ 1676.478114] RSP: 0018:ffffb5e0c021bb98 EFLAGS: 00010006
[ 1676.478118] RAX: 0000000000000000 RBX: ffffffffbe27c560 RCX: 0000000000000012
[ 1676.478121] RDX: 0000000000000000 RSI: ffffffffbe342060 RDI: ffffffffbe27c590
[ 1676.478123] RBP: ffffb5e0c021bc78 R08: ffffffffffffff98 R09: ffffb5e0c021bbd0
[ 1676.478126] R10: 00003fffffffffff R11: ffffffffbe342058 R12: ffffb5e0c021bc60
[ 1676.478128] R13: 00000000ffffee71 R14: 000000000000003f R15: 0000000000000400
[ 1676.478133] FS:  0000000000000000(0000) GS:ffff9d0b605f7000(0000) knlGS:0000000000000000
[ 1676.478136] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1676.478139] CR2: ffff9d0b0ac01000 CR3: 0000000001ed2000 CR4: 00000000000006f0
[ 1676.478141] Call Trace:
[ 1676.478155]  <TASK>
[ 1676.478157]  prb_reserve_in_last+0x190/0x4d0
[ 1676.478163]  ? vsnprintf+0x11/0x4a0
[ 1676.478176]  vprintk_store+0x41e/0x530
[ 1676.478191]  vprintk_emit+0x8f/0x360
[ 1676.478199]  _printk+0x56/0x70
[ 1676.478207]  rcu_torture_stats_print+0x2fe/0x780
[ 1676.478218]  ? lock_release+0xc6/0x290
[ 1676.478222]  ? __pfx_rcu_torture_stats+0x10/0x10
[ 1676.478237]  ? __pfx_rcu_torture_stats+0x10/0x10
[ 1676.478241]  rcu_torture_stats+0x25/0x70
[ 1676.478245]  kthread+0x102/0x200
[ 1676.478252]  ? __pfx_kthread+0x10/0x10
[ 1676.478258]  ret_from_fork+0x23f/0x280
[ 1676.478266]  ? __pfx_kthread+0x10/0x10
[ 1676.478270]  ret_from_fork_asm+0x1a/0x30
[ 1676.478285]  </TASK>
[ 1676.478287] irq event stamp: 54788
[ 1676.478289] hardirqs last  enabled at (54787): [<ffffffffbc933c3d>] __up_console_sem+0x4d/0x60
[ 1676.478293] hardirqs last disabled at (54788): [<ffffffffbc935ecc>] vprintk_store+0x3cc/0x530
[ 1676.478297] softirqs last  enabled at (54672): [<ffffffffbc89fb8e>] handle_softirqs+0x2ee/0x3b0
[ 1676.478303] softirqs last disabled at (54659): [<ffffffffbc89fdb1>] __irq_exit_rcu+0xa1/0xc0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100
  2025-11-13  0:52 [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100 Paul E. McKenney
@ 2025-11-13  7:37 ` Petr Mladek
  2025-11-13 17:09   ` Paul E. McKenney
  0 siblings, 1 reply; 3+ messages in thread
From: Petr Mladek @ 2025-11-13  7:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, linux-next, d-tatianin, john.ogness, sfr, rostedt,
	senozhatsky

Hi Paul,

first, thanks a lot for reporting the regression.

On Wed 2025-11-12 16:52:16, Paul E. McKenney wrote:
> Hello!
> 
> Some rcutorture runs on next-20251110 hit the following error on x86:
> 
> WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100, CPU#0: rcu_torture_sta/63
> 
> This happens in about 20-25% of the rcutorture runs, and is the
> WARN_ON_ONCE(1) in the "else" clause of get_data().  There was no
> rcutorture scenario that failed to reproduce this bug, so I am guessing
> that the various .config files will not provide useful information.
> Please see the end of this email for a representative splat, which is
> usually rcutorture printing out something or another.  (Which, in its
> defense, has worked just fine in the past.)
> 
> Bisection converged on this commit:
> 
> 67e1b0052f6b ("printk_ringbuffer: don't needlessly wrap data blocks around")
> 
> Reverting this commit suppressed (or at least hugely reduced the
> probability of) the WARN_ON_ONCE().
> 
> The SRCU-T, SRCU-U, and TREE09 scenarios hit this most frequently at
> about double the base rate, but are CONFIG_SMP=n builds.  The RUDE01
> scenario was the most productive CONFIG_SMP=y scenario.  Reproduce as
> follows, where "N" is the number of CPUs on your system divided by three,
> rounded down:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*RUDE01"
> 
> Or if you can do CONFIG_SMP=n, the following works, where "N" is the
> number of CPUs on your system:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*SRCU-T"
> 
> Or please tell me what debug I should enable on my runs.

The problem was reported by two test robots last week. It happens when
a message fits exactly up to the last byte before the ring buffer gets
wrapped for the first time. It is interesting that you have seen
so frequently (in about 20-25% rcutorture runs).

Anyway, I have pushed a fix on Monday. It is the commit
cc3bad11de6e0d601 ("printk_ringbuffer: Fix check of
valid data size when blk_lpos overflows"), see
https://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git/commit/?h=for-6.19&id=cc3bad11de6e0d6012460487903e7167d3e73957

Thanks a lot for so exhaustive report. And I am sorry that you
probably spent a lot of time with it.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100
  2025-11-13  7:37 ` Petr Mladek
@ 2025-11-13 17:09   ` Paul E. McKenney
  0 siblings, 0 replies; 3+ messages in thread
From: Paul E. McKenney @ 2025-11-13 17:09 UTC (permalink / raw)
  To: Petr Mladek
  Cc: linux-kernel, linux-next, d-tatianin, john.ogness, sfr, rostedt,
	senozhatsky

On Thu, Nov 13, 2025 at 08:37:15AM +0100, Petr Mladek wrote:
> Hi Paul,
> 
> first, thanks a lot for reporting the regression.
> 
> On Wed 2025-11-12 16:52:16, Paul E. McKenney wrote:
> > Hello!
> > 
> > Some rcutorture runs on next-20251110 hit the following error on x86:
> > 
> > WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100, CPU#0: rcu_torture_sta/63
> > 
> > This happens in about 20-25% of the rcutorture runs, and is the
> > WARN_ON_ONCE(1) in the "else" clause of get_data().  There was no
> > rcutorture scenario that failed to reproduce this bug, so I am guessing
> > that the various .config files will not provide useful information.
> > Please see the end of this email for a representative splat, which is
> > usually rcutorture printing out something or another.  (Which, in its
> > defense, has worked just fine in the past.)
> > 
> > Bisection converged on this commit:
> > 
> > 67e1b0052f6b ("printk_ringbuffer: don't needlessly wrap data blocks around")
> > 
> > Reverting this commit suppressed (or at least hugely reduced the
> > probability of) the WARN_ON_ONCE().
> > 
> > The SRCU-T, SRCU-U, and TREE09 scenarios hit this most frequently at
> > about double the base rate, but are CONFIG_SMP=n builds.  The RUDE01
> > scenario was the most productive CONFIG_SMP=y scenario.  Reproduce as
> > follows, where "N" is the number of CPUs on your system divided by three,
> > rounded down:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*RUDE01"
> > 
> > Or if you can do CONFIG_SMP=n, the following works, where "N" is the
> > number of CPUs on your system:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*SRCU-T"
> > 
> > Or please tell me what debug I should enable on my runs.
> 
> The problem was reported by two test robots last week. It happens when
> a message fits exactly up to the last byte before the ring buffer gets
> wrapped for the first time. It is interesting that you have seen
> so frequently (in about 20-25% rcutorture runs).
> 
> Anyway, I have pushed a fix on Monday. It is the commit
> cc3bad11de6e0d601 ("printk_ringbuffer: Fix check of
> valid data size when blk_lpos overflows"), see
> https://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git/commit/?h=for-6.19&id=cc3bad11de6e0d6012460487903e7167d3e73957

Even better!  Thank you for the fix.

> Thanks a lot for so exhaustive report. And I am sorry that you
> probably spent a lot of time with it.

Well, actually, it was the first time that I turned "git bisect run"
loose on a full (and fully scripted) remote RCU run.  Each step involved
checking out 20 systems from the test group, building 20 kernels,
downloading the build products to each of the 20 systems, running each
of 286 guest OSes (15 each for 19 of the kernels and one instance of
the large one) spread over the 20 systems, waiting for them to finish,
uploading the test results, returning the systems to the test group,
analyzing them, and reporting either success (all runs succeeded) or
failure (at least one failure across the 286 kernels.  Then my grepping
through the run results directory to get you the failure rate.  Of course,
that failure rate indicates that I could have done the bisection more
quickly and with much less hardware, but that would have required me to
stop the other things I was doing and actually think about this.

Each step took somewhere between 90 minutes and two hours on a total of
1600 CPUs, and all ~11 bisection steps completed without my intervention.
Thus far, neither the test grid, the systems, the scripting, nor git
bisect have complained about my having wasted their time, but what with
AI they probably soon will do so.

I am somewhat surprised that it all went through without something
breaking, but I guess that we all get lucky from time to time.  ;-)

So not a lot of work for me, which is a good thing, given that I had
lots of other distractions, including another much more troublesome
bisection on ARM that actually found a bug that had not yet been fixed.
A trivial bug, admittedly, but such is life!

And again, thank you for so quickly fixing this!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-11-13 17:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13  0:52 [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100 Paul E. McKenney
2025-11-13  7:37 ` Petr Mladek
2025-11-13 17:09   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox