[Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
@ 2022-12-09 23:34 Joel Fernandes
  2022-12-10  3:22 ` Joel Fernandes
  0 siblings, 1 reply; 7+ messages in thread
From: Joel Fernandes @ 2022-12-09 23:34 UTC (permalink / raw)
  To: rcu, Zqiang; +Cc: paulmck

Hello,

I see this an hour into the run for TREE03 on v4.19:
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60

Checking to see if Qiang has any thoughts as I saw him comment about a similar
issue in [1].

[ 3243.844445] ------------[ cut here ]------------^M
[ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
[ 3243.851585] Modules linked in:^M
[ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
[ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
[ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
[ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
[ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
[ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
[ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
[ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
[ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
[ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
[ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
[ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
[ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
[ 3243.883660] Call Trace:^M
[ 3243.884925]  kthread_unpark+0x52/0x60^M
[ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
[ 3243.886341]  cpu_startup_entry+0x62/0x70^M
[ 3243.887063]  start_secondary+0x186/0x1b0^M
[ 3243.887783]  secondary_startup_64+0xa4/0xb0^M

[1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
  2022-12-09 23:34 [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches) Joel Fernandes
@ 2022-12-10  3:22 ` Joel Fernandes
  2022-12-12 11:45   ` Zhang, Qiang1
  0 siblings, 1 reply; 7+ messages in thread
From: Joel Fernandes @ 2022-12-10  3:22 UTC (permalink / raw)
  To: rcu, Zqiang; +Cc: paulmck

On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
>
> I see this an hour into the run for TREE03 on v4.19:
> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
>
> Checking to see if Qiang has any thoughts as I saw him comment about a similar
> issue in [1].
>
> [ 3243.844445] ------------[ cut here ]------------^M
> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
> [ 3243.851585] Modules linked in:^M
> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> [ 3243.883660] Call Trace:^M
> [ 3243.884925]  kthread_unpark+0x52/0x60^M
> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
> [ 3243.887063]  start_secondary+0x186/0x1b0^M
> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
>
> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
>

This set of scheduler patches seems to make it go away, however I am
running a long weekend test to collect more data:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
  2022-12-10  3:22 ` Joel Fernandes
@ 2022-12-12 11:45   ` Zhang, Qiang1
  2022-12-12 14:05     ` Joel Fernandes
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang, Qiang1 @ 2022-12-12 11:45 UTC (permalink / raw)
  To: Joel Fernandes, rcu@vger.kernel.org; +Cc: paulmck@kernel.org

On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
>
> I see this an hour into the run for TREE03 on v4.19:
> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
>
> Checking to see if Qiang has any thoughts as I saw him comment about a similar
> issue in [1].
>
> [ 3243.844445] ------------[ cut here ]------------^M
> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
> [ 3243.851585] Modules linked in:^M
> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> [ 3243.883660] Call Trace:^M
> [ 3243.884925]  kthread_unpark+0x52/0x60^M
> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
> [ 3243.887063]  start_secondary+0x186/0x1b0^M
> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
>
> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
>
>
>This set of scheduler patches seems to make it go away, however I am
>running a long weekend test to collect more data:
>https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922


Hi Joel

I also used the link above to do the TREE03 test, looks like the problem still exists:

5834 [  966.016205]  kthread_unpark+0x50/0x60^M
 5835 [  966.016440]  cpuhp_online_idle+0x31/0x50^M
 5836 [  966.016678]  cpu_startup_entry+0x65/0x70^M
 5837 [  966.016913]  start_secondary+0x18b/0x1c0^M
 5838 [  966.017148]  secondary_startup_64+0xa4/0xb0^M
 5839 [  966.017407] ---[ end trace 636953f76b8055db ]---^M
 5840 [  966.017691] migration/1     R  running task    14952    18      2 0x80000000 last_sleep: 961405359084.  last_runna      ble: 961700816631^M
 5841 [  966.018406] Call Trace:^M
 5842 [  966.018559]  __schedule+0x75f/0x1320^M
 5843 [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
 5844 [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
 5845 [  966.019302]  ? sort_range+0x20/0x20^M
 5846 [  966.019515]  preempt_schedule_common+0x32/0x80^M
 5847 [  966.019776]  ___preempt_schedule+0x16/0x18^M
 5848 [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
 5849 [  966.020275]  smpboot_thread_fn+0x195/0x230^M
 5850 [  966.020525]  kthread+0x139/0x160^M
 5851 [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
 5852 [  966.021014]  ret_from_fork+0x35/0x40^M

From the calltrace, the state of migration kthreads is not correct, 
I'm also looking at what's going wrong.

Thanks
Zqiang

>
>Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
  2022-12-12 11:45   ` Zhang, Qiang1
@ 2022-12-12 14:05     ` Joel Fernandes
  2022-12-13 13:21       ` Zhang, Qiang1
  0 siblings, 1 reply; 7+ messages in thread
From: Joel Fernandes @ 2022-12-12 14:05 UTC (permalink / raw)
  To: Zhang, Qiang1; +Cc: rcu, paulmck



> On Dec 12, 2022, at 6:45 AM, Zhang, Qiang1 <qiang1.zhang@intel.com> wrote:
> 
> On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>> 
>> Hello,
>> 
>> I see this an hour into the run for TREE03 on v4.19:
>> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
>> 
>> Checking to see if Qiang has any thoughts as I saw him comment about a similar
>> issue in [1].
>> 
>> [ 3243.844445] ------------[ cut here ]------------^M
>> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
>> [ 3243.851585] Modules linked in:^M
>> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
>> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
>> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
>> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
>> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
>> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
>> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
>> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
>> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
>> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
>> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
>> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
>> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
>> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
>> [ 3243.883660] Call Trace:^M
>> [ 3243.884925]  kthread_unpark+0x52/0x60^M
>> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
>> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
>> [ 3243.887063]  start_secondary+0x186/0x1b0^M
>> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
>> 
>> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
>> 
>> 
>> This set of scheduler patches seems to make it go away, however I am
>> running a long weekend test to collect more data:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922
> 
> 
> Hi Joel
> 
> I also used the link above to do the TREE03 test, looks like the problem still exists:
> 
> 5834 [  966.016205]  kthread_unpark+0x50/0x60^M
> 5835 [  966.016440]  cpuhp_online_idle+0x31/0x50^M
> 5836 [  966.016678]  cpu_startup_entry+0x65/0x70^M
> 5837 [  966.016913]  start_secondary+0x18b/0x1c0^M
> 5838 [  966.017148]  secondary_startup_64+0xa4/0xb0^M
> 5839 [  966.017407] ---[ end trace 636953f76b8055db ]---^M
> 5840 [  966.017691] migration/1     R  running task    14952    18      2 0x80000000 last_sleep: 961405359084.  last_runna      ble: 961700816631^M
> 5841 [  966.018406] Call Trace:^M
> 5842 [  966.018559]  __schedule+0x75f/0x1320^M
> 5843 [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> 5844 [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> 5845 [  966.019302]  ? sort_range+0x20/0x20^M
> 5846 [  966.019515]  preempt_schedule_common+0x32/0x80^M
> 5847 [  966.019776]  ___preempt_schedule+0x16/0x18^M
> 5848 [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> 5849 [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> 5850 [  966.020525]  kthread+0x139/0x160^M
> 5851 [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> 5852 [  966.021014]  ret_from_fork+0x35/0x40^M
> 
> From the calltrace, the state of migration kthreads is not correct, 
> I'm also looking at what's going wrong.

Thanks for checking. I also noticed that the state of those threads is running instead of parked, causing the warning.

- Joel



> Thanks
> Zqiang
> 
>> 
>> Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
  2022-12-12 14:05     ` Joel Fernandes
@ 2022-12-13 13:21       ` Zhang, Qiang1
  2022-12-13 15:54         ` Joel Fernandes
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang, Qiang1 @ 2022-12-13 13:21 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu@vger.kernel.org, paulmck@kernel.org

> On Dec 12, 2022, at 6:45 AM, Zhang, Qiang1 <qiang1.zhang@intel.com> wrote:
> 
> On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>> 
>> Hello,
>> 
>> I see this an hour into the run for TREE03 on v4.19:
>> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
>> 
>> Checking to see if Qiang has any thoughts as I saw him comment about a similar
>> issue in [1].
>> 
>> [ 3243.844445] ------------[ cut here ]------------^M
>> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
>> [ 3243.851585] Modules linked in:^M
>> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
>> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
>> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
>> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
>> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
>> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
>> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
>> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
>> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
>> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
>> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
>> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
>> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
>> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
>> [ 3243.883660] Call Trace:^M
>> [ 3243.884925]  kthread_unpark+0x52/0x60^M
>> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
>> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
>> [ 3243.887063]  start_secondary+0x186/0x1b0^M
>> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
>> 
>> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
>> 
>> 
>> This set of scheduler patches seems to make it go away, however I am
>> running a long weekend test to collect more data:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922
> 
> 
> Hi Joel
> 
> I also used the link above to do the TREE03 test, looks like the problem still exists:
> 
> 5834 [  966.016205]  kthread_unpark+0x50/0x60^M
> 5835 [  966.016440]  cpuhp_online_idle+0x31/0x50^M
> 5836 [  966.016678]  cpu_startup_entry+0x65/0x70^M
> 5837 [  966.016913]  start_secondary+0x18b/0x1c0^M
> 5838 [  966.017148]  secondary_startup_64+0xa4/0xb0^M
> 5839 [  966.017407] ---[ end trace 636953f76b8055db ]---^M
> 5840 [  966.017691] migration/1     R  running task    14952    18      2 0x80000000 last_sleep: 961405359084.  last_runna      ble: 961700816631^M
> 5841 [  966.018406] Call Trace:^M
> 5842 [  966.018559]  __schedule+0x75f/0x1320^M
> 5843 [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> 5844 [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> 5845 [  966.019302]  ? sort_range+0x20/0x20^M
> 5846 [  966.019515]  preempt_schedule_common+0x32/0x80^M
> 5847 [  966.019776]  ___preempt_schedule+0x16/0x18^M
> 5848 [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> 5849 [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> 5850 [  966.020525]  kthread+0x139/0x160^M
> 5851 [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> 5852 [  966.021014]  ret_from_fork+0x35/0x40^M
> 
> From the calltrace, the state of migration kthreads is not correct, 
> I'm also looking at what's going wrong.
>
>Thanks for checking. I also noticed that the state of those threads is running instead of parked, causing the warning.
>
>- Joel

Hi Joel

I find the TREE03 enable CONFIG_SCHED_CORE = y, under this condition, if the cpu goes offline, 
we will directly select the idle kthread, which leads to it preempting the migration kthread.
through calltrace, it can also be found , the  migration/1  kthread is preempted and schedule out.

[  966.017691] migration/1     R  running task    14952   
[  966.018559]  __schedule+0x75f/0x1320^M
[  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
[  966.019042]  ? ___preempt_schedule+0x16/0x18^M
[  966.019302]  ? sort_range+0x20/0x20^M
[  966.019515]  preempt_schedule_common+0x32/0x80^M
[  966.019776]  ___preempt_schedule+0x16/0x18^M
[  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
[  966.020275]  smpboot_thread_fn+0x195/0x230^M
[  966.020525]  kthread+0x139/0x160^M
[  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
[  966.021014]  ret_from_fork+0x35/0x40^M

The following modification can fix this warning(this change comes from the new kernel version).

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0acbc7706d71..1c76226a2c26 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4109,8 +4109,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
        bool fi_before = false;

        cpu = cpu_of(rq);
-       if (cpu_is_offline(cpu))
-               return idle_sched_class.pick_next_task(rq, prev, rf);
+       if (cpu_is_offline(cpu)) {
+               rq->core_pick = NULL;
+               return __pick_next_task(rq, prev, rf);
+       }


Thanks
Zqiang

>
>
>
> Thanks
> Zqiang
> 
>> 
>> Thanks.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
  2022-12-13 13:21       ` Zhang, Qiang1
@ 2022-12-13 15:54         ` Joel Fernandes
  2022-12-14  8:33           ` Zhang, Qiang1
  0 siblings, 1 reply; 7+ messages in thread
From: Joel Fernandes @ 2022-12-13 15:54 UTC (permalink / raw)
  To: Zhang, Qiang1; +Cc: rcu@vger.kernel.org, paulmck@kernel.org

On Tue, Dec 13, 2022 at 8:21 AM Zhang, Qiang1 <qiang1.zhang@intel.com> wrote:
>
> > On Dec 12, 2022, at 6:45 AM, Zhang, Qiang1 <qiang1.zhang@intel.com> wrote:
> >
> > On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>
> >> Hello,
> >>
> >> I see this an hour into the run for TREE03 on v4.19:
> >> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
> >>
> >> Checking to see if Qiang has any thoughts as I saw him comment about a similar
> >> issue in [1].
> >>
> >> [ 3243.844445] ------------[ cut here ]------------^M
> >> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
> >> [ 3243.851585] Modules linked in:^M
> >> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
> >> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
> >> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
> >> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
> >> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
> >> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
> >> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
> >> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
> >> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
> >> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
> >> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
> >> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
> >> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> >> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> >> [ 3243.883660] Call Trace:^M
> >> [ 3243.884925]  kthread_unpark+0x52/0x60^M
> >> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
> >> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
> >> [ 3243.887063]  start_secondary+0x186/0x1b0^M
> >> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
> >>
> >> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
> >>
> >>
> >> This set of scheduler patches seems to make it go away, however I am
> >> running a long weekend test to collect more data:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922
> >
> >
> > Hi Joel
> >
> > I also used the link above to do the TREE03 test, looks like the problem still exists:
> >
> > 5834 [  966.016205]  kthread_unpark+0x50/0x60^M
> > 5835 [  966.016440]  cpuhp_online_idle+0x31/0x50^M
> > 5836 [  966.016678]  cpu_startup_entry+0x65/0x70^M
> > 5837 [  966.016913]  start_secondary+0x18b/0x1c0^M
> > 5838 [  966.017148]  secondary_startup_64+0xa4/0xb0^M
> > 5839 [  966.017407] ---[ end trace 636953f76b8055db ]---^M
> > 5840 [  966.017691] migration/1     R  running task    14952    18      2 0x80000000 last_sleep: 961405359084.  last_runna      ble: 961700816631^M
> > 5841 [  966.018406] Call Trace:^M
> > 5842 [  966.018559]  __schedule+0x75f/0x1320^M
> > 5843 [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> > 5844 [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> > 5845 [  966.019302]  ? sort_range+0x20/0x20^M
> > 5846 [  966.019515]  preempt_schedule_common+0x32/0x80^M
> > 5847 [  966.019776]  ___preempt_schedule+0x16/0x18^M
> > 5848 [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> > 5849 [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> > 5850 [  966.020525]  kthread+0x139/0x160^M
> > 5851 [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> > 5852 [  966.021014]  ret_from_fork+0x35/0x40^M
> >
> > From the calltrace, the state of migration kthreads is not correct,
> > I'm also looking at what's going wrong.
> >
> >Thanks for checking. I also noticed that the state of those threads is running instead of parked, causing the warning.
> >
> >- Joel
>
> Hi Joel
>
> I find the TREE03 enable CONFIG_SCHED_CORE = y, under this condition, if the cpu goes offline,
> we will directly select the idle kthread, which leads to it preempting the migration kthread.
> through calltrace, it can also be found , the  migration/1  kthread is preempted and schedule out.
>
> [  966.017691] migration/1     R  running task    14952
> [  966.018559]  __schedule+0x75f/0x1320^M
> [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> [  966.019302]  ? sort_range+0x20/0x20^M
> [  966.019515]  preempt_schedule_common+0x32/0x80^M
> [  966.019776]  ___preempt_schedule+0x16/0x18^M
> [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> [  966.020525]  kthread+0x139/0x160^M
> [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> [  966.021014]  ret_from_fork+0x35/0x40^M
>
> The following modification can fix this warning(this change comes from the new kernel version).
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0acbc7706d71..1c76226a2c26 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4109,8 +4109,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>         bool fi_before = false;
>
>         cpu = cpu_of(rq);
> -       if (cpu_is_offline(cpu))
> -               return idle_sched_class.pick_next_task(rq, prev, rf);
> +       if (cpu_is_offline(cpu)) {
> +               rq->core_pick = NULL;
> +               return __pick_next_task(rq, prev, rf);
> +       }

Wow hats off at that debugging and thank you so much, I will test this
more and let you know. I think CONFIG_SCHED_CORE might be enabled by
default on the ChromeOS 4.19 kernel, which is why we see it and the
regular 4.19 stable kernel does not.

Anyway thank you again and kudos on the awesome debug. I will keep you
posted on my testing.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)
  2022-12-13 15:54         ` Joel Fernandes
@ 2022-12-14  8:33           ` Zhang, Qiang1
  0 siblings, 0 replies; 7+ messages in thread
From: Zhang, Qiang1 @ 2022-12-14  8:33 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu@vger.kernel.org, paulmck@kernel.org

On Tue, Dec 13, 2022 at 8:21 AM Zhang, Qiang1 <qiang1.zhang@intel.com> wrote:
>
> > On Dec 12, 2022, at 6:45 AM, Zhang, Qiang1 <qiang1.zhang@intel.com> wrote:
> >
> > On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>
> >> Hello,
> >>
> >> I see this an hour into the run for TREE03 on v4.19:
> >> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
> >>
> >> Checking to see if Qiang has any thoughts as I saw him comment about a similar
> >> issue in [1].
> >>
> >> [ 3243.844445] ------------[ cut here ]------------^M
> >> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
> >> [ 3243.851585] Modules linked in:^M
> >> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
> >> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
> >> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
> >> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
> >> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
> >> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
> >> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
> >> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
> >> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
> >> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
> >> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
> >> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
> >> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> >> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> >> [ 3243.883660] Call Trace:^M
> >> [ 3243.884925]  kthread_unpark+0x52/0x60^M
> >> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
> >> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
> >> [ 3243.887063]  start_secondary+0x186/0x1b0^M
> >> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
> >>
> >> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
> >>
> >>
> >> This set of scheduler patches seems to make it go away, however I am
> >> running a long weekend test to collect more data:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922
> >
> >
> > Hi Joel
> >
> > I also used the link above to do the TREE03 test, looks like the problem still exists:
> >
> > 5834 [  966.016205]  kthread_unpark+0x50/0x60^M
> > 5835 [  966.016440]  cpuhp_online_idle+0x31/0x50^M
> > 5836 [  966.016678]  cpu_startup_entry+0x65/0x70^M
> > 5837 [  966.016913]  start_secondary+0x18b/0x1c0^M
> > 5838 [  966.017148]  secondary_startup_64+0xa4/0xb0^M
> > 5839 [  966.017407] ---[ end trace 636953f76b8055db ]---^M
> > 5840 [  966.017691] migration/1     R  running task    14952    18      2 0x80000000 last_sleep: 961405359084.  last_runna      ble: 961700816631^M
> > 5841 [  966.018406] Call Trace:^M
> > 5842 [  966.018559]  __schedule+0x75f/0x1320^M
> > 5843 [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> > 5844 [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> > 5845 [  966.019302]  ? sort_range+0x20/0x20^M
> > 5846 [  966.019515]  preempt_schedule_common+0x32/0x80^M
> > 5847 [  966.019776]  ___preempt_schedule+0x16/0x18^M
> > 5848 [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> > 5849 [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> > 5850 [  966.020525]  kthread+0x139/0x160^M
> > 5851 [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> > 5852 [  966.021014]  ret_from_fork+0x35/0x40^M
> >
> > From the calltrace, the state of migration kthreads is not correct,
> > I'm also looking at what's going wrong.
> >
> >Thanks for checking. I also noticed that the state of those threads is running instead of parked, causing the warning.
> >
> >- Joel
>
> Hi Joel
>
> I find the TREE03 enable CONFIG_SCHED_CORE = y, under this condition, if the cpu goes offline,
> we will directly select the idle kthread, which leads to it preempting the migration kthread.
> through calltrace, it can also be found , the  migration/1  kthread is preempted and schedule out.
>
> [  966.017691] migration/1     R  running task    14952
> [  966.018559]  __schedule+0x75f/0x1320^M
> [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> [  966.019302]  ? sort_range+0x20/0x20^M
> [  966.019515]  preempt_schedule_common+0x32/0x80^M
> [  966.019776]  ___preempt_schedule+0x16/0x18^M
> [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> [  966.020525]  kthread+0x139/0x160^M
> [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> [  966.021014]  ret_from_fork+0x35/0x40^M
>
> The following modification can fix this warning(this change comes from the new kernel version).
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0acbc7706d71..1c76226a2c26 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4109,8 +4109,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>         bool fi_before = false;
>
>         cpu = cpu_of(rq);
> -       if (cpu_is_offline(cpu))
> -               return idle_sched_class.pick_next_task(rq, prev, rf);
> +       if (cpu_is_offline(cpu)) {
> +               rq->core_pick = NULL;
> +               return __pick_next_task(rq, prev, rf);
> +       }
>
>Wow hats off at that debugging and thank you so much, I will test this
>more and let you know. I think CONFIG_SCHED_CORE might be enabled by
>default on the ChromeOS 4.19 kernel, which is why we see it and the
>regular 4.19 stable kernel does not.
>
>Anyway thank you again and kudos on the awesome debug. I will keep you
>posted on my testing.

No problem, most of my work is also done on older kernel versions.

Thanks
Zqiang

>
>Thanks,
>
> - Joel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-14  8:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-09 23:34 [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches) Joel Fernandes
2022-12-10  3:22 ` Joel Fernandes
2022-12-12 11:45   ` Zhang, Qiang1
2022-12-12 14:05     ` Joel Fernandes
2022-12-13 13:21       ` Zhang, Qiang1
2022-12-13 15:54         ` Joel Fernandes
2022-12-14  8:33           ` Zhang, Qiang1

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.